A few months ago I participated in Q&A in the Sydney Power BI User Group, and one person asked whether we should write our queries in datasets or dataflows. The Q&A format doesn’t allow you to spend too much time on a single question, so in this blog posts I’m providing a more comprehensive comparison of the two options.
If you’re new to dataflows, refer to my previous blog post on dataflows considerations.
Dataflows vs. datasets — let’s go!
1. Data refresh frequency
Dataflows
Refreshes can be managed independently and efficiently. Different dataflows can be refreshed daily, weekly, and on demand (with Power Automate, you can use other triggers too). This is especially helpful when some of your data sources are slow to refresh and the underlying data isn’t refreshed as often as for other data sources.
Datasets
Data is pulled from all the data sources when you refresh a dataset, even though some sources don’t need to be refreshed as frequently as other data sources. In case there’s a slow and infrequently-refreshed data source, the whole dataset pays the price.
Winner
Dataflows.
2. Refresh failures
Dataflows
If one data source fails to refresh, a dataflow fails to refresh, though not the dataset. You can also download a detailed report of which table or tables failed to refresh and why. With a recent update, you can notify non-owners of dataflows of refresh failures too.
Datasets
When one data source fails to refresh, the whole dataset fails to refresh. Power BI service will only tell you which table first failed to refresh and, even though it may not be the only one. Error messages seem to be less clear than those for dataflows.
Winner
Dataflows.
3. Report refresh during development
Dataflows
The report refresh time is considerably reduced when using dataflows because your data is already transformed and readily available as CSVs.
Datasets
Some data sources may take long to refresh, making the whole report slow to refresh.
Winner
Dataflows.
4. Report development
Dataflows
You can split the roles of those who prepare data in dataflows and those who model data, so multiple people can work on the same report effectively without third-party tools.
Datasets
Without third-party tools, only one person at a time can work on a report.
Winner
Dataflows.
5. Validating the schema
Dataflows
Validating the schema and saving dataflows takes a long time. Depending on data volume and transformation steps, schema validation could even take an hour before you save your dataflow.
Datasets
Power BI Desktop doesn’t take long to validate to save and close Power Query Editor.
Winner
Datasets, for now. The Power Query team is actively working on background schema validation for dataflows.
6. Using gateways
Dataflows
There’s no need to map a data source to a gateway when using dataflows in your datasets, because you don’t need a gateway when refreshing a dataset based on dataflows. Furthermore, you can separate the roles of those who prepare and consume data, and you won’t need to give access to the original underlying data to those who only read dataflows without developing them.
Datasets
If different people own a dataset, they may need to re-map each data source to a gateway every time someone takes over a dataset.
Winner
Dataflows.
7. Reusing queries in Power BI
Dataflows
Common dimensions and other tables that are used across different reports can be reused with dataflows, which is especially handy if you apply the same transformations to tables.
Datasets
If you need to apply transformations to tables you use in different datasets, you’ll need to do that in every dataset.
Winner
Dataflows.
8. Reusing queries outside of Power BI
Dataflows
You can use your own ADSL storage and load dataflows there, which will make the results of your queries available not just for Power BI, but for other purposes, such as machine learning.
Datasets
While you can connect to datasets in Premium capacity by using XMLA endpoints, it’s not a viable solution for using large amounts of data in applications other than Power BI.
Winner
Dataflows.
9. Total refresh times
Dataflows
Refreshing a dataflow takes time, and you’ll need to refresh a dataset after that too, unless you’re using DirectQuery to dataflows. By using the user interface alone, you’ll need to set refreshes sufficiently far apart so a dataflow finishes refreshing before a dataset refreshes. Refreshes can be chained by using Power Automate, though this may cost extra if it’s not included in your Microsoft 365 license.
Datasets
When using Power Query in a dataset, you only need to refresh a dataset, and there’s no need to wait for dataflows to finish refreshing.
Winner
Datasets.
10. Incremental refresh
Dataflows
With Power BI Premium and enhanced compute engine, you can configure incremental refresh over dataflows. Unfortunately, incremental refresh settings aren’t as flexible in dataflows as in datasets: you must have a column of type DateTime, unlike in datasets.
Datasets
In datasets, you can configure incremental refresh even on tables with date columns that are stored as integers, for example, and you don’t need Power BI Premium.
Winner
Datasets.
Conclusions
Dataflows win 7:3 (soon to be 8:2, with background schema validation).
Unless refresh times are a major issue, in most cases I’d go with dataflows.
I’d like to thank Krishna Karnati, who came up with the bulk of the points above, as well as Claire Mitchell, who added her points too.
Happy querying! 🙂
Edited on 10 June 2021:
Tom Sun reminded me that you can load dataflows to your own ADSL folders.
Sandeep Pawar reminded me of incremental refresh in both dataflows and datasets.
Edited on 30 November 2021:
Kirill in the comments section reminded me that the dataflows connector was updated in August 2021 to support incremental refresh.