Power Query in dataflows vs. datasets

A few months ago I participated in Q&A in the Sydney Power BI User Group, and one person asked whether we should write our queries in datasets or dataflows. The Q&A format doesn’t allow you to spend too much time on a single question, so in this blog posts I’m providing a more comprehensive comparison of the two options.

If you’re new to dataflows, refer to my previous blog post on dataflows considerations.

Dataflows vs. datasets — let’s go!

1. Data refresh frequency

Dataflows

Refreshes can be managed independently and efficiently. Different dataflows can be refreshed daily, weekly, and on demand (with Power Automate, you can use other triggers too). This is especially helpful when some of your data sources are slow to refresh and the underlying data isn’t refreshed as often as for other data sources.

Datasets

Data is pulled from all the data sources when you refresh a dataset, even though some sources don’t need to be refreshed as frequently as other data sources. In case there’s a slow and infrequently-refreshed data source, the whole dataset pays the price.

Winner

Dataflows.

2. Refresh failures

Dataflows

If one data source fails to refresh, a dataflow fails to refresh, though not the dataset. You can also download a detailed report of which table or tables failed to refresh and why. With a recent update, you can notify non-owners of dataflows of refresh failures too.

Datasets

When one data source fails to refresh, the whole dataset fails to refresh. Power BI service will only tell you which table first failed to refresh and, even though it may not be the only one. Error messages seem to be less clear than those for dataflows.

Winner

Dataflows.

3. Report refresh during development

Dataflows

The report refresh time is considerably reduced when using dataflows because your data is already transformed and readily available as CSVs.

Datasets

Some data sources may take long to refresh, making the whole report slow to refresh.

Winner

Dataflows.

4. Report development

Dataflows

You can split the roles of those who prepare data in dataflows and those who model data, so multiple people can work on the same report effectively without third-party tools.

Datasets

Without third-party tools, only one person at a time can work on a report.

Winner

Dataflows.

5. Validating the schema

Dataflows

Validating the schema and saving dataflows takes a long time. Depending on data volume and transformation steps, schema validation could even take an hour before you save your dataflow.

Datasets

Power BI Desktop doesn’t take long to validate to save and close Power Query Editor.

Winner

Datasets, for now. The Power Query team is actively working on background schema validation for dataflows.

6. Using gateways

Dataflows

There’s no need to map a data source to a gateway when using dataflows in your datasets, because you don’t need a gateway when refreshing a dataset based on dataflows. Furthermore, you can separate the roles of those who prepare and consume data, and you won’t need to give access to the original underlying data to those who only read dataflows without developing them.

Datasets

If different people own a dataset, they may need to re-map each data source to a gateway every time someone takes over a dataset.

Winner

Dataflows.

7. Reusing queries in Power BI

Dataflows

Common dimensions and other tables that are used across different reports can be reused with dataflows, which is especially handy if you apply the same transformations to tables.

Datasets

If you need to apply transformations to tables you use in different datasets, you’ll need to do that in every dataset.

Winner

Dataflows.

8. Reusing queries outside of Power BI

Dataflows

You can use your own ADSL storage and load dataflows there, which will make the results of your queries available not just for Power BI, but for other purposes, such as machine learning.

Datasets

While you can connect to datasets in Premium capacity by using XMLA endpoints, it’s not a viable solution for using large amounts of data in applications other than Power BI.

Winner

Dataflows.

9. Total refresh times

Dataflows

Refreshing a dataflow takes time, and you’ll need to refresh a dataset after that too, unless you’re using DirectQuery to dataflows. By using the user interface alone, you’ll need to set refreshes sufficiently far apart so a dataflow finishes refreshing before a dataset refreshes. Refreshes can be chained by using Power Automate, though this may cost extra if it’s not included in your Microsoft 365 license.

Datasets

When using Power Query in a dataset, you only need to refresh a dataset, and there’s no need to wait for dataflows to finish refreshing.

Winner

Datasets.

10. Incremental refresh

Dataflows

Dataflows themselves can use incremental refresh, though you cannot use dataflows as a data source and refresh it incrementally, even if you use the enhanced compute engine. Furthermore, incremental refresh settings aren’t as flexible in dataflows as in datasets: you must have a column of type DateTime, unlike in datasets.

Datasets

In datasets, you can configure incremental refresh even on tables with date columns that are stored as integers, for example.

Winner

Datasets.

Conclusions

Dataflows win 7:3 (soon to be 8:2, with background schema validation).

Unless refresh times are a major issue, in most cases I’d go with dataflows.

I’d like to thank Krishna Karnati, who came up with the bulk of the points above, as well as Claire Mitchell, who added her points too.

Happy querying! 🙂

Edited on 10 June 2021:

Tom Sun reminded me that you can load dataflows to your own ADSL folders.

Sandeep Pawar reminded me of incremental refresh in both dataflows and datasets.