Grouping text values in Power BI

The August 2020 release of Power BI Desktop includes a new Power Query function: Table.AddFuzzyClusterColumn. This function allows you to clean, cluster, or group text values. In this blog post I’m showing an example of how it works.

Update 22 September 2020: There’s a new button in Power Query online that makes this blog post obsolete 🙂 Hopefully it’s coming soon to Power BI Desktop!

The problem

Let’s say you have the following text values in your table:

  • This Is Text
  • ThisIsText
  • thisistext
  • this is text

Note how for Power Query they’re all different, because some of the values have spaces and some don’t. Furthermore, some are all lowercase, while some use title case.

Now let’s say you’d like all these values to be consistent with each other, because according to your business logic, it’s just one value. In the past, you either had to solve this in your data source, or you had to write some complex M code.

The solution

The new function greatly simplifies the process!

At minimum, you need to specify the table, the existing text column name, and the new column name, and you’ll get a new column like so:

Here’s the corresponding query:

let
    Source = #table(
        type table [Text = text],
        {
            {"This Is Text"},
            {"ThisIsText"},
            {"thisistext"},
            {"this is text"}
        }
    ),
    ClusteredText = Table.AddFuzzyClusterColumn(
        Source,
        "Text",
        "Clustered Text"
    )
in
    ClusteredText

Options

There are several options you can specify too. For example, if you care about spaces, which are ignored by default, you can instruct the function to retain them like so:

...
    ClusteredText = Table.AddFuzzyClusterColumn(
        Source,
        "Text",
        "Clustered Text",
        [IgnoreSpace = false]
    )
...

The result:

As you probably noticed, the function name includes “Fuzzy”, so other options you can specify are similar to those you get with fuzzy matching (joins) in Power Query:

  • Culture—culture neutral by default.
  • IgnoreCase—case is ignored by default.
  • Threshold—the default value is 0.8.
  • TransformationTable—no transformation table by default.

The last option, TransformationTable, is especially useful for custom translations like NYC → New York City.

You can read the documentation on the new function on Microsoft Docs.

Happy clustering! 🙂