Simple linear regression in DAX

Random data points and a simple linear regression line

DAX, originating in Power Pivot, shares many functions with Excel. As of 2017, some of the functions, such as SLOPE and INTERCEPT, exist in the latter but not in the former. The two functions can be used for a simple linear regression analysis, and in this article I am sharing patterns to easily replicate them in DAX. But first, why would you want to do such analysis?

Update 2 December 2017: the sales example was updated to display the correct Estimated Sales figure at the grand total level.

How can I use simple linear regression?

With simple linear regression, you can estimate the quantitative relationship between any two variables. Let’s say you have the following data model:

Name Height (cm) Weight (kg)
Harry 166 54
Hermione 164 51
Ron 173 73
Ginny 171 55
Dumbledore 181

Note how Dumbledore’s weight is unknown — we are going to predict it with simple linear regression.

Linear regression has been done in DAX before (by Rob Collie and Greg Deckler, for instance), but my approach uses the new DAX syntax, which makes the calculations very easy.

Simple linear regression pattern

Simple linear regression can be done in just one measure:

Simple linear regression =
VAR Known =
    FILTER (
        SELECTCOLUMNS (
            ALLSELECTED ( Table[Column] ),
            "Known[X]", [Measure X],
            "Known[Y]", [Measure Y]
        ),
        AND (
            NOT ( ISBLANK ( Known[X] ) ),
            NOT ( ISBLANK ( Known[Y] ) )
        )
    )
VAR Count_Items =
    COUNTROWS ( Known )
VAR Sum_X =
    SUMX ( Known, Known[X] )
VAR Sum_X2 =
    SUMX ( Known, Known[X] ^ 2 )
VAR Sum_Y =
    SUMX ( Known, Known[Y] )
VAR Sum_XY =
    SUMX ( Known, Known[X] * Known[Y] )
VAR Average_X =
    AVERAGEX ( Known, Known[X] )
VAR Average_Y =
    AVERAGEX ( Known, Known[Y] )
VAR Slope =
    DIVIDE (
        Count_Items * Sum_XY - Sum_X * Sum_Y,
        Count_Items * Sum_X2 - Sum_X ^ 2
    )
VAR Intercept =
    Average_Y - Slope * Average_X
RETURN
    Intercept + Slope * [Measure X]

To make it work for you, replace the

  1. Red part with your category values column reference
  2. Green parts with your independent values measure (twice)
  3. Blue part with your dependent values measure

Estimating Dumbledore’s weight

We can now estimate Dumbledore’s weight and, with a little more DAX, build the following graph:

Dumbledore's weight is highlighted as estimated one

Note how we show known weight values when weight is known, but we display estimated weight in case where weight is not known.

Other ways to use the pattern

You are not limited to using measures for known X and Y values — you can also use columns, thanks to row context being present in SELECTCOLUMNS. For example, to estimate future sales, you can use dates in place of Measure X, like so:

Simple linear regression =
VAR Known =
    FILTER (
        SELECTCOLUMNS (
            ALLSELECTED ( 'Date'[Date] ),
            "Known[X]", 'Date'[Date],
            "Known[Y]", [Measure Y]
        ),
        AND (
            NOT ( ISBLANK ( Known[X] ) ),
            NOT ( ISBLANK ( Known[Y] ) )
        )
    )
VAR Count_Items =
    COUNTROWS ( Known )
VAR Sum_X =
    SUMX ( Known, Known[X] )
VAR Sum_X2 =
    SUMX ( Known, Known[X] ^ 2 )
VAR Sum_Y =
    SUMX ( Known, Known[Y] )
VAR Sum_XY =
    SUMX ( Known, Known[X] * Known[Y] )
VAR Average_X =
    AVERAGEX ( Known, Known[X] )
VAR Average_Y =
    AVERAGEX ( Known, Known[Y] )
VAR Slope =
    DIVIDE (
        Count_Items * Sum_XY - Sum_X * Sum_Y,
        Count_Items * Sum_X2 - Sum_X ^ 2
    )
VAR Intercept =
    Average_Y - Slope * Average_X
RETURN
    SUMX (
        DISTINCT ( 'Date'[Date] ),
        Intercept + Slope * 'Date'[Date]
    )

Note that to display the correct amount at the grand total level, you also need to modify the RETURN expression. In case of sales, it is appropriate to use SUMX; if you deal with temperatures, for example, you will probably use AVERAGEX.

You can then build the following graph:

Actual and estimated sales on one graph

The red line looks a lot like a trend line, doesn’t it? In fact, if you add a trend line to the graph, it will be exactly the same as red line:

Actual and estimated sales with trend line

There are at least two reasons to consider calculated trend lines:

  1. With the built-in trend line, you can only infer its values from the Y axis, while the calculated trend line allows you to see the values explicitly
  2. As of September 2017, Power BI allows you to only add trend lines for numeric or datetime axes. As soon as you use strings (month names, for instance), you lose the ability to add trend lines. With simple linear regression, you can calculate them yourself, as long as you have sequential numeric values to use as known X values

If you think this pattern is useful, please give kudos to it in the Quick Measures Gallery.

Happy forecasting!

Sample Power BI file: Simple linear regression.pbix

  • Mim

    wow, this is brilliant !!!!

  • Mack Cage

    Good Stuff thanks!

  • Rainar Essenson

    Hi Daniil!

    Great job with combining all the variables in one measure!
    I have a question though – what if i want to show Sales Estimate on a different value then Date.
    Instead of ‘date’ values I added ‘Date'[Month Name] but got an error, that SUMX cannot work with ‘string’ values. I did try using numeric values from ‘Date’ table (like month number & relative month) – these work well, but on the visual I’d like to show [Month Name] – unfortunately it does not calculate correctly if i use y-axis as something different then the column selection I have made in the measure.

    Anyway my goal is to show sales per customer for the past selected period (usually rolling 12 months) and then give an indication if sales trend is going up or down based on the ‘slope’ variable. But I also want to show the actual trend line on the visual using a user friendly format (January 2018).

    • Hi, Rainar, thanks for your question! You can try the following formula (note the changes in the Known variable and the RETURN statement):
      Estimated Sales =
      VAR Known =
      FILTER (
      SELECTCOLUMNS (
      SUMMARIZE (
      ALLSELECTED ( 'Date' ),
      'Date'[Calendar Year Month],
      'Date'[Calendar Year Month Sequential Number]
      ),
      "Known[X]", 'Date'[Calendar Year Month Sequential Number],
      "Known[Y]", [Sales Amount]
      ),
      AND (
      NOT ( ISBLANK ( Known[X] ) ),
      NOT ( ISBLANK ( Known[Y] ) )
      )
      )
      VAR Count_Items =
      COUNTROWS ( Known )
      VAR Sum_X =
      SUMX ( Known, Known[X] )
      VAR Sum_X2 =
      SUMX ( Known, Known[X] ^ 2 )
      VAR Sum_Y =
      SUMX ( Known, Known[Y] )
      VAR Sum_XY =
      SUMX ( Known, Known[X] * Known[Y] )
      VAR Average_X =
      AVERAGEX ( Known, Known[X] )
      VAR Average_Y =
      AVERAGEX ( Known, Known[Y] )
      VAR Slope =
      DIVIDE (
      Count_Items * Sum_XY - Sum_X * Sum_Y,
      Count_Items * Sum_X2 - Sum_X ^ 2
      )
      VAR Intercept =
      Average_Y - Slope * Average_X
      RETURN
      SUMX (
      DISTINCT ( 'Date'[Calendar Year Month Sequential Number] ),
      Intercept + Slope * 'Date'[Calendar Year Month Sequential Number]
      )

      You just need to make sure that the sorting column consists of sequential numbers. By that I mean it is “year * 12 + month”, not “year * 100 + month”.

      • Rainar Essenson

        Thank you Daniil! Спасибо 😉
        I did try using ‘relative month’ (Current month = 0, previous month = -1 etc) field from my date table, but that did not work out that well – so yes I created month sequence in my date table.
        I ended up multiplying my sales slopes with sales amount in order to also take account the magnitude of change in trend. From that number I created dynamic segments – aggressive/moderate/flat/declining etc. The outcome is quite interesting – possible to filter customers according to their past sales trend.
        Not completely sure how my data holds up to being evaluated on a linear trend – daily sales are fluctuating a lot, monthly looks better in this case.

        Thanks again for your help! I really like your blog posts!

        • Your multiplication of slope and sales amount with subsequent dynamic segmentation sounds like a very interesting application of linear regression. You should consider writing a blog post about it 🙂

          Aitäh for your feedback, Rainar!

  • Prashanth Mohan

    Hi Daniil,

    Thanks for the post. It is really very helpful. But, I wanted to know if it is possible to have this regression line for separate time frames.

    Say, a regression line from 3rd sep to 10th sep. another regression line from 10th sep to 17th Sep..

    I wanted to show trends for each time-frame in my chart in power BI, wanted to know if that is achievable?

    Thanks and Regards,
    Vishnu

    • Hello, Prashanth. Yes, I think what you want is possible, though it’s not going to look great, in my opinion, because different segments will still be connected: https://uploads.disquscdn.com/images/19e1e2540ab640b1c3b720464d224e6fd49be01e006c7154bfd9eed2a04ceed9.png

      Here is one way you could do this:
      Estimated Sales 2 =
      VAR ThisWeek =
      WEEKNUM ( SELECTEDVALUE ( ‘Date'[Date] ), 2 )
      VAR Known =
      FILTER (
      SELECTCOLUMNS (
      FILTER ( ALLSELECTED ( ‘Date'[Date] ), WEEKNUM ( ‘Date'[Date], 2 ) = ThisWeek ),
      “Known[X]”, ‘Date'[Date],
      “Known[Y]”, [Actual Sales]
      ),
      AND ( NOT ( ISBLANK ( Known[X] ) ), NOT ( ISBLANK ( Known[Y] ) ) )
      )
      VAR Count_Items =
      COUNTROWS ( Known )
      VAR Sum_X =
      SUMX ( Known, Known[X] )
      VAR Sum_X2 =
      SUMX ( Known, Known[X] ^ 2 )
      VAR Sum_Y =
      SUMX ( Known, Known[Y] )
      VAR Sum_XY =
      SUMX ( Known, Known[X] * Known[Y] )
      VAR Average_X =
      AVERAGEX ( Known, Known[X] )
      VAR Average_Y =
      AVERAGEX ( Known, Known[Y] )
      VAR Slope =
      DIVIDE (
      Count_Items * Sum_XY
      – Sum_X * Sum_Y,
      Count_Items * Sum_X2
      – Sum_X ^ 2
      )
      VAR Intercept = Average_Y
      – Slope * Average_X
      RETURN
      Intercept
      + Slope * SELECTEDVALUE ( ‘Date'[Date] )