Pearson correlation coefficient in DAX

If you had tried calculating the Pearson correlation coefficient (PCC) in DAX, you would have likely read Gerhard Brueckl’s excellent blog post. If you haven’t, I encourage you to read it, as it contains a high-level overview of what PCC is. In this article I am showing how the same calculation can be done more efficiently using the new DAX functions, which are available starting with Power Pivot in Excel 2016, Power BI Desktop, SQL Server 2016 Analysis Services, and Azure Analysis Services.

The new DAX functions allow us to compute PCC in just one easy-to-use measure. Furthermore, depending on your data model, you may see a significant performance improvement. I have tested the new approach on two data models, and in one of them the calculation ran 93% faster, while the other model saw only a 20% increase in performance. The comparison is not exactly fair though, because the new formula has one additional benefit: it treats missing values in the same way as Excel’s CORREL function.

To illustrate the last point, let’s have a look at the very basic data model below:

Item	Value X	Value Y
A	2	0
B	3	1
C	5	1
D	7	2

If we calculate the correlation coefficient between Value X and Value Y using the CORREL function in Excel, we will get 0.92, and we will get the same result by following Gerhard’s approach in DAX. So far, so good.

However, if we remove one of the values, the calculations will no longer match. For example, let’s remove the number seven from our data.

Item	Value X	Value Y
A	2	0
B	3	1
C	5	1
D		2

Now the CORREL function returns 0.76 and the original formula returns -0.39. Note that Excel treats missing values differently from zeros. Essentially, Excel ignores the whole pairs where at least one value is missing. If we replaced 7 with 0, we would get -0.39 — same as using Gerhard’s formula.

The new DAX correlation coefficient formula matches Excel’s logic. Here is the DAX formula:

Correlation Coefficient :=
VAR Correlation_Table =
    FILTER (
        ADDCOLUMNS (
            VALUES ( 'Table'[Column] ),
            "Value_X", [Measure_X],
            "Value_Y", [Measure_Y]
        ),
        AND (
            NOT ( ISBLANK ( [Value_X] ) ),
            NOT ( ISBLANK ( [Value_Y] ) )
        )
    )
VAR Count_Items =
    COUNTROWS ( Correlation_Table )
VAR Sum_X =
    SUMX ( Correlation_Table, [Value_X] )
VAR Sum_X2 =
    SUMX ( Correlation_Table, [Value_X] ^ 2 )
VAR Sum_Y =
    SUMX ( Correlation_Table, [Value_Y] )
VAR Sum_Y2 =
    SUMX ( Correlation_Table, [Value_Y] ^ 2 )
VAR Sum_XY =
    SUMX ( Correlation_Table, [Value_X] * [Value_Y] )
VAR Pearson_Numerator =
    Count_Items * Sum_XY - Sum_X * Sum_Y
VAR Pearson_Denominator_X =
    Count_Items * Sum_X2 - Sum_X ^ 2
VAR Pearson_Denominator_Y =
    Count_Items * Sum_Y2 - Sum_Y ^ 2
VAR Pearson_Denominator =
    SQRT ( Pearson_Denominator_X * Pearson_Denominator_Y )
RETURN
    DIVIDE ( Pearson_Numerator, Pearson_Denominator )

To use the formula, simply replace:

The red part with a table expression that returns the items for which you calculate your measures X and Y
The green part with your measure X
The blue part with your measure Y

You can also insert the formulas for [Measure_X] and [Measure_Y] directly into the code above — just don’t forget to wrap them in CALCULATE to trigger context transition.

If you think this pattern is useful, please give kudos to it in the Quick Measures Gallery.

Sample Excel 2016 workbook: Correlation.xlsx

Evan Willenson

Helpful article Daniil. I’ve modified this template to use the Correlation_Table similar to what i consider a temp table to do in SQL. Works great and improves speed 4x for measures where I iterate over Dates contained in the row context but i’m having trouble for cumulative measures where I want to override Date in “Correlation_Table” based on the first Date for the row context. (example but syntax doesn’t work) SUMX(Filter(Correlation_Table,Correlation_Table[Date]>= FirstDate(Correlation_Table[Date]) && Correlation_Table[Date]<= LastDate(Correlation_Table[Date]) , [Value_X]) — any ideas?
- Daniil Maslyuk
  
  Hi Evan, thanks for your question. I see two issues here.
  
  First, to call a column in a table that is stored in a variable, you need to specify the column name only, such as [Date].
  
  Second, FIRSTDATE and LASTDATE are a bit tricky to use with tables stored as variables. Furthermore, they may provide unexpected results due to context transition (remember, they return tables instead of scalar values).
  
  The following syntax works:
  SUMX (
  FILTER (
  Correlation_Table,
  [Date] >= MIN ( [Date] )
  && [Date] <= MAX ( [Date] )
  ),
  [Value_X]
  )
  It’s a bit difficult to give advice without my understanding the end goal though 🙂
  - Evan Willenson
    
    Thanks for the quick response Daniil. When I try this I get “Cannot identify the table that contains [Date] column” error.
    
    To give a brief description of the end goal, I am issuing a DAX template where the columns and measures are defined through a UI. The basic DAX template uses a SUMMARIZECOLUMNS and ROLLUPADDISSUBTOTAL with additional date and filter logic and the selected measures. I want to iterate cumulatively through my date table, while still retaining the row context from ROLLUPADDISSUBTOTAL for one of the measures. I have a current version that works (Current Version of Cumulative Measure), but it can be quite cumbersome. The size of the data i’m working with isn’t too huge, it’s the complexity of the calculations (time link returns) that causes the biggest drain on speed. I was thinking creating a table to store only the values necessary for the calculation would be more efficient.
    
    I mocked up the output in the hopes that its easily translatable across data models.
    
    Thanks again!
    
    EVALUATE
    (
    SUMMARIZECOLUMNS (
    ROLLUPADDISSUBTOTAL ( FactTable[Grouping], “isGrouping” ),
    DATESBETWEEN ( DateTable[Date], DATE ( 2017, 1, 1 ), DATE ( 2017, 6, 12 ) ),
    FILTER ( VALUES ( FactTable[Variable] ), FactTable[Variable] = “something” ),
    “Current Version of Cumulative Measure”, IF (
    DISTINCTCOUNT ( PositionPerformance[Date] ) > 0,
    PRODUCTX (
    DATESBETWEEN (
    DateTable[Date],
    DATE ( 2017, 1, 1 ),
    LASTDATE ( DateTable[Date] )
    ),
    1 + DIVIDE ( FactTable[numerator], FactTable[denominator], 0 )
    )
    – 1
    ),
    “Trying New Version of Cumulative Measure”,
    VAR Temp_Table =
    ADDCOLUMNS (
    VALUES ( DateTable[Date] ),
    “Value_numerator”, FactTable[numerator],
    “Value_denominator”, FactTable[denominator]
    )
    VAR rOne = [Value_numerator] / [Value_denominator]
    VAR rCum =
    PRODUCTX (
    FILTER ( Temp_Table, [Date] >= MIN ( [Date] ) && [Date] <= MAX ( [Date] ) ),
    1 + rOne
    )
    – 1
    RETURN
    rCum
    )
    )
    - Daniil Maslyuk
      
      There is something in SUMMARIZECOLUMNS that doesn’t let you access the Temp_Table[Date] column, and I’m not sure what it is.
      
      This syntax works:
      DEFINE
      MEASURE FactTable[Trying New Version of Cumulative Measure] =
      VAR Temp_Table =
      ADDCOLUMNS (
      VALUES ( DateTable[Date] ),
      “rOne”, DIVIDE ( [numerator], [denominator] )
      )
      VAR rCum =
      PRODUCTX (
      FILTER ( Temp_Table, [Date] >= MIN ( [Date] ) && [Date] 0,
      PRODUCTX (
      DATESBETWEEN (
      DateTable[Date],
      DATE ( 2017, 1, 1 ),
      LASTDATE ( DateTable[Date] )
      ),
      1 + DIVIDE ( FactTable[numerator], FactTable[denominator], 0 )
      )
      – 1
      ),
      “Trying New Version of Cumulative Measure”,
      [Trying New Version of Cumulative Measure]
      )
      )
Mike Ennis

This is very useful! Thanks!

But I faced a challenge implementing this code to calculate multiple correlations at once and report them in the same table.

I am calculating the correlation coefficient as an estimate of item discrimination, that is, the correlation between write-wrong answers on individual test questions with each testtaker’s total score on the test. The challenge was that my data table was not as neatly structured as yours: there were rows which should not be included in the calculation–but are needed for other purposes–and the data was “stacked” by question. So, in the measures I needed to filter out the errenous rows and I needed to segment the data according items. The first required a simple filter of the source table. The second required using DISTINCT(SELECTCOLUMNS()) instead of simply VALUES() in order to combine all columns which produced the correct SUM()s. This is the coded that workrd for the table var:

Correlation_Table =
FILTER (
ADDCOLUMNS (
DISTINCT(SELECTCOLUMNS( FILTER(‘All Responses’,
‘All Responses'[Responses] 0),
“ExamineeID”, ‘All Responses'[examinee_id],
“ItemID”, ‘All Responses'[Item Id],
“ResponseOrder”, ‘All Responses'[Response Order])),
“Value_X”, [X_MEASURE],
“Value_Y”, [Y_MEASURE]
),
AND (
NOT ( ISBLANK ( [Value_X] ) ),
NOT ( ISBLANK ( [Value_Y] ) )
)
)

Note that FILTER(‘All Responses’, ‘All Responses'[Responses] 0) was also applied to X_MEASURE and Y_MEASURE, but I haven’t checked if this is necessary since the table var already applies this filter.

The result is pretty cool!

https://uploads.disquscdn.com/images/0718d0f897336da04ee7c7e3fa63f15e0d2c38d41152b4e4df16b9901a011392.png
- Daniil Maslyuk
  
  Well done! Glad you found the formula useful 🙂