Power BI- Outlier Detection in Time Series Data with Z scores
Hi there,
This is an article on getting into Time Series Anomaly with Power BI. We have Machine Learning algorithms in Python and R to do the same, however I could not find many articles on this in the web so want to share this use case.
I will be using Z score method for this, refer to the below document for concepts.
We are going to practice with the StoreSalesData (download from link below).
First get the data from the excel sheet and get to the query editor.
You will see a table in the query editor section. I am planning to create separate queries for all the Stores separately and calculate the Z score and flag them if it is above 3.
(This might be inefficient, I would love a feedback about any other way to do this)
Duplicate the Query and name it after the specific store you want the query to be built on.
Filter the store name (in this case "Hi tech city")
Add a new query from the revenue field (so that we can calculate Average and Standard Deviation)
Duplicate the list and name it in such a manner so that you can reference it later. (Remember! you will have to reference them in Power BI's M language, with no intellisense! So you have to get the Table and Column references right)
Calculate the Average and Standard Deviation (quite easily from the drop down)
Convert them to a Table
Do the same for the Standard Deviation List. The end result would look like this
Now get those metrics in your working table. Add a custom column and refer them with M
You will get a List instead of the data. Just scroll and click "Extract Values"
Do not select any delimiter and do the same for Standard deviation. Now we have everything ready in our query.
Lets Play!
Add a custom column which would calculate absolute value of X-Mean of X
Number.Abs([Revenue]-[Average])
Calculate the Z Score now
And Voilà! You have all the anomalies ready to be flagged
Add a conditional custom column to highlight Zscores above 3
Now all you need to do is filter the month to latest and filter outlier to "is outlier"
The Table comes as empty, that just means the latest data is all good and need not be highlighted
Now comes the painful part, you will have to replicate this for all the territories separately ( would love to hear about any suggestion to do this in a better way).
Replicate the three queries you used for Hitechcity and all you need to do in change the filter.
Simple way to do this is copy the query script that got generated for your first query and change the variables and table names to refer to the right table.
You can copy the query from the advanced editor.
Copy the query and paste that in the new query for the other store.
So we check Adra and its no Outlier
When we replicate it for Head Quarter, outlier!!
I will let you do this for the other stores, in the meantime lets look at how we will bring this out to a list so that we have a list of all stores which showed abnormal revenue in the latest month.
Create a new query and go to append queries from three or more tables. Append the query to three or more tables
I just need the store name as the primary key and the value (whether outlier or not). As we have only one outlier for HQ, we see only one in the list.
My objective is to quickly spot the store with anomaly as opposed to the view I see below where I have to go through all the stores one by one to find anomaly.
Now data modelling comes to play! We need to link our outlier table (HQ(2)) and map it to Table 1.
Go to the data modelling section and create a relation between the two.
Now you can create a slicer and bring in the HQ(2) outlier field. If you select "Outlier", you will see only Head Quarter (Just the way we wanted it!).
I will end this here, I also want to bring in a Year on Year comparison to check seasonality. I will be doing that in my next post.
Stay tuned! :)