Alteryx + Microsoft SQL Server: Tips and Tricks

Alteryx + Microsoft SQL Server: Tips and Tricks

Alteryx has been the leader in the Data Blending tool arena for a while and SQL Server is THE go to RDBMS for building highly-performant business facing Data Marts that are scalable, easy to maintain and have low TCO (Total Cost of Ownership).

An Alteryx plus SQLServer combination provides a very nimble, analyst friendly, cost effective tool in the enterprise analytics arsenal to complement the typical enterprise wide tools like Informatica, Teradata, Oracle, etc. 

The genesis of this post was the journey we recently experienced when building a PricingAnalytics solution for a large Pharma company. The platform was built on a SQLServer, Alteryx & Tableau foundation and it processes over 500M records per month from multiple internal & external sources to give the pricing analysts and executives up-to-date intel on market movements. The glamorous side of the solution was definitely the analyst & executive facing dashboards but the lessons learned in the data blending pipeline are hopefully interesting too! The pipeline and the datamart needed to be optimized for quicker interactivity via Tableau while maintaining a reasonable data processing velocity.

A few tips and tricks go a long way in the making of an effective and well-tuned analytics solution:

Keep the model simple: Focus on simplifying the Fact/Dimension model to the known universe of requirements. Seems obvious but there is a tendency to try and architect an infinitely scalable solution. Keep the granular data in staging and add appropriate grains of fact tables later on when needed. Use Natural Keys for joins to avoid a complicated surrogate key management problem later on. 

Right size the columns: Auto-generating data models with varchar(255) might seem the easy route but it invariably bites you later on. Use the Auto-field to create columns that are the correct size, particularly the table keys. The size and speed of index creation will heavily depend on correct sizing.

Clustered Indexes: Large SQL Server tables work way better if a single column is chosen as a Clustered index. Since we were using Natural Keys for joins, a 6+ column Clustered index would just take too much space without providing any business value. We chose to concatenate the natural keys and create an MD5 ASCII key in Alteryx and use that as the Primary Key/Clustered Index. The concatenation can be done using the "Concatenate Macro".

Filtered indexes: Filtered Indexes work really well for repeated and known data access patterns. In our case, a slice of the Pricing History data was accessed very frequently but the query filters did not change. Instead of hitting 500M rows, we created a couple of Filtered Indexes that returned sub-2s performance on a complicated query that would run for over 2 mins without the FIs.

Statistics on filter columns: The SQL Server Management Studio’s Database Engine Tuning Advisor (quite a mouthful!) recommends a ton of Indexes and Statistics so you have to take it with a grain of salt. However, it is a great starting point to feed in a history of queries and get recommendations on indexes/stats. At the bare minimum, the more obvious statistics are the commonly used filter columns.

Alteryx In-DB Processing: This is one of my favorite features of Alteryx. I often get asked, why use In-DB when someone who knows SQL can just write the same complicated SQL in the Input tool? Readability! Alteryx’s In-DB capability gives you a great platform to visually depict the SQL logic as a series of steps instead of 60 lines of code. It makes it super simple to maintain for the next analyst who looks at the workflow and wants to make a change. I also advocate In-DB vs. regular tools as it minimizes the transfer of data across the network and speeds up processing. Note: All Alteryx tool functionality is not available In-DB (e.g. No Transpose tool), so get creative in doing as much as possible and then bring out the data to regular Alteryx tools using Data Stream Out.

SQL Server Bulk Connection: Instead of the regular ODBC connection, Alteryx’s SQL Server Bulk Loader option significantly speeds up the writes into the database. 

Intermediate Tables: As the Fact table grew to over 500M records, the Delete+Insert strategy did not work well for our case when we were replacing more than 30-40% of the data. In an Alteryx + SQL Server environment, it is way faster to write into an empty or Heap table (Table without Clustered Index) from Alteryx and then do a Select+Insert into Final Fact table than trying to write directly from Alteryx to the large Fact table. If less than 20%-30% of the data is being replaced the extra step is a wash. A couple of pointers when using an intermediate table approach:

  • Use a SELECT tool prior to the Intermediate output to make sure the structure of the Intermediate & Final Fact is EXACTLY the same. If not, the errors are painful to resolve.
  • Disable all indexes (except the Clustered Index and any indexes used by the delete) on the Final Fact
  • We had performance issues with deleting 30%+ of the Final Fact data, with the database logs running out of space to grow. We weren’t worried about losing the data in case of a processing failure so we chose to break the Deletes into smaller deletes and utilize the DBCC SHRINKFILE command between deletes to keep the log in check.
  • Insert into the Final fact from the Intermediate table
  • Rebuild the indexes

This approach cut down processing the final write significantly for our solution.

Alteryx Database: We learned this lesson the hard way, after processing over 100M records the Alteryx workflow failed during the final write. After this painful setback, we utilized the awesome “Parallel Block Until Done” CReW macro in conjunction with the Alteryx database output to avoid mass re-processing. Read off the Alteryx YXDB are incredibly fast and avoiding re-processing multiple million rows is always a good idea.


Conclusion: As with all analytics platforms, there are an infinite number of ways to tinker and tune the solution, but hopefully some of these can be incorporated from the get go so history does not repeat itself! If there are other suggestions based on your experiences would love to hear them in the comments.

Nikhil Prakash is an Analytics Senior Leader with the Information Management & Analytics practice at Slalom Silicon Valley. He has deep expertise in the Enterprise Data Architecture domain and passionately embraces new age analytics technologies to help companies better understand & improve their businesses.

#Alteryx #Analytics #SQLServer #Tips #



Thanks for sharing, this is awesome!

Like
Reply

To view or add a comment, sign in

More articles by Nikhil Prakash

Others also viewed

Explore content categories