Best Practices for Snowflake Implementation

Best Practices for Snowflake Implementation

Lately I have been involved in a few Data Analytics projects built on #Snowflake DB and I have always been approached by every customer about the best practices of implementing a data warehouse or a data mart on Snowflake, and how is it different from other relational databases. The fact is that Snowflake is almost same and at the same time very different from other relational databases. It is almost same because it is a columnar-stored relational database and everything pretty much that holds true for any relational database holds true for Snowflake in most part. So that is why it is so easy to get started on Snowflake and master it quickly. The founders of Snowflake came from Oracle. I think that tells why Snowflake has all the qualities of a relational database. At the same time it is very different from other relational database because it is truly built on cloud principles and it is a truly cloud database: it is truly elastic, it is practically zero maintenance, it is up and running in hours if not mins and it has native support for structured and semi-structured (JSON) data. The other great thing about Snowflake is that it discourages copying of data again and again, which is the case in every company. Same data is copied over to multiple storage locations, databases until there is a total loss of control over the data.

Now I did not mean to write this article to talk about Snowflake advantages or disadvantages to other databases whether they are cloud based or not. The reason for bringing the concept of a truly cloud database is to able to address the question of best practices of building a data warehouse in Snowflake. To be honest, most of the design best practices of building a data warehouse on any database platform pretty much holds true here and there should be no reason why it should not be as such because it is a relational database in heart. For example, here are a few of them.

  • Data must be categorized and appropriately marked particularly with data that is highly protected 
  • Data must maintain its history through lineage and audit 
  • Data must be validated at source wherever possible
  • Data should be processed in micro batches where ever possible.
  • Data should be ingested and loaded, and then transformed as required, meaning ELT vs. ETL
  • Process CDC data end to end to avoid performance challenges where ever possible
  • Create instances of data marts separately based on business group specific requirements. These are mostly views on top of EDW core tables. Give business independence to build these semantic view and build governance around Semantic view creation

Having said that there are some best practices which apply to Snowflake only because of its unique architectural differences with other relational databases or big data platforms. Here are some of the Snowflake specific best practices.

  • Use Snowflake’s multi-cluster independent data warehouses shared data scale-out storage functionality to optimize computing needs of various job workloads. For example, landing and staging schema can be on a different database while curated core layer is in another database. Assign separate virtual data warehouse clusters to these schemas to optimize performance
  • Assign separate virtual data warehouse to data mart to have optimized data consumption experience
  • Keep semi structured data in original format if possible to increase data processing performance. Often JSON data is processed faster than JSON converted to relational tables. When storing semi-structured data, Snowflake optimizes the storage based on repeated elements within the semi-structure strings
  • Load data in small chunks instead of one large file. Load data in parallel using multiple nodes. At one client we have been able to load 12 months of event telemetry data in less than 2 weeks with small node clusters.
  • Consider clustering large tables to improve query performance. Recluster if performance degrades. Sometime cluster/reclustering can reduce query performance. So it is recommended to analyze table data before making such adjustments.
  • Snowflake stores metadata (min and max values, distinct values, etc) about the micro partitions so it can effectively prune the micro partitions needed to scan for a query. If you're querying the last month in a table that stores 10 years of data, only a fraction of the data will be scanned. This is valid for tables when data is loaded in chronological order. It is recommended to do event based data ingestion to allow for chronological order of data.
  • Create data pipelines that uses the processing power of Snowflake. Flexible framework based pipelines developed by #Capgemini allows for the SnowSQLs to be completely separated from the actual processing framework. This allows for minimum changes of code if there are changes in target schema or source schema but it also allows for use of power SnowSQL to do all the data loading, transformation, aggregation and overall processing.
  • Automate all console related tasks via automation codes
  • Build robust audit balance control framework that traces not only data lineage, data quality but allows for database performance optimization vs computing cost (credits)
  • Start with small node clusters and scale up as needed.
  • Make zero-copy clones to create testing or validation databases to avoid data duplication.

Snowflake is a powerful database and has unique features that allow for quick implementation of data analytics projects but like all other databases it requires careful designing. At Capgemini we have developed unique design approaches and robust frameworks to take advantage of Snowflake capabilities to build robust, scalable, low TCO data analytics platforms to solve critical business challenges.

Hello Mainak I am trying to find out the credits consumed by tasks, by joining task history ,query history and warehouse metering history. Getting to many rows. I think I am missing something here. could you help me? thanks Nagaraj

Like
Reply

What is the best practice to design data model ? Flat denormalized, STAR, OLAP ?

Like
Reply

please do in wondering how to differentiate between when you sound extract or not!

it would be interesting to see some tests around json variant outperforming the contents flattened out into relational.

To view or add a comment, sign in

More articles by Mainak Sarkar

Others also viewed

Explore content categories