Best Practices for Snowflake Implementation

Mainak Sarkar

Published Feb 7, 2019

Lately I have been involved in a few Data Analytics projects built on #Snowflake DB and I have always been approached by every customer about the best practices of implementing a data warehouse or a data mart on Snowflake, and how is it different from other relational databases. The fact is that Snowflake is almost same and at the same time very different from other relational databases. It is almost same because it is a columnar-stored relational database and everything pretty much that holds true for any relational database holds true for Snowflake in most part. So that is why it is so easy to get started on Snowflake and master it quickly. The founders of Snowflake came from Oracle. I think that tells why Snowflake has all the qualities of a relational database. At the same time it is very different from other relational database because it is truly built on cloud principles and it is a truly cloud database: it is truly elastic, it is practically zero maintenance, it is up and running in hours if not mins and it has native support for structured and semi-structured (JSON) data. The other great thing about Snowflake is that it discourages copying of data again and again, which is the case in every company. Same data is copied over to multiple storage locations, databases until there is a total loss of control over the data.

Now I did not mean to write this article to talk about Snowflake advantages or disadvantages to other databases whether they are cloud based or not. The reason for bringing the concept of a truly cloud database is to able to address the question of best practices of building a data warehouse in Snowflake. To be honest, most of the design best practices of building a data warehouse on any database platform pretty much holds true here and there should be no reason why it should not be as such because it is a relational database in heart. For example, here are a few of them.

Data must be categorized and appropriately marked particularly with data that is highly protected
Data must maintain its history through lineage and audit
Data must be validated at source wherever possible
Data should be processed in micro batches where ever possible.
Data should be ingested and loaded, and then transformed as required, meaning ELT vs. ETL
Process CDC data end to end to avoid performance challenges where ever possible
Create instances of data marts separately based on business group specific requirements. These are mostly views on top of EDW core tables. Give business independence to build these semantic view and build governance around Semantic view creation

Having said that there are some best practices which apply to Snowflake only because of its unique architectural differences with other relational databases or big data platforms. Here are some of the Snowflake specific best practices.

Use Snowflake’s multi-cluster independent data warehouses shared data scale-out storage functionality to optimize computing needs of various job workloads. For example, landing and staging schema can be on a different database while curated core layer is in another database. Assign separate virtual data warehouse clusters to these schemas to optimize performance
Assign separate virtual data warehouse to data mart to have optimized data consumption experience
Keep semi structured data in original format if possible to increase data processing performance. Often JSON data is processed faster than JSON converted to relational tables. When storing semi-structured data, Snowflake optimizes the storage based on repeated elements within the semi-structure strings
Load data in small chunks instead of one large file. Load data in parallel using multiple nodes. At one client we have been able to load 12 months of event telemetry data in less than 2 weeks with small node clusters.
Consider clustering large tables to improve query performance. Recluster if performance degrades. Sometime cluster/reclustering can reduce query performance. So it is recommended to analyze table data before making such adjustments.
Snowflake stores metadata (min and max values, distinct values, etc) about the micro partitions so it can effectively prune the micro partitions needed to scan for a query. If you're querying the last month in a table that stores 10 years of data, only a fraction of the data will be scanned. This is valid for tables when data is loaded in chronological order. It is recommended to do event based data ingestion to allow for chronological order of data.
Create data pipelines that uses the processing power of Snowflake. Flexible framework based pipelines developed by #Capgemini allows for the SnowSQLs to be completely separated from the actual processing framework. This allows for minimum changes of code if there are changes in target schema or source schema but it also allows for use of power SnowSQL to do all the data loading, transformation, aggregation and overall processing.
Automate all console related tasks via automation codes
Build robust audit balance control framework that traces not only data lineage, data quality but allows for database performance optimization vs computing cost (credits)
Start with small node clusters and scale up as needed.
Make zero-copy clones to create testing or validation databases to avoid data duplication.

Snowflake is a powerful database and has unique features that allow for quick implementation of data analytics projects but like all other databases it requires careful designing. At Capgemini we have developed unique design approaches and robust frameworks to take advantage of Snowflake capabilities to build robust, scalable, low TCO data analytics platforms to solve critical business challenges.

Nagaraj Chidella 4y

Hello Mainak I am trying to find out the credits consumed by tasks, by joining task history ,query history and warehouse metering history. Getting to many rows. I think I am missing something here. could you help me? thanks Nagaraj

Alexander Chernyshev 6y

What is the best practice to design data model ? Flat denormalized, STAR, OLAP ?

Mike Taveirne 7y

please do in wondering how to differentiate between when you sound extract or not!

1 Reaction

Mike Taveirne 7y

it would be interesting to see some tests around json variant outperforming the contents flattened out into relational.

Best Practices for Snowflake Implementation

Mainak Sarkar

More articles by Mainak Sarkar

Others also viewed

Migrating EDW to Databricks: Lessons from the Trenches

How We Squeeze The Juice Out Of Redshift

Migration from AWS Redshift to Snowflake ( why and how)

The Data Specialist remains cloud-agnostic

Mastering Snowflake Fundamentals: Key Concepts

Snowflake Data Warehouse BEST PRACTICES

Snowflake is not a data warehouse

A Simple, yet effective way for improved query performance in Redshift - Overcome the challenge of a single Distribution key

Snowflake - what's so special?

Snowflake DW Tips for Data Engineers

Explore content categories

More articles by Mainak Sarkar

The Dawn of a New Era in Software Development: Micro software applications

What is GenBI

Storytelling Nerd - Perhaps the most important people you need to hire

Rise of Data Apps and the data application engineers

Data Engineering and Data Science - Yin and Yang: Lessons from Data Science Projects

Data Cataloging before Data Warehouse Modernization?

Does your Snowflake platform need to be optimized? A few tips from my experience

Fast Lane to Analytics: Rise of Replication tools and Snowflake in Modern Data Architecture

Characteristics of Modern Data Architecture and Key Guiding Principles

Linking Optimism and Hope at the time of Covid-19 - learning from Stockdale Paradox