Does your Snowflake platform need to be optimized? A few tips from my experience

Does your Snowflake platform need to be optimized? A few tips from my experience

By now, many companies have bought into the magic of Snowflake's Data Cloud and rightfully so. Snowflake has been one of those software products which makes you think if it is too good to be true. Snowflake is easy, flexible, scalable and hassle free. In fact, it is so easy that you could start pumping in data in matter of hours and start building your data warehouse. With more new features announced in the recent Snowflake Data Summit, Snowflake is ready to take on any data in this world. I have been working with many customers who have implemented Snowflake already and some are in the middle of adopting it. It is not only the ease of ingesting and transforming data, but also the ease of accessing Snowflake platform directly or using any of the BI tools, it can easily attract more and more users to the platform further adding to the proliferation of Snowflake within the enterprise.

While all these might sound great for the data management team, I have seen them struggling to keep up with the management of the platform optimally. Usage is high with no checks and balances. Cost goes through the roof . Users are onboarded with little governance to check for appropriate access to sensitive data. Some are not taking advantage of the new features and functionalities that could avoid cost overage and sub optimal implementation of the platform. Lot of customers tend to replicate their existing on premise data warehouse solutions which are built on complete different architecture resulting in a platform that is not translating value quickly as intended. Based on my experience the following are key items that should be addressed in order to take full advantage of Snowflake and take it to the next level.

  1. Security: Snowflake takes security quite seriously and it comes with quite advanced security features that are often overlooked by most customers because Snowflake can be easily accessed over the internet (HTTPS) without requiring any network and infrastructure team support. This often can lead to some amount of vulnerability and non compliance to company's stringent data security policies. I have customers whose snowflake environments host critical sensitive data requiring stringent security and monitoring. So, it is important to ask the following questions to understand if your snowflake implementation is optimal in terms of data security, compliance and governance.
  • Do you have critical sensitive data (PII/PCI) data in Snowflake?
  • Are you using Snowflake native features like PII masking, third party enabled encryption or tokenization to share such sensitive data securely?
  • Are there requirements for row level access restrictions?
  • Are you accessing Snowflake via HTTPS only? Is there Direct Connect(AWS) or Express Route (Azure) setup for secure data transfer between on premise and your cloud environment?
  • Is there Private Link setup between your cloud environment and Snowflake for secure data transfer between the two VNETs(AWS) or VPCs(Azure)?
  • Are you using federated Identity and Access Management tools like OKTA or Ping for SSO authentication to Snowflake? Is Snowflake configured for user life cycle management with OKTA or Azure AD or other SCIM applications?
  • Are your critical users like admins (ACCOUNTADMIN, SYSADMIN, etc.) using Multi Factor Authentication (MFA) to log into Snowflake with increased login security?
  • Are your Snowflake supported ELT and BI applications using OAuth access to Snowflake securely?
  • Is your Snowflake configured with Network policies to access only allowed IP addresses?
  • Are you sharing data within and outside your company boundaries? If so, is your Snowflake configured for secured data sharing?
  • Is your Snowflake platform configured for security monitoring?
  • Role Based Access Control - RBAC in Snowflake is built on role hierarchy and requires careful role design strategy. Otherwise, it can become quite complex and convoluted such that users could have unauthorized access privileges via multiple roles assigned to the users with conflicting privileges. Users in Snowflake can be assigned multiple roles but users can only take up one role at a time. Are there custom roles setup to build a set of object based roles which assign database objects level privileges and persona based roles which represents the different business roles with your company? Object roles are assigned to persona based roles, and users are assigned one or more person based roles depending on their access needs. For example, a user has to access a specific role for accessing sensitive HR data but can be using another standard reporting role at other times.
Customer Role setup in Snowflake

Due to the complexity it can create in designing custom roles with roles based hierarchy, it is not easy to understand the relationships between a user and the privileges granted via the hierarchical roles. The following article by Venkatesh Sekar can come really handy in capturing these relationships in Snowflake tables and ultimately in some kind of easy to access report.

HOW TO CAPTURE SNOWFLAKE USERS, ROLES, AND GRANTS INTO A TABLE

Key takeaways: Companies should invest time in understanding the different security features delivered by Snowflake and the leading practices, options and approaches to meet their data privacy requirements, network security policies, data governance and integration with existing security and authentication applications. Apply best practices in Privacy Compliance, Network Security, RBAC, User Onboarding and offboarding, SSO integration, Secure data sharing and Security Monitoring to analyze, identify gaps and find ways to optimize overall security based on best practices.

2. Usage and Cost Management: I have often talked about my guiding principles of modern data architecture in my previous articles. One of the key guiding principles is about Usage and Cost - "Scale DOWN is the challenge on Cloud not Scale UP." So many times I have seen at customers the challenge of managing cost - administration of cost and usage is not configured for granular view for showback / chargeback; resource monitoring is not set up to alert over usage. Cost and Usage monitoring is always looking back and not current; no auto-scaling features are used while running ELT/ETL jobs. 

Snowflake comes with strong features for managing cost a very granular level. The Snowflake account administration allows you to review and analyze credit usage at a warehouse level by date in the Billings and Usage tab. Combine this with query history (QUERY_HISTORY) to understand the rogue queries that could be using up the precious credits. Further more, Snowflake allows you to set up Resource Monitors at account or virtual warehouse level for active usage monitoring. If further detail analysis is required, the snowflake ACCOUNT_USAGE schema provides granular usage data on data transfer (DATA_TRANSFER_HISTORY), warehouse metering (WAREHOUSE_METERING_HISTORY) and total credits used up daily (METERING_DAILY_HISTORY). Scott Smith from Tableau used these tables to built a powerful Tableau based Account Usage dashboard. This Monitoring Dashboard will allow monitoring and forecasting compute cost, credit usage and storage cost. It will allow to identify common recurring errors that can impact performance and compute cost.

Tableau Usage Monitoring Dashboard

Snowflake allows virtual warehouses to be scaled up and down almost instantaneously. It is important to understand what ELT tools will allow to use such native Snowflake auto-scaling features so cost can be controlled while running data pipelines. It is also quite important to understand how to configure the virtual warehouses optimally. Analysis of data pipeline job runtimes, warehouse usage in the recent past, specific use cases' data needs, change in overall transformation architecture, consumptions patterns in the recent past, new source application onboarding, new users onboarding, governance over managing the warehouses etc. should be taken into account while configuring them.

Key takeaways: Usage and Cost management is a very critical component of Snowflake optimization. Active monitoring with Resource Monitors, query history monitoring and Account Usage Dashboard can help to address runaway costs, rogue queries and performance busters. Experienced Snowflake Data architects can analyze and identify the resource utilization challenges using Snowflake delivered features and functionalities, and then help to optimize overall Snowflake usage to deliver intended value.

3. Architecture and constantly upgraded Snowflake features: Many customers who have been on Snowflake have built architecture around Snowflake available features. To Snowflake's credit, it is constantly innovating and coming out with new features to improve overall analytics experience. For example, until recently dynamic masking of data feature was not available. So, several companies have implemented custom solutions to secure sensitive data in Snowflake. Often these custom solutions added to complexity of data processing and may increase computing and hence credit usage.

Another great example is the age old problem of change data capture (CDC) for processing type 2 dimensions. Snowflake's introduction of Streams and Tasks have made it very easy for change data processing. A Snowflake Stream object is used to track any changes to a table including inserts, updates, and deletes, and then can be consumed by other DML statements. Snowflake Tasks can then simply consume the Stream offsets by some DML statement when the Streams has data to further load the data into target tables for further transformations.

Another Snowflake component Snowpipe can be used to ingest any data loaded into cloud storage in a near real time into Snowflake. Streaming data with Kafka can be ingested using Snowpipe in a simple architecture. Combining Snowpipe with Streams and Tasks can be real game changer for straight forward linear data pipelines.. It can eliminate the complex ETL jobs that are created to ingest, change data capture and transform. The article ELT DATA PIPELINING IN SNOWFLAKE DATA WAREHOUSE is an excellent write-up on using Snowpipe, Streams and Tasks. What I like about using Snowpipe and Streams is that it simplifies the architecture and eliminates complex data pipelines using external ELT tools.

I have written about building robust Modern Data Architecture using Snowflake in several of my previous articles.

 Best Practices for Snowflake Implementation

Fast Lane to Analytics: Rise of Replication tools and Snowflake in Modern Data Architecture

Characteristics of Modern Data Architecture and Key Guiding Principles

Here are some more Snowflake features that could be investigated further for optimal implementation of Snowflake. It is not an exhaustive list by any means but most important ones based on my experience.

Zero copy cloning: Snowflake's extremely powerful feature that allows to create an exact replica of the database or portion of the database within seconds. This can allow you to clone production databases to create a zero copy clone for Dev and Test environments. There is a really great article written about the best practices for data life cycle management using zero copy cloning and access controls.

Replication/Failover: Snowflake can be configured to replicate automatically at a specific interval in the same or different region on the same or different cloud platform for fail over (does not happen automatically). The interval can be determined by the RPO/RTO requirements. Most database objects are replicated but non database objects like users, roles, snowpipes, shares, virtual warehouses are not replicated automatically and must be automated via Snowflake custom stored procs/Streams and Tasks.

Materialized Views: Improved SQL Performance using Materialized Views. No need for tuning activities. Since it is pre-computed, performance is several times faster than querying of the actual table. I find them very useful when I run into running same set of queries on a very large table specially when it is an external table. Queries can be complex and requires significant processing specially on semi structured datasets. It is maintained by Snowflake and refreshed automatically.

Geospatial Data Type: Native support for geospatial features such as points, lines, and polygons on the Earth’s surface. Integrates with BI tools for visualization and works with spatial ETL tools for data integration. Extremely useful feature for use cases which require geospatial data based calculations, visualizations, optimal distance algorithm based user cases and so on..

Snowsight: Understand and analyze your data directly from this Snowflake UI. I find it extremely useful when analyzing data in a table or a query result. It allows interactive results and builds dashboards quickly. Using Snowsight for data quality analysis, usage analysis, data auditing can be easy but very powerful.

More great new features have been announced in the last Data Summit. Constant reviewing existing architecture and finding ways to optimize the Snowflake implementation using these great features while balancing cost of change will allow for optimal implementation of the platform. You want to take advantage of new features or functionality for your existing use cases and workloads. You may want to modify existing implementations to be more efficient. Bottom line is that Snowflake like all other technology would need extra "love" to be really efficient in producing value for you.

Thanks Mainak for these excellent thoughts and best practices!

Like
Reply

Excellent Mainak! Hope to see more articles from you on Snowflake esp on the newer capabilities on the topics you mentioned.

To view or add a comment, sign in

More articles by Mainak Sarkar

Others also viewed

Explore content categories