Introduction
Many organizations are migrating to or selecting Snowflake as their new data platform to satisfy data lakehouse capabilities, among others. This document shows a few tips and tricks that have been created based on previous Snowflake implementations.
Snowflake playing nicely with other data lakes:
- At this time Snowflake supports three external tables or external stages: AWS S3, Azure Blob, and GCP Cloud Storage.
- You can use the "External Table" type to point to a data lake that is not housed in the Snowflake environment but allows you to perform compute read-only in the Snowflake environment.
- When using Snowpipe to load data into Snowflake you can use serverless computing, which reduces cost as data warehouses are not required.
Data governance in Snowflake:
- There is no native lineage through the lifecycle of data from source files to stage to Snowflake. If full lineage is required, there would have to be a separate process to write this metadata out to another table.
Working with files and Snowflake:
- Always split very large files into 100MB to 250MB (gzipped) chunks before ingestion. Then you can use a large warehouse to process the files in parallel.
- Snowflake supports GeoJSON but does not yet support KML, GML, or ESRI. If you need to store those files a conversion to GeoJSON would need to be performed first.
- Most DBMS solutions have a bulk file importer that significantly speeds up the ingestion process, such as MySQL Bulk Loader for MySQL and MariaDB, for example. Snowflake has the same “Copy Into” command.
Performance considerations in Snowflake:
- AM spillage from queries to the Snowflake SDD is not too impactful but spilling onto local storage is, so when that happens, you should investigate.
- Reduce your use of “select *” and choose specific columns. Especially without a top or limit, it uses large uncached compute, which is costly.
- Build communities around data domains and design warehouses to service those communities.
- Use a combination of query history and user role from the Snowflake schema to identify which roles are hitting which tables and which user roles should be sharing which warehouses.
- Snowflake has “Approx” or estimation versions of Median and Count, etc., whichsignificantly lowers the compute time at the expense of <2% rounding errors. These are called "high-performing" functions.
Key to data performance in Snowflake (reducing compute cost):
- Flocon De Neige (FDN) is the file format of Snowflake, which is similar to Parquet. It does not do well with schema evolution support, so you will need to implement CDC if the logical schema changes over time. However, you get the performance and compression you expect with Parquet.
- When you execute canned dashboards against a Snowflake warehouse it will put the result in the query cache, making the subsequent running of the dashboard “free.” Unless you drill down or filter through to new data.
- Snowflake creates micro caches of some regularly used partitions, which again incur no cost when queried.
- Snowflake hashes queries and stores the results in a cache for 24 hours. The second execution of that exact query resets the 24-hour counter. The subsequent queries do not incur a cost because they are cached. If any partition the query points to is updated, the cache is destroyed.
- Snowflake saves costs by pulling metadata to a cache. Queries on this metadata do not require the use of the warehouse and therefore are “free” queries.
- JSON stored in variant fields in Snowflake are stored as FDN and split into partitions, just like other records, so you still get the expected performance.
- Snowflake uses query pruning to reduce the tablescans by elevating some of the data even from JSON into the metadata layer.
- Snowflake only makes money on compute. All other costs are passthrough for storage or IOPS from the cloud provider.
- Overlapping data in partitions is not a problem for Snowflake; overlapping data in partitions that you could potentially return at the same time (depth) is a problem. When this is the case, you can create materialized views to reorder and increase performance.
- Cluster keys can be used to rearrange partitions in a more appropriate order for querying if there is significant depth.
Be careful in your architectural considerations:
- If you dissolve the Snowflake warehouse all micro-partition caches in the warehouse are also destroyed but the query caches are not.
- Balance the warehouse timeout between the reduction in cost from not using the instance with the cost of the extra time it takes for queries that could have used a cache.
- Depending on the user group, if there are huge volume requirements you can create multiple warehouses of the same type using auto-scaling.
- Group together the users that need access to similar data and have similar compute complexity requirements and then add a virtual warehouse to each of them.
- Capacity storage (per TB) is a lot cheaper than on-demand, but it is difficult to know with compression how much you will actually need. So start with on-demand and then switch to capacity storage later. Otherwise, you may overdo it with your guess.
- When you clone a Snowflake database it is instant because it is a virtual pointer. Any other data that is added to that clone is captured as an enriched extension.
- For very complex business algorithms that need significant processing power and data access, it may be worth considering writing those algorithms in Snowflake Stored Procedures rather than in traditional docker containers, for example.
- Clusterkeys are a great way to rearrange the data to be more performant, but keep a close eye on compute. It may cost more credits than it saves in some scenarios.
- Set the scaling policy appropriately. “Economy” is more appropriate for non-business critical workloads (only scale if the work will be done within six minutes) and “Standard” is appropriate for business-critical workloads.
- Scaling up or scaling out should be considered in relation to compute volume requirements, frequency of load, complexity of load, and duration of workload.
- Time travel is possible with most types of tables and schemas, which allows you to see what the data looked like at a previous point in time depending on configuration.
- Snowflake works well if you plan to implement data fabric. It allows full data pipelines from source systems to expose the data in marketplaces. It also has data governance capabilities built in. However, to enable full data lineage and explainability, it would require some extra “non-out-of-the-box” work.
- There are three levels of metadata schemas: Information Schema (per schema – six months), Snowflake Database (all schemas – six months), and Account Usage Schema (all schemas – one year).
Clashes with traditional terminology in Snowflake:
- A Snowflake “virtual warehouse” is not a virtual version of a traditional data warehouse but rather just the compute layer (compute cluster) of one.
- Snowflake Scripting is just an SQL version of stored procedures. Stored procedures used to be available only in JavaScript.
Future Snowflake considerations and predictions:
- With FlashBlade and Dell (and others, I’m sure) you could bring Snowflake on-premises.
- Support for ESRI, GML, and KML will be coming so that the spatial queries in Snowflake become ubiquitous.
- Using data sharing allows data monetization. Snowflake gives commission to the data owner for the compute that is used.
Compliance and regulation in Snowflake:
- To comply with the GDPR right to be forgotten you can set time travel to 0 for the table or the schema and remove any personally identifiable information (PII). The new record will not contain the PII.
Choosing to process data in Snowflake:
- Stored procedures are available in native SQL or JavaScript
- Snowpark is a product that allows you to operate in other languages such as Java and Scala.
Data marketplace:
- You can “unload data” to export to Excel and other formats, but “zero copy” functionality is a better way to go so that the data is not duplicated and has its own lifecycle. You also get the processing power of Snowflake behind the data service.
- There are three models to share access to data in Snowflake: Secure Direct (private), Snowflake Data Exchange (subcommunity of public), and Snowflake Data Marketplace (public).
- You can use a Snowflake reader account with shares for read-only sharing of data to those that do not have their own Snowflake ID.
Snowflake security:
- You could use Direct Connect to connect fiber directly to the cloud provider, taking all of the network traffic of the public internet.
- Insight: Snowflake has four layers of security: access (IP- and policy-driven), authentication (MFA, SAML 2.0, OAuth, Key-Pair, SCIM, SSO), authorization (RBAC), and data protection (replication, failsafe, and encryption).
- Secure views allow you to hide all of the proprietary formulas, names, and definitions that make up that dataset.
- There is a full replication feature in Snowflake that allows near real-time replication between instances in the same organization, but selecting replication across cloud providers or availability regions may incur extra cost. This is useful for disaster recovery.
- Secure views and functions are sharable in shares to data consumers or data readers, but they do not get to see the DDL.
- Permissions are set at the role level and inheritance goes up the hierarchy. Parents inherit permissions from the children, not the other way around.
- “Secondary role” gives one individual access to all roles they are assigned at the same time in the same session. A more deliberate approach would be to create another role that inherits all of the required roles for the task.
- You can use “on future” with a grant to ensure all future tables created by certain roles will be accessible to a certain role without inheritance.
Training:
- The Snowflake fundamentals course does not go into detail regarding data governance or data architecture; therefore, you should be educated in these areas in addition to fundamentals before embarking upon an enterprise Snowflake project.
Bottom Line
Snowflake can be a powerful tool if you use it for the correct job and configure the environment to fit the types of data and workloads required. The lack of need for performance indexing, query optimization, and traditional DBA tasks reduces operational cost. The architecture of decoupling the warehouse layer and the storage layer reduces the overall service cost. However, there are a number of architectural decisions that have to be made on the environment to get true ROI of the product. A full understanding of the product is a must to make Snowflake implementations successful.