In cloud data warehousing environments, the cost of operations is a critical factor. Data storage, data transfer, and computational resources all contribute to expenses. Efficient resource usage is essential to keep costs in check.
Query performance refers to the time it takes for a query to execute and return results. In traditional data warehousing systems, complex queries, especially those involving large datasets, can take a significant amount of time to complete.
Running complex and resource-intensive queries can lead to extended query execution times, increased resource consumption, and higher costs. This can be particularly challenging when dealing with large volumes of data.
Caching is like having a fast-access memory for your data. It allows you to store and retrieve frequently accessed data quickly, which can significantly speed up query performance. Think of it as keeping a smaller copy of the most important pages from a big book right next to you for quick reference.
Snowflake caching is a powerful feature that helps enhance query performance by reducing the time it takes to access frequently used data. Snowflake cache mechanism automatically accelerates query performance without requiring manual intervention. By reusing cached query results, Snowflake reduces the need for redundant query execution. This optimization not only speeds up results but also minimizes the consumption of compute resources, resulting in potential cost savings.
In Snowflake, there are three primary types of caches that enhance query performance:
These caches come pre-enabled in your Snowflake environment, ready to optimize your data processing tasks.
The Query Result Cache and Metadata Cache are integral components of the Cloud Services Layer within Snowflake’s architecture. Due to their placement in this layer, they are accessible to all virtual warehouses in your Snowflake environment. This means that these caches can serve their purpose without the need for an active virtual warehouse, offering consistent benefits across the platform.
On the other hand, the Virtual Warehouse Cache is specifically housed within the Query Processing Layer. This location makes it local to each individual virtual warehouse, ensuring that the cache’s benefits are tailored to the specific needs and workloads of that warehouse.
1. Query Result Cache:
In Snowflake, the “Result Cache” is one of the types of caching mechanisms used to improve query performance. It stores the results of previously executed queries, making them readily available for future queries.
The following conditions should be met for the query results to be reused:
For example, let’s use the following base query.
Now, let’s consider different scenarios:
Identical Query:
Re-running the exact same query as the base query will benefit from 100% cache reuse. For example:
Column Order Difference:
If you change the order of columns in the SELECT statement, even if it retrieves the same data, it will not trigger cache reuse. For example:
Table Alias Introduction:
Introducing a table alias to the query will prevent cache reuse, as the query structure is different. For example:
Lowercase Keywords:
Using lowercase keywords in the query, even if the table and column names are the same, will inhibit cache reuse due to syntax differences. For example:
Points to Keep in Mind:
2. Metadata Cache
The Metadata Cache stores metadata information about tables, views, and databases. When you query metadata about your Snowflake objects, the Metadata Cache can speed up response times by reducing the need to access the underlying metadata tables. This table information stored in metadata is referred as Metadata Cache in Snowflake.
Snowflake’s metadata stores valuable information about each table, including:
Demonstration:
Run the below SQL query which queries the total row count from a table.
The query profile will show a single node with the message METADATA-BASED RESULT indicating that the query result was returned from metadata and time taken would be only few milliseconds.
3. Virtual Warehouse Cache / Local Disk Cache
Whenever a virtual warehouse retrieves data from a table, it creates a local cache containing that data. Subsequent queries have the option to reuse this cached data rather than fetching it from the cloud storage where the table resides. This local cache, stored on the virtual warehouse’s local disk, is known as the Virtual Warehouse Cache or Local Disk Cache in Snowflake.
Accessing data from a local cache is notably more efficient than retrieving it from remote cloud storage. As a result, the warehouse cache significantly enhances the performance of queries that can leverage this cache, resulting in faster query execution and improved efficiency.
Expiry Of Virtual Warehouse:
The Virtual Warehouse Cache is automatically cleared when a virtual warehouse is suspended.
Upon resuming the virtual warehouse, the cache gradually rebuilds as queries are executed. This dynamic behavior prompts consideration: whether suspending a virtual warehouse for cost savings outweighs maintaining it running for the cache’s performance enhancement.
It’s important to note that the cache’s size is directly related to the virtual warehouse’s size. A larger virtual warehouse results in a larger cache.
Demonstration:
you initially run the query.
After the query is complete, review the query plan of the query. The query profile indicates that the Percentage of data scanned from the cache is zero indicating that the virtual warehouse cache had no data in it initially.
Subsequently, when you run the below query.
Note that the condition in the query is changed so that the query is not answered using Query Result Cache. But it is kept close enough that it can make use of the Virtual Warehouse Cache.
After the query is complete, review the query plan of the query. The query profile indicates that almost 78% of data is scanned from cache indicating that the virtual warehouse cache is reused.
Let us suspend the virtual warehouse and run a slightly different query. Note that suspending the warehouse results in the existing virtual warehouse cache being purged.
After the query is complete, review the query plan of the query. The query profile indicates that the Percentage of data scanned from the cache is zero indicating that no virtual warehouse cache is used after suspending the warehouse.
In summary, the performance of your queries depends on whether the data is cached in memory or stored in permanent storage, with cached data being retrieved much faster than data that needs to be fetched from permanent storage.