Crushing Long-Running Query in Snowflake: A Case Study with Custom Stored Procedures and Caching

Introduction

In the fast-paced world of enterprise data warehousing, query performance is paramount. Long-running queries can cripple your data pipelines, especially when accessed through APIs, leading to timeouts and frustrating delays. Additionally, processing large amounts of data on the server-side memory can strain resources and degrade performance. This blog post dives into the real-world challenges we faced with these problematic queries, the various strategies we attempted, and the innovative solution we developed using Snowflake’s powerful caching mechanisms. You will learn about our journey, the breakthrough we achieved, and practical tips for implementing a similar solution.

Our Pain Points: The Timeout Trap

Our team was constantly battling long-running queries that often exceeded the timeout limits of our HTTP API calls. The consequences were severe:

Incomplete or Inaccurate Results: Timeouts meant we couldn’t trust the data returned by our queries.
Disrupted Workflows: Delays rippled through our processes, impacting downstream tasks and decision-making.
Frustrated Users: Slow response times led to a poor user experience and decreased confidence in our data systems.
Resource Wastage: Timeouts also led to resource wastage, as partial results consumed processing power without delivering value.
Server-Side Memory Strain: Processing large amounts of data in server-side memory increased resource usage and risked system performance degradation.

We knew we needed a more robust and reliable solution to ensure consistent query performance and eliminate the constant threat of timeouts.

The Journey to a Solution

Our initial efforts focused on traditional query optimization techniques:

Indexing and Query Rewriting: We meticulously reviewed and refined our query structures, added indexes where appropriate, and experimented with different query plans.
Hardware Scaling: Although upgrading our infrastructure was not an option for us, it is a common approach where increasing raw processing power is used in hopes of resolving performance issues.
Materialized Views: While materialized views can accelerate frequently used queries, they were not suitable for our specific use cases due to the high frequency of underlying data updates and the need for near real-time results. Materialized views introduce a trade-off between query speed and data freshness that didn’t align with our requirements.

Despite these improvements, our timeout issues persisted. Long-running queries continued to disrupt our operations, prompting us to explore more innovative solutions.

A Snowflake Breakthrough

Our turning point came when we built the GET_PERSISTED_QUERY_RESULT_ID stored procedure, leveraging a Snowflake feature called the persisted query result ID. This procedure opened up a new avenue for optimization by utilizing Snowflake’s built-in caching capabilities. This section will explain how this procedure works and the caching mechanisms that significantly improved our query performance.

How It Works

Query Execution and ID Retrieval

The procedure executes your SQL query and captures essential metadata, including a unique query ID and the total number of rows returned. Here’s the SQL code for creating the stored procedure:

CREATE OR REPLACE PROCEDURE GET_PERSISTED_QUERY_RESULT_ID("SQL_QUERY" VARCHAR(16777216))
RETURNS VARIANT
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
var result = {}; // Initialize an empty object to hold the result
try {
    // Construct the SQL query to be executed
    var constructedQuery = `SELECT * FROM (${SQL_QUERY})`;
    
    // Create a statement object to execute the SQL query
    var statement = snowflake.createStatement({ sqlText: constructedQuery });
    
    // Execute the SQL query and store the result set
    var resultSet = statement.execute();
    
    // Retrieve the row count
    var rowCount = resultSet.getRowCount();

    // Always return the QUERY_ID if it exists, otherwise set to null
    result["QUERY_ID"] = resultSet.getQueryId() || null;
    
    // Set the total row count
    result["TOTALCOUNT"] = rowCount;
} catch (err) {
    // If an error occurs during query execution, store error details in the result object
    result = {
        "status": "Failed", // Status of the result
        "error_code": err.code || "Unknown", // Error code
        "error_state": err.state || "Unknown", // Error state
        "error_message": err.message || "Unknown error occurred", // Error message
        "stack_trace": err.stackTraceTxt || "No stack trace available", // Stack trace of the error
        "QUERY_ID": null, // Set QUERY_ID to null in case of failure
        "TOTALCOUNT": 0 // Default TOTALCOUNT to 0 in case of failure
    };
}
// Return the result object
return result;
$$;

Caching Magic

This query ID becomes the key to unlock Snowflake’s result cache. If an identical query is executed again within the cache retention period (typically 24 hours), Snowflake can efficiently retrieve the cached results, bypassing the need to re-execute the entire query. This powerful mechanism significantly accelerated our query performance, proving to be a game-changer in our fight against timeouts. However, it’s important to note that Snowflake will automatically invalidate cached results if the underlying data or query structure changes.

Implementing the Solution

Calling the Procedure

To use the GET_PERSISTED_QUERY_RESULT_ID stored procedure, you call it with a SQL query as its input parameter:

-- Define the SQL query you want to execute
SET SQL_QUERY = 'SELECT * FROM employees';

-- Call the stored procedure with the SQL query
CALL GET_PERSISTED_QUERY_RESULT_ID($SQL_QUERY);

The result returned by the procedure will be a JSON object containing the QUERY_ID and TOTALCOUNT if the query executes successfully, or an error message if it fails. Here’s an example of the results

Successful response with a 1000 records

{
  "QUERY_ID": "01b509e1-0605-02b3-0075-9283945888fe",
  "TOTALCOUNT": 1000
}

Successful response without any records

{
  "QUERY_ID": "01b509fd-0604-fdf5-0075-9283945a135a",
  "TOTALCOUNT": 0
}

Failed response

{
  "QUERY_ID": null,
  "TOTALCOUNT": 0,
  "error_code": 2003,
  "error_message": "SQL compilation error:\nObject 'employees' does not exist or not authorized.",
  "error_state": "42S02",
  "stack_trace": "Statement.execute, line 11 position 30",
  "status": "Failed"
}

Using the Persisted Query Result ID to Get Cached Results

Once you have the query ID, you can use it to retrieve cached results from Snowflake’s result cache. Here’s an example of how to use the query ID to get cached results:

-- Use the query ID to retrieve cached results
SELECT * FROM TABLE(RESULT_SCAN('QUERY_RESULT_ID'));

Batch Processing Benefits

The GET_PERSISTED_QUERY_RESULT_ID procedure is particularly beneficial for batch processing large datasets. For example, consider a scenario where you need to process 1 million records. Fetching all these records at once and keeping them in memory would be impractical. Instead, you can use the stored procedure in conjunction with a window function like ROW_NUMBER() to process the data in smaller, more manageable batches:

SET SQL_QUERY = 'SELECT ROW_NUMBER() OVER (ORDER BY UNIQUE_ID) as rowNumber, * FROM my_large_table';
CALL GET_PERSISTED_QUERY_RESULT_ID($SQL_QUERY);

{
"QUERY_ID": "01b50a80-0605-02b3-0075-92839460a922",
"TOTALCOUNT": 4561
}Code language: JSON / JSON with Comments (json)

Using Python for Batch Processing

To handle large datasets efficiently, we implemented here is a simple Python script that processes data in manageable batches. This approach ensures that we don’t overwhelm the system’s memory and maintain high performance even with extensive data.

This Python function process_batches takes in the endpoint, query ID, and total record count, then makes sequential (or parallel) calls to the service endpoint, fetching data in batches. Here’s a look at the Python code used for this purpose:

import requests

def process_batches(query_endpoint, query_id, total_count, batch_size=1000):
    # Initialize the current row number
    current_row_number = 1
    
    # Process the data in batches
    while current_row_number <= total_count:
        # Calculate the range for the current batch
        end_row_number = min(current_row_number + batch_size - 1, total_count)
        
        # Create the query string for the current batch
        query = {
            "query": f"SELECT * FROM TABLE(RESULT_SCAN('{query_id}')) WHERE rowNumber BETWEEN {current_row_number} AND {end_row_number}"
        }
        
        # Make the POST request to the service endpoint with the query string
        batch_response = requests.post(query_endpoint, json=query)
        
        # Check if the request was successful
        if batch_response.status_code != 200:
            raise Exception(f"Failed to fetch batch data, status code: {batch_response.status_code}")
        
        # Process the batch response (replace this with actual processing logic)
        batch_data = batch_response.json()
        print(f"Processed rows {current_row_number} to {end_row_number}, data: {batch_data}")
        
        # Update the current row number for the next batch
        current_row_number += batch_size

# Example usage
service_endpoint = "https://my_snowflake_service.com/api/query"  # Replace with your actual endpoint
query_id = "01b50a5e-0605-0467-0075-9283945eec5e"  # Replace with the actual query ID
total_count = 4561  # Replace with the actual total count

process_batches(service_endpoint, query_id, total_count)

This script enhances our ability to handle large datasets by dividing the data retrieval process into smaller chunks, ensuring efficient memory usage and faster processing times. By iterating over the results in batches, we can process each subset of data independently, significantly improving the overall performance and reliability of our data pipelines. Since the results are cached, we can be confident that the results are deterministic each time you fetch a batch, even if the underlying data changes.

Beyond Basic Caching

While basic caching significantly improved our query performance, we explored additional advanced techniques to further optimize our processes:

Cache Warming: We proactively pre-loaded the cache with results from frequently executed queries, ensuring they were immediately available when needed.
Partial Caching: For massive datasets, we strategically cached only the most relevant subsets or aggregations to optimize storage and retrieval.

By fine-tuning our caching strategy, we achieved unprecedented levels of query performance and reliability.

Monitoring and Maintaining Cache Efficiency

The effectiveness of your caching strategy depends on active monitoring and maintenance. Here are some tips to ensure you’re getting the most out of your cached data:

Cache Metrics: Snowflake provides several system views and functions that reveal insights into your cached data:
- QUERY_HISTORY: Shows whether a query used the cache or not.
- RESULT_CACHE_USAGE_HISTORY: Provides statistics on cache usage over time.
- SYSTEM$RESULT_CACHE_GET_STATUS: Lets you check if a specific query result is cached.
Cache Invalidation: Snowflake automatically invalidates cached results when the underlying data changes. However, you can also manually clear the cache using the SYSTEM$DROP_RESULT_CACHE_ENTRIES function. ie SYSTEM$DROP_RESULT_CACHE_ENTRIES('01b50a5e-0605-0467-0075-9283945eec5e'); This would remove the cached result associated with the query result ID 01b50a5e-0605-0467-0075-9283945eec5e or to force refreshing when you suspect stale data.
- Important Considerations:
  - The function returns the number of cache entries dropped (either 0 or 1).
  - Dropping a cache entry does not affect the underlying data; it only removes the cached result from memory.
  - Use this function cautiously, as clearing the cache unnecessarily can negatively impact query performance.

Exploring Advanced Caching Techniques

Snowflake offers even more advanced caching capabilities for those who want to take their optimization to the next level:

Micro-Partitions: Snowflake’s micro-partition architecture can significantly improve the efficiency of cached results. By partitioning your data effectively, you can isolate changes and only invalidate the relevant micro-partitions, rather than the entire cache.
Materialized Views: While not strictly a caching mechanism, materialized views can serve a similar purpose by pre-computing and storing the results of complex queries. This can be particularly beneficial for dashboards and reports that require near-real-time data access.
Snowflake Scripts: Automate cache management tasks using Snowflake Scripts. Create scripts that periodically warm the cache for critical queries or clear the cache for stale data. This can be scheduled using Snowflake Tasks for optimal efficiency.
Clustered Tables: For tables with high query volumes and selective filtering, consider clustering them on the columns used in your queries. This can dramatically improve query performance, especially when combined with caching.

Real-World Results

The impact was undeniable:

Dramatic Reduction in Timeouts: Timeouts became a rare occurrence, even for the most complex and resource-intensive queries.
Lightning-Fast Response Times: Users experienced significantly faster response times, leading to increased productivity and satisfaction.
Improved Resource Utilization: Our systems were no longer bogged down by long-running queries, allowing us to allocate resources more efficiently.

Practical Tips and Considerations

Analyze Query Patterns: Identify the most frequently executed queries and the tables they access. These are prime candidates for caching.
Profile Query Performance: Use Snowflake’s query profiler to identify bottlenecks and opportunities for optimization. Caching may not always be the best solution.
Experiment and Iterate: Test different caching strategies and configurations to find what works best for your specific workload.
Monitor and Adapt: Continuously monitor cache performance and adjust your strategy as your data and workload evolve.
Be Mindful of Limitations: While caching is a powerful tool, it’s not a silver bullet. Be aware of potential limitations like cache invalidation and storage costs, and tailor your strategy accordingly.

Limitations

While the GET_PERSISTED_QUERY_RESULT_ID procedure and caching mechanisms offer significant benefits, there are some limitations to consider:

Cache Invalidation: Cached results may become outdated if the underlying data changes frequently. Ensure proper cache invalidation strategies are in place.
Storage Costs: While Snowflake’s result cache itself doesn’t incur additional storage costs, implementing other caching mechanisms like materialized views can increase storage consumption. Be sure to monitor and manage your overall caching strategy, including the size and retention of cached data, to optimize costs and performance..
Not Always Beneficial: In scenarios with highly dynamic data or infrequent query repetition, caching may not provide significant performance gains.

Conclusion

By leveraging the GET_PERSISTED_QUERY_RESULT_ID stored procedure and implementing a thoughtful caching strategy, you can transform Snowflake into a high-performance, data-driven powerhouse. Effective caching requires continuous monitoring, optimization, and adaptation to evolving needs. With the right tools and techniques, you can unlock the full potential of your data and propel your organization towards success.

References

June 15, 2024 Ghazi Hudeihed

Recent Posts

Recent Comments

Archives

Categories

Meta