Tuesday, 26 November 2024

Snowflake Architecture & Caching Mechanisim

Snowflake is a combination of shared-disk and shared-nothing database architecture, exploiting the scalability and performance factors of shared-nothing architecture and data management simplicity of shared disk architecture. It uses a central data repository to persist the data, which is accessible from all the compute nodes, similar to a shared-disk architecture. At the same time, Snowflake also processes queries using MPP (massively parallel processing) compute cluster where a node in the cluster persists part of the dataset locally, similar to shared-nothing architectures.

Shared-Disk Architecture:

One of the early scaling approach, designed to keep the data in a central storage location accessible by all the database cluster nodes. As all the modifications are written to a shared disk, the data accessible by each cluster node is consistently available. Example: Oracle RAC.

 

Figure 1: shared-disk architecture

Shared-Nothing Architecture:

In shared-nothing architecture the compute and sorage are scaled together, and as the name suggests each node does not share any resource with the other node i.e. memory, cpu, and storage. In this architecture data needs to be shuffled between nodes adding to the overhead.

Figure 2: shared-nothing architecture

The Snowflake Architecture:

An entirely new modern data platform built for the cloud, that allows multiple users to concurently share the data. This new desgin physically separaetes but logically integrates storage and compute along with providing service for data security and data management.

The Snowflake’s unique hybrid-model architecture consist of the three key layers:

  • Cloud Services
  • Query Processing
  • Database Storage

And three snowflake caches:

  • Results Cache
  • Metadata Storage Cache
  • Virtual Warehouse Cache

Figure 3: Snowflake’s hybrid columnar architecture

 Cloud Services Layer:

Also known as the global service layer or brain of Snowflake is a collection of services that coordinate activities such as authentication, access control, and encryption. It also includes management functions for handling infrastructure and metadata, as well as performing query parsing and optimization. The cloud service layer is what enables the SQL client interface for Data Definition Language and Data Manipulation Language (DML) operations on data.

The Query Processing (Virtual Warehouse) Compute Layer:

A Snowflake Compute Cluster often known as a Virtual Warehouse is a dynamic resource of compute resources consisting of CPU, memory, and temporary storage. In Snowflake, compute is separate from storage which means any virtual warehouse can access the same data as another, without any impact on the performance of the other warehouses. Snowflake virtual warehouse can be scaled up by resizing a warehouse and can be scaled out by adding clusters to a warehouse.

Centralized (Hybrid Columnar) Database Storage Layer:

Data loaded to Snowflake is optimally reorganized into a compressed, columnar format, and automatically stored into micro-partitions. Snowflake’s underlying file system is implemented on Amazon, Microsoft, or Google Cloud. There are two unique features in the storage layer architecture.

  • Time Travel: Ability to restore a previous version of database, table, or schema
  • Zero-Copy cloning: A mechanism to snapshot a Snowflake Database, Schema, or Table along with its associated data without additional storage cost.

Snowflake also uses 3 different caching mechanisms to optimize the computing process.

Query Result Cache:

The results of a Snowflake query are cached for 24 hours and then persisted. The 24-hour clock is reset every time the query is re-executed up to a maximum of 31 days from the time the query was first executed, post which the results are purged anyways. The result cache is fully managed by the Snowflake global cloud service layer (GCS) and available across all virtual warehouses. The process of retrieving the cached result is managed by GCS and if the results size exceeds a certain threshold, the results are stored in and retrieved from cloud storage. The query results cache can be enabled or disabled.

Metadata Cache:

The information stored in the metadata cache is used to build the query execution plan and is fully managed by the global service layer. Snowflake collects metadata about tables, micro-partitions, and clustering. It also stores row count, table size in bytes, file references, table version, range of values in terms of MIN and MAX, the NULL count, and the number of distinct values as a result, queries like SELCT COUNT(*) on a table, would not require virtual compute only cloud services.

Virtual Warehouse Local Disk Cache:

Also referred to as raw data cache, SSD cache or simply data cache, is specific to the virtual warehouse used to process the query. The virtual warehouse data cache is limited in size and uses the LRU (Least Recently Used) algorithm. The virtual warehouse uses the SSD to store the micro partitions that are pulled from the remote database storage layer when a query is processed. Whenever a virtual warehouse receives a query to execute, first it scans the SSD cache first before accessing the remote disk storage making the query execution faster. Hence the size of the SSD cache plays a pivotal role in query execution time and is proportional to the size of virtual warehouse’s compute resource. SSD cache being local to the virtual warehouse gets dropped once the virtual warehouse is suspended. Although the SSD cache is specific to a warehouse, which operates independently of other virtual warehouses, the global service layer handles the overall data freshness. And it does so via the query optimizer which checks the freshness of each data segment of the assigned virtual warehouse and then builds a query plan to update any segment by replacing it with data from the remote disk storage. This design approach leads to a trade-off between keeping a warehouse running to exploit the cache mechanism for better query performance and incurring cost. Its advisable wherever feasible assign the same virtual warehouse to users who will be accessing the same data for their queries.

 


No comments:

Post a Comment

Apache Sqoop: A Comprehensive Guide to Data Transfer in the Hadoop Ecosystem

  Introduction In the era of big data, organizations deal with massive volumes of structured and unstructured data stored in various systems...