Snowflake is a combination of shared-disk and shared-nothing database architecture, exploiting the scalability and performance factors of shared-nothing architecture and data management simplicity of shared disk architecture. It uses a central data repository to persist the data, which is accessible from all the compute nodes, similar to a shared-disk architecture. At the same time, Snowflake also processes queries using MPP (massively parallel processing) compute cluster where a node in the cluster persists part of the dataset locally, similar to shared-nothing architectures.
Shared-Disk Architecture:
One of the early scaling approach, designed to keep the data
in a central storage location accessible by all the database cluster nodes. As
all the modifications are written to a shared disk, the data accessible by each
cluster node is consistently available. Example: Oracle RAC.
Figure 1: shared-disk architecture
Shared-Nothing Architecture:
In shared-nothing architecture the compute and sorage are
scaled together, and as the name suggests each node does not share any resource
with the other node i.e. memory, cpu, and storage. In this architecture data
needs to be shuffled between nodes adding to the overhead.
Figure 2: shared-nothing architecture
The Snowflake Architecture:
An entirely new modern data platform built for the cloud,
that allows multiple users to concurently share the data. This new desgin
physically separaetes but logically integrates storage and compute along with
providing service for data security and data management.
The Snowflake’s unique hybrid-model architecture consist of
the three key layers:
- Cloud
Services
- Query
Processing
- Database
Storage
And three snowflake caches:
- Results
Cache
- Metadata
Storage Cache
- Virtual
Warehouse Cache
Figure 3: Snowflake’s hybrid columnar architecture
Cloud Services Layer:
Also known as the global service layer or brain of Snowflake is
a collection of services that coordinate activities such as authentication,
access control, and encryption. It also includes management functions for
handling infrastructure and metadata, as well as performing query parsing and
optimization. The cloud service layer is what enables the SQL client interface
for Data Definition Language and Data Manipulation Language (DML) operations on
data.
The Query Processing (Virtual Warehouse) Compute Layer:
A Snowflake Compute Cluster often known as a Virtual
Warehouse is a dynamic resource of compute resources consisting of CPU, memory, and temporary storage. In Snowflake, compute is separate from storage which
means any virtual warehouse can access the same data as another, without any
impact on the performance of the other warehouses. Snowflake virtual warehouse can
be scaled up by resizing a warehouse and can be scaled out by adding clusters
to a warehouse.
Centralized (Hybrid Columnar) Database Storage Layer:
Data loaded to Snowflake is optimally reorganized into a
compressed, columnar format, and automatically stored into micro-partitions.
Snowflake’s underlying file system is implemented on Amazon, Microsoft, or
Google Cloud. There are two unique features in the storage layer architecture.
- Time
Travel: Ability to restore a previous version of database, table, or
schema
- Zero-Copy
cloning: A mechanism to snapshot a Snowflake Database, Schema, or
Table along with its associated data without additional storage cost.
Snowflake also uses 3 different caching mechanisms to
optimize the computing process.
Query Result Cache:
The results of a Snowflake query are cached for 24 hours and
then persisted. The 24-hour clock is reset every time the query is re-executed
up to a maximum of 31 days from the time the query was first executed, post which
the results are purged anyways. The result cache is fully managed by the
Snowflake global cloud service layer (GCS) and available across all virtual
warehouses. The process of retrieving the cached result is managed by GCS and if
the results size exceeds a certain threshold, the results are stored in and
retrieved from cloud storage. The query results cache can be enabled or
disabled.
Metadata Cache:
The information stored in the metadata cache is used to build
the query execution plan and is fully managed by the global service layer.
Snowflake collects metadata about tables, micro-partitions, and clustering. It
also stores row count, table size in bytes, file references, table version,
range of values in terms of MIN and MAX, the NULL count, and the number of
distinct values as a result, queries like SELCT COUNT(*) on a table, would not
require virtual compute only cloud services.
Virtual Warehouse Local Disk Cache:
Also referred to as raw data cache, SSD cache or simply data
cache, is specific to the virtual warehouse used to process the query. The
virtual warehouse data cache is limited in size and uses the LRU (Least
Recently Used) algorithm. The virtual warehouse uses the SSD to store the micro
partitions that are pulled from the remote database storage layer when a query
is processed. Whenever a virtual warehouse receives a query to execute, first
it scans the SSD cache first before accessing the remote disk storage making
the query execution faster. Hence the size of the SSD cache plays a pivotal
role in query execution time and is proportional to the size of virtual
warehouse’s compute resource. SSD cache being local to the virtual warehouse
gets dropped once the virtual warehouse is suspended. Although the SSD cache is
specific to a warehouse, which operates independently of other virtual warehouses, the global service layer handles the overall data freshness. And it
does so via the query optimizer which checks the freshness of each data segment
of the assigned virtual warehouse and then builds a query plan to update any
segment by replacing it with data from the remote disk storage. This design approach leads to a trade-off between keeping a warehouse running to exploit the
cache mechanism for better query performance and incurring cost. Its advisable wherever feasible assign the same virtual warehouse to users who will be
accessing the same data for their queries.
No comments:
Post a Comment