Amazon Redshift is a high-performance, fully managed data
warehouse optimized for analytical queries on large-scale datasets. Its core
strengths lie in its massively parallel processing (MPP) architecture and a
robust data distribution mechanism that ensures efficient query execution. This
article examines the key data distribution styles supported by Redshift—EVEN,
KEY, and ALL—and their applicability in various scenarios. Additionally, we
explore Redshift's internal architecture, which underpins its high scalability
and performance, and its slicing mechanism for parallel query execution.
1. Introduction
Data warehouses serve as the backbone of analytical workload
dealing with hot data requiring low latency access. It enables organizations to
analyze massive datasets for business insights through Business Intelligence
(BI) tools. Amazon Redshift is a leading solution in this space, especially for
the organization having a data platform on AWS cloud, due to its scalability,
flexibility, and performance. Redshift distributes data across compute nodes
using customizable distribution styles, which directly influence query
performance and workload balancing.
This article provides a detailed exploration of Redshift’s
distribution styles—EVEN, KEY, and ALL—and explains how
these styles align with different data processing needs. We also introduce
Redshift's internal architecture, focusing on its MPP framework, node and slice
organization, and query execution processes.
AWS Redshift as Low Latency DWH in AWS Data Platform
2. Distribution Styles in Amazon Redshift
Redshift uses distribution styles to determine how table
data is stored across the cluster's compute nodes. The chosen style
significantly affects query efficiency, resource utilization, and data
shuffling. Below, we detail the three distribution styles supported by
Redshift:
2.1 EVEN Distribution Style
EVEN distribution spreads table data uniformly across all
slices in the cluster, without regard to content. This ensures balanced storage
and computation across slices.
Use Case: This style is optimal when:
-> No specific relationship exists between rows in a
table and other tables.
-> Data lacks a natural key suitable for distribution
.-> Queries do not involve frequent joins with other
tables.
For instance, in cases where a large fact table does not
join with a dimension table, EVEN distribution minimizes data skew and avoids
bottlenecks.
2.2 KEY Distribution Style
In KEY distribution, rows are distributed based on a column
designated as the "distribution key." A hashing algorithm assigns
rows with the same key value to the same slice, ensuring the colocation of
related data.
Use Case: KEY distribution is ideal for:
-> Tables frequently joined or aggregated on the
distribution key column.
-> Reducing data shuffling during query execution.
-> Scenarios involving large fact and dimension table
joins.
For example, joining a sales fact table and a customer
dimension table on customer_id benefits from specifying customer_id as the
distribution key, improving query performance through localized data
processing.
2.3 ALL Distribution Style
ALL distribution replicates the entire table across all
nodes. Each node holds a full copy of the table, eliminating data movement
during query execution.
Use Case: This style is best suited for small,
frequently accessed tables, such as lookup tables. Typical scenarios include:
-> Small dimension tables joined with large fact tables.
-> Queries requiring broadcast joins to avoid
redistribution costs.
Caution must be exercised when applying ALL distribution to
large tables, as this can significantly increase storage overhead and reduce
efficiency.
AWS Redshift Distribution Style
3. Internal Architecture of Amazon Redshift
AWS Redshift Internal Architecture
Amazon Redshift’s internal architecture is designed to
support high scalability, parallelism, and fault tolerance. It is composed of
three primary components:
3.1 Cluster Nodes
A Redshift cluster comprises a leader node and
multiple compute nodes:
Leader Node: Manages query parsing, optimization, and
coordination of execution across compute nodes. It does not store data.
Compute Nodes: Store data and execute queries. Each
compute node is divided into slices, where each slice is responsible for a
portion of the node's data and workload.
3.2 Slicing Mechanism
Each compute node is partitioned into slices, with the
number of slices determined by the node's vCPU count. For example, an 8-vCPU
node has 8 slices.
Key Functions:
- Data
Allocation: Data is distributed to slices based on the
distribution style (EVEN, KEY, or ALL).
- Parallel
Query Execution: Queries are processed concurrently across slices to
reduce execution time.
- Load
Balancing: EVEN distribution ensures that slices handle approximately
equal amounts of data, minimizing hotspots.
3.3 Massively Parallel Processing (MPP)
Redshift’s MPP framework enables distributed query
execution:
-> Queries are decomposed into steps executed in parallel
by the slices.
-> Intermediate results are exchanged between slices through
a high-speed network.
This architecture ensures efficient utilization of cluster
resources and high throughput for complex analytical queries.
4. Conclusion
Amazon Redshift offers a highly optimized data warehouse
solution tailored for large-scale analytics. By selecting an appropriate
distribution style—EVEN, KEY, or ALL—users can optimize query performance based
on their workload characteristics. Meanwhile, the slicing mechanism and MPP
architecture enable Redshift to handle massive datasets efficiently.
Understanding the internal architecture of Redshift,
including its leader and compute nodes, slicing mechanism, and MPP execution,
provides a foundation for designing effective data models. With these features,
organizations can leverage Redshift for scalable and high-performance data
analytics.
For more such interesting articles please follow my blog The Data Cook