Understanding Slowly Changing Dimensions in Data Warehousing
Written on
What Are Slowly Changing Dimensions?
Slowly Changing Dimensions (SCDs) represent the gradual changes that occur within dimension tables over time. These changes can include updates or deletions of existing records. Properly tracking these changes is essential for accurate reporting, particularly when dimension tables are associated with fact tables through foreign keys. For instance, if a record in a dimension table is altered, it’s vital to keep a history of these modifications to ensure correct historical analysis and reporting.
Fact and Dimension Tables Defined
Fact tables capture data reflecting events such as transactions at a retail store, hotel reservations, or doctor visits. In contrast, dimension tables provide context that enhances the data in fact tables. The following graphic illustrates how a fact table connects to a dimension table with various examples.
While fact tables undergo frequent updates to mirror business events, dimension tables typically experience fewer modifications. However, what occurs when a dimension does change? What if the dimension undergoes multiple changes? How can historical data be preserved?
Enter slowly changing dimensions.
SCDs form a structured approach to updating and maintaining data in dimension tables as those dimensions evolve. Various methods exist for managing these changing dimensions, commonly referred to as SCD "types." This article delves into three such methods, examining their definitions, common use cases, and implementation code.
Data architects and engineers traditionally collaborate to design, develop, and sustain a data model utilizing slowly changing dimensions. Subsequently, data scientists and analysts leverage these fact and dimension tables, integral to the star schema, to train machine learning models, populate dashboards, and facilitate various data-driven tasks. A foundational understanding of slowly changing dimensions enhances efficiency and reliability while decreasing time-to-insights across the data profession.
Understanding Slowly Changing Dimensions (SCDs)
SCDs facilitate the management and storage of data over time, enabling historical tracking within a data warehouse. Below is an overview of the six SCD types:
- Type-0: Data remains unchanged; no changes are monitored. (DO NOTHING)
- Type-1: Existing records are replaced with new data. (OVERWRITE RECORD)
- Type-2: New records are created with versioning to maintain history. (INSERT NEW ROW)
- Type-3: Both current and previous values are stored in the same record. (UPDATE WITH NEW COLUMN)
- Type-4: Separate tables are used for current and historical data. (STORE IN SEPARATE TABLE)
- Type-6: A hybrid approach that combines Types 1, 2, and 3. (COMBINE OVERWRITE, INSERT, AND UPDATE)
SCD Type 1
With SCD Type 1, if a record in a dimension table changes, the existing record is updated or overwritten using a standard MERGE statement. If no record exists, a new entry is added to the dimension table. This approach ensures that dimension table records consistently reflect the current state, but it does not maintain any historical data.
For instance, consider a grocery store tracking sold items. If an item’s details change, the existing record gets updated with the new information; if no record exists, a new entry is created.
Implementing SCD Type 1
To demonstrate SCD Type 1, we can examine the employee table, which contains essential employee information such as name, title, and contact details. The following SQL code illustrates this table's creation and initial data insertion:
CREATE OR REPLACE TABLE employees (
employee_id INT,
name VARCHAR,
title VARCHAR,
birthday DATE,
email VARCHAR,
phone VARCHAR
);
INSERT INTO employees (
employee_id,
name,
title,
birthday,
email,
phone
) VALUES
(477379, 'Emily Verplank', 'Manager', '1989-07-28', '[email protected]', '928-144-8201'),
(392005, 'Josh Murray', 'Cashier', '2002-12-11', '[email protected]', '717-304-5547');
Using SCD Type 1, the existing record would be overwritten with the most recent data. If, for instance, Emily’s phone number changes to 928–652–9704, the updated table would reflect this change.
To implement this in Snowflake, the MERGE INTO command can be used. This command allows data practitioners to provide a match key and conditions. If these conditions are met, the record can be updated; otherwise, it can be inserted, or the operation can be halted.
Before using the MERGE INTO command, we first create a staging table named stage_employees, which will hold all updated records since the last refresh of the employees table:
CREATE OR REPLACE TABLE stage_employees (
employee_id INT,
name VARCHAR,
title VARCHAR,
birthday DATE,
email VARCHAR,
phone VARCHAR
);
INSERT INTO stage_employees (
employee_id,
name,
title,
birthday,
email,
phone
) VALUES
(477379, 'Emily Verplank', 'Manager', '1989-07-28', '[email protected]', '928-652-9704');
Now we can utilize Snowflake’s MERGE functionality to “upsert” the existing record.
The key to merging data between the employees and stage_employees tables is the employee_id field. If the employee_id matches, the name, title, email, and phone attributes will be updated with the values from the stage_employees table. If the stage_employees record does not match any in the employees table, it will be inserted.
A good practice after merging data from staging tables to base tables is to truncate the staging table to ensure only newly inserted or updated records are tracked each time.
SCD Type 1 guarantees no duplicate records in the table and ensures that the data reflects the most recent state of the dimension. This is particularly beneficial for real-time dashboards and predictive modeling, where only the current state is relevant.
However, since only the latest information is stored, data practitioners cannot analyze changes in dimensions over time. For example, a data analyst would struggle to identify the revenue increase for Potato Chips after they were relocated to aisle 6 without additional information.
While SCD Type 1 simplifies current state reporting and analytics, it has limitations for historical analysis.
SCD Type 2
Although maintaining a table that only reflects the current state can be beneficial, there are instances when tracking historical changes to a dimension is either convenient or necessary. SCD Type 2 preserves historical data by adding a new row whenever a dimension changes, marking this new row as current while designating the old record as historical.
Implementing SCD Type 2
We will utilize a table similar to the previous example, but with an additional column. The is_current column will store a boolean value indicating whether the record reflects the most current value.
CREATE OR REPLACE TABLE items (
item_id INT,
name VARCHAR,
price FLOAT,
category_id INT,
placement VARCHAR,
start_date DATE,
end_date DATE
);
INSERT INTO items (
item_id,
name,
price,
category_id,
placement,
start_date,
end_date
) VALUES
(747295, 'Sports Jersey', 59.99, 743, 'Aisle 8', '2024-08-01', NULL),
(667812, 'Socks', 8.99, 156, 'Aisle 11', '2024-08-01', NULL);
This table provides information about specific items sold at a retail store, including name, price, and placement, along with a foreign key to the item's category. To implement SCD Type 2, we will perform an "upsert" using the start_date and end_date to maintain both current and historical data.
For instance, when sports jerseys are relocated to the front of the store for better visibility and their price is reduced, the existing record will be updated with an end date, and a new entry will be created.
Next, we create a staging table named stage_items to store records for implementing SCD Type 2 in the corresponding items dimension:
CREATE OR REPLACE TABLE stage_items (
item_id INT,
name VARCHAR,
price FLOAT,
category_id INT,
placement VARCHAR,
start_date DATE,
end_date DATE
);
INSERT INTO stage_items (
item_id,
name,
price,
category_id,
placement,
start_date,
end_date
) VALUES
(747295, 'Sports Jersey', 49.99, 743, 'Entry Display', '2024-08-03', NULL);
We will now use the MERGE INTO function in Snowflake to implement SCD Type 2. This process is slightly more complex than the previous example and requires careful thought.
First, we establish a match condition with the following three criteria:
- The item_id in both the items and stage_items tables must match.
- The start_date in the stage_items table must be greater than that in the items table.
- The end_date in the items table must be NULL.
If these conditions are met, the original record in the items table must be updated. The items.end_date column will no longer be NULL; it will take the value from stage_items.start_date. No logic is applied if there’s no match in this first statement.
Next, we will use another MERGE INTO statement to insert the new record. For a new record to be added, the match condition must not be satisfied.
This can be determined by checking if the item_id in both tables matches and if the items.end_date is NULL. Here’s a breakdown of this logic:
- If the item_id matches and items.end_date is NULL, there is already a current record in the items table, so no new entry should be inserted.
- If there are no matches between item_id in both tables, the match condition is not satisfied, and a new row is inserted.
- If the item_id in stage_items matches a record in items and end_date is not NULL, a new value will be added to maintain historical data while ensuring a current record exists in the items table.
The implementation using two MERGE INTO statements to first update the existing record and then insert the most current data is as follows:
MERGE INTO items USING stage_items
ON items.item_id = stage_items.item_id
AND items.start_date < stage_items.start_date
AND items.end_date IS NULL
WHEN MATCHED
THEN UPDATE SET
items.name = stage_items.name,
items.price = stage_items.price,
items.category_id = stage_items.category_id,
items.placement = stage_items.placement,
items.start_date = items.start_date,
items.end_date = stage_items.start_date;
MERGE INTO items USING stage_items
ON items.item_id = stage_items.item_id
AND items.end_date IS NULLWHEN NOT MATCHED THEN INSERT (
item_id,
name,
price,
category_id,
placement,
start_date,
end_date
) VALUES (
stage_items.item_id,
stage_items.name,
stage_items.price,
stage_items.category_id,
stage_items.placement,
stage_items.start_date,
NULL
);
SCD Type 3
When dealing with data that is anticipated to change infrequently or when only the most recent historical record is of interest, SCD Type 3 is particularly advantageous. Instead of “upserting” a modified dimension or adding a new row for the change, SCD Type 3 utilizes a column to document the modification. Let’s explore this concept with an example.
Implementing SCD Type 3
In this scenario, the discounts table contains information about various discounts available to customers during checkout. This table includes the discount ID, name, the percentage off, and a classification indicating whether the discount is seasonal. Below are two potential records in the discounts table:
CREATE TABLE discounts (
discount_id INTEGER,
name VARCHAR,
is_seasonal BOOLEAN,
percent_off INTEGER,
previous_percent_off INTEGER
);
INSERT INTO discounts (
discount_id,
name,
is_seasonal,
percent_off,
previous_percent_off
) VALUES
(994863, 'Rewards Member', FALSE, 10, NULL),
(467782, 'Employee Discount', FALSE, 50, NULL);
Since the retailer does not expect frequent changes to discounts, this dimension is well-suited for a Type 3 implementation. When the percentage off changes, the previous value shifts to the previous_percent_off column, while the new value occupies the percent_off column.
This approach allows for the retention of historical data while showcasing the latest value in the percent_off column.
To implement this in Snowflake, we create a stage_discounts table and insert a single record with the new percent_off value:
CREATE TABLE stage_discounts (
discount_id INTEGER,
name VARCHAR,
is_seasonal BOOLEAN,
percent_off INTEGER
);
INSERT INTO stage_discounts (
discount_id,
name,
is_seasonal,
percent_off
) VALUES
(467782, 'Rewards Member', FALSE, 35);
We again utilize MERGE INTO to implement SCD Type 3. The match condition is straightforward: if the discount_id in the discounts and stage_discounts tables match and the percent_off values differ, the existing record in the discounts table will be updated. The previous percent_off value will be assigned to the previous_percent_off field, and if the discount_id does not match, a new record will be inserted with a value of NULL for historical tracking.
MERGE INTO discounts USING stage_discounts
ON discounts.discount_id = stage_discounts.discount_id
WHEN MATCHED
AND discounts.percent_off <> stage_discounts.percent_offTHEN UPDATE SET
discounts.previous_percent_off = discounts.percent_off,
discounts.percent_off = stage_discounts.percent_off
WHEN NOT MATCHED
THEN INSERT (
discount_id,
name,
is_seasonal,
percent_off,
previous_percent_off
) VALUES (
stage_discounts.discount_id,
stage_discounts.name,
stage_discounts.is_seasonal,
stage_discounts.percent_off,
NULL
);
SCD Type 3 is best suited for data that seldom changes, requiring only the most recent historical entry to be preserved. If multiple changes to the dimension are expected, SCD Type 2 is typically a better choice.
Challenges When Implementing SCD
Duplicate Data
When applying any technique for slowly changing dimensions, it’s crucial to consider the potential for duplicate data. Two types of duplicates should be monitored: intra-batch and inter-batch duplicates.
Intra-batch Duplicates
Intra-batch duplicates occur between different data batches. If a dimension table already exists, and two files intended to update this table contain duplicate records, handling this issue is essential.
To manage this, it’s vital to add constraints to the logic that is "upserting" and/or loading data into a dimension table. In the examples provided earlier, we included logic to prevent duplicates, such as:
- Inserting data only with SCD Type 1 if a record with a matching employee_id did not exist.
- Adding conditions to our SCD Type 2 implementation to avoid re-inserting data if a current record was already present in the table.
- Verifying that percent_off values differed in the items and stage_items tables before updating a record.
Inter-batch Duplicates
Inter-batch duplicates occur within the same batch of data. For example, if a file contains two entries meant to update a single record in a dimension table, precautions must be implemented. Similar to intra-batch duplicates, constraints should be added to the logic used for SCD implementations.
If conflicting records exist within the same file, these records must be differentiated, potentially using metadata or a source-provided timestamp.
What Type to Choose?
The selection of the appropriate type will depend entirely on the project requirements. Additionally, different SCD types can be applied to various columns within the same table.
For instance, in a table containing various data, such as an employee’s city and date of birth, we could apply Type 2 for the city and Type 1 for the date of birth.
Frequent changes to records may lead to rapid table growth. In such cases, it might not be wise to use SCD Type 2 to maintain dimensional data. Switching to SCD Type 1 or Type 3 may offer similar functionality while significantly enhancing system performance. However, the trade-off would be an incomplete representation of historical data. Collaborate with your team to weigh these trade-offs before altering your SCD implementation approach.
Conclusion
Mastering slowly changing dimensions (SCD) is a valuable skill when developing your own data model.
In this article, we covered the fundamentals of star schemas and provided an overview of SCDs. We explored SCD Types 1, 2, and 3 for preserving historical data while capturing the current state.
Utilizing Snowflake, we demonstrated the implementation of each SCD technique within a retail context. Finally, we discussed some of the technical challenges associated with implementing SCDs and strategies for addressing them.