Menu

Data Warehouse Modeling Star Schema vs. Snowflake Schema

اردیبهشت ۴, ۱۳۹۷ - Business Intelligence, علمی

 The First Difference: Normalization

As mentioned, normalization is a key difference between star and snowflake schemas. Regarding this, there are a couple of things to know:

Let’s move on to the second major difference between these two schemas.

The Second Difference: Query Complexity

In our first two articles, we demonstrated a query that could be used on the sales model to get the quantity of all phone-type products sold in Berlin stores in 2016.

The star schema query looks like this:

SELECT
  dim_store.store_address,
  SUM(fact_sales.quantity) AS quantity_sold
FROM
  fact_sales
  INNER JOIN dim_product ON fact_sales.product_id = dim_product.product_id
  INNER JOIN dim_time ON fact_sales.time_id = dim_time.time_id
  INNER JOIN dim_store ON fact_sales.store_id = dim_store.store_id
WHERE
  dim_time.action_year = 2016
  AND dim_store.city = 'Berlin'
  AND dim_product.product_type = 'phone'
GROUP BY
  dim_store.store_id,
  dim_store.store_address

To get the same result from the snowflake schema, we have to use this query:

SELECT
  dim_store.store_address,
  SUM(fact_sales.quantity) AS quantity_sold
FROM
  fact_sales
  INNER JOIN dim_product ON fact_sales.product_id = dim_product.product_id
  INNER JOIN dim_product_type ON dim_product.product_type_id = dim_product_type.product_type_id
  INNER JOIN dim_time ON fact_sales.time_id = dim_time.time_id
  INNER JOIN dim_year ON dim_time.year_id = dim_year.year_id
  INNER JOIN dim_store ON fact_sales.store_id = dim_store.store_id
  INNER JOIN dim_city ON dim_store.city_id = dim_city.city_id
WHERE
  dim_year.action_year = 2016
  AND dim_city.city = 'Berlin'
  AND dim_product_type.product_type_name = 'phone'
GROUP BY
  dim_store.store_id,
  dim_store.store_address

Obviously, the snowflake schema query is more complex. Because the dimension tables are normalized, we need to dig deeper to get the name of the product type and the city. We have to add another JOIN for every new level inside the same dimension.

In the star schema, we only join the fact table with those dimension tables we need. At most, we’ll have only one JOIN per dimension table. And if we’re not using a dimension table, we don’t even need to bother with it. In the snowflake schema query, we don’t know how deep we’ll have to go to get the right dimension level, so that complicates the process of writing queries.

Joining two tables takes time because of the DMBS takes longer to process the request. The dim_store and  dim_city tables are placed in close proximity in our model, but they may not be located anywhere near each other on the disk. There is a better possibility that data will be physically closer on the disk if it lives inside the same table.Basically, a query ran against a snowflake schema data mart will execute more slowly. But in most cases this won’t present a problem: it doesn’t matter much if we get the result in one millisecond or one second.

Speeding Things Up

To speed up reporting, we can:

Snowflake vs. Star Schemas: Which Should You Use?

Now that we’ve looked at theory and query speeds, let’s get right into the heart of the matter: how do you know which schema to use on any given project?

Consider using the snowflake schema:

Consider using the star schema:

Both the star schema and the snowflake schema are relational models used to organize data warehouses and/or data marts. No matter how similar they are, they demonstrate two different approaches and have their own benefits and disadvantages. Personally, I would go with the snowflake schema when implementing a data warehouse (to save storage space) and with the star schema for data marts (to make life easier for business users).

دیدگاهتان را بنویسید

نشانی ایمیل شما منتشر نخواهد شد. بخش‌های موردنیاز علامت‌گذاری شده‌اند *

Time limit is exhausted. Please reload CAPTCHA.