We have discussed our data stack incident multiple times, but have not delved into the design principles of our underlying data warehouse. This blog post will provide an overview of the structure of our data warehouse and then dive deep into its layers.
Following dbt labs’ high-level approach for data modeling, we segment our data warehouse into staging (stg_[source]__[model name]), intermediate (int_[source]__[model name]), and marts layers (dim/fct for internal data models and insights for customer-facing data models).
We also adhere to certain design principles to achieve flexible yet consistent data modeling in our BI tool:
– Use pre-joined, modeled marts tables to answer most queries, and offer power users the ability to access the remaining queries
– Avoid displaying intermediate tables in the BI tool
– Do not over-model marts models
– Do not incorporate staging tables into pre-joined datasets in the BI tool, and refrain from saving custom columns in the BI tool.
1. Staging models
Purpose: rename columns & cast data types, perform basic calculations.
Staging models typically consist of one dbt model per source table. In cases where data sources have highly denormalized structures, like Stripe or Salesforce, we may combine multiple source tables into a single staging model to streamline the process.
2. Intermediate models
Purpose: join tables together, and perform more complex calculations.
Intermediate models serve as a bridge between staging and marts layers, enabling complex calculations involving multiple tables to be performed before data is pulled into the marts layer.
3. Marts models
Purpose: surface data internally or to customers.
Our marts layer comprises customer-facing insights models and internal-facing dim/fct models. We maintain separate folders for customer-facing and internal-facing models to ensure neatness and prevent unintended data exposure.
Internal marts models include dimension or fact tables, with one model per granularity to prevent confusion in data tables with numerous columns.
In conclusion, our data warehouse is structured to facilitate efficient data modeling and analysis, ensuring that data is organized and accessible for both internal and external users. However, for now, this approach is effective for us.
Have you ever found yourself constantly doing ad-hoc joins in your data warehouse? We certainly did, which is why we made the decision to create a dim_salesforce_contacts
model. This allowed us to include it in our pre-joined datasets, saving us time and effort.
This approach has helped us avoid unnecessary modeling, duplicate dbt marts models that serve no real purpose other than being a SELECT *
from a staging model, as well as countless dbt calculations that may never be used. Most importantly, it has prevented our data team from becoming a bottleneck in the process.
🧠 Closing thoughts
While this method works well for us as a smaller company, it can also be scaled effectively. By analyzing database logs, you can identify which staging tables are being used most frequently in your BI tool. Additionally, you can determine which columns and models are underutilized and trim them down.
- You can leverage database logs to track usage of staging tables in your BI tool
- You can also analyze logs to identify unused columns and models for optimization
We hope this post has offered you valuable insights into our data warehouse structure and how we leverage it within our BI tool.