Practical Data Warehouse Design Considerations

Over on my blog at leadernode.com, I've been writing about the Single Source of Truth data warehouse we've built at Teachable. This is a repost describing the major design considerations we had to take into account, how we dealt with them, and some general practical advice.

Single Source of Truth, or SST, is a data warehouse paradigm that we use at Teachable. The idea is to have a single place for the entire organization to access data that is the best understanding of truth, with transparency around how metrics are derived and a discreet place for derivations to live so that you never calculate the same metric in two different ways. It works really well, but many organizations don’t have a single source of truth, so I want to talk a bit about how to build one. In this post I’ll be considering four major design considerations in building an SST data warehouse:

  • Cleaning messy data
  • How much raw data to store
  • Making facts easily available
  • How to store, and surface, documentation

Massaging messy data into clean data

Every production database I’ve ever encountered has had messy data in it. Sometimes you have redundancies; sometimes you have gaps; sometimes schemas have changed, or field names don’t represent what they really contain; sometimes there are incorrect values, or mixes of currencies or all kinds of other things that trip you up when you try to run what ought to be simple queries. One of the points of having a data warehouse is to clean all that stuff up. You want your single source of truth to be simple to query, and always correct. Sometimes that means completely rebuilding tables to give an accurate view.

When we began building the SST data warehouse at Teachable, one of the major problems we had was reporting from our production ‘subscriptions’ table: a row is inserted every time one of our customers (schools) subscribes to a different plan (plans being, e.g., Basic/Professional/Business and Monthly/Annual). The table records the start of the subscription and end of the current billing period (so records get updated every time the subscription renews) but has no entries before a school subscribes to a paid plan, or for any gaps between plans or after the final plan; furthermore, if a plan change will take place in the future (at the end of the current billing plan) there is a future-dated record for the upcoming plan.

This is all fine for the admin application, but it’s a nightmare for reporting; even simple questions like “what was the subscription state for a school on a given date” requires complex SQL to derive. This is not an unusual problem; our subscriptions fall into the broad category of state changes, where you need to be able to easily derive the state of a thing at any given point in time. The state of the subscription plan is therefore a slowly-changing dimension, or SCD. In an SCD table we keep track of the state of a thing as changes occur (in a future post, I’ll describe the table in detail, including explaining how it took several iterations to get to an efficient ETL process with no edge cases– so far).

We also built a process to clean up some messy fields in a payment transaction table, where different fees and refunds were not always recorded consistently, and built a process to calculate and make readily available our MRR (monthly recurring revenue) broken into the standard categories (new, continuing, upgraded, downgraded, reactivated, churned), which is a somewhat complicated process to get right and which relies heavily on the subscription plan SCD table (since you have to know the correct plan for each school at the end of each month to know what its MRR was that month).

Generally, each messy table should be dealt with in its own discrete ETL process. The process itself should be easy to access (and it should be obvious which process is responsible for cleaning each table) and if it is not obvious how the data is cleaned, the process should be well-documented. To have a successful single source of truth, it’s vital that all stakeholders be able to confirm that metrics and dimensions are derived correctly… this is a shared understanding of the truth, so everyone has to agree that this is, in fact, our best understanding.

How much raw data to store in the data warehouse

When we started planning the single source of truth data warehouse at the beginning of 2019, our redshift cluster contained copies of most of the tables in the production database for each of our applications. This added up to hundreds of tables and thousands or tens of thousands of columns. Most of these were never used in analytics. One of the first decisions I made was to vastly reduce the volume of what we were ingesting into the data warehouse from the production databases. The data team inventoried all of our analytics and BI views, documenting which fields from which tables were actually in use. As a result, we probably removed 80-85% of the tables and fields being synced. The rule we follow now is, don’t add a field until you need it (the story of how we migrated our database syncing from a 3rd-party provider to Apache Airflow, and the reason it is simple for a data analyst to add a field to the sync, is a story I may tell in another post).

In addition to synced tables from the production databases, we collect events from our applications and from some 3rd parties, and we ingest a few other 3rd-party data sources (such as the customer team’s tickets from zendesk). The general rule though remains, only ingest data that we know we need. It’s always available somewhere, so if a new use case arises and we need to add something, we can do it; we don’t try to guess what future use cases will be and ingest data ‘just in case.’ We are, however, developing a plan to automatically clean and load events added to our applications by the engineering team. This is specifically so that product managers don’t have to wait on the data team to add their events to ETL before being able to analyze them.

How to make facts easily available

Broadly speaking, we have two kinds of tables that present dimensions and aggregated metrics to end users: fact tables, which have one entry per entity (an entity being, in our case, a school, a student, a school owner, etc., with a separate table for each type of entity), and ‘period’ tables, which have one entry per entity per period (typically daily or monthly).

Entity fact tables are used throughout the business by end users to do things like cohorting schools or students based on the month that they first performed some action (schools by month of first sale, or month of most recent course launch, for example), or segmenting them based on some metric (schools binned by lifetime sales). These are generally ad-hoc inquiries by the marketing, product, finance, or customer teams.

Metrics in ‘period’ tables have two distinct formats: values that refer to the specific period, and cumulative values through the period. We try to name columns in the period tables to make it clear what the value represents, e.g., in a daily_earnings table the gross_merchandise_value field refers to GMV earned on a specific date, whereas in daily_school_facts, lifetime_gmv refers to cumulative GMV through a specific date. The daily_school_facts table also has a column for the school’s subscription plan, which indicates the plan that was in effect at the end of the date. 

Naming conventions can be very important here. We group tables by kind through a prefix, and add the suffix ‘_facts’ to tables that have cumulative values in them. The specific convention you choose isn’t necessarily important, but consistency is. In our SST, tables named f_<entity>_facts contain only cumulative or as-of-now facts; so lifetime metrics, and flags interpreted as ‘has ever used feature X’.

An important thing to keep in mind when designing your single source of truth is that the facts needed by the business on a regular basis will not all be obvious, even to business users, until they have fact tables available to start playing with. You need to have a process for end users to request new facts, and for adding them in a timely manner. There are plenty of ways you can make that work; in our case, the open slack channel #sst-requests is where users can surface their needs. This works well for us because it makes it easy to have a team-wide conversation about a new fact when it is requested: if someone on the data team is unclear about what how to derive the new fact, or what its business use is, we can hash it out; and other business users can chime in if they think the fact should be defined differently or if they would like something similar but different. A critical part of the process is moving requests from the slack channel to our sprint board. In general, our lead data analyst owns getting new requests onto the board and making sure that the work gets scheduled in a sprint. This whole process keeps things very transparent– anyone who requests a new fact can easily keep track of which sprint it is scheduled in, and we try to make sure to report back to the requests channel when the fact has been added to the warehouse.

Documentation

There’s no point in making a bunch of great fact tables and SCDs without documenting them; in fact, documentation is critical to the success of a single source of truth, since the whole point is to have agreement among stakeholders on the correct derivation of every fact. When we started building our SST, we wrote a Manifesto which emphasized the importance of documentation, and we spent a lot of time debating how to store and surface our SST docs. After a lot of discussion among the data team, we decided that for maximum flexibility and the ability, in future, to build new tools on top of the docs, we would store the SST documentation in a database. 

This is still an evolving part of the data store, but we currently have an implementation with a simple front-end on it for inserting and updating records. Each data source, table, metric, dimension, and derivation has an entry in the documentation database. The unique key for each object is its name; this helps ensure that metrics and dimensions always correspond to one and only one derivation, and that we never use the same name to mean two different things. The documentation database is currently exposed to end users through the BI tool; users can search by table to see all the fields in the table, or by keyword to find any record that uses the search term in its name, description, or derivation. Every derivation also includes a permalink to the specific lines of code (in github) where the metric or dimension is derived; so end users have full transparency around the meaning of a column.

In the coming year one of our major projects will be building out a full-featured fronted for the documentation store; one that has both and elegant interface for the data team for CRUD operations and a search and browse interface that will make it easier for business users to find the metrics and dimensions that they need. This will also incorporate a complete event dictionary, where product engineers can describe the events they add to the application and product managers can find them.

To view or add a comment, sign in

More articles by Peter Jaffe

  • Data as a Service... Org

    This is a repost from my blog at leadernode.com.

  • Masters of Data Science

    This is a repost from my blog at leadernode.com.

  • Hiring for Data Science

    What is a data scientist? Data science is a phrase that gets used an awful lot right now (late 2019) but it’s…

Others also viewed

Explore content categories