Ecommerce Data Warehouse on Amazon Redshift Serverless: A Production-Grade Star Schema with Historical Accuracy

Ecommerce Data Warehouse on Amazon Redshift Serverless: A Production-Grade Star Schema with Historical Accuracy

Terraform, AWS Glue, dbt Core, and a Medallion Architecture for scalable ecommerce analytics

TL;DR

I designed and implemented a production-ready Ecommerce Data Warehouse on Amazon Redshift Serverless using:

  • Amazon S3 as raw/staging data lake
  • AWS Glue for serverless ETL
  • Redshift Serverless as analytics engine
  • dbt Core for transformation and testing
  • Terraform for Infrastructure as Code
  • AWS Step Functions for orchestration
  • CloudWatch for monitoring

The data model is a star schema with:

  • fact_order_items at item-level grain
  • SCD Type 2 dimensions for customers and products
  • Full support for point-in-time reporting

This prevents double-counting, ensures historical accuracy, and scales with ecommerce growth.


The Problem: Ecommerce Data Breaks Easily

Revenue numbers drift. Customer metrics change retroactively. Product prices update without history.

The moment you lack historical tracking, dashboards stop reflecting reality.

The solution is not “more queries.” The solution is architecture.


Architecture Overview

This implementation follows a layered, production-ready design built for reliability, scalability, and cost control.

Article content

Storage Layer – Amazon S3

Raw ecommerce events land in S3.

  • Durable storage
  • Cheap at scale
  • Clear separation between raw and processed data

S3 acts as the immutable source of truth before transformation.


ETL Layer – AWS Glue

AWS Glue performs serverless data processing:

  • Schema normalization
  • Data cleaning
  • Format standardization
  • Loading into staging tables

No cluster management. Compute runs only when needed.


Warehouse Layer – Amazon Redshift Serverless

Redshift Serverless provides:

  • Auto-scaling compute
  • Pay-per-query pricing
  • Zero cluster maintenance

This aligns infrastructure cost directly with analytical demand.


Transformation Layer – dbt Core

All warehouse modeling is managed in dbt Core.

Key practices implemented:

  • Incremental models
  • Data quality tests
  • Source freshness checks
  • Clear model lineage
  • Modular SQL design

Transformations are version-controlled and reproducible.


Orchestration – AWS Step Functions

Step Functions coordinate:

  1. Raw ingestion
  2. Glue processing
  3. dbt transformations
  4. Validation

Workflows are explicit, observable, and fault-tolerant.


Monitoring – CloudWatch

CloudWatch provides:

  • Logging
  • Alerts
  • Execution monitoring

Failures are visible. Latency is measurable. Cost spikes are traceable.


Data Model: Star Schema Optimized for Analytics

The warehouse uses a dimensional model specifically optimized for analytical queries and BI performance.

Fact Table

fact_order_items Grain: one row per order item

Why item-level grain?

  • Prevents revenue double-counting
  • Enables product-level profitability
  • Supports basket analysis
  • Preserves analytical flexibility

All metrics roll up cleanly from this base grain.


Dimension Tables

dim_customers – SCD Type 2 Tracks historical changes such as customer attributes. Enables accurate point-in-time reporting and customer lifetime value analysis.

dim_products – SCD Type 2 Captures price changes and product attribute updates. Ensures margin and revenue are calculated using the correct historical context.

dim_date Standard calendar dimension supporting time-series analysis.

dim_payment_methods Supports revenue breakdown by payment channel.

dim_shipment_status Enables operational analysis and delivery performance tracking.


Why SCD Type 2 Matters

Without SCD Type 2:

  • Historical reports shift retroactively
  • Customer segmentation becomes unreliable
  • Margin analysis loses accuracy

With SCD Type 2:

  • You can reconstruct the business at any past point in time
  • Revenue reflects the actual product price at purchase
  • Customer attributes remain historically consistent

This is critical for executive-level reporting and forecasting.


Business Value

For C-Level

  • Single source of truth for revenue and customer metrics
  • Accurate historical tracking
  • Trend analysis and forecasting capability
  • Serverless cost model aligned with usage
  • Reduced operational overhead


For Technical Teams

  • Infrastructure as Code with Terraform
  • Serverless ETL via AWS Glue
  • Redshift Serverless optimization
  • dbt-based analytics engineering
  • Incremental processing patterns
  • Star schema optimized for BI tools
  • Full orchestration and monitoring stack

This is not a demo project. It reflects production architecture decisions.


Cost and Performance Strategy

Several deliberate decisions reduce cost and improve performance:

  • Incremental dbt models to avoid full recomputation
  • Item-level fact table to eliminate metric distortion
  • Serverless compute to avoid idle cluster cost
  • Separation of raw, staging, and marts
  • Query-optimized star schema design

Compute is applied where it produces business value.


What This Demonstrates as a Data Engineer

End-to-end ownership:

  • Designing storage and ingestion
  • Building ETL pipelines
  • Modeling dimensional schemas
  • Implementing SCD Type 2
  • Managing infrastructure with Terraform
  • Orchestrating workflows
  • Monitoring production systems

This is the difference between writing queries and designing systems.

I design, therefore I exist.

To view or add a comment, sign in

More articles by Rocío Baigorria

Others also viewed

Explore content categories