Building a data lake with the Apache stack: Architecture, setup, and code

A modern data lake on the Apache stack gives you scalable storage, flexible processing, and open standards—without lock‑in. This guide walks through the “why,” the architecture, and the “how,” with working commands and code for each component.

Architecture explained

A data lake is more than storage—it’s an ecosystem. The Apache stack provides modular building blocks that you can compose cleanly:

Ingestion (Kafka/NiFi/Airflow): Bring in data from apps, devices, logs, and databases—both streaming and batch.
Storage (HDFS or object storage): Durable, low‑cost storage for raw, cleansed, and curated zones.
Processing (Spark): Transform, enrich, and aggregate data—batch and streaming.
Table format (Iceberg): ACID transactions, schema evolution, partitioning, and time travel across engines.
SQL/query (Hive/Trino): Interactive analytics with ANSI SQL and BI tool connectivity.
Governance (Ranger/Atlas): Access control, masking, lineage, and metadata.
Observability (Prometheus/Grafana): Metrics, alerts, and capacity planning.

Architecture diagram (text layout)

Code

                +-----------------------------+
                |        Producers            |
                |  Apps | IoT | DB | Logs     |
                +-----------------------------+
                           |
                           v
+------------------+   +------------------+   +------------------+
|  Batch Ingestion |   |  Stream Ingest   |   |  Orchestration   |
|  (NiFi / Sqoop)  |   |  (Kafka topics)  |   |  (Airflow)       |
+------------------+   +------------------+   +------------------+
           \                 |                       /
            \                v                      /
             \        +------------------+        /
              \       |   Raw Zone       |       /
               \      | (HDFS/S3/GCS)    |      /
                \     +------------------+     /
                 \             |              /
                  \            v             /
                   \   +------------------+ /
                    \  | Cleansed Zone    |/
                     \ | (Parquet/ORC)    |
                      \+------------------+
                               |
                               v
                     +----------------------+
                     |   Iceberg Tables     |
                     |  (ACID, partitions)  |
                     +----------------------+
                               |
             +-----------------+------------------+
             |                                    |
             v                                    v
+----------------------+               +----------------------+
|  Spark (ETL/ML/SQL)  |               | Hive/Trino (SQL)     |
+----------------------+               +----------------------+
             |                                    |
             v                                    v
     +------------------+                 +------------------+
     | BI / Notebooks   |                 | Apps / Services  |
     +------------------+                 +------------------+

Governance: Ranger (policies), Atlas (lineage)
Observability: Prometheus/Grafana, logs

Why Apache for data lakes

Open standards: Avoid proprietary lock‑in; integrate with anything.
Scale & cost: Horizontal scaling with commodity hardware or cloud storage.
Flexibility: Handle CSV, JSON, Parquet, ORC, images, and more.
Interoperability: Spark + Hive + Iceberg work across engines and clouds.

Data zones and lifecycle

Raw zone: Immutable, source‑of‑truth dumps (CSV/JSON/Avro). No transformations.
Cleansed zone: Validated, standardized formats (Parquet/ORC), basic quality checks.
Curated zone: Analytics‑ready Iceberg tables with partitions, ACID, and governance.

Best practice: Never mutate raw; enforce contracts in cleansed; publish curated for consumption.

Setup: step‑by‑step with working code

1) Hadoop HDFS (storage)

Install Java & Hadoop:

bash

sudo apt-get update && sudo apt-get install -y openjdk-11-jdk
HADOOP_VERSION=3.3.6
wget https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
sudo tar -xzf hadoop-$HADOOP_VERSION.tar.gz -C /opt
sudo ln -s /opt/hadoop-$HADOOP_VERSION /opt/hadoop
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' | sudo tee -a /opt/hadoop/etc/hadoop/hadoop-env.sh
echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' | sudo tee -a /etc/profile.d/hadoop.sh
source /etc/profile.d/hadoop.sh

Configure single‑node (dev):

xml

<!-- /opt/hadoop/etc/hadoop/core-site.xml -->
<configuration>
  <property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
</configuration>

xml

<!-- /opt/hadoop/etc/hadoop/hdfs-site.xml -->
<configuration>
  <property><name>dfs.replication</name><value>1</value></property>
  <property><name>dfs.namenode.name.dir</name><value>file:///var/lib/hadoop/hdfs/namenode</value></property>
  <property><name>dfs.datanode.data.dir</name><value>file:///var/lib/hadoop/hdfs/datanode</value></property>
</configuration>

Format & start:

bash

sudo mkdir -p /var/lib/hadoop/hdfs/namenode /var/lib/hadoop/hdfs/datanode
hdfs namenode -format
start-dfs.sh
hdfs dfs -mkdir -p /warehouse /checkpoints

2) Apache Kafka (stream ingestion)

Install & start (KRaft mode):

bash

KAFKA_VERSION=3.7.0; SCALA_VERSION=2.13
wget https://downloads.apache.org/kafka/$KAFKA_VERSION/kafka_$SCALA_VERSION-$KAFKA_VERSION.tgz
tar -xzf kafka_$SCALA_VERSION-$KAFKA_VERSION.tgz && cd kafka_$SCALA_VERSION-$KAFKA_VERSION
bin/kafka-storage.sh format -t $(uuidgen) -c config/kraft/server.properties
bin/kafka-server-start.sh config/kraft/server.properties

Create topic & test:

bash

bin/kafka-topics.sh --create --topic raw_events --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
bin/kafka-console-producer.sh --topic raw_events --bootstrap-server localhost:9092
# Paste JSON lines, e.g.:
# {"event_id":"e1","user_id":"u1","action":"view","ts":"2025-01-01T10:00:00Z"}

3) Apache Spark (processing)

Install Spark:

bash

SPARK_VERSION=3.5.1
wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
tar -xzf spark-$SPARK_VERSION-bin-hadoop3.tgz
sudo mv spark-$SPARK_VERSION-bin-hadoop3 /opt/spark
echo 'export SPARK_HOME=/opt/spark' | sudo tee -a /etc/profile.d/spark.sh
echo 'export PATH=$PATH:/opt/spark/bin' | sudo tee -a /etc/profile.d/spark.sh
source /etc/profile.d/spark.sh

Streaming ETL: Kafka → Iceberg:

python

# kafka_to_iceberg.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = (SparkSession.builder
    .appName("KafkaToIceberg")
    .config("spark.sql.catalog.lake", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lake.type", "hadoop")
    .config("spark.sql.catalog.lake.warehouse", "hdfs://localhost:9000/warehouse")
    .getOrCreate())

schema = StructType([
    StructField("event_id", StringType()),
    StructField("user_id", StringType()),
    StructField("action", StringType()),
    StructField("ts", TimestampType())
])

df = (spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "raw_events")
    .load()
    .selectExpr("CAST(value AS STRING) AS json")
    .select(from_json(col("json"), schema).alias("data"))
    .select("data.*"))

(df.writeStream
    .format("iceberg")
    .option("checkpointLocation", "hdfs://localhost:9000/checkpoints/kafka_to_iceberg")
    .toTable("lake.events")
    .start()
    .awaitTermination())

bash

$SPARK_HOME/bin/spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
  kafka_to_iceberg.py

4) Apache Hive (SQL engine)

Metastore (PostgreSQL):

bash

sudo apt-get install -y postgresql
sudo -u postgres psql -c "CREATE DATABASE hive_metastore;"
sudo -u postgres psql -c "CREATE USER hive WITH ENCRYPTED PASSWORD 'hivepass';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;"

Configure Hive:

xml

<!-- hive-site.xml -->
<configuration>
  <property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:postgresql://localhost:5432/hive_metastore</value></property>
  <property><name>javax.jdo.option.ConnectionDriverName</name><value>org.postgresql.Driver</value></property>
  <property><name>javax.jdo.option.ConnectionUserName</name><value>hive</value></property>
  <property><name>javax.jdo.option.ConnectionPassword</name><value>hivepass</value></property>
</configuration>

Initialize & run:

bash

schematool -dbType postgres -initSchema
hiveserver2 &

Query Iceberg tables (via Spark SQL or Hive):

sql

CREATE DATABASE lake;
USE lake;

-- Create Iceberg table (Spark SQL)
CREATE TABLE lake.events (
  event_id string,
  user_id string,
  action string,
  ts timestamp
) USING iceberg
PARTITIONED BY (days(ts));

SELECT action, COUNT(*) AS cnt
FROM lake.events
WHERE ts >= TIMESTAMP '2025-01-01 00:00:00'
GROUP BY action
ORDER BY cnt DESC;

5) Apache Iceberg (table format)

Key capabilities:
Examples:

sql

-- Add a column
ALTER TABLE lake.events ADD COLUMN device string;

-- Time travel
SELECT * FROM lake.events VERSION AS OF 3;

Governance, security, and observability

Apache Ranger (authorization):
Apache Atlas (metadata & lineage):
Prometheus & Grafana (monitoring):

Best practice: Treat governance as code—version policies, automate audits, and integrate with CI/CD.

Cloud deployment options

AWS EMR: Managed Hadoop/Spark/Hive; use S3 as the lake with Iceberg tables in Glue/Hive catalogs.
GCP Dataproc: Managed Spark/Hadoop; use GCS as storage; integrate with BigQuery via connectors.
Azure HDInsight/Synapse: Managed Hadoop/Spark; use ADLS; Ranger available for governance.

Pattern: Use object storage (S3/GCS/ADLS) as the lake, Spark for processing, Iceberg for tables, and a managed metastore/catalog. Keep compute ephemeral; keep storage persistent.

End‑to‑end example: streaming to curated analytics

Produce events to Kafka:

bash

bin/kafka-console-producer.sh --topic raw_events --bootstrap-server localhost:9092
# {"event_id":"e2","user_id":"u9","action":"purchase","ts":"2025-01-02T12:00:00Z"}

Spark streaming writes to Iceberg (code above).
Run an aggregation job (batch):

python

# daily_agg.py
from pyspark.sql import SparkSession
spark = (SparkSession.builder
    .appName("DailyAgg")
    .config("spark.sql.catalog.lake","org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lake.type","hadoop")
    .config("spark.sql.catalog.lake.warehouse","hdfs://localhost:9000/warehouse")
    .getOrCreate())

spark.sql("""
CREATE TABLE IF NOT EXISTS lake.daily_actions (
  day date,
  action string,
  cnt bigint
) USING iceberg
PARTITIONED BY (day)
""")

spark.sql("""
INSERT INTO lake.daily_actions
SELECT DATE(ts) AS day, action, COUNT(*) AS cnt
FROM lake.events
GROUP BY DATE(ts), action
""")

bash

$SPARK_HOME/bin/spark-submit --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 daily_agg.py

Query results:

sql

SELECT * FROM lake.daily_actions ORDER BY day DESC, cnt DESC;

Performance and reliability best practices

File formats: Use Parquet/ORC; avoid tiny files—compact regularly.
Partitioning: Partition by date/time; avoid over‑partitioning (too many small files).
Checkpointing: Use durable checkpoint locations for streaming jobs.
Resource tuning: Size executors for Spark (cores/memory); cache hot datasets.
Schema contracts: Validate inputs; enforce schemas at ingestion.
Cost control: Separate raw/cleansed/curated; lifecycle policies on object storage.

Building a data lake with the Apache stack: Architecture, setup, and code

Arun Tiwari

Architecture explained

Architecture diagram (text layout)

Why Apache for data lakes

Data zones and lifecycle

Setup: step‑by‑step with working code

1) Hadoop HDFS (storage)

2) Apache Kafka (stream ingestion)

3) Apache Spark (processing)

Recommended by LinkedIn

4) Apache Hive (SQL engine)

5) Apache Iceberg (table format)

Governance, security, and observability

Cloud deployment options

End‑to‑end example: streaming to curated analytics

Performance and reliability best practices

Official documentation links

Built to Scale

1,346 follower

More articles by Arun Tiwari

Others also viewed

Apache Data Lakehouse Weekly: February 18–25, 2026

Fast data access using GemFire and Apache Spark (Part 1):Introduction

A Deep Intro to Apache Iceberg and Resources for Learning More

6 SQL Data Warehouse Solutions For Big Data Analysts (With Their Pros And Cons)

Spark or Trino: Why Not Both in One Place?

Understanding Partitioning in Hive vs Apache Iceberg with PySpark

Apache Iceberg: The Definitive Guide - A Short Review

Mastering Spark Session Creation and Configuration in Apache Spark

How to: Spark to read HDFS files, union DataFrames and store in Parquet format (Scala, Spark SQL)

What is Dynamic Partition Pruning in Apache Spark 3.x ?

Explore content categories

Architecture explained

Architecture diagram (text layout)

Why Apache for data lakes

Data zones and lifecycle

Setup: step‑by‑step with working code

1) Hadoop HDFS (storage)

2) Apache Kafka (stream ingestion)

3) Apache Spark (processing)

Recommended by LinkedIn

4) Apache Hive (SQL engine)

5) Apache Iceberg (table format)

Governance, security, and observability

Cloud deployment options

End‑to‑end example: streaming to curated analytics

Performance and reliability best practices

Official documentation links

Built to Scale

1,346 follower

More articles by Arun Tiwari

How to Create a Data Pipeline Using the Apache Stack: A Complete Guide

🦀 Rust in Data Engineering and Data Science: A Balanced Perspective

🧱 Data Engineering That Doesn’t Break Under Pressure.

Others also viewed

Apache Data Lakehouse Weekly: February 18–25, 2026

Fast data access using GemFire and Apache Spark (Part 1):Introduction

A Deep Intro to Apache Iceberg and Resources for Learning More

6 SQL Data Warehouse Solutions For Big Data Analysts (With Their Pros And Cons)

Spark or Trino: Why Not Both in One Place?

Understanding Partitioning in Hive vs Apache Iceberg with PySpark

Apache Iceberg: The Definitive Guide - A Short Review

Mastering Spark Session Creation and Configuration in Apache Spark

How to: Spark to read HDFS files, union DataFrames and store in Parquet format (Scala, Spark SQL)

What is Dynamic Partition Pruning in Apache Spark 3.x ?

Explore content categories