Building a data lake with the Apache stack: Architecture, setup, and code
A modern data lake on the Apache stack gives you scalable storage, flexible processing, and open standards—without lock‑in. This guide walks through the “why,” the architecture, and the “how,” with working commands and code for each component.
Architecture explained
A data lake is more than storage—it’s an ecosystem. The Apache stack provides modular building blocks that you can compose cleanly:
Architecture diagram (text layout)
Code
+-----------------------------+
| Producers |
| Apps | IoT | DB | Logs |
+-----------------------------+
|
v
+------------------+ +------------------+ +------------------+
| Batch Ingestion | | Stream Ingest | | Orchestration |
| (NiFi / Sqoop) | | (Kafka topics) | | (Airflow) |
+------------------+ +------------------+ +------------------+
\ | /
\ v /
\ +------------------+ /
\ | Raw Zone | /
\ | (HDFS/S3/GCS) | /
\ +------------------+ /
\ | /
\ v /
\ +------------------+ /
\ | Cleansed Zone |/
\ | (Parquet/ORC) |
\+------------------+
|
v
+----------------------+
| Iceberg Tables |
| (ACID, partitions) |
+----------------------+
|
+-----------------+------------------+
| |
v v
+----------------------+ +----------------------+
| Spark (ETL/ML/SQL) | | Hive/Trino (SQL) |
+----------------------+ +----------------------+
| |
v v
+------------------+ +------------------+
| BI / Notebooks | | Apps / Services |
+------------------+ +------------------+
Governance: Ranger (policies), Atlas (lineage)
Observability: Prometheus/Grafana, logs
Why Apache for data lakes
Data zones and lifecycle
Best practice: Never mutate raw; enforce contracts in cleansed; publish curated for consumption.
Setup: step‑by‑step with working code
1) Hadoop HDFS (storage)
bash
sudo apt-get update && sudo apt-get install -y openjdk-11-jdk
HADOOP_VERSION=3.3.6
wget https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
sudo tar -xzf hadoop-$HADOOP_VERSION.tar.gz -C /opt
sudo ln -s /opt/hadoop-$HADOOP_VERSION /opt/hadoop
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' | sudo tee -a /opt/hadoop/etc/hadoop/hadoop-env.sh
echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' | sudo tee -a /etc/profile.d/hadoop.sh
source /etc/profile.d/hadoop.sh
xml
<!-- /opt/hadoop/etc/hadoop/core-site.xml -->
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
</configuration>
xml
<!-- /opt/hadoop/etc/hadoop/hdfs-site.xml -->
<configuration>
<property><name>dfs.replication</name><value>1</value></property>
<property><name>dfs.namenode.name.dir</name><value>file:///var/lib/hadoop/hdfs/namenode</value></property>
<property><name>dfs.datanode.data.dir</name><value>file:///var/lib/hadoop/hdfs/datanode</value></property>
</configuration>
bash
sudo mkdir -p /var/lib/hadoop/hdfs/namenode /var/lib/hadoop/hdfs/datanode
hdfs namenode -format
start-dfs.sh
hdfs dfs -mkdir -p /warehouse /checkpoints
2) Apache Kafka (stream ingestion)
bash
KAFKA_VERSION=3.7.0; SCALA_VERSION=2.13
wget https://downloads.apache.org/kafka/$KAFKA_VERSION/kafka_$SCALA_VERSION-$KAFKA_VERSION.tgz
tar -xzf kafka_$SCALA_VERSION-$KAFKA_VERSION.tgz && cd kafka_$SCALA_VERSION-$KAFKA_VERSION
bin/kafka-storage.sh format -t $(uuidgen) -c config/kraft/server.properties
bin/kafka-server-start.sh config/kraft/server.properties
bash
bin/kafka-topics.sh --create --topic raw_events --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
bin/kafka-console-producer.sh --topic raw_events --bootstrap-server localhost:9092
# Paste JSON lines, e.g.:
# {"event_id":"e1","user_id":"u1","action":"view","ts":"2025-01-01T10:00:00Z"}
3) Apache Spark (processing)
bash
SPARK_VERSION=3.5.1
wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
tar -xzf spark-$SPARK_VERSION-bin-hadoop3.tgz
sudo mv spark-$SPARK_VERSION-bin-hadoop3 /opt/spark
echo 'export SPARK_HOME=/opt/spark' | sudo tee -a /etc/profile.d/spark.sh
echo 'export PATH=$PATH:/opt/spark/bin' | sudo tee -a /etc/profile.d/spark.sh
source /etc/profile.d/spark.sh
python
# kafka_to_iceberg.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
spark = (SparkSession.builder
.appName("KafkaToIceberg")
.config("spark.sql.catalog.lake", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.lake.type", "hadoop")
.config("spark.sql.catalog.lake.warehouse", "hdfs://localhost:9000/warehouse")
.getOrCreate())
schema = StructType([
StructField("event_id", StringType()),
StructField("user_id", StringType()),
StructField("action", StringType()),
StructField("ts", TimestampType())
])
df = (spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "raw_events")
.load()
.selectExpr("CAST(value AS STRING) AS json")
.select(from_json(col("json"), schema).alias("data"))
.select("data.*"))
(df.writeStream
.format("iceberg")
.option("checkpointLocation", "hdfs://localhost:9000/checkpoints/kafka_to_iceberg")
.toTable("lake.events")
.start()
.awaitTermination())
bash
Recommended by LinkedIn
$SPARK_HOME/bin/spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
kafka_to_iceberg.py
4) Apache Hive (SQL engine)
bash
sudo apt-get install -y postgresql
sudo -u postgres psql -c "CREATE DATABASE hive_metastore;"
sudo -u postgres psql -c "CREATE USER hive WITH ENCRYPTED PASSWORD 'hivepass';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;"
xml
<!-- hive-site.xml -->
<configuration>
<property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:postgresql://localhost:5432/hive_metastore</value></property>
<property><name>javax.jdo.option.ConnectionDriverName</name><value>org.postgresql.Driver</value></property>
<property><name>javax.jdo.option.ConnectionUserName</name><value>hive</value></property>
<property><name>javax.jdo.option.ConnectionPassword</name><value>hivepass</value></property>
</configuration>
bash
schematool -dbType postgres -initSchema
hiveserver2 &
sql
CREATE DATABASE lake;
USE lake;
-- Create Iceberg table (Spark SQL)
CREATE TABLE lake.events (
event_id string,
user_id string,
action string,
ts timestamp
) USING iceberg
PARTITIONED BY (days(ts));
SELECT action, COUNT(*) AS cnt
FROM lake.events
WHERE ts >= TIMESTAMP '2025-01-01 00:00:00'
GROUP BY action
ORDER BY cnt DESC;
5) Apache Iceberg (table format)
sql
-- Add a column
ALTER TABLE lake.events ADD COLUMN device string;
-- Time travel
SELECT * FROM lake.events VERSION AS OF 3;
Governance, security, and observability
Best practice: Treat governance as code—version policies, automate audits, and integrate with CI/CD.
Cloud deployment options
Pattern: Use object storage (S3/GCS/ADLS) as the lake, Spark for processing, Iceberg for tables, and a managed metastore/catalog. Keep compute ephemeral; keep storage persistent.
End‑to‑end example: streaming to curated analytics
bash
bin/kafka-console-producer.sh --topic raw_events --bootstrap-server localhost:9092
# {"event_id":"e2","user_id":"u9","action":"purchase","ts":"2025-01-02T12:00:00Z"}
python
# daily_agg.py
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("DailyAgg")
.config("spark.sql.catalog.lake","org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.lake.type","hadoop")
.config("spark.sql.catalog.lake.warehouse","hdfs://localhost:9000/warehouse")
.getOrCreate())
spark.sql("""
CREATE TABLE IF NOT EXISTS lake.daily_actions (
day date,
action string,
cnt bigint
) USING iceberg
PARTITIONED BY (day)
""")
spark.sql("""
INSERT INTO lake.daily_actions
SELECT DATE(ts) AS day, action, COUNT(*) AS cnt
FROM lake.events
GROUP BY DATE(ts), action
""")
bash
$SPARK_HOME/bin/spark-submit --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 daily_agg.py
sql
SELECT * FROM lake.daily_actions ORDER BY day DESC, cnt DESC;
Performance and reliability best practices
Official documentation links