Building a data lake with the Apache stack: Architecture, setup, and code

Building a data lake with the Apache stack: Architecture, setup, and code

A modern data lake on the Apache stack gives you scalable storage, flexible processing, and open standards—without lock‑in. This guide walks through the “why,” the architecture, and the “how,” with working commands and code for each component.

Architecture explained

A data lake is more than storage—it’s an ecosystem. The Apache stack provides modular building blocks that you can compose cleanly:

  • Ingestion (Kafka/NiFi/Airflow): Bring in data from apps, devices, logs, and databases—both streaming and batch.
  • Storage (HDFS or object storage): Durable, low‑cost storage for raw, cleansed, and curated zones.
  • Processing (Spark): Transform, enrich, and aggregate data—batch and streaming.
  • Table format (Iceberg): ACID transactions, schema evolution, partitioning, and time travel across engines.
  • SQL/query (Hive/Trino): Interactive analytics with ANSI SQL and BI tool connectivity.
  • Governance (Ranger/Atlas): Access control, masking, lineage, and metadata.
  • Observability (Prometheus/Grafana): Metrics, alerts, and capacity planning.

Architecture diagram (text layout)

Code

                +-----------------------------+
                |        Producers            |
                |  Apps | IoT | DB | Logs     |
                +-----------------------------+
                           |
                           v
+------------------+   +------------------+   +------------------+
|  Batch Ingestion |   |  Stream Ingest   |   |  Orchestration   |
|  (NiFi / Sqoop)  |   |  (Kafka topics)  |   |  (Airflow)       |
+------------------+   +------------------+   +------------------+
           \                 |                       /
            \                v                      /
             \        +------------------+        /
              \       |   Raw Zone       |       /
               \      | (HDFS/S3/GCS)    |      /
                \     +------------------+     /
                 \             |              /
                  \            v             /
                   \   +------------------+ /
                    \  | Cleansed Zone    |/
                     \ | (Parquet/ORC)    |
                      \+------------------+
                               |
                               v
                     +----------------------+
                     |   Iceberg Tables     |
                     |  (ACID, partitions)  |
                     +----------------------+
                               |
             +-----------------+------------------+
             |                                    |
             v                                    v
+----------------------+               +----------------------+
|  Spark (ETL/ML/SQL)  |               | Hive/Trino (SQL)     |
+----------------------+               +----------------------+
             |                                    |
             v                                    v
     +------------------+                 +------------------+
     | BI / Notebooks   |                 | Apps / Services  |
     +------------------+                 +------------------+

Governance: Ranger (policies), Atlas (lineage)
Observability: Prometheus/Grafana, logs
        
Article content

Why Apache for data lakes

  • Open standards: Avoid proprietary lock‑in; integrate with anything.
  • Scale & cost: Horizontal scaling with commodity hardware or cloud storage.
  • Flexibility: Handle CSV, JSON, Parquet, ORC, images, and more.
  • Interoperability: Spark + Hive + Iceberg work across engines and clouds.

Data zones and lifecycle

  • Raw zone: Immutable, source‑of‑truth dumps (CSV/JSON/Avro). No transformations.
  • Cleansed zone: Validated, standardized formats (Parquet/ORC), basic quality checks.
  • Curated zone: Analytics‑ready Iceberg tables with partitions, ACID, and governance.

Best practice: Never mutate raw; enforce contracts in cleansed; publish curated for consumption.

Setup: step‑by‑step with working code

1) Hadoop HDFS (storage)

  • Install Java & Hadoop:

bash

sudo apt-get update && sudo apt-get install -y openjdk-11-jdk
HADOOP_VERSION=3.3.6
wget https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
sudo tar -xzf hadoop-$HADOOP_VERSION.tar.gz -C /opt
sudo ln -s /opt/hadoop-$HADOOP_VERSION /opt/hadoop
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' | sudo tee -a /opt/hadoop/etc/hadoop/hadoop-env.sh
echo 'export PATH=$PATH:/opt/hadoop/bin:/opt/hadoop/sbin' | sudo tee -a /etc/profile.d/hadoop.sh
source /etc/profile.d/hadoop.sh
        

  • Configure single‑node (dev):

xml

<!-- /opt/hadoop/etc/hadoop/core-site.xml -->
<configuration>
  <property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
</configuration>
        

xml

<!-- /opt/hadoop/etc/hadoop/hdfs-site.xml -->
<configuration>
  <property><name>dfs.replication</name><value>1</value></property>
  <property><name>dfs.namenode.name.dir</name><value>file:///var/lib/hadoop/hdfs/namenode</value></property>
  <property><name>dfs.datanode.data.dir</name><value>file:///var/lib/hadoop/hdfs/datanode</value></property>
</configuration>
        

  • Format & start:

bash

sudo mkdir -p /var/lib/hadoop/hdfs/namenode /var/lib/hadoop/hdfs/datanode
hdfs namenode -format
start-dfs.sh
hdfs dfs -mkdir -p /warehouse /checkpoints
        

2) Apache Kafka (stream ingestion)

  • Install & start (KRaft mode):

bash

KAFKA_VERSION=3.7.0; SCALA_VERSION=2.13
wget https://downloads.apache.org/kafka/$KAFKA_VERSION/kafka_$SCALA_VERSION-$KAFKA_VERSION.tgz
tar -xzf kafka_$SCALA_VERSION-$KAFKA_VERSION.tgz && cd kafka_$SCALA_VERSION-$KAFKA_VERSION
bin/kafka-storage.sh format -t $(uuidgen) -c config/kraft/server.properties
bin/kafka-server-start.sh config/kraft/server.properties
        

  • Create topic & test:

bash

bin/kafka-topics.sh --create --topic raw_events --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
bin/kafka-console-producer.sh --topic raw_events --bootstrap-server localhost:9092
# Paste JSON lines, e.g.:
# {"event_id":"e1","user_id":"u1","action":"view","ts":"2025-01-01T10:00:00Z"}
        

3) Apache Spark (processing)

  • Install Spark:

bash

SPARK_VERSION=3.5.1
wget https://downloads.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
tar -xzf spark-$SPARK_VERSION-bin-hadoop3.tgz
sudo mv spark-$SPARK_VERSION-bin-hadoop3 /opt/spark
echo 'export SPARK_HOME=/opt/spark' | sudo tee -a /etc/profile.d/spark.sh
echo 'export PATH=$PATH:/opt/spark/bin' | sudo tee -a /etc/profile.d/spark.sh
source /etc/profile.d/spark.sh
        

  • Streaming ETL: Kafka → Iceberg:

python

# kafka_to_iceberg.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = (SparkSession.builder
    .appName("KafkaToIceberg")
    .config("spark.sql.catalog.lake", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lake.type", "hadoop")
    .config("spark.sql.catalog.lake.warehouse", "hdfs://localhost:9000/warehouse")
    .getOrCreate())

schema = StructType([
    StructField("event_id", StringType()),
    StructField("user_id", StringType()),
    StructField("action", StringType()),
    StructField("ts", TimestampType())
])

df = (spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "raw_events")
    .load()
    .selectExpr("CAST(value AS STRING) AS json")
    .select(from_json(col("json"), schema).alias("data"))
    .select("data.*"))

(df.writeStream
    .format("iceberg")
    .option("checkpointLocation", "hdfs://localhost:9000/checkpoints/kafka_to_iceberg")
    .toTable("lake.events")
    .start()
    .awaitTermination())
        

bash

$SPARK_HOME/bin/spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
  kafka_to_iceberg.py
        

4) Apache Hive (SQL engine)

  • Metastore (PostgreSQL):

bash

sudo apt-get install -y postgresql
sudo -u postgres psql -c "CREATE DATABASE hive_metastore;"
sudo -u postgres psql -c "CREATE USER hive WITH ENCRYPTED PASSWORD 'hivepass';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;"
        

  • Configure Hive:

xml

<!-- hive-site.xml -->
<configuration>
  <property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:postgresql://localhost:5432/hive_metastore</value></property>
  <property><name>javax.jdo.option.ConnectionDriverName</name><value>org.postgresql.Driver</value></property>
  <property><name>javax.jdo.option.ConnectionUserName</name><value>hive</value></property>
  <property><name>javax.jdo.option.ConnectionPassword</name><value>hivepass</value></property>
</configuration>
        

  • Initialize & run:

bash

schematool -dbType postgres -initSchema
hiveserver2 &
        

  • Query Iceberg tables (via Spark SQL or Hive):

sql

CREATE DATABASE lake;
USE lake;

-- Create Iceberg table (Spark SQL)
CREATE TABLE lake.events (
  event_id string,
  user_id string,
  action string,
  ts timestamp
) USING iceberg
PARTITIONED BY (days(ts));

SELECT action, COUNT(*) AS cnt
FROM lake.events
WHERE ts >= TIMESTAMP '2025-01-01 00:00:00'
GROUP BY action
ORDER BY cnt DESC;
        

5) Apache Iceberg (table format)

  • Key capabilities:
  • Examples:

sql

-- Add a column
ALTER TABLE lake.events ADD COLUMN device string;

-- Time travel
SELECT * FROM lake.events VERSION AS OF 3;
        

Governance, security, and observability

  • Apache Ranger (authorization):
  • Apache Atlas (metadata & lineage):
  • Prometheus & Grafana (monitoring):

Best practice: Treat governance as code—version policies, automate audits, and integrate with CI/CD.

Cloud deployment options

  • AWS EMR: Managed Hadoop/Spark/Hive; use S3 as the lake with Iceberg tables in Glue/Hive catalogs.
  • GCP Dataproc: Managed Spark/Hadoop; use GCS as storage; integrate with BigQuery via connectors.
  • Azure HDInsight/Synapse: Managed Hadoop/Spark; use ADLS; Ranger available for governance.

Pattern: Use object storage (S3/GCS/ADLS) as the lake, Spark for processing, Iceberg for tables, and a managed metastore/catalog. Keep compute ephemeral; keep storage persistent.

End‑to‑end example: streaming to curated analytics

  1. Produce events to Kafka:

bash

bin/kafka-console-producer.sh --topic raw_events --bootstrap-server localhost:9092
# {"event_id":"e2","user_id":"u9","action":"purchase","ts":"2025-01-02T12:00:00Z"}
        

  1. Spark streaming writes to Iceberg (code above).
  2. Run an aggregation job (batch):

python

# daily_agg.py
from pyspark.sql import SparkSession
spark = (SparkSession.builder
    .appName("DailyAgg")
    .config("spark.sql.catalog.lake","org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lake.type","hadoop")
    .config("spark.sql.catalog.lake.warehouse","hdfs://localhost:9000/warehouse")
    .getOrCreate())

spark.sql("""
CREATE TABLE IF NOT EXISTS lake.daily_actions (
  day date,
  action string,
  cnt bigint
) USING iceberg
PARTITIONED BY (day)
""")

spark.sql("""
INSERT INTO lake.daily_actions
SELECT DATE(ts) AS day, action, COUNT(*) AS cnt
FROM lake.events
GROUP BY DATE(ts), action
""")
        

bash

$SPARK_HOME/bin/spark-submit --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 daily_agg.py
        

  1. Query results:

sql

SELECT * FROM lake.daily_actions ORDER BY day DESC, cnt DESC;
        

Performance and reliability best practices

  • File formats: Use Parquet/ORC; avoid tiny files—compact regularly.
  • Partitioning: Partition by date/time; avoid over‑partitioning (too many small files).
  • Checkpointing: Use durable checkpoint locations for streaming jobs.
  • Resource tuning: Size executors for Spark (cores/memory); cache hot datasets.
  • Schema contracts: Validate inputs; enforce schemas at ingestion.
  • Cost control: Separate raw/cleansed/curated; lifecycle policies on object storage.

Official documentation links

To view or add a comment, sign in

More articles by Arun Tiwari

Others also viewed

Explore content categories