Evolution of Spark 🌟

Shiva Kumar Talari

Published Jul 4, 2025

+ Follow

📅 Evolution of Distributed Computing & Data Lakes - Key Points

📚 Background

Google created GFS (Google File System) for large-scale data storage.
The open-source version is HDFS (Hadoop Distributed File System).
MapReduce was introduced for distributed data processing.
These tools let regular (cheap) computers form clusters to store and process big data.

📊 Comparison to Traditional Data Warehouses (e.g., Teradata, Exadata)

Data Warehouses collected data from OLTP systems and extracted insights.
Hadoop challenged Data Warehouses in 3 major ways:

🌎 Rise of the Data Lake

Term "Data Lake" coined by James Dixon (Pentaho CTO).
A Data Lake collects data from various sources into HDFS.
Spark (later replaced MapReduce) is used to process the data.
Processed data is stored again for BI/Reporting, ML, and AI.

🔧 Limitations of Early Data Lakes

Missing features from Data Warehouses:
To fix this, Data Lakes integrated Data Warehouses for BI.
Final flow: Ingest raw data → Process with Spark → Store in DW → BI tools connect to DW

⚡ Cloud and the Modern Data Lake

Cloud made Data Lakes more flexible, cheaper, and widely adopted.
Modern Data Lake has 4 key capabilities:

Recommended by LinkedIn

DATA Pill #076 - Distributed Computing MMA: Ray vs…

Adam Kawa 2 years ago

Comprehensive Guide to Spark Optimization Methods in…

Kanvitha Reddy 10 months ago

🚀 Big Data Unleashed: The Power of MapReduce, Spark…

Kotha Sreeja 1 year ago

📁 Data Storage

Core is storage: HDFS or cloud-based (S3, Azure Blob, Google Cloud Storage).
Cloud preferred due to scalability, low cost, and quick access.

🚗 Data Ingestion Layer

Data is ingested raw and immutable (not modified).
No one-size-fits-all tool — multiple ingestion tools exist.

⚖️ Data Processing Layer

Key activities:
Apache Spark is commonly used here.

📚 Data Access Layer

Consumers (analysts, dashboards, apps) want data in various forms:
Tools must support all these formats.
Data Warehouses & RDBMS still popular for consumption.

🔒 Additional Critical Capabilities for Full Data Lake Implementation

Security & Access Control
Workflow & Scheduling Tools
Metadata & Data Catalogs
Lifecycle Management & Governance
Monitoring & Operations Tools

📄 Summary Quote for Interviews

"A Data Lake is a scalable platform that allows storage of raw, structured, and unstructured data. Spark enables processing, while integration with Data Warehouses supports BI. A mature data lake also includes ingestion, governance, access control, and multi-format access tools."

Rushi Kotamraju 10mo

Hero anna miru

To view or add a comment, sign in

Evolution of Spark 🌟

Shiva Kumar Talari

Recommended by LinkedIn

Others also viewed

APACHE SPARK DELTA LAKE - PART 1

Machine Learning? Go to know below solutions/languages and tools!

How to optimize PySpark in Databricks for efficient processing of large volumes of data.

⚙️ Databricks Clusters Explained: Types, Setup, and Best Practices

Power of Databricks: Basics to Mastery

A Comprehensive Guide to Data Transformation with PySpark and Azure

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Big Data Technology Stack

How a University Research Project Became the Backbone of Modern Data Engineering

Big Data Optimization - The Secret to Faster Spark Workloads with Smarter Data Distribution

Data Lakes and Warehousing

Big Data Integration Platforms

Batch Processing in Big Data

Machine Learning Frameworks

Explore content categories

Recommended by LinkedIn

Others also viewed

APACHE SPARK DELTA LAKE - PART 1

Machine Learning? Go to know below solutions/languages and tools!

How to optimize PySpark in Databricks for efficient processing of large volumes of data.

⚙️ Databricks Clusters Explained: Types, Setup, and Best Practices

Power of Databricks: Basics to Mastery

A Comprehensive Guide to Data Transformation with PySpark and Azure

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Big Data Technology Stack

How a University Research Project Became the Backbone of Modern Data Engineering

Big Data Optimization - The Secret to Faster Spark Workloads with Smarter Data Distribution

Similar topics

Data Lakes and Warehousing

Big Data Integration Platforms

Batch Processing in Big Data

Machine Learning Frameworks

Explore content categories