Databricks: A Cloud-Agnostic Solution for Data Engineering

Oleksiy Drobnych

Published Mar 19, 2024

Databricks is a powerful Big Data platform. Many comparisons exist between Databricks and proprietary platforms like Azure Synapse or AWS Glue. However, this article focuses on achieving independence in your data engineering solution.

New Projects vs. Legacy Systems

It's important to note that Databricks excels at building new solutions, not revamping legacy Hadoop/Pig/pre-Spark 3 systems. For those scenarios, consider GCP Dataproc, AWS EMR, or Azure HDInsight.

Databricks shines when embarking on new ventures. This story follows a small team of engineers at a young startup. They needed to build a user-facing web application connected to a sophisticated data pipeline.

The pipeline sourced data from web scrapers collecting social network information. A curated zone and an enriched zone contained unified data joined with machine learning model outputs. The team also required a data warehouse-like solution for data scientists and startup partners, along with data ingestion into an Elasticsearch cluster to serve B2C clients via the web app.

Cloud-Agnostic Platform Selection

The team chose Databricks as their underlying data platform due to its cloud-agnostic nature. They initially used Azure Databricks for initial experiments, proof-of-concepts (POCs), and team member training on Spark and Delta Lake.

The Minimum Viable Product (MVP) was built on Google Cloud Platform (GCP). Databricks' excellent separation of data and compute proved valuable. Data landed in Google Cloud Storage (GCS), while transformations were run using Kubernetes (K8S). The point where Databricks and K8S, two superior open-source platforms, connect becomes a true point of cloud independence. Initially, the Kubernetes cluster ran on GCP Compute Engine virtual machines.

Recommended by LinkedIn

How to Run Spark on Kubernetes

Zesty 3 months ago

The new EMR Serverless and its under-development…

Fahad Mawlood 3 years ago

Parallel Iceberg Table Compaction with AWS Step…

Soumil S. 11 months ago

DevOps and Custom Kubernetes Cluster

The team considered using Databricks on Google Kubernetes Engine (GKE) as an intermediate step, reducing the need for strong DevOps expertise. While a valid option, the team had a capable DevOps engineer and opted for a custom Kubernetes cluster.

The final "cloud escape" occurred when the team accumulated a large amount of data. They migrated to cheaper and more raw compute resources from a provider outside the GCP-AWS-Azure triad. Data were migrated to S3-Compatible Object Storage on the same K8S cluster.

Cloud Lock-In and Costs

Big Data on public clouds can be expensive, regardless of company size. The pay-per-compute model followed by storage fees can lead to significant bills. Cloud lock-in, a consequence of relying on proprietary platforms, further restricts flexibility and cost optimization.

Databricks: Not a Silver Bullet

While Databricks offers cloud independence, it's not a simple solution. Established cloud platforms like Azure Synapse often have better support and documentation. Choosing open-source solutions requires careful consideration of your engineering team's capabilities, higher complexity, and technical risk tolerance.

A Scalable Solution for Growing Businesses

However, Databricks is an almost ideal solution for growing systems and teams. Your journey can start with a couple of notebooks and containers on Azure's ADLS (Azure Data Lake Storage), and finish as a massive hybrid cloud solution powering petabytes of data and a growing sustainable business. Databricks scales alongside your needs, offering flexibility and cost-efficiency throughout your data engineering journey.

Richard Vavřena 1y

Zaujímavé

To view or add a comment, sign in

Databricks: A Cloud-Agnostic Solution for Data Engineering

Oleksiy Drobnych

Recommended by LinkedIn

Others also viewed

S3 Files and Implications on Data, ML & AI Solutions

Persistent Storage in AWS Cloud-Native Microservices with AWS FSX Lustre

AWS Re:Invent and the New Data Engineering Baseline: Lakehouses, Low Latency, and AI-Native Architectures

118. Fuel your business with data that lives - AWS re:Invent 2022 Takeaway #3

Azure Databricks: The Unified Platform Powering Modern Data & AI

Top Azure Databricks Interview Questions and Detailed Answers

Azure Data Factory – CI/CD [Part-2]

🚀 How NoSQL is Powering the AI and Cloud Revolution

Unlocking the Power of Azure Databricks: A Comprehensive Guide for Professionals

Databricks Serverless: Best Practices for Serverless Compute on Databricks

Using Azure in Data Engineering Projects

Kubernetes Cluster Setup for Development Teams

Cloud Fundamentals for Leaders: Azure, AWS, GCP

Kubernetes Deployment Strategies on Google Cloud

How Databricks is Transforming AI

Reasons Engineers Choose Kubernetes for Container Management

Why Kubernetes Is Overkill for Small Teams

Explore content categories

Recommended by LinkedIn

Others also viewed

S3 Files and Implications on Data, ML & AI Solutions

Persistent Storage in AWS Cloud-Native Microservices with AWS FSX Lustre

AWS Re:Invent and the New Data Engineering Baseline: Lakehouses, Low Latency, and AI-Native Architectures

118. Fuel your business with data that lives - AWS re:Invent 2022 Takeaway #3

Azure Databricks: The Unified Platform Powering Modern Data & AI

Top Azure Databricks Interview Questions and Detailed Answers

Azure Data Factory – CI/CD [Part-2]

🚀 How NoSQL is Powering the AI and Cloud Revolution

Unlocking the Power of Azure Databricks: A Comprehensive Guide for Professionals

Databricks Serverless: Best Practices for Serverless Compute on Databricks

Similar topics

Using Azure in Data Engineering Projects

Kubernetes Cluster Setup for Development Teams

Cloud Fundamentals for Leaders: Azure, AWS, GCP

Kubernetes Deployment Strategies on Google Cloud

How Databricks is Transforming AI

Reasons Engineers Choose Kubernetes for Container Management

Why Kubernetes Is Overkill for Small Teams

Explore content categories