Databricks: A Cloud-Agnostic Solution for Data Engineering

Databricks: A Cloud-Agnostic Solution for Data Engineering

Databricks is a powerful Big Data platform. Many comparisons exist between Databricks and proprietary platforms like Azure Synapse or AWS Glue. However, this article focuses on achieving independence in your data engineering solution.

New Projects vs. Legacy Systems

It's important to note that Databricks excels at building new solutions, not revamping legacy Hadoop/Pig/pre-Spark 3 systems. For those scenarios, consider GCP Dataproc, AWS EMR, or Azure HDInsight.

Databricks shines when embarking on new ventures. This story follows a small team of engineers at a young startup. They needed to build a user-facing web application connected to a sophisticated data pipeline.

The pipeline sourced data from web scrapers collecting social network information. A curated zone and an enriched zone contained unified data joined with machine learning model outputs. The team also required a data warehouse-like solution for data scientists and startup partners, along with data ingestion into an Elasticsearch cluster to serve B2C clients via the web app.

Cloud-Agnostic Platform Selection

The team chose Databricks as their underlying data platform due to its cloud-agnostic nature. They initially used Azure Databricks for initial experiments, proof-of-concepts (POCs), and team member training on Spark and Delta Lake.

The Minimum Viable Product (MVP) was built on Google Cloud Platform (GCP). Databricks' excellent separation of data and compute proved valuable. Data landed in Google Cloud Storage (GCS), while transformations were run using Kubernetes (K8S). The point where Databricks and K8S, two superior open-source platforms, connect becomes a true point of cloud independence. Initially, the Kubernetes cluster ran on GCP Compute Engine virtual machines.

DevOps and Custom Kubernetes Cluster

The team considered using Databricks on Google Kubernetes Engine (GKE) as an intermediate step, reducing the need for strong DevOps expertise. While a valid option, the team had a capable DevOps engineer and opted for a custom Kubernetes cluster.

The final "cloud escape" occurred when the team accumulated a large amount of data. They migrated to cheaper and more raw compute resources from a provider outside the GCP-AWS-Azure triad. Data were migrated to S3-Compatible Object Storage on the same K8S cluster.

Cloud Lock-In and Costs

Big Data on public clouds can be expensive, regardless of company size. The pay-per-compute model followed by storage fees can lead to significant bills. Cloud lock-in, a consequence of relying on proprietary platforms, further restricts flexibility and cost optimization.

Databricks: Not a Silver Bullet

While Databricks offers cloud independence, it's not a simple solution. Established cloud platforms like Azure Synapse often have better support and documentation. Choosing open-source solutions requires careful consideration of your engineering team's capabilities, higher complexity, and technical risk tolerance.

A Scalable Solution for Growing Businesses

However, Databricks is an almost ideal solution for growing systems and teams. Your journey can start with a couple of notebooks and containers on Azure's ADLS (Azure Data Lake Storage), and finish as a massive hybrid cloud solution powering petabytes of data and a growing sustainable business. Databricks scales alongside your needs, offering flexibility and cost-efficiency throughout your data engineering journey.

To view or add a comment, sign in

Others also viewed

Explore content categories