Machine Learning at Scale on Azure with Apache Spark

Luca Bigoni

Published Sep 8, 2025

From raw data to reliable real-time predictions, safely and repeatably.

Modern ML isn’t a single model, it’s a production system. The architecture below (Azure + Spark/Databricks + Azure ML + AKS + Synapse/SQL + Power BI) is a proven blueprint to move from messy data to governed, observable, cost-aware AI at scale.

1) Ingestion & Data Foundation: How to implement

Create the lakehouse

Storage account: ADLS Gen2 with Hierarchical Namespace = On; enable soft delete and versioning.
Networking: Private Endpoint + disallow public network access.
Containers & folders:
IAM: Use Managed Identities; grant Storage Blob Data Reader/Contributor at container scope, not account.

Ingest batch data

In Azure Data Factory:

Ingest streaming data

Use Event Hubs or IoT Hub → Databricks Structured Streaming to Delta in Bronze.
Example (PySpark + Autoloader):

Quality & contracts

Delta Live Tables:
Data contracts: publish schemas (Avro/JSON) in Git, version them, and validate at ingestion with a schema registry (Confluent or custom).

2) Feature Engineering & Lakehouse Compute: How to implement

Transform at scale

Use Photon (Databricks 12.x+) for ETL performance. Partition by natural keys/date, OPTIMIZE + ZORDER on query columns.

Build feature tables

Keep surrogate key + event_time + feature_time; include effective_from/effective_to for point-in-time joins.
Register features:

Keep offline = online

Nightly job to push hot features to Redis/Cosmos/SQL with the same keys and freshness SLA metrics (write the freshness timestamp with each row).

3) Experimentation & Training — How to implement

Track experiments with MLflow

import mlflow, mlflow.sklearn
with mlflow.start_run():
    mlflow.log_params({"max_depth": 8, "lr": 0.03})
    mlflow.log_metric("auc", 0.881)
    mlflow.sklearn.log_model(model, "model")
    mlflow.register_model("runs:/{run_id}/model", "churn_xgb")

Reproducibility

Version data via Delta time travel: spark.read.format("delta").option("versionAsOf","123").load(path).
Pin Python & libs in requirements.txt or conda.yaml; use custom Docker for Azure ML.

Compute choices

Databricks job clusters with autoscale and spot; or Azure ML Compute for GPU sweeps.
Example Azure ML sweep (YAML):

Promotion policy

Only promote if: AUC >= incumbent_AUC + 0.01, fairness and robustness checks pass, and explainability drift < threshold. Enforce via CI gate (see §7).

4) Model Serving & Inference: How to implement

Batch scoring

pred = model.transform(silver_df)
(pred.write.format("delta")
     .mode("overwrite")
     .save(".../gold/predictions/churn/"))

Real-time serving options

Azure ML Managed Online Endpoint: quickest path.
AKS: for custom runtimes/routing. Package with FastAPI + Uvicorn, add /healthz probes, push as ACR image, deploy with Helm.

Autoscaling & health

Enable HPA on CPU/GPU and request-queue length; set readiness/liveness probes (e.g., 2s interval, 3 failures → restart).
Add Redis cache for idempotent or repeat queries.

Release strategies

Start with shadow traffic (0% user impact), then canary 10/25/50/100%. Gate each step on latency, error rate, and business KPI deltas.

Monitoring

Log request id, model version, feature set hash, latency, status, and cost token estimates. Send to App Insights/Prometheus, push model metrics to MLflow.

5) Analytics & Decisioning: How to implement

Expose data to analysts

Synapse Serverless external table over Delta:
Power BI: build a semantic model with row-level security; choose DirectQuery for low latency or Import for performance at scale.

Operational dashboards

Include lift vs. baseline, coverage, drift indicators, ROI (e.g., revenue saved per 1k predictions).

6) Security, Privacy, Governance: How to implement

Key Vault + Managed Identities: no secrets in notebooks/pipelines.
Unity Catalog: create metastore, external locations, storage credentials, then grant USAGE/SELECT at catalog/schema/table.
Private Link for Storage, Databricks, Synapse, AML endpoints.
Purview: scan ADLS/Synapse/SQL weekly, auto-classify PII, and publish glossary terms.
PII handling in Silver with tokenization or hashing; store keys in Key Vault; restrict decrypt UDFs to specific groups.

7) MLOps & CI/CD: How to implement

Repo layout

/infra/ (Bicep/Terraform)
/etl/ (dbx or DABs jobs)
/features/
/ml/ (training, eval, registry, scoring)
/serving/ (FastAPI/AML specs/Helm)
/dashboards/ (PBI)

Pipelines (GitHub Actions or Azure DevOps)

Build: lint, unit tests, DQ tests (Great Expectations), small sample training run; publish wheel/Docker.
Release:
Policy gates: block if AUC ↓, latency ↑, bias test fails, or cost per 1k ↑ beyond threshold.

8) Reliability & Observability: How to implement

Define SLOs: e.g., endpoint P95 < 120 ms, availability 99.9%, feature freshness < 15 min, training time < 2h.
Send logs/metrics/traces via OpenTelemetry to Azure Monitor or Grafana.
Write runbooks: rollback steps, cache warm, feature backfill; test quarterly via game days.
Use dead-letter queues for bad events and alert on DLQ depth.

9) Cost-to-Serve & FinOps — How to implement

Tagging policy via Azure Policy; budgets/alerts in Cost Management per env/team.
Databricks cluster policies to enforce autoscale, max nodes, spot, and auto-termination.
Schedule off-hours shutdowns for dev/test.
Track cost per experiment (cluster minutes × rate) and per 1k predictions; put both on the Ops dashboard.

10) KPIs That Matter — How to implement

Publish KPIs to a shared Ops & Product dashboard:

Minimal Viable Platform (MVP) — First 2–4 weeks

ADLS Gen2 + Delta (Bronze/Silver/Gold), private networking
Databricks for ETL/Autoloader + DLT expectations
MLflow tracking + registry
Azure ML Managed Online Endpoint for serving
Synapse Serverless + Power BI for analytics
Key Vault, Unity Catalog, Purview scans enabled

Common Pitfalls (and fixes)

Model ≠ product → Treat system as the product; invest in CI/CD, observability, runbooks.
Offline/online skew → Single feature codepath; parity tests; freshness monitors.
Notebook-only orchestration → Use ADF/Workflows; add retries and lineage.
Governance late → Enable Unity Catalog/Purview day one.
GPU first → Profile, quantify ROI, and right-size before scaling.

To view or add a comment, sign in

Machine Learning at Scale on Azure with Apache Spark

Luca Bigoni

1) Ingestion & Data Foundation: How to implement

2) Feature Engineering & Lakehouse Compute: How to implement

3) Experimentation & Training — How to implement

4) Model Serving & Inference: How to implement

5) Analytics & Decisioning: How to implement

6) Security, Privacy, Governance: How to implement

7) MLOps & CI/CD: How to implement

8) Reliability & Observability: How to implement

9) Cost-to-Serve & FinOps — How to implement

10) KPIs That Matter — How to implement

Minimal Viable Platform (MVP) — First 2–4 weeks

Common Pitfalls (and fixes)

More articles by Luca Bigoni

Explore content categories

1) Ingestion & Data Foundation: How to implement

2) Feature Engineering & Lakehouse Compute: How to implement

3) Experimentation & Training — How to implement

4) Model Serving & Inference: How to implement

5) Analytics & Decisioning: How to implement

6) Security, Privacy, Governance: How to implement

7) MLOps & CI/CD: How to implement

8) Reliability & Observability: How to implement

9) Cost-to-Serve & FinOps — How to implement

10) KPIs That Matter — How to implement

Minimal Viable Platform (MVP) — First 2–4 weeks

Common Pitfalls (and fixes)

More articles by Luca Bigoni

Stop Treating Every System Like a Snowflake

Implementing Azure Landing Zones for Enterprise Scalability and Governance

Microservices on Azure Kubernetes Cluster: Architecture Breakdown

Leveraging SRE and IaC for Optimal IT Operations and Strategic Financial Management

Deep Dive into Training Error vs. Test Error

KNN for regression, rather than classification, problems.

The Importance of Prompt Engineering in Generative Artificial Intelligence

Explore content categories