Machine Learning at Scale on Azure with Apache Spark

Machine Learning at Scale on Azure with Apache Spark

From raw data to reliable real-time predictions, safely and repeatably.

Modern ML isn’t a single model, it’s a production system. The architecture below (Azure + Spark/Databricks + Azure ML + AKS + Synapse/SQL + Power BI) is a proven blueprint to move from messy data to governed, observable, cost-aware AI at scale.

Article content
Azure Landscape

1) Ingestion & Data Foundation: How to implement

Create the lakehouse

  • Storage account: ADLS Gen2 with Hierarchical Namespace = On; enable soft delete and versioning.
  • Networking: Private Endpoint + disallow public network access.
  • Containers & folders:
  • IAM: Use Managed Identities; grant Storage Blob Data Reader/Contributor at container scope, not account.

Ingest batch data

  • In Azure Data Factory:

Ingest streaming data

  • Use Event Hubs or IoT HubDatabricks Structured Streaming to Delta in Bronze.
  • Example (PySpark + Autoloader):

Quality & contracts

  • Delta Live Tables:
  • Data contracts: publish schemas (Avro/JSON) in Git, version them, and validate at ingestion with a schema registry (Confluent or custom).

2) Feature Engineering & Lakehouse Compute: How to implement

Transform at scale

  • Use Photon (Databricks 12.x+) for ETL performance. Partition by natural keys/date, OPTIMIZE + ZORDER on query columns.

Build feature tables

  • Keep surrogate key + event_time + feature_time; include effective_from/effective_to for point-in-time joins.
  • Register features:

Keep offline = online

  • Nightly job to push hot features to Redis/Cosmos/SQL with the same keys and freshness SLA metrics (write the freshness timestamp with each row).

3) Experimentation & Training — How to implement

Track experiments with MLflow

import mlflow, mlflow.sklearn
with mlflow.start_run():
    mlflow.log_params({"max_depth": 8, "lr": 0.03})
    mlflow.log_metric("auc", 0.881)
    mlflow.sklearn.log_model(model, "model")
    mlflow.register_model("runs:/{run_id}/model", "churn_xgb")
        

Reproducibility

  • Version data via Delta time travel: spark.read.format("delta").option("versionAsOf","123").load(path).
  • Pin Python & libs in requirements.txt or conda.yaml; use custom Docker for Azure ML.

Compute choices

  • Databricks job clusters with autoscale and spot; or Azure ML Compute for GPU sweeps.
  • Example Azure ML sweep (YAML):

Promotion policy

  • Only promote if: AUC >= incumbent_AUC + 0.01, fairness and robustness checks pass, and explainability drift < threshold. Enforce via CI gate (see §7).

4) Model Serving & Inference: How to implement

Batch scoring

pred = model.transform(silver_df)
(pred.write.format("delta")
     .mode("overwrite")
     .save(".../gold/predictions/churn/"))
        

Real-time serving options

  • Azure ML Managed Online Endpoint: quickest path.
  • AKS: for custom runtimes/routing. Package with FastAPI + Uvicorn, add /healthz probes, push as ACR image, deploy with Helm.

Autoscaling & health

  • Enable HPA on CPU/GPU and request-queue length; set readiness/liveness probes (e.g., 2s interval, 3 failures → restart).
  • Add Redis cache for idempotent or repeat queries.

Release strategies

  • Start with shadow traffic (0% user impact), then canary 10/25/50/100%. Gate each step on latency, error rate, and business KPI deltas.

Monitoring

  • Log request id, model version, feature set hash, latency, status, and cost token estimates. Send to App Insights/Prometheus, push model metrics to MLflow.

5) Analytics & Decisioning: How to implement

Expose data to analysts

  • Synapse Serverless external table over Delta:
  • Power BI: build a semantic model with row-level security; choose DirectQuery for low latency or Import for performance at scale.

Operational dashboards

  • Include lift vs. baseline, coverage, drift indicators, ROI (e.g., revenue saved per 1k predictions).

6) Security, Privacy, Governance: How to implement

  • Key Vault + Managed Identities: no secrets in notebooks/pipelines.
  • Unity Catalog: create metastore, external locations, storage credentials, then grant USAGE/SELECT at catalog/schema/table.
  • Private Link for Storage, Databricks, Synapse, AML endpoints.
  • Purview: scan ADLS/Synapse/SQL weekly, auto-classify PII, and publish glossary terms.
  • PII handling in Silver with tokenization or hashing; store keys in Key Vault; restrict decrypt UDFs to specific groups.

7) MLOps & CI/CD: How to implement

Repo layout

/infra/ (Bicep/Terraform)
/etl/ (dbx or DABs jobs)
/features/
/ml/ (training, eval, registry, scoring)
/serving/ (FastAPI/AML specs/Helm)
/dashboards/ (PBI)
        

Pipelines (GitHub Actions or Azure DevOps)

  • Build: lint, unit tests, DQ tests (Great Expectations), small sample training run; publish wheel/Docker.
  • Release:
  • Policy gates: block if AUC ↓, latency ↑, bias test fails, or cost per 1k ↑ beyond threshold.

8) Reliability & Observability: How to implement

  • Define SLOs: e.g., endpoint P95 < 120 ms, availability 99.9%, feature freshness < 15 min, training time < 2h.
  • Send logs/metrics/traces via OpenTelemetry to Azure Monitor or Grafana.
  • Write runbooks: rollback steps, cache warm, feature backfill; test quarterly via game days.
  • Use dead-letter queues for bad events and alert on DLQ depth.

9) Cost-to-Serve & FinOps — How to implement

  • Tagging policy via Azure Policy; budgets/alerts in Cost Management per env/team.
  • Databricks cluster policies to enforce autoscale, max nodes, spot, and auto-termination.
  • Schedule off-hours shutdowns for dev/test.
  • Track cost per experiment (cluster minutes × rate) and per 1k predictions; put both on the Ops dashboard.

10) KPIs That Matter — How to implement

  • Publish KPIs to a shared Ops & Product dashboard:

Minimal Viable Platform (MVP) — First 2–4 weeks

  1. ADLS Gen2 + Delta (Bronze/Silver/Gold), private networking
  2. Databricks for ETL/Autoloader + DLT expectations
  3. MLflow tracking + registry
  4. Azure ML Managed Online Endpoint for serving
  5. Synapse Serverless + Power BI for analytics
  6. Key Vault, Unity Catalog, Purview scans enabled

Common Pitfalls (and fixes)

  • Model ≠ product → Treat system as the product; invest in CI/CD, observability, runbooks.
  • Offline/online skew → Single feature codepath; parity tests; freshness monitors.
  • Notebook-only orchestration → Use ADF/Workflows; add retries and lineage.
  • Governance late → Enable Unity Catalog/Purview day one.
  • GPU first → Profile, quantify ROI, and right-size before scaling.


To view or add a comment, sign in

More articles by Luca Bigoni

Explore content categories