Email Spam Detection System with Airflow & Kubernetes

4mo Edited

Just Published: My Third End-to-End MLOps Project – Email Spam Detection System Building on my previous MLOps projects, I’ve now developed a production-ready Email Spam Detection system with advanced workflow orchestration and deployment strategies. This project focuses on learning and applying Airflow, Kubernetes, and boosting models for scalable ML pipelines. The goal of this project was to build production-grade MLOps pipelines, including experiment tracking, pipeline orchestration, and containerized deployment, rather than deploying every model.” 📌 What I built • Full ML workflow with Airflow DAGs for pipeline orchestration, integrating MLflow for experiment tracking separately. • Multiple boosting classification models (AdaBoost, Gradient Boosting, etc.) running within the same pipeline. • Containerized deployment using Docker and orchestration using Kubernetes (Minikube + kubectl). • Modular project structure to handle data ingestion, preprocessing, model training, evaluation, and deployment. •I trained 2 models and deployed the best-performing one. The pipeline is modular and ready for future model experimentation. 🛠 Key Skills & Tools Python • MLOps • Airflow • MLflow (for tracking metric and logging and deployig trained models) • Docker • Kubernetes (K8s) • CI/CD • Boosting Models • Production-grade ML pipelines 💡 Improvements over my 2nd project • Airflow for robust workflow orchestration and explicit task dependencies. • Separate MLflow setup instead of ZenML’s built-in tracking. • Multiple models in the same pipeline, supporting experimentation with boosting classifiers. • Kubernetes basics & deployment: pods, deployments, services, and Minikube setup. • Learned how to serialize large objects, manage XComArgs, and troubleshoot common Airflow errors. • Hands-on experience with Docker + K8s deployment and managing containerized ML applications. 💡 Key Learnings • Airflow DAG structure, task dependencies, and XCom management. • Integration of Airflow with MLflow for reproducible experiment tracking. • Boosting models and running multiple classifiers in a single pipeline. • Kubernetes concepts for managing scalable containerized applications. • Ensuring version consistency between training and deployment environments. 🔗 Check out the repo here: https://lnkd.in/eVTYEm5C This project was a deep dive into scalable MLOps architectures, bridging the gap between pipeline orchestration, experiment tracking, and cloud-native deployment. #MLOps #MachineLearning #DataScience #Python #Airflow #MLFlow #Docker #Kubernetes #BoostingModels #PipelineOrchestration #ProductionML #ProjectShowcase #UKTech #TechJobsUK #UKJobs #HiringUK #LondonTech

To view or add a comment, sign in

More Relevant Posts

Kanhaiya Gupta
3mo
Report this post
Here are the top Python skill areas dominating 2026 — the ones that actually matter in real-world engineering: 🐍 1. Data Analysis with Pandas Python engineers must be comfortable manipulating datasets, generating insights, and supporting data-driven decisions — even in DevOps and cloud operations. 🤖 2. Machine Learning with scikit-learn ML is no longer optional. Basic predictive models, feature engineering, and pipeline understanding give engineers a massive edge. ⚙️ 3. Automation & Scripting Still the strongest use-case. Whether it’s CI/CD, cloud ops, or API automation — Python scripts save time, reduce errors, and boost efficiency. ⚡ 4. FastAPI & Modern REST APIs Lightweight, fast, async-ready APIs are replacing old frameworks. FastAPI is now the new standard for high-performance backend services. 📊 5. Data Visualization with Matplotlib / Seaborn If you can’t visualize data, you can’t explain data. Python charts help engineers communicate insights clearly, especially in monitoring and reporting. ☁️ 6. Cloud & AWS SDK (Boto3) Python + Cloud is the strongest combo. From provisioning AWS resources to automating deployments — Boto3 skills make engineers stand out. 🔥 If you master these 6 areas in 2026, you will stay ahead of 90% of engineers in automation, DevOps, data, and cloud roles. Python isn’t slowing down. It’s becoming unstoppable. 🐍💥 #Python #Programming #Automation #DevOps #CloudComputing #MachineLearning #FastAPI #AWS #TechSkills2026
Like Comment
To view or add a comment, sign in
Steven Ayers
3mo
Report this post
In 2025, I became more involved with Dagster and its open-source community. A significant highlight has been contributing support for async operations to the Dagster Framework. As Data Engineering increasingly relies on Cloud-based services, AI & LLM APIs, optimizing I/O-intensive operations in data pipelines has become a bigger challenge. Collaborating with the brilliant team at Dagster Labs, we created a blog that provides an in-depth look at how this new async feature works. https://lnkd.in/eG4ymcpU #DataEngineering #Dagster #Python #Async #WorkflowOrchestration

When Sync Isn’t Enough: Async Execution Inside Dagster dagster.io

2 Comments
Like Comment
To view or add a comment, sign in
Hasan Erdin
3mo
Report this post
🚀 Refactoring My Final Project: Turning a Working App into a Clean System A month ago, I completed my AI Trip Planner as the final project of the WBS Coding School Data Science Bootcamp. It worked. Users could enter preferences, and the app generated optimized city itineraries. But under the hood? Let’s just say deadlines won… and the code turned into spaghetti 🍝 So I decided to rebuild it — not to add features, but to fix the foundation. This time, the goal is simple: 👉 a clean, maintainable, production-ready system I can confidently deploy and extend. 🧠 What I’m Building Differently This Time I’m restructuring the entire project around a strict 4-layer architecture, where every layer has one clear responsibility: • Frontend (Streamlit): Pure UI — input, output, nothing else • Backend API (FastAPI): Orchestration — validation, database access, coordination • Intelligence Layer: Pure Python logic — scoring places & optimizing routes • Data Pipeline: Independent data ingestion & processing (Google Places / Distance APIs) This separation alone changed how I think about building ML-driven applications. 🔧 What I’ve Learned Along the Way This refactor has been a crash course in software engineering fundamentals that go far beyond models: • Designing RESTful APIs with FastAPI • Using Pydantic schemas to enforce data contracts • Modeling relational data with SQLAlchemy ORM • Managing database sessions, transactions, and relationships • Keeping secrets out of code with environment-based configuration • Writing framework-agnostic business logic that’s testable in isolation • Structuring projects so future changes don’t break everything One of the biggest mindset shifts: Good architecture isn’t about complexity — it’s about clarity. Every function doing one thing. Every layer knowing only what it needs to know. 🎯 Why This Matters (Especially for Data Scientists) It’s easy to focus on models, metrics, and notebooks. But real-world systems live longer than experiments. This project taught me that the difference between a portfolio project and production-ready software is architectural thinking — knowing where things belong and why. I’m documenting the rebuild step by step and will share the deployed version soon. More updates coming ✈️📍 #DataScience #SoftwareEngineering #CleanArchitecture #Python #FastAPI #Streamlit #LearningInPublic
Like Comment
To view or add a comment, sign in
Nishanta Banik
4mo
Report this post
Seeking Best Practices: Custom Base Images for Multi-Language data and ML Pipelines Our team recently standardized a dual-strategy approach for managing dependencies in our data engineering pipelines: Python Approach: - Shared base images with locked dependencies (pip-tools) - Monthly automated rebuilds with CI validation - Reusable across Cloud Run, Vertex AI, and Cloud Functions Java Approach: - Application-specific images only (no shared base images) - Maven-based builds with compiled JARs - Each application owns its full dependency graph - Used by Dataflow Flex Templates (not optimized via base images) The Challenge: We're balancing control, security, and maintainability while avoiding dependency hell and ensuring reproducible builds across environments. Question for the community: How are you managing base images and dependencies for multi-language data platforms? Are you using shared images, application-specific images, or a hybrid approach? Would love to hear your experiences, especially around: - Dependency locking strategies - CI/CD patterns for image updates - Handling Python vs Java/Scala differently - Security and vulnerability scanning workflows Drop your thoughts in the comments! 💬 #DataEngineering #DevOps #Docker #CloudNative #GCP #BestPractices #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Mohammad Suleman(محمد سليمان)
4mo
Report this post
⚡ Stop Building Batch Pipelines. Real-Time Is the New Default. One of the most underrated yet fast-trending GitHub projects right now is Pathway — and it’s a big deal for anyone building AI or data-driven products. 🔗 GitHub: https://lnkd.in/g94iR7-c Why Pathway is gaining traction 👇 🔁 Real-time data processing (not batch-only like traditional ETL) 🧠 Built for AI pipelines — streaming → embeddings → inference 🐍 Python-native (no JVM pain) ⚙️ Unified ETL + streaming + analytics 🚀 Perfect for LLM apps, RAG systems, fraud detection, live analytics If you’re building: • AI agents • LLM-powered products • Streaming analytics • Real-time dashboards …this is a framework you should absolutely explore. The future of data isn’t offline processing. It’s live, continuous, and AI-native. ⭐ Star it. Study it. Build with it. #GitHubTrending #DataEngineering #AI #LLM #RealTimeData #OpenSource #Startups

GitHub - pathwaycom/pathway: Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG. github.com
Like Comment
To view or add a comment, sign in
Danyal Wahdat
4mo
Report this post
𝗪𝗵𝘆 𝗜 𝗺𝗼𝘃𝗲𝗱 𝗺𝘆 𝗗𝗮𝘁𝗮 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 𝗳𝗿𝗼𝗺 𝗦𝗰𝗿𝗶𝗽𝘁𝘀 𝘁𝗼 𝗔𝗶𝗿𝗳𝗹𝗼𝘄 + 𝗗𝗼𝗰𝗸𝗲𝗿. 🐳 Is Airflow overkill for a cleaning pipeline? A simple Python script is definitely faster to write, but it fails at scale. Here is the engineering reasoning behind the architecture (see diagram). 👇 The Problem with "Just Scripts" A simple Python script works fine... until it crashes halfway through. Then you have to figure out: Which row did it fail on? Do I re-run the whole thing? (Duplicating data). How do I run this on a different machine without dependency hell? 💡 The "Containerized Orchestration" Solution I didn't just build a DAG; I built a resilient ecosystem. Environment Isolation (Docker) The entire Airflow instance runs in a Docker container. This means the pipeline is "Environment Agnostic." It runs exactly the same on my laptop as it would on an AWS EC2 instance. No "it works on my machine" excuses. Atomicity (The 7-Step Chain) I broke the cleaning logic into 7 atomic tasks. Why? If Step 4 (Imputation) fails, Airflow retries only Step 4. It doesn't waste time re-ingesting the data (Step 1). Observability With a standard script, logging is a mess. With this setup, I have a UI that visually flags bottlenecks in the transformation logic. The Takeaway Over-engineering? Maybe for a small CSV. But for LLMOps, where bad data = hallucinating models, reliability is not optional. #DataEngineering #Docker #ApacheAirflow #DevOps #Python #Architecture
Like Comment
To view or add a comment, sign in
Sridhar Pillai
4mo Edited
Report this post
I decided to set up Kubeflow Pipelines locally to better understand MLOps workflows. The setup was straightforward but I hit some interesting errors where the UI worked while the backend services were stuck. Troubleshooting those dependencies taught me how the architecture actually connects. I documented the process and the fix in my blog post. Check it out here: https://lnkd.in/dNByqzSz #MLOps #Kubernetes #Kubeflow #medium

Kubeflow Pipelines Locally: The Honest Guide to Breaking Things (and Learning) medium.com

4 Comments
Like Comment
To view or add a comment, sign in
Muhammad Sami
4mo Edited
Report this post
🤖 𝐀𝐈-𝐍𝐚𝐭𝐢𝐯𝐞 𝐓𝐨𝐝𝐨 𝐒𝐲𝐬𝐭𝐞𝐦 | 𝐒𝐩𝐞𝐜𝐊𝐢𝐭 𝐏𝐥𝐮𝐬 𝐇𝐚𝐜𝐤𝐚𝐭𝐡𝐨𝐧 𝐈𝐈 | 𝐀𝐥𝐥 𝐏𝐡𝐚𝐬𝐞𝐬 𝐂𝐨𝐦𝐩𝐥𝐞𝐭𝐞 Two weeks ago, this project started as a simple Python CLI todo app. Today, it is a production-grade, AI-native, event-driven system deployed on Kubernetes with conversational AI and distributed services. This hackathon was not about features. It was about learning how modern software is actually built in 2025. 1️⃣ 𝐅𝐫𝐨𝐦 𝐂𝐋𝐈 → 𝐂𝐥𝐨𝐮𝐝 → 𝐀𝐈 This was a single system evolved across five tightly connected phases. Each phase introduced real architectural complexity, not cosmetic changes. • Phase I — Python CLI, in-memory storage, spec-driven foundations • Phase II — Full-stack web app, JWT auth, PostgreSQL persistence • Phase III — AI chatbot mode using OpenAI Agents + MCP tools • Phase IV — Docker, Kubernetes (Minikube), Helm, cluster ops • Phase V — Production DOKS deployment, Kafka, Dapr, CI/CD All phases delivered on time. 2️⃣ 𝐖𝐡𝐚𝐭 𝐭𝐡𝐞 𝐒𝐲𝐬𝐭𝐞𝐦 𝐈𝐬 𝐓𝐨𝐝𝐚𝐲 This is no longer a “todo app”. • Multi-user, secure, full-stack application • Conversational task management via AI agents • Event-driven microservices with Kafka • Distributed runtime powered by Dapr • Fully containerized and orchestrated on Kubernetes 3️⃣ 𝐀𝐈-𝐍𝐚𝐭𝐢𝐯𝐞 𝐂𝐚𝐩𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬 (𝐏𝐡𝐚𝐬𝐞 𝐈𝐈𝐈) • Natural-language task creation and updates • AI agents that execute actions, not just respond • MCP tools for safe, structured agent operations • Context-aware conversations backed by real database state 4️⃣ 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 • Frontend: Next.js, TypeScript, Tailwind, ChatKit • Backend: FastAPI, OpenAI Agents SDK, SQLModel • Data: Neon PostgreSQL, Redis caching • Events: Kafka (Redpanda) with Dapr Pub/Sub • Infra: Docker, Kubernetes, Helm, DigitalOcean DOKS 5️⃣ 𝐖𝐡𝐲 𝐒𝐩𝐞𝐜-𝐃𝐫𝐢𝐯𝐞𝐧 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭 𝐌𝐚𝐭𝐭𝐞𝐫𝐞𝐝 • Specifications acted as the single source of truth • AI generated code followed architecture, not guesses • Frontend, backend, and agents stayed consistent • Scaling from CLI to Kubernetes never broke the system 6️⃣ 𝐊𝐞𝐲 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠𝐬 • Clear specs beat fast coding • Event-driven systems scale better • AI is powerful when guided, useless when vague • Kubernetes is a baseline, not an advanced topic • Architecture matters more than frameworks 7️⃣ 𝐖𝐡𝐚𝐭’𝐬 𝐍𝐞𝐱𝐭 Hackathon II is complete, but the journey continues. Hackathon III is coming soon, 🔗 Project Links 🌐 Live App: https://lnkd.in/eePjQ8ue Credits & Community Panaversity | PIAIC | Governor Sindh Initiative for GenAI, Web3, and Metaverse Zia Khan | Muhammad Qasim | Ameen Alam | Maryam Mumtaz Ali Aftab Sheikh | Muhammad Junaid Shaukat #AINative #SpecDrivenDevelopment #HackathonII #SpecKitPlus #OpenAI #Agents #MCP #Kubernetes #Kafka #Dapr #CloudNative #SoftwareArchitecture #FullStack #PIAIC #GIAIC

3 Comments
Like Comment
To view or add a comment, sign in
Om .
4mo
Report this post
Launching SPECTRA 🚀 An ML-Powered Log Analytics & Observability Platform I’m excited to share my latest project, Spectra – an intelligent observability platform designed to detect system anomalies using Machine Learning. Building this challenged me to combine scalable API engineering with data science pipelines. It’s not just about storing logs; it’s about using unsupervised learning to proactively identify critical system failures. 🛠 Tech Stack: Backend & API: Python, FastAPI, Pydantic (Async Architecture) Machine Learning: Scikit-learn (Isolation Forest), TF-IDF Vectorization Database: PostgreSQL (SQLAlchemy Async) Frontend: React 19, Recharts, TailwindCSS DevOps: Render (Dockerized Backend) & GitHub Pages 💡 Key Features: AI-Driven Anomaly Detection: Uses Isolation Forest models to flag irregular log patterns. High-Performance API: Fully asynchronous REST endpoints for handling concurrent data streams. Real-time Visualization: Dynamic dashboards powered by React and Recharts. Secure Auth: Integrated Google OAuth 2.0 flow. Check it out live here: https://lnkd.in/gbHmWmRF I’d love to hear your feedback on the ML pipeline or API structure! #machinelearning #backend #python #fastapi #datascience #deeplearning #fullstack #hiring
1 Comment
Like Comment
To view or add a comment, sign in
Sarath Chandrika K
4mo Edited
Report this post
🚀 Pydantic Data Modeling I’ve been learning Pydantic and strengthening my understanding of structured data validation in Python. Some of the key concepts I’ve covered include: ✅ Basic models and schema design ✅ Field types, built-in data types, and aliasing ✅ Data validations at field and model levels ✅ Nested models for clean, reusable structures ✅ Recursive models for hierarchical data representations 💡 Why Pydantic matters: Ensures data correctness early, reducing runtime bugs Acts as a clear contract between different system components Improves maintainability by keeping validation close to the data Widely used in modern Python frameworks like FastAPI 💡Pydantic in Data Engineering ✅ Pydantic enforces clear schemas between ingestion, transformation, and storage stages, reducing schema drift in data pipelines. ✅ Early data validation – It catches incorrect or malformed data at ingestion time, preventing bad data from propagating downstream. ✅ Simplifies working with semi-structured data ✅ Explicit schemas and validations lead to predictable behavior and easier debugging in batch and streaming systems. ✅ Widely used in modern data stacks – Commonly paired with FastAPI, Airflow, Kafka, and ML pipelines as a lightweight validation layer. Link to class/model implementations: https://lnkd.in/gEwgWBMJ Link to Article: https://lnkd.in/gBr-8Wap Thank you DataCamp for the detailed article on pydantic models. 🚀 One important point is that Pydantic follows a keyword-argument (kwarg) specific approach, where data is explicitly mapped to field names. This makes schemas predictable, readable, and much safer when working with external or untrusted data. #Python #Pydantic #BackendDevelopment #DataValidation #LearningJourney #APIDesign #DataEngineering #databricks

Pydantic: A Guide With Practical Examples datacamp.com
Like Comment
To view or add a comment, sign in

933 followers

19 Posts

View Profile Follow

Email Spam Detection System with Airflow & Kubernetes

More Relevant Posts

Explore related topics

Explore content categories