Dinesh Kumar’s Post

🚀 Day 9/10 — Optimization Series Config-Driven Pipelines (Avoid Hardcoding) 👉 Basics are done. 👉 Now we move from working code → optimized code. You build a pipeline… It works perfectly… But you hardcode everything 😐 file_path = "data/sales_2024.csv" api_url = "https://lnkd.in/gsfHEDWP" 👉 Looks simple… but becomes a problem later. 🔹 The Problem Hard to update values ❌ Not reusable ❌ Breaks across environments ❌ 🔹 What is Config-Driven Approach? 👉 Move all dynamic values to a config file 🔹 Example (config.json) { "file_path": "data/sales_2024.csv", "api_url": "https://lnkd.in/gsfHEDWP" } 🔹 Use in Python import json with open("config.json") as f: config = json.load(f) file_path = config["file_path"] api_url = config["api_url"] 🔹 Why This Matters Easy to update 🔄 Reusable pipelines ♻️ Environment-friendly 🌍 🔹 Real-World Use 👉 Dev / Test / Prod configs 👉 Data pipelines 👉 API integrations 💡 Quick Summary Config-driven = flexible + scalable pipelines 💡 Something to remember If your values change often… they don’t belong in your code. #Python #DataEngineering #LearningInPublic #TechLearning

To view or add a comment, sign in

More Relevant Posts

Adeel Rehman
2w
Report this post
Stop writing "Airflow boilerplate" and start writing actual Python. If your DAGs still look like a tangled web of PythonOperator and manual xcom_pull calls, you aren’t just building pipelines, you’re doing manual plumbing. It’s time to lean into the TaskFlow API. Here is why TaskFlow is quietly becoming the gold standard for Data Engineers: 1. The "Pythonic" Dream Traditional Airflow forces you to wrap every function in an operator and manually set task_id. With TaskFlow, a simple @task decorator is all you need. Your functions stay functions, and your code stays readable. 2. XComs that actually flow The old way of moving data required explicit pushes and pulls that felt like sending telegrams between tasks. Old way: task_instance.xcom_pull(task_ids='get_data') TaskFlow: data = get_data() It’s that simple. Airflow handles the backend plumbing while you focus on the logic. 3. Less Code, Fewer Bugs By removing the need for bitshift operators (>>) and redundant configuration, you're looking at a 40-60% reduction in boilerplate. Clean code isn't just a "nice-to-have" it's less surface area for bugs to hide. Is the classic PythonOperator dead? Not entirely. It still has its niche for specific legacy patterns. But for custom logic? If you aren't using @task, you're working harder, not smarter. Are you still bitshifting your way through life, or have you embraced the decorator? #DataEngineering #Airflow #Python #TaskFlow #BigData #CodeQuality
Like Comment
To view or add a comment, sign in
Shinu Cherian
3w
Report this post
Recursion used to confuse me a lot… until I started thinking of it like a corporate manager delegating work. 😅 🌳 Maximum Depth of Binary Tree (LeetCode 104 — Easy | Blind 75) Whenever I saw a Binary Tree problem, I used to panic. Keeping track of depth while moving through the tree felt overwhelming. The question was always: 👉 How do I keep track of everything at once? I realized something simple but powerful: 👉 I don’t need to track everything myself — I can ask my subtrees Think of each node as a Manager. The manager asks the left subtree: “What’s your max depth?” Then asks the right subtree: “What’s your max depth?” Takes the maximum of both and adds 1 (for itself) That’s exactly what this does: 👉 1 + max(left_depth, right_depth) 🔑 Key Learnings: ✔️ Base Case is your foundation If there’s no node, return 0 → Like an employee saying: “No team under me.” ✔️ Trust Recursion Don’t try to track every path manually → Assume subproblems are solved correctly ✔️ Recognize the Pattern This is a classic Depth-First Search (DFS) approach (bottom-up) ⚙️ Complexity Time Complexity: O(N) — visit each node once Space Complexity: O(H) — recursion stack (tree height) Trees used to be my biggest fear… Now they’re becoming one of my favorite topics Curious to hear from you 👇 What’s a data structure or concept that finally “clicked” for you recently? #LeetCode #BinaryTrees #Blind75 #DataStructures #Recursion #DFS #ProblemSolving #CodingJourney #Python #TechCareers
Like Comment
To view or add a comment, sign in
Joe Kirincic
2w
Report this post
New blog post. You've finished developing an ML model with {tidymodels}, and you're ready to automate it in Dagster. You hand things off to data engineering. Their reply: "Sorry, we need this rewritten in Python to deploy." But the model pipeline code is solid. It's wrapped in an R package; there's good test coverage, a {pkgdown} website documenting everything, the works. It's just written in R. Do we really need to do all of that work all over again? Not anymore. I built the R package {dagsterpipes} to solve this problem. It implements Dagster's Pipes Protocol for the R language, allowing you to run R code inside of Dagster without losing its logging and observability features. Walkthrough with a working example in the post: https://lnkd.in/gfxjadQy #rstats

Use R in your Dagster pipelines with joekirincic.com

12 Comments
Like Comment
To view or add a comment, sign in
Andrei Voicu Tomut
3w
Report this post
When code runs millions of times a day, even minor enhancements lead to significant compute savings. So I built xmltodict-fast. 🦀🐍 xmltodict is a Python library many of us use without a second thought. With ~5K GitHub stars, it’s a quiet workhorse powering ETL pipelines, SOAP clients, and invoice processors. It’s a drop-in replacement that maintains the same public API, but rewrites the performance-critical sections in Rust using PyO3 and quick-xml. Importantly: if the Rust extension isn't available on a platform, it seamlessly reverts to the original Python implementation. It's completely safe for incremental adoption. local benchmarks : 🚀 parse(): 2.1 × faster on typical XML 🚀 unparse():5.9 × faster (massive for serialization-heavy workflows) On pathologically deep XML (500+ nesting levels), the Rust version is actually slower. :( (Side note: Thanks to my kind and patient AI coding assistant for helping me building this!) If you work with XML in Python, I welcome your feedback, testing, and pull requests! 🔗 Repo & Benchmarks: https://lnkd.in/exhfBuD7 #Python #RustLang #PyO3 #OpenSource #DataEngineering #PerformanceOptimization
2 Comments
Like Comment
To view or add a comment, sign in
Sirajus Salayhin

Tech Lead / Staff Data Engineer | Data Platform & Lakehouse Architect | Telemetry, FinOps & Governance at Scale
3w Edited
Report this post
Half my context window was gone before I typed a single prompt. Claude Code indexed my entire monorepo at session start — Python files, Airflow DAGs, three months of task logs. Then it generated a migration that referenced a table that doesn't exist. I spent weeks rebuilding my project setup from scratch. Token usage dropped over 60%. But the real win was rework time going down significantly. Here's what actually moved the needle: - permissions.deny in settings.json — the official way to block files Claude shouldn't read. Read(./.env), Read(./airflow/logs/), Read(./.venv/). The airflow/logs line alone cut 15%. - .claudeignore — an unofficial shortcut that works like .gitignore. Not in the docs yet, but a lot of people use it. Same result, cleaner syntax. - CLAUDE.md hierarchy — root file under 200 lines. Subdirectory files load only when needed. Past 200 lines, Claude starts treating your instructions as optional. - MCP servers (BigQuery + Airflow) — live database access without pre-loading schemas into context. Deferred by default, costs almost nothing until Claude actually queries one. - Skills & agents — on-demand workflows at ~100 tokens each instead of 3,000-5,000 tokens baked into CLAUDE.md every session. - /compact and /context — the two commands I run multiple times a day to manage what's eating my context window. 30 minutes of setup. Every session after that starts lean. Full walkthrough with real configs from a data pipeline project: https://lnkd.in/gaNuSUta -- What does your Claude Code project setup look like? Are you using permissions.deny or .claudeignore — or just letting it index everything? #AICoding #SoftwareEngineering #DataEngineering #ClaudeCode #DeveloperTools #AIEngineering #SystemDesign

Stop Burning Tokens — How I Set Up Claude Code the Right Way medium.com

4 Comments
Like Comment
To view or add a comment, sign in
DHRUVPURI GOSWAMI
1w
Report this post
Most “learn data engineering” tutorials fail for one reason: they make you set up everything before you learn anything. → 6-hour videos → Docker pain → config hell → you quit by Sunday So I built something different. Skilance - a free, browser-based path to actually learn data engineering by doing. No setup. No accounts. No cost. You write real Python. Pipelines actually run. Concepts click faster. What you’ll learn (the real stuff teams use): • Airflow (DAGs, scheduling, retries) • SQL for pipelines (CTEs, window functions) • dbt (models, tests, macros) • Spark & PySpark • Kafka & streaming • DataOps (lineage, contracts, CI) Airflow just launched: 8 hands-on lessons + runnable DAGs + visual skill map 👉 Check it out: https://lnkd.in/ghvrudwb Would genuinely love your feedback - what works, what doesn’t? #DataEngineering #Python #OpenSource #BuildInPublic #LearnInPublic 😊
8 Comments
Like Comment
To view or add a comment, sign in
Uttarkar Sai Nath Rao
3w
Report this post
Let’s talk about something fun and interesting I did quite a while ago. I optimized a keyword-driven query system, focusing on improving throughput and stability under constraints. The core problem: Maximize queries/hour while avoiding conflicts, throttling, and system instability. Key optimizations: • Parallel processing with controlled concurrency • Keyword-based query pipeline for structured input distribution • User-agent rotation to distribute request patterns • Retry + backoff mechanisms for handling transient failures • Idempotent execution to avoid duplicate processing One interesting tweak that made a noticeable difference: I introduced a keyword expansion strategy - combining each keyword with incremental alphabet variations (e.g., keyword + a, keyword + b, ...). This helped: • Increase result coverage without changing the core keyword set • Avoid repetitive query patterns • Improve overall discovery efficiency per keyword After multiple iterations, the system stabilized at ~70 leads/hour from about ~15–20 leads/hour with consistent performance. This was one of the most interesting things I had worked on, may not be as flashy but interesting for sure that such a small change can have such a great impact! Curious to know your thoughts! #Optimizations #Python #Software #SaaS
Like Comment
To view or add a comment, sign in
SynapseKit AI

13 followers
4w
Report this post
📣 SynapseKit v1.4.7 + v1.4.8 just dropped. Back to back. Huge thanks to Dhruv Garg and Abhay Krishna who drove most of this sprint. 🙌 Two themes in these releases: getting data in, and making workflows resilient. Getting data in: 5 new loaders The gap between "I have a RAG pipeline" and "I can actually feed it my company's data" is a loader problem. These close it: 📨 SlackLoader — pull channel messages directly into your pipeline 📝 NotionLoader — ingest pages and databases from Notion 📖 WikipediaLoader — single article or multiple, pipe-separated 📄 ArXivLoader — search arXiv, download PDFs, extract text automatically 📧 EmailLoader — any IMAP mailbox, stdlib only, zero extra dependencies SynapseKit now has 24 loaders. Your data is probably already covered. Better retrieval — ColBERT ColBERTRetriever brings late-interaction ColBERT via RAGatouille. Instead of comparing a single query vector against a single document vector, ColBERT scores every query token against every document token (MaxSim). On long documents the recall improvement is significant- single-vector approaches lose detail in the compression. Token-level scoring doesn't. Resilient graph workflows Subgraph error handling now ships with three strategies — retry with backoff, fallback to an alternative graph, skip and continue. Production workflows break. The question is whether they break gracefully. Where SynapseKit stands today: 27 providers · 9 vector backends · 42 tools · 24 loaders · 2 hard dependencies ⚡ pip install synapsekit==1.4.8 📖 https://lnkd.in/dvr6Nyhx 🔗 https://lnkd.in/d2fGSPkX #Python #LLM #RAG #AI #OpenSource #MachineLearning #Agents #SynapseKit

GitHub - SynapseKit/SynapseKit: Ship LLM apps faster. Production-grade LLM framework for Python. Async-native RAG, agents, and graph workflows. 2 dependencies. Zero magic. github.com
Like Comment
To view or add a comment, sign in
Ronak Jain
1w Edited
Report this post
I built a complete 𝗨𝘀𝗲𝗱 𝗖𝗮𝗿 𝗣𝗿𝗶𝗰𝗲 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗼𝗿 from scratch, creating a full end-to-end pipeline that handles everything from raw data to a live application. Instead of relying on a pre-built dataset, I identified a unique problem and built my own data source using web scraping. My goal was to move beyond tutorials and mimic a real-world 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝗰𝗲 workflow. • 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴: Automated data collection to get real-time market prices. • 𝗣𝗿𝗲𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: Cleaning messy web data into a machine-learning-ready format. • 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴: Training a robust regressor to find the patterns. • 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁: Building a Flask web app to make the model accessible to anyone. The Workflow: 𝗦𝗰𝗿𝗮𝗽𝗲 𝗗𝗮𝘁𝗮 → 𝗖𝗹𝗲𝗮𝗻 & 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺 → 𝗧𝗿𝗮𝗶𝗻 𝗠𝗼𝗱𝗲l → 𝗗𝗲𝗽𝗹𝗼𝘆 #MachineLearning #DataScience #Python #Flask #WebScraping #PortfolioProject Check out the full documentation and code on GitHub: https://lnkd.in/gAZp4iKq
1 Comment
Like Comment
To view or add a comment, sign in
Yubisono P.

Experienced Credit Specialist with a demonstrated history of working in the Financial Services Industry. Data Scientist and Machine Learnings using Python, SQL, PostgreSQL, Tableau, Pentaho, Chat GPT, Gemini 2.5 Flash
4d
Report this post
Model Serialization Deployment using modelkit #machinelearning #datascience #modelserializationdeployment #modelkit modelkit is a minimalist yet powerful MLOps library for Python, built for people who want to deploy ML models to production. It packs several features which make your go-to-production journey a breeze, and ensures that the same exact code will run in production, on your machine, or on data processing pipelines. Features Wrapping your prediction code in modelkit instantly gives acces to all features : fast Model predictions can be batched for speed (you define the batching logic) with minimal overhead. composable Models can depend on other models, and evaluate them however you need to extensible Models can rely on arbitrary supporting configurations files called assets hosted on local or cloud object stores type-safe Models' inputs and outputs can be validated by pydantic, you get type annotations for your predictions and can catch errors with static type analysis tools during development. async Models support async and sync prediction functions. modelkit supports calling async code from sync code so you don't have to suffer from partially async code. testable Models carry their own unit test cases, and unit testing fixtures are available for pytest fast to deploy Models can be served in a single CLI call using fastapi In addition, you will find that modelkit is : simple Use pip to install modelkit, it is just a Python library. robust Follow software development best practices : version and test all your configurations and artifacts. customizable Go beyond off-the-shelf models: custom processing, heuristics, business logic, different frameworks, etc. framework agnostic Bring your own framework to the table, and use whatever code or library you want. modelkit is not opinionated about how you build or train your models. organized Version and share you ML library and artifacts with others, as a Python package or as a service. fast to code Just write the prediction logic and that's it. No cumbersome pre or postprocessing logic, branching options, etc... The boilerplate code is minimal and sensible. https://lnkd.in/genAAUCg

GitHub - Cornerstone-OnDemand/modelkit: Toolkit for developing and maintaining ML models github.com
Like Comment
To view or add a comment, sign in

68 followers

56 Posts

View Profile Connect

Dinesh Kumar’s Post

More Relevant Posts

Explore content categories