DataArch Consultancy LLP’s Post

View organization page for DataArch Consultancy LLP

300 followers

1mo

Is writing custom Python scripts for ingestion a sign of seniority, or a sign of inefficiency? 🐍💻 In 2026, the "hand-coded" vs. "low-code" debate has moved past the surface level. We are finally asking the right question: Where should a Data Engineer spend their "Code Capital"? If you are still writing boilerplate scripts to move data from a Postgres database to an S3 bucket, you might be falling into the Maintenance Trap. Here is why the industry is shifting toward a hybrid model: 1. The Maintenance Trap 🪤 Writing the first script is fun. Maintaining 100+ individual ingestion scripts is a nightmare. Every time an API version changes, a primary key is renamed, or a source schema drifts, your weekend is gone. Managed ELT tools like Airbyte or Fivetran treat these "connectors" as a commodity, handling the boring parts so you don't have to. 2. Spend Your "Code Capital" Wisely 💎 Your time is your most valuable asset. Spending it on basic data movement is like an architect laying bricks—it’s necessary work, but it’s not where the value is created. The Rule: Use low-code for the "pipes" (Ingestion). Save the custom Python/SQL for the "engine" (Transformations, business logic, and complex SCD logic). 3. The Hybrid Reality 🛠️ Low-code isn't a silver bullet. High-seniority engineering comes into play when you hit the limits of a managed tool: Complex API Rate Limits: When you need custom backoff strategies. Deeply Nested JSON: When the out-of-the-box flattener creates a mess. Proprietary Sources: When a pre-built connector simply doesn't exist. 4. Productivity = Control + Speed 🚀 Seniority in 2026 isn't about how much code you write; it’s about how much value you deliver with the least amount of code to maintain. Choosing a managed tool for 80% of your sources allows you to focus 100% of your energy on the 20% that actually drives business insights. The Bottom Line: Don’t be a "script collector." Be a Platform Architect. Build systems that scale, not just scripts that run. Are you still writing custom ingestion code for standard sources, or have you made the leap to fully managed ELT? Let’s hear your take in the comments! 👇 #DataEngineering #Airbyte #Python #ETL #ModernDataStack #DataArchitecture #CloudComputing #SoftwareEngineering #DataOps #BigData

To view or add a comment, sign in

More Relevant Posts

Sai Kiran
2w
Report this post
Our database ran out of connections at 3 AM. Every pipeline stopped. Every report failed. My phone was ringing at 3:15 AM. The cause? I had been leaking database connections for 3 months. Every pipeline run opened a new connection. None of them ever closed. The fix was 2 lines of Python. I just didn't know they existed. 👇 ──────────────── What was happening: # BEFORE — connection never closes if code crashes conn = get_db_connection() cursor = conn.cursor() cursor.execute("SELECT * FROM orders") results = cursor.fetchall() # if ANYTHING crashes above — conn stays open forever # 100 pipeline runs = 100 open connections conn.close() # never reached on error ──────────────── The fix — Python context manager: from contextlib import contextmanager @contextmanager def get_connection(db_config): conn = get_db_connection(db_config) try: yield conn # your code runs here finally: conn.close() # ALWAYS runs — crash or success # Now use it with 'with' keyword with get_connection(config) as conn: cursor = conn.cursor() cursor.execute("SELECT * FROM orders") results = cursor.fetchall() # connection closed here — automatically # even if cursor.execute() crashes halfway ──────────────── Why this works: The finally block runs no matter what. Success → closes connection. Crash → closes connection. Timeout → closes connection. The with keyword is Python's way of saying: "Use this resource. I'll handle the cleanup." ──────────────── 4 places every data engineer should use this: → Database connections (never leave open) → File handles (always close after reading) → Spark sessions (release cluster resources) → Temp directories (auto-cleanup after processing) ──────────────── That 3 AM call cost us 4 hours of downtime. Two lines of Python would have prevented all of it. Context managers are not advanced Python. They are basic production hygiene. What's your most painful Python mistake in prod? Drop it below 👇 #Python #DataEngineering #ETL #DataEngineer #PythonProgramming #DataPipeline #BestPractices #SoftwareEngineering #TechTips #OpenToWork #DataCommunity #HiringDataEngineers #100DaysOfPython #Databricks
1 Comment
Like Comment
To view or add a comment, sign in
Yunus Alper Körükcü
4w
Report this post
🚨 Every data team has that one Python script. You know the one. Someone wrote it "just for now" two years ago. It's still running in production. No retries. No logging. Hardcoded credentials. And every time it breaks at 3 AM, someone has to SSH into a server and pray. I just published a new article on what actually separates a script from a pipeline. Spoiler: it's not complexity. It's whether the code was designed to fail gracefully. In the article, I cover: ⚙️ Why idempotency is the single most important property your pipeline can have (and how to test it in 30 seconds) 🔁 How to handle transient vs permanent errors the right way 🔐 The Twelve-Factor config test: could you open source your codebase right now without leaking credentials? 📊 Why print() is not observability, and what to log instead 🧪 The uncomfortable truth about data testing: only 3% of tests are business logic tests 🚫 The notebook trap and other anti-patterns killing your pipelines in production If your team is stuck between "it works on my laptop" and "production grade," this one is for you. Read it here 👉 https://lnkd.in/dwMDTUSD

Stop Writing Python Scripts, Start Writing Data Pipelines. alper-korukcu.medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Sohan Sethi
2w
Report this post
Here's my Ultimate Advanced Python Tricks Cheatsheet for Data Analysts: (Save this - these are the ones that actually matter in real work) Every analyst knows pd.read_csv() and df.head(). The ones getting promoted know what comes after that. Here are 15 advanced Python tricks that separate junior analysts from senior ones 👇 1. Memory-Optimized Data Loading Specify data types while loading to reduce memory and speed up processing. 2. Select Columns Efficiently Always load only the columns you need — never the entire dataset. 3. Conditional Filtering with Multiple Rules Apply complex business logic to slice data precisely in one line. 4. Vectorized Feature Engineering Multiply columns directly instead of loops — faster and more scalable. 5. Use query() for Cleaner Filtering Write SQL-like filter conditions that are readable and easy to maintain. 6. Advanced GroupBy with Multiple Aggregations Generate sum, mean, and max insights across categories in one operation. 7. Window Functions SQL Style Rank rows within groups directly in Python — exactly like SQL window functions. 8. Rolling Window Analysis Calculate 7-day moving averages to smooth trends for time-series reporting. 9. Handle Missing Data Strategically Fill nulls with the median — preserves distribution instead of distorting it. 10. Efficient Deduplication with Priority Sort by date first then drop duplicates — keeps the most recent record per user. 11. Merge Datasets Like SQL Joins Combine two dataframes on a key column exactly like a SQL LEFT JOIN. 12. Pivot Tables for Quick Reporting Summarize revenue by category and region instantly without building a dashboard. 13. Explode Nested Data Transform list-like columns into individual rows for deeper granular analysis. 14. Apply Custom Functions Efficiently Use np.where for conditional logic - significantly faster than apply() on large datasets. 15. Chain Operations for Clean Pipelines Drop nulls, filter, and engineer features in one readable chained expression. Most analysts use Python like a calculator. Senior analysts use it like a pipeline. The difference is not knowing more functions. It is knowing how to chain them together to go from raw messy data to a clean business insight in minutes. Save this. Practice each one on a real dataset. Watching is not learning. Building is. Which of these are you not using yet? ♻️ Repost to help someone level up their Python skills 💭 Tag a data analyst who needs to see this 📩 Get my full Python analytics guide: https://lnkd.in/gjUqmQ5H
55 Comments
Like Comment
To view or add a comment, sign in
Pooja Yadav
2w
Report this post
Here's my Ultimate Advanced Python Tricks Cheatsheet for Data Analysts: (Save this - these are the ones that actually matter in real work) Every analyst knows pd.read_csv() and df.head(). The ones getting promoted know what comes after that. Here are 15 advanced Python tricks that separate junior analysts from senior ones 👇 1. Memory-Optimized Data Loading Specify data types while loading to reduce memory and speed up processing. 2. Select Columns Efficiently Always load only the columns you need — never the entire dataset. 3. Conditional Filtering with Multiple Rules Apply complex business logic to slice data precisely in one line. 4. Vectorized Feature Engineering Multiply columns directly instead of loops — faster and more scalable. 5. Use query() for Cleaner Filtering Write SQL-like filter conditions that are readable and easy to maintain. 6. Advanced GroupBy with Multiple Aggregations Generate sum, mean, and max insights across categories in one operation. 7. Window Functions SQL Style Rank rows within groups directly in Python — exactly like SQL window functions. 8. Rolling Window Analysis Calculate 7-day moving averages to smooth trends for time-series reporting. 9. Handle Missing Data Strategically Fill nulls with the median — preserves distribution instead of distorting it. 10. Efficient Deduplication with Priority Sort by date first then drop duplicates — keeps the most recent record per user. 11. Merge Datasets Like SQL Joins Combine two dataframes on a key column exactly like a SQL LEFT JOIN. 12. Pivot Tables for Quick Reporting Summarize revenue by category and region instantly without building a dashboard. 13. Explode Nested Data Transform list-like columns into individual rows for deeper granular analysis. 14. Apply Custom Functions Efficiently Use np.where for conditional logic - significantly faster than apply() on large datasets. 15. Chain Operations for Clean Pipelines Drop nulls, filter, and engineer features in one readable chained expression. Most analysts use Python like a calculator. Senior analysts use it like a pipeline. The difference is not knowing more functions. It is knowing how to chain them together to go from raw messy data to a clean business insight in minutes. Save this. Practice each one on a real dataset. Watching is not learning. Building is. Which of these are you not using yet? ♻️ Repost to help someone level up their Python skills 💭 Tag a data analyst who needs to see this 📩 Get my full Python analytics guide: https://lnkd.in/g7W9Cv-J
Like Comment
To view or add a comment, sign in
J V Pavan
1w
Report this post
PySpark code is a classic implementation of a Reliable Streaming Pipeline ⚙️ Phase 1: The Continuous Engine This part of the code tells Databricks to keep the "engine" running 24/7. Python (spark.readStream.table("source_append_table") .filter("(status IS NULL) AND (record_type = 'file_type')") .writeStream .foreachBatch(load_all_and_route_errors) # Calls the logic below .option("checkpointLocation", "/mnt/delta/checkpoints/dual_target_load") .trigger(processingTime='10 seconds') # ✅ Makes it run continually .start() ) 🛠️ Phase 2: The Validation & Routing Function This is the internal logic (load_all_and_route_errors) that runs every time new data is detected. 1. Persisting Data (The Memory Guard) 💾 Python microBatchDf.persist() Icon: 🧠 Action: Saves the incoming data in RAM. Why: Since we are writing to two tables (Main and Error), we don't want Spark to do the work twice. Caching it here makes the job twice as fast. 2. The Validation Engine (The Inspector) ⚖️ Python errors = F.array_remove(F.array( F.when(F.col("order_id").isNull(), "Missing order_id"), F.when(F.col("price") < 0, "Negative price") ), None) Icon: 🔍 Action: Captures WHY it failed. It creates a list of errors for every row. Note: Unlike simple filters, this ensures you have an audit trail of reasons for every bad record. 3. Flagging the Data 🏷️ Python validated_df = (microBatchDf .withColumn("validation_status", F.when(F.size(errors) > 0, "Invalid").otherwise("Valid")) ) Icon: 🚩 Action: Tags every single row as either Valid or Invalid based on the results of the Validation Engine. 🍴 Phase 3: The Fork in the Road (Dual Write) Path A: The Clean Production Table ✅ Python only_valid_records = validated_df.filter("validation_status = 'Valid'") (only_valid_records.write .format("delta") .mode("append") .saveAsTable("main_target_table")) Icon: 🏦 Strategy: Only rows with zero errors move forward. This keeps your business dashboards clean and trustworthy. Path B: The Quarantine/Error Table 🚨 Python invalid_records = validated_df.filter("validation_status = 'Invalid'") if not invalid_records.isEmpty(): (invalid_records.write .format("delta") .mode("append") .saveAsTable("error_records_table")) Icon: 🚧 Strategy: Redirects bad data to a separate log. Because we captured the reasons, engineers can immediately see that "Row X failed because of a negative price." 🧹 Phase 4: Final Cleanup Python microBatchDf.unpersist() Icon: 🧼 Action: Clears the memory block. Why: In a continually running job, if you forget this, your cluster memory will fill up over time and eventually crash (OOM error). 💡 Summary of "Continuous" Best Practices Use Job Clusters: In Databricks, run this as a "Continuous Job" type so Databricks automatically restarts it if the cloud provider has a hiccup. final tip: since you are now running this continually, ensure your cluster is sized correctly for a 24/7 workload!
Like Comment
To view or add a comment, sign in
Towards Data Science

646,398 followers
3d Edited
Report this post
"It used to take us three weeks to ship a single data pipeline. Today, an analyst with zero Python experience does it in a day. Here’s how we got there." Don't miss Kiril Kazlou's insightful recap of his company's successful move away from Python pipelines.

4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers | Towards Data Science https://towardsdatascience.com
Like Comment
To view or add a comment, sign in
Ahmed Shehab
2w
Report this post
Scaling Python backends with asyncio and PostgreSQL (asyncpg) requires thinking beyond async/await syntax. If you don't map your coroutines to the underlying OS-level sockets and memory buffers, you will hit silent deadlocks, connection exhaustion, and OOM crashes. Spent a lot of time reading and building lately, and I wanted to share the most important aspects of building high-performance async database drivers. Here’s what I’ve learned: Throttle with asyncio.BoundedSemaphore: Don't just dump 10,000 tasks onto the event loop. Match your semaphore limit to your connection pool's max_size. This provides backpressure, preventing task queue timeouts and event loop thrashing. (Tip: Always use BoundedSemaphore over Semaphore to catch rogue .release() calls). Pipeline with executemany(): Stop running .execute() in a loop. executemany leverages the Postgres extended query protocol (PARSE once, BIND/EXECUTE many) to pack the TCP window and eliminate thousands of network Round Trip Times (RTT). Isolate State with Savepoints: Use nested async with conn.transaction() blocks to handle partial payload failures. When an inner block fails, it just flags the Postgres SubXID as aborted (leaving dead tuples for the VACUUM process) while allowing the parent transaction to safely commit. Prevent OOMs with Server-Side Cursors: Never use .fetch() for massive multi-million row exports. Stream them via async for row in conn.cursor(query, prefetch=chunk_size). This guarantees your Python process memory stays strictly bounded to the chunk size, no matter how large the table gets. Shield Your Cleanup: If a client abruptly drops an HTTP connection, the ASGI server will inject an asyncio.CancelledError. If you don't wrap your pool.release() and tx.rollback() in asyncio.shield() inside your Unit of Work, the network socket will be left permanently checked out, leading to a silent pool deadlock. Adopt asyncio.TaskGroup: (Python 3.11+) Move away from naked asyncio.gather(). TaskGroups provide structured concurrency—if one concurrent validation query fails, the siblings are safely and instantly cancelled, returning their leased connections to the pool immediately. Avoid Distributed Transactions: Don't attempt Two-Phase Commits (2PC) across microservices using the event loop; it destroys throughput. Rely on the Transactional Outbox pattern: commit your local database mutation and an event payload in the same transaction, and let your message broker manage eventual consistency. Stop treating the event loop like magic. Treat it like an I/O multiplexing coordinator. #Python #Asyncio #PostgreSQL #BackendEngineering #SoftwareArchitecture #DistributedSystems
Like Comment
To view or add a comment, sign in
Abhishek kumar sinha
2w
Report this post
𝗦𝗽𝗮𝗿𝗸 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝘀 #2: 𝗨𝗗𝗙𝘀 — 𝗧𝗵𝗲 𝗦𝗺𝗮𝗿𝘁 𝗖𝗼𝗱𝗲 𝗧𝗵𝗮𝘁 𝗕𝗿𝗲𝗮𝗸𝘀 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲. I used to think UDFs were the cleanest way to write Spark code… Clean. Reusable. Easy. Until a discussion with my architect changed my perspective. “This is why your job is slow.” Then I looked under the hood… and everything clicked. 👉 UDFs (User Defined Functions) look powerful but inside Spark, they break optimization. ⚠️ What actually happens when you use a #UDF: ❌ #Spark treats it as a black box ❌ #CatalystOptimizer can’t analyze your logic ❌ No predicate pushdown ❌ No #WholeStageCodegen And it gets worse… 💥 #JVM ↔ #Python serialization overhead 💥 Execution becomes row-by-row 💥 Vectorized (batch) execution is lost 🧠 What’s happening internally (real reason) Spark doesn’t execute your Python code directly. It follows this pipeline: 1️⃣ Build Logical Plan 2️⃣ Optimize using Catalyst 3️⃣ Convert to Physical Plan 4️⃣ Generate JVM bytecode (WholeStageCodegen) 5️⃣ Execute in a distributed manner ✅ With Built-in Functions: Spark understands expressions like: → when, filter, join, agg So it can: ✔ Apply rule-based + cost-based optimization ✔ Push filters down to data source ✔ Reorder joins ✔ Eliminate unnecessary columns ✔ Combine multiple operations into a single stage 👉 Result: Fewer stages + less I/O + faster execution ❌ With UDF: Your logic lives in Python so for Spark, It becomes an opaque expression, Meaning: Spark don’t know what this function does. Because of this: 🚫 Catalyst cannot rewrite or optimize it 🚫 Filters cannot be pushed below UDF 🚫 Column pruning stops at UDF boundary 🚫 WholeStageCodegen cannot include it 💥 The Real Bottleneck 👉 Crossing JVM ↔ Python boundary, Spark runs in JVM, but UDF runs in Python. So for every row (or batch): → Serialize data (JVM → Python) → Execute function → Deserialize back (Python → JVM) This causes: 💥 High CPU overhead 💥 Serialization cost 💥 Loss of #vectorizedexecution 💥 More GC pressure 🔍 Example ❌ Using UDF: from pyspark.sql.functions import udf from pyspark.sql.types import StringType def categorize(age): return "minor" if age < 18 else "adult" df = df.withColumn("category", udf(categorize, StringType())(df.age)) ✅ Using Built-in Functions: from pyspark.sql.functions import when df = df.withColumn( "category", when(df.age < 18, "minor").otherwise("adult") ) 💡 Same logic. Completely different execution plan. ✔ Built-in → optimized DAG + codegen + vectorized ❌ UDF → isolated, row-based, non-optimizable 🚀 What to do instead: ✔ Prefer #SparkSQL functions (when, expr, concat) ✔ Think in columnar transformations ✔ Use #PandasUDF only when unavoidable 🧠 Spark is not a Python engine, It’s a distributed SQL engine with a Python interface. Happy to share more Databricks tutorials & Spark insights — just DM me #ApacheSpark #SparkInternals #DataEngineering #Databricks #BigData
Like Comment
To view or add a comment, sign in
Ishant Bhardwaj
3d
Report this post
📌 Tuples And It's Methods in Python #Day34 You've probably come across Tuples. They might look simple, but they play a powerful role in writing efficient and reliable code. 🔹 What are Tuples? Tuples are ordered, immutable collections of elements. Once created, their values cannot be changed 🚫 👉 Example: my_tuple = (1, 2, 3, "Python") 🔹 Key Features of Tuples ✅ Ordered → Elements maintain their position ✅ Immutable → Cannot modify after creation ✅ Allow duplicates → Same values can exist ✅ Can store multiple data types → int, string, list, etc. 🔹 Creating Tuples You can create tuples in multiple ways: t1 = (1, 2, 3) t2 = "a", "b", "c" # Without parentheses t3 = (5,) # Single element tuple (comma is important!) 🔹 Accessing Tuple Elements Use indexing just like lists: t = (10, 20, 30, 40) print(t[0]) # Output: 10 print(t[-1]) # Output: 40 🔹 Tuple Slicing t = (1, 2, 3, 4, 5) print(t[1:4]) # Output: (2, 3, 4) 🔹 Why Tuples are Immutable? 🤔 Immutability ensures: 🔒 Data safety ⚡ Faster performance than lists 📌 Suitable for fixed data 🔹 Tuple Methods Tuples have only 2 built-in methods: t = (1, 2, 2, 3) print(t.count(2)) # Count occurrences print(t.index(3)) # Find index 🔹 Tuple Packing & Unpacking 🎁 👉 Packing: data = (1, "Python", True) 👉 Unpacking: a, b, c = data print(a, b, c) 🔹 Tuples vs Lists ⚔️ FeatureTuple 🧊List 🔥MutabilityNo ❌Yes ✅SpeedFaster ⚡Slower 🐢Use CaseFixed dataDynamic data 🔹 Nested Tuples Tuples can contain other tuples: nested = ((1, 2), (3, 4)) print(nested[1][0]) # Output: 3 🔹 When to Use Tuples? 🎯 ✔ When data should not change ✔ When performance matters ✔ When returning multiple values from functions 🚀 Final Thoughts Tuples may seem simple, but they are a powerful tool for writing clean and efficient Python code. Master them, and your Python skills will level up! 💯 #Python #DataAnalysts #DataAnalysis #DataVisualization #DataCleaning #DataHandling #DataCollection #Consistency #CodeWithHarry #DataAnalytics #PowerBI #Excel #MicrosoftExcel #MicrosoftPowerBI #TuplesInPython #PythonProgramming #Learning #LearningJourney
Like Comment
To view or add a comment, sign in
Ishant Bhardwaj
4d Edited
Report this post
🙇 Lists in Python #Day33 one concept you simply can’t ignore is Lists. They are one of the most powerful and flexible data structures in Python. 🔹 What is a List? A list is a collection of items stored in a single variable. It can hold multiple values, even of different data types! 📌 Example: my_list = [10, "Python", 3.14, True] 🔹 Key Features of Lists ✅ Ordered (items have a fixed position) ✅ Mutable (you can change values) ✅ Allows duplicates ✅ Can store different data types 🔹 Creating Lists numbers = [1, 2, 3, 4] names = ["Ishu", "Rahul", "Aman"] empty_list = [] 🔹 Accessing Elements (Indexing) print(numbers[0]) # First element print(numbers[-1]) # Last element 🔹 Slicing Lists ✂️ print(numbers[1:3]) # Elements from index 1 to 2 🔹 Modifying Lists 🔄 numbers[0] = 100 🔹 Adding Elements ➕ numbers.append(5) # Add at end numbers.insert(1, 50) # Add at specific position 🔹 Removing Elements ❌ numbers.remove(50) # Remove by value numbers.pop() # Remove last item del numbers[0] # Delete by index 🔹 Common List Methods 🛠️ numbers.sort() # Sort list numbers.reverse() # Reverse list numbers.count(2) # Count occurrences numbers.index(3) # Find index 🔹 Looping Through Lists 🔁 for item in numbers: print(item) 🔹 List Comprehension ⚡ (Advanced & Powerful) A compact way to create lists! squares = [x**2 for x in range(5)] 🔹 Nested Lists (Lists inside Lists) matrix = [[1, 2], [3, 4], [5, 6]] 🔹 Why Lists Matter in Data Analytics 📊 Lists are used for: 📌 Storing datasets 📌 Data manipulation 📌 Iterations & transformations 📌 Building logic before moving to libraries like Pandas 🔥 Final Thought Lists are not just a topic — they are a foundation. The better you understand them, the stronger your Python skills will be! #Python #DataAnalytics #DataAnalysis #DataAnalysts #PowerBI #Excel #MicrosoftExcel #MicrosoftPowerBI #SQL #PythonProgramming #CodeWithHarry #DataHandling #DataVisualization #DataCollection #DataCleaning #Consistency
Like Comment
To view or add a comment, sign in

300 followers

View Profile Connect

DataArch Consultancy LLP’s Post

More Relevant Posts

Explore content categories