I wrote a bug today that took me 20 minutes to find. The function looked completely fine. ━━━━━━━━━━━━━━━━━━━━━━ def add_item(item, data=[ ]): ····data.append(item) ····return data ━━━━━━━━━━━━━━━━━━━━━━ I called it three times — expecting three separate lists. Got this instead: ▶ add_item("apple") → ["apple"] ▶ add_item("banana") → ["apple", "banana"] ▶ add_item("cherry") → ["apple", "banana", "cherry"] Same list. Growing every time. I never passed a list — Python was reusing the same default list across every single call. ━━━━━━━━━━━━━━━━━━━━━━ This is Python's Mutable Default Argument trap. The default value [ ] is created once when the function is defined — not every time it's called. So every call without an argument shares the exact same list object in memory. My Software Engineering brain expected fresh memory every time. That's how C++ and Java work. Python doesn't work that way. ━━━━━━━━━━━━━━━━━━━━━━ The fix: def add_item(item, data=None): ····if data is None: ········data = [ ] ····data.append(item) ····return data None as default. Fresh list created inside. Done. ━━━━━━━━━━━━━━━━━━━━━━ The scary part? This bug doesn't crash your program. It silently gives you wrong results. In a Data Science pipeline — that means corrupted data with zero error messages. ━━━━━━━━━━━━━━━━━━━━━━ Senior developers — what's the silent bug that once corrupted your data without a single error? Would love to know I'm not alone in this. SE → Data Science | OOP Series #2 | IUB #Python #OOP #DataScience #100DaysOfCode #SoftwareEngineering
Python's Mutable Default Argument Trap: Silent Data Corruption
More Relevant Posts
-
In academic theory, datasets are clean. In the industrial trenches, legacy ERPs export garbage. Attempting to dump a static Bill of Materials (BOM) directly from a corporate Excel into a relational database is the perfect recipe for destroying your system's referential integrity. Hidden blank spaces, inconsistent nomenclatures, and mixed data typing (text vs. numbers) will crash any automated risk model. In Operations Engineering, we do not rely on manual data entry; we build firewalls. In part four of the Obsolescence series on Datalaria, I deconstruct how to build a ruthless ETL (Extract, Transform, Load) pipeline to sanitize this structural entropy before it impacts your P&L: 1️⃣ Radical Cleansing (Pandas): Utilizing Python as a strict gatekeeper to standardize MPNs (Manufacturer Part Numbers) and enforce strict numerical typing. 2️⃣ Graph Shattering: Breaking the flat 2D Excel table into true hierarchical vectors (End Product -> Subassembly -> Component). 3️⃣ The Golden Rule (Idempotency): Implementing upsert architecture in Supabase. The system must allow you to run the ingestion script 1,000 times consecutively without duplicating a single node. Sandbox Strategy: Theory is no longer enough. Don't take my word for it; run it yourself. I have embedded a secure, interactive environment (Google Colab) inside the article. Without installing anything, you will watch a Python script ingest a corrupt CSV and build a relational tree in milliseconds. 👉 Access the interactive Sandbox and full analysis here: https://lnkd.in/eaxyv_mQ #OperationsEngineering #DataEngineering #Python #Pandas #Supabase #SupplyChain #ETL #BOMManagement #FirstPrinciples
To view or add a comment, sign in
-
🚨Stop Treating Them Like They’re the Same! 🚨 If you’ve ever looked at a dataset and felt like you were staring into a black hole of "Nothingness," you aren’t alone. But in the world of data, not all "nothings" are created equal. Is None the same as NaN? Is Null just a fancy word for zero? No. Mixing these up is a one-way ticket to buggy code and broken pipelines. Here is the "No-Nonsense" breakdown: The terms None, NaN, and Null are used to represent missing or invalid data, but they belong to different programming environments and behave differently. 1. None (The Python Specialist) In Python, None is a built-in constant used to represent the absence of a value. None is a literal object. It represents the intentional absence of a value. Type: It is a singleton of the NoneType class. Behavior: It is not equal to 0, False, or an empty string. Comparison: You should check for it using the is operator (e.g., x is None). Usage: Commonly used as a default return value for functions that don't return anything or to initialize variables that don't have a value yet. 2. NaN (Not a Number) NaN is a special numeric value used to represent a value that is undefined or unrepresentable, particularly in floating-point calculations. Type: In Python's NumPy and Pandas libraries, it belongs to the float class. Comparison: A unique property of NaN is that it is not equal to itself (np.nan == np.nan returns False). Use special functions like pd.isna() or np.isnan() to detect it. Behavior: Mathematical operations involving NaN usually result in NaN (e.g., 5 + NaN = NaN). 3. Null Null is a keyword used in many languages (like SQL, Java, C#, and JavaScript) to indicate that a variable does not point to any object or memory address. Context: SQL: Used to represent missing or unknown values in a database. It’s a placeholder, not a value. In SQL, Null != Null, which is why we have to use IS NULL. JavaScript: Represents the intentional absence of an object value. Python: Does not have a null keyword; it uses None instead. Pandas/Polars: Modern data libraries like Polars use null as their primary indicator for any missing data across all types, whereas Pandas traditionally converts None to NaN in numeric columns. 💡 The Bottom Line: None is an object. NaN is for missing/invalid numbers. Null is for missing database entries. #DataScience #Python #Programming #SQL #DataEngineering #CodingTips
To view or add a comment, sign in
-
-
Meet SLayer, the semantic layer for AI agents and humans we've built at Motley, open-sourced now! It's the best way to let your agent explore a database: create and edit semantic models on the fly, define new metrics, but most importantly – query data using a very straightforward format that is easily understandable by both LLMs and humans (much more so than SQL). Talk to it over MCP, CLI, API, or a Python client if you want dataframes. Power talk-to-your-data bots, data analyst agents, dashboards, and non-agentic apps too. My favorite SLayer feature is ease of integration. Without needing to run the server, you can use the CLI, MCP (stdio-based), or just import it into your Python app. Quickstart & more on GitHub: https://lnkd.in/dxxmCE_G
To view or add a comment, sign in
-
When I joined my current team, we ran ETL. Extract from source. Transform in Python. Load clean data to BigQuery. Six months later, we switched to ELT. Load raw data to BigQuery first. Transform Inside BigQuery using dbt. Here's exactly why - and what we got wrong the first time. ───────────────── The ETL problems we kept hitting: Python transform scripts were getting complex fast. Business logic kept changing. Every new metric required updating Python, code review, redeploy, rerun. Worse: no way to replay history with new logic. Raw data was already transformed and gone. Business rule changes meant we couldn't reprocess old data. We painted ourselves into corners every sprint. ───────────────── What switching to ELT changed: → Analysts now change transformation logic themselves - in SQL, not Python → Business rule changes? Rerun dbt on historical raw data. Done in minutes. → Python pipeline went from 800 lines to ~100. The rest is dbt models. → dbt gave us automatic documentation and lineage for free ───────────────── But - ELT is Not always right. If you handle sensitive personal data (healthcare, financial), you may Not be allowed to land raw PII in your warehouse. ETL is correct here - mask or encrypt before data touches storage. ───────────────── The honest decision rule: Can your warehouse handle transformation compute? → ELT Can you store raw data affordably? → ELT Does your team prefer SQL over Python for transforms? → ELT Is data sensitivity a hard constraint? → ETL Which does your team use - and what drove that decision? 👇 #DataEngineering #ETL #ELT #dbt #BigQuery #LearningInPublic
To view or add a comment, sign in
-
🚨 𝗦𝗺𝗮𝗹𝗹 𝗦𝗵𝗼𝗿𝘁𝗰𝘂𝘁, 𝗕𝗶𝗴 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: 𝗶𝗺𝗽𝗼𝗿𝘁 * 𝗶𝗻 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 A lot of PySpark code works perfectly in notebooks… but fails to scale in production for small reasons like this 👇 👨💻𝗠𝗼𝘀𝘁 𝗯𝗲𝗴𝗶𝗻𝗻𝗲𝗿𝘀 𝘄𝗿𝗶𝘁𝗲: from pyspark.sql.functions import * ✔️ It works ❌ But this small shortcut creates long-term problems in real data pipelines ⚠️ 𝗪𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝘄𝗵𝗲𝗻 𝘆𝗼𝘂 𝘂𝘀𝗲 𝗶𝗺𝗽𝗼𝗿𝘁 * 👉When you use import *, you’re pulling 100+ functions into your global namespace That leads to: • ❓ Function ambiguity — where is this function coming from? • ⚠️ Overriding built-ins like sum(), max(), min() • 🐛 Harder debugging as pipelines grow • 👀 Poor readability in shared codebases 📌𝗘𝘅𝗮𝗺𝗽𝗹𝗲: sum([1,2,3]) 🤔 Is this Python’s sum or Spark’s aggregation? 👉 In large projects, this confusion is not theoretical 👉 It causes real production bugs ✅ 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗟𝗲𝘃𝗲𝗹 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵: from pyspark.sql import functions as F 📌 𝗨𝘀𝗮𝗴𝗲: F.col(“salary”) F.sum(“salary”) F.when(…) ✔️ Same output 🔥 Much better code quality 🔍 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗶𝗻 𝗿𝗲𝗮𝗹 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 In small notebooks → both approaches feel identical But in production 👇 • 📏 Pipelines are hundreds of lines long • 👥 Multiple engineers contribute • ⚡ Debugging must be fast & precise ⚖️ 𝗡𝗼𝘄 𝗰𝗼𝗺𝗽𝗮𝗿𝗲: df.groupBy(“dept”).agg(sum(“salary”)) vs df.groupBy(“dept”).agg(F.sum(“salary”)) 💡The second version instantly tells you: 👉 This is a Spark transformation 👉 Not a Python function 👉 Not a custom utility 🧠 That clarity = faster debugging + cleaner reviews + easier maintenance ⚙️ 𝗪𝗵𝗲𝗿𝗲 𝘁𝗵𝗶𝘀 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 • ☁️ Azure Databricks jobs (scheduled ETL pipelines) • 🔄 Incremental loads & complex transformations • 👨💻 Team-based code reviews • 🚨 Debugging production failures 👉 When something breaks…you don’t want to guess where a function came from 🧠𝗧𝗵𝗶𝘀 𝗶𝘀𝗻’𝘁 𝗮𝗯𝗼𝘂𝘁 𝗳𝗲𝘄𝗲𝗿 𝗸𝗲𝘆𝘀𝘁𝗿𝗼𝗸𝗲𝘀 It’s about: • 🎯 Clear ownership of logic • 🔍 Predictable behavior • 🧱 Maintainable pipelines 🔥 𝗦𝗵𝗼𝗿𝘁 𝗰𝗼𝗱𝗲 𝗶𝘀 𝗻𝗼𝘁 𝗮𝗹𝘄𝗮𝘆𝘀 𝗯𝗲𝘁𝘁𝗲𝗿…𝗖𝗹𝗲𝗮𝗿 𝗰𝗼𝗱𝗲 𝗶𝘀. 💬 What do you use in your projects — import * or F? #DataEngineering #PySpark #AzureDatabricks #BigData #ETL #CodingStandards #Spark
To view or add a comment, sign in
-
-
🚨 𝗦𝗺𝗮𝗹𝗹 𝗦𝗵𝗼𝗿𝘁𝗰𝘂𝘁, 𝗕𝗶𝗴 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: 𝗶𝗺𝗽𝗼𝗿𝘁 * 𝗶𝗻 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 A lot of PySpark code works perfectly in notebooks… but fails to scale in production for small reasons like this 👇 👨💻𝗠𝗼𝘀𝘁 𝗯𝗲𝗴𝗶𝗻𝗻𝗲𝗿𝘀 𝘄𝗿𝗶𝘁𝗲: from pyspark.sql.functions import * ✔️ It works ❌ But this small shortcut creates long-term problems in real data pipelines ⚠️ 𝗪𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝘄𝗵𝗲𝗻 𝘆𝗼𝘂 𝘂𝘀𝗲 𝗶𝗺𝗽𝗼𝗿𝘁 * 👉When you use import *, you’re pulling 100+ functions into your global namespace That leads to: • ❓ Function ambiguity — where is this function coming from? • ⚠️ Overriding built-ins like sum(), max(), min() • 🐛 Harder debugging as pipelines grow • 👀 Poor readability in shared codebases 📌𝗘𝘅𝗮𝗺𝗽𝗹𝗲: sum([1,2,3]) 🤔 Is this Python’s sum or Spark’s aggregation? 👉 In large projects, this confusion is not theoretical 👉 It causes real production bugs ✅ 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗟𝗲𝘃𝗲𝗹 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵: from pyspark.sql import functions as F 📌 𝗨𝘀𝗮𝗴𝗲: F.col(“salary”) F.sum(“salary”) F.when(…) ✔️ Same output 🔥 Much better code quality 🔍 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗶𝗻 𝗿𝗲𝗮𝗹 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 In small notebooks → both approaches feel identical But in production 👇 • 📏 Pipelines are hundreds of lines long • 👥 Multiple engineers contribute • ⚡ Debugging must be fast & precise ⚖️ 𝗡𝗼𝘄 𝗰𝗼𝗺𝗽𝗮𝗿𝗲: df.groupBy(“dept”).agg(sum(“salary”)) vs df.groupBy(“dept”).agg(F.sum(“salary”)) 💡The second version instantly tells you: 👉 This is a Spark transformation 👉 Not a Python function 👉 Not a custom utility 🧠 That clarity = faster debugging + cleaner reviews + easier maintenance ⚙️ 𝗪𝗵𝗲𝗿𝗲 𝘁𝗵𝗶𝘀 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 • ☁️ Azure Databricks jobs (scheduled ETL pipelines) • 🔄 Incremental loads & complex transformations • 👨💻 Team-based code reviews • 🚨 Debugging production failures 👉 When something breaks…you don’t want to guess where a function came from 🧠𝗧𝗵𝗶𝘀 𝗶𝘀𝗻’𝘁 𝗮𝗯𝗼𝘂𝘁 𝗳𝗲𝘄𝗲𝗿 𝗸𝗲𝘆𝘀𝘁𝗿𝗼𝗸𝗲𝘀 It’s about: • 🎯 Clear ownership of logic • 🔍 Predictable behavior • 🧱 Maintainable pipelines 🔥 𝗦𝗵𝗼𝗿𝘁 𝗰𝗼𝗱𝗲 𝗶𝘀 𝗻𝗼𝘁 𝗮𝗹𝘄𝗮𝘆𝘀 𝗯𝗲𝘁𝘁𝗲𝗿…𝗖𝗹𝗲𝗮𝗿 𝗰𝗼𝗱𝗲 𝗶𝘀. #PySpark #ApacheSpark #DataEngineering #BigData #DataEngineer #Spark #ETL #PythonBestPractices #CodeQuality #CleanCode #CodeReadability #SoftwareDesign #CodingStandards #Debugging #CodeSmells #TechDebt #MaintainableCode #BugPrevention #CodeReview #ProductionReady #ScalableSystems #DataPipeline #DataArchitecture #CloudEngineering #Databricks #Programming #Coding #SoftwareEngineering #CloudComputing #AI #MachineLearning #Analytics #PySpark #ApacheSpark #DataEngineering #BigData #CleanCode #PythonBestPractices #CodeQuality #CodeReadability #Debugging #DataPipeline #SoftwareEngineering #CloudComputing #Databricks #ETL #ScalableSystems
To view or add a comment, sign in
-
-
How I Built an AI-Powered DBA Assistant (Step-by-Step) In my previous post, I shared how I built an AI assistant to handle tablespace alerts automatically. A few people asked for implementation details — so here’s a simplified breakdown of how this works 👇 🔹 Step 1: Alert Source (OEM) We start with OEM alerts like: PRODDB1 Tablespace [SYSAUX] is [85 percent] full 🔹 Step 2: Alert Parsing (Python) Using Python (regex), we extract: Database name Tablespace name Usage percentage This converts raw alert → structured data 🔹 Step 3: Decision Logic Simple rule-based logic: 90% → Critical → Immediate action 80% → Warning → Plan action This acts as a mini “AI decision engine” 🔹 Step 4: AI Assistant Layer Built using Streamlit: Accepts alert input Detects context (DB + tablespace) Provides recommendation Shows action button 🔹 Step 5: Automation Integration On clicking action: 👉 Redirect to Datafile Management module 👉 Auto-fill: Database Tablespace 🔹 Step 6: Execution Layer Generate SQL automatically Execute with confirmation 🔹 Final Flow: Alert → Parse → Analyze → Recommend → Execute 💡 Key takeaway: You don’t need complex AI to start. Even simple logic + automation can significantly improve DBA workflows. Next, I’m exploring: 👉 AI-based recommendation for datafile sizing 👉 Fully automated (self-healing) execution Would love to hear how others are approaching automation in database operations. #OracleDBA #Automation #AIOps #Python #DevOps #MachineLearning #DatabaseEngineering
To view or add a comment, sign in
-
One of the problems I'm running into as a n00bie data engineer moving from a beginner to intermediate-comprehension of Python, is understanding what is a variable versus what is a function or an argument. This may seem silly, but bear in mind that I didn't start out as a computer programmer or software engineer--I started out as an International Affairs/Econ student, jumped into trade show logistics, then Education, then into Corporate Finance, then into lean manufacturing, and then into software development with a focus on requirement validation and verification. I do not have a prior mental map for variables versus functions and arguments; I'm willing to bet a lot of other people are like me, and need the extra attention to differentiating those concepts. Below is a screenshot from my BoK, where I'm covering K-Means and agglomerate clustering as unsupervised machine learning techniques. The focus here is on K-Means; the screenshot is of a function that is written to calculate inertia for three clusters. What I'm going to be doing here, as part of a two-pronged approach to: a) better understand how to write a function with for and when loops (because I'm sadly weak here); b) better understand how to utilize a K-Means function on a scaled dataset is to go through this function and amend all the variables with the prefix "df_var", or "df_variable_", or "df_model_" in my BoK. What this does is create an immediate explanatory model for me within my Book of Knowledge demonstrating what are the variables that I have created versus what the functions and arguments and calls that come from the libraries and packages I've imported. For example: I'd write 'num_clusters' as 'df_var_num_clusters'; 'x_vals' as 'df_var_x_vals'. Then for the actual function, I'd write 'kmeans_inertia' as 'df_calc_kmeans_inertia'. Then, when I'm confronted with a similar coding challenge in the future, I can pull up my BoK, use the Find function and key words to get to this specific example, and because I know what my variable prefixes are, I can visually look at my example, then at the parameters for what I have to write, and I can then confidently design my function knowing which terms are variables that I'm responsible for, versus the functions and arguments that actually execute what I want to do. I'll be putting this up on GitHub either tonight or tomorrow, along with a README to explain this. It's my hope that this helps address some of the angst and fear that project managers and other professionals might have about jumping into software design and machine learning.
To view or add a comment, sign in
-
-
Is R Becoming a Niche Guest in its Own House? For those of us who grew up in the Tidyverse, the recent ripples in the data ecosystem feel more like a tidal wave. After planing a 200,000-line codebase transition from R to Python, I’ve been reflecting on five pivotal shifts that signal a "New World Order" in Data Science: 1. The "Pandas" Effect & The Memory Revolution Wes McKinney didn't just give Python a DataFrame; he (and the subsequent Apache Arrow movement) unified the underlying data infrastructure. By bringing Wes into the fold, the industry shifted focus from "language-specific" tools to "language-agnostic" high-performance kernels. 2. The End of an Era: From RMarkdown to Quarto The departure of Yihui Xie from Posit wasn't just a personnel change; it was a symbolic turning point. As Quarto supersedes RMarkdown, we see a move toward a multi-language future. R is no longer the center of the solar system—it’s just one of the planets orbiting the "publish anything" sun. 3. The Shiny Expansion (and Dilution?) Shiny for Python is a technical marvel, but it marks the fall of R's last "monopoly." When the most efficient tool for interactive dashboards goes cross-platform, the gravity inevitably pulls toward the broader Python ecosystem for production-grade deployment. 4. The SparkR Sunset With SparkR deprecated and the baton passed to sparklyr, the message from big-data platforms is clear: core development is moving elsewhere. R is being reframed as a specialized "interface" rather than a first-class citizen in massive-scale parallel computing. 5. The Infrastructure Barrier: The "Shared Cluster" Problem In modern cloud environments like Databricks, the lack of R support on Shared Clusters is a deal-breaker for many enterprise architects. When you can't share resources or scale multi-user environments in R, you aren't just losing a language; you're losing the battle for ROI and stability. My Takeaway: I am not pessimistic about R’s survival—it will always remain the "Gold Standard" for deep statistical rigor and validated research, especially in the Pharmaceutical industry. However, for AI Automation and Big Data Engineering, the "Great Consolidation" toward Python is no longer a trend—it's a finished reality. If you are building for the next 10 years of stability (and avoiding the 3-year re-validation nightmare), it's time to stop fighting the current and start mastering the new stack. What do you think? Is R returning to its roots as a specialist's tool, or is it losing its seat at the head of the table? #DataScience #RStats #Python #BigData #AI #Databricks #Pharmaceuticals #Quarto #TechTrends #DataEngineering
To view or add a comment, sign in
-
Today I shipped the first meaningful Rust commit for DYFJ — my open-source sovereign personal AI stack. It's testing an architectural idea I've come to recently: the same reasoning that draws me to Rust — strong types create predictable failure patterns — should apply at the database boundary, not just at the language. The default I keep seeing is to define the data contract in language constructs — the framework's native classes, types, or interfaces — and treat schema as something *exported* from those, rather than the other way around. Every "agent framework" I've looked at recently does this with Python or TypeScript classes, sometimes producing JSON Schema or OpenAPI specs that pretend to be the contract. The class is the source of truth. The runtime is whatever interprets that class. The database (when it exists at all) has a schema that drifts silently from the language one until something breaks in production at 3am. The data outlives the language. In DYFJ, the schema is committed to the repo as DDL — the contract that every language binding consumes, never the other way around. Whatever database runs that DDL is itself a modular component. Today's tracer bullet enforces that stance at the language boundary. Two Rust functions: events::write() and events::read_by_id(). Both use sqlx::query! macros, which check the SQL at compile time against the actual database. If I rename a column in the schema, the Rust code fails to compile until I update the queries. The build is the contract. Full post: https://lnkd.in/eY8nypMA #Rust #SovereignAI #OpenSource
To view or add a comment, sign in
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development