Hidden Cost of Missing Values in Data Systems

The Hidden Cost of a Missing Value You start with Employee IDs: 101, 102, 103 After a join: 101.0, 102.0, NaN That .0 isn’t cosmetic—it’s a silent upcast. Under the hood, NumPy-backed systems (like Pandas) require columns to have a fixed dtype for vectorized performance. Standard integers don’t support nulls, but IEEE 754 floating-point does (via NaN). So to accommodate a single missing value, the entire column is upcast to float64. This is a classic leaky abstraction: 1) Very large numbers can lose accuracy when converted to floating point. 2) Memory usage may increase depending on original dtype Modern solutions (Pandas nullable dtypes, Apache Arrow, Polars) solve this using validity bitmasks—keeping integers as integers while tracking nulls separately with a very minimal tradeoffs on compatibility and compute Even “nothing” has a cost in system design. #DataEngineering #Python #SystemDesign

To view or add a comment, sign in

More Relevant Posts

Lakshmi P
1w
Report this post
𝗗𝗮𝘆 𝟲𝟴/𝟳𝟱 | 𝗟𝗲𝗲𝘁𝗖𝗼𝗱𝗲 𝟳𝟱 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: 136. Single Number 𝗗𝗶𝗳𝗳𝗶𝗰𝘂𝗹𝘁𝘆: Easy 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝗦𝘂𝗺𝗺𝗮𝗿𝘆: Given a non-empty array where every element appears twice except for one, find the element that appears only once. Constraints: • Linear time complexity required • Constant extra space 𝗠𝘆 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵: This problem is efficiently solved using Bit Manipulation (XOR operation). • Key Properties of XOR: – a ^ a = 0 – a ^ 0 = a – XOR is commutative and associative • Logic: – XOR all elements in the array – Duplicate numbers cancel each other out – The remaining value is the unique element • Implementation: – Initialize a variable (e.g., result = 0) – Traverse the array and XOR each element with result – Final result holds the single number This works because pairs eliminate themselves, leaving only the non-repeating number. 𝗖𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: • Time Complexity: O(n) • Space Complexity: O(1) 𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆: Bit manipulation can simplify problems that seem to require extra space. XOR is especially powerful when dealing with pairs or duplicates. 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗟𝗶𝗻𝗸: https://lnkd.in/guP-fJnC #Day68of75 #LeetCode75 #DSA #Java #Python #BitManipulation #Algorithms #MachineLearning #DataScience #ML #DataAnalyst #LearningInPublic #TechJourney #LeetCode
Like Comment
To view or add a comment, sign in
Kishoraa Rajeshkannan
4w
Report this post
Visualizing Multithreading: From Race Conditions to Thread Safety 🧵💻 I’ve always believed that the best way to master complex Operating System concepts is to build them from scratch. I recently developed a Multithreading Task Manager in Python to simulate how modern OSs handle concurrent tasks. Key features I implemented: ✅ Thread Lifecycle Simulation: Visualizing threads moving from New → Ready → Running → Terminated. ✅ Race Condition Demo: Showing how data corruption occurs without proper locking mechanisms. ✅ Mutex Locks: Using threading.Lock() to ensure critical sections are protected. ✅ Producer-Consumer Pattern: Implementing a thread-safe Task Queue. Building this helped me bridge the gap between theoretical OS concepts and practical, thread-safe Python code. Check out the demo below! 👇 #Python #Multithreading #OperatingSystems #ComputerScience #SoftwareEngineering #BackendDevelopment #Concurrency
Like Comment
To view or add a comment, sign in
Kartik Bodke
3w
Report this post
The most dangerous phrase in software is: "But it works on my machine." ⚠️ I decided to eliminate that phrase entirely for my latest project. This video captures the "black box" transformation—using the Nuitka compiler to translate thousands of lines of Python into optimized C code. It’s a tense few minutes watching the machine stitch together ezdxf, win32com, and Tkinter into a single, indestructible executable. The Mystery: Why go through the trouble of a C-level compilation instead of just running a script? Because when a tool hits the production floor, it needs to be more than just "code." It needs to be a professional-grade solution for engineering efficiency. The goal was to take a complex environment and vanish it—leaving behind nothing but a high-performance, standalone tool. The Result: The "Successfully created" message at the end isn't just a build log. It’s the birth of LCPL_Grey_Scaler.exe (v1.1). No Python installation needed. No dependency errors. Just pure, compiled speed. 🚀 #Python #SoftwareEngineering #Automation #CAD #Nuitka #Innovation #CodingLife #Engineering #Deployment
Like Comment
To view or add a comment, sign in
Vishwanath T L
1w
Report this post
🚀 Stop iterating through your DataFrames like it's 2010. I recently refactored a pipeline processing 50M rows. We were using a standard loop to calculate a rolling average, which was choking our CPU and stalling the entire cluster. Before optimisation: for i in range(len(df)): df.loc[i, 'avg'] = df['val'].iloc[i-5:i].mean() After optimisation: df['avg'] = df['val'].rolling(window=5).mean() Performance gain: 45x faster execution time. By moving from a row-based loop to a vectorised rolling window function, we cut the execution time from 12 minutes down to 16 seconds. The underlying implementation of Polars and Pandas handles these operations in highly optimised C/Rust, which no Python loop can match. Stop writing row-wise operations and start leveraging vectorisation. It’s the single biggest win for data processing throughput. What is the most expensive loop you have ever managed to replace with a vectorised operation? #DataEngineering #Python #PerformanceTuning #DataScience #CodingTips
Like Comment
To view or add a comment, sign in
Dmitry Saltanov
3w
Report this post
Handling Time Data: Logic over Strings. Today I worked on a common challenge: comparing time values like '7:15' and '10:30' stored in a list.The Problem: Standard string comparison can be unreliable (e.g., '7:15' vs '10:30'), and using float numbers leads to mathematical inaccuracies.The Solution: I converted all time entries into a single unit — Total Minutes from the start of the day (hours * 60 + minutes).This transformation turns time-strings into simple integers, creating a robust and scalable logic for sorting and filtering. A solid foundation is everything, whether it's infrastructure or code. 🛡️🦾#Python #Coding #ProblemSolving #SoftwareEngineering #Backend #Summerson
Like Comment
To view or add a comment, sign in
Zain Ul Abideen
2w Edited
Report this post
𝟭. 𝗪𝗵𝗲𝗻 "𝗦𝗶𝗺𝗽𝗹𝗲" 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗠𝗲𝗲𝘁 𝗦𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 A client came to me asking for a "Read Time" badge on their enterprise blog. On the surface, it’s just division, right? Total words / 200. Done. But as we peeled back the layers, it became a fascinating system design challenge: 𝗧𝗵𝗲 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: How do you handle code snippets, technical tables, or 50+ images? (Hint: They don't read as fast as prose). 𝗧𝗵𝗲 𝗦𝗰𝗮𝗹𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: Recalculating this on every page load for 100k+ users is a waste of CPU cycles. 𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: We moved the logic to the Write-Path. Calculate once during the "Publish" event, store it as metadata, and serve it via CDN. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁: A snappier UI and a more accurate "time-to-value" promise for the readers. Check out the Python logic I used to handle the heavy lifting below! 👇 𝟮. 𝗣𝘆𝘁𝗵𝗼𝗻 𝗕𝗮𝗰𝗸𝗲𝗻𝗱 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 This script handles the logic of "Write-Time" processing. It strips out distractions and accounts for "Image Fatigue" (where users scan images faster the more there are). #SystemDesign #Python #SoftwareEngineering #Backend
Like Comment
To view or add a comment, sign in
Aarush Aggarwal
1w
Report this post
Spent 5 days chasing ghosts—DLL hell and ABI mismatches. I followed the agentic debugger down the wrong path as it hallucinated at a wrong layer because it misread the WinError 1114 as a load-path issue rather than a missing export. The actual fix was two lines. I used TORCH_LIBRARY when I needed PYBIND11_MODULE. The Architecture Gap: - Use TORCH_LIBRARY to register ops into the PyTorch C++ Dispatcher (accessed via torch.ops). It fires static C++ constructors at DLL load time but does not create a PyInit_* function. Python can't "see" it as a module. - Use PYBIND11_MODULE to generate the standard Python C Extension entry point. This generates the PyInit_{name} entry point Python needs to "see" the module. The error was literal: "dynamic module does not define module export function." No PyInit_* existed because TORCH_LIBRARY isn't meant to be imported directly. {just correcting the record} #CPP #PyTorch #SystemsProgramming #MachineLearning #barebones #3D
Like Comment
To view or add a comment, sign in
NARESH B A
1w Edited
Report this post
I added more threads to fix my pipeline. Throughput dropped. That was the moment I realized I had been debugging the wrong thing for days. We were processing ~8 million events in a strict time window. The system was falling behind. My first instinct more threads, more parallelism. Classic move. But the system wasn't compute-bound. Every event was triggering datastore lookups, config reads, network calls. The threads weren't competing for CPU. They were all just waiting and I gave them more company. Python 3.13 introduced a free-threaded build where the GIL can be disabled. True parallel execution across cores. No more serialization. A lot of engineers read that and thought: "Finally, Python is fixed." It's not that simple. Your system's throughput is capped by three things and free-threading only addresses one: → I/O ceiling - how fast your external dependencies respond → Thread overhead ceiling - context switching cost beyond the optimal thread count → Execution ceiling - where the GIL used to apply Removing the GIL lifts the execution ceiling. If your system is sitting behind the I/O ceiling, nothing moves. What actually fixed my pipeline wasn't threads or Python versions. It was pulling config data out of the critical path and caching state locally. One design decision. More impact than any concurrency change. Removing the GIL doesn't remove bad architecture. Full breakdown with the three-ceiling model and where free-threading genuinely helps link in comments. #Python #SystemDesign #BackendEngineering #SoftwareEngineering #DistributedSystems #Concurrency
1 Comment
Like Comment
To view or add a comment, sign in
Mahim Dashora
3w
Report this post
I deleted code and my CPU usage dropped by 90%. . . . . I recently spent time optimizing a PDF processing microservice (FastAPI + PyMuPDF) and stumbled upon a surprising performance killer: 'gc.collect()'. - By simply removing manual garbage collection calls, my CPU utilization plummeted from spikes of 80%+ down to a steady 5-6%. -In Python, we often get tempted to manually call gc.collect() when handling mem heavy tasks (like splitting large PDFs) to "help" the system. However, this often causes more harm than good: -Every time you call gc.collect(), Python pauses your execution to traverse every object in memory. In a high-concurrency environment, this creates massive CPU churn. The Takeaway: - Trust the runtime. If you feel the need to manually trigger the GC, it’s usually a sign that your object management needs a refactor, not that the collector is being "lazy." Sometimes, the best code is the code you delete. #Python #SoftwareEngineering #Performance #FastAPI #CleanCode #BackendDevelopment
Like Comment
To view or add a comment, sign in
Dr. Mohammad Nasucha

Applied Research Scientist | Bridging Mathematics, Algorithms and Hardware + Software Realization | Computer Vision, Computer Graphics, Robotics, Radar
1w
Report this post
A quick exercise for anyone working in computer vision: Implement the following in Python from scratch (no external libraries), within 50 minutes: • Brightness adjustment • Contrast enhancement • High-impact contrast & color enhancement • RGB to grayscale conversion • Image normalization The implementations themselves are straightforward. The real point is understanding how pixel intensities behave under transformation, and how small changes affect visual outcomes. A simple task on the surface—but a good check on fundamentals. I’d be happy if you share your code, results, and your thoughts and impressions during and after completing these tasks.

62 Comments
Like Comment
To view or add a comment, sign in

784 followers

15 Posts

View Profile Connect

Hidden Cost of Missing Values in Data Systems

More Relevant Posts

Explore content categories