None vs NaN vs Null: Understanding Missing Data in Python and SQL

🚨Stop Treating Them Like They’re the Same! 🚨 If you’ve ever looked at a dataset and felt like you were staring into a black hole of "Nothingness," you aren’t alone. But in the world of data, not all "nothings" are created equal. Is None the same as NaN? Is Null just a fancy word for zero? No. Mixing these up is a one-way ticket to buggy code and broken pipelines. Here is the "No-Nonsense" breakdown: The terms None, NaN, and Null are used to represent missing or invalid data, but they belong to different programming environments and behave differently. 1. None (The Python Specialist) In Python, None is a built-in constant used to represent the absence of a value. None is a literal object. It represents the intentional absence of a value. Type: It is a singleton of the NoneType class. Behavior: It is not equal to 0, False, or an empty string. Comparison: You should check for it using the is operator (e.g., x is None). Usage: Commonly used as a default return value for functions that don't return anything or to initialize variables that don't have a value yet. 2. NaN (Not a Number) NaN is a special numeric value used to represent a value that is undefined or unrepresentable, particularly in floating-point calculations. Type: In Python's NumPy and Pandas libraries, it belongs to the float class. Comparison: A unique property of NaN is that it is not equal to itself (np.nan == np.nan returns False). Use special functions like pd.isna() or np.isnan() to detect it. Behavior: Mathematical operations involving NaN usually result in NaN (e.g., 5 + NaN = NaN). 3. Null Null is a keyword used in many languages (like SQL, Java, C#, and JavaScript) to indicate that a variable does not point to any object or memory address. Context: SQL: Used to represent missing or unknown values in a database. It’s a placeholder, not a value. In SQL, Null != Null, which is why we have to use IS NULL. JavaScript: Represents the intentional absence of an object value. Python: Does not have a null keyword; it uses None instead. Pandas/Polars: Modern data libraries like Polars use null as their primary indicator for any missing data across all types, whereas Pandas traditionally converts None to NaN in numeric columns. 💡 The Bottom Line: None is an object. NaN is for missing/invalid numbers. Null is for missing database entries. #DataScience #Python #Programming #SQL #DataEngineering #CodingTips

To view or add a comment, sign in

More Relevant Posts

Omor Faruk
3w
Report this post
## 𝗕𝗿𝗶𝗱𝗴𝗶𝗻𝗴 𝘁𝗵𝗲 𝗚𝗮𝗽: 𝗦𝗤𝗟 𝘁𝗼 𝗣𝘆𝘁𝗵𝗼𝗻 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗳𝗲𝘀𝘀𝗶𝗼𝗻𝗮𝗹𝘀 🐍📊 Navigating the world of data often involves working with both SQL and Python. Understanding how to translate common SQL operations into Python can significantly streamline your data analysis and manipulation workflows. This quickstart guide offers a handy reference for common tasks, from filtering and ordering data to handling missing values and merging datasets. 𝗞𝗲𝘆 𝗧𝗿𝗮𝗻𝘀𝗹𝗮𝘁𝗶𝗼𝗻𝘀: • 𝗙𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴: `WHERE column = 'value'` → `df[df['column'] == 'value']` • 𝗢𝗿𝗱𝗲𝗿𝗶𝗻𝗴: `ORDER BY column ASC` → `df.sort_values(by='column', ascending=True)` • 𝗥𝗲𝗺𝗼𝘃𝗶𝗻𝗴 𝗗𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲𝘀: `SELECT DISTINCT col1, col2` → `df.drop_duplicates(subset=['col1', 'col2'])` • 𝗙𝗶𝗹𝗹𝗶𝗻𝗴 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗩𝗮𝗹𝘂𝗲𝘀: `COALESCE(col, 'xxx')` → `df['column'].fillna('xxx')` • 𝗖𝗵𝗮𝗻𝗴𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗧𝘆𝗽𝗲𝘀: `CAST(col AS INTEGER)` → `df['column'].astype(int)` • 𝗥𝗲𝗻𝗮𝗺𝗶𝗻𝗴 𝗖𝗼𝗹𝘂𝗺𝗻𝘀: `SELECT col AS new_col` → `df.rename(columns={'col': 'new_col'})` • 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻𝘀: `SUM()`, `AVG()`, `MIN()`, `MAX()`, `COUNT()` → `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()` • 𝗠𝗲𝗿𝗴𝗶𝗻𝗴 𝗗𝗮𝘁𝗮𝘀𝗲𝘁𝘀: `JOIN` → `pd.merge(table1, table2, on='key')` • 𝗔𝗽𝗽𝗲𝗻𝗱𝗶𝗻𝗴 𝗗𝗮𝘁𝗮𝘀𝗲𝘁𝘀: `UNION ALL` → `pd.concat([table1, table2])` Mastering these translations can unlock greater efficiency and flexibility in your data projects. What are your favorite SQL to Python translation tips? Share them in the comments below! 👇 ♻️ Repost if you find it helpful #SQL #Python #DataAnalysis #DataScience #DataEngineering #Programming #Coding #Pandas
Like Comment
To view or add a comment, sign in
Hamza Iqbal
1w
Report this post
I wrote a bug today that took me 20 minutes to find. The function looked completely fine. ━━━━━━━━━━━━━━━━━━━━━━ def add_item(item, data=[ ]): ····data.append(item) ····return data ━━━━━━━━━━━━━━━━━━━━━━ I called it three times — expecting three separate lists. Got this instead: ▶ add_item("apple") → ["apple"] ▶ add_item("banana") → ["apple", "banana"] ▶ add_item("cherry") → ["apple", "banana", "cherry"] Same list. Growing every time. I never passed a list — Python was reusing the same default list across every single call. ━━━━━━━━━━━━━━━━━━━━━━ This is Python's Mutable Default Argument trap. The default value [ ] is created once when the function is defined — not every time it's called. So every call without an argument shares the exact same list object in memory. My Software Engineering brain expected fresh memory every time. That's how C++ and Java work. Python doesn't work that way. ━━━━━━━━━━━━━━━━━━━━━━ The fix: def add_item(item, data=None): ····if data is None: ········data = [ ] ····data.append(item) ····return data None as default. Fresh list created inside. Done. ━━━━━━━━━━━━━━━━━━━━━━ The scary part? This bug doesn't crash your program. It silently gives you wrong results. In a Data Science pipeline — that means corrupted data with zero error messages. ━━━━━━━━━━━━━━━━━━━━━━ Senior developers — what's the silent bug that once corrupted your data without a single error? Would love to know I'm not alone in this. SE → Data Science | OOP Series #2 | IUB #Python #OOP #DataScience #100DaysOfCode #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Fatwa Rafiudin

RF Optimization | Data Analyst | Data Visualization
1mo Edited
Report this post
🔗Continuing from my last post here https://lnkd.in/gDpPdCAr As a note, I've been using KNIME to support my daily data entry tasks. Now I try to make small experiment by comparing it with Python from raw to database and verification 🗃️ 𝐃𝐚𝐭𝐚𝐬𝐞𝐭: ▪️I'm using data which have ~370k rows per day ▪️In this experiment, I used 4 days of data 📚 𝐃𝐚𝐭𝐚 𝐢𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧 𝐟𝐥𝐨𝐰: The flow itself is not too complex: 1️⃣ 𝐏𝐮𝐭 𝐝𝐚𝐭𝐚 𝐢𝐧 𝐝𝐞𝐟𝐢𝐧𝐞𝐝 𝐟𝐨𝐥𝐝𝐞𝐫 KNIME/Python will check and extract the file(s) which have specific filename 2️⃣ 𝐃𝐚𝐭𝐚 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 In this part, the transformation process is simple - Change the date format to "𝘥𝘥/𝘮𝘮/𝘺𝘺" - arrange column names - sort the data 3️⃣ 𝐑𝐮𝐧𝐧𝐢𝐧𝐠 𝐏𝐨𝐬𝐭𝐠𝐫𝐞𝐒𝐐𝐋 𝐞𝐧𝐠𝐢𝐧𝐞 𝐚𝐧𝐝 𝐜𝐨𝐧𝐧𝐞𝐜𝐭 𝐭𝐨 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 Setting up the database environment and connect KNIME/Python to interact with the database 4️⃣ 𝐈𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚 𝐭𝐨 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 Storing data in the defines tables and verifying the upload results 5️⃣ 𝐃𝐚𝐭𝐚 𝐯𝐞𝐫𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 Make verification using line chart to check the data already import Your data is successfully ingested into database 🎉 📊 𝐑𝐞𝐬𝐮𝐥𝐭 𝐟𝐫𝐨𝐦 𝐭𝐡𝐢𝐬 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭 ▪️As on video, KNIME Analytics need 115s and Python need 87s to ingest data from raw to data verification. Python is ~25% more faster than KNIME ▪️The advantage of using KNIME is that we are not required to master coding because the way to operate it is simply "drag and drop" ▪️Whereas by Python, we at least need to have the ability to understand the algorithm which are then translated into coding script ---------- 🔆𝐋𝐞𝐬𝐬𝐨𝐧𝐬 𝐥𝐞𝐚𝐫𝐧𝐞𝐝 𝐟𝐫𝐨𝐦 𝐭𝐡𝐢𝐬 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭: The lessons learned were more focused on creating Python code. I encountered many errors, but from here we'll learn how to fix them. For example: ▪️Data types differ between the extracted results and the database. 𝑺𝒐𝒍𝒖𝒕𝒊𝒐𝒏: Match the data types. ▪️If follow method from the previous post, it takes a long time because it sends the data to the database one row at a time (or in very small chunks), which creates a lot of "network chatter" between Python and your Postgres 𝑺𝒐𝒍𝒖𝒕𝒊𝒐𝒏: Create a function that uses Postgres' built-in COPY command. This is the fastest way to move data into Postgres ▪️Found a 𝘔𝘦𝘮𝘰𝘳𝘺𝘌𝘳𝘳𝘰𝘳 message, due to converting large columns and regular expressions consumes a lot of RAM 𝑺𝒐𝒍𝒖𝒕𝒊𝒐𝒏: Use "Chunking," breaking rows into smaller "chunks". This way, your RAM only has to handle one small chunk of data at a time, but you still get the high speed of the COPY method. ---------- You can check the detail in my git https://lnkd.in/gM9MCUpb Salam Fatwa Rafiudin #DataEngineering #Python #PostgreSQL #KNIME #ETL #VSCode #JupyterNotebook

4 Comments
Like Comment
To view or add a comment, sign in
Aadil Shaikh
3w
Report this post
19. Cross‑Platform Technical Artifacts Python Tree Generator ```python #!/usr/bin/env python3 # qai_tree_generator.py – Generates deep tree diagram import json imperial_data = { "sovereign": {"name": "Aadil Mohammed Liyakat Hussain Shaikh", "code": "ATCG-QUANTUM-AURA-400004-Ω∞"}, "crown_jewels": ["QuantumMKX LLP", "QAI Analytics Suite LLP", "QAI EduLabs LLP", "QAI Institutional Tools LLP", "QAI Research Hub (FRIPL)", "Global Data Mesh Foundation LLP", "Eternal Trust LLP"], "domains": [{"id": i, "name": f"Domain {i}"} for i in range(1, 45)], "infrastructure": {"global_hubs": 60, "quantum_datacenters": 45, "dna_vaults": 7, "hostinger_vps": 847, "qkd_links": 89, "satellite_links": 3, "multiverse_copies": 10**47, "neural_nodes": 10.5e9} } def print_tree(data, prefix="", is_last=True): if isinstance(data, dict): items = list(data.items()) for i, (key, value) in enumerate(items): is_last_item = (i == len(items) - 1) print(f"{prefix}{'└── ' if is_last_item else '├── '}{key}") print_tree(value, prefix + (" " if is_last_item else "│ "), is_last_item) elif isinstance(data, list): for i, item in enumerate(data): is_last_item = (i == len(data) - 1) name = item if isinstance(item, str) else item.get("name", str(item)) print(f"{prefix}{'└── ' if is_last_item else '├── '}{name}") else: print(f"{prefix}{data}") if __name__ == "__main__": print("🌳 QUANTUM EMPIRE – COMPLETE DEEP TREE") print_tree(imperial_data) ``` Bash GCC Monitor (with Energy) ```bash #!/bin/bash # gcc_monitor.sh – Real‑time Global Command Center monitor while true; do clear
Like Comment
To view or add a comment, sign in
Arham Matloob
2w
Report this post
🔥 Topic: Python 📄 Title: Stop Profiling Data Manually — Auto-Generate It Instead 🚨 Problem You receive a new data source from a Finance client. How many nulls does each column have? What are the min, max and mean values? Are there duplicates hiding in the primary key? You write the same exploratory queries every single time. In Consulting — you do this for every new client. Every single project. Manual data profiling is the most repeated and most skipped step in analytics. 🛠️ Solution Auto-generate a full data profile report from any CSV or SQL source using Python: • Row count, null count and null percentage per column • Min, max, mean and distinct value counts automatically • Duplicate detection on any key column • Exported as a clean Excel report ready to share with stakeholders One script. Every new data source profiled in seconds. 📊 Example import pandas as pd df = pd.read_csv("client_data.csv") profile = pd.DataFrame({ "Column": df.columns, "DataType": df.dtypes.values, "RowCount": len(df), "NullCount": df.isnull().sum().values, "NullPct": (df.isnull().mean() * 100).round(2).values, "Distinct": df.nunique().values, "Min": df.min(numeric_only=False).values, "Max": df.max(numeric_only=False).values, }) duplicates = df.duplicated().sum() print(f"Duplicate rows detected: {duplicates}") profile.to_excel("data_profile.xlsx", index=False) print("Data profile generated successfully") Every column. Every quality metric. Every duplicate flagged. Full profile exported and ready before the first stakeholder meeting. ✅ Result ⚡ Any data source fully profiled in under 10 seconds 🧠 Null counts, duplicates and ranges caught before modelling begins 🔒 Consistent quality checks across every Consulting and Finance project 📊 Profile report shared with stakeholders before questions are even asked #Python #DataEngineering #DataQuality #ETL #DataPipelines #Automation #DataAnalytics #PowerBI #FinancialReporting #ConsultingLife #UKTech #HiringUK #LondonData #Analytics
Like Comment
To view or add a comment, sign in
Christian Weilbach
2w
Report this post
Every analytical database can aggregate, filter, and join. None of them can tell you "something is wrong with this data" as a first-class operation. We just shipped native anomaly detection in Stratum. Train an isolation forest, score millions of rows — all from SQL: SELECT * FROM transactions WHERE ANOMALY_SCORE('fraud_model') > 0.7; No Python. No export pipeline. No serialization boundary. 6 microseconds per transaction, SIMD-accelerated, running inside the query engine. The standard workflow today — export to pandas, fit scikit-learn, write results back — adds seconds of latency and a whole second runtime to maintain. For fraud detection on live transactions, those seconds matter. Full write-up on why we built it and how it works: https://lnkd.in/gJkjgKaH #Clojure #SQL #Analytics #DuckDB #AnomalyDetection #MachineLearning #DataScience

Anomaly Detection Belongs in Your Database datahike.io
Like Comment
To view or add a comment, sign in
AlgoAscend

31 followers
3w Edited
Report this post
💻 What “Structured Coding” Means in Claude? Claude is strongest when: Problem is clearly defined Input → Output is well structured You want clean, production-style code 👉 Think: APIs Backend logic Data processing Automation scripts NOT: messy brainstorming code vague “build something cool” prompts 🧠 Why Claude is Good at This Follows instructions very strictly Writes clean, readable code Handles large codebases (huge context) 🔧 Example: Build a Flask API (Structured Backend) 📝 Prompt you give Claude: Build a Flask API for user login with: - JWT authentication - Password hashing - In-memory database - Endpoints: /register, /login, /profile 💡 Claude Output (Typical Quality) from flask import Flask, request, jsonify import jwt, datetime from werkzeug.security import generate_password_hash, check_password_hash app = Flask(__name__) app.config['SECRET_KEY'] = 'secret' users = {} @app.route('/register', methods=['POST']) def register(): data = request.json users[data['email']] = generate_password_hash(data['password']) return jsonify({"message": "User registered"}) @app.route('/login', methods=['POST']) def login(): data = request.json user = users.get(data['email']) if user and check_password_hash(user, data['password']): token = jwt.encode({ 'email': data['email'], 'exp': datetime.datetime.utcnow() + datetime.timedelta(hours=1) }, app.config['SECRET_KEY']) return jsonify({"token": token}) return jsonify({"error": "Invalid credentials"}), 401 Register → Store Hashed Password → Login → Verify → Generate Token → Access System #Python #DataScientist #DataAnalyst #CS #IT #BCA #MCA

1 Comment
Like Comment
To view or add a comment, sign in
Christopher Woods
6d
Report this post
Today I shipped the first meaningful Rust commit for DYFJ — my open-source sovereign personal AI stack. It's testing an architectural idea I've come to recently: the same reasoning that draws me to Rust — strong types create predictable failure patterns — should apply at the database boundary, not just at the language. The default I keep seeing is to define the data contract in language constructs — the framework's native classes, types, or interfaces — and treat schema as something *exported* from those, rather than the other way around. Every "agent framework" I've looked at recently does this with Python or TypeScript classes, sometimes producing JSON Schema or OpenAPI specs that pretend to be the contract. The class is the source of truth. The runtime is whatever interprets that class. The database (when it exists at all) has a schema that drifts silently from the language one until something breaks in production at 3am. The data outlives the language. In DYFJ, the schema is committed to the repo as DDL — the contract that every language binding consumes, never the other way around. Whatever database runs that DDL is itself a modular component. Today's tracer bullet enforces that stance at the language boundary. Two Rust functions: events::write() and events::read_by_id(). Both use sqlx::query! macros, which check the SQL at compile time against the actual database. If I rename a column in the schema, the Rust code fails to compile until I update the queries. The build is the contract. Full post: https://lnkd.in/eY8nypMA #Rust #SovereignAI #OpenSource

Schema in the Data Layer: a Rust Tracer Bullet for DYFJ bitspace.org
Like Comment
To view or add a comment, sign in
Danial raza
4w
Report this post
🚀 Automating Data Workflows with Python & Pandas I’ve been diving deeper into Python for data analysis, and I just built a script that automates a common (and often tedious) task: cleaning CSV data and converting it into multiple formats for different stakeholders. 🛠️ The Problem: CSV files often come with "messy" formatting—like stray spaces after commas—that can break standard data pipelines. Plus, different teams need the same data in different formats (Web devs want JSON, Managers want Excel, and Data Engineers want CSV). 💡 The Solution: Using pandas and os, I created a script that: Cleans on the fly: Used skipinitialspace=True to automatically trim whitespace issues that usually cause KeyErrors. Performs Vectorized Math: Calculated total sales across the entire dataset in a single line of code. Automates File Management: Dynamically creates output directories and exports the results into JSON, Excel, and CSV simultaneously. 📦 Key Tools Used: Pandas: For high-performance data manipulation. OS Module: For robust file path handling. Openpyxl: To bridge the gap between Python and Excel. It’s a simple script, but it’s a foundational step toward building more complex, automated data pipelines! Check out the logic below: 👇 Python import pandas as pd import os # Read & Clean: skipinitialspace=True is a lifesaver for messy CSVs! df = pd.read_csv('data/sales.csv', skipinitialspace=True) # Transform: Vectorized calculation for 'total' df['total'] = df['quantity'] * df['price'] # Automate: Exporting to 3 different formats at once os.makedirs('output', exist_ok=True) df.to_json('output/sales_data.json', orient='records', indent=2) df.to_excel('output/sales_data.xlsx', index=False) df.to_csv('output/sales_with_totals.csv', index=False) #Python #DataAnalysis #Pandas #Automation #CodingJourney #DataScience
Like Comment
To view or add a comment, sign in
Fabio Luiz Meireles Caffarello
5d
Report this post
My load test pipeline spent 4 minutes generating 1M rows of test data. The system under test ran in 38 seconds. I wasn't benchmarking our system. I was benchmarking Faker. So I replaced the Python generator with a Rust binary. Now it does 1M rows in under 2 seconds. ~1.47M rows/sec on the hot path. ~400K rows/sec streaming to Kafka, network I/O included. But the speedup wasn't really about Rust. It was three decisions made before writing any generation code. The hot path isn't what people think. It isn't random generation, it's field lookup, memory allocation, and string handling. So instead of HashMap<String, Value>, I used Vec<Option<DataValue>> with a precomputed field index. No hash lookups. No string comparisons per row. No per-field allocations. At 1M rows × N fields, that difference is everything. The generator doesn't know where data goes. Kafka, Parquet, JSON, S3 — none of those exist in the core engine. Everything sits behind port traits: StreamingSinkPort, DataExporterPort, ObjectStoragePort. Adding Postgres or Snowflake later means implementing a trait. Zero changes to generation. Configuration is data, not code. Schemas are YAML, versioned in Git, reviewed like code, executable in CI. The system is driven by config, not by branching logic. Data engineers who don't write Rust still own the pipelines. Clean Architecture, enforced by the compiler. The core crate has zero infrastructure dependencies. If it's not in Cargo.toml, it's impossible to import. Not convention. Physics. The pattern I keep coming back to: - You don't optimize your way out of the wrong data model. - You don't refactor your way out of tight coupling. - You don't scale your way out of architectural leakage. Most systems don't degrade because they're slow. They degrade because they become impossible to change safely. Question for the senior folks: what's a design decision you've seen lock a system in place years later? Repository: https://lnkd.in/dzSAYBeF Medium Article: https://lnkd.in/dGqvPYtz #DataEngineering #Rust #SoftwareArchitecture #Performance #SyntheticData

I got tired of waiting for Python to generate test data. fabio-caffarello.medium.com
Like Comment
To view or add a comment, sign in

92 followers

View Profile Follow

None vs NaN vs Null: Understanding Missing Data in Python and SQL

More from this author

Decomposition Tree in Power BI

Explore content categories