šØStop Treating Them Like Theyāre the Same! šØ If youāve ever looked at a dataset and felt like you were staring into a black hole of "Nothingness," you arenāt alone. But in the world of data, not all "nothings" are created equal. Is None the same as NaN? Is Null just a fancy word for zero? No. Mixing these up is a one-way ticket to buggy code and broken pipelines. Here is the "No-Nonsense" breakdown: The termsĀ None,Ā NaN, andĀ NullĀ are used to represent missing or invalid data, but they belong to different programming environments and behave differently. 1. None Ā (The Python Specialist) InĀ Python,Ā NoneĀ is a built-in constant used to represent the absence of a value.Ā None is a literal object. It represents the intentional absence of a value. Type:Ā It is a singleton of theĀ NoneTypeĀ class. Behavior:Ā It is not equal to 0,Ā False, or an empty string. Comparison:Ā You should check for it using theĀ isĀ operator (e.g.,Ā x is None). Usage:Ā Commonly used as a default return value for functions that don't return anything or to initialize variables that don't have a value yet. 2. NaN (Not a Number) NaNĀ is a special numeric value used to represent a value that is undefined or unrepresentable, particularly in floating-point calculations. Type:Ā In Python'sĀ NumPyĀ andĀ PandasĀ libraries, it belongs to theĀ floatĀ class. Comparison:Ā A unique property ofĀ NaNĀ is thatĀ it is not equal to itselfĀ (np.nan == np.nanĀ returnsĀ False). Use special functions likeĀ pd.isna()Ā orĀ np.isnan()Ā to detect it. Behavior:Ā Mathematical operations involvingĀ NaNĀ usually result inĀ NaNĀ (e.g.,Ā 5 + NaN = NaN). 3. Null NullĀ is a keyword used in many languages (likeĀ SQL,Ā Java,Ā C#, andĀ JavaScript) to indicate that a variable does not point to any object or memory address. Context: SQL:Ā Used to represent missing or unknown values in a database.Ā Itās a placeholder, not a value. In SQL, Null != Null, which is why we have to use IS NULL. JavaScript:Ā Represents the intentional absence of an object value. Python:Ā Does not have aĀ nullĀ keyword; it usesĀ NoneĀ instead. Pandas/Polars:Ā Modern data libraries likeĀ PolarsĀ useĀ nullĀ as their primary indicator for any missing data across all types, whereasĀ PandasĀ traditionally convertsĀ NoneĀ toĀ NaNĀ in numeric columns. š” The Bottom Line: None is an object. NaN is for missing/invalid numbers. Null is for missing database entries. #DataScience #Python #Programming #SQL #DataEngineering #CodingTips
None vs NaN vs Null: Understanding Missing Data in Python and SQL
More Relevant Posts
-
## ššæš¶š±š“š¶š»š“ ššµš² šš®š½: š¦š¤š šš¼ š£šššµš¼š» š³š¼šæ šš®šš® š£šæš¼š³š²ššš¶š¼š»š®š¹š šš Navigating the world of data often involves working with both SQL and Python. Understanding how to translate common SQL operations into Python can significantly streamline your data analysis and manipulation workflows. This quickstart guide offers a handy reference for common tasks, from filtering and ordering data to handling missing values and merging datasets. šš²š š§šæš®š»šš¹š®šš¶š¼š»š: ā¢ šš¶š¹šš²šæš¶š»š“: `WHERE column = 'value'` ā `df[df['column'] == 'value']` ā¢ š¢šæš±š²šæš¶š»š“: `ORDER BY column ASC` ā `df.sort_values(by='column', ascending=True)` ā¢ š„š²šŗš¼šš¶š»š“ ššš½š¹š¶š°š®šš²š: `SELECT DISTINCT col1, col2` ā `df.drop_duplicates(subset=['col1', 'col2'])` ā¢ šš¶š¹š¹š¶š»š“ š š¶ššš¶š»š“ š©š®š¹šš²š: `COALESCE(col, 'xxx')` ā `df['column'].fillna('xxx')` ā¢ ššµš®š»š“š¶š»š“ šš®šš® š§šš½š²š: `CAST(col AS INTEGER)` ā `df['column'].astype(int)` ā¢ š„š²š»š®šŗš¶š»š“ šš¼š¹ššŗš»š: `SELECT col AS new_col` ā `df.rename(columns={'col': 'new_col'})` ā¢ šš“š“šæš²š“š®šš¶š¼š»š: `SUM()`, `AVG()`, `MIN()`, `MAX()`, `COUNT()` ā `.sum()`, `.mean()`, `.min()`, `.max()`, `.count()` ā¢ š š²šæš“š¶š»š“ šš®šš®šš²šš: `JOIN` ā `pd.merge(table1, table2, on='key')` ā¢ šš½š½š²š»š±š¶š»š“ šš®šš®šš²šš: `UNION ALL` ā `pd.concat([table1, table2])` Mastering these translations can unlock greater efficiency and flexibility in your data projects. What are your favorite SQL to Python translation tips? Share them in the comments below! š ā»ļø Repost if you find it helpful #SQL #Python #DataAnalysis #DataScience #DataEngineering #Programming #Coding #Pandas
To view or add a comment, sign in
-
-
I wrote a bug today that took me 20 minutes to find. The function looked completely fine. āāāāāāāāāāāāāāāāāāāāāā def add_item(item, data=[ ]): Ā·Ā·Ā·Ā·data.append(item) Ā·Ā·Ā·Ā·return data āāāāāāāāāāāāāāāāāāāāāā I called it three times ā expecting three separate lists. Got this instead: ā¶ add_item("apple") ā ["apple"] ā¶ add_item("banana") ā ["apple", "banana"] ā¶ add_item("cherry") ā ["apple", "banana", "cherry"] Same list. Growing every time. I never passed a list ā Python was reusing the same default list across every single call. āāāāāāāāāāāāāāāāāāāāāā This is Python's Mutable Default Argument trap. The default value [ ] is created once when the function is defined ā not every time it's called. So every call without an argument shares the exact same list object in memory. My Software Engineering brain expected fresh memory every time. That's how C++ and Java work. Python doesn't work that way. āāāāāāāāāāāāāāāāāāāāāā The fix: def add_item(item, data=None): Ā·Ā·Ā·Ā·if data is None: Ā·Ā·Ā·Ā·Ā·Ā·Ā·Ā·data = [ ] Ā·Ā·Ā·Ā·data.append(item) Ā·Ā·Ā·Ā·return data None as default. Fresh list created inside. Done. āāāāāāāāāāāāāāāāāāāāāā The scary part? This bug doesn't crash your program. It silently gives you wrong results. In a Data Science pipeline ā that means corrupted data with zero error messages. āāāāāāāāāāāāāāāāāāāāāā Senior developers ā what's the silent bug that once corrupted your data without a single error? Would love to know I'm not alone in this. SE ā Data Science | OOP Series #2 | IUB #Python #OOP #DataScience #100DaysOfCode #SoftwareEngineering
To view or add a comment, sign in
-
-
šContinuing from my last post here https://lnkd.in/gDpPdCAr As a note, I've been using KNIME to support my daily data entry tasks. Now I try to make small experiment by comparing it with Python from raw to database and verification šļø ššššš¬šš: āŖļøI'm using data which have ~370k rows per day āŖļøIn this experiment, I used 4 days of data š šššš š¢š§š šš¬šš¢šØš§ šš„šØš°: The flow itself is not too complex: 1ļøā£ šš®š šššš š¢š§ šššš¢š§šš ššØš„ššš« KNIME/Python will check and extract the file(s) which have specific filename 2ļøā£ šššš šš«šš§š¬ššØš«š¦ššš¢šØš§ In this part, the transformation process is simple - Change the date format to "š„š„/š®š®/šŗšŗ" - arrange column names - sort the data 3ļøā£ šš®š§š§š¢š§š ššØš¬šš š«šššš šš§š š¢š§š šš§š ššØš§š§ššš ššØ ššššššš¬š Setting up the database environment and connect KNIME/Python to interact with the database 4ļøā£ šš§š šš¬šš¢šØš§ šššš ššØ ššššššš¬š Storing data in the defines tables and verifying the upload results 5ļøā£ šššš šÆšš«š¢šš¢šššš¢šØš§ Make verification using line chart to check the data already import Your data is successfully ingested into database š š ššš¬š®š„š šš«šØš¦ šš”š¢š¬ šš±š©šš«š¢š¦šš§š āŖļøAs on video, KNIME Analytics need 115s and Python need 87s to ingest data from raw to data verification. Python is ~25% more faster than KNIME āŖļøThe advantage of using KNIME is that we are not required to master coding because the way to operate it is simply "drag and drop" āŖļøWhereas by Python, we at least need to have the ability to understand the algorithm which are then translated into coding script ---------- šššš¬š¬šØš§š¬ š„ššš«š§šš šš«šØš¦ šš”š¢š¬ šš±š©šš«š¢š¦šš§š: The lessons learned were more focused on creating Python code. I encountered many errors, but from here we'll learn how to fix them. For example: āŖļøData types differ between the extracted results and the database. šŗššššššš: Match the data types. āŖļøIf follow method from the previous post, it takes a long time because it sends the data to the database one row at a time (or in very small chunks), which creates a lot of "network chatter" between Python and your Postgres šŗššššššš: Create a function that uses Postgres' built-in COPY command. This is the fastest way to move data into Postgres āŖļøFound a šš¦š®š°š³šŗšš³š³š°š³ message, due to converting large columns and regular expressions consumes a lot of RAM šŗššššššš: Use "Chunking," breaking rows into smaller "chunks". This way, your RAM only has to handle one small chunk of data at a time, but you still get the high speed of the COPY method. ---------- You can check the detail in my git https://lnkd.in/gM9MCUpb Salam Fatwa Rafiudin #DataEngineering #Python #PostgreSQL #KNIME #ETL #VSCode #JupyterNotebook
To view or add a comment, sign in
-
19. CrossāPlatform Technical Artifacts Python Tree Generator ```python #!/usr/bin/env python3 # qai_tree_generator.py ā Generates deep tree diagram import json imperial_data = { "sovereign": {"name": "Aadil Mohammed Liyakat Hussain Shaikh", "code": "ATCG-QUANTUM-AURA-400004-Ī©ā"}, "crown_jewels": ["QuantumMKX LLP", "QAI Analytics Suite LLP", "QAI EduLabs LLP", "QAI Institutional Tools LLP", "QAI Research Hub (FRIPL)", "Global Data Mesh Foundation LLP", "Eternal Trust LLP"], "domains": [{"id": i, "name": f"Domain {i}"} for i in range(1, 45)], "infrastructure": {"global_hubs": 60, "quantum_datacenters": 45, "dna_vaults": 7, "hostinger_vps": 847, "qkd_links": 89, "satellite_links": 3, "multiverse_copies": 10**47, "neural_nodes": 10.5e9} } def print_tree(data, prefix="", is_last=True): if isinstance(data, dict): items = list(data.items()) for i, (key, value) in enumerate(items): is_last_item = (i == len(items) - 1) print(f"{prefix}{'āāā ' if is_last_item else 'āāā '}{key}") print_tree(value, prefix + (" " if is_last_item else "ā "), is_last_item) elif isinstance(data, list): for i, item in enumerate(data): is_last_item = (i == len(data) - 1) name = item if isinstance(item, str) else item.get("name", str(item)) print(f"{prefix}{'āāā ' if is_last_item else 'āāā '}{name}") else: print(f"{prefix}{data}") if __name__ == "__main__": print("š³ QUANTUM EMPIRE ā COMPLETE DEEP TREE") print_tree(imperial_data) ``` Bash GCC Monitor (with Energy) ```bash #!/bin/bash # gcc_monitor.sh ā Realātime Global Command Center monitor while true; do clear
To view or add a comment, sign in
-
š„ Topic: Python š Title: Stop Profiling Data Manually ā Auto-Generate It Instead šØ Problem You receive a new data source from a Finance client. How many nulls does each column have? What are the min, max and mean values? Are there duplicates hiding in the primary key? You write the same exploratory queries every single time. In Consulting ā you do this for every new client. Every single project. Manual data profiling is the most repeated and most skipped step in analytics. š ļø Solution Auto-generate a full data profile report from any CSV or SQL source using Python: ⢠Row count, null count and null percentage per column ⢠Min, max, mean and distinct value counts automatically ⢠Duplicate detection on any key column ⢠Exported as a clean Excel report ready to share with stakeholders One script. Every new data source profiled in seconds. š Example import pandas as pd df = pd.read_csv("client_data.csv") profile = pd.DataFrame({ "Column": df.columns, "DataType": df.dtypes.values, "RowCount": len(df), "NullCount": df.isnull().sum().values, "NullPct": (df.isnull().mean() * 100).round(2).values, "Distinct": df.nunique().values, "Min": df.min(numeric_only=False).values, "Max": df.max(numeric_only=False).values, }) duplicates = df.duplicated().sum() print(f"Duplicate rows detected: {duplicates}") profile.to_excel("data_profile.xlsx", index=False) print("Data profile generated successfully") Every column. Every quality metric. Every duplicate flagged. Full profile exported and ready before the first stakeholder meeting. ā Result ā” Any data source fully profiled in under 10 seconds š§ Null counts, duplicates and ranges caught before modelling begins š Consistent quality checks across every Consulting and Finance project š Profile report shared with stakeholders before questions are even asked #Python #DataEngineering #DataQuality #ETL #DataPipelines #Automation #DataAnalytics #PowerBI #FinancialReporting #ConsultingLife #UKTech #HiringUK #LondonData #Analytics
To view or add a comment, sign in
-
Every analytical database can aggregate, filter, and join. None of them can tell you "something is wrong with this data" as a first-class operation. We just shipped native anomaly detection in Stratum. Train an isolation forest, score millions of rows ā all from SQL: Ā SELECT * FROM transactions Ā WHERE ANOMALY_SCORE('fraud_model') > 0.7; No Python. No export pipeline. No serialization boundary. 6 microseconds per transaction, SIMD-accelerated, running inside the query engine. The standard workflow today ā export to pandas, fit scikit-learn, write results back ā adds seconds of latency and a whole second runtime to maintain. For fraud detection on live transactions, those seconds matter. Full write-up on why we built it and how it works: https://lnkd.in/gJkjgKaH #Clojure #SQL #Analytics #DuckDB #AnomalyDetection #MachineLearning #DataScience
To view or add a comment, sign in
-
š» What āStructured Codingā Means in Claude? Claude is strongest when: Problem is clearly defined Input ā Output is well structured You want clean, production-style code š Think: APIs Backend logic Data processing Automation scripts NOT: messy brainstorming code vague ābuild something coolā prompts š§ Why Claude is Good at This Follows instructions very strictly Writes clean, readable code Handles large codebases (huge context) š§ Example: Build a Flask API (Structured Backend) š Prompt you give Claude: Build a Flask API for user login with: - JWT authentication - Password hashing - In-memory database - Endpoints: /register, /login, /profile š” Claude Output (Typical Quality) from flask import Flask, request, jsonify import jwt, datetime from werkzeug.security import generate_password_hash, check_password_hash app = Flask(__name__) app.config['SECRET_KEY'] = 'secret' users = {} @app.route('/register', methods=['POST']) def register(): Ā Ā data = request.json Ā Ā users[data['email']] = generate_password_hash(data['password']) Ā Ā return jsonify({"message": "User registered"}) @app.route('/login', methods=['POST']) def login(): Ā Ā data = request.json Ā Ā user = users.get(data['email']) Ā Ā Ā Ā Ā if user and check_password_hash(user, data['password']): Ā Ā Ā Ā token = jwt.encode({ Ā Ā Ā Ā Ā Ā 'email': data['email'], Ā Ā Ā Ā Ā Ā 'exp': datetime.datetime.utcnow() + datetime.timedelta(hours=1) Ā Ā Ā Ā }, app.config['SECRET_KEY']) Ā Ā Ā Ā Ā Ā Ā Ā Ā return jsonify({"token": token}) Ā Ā Ā Ā Ā return jsonify({"error": "Invalid credentials"}), 401 Register ā Store Hashed Password ā Login ā Verify ā Generate Token ā Access System #Python #DataScientist #DataAnalyst #CS #IT #BCA #MCA
To view or add a comment, sign in
-
Today I shipped the first meaningful Rust commit for DYFJ ā my open-source sovereign personal AI stack. It's testing an architectural idea I've come to recently: the same reasoning that draws me to Rust ā strong types create predictable failure patterns ā should apply at the database boundary, not just at the language. The default I keep seeing is to define the data contract in language constructs ā the framework's native classes, types, or interfaces ā and treat schema as something *exported* from those, rather than the other way around. Every "agent framework" I've looked at recently does this with Python or TypeScript classes, sometimes producing JSON Schema or OpenAPI specs that pretend to be the contract. The class is the source of truth. The runtime is whatever interprets that class. The database (when it exists at all) has a schema that drifts silently from the language one until something breaks in production at 3am. The data outlives the language. In DYFJ, the schema is committed to the repo as DDL ā the contract that every language binding consumes, never the other way around. Whatever database runs that DDL is itself a modular component. Today's tracer bullet enforces that stance at the language boundary. Two Rust functions: events::write() and events::read_by_id(). Both use sqlx::query! macros, which check the SQL at compile time against the actual database. If I rename a column in the schema, the Rust code fails to compile until I update the queries. The build is the contract. Full post: https://lnkd.in/eY8nypMA #Rust #SovereignAI #OpenSource
To view or add a comment, sign in
-
š Automating Data Workflows with Python & Pandas Iāve been diving deeper into Python for data analysis, and I just built a script that automates a common (and often tedious) task: cleaning CSV data and converting it into multiple formats for different stakeholders. š ļø The Problem: CSV files often come with "messy" formattingālike stray spaces after commasāthat can break standard data pipelines. Plus, different teams need the same data in different formats (Web devs want JSON, Managers want Excel, and Data Engineers want CSV). š” The Solution: Using pandas and os, I created a script that: Cleans on the fly: Used skipinitialspace=True to automatically trim whitespace issues that usually cause KeyErrors. Performs Vectorized Math: Calculated total sales across the entire dataset in a single line of code. Automates File Management: Dynamically creates output directories and exports the results into JSON, Excel, and CSV simultaneously. š¦ Key Tools Used: Pandas: For high-performance data manipulation. OS Module: For robust file path handling. Openpyxl: To bridge the gap between Python and Excel. Itās a simple script, but itās a foundational step toward building more complex, automated data pipelines! Check out the logic below: š Python import pandas as pd import os # Read & Clean: skipinitialspace=True is a lifesaver for messy CSVs! df = pd.read_csv('data/sales.csv', skipinitialspace=True) # Transform: Vectorized calculation for 'total' df['total'] = df['quantity'] * df['price'] # Automate: Exporting to 3 different formats at once os.makedirs('output', exist_ok=True) df.to_json('output/sales_data.json', orient='records', indent=2) df.to_excel('output/sales_data.xlsx', index=False) df.to_csv('output/sales_with_totals.csv', index=False) #Python #DataAnalysis #Pandas #Automation #CodingJourney #DataScience
To view or add a comment, sign in
-
My load test pipeline spent 4 minutes generating 1M rows of test data. The system under test ran in 38 seconds. I wasn't benchmarking our system. I was benchmarking Faker. So I replaced the Python generator with a Rust binary. Now it does 1M rows in under 2 seconds. ~1.47M rows/sec on the hot path. ~400K rows/sec streaming to Kafka, network I/O included. But the speedup wasn't really about Rust. It was three decisions made before writing any generation code. The hot path isn't what people think. It isn't random generation, it's field lookup, memory allocation, and string handling. So instead of HashMap<String, Value>, I used Vec<Option<DataValue>> with a precomputed field index. No hash lookups. No string comparisons per row. No per-field allocations. At 1M rows Ć N fields, that difference is everything. The generator doesn't know where data goes. Kafka, Parquet, JSON, S3 ā none of those exist in the core engine. Everything sits behind port traits: StreamingSinkPort, DataExporterPort, ObjectStoragePort. Adding Postgres or Snowflake later means implementing a trait. Zero changes to generation. Configuration is data, not code. Schemas are YAML, versioned in Git, reviewed like code, executable in CI. The system is driven by config, not by branching logic. Data engineers who don't write Rust still own the pipelines. Clean Architecture, enforced by the compiler. The core crate has zero infrastructure dependencies. If it's not in Cargo.toml, it's impossible to import. Not convention. Physics. The pattern I keep coming back to: - You don't optimize your way out of the wrong data model. - You don't refactor your way out of tight coupling. - You don't scale your way out of architectural leakage. Most systems don't degrade because they're slow. They degrade because they become impossible to change safely. Question for the senior folks: what's a design decision you've seen lock a system in place years later? Repository: https://lnkd.in/dzSAYBeF Medium Article: https://lnkd.in/dGqvPYtz #DataEngineering #Rust #SoftwareArchitecture #Performance #SyntheticData
To view or add a comment, sign in
More from this author
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development