A fraud model reports 92% accuracy in testing. Two weeks later, false positives surge. Customers get blocked. Revenue takes a hit. No one changed the model. So what failed? Not the algorithm. The data flow. Late-arriving records weren’t handled. Duplicates weren’t removed properly. Training logic didn’t match serving logic. In production, models rarely break because of machine learning theory. They break because the underlying data system isn’t designed for reality. After building and reviewing multiple ML systems in production environments, one thing is clear: Strong SQL patterns are what separate demo projects from production-grade AI systems. Here are 14 SQL patterns that actually matter in real-world data science systems: 1. Deduplication using window functions Ensure only the latest or correct record per entity survives noisy event streams. 2. Handling late-arriving data Design logic that updates aggregates when delayed records arrive. 3. Idempotent transformations Make pipelines safe to re-run without corrupting outputs. 4. Feature consistency (training vs serving parity) Use identical logic to generate features across batch and real-time systems. 5. Incremental model feature builds Process only new or changed data instead of recomputing everything. 6. Slowly Changing Dimensions (SCD) Track historical changes in user or entity attributes accurately. 7. Sessionization patterns Group events into logical sessions using time-based rules. 8. Rolling and windowed aggregations Compute features like 7-day averages or 30-day sums efficiently. 9. Event ordering and sequencing Preserve chronological integrity for behavioral modeling. 10. Data validation checks in SQL Catch null spikes, schema drifts, and anomalies early. 11. Outlier filtering and anomaly flags Prevent extreme values from poisoning training data. 12. Partition-aware queries Optimize performance and cost for large-scale datasets. 13. Experiment tracking joins Correctly map users to experiments for clean A/B analysis. 14. Reproducible feature snapshots Store versioned datasets to recreate past model states exactly. Final Thought Models get the spotlight. SQL pipelines carry the weight. If your data foundation is weak, your model will eventually expose it. Build patterns that survive real traffic, messy data, and scale. That’s how production AI systems stay reliable. If this helped, repost and follow Sumit Gupta for more insights!!
Data Quality Management Strategies in SQL
Explore top LinkedIn content from expert professionals.
Summary
Data quality management strategies in SQL help ensure that the data stored and processed in databases is reliable, accurate, and consistent, which is essential for trustworthy reporting and analytics. These strategies use SQL techniques to identify, prevent, and handle issues such as duplicates, missing values, errors, or unexpected changes in the data.
- Monitor and validate: Set up checks for row counts, null values, and schema changes to catch problems before they impact your reports or models.
- Isolate bad data: Use tools and patterns that log errors and quarantine faulty records so one mistake doesn’t disrupt the entire process.
- Document your logic: Make your data quality rules and assumptions clear and accessible so downstream teams understand how data is handled and can trust the insights.
-
-
One bad row of data shouldn't take down your entire pipeline. If you've ever had a 100M-row INSERT fail at row 99,999,999 because of a single NOT NULL violation — rolling back everything — you know how expensive that is. Not just compute. The reprocessing, the debugging, the 2am pages. This is a solved problem in other parts of the stack. Kafka has dead-letter queues. APIs return partial success. Stream processors route bad records to side outputs. The philosophy is the same everywhere: isolate the failure, keep the pipeline moving, deal with the bad data separately. In the database world, Oracle figured this out a long time ago with LOG ERRORS INTO. It became a staple of enterprise ETL playbooks. But most other databases never adopted the pattern — leaving teams to build workarounds with staging tables, TRY/CATCH wrappers, and row-by-row error handling that doesn't scale. We're happy to announce that DML Error Logging is Generally Available, and the implementation is worth looking at: Tell Cortex Code to enable DML error logging, or, if you're old fashioned :-), ALTER TABLE orders SET ERROR_LOGGING = true; -- good rows land, bad rows get captured SELECT * FROM ERROR_TABLE(orders); No separate error table DDL. No procedural wrappers. Every logged error includes the error code, message, source column, and the actual data that failed — structured and queryable. You can put streams on error tables for real-time alerting and route them into your data quality monitoring. At enterprise scale this changes the operational model. Pipelines that used to fail-and-retry can now succeed-and-triage. Your SLAs stay intact while data quality issues get surfaced, tracked, and resolved in parallel — not in the critical path. If you're running large-scale ingestion into Snowflake, or migrating workloads from Oracle where your team relied on this pattern, it's worth a look - and we'd love your feedback on how we can keep evolving. Docs Here: https://lnkd.in/gmcxRKnA!
-
𝗧𝗵𝗲 𝗱𝗮𝘀𝗵𝗯𝗼𝗮𝗿𝗱 𝗱𝗶𝗱𝗻’𝘁 𝗹𝗶𝗲. 𝗧𝗵𝗲 𝗱𝗮𝘁𝗮 𝗱𝗶𝗱, 𝗾𝘂𝗶𝗲𝘁𝗹𝘆. 𝗔𝗻𝗱 𝘁𝗵𝗮𝘁’𝘀 𝗵𝗼𝘄 𝗯𝗮𝗱 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻𝘀 𝗴𝗲𝘁 𝗺𝗮𝗱𝗲. Most data issues don’t show up as errors. They show up as slightly wrong numbers that snowball into wrong strategy, wrong forecasts, and wrong outcomes. Here are the data quality checks that keep your business from steering off-course: 𝟭. 𝗥𝗼𝘄 𝗖𝗼𝘂𝗻𝘁 𝗗𝗿𝗶𝗳𝘁 𝗖𝗵𝗲𝗰𝗸 Catches sudden jumps or drops in record counts before they distort metrics. 𝟮. 𝗡𝘂𝗹𝗹 𝗩𝗮𝗹𝘂𝗲𝘀 𝗶𝗻 𝗖𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗙𝗶𝗲𝗹𝗱𝘀 Ensures key identifiers and revenue fields are never missing. 𝟯. 𝗗𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲 𝗥𝗲𝗰𝗼𝗿𝗱 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 Flags repeated data caused by retries or broken idempotency. 𝟰. 𝗥𝗲𝗳𝗲𝗿𝗲𝗻𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗴𝗿𝗶𝘁𝘆 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 Checks whether all foreign keys correctly map to parent records. 𝟱. 𝗦𝗰𝗵𝗲𝗺𝗮 𝗖𝗵𝗮𝗻𝗴𝗲 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 Alerts you when columns are added, removed, or renamed so pipelines don’t break silently. 𝟲. 𝗙𝗿𝗲𝘀𝗵𝗻𝗲𝘀𝘀 & 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 𝗖𝗵𝗲𝗰𝗸𝘀 Confirms dashboards are showing timely data within agreed SLAs. 𝟳. 𝗩𝗮𝗹𝘂𝗲 𝗥𝗮𝗻𝗴𝗲 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 Detects impossible values like negative revenue or unrealistic outliers. 𝟴. 𝗛𝗶𝘀𝘁𝗼𝗿𝗶𝗰𝗮𝗹 𝗧𝗿𝗲𝗻𝗱 𝗖𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 Surfaces metric shifts that don’t match past behavior or known events. 𝟵. 𝗦𝗼𝘂𝗿𝗰𝗲-𝘁𝗼-𝗧𝗮𝗿𝗴𝗲𝘁 𝗥𝗲𝗰𝗼𝗻𝗰𝗶𝗹𝗶𝗮𝘁𝗶𝗼𝗻 Validates that transformed totals match upstream source data. 𝟭𝟬. 𝗟𝗮𝘁𝗲-𝗔𝗿𝗿𝗶𝘃𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 Prevents delayed events from corrupting historical reporting. 𝟭𝟭. 𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗥𝘂𝗹𝗲 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 Ensures domain rules, like order states or status transitions, are always respected. 𝟭𝟮. 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 𝗖𝗵𝗲𝗰𝗸𝘀 Confirms daily, weekly, and monthly totals all align. 𝟭𝟯. 𝗖𝗮𝗿𝗱𝗶𝗻𝗮𝗹𝗶𝘁𝘆 𝗔𝗻𝗼𝗺𝗮𝗹𝘆 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 Catches unexpected drops or spikes in unique users, products, or transactions. 𝟭𝟰. 𝗗𝗮𝘁𝗮 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗻𝗲𝘀𝘀 𝗯𝘆 𝗦𝗲𝗴𝗺𝗲𝗻𝘁 Ensures every region, product line, or channel is fully represented. Bad data rarely screams, it whispers. These checks make sure you hear it before your business does.
-
🚨 You can't control what you don't own — especially in data engineering. One of the hardest truths about working with data is this: We don’t always have control over the source system. 📥 The incoming data may be inconsistent, missing fields, wrongly typed, duplicated, or just plain dirty. Yet, we’re still expected to build reliable pipelines and deliver trustworthy insights downstream. So what do you do? 🎯 You enforce data quality checks at the points you can control — your ingestion, staging, and transformation layers. Here’s what has worked well for me and my team: ✅ Set expectations early Have conversations with upstream data owners. Learn what fields are likely to be unstable or late. Understand the business meaning of each data point — it matters more than the datatype. ✅ Build smart validations If customer_id should always be a UUID and never null — enforce it. If transaction_date sometimes arrives in the wrong format — catch and log it. You can’t block every issue, but you can track and route it. ✅ Design for imperfection Build your pipelines to be resilient. Expect bad data and handle it gracefully. Quarantine rows. Use retry logic. Add lineage. Create alerts. Don’t let one row crash the entire batch. ✅ Document & communicate Your downstream teams need to know the data quality rules you enforce. Make your assumptions and logic transparent — and version-controlled. 💡 Pro tip: Sometimes the issue isn’t the data — it’s the assumptions. Double-check the logic you’ve written. What seemed like an edge case might actually be the norm in another part of the business. At the end of the day, data engineering is part technical, part detective, part negotiator. 👉 How do you handle poor source data in your pipeline?
-
It took me 10 years to learn about the different types of data quality checks; I'll teach it to you in 5 minutes: 1. Check table constraints The goal is to ensure your table's structure is what you expect: * Uniqueness * Not null * Enum check * Referential integrity Ensuring the table's constraints is an excellent way to cover your data quality base. 2. Check business criteria Work with the subject matter expert to understand what data users check for: * Min/Max permitted value * Order of events check * Data format check, e.g., check for the presence of the '$' symbol Business criteria catch data quality issues specific to your data/business. 3. Table schema checks Schema checks are to ensure that no inadvertent schema changes happened * Using incorrect transformation function leading to different data type * Upstream schema changes 4. Anomaly detection Metrics change over time; ensure it's not due to a bug. * Check percentage change of metrics over time * Use simple percentage change across runs * Use standard deviation checks to ensure values are within the "normal" range Detecting value deviations over time is critical for business metrics (revenue, etc.) 5. Data distribution checks Ensure your data size remains similar over time. * Ensure the row counts remain similar across days * Ensure critical segments of data remain similar in size over time Distribution checks ensure you get all the correct dates due to faulty joins/filters. 6. Reconciliation checks Check that your output has the same number of entities as your input. * Check that your output didn't lose data due to buggy code 7. Audit logs Log the number of rows input and output for each "transformation step" in your pipeline. * Having a log of the number of rows going in & coming out is crucial for debugging * Audit logs can also help you answer business questions Debugging data questions? Look at the audit log to see where data duplication/dropping happens. DQ warning levels: Make sure that your data quality checks are tagged with appropriate warning levels (e.g., INFO, DEBUG, WARN, ERROR, etc.). Based on the criticality of the check, you can block the pipeline. Get started with the business and constraint checks, adding more only as needed. Before you know it, your data quality will skyrocket! Good Luck! - Like this thread? Read about they types of data quality checks in detail here 👇 https://lnkd.in/eBdmNbKE Please let me know what you think in the comments below. Also, follow me for more actionable data content. #data #dataengineering #dataquality
-
Data quality isn't a single check, it's a lifecycle. 🔄 Most data pipelines struggle to guarantee quality because they lack end-to-end control. dlt bridges this gap by owning the entire runtime, from ingestion to staging to production. dlt ensures quality across 5 core dimensions: 1️⃣ Structural Integrity Does the data fit? dlt automatically normalizes column names and types to prevent SQL errors. For stricter control, use Schema Contracts to reject undocumented fields. 2️⃣ Semantic Validity Does it make business sense? Attach Pydantic models to your resources to enforce logic like "age > 0" or email validation in-stream. 3️⃣ Uniqueness & Relations Is the dataset consistent? Handle deduplication automatically using primary keys and merge dispositions. 4️⃣ Privacy & Governance Is the data safe? Hash PII or drop sensitive columns in-stream before they ever touch the disk. 5️⃣ Operational Health Is the pipeline reliable? Monitor volume metrics and set up alerts to catch schema drift the moment it happens. It’s time to move beyond simple "null checks" and treat data quality as a comprehensive lifecycle. Here are the docs to help you implement some of this: 📌 Alerting on Schema Changes: https://lnkd.in/d8dGX-2b 📌 Data Normalization & Type Management: https://lnkd.in/dsSr3CPf 🚀 Commercial Early Access: dltHub Data Quality Checks https://lnkd.in/dCjcug_F #DataEngineering #DataQuality #Python #dlt #DataGovernance #ETL #SchemaEvolution
-
Just checking for NULL is not enough: If you want to validate your data and make sure it's good quality for downstream user's you need a more comprehensive approach. Validate these and more: ↳ Data ranges that make business sense ↳ Referential integrity between related tables ↳ Date formats and impossible date combinations ↳ String patterns and data type consistency ↳ Record counts and expected data volumes As someone who cares about the functioning of the business you should be guarding against downstream data that is dirty. Why comprehensive validation matters: ✅ Catches data quality issues before they reach dashboards ✅ Creates alerting opportunities for pipeline monitoring ✅ Prevents incorrect business decisions from bad data ✅ Makes your pipelines more resilient to source system changes Your source data isn't going to be clean. Build validation checks into all your SQL transformations. Start with these validation patterns and expand based on what you discover in your own data. What are some cool data validation checks you've built into your SQL models? 🔔 Follow me for more SQL and data engineering tips. ♻️ Repost if you think your network will benefit. #sql #dataengineering #dataanalytics
-
Data cleaning is crucial for ensuring accurate and reliable datasets for analysis. SQL offers robust tools to efficiently manage and clean data within databases. By following key data cleaning steps, organizations can uphold high-quality data for better decision-making. Key Data Cleaning Steps in SQL: - Removing Duplicates: Use `DISTINCT` or `GROUP BY` clauses to eliminate duplicate records from tables. - Handling Missing Data: Identify and fill or remove `NULL` values using `COALESCE` or filtering out irrelevant data with `WHERE` clauses. - Normalizing Data: Standardize data formats by converting date formats or correcting inconsistent text with `UPDATE` statements. - Validating Data: Ensure data integrity by checking for errors or anomalies with `CASE`, `IF`, or `CHECK` constraints, and update records as necessary. These steps are essential in guaranteeing clean, trustworthy data ready for analysis, making SQL an indispensable tool for data professionals. #SQL #Datacleaning #Missingvalues #outliers #Duplicates #Normalization
-
+6
-
Day 9 – Data Quality Framework (In Depth) What is Data Quality? Data Quality ensures that data is: Accurate Complete Consistent Timely Valid Unique. Data Quality Dimensions: 1. Completeness Check for NULL values. #Example: df.filter(col("product_id").isNull()).count() 2. Uniqueness Ensure txn_id is unique. df.groupBy("txn_id").count().filter(col("count") > 1) 3. Validity Quantity must be > 0. df.filter(col("quantity") <= 0) 4. Consistency Foreign key relationships exist. Sales.product_id must exist in product table. 2. Where Data Quality Fits in Lakehouse Bronze → Basic validation Silver → Business rules validation Gold → Aggregation validation. 3. Data Quality Framework Architecture Source ↓ Bronze ↓ Validation Layer ↓ Valid Records → Silver Invalid Records → Quarantine Table. 4. Implementing Data Quality in PySpark (Retail Example) Step 1 – Load Sales sales_df = spark.read.format("delta") \ .load("/retail/bronze/sales") Step 2 – Define Rules valid_df = sales_df.filter( (col("quantity") > 0) & (col("price") > 0) & (col("txn_id").isNotNull()) ) invalid_df = sales_df.subtract(valid_df) Step 3 – Write to Separate Tables valid_df.write.format("delta") \ .mode("overwrite") \ .save("/retail/silver/sales_valid") invalid_df.write.format("delta") \ .mode("overwrite") \ .save("/retail/quarantine/sales_invalid") 5. Monitoring & Alerting When invalid records > threshold: Send Slack alert Send email Trigger Airflow failure. #CommonInterviewQuestions 1. How do you handle bad records? 2. What is schema drift? 3. How do you detect volume anomalies? 4. How to prevent duplicate processing? 5. What is data observability? 6. How do you implement referential integrity checks? 7. How to automate quality validation? Karthik K. #DataEngineering #DataQuality #DataGovernance #Lakehouse #DeltaLake #BigData #DataValidation #GreatExpectations #DataObservability #ETL #InterviewPreparation #VineshDataEngineer #Scenario A retail company ingests daily sales data. You must: Validate data quality rules Separate valid and invalid records Store invalid records in a quarantine table Track quality metrics Store validation audit results Ensure pipeline is production-ready. COMPLETE END-TO-END CODE:
-
As data engineers, we often talk about scalability, performance, and automation — but there’s one thing that silently determines the success or failure of every pipeline: Data Quality. No matter how advanced your stack, if your data is inconsistent, incomplete, or inaccurate, your downstream dashboards, ML models, and decisions will all be compromised. Here’s a detailed list of 25 critical checks that every modern data engineer should implement 👇 🔹 1. Null or Missing Value Checks Ensure no essential field (like customer_id, transaction_id) contains missing data 🔹 2. Primary Key Uniqueness Validation Verify that key columns (like IDs) remain unique to prevent duplicate business entities or revenue double counting. 🔹 3. Duplicate Record Detection Detect duplicates across ingestion stages 🔹 4. Referential Integrity Validation Confirm that all foreign key relationships hold true 🔹 5. Data Type Validation Ensure incoming data matches schema definitions — no strings in numeric fields, no invalid dates. 🔹 6. Numeric Range Validation Catch impossible values (e.g., negative ages, >100% percentages, invalid ratings). 🔹 7. String Length & Pattern Checks Enforce length constraints and validate formats (emails, phone numbers, IDs) with regex rules. 🔹 8. Allowed Value / Domain Validation Ensure categorical columns only contain valid entries — e.g., gender ∈ {‘M’, ‘F’, ‘Other’}. 🔹 9. Business Rule Consistency Check rules like order_amount = item_price * quantity or revenue = sum(product_sales). 🔹 10. Cross-Column Consistency Validate logical dependencies — e.g., delivery_date ≥ order_date. 🔹 11. Timeliness / Freshness Checks Detect data delays and SLA breaches — especially important for near real-time systems. 🔹 12. Completeness Check Verify all partitions, expected files, or dates are present — no missing data slices. 🔹 13. Volume Check Against Historical Data Compare record counts or data sizes vs previous runs to detect anomalies in ingestion. 🔹 14. Statistical Distribution Checks Validate stability of metrics like mean, median, and standard deviation to catch silent drifts. 🔹 15. Outlier Detection Identify records that deviate significantly from normal ranges 🔹 16. Schema Drift Detection Automatically detect added, removed, or renamed columns — common in dynamic source systems. 🔹 17. Duplicate File Ingestion Check Prevent reprocessing of already-loaded files or data across multiple sources. 🔹 18. Negative / Invalid Value Checks Block impossible values like negative prices or zero quantities where not allowed. 🔹 19. Percentage / Total Consistency Check Ensure calculated percentages correctly sum to 100% or totals match constituent values. 🔹 20. Hierarchy Validation Validate hierarchical consistency. 🔹 21. Audit Column Consistency Confirm audit columns like created_by, updated_at, and load_date are properly populated. #DataEngineering #DataQuality #Databricks #ETL #DataPipelines #DataGovernance
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning