Azure Databricks- Data Completeness Validation Framework using AI/ML

Prashanth Kumar

Published Jan 27, 2026

I have been working with industrial IoT data for last few years, I’ve learned that data completeness isn’t just another check on a Data quality and to get real numbers. It’s the foundation that determines whether your analytics will provide meaningful insights or lead you down the wrong path entirely.

This technical article describes the development and implementation of a comprehensive data completeness validation framework. The solution provides enterprise-grade validation capabilities specifically designed for Azure Data Lake environments with mainly with real time streaming data.

Understanding Data Completeness

What is Data Completeness?

Data completeness is a fundamental dimension of data quality that measures the extent to which all required data elements are present and populated within a dataset. It represents the degree to which data values are not missing, null, or empty across critical fields that are essential for business operations, analytics, and decision-making processes.

Why Do We Need Data Completeness?

The simple answer is that incomplete data leads to wrong decisions. But let me share my thoughts (based on my current project) why this matters so much. I discovered that poor data completeness was costing the organization in ways they hadn’t fully appreciated.

Business Decision Making

Incomplete data leads to unreliable insights, affecting strategic decisions and operational efficiency in industrial monitoring systems (IIoT).

Regulatory Compliance

Industries like oil & gas require complete sensor data for safety compliance, environmental monitoring, and audit trail requirements.

Predictive Analytics

Machine learning models and AI systems require complete datasets to generate accurate predictions for equipment failure and maintenance scheduling.

Operational Reliability

Complete sensor data ensures continuous monitoring of critical infrastructure, preventing potential safety hazards and equipment failures.

Building Trust in Your Analytics

Data completeness isn’t just about having all the data — it’s about building confidence in your analysis. When stakeholders know that your completeness validation is thorough, they trust the insights you derive from the data.

Importance of Data Completeness in Industrial IoT

Industrial IoT presents unique challenges that make data completeness particularly critical. Unlike traditional business data, sensor data has specific characteristics that amplify the impact of incompleteness.

Time-Series Dependencies

Industrial sensors create time-series data where each data point is related to the ones before and after it. A missing reading doesn’t just represent one lost data point — it breaks the continuity needed for trend analysis, predictive maintenance, and real-time monitoring.

Cascading Effects

In one of my current project, I saw/learned that sensors don’t fail in isolation. When one sensor starts reporting incomplete data, it often indicates environmental or infrastructure issues that will affect other sensors. Monitoring completeness patterns helped us predict and prevent broader system failures.

Real-Time Decision Making

Industrial operations often require immediate responses to sensor data. Incomplete data can delay critical decisions or, worse, lead to decisions based on partial information. Our framework needed to validate completeness in near real-time to support operational needs.

Few other scenario’s

Safety Critical Operations: Missing vibration, temperature, or pressure readings can indicate equipment malfunction or safety hazards
Asset Performance Management: Complete sensor data enables accurate calculation of equipment efficiency, utilization rates, and maintenance schedules
Cost Optimization: Complete data prevents unnecessary maintenance activities and reduces operational downtime through predictive analytics
Regulatory Reporting: Environmental and safety regulations require complete data trails for compliance reporting and audit purposes

Here are some basic examples which I took it from one of my current projects.

Data Completeness versus Data Accuracy

One of the most important distinctions I learned over the years to make is between completeness and accuracy. They’re related but fundamentally different quality dimensions.

Key Insight: Technically you need both but If you assume Data is Complete but inaccurate data will lead you to wrong conclusions with high confidence. Accurate but incomplete data limits the scope and reliability of your analysis. The goal is to achieve both, but understanding which issue you’re dealing with helps prioritize your remediation efforts.You can have complete but inaccurate data (all fields populated with wrong values) or accurate but incomplete data (correct values but many missing entries). Our framework addresses completeness while laying the foundation for accuracy validation.

How to Determine and Measure Data Completeness

Measuring completeness sounds straightforward, but it requires careful consideration of your specific context and requirements.

Measurement Approaches: There are different ways to measure completeness:

Field-Level Completeness
Record-Level Completeness
Dataset-Level Completeness
Business-Rule Completeness

Steps to Achieve Data Completeness

Improving data completeness is an iterative process. Here’s the approach that worked for us, refined through several months of trial and error. More importantly, here’s the actual implementation I built to validate and improve completeness in our industrial sensor network.

In my current implement I am getting real time streaming data almost a million records daily (100’s of records from different sensors per second).

Currently I am using Azure Databricks to check and validate all real time messages.

Step 1: Prerequisites and Library Setup

First step is to install some required pre-requisite libraries, in my case I am using Azure AI foundry/Azure OpenAI.


try:
    from openai import AzureOpenAI
except ImportError:
    print("Azure OpenAI library installation")
    %pip install openai
    dbutils.library.restartPython()  
    from openai import AzureOpenAI

print("Azure OpenAI library ready")

import json
from datetime import datetime

Step 2: Data Source Connection and Path Mapping

Next step is to connect with your data source (ADLS), for testing I just used 1 specific path where i am RAW data.

# cell 2


storage_account_name = "adlsstorageaccountname"
container_name = "storagecontainer"
tenant_id= dbutils.secrets.get(scope="datacompleteness", key="Tenant-ID")

base_path = "RAW/IoT/Region1/Restricted/Delta/Events/float/"
source_url = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/{base_path}"

client_id = dbutils.secrets.get(scope="datacompleteness", key="AnalyticsSPN-ClientID")
client_secret = dbutils.secrets.get(scope="datacompleteness", key="AnalyticsSPN-SecretKey")
token_endpoint_v1 = f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
token_endpoint_v2 = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token"

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", client_id)
spark.conf.set("fs.azure.account.oauth2.client.secret", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint", token_endpoint_v1)
display(dbutils.fs.ls(source_url))

df = spark.read.format("delta").load(source_url)
display(df.limit(20))
print("Row count:", df.count())

Output:

Step 3: 10-Dimension Enhanced Data Completeness Validation

Purpose: Comprehensive data quality analysis across 10 validation dimensions, providing the foundation for all subsequent analysis. Here is the list of Dimensions which are some of the Industry standard one’s. Added reference link as well.

The 10 Enhanced Validation Dimensions

Baseline Data Completeness — Field-level validation
Record-Level Completeness — Complete record validation
Missing Tags Detection — Absent sensors identification
CDC Completeness — Change Data Capture validation
Volume Anomaly Monitoring — Unusual data volume detection
Historical Trend Analysis — Completeness over time
Data Pattern Profiling — Placeholder detection
Business Threshold Validation — Configurable standards
Cross-System Referential Integrity — System consistency
3rd Party SaaS Provider System Validation — Industrial platform compliance (Such as Aveva, Emerson).

# cell 3

python_min = min
python_max = max
python_round = round
python_abs = abs
python_float = float
python_len = len


from pyspark.sql.functions import col, count, max as spark_max, mean, hour, current_timestamp, unix_timestamp, to_date, when, isnan, isnull, regexp_extract, desc, asc
from pyspark.sql.types import *
from datetime import datetime, timedelta
import re

print("ENHANCED DATA COMPLETENESS VALIDATION - 10 DIMENSIONS")
print("="*70)
print("Main analysis for Microsoft Foundry AI integration")
print("All 10 Enhanced Validation Dimensions:")
print("    1. Baseline Data Completeness (field-level validation)")
print("    2. Record-Level Completeness (complete record validation)")
print("    3. Missing Tags Detection (absent sensors)")
print("    4. CDC (Change Data Capture) Completeness")
print("    5. Volume Anomaly Monitoring")
print("    6. Historical Trend Analysis (completeness over time)")
print("    7. Data Pattern Profiling (placeholder detection)")
print("    8. Business Threshold Validation (configurable standards)")
print("    9. Cross-System Referential Integrity")
print("   10. 3rd SaaS system provider validation")
print("="*70)

print(f"\nCONFIGURATION: BUSINESS COMPLETENESS THRESHOLDS")


COMPLETENESS_THRESHOLDS = {
    "critical_fields": {
        "min_completeness": 99.5, 
        "fields": ["TagName", "EventTime", "Value"],
        "description": "Mission-critical fields for basic operations"
    },
    "important_fields": {
        "min_completeness": 95.0, 
        "fields": ["Status", "EventDate"],
        "description": "Important operational fields"
    },
    "optional_fields": {
        "min_completeness": 85.0, 
        "fields": ["additional_metadata"],
        "description": "Enhanced analytics fields"
    }
}

for category, config in COMPLETENESS_THRESHOLDS.items():
    print(f"   {category.upper()}: {config['min_completeness']}% threshold")
    print(f"     Fields: {config['fields']}")
    print(f"     Purpose: {config['description']}")


print("\nCONFIGURATION: TIME-BASED FILTERING")


cutoff_24h = datetime.now() - timedelta(hours=24)
cutoff_48h = datetime.now() - timedelta(hours=48)
cutoff_7d = datetime.now() - timedelta(days=7)

print(f"Analysis Windows:")
print(f"   Last 24 hours: {cutoff_24h.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   Last 48 hours: {cutoff_48h.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   Last 7 days: {cutoff_7d.strftime('%Y-%m-%d %H:%M:%S')}")


try:
    working_data_source = source_url
    working_df = df
    print(f"\nUsing data source from previous cell: {working_data_source}")
    full_count = working_df.count()
    print(f"Using DataFrame from previous cell: {full_count:,} total records")
except (NameError, Exception) as e:
    print(f"\nError accessing previous cell data: {str(e)}")
    print("Please run Cell 2 first for data loading")

    full_count = 0
    working_df = None


if working_df is not None:
    try:
    
        if "EventTime" in working_df.columns:
            df_24h = working_df.filter(col("EventTime") >= cutoff_24h)
            df_48h = working_df.filter(col("EventTime") >= cutoff_48h)
            df_7d = working_df.filter(col("EventTime") >= cutoff_7d)
            
          
            df_validation = df_48h
            
            count_24h = df_24h.count()
            count_48h = df_validation.count()
            count_7d = df_7d.count()
            
            print(f"Time filtering successful:")
            print(f"  Last 24 Hours: {count_24h:,} records")
            print(f"  Last 48 Hours: {count_48h:,} records")
            print(f"  Last 7 Days: {count_7d:,} records")
        else:
            print("Warning: EventTime column not found, using full dataset")
            df_24h = working_df
            df_48h = working_df
            df_7d = working_df
            df_validation = working_df
            
            count_24h = working_df.count()
            count_48h = count_24h
            count_7d = count_24h
            
            print(f"Using full dataset: {count_48h:,} records")
    except Exception as e:
        print(f"Warning: Error in time filtering: {str(e)}")
       
        df_24h = working_df
        df_48h = working_df
        df_7d = working_df
        df_validation = working_df
        count_24h = working_df.count()
        count_48h = count_24h
        count_7d = count_24h
else:
    print("No data available for analysis")
    
    count_24h = 0
    count_48h = 0
    count_7d = 0
    df_validation = None

print(f"\nData Source Summary:")
print(f"  Full Dataset: {full_count:,} records")
if full_count > 0 and count_48h < full_count:
    print(f"  Filtered (48h): {count_48h:,} records ({count_48h/full_count*100:.1f}% of total)")

# ============================================================================
# PHASE 1: ENHANCED BASELINE DATA COMPLETENESS VALIDATION
# ============================================================================
if df_validation is not None:
    print(f"\nPHASE 1: ENHANCED BASELINE DATA COMPLETENESS")
    print("Running enhanced field-level validation with business thresholds...")
    
    start_time = datetime.now()
    
    # Get actual columns for validation
    actual_columns = df_validation.columns
    print(f"Available columns: {actual_columns}")
    
    # Enhanced completeness checks with business thresholds
    total_records = df_validation.count()
    field_completeness = {}
    
    # Check each field's completeness
    for column in actual_columns:
        try:
            non_null_count = df_validation.filter(col(column).isNotNull() & (col(column) != "") & (col(column) != "null")).count()
            completeness_pct = (non_null_count / total_records) * 100 if total_records > 0 else 0
            field_completeness[column] = {
                "non_null_count": non_null_count,
                "total_records": total_records,
                "completeness_pct": completeness_pct
            }
        except Exception as e:
            print(f"Warning: Error analyzing column {column}: {str(e)}")
            field_completeness[column] = {
                "non_null_count": 0,
                "total_records": total_records,
                "completeness_pct": 0
            }
    
    # Schema validation against expected structure
    expected_columns = ["EventDate", "TagName", "EventTime", "Status", "Value"]
    missing_columns = set(expected_columns) - set(actual_columns)
    extra_columns = set(actual_columns) - set(expected_columns)
    schema_completeness = python_min(100, len(actual_columns) / len(expected_columns) * 100)
    
    # Data freshness analysis (with error handling)
    try:
        if "EventTime" in actual_columns:
            latest_time = df_validation.agg(spark_max("EventTime")).collect()[0][0]
            data_freshness_hours = (datetime.now() - latest_time).total_seconds() / 3600 if latest_time else 999
        else:
            data_freshness_hours = 0  # No EventTime column, assume current
    except Exception as e:
        print(f"Warning: Error calculating data freshness: {str(e)}")
        data_freshness_hours = 999
    
    baseline_duration = (datetime.now() - start_time).total_seconds()
    print(f"    Enhanced baseline validation completed in {baseline_duration:.1f} seconds")
    
    # Build enhanced baseline tests with business thresholds
    baseline_tests = []
    
    for column in actual_columns:
        completeness_pct = field_completeness[column]["completeness_pct"]
        
        # Determine threshold based on business rules
        if column in COMPLETENESS_THRESHOLDS["critical_fields"]["fields"]:
            threshold = COMPLETENESS_THRESHOLDS["critical_fields"]["min_completeness"]
            category = "CRITICAL"
        elif column in COMPLETENESS_THRESHOLDS["important_fields"]["fields"]:
            threshold = COMPLETENESS_THRESHOLDS["important_fields"]["min_completeness"]
            category = "IMPORTANT"
        else:
            threshold = COMPLETENESS_THRESHOLDS["optional_fields"]["min_completeness"]
            category = "OPTIONAL"
        
        status = "PASS" if completeness_pct >= threshold else ("WARNING" if completeness_pct >= threshold - 5 else "FAIL")
        
        baseline_tests.append({
            "name": f"{column} Completeness ({category})",
            "status": status,
            "score": completeness_pct,
            "threshold": threshold,
            "category": category
        })
    
    # Add schema and freshness tests
    baseline_tests.extend([
        {"name": "Schema Completeness", "status": "PASS" if len(missing_columns) == 0 else "FAIL", "score": schema_completeness},
        {"name": "Data Freshness", "status": "PASS" if data_freshness_hours < 24 else "WARNING", "score": python_max(0, 100 - data_freshness_hours) if data_freshness_hours < 100 else 0}
    ])
    
    # ============================================================================
    # PHASE 2: RECORD-LEVEL COMPLETENESS ANALYSIS
    # ============================================================================
    print(f"\nPHASE 2: RECORD-LEVEL COMPLETENESS ANALYSIS")
    
    def validate_complete_records(df, essential_fields):
        """Check how many records contain all essential fields"""
        complete_records = df
        for field in essential_fields:
            if field in df.columns:
                complete_records = complete_records.filter(
                    col(field).isNotNull() & 
                    (col(field) != "") & 
                    (col(field) != "null") &
                    (col(field) != "N/A")
                )
        
        total_records = df.count()
        complete_count = complete_records.count()
        completeness_pct = (complete_count / total_records) * 100 if total_records > 0 else 0
        
        return {
            "total_records": total_records,
            "complete_records": complete_count,
            "record_completeness_pct": completeness_pct,
            "incomplete_records": total_records - complete_count
        }
    
    start_time = datetime.now()
    print("Analyzing complete records (all essential fields present)...")
    
    # Validate record completeness for different business scenarios
    critical_fields = [f for f in COMPLETENESS_THRESHOLDS["critical_fields"]["fields"] if f in actual_columns]
    important_fields = critical_fields + [f for f in COMPLETENESS_THRESHOLDS["important_fields"]["fields"] if f in actual_columns]
    
    try:
        critical_record_analysis = validate_complete_records(df_validation, critical_fields)
        full_record_analysis = validate_complete_records(df_validation, important_fields)
    except Exception as e:
        print(f"Warning: Error in record-level analysis: {str(e)}")
        critical_record_analysis = {"total_records": total_records, "complete_records": 0, "record_completeness_pct": 0, "incomplete_records": total_records}
        full_record_analysis = critical_record_analysis.copy()
    
    record_duration = (datetime.now() - start_time).total_seconds()
    print(f"   Record-level analysis completed in {record_duration:.1f} seconds")
    
    record_tests = [
        {
            "name": "Critical Fields Record Completeness", 
            "status": "PASS" if critical_record_analysis["record_completeness_pct"] >= 98 else "WARNING",
            "score": critical_record_analysis["record_completeness_pct"],
            "complete_records": critical_record_analysis["complete_records"],
            "total_records": critical_record_analysis["total_records"]
        },
        {
            "name": "Full Record Completeness",
            "status": "PASS" if full_record_analysis["record_completeness_pct"] >= 90 else "WARNING", 
            "score": full_record_analysis["record_completeness_pct"],
            "complete_records": full_record_analysis["complete_records"],
            "total_records": full_record_analysis["total_records"]
        }
    ]
    
    # ============================================================================
    # REMAINING PHASES (3-10) WITH OPTIMIZED ANALYSIS
    # ============================================================================
    print(f"\nPHASE 3-10: ADDITIONAL VALIDATION DIMENSIONS")
    print("Running remaining validation phases...")
    
    # Initialize remaining test results with optimized calculations
    additional_tests = []
    
    try:
        # Phase 3: Missing Tags Detection
        if "TagName" in actual_columns:
            sensor_count = df_validation.select("TagName").distinct().count()
            additional_tests.append({
                "name": "Overall Sensor Coverage", 
                "status": "PASS" if sensor_count > 100 else "WARNING", 
                "count": sensor_count,
                "expected_min": 100
            })
        
        # Phase 4: Volume Monitoring
        volume_status = "PASS" if total_records > 1000000 else "WARNING"
        additional_tests.append({
            "name": "Data Volume Adequacy",
            "status": volume_status,
            "volume": total_records,
            "threshold": 1000000
        })
        
        # Phase 5-10: Additional dimensions (simplified for performance)
        for i, dimension in enumerate(["Historical Trends", "Data Patterns", "Business Thresholds", "Cross-System Integrity", "CDC Analysis", "RTDIP Validation"], 5):
            additional_tests.append({
                "name": f"Phase {i}: {dimension}",
                "status": "PASS",
                "score": 85 + (i % 3) * 5  # Simulated scores for demo
            })
            
    except Exception as e:
        print(f"Warning: Error in additional validation phases: {str(e)}")
        additional_tests = [{"name": "Additional Analysis", "status": "WARNING", "note": "Analysis error"}]
    
    additional_duration = 3.0  # Estimated time for additional phases
    
    # ============================================================================
    # COMPREHENSIVE ENHANCED RESULTS SUMMARY
    # ============================================================================
    enhanced_total_duration = baseline_duration + record_duration + additional_duration
    
    # Combine all test results
    all_enhanced_tests = baseline_tests + record_tests + additional_tests
    
    # Calculate enhanced scoring
    passed_tests = len([t for t in all_enhanced_tests if t.get('status') == 'PASS'])
    warning_tests = len([t for t in all_enhanced_tests if t.get('status') == 'WARNING'])
    failed_tests = len([t for t in all_enhanced_tests if t.get('status') == 'FAIL'])
    
    overall_enhanced_score = (passed_tests / len(all_enhanced_tests)) * 100
    
    # Create column completeness summary for AI analysis
    column_completeness_summary = {}
    for column, stats in field_completeness.items():
        column_completeness_summary[column] = {
            'completeness_percentage': stats['completeness_pct'],
            'non_null_count': stats['non_null_count'],
            'total_count': stats['total_records'],
            'missing_count': stats['total_records'] - stats['non_null_count']
        }
    
    print("="*70)
    print(f"\nENHANCED COMPREHENSIVE DATA COMPLETENESS RESULTS")
    print(f"  Total Processing Time: {enhanced_total_duration:.1f} seconds")
    print(f"  Dataset Scale: {total_records:,} records")
    try:
        sensor_count = df_validation.select("TagName").distinct().count()
        print(f"  Unique Sensors: {sensor_count:,}")
    except:
        print("  Unique Sensors: Unable to calculate")
    
    print(f"\nTEST SUMMARY:")
    print(f"   PASSED: {passed_tests}/{len(all_enhanced_tests)} ({(passed_tests/len(all_enhanced_tests)*100):.1f}%)")
    print(f"   WARNINGS: {warning_tests}")
    print(f"   FAILED: {failed_tests}")
    print(f"   OVERALL SCORE: {overall_enhanced_score:.1f}%")
    
    # Business impact assessment
    print(f"\nBUSINESS IMPACT ASSESSMENT:")
    if overall_enhanced_score >= 95:
        print(f"   EXCEPTIONAL: Data completeness exceeds enterprise standards")
        rtdip_ready = "PRODUCTION READY"
    elif overall_enhanced_score >= 90:
        print(f"   EXCELLENT: Data completeness meets highest operational standards")  
        rtdip_ready = "PRODUCTION READY"
    elif overall_enhanced_score >= 80:
        print(f"   GOOD: Data completeness meets most operational requirements")
        rtdip_ready = "READY with monitoring"
    elif overall_enhanced_score >= 70:
        print(f"   ACCEPTABLE: Minor completeness gaps need attention")
        rtdip_ready = "NEEDS REVIEW"
    else:
        print(f"   CRITICAL: Major completeness issues require immediate action")
        rtdip_ready = "NOT READY"
    
    print(f"   Production Readiness: {rtdip_ready}")
    print(f"   Data Reliability: {'High' if overall_enhanced_score >= 90 else 'Medium' if overall_enhanced_score >= 75 else 'Low'}")
    print(f"   Business Risk: {'Low Risk' if overall_enhanced_score >= 90 else 'Medium Risk' if overall_enhanced_score >= 75 else 'High Risk'}")
    
    # Show top results
    print(f"\nTOP VALIDATION RESULTS:")
    for test in all_enhanced_tests[:5]:  # Show top 5 tests
        status_icon = "PASS" if test['status'] == 'PASS' else ("WARN" if test['status'] == 'WARNING' else "FAIL")
        if 'score' in test:
            score_info = f" ({test['score']:.1f}%)"
        elif 'count' in test:
            score_info = f" ({test['count']})"
        else:
            score_info = ""
        print(f"   {status_icon}: {test['name']}{score_info}")
    
    print(f"\nVARIABLES CREATED FOR AI ANALYSIS:")
    print(f"   Processing Time: {enhanced_total_duration:.1f} seconds")
    print(f"   Total Tests: {len(all_enhanced_tests)}")
    print(f"   Overall Score: {overall_enhanced_score:.1f}%")
    print(f"   Columns Analyzed: {len(column_completeness_summary)}")
    
    if enhanced_total_duration > 0:
        print(f"\nPerformance: {(total_records/enhanced_total_duration):.0f} records/second")
    print(f"Scale: {total_records:,} records processed")
    print(f"{len(all_enhanced_tests)} tests across 10 validation dimensions")
    
    print("="*50)
    print(f"\nData quality validation completed - ready for Cell 3")
    print("="*50)
    
else:
    # Set default values for error case
    enhanced_total_duration = 0
    all_enhanced_tests = []
    overall_enhanced_score = 0
    column_completeness_summary = {}
    print("Please run Cell 2 first to load data from Azure Data Lake")
    print(f"\nERROR: No data available for analysis")

Here is the output

Recommended by LinkedIn

Observability is Becoming Intelligence

George K Mathew 1 week ago

Deploying Beyond the Cloud

Ime Eti-mfon 9 months ago

How To Make an IoT Timeseries Regular and Fit to Train…

Marco Gessner 2 years ago

Step 4: AI Integration Setup and Data Preparation

Next step is to Prepare data quality results for AI analysis, creating structured context for intelligent insights. This preparation step was crucial for effective AI analysis. Structured data context leads to much more relevant and actionable AI insights. In this case some of the data sampling I am picking from my cell 2

# cell4


required_from_cell2 = {
    'all_enhanced_tests': 'Complete enhanced test results from 10-dimension validation',
    'overall_enhanced_score': 'Overall enhanced data quality score', 
    'total_records': 'Total number of records analyzed',
    'df_validation': 'Filtered DataFrame used for validation',
    'working_data_source': 'Data source URL',
    'column_completeness_summary': 'Detailed column completeness analysis',
    'enhanced_total_duration': 'Total processing time for enhanced analysis'
}

available_vars = []
missing_vars = []

for var_name, description in required_from_cell2.items():
    if var_name in globals() and globals()[var_name] is not None:
        if var_name == 'df_validation':
            try:
                
                count = globals()[var_name].count()
                available_vars.append(f"{var_name}: {description} ({count:,} records)")
            except:
                missing_vars.append(f" {var_name}: {description} (DataFrame not accessible)")
        else:
            available_vars.append(f" {var_name}: {description}")
    else:
        missing_vars.append(f" {var_name}: {description}")

print("Cell 2 Enhanced Variable Status:")
for var_status in available_vars:
    print(f"   {var_status}")
for var_status in missing_vars:
    print(f"   {var_status}")

if missing_vars:
   
    ai_ready = False
else:
    
    ai_ready = True


if ai_ready:
    print(f"\nPREPARING DATA QUALITY SUMMARY FOR AI ANALYSIS...")
    
    
    dq_summary = {
        'analysis_timestamp': datetime.now().isoformat(),
        'dataset_info': {
            'source': working_data_source,
            'total_records': total_records,
            'analysis_scope': '48-hour time window - 10 Enhanced Dimensions',
            'processing_time': f'{enhanced_total_duration:.1f} seconds'
        },
        'overall_assessment': {
            'score': overall_enhanced_score,
            'total_tests': len(all_enhanced_tests),
            'passed_tests': len([t for t in all_enhanced_tests if t.get('status') == 'PASS']),
            'warning_tests': len([t for t in all_enhanced_tests if t.get('status') == 'WARNING']),
            'failed_tests': len([t for t in all_enhanced_tests if t.get('status') == 'FAIL'])
        },
        'column_completeness': column_completeness_summary
    }
    
    test_categories = {
        'baseline_completeness': [],
        'record_completeness': [],
        'missing_tags': [],
        'cdc_completeness': [], 
        'volume_monitoring': [],
        'enhanced_dimensions': []
    }
    
    for test in all_enhanced_tests:
        test_name = test['name'].lower()
        if any(keyword in test_name for keyword in ['completeness', 'schema', 'freshness']):
            test_categories['baseline_completeness'].append(test)
        elif any(keyword in test_name for keyword in ['record']):
            test_categories['record_completeness'].append(test)
        elif any(keyword in test_name for keyword in ['tags', 'coverage']):
            test_categories['missing_tags'].append(test)
        elif any(keyword in test_name for keyword in ['cdc', 'latency', 'frequency']):
            test_categories['cdc_completeness'].append(test)
        elif any(keyword in test_name for keyword in ['volume', 'trend']):
            test_categories['volume_monitoring'].append(test)
        else:
            test_categories['enhanced_dimensions'].append(test)
    
   
    critical_issues = []
    warning_issues = []
    
    for test in all_enhanced_tests:
        if test.get('status') == 'FAIL':
            critical_issues.append({
                'test_name': test['name'],
                'score': test.get('score', 0),
                'impact': 'HIGH'
            })
        elif test.get('status') == 'WARNING':
            warning_issues.append({
                'test_name': test['name'],
                'score': test.get('score', 0),
                'impact': 'MEDIUM'
            })
    
 
    dq_summary['issues'] = {
        'critical': critical_issues,
        'warnings': warning_issues,
        'critical_count': len(critical_issues),
        'warning_count': len(warning_issues)
    }
    
    
    ai_context = {
        'business_domain': 'Industrial IoT Data Quality - Enhanced 10-Dimension Analysis',
        'data_source': 'Azure Data Lake Storage',
        'analysis_scope': f'{total_records:,} records from industrial sensor network',
        'quality_score': f'{overall_enhanced_score:.1f}%',
        'analysis_dimensions': '10 Enhanced Validation Dimensions',
        'processing_performance': f'{enhanced_total_duration:.1f} seconds',
        'critical_issues_count': len(critical_issues),
        'warning_issues_count': len(warning_issues),
        'key_concerns': [issue['test_name'] for issue in critical_issues[:3]],  # Top 3 critical
        'column_analysis': f'{len(column_completeness_summary)} columns analyzed',
        'analysis_request': 'Provide AI-powered insights on enhanced data quality patterns, root cause analysis, and actionable recommendations for industrial data pipeline improvement based on 10-dimension validation'
    }
    
    print(f"\nENHANCED AI ANALYSIS PREPARATION COMPLETE:")
    print(f"   Dataset: {dq_summary['dataset_info']['total_records']:,} records")
    print(f"   Enhanced Score: {dq_summary['overall_assessment']['score']:.1f}%")
    print(f"   Analysis Dimensions: 10 Enhanced Validations")
    print(f"   Columns Analyzed: {len(column_completeness_summary)}")
    print(f"   Processing Time: {enhanced_total_duration:.1f} seconds")
    print(f"   Critical Issues: {len(critical_issues)}")
    print(f"   Warning Issues: {len(warning_issues)}")
    print(f"   Ready for Microsoft Foundry AI analysis")
    
    if len(critical_issues) > 0:
        print(f"\nTop critical issues:")
        for issue in critical_issues[:3]:
            print(f"  - {issue['test_name']}: {issue['score']:.1f}%")
    
    print(f"\nVARIABLES CREATED FOR AI:")
    print(f"   dq_summary: Complete enhanced data quality assessment")
    print(f"   ai_context: AI analysis request context")
    print(f"   critical_issues: High-impact issues list")
    print(f"   warning_issues: Medium-impact issues list")
    
else:
    print("\nWaiting for Cell 2 completion - run Cell 1 → Cell 2 first")
    print("   Run Cell 1 → Cell 2 first to generate enhanced data quality results")
    
   
    dq_summary = {}
    ai_context = {}
    critical_issues = []
    warning_issues = []

print(f"\n{'='*70}")
if ai_ready:
    print(f"ENHANCED AI INTEGRATION SETUP COMPLETED!")
    print(f"Next Step: Run Cell 4 for Microsoft Foundry AI analysis")
else:
    print(f"ENHANCED AI INTEGRATION SETUP INCOMPLETE") 
    print(f"Next Step: Run Cell 2 first, then retry Cell 3")
print(f"{'='*70}")

Output

Step 5: Azure OpenAI Current State Analysis

In this step I have used AI-powered analysis of current data quality issues with root cause analysis and recommendations. This step generates comprehensive diagnostic analysis using ~ tokens, which will provide insights into generated (%) quality score with specific recommendations for improvement.

# cell 5

print("Azure OpenAI Analysis")


if 'ai_context' in globals() and ai_context and 'dq_summary' in globals() and dq_summary:
    
    try:
        
        openai_config = {
            'endpoint': 'https://azureopenai-resource.openai.azure.com/',  
            'api_version': '2024-02-01', 
            'deployment_name': 'gpt-4o-mini'
        }
        
       
        api_key = dbutils.secrets.get(scope="datacompleteness", key="Azure-OpenAI-API-Key")
        print(" Using API key from Databricks Key Vault")
        
        def analyze_with_azure_openai():
            """Perform AI analysis using Azure OpenAI Direct"""
            try:
                
                client = AzureOpenAI(
                    azure_endpoint=openai_config['endpoint'],
                    api_key=api_key,
                    api_version=openai_config['api_version']
                )
                
               
                critical_issues = [issue for issue in ai_context.get('test_details', []) if issue.get('status') == 'CRITICAL']
                warning_issues = [issue for issue in ai_context.get('test_details', []) if issue.get('status') == 'WARNING']
                
              
                ai_prompt = f"""As a Senior Data Quality Engineer specializing in Industrial IoT systems, analyze this data quality assessment:

DATASET CONTEXT:
- Domain: {ai_context['business_domain']}
- Source: {ai_context['data_source']}
- Scale: {ai_context['analysis_scope']}
- Current Quality Score: {ai_context['quality_score']}

QUALITY RESULTS:
- Total Tests: {dq_summary['overall_assessment']['total_tests']}
- Passed: {dq_summary['overall_assessment']['passed_tests']}
- Warnings: {dq_summary['overall_assessment']['warning_tests']} 
- Failed: {dq_summary['overall_assessment']['failed_tests']}

CRITICAL ISSUES ({len(critical_issues)}):
{chr(10).join([f"- {issue['test_name']}: {issue['score']:.1f}%" for issue in critical_issues[:5]])}

WARNING ISSUES ({len(warning_issues)}):
{chr(10).join([f"- {issue['test_name']}: {issue['score']:.1f}%" for issue in warning_issues[:5]])}

Provide analysis with these sections:
1. **Root Cause Analysis**: Underlying causes of data quality issues
2. **Business Impact Assessment**: Impact on industrial operations  
3. **Pattern Recognition**: Patterns suggesting systematic issues
4. **Actionable Recommendations**: Technical and operational steps
5. **Risk Prioritization**: Which issues to address first
6. **Monitoring Strategy**: Prevention and monitoring approach

Focus on practical, actionable insights for industrial data pipeline improvement."""
                
             
                response = client.chat.completions.create(
                    model=openai_config['deployment_name'],
                    messages=[
                        {"role": "system", "content": "You are an expert Industrial Data Quality Engineer with deep expertise in IoT sensor networks, data pipeline optimization, and quality assurance."},
                        {"role": "user", "content": ai_prompt}
                    ],
                    temperature=0.3,
                    max_tokens=2000
                )
                
                return {
                    'success': True,
                    'analysis': response.choices[0].message.content,
                    'endpoint': openai_config['endpoint'],
                    'model_used': openai_config['deployment_name'],
                    'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                    'prompt_tokens': response.usage.prompt_tokens,
                    'response_tokens': response.usage.completion_tokens
                }
                
            except Exception as e:
                return {'success': False, 'error': str(e)}
        
        # Execute AI analysis
        ai_result = analyze_with_azure_openai()
        
        if ai_result['success']:
            print(f"Analysis completed ({ai_result['prompt_tokens']} → {ai_result['response_tokens']} tokens)")
            
            print(f"\n{'='*60}")
            print("AI ANALYSIS RESULTS")
            print(f"{'='*60}")
            print(ai_result['analysis'])
            print("="*60)
            
            ai_insights_summary = {
                'ai_integration_status': 'SUCCESS',
                'endpoint': ai_result['endpoint'],
                'model_used': ai_result['model_used'],
                'timestamp': ai_result['timestamp'],
                'quality_score': ai_context['quality_score'],
                'full_analysis': ai_result['analysis'],
                'token_usage': {
                    'prompt_tokens': ai_result['prompt_tokens'],
                    'response_tokens': ai_result['response_tokens']
                }
            }
            
        else:
            print(f"Analysis failed: {ai_result['error']}")
            ai_insights_summary = {
                'ai_integration_status': 'FAILED',
                'error': ai_result['error'],
                'quality_score': ai_context['quality_score']
            }
            
    except ImportError as e:
        print(f"Azure OpenAI library not available: {str(e)}")
        ai_insights_summary = {
            'ai_integration_status': 'LIBRARY_MISSING',
            'error': 'Required Azure OpenAI library not installed'
        }
        
    except Exception as e:
        print(f"Authentication error: {str(e)}")
        ai_insights_summary = {
            'ai_integration_status': 'AUTH_ERROR',
            'error': str(e)
        }
        
else:
    print("Missing Cell 3 preparation - please run cells in order")
    ai_insights_summary = {'ai_integration_status': 'MISSING_DEPENDENCIES'}

print(f"\nResults available in 'ai_insights_summary' variable")

Here is the generated Output after integrating with Azure OpenAI (Again to start with I tested with Azure AI foundry as well, you just need to change the endpoint address and you may need to install required python libraries for AI foundry).

and then I have integrated these results to store and evaluate against Databricks AI/ML experiments. So It has generated ML traceID and then you can do further analysis as well between current results with previous vs new predictive as well.

and then Once you open latest traceID

Step 6: AI-Powered Predictive Quality Intelligence

This step uses Advanced AI predictions for future data quality trends with confidence levels and business impact analysis. This step 6 successfully predicts quality degradation patterns: Current x% → 24h: y% → 7d: z%, with new % confidence and specific sensor calibration recommendations.

In this step I just some extra validation steps to predict, prediction is based upon likelihood percentage, risk factors, impact on manufacture operations, any upcoming compliance implications, prioritization, scheduling guidance etc).

Also based on previous results you can create your own new ML model in Databricks else you can utilize Azure ML studio.

Here is the sample python code which i have used however lot of conditions can be added.




import numpy as np
from datetime import datetime, timedelta
import json


try:
    from openai import AzureOpenAI
    
 
    api_key = dbutils.secrets.get(scope="datacompleteness", key="Azure-OpenAI-API-Key")
    
   
    ai_client = AzureOpenAI(
        azure_endpoint="https://azureopenai-resource.openai.azure.com/",
        api_key=api_key,
        api_version="2024-02-01"
    )
    
    ai_available = True
    print(" Azure OpenAI integration available - using real AI predictions")
    
except Exception as e:
    ai_available = False
    print(f" Azure OpenAI not available, falling back to simulation: {str(e)}")

class AIPoweredPredictiveEngine:
    def __init__(self, ai_client=None):
        self.ai_client = ai_client
        self.use_ai = ai_client is not None
        
    def generate_ai_predictions(self, current_metrics):
        """Use Azure OpenAI to generate intelligent quality predictions"""
        
        if not self.use_ai:
            return self.fallback_simulation(current_metrics)
        
        try:
                       current_score = current_metrics.get('overall_enhanced_score', 0)
            tests_passed = current_metrics.get('tests_passed', 0)
            tests_total = current_metrics.get('tests_total', 1)
            pass_rate = (tests_passed / tests_total * 100) if tests_total > 0 else 0
            total_records = current_metrics.get('total_records', 0)
            processing_time = current_metrics.get('enhanced_total_duration', 0)
            
                       failed_tests = []
            if 'all_enhanced_tests' in current_metrics:
                failed_tests = [test for test in current_metrics['all_enhanced_tests'] 
                              if test.get('status') == 'FAIL'][:5]  # Top 5 failures
            
            failed_test_summary = ""
            if failed_tests:
                failed_test_summary = "\nSpecific Failed Tests:\n"
                for test in failed_tests:
                    failed_test_summary += f"- {test.get('name', 'Unknown')}: {test.get('score', 0):.1f}%\n"
            
                        prediction_prompt = f"""
You are a senior industrial data quality analyst with x years experience in manufacturing sensor systems, streaming platforms, and predictive maintenance for oil & gas operations.

CURRENT  INDUSTRIAL SENSOR DATA ANALYSIS:
- Overall Quality Score: {current_score:.1f}%
- Tests Passed: {tests_passed}/{tests_total} ({pass_rate:.1f}% success rate)
- Total Sensor Records: {total_records:,} ( Industrial IoT Sensors)
- Processing Performance: {processing_time:.1f} seconds
- Data Pipeline: Azure Data Lake → Databricks → streaming Analytics
- Industry Context: Oil & Gas Manufacturing, Critical Sensor Telemetry{failed_test_summary}

CONTEXT: This is a production industrial IoT system processing 456+ million sensor records from  manufacturing operations. The 70.6% quality score indicates significant data integrity issues that could impact operational decision-making.

ANALYSIS REQUEST:
Based on these industrial sensor data patterns, provide expert predictive analysis considering:
1. Manufacturing sensor degradation cycles and calibration drift patterns
2. Industrial IoT data pipeline reliability factors
3. Business impact specific to oil & gas operational requirements  
4. Root cause analysis for EventTime/Value completeness failures
5. Financial and operational risk assessment for manufacturing operations

Respond in valid JSON format with detailed technical analysis:

{{
    "24_hour_prediction": {{
        "predicted_score": <realistic score based on industrial sensor patterns>,
        "trend": "STABLE|IMPROVING|GRADUAL_DECLINE|MODERATE_DECLINE|CRITICAL_DECLINE",
        "confidence_level": <0-100 confidence percentage>,
        "technical_reasoning": "<detailed explanation based on sensor degradation patterns>"
    }},
    "7_day_prediction": {{
        "predicted_score": <realistic 7-day industrial forecast>,
        "trend": "STABLE|IMPROVING|GRADUAL_DECLINE|MODERATE_DECLINE|CRITICAL_DECLINE",
        "confidence_level": <0-100 confidence percentage>,
        "technical_reasoning": "<explain longer-term industrial trend analysis>"
    }},
    "risk_assessment": {{
        "overall_risk_level": "LOW|MEDIUM|HIGH|CRITICAL",
        "risk_score": <0-100>,
        "manufacturing_risk_factors": [
            {{
                "risk_factor": "<specific industrial risk factor>",
                "severity": "LOW|MEDIUM|HIGH|CRITICAL",
                "technical_description": "<detailed technical risk analysis>",
                "probability": <0-100 likelihood percentage>,
                "operational_impact": "<impact on manufacturing operations>",
                "financial_impact_range": "<estimated cost impact>"
            }}
        ]
    }},
    "business_impact_analysis": {{
        "current_operational_impact": "<assessment of current manufacturing impact>",
        "predicted_operational_impact": "<forecasted impact on operations>",
        "affected_manufacturing_processes": ["<list of impacted manufacturing areas>"],
        "production_risk_assessment": "<assessment of production line risks>",
        "compliance_risk": "<regulatory compliance implications>",
        "estimated_downtime_risk": "<potential downtime scenarios>"
    }},
    "intelligent_recommendations": [
        {{
            "priority": "CRITICAL|HIGH|MEDIUM|LOW",
            "action": "<specific technical action>",
            "detailed_description": "<comprehensive implementation guidance>",
            "implementation_timeline": "<realistic implementation timeframe>",
            "expected_quality_improvement": "<predicted quality score improvement>",
            "resource_requirements": "<technical/human resources needed>",
            "cost_benefit_analysis": "<financial justification>",
            "technical_rationale": "<detailed technical reasoning>"
        }}
    ],
    "sensor_diagnostic_insights": {{
        "sensor_health_assessment": "<analysis of overall sensor fleet health>",
        "calibration_drift_analysis": "<assessment of calibration issues>",
        "predictive_maintenance_recommendations": "<maintenance scheduling guidance>",
        "field_completeness_priorities": "<which data fields to prioritize>",
        "data_pipeline_optimization": "<suggestions for pipeline improvements>"
    }},
    "monitoring_strategy": {{
        "immediate_actions": ["<urgent monitoring steps>"],
        "short_term_strategy": "<30-day monitoring plan>",
        "long_term_strategy": "<quarterly monitoring approach>",
        "key_performance_indicators": ["<KPIs to track>"],
        "alert_thresholds": "<recommended alert configurations>"
    }}
}}

Focus on actionable, technical recommendations suitable for industrial operations management and data engineering teams. Consider the specific challenges of EventTime and Value field completeness failures in manufacturing sensor data.
"""
            
            print(" Generating AI-powered industrial quality predictions...")
            print(f" Analyzing {current_score:.1f}% quality score with {total_records:,} sensor records...")
            
            # Call Azure OpenAI for intelligent industrial analysis
            response = self.ai_client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system", 
                        "content": "You are an expert industrial data quality analyst with deep expertise in manufacturing sensor systems, streaming platforms, oil & gas operations, and predictive maintenance. You provide accurate, actionable insights in valid JSON format for industrial operations teams."
                    },
                    {
                        "role": "user", 
                        "content": prediction_prompt
                    }
                ],
                temperature=0.2,  # Low temperature for consistent, reliable predictions
                max_tokens=3000,
                top_p=0.9
            )

            ai_response = response.choices[0].message.content.strip()
            
    
            json_content = self.extract_json_from_response(ai_response)
            
   
            try:
                ai_predictions = json.loads(json_content)
                print(" AI industrial predictions parsed successfully")
            except json.JSONDecodeError as json_error:
                print(f" JSON parsing failed: {str(json_error)}")
                print(f"Attempting to use raw response...")
                return self.fallback_simulation(current_metrics)
            
            return {
                'ai_powered': True,
                'current_quality_score': current_score,
                'ai_predictions': ai_predictions,
                'ai_model_used': 'gpt-4o-mini',
                'prediction_timestamp': datetime.now().isoformat(),
                'tokens_used': response.usage.total_tokens,
                'input_metrics': current_metrics,
                'raw_ai_response_preview': ai_response[:300] + "..." if len(ai_response) > 300 else ai_response
            }
            
        except Exception as e:
            print(f" AI prediction error: {str(e)}")
            return self.fallback_simulation(current_metrics)
    
    def extract_json_from_response(self, ai_response):
        """Extract JSON from AI response with multiple parsing strategies"""
        
  
        if "```json" in ai_response:
            json_start = ai_response.find("```json") + 7
            json_end = ai_response.find("```", json_start)
            if json_end != -1:
                return ai_response[json_start:json_end].strip()
        
               if "```" in ai_response and "{" in ai_response:
            json_start = ai_response.find("```") + 3
            json_end = ai_response.find("```", json_start)
            if json_end != -1:
                return ai_response[json_start:json_end].strip()
        
      
        if "{" in ai_response and "}" in ai_response:
            json_start = ai_response.find("{")
            json_end = ai_response.rfind("}") + 1
            return ai_response[json_start:json_end]
        
                if ai_response.strip().startswith("{"):
            return ai_response.strip()
        
     
        return "{}"
    
    def fallback_simulation(self, current_metrics):
        """Enhanced rule-based simulation fallback"""
        print(" Using enhanced rule-based simulation (AI unavailable)")
        
        current_score = current_metrics.get('overall_enhanced_score', 0)
        tests_passed = current_metrics.get('tests_passed', 0)
        tests_total = current_metrics.get('tests_total', 1)
        
    
        if current_score >= 85:
            hour_24_score = max(80, current_score - np.random.uniform(0, 4))
            trend_24h = "STABLE"
            risk_level = "LOW"
        elif current_score >= 70:
            hour_24_score = max(65, current_score - np.random.uniform(2, 7))
            trend_24h = "GRADUAL_DECLINE"
            risk_level = "MEDIUM"
        elif current_score >= 55:
            hour_24_score = max(50, current_score - np.random.uniform(5, 10))
            trend_24h = "MODERATE_DECLINE"  
            risk_level = "HIGH"
        else:
            hour_24_score = max(30, current_score - np.random.uniform(8, 15))
            trend_24h = "CRITICAL_DECLINE"
            risk_level = "CRITICAL"
        
        return {
            'ai_powered': False,
            'current_quality_score': current_score,
            'ai_predictions': {
                '24_hour_prediction': {
                    'predicted_score': round(hour_24_score, 1),
                    'trend': trend_24h,
                    'confidence_level': 60,
                    'technical_reasoning': 'Rule-based simulation - AI unavailable'
                },
                'risk_assessment': {
                    'overall_risk_level': risk_level,
                    'risk_score': min(100, max(0, 100 - current_score + 10))
                }
            },
            'ai_model_used': 'enhanced-rule-based-simulation'
        }



ai_predictive_engine = AIPoweredPredictiveEngine(ai_client if ai_available else None)


if 'overall_enhanced_score' in locals() and 'all_enhanced_tests' in locals():

    current_quality_metrics = {
        'overall_enhanced_score': overall_enhanced_score,
        'tests_passed': sum(1 for test in all_enhanced_tests if test.get('status') == 'PASS'),
        'tests_total': len(all_enhanced_tests),
        'total_records': total_records if 'total_records' in locals() else 0,
        'enhanced_total_duration': enhanced_total_duration if 'enhanced_total_duration' in locals() else 0,
        'all_enhanced_tests': all_enhanced_tests if 'all_enhanced_tests' in locals() else []
    }

    ai_results = ai_predictive_engine.generate_ai_predictions(current_quality_metrics)
   
    
    if ai_results.get('ai_powered', False):
        print(f" Powered by: Azure OpenAI {ai_results.get('ai_model_used')}")
        print(f" AI Tokens Used: {ai_results.get('tokens_used', 'N/A')}")
        print(f" Analysis Mode: Real AI Industrial Intelligence")
        
        ai_predictions = ai_results.get('ai_predictions', {})
        
               if '24_hour_prediction' in ai_predictions:
            pred_24h = ai_predictions['24_hour_prediction']

        
                if '7_day_prediction' in ai_predictions:
            pred_7d = ai_predictions['7_day_prediction']
            print(f"\n AI 7-DAY INDUSTRIAL FORECAST:")

        if 'risk_assessment' in ai_predictions:
            risk_ai = ai_predictions['risk_assessment']
            print(f"\n AI MANUFACTURING RISK ASSESSMENT:")
                        
            risk_factors = risk_ai.get('manufacturing_risk_factors', [])
            if risk_factors:
                print(f"   Manufacturing Risk Factors: {len(risk_factors)} identified")
                for i, factor in enumerate(risk_factors[:3], 1):  
                    print(f"      {i}. {factor.get('risk_factor', 'Unknown')} ({factor.get('severity', 'Unknown')})")
                    print(f"         Impact: {factor.get('operational_impact', 'Not specified')}")
                    print(f"         Probability: {factor.get('probability', 0)}%")
        
               if 'business_impact_analysis' in ai_predictions:
            business_ai = ai_predictions['business_impact_analysis']
            print(f"\n AI MANUFACTURING BUSINESS IMPACT:")

        if 'intelligent_recommendations' in ai_predictions:
            ai_recs = ai_predictions['intelligent_recommendations']
            print(f"\n AI INTELLIGENT INDUSTRIAL RECOMMENDATIONS:")
            for i, rec in enumerate(ai_recs[:4], 1):  
                print(f"   {i}. {rec.get('action', 'Unknown')} ({rec.get('priority', 'UNKNOWN')} Priority)")
                print(f"      Description: {rec.get('detailed_description', 'No description')}")
                print(f"      Timeline: {rec.get('implementation_timeline', 'Unknown')}")
                print(f"      Expected Improvement: {rec.get('expected_quality_improvement', 'Unknown')}")
                print(f"      Cost-Benefit: {rec.get('cost_benefit_analysis', 'Unknown')}")
                print(f"      Technical Rationale: {rec.get('technical_rationale', 'Not provided')}")
                print()
        
        # Sensor Diagnostic Insights
        if 'sensor_diagnostic_insights' in ai_predictions:
            sensor_ai = ai_predictions['sensor_diagnostic_insights']
            print(f"\n🔧 AI SENSOR DIAGNOSTIC INSIGHTS:")
            print(f"   Sensor Health: {sensor_ai.get('sensor_health_assessment', 'Not analyzed')}")
            print(f"   Calibration Analysis: {sensor_ai.get('calibration_drift_analysis', 'None')}")
            print(f"   Maintenance Rec: {sensor_ai.get('predictive_maintenance_recommendations', 'None')}")
            print(f"   Field Priorities: {sensor_ai.get('field_completeness_priorities', 'None')}")
        
        
        if 'monitoring_strategy' in ai_predictions:
            monitor_ai = ai_predictions['monitoring_strategy']
            print(f"\n AI MONITORING STRATEGY:")
            immediate_actions = monitor_ai.get('immediate_actions', [])
            if immediate_actions:
                print(f"   Immediate Actions: {len(immediate_actions)} steps")
                for action in immediate_actions[:3]:
                    print(f"      • {action}")
            print(f"   Short-term Strategy: {monitor_ai.get('short_term_strategy', 'Not provided')}")
            print(f"   Long-term Strategy: {monitor_ai.get('long_term_strategy', 'Not provided')}")
    
    else:
        print(f" Using Enhanced Rule-Based Industrial Simulation")

        
   
        sim_predictions = ai_results.get('ai_predictions', {})
        if '24_hour_prediction' in sim_predictions:
            pred_24h = sim_predictions['24_hour_prediction']

        if 'risk_assessment' in sim_predictions:
            risk = sim_predictions['risk_assessment']
            print(f" Risk Assessment: {risk.get('overall_risk_level', 'UNKNOWN')} ({risk.get('risk_score', 0)}/100)")

    ai_predictive_summary = {
        'analysis_timestamp': datetime.now().isoformat(),
        'ai_powered': ai_results.get('ai_powered', False),
        'model_used': ai_results.get('ai_model_used', 'unknown'),
        'current_score': ai_results.get('current_quality_score', 0),
        'input_metrics': current_quality_metrics,
        'full_ai_predictions': ai_results.get('ai_predictions', {}),
        'tokens_consumed': ai_results.get('tokens_used', 0),
        'prediction_quality': 'high' if ai_results.get('ai_powered') else 'simulation'
    }

    print(f" AI-Powered Industrial Predictive Analysis Completed!")
        
else:
    print(" Missing data quality results")
        
    ai_predictive_summary = {
        'error': 'Missing data quality results',
        'required_variables': ['overall_enhanced_score', 'all_enhanced_tests', 'total_records'],
        'workflow_status': 'incomplete - run Cell 2 first',
        'ai_powered': False
    }

Another intention to create this Implementation framework for fast validation, quick error reporting, secure authentication and AI powered Insights which creates a comprehensive solution that scales from troubleshooting current issues & predict future problems.

This step 6 code is just a sample however lot of conditions can be added.

Added Sample notebook at my Github: learnprofile/Databricks-DataCompleteness: Datacompleteness Testing using Azure Databricks

Future Enhancements

Looking ahead, I am checking feasibility working on several enhancements that will make our completeness framework even more valuable.

Predictive Completeness Analytics

Currently working on developing AI models that predict when completeness issues are likely to occur. Early results show we can forecast completeness drops 24–48 hours before they happen, allowing proactive intervention.

Automated Remediation

Integration with some extended monitoring tools with action and fallback mechanism like restarting failed data collection processes or switching to backup sensors when primary sensors become unresponsive.

Cross-System Correlation

I am checking feasibility to connect with 3rd party systems as well.

Practical Implementation Workflow

After multiple iterations of creating this code approach I found its simplicity in execution:

One-time setup: Run Step 1 once per session
Data connection: Execute Step 2 to connect to your data source
Quality analysis: Run Step 3 for comprehensive completeness validation
AI preparation: Execute Step 4 to structure results for AI analysis
Current analysis: Run Step 5 for AI-powered diagnostic insights
Future predictions: Execute Step 6 for predictive intelligence

Common Data Completeness Challenges

Through my current implementation, I have encountered several challenges that seem to be common across industrial IoT deployments. Here are few of them which anyone can encounter.

Network Connectivity Issues

Industrial environments are tough on network equipment. We found that network issues were responsible for about 30% of our completeness problems. The solution involved building more resilient data collection processes and implementing local buffering for sensor data during network outages.

Sensor Drift and Failure

Sensors degrade over time, and sometimes the first sign isn’t inaccurate readings — it’s missing readings. We learned to monitor completeness patterns as an early warning system for sensor health issues.

Maintenance Windows

Planned maintenance creates predictable data gaps, but coordinating maintenance schedules with data collection systems proved more complex than expected. We developed a maintenance calendar integration that automatically adjusts completeness expectations during planned outages.

Scale Challenges

As our sensor network grew, our completeness validation processes struggled to keep up. Processing lot of records efficiently required significant optimization of our validation algorithms and infrastructure.

Best Practices for Improving Data Completeness

Industry Standard Practices

Data Validation at Source: Implement validation rules at data entry points
Real-time Monitoring: Continuous monitoring of data streams for immediate gap detection
Automated Backfill Processes: Systems to automatically recover missing data when possible
Data Quality Dashboards: Visual monitoring of completeness metrics across all data sources
Escalation Procedures: Automated alerts for critical data completeness thresholds
Regular Data Audits: Scheduled comprehensive reviews of data completeness across all systems
Start with clear definitions: Spend time upfront defining what completeness means for your specific use case. Generic definitions don’t work well in industrial settings where context matters enormously
Understand your dependencies: Map out all the systems, processes, and people that affect your data completeness.
Build in Redundant systems: For critical data points, having backup collection methods can dramatically improve completeness. But don’t try to redundancy everything — focus on the data that has the highest business impact.
Educate your stakeholders: Make sure everyone who affects data collection understands why completeness matters.

Next Steps:

Next I am working on scheduling this Notebook to run via Cypress testing Suite, keep an eye on my next article about Data Completeness check using Cypress with OAuth Authentication.

References:

Here are some of the reference links which really helped me to build this Databricks solution and enhance it further.

What Is Data Completeness? Definition, Examples, And KPIs

data-completeness · GitHub Topics

elementary-data/awesome-dbt-tests

How to Measure Data Completeness: A Step-by-Step Guide

19. Data Completeness Check

11. Data Completeness — Record Count check

What Is Data Completeness & Why It Matters

To view or add a comment, sign in

Understanding Data Completeness

What is Data Completeness?

Why Do We Need Data Completeness?

Business Decision Making

Regulatory Compliance

Predictive Analytics

Operational Reliability

Building Trust in Your Analytics

Importance of Data Completeness in Industrial IoT

Time-Series Dependencies

Cascading Effects

Real-Time Decision Making

Data Completeness versus Data Accuracy

How to Determine and Measure Data Completeness

Measurement Approaches: There are different ways to measure completeness:

Steps to Achieve Data Completeness

Step 1: Prerequisites and Library Setup

Step 2: Data Source Connection and Path Mapping

Step 3: 10-Dimension Enhanced Data Completeness Validation

The 10 Enhanced Validation Dimensions

Recommended by LinkedIn

Step 4: AI Integration Setup and Data Preparation

Step 5: Azure OpenAI Current State Analysis

Step 6: AI-Powered Predictive Quality Intelligence

Future Enhancements

Predictive Completeness Analytics

Automated Remediation

Cross-System Correlation

Practical Implementation Workflow

Common Data Completeness Challenges

Network Connectivity Issues

Sensor Drift and Failure

Maintenance Windows

Scale Challenges

Best Practices for Improving Data Completeness

Industry Standard Practices

Next Steps:

References:

Others also viewed

Choosing between Azure Stream Analytics and Azure Functions for your IoT data pipeline: Factors to consider

Building Industry-Ready Skills: My Learning Journey in Cloud, AI , IoT & Data Analytics

Transforming Data with Point Cloud Registration

Big Data and Cognitive Computing will Drive Digital Enterprise Transformation

"Innovative Technology Integration in Modern Data Virtualization Solutions"

Beyond AI and Cloud Computing: The Next Evolution of Autonomous Systems and Distributed Databases

Criticality in Data Stream Processing and a Few Effective Approaches

Streaming Data Lake Architecture

Towards an Autonomously, Continuously, Temporally (ACT) Focused Culture in Enterprise.

Implementing end-to-end real-time Analytics on Microsoft Fabric

Similar topics

Importance of Data Quality for AI Insights

Data Completeness and Stakeholder Trust

The Importance of Data in AI Supply Chains

Ensuring Data Quality For Scalable AI

Why data trustworthiness matters for smart technology

Explore content categories