Left-Shifting Defect Detection with Sensor Data
Image: Semiconductor fabrication process flow

Left-Shifting Defect Detection with Sensor Data

*Disclaimer: The article reflects my personal views and does not represent the opinions or positions of my employer

Introduction:

In Semiconductor manufacturing , discovering a defective wafer post-fabrication is expensive. A defective wafer represents thousands of dollars in wasted investment. What if you could predict failures early in the manufacturing process, saving a significant portion of those costs?

Left-shifting defect detection with sensor data, is the foundation for building predictive, data-driven models that enable early fault detection and intelligent process optimization, turning manufacturing data into actionable insights.

The UCI-SECOM dataset provides a glimpse into this opportunity. With ~590 sensors monitoring wafer fabrication and pass/fail outcomes for 1567 wafers, we can build models that predict defects early. However this comes with challenges - extreme class imbalance, massive missing data or rather sparse sensor data and critical trade-off between false alarms and missed defects.

Dataset Overview and Challenges:

The UCI-SECOM dataset contains anonymized features to protect proprietary manufacturing data. Sensor readings are indexed from 0 to 589 representing 590 predictor variables per wafer, along with a Pass/Fail target variable. Dataset includes 1,567 samples each representing a wafer tracked across all sensors. The sensor data most likely represent a comprehensive view of the manufacturing environment: temperatures, pressures for example.

However, the dataset presents three critical challenges that make the modelling difficult.

  1. Class imbalance: 104 failures out of 1567 wafers (14:1 Pass:Fail). While not extreme, the imbalance tilts the model towards the majority class. In this case the Pass class.
  2. High Dimensionality: 590 features create information overload. Too many features makes it hard for the model to identify what actually matters for predicting failures.
  3. Missing data: Most Sensors have either minimal missing data (<10%) or (>40%) missing data, requiring strategic handling.

Article content
The missing data represents multi-modal characteristic. There are four groups of predictors exhibiting varied amount from (0-10%) missing data to ( >40 %) missing data.

These challenges must be addressed, before building any predictive model. In machine learning, exploratory data analysis and data preparation are critical steps - also with real-world manufacturing data, they are often the hardest part.

Data preparation Strategy:

Key things to take care with respect to data preparation are missing data and high dimensionality.

Missing Data Handling:

Step 1: Remove columns with 100% missing data.

First, we identified and removed predictors/features that had complete missing data. These provide zero information.

Step 2: Apply 40% threshold

Next, we removed features with more than 40% missing values. The threshold was chosen based on the histogram pattern. Sensors with 40% missing data represent significant data sparsity where data imputation strategies might not yield meaningful information.

Step 3: Impute remaining predictors with missing data.

For features with <40% missing data, we used median imputation via scikit-learn's SimpleImputer data imputation method. Median Imputation is robust to outliers with respect to mean imputation. Mean Imputation might aggravate the skewness of the data.

With the three-step data preparation strategy, the feature set reduced from 590 to 558

Once the missing data is handled, its important to cater to high dimensionality of the data.

High Data Dimensionality Handling:

558 predictors still represents information overload. Its important to look at strategies to reduce data dimensionality for meaningful data usage for the model.

What makes data meaningful ?

Ask any statistician what makes data insightful and the answer most likely would be "variance". Variance gives data meaning. Understanding variance of the data leads to critical insights into dimensionality reduction. Variance plot of the data is capture below.

Article content
The variance plot shows sensors ranked from lowest to highest variance. The first ~100 sensors have extremely low variance (near constant readings) as indicated by the steep drop on the left side of the distribution.

The variance plot shows two critical insights. First, sensors on the extreme left (very low variance) are near-constant and carry little information. Second, sensors on the extreme right have very high variance, which may indicate either highly sensitive measurements or unstable/faulty sensors. The log scale visualisation helps identify these outliers on both ends. As a strategy, variance threshold of 0.001 to 10,000 was applied to filter out extremes, retaining sensors that capture meaningful process variations without excessive noise.

The above filtering strategy leads to bringing down the data dimension from 558 to 307. In essence almost 250 features!

Can data dimensionality be further reduced ?

Principal Component Analysis (PCA), is a powerful dimensionality reduction technique. PCA transforms correlated features into smaller set of uncorrelated principal components. While doing so it preserves the most important information in the data i.e variance. PCA further reduces the high dimensional data by:

  1. Identifying the direction of maximum variance in the data.
  2. Eliminating correlated sensor readings that carry similar information.
  3. Retaining 95% of the variance in data with fewer principal components.

In this project, PCA transformed 307 sensors into 112 principal components while still retaining 95% variance in the data. this makes data more manageable for machine learning models.

The key advantage with PCA is that it creates new features that are linear combination sof the original sensors, capturing the most important process variations in a compact representation.

In summary, we were able to reduce the data dimensionality from 558 to 112 principal components by using variance analysis and PCA. This is almost 5x reduction!

Addressing Class Imbalance:

UCI-SEMCOM data suffers from severe class imbalance problem, with 14:1 ratio of pass to fail wafers. This creates a critical problem for the machine learning models. the model tends to memorise and starts to predict a "pass" for every wafer. for example, a model achieving accuracy of 93% might be impressive,but the performance is terrible - the recall and precision of the Fail class would be extremely poor. This means that the model simply cannot detect any failing wafers, defeating the entire purpose of defect prediction.

The solution lies in balancing the classes - Pass/Fail, through a two pronged approach.

  1. SMOTE (Synthetic Minority Over-sampling Technique): SMOTE generates synthetic samples of the minority class (Fail wafers). It creates synthetic samples by interpolating between a data point and its nearest neighbors. This balances the training data and forces the model to learn actual defect patterns rather than memorising the majority class.
  2. XGBoost classifier - model for prediction: XGBoost (Extreme Gradient Boosting) is selected as the prediction model for it's exceptional performance on imbalanced data. Its a powerful ensemble learning algorithm, that builds decision trees sequentially. where each tree learns from the mistakes of the previous trees. The native scale_pos_weight=100 is used to heavily penalise the missed defects. With 500 trees, max depth of 8 and a conservative learning rate of 0.05, model training is done to maximise recall while maintaining acceptable precision.

Model Performance and Result:

After training the XGBoost pipeline with SMOTE and optimised hyper-parameters, the model was evaluated on held out test set using the default 0.5 probability threshold.

Baseline Performance (Default Threshold = 0.5)

Article content
Classification report is a method in the scikit-learn metrics package. It is used to generate a classification report metric which outlines the performance of the model.

At a first glance 88% accuracy seems reasonable. However looking deeper.

  • Recall for Fail class: 0.29 - The model catches only 6 out of 21 defective wafers. This means that 15 defects slip through undetected (71% miss rate!).
  • Precision for Fail class: 0.21 - Only 1 in 5 wafers flagged as defects are truly defective, indicating high false prediction rates.
  • Pass class performs well : 92% recall and 95% precision show the model handles the majority class effectively

The Problem:

While this is better than predicting "Pass" for everything (which could catch 0 defects), detecting only 29% of defects would be unacceptable for manufacturing. The majority of the defective wafers would still complete the expensive fabrication process, resulting into a significant waste.

Reasoning the Low Recall:

The default probability threshold of 0.5 treats both classes equally. The model must be 50% confident that the wafer will fail before flagging it. In an imbalanced dataset where failures are rate, the model becomes overly cautious, resulting into many missed defects.

Probability Threshold Tuning: Aligning the model with Business Objectives.

Not all the errors cost the same!

False Positive : Flagging a good wafer as defective - Stop early in fabrication, lose ~$100( imaginary number used solely for comparison).

False Negative : Flagging a defective wafer as non defective(missing a defect) - Complete fabrication, lose ~$1,500( imaginary number showing that false positives in manufacturing can be orders of magnitude expensive than false negatives).

Above is a very critical insight. Recall is important than Precision. Reducing False Negatives(missed defects) should be preferred over reducing False Positives.

To reduce the False Negatives, we lower the probability threshold such that we can catch more defective wafers. By making the model more sensitive towards the Fail class, we accept more False Positives in exchange for catching more actual defects.

Our experiments showed that at a probability threshold of 0.10, recall increases to 0.38 that is it catches 8 out of 21 defects instead of just 6. While this generates more False Positives, the cost is negligible compared to the savings from catching defects.

Summary:

The project transformed the UCI-SECOM dataset into a practical defect prediction system. Through variance filtering, PCA, SMOTE and XGBoost with fine tuned thresholds, we build a model that catches defects early in fabrication.

The Takeaway:

Model Success is not measured by accuracy - It's measured by business impact. Missing a defect costs order of magnitudes more than incorrectly labelling non defective parts as defective. By optimising the recall rather than accuracy, we increased the defect detection from 29% to 38%, delivering cost savings.

Left shifting defect detection from post-fabrication to early intervention is the key to smart manufacturing. Machine learning makes it possible!








To view or add a comment, sign in

More articles by Santosh Kumar

Others also viewed

Explore content categories