The Process of Building an Explainable Fraud Detection System Using ML and Streamlit
Following my previous project 'Building a Fraud Detection System Using XGBoost', I wanted to go further and build a dashboard on Streamlit. This project uses synthetic data that mimics the risk of fraudulent refund abuse. To simulate tackling this issue, I built an end-to-end Machine Learning dashboard for fraudulent user detection, using a synthetic dataset from Kaggle.
This project included data preprocessing, feature engineering, model training, explainability through SHAP, and deployment using Streamlit. It is important to reflect critically on the process, especially on the dataset's limitations, and highlight the project's true strengths.
Shortcomings and Challenges
1. Synthetic Nature of the Dataset
The dataset used was artificially generated and not from real-world customer behaviours which introduced several limitations:
Predictability: The data patterns were simpler and cleaner than what real-world data would present.
Perfect Accuracy: The model achieved an accuracy of 1.0, which is not realistic in production environments where user behaviour is noisy and evolving.
Lack of Outliers: Real fraud often involves rare, outlier behavior, which synthetic datasets struggle to replicate.
2. Data Quality Issues
Before modelling, significant inconsistencies were identified:
Return Dates Earlier than Order Dates: Logically invalid records, which required removal.
Loss of Balance: Removing invalid entries slightly skewed the balanced nature of the original dataset, although not dramatically.
3. Simplified Feature Relationships
Although feature engineering was done, the relationships between features and fraud were more linear than expected in reality:
4. Streamlit Deployment Constraints
While Streamlit enabled rapid deployment, it also posed some challenges:
Recommended by LinkedIn
Strengths and Achievements
Despite these limitations, the project has several important strengths worth celebrating:
1. End-to-End Pipeline
Built a full system from raw data ingestion ➔ cleaning ➔ feature generation ➔ model prediction ➔ visualisation.
Automated the fraud detection pipeline.
2. Feature Engineering from Domain Intuition
Created meaningful features like Days_to_Return_Corrected, Suspicious_Score, and High_Returner_Flag based on logical business behaviour.
3. Model Explainability Integrated
4. Professional Dashboard Experience
Designed a clean, tabbed Streamlit app allowing easy exploration of KPIs, fraud scores, and detailed user explainability.
5. Problem Solving and Adaptability
Going forward, applying similar methodologies to real, messy data would introduce new challenges such as handling concept drift, building feedback loops, threshold tuning, and minimising false positives. Embracing these complexities will be key to building robust, production-ready fraud detection systems.
Amazing 👏
Great tool Nameera Nilofer K.!
Sounds exciting!!