Insurance Fraud Detection Model
An interactive dashboard summarizing the findings from an end-to-end machine learning project to identify fraudulent insurance claims.
A Project by Gurpreet Singh
View Source Code on GitHubProject at a Glance
Total Claims Analyzed
1,000
Observed Fraud Rate
24.7%
Best Model F1-Score (Fraud)
0.64
Project Workflow
This project followed a standard data science lifecycle, from data exploration to model deployment insights.
Data Exploration
This section provides insights from the Exploratory Data Analysis (EDA). The goal was to understand data distributions, patterns, and relationships to inform feature engineering and modeling. Select a chart from the dropdown to explore different aspects of the dataset.
Model Performance Comparison
Here, we evaluate and compare the performance of three different classification models. For fraud detection, we prioritize metrics like **Recall** (to catch as many fraud cases as possible) and **F1-Score** (to balance recall with precision). Select a model to view its detailed performance metrics.
XGBoost Performance Metrics
XGBoost Confusion Matrix
Model Interpretability with SHAP
Understanding *why* a model makes its predictions is crucial for trust and actionable insights. We used SHAP (SHapley Additive exPlanations) to determine which features have the most impact on fraud prediction for the best-performing model (XGBoost).
Top Features Driving Fraud Prediction
How Top Features Influence Predictions
Based on SHAP beeswarm plot analysis (not shown here), we can interpret the influence of the top features:
Incident Severity
Higher severity incidents (e.g., 'Major Damage', 'Total Loss') strongly increase the likelihood of a fraud prediction. This is the most significant driver.
Insured Hobbies
Certain hobbies, like 'chess' and 'cross-fit' in this synthetic dataset, were surprisingly strong predictors, pushing predictions towards fraud. This highlights how models can find non-obvious correlations.
Policy Annual Premium
Higher annual premiums tend to be associated with non-fraudulent claims (pushing SHAP values lower), whereas lower premiums are associated with a higher likelihood of fraud.
Conclusion & Business Impact
This project successfully demonstrates a data-driven approach to a critical business problem, providing a blueprint for real-world applications in the insurance industry.
Project Summary
An end-to-end pipeline was built to clean, explore, and model insurance claims data. After comparing Logistic Regression, Random Forest, and XGBoost, the **XGBoost model, enhanced with hyperparameter tuning and SMOTE for imbalance**, proved to be the most effective and balanced solution for detecting fraudulent claims.
Business Insights & Use Cases
- Automated Risk Scoring: The model can be deployed to score incoming claims in real-time, allowing fraud investigation units to prioritize high-risk cases and fast-track low-risk ones.
- Resource Optimization: By automating the initial screening, the model enables claims handlers and investigators to focus their efforts where they are most needed, significantly improving operational efficiency.
- Improved Loss Ratios: By more accurately identifying and preventing fraudulent payouts, the model can directly improve the insurer's loss ratio and overall profitability.
Future Improvements
- Model claim severity (`total_claim_amount`) using specialized regression models like Tweedie GLM.
- Develop a user-friendly front-end interface (e.g., using Streamlit) for interactive risk scoring.
- Incorporate text analysis on claim descriptions to capture more nuanced fraud indicators.