Insurance Fraud Detection Model

An interactive dashboard summarizing the findings from an end-to-end machine learning project to identify fraudulent insurance claims.

A Project by Gurpreet Singh

View Source Code on GitHub

Project at a Glance

Total Claims Analyzed

1,000

Observed Fraud Rate

24.7%

Best Model F1-Score (Fraud)

0.64

Project Workflow

This project followed a standard data science lifecycle, from data exploration to model deployment insights.

1. Data Exploration & Cleaning
2. Feature Engineering
3. Model Training & Tuning
4. Evaluation & Interpretability

Data Exploration

This section provides insights from the Exploratory Data Analysis (EDA). The goal was to understand data distributions, patterns, and relationships to inform feature engineering and modeling. Select a chart from the dropdown to explore different aspects of the dataset.

Model Performance Comparison

Here, we evaluate and compare the performance of three different classification models. For fraud detection, we prioritize metrics like **Recall** (to catch as many fraud cases as possible) and **F1-Score** (to balance recall with precision). Select a model to view its detailed performance metrics.

XGBoost Performance Metrics

XGBoost Confusion Matrix

Predicted Not Fraud
Predicted Fraud
Actual Not Fraud
0
0
Actual Fraud
0
0

Model Interpretability with SHAP

Understanding *why* a model makes its predictions is crucial for trust and actionable insights. We used SHAP (SHapley Additive exPlanations) to determine which features have the most impact on fraud prediction for the best-performing model (XGBoost).

Top Features Driving Fraud Prediction

How Top Features Influence Predictions

Based on SHAP beeswarm plot analysis (not shown here), we can interpret the influence of the top features:

Incident Severity

Higher severity incidents (e.g., 'Major Damage', 'Total Loss') strongly increase the likelihood of a fraud prediction. This is the most significant driver.

Insured Hobbies

Certain hobbies, like 'chess' and 'cross-fit' in this synthetic dataset, were surprisingly strong predictors, pushing predictions towards fraud. This highlights how models can find non-obvious correlations.

Policy Annual Premium

Higher annual premiums tend to be associated with non-fraudulent claims (pushing SHAP values lower), whereas lower premiums are associated with a higher likelihood of fraud.

Conclusion & Business Impact

This project successfully demonstrates a data-driven approach to a critical business problem, providing a blueprint for real-world applications in the insurance industry.

Project Summary

An end-to-end pipeline was built to clean, explore, and model insurance claims data. After comparing Logistic Regression, Random Forest, and XGBoost, the **XGBoost model, enhanced with hyperparameter tuning and SMOTE for imbalance**, proved to be the most effective and balanced solution for detecting fraudulent claims.

Business Insights & Use Cases

  • Automated Risk Scoring: The model can be deployed to score incoming claims in real-time, allowing fraud investigation units to prioritize high-risk cases and fast-track low-risk ones.
  • Resource Optimization: By automating the initial screening, the model enables claims handlers and investigators to focus their efforts where they are most needed, significantly improving operational efficiency.
  • Improved Loss Ratios: By more accurately identifying and preventing fraudulent payouts, the model can directly improve the insurer's loss ratio and overall profitability.

Future Improvements

  • Model claim severity (`total_claim_amount`) using specialized regression models like Tweedie GLM.
  • Develop a user-friendly front-end interface (e.g., using Streamlit) for interactive risk scoring.
  • Incorporate text analysis on claim descriptions to capture more nuanced fraud indicators.