Project Overview
This project addresses the critical challenge of credit card fraud, a significant threat to financial stability. The core difficulty lies in the extreme class imbalance of the dataset, where fraudulent transactions constitute only 0.172% of the data. This imbalance often leads to models that are biased towards the majority class, failing to identify fraud effectively.
The primary objective is to develop and deploy a robust machine learning model that excels at identifying fraudulent activities (high recall) while maintaining a low rate of false positives (high precision) to ensure customer trust and operational efficiency. The entire machine learning lifecycle, from experimentation to deployment readiness, is managed using MLflow to ensure reproducibility and scalability.
Features
Data Exploration & Preprocessing: Comprehensive analysis of credit card transaction data, including handling duplicates, log transformation of skewed features, and robust scaling.
Class Imbalance Handling: Implementation and evaluation of various resampling techniques: Undersampling (TomekLinks), Oversampling (SMOTE), and Hybrid Sampling (SMOTE + Tomek).
Baseline Model Experimentation: Training and evaluation of multiple machine learning models (MLPClassifier, Logistic Regression, RandomForest, XGBoost) across different resampled datasets.
MLflow Integration: Full lifecycle management of machine learning experiments, including tracking of metrics, parameters, artifacts (confusion matrices, ROC curves, PR curves), and model versions.
Hyperparameter Tuning: Optimization of the best-performing model using Optuna to maximize Average Precision.
Model Evaluation & Registration: Rigorous evaluation of the final model on a hold-out test set and registration to the MLflow Model Registry for seamless deployment.
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
Ensure you have Python 3.8+ installed. It is recommended to use a virtual environment.
Installation
Clone the repository:
Install the required Python packages:
Usage
Download the Dataset: The
fraud_detection_notebook.ipynbwill automatically download thecreditcardfraud.zipdataset from Kaggle into thedata/directory. No manual download is required.Run the Jupyter Notebook:
Open
fraud_detection_notebook.ipynbin your browser. Run all cells to execute the entire machine learning workflow, from data loading and preprocessing to model training, evaluation, and MLflow tracking.View MLflow UI: To inspect the logged experiments, models, and artifacts, start the MLflow UI:
Then, open your web browser and navigate to
http://127.0.0.1:5000(or the address displayed in your terminal).
Dataset
The dataset contains credit card transactions made by European cardholders in September 2013. It includes 284,807 transactions, out of which only 492 are fraudulent, making the dataset highly imbalanced (fraud cases represent just 0.172% of the total).
To preserve confidentiality, all features (except Time and Amount) have been transformed using Principal Component Analysis (PCA), resulting in 28 anonymized features labeled V1 to V28.
Time: Seconds elapsed since the first transaction in the dataset.Amount: Transaction value.Class: Target variable, where1indicates fraud and0indicates a legitimate transaction.
Technologies Used
Python: Primary programming language
MLflow: For experiment tracking, model management, and deployment.
Optuna: For hyperparameter optimization.
Scikit-learn: For machine learning models (Logistic Regression, RandomForestClassifier, MLPClassifier), data preprocessing (RobustScaler, train_test_split), and evaluation metrics.
Imbalanced-learn: For handling class imbalance (SMOTE, TomekLinks, SMOTETomek).
XGBoost: For gradient boosting models.
Pandas & NumPy: For data manipulation and numerical operations.
Matplotlib & Seaborn: For data visualization.
