Spaces:
Sleeping
Sleeping
Sibi Krishnamoorthy commited on
Commit ·
8a08300
1
Parent(s): 2098ced
prod
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .agent/rules/architecture.md +175 -0
- .coverage +0 -0
- .dockerignore +34 -0
- .env.example +23 -0
- .github/workflows/main.yml +49 -0
- .gitignore +17 -0
- .python-version +1 -0
- Dockerfile +68 -0
- README.md +162 -0
- assets/Screenshot_18-1-2026_22489_localhost.jpeg +3 -0
- assets/banner.jpeg +3 -0
- assets/banner_0.jpeg +3 -0
- assets/dashboard_screenshot.png +3 -0
- configs/model_config.yaml +29 -0
- configs/redis_config.yaml +19 -0
- docker-compose.yml +63 -0
- docker/docker-compose.yml +42 -0
- docs/DEVELOPMENT.md +135 -0
- entrypoint.sh +7 -0
- logs/shadow_predictions.jsonl +55 -0
- main.py +6 -0
- models/fraud_model.pkl +3 -0
- models/threshold.json +10 -0
- notebook/.ipynb_checkpoints/Untitled-checkpoint.ipynb +6 -0
- notebook/.ipynb_checkpoints/credit-card-fraud-detection-checkpoint.ipynb +0 -0
- notebook/README.md +122 -0
- notebook/credit-card-fraud-detection.ipynb +0 -0
- notebook/credit-card-fraud-detection.md +0 -0
- notebook/credit-card-fraud-detection.pdf +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_23_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_25_1.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_26_1.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_28_1.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_29_1.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_30_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_32_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_34_1.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_40_1.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_58_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_59_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_60_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_68_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_70_0.png +3 -0
- notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_72_1.png +3 -0
- notebook/dataset/download.py +27 -0
- pyproject.toml +81 -0
- scripts/demo_phase1.py +84 -0
- scripts/verify_phase1.py +192 -0
- src/__init__.py +1 -0
- src/api/__init__.py +1 -0
.agent/rules/architecture.md
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
trigger: always_on
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
Here is the comprehensive `architecture.md` file for your repository. It documents the high-level system design, data flow, and infrastructure decisions, serving as a critical artifact for technical interviews and system design reviews.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# 🏗️ System Architecture: Real-Time Fraud Detection
|
| 10 |
+
|
| 11 |
+
This document outlines the end-to-end architecture of the Fraud Detection System, designed for **low-latency inference (<50ms)**, **high throughput**, and **explainability**.
|
| 12 |
+
|
| 13 |
+
The system follows a **Lambda Architecture** pattern, splitting workflows into an **Offline Training Layer** (batch processing) and an **Online Serving Layer** (real-time inference).
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 1. High-Level Diagram
|
| 18 |
+
|
| 19 |
+
```mermaid
|
| 20 |
+
graph TD
|
| 21 |
+
User[Payment Gateway] -->|POST /predict| API[FastAPI Inference Service]
|
| 22 |
+
|
| 23 |
+
subgraph "Online Serving Layer"
|
| 24 |
+
API -->|1. Get Velocity| Redis[(Redis Feature Store)]
|
| 25 |
+
API -->|2. Compute Features| Preproc[Scikit-Learn Pipeline]
|
| 26 |
+
Preproc -->|3. Inference| Model[XGBoost Classifier]
|
| 27 |
+
Model -->|4. Log Prediction| Logger[Async Logger]
|
| 28 |
+
end
|
| 29 |
+
|
| 30 |
+
subgraph "Offline Training Layer"
|
| 31 |
+
DW[(Data Warehouse)] -->|Batch Load| Trainer[Training Pipeline]
|
| 32 |
+
Trainer -->|Track Experiments| MLflow[MLflow Server]
|
| 33 |
+
Trainer -->|Save Artifact| Registry[Model Registry]
|
| 34 |
+
Registry -->|Load .pkl| API
|
| 35 |
+
end
|
| 36 |
+
|
| 37 |
+
subgraph "Explainability & Monitoring"
|
| 38 |
+
Logger -->|Drift Analysis| Dashboard[Streamlit / Grafana]
|
| 39 |
+
API -->|Request Waterfall| SHAP[SHAP Explainer]
|
| 40 |
+
end
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 2. Component Breakdown
|
| 47 |
+
|
| 48 |
+
### 🟢 Online Serving Layer (Real-Time)
|
| 49 |
+
|
| 50 |
+
The critical path for live transactions. Designed for sub-50ms latency.
|
| 51 |
+
|
| 52 |
+
* **FastAPI Service:** Asynchronous Python web server. Handles request validation (Pydantic) and orchestrates the inference flow.
|
| 53 |
+
* **Redis Feature Store:**
|
| 54 |
+
* **Role:** Solves the "Stateful Feature" problem.
|
| 55 |
+
* **Implementation:** Uses Redis Sorted Sets (`ZSET`) to maintain a rolling window of transaction counts for the last 24 hours.
|
| 56 |
+
* **Latency:** <2ms lookups via pipelining.
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
* **Inference Engine:**
|
| 60 |
+
* **Model:** XGBoost Classifier (serialized via Joblib).
|
| 61 |
+
* **Pipeline:** `ColumnTransformer` (WOEEncoder + RobustScaler) ensures raw input is transformed exactly as it was during training.
|
| 62 |
+
* **Shadow Mode:** A configuration toggle allowing the model to run silently alongside legacy rules for A/B testing.
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
### 🔵 Offline Training Layer (Batch)
|
| 67 |
+
|
| 68 |
+
Where models are built, validated, and versioned.
|
| 69 |
+
|
| 70 |
+
* **Data Ingestion:** Loads historical transaction logs (CSV/Parquet).
|
| 71 |
+
* **Feature Engineering:**
|
| 72 |
+
* Calculates Haversine distances.
|
| 73 |
+
* Generates Cyclical time encodings (`hour_sin`, `hour_cos`).
|
| 74 |
+
* Computes historical aggregates for the Feature Store.
|
| 75 |
+
* refer notebook markdown for more information at path notebook/credit-card-fraud-detection.md
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
* **Experiment Tracking (MLflow):**
|
| 79 |
+
* Logs hyperparameters (`max_depth`, `learning_rate`).
|
| 80 |
+
* Logs metrics: Precision, Recall, PR-AUC.
|
| 81 |
+
* Stores the model artifact and the optimized decision threshold (e.g., `0.895`).
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
### 🟣 Explainability Layer (Compliance)
|
| 86 |
+
|
| 87 |
+
Ensures decisions are transparent and auditable.
|
| 88 |
+
|
| 89 |
+
* **SHAP (SHapley Additive exPlanations):**
|
| 90 |
+
* **Global:** Summary plots to understand model drivers.
|
| 91 |
+
* **Local:** Waterfall plots generated on-demand for specific high-risk blocks.
|
| 92 |
+
* **Optimization:** Uses the TreeExplainer with the "JSON Serialization Patch" to handle XGBoost 2.0+ metadata compatibility.
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## 3. Data Flow Lifecycle
|
| 99 |
+
|
| 100 |
+
### Step 1: Request Ingestion
|
| 101 |
+
|
| 102 |
+
The Payment Gateway sends a raw transaction payload:
|
| 103 |
+
|
| 104 |
+
```json
|
| 105 |
+
{
|
| 106 |
+
"user_id": "u8312",
|
| 107 |
+
"amt": 150.00,
|
| 108 |
+
"lat": 40.7128,
|
| 109 |
+
"long": -74.0060,
|
| 110 |
+
"category": "grocery_pos",
|
| 111 |
+
"job": "systems_analyst"
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
### Step 2: Real-Time Feature Enrichment
|
| 117 |
+
|
| 118 |
+
The API queries Redis to fetch dynamic context that isn't in the payload:
|
| 119 |
+
|
| 120 |
+
* *Query:* `ZCARD user:u8312:tx_history` (Transactions in last 24h)
|
| 121 |
+
* *Result:* `5` (Velocity)
|
| 122 |
+
|
| 123 |
+
### Step 3: Preprocessing & Inference
|
| 124 |
+
|
| 125 |
+
The combined vector (Raw + Redis Features) is passed to the Pipeline:
|
| 126 |
+
|
| 127 |
+
1. **Imputation:** Handle missing values.
|
| 128 |
+
2. **Encoding:** `category` -> Weight of Evidence (Float).
|
| 129 |
+
3. **Scaling:** `amt` -> RobustScaler (centered/scaled).
|
| 130 |
+
4. **Prediction:** XGBoost outputs probability `0.982`.
|
| 131 |
+
|
| 132 |
+
### Step 4: Decision & Action
|
| 133 |
+
|
| 134 |
+
* **Threshold Check:** `0.982 > 0.895` → **FRAUD**.
|
| 135 |
+
* **Shadow Mode Check:** If enabled, return `APPROVE` but log `[SHADOW_BLOCK]`.
|
| 136 |
+
* **Response:**
|
| 137 |
+
```json
|
| 138 |
+
{
|
| 139 |
+
"status": "BLOCK",
|
| 140 |
+
"risk_score": 0.982,
|
| 141 |
+
"latency_ms": 35
|
| 142 |
+
}
|
| 143 |
+
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## 4. Deployment Strategy
|
| 151 |
+
|
| 152 |
+
### Shadow Deployment (Dark Launch)
|
| 153 |
+
|
| 154 |
+
To mitigate risk, new model versions are deployed in "Shadow Mode" before blocking real money.
|
| 155 |
+
|
| 156 |
+
1. **Deploy:** Version v2 is deployed to production.
|
| 157 |
+
2. **Listen:** It receives 100% of traffic but has **no write permissions** (cannot decline cards).
|
| 158 |
+
3. **Compare:** We compare v2's "Virtual Blocks" against the v1 "Actual Blocks" and customer complaints.
|
| 159 |
+
4. **Promote:** If Precision > 95% and complaints == 0, toggle Shadow Mode **OFF**.
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## 5. Technology Stack
|
| 164 |
+
|
| 165 |
+
| Component | Technology | Rationale |
|
| 166 |
+
| --- | --- | --- |
|
| 167 |
+
| **Language** | Python 3.10+ | Standard for ML ecosystem. |
|
| 168 |
+
| **Package Manager | astral-uv | An extremely fast Python package and project manager |
|
| 169 |
+
| **API Framework** | FastAPI | Async native, high performance, auto-documentation. |
|
| 170 |
+
| **Model** | XGBoost | Best-in-class for tabular fraud data. |
|
| 171 |
+
| **Feature Store** | Redis | Sub-millisecond latency for sliding windows. |
|
| 172 |
+
| **Container** | Docker | Reproducible environments. |
|
| 173 |
+
| **Orchestration** | Docker Compose | Local development and testing. |
|
| 174 |
+
| **Tracking** | MLflow | Experiment management and artifact versioning. |
|
| 175 |
+
| **Frontend** | Streamlit | Rapid prototyping of analyst dashboards. |
|
.coverage
ADDED
|
Binary file (53.2 kB). View file
|
|
|
.dockerignore
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Git and Version Control
|
| 2 |
+
.git
|
| 3 |
+
.gitignore
|
| 4 |
+
.github
|
| 5 |
+
|
| 6 |
+
# Python
|
| 7 |
+
__pycache__
|
| 8 |
+
*.py[cod]
|
| 9 |
+
*$py.class
|
| 10 |
+
.venv
|
| 11 |
+
.env
|
| 12 |
+
.pytest_cache
|
| 13 |
+
.ruff_cache
|
| 14 |
+
.coverage
|
| 15 |
+
htmlcov
|
| 16 |
+
.ipynb_checkpoints
|
| 17 |
+
|
| 18 |
+
# Project Data & Artifacts
|
| 19 |
+
# models/
|
| 20 |
+
logs/
|
| 21 |
+
notebooks/
|
| 22 |
+
data/
|
| 23 |
+
*.csv
|
| 24 |
+
*.parquet
|
| 25 |
+
*.pkl
|
| 26 |
+
# *.json
|
| 27 |
+
|
| 28 |
+
# UV
|
| 29 |
+
# uv.lock
|
| 30 |
+
|
| 31 |
+
# Docker
|
| 32 |
+
Dockerfile
|
| 33 |
+
docker-compose.yml
|
| 34 |
+
.dockerignore
|
.env.example
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Environment Variables
|
| 2 |
+
|
| 3 |
+
# Redis Configuration
|
| 4 |
+
REDIS_HOST=localhost
|
| 5 |
+
REDIS_PORT=6379
|
| 6 |
+
REDIS_DB=0
|
| 7 |
+
REDIS_PASSWORD=
|
| 8 |
+
|
| 9 |
+
# MLflow Configuration
|
| 10 |
+
MLFLOW_TRACKING_URI=http://localhost:5000
|
| 11 |
+
|
| 12 |
+
# Application Settings
|
| 13 |
+
ENV=development
|
| 14 |
+
LOG_LEVEL=INFO
|
| 15 |
+
|
| 16 |
+
# Model Settings
|
| 17 |
+
MODEL_PATH=models/fraud_model.pkl
|
| 18 |
+
THRESHOLD=0.895
|
| 19 |
+
|
| 20 |
+
shadow_mode=false
|
| 21 |
+
enable_explainability=true
|
| 22 |
+
# Performance
|
| 23 |
+
max_latency_ms=50.0
|
.github/workflows/main.yml
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: PayShield CI
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
push:
|
| 5 |
+
branches: [ "master" ]
|
| 6 |
+
pull_request:
|
| 7 |
+
branches: [ "master" ]
|
| 8 |
+
|
| 9 |
+
jobs:
|
| 10 |
+
lint-and-test:
|
| 11 |
+
runs-on: ubuntu-latest
|
| 12 |
+
|
| 13 |
+
services:
|
| 14 |
+
redis:
|
| 15 |
+
image: redis:7-alpine
|
| 16 |
+
ports:
|
| 17 |
+
- 6379:6379
|
| 18 |
+
options: >-
|
| 19 |
+
--health-cmd "redis-cli ping"
|
| 20 |
+
--health-interval 10s
|
| 21 |
+
--health-timeout 5s
|
| 22 |
+
--health-retries 5
|
| 23 |
+
|
| 24 |
+
steps:
|
| 25 |
+
- uses: actions/checkout@v4
|
| 26 |
+
|
| 27 |
+
- name: Install uv
|
| 28 |
+
uses: astral-sh/setup-uv@v5
|
| 29 |
+
with:
|
| 30 |
+
version: "latest"
|
| 31 |
+
enable-cache: true
|
| 32 |
+
|
| 33 |
+
- name: Set up Python
|
| 34 |
+
run: uv python install
|
| 35 |
+
|
| 36 |
+
- name: Install the project
|
| 37 |
+
run: uv sync --all-extras --dev
|
| 38 |
+
|
| 39 |
+
- name: Run Ruff (Linter)
|
| 40 |
+
run: uv run ruff check .
|
| 41 |
+
|
| 42 |
+
- name: Run Ruff (Formatter check)
|
| 43 |
+
run: uv run ruff format --check .
|
| 44 |
+
|
| 45 |
+
- name: Run tests with coverage
|
| 46 |
+
run: uv run pytest --cov=src tests/
|
| 47 |
+
env:
|
| 48 |
+
REDIS_HOST: localhost
|
| 49 |
+
REDIS_PORT: 6379
|
.gitignore
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python-generated files
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[oc]
|
| 4 |
+
build/
|
| 5 |
+
dist/
|
| 6 |
+
wheels/
|
| 7 |
+
*.egg-info
|
| 8 |
+
|
| 9 |
+
# Virtual environments
|
| 10 |
+
.venv
|
| 11 |
+
.env
|
| 12 |
+
|
| 13 |
+
.ruff_cache
|
| 14 |
+
.pytest_cache
|
| 15 |
+
.vscode
|
| 16 |
+
mlruns/
|
| 17 |
+
notebook/dataset/*.csv
|
.python-version
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
3.12
|
Dockerfile
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stage 1: Builder
|
| 2 |
+
FROM python:3.12-slim-trixie AS builder
|
| 3 |
+
|
| 4 |
+
# Install uv by copying the binary from the official image
|
| 5 |
+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
|
| 6 |
+
|
| 7 |
+
# Set working directory
|
| 8 |
+
WORKDIR /app
|
| 9 |
+
|
| 10 |
+
# Enable bytecode compilation
|
| 11 |
+
ENV UV_COMPILE_BYTECODE=1
|
| 12 |
+
|
| 13 |
+
# Copy only dependency files first to leverage Docker layer caching
|
| 14 |
+
COPY pyproject.toml uv.lock ./
|
| 15 |
+
|
| 16 |
+
# Disable development dependencies
|
| 17 |
+
ENV UV_NO_DEV=1
|
| 18 |
+
|
| 19 |
+
# Sync the project into a new environment
|
| 20 |
+
RUN uv sync --frozen --no-install-project
|
| 21 |
+
|
| 22 |
+
# Stage 2: Final
|
| 23 |
+
FROM python:3.12-slim-trixie
|
| 24 |
+
|
| 25 |
+
# Set working directory
|
| 26 |
+
WORKDIR /app
|
| 27 |
+
|
| 28 |
+
# Install system dependencies (curl for healthchecks)
|
| 29 |
+
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
|
| 30 |
+
|
| 31 |
+
# Create logs directory
|
| 32 |
+
RUN mkdir -p logs
|
| 33 |
+
|
| 34 |
+
# Create a non-root user for security
|
| 35 |
+
# Note: HF Spaces runs as user 1000 by default, so we align with that
|
| 36 |
+
RUN useradd -m -u 1000 payshield
|
| 37 |
+
|
| 38 |
+
# Set PYTHONPATH to include the app directory
|
| 39 |
+
ENV PYTHONPATH=/app
|
| 40 |
+
|
| 41 |
+
# Copy the virtual environment from the builder stage
|
| 42 |
+
COPY --from=builder /app/.venv /app/.venv
|
| 43 |
+
|
| 44 |
+
# Ensure the installed binary is on the `PATH`
|
| 45 |
+
ENV PATH="/app/.venv/bin:$PATH"
|
| 46 |
+
|
| 47 |
+
# Copy the rest of the application code
|
| 48 |
+
COPY src /app/src
|
| 49 |
+
COPY configs /app/configs
|
| 50 |
+
COPY models /app/models
|
| 51 |
+
COPY entrypoint.sh /app/entrypoint.sh
|
| 52 |
+
|
| 53 |
+
# Change ownership to the non-root user
|
| 54 |
+
RUN chown -R payshield:payshield /app
|
| 55 |
+
|
| 56 |
+
# Switch to the non-root user
|
| 57 |
+
USER 1000
|
| 58 |
+
|
| 59 |
+
# Expose ports
|
| 60 |
+
# 7860 is the default port for Hugging Face Spaces
|
| 61 |
+
EXPOSE 7860 8000
|
| 62 |
+
|
| 63 |
+
# Set environment variables for HF Spaces
|
| 64 |
+
ENV API_URL="http://localhost:8000/v1/predict"
|
| 65 |
+
ENV PORT=7860
|
| 66 |
+
|
| 67 |
+
# Use entrypoint script to run both services
|
| 68 |
+
CMD ["/bin/bash", "/app/entrypoint.sh"]
|
README.md
CHANGED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🛡️ PayShield-ML:Real-Time Fraud Engine
|
| 2 |
+
|
| 3 |
+

|
| 4 |
+

|
| 5 |
+

|
| 6 |
+

|
| 7 |
+

|
| 8 |
+

|
| 9 |
+
|
| 10 |
+
**A production-grade MLOps system providing low-latency fraud detection with stateful feature hydration and human-in-the-loop explainability.**
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## 🏗️ System Architecture
|
| 15 |
+
|
| 16 |
+
The system follows a **Lambda Architecture** pattern, decoupling the high-latency model training from the low-latency inference path.
|
| 17 |
+
|
| 18 |
+
```mermaid
|
| 19 |
+
graph TD
|
| 20 |
+
User[Payment Gateway] -->|POST /predict| API[FastAPI Inference Service]
|
| 21 |
+
|
| 22 |
+
subgraph "Online Serving Layer (<50ms)"
|
| 23 |
+
API -->|1. Hydrate Velocity| Redis[(Redis Feature Store)]
|
| 24 |
+
API -->|2. Preprocess| Pipeline[Scikit-Learn Pipeline]
|
| 25 |
+
Pipeline -->|3. Predict| Model[XGBoost Classifier]
|
| 26 |
+
Model -->|4. Explain| SHAP[SHAP Engine]
|
| 27 |
+
end
|
| 28 |
+
|
| 29 |
+
subgraph "Offline Training Layer"
|
| 30 |
+
Data[(Transactional Data)] -->|Batch Load| Trainer[Training Pipeline]
|
| 31 |
+
Trainer -->|CV & Tuning| MLflow[MLflow Tracking]
|
| 32 |
+
Trainer -->|Export Artifact| Registry[Model Registry]
|
| 33 |
+
end
|
| 34 |
+
|
| 35 |
+
API -->|Async Log| ShadowLogs[Shadow Mode Logs]
|
| 36 |
+
API -->|Visualize| Dashboard[Analyst Dashboard]
|
| 37 |
+
```
|
| 38 |
+
### 🖥️ Analyst Workbench
|
| 39 |
+

|
| 40 |
+
*Real-time interface for fraud analysts to review blocked transactions and SHAP explanations.*
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## 🚀 Key Features
|
| 44 |
+
|
| 45 |
+
### 1. Stateful Feature Store (Redis)
|
| 46 |
+
Traditional stateless APIs struggle with "Velocity Features" (e.g., *how many times did this user swipe in 24 hours?*). Our engine utilizes **Redis Sorted Sets (ZSET)** to maintain rolling windows, allowing feature hydration in **<2ms** with $O(\log N)$ complexity.
|
| 47 |
+
|
| 48 |
+
### 2. Shadow Mode (Dark Launch)
|
| 49 |
+
To mitigate the risk of model drift or false positives, the system supports a **Shadow Mode** configuration. The model runs in production, receives real traffic, and logs decisions, but never blocks a transaction. This allows for risk-free A/B testing against legacy rule engines.
|
| 50 |
+
|
| 51 |
+
### 3. Business-Aligned Optimization
|
| 52 |
+
Standard accuracy is misleading in fraud detection due to class imbalance. We implement a **Recall-Constraint Strategy** (Target: 80% Recall) ensuring the model captures the vast majority of fraud while maintaining a strict upper bound on False Positive Rates, as required by financial compliance.
|
| 53 |
+
|
| 54 |
+
### 4. Cold-Start Handling
|
| 55 |
+
Engineered logic to handle new users with zero history by defaulting to global medians and "warm-up" priors, preventing the system from unfairly blocking legitimate first-time customers.
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## 📊 Performance & Metrics
|
| 60 |
+
|
| 61 |
+
The following metrics were achieved on a hold-out temporal test set (out-of-time validation):
|
| 62 |
+
|
| 63 |
+
| Metric | Result | Target / Bench |
|
| 64 |
+
| :--- | :--- | :--- |
|
| 65 |
+
| **PR-AUC** | 0.9245 | Excellent |
|
| 66 |
+
| **Precision** | 93.18% | Low False Positives |
|
| 67 |
+
| **Recall** | 80.06% | Target: 80% |
|
| 68 |
+
| **Inference Latency (p95)** | ~30ms | < 50ms |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## 🛠️ Quick Start
|
| 73 |
+
|
| 74 |
+
### Prerequisites
|
| 75 |
+
- Docker & Docker Compose
|
| 76 |
+
- (Optional) Python 3.12+ (managed via `uv`)
|
| 77 |
+
|
| 78 |
+
### Installation & Deployment
|
| 79 |
+
|
| 80 |
+
1. **Clone the Repository**
|
| 81 |
+
```bash
|
| 82 |
+
git clone https://github.com/Sibikrish3000/realtime-fraud-engine.git
|
| 83 |
+
cd realtime-fraud-engine
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
2. **Launch the Stack**
|
| 87 |
+
```bash
|
| 88 |
+
docker-compose up --build
|
| 89 |
+
```
|
| 90 |
+
*This starts the FastAPI Backend, Redis Feature Store, and Streamlit Dashboard.*
|
| 91 |
+
|
| 92 |
+
3. **Access the Services**
|
| 93 |
+
- **Analyst Dashboard:** [http://localhost:8501](http://localhost:8501)
|
| 94 |
+
- **API Documentation:** [http://localhost:8000/docs](http://localhost:8000/docs)
|
| 95 |
+
- **Redis Instance:** `localhost:6379`
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## 📂 Project Structure
|
| 100 |
+
|
| 101 |
+
```text
|
| 102 |
+
src/
|
| 103 |
+
├── api/ # FastAPI service, schemas (Pydantic), and config
|
| 104 |
+
├── features/ # Redis feature store logic and sliding window constants
|
| 105 |
+
├── models/ # Training pipelines, metrics calculation, and XGBoost wrappers
|
| 106 |
+
├── frontend/ # Streamlit-based analyst workbench
|
| 107 |
+
├── data/ # Data ingestion and cleaning utilities
|
| 108 |
+
└── explainability.py # SHAP-based waterfall plots and global importance
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
```
|
| 112 |
+
See the [Development & MLOps Guide](docs/DEVELOPMENT.md) for detailed instructions on training and local development.
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## 📡 API Reference
|
| 116 |
+
|
| 117 |
+
### Predict Transaction
|
| 118 |
+
`POST /v1/predict`
|
| 119 |
+
|
| 120 |
+
**Request:**
|
| 121 |
+
```bash
|
| 122 |
+
curl -X 'POST' \
|
| 123 |
+
'http://localhost:8000/v1/predict' \
|
| 124 |
+
-H 'Content-Type: application/json' \
|
| 125 |
+
-d '{
|
| 126 |
+
"user_id": "u12345",
|
| 127 |
+
"trans_date_trans_time": "2024-01-20 14:30:00",
|
| 128 |
+
"amt": 150.75,
|
| 129 |
+
"lat": 40.7128,
|
| 130 |
+
"long": -74.0060,
|
| 131 |
+
"merch_lat": 40.7306,
|
| 132 |
+
"merch_long": -73.9352,
|
| 133 |
+
"category": "grocery_pos"
|
| 134 |
+
}'
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
**Response:**
|
| 138 |
+
```json
|
| 139 |
+
{
|
| 140 |
+
"decision": "APPROVE",
|
| 141 |
+
"probability": 0.12,
|
| 142 |
+
"risk_score": 12.0,
|
| 143 |
+
"latency_ms": 28.5,
|
| 144 |
+
"shadow_mode": false,
|
| 145 |
+
"shap_values": {
|
| 146 |
+
"amt": 0.05,
|
| 147 |
+
"dist": 0.02
|
| 148 |
+
}
|
| 149 |
+
}
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## 🗺️ Future Roadmap
|
| 155 |
+
|
| 156 |
+
- [ ] **Kafka Integration:** Transition to an asynchronous event-driven architecture for high-throughput stream processing.
|
| 157 |
+
- [ ] **KServe Deployment:** Migrate from standalone FastAPI to KServe for automated scaling and model versioning.
|
| 158 |
+
- [ ] **Graph Features:** Incorporate Neo4j-based features to detect fraud rings and synthetic identities.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
**Author:** [Sibi Krishnamoorthy](https://github.com/Sibikrish3000)
|
| 162 |
+
*Machine Learning Engineer | Fintech & Risk Analytics*
|
assets/Screenshot_18-1-2026_22489_localhost.jpeg
ADDED
|
Git LFS Details
|
assets/banner.jpeg
ADDED
|
Git LFS Details
|
assets/banner_0.jpeg
ADDED
|
Git LFS Details
|
assets/dashboard_screenshot.png
ADDED
|
Git LFS Details
|
configs/model_config.yaml
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Training Configuration
|
| 2 |
+
|
| 3 |
+
model:
|
| 4 |
+
# XGBoost Hyperparameters (Aligned with Notebook)
|
| 5 |
+
n_estimators: 500 # Notebook used 500 trees (Script had 100)
|
| 6 |
+
max_depth: 8 # Notebook used depth 8 (Script had 6)
|
| 7 |
+
learning_rate: 0.1 # Matches notebook
|
| 8 |
+
subsample: 0.8 # Row sampling for regularization
|
| 9 |
+
colsample_bytree: 0.8 # Column sampling for regularization
|
| 10 |
+
scale_pos_weight: 100 # Will be overridden by calculated imbalance ratio
|
| 11 |
+
eval_metric: "aucpr"
|
| 12 |
+
random_state: 42
|
| 13 |
+
|
| 14 |
+
# Training Settings
|
| 15 |
+
training:
|
| 16 |
+
test_size: 0.2
|
| 17 |
+
validation_size: 0.1
|
| 18 |
+
random_state: 42
|
| 19 |
+
|
| 20 |
+
# Decision Threshold
|
| 21 |
+
threshold:
|
| 22 |
+
optimal_threshold: 0.9016819596290588 # From notebook PR curve analysis
|
| 23 |
+
min_precision: 0.80 # Business requirement
|
| 24 |
+
|
| 25 |
+
# Feature Engineering
|
| 26 |
+
features:
|
| 27 |
+
use_cyclical_encoding: true
|
| 28 |
+
use_distance_features: true
|
| 29 |
+
use_temporal_features: true
|
configs/redis_config.yaml
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Redis Feature Store Configuration
|
| 2 |
+
|
| 3 |
+
# Connection Settings
|
| 4 |
+
host: localhost
|
| 5 |
+
port: 6379
|
| 6 |
+
db: 0
|
| 7 |
+
password: null # Set via REDIS_PASSWORD env var in production
|
| 8 |
+
|
| 9 |
+
# Connection Pool
|
| 10 |
+
max_connections: 50
|
| 11 |
+
socket_connect_timeout: 2 # seconds
|
| 12 |
+
socket_timeout: 1 # seconds
|
| 13 |
+
|
| 14 |
+
# Feature Store Settings
|
| 15 |
+
ema_alpha: 0.08 # 2/(24+1) for 24-hour window
|
| 16 |
+
key_ttl: 604800 # 7 days in seconds
|
| 17 |
+
|
| 18 |
+
# Window Configuration
|
| 19 |
+
sliding_window_hours: 24
|
docker-compose.yml
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version: '3.8'
|
| 2 |
+
|
| 3 |
+
services:
|
| 4 |
+
redis:
|
| 5 |
+
image: redis:7-alpine
|
| 6 |
+
container_name: payshield-redis
|
| 7 |
+
ports:
|
| 8 |
+
- "6379:6379"
|
| 9 |
+
volumes:
|
| 10 |
+
- redis_data:/data
|
| 11 |
+
healthcheck:
|
| 12 |
+
test: ["CMD", "redis-cli", "ping"]
|
| 13 |
+
interval: 5s
|
| 14 |
+
timeout: 3s
|
| 15 |
+
retries: 5
|
| 16 |
+
networks:
|
| 17 |
+
- payshield-network
|
| 18 |
+
|
| 19 |
+
api:
|
| 20 |
+
build:
|
| 21 |
+
context: .
|
| 22 |
+
dockerfile: Dockerfile
|
| 23 |
+
container_name: payshield-api
|
| 24 |
+
depends_on:
|
| 25 |
+
redis:
|
| 26 |
+
condition: service_healthy
|
| 27 |
+
ports:
|
| 28 |
+
- "8000:8000"
|
| 29 |
+
environment:
|
| 30 |
+
- REDIS_HOST=redis
|
| 31 |
+
- REDIS_PORT=6379
|
| 32 |
+
- MODEL_PATH=/app/models/fraud_model.pkl
|
| 33 |
+
- THRESHOLD_PATH=/app/models/threshold.json
|
| 34 |
+
- SHADOW_MODE=false
|
| 35 |
+
volumes:
|
| 36 |
+
- ./models:/app/models:ro
|
| 37 |
+
- ./logs:/app/logs
|
| 38 |
+
networks:
|
| 39 |
+
- payshield-network
|
| 40 |
+
|
| 41 |
+
# Placeholder for Phase 6 Dashboard
|
| 42 |
+
dashboard:
|
| 43 |
+
build:
|
| 44 |
+
context: .
|
| 45 |
+
dockerfile: Dockerfile
|
| 46 |
+
container_name: payshield-dashboard
|
| 47 |
+
depends_on:
|
| 48 |
+
- api
|
| 49 |
+
command: ["streamlit", "run", "src/frontend/app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
|
| 50 |
+
ports:
|
| 51 |
+
- "8501:8501"
|
| 52 |
+
environment:
|
| 53 |
+
- API_URL=http://api:8000/v1/predict
|
| 54 |
+
networks:
|
| 55 |
+
- payshield-network
|
| 56 |
+
restart: unless-stopped
|
| 57 |
+
|
| 58 |
+
networks:
|
| 59 |
+
payshield-network:
|
| 60 |
+
driver: bridge
|
| 61 |
+
|
| 62 |
+
volumes:
|
| 63 |
+
redis_data:
|
docker/docker-compose.yml
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version: '3.8'
|
| 2 |
+
|
| 3 |
+
services:
|
| 4 |
+
# Redis Feature Store
|
| 5 |
+
redis:
|
| 6 |
+
image: redis:7-alpine
|
| 7 |
+
container_name: payshield-redis
|
| 8 |
+
ports:
|
| 9 |
+
- "6379:6379"
|
| 10 |
+
volumes:
|
| 11 |
+
- redis_data:/data
|
| 12 |
+
command: redis-server --appendonly yes
|
| 13 |
+
healthcheck:
|
| 14 |
+
test: ["CMD", "redis-cli", "ping"]
|
| 15 |
+
interval: 5s
|
| 16 |
+
timeout: 3s
|
| 17 |
+
retries: 5
|
| 18 |
+
networks:
|
| 19 |
+
- payshield-network
|
| 20 |
+
|
| 21 |
+
# MLflow Tracking Server (for future phases)
|
| 22 |
+
mlflow:
|
| 23 |
+
image: ghcr.io/mlflow/mlflow:v2.10.0
|
| 24 |
+
container_name: payshield-mlflow
|
| 25 |
+
ports:
|
| 26 |
+
- "5000:5000"
|
| 27 |
+
environment:
|
| 28 |
+
- MLFLOW_BACKEND_STORE_URI=sqlite:///mlflow/mlflow.db
|
| 29 |
+
- MLFLOW_ARTIFACTS_DESTINATION=/mlflow/artifacts
|
| 30 |
+
volumes:
|
| 31 |
+
- mlflow_data:/mlflow
|
| 32 |
+
command: mlflow server --host 0.0.0.0 --port 5000
|
| 33 |
+
networks:
|
| 34 |
+
- payshield-network
|
| 35 |
+
|
| 36 |
+
volumes:
|
| 37 |
+
redis_data:
|
| 38 |
+
mlflow_data:
|
| 39 |
+
|
| 40 |
+
networks:
|
| 41 |
+
payshield-network:
|
| 42 |
+
driver: bridge
|
docs/DEVELOPMENT.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🛠️ Development & MLOps Guide
|
| 2 |
+
|
| 3 |
+
This guide provides detailed instructions on setting up the environment, running experiments, and developing the PayShield-ML system.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 🏗️ 1. Environment Setup
|
| 8 |
+
|
| 9 |
+
The project uses `uv` for lightning-fast Python package and project management.
|
| 10 |
+
|
| 11 |
+
### Prerequisites
|
| 12 |
+
- [uv](https://github.com/astral-sh/uv) installed
|
| 13 |
+
- Docker & Docker Compose
|
| 14 |
+
- Redis (can be run via Docker)
|
| 15 |
+
|
| 16 |
+
### Installation
|
| 17 |
+
```bash
|
| 18 |
+
# Sync dependencies and create virtual environment
|
| 19 |
+
uv sync
|
| 20 |
+
|
| 21 |
+
# Activate the environment
|
| 22 |
+
source .venv/bin/activate
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## 📊 2. MLflow Tracking
|
| 28 |
+
|
| 29 |
+
We use MLflow to track hyperparameters, metrics, and model artifacts.
|
| 30 |
+
|
| 31 |
+
### Start MLflow Server
|
| 32 |
+
Run this in a separate terminal to view the UI:
|
| 33 |
+
```bash
|
| 34 |
+
uv run mlflow ui --host 0.0.0.0 --port 5000
|
| 35 |
+
```
|
| 36 |
+
Then access the dashboard at [http://localhost:5000](http://localhost:5000).
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 🚂 3. Model Training Pipeline
|
| 41 |
+
|
| 42 |
+
The training script handles data ingestion, feature engineering, cross-validation, and MLflow logging.
|
| 43 |
+
|
| 44 |
+
### Basic Training
|
| 45 |
+
```bash
|
| 46 |
+
uv run python src/models/train.py --data_path data/fraud_sample.csv
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Advanced Training with Custom Params
|
| 50 |
+
```bash
|
| 51 |
+
uv run python src/models/train.py \
|
| 52 |
+
--data_path data/fraud_sample.csv \
|
| 53 |
+
--experiment_name fraud_v2 \
|
| 54 |
+
--min_recall 0.85 \
|
| 55 |
+
--output_dir models/
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
| Argument | Description | Default |
|
| 59 |
+
| :--- | :--- | :--- |
|
| 60 |
+
| `--data_path` | Path to CSV/Parquet training data | (Required) |
|
| 61 |
+
| `--params_path` | Path to model config YAML | `configs/model_config.yaml` |
|
| 62 |
+
| `--experiment_name` | MLflow experiment grouping | `fraud_detection` |
|
| 63 |
+
| `--min_recall` | Target recall for threshold optimization | `0.80` |
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## 🔌 4. Running Services Locally
|
| 68 |
+
|
| 69 |
+
For rapid iteration without rebuilding Docker images.
|
| 70 |
+
|
| 71 |
+
### Start Redis (Required)
|
| 72 |
+
```bash
|
| 73 |
+
docker run -d --name payshield-redis -p 6379:6379 redis:7-alpine
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
### Start FastAPI Backend
|
| 77 |
+
```bash
|
| 78 |
+
# Running with hot-reload
|
| 79 |
+
uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --reload
|
| 80 |
+
```
|
| 81 |
+
- **Swagger Docs:** [http://localhost:8000/docs](http://localhost:8000/docs)
|
| 82 |
+
|
| 83 |
+
### Start Streamlit Dashboard
|
| 84 |
+
```bash
|
| 85 |
+
export API_URL="http://localhost:8000/v1/predict"
|
| 86 |
+
uv run streamlit run src/frontend/app.py --server.port 8501
|
| 87 |
+
```
|
| 88 |
+
- **Dashboard:** [http://localhost:8501](http://localhost:8501)
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## 🐳 5. Full-Stack Development (Docker Compose)
|
| 93 |
+
|
| 94 |
+
The easiest way to replicate production-like environment.
|
| 95 |
+
|
| 96 |
+
### Build and Launch
|
| 97 |
+
```bash
|
| 98 |
+
docker-compose up --build
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
### Useful Commands
|
| 102 |
+
```bash
|
| 103 |
+
# Run in background
|
| 104 |
+
docker-compose up -d
|
| 105 |
+
|
| 106 |
+
# Check all service logs
|
| 107 |
+
docker-compose logs -f
|
| 108 |
+
|
| 109 |
+
# Stop and remove containers
|
| 110 |
+
docker-compose down
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
### Service Map
|
| 114 |
+
| Service | Port | Endpoint |
|
| 115 |
+
| :--- | :--- | :--- |
|
| 116 |
+
| **API** | 8000 | `http://localhost:8000` |
|
| 117 |
+
| **Dashboard** | 8501 | `http://localhost:8501` |
|
| 118 |
+
| **Redis** | 6379 | `localhost:6379` |
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 🧪 6. Testing
|
| 123 |
+
|
| 124 |
+
We use `pytest` for unit and integration tests.
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
# Run all tests
|
| 128 |
+
uv run pytest
|
| 129 |
+
|
| 130 |
+
# Run with coverage report
|
| 131 |
+
uv run pytest --cov=src
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
**Note:** Ensure you have the `models/fraud_model.pkl` and `models/threshold.json` artifacts present before starting the API. These are generated by the training pipeline.
|
entrypoint.sh
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
# Start FastAPI in the background
|
| 4 |
+
uvicorn src.api.main:app --host 0.0.0.0 --port 8000 &
|
| 5 |
+
|
| 6 |
+
# Start Streamlit in the foreground
|
| 7 |
+
streamlit run src/frontend/app.py --server.port 7860 --server.address 0.0.0.0
|
logs/shadow_predictions.jsonl
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"timestamp": "2026-01-18T12:37:19.214473Z", "user_id": "u12345", "amt": 150.0, "category": "grocery_pos", "real_decision": "APPROVE", "probability": 9.402751288689615e-07, "latency_ms": 72.96919822692871, "shadow_mode": true}
|
| 2 |
+
{"timestamp": "2026-01-18T12:44:56.721604Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.37005043029785, "shadow_mode": true}
|
| 3 |
+
{"timestamp": "2026-01-18T12:47:36.067083Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 54.149627685546875, "shadow_mode": true}
|
| 4 |
+
{"timestamp": "2026-01-18T12:47:39.609355Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 25.424957275390625, "shadow_mode": true}
|
| 5 |
+
{"timestamp": "2026-01-18T12:47:59.207044Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.46653747558594, "shadow_mode": true}
|
| 6 |
+
{"timestamp": "2026-01-18T12:48:01.070087Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.763057708740234, "shadow_mode": true}
|
| 7 |
+
{"timestamp": "2026-01-18T12:48:01.865266Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 29.25252914428711, "shadow_mode": true}
|
| 8 |
+
{"timestamp": "2026-01-18T12:48:02.489046Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 23.40531349182129, "shadow_mode": true}
|
| 9 |
+
{"timestamp": "2026-01-18T12:48:03.161965Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 44.34394836425781, "shadow_mode": true}
|
| 10 |
+
{"timestamp": "2026-01-18T12:48:03.765913Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.629220962524414, "shadow_mode": true}
|
| 11 |
+
{"timestamp": "2026-01-18T12:48:04.394367Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 27.244091033935547, "shadow_mode": true}
|
| 12 |
+
{"timestamp": "2026-01-18T12:48:04.986422Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.80378532409668, "shadow_mode": true}
|
| 13 |
+
{"timestamp": "2026-01-18T12:48:05.901805Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.35393142700195, "shadow_mode": true}
|
| 14 |
+
{"timestamp": "2026-01-18T12:48:06.666034Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.82218551635742, "shadow_mode": true}
|
| 15 |
+
{"timestamp": "2026-01-18T12:48:07.345695Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 48.212528228759766, "shadow_mode": true}
|
| 16 |
+
{"timestamp": "2026-01-18T12:48:07.919900Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 26.331424713134766, "shadow_mode": true}
|
| 17 |
+
{"timestamp": "2026-01-18T12:48:08.553960Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 41.85080528259277, "shadow_mode": true}
|
| 18 |
+
{"timestamp": "2026-01-18T12:48:09.265804Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 68.79472732543945, "shadow_mode": true}
|
| 19 |
+
{"timestamp": "2026-01-18T12:48:09.853970Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.57653045654297, "shadow_mode": true}
|
| 20 |
+
{"timestamp": "2026-01-18T12:48:10.481686Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 21.60501480102539, "shadow_mode": true}
|
| 21 |
+
{"timestamp": "2026-01-18T12:48:11.133950Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.62055587768555, "shadow_mode": true}
|
| 22 |
+
{"timestamp": "2026-01-18T12:48:11.734621Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.61753845214844, "shadow_mode": true}
|
| 23 |
+
{"timestamp": "2026-01-18T12:48:12.292397Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 21.454334259033203, "shadow_mode": true}
|
| 24 |
+
{"timestamp": "2026-01-18T12:48:12.847376Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 42.4959659576416, "shadow_mode": true}
|
| 25 |
+
{"timestamp": "2026-01-18T12:48:13.340940Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.911882400512695, "shadow_mode": true}
|
| 26 |
+
{"timestamp": "2026-01-18T12:48:13.923384Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 59.66687202453613, "shadow_mode": true}
|
| 27 |
+
{"timestamp": "2026-01-18T12:48:14.445859Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.5223274230957, "shadow_mode": true}
|
| 28 |
+
{"timestamp": "2026-01-18T12:48:14.990366Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.456501007080078, "shadow_mode": true}
|
| 29 |
+
{"timestamp": "2026-01-18T12:48:15.709852Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.6852912902832, "shadow_mode": true}
|
| 30 |
+
{"timestamp": "2026-01-18T12:48:15.913553Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 25.067567825317383, "shadow_mode": true}
|
| 31 |
+
{"timestamp": "2026-01-18T12:48:16.193654Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 44.05331611633301, "shadow_mode": true}
|
| 32 |
+
{"timestamp": "2026-01-18T12:48:16.345675Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.55793380737305, "shadow_mode": true}
|
| 33 |
+
{"timestamp": "2026-01-18T12:48:16.590079Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.01242256164551, "shadow_mode": true}
|
| 34 |
+
{"timestamp": "2026-01-18T12:48:16.789928Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 46.65184020996094, "shadow_mode": true}
|
| 35 |
+
{"timestamp": "2026-01-18T12:48:16.971762Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 29.859066009521484, "shadow_mode": true}
|
| 36 |
+
{"timestamp": "2026-01-18T12:48:17.185712Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.96466636657715, "shadow_mode": true}
|
| 37 |
+
{"timestamp": "2026-01-18T12:48:17.345695Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.675235748291016, "shadow_mode": true}
|
| 38 |
+
{"timestamp": "2026-01-18T12:48:17.524961Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 40.1458740234375, "shadow_mode": true}
|
| 39 |
+
{"timestamp": "2026-01-18T12:48:17.683832Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.51038360595703, "shadow_mode": true}
|
| 40 |
+
{"timestamp": "2026-01-18T12:48:17.851388Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 23.29730987548828, "shadow_mode": true}
|
| 41 |
+
{"timestamp": "2026-01-18T12:48:18.065689Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.719478607177734, "shadow_mode": true}
|
| 42 |
+
{"timestamp": "2026-01-18T12:48:18.381860Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.37034034729004, "shadow_mode": true}
|
| 43 |
+
{"timestamp": "2026-01-18T12:48:18.830235Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.47638511657715, "shadow_mode": true}
|
| 44 |
+
{"timestamp": "2026-01-18T12:48:19.041672Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 41.09072685241699, "shadow_mode": true}
|
| 45 |
+
{"timestamp": "2026-01-18T12:48:19.269802Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 37.813663482666016, "shadow_mode": true}
|
| 46 |
+
{"timestamp": "2026-01-18T12:48:19.570256Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 30.001163482666016, "shadow_mode": true}
|
| 47 |
+
{"timestamp": "2026-01-18T12:48:19.910120Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.04141616821289, "shadow_mode": true}
|
| 48 |
+
{"timestamp": "2026-01-18T12:48:20.254294Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 52.63257026672363, "shadow_mode": true}
|
| 49 |
+
{"timestamp": "2026-01-18T12:48:20.520307Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.410486221313477, "shadow_mode": true}
|
| 50 |
+
{"timestamp": "2026-01-18T12:48:20.933840Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 42.56725311279297, "shadow_mode": true}
|
| 51 |
+
{"timestamp": "2026-01-18T12:48:58.996285Z", "user_id": "u12345", "amt": 2500.0, "category": "shopping_net", "real_decision": "BLOCK", "probability": 0.999806821346283, "latency_ms": 38.36369514465332, "shadow_mode": true}
|
| 52 |
+
{"timestamp": "2026-01-18T14:12:06.614568Z", "user_id": "u12345", "amt": 150.0, "category": "grocery_pos", "real_decision": "APPROVE", "probability": 8.649675874039531e-05, "latency_ms": 86.48848533630371, "shadow_mode": true}
|
| 53 |
+
{"timestamp": "2026-01-18T14:15:46.042042Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 7.321957014028158e-07, "latency_ms": 86.42172813415527, "shadow_mode": true}
|
| 54 |
+
{"timestamp": "2026-01-18T14:16:12.912011Z", "user_id": "u12345", "amt": 1500.0, "category": "misc_net", "real_decision": "BLOCK", "probability": 0.9997040629386902, "latency_ms": 49.93748664855957, "shadow_mode": true}
|
| 55 |
+
{"timestamp": "2026-01-18T14:17:14.040831Z", "user_id": "u12345", "amt": 1500.0, "category": "misc_net", "real_decision": "BLOCK", "probability": 0.9997040629386902, "latency_ms": 43.45536231994629, "shadow_mode": true}
|
main.py
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
def main():
|
| 2 |
+
print("Hello from payshield-ml!")
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
if __name__ == "__main__":
|
| 6 |
+
main()
|
models/fraud_model.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9be403668e04f1458d5eb5e0dd805495e7f6fa1f322dc8b006dcb2fccc78061d
|
| 3 |
+
size 3933261
|
models/threshold.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"optimal_threshold": 0.8199467062950134,
|
| 3 |
+
"metrics": {
|
| 4 |
+
"precision": 0.9318377911993098,
|
| 5 |
+
"recall": 0.8005930318754633,
|
| 6 |
+
"f1": 0.861244019138756,
|
| 7 |
+
"pr_auc": 0.924508450881234,
|
| 8 |
+
"threshold_used": 0.8199467062950134
|
| 9 |
+
}
|
| 10 |
+
}
|
notebook/.ipynb_checkpoints/Untitled-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [],
|
| 3 |
+
"metadata": {},
|
| 4 |
+
"nbformat": 4,
|
| 5 |
+
"nbformat_minor": 5
|
| 6 |
+
}
|
notebook/.ipynb_checkpoints/credit-card-fraud-detection-checkpoint.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/README.md
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🛡️ Behavioral Fraud Detection System (XGBoost + SHAP)
|
| 2 |
+
|
| 3 |
+
**A production-grade machine learning pipeline identifying fraudulent credit card transactions with 97% precision and explainable AI.**
|
| 4 |
+
|
| 5 |
+
## 💼 Executive Summary & Business Impact
|
| 6 |
+
|
| 7 |
+
Financial fraud detection is not just about accuracy; it's about the **Precision-Recall trade-off**. A model that flags too many legitimate transactions (False Positives) causes customer churn, while missing fraud (False Negatives) causes direct financial loss.
|
| 8 |
+
|
| 9 |
+
This project implements a cost-sensitive **XGBoost** classifier engineered to minimize **operational friction**. By optimizing the decision threshold to **0.895**, the system achieved:
|
| 10 |
+
|
| 11 |
+
| Metric | Performance | Business Value |
|
| 12 |
+
| --- | --- | --- |
|
| 13 |
+
| **Precision** | **97%** | Only 3% of alerts are false alarms, drastically reducing manual review costs. |
|
| 14 |
+
| **Recall** | **80%** | Captures 80% of all fraud attempts (approx. $810k in prevented loss). |
|
| 15 |
+
| **Net Savings** | **$810,470** | Calculated ROI on the test set (Loss Prevented - Operational Costs). |
|
| 16 |
+
| **Latency** | **<50ms** | Inference speed optimized via `scikit-learn` Pipeline serialization. |
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## 🏗️ Technical Architecture
|
| 21 |
+
|
| 22 |
+
The solution moves beyond basic "fit-predict" workflows by implementing a robust preprocessing pipeline designed to prevent **data leakage** and handle extreme class imbalance (0.5% fraud rate).
|
| 23 |
+
|
| 24 |
+
### 1. Advanced Feature Engineering
|
| 25 |
+
|
| 26 |
+
Raw transaction data is insufficient for modern fraud. I engineered **14 behavioral features** to capture context:
|
| 27 |
+
|
| 28 |
+
* **Velocity Metrics (`trans_count_24h`, `amt_to_avg_ratio_24h`)**: Detects "burst" behavior where a card is used rapidly or for amounts exceeding the user's historical norm.
|
| 29 |
+
* **Geospatial Analysis (`distance_km`)**: Calculates the Haversine distance between the cardholder's home and the merchant.
|
| 30 |
+
* **Cyclical Temporal Encoding (`hour_sin`, `hour_cos`)**: Captures high-risk time windows (e.g., 3 AM surges) while preserving the 24-hour cycle continuity.
|
| 31 |
+
* **Risk Profiling (`WOEEncoder`)**: Replaces high-cardinality categorical features (Merchant, Job) with their "Weight of Evidence" - a measure of how much a specific category supports the "Fraud" hypothesis.
|
| 32 |
+
|
| 33 |
+
### 2. The Pipeline
|
| 34 |
+
|
| 35 |
+
To ensure production stability, all steps are wrapped in a single Scikit-Learn `Pipeline`:
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
pipeline = Pipeline(steps=[
|
| 39 |
+
('preprocessor', ColumnTransformer(transformers=[
|
| 40 |
+
('cat', WOEEncoder(), ['job', 'category']),
|
| 41 |
+
('num', RobustScaler(), numerical_features)
|
| 42 |
+
])),
|
| 43 |
+
('classifier', XGBClassifier(scale_pos_weight=imbalance_ratio, ...))
|
| 44 |
+
])
|
| 45 |
+
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## 📊 Model Performance
|
| 51 |
+
|
| 52 |
+
### Precision-Recall Strategy
|
| 53 |
+
|
| 54 |
+
Instead of optimizing for ROC-AUC (which can be misleading in imbalanced datasets), I optimized for **PR-AUC (0.998)**.
|
| 55 |
+
|
| 56 |
+
* **Default Threshold (0.50):** Precision was 65%. Too many false alarms.
|
| 57 |
+
* **Optimized Threshold (0.895):** Precision increased to **97%**, with minimal loss in Recall.
|
| 58 |
+
|
| 59 |
+
*(Insert your Precision-Recall Curve image here)*
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 🔍 Explainability & "The Why"
|
| 64 |
+
|
| 65 |
+
Black-box models are dangerous in finance. I implemented **SHAP (SHapley Additive exPlanations)** to provide reason codes for every decision.
|
| 66 |
+
|
| 67 |
+
### The "Smoking Gun" (Fraud Example)
|
| 68 |
+
|
| 69 |
+
For a transaction flagged with **99.9% confidence**, the SHAP Waterfall plot reveals the exact drivers:
|
| 70 |
+
|
| 71 |
+
1. **`amt_log` (+8.83)**: The transaction amount was significantly higher than normal.
|
| 72 |
+
2. **`hour_sin` (+2.46)**: The transaction occurred during a high-risk time window (late night).
|
| 73 |
+
3. **`job` (+1.72)**: The cardholder's profession falls into a statistically higher-risk segment.
|
| 74 |
+
4. **`amt_to_avg_ratio_24h` (+1.24)**: The amount was an outlier *specifically* for this user's 24-hour history.
|
| 75 |
+
|
| 76 |
+
*(Insert your SHAP Waterfall Plot image here)*
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## 🚀 How to Run
|
| 81 |
+
|
| 82 |
+
### Prerequisites
|
| 83 |
+
|
| 84 |
+
```bash
|
| 85 |
+
pip install pandas xgboost category_encoders shap scikit-learn
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### Inference
|
| 90 |
+
|
| 91 |
+
The model is serialized as a `.pkl` file. You can load it to predict on new data immediately without re-training.
|
| 92 |
+
|
| 93 |
+
```python
|
| 94 |
+
import joblib
|
| 95 |
+
import pandas as pd
|
| 96 |
+
|
| 97 |
+
# Load the production pipeline
|
| 98 |
+
model = joblib.load('fraud_detection_model_v1.pkl')
|
| 99 |
+
|
| 100 |
+
# Define a new transaction (Example)
|
| 101 |
+
new_transaction = pd.DataFrame([{
|
| 102 |
+
'amt_log': 5.2,
|
| 103 |
+
'distance_km': 120.5,
|
| 104 |
+
'trans_count_24h': 12,
|
| 105 |
+
'amt_to_avg_ratio_24h': 4.5,
|
| 106 |
+
# ... include all 14 features
|
| 107 |
+
}])
|
| 108 |
+
|
| 109 |
+
# Get prediction (Probability of Fraud)
|
| 110 |
+
fraud_prob = model.predict_proba(new_transaction)[:, 1]
|
| 111 |
+
is_fraud = (fraud_prob >= 0.895).astype(int)
|
| 112 |
+
|
| 113 |
+
print(f"Fraud Probability: {fraud_prob[0]:.4f}")
|
| 114 |
+
print(f"Action: {'BLOCK' if is_fraud[0] else 'APPROVE'}")
|
| 115 |
+
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
### **Author**
|
| 121 |
+
|
| 122 |
+
**Sibi Krishnamoorthy** *Machine Learning Engineer | Fintech & Risk Analytics*
|
notebook/credit-card-fraud-detection.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/credit-card-fraud-detection.md
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/credit-card-fraud-detection.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1f604362aa5035ae24f4b5b1dcd4d67926ee9246030eba352eb694ca05fd18aa
|
| 3 |
+
size 817131
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_23_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_25_1.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_26_1.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_28_1.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_29_1.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_30_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_32_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_34_1.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_40_1.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_58_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_59_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_60_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_68_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_70_0.png
ADDED
|
Git LFS Details
|
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_72_1.png
ADDED
|
Git LFS Details
|
notebook/dataset/download.py
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env -S uv run --script
|
| 2 |
+
#
|
| 3 |
+
# /// script
|
| 4 |
+
# dependencies = [
|
| 5 |
+
# "pandas",
|
| 6 |
+
# "rich",
|
| 7 |
+
# ]
|
| 8 |
+
# ///
|
| 9 |
+
import os
|
| 10 |
+
import zipfile
|
| 11 |
+
|
| 12 |
+
path = os.path.join(os.path.dirname(__file__), "dataset")
|
| 13 |
+
def download_data():
|
| 14 |
+
os.system(f"curl -L -o {path}/fraud-detection.zip https://www.kaggle.com/api/v1/datasets/download/kartik2112/fraud-detection")
|
| 15 |
+
|
| 16 |
+
with zipfile.ZipFile(f"{path}/fraud-detection.zip", 'r') as zip_ref:
|
| 17 |
+
zip_ref.extractall(f"{path}/")
|
| 18 |
+
def combine_data():
|
| 19 |
+
import pandas as pd
|
| 20 |
+
train = pd.read_csv(f"{path}/fraud-detection/fraudTrain.csv")
|
| 21 |
+
test = pd.read_csv(f"{path}/fraud-detection/fraudTest.csv")
|
| 22 |
+
df = pd.concat([train, test], axis=0).reset_index(drop=True)
|
| 23 |
+
df.to_csv(f"{path}/fraud.csv", index=False)
|
| 24 |
+
if __name__ == "__main__":
|
| 25 |
+
download_data()
|
| 26 |
+
combine_data()
|
| 27 |
+
|
pyproject.toml
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "payshield-ml"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "Real-Time Financial Fraud Detection System with sub-50ms latency"
|
| 5 |
+
readme = "README.md"
|
| 6 |
+
requires-python = ">=3.12"
|
| 7 |
+
dependencies = [
|
| 8 |
+
# ML Stack
|
| 9 |
+
"xgboost>=2.0.0",
|
| 10 |
+
"scikit-learn>=1.4.0",
|
| 11 |
+
"shap>=0.50.0", # Use latest stable version compatible with Python 3.12
|
| 12 |
+
"mlflow>=2.10.0",
|
| 13 |
+
# API Framework
|
| 14 |
+
"fastapi>=0.109.0",
|
| 15 |
+
"uvicorn[standard]>=0.27.0",
|
| 16 |
+
"pydantic>=2.5.0",
|
| 17 |
+
# Data Processing
|
| 18 |
+
"pandas>=2.2.0",
|
| 19 |
+
"numpy>=2.0.0",
|
| 20 |
+
"pyarrow>=15.0.0",
|
| 21 |
+
# Feature Store
|
| 22 |
+
"redis>=5.0.0",
|
| 23 |
+
"hiredis>=2.3.0",
|
| 24 |
+
# Utilities
|
| 25 |
+
"python-dotenv>=1.0.0",
|
| 26 |
+
"pyyaml>=6.0.1",
|
| 27 |
+
"joblib>=1.5.0",
|
| 28 |
+
"category-encoders>=2.6.0",
|
| 29 |
+
"seaborn>=0.13.2",
|
| 30 |
+
"matplotlib>=3.10.8",
|
| 31 |
+
"pydantic-settings>=2.12.0",
|
| 32 |
+
"streamlit>=1.53.0",
|
| 33 |
+
"plotly>=6.5.2",
|
| 34 |
+
"requests>=2.32.5",
|
| 35 |
+
"typing-extensions>=4.15.0",
|
| 36 |
+
]
|
| 37 |
+
|
| 38 |
+
[project.optional-dependencies]
|
| 39 |
+
dev = [
|
| 40 |
+
# Testing
|
| 41 |
+
"pytest>=8.0.0",
|
| 42 |
+
"pytest-asyncio>=0.23.0",
|
| 43 |
+
"pytest-cov>=4.1.0",
|
| 44 |
+
|
| 45 |
+
# Type Checking & Linting
|
| 46 |
+
"mypy>=1.8.0",
|
| 47 |
+
"ruff>=0.1.0",
|
| 48 |
+
|
| 49 |
+
# Notebook Support
|
| 50 |
+
"jupyter>=1.0.0",
|
| 51 |
+
"ipykernel>=6.29.0",
|
| 52 |
+
]
|
| 53 |
+
|
| 54 |
+
[build-system]
|
| 55 |
+
requires = ["hatchling"]
|
| 56 |
+
build-backend = "hatchling.build"
|
| 57 |
+
|
| 58 |
+
[tool.pytest.ini_options]
|
| 59 |
+
testpaths = ["tests"]
|
| 60 |
+
python_files = "test_*.py"
|
| 61 |
+
python_classes = "Test*"
|
| 62 |
+
python_functions = "test_*"
|
| 63 |
+
addopts = "-v --cov=src --cov-report=term-missing"
|
| 64 |
+
|
| 65 |
+
[tool.mypy]
|
| 66 |
+
python_version = "3.12"
|
| 67 |
+
warn_return_any = true
|
| 68 |
+
warn_unused_configs = true
|
| 69 |
+
disallow_untyped_defs = true
|
| 70 |
+
|
| 71 |
+
[tool.ruff]
|
| 72 |
+
line-length = 100
|
| 73 |
+
target-version = "py312"
|
| 74 |
+
|
| 75 |
+
[tool.hatch.build.targets.wheel]
|
| 76 |
+
packages = ["src"]
|
| 77 |
+
|
| 78 |
+
[dependency-groups]
|
| 79 |
+
dev = [
|
| 80 |
+
"ruff>=0.14.13",
|
| 81 |
+
]
|
scripts/demo_phase1.py
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Example script demonstrating the data ingestion and feature store usage.
|
| 3 |
+
|
| 4 |
+
This shows how to use the modules we just built.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from src.data.ingest import load_dataset, InferenceTransactionSchema
|
| 8 |
+
from src.features.store import RedisFeatureStore
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def main():
|
| 12 |
+
"""Demonstrate basic usage of Phase 1 modules."""
|
| 13 |
+
|
| 14 |
+
print("=== PayShield-ML Phase 1 Demo ===\n")
|
| 15 |
+
|
| 16 |
+
# 1. Data Validation Example
|
| 17 |
+
print("1. Testing Data Validation...")
|
| 18 |
+
try:
|
| 19 |
+
valid_transaction = InferenceTransactionSchema(
|
| 20 |
+
user_id="u12345",
|
| 21 |
+
amt=150.00,
|
| 22 |
+
lat=40.7128,
|
| 23 |
+
long=-74.0060,
|
| 24 |
+
category="grocery_pos",
|
| 25 |
+
job="Engineer, biomedical",
|
| 26 |
+
merch_lat=40.7200,
|
| 27 |
+
merch_long=-74.0100,
|
| 28 |
+
unix_time=1234567890,
|
| 29 |
+
)
|
| 30 |
+
print(f" ✓ Valid transaction: ${valid_transaction.amt:.2f}")
|
| 31 |
+
except Exception as e:
|
| 32 |
+
print(f" ✗ Validation failed: {e}")
|
| 33 |
+
|
| 34 |
+
# 2. Feature Store Example
|
| 35 |
+
print("\n2. Testing Feature Store...")
|
| 36 |
+
try:
|
| 37 |
+
store = RedisFeatureStore(host="localhost", port=6379, db=0)
|
| 38 |
+
|
| 39 |
+
# Check health
|
| 40 |
+
health = store.health_check()
|
| 41 |
+
print(f" ✓ Redis connection: {health['status']} (ping: {health['ping_ms']}ms)")
|
| 42 |
+
|
| 43 |
+
# Add some transactions
|
| 44 |
+
user_id = "demo_user_001"
|
| 45 |
+
import time
|
| 46 |
+
|
| 47 |
+
base_time = int(time.time())
|
| 48 |
+
|
| 49 |
+
print(f"\n Adding 3 transactions for {user_id}...")
|
| 50 |
+
for i, amount in enumerate([50.00, 75.00, 100.00]):
|
| 51 |
+
store.add_transaction(
|
| 52 |
+
user_id=user_id,
|
| 53 |
+
amount=amount,
|
| 54 |
+
timestamp=base_time + (i * 3600), # 1 hour apart
|
| 55 |
+
)
|
| 56 |
+
print(f" Transaction {i + 1}: ${amount:.2f}")
|
| 57 |
+
|
| 58 |
+
# Get features
|
| 59 |
+
features = store.get_features(user_id, current_timestamp=base_time + 7200)
|
| 60 |
+
print(f"\n Features computed:")
|
| 61 |
+
print(f" - Transaction count (24h): {features['trans_count_24h']:.0f}")
|
| 62 |
+
print(f" - Average spend (24h): ${features['avg_spend_24h']:.2f}")
|
| 63 |
+
|
| 64 |
+
# Get history
|
| 65 |
+
history = store.get_transaction_history(user_id, lookback_hours=24)
|
| 66 |
+
print(f"\n Transaction history ({len(history)} transactions):")
|
| 67 |
+
for ts, amt in history[:3]:
|
| 68 |
+
print(f" - Timestamp {ts}: ${amt:.2f}")
|
| 69 |
+
|
| 70 |
+
# Cleanup
|
| 71 |
+
deleted = store.delete_user_data(user_id)
|
| 72 |
+
print(f"\n ✓ Cleaned up {deleted} keys")
|
| 73 |
+
|
| 74 |
+
except Exception as e:
|
| 75 |
+
print(f" ✗ Feature store error: {e}")
|
| 76 |
+
print(
|
| 77 |
+
" Make sure Redis is running: docker-compose -f docker/docker-compose.yml up -d redis"
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
print("\n=== Demo Complete ===")
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
if __name__ == "__main__":
|
| 84 |
+
main()
|
scripts/verify_phase1.py
ADDED
|
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Quick verification script for Phase 1 implementation.
|
| 4 |
+
Tests imports and basic module functionality without requiring Redis.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def test_imports():
|
| 9 |
+
"""Test that all Phase 1 modules can be imported."""
|
| 10 |
+
print("Testing imports...")
|
| 11 |
+
|
| 12 |
+
try:
|
| 13 |
+
from src.data.ingest import TransactionSchema, InferenceTransactionSchema, load_dataset
|
| 14 |
+
|
| 15 |
+
print(" ✓ Data ingestion module imports successfully")
|
| 16 |
+
except ImportError as e:
|
| 17 |
+
print(f" ✗ Data ingestion import failed: {e}")
|
| 18 |
+
return False
|
| 19 |
+
|
| 20 |
+
try:
|
| 21 |
+
from src.features.store import RedisFeatureStore
|
| 22 |
+
|
| 23 |
+
print(" ✓ Feature store module imports successfully")
|
| 24 |
+
except ImportError as e:
|
| 25 |
+
print(f" ✗ Feature store import failed: {e}")
|
| 26 |
+
return False
|
| 27 |
+
|
| 28 |
+
try:
|
| 29 |
+
from src.features.constants import category_names, job_names
|
| 30 |
+
|
| 31 |
+
print(
|
| 32 |
+
f" ✓ Constants module loaded ({len(category_names)} categories, {len(job_names)} jobs)"
|
| 33 |
+
)
|
| 34 |
+
except ImportError as e:
|
| 35 |
+
print(f" ✗ Constants import failed: {e}")
|
| 36 |
+
return False
|
| 37 |
+
|
| 38 |
+
return True
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def test_pydantic_validation():
|
| 42 |
+
"""Test Pydantic validation logic."""
|
| 43 |
+
print("\nTesting Pydantic validation...")
|
| 44 |
+
|
| 45 |
+
from src.data.ingest import TransactionSchema, InferenceTransactionSchema
|
| 46 |
+
from pydantic import ValidationError
|
| 47 |
+
|
| 48 |
+
# Test valid transaction
|
| 49 |
+
try:
|
| 50 |
+
valid_data = {
|
| 51 |
+
"trans_date_trans_time": "2019-01-01 00:00:18",
|
| 52 |
+
"cc_num": 2703186189652095,
|
| 53 |
+
"merchant": "Test Merchant",
|
| 54 |
+
"category": "misc_net",
|
| 55 |
+
"amt": 100.50,
|
| 56 |
+
"first": "John",
|
| 57 |
+
"last": "Doe",
|
| 58 |
+
"gender": "M",
|
| 59 |
+
"street": "123 Main St",
|
| 60 |
+
"city": "Springfield",
|
| 61 |
+
"state": "IL",
|
| 62 |
+
"zip": 62701,
|
| 63 |
+
"lat": 39.7817,
|
| 64 |
+
"long": -89.6501,
|
| 65 |
+
"city_pop": 100000,
|
| 66 |
+
"job": "Engineer, biomedical",
|
| 67 |
+
"dob": "1990-01-01",
|
| 68 |
+
"trans_num": "abc123",
|
| 69 |
+
"unix_time": 1325376018,
|
| 70 |
+
"merch_lat": 39.7900,
|
| 71 |
+
"merch_long": -89.6600,
|
| 72 |
+
"is_fraud": 0,
|
| 73 |
+
}
|
| 74 |
+
schema = TransactionSchema(**valid_data)
|
| 75 |
+
print(f" ✓ Valid transaction accepted (amt: ${schema.amt:.2f})")
|
| 76 |
+
except ValidationError as e:
|
| 77 |
+
print(f" ✗ Valid transaction rejected: {e}")
|
| 78 |
+
return False
|
| 79 |
+
|
| 80 |
+
# Test invalid amount
|
| 81 |
+
try:
|
| 82 |
+
invalid_data = valid_data.copy()
|
| 83 |
+
invalid_data["amt"] = -50.00
|
| 84 |
+
schema = TransactionSchema(**invalid_data)
|
| 85 |
+
print(" ✗ Negative amount not rejected!")
|
| 86 |
+
return False
|
| 87 |
+
except ValidationError:
|
| 88 |
+
print(" ✓ Negative amount correctly rejected")
|
| 89 |
+
|
| 90 |
+
# Test invalid coordinates
|
| 91 |
+
try:
|
| 92 |
+
invalid_data = valid_data.copy()
|
| 93 |
+
invalid_data["lat"] = 200.0
|
| 94 |
+
schema = TransactionSchema(**invalid_data)
|
| 95 |
+
print(" ✗ Invalid latitude not rejected!")
|
| 96 |
+
return False
|
| 97 |
+
except ValidationError:
|
| 98 |
+
print(" ✓ Invalid latitude correctly rejected")
|
| 99 |
+
|
| 100 |
+
# Test inference schema
|
| 101 |
+
try:
|
| 102 |
+
inference_data = {
|
| 103 |
+
"user_id": "u12345",
|
| 104 |
+
"amt": 150.00,
|
| 105 |
+
"lat": 40.7128,
|
| 106 |
+
"long": -74.0060,
|
| 107 |
+
"category": "grocery_pos",
|
| 108 |
+
"job": "Engineer, biomedical",
|
| 109 |
+
"merch_lat": 40.7200,
|
| 110 |
+
"merch_long": -74.0100,
|
| 111 |
+
"unix_time": 1234567890,
|
| 112 |
+
}
|
| 113 |
+
schema = InferenceTransactionSchema(**inference_data)
|
| 114 |
+
print(f" ✓ Inference schema works (user: {schema.user_id})")
|
| 115 |
+
except ValidationError as e:
|
| 116 |
+
print(f" ✗ Inference schema failed: {e}")
|
| 117 |
+
return False
|
| 118 |
+
|
| 119 |
+
return True
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def test_constants():
|
| 123 |
+
"""Test that constants are properly loaded."""
|
| 124 |
+
print("\nTesting constants...")
|
| 125 |
+
|
| 126 |
+
from src.features.constants import category_names, job_names
|
| 127 |
+
|
| 128 |
+
# Check categories
|
| 129 |
+
expected_categories = ["misc_net", "gas_transport", "grocery_pos", "entertainment"]
|
| 130 |
+
for cat in expected_categories:
|
| 131 |
+
if cat not in category_names:
|
| 132 |
+
print(f" ✗ Missing category: {cat}")
|
| 133 |
+
return False
|
| 134 |
+
print(f" ✓ All expected categories present ({len(category_names)} total)")
|
| 135 |
+
|
| 136 |
+
# Check jobs
|
| 137 |
+
expected_jobs = ["Engineer, biomedical", "Data scientist", "Mechanical engineer"]
|
| 138 |
+
found_jobs = [j for j in expected_jobs if j in job_names]
|
| 139 |
+
print(
|
| 140 |
+
f" ✓ Job list loaded ({len(job_names)} jobs, {len(found_jobs)}/{len(expected_jobs)} test jobs found)"
|
| 141 |
+
)
|
| 142 |
+
|
| 143 |
+
return True
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def main():
|
| 147 |
+
"""Run all verification tests."""
|
| 148 |
+
print("=" * 60)
|
| 149 |
+
print("Phase 1 Verification Script")
|
| 150 |
+
print("=" * 60)
|
| 151 |
+
|
| 152 |
+
tests = [
|
| 153 |
+
("Imports", test_imports),
|
| 154 |
+
("Pydantic Validation", test_pydantic_validation),
|
| 155 |
+
("Constants", test_constants),
|
| 156 |
+
]
|
| 157 |
+
|
| 158 |
+
results = []
|
| 159 |
+
for name, test_func in tests:
|
| 160 |
+
try:
|
| 161 |
+
result = test_func()
|
| 162 |
+
results.append((name, result))
|
| 163 |
+
except Exception as e:
|
| 164 |
+
print(f"\n ✗ {name} test crashed: {e}")
|
| 165 |
+
results.append((name, False))
|
| 166 |
+
|
| 167 |
+
print("\n" + "=" * 60)
|
| 168 |
+
print("Summary")
|
| 169 |
+
print("=" * 60)
|
| 170 |
+
|
| 171 |
+
for name, passed in results:
|
| 172 |
+
status = "✓ PASS" if passed else "✗ FAIL"
|
| 173 |
+
print(f"{status:>10} - {name}")
|
| 174 |
+
|
| 175 |
+
all_passed = all(r[1] for r in results)
|
| 176 |
+
|
| 177 |
+
print("\n" + "=" * 60)
|
| 178 |
+
if all_passed:
|
| 179 |
+
print("✅ All verification tests passed!")
|
| 180 |
+
print("\nPhase 1 implementation is working correctly.")
|
| 181 |
+
print("\nNext steps:")
|
| 182 |
+
print(" 1. Install dependencies: uv sync")
|
| 183 |
+
print(" 2. Start Redis: docker-compose -f docker/docker-compose.yml up -d redis")
|
| 184 |
+
print(" 3. Run full test suite: pytest tests/ -v")
|
| 185 |
+
print(" 4. Try demo: python scripts/demo_phase1.py")
|
| 186 |
+
else:
|
| 187 |
+
print("❌ Some tests failed. Check the output above.")
|
| 188 |
+
print("=" * 60)
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
if __name__ == "__main__":
|
| 192 |
+
main()
|
src/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# Empty __init__.py for src package
|
src/api/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# API module __init__
|