Sibi Krishnamoorthy commited on
Commit
8a08300
·
1 Parent(s): 2098ced
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .agent/rules/architecture.md +175 -0
  2. .coverage +0 -0
  3. .dockerignore +34 -0
  4. .env.example +23 -0
  5. .github/workflows/main.yml +49 -0
  6. .gitignore +17 -0
  7. .python-version +1 -0
  8. Dockerfile +68 -0
  9. README.md +162 -0
  10. assets/Screenshot_18-1-2026_22489_localhost.jpeg +3 -0
  11. assets/banner.jpeg +3 -0
  12. assets/banner_0.jpeg +3 -0
  13. assets/dashboard_screenshot.png +3 -0
  14. configs/model_config.yaml +29 -0
  15. configs/redis_config.yaml +19 -0
  16. docker-compose.yml +63 -0
  17. docker/docker-compose.yml +42 -0
  18. docs/DEVELOPMENT.md +135 -0
  19. entrypoint.sh +7 -0
  20. logs/shadow_predictions.jsonl +55 -0
  21. main.py +6 -0
  22. models/fraud_model.pkl +3 -0
  23. models/threshold.json +10 -0
  24. notebook/.ipynb_checkpoints/Untitled-checkpoint.ipynb +6 -0
  25. notebook/.ipynb_checkpoints/credit-card-fraud-detection-checkpoint.ipynb +0 -0
  26. notebook/README.md +122 -0
  27. notebook/credit-card-fraud-detection.ipynb +0 -0
  28. notebook/credit-card-fraud-detection.md +0 -0
  29. notebook/credit-card-fraud-detection.pdf +3 -0
  30. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_23_0.png +3 -0
  31. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_25_1.png +3 -0
  32. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_26_1.png +3 -0
  33. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_28_1.png +3 -0
  34. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_29_1.png +3 -0
  35. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_30_0.png +3 -0
  36. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_32_0.png +3 -0
  37. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_34_1.png +3 -0
  38. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_40_1.png +3 -0
  39. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_58_0.png +3 -0
  40. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_59_0.png +3 -0
  41. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_60_0.png +3 -0
  42. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_68_0.png +3 -0
  43. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_70_0.png +3 -0
  44. notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_72_1.png +3 -0
  45. notebook/dataset/download.py +27 -0
  46. pyproject.toml +81 -0
  47. scripts/demo_phase1.py +84 -0
  48. scripts/verify_phase1.py +192 -0
  49. src/__init__.py +1 -0
  50. src/api/__init__.py +1 -0
.agent/rules/architecture.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ trigger: always_on
3
+ ---
4
+
5
+ Here is the comprehensive `architecture.md` file for your repository. It documents the high-level system design, data flow, and infrastructure decisions, serving as a critical artifact for technical interviews and system design reviews.
6
+
7
+ ---
8
+
9
+ # 🏗️ System Architecture: Real-Time Fraud Detection
10
+
11
+ This document outlines the end-to-end architecture of the Fraud Detection System, designed for **low-latency inference (<50ms)**, **high throughput**, and **explainability**.
12
+
13
+ The system follows a **Lambda Architecture** pattern, splitting workflows into an **Offline Training Layer** (batch processing) and an **Online Serving Layer** (real-time inference).
14
+
15
+ ---
16
+
17
+ ## 1. High-Level Diagram
18
+
19
+ ```mermaid
20
+ graph TD
21
+ User[Payment Gateway] -->|POST /predict| API[FastAPI Inference Service]
22
+
23
+ subgraph "Online Serving Layer"
24
+ API -->|1. Get Velocity| Redis[(Redis Feature Store)]
25
+ API -->|2. Compute Features| Preproc[Scikit-Learn Pipeline]
26
+ Preproc -->|3. Inference| Model[XGBoost Classifier]
27
+ Model -->|4. Log Prediction| Logger[Async Logger]
28
+ end
29
+
30
+ subgraph "Offline Training Layer"
31
+ DW[(Data Warehouse)] -->|Batch Load| Trainer[Training Pipeline]
32
+ Trainer -->|Track Experiments| MLflow[MLflow Server]
33
+ Trainer -->|Save Artifact| Registry[Model Registry]
34
+ Registry -->|Load .pkl| API
35
+ end
36
+
37
+ subgraph "Explainability & Monitoring"
38
+ Logger -->|Drift Analysis| Dashboard[Streamlit / Grafana]
39
+ API -->|Request Waterfall| SHAP[SHAP Explainer]
40
+ end
41
+
42
+ ```
43
+
44
+ ---
45
+
46
+ ## 2. Component Breakdown
47
+
48
+ ### 🟢 Online Serving Layer (Real-Time)
49
+
50
+ The critical path for live transactions. Designed for sub-50ms latency.
51
+
52
+ * **FastAPI Service:** Asynchronous Python web server. Handles request validation (Pydantic) and orchestrates the inference flow.
53
+ * **Redis Feature Store:**
54
+ * **Role:** Solves the "Stateful Feature" problem.
55
+ * **Implementation:** Uses Redis Sorted Sets (`ZSET`) to maintain a rolling window of transaction counts for the last 24 hours.
56
+ * **Latency:** <2ms lookups via pipelining.
57
+
58
+
59
+ * **Inference Engine:**
60
+ * **Model:** XGBoost Classifier (serialized via Joblib).
61
+ * **Pipeline:** `ColumnTransformer` (WOEEncoder + RobustScaler) ensures raw input is transformed exactly as it was during training.
62
+ * **Shadow Mode:** A configuration toggle allowing the model to run silently alongside legacy rules for A/B testing.
63
+
64
+
65
+
66
+ ### 🔵 Offline Training Layer (Batch)
67
+
68
+ Where models are built, validated, and versioned.
69
+
70
+ * **Data Ingestion:** Loads historical transaction logs (CSV/Parquet).
71
+ * **Feature Engineering:**
72
+ * Calculates Haversine distances.
73
+ * Generates Cyclical time encodings (`hour_sin`, `hour_cos`).
74
+ * Computes historical aggregates for the Feature Store.
75
+ * refer notebook markdown for more information at path notebook/credit-card-fraud-detection.md
76
+
77
+
78
+ * **Experiment Tracking (MLflow):**
79
+ * Logs hyperparameters (`max_depth`, `learning_rate`).
80
+ * Logs metrics: Precision, Recall, PR-AUC.
81
+ * Stores the model artifact and the optimized decision threshold (e.g., `0.895`).
82
+
83
+
84
+
85
+ ### 🟣 Explainability Layer (Compliance)
86
+
87
+ Ensures decisions are transparent and auditable.
88
+
89
+ * **SHAP (SHapley Additive exPlanations):**
90
+ * **Global:** Summary plots to understand model drivers.
91
+ * **Local:** Waterfall plots generated on-demand for specific high-risk blocks.
92
+ * **Optimization:** Uses the TreeExplainer with the "JSON Serialization Patch" to handle XGBoost 2.0+ metadata compatibility.
93
+
94
+
95
+
96
+ ---
97
+
98
+ ## 3. Data Flow Lifecycle
99
+
100
+ ### Step 1: Request Ingestion
101
+
102
+ The Payment Gateway sends a raw transaction payload:
103
+
104
+ ```json
105
+ {
106
+ "user_id": "u8312",
107
+ "amt": 150.00,
108
+ "lat": 40.7128,
109
+ "long": -74.0060,
110
+ "category": "grocery_pos",
111
+ "job": "systems_analyst"
112
+ }
113
+
114
+ ```
115
+
116
+ ### Step 2: Real-Time Feature Enrichment
117
+
118
+ The API queries Redis to fetch dynamic context that isn't in the payload:
119
+
120
+ * *Query:* `ZCARD user:u8312:tx_history` (Transactions in last 24h)
121
+ * *Result:* `5` (Velocity)
122
+
123
+ ### Step 3: Preprocessing & Inference
124
+
125
+ The combined vector (Raw + Redis Features) is passed to the Pipeline:
126
+
127
+ 1. **Imputation:** Handle missing values.
128
+ 2. **Encoding:** `category` -> Weight of Evidence (Float).
129
+ 3. **Scaling:** `amt` -> RobustScaler (centered/scaled).
130
+ 4. **Prediction:** XGBoost outputs probability `0.982`.
131
+
132
+ ### Step 4: Decision & Action
133
+
134
+ * **Threshold Check:** `0.982 > 0.895` → **FRAUD**.
135
+ * **Shadow Mode Check:** If enabled, return `APPROVE` but log `[SHADOW_BLOCK]`.
136
+ * **Response:**
137
+ ```json
138
+ {
139
+ "status": "BLOCK",
140
+ "risk_score": 0.982,
141
+ "latency_ms": 35
142
+ }
143
+
144
+ ```
145
+
146
+
147
+
148
+ ---
149
+
150
+ ## 4. Deployment Strategy
151
+
152
+ ### Shadow Deployment (Dark Launch)
153
+
154
+ To mitigate risk, new model versions are deployed in "Shadow Mode" before blocking real money.
155
+
156
+ 1. **Deploy:** Version v2 is deployed to production.
157
+ 2. **Listen:** It receives 100% of traffic but has **no write permissions** (cannot decline cards).
158
+ 3. **Compare:** We compare v2's "Virtual Blocks" against the v1 "Actual Blocks" and customer complaints.
159
+ 4. **Promote:** If Precision > 95% and complaints == 0, toggle Shadow Mode **OFF**.
160
+
161
+ ---
162
+
163
+ ## 5. Technology Stack
164
+
165
+ | Component | Technology | Rationale |
166
+ | --- | --- | --- |
167
+ | **Language** | Python 3.10+ | Standard for ML ecosystem. |
168
+ | **Package Manager | astral-uv | An extremely fast Python package and project manager |
169
+ | **API Framework** | FastAPI | Async native, high performance, auto-documentation. |
170
+ | **Model** | XGBoost | Best-in-class for tabular fraud data. |
171
+ | **Feature Store** | Redis | Sub-millisecond latency for sliding windows. |
172
+ | **Container** | Docker | Reproducible environments. |
173
+ | **Orchestration** | Docker Compose | Local development and testing. |
174
+ | **Tracking** | MLflow | Experiment management and artifact versioning. |
175
+ | **Frontend** | Streamlit | Rapid prototyping of analyst dashboards. |
.coverage ADDED
Binary file (53.2 kB). View file
 
.dockerignore ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Git and Version Control
2
+ .git
3
+ .gitignore
4
+ .github
5
+
6
+ # Python
7
+ __pycache__
8
+ *.py[cod]
9
+ *$py.class
10
+ .venv
11
+ .env
12
+ .pytest_cache
13
+ .ruff_cache
14
+ .coverage
15
+ htmlcov
16
+ .ipynb_checkpoints
17
+
18
+ # Project Data & Artifacts
19
+ # models/
20
+ logs/
21
+ notebooks/
22
+ data/
23
+ *.csv
24
+ *.parquet
25
+ *.pkl
26
+ # *.json
27
+
28
+ # UV
29
+ # uv.lock
30
+
31
+ # Docker
32
+ Dockerfile
33
+ docker-compose.yml
34
+ .dockerignore
.env.example ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment Variables
2
+
3
+ # Redis Configuration
4
+ REDIS_HOST=localhost
5
+ REDIS_PORT=6379
6
+ REDIS_DB=0
7
+ REDIS_PASSWORD=
8
+
9
+ # MLflow Configuration
10
+ MLFLOW_TRACKING_URI=http://localhost:5000
11
+
12
+ # Application Settings
13
+ ENV=development
14
+ LOG_LEVEL=INFO
15
+
16
+ # Model Settings
17
+ MODEL_PATH=models/fraud_model.pkl
18
+ THRESHOLD=0.895
19
+
20
+ shadow_mode=false
21
+ enable_explainability=true
22
+ # Performance
23
+ max_latency_ms=50.0
.github/workflows/main.yml ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: PayShield CI
2
+
3
+ on:
4
+ push:
5
+ branches: [ "master" ]
6
+ pull_request:
7
+ branches: [ "master" ]
8
+
9
+ jobs:
10
+ lint-and-test:
11
+ runs-on: ubuntu-latest
12
+
13
+ services:
14
+ redis:
15
+ image: redis:7-alpine
16
+ ports:
17
+ - 6379:6379
18
+ options: >-
19
+ --health-cmd "redis-cli ping"
20
+ --health-interval 10s
21
+ --health-timeout 5s
22
+ --health-retries 5
23
+
24
+ steps:
25
+ - uses: actions/checkout@v4
26
+
27
+ - name: Install uv
28
+ uses: astral-sh/setup-uv@v5
29
+ with:
30
+ version: "latest"
31
+ enable-cache: true
32
+
33
+ - name: Set up Python
34
+ run: uv python install
35
+
36
+ - name: Install the project
37
+ run: uv sync --all-extras --dev
38
+
39
+ - name: Run Ruff (Linter)
40
+ run: uv run ruff check .
41
+
42
+ - name: Run Ruff (Formatter check)
43
+ run: uv run ruff format --check .
44
+
45
+ - name: Run tests with coverage
46
+ run: uv run pytest --cov=src tests/
47
+ env:
48
+ REDIS_HOST: localhost
49
+ REDIS_PORT: 6379
.gitignore ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+ .env
12
+
13
+ .ruff_cache
14
+ .pytest_cache
15
+ .vscode
16
+ mlruns/
17
+ notebook/dataset/*.csv
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
Dockerfile ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stage 1: Builder
2
+ FROM python:3.12-slim-trixie AS builder
3
+
4
+ # Install uv by copying the binary from the official image
5
+ COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
6
+
7
+ # Set working directory
8
+ WORKDIR /app
9
+
10
+ # Enable bytecode compilation
11
+ ENV UV_COMPILE_BYTECODE=1
12
+
13
+ # Copy only dependency files first to leverage Docker layer caching
14
+ COPY pyproject.toml uv.lock ./
15
+
16
+ # Disable development dependencies
17
+ ENV UV_NO_DEV=1
18
+
19
+ # Sync the project into a new environment
20
+ RUN uv sync --frozen --no-install-project
21
+
22
+ # Stage 2: Final
23
+ FROM python:3.12-slim-trixie
24
+
25
+ # Set working directory
26
+ WORKDIR /app
27
+
28
+ # Install system dependencies (curl for healthchecks)
29
+ RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
30
+
31
+ # Create logs directory
32
+ RUN mkdir -p logs
33
+
34
+ # Create a non-root user for security
35
+ # Note: HF Spaces runs as user 1000 by default, so we align with that
36
+ RUN useradd -m -u 1000 payshield
37
+
38
+ # Set PYTHONPATH to include the app directory
39
+ ENV PYTHONPATH=/app
40
+
41
+ # Copy the virtual environment from the builder stage
42
+ COPY --from=builder /app/.venv /app/.venv
43
+
44
+ # Ensure the installed binary is on the `PATH`
45
+ ENV PATH="/app/.venv/bin:$PATH"
46
+
47
+ # Copy the rest of the application code
48
+ COPY src /app/src
49
+ COPY configs /app/configs
50
+ COPY models /app/models
51
+ COPY entrypoint.sh /app/entrypoint.sh
52
+
53
+ # Change ownership to the non-root user
54
+ RUN chown -R payshield:payshield /app
55
+
56
+ # Switch to the non-root user
57
+ USER 1000
58
+
59
+ # Expose ports
60
+ # 7860 is the default port for Hugging Face Spaces
61
+ EXPOSE 7860 8000
62
+
63
+ # Set environment variables for HF Spaces
64
+ ENV API_URL="http://localhost:8000/v1/predict"
65
+ ENV PORT=7860
66
+
67
+ # Use entrypoint script to run both services
68
+ CMD ["/bin/bash", "/app/entrypoint.sh"]
README.md CHANGED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🛡️ PayShield-ML:Real-Time Fraud Engine
2
+
3
+ ![Banner](assets/banner_0.jpeg)
4
+ ![Python](https://img.shields.io/badge/python-3.12-blue.svg)
5
+ ![FastAPI](https://img.shields.io/badge/FastAPI-0.100+-009688.svg)
6
+ ![Redis](https://img.shields.io/badge/redis-7.0+-DC382D.svg)
7
+ ![XGBoost](https://img.shields.io/badge/XGBoost-2.0+-FF6F00.svg)
8
+ ![Docker](https://img.shields.io/badge/docker-%230db7ed.svg)
9
+
10
+ **A production-grade MLOps system providing low-latency fraud detection with stateful feature hydration and human-in-the-loop explainability.**
11
+
12
+ ---
13
+
14
+ ## 🏗️ System Architecture
15
+
16
+ The system follows a **Lambda Architecture** pattern, decoupling the high-latency model training from the low-latency inference path.
17
+
18
+ ```mermaid
19
+ graph TD
20
+ User[Payment Gateway] -->|POST /predict| API[FastAPI Inference Service]
21
+
22
+ subgraph "Online Serving Layer (<50ms)"
23
+ API -->|1. Hydrate Velocity| Redis[(Redis Feature Store)]
24
+ API -->|2. Preprocess| Pipeline[Scikit-Learn Pipeline]
25
+ Pipeline -->|3. Predict| Model[XGBoost Classifier]
26
+ Model -->|4. Explain| SHAP[SHAP Engine]
27
+ end
28
+
29
+ subgraph "Offline Training Layer"
30
+ Data[(Transactional Data)] -->|Batch Load| Trainer[Training Pipeline]
31
+ Trainer -->|CV & Tuning| MLflow[MLflow Tracking]
32
+ Trainer -->|Export Artifact| Registry[Model Registry]
33
+ end
34
+
35
+ API -->|Async Log| ShadowLogs[Shadow Mode Logs]
36
+ API -->|Visualize| Dashboard[Analyst Dashboard]
37
+ ```
38
+ ### 🖥️ Analyst Workbench
39
+ ![Dashboard Screenshot](assets/dashboard_screenshot.png)
40
+ *Real-time interface for fraud analysts to review blocked transactions and SHAP explanations.*
41
+ ---
42
+
43
+ ## 🚀 Key Features
44
+
45
+ ### 1. Stateful Feature Store (Redis)
46
+ Traditional stateless APIs struggle with "Velocity Features" (e.g., *how many times did this user swipe in 24 hours?*). Our engine utilizes **Redis Sorted Sets (ZSET)** to maintain rolling windows, allowing feature hydration in **<2ms** with $O(\log N)$ complexity.
47
+
48
+ ### 2. Shadow Mode (Dark Launch)
49
+ To mitigate the risk of model drift or false positives, the system supports a **Shadow Mode** configuration. The model runs in production, receives real traffic, and logs decisions, but never blocks a transaction. This allows for risk-free A/B testing against legacy rule engines.
50
+
51
+ ### 3. Business-Aligned Optimization
52
+ Standard accuracy is misleading in fraud detection due to class imbalance. We implement a **Recall-Constraint Strategy** (Target: 80% Recall) ensuring the model captures the vast majority of fraud while maintaining a strict upper bound on False Positive Rates, as required by financial compliance.
53
+
54
+ ### 4. Cold-Start Handling
55
+ Engineered logic to handle new users with zero history by defaulting to global medians and "warm-up" priors, preventing the system from unfairly blocking legitimate first-time customers.
56
+
57
+ ---
58
+
59
+ ## 📊 Performance & Metrics
60
+
61
+ The following metrics were achieved on a hold-out temporal test set (out-of-time validation):
62
+
63
+ | Metric | Result | Target / Bench |
64
+ | :--- | :--- | :--- |
65
+ | **PR-AUC** | 0.9245 | Excellent |
66
+ | **Precision** | 93.18% | Low False Positives |
67
+ | **Recall** | 80.06% | Target: 80% |
68
+ | **Inference Latency (p95)** | ~30ms | < 50ms |
69
+
70
+ ---
71
+
72
+ ## 🛠️ Quick Start
73
+
74
+ ### Prerequisites
75
+ - Docker & Docker Compose
76
+ - (Optional) Python 3.12+ (managed via `uv`)
77
+
78
+ ### Installation & Deployment
79
+
80
+ 1. **Clone the Repository**
81
+ ```bash
82
+ git clone https://github.com/Sibikrish3000/realtime-fraud-engine.git
83
+ cd realtime-fraud-engine
84
+ ```
85
+
86
+ 2. **Launch the Stack**
87
+ ```bash
88
+ docker-compose up --build
89
+ ```
90
+ *This starts the FastAPI Backend, Redis Feature Store, and Streamlit Dashboard.*
91
+
92
+ 3. **Access the Services**
93
+ - **Analyst Dashboard:** [http://localhost:8501](http://localhost:8501)
94
+ - **API Documentation:** [http://localhost:8000/docs](http://localhost:8000/docs)
95
+ - **Redis Instance:** `localhost:6379`
96
+
97
+ ---
98
+
99
+ ## 📂 Project Structure
100
+
101
+ ```text
102
+ src/
103
+ ├── api/ # FastAPI service, schemas (Pydantic), and config
104
+ ├── features/ # Redis feature store logic and sliding window constants
105
+ ├── models/ # Training pipelines, metrics calculation, and XGBoost wrappers
106
+ ├── frontend/ # Streamlit-based analyst workbench
107
+ ├── data/ # Data ingestion and cleaning utilities
108
+ └── explainability.py # SHAP-based waterfall plots and global importance
109
+
110
+
111
+ ```
112
+ See the [Development & MLOps Guide](docs/DEVELOPMENT.md) for detailed instructions on training and local development.
113
+ ---
114
+
115
+ ## 📡 API Reference
116
+
117
+ ### Predict Transaction
118
+ `POST /v1/predict`
119
+
120
+ **Request:**
121
+ ```bash
122
+ curl -X 'POST' \
123
+ 'http://localhost:8000/v1/predict' \
124
+ -H 'Content-Type: application/json' \
125
+ -d '{
126
+ "user_id": "u12345",
127
+ "trans_date_trans_time": "2024-01-20 14:30:00",
128
+ "amt": 150.75,
129
+ "lat": 40.7128,
130
+ "long": -74.0060,
131
+ "merch_lat": 40.7306,
132
+ "merch_long": -73.9352,
133
+ "category": "grocery_pos"
134
+ }'
135
+ ```
136
+
137
+ **Response:**
138
+ ```json
139
+ {
140
+ "decision": "APPROVE",
141
+ "probability": 0.12,
142
+ "risk_score": 12.0,
143
+ "latency_ms": 28.5,
144
+ "shadow_mode": false,
145
+ "shap_values": {
146
+ "amt": 0.05,
147
+ "dist": 0.02
148
+ }
149
+ }
150
+ ```
151
+
152
+ ---
153
+
154
+ ## 🗺️ Future Roadmap
155
+
156
+ - [ ] **Kafka Integration:** Transition to an asynchronous event-driven architecture for high-throughput stream processing.
157
+ - [ ] **KServe Deployment:** Migrate from standalone FastAPI to KServe for automated scaling and model versioning.
158
+ - [ ] **Graph Features:** Incorporate Neo4j-based features to detect fraud rings and synthetic identities.
159
+
160
+ ---
161
+ **Author:** [Sibi Krishnamoorthy](https://github.com/Sibikrish3000)
162
+ *Machine Learning Engineer | Fintech & Risk Analytics*
assets/Screenshot_18-1-2026_22489_localhost.jpeg ADDED

Git LFS Details

  • SHA256: fe0692585014893507c4adfaeb3b6fb4a4fb13bb77e5f3909f131251c11d4c2d
  • Pointer size: 131 Bytes
  • Size of remote file: 226 kB
assets/banner.jpeg ADDED

Git LFS Details

  • SHA256: a1728be71a279f0bd5d0902ddc40d0888e69ab4f4e5b365731d8efcbd0c16c49
  • Pointer size: 131 Bytes
  • Size of remote file: 134 kB
assets/banner_0.jpeg ADDED

Git LFS Details

  • SHA256: 5a3aa5bc40a100a8cebdce866dace9a7a62a8f7fb9af76a11de389474ade7cee
  • Pointer size: 131 Bytes
  • Size of remote file: 155 kB
assets/dashboard_screenshot.png ADDED

Git LFS Details

  • SHA256: 25d04e9de048d3aa39e8411af2faf8b924bd0eb5fb60d32ee8ea10b0b260863b
  • Pointer size: 131 Bytes
  • Size of remote file: 241 kB
configs/model_config.yaml ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Training Configuration
2
+
3
+ model:
4
+ # XGBoost Hyperparameters (Aligned with Notebook)
5
+ n_estimators: 500 # Notebook used 500 trees (Script had 100)
6
+ max_depth: 8 # Notebook used depth 8 (Script had 6)
7
+ learning_rate: 0.1 # Matches notebook
8
+ subsample: 0.8 # Row sampling for regularization
9
+ colsample_bytree: 0.8 # Column sampling for regularization
10
+ scale_pos_weight: 100 # Will be overridden by calculated imbalance ratio
11
+ eval_metric: "aucpr"
12
+ random_state: 42
13
+
14
+ # Training Settings
15
+ training:
16
+ test_size: 0.2
17
+ validation_size: 0.1
18
+ random_state: 42
19
+
20
+ # Decision Threshold
21
+ threshold:
22
+ optimal_threshold: 0.9016819596290588 # From notebook PR curve analysis
23
+ min_precision: 0.80 # Business requirement
24
+
25
+ # Feature Engineering
26
+ features:
27
+ use_cyclical_encoding: true
28
+ use_distance_features: true
29
+ use_temporal_features: true
configs/redis_config.yaml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Redis Feature Store Configuration
2
+
3
+ # Connection Settings
4
+ host: localhost
5
+ port: 6379
6
+ db: 0
7
+ password: null # Set via REDIS_PASSWORD env var in production
8
+
9
+ # Connection Pool
10
+ max_connections: 50
11
+ socket_connect_timeout: 2 # seconds
12
+ socket_timeout: 1 # seconds
13
+
14
+ # Feature Store Settings
15
+ ema_alpha: 0.08 # 2/(24+1) for 24-hour window
16
+ key_ttl: 604800 # 7 days in seconds
17
+
18
+ # Window Configuration
19
+ sliding_window_hours: 24
docker-compose.yml ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ redis:
5
+ image: redis:7-alpine
6
+ container_name: payshield-redis
7
+ ports:
8
+ - "6379:6379"
9
+ volumes:
10
+ - redis_data:/data
11
+ healthcheck:
12
+ test: ["CMD", "redis-cli", "ping"]
13
+ interval: 5s
14
+ timeout: 3s
15
+ retries: 5
16
+ networks:
17
+ - payshield-network
18
+
19
+ api:
20
+ build:
21
+ context: .
22
+ dockerfile: Dockerfile
23
+ container_name: payshield-api
24
+ depends_on:
25
+ redis:
26
+ condition: service_healthy
27
+ ports:
28
+ - "8000:8000"
29
+ environment:
30
+ - REDIS_HOST=redis
31
+ - REDIS_PORT=6379
32
+ - MODEL_PATH=/app/models/fraud_model.pkl
33
+ - THRESHOLD_PATH=/app/models/threshold.json
34
+ - SHADOW_MODE=false
35
+ volumes:
36
+ - ./models:/app/models:ro
37
+ - ./logs:/app/logs
38
+ networks:
39
+ - payshield-network
40
+
41
+ # Placeholder for Phase 6 Dashboard
42
+ dashboard:
43
+ build:
44
+ context: .
45
+ dockerfile: Dockerfile
46
+ container_name: payshield-dashboard
47
+ depends_on:
48
+ - api
49
+ command: ["streamlit", "run", "src/frontend/app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
50
+ ports:
51
+ - "8501:8501"
52
+ environment:
53
+ - API_URL=http://api:8000/v1/predict
54
+ networks:
55
+ - payshield-network
56
+ restart: unless-stopped
57
+
58
+ networks:
59
+ payshield-network:
60
+ driver: bridge
61
+
62
+ volumes:
63
+ redis_data:
docker/docker-compose.yml ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ # Redis Feature Store
5
+ redis:
6
+ image: redis:7-alpine
7
+ container_name: payshield-redis
8
+ ports:
9
+ - "6379:6379"
10
+ volumes:
11
+ - redis_data:/data
12
+ command: redis-server --appendonly yes
13
+ healthcheck:
14
+ test: ["CMD", "redis-cli", "ping"]
15
+ interval: 5s
16
+ timeout: 3s
17
+ retries: 5
18
+ networks:
19
+ - payshield-network
20
+
21
+ # MLflow Tracking Server (for future phases)
22
+ mlflow:
23
+ image: ghcr.io/mlflow/mlflow:v2.10.0
24
+ container_name: payshield-mlflow
25
+ ports:
26
+ - "5000:5000"
27
+ environment:
28
+ - MLFLOW_BACKEND_STORE_URI=sqlite:///mlflow/mlflow.db
29
+ - MLFLOW_ARTIFACTS_DESTINATION=/mlflow/artifacts
30
+ volumes:
31
+ - mlflow_data:/mlflow
32
+ command: mlflow server --host 0.0.0.0 --port 5000
33
+ networks:
34
+ - payshield-network
35
+
36
+ volumes:
37
+ redis_data:
38
+ mlflow_data:
39
+
40
+ networks:
41
+ payshield-network:
42
+ driver: bridge
docs/DEVELOPMENT.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🛠️ Development & MLOps Guide
2
+
3
+ This guide provides detailed instructions on setting up the environment, running experiments, and developing the PayShield-ML system.
4
+
5
+ ---
6
+
7
+ ## 🏗️ 1. Environment Setup
8
+
9
+ The project uses `uv` for lightning-fast Python package and project management.
10
+
11
+ ### Prerequisites
12
+ - [uv](https://github.com/astral-sh/uv) installed
13
+ - Docker & Docker Compose
14
+ - Redis (can be run via Docker)
15
+
16
+ ### Installation
17
+ ```bash
18
+ # Sync dependencies and create virtual environment
19
+ uv sync
20
+
21
+ # Activate the environment
22
+ source .venv/bin/activate
23
+ ```
24
+
25
+ ---
26
+
27
+ ## 📊 2. MLflow Tracking
28
+
29
+ We use MLflow to track hyperparameters, metrics, and model artifacts.
30
+
31
+ ### Start MLflow Server
32
+ Run this in a separate terminal to view the UI:
33
+ ```bash
34
+ uv run mlflow ui --host 0.0.0.0 --port 5000
35
+ ```
36
+ Then access the dashboard at [http://localhost:5000](http://localhost:5000).
37
+
38
+ ---
39
+
40
+ ## 🚂 3. Model Training Pipeline
41
+
42
+ The training script handles data ingestion, feature engineering, cross-validation, and MLflow logging.
43
+
44
+ ### Basic Training
45
+ ```bash
46
+ uv run python src/models/train.py --data_path data/fraud_sample.csv
47
+ ```
48
+
49
+ ### Advanced Training with Custom Params
50
+ ```bash
51
+ uv run python src/models/train.py \
52
+ --data_path data/fraud_sample.csv \
53
+ --experiment_name fraud_v2 \
54
+ --min_recall 0.85 \
55
+ --output_dir models/
56
+ ```
57
+
58
+ | Argument | Description | Default |
59
+ | :--- | :--- | :--- |
60
+ | `--data_path` | Path to CSV/Parquet training data | (Required) |
61
+ | `--params_path` | Path to model config YAML | `configs/model_config.yaml` |
62
+ | `--experiment_name` | MLflow experiment grouping | `fraud_detection` |
63
+ | `--min_recall` | Target recall for threshold optimization | `0.80` |
64
+
65
+ ---
66
+
67
+ ## 🔌 4. Running Services Locally
68
+
69
+ For rapid iteration without rebuilding Docker images.
70
+
71
+ ### Start Redis (Required)
72
+ ```bash
73
+ docker run -d --name payshield-redis -p 6379:6379 redis:7-alpine
74
+ ```
75
+
76
+ ### Start FastAPI Backend
77
+ ```bash
78
+ # Running with hot-reload
79
+ uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --reload
80
+ ```
81
+ - **Swagger Docs:** [http://localhost:8000/docs](http://localhost:8000/docs)
82
+
83
+ ### Start Streamlit Dashboard
84
+ ```bash
85
+ export API_URL="http://localhost:8000/v1/predict"
86
+ uv run streamlit run src/frontend/app.py --server.port 8501
87
+ ```
88
+ - **Dashboard:** [http://localhost:8501](http://localhost:8501)
89
+
90
+ ---
91
+
92
+ ## 🐳 5. Full-Stack Development (Docker Compose)
93
+
94
+ The easiest way to replicate production-like environment.
95
+
96
+ ### Build and Launch
97
+ ```bash
98
+ docker-compose up --build
99
+ ```
100
+
101
+ ### Useful Commands
102
+ ```bash
103
+ # Run in background
104
+ docker-compose up -d
105
+
106
+ # Check all service logs
107
+ docker-compose logs -f
108
+
109
+ # Stop and remove containers
110
+ docker-compose down
111
+ ```
112
+
113
+ ### Service Map
114
+ | Service | Port | Endpoint |
115
+ | :--- | :--- | :--- |
116
+ | **API** | 8000 | `http://localhost:8000` |
117
+ | **Dashboard** | 8501 | `http://localhost:8501` |
118
+ | **Redis** | 6379 | `localhost:6379` |
119
+
120
+ ---
121
+
122
+ ## 🧪 6. Testing
123
+
124
+ We use `pytest` for unit and integration tests.
125
+
126
+ ```bash
127
+ # Run all tests
128
+ uv run pytest
129
+
130
+ # Run with coverage report
131
+ uv run pytest --cov=src
132
+ ```
133
+
134
+ ---
135
+ **Note:** Ensure you have the `models/fraud_model.pkl` and `models/threshold.json` artifacts present before starting the API. These are generated by the training pipeline.
entrypoint.sh ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Start FastAPI in the background
4
+ uvicorn src.api.main:app --host 0.0.0.0 --port 8000 &
5
+
6
+ # Start Streamlit in the foreground
7
+ streamlit run src/frontend/app.py --server.port 7860 --server.address 0.0.0.0
logs/shadow_predictions.jsonl ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"timestamp": "2026-01-18T12:37:19.214473Z", "user_id": "u12345", "amt": 150.0, "category": "grocery_pos", "real_decision": "APPROVE", "probability": 9.402751288689615e-07, "latency_ms": 72.96919822692871, "shadow_mode": true}
2
+ {"timestamp": "2026-01-18T12:44:56.721604Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.37005043029785, "shadow_mode": true}
3
+ {"timestamp": "2026-01-18T12:47:36.067083Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 54.149627685546875, "shadow_mode": true}
4
+ {"timestamp": "2026-01-18T12:47:39.609355Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 25.424957275390625, "shadow_mode": true}
5
+ {"timestamp": "2026-01-18T12:47:59.207044Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.46653747558594, "shadow_mode": true}
6
+ {"timestamp": "2026-01-18T12:48:01.070087Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.763057708740234, "shadow_mode": true}
7
+ {"timestamp": "2026-01-18T12:48:01.865266Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 29.25252914428711, "shadow_mode": true}
8
+ {"timestamp": "2026-01-18T12:48:02.489046Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 23.40531349182129, "shadow_mode": true}
9
+ {"timestamp": "2026-01-18T12:48:03.161965Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 44.34394836425781, "shadow_mode": true}
10
+ {"timestamp": "2026-01-18T12:48:03.765913Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.629220962524414, "shadow_mode": true}
11
+ {"timestamp": "2026-01-18T12:48:04.394367Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 27.244091033935547, "shadow_mode": true}
12
+ {"timestamp": "2026-01-18T12:48:04.986422Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.80378532409668, "shadow_mode": true}
13
+ {"timestamp": "2026-01-18T12:48:05.901805Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.35393142700195, "shadow_mode": true}
14
+ {"timestamp": "2026-01-18T12:48:06.666034Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.82218551635742, "shadow_mode": true}
15
+ {"timestamp": "2026-01-18T12:48:07.345695Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 48.212528228759766, "shadow_mode": true}
16
+ {"timestamp": "2026-01-18T12:48:07.919900Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 26.331424713134766, "shadow_mode": true}
17
+ {"timestamp": "2026-01-18T12:48:08.553960Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 41.85080528259277, "shadow_mode": true}
18
+ {"timestamp": "2026-01-18T12:48:09.265804Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 68.79472732543945, "shadow_mode": true}
19
+ {"timestamp": "2026-01-18T12:48:09.853970Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.57653045654297, "shadow_mode": true}
20
+ {"timestamp": "2026-01-18T12:48:10.481686Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 21.60501480102539, "shadow_mode": true}
21
+ {"timestamp": "2026-01-18T12:48:11.133950Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.62055587768555, "shadow_mode": true}
22
+ {"timestamp": "2026-01-18T12:48:11.734621Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.61753845214844, "shadow_mode": true}
23
+ {"timestamp": "2026-01-18T12:48:12.292397Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 21.454334259033203, "shadow_mode": true}
24
+ {"timestamp": "2026-01-18T12:48:12.847376Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 42.4959659576416, "shadow_mode": true}
25
+ {"timestamp": "2026-01-18T12:48:13.340940Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.911882400512695, "shadow_mode": true}
26
+ {"timestamp": "2026-01-18T12:48:13.923384Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 59.66687202453613, "shadow_mode": true}
27
+ {"timestamp": "2026-01-18T12:48:14.445859Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.5223274230957, "shadow_mode": true}
28
+ {"timestamp": "2026-01-18T12:48:14.990366Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.456501007080078, "shadow_mode": true}
29
+ {"timestamp": "2026-01-18T12:48:15.709852Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.6852912902832, "shadow_mode": true}
30
+ {"timestamp": "2026-01-18T12:48:15.913553Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 25.067567825317383, "shadow_mode": true}
31
+ {"timestamp": "2026-01-18T12:48:16.193654Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 44.05331611633301, "shadow_mode": true}
32
+ {"timestamp": "2026-01-18T12:48:16.345675Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.55793380737305, "shadow_mode": true}
33
+ {"timestamp": "2026-01-18T12:48:16.590079Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.01242256164551, "shadow_mode": true}
34
+ {"timestamp": "2026-01-18T12:48:16.789928Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 46.65184020996094, "shadow_mode": true}
35
+ {"timestamp": "2026-01-18T12:48:16.971762Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 29.859066009521484, "shadow_mode": true}
36
+ {"timestamp": "2026-01-18T12:48:17.185712Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.96466636657715, "shadow_mode": true}
37
+ {"timestamp": "2026-01-18T12:48:17.345695Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.675235748291016, "shadow_mode": true}
38
+ {"timestamp": "2026-01-18T12:48:17.524961Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 40.1458740234375, "shadow_mode": true}
39
+ {"timestamp": "2026-01-18T12:48:17.683832Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.51038360595703, "shadow_mode": true}
40
+ {"timestamp": "2026-01-18T12:48:17.851388Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 23.29730987548828, "shadow_mode": true}
41
+ {"timestamp": "2026-01-18T12:48:18.065689Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.719478607177734, "shadow_mode": true}
42
+ {"timestamp": "2026-01-18T12:48:18.381860Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.37034034729004, "shadow_mode": true}
43
+ {"timestamp": "2026-01-18T12:48:18.830235Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.47638511657715, "shadow_mode": true}
44
+ {"timestamp": "2026-01-18T12:48:19.041672Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 41.09072685241699, "shadow_mode": true}
45
+ {"timestamp": "2026-01-18T12:48:19.269802Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 37.813663482666016, "shadow_mode": true}
46
+ {"timestamp": "2026-01-18T12:48:19.570256Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 30.001163482666016, "shadow_mode": true}
47
+ {"timestamp": "2026-01-18T12:48:19.910120Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.04141616821289, "shadow_mode": true}
48
+ {"timestamp": "2026-01-18T12:48:20.254294Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 52.63257026672363, "shadow_mode": true}
49
+ {"timestamp": "2026-01-18T12:48:20.520307Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.410486221313477, "shadow_mode": true}
50
+ {"timestamp": "2026-01-18T12:48:20.933840Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 42.56725311279297, "shadow_mode": true}
51
+ {"timestamp": "2026-01-18T12:48:58.996285Z", "user_id": "u12345", "amt": 2500.0, "category": "shopping_net", "real_decision": "BLOCK", "probability": 0.999806821346283, "latency_ms": 38.36369514465332, "shadow_mode": true}
52
+ {"timestamp": "2026-01-18T14:12:06.614568Z", "user_id": "u12345", "amt": 150.0, "category": "grocery_pos", "real_decision": "APPROVE", "probability": 8.649675874039531e-05, "latency_ms": 86.48848533630371, "shadow_mode": true}
53
+ {"timestamp": "2026-01-18T14:15:46.042042Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 7.321957014028158e-07, "latency_ms": 86.42172813415527, "shadow_mode": true}
54
+ {"timestamp": "2026-01-18T14:16:12.912011Z", "user_id": "u12345", "amt": 1500.0, "category": "misc_net", "real_decision": "BLOCK", "probability": 0.9997040629386902, "latency_ms": 49.93748664855957, "shadow_mode": true}
55
+ {"timestamp": "2026-01-18T14:17:14.040831Z", "user_id": "u12345", "amt": 1500.0, "category": "misc_net", "real_decision": "BLOCK", "probability": 0.9997040629386902, "latency_ms": 43.45536231994629, "shadow_mode": true}
main.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ def main():
2
+ print("Hello from payshield-ml!")
3
+
4
+
5
+ if __name__ == "__main__":
6
+ main()
models/fraud_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9be403668e04f1458d5eb5e0dd805495e7f6fa1f322dc8b006dcb2fccc78061d
3
+ size 3933261
models/threshold.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "optimal_threshold": 0.8199467062950134,
3
+ "metrics": {
4
+ "precision": 0.9318377911993098,
5
+ "recall": 0.8005930318754633,
6
+ "f1": 0.861244019138756,
7
+ "pr_auc": 0.924508450881234,
8
+ "threshold_used": 0.8199467062950134
9
+ }
10
+ }
notebook/.ipynb_checkpoints/Untitled-checkpoint.ipynb ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [],
3
+ "metadata": {},
4
+ "nbformat": 4,
5
+ "nbformat_minor": 5
6
+ }
notebook/.ipynb_checkpoints/credit-card-fraud-detection-checkpoint.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebook/README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🛡️ Behavioral Fraud Detection System (XGBoost + SHAP)
2
+
3
+ **A production-grade machine learning pipeline identifying fraudulent credit card transactions with 97% precision and explainable AI.**
4
+
5
+ ## 💼 Executive Summary & Business Impact
6
+
7
+ Financial fraud detection is not just about accuracy; it's about the **Precision-Recall trade-off**. A model that flags too many legitimate transactions (False Positives) causes customer churn, while missing fraud (False Negatives) causes direct financial loss.
8
+
9
+ This project implements a cost-sensitive **XGBoost** classifier engineered to minimize **operational friction**. By optimizing the decision threshold to **0.895**, the system achieved:
10
+
11
+ | Metric | Performance | Business Value |
12
+ | --- | --- | --- |
13
+ | **Precision** | **97%** | Only 3% of alerts are false alarms, drastically reducing manual review costs. |
14
+ | **Recall** | **80%** | Captures 80% of all fraud attempts (approx. $810k in prevented loss). |
15
+ | **Net Savings** | **$810,470** | Calculated ROI on the test set (Loss Prevented - Operational Costs). |
16
+ | **Latency** | **<50ms** | Inference speed optimized via `scikit-learn` Pipeline serialization. |
17
+
18
+ ---
19
+
20
+ ## 🏗️ Technical Architecture
21
+
22
+ The solution moves beyond basic "fit-predict" workflows by implementing a robust preprocessing pipeline designed to prevent **data leakage** and handle extreme class imbalance (0.5% fraud rate).
23
+
24
+ ### 1. Advanced Feature Engineering
25
+
26
+ Raw transaction data is insufficient for modern fraud. I engineered **14 behavioral features** to capture context:
27
+
28
+ * **Velocity Metrics (`trans_count_24h`, `amt_to_avg_ratio_24h`)**: Detects "burst" behavior where a card is used rapidly or for amounts exceeding the user's historical norm.
29
+ * **Geospatial Analysis (`distance_km`)**: Calculates the Haversine distance between the cardholder's home and the merchant.
30
+ * **Cyclical Temporal Encoding (`hour_sin`, `hour_cos`)**: Captures high-risk time windows (e.g., 3 AM surges) while preserving the 24-hour cycle continuity.
31
+ * **Risk Profiling (`WOEEncoder`)**: Replaces high-cardinality categorical features (Merchant, Job) with their "Weight of Evidence" - a measure of how much a specific category supports the "Fraud" hypothesis.
32
+
33
+ ### 2. The Pipeline
34
+
35
+ To ensure production stability, all steps are wrapped in a single Scikit-Learn `Pipeline`:
36
+
37
+ ```python
38
+ pipeline = Pipeline(steps=[
39
+ ('preprocessor', ColumnTransformer(transformers=[
40
+ ('cat', WOEEncoder(), ['job', 'category']),
41
+ ('num', RobustScaler(), numerical_features)
42
+ ])),
43
+ ('classifier', XGBClassifier(scale_pos_weight=imbalance_ratio, ...))
44
+ ])
45
+
46
+ ```
47
+
48
+ ---
49
+
50
+ ## 📊 Model Performance
51
+
52
+ ### Precision-Recall Strategy
53
+
54
+ Instead of optimizing for ROC-AUC (which can be misleading in imbalanced datasets), I optimized for **PR-AUC (0.998)**.
55
+
56
+ * **Default Threshold (0.50):** Precision was 65%. Too many false alarms.
57
+ * **Optimized Threshold (0.895):** Precision increased to **97%**, with minimal loss in Recall.
58
+
59
+ *(Insert your Precision-Recall Curve image here)*
60
+
61
+ ---
62
+
63
+ ## 🔍 Explainability & "The Why"
64
+
65
+ Black-box models are dangerous in finance. I implemented **SHAP (SHapley Additive exPlanations)** to provide reason codes for every decision.
66
+
67
+ ### The "Smoking Gun" (Fraud Example)
68
+
69
+ For a transaction flagged with **99.9% confidence**, the SHAP Waterfall plot reveals the exact drivers:
70
+
71
+ 1. **`amt_log` (+8.83)**: The transaction amount was significantly higher than normal.
72
+ 2. **`hour_sin` (+2.46)**: The transaction occurred during a high-risk time window (late night).
73
+ 3. **`job` (+1.72)**: The cardholder's profession falls into a statistically higher-risk segment.
74
+ 4. **`amt_to_avg_ratio_24h` (+1.24)**: The amount was an outlier *specifically* for this user's 24-hour history.
75
+
76
+ *(Insert your SHAP Waterfall Plot image here)*
77
+
78
+ ---
79
+
80
+ ## 🚀 How to Run
81
+
82
+ ### Prerequisites
83
+
84
+ ```bash
85
+ pip install pandas xgboost category_encoders shap scikit-learn
86
+
87
+ ```
88
+
89
+ ### Inference
90
+
91
+ The model is serialized as a `.pkl` file. You can load it to predict on new data immediately without re-training.
92
+
93
+ ```python
94
+ import joblib
95
+ import pandas as pd
96
+
97
+ # Load the production pipeline
98
+ model = joblib.load('fraud_detection_model_v1.pkl')
99
+
100
+ # Define a new transaction (Example)
101
+ new_transaction = pd.DataFrame([{
102
+ 'amt_log': 5.2,
103
+ 'distance_km': 120.5,
104
+ 'trans_count_24h': 12,
105
+ 'amt_to_avg_ratio_24h': 4.5,
106
+ # ... include all 14 features
107
+ }])
108
+
109
+ # Get prediction (Probability of Fraud)
110
+ fraud_prob = model.predict_proba(new_transaction)[:, 1]
111
+ is_fraud = (fraud_prob >= 0.895).astype(int)
112
+
113
+ print(f"Fraud Probability: {fraud_prob[0]:.4f}")
114
+ print(f"Action: {'BLOCK' if is_fraud[0] else 'APPROVE'}")
115
+
116
+ ```
117
+
118
+ ---
119
+
120
+ ### **Author**
121
+
122
+ **Sibi Krishnamoorthy** *Machine Learning Engineer | Fintech & Risk Analytics*
notebook/credit-card-fraud-detection.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebook/credit-card-fraud-detection.md ADDED
The diff for this file is too large to render. See raw diff
 
notebook/credit-card-fraud-detection.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f604362aa5035ae24f4b5b1dcd4d67926ee9246030eba352eb694ca05fd18aa
3
+ size 817131
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_23_0.png ADDED

Git LFS Details

  • SHA256: 88ffca8cc308c74d266feabb0de1ab67df6350b8d4b44798ee4343ae8ff0acf5
  • Pointer size: 130 Bytes
  • Size of remote file: 41.9 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_25_1.png ADDED

Git LFS Details

  • SHA256: ec441ac40b9528b442b4e16b82518a8ce46bc52af746a5c3aa67538f47f4f1e8
  • Pointer size: 130 Bytes
  • Size of remote file: 14.9 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_26_1.png ADDED

Git LFS Details

  • SHA256: 91aec463d1c6742e9f7946858926c1db6c5ee1b775db8d9895366d82e198b465
  • Pointer size: 130 Bytes
  • Size of remote file: 29.9 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_28_1.png ADDED

Git LFS Details

  • SHA256: 7a8174cf93ce7eaf8c5363509fb410c7759ee1bfce43ee0ceda0543484a594e8
  • Pointer size: 130 Bytes
  • Size of remote file: 59.2 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_29_1.png ADDED

Git LFS Details

  • SHA256: 89719cdfbdd9d862df5275f0ef5f164e7325e82b1835b0612620a014aeedad69
  • Pointer size: 130 Bytes
  • Size of remote file: 60.2 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_30_0.png ADDED

Git LFS Details

  • SHA256: 0570aba24636b052993e0cb228f5cfd0c7afbb7c5dcb0301a16bf34c3213cbcb
  • Pointer size: 130 Bytes
  • Size of remote file: 59.5 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_32_0.png ADDED

Git LFS Details

  • SHA256: 14a9f5ef856ee4f33aaa939bac4d63f4f262757d8ffdad380756331219ccd8f3
  • Pointer size: 130 Bytes
  • Size of remote file: 17.8 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_34_1.png ADDED

Git LFS Details

  • SHA256: 11d0708dfd6f1403d04f04fb3e64c5d2cfe26b0647c706c85d532fcbcd3aeeca
  • Pointer size: 130 Bytes
  • Size of remote file: 25.8 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_40_1.png ADDED

Git LFS Details

  • SHA256: 193a7c99eaae83869e529ad3a877c9161a49b2bcc8a3513e4e892f6b429d70bb
  • Pointer size: 130 Bytes
  • Size of remote file: 15.4 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_58_0.png ADDED

Git LFS Details

  • SHA256: be3db171ddbd2eb4327efcc89366d6ba50c272fae85efd980a8e8abfb1d21214
  • Pointer size: 130 Bytes
  • Size of remote file: 63.1 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_59_0.png ADDED

Git LFS Details

  • SHA256: 48c2d93bf385e18fa45c9c96edeb99443b55d34064285f6a7eb5f3ce5291606d
  • Pointer size: 130 Bytes
  • Size of remote file: 49.7 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_60_0.png ADDED

Git LFS Details

  • SHA256: bb83069c4308aaf513b127ca55e46b1d6bcfccd278a5fe722649493ac6fc9df7
  • Pointer size: 130 Bytes
  • Size of remote file: 41.9 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_68_0.png ADDED

Git LFS Details

  • SHA256: aa61ce6ca4f34ca70fab9e28bf3465aeac9792248032e6f3f129febe52d00c90
  • Pointer size: 131 Bytes
  • Size of remote file: 103 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_70_0.png ADDED

Git LFS Details

  • SHA256: 4f6571337f98d8023d78afa3c172296bbf81c6d2e58b1d96a3973feb69cbc334
  • Pointer size: 130 Bytes
  • Size of remote file: 65.7 kB
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_72_1.png ADDED

Git LFS Details

  • SHA256: af5649b380ca39394ce291aab39bd04aaec5697dc3d376e2808f66ad4442dab8
  • Pointer size: 130 Bytes
  • Size of remote file: 61.6 kB
notebook/dataset/download.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env -S uv run --script
2
+ #
3
+ # /// script
4
+ # dependencies = [
5
+ # "pandas",
6
+ # "rich",
7
+ # ]
8
+ # ///
9
+ import os
10
+ import zipfile
11
+
12
+ path = os.path.join(os.path.dirname(__file__), "dataset")
13
+ def download_data():
14
+ os.system(f"curl -L -o {path}/fraud-detection.zip https://www.kaggle.com/api/v1/datasets/download/kartik2112/fraud-detection")
15
+
16
+ with zipfile.ZipFile(f"{path}/fraud-detection.zip", 'r') as zip_ref:
17
+ zip_ref.extractall(f"{path}/")
18
+ def combine_data():
19
+ import pandas as pd
20
+ train = pd.read_csv(f"{path}/fraud-detection/fraudTrain.csv")
21
+ test = pd.read_csv(f"{path}/fraud-detection/fraudTest.csv")
22
+ df = pd.concat([train, test], axis=0).reset_index(drop=True)
23
+ df.to_csv(f"{path}/fraud.csv", index=False)
24
+ if __name__ == "__main__":
25
+ download_data()
26
+ combine_data()
27
+
pyproject.toml ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "payshield-ml"
3
+ version = "0.1.0"
4
+ description = "Real-Time Financial Fraud Detection System with sub-50ms latency"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ # ML Stack
9
+ "xgboost>=2.0.0",
10
+ "scikit-learn>=1.4.0",
11
+ "shap>=0.50.0", # Use latest stable version compatible with Python 3.12
12
+ "mlflow>=2.10.0",
13
+ # API Framework
14
+ "fastapi>=0.109.0",
15
+ "uvicorn[standard]>=0.27.0",
16
+ "pydantic>=2.5.0",
17
+ # Data Processing
18
+ "pandas>=2.2.0",
19
+ "numpy>=2.0.0",
20
+ "pyarrow>=15.0.0",
21
+ # Feature Store
22
+ "redis>=5.0.0",
23
+ "hiredis>=2.3.0",
24
+ # Utilities
25
+ "python-dotenv>=1.0.0",
26
+ "pyyaml>=6.0.1",
27
+ "joblib>=1.5.0",
28
+ "category-encoders>=2.6.0",
29
+ "seaborn>=0.13.2",
30
+ "matplotlib>=3.10.8",
31
+ "pydantic-settings>=2.12.0",
32
+ "streamlit>=1.53.0",
33
+ "plotly>=6.5.2",
34
+ "requests>=2.32.5",
35
+ "typing-extensions>=4.15.0",
36
+ ]
37
+
38
+ [project.optional-dependencies]
39
+ dev = [
40
+ # Testing
41
+ "pytest>=8.0.0",
42
+ "pytest-asyncio>=0.23.0",
43
+ "pytest-cov>=4.1.0",
44
+
45
+ # Type Checking & Linting
46
+ "mypy>=1.8.0",
47
+ "ruff>=0.1.0",
48
+
49
+ # Notebook Support
50
+ "jupyter>=1.0.0",
51
+ "ipykernel>=6.29.0",
52
+ ]
53
+
54
+ [build-system]
55
+ requires = ["hatchling"]
56
+ build-backend = "hatchling.build"
57
+
58
+ [tool.pytest.ini_options]
59
+ testpaths = ["tests"]
60
+ python_files = "test_*.py"
61
+ python_classes = "Test*"
62
+ python_functions = "test_*"
63
+ addopts = "-v --cov=src --cov-report=term-missing"
64
+
65
+ [tool.mypy]
66
+ python_version = "3.12"
67
+ warn_return_any = true
68
+ warn_unused_configs = true
69
+ disallow_untyped_defs = true
70
+
71
+ [tool.ruff]
72
+ line-length = 100
73
+ target-version = "py312"
74
+
75
+ [tool.hatch.build.targets.wheel]
76
+ packages = ["src"]
77
+
78
+ [dependency-groups]
79
+ dev = [
80
+ "ruff>=0.14.13",
81
+ ]
scripts/demo_phase1.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Example script demonstrating the data ingestion and feature store usage.
3
+
4
+ This shows how to use the modules we just built.
5
+ """
6
+
7
+ from src.data.ingest import load_dataset, InferenceTransactionSchema
8
+ from src.features.store import RedisFeatureStore
9
+
10
+
11
+ def main():
12
+ """Demonstrate basic usage of Phase 1 modules."""
13
+
14
+ print("=== PayShield-ML Phase 1 Demo ===\n")
15
+
16
+ # 1. Data Validation Example
17
+ print("1. Testing Data Validation...")
18
+ try:
19
+ valid_transaction = InferenceTransactionSchema(
20
+ user_id="u12345",
21
+ amt=150.00,
22
+ lat=40.7128,
23
+ long=-74.0060,
24
+ category="grocery_pos",
25
+ job="Engineer, biomedical",
26
+ merch_lat=40.7200,
27
+ merch_long=-74.0100,
28
+ unix_time=1234567890,
29
+ )
30
+ print(f" ✓ Valid transaction: ${valid_transaction.amt:.2f}")
31
+ except Exception as e:
32
+ print(f" ✗ Validation failed: {e}")
33
+
34
+ # 2. Feature Store Example
35
+ print("\n2. Testing Feature Store...")
36
+ try:
37
+ store = RedisFeatureStore(host="localhost", port=6379, db=0)
38
+
39
+ # Check health
40
+ health = store.health_check()
41
+ print(f" ✓ Redis connection: {health['status']} (ping: {health['ping_ms']}ms)")
42
+
43
+ # Add some transactions
44
+ user_id = "demo_user_001"
45
+ import time
46
+
47
+ base_time = int(time.time())
48
+
49
+ print(f"\n Adding 3 transactions for {user_id}...")
50
+ for i, amount in enumerate([50.00, 75.00, 100.00]):
51
+ store.add_transaction(
52
+ user_id=user_id,
53
+ amount=amount,
54
+ timestamp=base_time + (i * 3600), # 1 hour apart
55
+ )
56
+ print(f" Transaction {i + 1}: ${amount:.2f}")
57
+
58
+ # Get features
59
+ features = store.get_features(user_id, current_timestamp=base_time + 7200)
60
+ print(f"\n Features computed:")
61
+ print(f" - Transaction count (24h): {features['trans_count_24h']:.0f}")
62
+ print(f" - Average spend (24h): ${features['avg_spend_24h']:.2f}")
63
+
64
+ # Get history
65
+ history = store.get_transaction_history(user_id, lookback_hours=24)
66
+ print(f"\n Transaction history ({len(history)} transactions):")
67
+ for ts, amt in history[:3]:
68
+ print(f" - Timestamp {ts}: ${amt:.2f}")
69
+
70
+ # Cleanup
71
+ deleted = store.delete_user_data(user_id)
72
+ print(f"\n ✓ Cleaned up {deleted} keys")
73
+
74
+ except Exception as e:
75
+ print(f" ✗ Feature store error: {e}")
76
+ print(
77
+ " Make sure Redis is running: docker-compose -f docker/docker-compose.yml up -d redis"
78
+ )
79
+
80
+ print("\n=== Demo Complete ===")
81
+
82
+
83
+ if __name__ == "__main__":
84
+ main()
scripts/verify_phase1.py ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Quick verification script for Phase 1 implementation.
4
+ Tests imports and basic module functionality without requiring Redis.
5
+ """
6
+
7
+
8
+ def test_imports():
9
+ """Test that all Phase 1 modules can be imported."""
10
+ print("Testing imports...")
11
+
12
+ try:
13
+ from src.data.ingest import TransactionSchema, InferenceTransactionSchema, load_dataset
14
+
15
+ print(" ✓ Data ingestion module imports successfully")
16
+ except ImportError as e:
17
+ print(f" ✗ Data ingestion import failed: {e}")
18
+ return False
19
+
20
+ try:
21
+ from src.features.store import RedisFeatureStore
22
+
23
+ print(" ✓ Feature store module imports successfully")
24
+ except ImportError as e:
25
+ print(f" ✗ Feature store import failed: {e}")
26
+ return False
27
+
28
+ try:
29
+ from src.features.constants import category_names, job_names
30
+
31
+ print(
32
+ f" ✓ Constants module loaded ({len(category_names)} categories, {len(job_names)} jobs)"
33
+ )
34
+ except ImportError as e:
35
+ print(f" ✗ Constants import failed: {e}")
36
+ return False
37
+
38
+ return True
39
+
40
+
41
+ def test_pydantic_validation():
42
+ """Test Pydantic validation logic."""
43
+ print("\nTesting Pydantic validation...")
44
+
45
+ from src.data.ingest import TransactionSchema, InferenceTransactionSchema
46
+ from pydantic import ValidationError
47
+
48
+ # Test valid transaction
49
+ try:
50
+ valid_data = {
51
+ "trans_date_trans_time": "2019-01-01 00:00:18",
52
+ "cc_num": 2703186189652095,
53
+ "merchant": "Test Merchant",
54
+ "category": "misc_net",
55
+ "amt": 100.50,
56
+ "first": "John",
57
+ "last": "Doe",
58
+ "gender": "M",
59
+ "street": "123 Main St",
60
+ "city": "Springfield",
61
+ "state": "IL",
62
+ "zip": 62701,
63
+ "lat": 39.7817,
64
+ "long": -89.6501,
65
+ "city_pop": 100000,
66
+ "job": "Engineer, biomedical",
67
+ "dob": "1990-01-01",
68
+ "trans_num": "abc123",
69
+ "unix_time": 1325376018,
70
+ "merch_lat": 39.7900,
71
+ "merch_long": -89.6600,
72
+ "is_fraud": 0,
73
+ }
74
+ schema = TransactionSchema(**valid_data)
75
+ print(f" ✓ Valid transaction accepted (amt: ${schema.amt:.2f})")
76
+ except ValidationError as e:
77
+ print(f" ✗ Valid transaction rejected: {e}")
78
+ return False
79
+
80
+ # Test invalid amount
81
+ try:
82
+ invalid_data = valid_data.copy()
83
+ invalid_data["amt"] = -50.00
84
+ schema = TransactionSchema(**invalid_data)
85
+ print(" ✗ Negative amount not rejected!")
86
+ return False
87
+ except ValidationError:
88
+ print(" ✓ Negative amount correctly rejected")
89
+
90
+ # Test invalid coordinates
91
+ try:
92
+ invalid_data = valid_data.copy()
93
+ invalid_data["lat"] = 200.0
94
+ schema = TransactionSchema(**invalid_data)
95
+ print(" ✗ Invalid latitude not rejected!")
96
+ return False
97
+ except ValidationError:
98
+ print(" ✓ Invalid latitude correctly rejected")
99
+
100
+ # Test inference schema
101
+ try:
102
+ inference_data = {
103
+ "user_id": "u12345",
104
+ "amt": 150.00,
105
+ "lat": 40.7128,
106
+ "long": -74.0060,
107
+ "category": "grocery_pos",
108
+ "job": "Engineer, biomedical",
109
+ "merch_lat": 40.7200,
110
+ "merch_long": -74.0100,
111
+ "unix_time": 1234567890,
112
+ }
113
+ schema = InferenceTransactionSchema(**inference_data)
114
+ print(f" ✓ Inference schema works (user: {schema.user_id})")
115
+ except ValidationError as e:
116
+ print(f" ✗ Inference schema failed: {e}")
117
+ return False
118
+
119
+ return True
120
+
121
+
122
+ def test_constants():
123
+ """Test that constants are properly loaded."""
124
+ print("\nTesting constants...")
125
+
126
+ from src.features.constants import category_names, job_names
127
+
128
+ # Check categories
129
+ expected_categories = ["misc_net", "gas_transport", "grocery_pos", "entertainment"]
130
+ for cat in expected_categories:
131
+ if cat not in category_names:
132
+ print(f" ✗ Missing category: {cat}")
133
+ return False
134
+ print(f" ✓ All expected categories present ({len(category_names)} total)")
135
+
136
+ # Check jobs
137
+ expected_jobs = ["Engineer, biomedical", "Data scientist", "Mechanical engineer"]
138
+ found_jobs = [j for j in expected_jobs if j in job_names]
139
+ print(
140
+ f" ✓ Job list loaded ({len(job_names)} jobs, {len(found_jobs)}/{len(expected_jobs)} test jobs found)"
141
+ )
142
+
143
+ return True
144
+
145
+
146
+ def main():
147
+ """Run all verification tests."""
148
+ print("=" * 60)
149
+ print("Phase 1 Verification Script")
150
+ print("=" * 60)
151
+
152
+ tests = [
153
+ ("Imports", test_imports),
154
+ ("Pydantic Validation", test_pydantic_validation),
155
+ ("Constants", test_constants),
156
+ ]
157
+
158
+ results = []
159
+ for name, test_func in tests:
160
+ try:
161
+ result = test_func()
162
+ results.append((name, result))
163
+ except Exception as e:
164
+ print(f"\n ✗ {name} test crashed: {e}")
165
+ results.append((name, False))
166
+
167
+ print("\n" + "=" * 60)
168
+ print("Summary")
169
+ print("=" * 60)
170
+
171
+ for name, passed in results:
172
+ status = "✓ PASS" if passed else "✗ FAIL"
173
+ print(f"{status:>10} - {name}")
174
+
175
+ all_passed = all(r[1] for r in results)
176
+
177
+ print("\n" + "=" * 60)
178
+ if all_passed:
179
+ print("✅ All verification tests passed!")
180
+ print("\nPhase 1 implementation is working correctly.")
181
+ print("\nNext steps:")
182
+ print(" 1. Install dependencies: uv sync")
183
+ print(" 2. Start Redis: docker-compose -f docker/docker-compose.yml up -d redis")
184
+ print(" 3. Run full test suite: pytest tests/ -v")
185
+ print(" 4. Try demo: python scripts/demo_phase1.py")
186
+ else:
187
+ print("❌ Some tests failed. Check the output above.")
188
+ print("=" * 60)
189
+
190
+
191
+ if __name__ == "__main__":
192
+ main()
src/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty __init__.py for src package
src/api/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # API module __init__