Spaces:

sibikrish
/

PayShield-ML

Sleeping

App Files Files Community

Sibi Krishnamoorthy commited on Jan 18

Commit

8a08300

1 Parent(s): 2098ced

prod

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.agent/rules/architecture.md +175 -0
.coverage +0 -0
.dockerignore +34 -0
.env.example +23 -0
.github/workflows/main.yml +49 -0
.gitignore +17 -0
.python-version +1 -0
Dockerfile +68 -0
README.md +162 -0
assets/Screenshot_18-1-2026_22489_localhost.jpeg +3 -0
assets/banner.jpeg +3 -0
assets/banner_0.jpeg +3 -0
assets/dashboard_screenshot.png +3 -0
configs/model_config.yaml +29 -0
configs/redis_config.yaml +19 -0
docker-compose.yml +63 -0
docker/docker-compose.yml +42 -0
docs/DEVELOPMENT.md +135 -0
entrypoint.sh +7 -0
logs/shadow_predictions.jsonl +55 -0
main.py +6 -0
models/fraud_model.pkl +3 -0
models/threshold.json +10 -0
notebook/.ipynb_checkpoints/Untitled-checkpoint.ipynb +6 -0
notebook/.ipynb_checkpoints/credit-card-fraud-detection-checkpoint.ipynb +0 -0
notebook/README.md +122 -0
notebook/credit-card-fraud-detection.ipynb +0 -0
notebook/credit-card-fraud-detection.md +0 -0
notebook/credit-card-fraud-detection.pdf +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_23_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_25_1.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_26_1.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_28_1.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_29_1.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_30_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_32_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_34_1.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_40_1.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_58_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_59_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_60_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_68_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_70_0.png +3 -0
notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_72_1.png +3 -0
notebook/dataset/download.py +27 -0
pyproject.toml +81 -0
scripts/demo_phase1.py +84 -0
scripts/verify_phase1.py +192 -0
src/__init__.py +1 -0
src/api/__init__.py +1 -0

.agent/rules/architecture.md ADDED Viewed

	@@ -0,0 +1,175 @@

+---
+trigger: always_on
+---
+Here is the comprehensive `architecture.md` file for your repository. It documents the high-level system design, data flow, and infrastructure decisions, serving as a critical artifact for technical interviews and system design reviews.
+---
+# 🏗️ System Architecture: Real-Time Fraud Detection
+This document outlines the end-to-end architecture of the Fraud Detection System, designed for **low-latency inference (<50ms)**, **high throughput**, and **explainability**.
+The system follows a **Lambda Architecture** pattern, splitting workflows into an **Offline Training Layer** (batch processing) and an **Online Serving Layer** (real-time inference).
+---
+## 1. High-Level Diagram
+```mermaid
+graph TD
+    User[Payment Gateway] -->|POST /predict| API[FastAPI Inference Service]
+    subgraph "Online Serving Layer"
+        API -->|1. Get Velocity| Redis[(Redis Feature Store)]
+        API -->|2. Compute Features| Preproc[Scikit-Learn Pipeline]
+        Preproc -->|3. Inference| Model[XGBoost Classifier]
+        Model -->|4. Log Prediction| Logger[Async Logger]
+    end
+    subgraph "Offline Training Layer"
+        DW[(Data Warehouse)] -->|Batch Load| Trainer[Training Pipeline]
+        Trainer -->|Track Experiments| MLflow[MLflow Server]
+        Trainer -->|Save Artifact| Registry[Model Registry]
+        Registry -->|Load .pkl| API
+    end
+    subgraph "Explainability & Monitoring"
+        Logger -->|Drift Analysis| Dashboard[Streamlit / Grafana]
+        API -->|Request Waterfall| SHAP[SHAP Explainer]
+    end
+```
+---
+## 2. Component Breakdown
+### 🟢 Online Serving Layer (Real-Time)
+The critical path for live transactions. Designed for sub-50ms latency.
+* **FastAPI Service:** Asynchronous Python web server. Handles request validation (Pydantic) and orchestrates the inference flow.
+* **Redis Feature Store:**
+* **Role:** Solves the "Stateful Feature" problem.
+* **Implementation:** Uses Redis Sorted Sets (`ZSET`) to maintain a rolling window of transaction counts for the last 24 hours.
+* **Latency:** <2ms lookups via pipelining.
+* **Inference Engine:**
+* **Model:** XGBoost Classifier (serialized via Joblib).
+* **Pipeline:** `ColumnTransformer` (WOEEncoder + RobustScaler) ensures raw input is transformed exactly as it was during training.
+* **Shadow Mode:** A configuration toggle allowing the model to run silently alongside legacy rules for A/B testing.
+### 🔵 Offline Training Layer (Batch)
+Where models are built, validated, and versioned.
+* **Data Ingestion:** Loads historical transaction logs (CSV/Parquet).
+* **Feature Engineering:**
+* Calculates Haversine distances.
+* Generates Cyclical time encodings (`hour_sin`, `hour_cos`).
+* Computes historical aggregates for the Feature Store.
+* refer notebook markdown for more information at path notebook/credit-card-fraud-detection.md
+* **Experiment Tracking (MLflow):**
+* Logs hyperparameters (`max_depth`, `learning_rate`).
+* Logs metrics: Precision, Recall, PR-AUC.
+* Stores the model artifact and the optimized decision threshold (e.g., `0.895`).
+### 🟣 Explainability Layer (Compliance)
+Ensures decisions are transparent and auditable.
+* **SHAP (SHapley Additive exPlanations):**
+* **Global:** Summary plots to understand model drivers.
+* **Local:** Waterfall plots generated on-demand for specific high-risk blocks.
+* **Optimization:** Uses the TreeExplainer with the "JSON Serialization Patch" to handle XGBoost 2.0+ metadata compatibility.
+---
+## 3. Data Flow Lifecycle
+### Step 1: Request Ingestion
+The Payment Gateway sends a raw transaction payload:
+```json
+{
+  "user_id": "u8312",
+  "amt": 150.00,
+  "lat": 40.7128,
+  "long": -74.0060,
+  "category": "grocery_pos",
+  "job": "systems_analyst"
+}
+```
+### Step 2: Real-Time Feature Enrichment
+The API queries Redis to fetch dynamic context that isn't in the payload:
+* *Query:* `ZCARD user:u8312:tx_history` (Transactions in last 24h)
+* *Result:* `5` (Velocity)
+### Step 3: Preprocessing & Inference
+The combined vector (Raw + Redis Features) is passed to the Pipeline:
+1. **Imputation:** Handle missing values.
+2. **Encoding:** `category` -> Weight of Evidence (Float).
+3. **Scaling:** `amt` -> RobustScaler (centered/scaled).
+4. **Prediction:** XGBoost outputs probability `0.982`.
+### Step 4: Decision & Action
+* **Threshold Check:** `0.982 > 0.895` → **FRAUD**.
+* **Shadow Mode Check:** If enabled, return `APPROVE` but log `[SHADOW_BLOCK]`.
+* **Response:**
+```json
+{
+  "status": "BLOCK",
+  "risk_score": 0.982,
+  "latency_ms": 35
+}
+```
+---
+## 4. Deployment Strategy
+### Shadow Deployment (Dark Launch)
+To mitigate risk, new model versions are deployed in "Shadow Mode" before blocking real money.
+1. **Deploy:** Version v2 is deployed to production.
+2. **Listen:** It receives 100% of traffic but has **no write permissions** (cannot decline cards).
+3. **Compare:** We compare v2's "Virtual Blocks" against the v1 "Actual Blocks" and customer complaints.
+4. **Promote:** If Precision > 95% and complaints == 0, toggle Shadow Mode **OFF**.
+---
+## 5. Technology Stack
+| Component | Technology | Rationale |
+| --- | --- | --- |
+| **Language** | Python 3.10+ | Standard for ML ecosystem. |
+| **Package Manager | astral-uv | An extremely fast Python package and project manager |
+| **API Framework** | FastAPI | Async native, high performance, auto-documentation. |
+| **Model** | XGBoost | Best-in-class for tabular fraud data. |
+| **Feature Store** | Redis | Sub-millisecond latency for sliding windows. |
+| **Container** | Docker | Reproducible environments. |
+| **Orchestration** | Docker Compose | Local development and testing. |
+| **Tracking** | MLflow | Experiment management and artifact versioning. |
+| **Frontend** | Streamlit | Rapid prototyping of analyst dashboards. |

.coverage ADDED Viewed

Binary file (53.2 kB). View file

.dockerignore ADDED Viewed

	@@ -0,0 +1,34 @@

+# Git and Version Control
+.git
+.gitignore
+.github
+# Python
+__pycache__
+*.py[cod]
+*$py.class
+.venv
+.env
+.pytest_cache
+.ruff_cache
+.coverage
+htmlcov
+.ipynb_checkpoints
+# Project Data & Artifacts
+# models/
+logs/
+notebooks/
+data/
+*.csv
+*.parquet
+*.pkl
+# *.json
+# UV
+# uv.lock
+# Docker
+Dockerfile
+docker-compose.yml
+.dockerignore

.env.example ADDED Viewed

	@@ -0,0 +1,23 @@

+# Environment Variables
+# Redis Configuration
+REDIS_HOST=localhost
+REDIS_PORT=6379
+REDIS_DB=0
+REDIS_PASSWORD=
+# MLflow Configuration
+MLFLOW_TRACKING_URI=http://localhost:5000
+# Application Settings
+ENV=development
+LOG_LEVEL=INFO
+# Model Settings
+MODEL_PATH=models/fraud_model.pkl
+THRESHOLD=0.895
+shadow_mode=false
+enable_explainability=true
+# Performance
+max_latency_ms=50.0

.github/workflows/main.yml ADDED Viewed

	@@ -0,0 +1,49 @@

+name: PayShield CI
+on:
+  push:
+    branches: [ "master" ]
+  pull_request:
+    branches: [ "master" ]
+jobs:
+  lint-and-test:
+    runs-on: ubuntu-latest
+    services:
+      redis:
+        image: redis:7-alpine
+        ports:
+          - 6379:6379
+        options: >-
+          --health-cmd "redis-cli ping"
+          --health-interval 10s
+          --health-timeout 5s
+          --health-retries 5
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          version: "latest"
+          enable-cache: true
+      - name: Set up Python
+        run: uv python install
+      - name: Install the project
+        run: uv sync --all-extras --dev
+      - name: Run Ruff (Linter)
+        run: uv run ruff check .
+      - name: Run Ruff (Formatter check)
+        run: uv run ruff format --check .
+      - name: Run tests with coverage
+        run: uv run pytest --cov=src tests/
+        env:
+          REDIS_HOST: localhost
+          REDIS_PORT: 6379

.gitignore ADDED Viewed

	@@ -0,0 +1,17 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+.env
+.ruff_cache
+.pytest_cache
+.vscode
+mlruns/
+notebook/dataset/*.csv

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

Dockerfile ADDED Viewed

	@@ -0,0 +1,68 @@

+# Stage 1: Builder
+FROM python:3.12-slim-trixie AS builder
+# Install uv by copying the binary from the official image
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+# Set working directory
+WORKDIR /app
+# Enable bytecode compilation
+ENV UV_COMPILE_BYTECODE=1
+# Copy only dependency files first to leverage Docker layer caching
+COPY pyproject.toml uv.lock ./
+# Disable development dependencies
+ENV UV_NO_DEV=1
+# Sync the project into a new environment
+RUN uv sync --frozen --no-install-project
+# Stage 2: Final
+FROM python:3.12-slim-trixie
+# Set working directory
+WORKDIR /app
+# Install system dependencies (curl for healthchecks)
+RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
+# Create logs directory
+RUN mkdir -p logs
+# Create a non-root user for security
+# Note: HF Spaces runs as user 1000 by default, so we align with that
+RUN useradd -m -u 1000 payshield
+# Set PYTHONPATH to include the app directory
+ENV PYTHONPATH=/app
+# Copy the virtual environment from the builder stage
+COPY --from=builder /app/.venv /app/.venv
+# Ensure the installed binary is on the `PATH`
+ENV PATH="/app/.venv/bin:$PATH"
+# Copy the rest of the application code
+COPY src /app/src
+COPY configs /app/configs
+COPY models /app/models
+COPY entrypoint.sh /app/entrypoint.sh
+# Change ownership to the non-root user
+RUN chown -R payshield:payshield /app
+# Switch to the non-root user
+USER 1000
+# Expose ports
+# 7860 is the default port for Hugging Face Spaces
+EXPOSE 7860 8000
+# Set environment variables for HF Spaces
+ENV API_URL="http://localhost:8000/v1/predict"
+ENV PORT=7860
+# Use entrypoint script to run both services
+CMD ["/bin/bash", "/app/entrypoint.sh"]

README.md CHANGED Viewed

	@@ -0,0 +1,162 @@

+# 🛡️ PayShield-ML:Real-Time Fraud Engine
+![Banner](assets/banner_0.jpeg)
+![Python](https://img.shields.io/badge/python-3.12-blue.svg)
+![FastAPI](https://img.shields.io/badge/FastAPI-0.100+-009688.svg)
+![Redis](https://img.shields.io/badge/redis-7.0+-DC382D.svg)
+![XGBoost](https://img.shields.io/badge/XGBoost-2.0+-FF6F00.svg)
+![Docker](https://img.shields.io/badge/docker-%230db7ed.svg)
+**A production-grade MLOps system providing low-latency fraud detection with stateful feature hydration and human-in-the-loop explainability.**
+---
+## 🏗️ System Architecture
+The system follows a **Lambda Architecture** pattern, decoupling the high-latency model training from the low-latency inference path.
+```mermaid
+graph TD
+    User[Payment Gateway] -->|POST /predict| API[FastAPI Inference Service]
+    subgraph "Online Serving Layer (<50ms)"
+        API -->|1. Hydrate Velocity| Redis[(Redis Feature Store)]
+        API -->|2. Preprocess| Pipeline[Scikit-Learn Pipeline]
+        Pipeline -->|3. Predict| Model[XGBoost Classifier]
+        Model -->|4. Explain| SHAP[SHAP Engine]
+    end
+    subgraph "Offline Training Layer"
+        Data[(Transactional Data)] -->|Batch Load| Trainer[Training Pipeline]
+        Trainer -->|CV & Tuning| MLflow[MLflow Tracking]
+        Trainer -->|Export Artifact| Registry[Model Registry]
+    end
+    API -->|Async Log| ShadowLogs[Shadow Mode Logs]
+    API -->|Visualize| Dashboard[Analyst Dashboard]
+```
+### 🖥️ Analyst Workbench
+![Dashboard Screenshot](assets/dashboard_screenshot.png)
+*Real-time interface for fraud analysts to review blocked transactions and SHAP explanations.*
+---
+## 🚀 Key Features
+### 1. Stateful Feature Store (Redis)
+Traditional stateless APIs struggle with "Velocity Features" (e.g., *how many times did this user swipe in 24 hours?*). Our engine utilizes **Redis Sorted Sets (ZSET)** to maintain rolling windows, allowing feature hydration in **<2ms** with $O(\log N)$ complexity.
+### 2. Shadow Mode (Dark Launch)
+To mitigate the risk of model drift or false positives, the system supports a **Shadow Mode** configuration. The model runs in production, receives real traffic, and logs decisions, but never blocks a transaction. This allows for risk-free A/B testing against legacy rule engines.
+### 3. Business-Aligned Optimization
+Standard accuracy is misleading in fraud detection due to class imbalance. We implement a **Recall-Constraint Strategy** (Target: 80% Recall) ensuring the model captures the vast majority of fraud while maintaining a strict upper bound on False Positive Rates, as required by financial compliance.
+### 4. Cold-Start Handling
+Engineered logic to handle new users with zero history by defaulting to global medians and "warm-up" priors, preventing the system from unfairly blocking legitimate first-time customers.
+---
+## 📊 Performance & Metrics
+The following metrics were achieved on a hold-out temporal test set (out-of-time validation):
+| Metric | Result | Target / Bench |
+| :--- | :--- | :--- |
+| **PR-AUC** | 0.9245 | Excellent |
+| **Precision** | 93.18% | Low False Positives |
+| **Recall** | 80.06% | Target: 80% |
+| **Inference Latency (p95)** | ~30ms | < 50ms |
+---
+## 🛠️ Quick Start
+### Prerequisites
+- Docker & Docker Compose
+- (Optional) Python 3.12+ (managed via `uv`)
+### Installation & Deployment
+1. **Clone the Repository**
+   ```bash
+   git clone https://github.com/Sibikrish3000/realtime-fraud-engine.git
+   cd realtime-fraud-engine
+   ```
+2. **Launch the Stack**
+   ```bash
+   docker-compose up --build
+   ```
+   *This starts the FastAPI Backend, Redis Feature Store, and Streamlit Dashboard.*
+3. **Access the Services**
+   - **Analyst Dashboard:** [http://localhost:8501](http://localhost:8501)
+   - **API Documentation:** [http://localhost:8000/docs](http://localhost:8000/docs)
+   - **Redis Instance:** `localhost:6379`
+---
+## 📂 Project Structure
+```text
+src/
+├── api/             # FastAPI service, schemas (Pydantic), and config
+├── features/        # Redis feature store logic and sliding window constants
+├── models/          # Training pipelines, metrics calculation, and XGBoost wrappers
+├── frontend/        # Streamlit-based analyst workbench
+├── data/            # Data ingestion and cleaning utilities
+└── explainability.py # SHAP-based waterfall plots and global importance
+```
+See the [Development & MLOps Guide](docs/DEVELOPMENT.md) for detailed instructions on training and local development.
+---
+## 📡 API Reference
+### Predict Transaction
+`POST /v1/predict`
+**Request:**
+```bash
+curl -X 'POST' \
+  'http://localhost:8000/v1/predict' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "user_id": "u12345",
+    "trans_date_trans_time": "2024-01-20 14:30:00",
+    "amt": 150.75,
+    "lat": 40.7128,
+    "long": -74.0060,
+    "merch_lat": 40.7306,
+    "merch_long": -73.9352,
+    "category": "grocery_pos"
+  }'
+```
+**Response:**
+```json
+{
+  "decision": "APPROVE",
+  "probability": 0.12,
+  "risk_score": 12.0,
+  "latency_ms": 28.5,
+  "shadow_mode": false,
+  "shap_values": {
+    "amt": 0.05,
+    "dist": 0.02
+  }
+}
+```
+---
+## 🗺️ Future Roadmap
+- [ ] **Kafka Integration:** Transition to an asynchronous event-driven architecture for high-throughput stream processing.
+- [ ] **KServe Deployment:** Migrate from standalone FastAPI to KServe for automated scaling and model versioning.
+- [ ] **Graph Features:** Incorporate Neo4j-based features to detect fraud rings and synthetic identities.
+---
+**Author:** [Sibi Krishnamoorthy](https://github.com/Sibikrish3000)
+*Machine Learning Engineer | Fintech & Risk Analytics*

assets/Screenshot_18-1-2026_22489_localhost.jpeg ADDED Viewed

Git LFS Details

SHA256: fe0692585014893507c4adfaeb3b6fb4a4fb13bb77e5f3909f131251c11d4c2d
Pointer size: 131 Bytes
Size of remote file: 226 kB

assets/banner.jpeg ADDED Viewed

Git LFS Details

SHA256: a1728be71a279f0bd5d0902ddc40d0888e69ab4f4e5b365731d8efcbd0c16c49
Pointer size: 131 Bytes
Size of remote file: 134 kB

assets/banner_0.jpeg ADDED Viewed

Git LFS Details

SHA256: 5a3aa5bc40a100a8cebdce866dace9a7a62a8f7fb9af76a11de389474ade7cee
Pointer size: 131 Bytes
Size of remote file: 155 kB

assets/dashboard_screenshot.png ADDED Viewed

Git LFS Details

SHA256: 25d04e9de048d3aa39e8411af2faf8b924bd0eb5fb60d32ee8ea10b0b260863b
Pointer size: 131 Bytes
Size of remote file: 241 kB

configs/model_config.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+# Model Training Configuration
+model:
+  # XGBoost Hyperparameters (Aligned with Notebook)
+  n_estimators: 500          # Notebook used 500 trees (Script had 100)
+  max_depth: 8               # Notebook used depth 8 (Script had 6)
+  learning_rate: 0.1         # Matches notebook
+  subsample: 0.8             # Row sampling for regularization
+  colsample_bytree: 0.8      # Column sampling for regularization
+  scale_pos_weight: 100      # Will be overridden by calculated imbalance ratio
+  eval_metric: "aucpr"
+  random_state: 42
+# Training Settings
+training:
+  test_size: 0.2
+  validation_size: 0.1
+  random_state: 42
+# Decision Threshold
+threshold:
+  optimal_threshold: 0.9016819596290588  # From notebook PR curve analysis
+  min_precision: 0.80  # Business requirement
+# Feature Engineering
+features:
+  use_cyclical_encoding: true
+  use_distance_features: true
+  use_temporal_features: true

configs/redis_config.yaml ADDED Viewed

	@@ -0,0 +1,19 @@

+# Redis Feature Store Configuration
+# Connection Settings
+host: localhost
+port: 6379
+db: 0
+password: null  # Set via REDIS_PASSWORD env var in production
+# Connection Pool
+max_connections: 50
+socket_connect_timeout: 2  # seconds
+socket_timeout: 1  # seconds
+# Feature Store Settings
+ema_alpha: 0.08  # 2/(24+1) for 24-hour window
+key_ttl: 604800  # 7 days in seconds
+# Window Configuration
+sliding_window_hours: 24

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,63 @@

+version: '3.8'
+services:
+  redis:
+    image: redis:7-alpine
+    container_name: payshield-redis
+    ports:
+      - "6379:6379"
+    volumes:
+      - redis_data:/data
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 5s
+      timeout: 3s
+      retries: 5
+    networks:
+      - payshield-network
+  api:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: payshield-api
+    depends_on:
+      redis:
+        condition: service_healthy
+    ports:
+      - "8000:8000"
+    environment:
+      - REDIS_HOST=redis
+      - REDIS_PORT=6379
+      - MODEL_PATH=/app/models/fraud_model.pkl
+      - THRESHOLD_PATH=/app/models/threshold.json
+      - SHADOW_MODE=false
+    volumes:
+      - ./models:/app/models:ro
+      - ./logs:/app/logs
+    networks:
+      - payshield-network
+  # Placeholder for Phase 6 Dashboard
+  dashboard:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: payshield-dashboard
+    depends_on:
+      - api
+    command: ["streamlit", "run", "src/frontend/app.py", "--server.port", "8501", "--server.address", "0.0.0.0"]
+    ports:
+      - "8501:8501"
+    environment:
+      - API_URL=http://api:8000/v1/predict
+    networks:
+      - payshield-network
+    restart: unless-stopped
+networks:
+  payshield-network:
+    driver: bridge
+volumes:
+  redis_data:

docker/docker-compose.yml ADDED Viewed

	@@ -0,0 +1,42 @@

+version: '3.8'
+services:
+  # Redis Feature Store
+  redis:
+    image: redis:7-alpine
+    container_name: payshield-redis
+    ports:
+      - "6379:6379"
+    volumes:
+      - redis_data:/data
+    command: redis-server --appendonly yes
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 5s
+      timeout: 3s
+      retries: 5
+    networks:
+      - payshield-network
+  # MLflow Tracking Server (for future phases)
+  mlflow:
+    image: ghcr.io/mlflow/mlflow:v2.10.0
+    container_name: payshield-mlflow
+    ports:
+      - "5000:5000"
+    environment:
+      - MLFLOW_BACKEND_STORE_URI=sqlite:///mlflow/mlflow.db
+      - MLFLOW_ARTIFACTS_DESTINATION=/mlflow/artifacts
+    volumes:
+      - mlflow_data:/mlflow
+    command: mlflow server --host 0.0.0.0 --port 5000
+    networks:
+      - payshield-network
+volumes:
+  redis_data:
+  mlflow_data:
+networks:
+  payshield-network:
+    driver: bridge

docs/DEVELOPMENT.md ADDED Viewed

	@@ -0,0 +1,135 @@

+# 🛠️ Development & MLOps Guide
+This guide provides detailed instructions on setting up the environment, running experiments, and developing the PayShield-ML system.
+---
+## 🏗️ 1. Environment Setup
+The project uses `uv` for lightning-fast Python package and project management.
+### Prerequisites
+- [uv](https://github.com/astral-sh/uv) installed
+- Docker & Docker Compose
+- Redis (can be run via Docker)
+### Installation
+```bash
+# Sync dependencies and create virtual environment
+uv sync
+# Activate the environment
+source .venv/bin/activate
+```
+---
+## 📊 2. MLflow Tracking
+We use MLflow to track hyperparameters, metrics, and model artifacts.
+### Start MLflow Server
+Run this in a separate terminal to view the UI:
+```bash
+uv run mlflow ui --host 0.0.0.0 --port 5000
+```
+Then access the dashboard at [http://localhost:5000](http://localhost:5000).
+---
+## 🚂 3. Model Training Pipeline
+The training script handles data ingestion, feature engineering, cross-validation, and MLflow logging.
+### Basic Training
+```bash
+uv run python src/models/train.py --data_path data/fraud_sample.csv
+```
+### Advanced Training with Custom Params
+```bash
+uv run python src/models/train.py \
+    --data_path data/fraud_sample.csv \
+    --experiment_name fraud_v2 \
+    --min_recall 0.85 \
+    --output_dir models/
+```
+| Argument | Description | Default |
+| :--- | :--- | :--- |
+| `--data_path` | Path to CSV/Parquet training data | (Required) |
+| `--params_path` | Path to model config YAML | `configs/model_config.yaml` |
+| `--experiment_name` | MLflow experiment grouping | `fraud_detection` |
+| `--min_recall` | Target recall for threshold optimization | `0.80` |
+---
+## 🔌 4. Running Services Locally
+For rapid iteration without rebuilding Docker images.
+### Start Redis (Required)
+```bash
+docker run -d --name payshield-redis -p 6379:6379 redis:7-alpine
+```
+### Start FastAPI Backend
+```bash
+# Running with hot-reload
+uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --reload
+```
+- **Swagger Docs:** [http://localhost:8000/docs](http://localhost:8000/docs)
+### Start Streamlit Dashboard
+```bash
+export API_URL="http://localhost:8000/v1/predict"
+uv run streamlit run src/frontend/app.py --server.port 8501
+```
+- **Dashboard:** [http://localhost:8501](http://localhost:8501)
+---
+## 🐳 5. Full-Stack Development (Docker Compose)
+The easiest way to replicate production-like environment.
+### Build and Launch
+```bash
+docker-compose up --build
+```
+### Useful Commands
+```bash
+# Run in background
+docker-compose up -d
+# Check all service logs
+docker-compose logs -f
+# Stop and remove containers
+docker-compose down
+```
+### Service Map
+| Service | Port | Endpoint |
+| :--- | :--- | :--- |
+| **API** | 8000 | `http://localhost:8000` |
+| **Dashboard** | 8501 | `http://localhost:8501` |
+| **Redis** | 6379 | `localhost:6379` |
+---
+## 🧪 6. Testing
+We use `pytest` for unit and integration tests.
+```bash
+# Run all tests
+uv run pytest
+# Run with coverage report
+uv run pytest --cov=src
+```
+---
+**Note:** Ensure you have the `models/fraud_model.pkl` and `models/threshold.json` artifacts present before starting the API. These are generated by the training pipeline.

entrypoint.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+#!/bin/bash
+# Start FastAPI in the background
+uvicorn src.api.main:app --host 0.0.0.0 --port 8000 &
+# Start Streamlit in the foreground
+streamlit run src/frontend/app.py --server.port 7860 --server.address 0.0.0.0

logs/shadow_predictions.jsonl ADDED Viewed

	@@ -0,0 +1,55 @@

+{"timestamp": "2026-01-18T12:37:19.214473Z", "user_id": "u12345", "amt": 150.0, "category": "grocery_pos", "real_decision": "APPROVE", "probability": 9.402751288689615e-07, "latency_ms": 72.96919822692871, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:44:56.721604Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.37005043029785, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:47:36.067083Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 54.149627685546875, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:47:39.609355Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 25.424957275390625, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:47:59.207044Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.46653747558594, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:01.070087Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.763057708740234, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:01.865266Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 29.25252914428711, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:02.489046Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 23.40531349182129, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:03.161965Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 44.34394836425781, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:03.765913Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.629220962524414, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:04.394367Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 27.244091033935547, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:04.986422Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.80378532409668, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:05.901805Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.35393142700195, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:06.666034Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 51.82218551635742, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:07.345695Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 48.212528228759766, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:07.919900Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 26.331424713134766, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:08.553960Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 41.85080528259277, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:09.265804Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 68.79472732543945, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:09.853970Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.57653045654297, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:10.481686Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 21.60501480102539, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:11.133950Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.62055587768555, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:11.734621Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.61753845214844, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:12.292397Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 21.454334259033203, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:12.847376Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 42.4959659576416, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:13.340940Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.911882400512695, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:13.923384Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 59.66687202453613, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:14.445859Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 50.5223274230957, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:14.990366Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.456501007080078, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:15.709852Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.6852912902832, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:15.913553Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 25.067567825317383, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:16.193654Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 44.05331611633301, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:16.345675Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.55793380737305, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:16.590079Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.01242256164551, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:16.789928Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 46.65184020996094, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:16.971762Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 29.859066009521484, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:17.185712Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.96466636657715, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:17.345695Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 39.675235748291016, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:17.524961Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 40.1458740234375, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:17.683832Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.51038360595703, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:17.851388Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 23.29730987548828, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:18.065689Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.719478607177734, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:18.381860Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 45.37034034729004, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:18.830235Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 49.47638511657715, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:19.041672Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 41.09072685241699, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:19.269802Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 37.813663482666016, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:19.570256Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 30.001163482666016, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:19.910120Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 47.04141616821289, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:20.254294Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 52.63257026672363, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:20.520307Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 24.410486221313477, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:20.933840Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 2.6684321596803784e-07, "latency_ms": 42.56725311279297, "shadow_mode": true}
+{"timestamp": "2026-01-18T12:48:58.996285Z", "user_id": "u12345", "amt": 2500.0, "category": "shopping_net", "real_decision": "BLOCK", "probability": 0.999806821346283, "latency_ms": 38.36369514465332, "shadow_mode": true}
+{"timestamp": "2026-01-18T14:12:06.614568Z", "user_id": "u12345", "amt": 150.0, "category": "grocery_pos", "real_decision": "APPROVE", "probability": 8.649675874039531e-05, "latency_ms": 86.48848533630371, "shadow_mode": true}
+{"timestamp": "2026-01-18T14:15:46.042042Z", "user_id": "u12345", "amt": 150.0, "category": "misc_net", "real_decision": "APPROVE", "probability": 7.321957014028158e-07, "latency_ms": 86.42172813415527, "shadow_mode": true}
+{"timestamp": "2026-01-18T14:16:12.912011Z", "user_id": "u12345", "amt": 1500.0, "category": "misc_net", "real_decision": "BLOCK", "probability": 0.9997040629386902, "latency_ms": 49.93748664855957, "shadow_mode": true}
+{"timestamp": "2026-01-18T14:17:14.040831Z", "user_id": "u12345", "amt": 1500.0, "category": "misc_net", "real_decision": "BLOCK", "probability": 0.9997040629386902, "latency_ms": 43.45536231994629, "shadow_mode": true}

main.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from payshield-ml!")
+if __name__ == "__main__":
+    main()

models/fraud_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9be403668e04f1458d5eb5e0dd805495e7f6fa1f322dc8b006dcb2fccc78061d
+size 3933261

models/threshold.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "optimal_threshold": 0.8199467062950134,
+  "metrics": {
+    "precision": 0.9318377911993098,
+    "recall": 0.8005930318754633,
+    "f1": 0.861244019138756,
+    "pr_auc": 0.924508450881234,
+    "threshold_used": 0.8199467062950134
+  }
+}

notebook/.ipynb_checkpoints/Untitled-checkpoint.ipynb ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+ "cells": [],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebook/.ipynb_checkpoints/credit-card-fraud-detection-checkpoint.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/README.md ADDED Viewed

	@@ -0,0 +1,122 @@

+# 🛡️ Behavioral Fraud Detection System (XGBoost + SHAP)
+**A production-grade machine learning pipeline identifying fraudulent credit card transactions with 97% precision and explainable AI.**
+## 💼 Executive Summary & Business Impact
+Financial fraud detection is not just about accuracy; it's about the **Precision-Recall trade-off**. A model that flags too many legitimate transactions (False Positives) causes customer churn, while missing fraud (False Negatives) causes direct financial loss.
+This project implements a cost-sensitive **XGBoost** classifier engineered to minimize **operational friction**. By optimizing the decision threshold to **0.895**, the system achieved:
+| Metric | Performance | Business Value |
+| --- | --- | --- |
+| **Precision** | **97%** | Only 3% of alerts are false alarms, drastically reducing manual review costs. |
+| **Recall** | **80%** | Captures 80% of all fraud attempts (approx. $810k in prevented loss). |
+| **Net Savings** | **$810,470** | Calculated ROI on the test set (Loss Prevented - Operational Costs). |
+| **Latency** | **<50ms** | Inference speed optimized via `scikit-learn` Pipeline serialization. |
+---
+## 🏗️ Technical Architecture
+The solution moves beyond basic "fit-predict" workflows by implementing a robust preprocessing pipeline designed to prevent **data leakage** and handle extreme class imbalance (0.5% fraud rate).
+### 1. Advanced Feature Engineering
+Raw transaction data is insufficient for modern fraud. I engineered **14 behavioral features** to capture context:
+* **Velocity Metrics (`trans_count_24h`, `amt_to_avg_ratio_24h`)**: Detects "burst" behavior where a card is used rapidly or for amounts exceeding the user's historical norm.
+* **Geospatial Analysis (`distance_km`)**: Calculates the Haversine distance between the cardholder's home and the merchant.
+* **Cyclical Temporal Encoding (`hour_sin`, `hour_cos`)**: Captures high-risk time windows (e.g., 3 AM surges) while preserving the 24-hour cycle continuity.
+* **Risk Profiling (`WOEEncoder`)**: Replaces high-cardinality categorical features (Merchant, Job) with their "Weight of Evidence" - a measure of how much a specific category supports the "Fraud" hypothesis.
+### 2. The Pipeline
+To ensure production stability, all steps are wrapped in a single Scikit-Learn `Pipeline`:
+```python
+pipeline = Pipeline(steps=[
+    ('preprocessor', ColumnTransformer(transformers=[
+        ('cat', WOEEncoder(), ['job', 'category']),
+        ('num', RobustScaler(), numerical_features)
+    ])),
+    ('classifier', XGBClassifier(scale_pos_weight=imbalance_ratio, ...))
+])
+```
+---
+## 📊 Model Performance
+### Precision-Recall Strategy
+Instead of optimizing for ROC-AUC (which can be misleading in imbalanced datasets), I optimized for **PR-AUC (0.998)**.
+* **Default Threshold (0.50):** Precision was 65%. Too many false alarms.
+* **Optimized Threshold (0.895):** Precision increased to **97%**, with minimal loss in Recall.
+*(Insert your Precision-Recall Curve image here)*
+---
+## 🔍 Explainability & "The Why"
+Black-box models are dangerous in finance. I implemented **SHAP (SHapley Additive exPlanations)** to provide reason codes for every decision.
+### The "Smoking Gun" (Fraud Example)
+For a transaction flagged with **99.9% confidence**, the SHAP Waterfall plot reveals the exact drivers:
+1. **`amt_log` (+8.83)**: The transaction amount was significantly higher than normal.
+2. **`hour_sin` (+2.46)**: The transaction occurred during a high-risk time window (late night).
+3. **`job` (+1.72)**: The cardholder's profession falls into a statistically higher-risk segment.
+4. **`amt_to_avg_ratio_24h` (+1.24)**: The amount was an outlier *specifically* for this user's 24-hour history.
+*(Insert your SHAP Waterfall Plot image here)*
+---
+## 🚀 How to Run
+### Prerequisites
+```bash
+pip install pandas xgboost category_encoders shap scikit-learn
+```
+### Inference
+The model is serialized as a `.pkl` file. You can load it to predict on new data immediately without re-training.
+```python
+import joblib
+import pandas as pd
+# Load the production pipeline
+model = joblib.load('fraud_detection_model_v1.pkl')
+# Define a new transaction (Example)
+new_transaction = pd.DataFrame([{
+    'amt_log': 5.2,
+    'distance_km': 120.5,
+    'trans_count_24h': 12,
+    'amt_to_avg_ratio_24h': 4.5,
+    # ... include all 14 features
+}])
+# Get prediction (Probability of Fraud)
+fraud_prob = model.predict_proba(new_transaction)[:, 1]
+is_fraud = (fraud_prob >= 0.895).astype(int)
+print(f"Fraud Probability: {fraud_prob[0]:.4f}")
+print(f"Action: {'BLOCK' if is_fraud[0] else 'APPROVE'}")
+```
+---
+### **Author**
+**Sibi Krishnamoorthy** *Machine Learning Engineer | Fintech & Risk Analytics*

notebook/credit-card-fraud-detection.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/credit-card-fraud-detection.md ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/credit-card-fraud-detection.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f604362aa5035ae24f4b5b1dcd4d67926ee9246030eba352eb694ca05fd18aa
+size 817131

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_23_0.png ADDED Viewed

Git LFS Details

SHA256: 88ffca8cc308c74d266feabb0de1ab67df6350b8d4b44798ee4343ae8ff0acf5
Pointer size: 130 Bytes
Size of remote file: 41.9 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_25_1.png ADDED Viewed

Git LFS Details

SHA256: ec441ac40b9528b442b4e16b82518a8ce46bc52af746a5c3aa67538f47f4f1e8
Pointer size: 130 Bytes
Size of remote file: 14.9 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_26_1.png ADDED Viewed

Git LFS Details

SHA256: 91aec463d1c6742e9f7946858926c1db6c5ee1b775db8d9895366d82e198b465
Pointer size: 130 Bytes
Size of remote file: 29.9 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_28_1.png ADDED Viewed

Git LFS Details

SHA256: 7a8174cf93ce7eaf8c5363509fb410c7759ee1bfce43ee0ceda0543484a594e8
Pointer size: 130 Bytes
Size of remote file: 59.2 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_29_1.png ADDED Viewed

Git LFS Details

SHA256: 89719cdfbdd9d862df5275f0ef5f164e7325e82b1835b0612620a014aeedad69
Pointer size: 130 Bytes
Size of remote file: 60.2 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_30_0.png ADDED Viewed

Git LFS Details

SHA256: 0570aba24636b052993e0cb228f5cfd0c7afbb7c5dcb0301a16bf34c3213cbcb
Pointer size: 130 Bytes
Size of remote file: 59.5 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_32_0.png ADDED Viewed

Git LFS Details

SHA256: 14a9f5ef856ee4f33aaa939bac4d63f4f262757d8ffdad380756331219ccd8f3
Pointer size: 130 Bytes
Size of remote file: 17.8 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_34_1.png ADDED Viewed

Git LFS Details

SHA256: 11d0708dfd6f1403d04f04fb3e64c5d2cfe26b0647c706c85d532fcbcd3aeeca
Pointer size: 130 Bytes
Size of remote file: 25.8 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_40_1.png ADDED Viewed

Git LFS Details

SHA256: 193a7c99eaae83869e529ad3a877c9161a49b2bcc8a3513e4e892f6b429d70bb
Pointer size: 130 Bytes
Size of remote file: 15.4 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_58_0.png ADDED Viewed

Git LFS Details

SHA256: be3db171ddbd2eb4327efcc89366d6ba50c272fae85efd980a8e8abfb1d21214
Pointer size: 130 Bytes
Size of remote file: 63.1 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_59_0.png ADDED Viewed

Git LFS Details

SHA256: 48c2d93bf385e18fa45c9c96edeb99443b55d34064285f6a7eb5f3ce5291606d
Pointer size: 130 Bytes
Size of remote file: 49.7 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_60_0.png ADDED Viewed

Git LFS Details

SHA256: bb83069c4308aaf513b127ca55e46b1d6bcfccd278a5fe722649493ac6fc9df7
Pointer size: 130 Bytes
Size of remote file: 41.9 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_68_0.png ADDED Viewed

Git LFS Details

SHA256: aa61ce6ca4f34ca70fab9e28bf3465aeac9792248032e6f3f129febe52d00c90
Pointer size: 131 Bytes
Size of remote file: 103 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_70_0.png ADDED Viewed

Git LFS Details

SHA256: 4f6571337f98d8023d78afa3c172296bbf81c6d2e58b1d96a3973feb69cbc334
Pointer size: 130 Bytes
Size of remote file: 65.7 kB

notebook/credit-card-fraud-detection_files/credit-card-fraud-detection_72_1.png ADDED Viewed

Git LFS Details

SHA256: af5649b380ca39394ce291aab39bd04aaec5697dc3d376e2808f66ad4442dab8
Pointer size: 130 Bytes
Size of remote file: 61.6 kB

notebook/dataset/download.py ADDED Viewed

	@@ -0,0 +1,27 @@

+#!/usr/bin/env -S uv run --script
+#
+# /// script
+# dependencies = [
+#   "pandas",
+#   "rich",
+# ]
+# ///
+import os
+import zipfile
+path = os.path.join(os.path.dirname(__file__), "dataset")
+def download_data():
+    os.system(f"curl -L -o {path}/fraud-detection.zip https://www.kaggle.com/api/v1/datasets/download/kartik2112/fraud-detection")
+    with zipfile.ZipFile(f"{path}/fraud-detection.zip", 'r') as zip_ref:
+        zip_ref.extractall(f"{path}/")
+def combine_data():
+    import pandas as pd
+    train = pd.read_csv(f"{path}/fraud-detection/fraudTrain.csv")
+    test = pd.read_csv(f"{path}/fraud-detection/fraudTest.csv")
+    df = pd.concat([train, test], axis=0).reset_index(drop=True)
+    df.to_csv(f"{path}/fraud.csv", index=False)
+if __name__ == "__main__":
+    download_data()
+    combine_data()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,81 @@

+[project]
+name = "payshield-ml"
+version = "0.1.0"
+description = "Real-Time Financial Fraud Detection System with sub-50ms latency"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    # ML Stack
+    "xgboost>=2.0.0",
+    "scikit-learn>=1.4.0",
+    "shap>=0.50.0",  # Use latest stable version compatible with Python 3.12
+    "mlflow>=2.10.0",
+    # API Framework
+    "fastapi>=0.109.0",
+    "uvicorn[standard]>=0.27.0",
+    "pydantic>=2.5.0",
+    # Data Processing
+    "pandas>=2.2.0",
+    "numpy>=2.0.0",
+    "pyarrow>=15.0.0",
+    # Feature Store
+    "redis>=5.0.0",
+    "hiredis>=2.3.0",
+    # Utilities
+    "python-dotenv>=1.0.0",
+    "pyyaml>=6.0.1",
+    "joblib>=1.5.0",
+    "category-encoders>=2.6.0",
+    "seaborn>=0.13.2",
+    "matplotlib>=3.10.8",
+    "pydantic-settings>=2.12.0",
+    "streamlit>=1.53.0",
+    "plotly>=6.5.2",
+    "requests>=2.32.5",
+    "typing-extensions>=4.15.0",
+]
+[project.optional-dependencies]
+dev = [
+    # Testing
+    "pytest>=8.0.0",
+    "pytest-asyncio>=0.23.0",
+    "pytest-cov>=4.1.0",
+    # Type Checking & Linting
+    "mypy>=1.8.0",
+    "ruff>=0.1.0",
+    # Notebook Support
+    "jupyter>=1.0.0",
+    "ipykernel>=6.29.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+python_files = "test_*.py"
+python_classes = "Test*"
+python_functions = "test_*"
+addopts = "-v --cov=src --cov-report=term-missing"
+[tool.mypy]
+python_version = "3.12"
+warn_return_any = true
+warn_unused_configs = true
+disallow_untyped_defs = true
+[tool.ruff]
+line-length = 100
+target-version = "py312"
+[tool.hatch.build.targets.wheel]
+packages = ["src"]
+[dependency-groups]
+dev = [
+    "ruff>=0.14.13",
+]

scripts/demo_phase1.py ADDED Viewed

	@@ -0,0 +1,84 @@

+"""
+Example script demonstrating the data ingestion and feature store usage.
+This shows how to use the modules we just built.
+"""
+from src.data.ingest import load_dataset, InferenceTransactionSchema
+from src.features.store import RedisFeatureStore
+def main():
+    """Demonstrate basic usage of Phase 1 modules."""
+    print("=== PayShield-ML Phase 1 Demo ===\n")
+    # 1. Data Validation Example
+    print("1. Testing Data Validation...")
+    try:
+        valid_transaction = InferenceTransactionSchema(
+            user_id="u12345",
+            amt=150.00,
+            lat=40.7128,
+            long=-74.0060,
+            category="grocery_pos",
+            job="Engineer, biomedical",
+            merch_lat=40.7200,
+            merch_long=-74.0100,
+            unix_time=1234567890,
+        )
+        print(f"   ✓ Valid transaction: ${valid_transaction.amt:.2f}")
+    except Exception as e:
+        print(f"   ✗ Validation failed: {e}")
+    # 2. Feature Store Example
+    print("\n2. Testing Feature Store...")
+    try:
+        store = RedisFeatureStore(host="localhost", port=6379, db=0)
+        # Check health
+        health = store.health_check()
+        print(f"   ✓ Redis connection: {health['status']} (ping: {health['ping_ms']}ms)")
+        # Add some transactions
+        user_id = "demo_user_001"
+        import time
+        base_time = int(time.time())
+        print(f"\n   Adding 3 transactions for {user_id}...")
+        for i, amount in enumerate([50.00, 75.00, 100.00]):
+            store.add_transaction(
+                user_id=user_id,
+                amount=amount,
+                timestamp=base_time + (i * 3600),  # 1 hour apart
+            )
+            print(f"      Transaction {i + 1}: ${amount:.2f}")
+        # Get features
+        features = store.get_features(user_id, current_timestamp=base_time + 7200)
+        print(f"\n   Features computed:")
+        print(f"      - Transaction count (24h): {features['trans_count_24h']:.0f}")
+        print(f"      - Average spend (24h): ${features['avg_spend_24h']:.2f}")
+        # Get history
+        history = store.get_transaction_history(user_id, lookback_hours=24)
+        print(f"\n   Transaction history ({len(history)} transactions):")
+        for ts, amt in history[:3]:
+            print(f"      - Timestamp {ts}: ${amt:.2f}")
+        # Cleanup
+        deleted = store.delete_user_data(user_id)
+        print(f"\n   ✓ Cleaned up {deleted} keys")
+    except Exception as e:
+        print(f"   ✗ Feature store error: {e}")
+        print(
+            "   Make sure Redis is running: docker-compose -f docker/docker-compose.yml up -d redis"
+        )
+    print("\n=== Demo Complete ===")
+if __name__ == "__main__":
+    main()

scripts/verify_phase1.py ADDED Viewed

	@@ -0,0 +1,192 @@

+#!/usr/bin/env python3
+"""
+Quick verification script for Phase 1 implementation.
+Tests imports and basic module functionality without requiring Redis.
+"""
+def test_imports():
+    """Test that all Phase 1 modules can be imported."""
+    print("Testing imports...")
+    try:
+        from src.data.ingest import TransactionSchema, InferenceTransactionSchema, load_dataset
+        print("  ✓ Data ingestion module imports successfully")
+    except ImportError as e:
+        print(f"  ✗ Data ingestion import failed: {e}")
+        return False
+    try:
+        from src.features.store import RedisFeatureStore
+        print("  ✓ Feature store module imports successfully")
+    except ImportError as e:
+        print(f"  ✗ Feature store import failed: {e}")
+        return False
+    try:
+        from src.features.constants import category_names, job_names
+        print(
+            f"  ✓ Constants module loaded ({len(category_names)} categories, {len(job_names)} jobs)"
+        )
+    except ImportError as e:
+        print(f"  ✗ Constants import failed: {e}")
+        return False
+    return True
+def test_pydantic_validation():
+    """Test Pydantic validation logic."""
+    print("\nTesting Pydantic validation...")
+    from src.data.ingest import TransactionSchema, InferenceTransactionSchema
+    from pydantic import ValidationError
+    # Test valid transaction
+    try:
+        valid_data = {
+            "trans_date_trans_time": "2019-01-01 00:00:18",
+            "cc_num": 2703186189652095,
+            "merchant": "Test Merchant",
+            "category": "misc_net",
+            "amt": 100.50,
+            "first": "John",
+            "last": "Doe",
+            "gender": "M",
+            "street": "123 Main St",
+            "city": "Springfield",
+            "state": "IL",
+            "zip": 62701,
+            "lat": 39.7817,
+            "long": -89.6501,
+            "city_pop": 100000,
+            "job": "Engineer, biomedical",
+            "dob": "1990-01-01",
+            "trans_num": "abc123",
+            "unix_time": 1325376018,
+            "merch_lat": 39.7900,
+            "merch_long": -89.6600,
+            "is_fraud": 0,
+        }
+        schema = TransactionSchema(**valid_data)
+        print(f"  ✓ Valid transaction accepted (amt: ${schema.amt:.2f})")
+    except ValidationError as e:
+        print(f"  ✗ Valid transaction rejected: {e}")
+        return False
+    # Test invalid amount
+    try:
+        invalid_data = valid_data.copy()
+        invalid_data["amt"] = -50.00
+        schema = TransactionSchema(**invalid_data)
+        print("  ✗ Negative amount not rejected!")
+        return False
+    except ValidationError:
+        print("  ✓ Negative amount correctly rejected")
+    # Test invalid coordinates
+    try:
+        invalid_data = valid_data.copy()
+        invalid_data["lat"] = 200.0
+        schema = TransactionSchema(**invalid_data)
+        print("  ✗ Invalid latitude not rejected!")
+        return False
+    except ValidationError:
+        print("  ✓ Invalid latitude correctly rejected")
+    # Test inference schema
+    try:
+        inference_data = {
+            "user_id": "u12345",
+            "amt": 150.00,
+            "lat": 40.7128,
+            "long": -74.0060,
+            "category": "grocery_pos",
+            "job": "Engineer, biomedical",
+            "merch_lat": 40.7200,
+            "merch_long": -74.0100,
+            "unix_time": 1234567890,
+        }
+        schema = InferenceTransactionSchema(**inference_data)
+        print(f"  ✓ Inference schema works (user: {schema.user_id})")
+    except ValidationError as e:
+        print(f"  ✗ Inference schema failed: {e}")
+        return False
+    return True
+def test_constants():
+    """Test that constants are properly loaded."""
+    print("\nTesting constants...")
+    from src.features.constants import category_names, job_names
+    # Check categories
+    expected_categories = ["misc_net", "gas_transport", "grocery_pos", "entertainment"]
+    for cat in expected_categories:
+        if cat not in category_names:
+            print(f"  ✗ Missing category: {cat}")
+            return False
+    print(f"  ✓ All expected categories present ({len(category_names)} total)")
+    # Check jobs
+    expected_jobs = ["Engineer, biomedical", "Data scientist", "Mechanical engineer"]
+    found_jobs = [j for j in expected_jobs if j in job_names]
+    print(
+        f"  ✓ Job list loaded ({len(job_names)} jobs, {len(found_jobs)}/{len(expected_jobs)} test jobs found)"
+    )
+    return True
+def main():
+    """Run all verification tests."""
+    print("=" * 60)
+    print("Phase 1 Verification Script")
+    print("=" * 60)
+    tests = [
+        ("Imports", test_imports),
+        ("Pydantic Validation", test_pydantic_validation),
+        ("Constants", test_constants),
+    ]
+    results = []
+    for name, test_func in tests:
+        try:
+            result = test_func()
+            results.append((name, result))
+        except Exception as e:
+            print(f"\n  ✗ {name} test crashed: {e}")
+            results.append((name, False))
+    print("\n" + "=" * 60)
+    print("Summary")
+    print("=" * 60)
+    for name, passed in results:
+        status = "✓ PASS" if passed else "✗ FAIL"
+        print(f"{status:>10} - {name}")
+    all_passed = all(r[1] for r in results)
+    print("\n" + "=" * 60)
+    if all_passed:
+        print("✅ All verification tests passed!")
+        print("\nPhase 1 implementation is working correctly.")
+        print("\nNext steps:")
+        print("  1. Install dependencies: uv sync")
+        print("  2. Start Redis: docker-compose -f docker/docker-compose.yml up -d redis")
+        print("  3. Run full test suite: pytest tests/ -v")
+        print("  4. Try demo: python scripts/demo_phase1.py")
+    else:
+        print("❌ Some tests failed. Check the output above.")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

src/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty __init__.py for src package

src/api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # API module __init__