Sanket-Setu / TASKS.md
devrajsinh2012's picture
Remove Fly.io and GitHub Actions workflows
c67369f
# SanketSetu β€” Execution TODO & Implementation Tracker
## Model Analysis (Reviewed 2026-03-02)
All 5 model files inspected. Three distinct inference pipelines exist:
| Pipeline | Files | Input | Process | Output |
|---|---|---|---|---|
| **A β€” Primary (Fastest)** | `Mediapipe_XGBoost/model.pkl` | 63 MediaPipe coords (21 landmarks Γ— x,y,z) | XGBClassifier (50 trees) | 34-class probability |
| **B β€” Autoencoder + LGBM** | `CNN_Autoencoder_LightGBM/autoencoder_model.pkl` + `lgbm_model.pkl` | 63 MediaPipe coords | Encoder (63β†’32β†’**16** bottleneck) + LGBMClassifier | 34-class probability |
| **C β€” Vision CNN + SVM** | `CNN_PreTrained/cnn_model.pkl` + `svm_model.pkl` | 128Γ—128Γ—3 RGB image | ResNet50-based CNN (179 layers) β†’ 256 features + SVC(C=10) | 34-class probability w/ probability=True |
### Key Architecture Facts
- **34 classes** (Gujarati Sign Language alphabet + digits, labels 0–33)
- **Pipeline A** input: 63 floats β€” directly from MediaPipe `hand_landmarks` (x, y, z per landmark, flattened)
- **Pipeline B** input: same 63 floats β†’ takes only the encoder half (first 3 Dense layers, output of `dense_1` layer = 16 features)
- **Pipeline C** input: 128Γ—128 BGR/RGB cropped hand image, normalized to [0,1]
- All `.pth` files are identical copies of the `.pkl` files (same objects, different extension)
- Model quality strategy: A is primary (sub-ms); if confidence < threshold, query B or C for ensemble
---
## Project Folder Structure to Create
```
SanketSetu/
β”œβ”€β”€ backend/ ← FastAPI server
β”‚ β”œβ”€β”€ app/
β”‚ β”‚ β”œβ”€β”€ main.py ← FastAPI entry, WebSocket + REST
β”‚ β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”‚ β”œβ”€β”€ loader.py ← Singleton model loader
β”‚ β”‚ β”‚ └── label_map.py ← 0–33 β†’ Gujarati sign name mapping
β”‚ β”‚ β”œβ”€β”€ inference/
β”‚ β”‚ β”‚ β”œβ”€β”€ pipeline_a.py ← XGBoost inference (63 landmarks)
β”‚ β”‚ β”‚ β”œβ”€β”€ pipeline_b.py ← Autoencoder encoder + LightGBM
β”‚ β”‚ β”‚ β”œβ”€β”€ pipeline_c.py ← ResNet CNN + SVM (image-based)
β”‚ β”‚ β”‚ └── ensemble.py ← Confidence-weighted ensemble logic
β”‚ β”‚ β”œβ”€β”€ schemas.py ← Pydantic request/response models
β”‚ β”‚ └── config.py ← Settings (confidence threshold, etc.)
β”‚ β”œβ”€β”€ weights/ ← Symlink or copy of model pkl files
β”‚ β”œβ”€β”€ requirements.txt
β”‚ β”œβ”€β”€ Dockerfile
β”‚
β”œβ”€β”€ frontend/ ← Vite + React + TS
β”‚ β”œβ”€β”€ src/
β”‚ β”‚ β”œβ”€β”€ components/
β”‚ β”‚ β”‚ β”œβ”€β”€ WebcamFeed.tsx ← Webcam + canvas landmark overlay
β”‚ β”‚ β”‚ β”œβ”€β”€ LandmarkCanvas.tsx ← Draws 21 hand points + connections
β”‚ β”‚ β”‚ β”œβ”€β”€ PredictionHUD.tsx ← Live sign, confidence bar, history
β”‚ β”‚ β”‚ β”œβ”€β”€ OnboardingGuide.tsx ← Animated intro wizard
β”‚ β”‚ β”‚ └── Calibration.tsx ← Lighting/distance check UI
β”‚ β”‚ β”œβ”€β”€ hooks/
β”‚ β”‚ β”‚ β”œβ”€β”€ useWebSocket.ts ← WS connection, send/receive
β”‚ β”‚ β”‚ β”œβ”€β”€ useMediaPipe.ts ← MediaPipe Hands JS integration
β”‚ β”‚ β”‚ └── useWebcam.ts ← Camera permissions + stream
β”‚ β”‚ β”œβ”€β”€ lib/
β”‚ β”‚ β”‚ └── landmarkUtils.ts ← Landmark normalization (mirror XGBoost preprocessing)
β”‚ β”‚ β”œβ”€β”€ App.tsx
β”‚ β”‚ └── main.tsx
β”‚ β”œβ”€β”€ public/
β”‚ β”œβ”€β”€ index.html
β”‚ β”œβ”€β”€ tailwind.config.ts
β”‚ β”œβ”€β”€ vite.config.ts
β”‚ └── package.json
β”‚
β”œβ”€β”€ CNN_Autoencoder_LightGBM/ ← (existing)
β”œβ”€β”€ CNN_PreTrained/ ← (existing)
β”œβ”€β”€ Mediapipe_XGBoost/ ← (existing)
└── .github/
└── workflows/
β”œβ”€β”€ deploy-backend.yml
└── deploy-frontend.yml
```
---
## Phase 1 β€” Backend Core (FastAPI + Model Integration)
### 1.1 Project Bootstrap
- [x] Create `backend/` folder and `app/` package structure
- [x] Create `backend/requirements.txt` with: `fastapi`, `uvicorn[standard]`, `websockets`, `xgboost`, `lightgbm`, `scikit-learn`, `keras==3.13.2`, `tensorflow-cpu`, `numpy`, `opencv-python-headless`, `pillow`, `python-dotenv`
- [x] Create `backend/app/config.py` β€” confidence threshold (default 0.7), WebSocket max connections, pipeline mode (A/B/C/ensemble)
- [x] Create `backend/app/models/label_map.py` β€” map class indices 0–33 to Gujarati sign names
### 1.2 Model Loader (Singleton)
- [x] Create `backend/app/models/loader.py`
- Load `model.pkl` (XGBoost) at startup
- Load `autoencoder_model.pkl` (extract encoder layers only: input β†’ dense β†’ dense_1) and `lgbm_model.pkl`
- Load `cnn_model.pkl` (full ResNet50 feature extractor, strip any classification head) and `svm_model.pkl`
- Expose `ModelStore` singleton accessed via `get_model_store()` dependency
- Log load times for each model
### 1.3 Pipeline A β€” XGBoost (Primary, Landmarks)
- [x] Create `backend/app/inference/pipeline_a.py`
- Input: `List[float]` of length 63 (x,y,z per landmark, already normalized by MediaPipe)
- Output: `{"sign": str, "confidence": float, "probabilities": List[float]}`
- Use `model.predict_proba(np.array(landmarks).reshape(1,-1))[0]`
- Return `classes_[argmax]` and `max(probabilities)` as confidence
### 1.4 Pipeline B β€” Autoencoder Encoder + LightGBM
- [x] Create `backend/app/inference/pipeline_b.py`
- Build encoder-only submodel: `encoder = keras.Model(inputs=model.input, outputs=model.layers[2].output)` (output of `dense_1`, the 16-D bottleneck)
- Input: 63 MediaPipe coords
- Encode: `features = encoder.predict(np.array(landmarks).reshape(1,-1))[0]` β†’ shape (16,)
- Classify: `lgbm.predict_proba(features.reshape(1,-1))[0]`
### 1.5 Pipeline C β€” CNN + SVM (Image-based)
- [x] Create `backend/app/inference/pipeline_c.py`
- Input: base64-encoded JPEG or raw bytes of the cropped hand region (128Γ—128 px)
- Decode β†’ numpy array (128,128,3) uint8 β†’ normalize to float32 [0,1]
- `features = cnn_model.predict(img[np.newaxis])[0]` β†’ shape (256,)
- `proba = svm.predict_proba(features.reshape(1,-1))[0]`
- Note: CNN inference is slower (~50–200ms on CPU); only call when Pipeline A confidence < threshold
### 1.6 Ensemble Logic
- [x] Create `backend/app/inference/ensemble.py`
- Call Pipeline A first
- If `confidence < config.THRESHOLD` (default 0.7), call Pipeline B
- If still below threshold and image data available, call Pipeline C
- Final result: weighted average of probabilities from each pipeline that was called
- Return the top predicted class and ensemble confidence score
### 1.7 WebSocket Handler
- [x] Create `backend/app/main.py` with FastAPI app
- [x] Implement `GET /health` β€” returns `{"status": "ok", "models_loaded": true}`
- [x] Implement `WS /ws/landmarks` β€” primary endpoint
- Client sends JSON: `{"landmarks": [63 floats], "session_id": "..."}`
- Server responds: `{"sign": "...", "confidence": 0.95, "pipeline": "A", "label_index": 12}`
- Handle disconnect gracefully
- [x] Implement `WS /ws/image` β€” optional image-based endpoint for Pipeline C
- Client sends JSON: `{"image_b64": "...", "session_id": "..."}`
- [x] Implement `POST /api/predict` β€” REST fallback for non-WS clients
- Body: `{"landmarks": [63 floats]}`
- Returns same response schema as WS
### 1.8 Schemas & Validation
- [x] Create `backend/app/schemas.py`
- `LandmarkMessage(BaseModel)`: `landmarks: List[float]` (must be length 63), `session_id: str`
- `ImageMessage(BaseModel)`: `image_b64: str`, `session_id: str`
- `PredictionResponse(BaseModel)`: `sign: str`, `confidence: float`, `pipeline: str`, `label_index: int`, `probabilities: Optional[List[float]]`
### 1.9 CORS & Middleware
- [x] Configure CORS for Vercel frontend domain + localhost:5173
- [x] Add request logging middleware (log session_id, pipeline used, latency ms)
- [x] Add global exception handler returning proper JSON errors
---
## Phase 2 β€” Frontend (React + Vite + Tailwind + Framer Motion)
### 2.1 Project Bootstrap
- [x] Run `npm create vite@latest frontend -- --template react-ts` inside `SanketSetu/`
- [x] Install deps: `tailwindcss`, `framer-motion`, `lucide-react`, `@mediapipe/tasks-vision`
- [x] Configure Tailwind with custom palette (dark neon-cyan glassmorphism theme)
- [x] Set up `vite.config.ts` proxy: `/api` β†’ backend URL, `/ws` β†’ backend WS URL
### 2.2 Webcam Hook (`useWebcam.ts`)
- [x] Request `getUserMedia({ video: { width: 1280, height: 720 } })`
- [x] Expose `videoRef`, `isReady`, `error`, `switchCamera()` (for mobile front/back toggle)
- [x] Handle permission denied state with instructional UI
### 2.3 MediaPipe Hook (`useMediaPipe.ts`)
- [x] Initialize `HandLandmarker` from `@mediapipe/tasks-vision` (WASM backend)
- [x] Process video frames at target 30fps using `requestAnimationFrame`
- [x] Extract `landmarks[0]` (first hand) β†’ flatten to 63 floats `[x0,y0,z0, x1,y1,z1, ...]`
- [x] Normalize: subtract wrist (landmark 0) position to make translation-invariant β€” **must match training preprocessing**
- [x] Expose `landmarks: number[] | null`, `handedness: string`, `isDetecting: boolean`
### 2.4 WebSocket Hook (`useWebSocket.ts`)
- [x] Connect to `wss://backend-url/ws/landmarks` on mount
- [x] Auto-reconnect with exponential backoff on disconnect
- [x] `sendLandmarks(landmarks: number[])` β€” throttled to max 15 sends/sec
- [x] Expose `lastPrediction: PredictionResponse | null`, `isConnected: boolean`, `latency: number`
### 2.5 Landmark Canvas (`LandmarkCanvas.tsx`)
- [x] Overlay `<canvas>` on top of `<video>` with `position: absolute`
- [x] Draw 21 hand landmark dots (cyan glow: `shadowBlur`, `shadowColor`)
- [x] Draw 21 bone connections following MediaPipe hand topology (finger segments)
- [x] On successful prediction: animate landmarks to pulse/glow with Framer Motion spring
- [x] Use `requestAnimationFrame` for smooth 60fps rendering
### 2.6 Prediction HUD (`PredictionHUD.tsx`)
- [x] Glassmorphism card: `backdrop-blur`, `bg-white/10`, `border-white/20`
- [x] Large Gujarati sign name (mapped from label index)
- [x] Confidence bar: animated width transition via Framer Motion `animate={{ width: confidence% }}`
- [x] Color coding: green (>85%), yellow (60–85%), red (<60%)
- [x] Rolling history list: last 10 recognized signs (Framer Motion `AnimatePresence` for enter/exit)
- [x] Pipeline badge: shows which pipeline (A/B/C) produced the result
- [x] Latency display: shows WS round-trip time in ms
### 2.7 Onboarding Guide (`OnboardingGuide.tsx`)
- [x] 3-step animated wizard using Framer Motion page transitions
1. "Position your hand 30–60cm from camera"
2. "Ensure good lighting, avoid dark backgrounds"
3. "Show signs clearly β€” palm facing camera"
- [x] Skip button + "Don't show again" (localStorage)
### 2.8 Calibration Screen (`Calibration.tsx`)
- [x] Brief 2-second "Ready?" screen after onboarding
- [x] Check: hand detected by MediaPipe β†’ show green checkmark animation
- [x] Auto-transitions to main translation view when hand is stable for 1 second
### 2.9 Main App Layout (`App.tsx`)
- [x] Full-screen dark background with subtle animated gradient
- [x] Three-panel layout (desktop): webcam | HUD | history
- [x] Mobile: stacked layout with webcam top, HUD bottom
- [x] Header: "SanketSetu | ΰͺΈΰͺ‚ΰͺ•ેΰͺ€-ΰͺΈΰ«‡ΰͺ€ΰ«" with glowing text effect
- [x] Settings gear icon β†’ modal for pipeline selection (A / B / C / Ensemble), confidence threshold slider
---
## Phase 3 β€” Dockerization & Deployment
### 3.1 Backend Dockerfile
- [x] Create `Dockerfile` (repo root, build context includes models)
- [x] Add `.dockerignore` (excludes `.venv`, `node_modules`, `*.pth`, tests)
- [ ] Test locally: `docker build -t sanketsetu-backend . && docker run -p 8000:8000 sanketsetu-backend`
### 3.2 Hugging Face Spaces Configuration
- [x] Create Hugging Face Spaces repository for backend deployment
- [x] Note: Keras/TF will increase Docker image size β€” use `tensorflow-cpu` to keep slim
- [ ] Push Docker image to Hugging Face Container Registry
### 3.3 Vercel Frontend Deployment
- [x] Create `frontend/vercel.json` with SPA rewrite + WASM Content-Type header
- [x] Add `VITE_WS_URL` and `VITE_API_URL` to Vercel environment variables (via CI vars)
- [ ] Ensure `@mediapipe/tasks-vision` WASM files are served correctly (add to `public/`)
---
## Phase 4 β€” Testing & Hardening
### 4.1 Backend Tests
- [x] `tests/test_pipeline_a.py` β€” 8 unit tests, XGBoost inference (4s)
- [x] `tests/test_pipeline_b.py` β€” 6 unit tests, encoder + LightGBM (49s)
- [x] `tests/test_pipeline_c.py` β€” 7 unit tests, CNN + SVM with real 128Γ—128 images (14s)
- [x] `tests/test_websocket.py` β€” 7 integration tests, health + REST + WS round-trip
### 4.2 Frontend Error Handling
- [ ] No-camera fallback UI (file upload for image mode)
- [x] WS reconnecting banner (red banner when `!isConnected && stage === 'running'`)
- [x] Low-bandwidth mode: reduce send rate to 5fps if latency > 500ms + yellow "LB" badge in HUD
- [x] MediaPipe WASM load failure fallback message (shown in header via `mpError`)
### 4.3 Label Map (Critical)
- [ ] Create `backend/app/models/label_map.py` mapping classes 0–33 to actual Gujarati signs
- You need to confirm the exact mapping used during training (check your original dataset/notebook)
- Placeholder: `LABEL_MAP = { 0: "ΰͺ•", 1: "ΰͺ–", ... , 33: "?" }`
- This file must exactly mirror what was used in training
---
## Execution Order (Start Here)
```
Week 1: Phase 1.1 β†’ 1.3 β†’ 1.7 (get WS working with Pipeline A alone, test in browser)
Week 2: Phase 1.4 β†’ 1.5 β†’ 1.6 (add other pipelines + ensemble)
Week 3: Phase 2.1 β†’ 2.2 β†’ 2.3 β†’ 2.4 (React skeleton + WS connected)
Week 4: Phase 2.5 β†’ 2.6 β†’ 2.7 β†’ 2.8 β†’ 2.9 (full UI)
Week 5: Phase 3 + 4 (deploy + tests)
```
---
## Critical Decision Points
| Decision | Default | Notes |
|---|---|---|
| Primary pipeline | **A (XGBoost)** | Sub-ms inference, uses MediaPipe landmarks already extracted client-side |
| Confidence threshold for fallback | **0.70** | Tune after testing - if XGBoost < 70%, call Pipeline B |
| Enable Pipeline C (CNN) | **Optional / off by default** | Adds ~150ms latency and requires image upload, not just landmarks |
| MediaPipe model variant | **lite** | Use `hand_landmarker_lite.task` for mobile performance |
| WebSocket frame rate | **15fps** | Sufficient for sign recognition, avoids server overload |
| Gujarati label map | **CONFIRM WITH DATASET** | Classes 0–33 must match training data exactly |