# SanketSetu — Execution TODO & Implementation Tracker ## Model Analysis (Reviewed 2026-03-02) All 5 model files inspected. Three distinct inference pipelines exist: | Pipeline | Files | Input | Process | Output | |---|---|---|---|---| | **A — Primary (Fastest)** | `Mediapipe_XGBoost/model.pkl` | 63 MediaPipe coords (21 landmarks × x,y,z) | XGBClassifier (50 trees) | 34-class probability | | **B — Autoencoder + LGBM** | `CNN_Autoencoder_LightGBM/autoencoder_model.pkl` + `lgbm_model.pkl` | 63 MediaPipe coords | Encoder (63→32→**16** bottleneck) + LGBMClassifier | 34-class probability | | **C — Vision CNN + SVM** | `CNN_PreTrained/cnn_model.pkl` + `svm_model.pkl` | 128×128×3 RGB image | ResNet50-based CNN (179 layers) → 256 features + SVC(C=10) | 34-class probability w/ probability=True | ### Key Architecture Facts - **34 classes** (Gujarati Sign Language alphabet + digits, labels 0–33) - **Pipeline A** input: 63 floats — directly from MediaPipe `hand_landmarks` (x, y, z per landmark, flattened) - **Pipeline B** input: same 63 floats → takes only the encoder half (first 3 Dense layers, output of `dense_1` layer = 16 features) - **Pipeline C** input: 128×128 BGR/RGB cropped hand image, normalized to [0,1] - All `.pth` files are identical copies of the `.pkl` files (same objects, different extension) - Model quality strategy: A is primary (sub-ms); if confidence < threshold, query B or C for ensemble --- ## Project Folder Structure to Create ``` SanketSetu/ ├── backend/ ← FastAPI server │ ├── app/ │ │ ├── main.py ← FastAPI entry, WebSocket + REST │ │ ├── models/ │ │ │ ├── loader.py ← Singleton model loader │ │ │ └── label_map.py ← 0–33 → Gujarati sign name mapping │ │ ├── inference/ │ │ │ ├── pipeline_a.py ← XGBoost inference (63 landmarks) │ │ │ ├── pipeline_b.py ← Autoencoder encoder + LightGBM │ │ │ ├── pipeline_c.py ← ResNet CNN + SVM (image-based) │ │ │ └── ensemble.py ← Confidence-weighted ensemble logic │ │ ├── schemas.py ← Pydantic request/response models │ │ └── config.py ← Settings (confidence threshold, etc.) │ ├── weights/ ← Symlink or copy of model pkl files │ ├── requirements.txt │ ├── Dockerfile │ ├── frontend/ ← Vite + React + TS │ ├── src/ │ │ ├── components/ │ │ │ ├── WebcamFeed.tsx ← Webcam + canvas landmark overlay │ │ │ ├── LandmarkCanvas.tsx ← Draws 21 hand points + connections │ │ │ ├── PredictionHUD.tsx ← Live sign, confidence bar, history │ │ │ ├── OnboardingGuide.tsx ← Animated intro wizard │ │ │ └── Calibration.tsx ← Lighting/distance check UI │ │ ├── hooks/ │ │ │ ├── useWebSocket.ts ← WS connection, send/receive │ │ │ ├── useMediaPipe.ts ← MediaPipe Hands JS integration │ │ │ └── useWebcam.ts ← Camera permissions + stream │ │ ├── lib/ │ │ │ └── landmarkUtils.ts ← Landmark normalization (mirror XGBoost preprocessing) │ │ ├── App.tsx │ │ └── main.tsx │ ├── public/ │ ├── index.html │ ├── tailwind.config.ts │ ├── vite.config.ts │ └── package.json │ ├── CNN_Autoencoder_LightGBM/ ← (existing) ├── CNN_PreTrained/ ← (existing) ├── Mediapipe_XGBoost/ ← (existing) └── .github/ └── workflows/ ├── deploy-backend.yml └── deploy-frontend.yml ``` --- ## Phase 1 — Backend Core (FastAPI + Model Integration) ### 1.1 Project Bootstrap - [x] Create `backend/` folder and `app/` package structure - [x] Create `backend/requirements.txt` with: `fastapi`, `uvicorn[standard]`, `websockets`, `xgboost`, `lightgbm`, `scikit-learn`, `keras==3.13.2`, `tensorflow-cpu`, `numpy`, `opencv-python-headless`, `pillow`, `python-dotenv` - [x] Create `backend/app/config.py` — confidence threshold (default 0.7), WebSocket max connections, pipeline mode (A/B/C/ensemble) - [x] Create `backend/app/models/label_map.py` — map class indices 0–33 to Gujarati sign names ### 1.2 Model Loader (Singleton) - [x] Create `backend/app/models/loader.py` - Load `model.pkl` (XGBoost) at startup - Load `autoencoder_model.pkl` (extract encoder layers only: input → dense → dense_1) and `lgbm_model.pkl` - Load `cnn_model.pkl` (full ResNet50 feature extractor, strip any classification head) and `svm_model.pkl` - Expose `ModelStore` singleton accessed via `get_model_store()` dependency - Log load times for each model ### 1.3 Pipeline A — XGBoost (Primary, Landmarks) - [x] Create `backend/app/inference/pipeline_a.py` - Input: `List[float]` of length 63 (x,y,z per landmark, already normalized by MediaPipe) - Output: `{"sign": str, "confidence": float, "probabilities": List[float]}` - Use `model.predict_proba(np.array(landmarks).reshape(1,-1))[0]` - Return `classes_[argmax]` and `max(probabilities)` as confidence ### 1.4 Pipeline B — Autoencoder Encoder + LightGBM - [x] Create `backend/app/inference/pipeline_b.py` - Build encoder-only submodel: `encoder = keras.Model(inputs=model.input, outputs=model.layers[2].output)` (output of `dense_1`, the 16-D bottleneck) - Input: 63 MediaPipe coords - Encode: `features = encoder.predict(np.array(landmarks).reshape(1,-1))[0]` → shape (16,) - Classify: `lgbm.predict_proba(features.reshape(1,-1))[0]` ### 1.5 Pipeline C — CNN + SVM (Image-based) - [x] Create `backend/app/inference/pipeline_c.py` - Input: base64-encoded JPEG or raw bytes of the cropped hand region (128×128 px) - Decode → numpy array (128,128,3) uint8 → normalize to float32 [0,1] - `features = cnn_model.predict(img[np.newaxis])[0]` → shape (256,) - `proba = svm.predict_proba(features.reshape(1,-1))[0]` - Note: CNN inference is slower (~50–200ms on CPU); only call when Pipeline A confidence < threshold ### 1.6 Ensemble Logic - [x] Create `backend/app/inference/ensemble.py` - Call Pipeline A first - If `confidence < config.THRESHOLD` (default 0.7), call Pipeline B - If still below threshold and image data available, call Pipeline C - Final result: weighted average of probabilities from each pipeline that was called - Return the top predicted class and ensemble confidence score ### 1.7 WebSocket Handler - [x] Create `backend/app/main.py` with FastAPI app - [x] Implement `GET /health` — returns `{"status": "ok", "models_loaded": true}` - [x] Implement `WS /ws/landmarks` — primary endpoint - Client sends JSON: `{"landmarks": [63 floats], "session_id": "..."}` - Server responds: `{"sign": "...", "confidence": 0.95, "pipeline": "A", "label_index": 12}` - Handle disconnect gracefully - [x] Implement `WS /ws/image` — optional image-based endpoint for Pipeline C - Client sends JSON: `{"image_b64": "...", "session_id": "..."}` - [x] Implement `POST /api/predict` — REST fallback for non-WS clients - Body: `{"landmarks": [63 floats]}` - Returns same response schema as WS ### 1.8 Schemas & Validation - [x] Create `backend/app/schemas.py` - `LandmarkMessage(BaseModel)`: `landmarks: List[float]` (must be length 63), `session_id: str` - `ImageMessage(BaseModel)`: `image_b64: str`, `session_id: str` - `PredictionResponse(BaseModel)`: `sign: str`, `confidence: float`, `pipeline: str`, `label_index: int`, `probabilities: Optional[List[float]]` ### 1.9 CORS & Middleware - [x] Configure CORS for Vercel frontend domain + localhost:5173 - [x] Add request logging middleware (log session_id, pipeline used, latency ms) - [x] Add global exception handler returning proper JSON errors --- ## Phase 2 — Frontend (React + Vite + Tailwind + Framer Motion) ### 2.1 Project Bootstrap - [x] Run `npm create vite@latest frontend -- --template react-ts` inside `SanketSetu/` - [x] Install deps: `tailwindcss`, `framer-motion`, `lucide-react`, `@mediapipe/tasks-vision` - [x] Configure Tailwind with custom palette (dark neon-cyan glassmorphism theme) - [x] Set up `vite.config.ts` proxy: `/api` → backend URL, `/ws` → backend WS URL ### 2.2 Webcam Hook (`useWebcam.ts`) - [x] Request `getUserMedia({ video: { width: 1280, height: 720 } })` - [x] Expose `videoRef`, `isReady`, `error`, `switchCamera()` (for mobile front/back toggle) - [x] Handle permission denied state with instructional UI ### 2.3 MediaPipe Hook (`useMediaPipe.ts`) - [x] Initialize `HandLandmarker` from `@mediapipe/tasks-vision` (WASM backend) - [x] Process video frames at target 30fps using `requestAnimationFrame` - [x] Extract `landmarks[0]` (first hand) → flatten to 63 floats `[x0,y0,z0, x1,y1,z1, ...]` - [x] Normalize: subtract wrist (landmark 0) position to make translation-invariant — **must match training preprocessing** - [x] Expose `landmarks: number[] | null`, `handedness: string`, `isDetecting: boolean` ### 2.4 WebSocket Hook (`useWebSocket.ts`) - [x] Connect to `wss://backend-url/ws/landmarks` on mount - [x] Auto-reconnect with exponential backoff on disconnect - [x] `sendLandmarks(landmarks: number[])` — throttled to max 15 sends/sec - [x] Expose `lastPrediction: PredictionResponse | null`, `isConnected: boolean`, `latency: number` ### 2.5 Landmark Canvas (`LandmarkCanvas.tsx`) - [x] Overlay `` on top of `