Spaces:
Sleeping
Sleeping
File size: 14,720 Bytes
cf93910 c67369f cf93910 c67369f cf93910 c67369f cf93910 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | # SanketSetu β Execution TODO & Implementation Tracker
## Model Analysis (Reviewed 2026-03-02)
All 5 model files inspected. Three distinct inference pipelines exist:
| Pipeline | Files | Input | Process | Output |
|---|---|---|---|---|
| **A β Primary (Fastest)** | `Mediapipe_XGBoost/model.pkl` | 63 MediaPipe coords (21 landmarks Γ x,y,z) | XGBClassifier (50 trees) | 34-class probability |
| **B β Autoencoder + LGBM** | `CNN_Autoencoder_LightGBM/autoencoder_model.pkl` + `lgbm_model.pkl` | 63 MediaPipe coords | Encoder (63β32β**16** bottleneck) + LGBMClassifier | 34-class probability |
| **C β Vision CNN + SVM** | `CNN_PreTrained/cnn_model.pkl` + `svm_model.pkl` | 128Γ128Γ3 RGB image | ResNet50-based CNN (179 layers) β 256 features + SVC(C=10) | 34-class probability w/ probability=True |
### Key Architecture Facts
- **34 classes** (Gujarati Sign Language alphabet + digits, labels 0β33)
- **Pipeline A** input: 63 floats β directly from MediaPipe `hand_landmarks` (x, y, z per landmark, flattened)
- **Pipeline B** input: same 63 floats β takes only the encoder half (first 3 Dense layers, output of `dense_1` layer = 16 features)
- **Pipeline C** input: 128Γ128 BGR/RGB cropped hand image, normalized to [0,1]
- All `.pth` files are identical copies of the `.pkl` files (same objects, different extension)
- Model quality strategy: A is primary (sub-ms); if confidence < threshold, query B or C for ensemble
---
## Project Folder Structure to Create
```
SanketSetu/
βββ backend/ β FastAPI server
β βββ app/
β β βββ main.py β FastAPI entry, WebSocket + REST
β β βββ models/
β β β βββ loader.py β Singleton model loader
β β β βββ label_map.py β 0β33 β Gujarati sign name mapping
β β βββ inference/
β β β βββ pipeline_a.py β XGBoost inference (63 landmarks)
β β β βββ pipeline_b.py β Autoencoder encoder + LightGBM
β β β βββ pipeline_c.py β ResNet CNN + SVM (image-based)
β β β βββ ensemble.py β Confidence-weighted ensemble logic
β β βββ schemas.py β Pydantic request/response models
β β βββ config.py β Settings (confidence threshold, etc.)
β βββ weights/ β Symlink or copy of model pkl files
β βββ requirements.txt
β βββ Dockerfile
β
βββ frontend/ β Vite + React + TS
β βββ src/
β β βββ components/
β β β βββ WebcamFeed.tsx β Webcam + canvas landmark overlay
β β β βββ LandmarkCanvas.tsx β Draws 21 hand points + connections
β β β βββ PredictionHUD.tsx β Live sign, confidence bar, history
β β β βββ OnboardingGuide.tsx β Animated intro wizard
β β β βββ Calibration.tsx β Lighting/distance check UI
β β βββ hooks/
β β β βββ useWebSocket.ts β WS connection, send/receive
β β β βββ useMediaPipe.ts β MediaPipe Hands JS integration
β β β βββ useWebcam.ts β Camera permissions + stream
β β βββ lib/
β β β βββ landmarkUtils.ts β Landmark normalization (mirror XGBoost preprocessing)
β β βββ App.tsx
β β βββ main.tsx
β βββ public/
β βββ index.html
β βββ tailwind.config.ts
β βββ vite.config.ts
β βββ package.json
β
βββ CNN_Autoencoder_LightGBM/ β (existing)
βββ CNN_PreTrained/ β (existing)
βββ Mediapipe_XGBoost/ β (existing)
βββ .github/
βββ workflows/
βββ deploy-backend.yml
βββ deploy-frontend.yml
```
---
## Phase 1 β Backend Core (FastAPI + Model Integration)
### 1.1 Project Bootstrap
- [x] Create `backend/` folder and `app/` package structure
- [x] Create `backend/requirements.txt` with: `fastapi`, `uvicorn[standard]`, `websockets`, `xgboost`, `lightgbm`, `scikit-learn`, `keras==3.13.2`, `tensorflow-cpu`, `numpy`, `opencv-python-headless`, `pillow`, `python-dotenv`
- [x] Create `backend/app/config.py` β confidence threshold (default 0.7), WebSocket max connections, pipeline mode (A/B/C/ensemble)
- [x] Create `backend/app/models/label_map.py` β map class indices 0β33 to Gujarati sign names
### 1.2 Model Loader (Singleton)
- [x] Create `backend/app/models/loader.py`
- Load `model.pkl` (XGBoost) at startup
- Load `autoencoder_model.pkl` (extract encoder layers only: input β dense β dense_1) and `lgbm_model.pkl`
- Load `cnn_model.pkl` (full ResNet50 feature extractor, strip any classification head) and `svm_model.pkl`
- Expose `ModelStore` singleton accessed via `get_model_store()` dependency
- Log load times for each model
### 1.3 Pipeline A β XGBoost (Primary, Landmarks)
- [x] Create `backend/app/inference/pipeline_a.py`
- Input: `List[float]` of length 63 (x,y,z per landmark, already normalized by MediaPipe)
- Output: `{"sign": str, "confidence": float, "probabilities": List[float]}`
- Use `model.predict_proba(np.array(landmarks).reshape(1,-1))[0]`
- Return `classes_[argmax]` and `max(probabilities)` as confidence
### 1.4 Pipeline B β Autoencoder Encoder + LightGBM
- [x] Create `backend/app/inference/pipeline_b.py`
- Build encoder-only submodel: `encoder = keras.Model(inputs=model.input, outputs=model.layers[2].output)` (output of `dense_1`, the 16-D bottleneck)
- Input: 63 MediaPipe coords
- Encode: `features = encoder.predict(np.array(landmarks).reshape(1,-1))[0]` β shape (16,)
- Classify: `lgbm.predict_proba(features.reshape(1,-1))[0]`
### 1.5 Pipeline C β CNN + SVM (Image-based)
- [x] Create `backend/app/inference/pipeline_c.py`
- Input: base64-encoded JPEG or raw bytes of the cropped hand region (128Γ128 px)
- Decode β numpy array (128,128,3) uint8 β normalize to float32 [0,1]
- `features = cnn_model.predict(img[np.newaxis])[0]` β shape (256,)
- `proba = svm.predict_proba(features.reshape(1,-1))[0]`
- Note: CNN inference is slower (~50β200ms on CPU); only call when Pipeline A confidence < threshold
### 1.6 Ensemble Logic
- [x] Create `backend/app/inference/ensemble.py`
- Call Pipeline A first
- If `confidence < config.THRESHOLD` (default 0.7), call Pipeline B
- If still below threshold and image data available, call Pipeline C
- Final result: weighted average of probabilities from each pipeline that was called
- Return the top predicted class and ensemble confidence score
### 1.7 WebSocket Handler
- [x] Create `backend/app/main.py` with FastAPI app
- [x] Implement `GET /health` β returns `{"status": "ok", "models_loaded": true}`
- [x] Implement `WS /ws/landmarks` β primary endpoint
- Client sends JSON: `{"landmarks": [63 floats], "session_id": "..."}`
- Server responds: `{"sign": "...", "confidence": 0.95, "pipeline": "A", "label_index": 12}`
- Handle disconnect gracefully
- [x] Implement `WS /ws/image` β optional image-based endpoint for Pipeline C
- Client sends JSON: `{"image_b64": "...", "session_id": "..."}`
- [x] Implement `POST /api/predict` β REST fallback for non-WS clients
- Body: `{"landmarks": [63 floats]}`
- Returns same response schema as WS
### 1.8 Schemas & Validation
- [x] Create `backend/app/schemas.py`
- `LandmarkMessage(BaseModel)`: `landmarks: List[float]` (must be length 63), `session_id: str`
- `ImageMessage(BaseModel)`: `image_b64: str`, `session_id: str`
- `PredictionResponse(BaseModel)`: `sign: str`, `confidence: float`, `pipeline: str`, `label_index: int`, `probabilities: Optional[List[float]]`
### 1.9 CORS & Middleware
- [x] Configure CORS for Vercel frontend domain + localhost:5173
- [x] Add request logging middleware (log session_id, pipeline used, latency ms)
- [x] Add global exception handler returning proper JSON errors
---
## Phase 2 β Frontend (React + Vite + Tailwind + Framer Motion)
### 2.1 Project Bootstrap
- [x] Run `npm create vite@latest frontend -- --template react-ts` inside `SanketSetu/`
- [x] Install deps: `tailwindcss`, `framer-motion`, `lucide-react`, `@mediapipe/tasks-vision`
- [x] Configure Tailwind with custom palette (dark neon-cyan glassmorphism theme)
- [x] Set up `vite.config.ts` proxy: `/api` β backend URL, `/ws` β backend WS URL
### 2.2 Webcam Hook (`useWebcam.ts`)
- [x] Request `getUserMedia({ video: { width: 1280, height: 720 } })`
- [x] Expose `videoRef`, `isReady`, `error`, `switchCamera()` (for mobile front/back toggle)
- [x] Handle permission denied state with instructional UI
### 2.3 MediaPipe Hook (`useMediaPipe.ts`)
- [x] Initialize `HandLandmarker` from `@mediapipe/tasks-vision` (WASM backend)
- [x] Process video frames at target 30fps using `requestAnimationFrame`
- [x] Extract `landmarks[0]` (first hand) β flatten to 63 floats `[x0,y0,z0, x1,y1,z1, ...]`
- [x] Normalize: subtract wrist (landmark 0) position to make translation-invariant β **must match training preprocessing**
- [x] Expose `landmarks: number[] | null`, `handedness: string`, `isDetecting: boolean`
### 2.4 WebSocket Hook (`useWebSocket.ts`)
- [x] Connect to `wss://backend-url/ws/landmarks` on mount
- [x] Auto-reconnect with exponential backoff on disconnect
- [x] `sendLandmarks(landmarks: number[])` β throttled to max 15 sends/sec
- [x] Expose `lastPrediction: PredictionResponse | null`, `isConnected: boolean`, `latency: number`
### 2.5 Landmark Canvas (`LandmarkCanvas.tsx`)
- [x] Overlay `<canvas>` on top of `<video>` with `position: absolute`
- [x] Draw 21 hand landmark dots (cyan glow: `shadowBlur`, `shadowColor`)
- [x] Draw 21 bone connections following MediaPipe hand topology (finger segments)
- [x] On successful prediction: animate landmarks to pulse/glow with Framer Motion spring
- [x] Use `requestAnimationFrame` for smooth 60fps rendering
### 2.6 Prediction HUD (`PredictionHUD.tsx`)
- [x] Glassmorphism card: `backdrop-blur`, `bg-white/10`, `border-white/20`
- [x] Large Gujarati sign name (mapped from label index)
- [x] Confidence bar: animated width transition via Framer Motion `animate={{ width: confidence% }}`
- [x] Color coding: green (>85%), yellow (60β85%), red (<60%)
- [x] Rolling history list: last 10 recognized signs (Framer Motion `AnimatePresence` for enter/exit)
- [x] Pipeline badge: shows which pipeline (A/B/C) produced the result
- [x] Latency display: shows WS round-trip time in ms
### 2.7 Onboarding Guide (`OnboardingGuide.tsx`)
- [x] 3-step animated wizard using Framer Motion page transitions
1. "Position your hand 30β60cm from camera"
2. "Ensure good lighting, avoid dark backgrounds"
3. "Show signs clearly β palm facing camera"
- [x] Skip button + "Don't show again" (localStorage)
### 2.8 Calibration Screen (`Calibration.tsx`)
- [x] Brief 2-second "Ready?" screen after onboarding
- [x] Check: hand detected by MediaPipe β show green checkmark animation
- [x] Auto-transitions to main translation view when hand is stable for 1 second
### 2.9 Main App Layout (`App.tsx`)
- [x] Full-screen dark background with subtle animated gradient
- [x] Three-panel layout (desktop): webcam | HUD | history
- [x] Mobile: stacked layout with webcam top, HUD bottom
- [x] Header: "SanketSetu | ΰͺΈΰͺΰͺΰ«ΰͺ€-ΰͺΈΰ«ΰͺ€ΰ«" with glowing text effect
- [x] Settings gear icon β modal for pipeline selection (A / B / C / Ensemble), confidence threshold slider
---
## Phase 3 β Dockerization & Deployment
### 3.1 Backend Dockerfile
- [x] Create `Dockerfile` (repo root, build context includes models)
- [x] Add `.dockerignore` (excludes `.venv`, `node_modules`, `*.pth`, tests)
- [ ] Test locally: `docker build -t sanketsetu-backend . && docker run -p 8000:8000 sanketsetu-backend`
### 3.2 Hugging Face Spaces Configuration
- [x] Create Hugging Face Spaces repository for backend deployment
- [x] Note: Keras/TF will increase Docker image size β use `tensorflow-cpu` to keep slim
- [ ] Push Docker image to Hugging Face Container Registry
### 3.3 Vercel Frontend Deployment
- [x] Create `frontend/vercel.json` with SPA rewrite + WASM Content-Type header
- [x] Add `VITE_WS_URL` and `VITE_API_URL` to Vercel environment variables (via CI vars)
- [ ] Ensure `@mediapipe/tasks-vision` WASM files are served correctly (add to `public/`)
---
## Phase 4 β Testing & Hardening
### 4.1 Backend Tests
- [x] `tests/test_pipeline_a.py` β 8 unit tests, XGBoost inference (4s)
- [x] `tests/test_pipeline_b.py` β 6 unit tests, encoder + LightGBM (49s)
- [x] `tests/test_pipeline_c.py` β 7 unit tests, CNN + SVM with real 128Γ128 images (14s)
- [x] `tests/test_websocket.py` β 7 integration tests, health + REST + WS round-trip
### 4.2 Frontend Error Handling
- [ ] No-camera fallback UI (file upload for image mode)
- [x] WS reconnecting banner (red banner when `!isConnected && stage === 'running'`)
- [x] Low-bandwidth mode: reduce send rate to 5fps if latency > 500ms + yellow "LB" badge in HUD
- [x] MediaPipe WASM load failure fallback message (shown in header via `mpError`)
### 4.3 Label Map (Critical)
- [ ] Create `backend/app/models/label_map.py` mapping classes 0β33 to actual Gujarati signs
- You need to confirm the exact mapping used during training (check your original dataset/notebook)
- Placeholder: `LABEL_MAP = { 0: "ΰͺ", 1: "ΰͺ", ... , 33: "?" }`
- This file must exactly mirror what was used in training
---
## Execution Order (Start Here)
```
Week 1: Phase 1.1 β 1.3 β 1.7 (get WS working with Pipeline A alone, test in browser)
Week 2: Phase 1.4 β 1.5 β 1.6 (add other pipelines + ensemble)
Week 3: Phase 2.1 β 2.2 β 2.3 β 2.4 (React skeleton + WS connected)
Week 4: Phase 2.5 β 2.6 β 2.7 β 2.8 β 2.9 (full UI)
Week 5: Phase 3 + 4 (deploy + tests)
```
---
## Critical Decision Points
| Decision | Default | Notes |
|---|---|---|
| Primary pipeline | **A (XGBoost)** | Sub-ms inference, uses MediaPipe landmarks already extracted client-side |
| Confidence threshold for fallback | **0.70** | Tune after testing - if XGBoost < 70%, call Pipeline B |
| Enable Pipeline C (CNN) | **Optional / off by default** | Adds ~150ms latency and requires image upload, not just landmarks |
| MediaPipe model variant | **lite** | Use `hand_landmarker_lite.task` for mobile performance |
| WebSocket frame rate | **15fps** | Sufficient for sign recognition, avoids server overload |
| Gujarati label map | **CONFIRM WITH DATASET** | Classes 0β33 must match training data exactly |
|