File size: 14,720 Bytes
cf93910
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c67369f
cf93910
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c67369f
 
cf93910
c67369f
cf93910
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# SanketSetu β€” Execution TODO & Implementation Tracker

## Model Analysis (Reviewed 2026-03-02)

All 5 model files inspected. Three distinct inference pipelines exist:

| Pipeline | Files | Input | Process | Output |
|---|---|---|---|---|
| **A β€” Primary (Fastest)** | `Mediapipe_XGBoost/model.pkl` | 63 MediaPipe coords (21 landmarks Γ— x,y,z) | XGBClassifier (50 trees) | 34-class probability |
| **B β€” Autoencoder + LGBM** | `CNN_Autoencoder_LightGBM/autoencoder_model.pkl` + `lgbm_model.pkl` | 63 MediaPipe coords | Encoder (63β†’32β†’**16** bottleneck) + LGBMClassifier | 34-class probability |
| **C β€” Vision CNN + SVM** | `CNN_PreTrained/cnn_model.pkl` + `svm_model.pkl` | 128Γ—128Γ—3 RGB image | ResNet50-based CNN (179 layers) β†’ 256 features + SVC(C=10) | 34-class probability w/ probability=True |

### Key Architecture Facts
- **34 classes** (Gujarati Sign Language alphabet + digits, labels 0–33)
- **Pipeline A** input: 63 floats β€” directly from MediaPipe `hand_landmarks` (x, y, z per landmark, flattened)
- **Pipeline B** input: same 63 floats β†’ takes only the encoder half (first 3 Dense layers, output of `dense_1` layer = 16 features) 
- **Pipeline C** input: 128Γ—128 BGR/RGB cropped hand image, normalized to [0,1]
- All `.pth` files are identical copies of the `.pkl` files (same objects, different extension)
- Model quality strategy: A is primary (sub-ms); if confidence < threshold, query B or C for ensemble

---

## Project Folder Structure to Create

```
SanketSetu/
β”œβ”€β”€ backend/                    ← FastAPI server
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py             ← FastAPI entry, WebSocket + REST
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   β”œβ”€β”€ loader.py       ← Singleton model loader
β”‚   β”‚   β”‚   └── label_map.py    ← 0–33 β†’ Gujarati sign name mapping
β”‚   β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline_a.py   ← XGBoost inference (63 landmarks)
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline_b.py   ← Autoencoder encoder + LightGBM
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline_c.py   ← ResNet CNN + SVM (image-based)
β”‚   β”‚   β”‚   └── ensemble.py     ← Confidence-weighted ensemble logic
β”‚   β”‚   β”œβ”€β”€ schemas.py          ← Pydantic request/response models
β”‚   β”‚   └── config.py           ← Settings (confidence threshold, etc.)
β”‚   β”œβ”€β”€ weights/                ← Symlink or copy of model pkl files
β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”œβ”€β”€ Dockerfile

β”‚
β”œβ”€β”€ frontend/                   ← Vite + React + TS
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ WebcamFeed.tsx       ← Webcam + canvas landmark overlay
β”‚   β”‚   β”‚   β”œβ”€β”€ LandmarkCanvas.tsx   ← Draws 21 hand points + connections
β”‚   β”‚   β”‚   β”œβ”€β”€ PredictionHUD.tsx    ← Live sign, confidence bar, history
β”‚   β”‚   β”‚   β”œβ”€β”€ OnboardingGuide.tsx  ← Animated intro wizard
β”‚   β”‚   β”‚   └── Calibration.tsx      ← Lighting/distance check UI
β”‚   β”‚   β”œβ”€β”€ hooks/
β”‚   β”‚   β”‚   β”œβ”€β”€ useWebSocket.ts      ← WS connection, send/receive
β”‚   β”‚   β”‚   β”œβ”€β”€ useMediaPipe.ts      ← MediaPipe Hands JS integration
β”‚   β”‚   β”‚   └── useWebcam.ts         ← Camera permissions + stream
β”‚   β”‚   β”œβ”€β”€ lib/
β”‚   β”‚   β”‚   └── landmarkUtils.ts     ← Landmark normalization (mirror XGBoost preprocessing)
β”‚   β”‚   β”œβ”€β”€ App.tsx
β”‚   β”‚   └── main.tsx
β”‚   β”œβ”€β”€ public/
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ tailwind.config.ts
β”‚   β”œβ”€β”€ vite.config.ts
β”‚   └── package.json
β”‚
β”œβ”€β”€ CNN_Autoencoder_LightGBM/   ← (existing)
β”œβ”€β”€ CNN_PreTrained/             ← (existing)
β”œβ”€β”€ Mediapipe_XGBoost/          ← (existing)
└── .github/
    └── workflows/
        β”œβ”€β”€ deploy-backend.yml
        └── deploy-frontend.yml
```

---

## Phase 1 β€” Backend Core (FastAPI + Model Integration)

### 1.1 Project Bootstrap
- [x] Create `backend/` folder and `app/` package structure
- [x] Create `backend/requirements.txt` with: `fastapi`, `uvicorn[standard]`, `websockets`, `xgboost`, `lightgbm`, `scikit-learn`, `keras==3.13.2`, `tensorflow-cpu`, `numpy`, `opencv-python-headless`, `pillow`, `python-dotenv`
- [x] Create `backend/app/config.py` β€” confidence threshold (default 0.7), WebSocket max connections, pipeline mode (A/B/C/ensemble)
- [x] Create `backend/app/models/label_map.py` β€” map class indices 0–33 to Gujarati sign names

### 1.2 Model Loader (Singleton)
- [x] Create `backend/app/models/loader.py`
  - Load `model.pkl` (XGBoost) at startup
  - Load `autoencoder_model.pkl` (extract encoder layers only: input β†’ dense β†’ dense_1) and `lgbm_model.pkl`
  - Load `cnn_model.pkl` (full ResNet50 feature extractor, strip any classification head) and `svm_model.pkl`
  - Expose `ModelStore` singleton accessed via `get_model_store()` dependency
  - Log load times for each model

### 1.3 Pipeline A β€” XGBoost (Primary, Landmarks)
- [x] Create `backend/app/inference/pipeline_a.py`
  - Input: `List[float]` of length 63 (x,y,z per landmark, already normalized by MediaPipe)
  - Output: `{"sign": str, "confidence": float, "probabilities": List[float]}`
  - Use `model.predict_proba(np.array(landmarks).reshape(1,-1))[0]`
  - Return `classes_[argmax]` and `max(probabilities)` as confidence

### 1.4 Pipeline B β€” Autoencoder Encoder + LightGBM
- [x] Create `backend/app/inference/pipeline_b.py`
  - Build encoder-only submodel: `encoder = keras.Model(inputs=model.input, outputs=model.layers[2].output)` (output of `dense_1`, the 16-D bottleneck)
  - Input: 63 MediaPipe coords
  - Encode: `features = encoder.predict(np.array(landmarks).reshape(1,-1))[0]`  β†’ shape (16,)
  - Classify: `lgbm.predict_proba(features.reshape(1,-1))[0]`

### 1.5 Pipeline C β€” CNN + SVM (Image-based)
- [x] Create `backend/app/inference/pipeline_c.py`
  - Input: base64-encoded JPEG or raw bytes of the cropped hand region (128Γ—128 px)
  - Decode β†’ numpy array (128,128,3) uint8 β†’ normalize to float32 [0,1]
  - `features = cnn_model.predict(img[np.newaxis])[0]`  β†’ shape (256,)
  - `proba = svm.predict_proba(features.reshape(1,-1))[0]`
  - Note: CNN inference is slower (~50–200ms on CPU); only call when Pipeline A confidence < threshold

### 1.6 Ensemble Logic
- [x] Create `backend/app/inference/ensemble.py`
  - Call Pipeline A first
  - If `confidence < config.THRESHOLD` (default 0.7), call Pipeline B
  - If still below threshold and image data available, call Pipeline C
  - Final result: weighted average of probabilities from each pipeline that was called
  - Return the top predicted class and ensemble confidence score

### 1.7 WebSocket Handler
- [x] Create `backend/app/main.py` with FastAPI app
- [x] Implement `GET /health` β€” returns `{"status": "ok", "models_loaded": true}`
- [x] Implement `WS /ws/landmarks` β€” primary endpoint
  - Client sends JSON: `{"landmarks": [63 floats], "session_id": "..."}`
  - Server responds: `{"sign": "...", "confidence": 0.95, "pipeline": "A", "label_index": 12}`
  - Handle disconnect gracefully
- [x] Implement `WS /ws/image` β€” optional image-based endpoint for Pipeline C
  - Client sends JSON: `{"image_b64": "...", "session_id": "..."}`
- [x] Implement `POST /api/predict` β€” REST fallback for non-WS clients
  - Body: `{"landmarks": [63 floats]}`
  - Returns same response schema as WS

### 1.8 Schemas & Validation
- [x] Create `backend/app/schemas.py`
  - `LandmarkMessage(BaseModel)`: `landmarks: List[float]` (must be length 63), `session_id: str`
  - `ImageMessage(BaseModel)`: `image_b64: str`, `session_id: str`
  - `PredictionResponse(BaseModel)`: `sign: str`, `confidence: float`, `pipeline: str`, `label_index: int`, `probabilities: Optional[List[float]]`

### 1.9 CORS & Middleware
- [x] Configure CORS for Vercel frontend domain + localhost:5173
- [x] Add request logging middleware (log session_id, pipeline used, latency ms)
- [x] Add global exception handler returning proper JSON errors

---

## Phase 2 β€” Frontend (React + Vite + Tailwind + Framer Motion)

### 2.1 Project Bootstrap
- [x] Run `npm create vite@latest frontend -- --template react-ts` inside `SanketSetu/`
- [x] Install deps: `tailwindcss`, `framer-motion`, `lucide-react`, `@mediapipe/tasks-vision`
- [x] Configure Tailwind with custom palette (dark neon-cyan glassmorphism theme)
- [x] Set up `vite.config.ts` proxy: `/api` β†’ backend URL, `/ws` β†’ backend WS URL

### 2.2 Webcam Hook (`useWebcam.ts`)
- [x] Request `getUserMedia({ video: { width: 1280, height: 720 } })`
- [x] Expose `videoRef`, `isReady`, `error`, `switchCamera()` (for mobile front/back toggle)
- [x] Handle permission denied state with instructional UI

### 2.3 MediaPipe Hook (`useMediaPipe.ts`)
- [x] Initialize `HandLandmarker` from `@mediapipe/tasks-vision` (WASM backend)
- [x] Process video frames at target 30fps using `requestAnimationFrame`
- [x] Extract `landmarks[0]` (first hand) β†’ flatten to 63 floats `[x0,y0,z0, x1,y1,z1, ...]`
- [x] Normalize: subtract wrist (landmark 0) position to make translation-invariant β€” **must match training preprocessing**
- [x] Expose `landmarks: number[] | null`, `handedness: string`, `isDetecting: boolean`

### 2.4 WebSocket Hook (`useWebSocket.ts`)
- [x] Connect to `wss://backend-url/ws/landmarks` on mount
- [x] Auto-reconnect with exponential backoff on disconnect
- [x] `sendLandmarks(landmarks: number[])` β€” throttled to max 15 sends/sec
- [x] Expose `lastPrediction: PredictionResponse | null`, `isConnected: boolean`, `latency: number`

### 2.5 Landmark Canvas (`LandmarkCanvas.tsx`)
- [x] Overlay `<canvas>` on top of `<video>` with `position: absolute`
- [x] Draw 21 hand landmark dots (cyan glow: `shadowBlur`, `shadowColor`)
- [x] Draw 21 bone connections following MediaPipe hand topology (finger segments)
- [x] On successful prediction: animate landmarks to pulse/glow with Framer Motion spring
- [x] Use `requestAnimationFrame` for smooth 60fps rendering

### 2.6 Prediction HUD (`PredictionHUD.tsx`)
- [x] Glassmorphism card: `backdrop-blur`, `bg-white/10`, `border-white/20`
- [x] Large Gujarati sign name (mapped from label index)
- [x] Confidence bar: animated width transition via Framer Motion `animate={{ width: confidence% }}`
- [x] Color coding: green (>85%), yellow (60–85%), red (<60%)
- [x] Rolling history list: last 10 recognized signs (Framer Motion `AnimatePresence` for enter/exit)
- [x] Pipeline badge: shows which pipeline (A/B/C) produced the result
- [x] Latency display: shows WS round-trip time in ms

### 2.7 Onboarding Guide (`OnboardingGuide.tsx`)
- [x] 3-step animated wizard using Framer Motion page transitions
  1. "Position your hand 30–60cm from camera"
  2. "Ensure good lighting, avoid dark backgrounds"
  3. "Show signs clearly β€” palm facing camera"
- [x] Skip button + "Don't show again" (localStorage)

### 2.8 Calibration Screen (`Calibration.tsx`)
- [x] Brief 2-second "Ready?" screen after onboarding
- [x] Check: hand detected by MediaPipe β†’ show green checkmark animation
- [x] Auto-transitions to main translation view when hand is stable for 1 second

### 2.9 Main App Layout (`App.tsx`)
- [x] Full-screen dark background with subtle animated gradient
- [x] Three-panel layout (desktop): webcam | HUD | history
- [x] Mobile: stacked layout with webcam top, HUD bottom
- [x] Header: "SanketSetu | ΰͺΈΰͺ‚ΰͺ•ેΰͺ€-ΰͺΈΰ«‡ΰͺ€ΰ«" with glowing text effect
- [x] Settings gear icon β†’ modal for pipeline selection (A / B / C / Ensemble), confidence threshold slider

---

## Phase 3 β€” Dockerization & Deployment

### 3.1 Backend Dockerfile
- [x] Create `Dockerfile` (repo root, build context includes models)
- [x] Add `.dockerignore` (excludes `.venv`, `node_modules`, `*.pth`, tests)
- [ ] Test locally: `docker build -t sanketsetu-backend . && docker run -p 8000:8000 sanketsetu-backend`

### 3.2 Hugging Face Spaces Configuration
- [x] Create Hugging Face Spaces repository for backend deployment
- [x] Note: Keras/TF will increase Docker image size β€” use `tensorflow-cpu` to keep slim
- [ ] Push Docker image to Hugging Face Container Registry

### 3.3 Vercel Frontend Deployment
- [x] Create `frontend/vercel.json` with SPA rewrite + WASM Content-Type header
- [x] Add `VITE_WS_URL` and `VITE_API_URL` to Vercel environment variables (via CI vars)
- [ ] Ensure `@mediapipe/tasks-vision` WASM files are served correctly (add to `public/`)

---

## Phase 4 β€” Testing & Hardening

### 4.1 Backend Tests
- [x] `tests/test_pipeline_a.py` β€” 8 unit tests, XGBoost inference (4s)
- [x] `tests/test_pipeline_b.py` β€” 6 unit tests, encoder + LightGBM (49s)
- [x] `tests/test_pipeline_c.py` β€” 7 unit tests, CNN + SVM with real 128Γ—128 images (14s)
- [x] `tests/test_websocket.py` β€” 7 integration tests, health + REST + WS round-trip

### 4.2 Frontend Error Handling
- [ ] No-camera fallback UI (file upload for image mode)
- [x] WS reconnecting banner (red banner when `!isConnected && stage === 'running'`)
- [x] Low-bandwidth mode: reduce send rate to 5fps if latency > 500ms + yellow "LB" badge in HUD
- [x] MediaPipe WASM load failure fallback message (shown in header via `mpError`)

### 4.3 Label Map (Critical)
- [ ] Create `backend/app/models/label_map.py` mapping classes 0–33 to actual Gujarati signs
  - You need to confirm the exact mapping used during training (check your original dataset/notebook)
  - Placeholder: `LABEL_MAP = { 0: "ΰͺ•", 1: "ΰͺ–", ... , 33: "?" }`
  - This file must exactly mirror what was used in training

---

## Execution Order (Start Here)

```
Week 1: Phase 1.1 β†’ 1.3 β†’ 1.7 (get WS working with Pipeline A alone, test in browser)
Week 2: Phase 1.4 β†’ 1.5 β†’ 1.6 (add other pipelines + ensemble)
Week 3: Phase 2.1 β†’ 2.2 β†’ 2.3 β†’ 2.4 (React skeleton + WS connected)
Week 4: Phase 2.5 β†’ 2.6 β†’ 2.7 β†’ 2.8 β†’ 2.9 (full UI)
Week 5: Phase 3 + 4 (deploy + tests)
```

---

## Critical Decision Points

| Decision | Default | Notes |
|---|---|---|
| Primary pipeline | **A (XGBoost)** | Sub-ms inference, uses MediaPipe landmarks already extracted client-side |
| Confidence threshold for fallback | **0.70** | Tune after testing - if XGBoost < 70%, call Pipeline B |
| Enable Pipeline C (CNN) | **Optional / off by default** | Adds ~150ms latency and requires image upload, not just landmarks |
| MediaPipe model variant | **lite** | Use `hand_landmarker_lite.task` for mobile performance |
| WebSocket frame rate | **15fps** | Sufficient for sign recognition, avoids server overload |
| Gujarati label map | **CONFIRM WITH DATASET** | Classes 0–33 must match training data exactly |