Spaces:
Sleeping
Sleeping
SanketSetu β Execution TODO & Implementation Tracker
Model Analysis (Reviewed 2026-03-02)
All 5 model files inspected. Three distinct inference pipelines exist:
| Pipeline | Files | Input | Process | Output |
|---|---|---|---|---|
| A β Primary (Fastest) | Mediapipe_XGBoost/model.pkl |
63 MediaPipe coords (21 landmarks Γ x,y,z) | XGBClassifier (50 trees) | 34-class probability |
| B β Autoencoder + LGBM | CNN_Autoencoder_LightGBM/autoencoder_model.pkl + lgbm_model.pkl |
63 MediaPipe coords | Encoder (63β32β16 bottleneck) + LGBMClassifier | 34-class probability |
| C β Vision CNN + SVM | CNN_PreTrained/cnn_model.pkl + svm_model.pkl |
128Γ128Γ3 RGB image | ResNet50-based CNN (179 layers) β 256 features + SVC(C=10) | 34-class probability w/ probability=True |
Key Architecture Facts
- 34 classes (Gujarati Sign Language alphabet + digits, labels 0β33)
- Pipeline A input: 63 floats β directly from MediaPipe
hand_landmarks(x, y, z per landmark, flattened) - Pipeline B input: same 63 floats β takes only the encoder half (first 3 Dense layers, output of
dense_1layer = 16 features) - Pipeline C input: 128Γ128 BGR/RGB cropped hand image, normalized to [0,1]
- All
.pthfiles are identical copies of the.pklfiles (same objects, different extension) - Model quality strategy: A is primary (sub-ms); if confidence < threshold, query B or C for ensemble
Project Folder Structure to Create
SanketSetu/
βββ backend/ β FastAPI server
β βββ app/
β β βββ main.py β FastAPI entry, WebSocket + REST
β β βββ models/
β β β βββ loader.py β Singleton model loader
β β β βββ label_map.py β 0β33 β Gujarati sign name mapping
β β βββ inference/
β β β βββ pipeline_a.py β XGBoost inference (63 landmarks)
β β β βββ pipeline_b.py β Autoencoder encoder + LightGBM
β β β βββ pipeline_c.py β ResNet CNN + SVM (image-based)
β β β βββ ensemble.py β Confidence-weighted ensemble logic
β β βββ schemas.py β Pydantic request/response models
β β βββ config.py β Settings (confidence threshold, etc.)
β βββ weights/ β Symlink or copy of model pkl files
β βββ requirements.txt
β βββ Dockerfile
β
βββ frontend/ β Vite + React + TS
β βββ src/
β β βββ components/
β β β βββ WebcamFeed.tsx β Webcam + canvas landmark overlay
β β β βββ LandmarkCanvas.tsx β Draws 21 hand points + connections
β β β βββ PredictionHUD.tsx β Live sign, confidence bar, history
β β β βββ OnboardingGuide.tsx β Animated intro wizard
β β β βββ Calibration.tsx β Lighting/distance check UI
β β βββ hooks/
β β β βββ useWebSocket.ts β WS connection, send/receive
β β β βββ useMediaPipe.ts β MediaPipe Hands JS integration
β β β βββ useWebcam.ts β Camera permissions + stream
β β βββ lib/
β β β βββ landmarkUtils.ts β Landmark normalization (mirror XGBoost preprocessing)
β β βββ App.tsx
β β βββ main.tsx
β βββ public/
β βββ index.html
β βββ tailwind.config.ts
β βββ vite.config.ts
β βββ package.json
β
βββ CNN_Autoencoder_LightGBM/ β (existing)
βββ CNN_PreTrained/ β (existing)
βββ Mediapipe_XGBoost/ β (existing)
βββ .github/
βββ workflows/
βββ deploy-backend.yml
βββ deploy-frontend.yml
Phase 1 β Backend Core (FastAPI + Model Integration)
1.1 Project Bootstrap
- Create
backend/folder andapp/package structure - Create
backend/requirements.txtwith:fastapi,uvicorn[standard],websockets,xgboost,lightgbm,scikit-learn,keras==3.13.2,tensorflow-cpu,numpy,opencv-python-headless,pillow,python-dotenv - Create
backend/app/config.pyβ confidence threshold (default 0.7), WebSocket max connections, pipeline mode (A/B/C/ensemble) - Create
backend/app/models/label_map.pyβ map class indices 0β33 to Gujarati sign names
1.2 Model Loader (Singleton)
- Create
backend/app/models/loader.py- Load
model.pkl(XGBoost) at startup - Load
autoencoder_model.pkl(extract encoder layers only: input β dense β dense_1) andlgbm_model.pkl - Load
cnn_model.pkl(full ResNet50 feature extractor, strip any classification head) andsvm_model.pkl - Expose
ModelStoresingleton accessed viaget_model_store()dependency - Log load times for each model
- Load
1.3 Pipeline A β XGBoost (Primary, Landmarks)
- Create
backend/app/inference/pipeline_a.py- Input:
List[float]of length 63 (x,y,z per landmark, already normalized by MediaPipe) - Output:
{"sign": str, "confidence": float, "probabilities": List[float]} - Use
model.predict_proba(np.array(landmarks).reshape(1,-1))[0] - Return
classes_[argmax]andmax(probabilities)as confidence
- Input:
1.4 Pipeline B β Autoencoder Encoder + LightGBM
- Create
backend/app/inference/pipeline_b.py- Build encoder-only submodel:
encoder = keras.Model(inputs=model.input, outputs=model.layers[2].output)(output ofdense_1, the 16-D bottleneck) - Input: 63 MediaPipe coords
- Encode:
features = encoder.predict(np.array(landmarks).reshape(1,-1))[0]β shape (16,) - Classify:
lgbm.predict_proba(features.reshape(1,-1))[0]
- Build encoder-only submodel:
1.5 Pipeline C β CNN + SVM (Image-based)
- Create
backend/app/inference/pipeline_c.py- Input: base64-encoded JPEG or raw bytes of the cropped hand region (128Γ128 px)
- Decode β numpy array (128,128,3) uint8 β normalize to float32 [0,1]
features = cnn_model.predict(img[np.newaxis])[0]β shape (256,)proba = svm.predict_proba(features.reshape(1,-1))[0]- Note: CNN inference is slower (~50β200ms on CPU); only call when Pipeline A confidence < threshold
1.6 Ensemble Logic
- Create
backend/app/inference/ensemble.py- Call Pipeline A first
- If
confidence < config.THRESHOLD(default 0.7), call Pipeline B - If still below threshold and image data available, call Pipeline C
- Final result: weighted average of probabilities from each pipeline that was called
- Return the top predicted class and ensemble confidence score
1.7 WebSocket Handler
- Create
backend/app/main.pywith FastAPI app - Implement
GET /healthβ returns{"status": "ok", "models_loaded": true} - Implement
WS /ws/landmarksβ primary endpoint- Client sends JSON:
{"landmarks": [63 floats], "session_id": "..."} - Server responds:
{"sign": "...", "confidence": 0.95, "pipeline": "A", "label_index": 12} - Handle disconnect gracefully
- Client sends JSON:
- Implement
WS /ws/imageβ optional image-based endpoint for Pipeline C- Client sends JSON:
{"image_b64": "...", "session_id": "..."}
- Client sends JSON:
- Implement
POST /api/predictβ REST fallback for non-WS clients- Body:
{"landmarks": [63 floats]} - Returns same response schema as WS
- Body:
1.8 Schemas & Validation
- Create
backend/app/schemas.pyLandmarkMessage(BaseModel):landmarks: List[float](must be length 63),session_id: strImageMessage(BaseModel):image_b64: str,session_id: strPredictionResponse(BaseModel):sign: str,confidence: float,pipeline: str,label_index: int,probabilities: Optional[List[float]]
1.9 CORS & Middleware
- Configure CORS for Vercel frontend domain + localhost:5173
- Add request logging middleware (log session_id, pipeline used, latency ms)
- Add global exception handler returning proper JSON errors
Phase 2 β Frontend (React + Vite + Tailwind + Framer Motion)
2.1 Project Bootstrap
- Run
npm create vite@latest frontend -- --template react-tsinsideSanketSetu/ - Install deps:
tailwindcss,framer-motion,lucide-react,@mediapipe/tasks-vision - Configure Tailwind with custom palette (dark neon-cyan glassmorphism theme)
- Set up
vite.config.tsproxy:/apiβ backend URL,/wsβ backend WS URL
2.2 Webcam Hook (useWebcam.ts)
- Request
getUserMedia({ video: { width: 1280, height: 720 } }) - Expose
videoRef,isReady,error,switchCamera()(for mobile front/back toggle) - Handle permission denied state with instructional UI
2.3 MediaPipe Hook (useMediaPipe.ts)
- Initialize
HandLandmarkerfrom@mediapipe/tasks-vision(WASM backend) - Process video frames at target 30fps using
requestAnimationFrame - Extract
landmarks[0](first hand) β flatten to 63 floats[x0,y0,z0, x1,y1,z1, ...] - Normalize: subtract wrist (landmark 0) position to make translation-invariant β must match training preprocessing
- Expose
landmarks: number[] | null,handedness: string,isDetecting: boolean
2.4 WebSocket Hook (useWebSocket.ts)
- Connect to
wss://backend-url/ws/landmarkson mount - Auto-reconnect with exponential backoff on disconnect
-
sendLandmarks(landmarks: number[])β throttled to max 15 sends/sec - Expose
lastPrediction: PredictionResponse | null,isConnected: boolean,latency: number
2.5 Landmark Canvas (LandmarkCanvas.tsx)
- Overlay
<canvas>on top of<video>withposition: absolute - Draw 21 hand landmark dots (cyan glow:
shadowBlur,shadowColor) - Draw 21 bone connections following MediaPipe hand topology (finger segments)
- On successful prediction: animate landmarks to pulse/glow with Framer Motion spring
- Use
requestAnimationFramefor smooth 60fps rendering
2.6 Prediction HUD (PredictionHUD.tsx)
- Glassmorphism card:
backdrop-blur,bg-white/10,border-white/20 - Large Gujarati sign name (mapped from label index)
- Confidence bar: animated width transition via Framer Motion
animate={{ width: confidence% }} - Color coding: green (>85%), yellow (60β85%), red (<60%)
- Rolling history list: last 10 recognized signs (Framer Motion
AnimatePresencefor enter/exit) - Pipeline badge: shows which pipeline (A/B/C) produced the result
- Latency display: shows WS round-trip time in ms
2.7 Onboarding Guide (OnboardingGuide.tsx)
- 3-step animated wizard using Framer Motion page transitions
- "Position your hand 30β60cm from camera"
- "Ensure good lighting, avoid dark backgrounds"
- "Show signs clearly β palm facing camera"
- Skip button + "Don't show again" (localStorage)
2.8 Calibration Screen (Calibration.tsx)
- Brief 2-second "Ready?" screen after onboarding
- Check: hand detected by MediaPipe β show green checkmark animation
- Auto-transitions to main translation view when hand is stable for 1 second
2.9 Main App Layout (App.tsx)
- Full-screen dark background with subtle animated gradient
- Three-panel layout (desktop): webcam | HUD | history
- Mobile: stacked layout with webcam top, HUD bottom
- Header: "SanketSetu | ΰͺΈΰͺΰͺΰ«ΰͺ€-ΰͺΈΰ«ΰͺ€ΰ«" with glowing text effect
- Settings gear icon β modal for pipeline selection (A / B / C / Ensemble), confidence threshold slider
Phase 3 β Dockerization & Deployment
3.1 Backend Dockerfile
- Create
Dockerfile(repo root, build context includes models) - Add
.dockerignore(excludes.venv,node_modules,*.pth, tests) - Test locally:
docker build -t sanketsetu-backend . && docker run -p 8000:8000 sanketsetu-backend
3.2 Hugging Face Spaces Configuration
- Create Hugging Face Spaces repository for backend deployment
- Note: Keras/TF will increase Docker image size β use
tensorflow-cputo keep slim - Push Docker image to Hugging Face Container Registry
3.3 Vercel Frontend Deployment
- Create
frontend/vercel.jsonwith SPA rewrite + WASM Content-Type header - Add
VITE_WS_URLandVITE_API_URLto Vercel environment variables (via CI vars) - Ensure
@mediapipe/tasks-visionWASM files are served correctly (add topublic/)
Phase 4 β Testing & Hardening
4.1 Backend Tests
-
tests/test_pipeline_a.pyβ 8 unit tests, XGBoost inference (4s) -
tests/test_pipeline_b.pyβ 6 unit tests, encoder + LightGBM (49s) -
tests/test_pipeline_c.pyβ 7 unit tests, CNN + SVM with real 128Γ128 images (14s) -
tests/test_websocket.pyβ 7 integration tests, health + REST + WS round-trip
4.2 Frontend Error Handling
- No-camera fallback UI (file upload for image mode)
- WS reconnecting banner (red banner when
!isConnected && stage === 'running') - Low-bandwidth mode: reduce send rate to 5fps if latency > 500ms + yellow "LB" badge in HUD
- MediaPipe WASM load failure fallback message (shown in header via
mpError)
4.3 Label Map (Critical)
- Create
backend/app/models/label_map.pymapping classes 0β33 to actual Gujarati signs- You need to confirm the exact mapping used during training (check your original dataset/notebook)
- Placeholder:
LABEL_MAP = { 0: "ΰͺ", 1: "ΰͺ", ... , 33: "?" } - This file must exactly mirror what was used in training
Execution Order (Start Here)
Week 1: Phase 1.1 β 1.3 β 1.7 (get WS working with Pipeline A alone, test in browser)
Week 2: Phase 1.4 β 1.5 β 1.6 (add other pipelines + ensemble)
Week 3: Phase 2.1 β 2.2 β 2.3 β 2.4 (React skeleton + WS connected)
Week 4: Phase 2.5 β 2.6 β 2.7 β 2.8 β 2.9 (full UI)
Week 5: Phase 3 + 4 (deploy + tests)
Critical Decision Points
| Decision | Default | Notes |
|---|---|---|
| Primary pipeline | A (XGBoost) | Sub-ms inference, uses MediaPipe landmarks already extracted client-side |
| Confidence threshold for fallback | 0.70 | Tune after testing - if XGBoost < 70%, call Pipeline B |
| Enable Pipeline C (CNN) | Optional / off by default | Adds ~150ms latency and requires image upload, not just landmarks |
| MediaPipe model variant | lite | Use hand_landmarker_lite.task for mobile performance |
| WebSocket frame rate | 15fps | Sufficient for sign recognition, avoids server overload |
| Gujarati label map | CONFIRM WITH DATASET | Classes 0β33 must match training data exactly |