Sanket-Setu / TASKS.md
devrajsinh2012's picture
Remove Fly.io and GitHub Actions workflows
c67369f

SanketSetu β€” Execution TODO & Implementation Tracker

Model Analysis (Reviewed 2026-03-02)

All 5 model files inspected. Three distinct inference pipelines exist:

Pipeline Files Input Process Output
A β€” Primary (Fastest) Mediapipe_XGBoost/model.pkl 63 MediaPipe coords (21 landmarks Γ— x,y,z) XGBClassifier (50 trees) 34-class probability
B β€” Autoencoder + LGBM CNN_Autoencoder_LightGBM/autoencoder_model.pkl + lgbm_model.pkl 63 MediaPipe coords Encoder (63β†’32β†’16 bottleneck) + LGBMClassifier 34-class probability
C β€” Vision CNN + SVM CNN_PreTrained/cnn_model.pkl + svm_model.pkl 128Γ—128Γ—3 RGB image ResNet50-based CNN (179 layers) β†’ 256 features + SVC(C=10) 34-class probability w/ probability=True

Key Architecture Facts

  • 34 classes (Gujarati Sign Language alphabet + digits, labels 0–33)
  • Pipeline A input: 63 floats β€” directly from MediaPipe hand_landmarks (x, y, z per landmark, flattened)
  • Pipeline B input: same 63 floats β†’ takes only the encoder half (first 3 Dense layers, output of dense_1 layer = 16 features)
  • Pipeline C input: 128Γ—128 BGR/RGB cropped hand image, normalized to [0,1]
  • All .pth files are identical copies of the .pkl files (same objects, different extension)
  • Model quality strategy: A is primary (sub-ms); if confidence < threshold, query B or C for ensemble

Project Folder Structure to Create

SanketSetu/
β”œβ”€β”€ backend/                    ← FastAPI server
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py             ← FastAPI entry, WebSocket + REST
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   β”œβ”€β”€ loader.py       ← Singleton model loader
β”‚   β”‚   β”‚   └── label_map.py    ← 0–33 β†’ Gujarati sign name mapping
β”‚   β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline_a.py   ← XGBoost inference (63 landmarks)
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline_b.py   ← Autoencoder encoder + LightGBM
β”‚   β”‚   β”‚   β”œβ”€β”€ pipeline_c.py   ← ResNet CNN + SVM (image-based)
β”‚   β”‚   β”‚   └── ensemble.py     ← Confidence-weighted ensemble logic
β”‚   β”‚   β”œβ”€β”€ schemas.py          ← Pydantic request/response models
β”‚   β”‚   └── config.py           ← Settings (confidence threshold, etc.)
β”‚   β”œβ”€β”€ weights/                ← Symlink or copy of model pkl files
β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”œβ”€β”€ Dockerfile

β”‚
β”œβ”€β”€ frontend/                   ← Vite + React + TS
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ WebcamFeed.tsx       ← Webcam + canvas landmark overlay
β”‚   β”‚   β”‚   β”œβ”€β”€ LandmarkCanvas.tsx   ← Draws 21 hand points + connections
β”‚   β”‚   β”‚   β”œβ”€β”€ PredictionHUD.tsx    ← Live sign, confidence bar, history
β”‚   β”‚   β”‚   β”œβ”€β”€ OnboardingGuide.tsx  ← Animated intro wizard
β”‚   β”‚   β”‚   └── Calibration.tsx      ← Lighting/distance check UI
β”‚   β”‚   β”œβ”€β”€ hooks/
β”‚   β”‚   β”‚   β”œβ”€β”€ useWebSocket.ts      ← WS connection, send/receive
β”‚   β”‚   β”‚   β”œβ”€β”€ useMediaPipe.ts      ← MediaPipe Hands JS integration
β”‚   β”‚   β”‚   └── useWebcam.ts         ← Camera permissions + stream
β”‚   β”‚   β”œβ”€β”€ lib/
β”‚   β”‚   β”‚   └── landmarkUtils.ts     ← Landmark normalization (mirror XGBoost preprocessing)
β”‚   β”‚   β”œβ”€β”€ App.tsx
β”‚   β”‚   └── main.tsx
β”‚   β”œβ”€β”€ public/
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ tailwind.config.ts
β”‚   β”œβ”€β”€ vite.config.ts
β”‚   └── package.json
β”‚
β”œβ”€β”€ CNN_Autoencoder_LightGBM/   ← (existing)
β”œβ”€β”€ CNN_PreTrained/             ← (existing)
β”œβ”€β”€ Mediapipe_XGBoost/          ← (existing)
└── .github/
    └── workflows/
        β”œβ”€β”€ deploy-backend.yml
        └── deploy-frontend.yml

Phase 1 β€” Backend Core (FastAPI + Model Integration)

1.1 Project Bootstrap

  • Create backend/ folder and app/ package structure
  • Create backend/requirements.txt with: fastapi, uvicorn[standard], websockets, xgboost, lightgbm, scikit-learn, keras==3.13.2, tensorflow-cpu, numpy, opencv-python-headless, pillow, python-dotenv
  • Create backend/app/config.py β€” confidence threshold (default 0.7), WebSocket max connections, pipeline mode (A/B/C/ensemble)
  • Create backend/app/models/label_map.py β€” map class indices 0–33 to Gujarati sign names

1.2 Model Loader (Singleton)

  • Create backend/app/models/loader.py
    • Load model.pkl (XGBoost) at startup
    • Load autoencoder_model.pkl (extract encoder layers only: input β†’ dense β†’ dense_1) and lgbm_model.pkl
    • Load cnn_model.pkl (full ResNet50 feature extractor, strip any classification head) and svm_model.pkl
    • Expose ModelStore singleton accessed via get_model_store() dependency
    • Log load times for each model

1.3 Pipeline A β€” XGBoost (Primary, Landmarks)

  • Create backend/app/inference/pipeline_a.py
    • Input: List[float] of length 63 (x,y,z per landmark, already normalized by MediaPipe)
    • Output: {"sign": str, "confidence": float, "probabilities": List[float]}
    • Use model.predict_proba(np.array(landmarks).reshape(1,-1))[0]
    • Return classes_[argmax] and max(probabilities) as confidence

1.4 Pipeline B β€” Autoencoder Encoder + LightGBM

  • Create backend/app/inference/pipeline_b.py
    • Build encoder-only submodel: encoder = keras.Model(inputs=model.input, outputs=model.layers[2].output) (output of dense_1, the 16-D bottleneck)
    • Input: 63 MediaPipe coords
    • Encode: features = encoder.predict(np.array(landmarks).reshape(1,-1))[0] β†’ shape (16,)
    • Classify: lgbm.predict_proba(features.reshape(1,-1))[0]

1.5 Pipeline C β€” CNN + SVM (Image-based)

  • Create backend/app/inference/pipeline_c.py
    • Input: base64-encoded JPEG or raw bytes of the cropped hand region (128Γ—128 px)
    • Decode β†’ numpy array (128,128,3) uint8 β†’ normalize to float32 [0,1]
    • features = cnn_model.predict(img[np.newaxis])[0] β†’ shape (256,)
    • proba = svm.predict_proba(features.reshape(1,-1))[0]
    • Note: CNN inference is slower (~50–200ms on CPU); only call when Pipeline A confidence < threshold

1.6 Ensemble Logic

  • Create backend/app/inference/ensemble.py
    • Call Pipeline A first
    • If confidence < config.THRESHOLD (default 0.7), call Pipeline B
    • If still below threshold and image data available, call Pipeline C
    • Final result: weighted average of probabilities from each pipeline that was called
    • Return the top predicted class and ensemble confidence score

1.7 WebSocket Handler

  • Create backend/app/main.py with FastAPI app
  • Implement GET /health β€” returns {"status": "ok", "models_loaded": true}
  • Implement WS /ws/landmarks β€” primary endpoint
    • Client sends JSON: {"landmarks": [63 floats], "session_id": "..."}
    • Server responds: {"sign": "...", "confidence": 0.95, "pipeline": "A", "label_index": 12}
    • Handle disconnect gracefully
  • Implement WS /ws/image β€” optional image-based endpoint for Pipeline C
    • Client sends JSON: {"image_b64": "...", "session_id": "..."}
  • Implement POST /api/predict β€” REST fallback for non-WS clients
    • Body: {"landmarks": [63 floats]}
    • Returns same response schema as WS

1.8 Schemas & Validation

  • Create backend/app/schemas.py
    • LandmarkMessage(BaseModel): landmarks: List[float] (must be length 63), session_id: str
    • ImageMessage(BaseModel): image_b64: str, session_id: str
    • PredictionResponse(BaseModel): sign: str, confidence: float, pipeline: str, label_index: int, probabilities: Optional[List[float]]

1.9 CORS & Middleware

  • Configure CORS for Vercel frontend domain + localhost:5173
  • Add request logging middleware (log session_id, pipeline used, latency ms)
  • Add global exception handler returning proper JSON errors

Phase 2 β€” Frontend (React + Vite + Tailwind + Framer Motion)

2.1 Project Bootstrap

  • Run npm create vite@latest frontend -- --template react-ts inside SanketSetu/
  • Install deps: tailwindcss, framer-motion, lucide-react, @mediapipe/tasks-vision
  • Configure Tailwind with custom palette (dark neon-cyan glassmorphism theme)
  • Set up vite.config.ts proxy: /api β†’ backend URL, /ws β†’ backend WS URL

2.2 Webcam Hook (useWebcam.ts)

  • Request getUserMedia({ video: { width: 1280, height: 720 } })
  • Expose videoRef, isReady, error, switchCamera() (for mobile front/back toggle)
  • Handle permission denied state with instructional UI

2.3 MediaPipe Hook (useMediaPipe.ts)

  • Initialize HandLandmarker from @mediapipe/tasks-vision (WASM backend)
  • Process video frames at target 30fps using requestAnimationFrame
  • Extract landmarks[0] (first hand) β†’ flatten to 63 floats [x0,y0,z0, x1,y1,z1, ...]
  • Normalize: subtract wrist (landmark 0) position to make translation-invariant β€” must match training preprocessing
  • Expose landmarks: number[] | null, handedness: string, isDetecting: boolean

2.4 WebSocket Hook (useWebSocket.ts)

  • Connect to wss://backend-url/ws/landmarks on mount
  • Auto-reconnect with exponential backoff on disconnect
  • sendLandmarks(landmarks: number[]) β€” throttled to max 15 sends/sec
  • Expose lastPrediction: PredictionResponse | null, isConnected: boolean, latency: number

2.5 Landmark Canvas (LandmarkCanvas.tsx)

  • Overlay <canvas> on top of <video> with position: absolute
  • Draw 21 hand landmark dots (cyan glow: shadowBlur, shadowColor)
  • Draw 21 bone connections following MediaPipe hand topology (finger segments)
  • On successful prediction: animate landmarks to pulse/glow with Framer Motion spring
  • Use requestAnimationFrame for smooth 60fps rendering

2.6 Prediction HUD (PredictionHUD.tsx)

  • Glassmorphism card: backdrop-blur, bg-white/10, border-white/20
  • Large Gujarati sign name (mapped from label index)
  • Confidence bar: animated width transition via Framer Motion animate={{ width: confidence% }}
  • Color coding: green (>85%), yellow (60–85%), red (<60%)
  • Rolling history list: last 10 recognized signs (Framer Motion AnimatePresence for enter/exit)
  • Pipeline badge: shows which pipeline (A/B/C) produced the result
  • Latency display: shows WS round-trip time in ms

2.7 Onboarding Guide (OnboardingGuide.tsx)

  • 3-step animated wizard using Framer Motion page transitions
    1. "Position your hand 30–60cm from camera"
    2. "Ensure good lighting, avoid dark backgrounds"
    3. "Show signs clearly β€” palm facing camera"
  • Skip button + "Don't show again" (localStorage)

2.8 Calibration Screen (Calibration.tsx)

  • Brief 2-second "Ready?" screen after onboarding
  • Check: hand detected by MediaPipe β†’ show green checkmark animation
  • Auto-transitions to main translation view when hand is stable for 1 second

2.9 Main App Layout (App.tsx)

  • Full-screen dark background with subtle animated gradient
  • Three-panel layout (desktop): webcam | HUD | history
  • Mobile: stacked layout with webcam top, HUD bottom
  • Header: "SanketSetu | ΰͺΈΰͺ‚ΰͺ•ેΰͺ€-ΰͺΈΰ«‡ΰͺ€ΰ«" with glowing text effect
  • Settings gear icon β†’ modal for pipeline selection (A / B / C / Ensemble), confidence threshold slider

Phase 3 β€” Dockerization & Deployment

3.1 Backend Dockerfile

  • Create Dockerfile (repo root, build context includes models)
  • Add .dockerignore (excludes .venv, node_modules, *.pth, tests)
  • Test locally: docker build -t sanketsetu-backend . && docker run -p 8000:8000 sanketsetu-backend

3.2 Hugging Face Spaces Configuration

  • Create Hugging Face Spaces repository for backend deployment
  • Note: Keras/TF will increase Docker image size β€” use tensorflow-cpu to keep slim
  • Push Docker image to Hugging Face Container Registry

3.3 Vercel Frontend Deployment

  • Create frontend/vercel.json with SPA rewrite + WASM Content-Type header
  • Add VITE_WS_URL and VITE_API_URL to Vercel environment variables (via CI vars)
  • Ensure @mediapipe/tasks-vision WASM files are served correctly (add to public/)

Phase 4 β€” Testing & Hardening

4.1 Backend Tests

  • tests/test_pipeline_a.py β€” 8 unit tests, XGBoost inference (4s)
  • tests/test_pipeline_b.py β€” 6 unit tests, encoder + LightGBM (49s)
  • tests/test_pipeline_c.py β€” 7 unit tests, CNN + SVM with real 128Γ—128 images (14s)
  • tests/test_websocket.py β€” 7 integration tests, health + REST + WS round-trip

4.2 Frontend Error Handling

  • No-camera fallback UI (file upload for image mode)
  • WS reconnecting banner (red banner when !isConnected && stage === 'running')
  • Low-bandwidth mode: reduce send rate to 5fps if latency > 500ms + yellow "LB" badge in HUD
  • MediaPipe WASM load failure fallback message (shown in header via mpError)

4.3 Label Map (Critical)

  • Create backend/app/models/label_map.py mapping classes 0–33 to actual Gujarati signs
    • You need to confirm the exact mapping used during training (check your original dataset/notebook)
    • Placeholder: LABEL_MAP = { 0: "ΰͺ•", 1: "ΰͺ–", ... , 33: "?" }
    • This file must exactly mirror what was used in training

Execution Order (Start Here)

Week 1: Phase 1.1 β†’ 1.3 β†’ 1.7 (get WS working with Pipeline A alone, test in browser)
Week 2: Phase 1.4 β†’ 1.5 β†’ 1.6 (add other pipelines + ensemble)
Week 3: Phase 2.1 β†’ 2.2 β†’ 2.3 β†’ 2.4 (React skeleton + WS connected)
Week 4: Phase 2.5 β†’ 2.6 β†’ 2.7 β†’ 2.8 β†’ 2.9 (full UI)
Week 5: Phase 3 + 4 (deploy + tests)

Critical Decision Points

Decision Default Notes
Primary pipeline A (XGBoost) Sub-ms inference, uses MediaPipe landmarks already extracted client-side
Confidence threshold for fallback 0.70 Tune after testing - if XGBoost < 70%, call Pipeline B
Enable Pipeline C (CNN) Optional / off by default Adds ~150ms latency and requires image upload, not just landmarks
MediaPipe model variant lite Use hand_landmarker_lite.task for mobile performance
WebSocket frame rate 15fps Sufficient for sign recognition, avoids server overload
Gujarati label map CONFIRM WITH DATASET Classes 0–33 must match training data exactly