Sanket-Setu / SanketSetu_ Production-Grade Implementation Plan.md
devrajsinh2012's picture
Remove Fly.io and GitHub Actions workflows
c67369f

SanketSetu: Production-Grade Implementation Plan

1. Executive Summary

SanketSetu (Bridge of Signs) is a high-performance, real-time Gujarati Sign Language (GSL) recognition system. This document outlines a production-ready architecture designed to run entirely on free-tier cloud services. The system leverages a decoupled architecture with a React-based interactive frontend and a FastAPI backend, ensuring low-latency inference and a seamless user experience.


2. High-Level System Architecture

The system follows a modern microservices-inspired pattern to ensure scalability and ease of updates.

Component Technology Role Hosting (Free Tier)
Frontend React + Vite + TS User interface, webcam capture, real-time feedback Vercel
Backend API FastAPI (Python) WebSocket management, API gateway, logic Hugging Face Spaces
Inference Engine ONNX Runtime / XGBoost High-speed model execution Hugging Face Spaces
Storage Hugging Face Model Hub Model weights and assets Hugging Face
Real-time WebSockets (WSS) Low-latency frame-by-frame data transfer N/A

3. Backend Implementation Details

3.1 API Design (FastAPI)

The backend is built for speed. It handles binary data from WebSockets to minimize overhead.

  • WebSocket Protocol: The client sends a stream of normalized hand landmark coordinates (63 points per frame) extracted locally via MediaPipe. This reduces bandwidth significantly compared to sending raw video frames.
  • Concurrency: Uses asyncio to handle multiple simultaneous user connections without blocking the event loop.
  • Model Loading: Models are loaded into memory at startup using a Singleton pattern to ensure zero-latency on the first request.

3.2 Model Serving Strategy

  1. Primary Model: The XGBoost model is used as the default due to its sub-millisecond inference time.
  2. Backup/Ensemble: The system can optionally query the CNN+SVM or LGBM models for high-confidence verification if the XGBoost score is below a certain threshold.
  3. Optimization: Models are converted to ONNX format to leverage the ONNX Runtime's hardware-specific optimizations, even on free-tier CPU instances.

4. Frontend & Interactive UI/UX

The frontend is designed to be "cool," responsive, and highly interactive, providing users with a "futuristic" feel.

4.1 Tech Stack

  • Styling: Tailwind CSS for rapid, modern UI development.
  • Animations: Framer Motion for smooth transitions, layout changes, and interactive elements.
  • Icons: Lucide React for a clean, consistent icon set.

4.2 Key UI Features

  • Glassmorphism Design: Use of semi-transparent backgrounds with blur effects for a modern look.
  • Interactive Landmark Overlay: A canvas overlay on the webcam feed that draws the 21 hand landmarks in real-time. Landmarks will "glow" when a sign is successfully recognized.
  • Dynamic Prediction HUD: A Head-Up Display (HUD) style interface that shows the current prediction, confidence level, and a history of recently detected signs.
  • Responsive Layout: Fully functional on mobile and desktop, with optimized camera controls for both.

4.3 User Experience Flow

  1. Onboarding: A quick, animated guide on how to position the hand for best results.
  2. Calibration: A brief "Ready?" state that ensures the lighting and hand distance are optimal.
  3. Real-time Translation: Instant feedback as the user signs, with the translated Gujarati text appearing in a stylized "speech bubble" or text box.

4. Deployment & DevOps

4.1 Deployment Strategy

  1. Frontend: Manually push to Vercel using Vercel CLI or GitHub integration (when needed).
  2. Backend: Manually deploy to Hugging Face Spaces using Git push or Hugging Face Hub CLI.

4.2 Scalability & Cost Management

  • Auto-scaling: Hugging Face Spaces manages resource allocation automatically with free-tier CPU instances.
  • CDN Caching: Vercel's Edge Network will cache all static assets, ensuring fast load times globally.

5. Implementation Roadmap

Phase 1: Core Backend & ML Integration

  • Set up FastAPI project structure.
  • Implement WebSocket handler for landmark data.
  • Integrate the trained XGBoost model for real-time inference.

Phase 2: Advanced Frontend Development

  • Initialize Vite + React project with Tailwind.
  • Implement webcam capture and MediaPipe landmark extraction (client-side).
  • Create the interactive HUD and glassmorphism UI.

Phase 3: Production Hardening

  • Set up GitHub Actions for automated deployment.
  • Implement error handling for low-bandwidth scenarios.
  • Finalize documentation and user guide.

6. References

[1] FastAPI Documentation - High-performance web framework for building APIs. [2] MediaPipe Hands - Real-time hand landmark detection. [3] Framer Motion - A production-ready motion library for React. [4] Hugging Face Spaces - Free-tier hosting for ML applications. [5] Vercel Deployment - Global CDN and hosting for frontend applications.