Spaces:
Runtime error
Runtime error
| # ποΈ Real-Time Speech-to-Text Service with Kyutai Moshi | |
| Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition! | |
| ## System Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β stt-gpu-service-python-v4 β | |
| β (Nvidia T4 Small) β | |
| β β | |
| β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β | |
| β β Moshi Model β β API Interfaces β β | |
| β β kyutai/stt-1b β β β β | |
| β β (Cached) β β π WebSocket /ws/stream β β | |
| β β β β β 80ms audio chunks β β | |
| β β β’ 0.5s delay ββββββ€ β Real-time transcription β β | |
| β β β’ EN/FR β β β β | |
| β β β’ 1B params β β π‘ REST /transcribe β β | |
| β β β β β Audio file upload β β | |
| β βββββββββββββββββββ β β JSON transcription β β | |
| β β β β | |
| β β π GET /health β β | |
| β β β Service status check β β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββΌβββββββββββββββ | |
| β β β | |
| βββββββββΌββββββ βββββββΌββββββ βββββββΌββββββ | |
| β Client 1 β β Client 2 β β Test β | |
| β (Streaming) β β(Streaming)β β (Upload) β | |
| βββββββββββββββ βββββββββββββ βββββββββββββ | |
| ``` | |
| ## API Interface Details | |
| ### π **WebSocket Streaming** `/ws/stream` | |
| Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations. | |
| ### π‘ **REST Upload** `/transcribe` | |
| Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps. | |
| ### π **Health Check** `/health` | |
| Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability. | |
| ## Technical Highlights | |
| - **Ultra-Low Latency**: 80ms frame processing with Moshi's native streaming | |
| - **Model Optimization**: Pre-cached in Docker image for instant startup | |
| - **Cost Efficient**: T4 Small GPU with 30-minute auto-sleep | |
| - **Production Ready**: Supports 2 concurrent streaming connections | |
| - **Multi-Language**: English and French recognition support | |
| Perfect for real-time voice applications, live transcription services, and conversational AI systems! | |
| #AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker |