Spaces:
Runtime error
Runtime error
ποΈ Real-Time Speech-to-Text Service with Kyutai Moshi
Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition!
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β stt-gpu-service-python-v4 β
β (Nvidia T4 Small) β
β β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β Moshi Model β β API Interfaces β β
β β kyutai/stt-1b β β β β
β β (Cached) β β π WebSocket /ws/stream β β
β β β β β 80ms audio chunks β β
β β β’ 0.5s delay ββββββ€ β Real-time transcription β β
β β β’ EN/FR β β β β
β β β’ 1B params β β π‘ REST /transcribe β β
β β β β β Audio file upload β β
β βββββββββββββββββββ β β JSON transcription β β
β β β β
β β π GET /health β β
β β β Service status check β β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
β β β
βββββββββΌββββββ βββββββΌββββββ βββββββΌββββββ
β Client 1 β β Client 2 β β Test β
β (Streaming) β β(Streaming)β β (Upload) β
βββββββββββββββ βββββββββββββ βββββββββββββ
API Interface Details
π WebSocket Streaming /ws/stream
Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations.
π‘ REST Upload /transcribe
Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps.
π Health Check /health
Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability.
Technical Highlights
- Ultra-Low Latency: 80ms frame processing with Moshi's native streaming
- Model Optimization: Pre-cached in Docker image for instant startup
- Cost Efficient: T4 Small GPU with 30-minute auto-sleep
- Production Ready: Supports 2 concurrent streaming connections
- Multi-Language: English and French recognition support
Perfect for real-time voice applications, live transcription services, and conversational AI systems!
#AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker