# 🎙️ Real-Time Speech-to-Text Service with Kyutai Moshi Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition! ## System Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ stt-gpu-service-python-v4 │ │ (Nvidia T4 Small) │ │ │ │ ┌─────────────────┐ ┌─────────────────────────────────┐ │ │ │ Moshi Model │ │ API Interfaces │ │ │ │ kyutai/stt-1b │ │ │ │ │ │ (Cached) │ │ 🌐 WebSocket /ws/stream │ │ │ │ │ │ ↓ 80ms audio chunks │ │ │ │ • 0.5s delay │◄───┤ ↑ Real-time transcription │ │ │ │ • EN/FR │ │ │ │ │ │ • 1B params │ │ 📡 REST /transcribe │ │ │ │ │ │ ↓ Audio file upload │ │ │ └─────────────────┘ │ ↑ JSON transcription │ │ │ │ │ │ │ │ 💓 GET /health │ │ │ │ ↑ Service status check │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌──────────────┼──────────────┐ │ │ │ ┌───────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ Client 1 │ │ Client 2 │ │ Test │ │ (Streaming) │ │(Streaming)│ │ (Upload) │ └─────────────┘ └───────────┘ └───────────┘ ``` ## API Interface Details ### 🌐 **WebSocket Streaming** `/ws/stream` Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations. ### 📡 **REST Upload** `/transcribe` Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps. ### 💓 **Health Check** `/health` Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability. ## Technical Highlights - **Ultra-Low Latency**: 80ms frame processing with Moshi's native streaming - **Model Optimization**: Pre-cached in Docker image for instant startup - **Cost Efficient**: T4 Small GPU with 30-minute auto-sleep - **Production Ready**: Supports 2 concurrent streaming connections - **Multi-Language**: English and French recognition support Perfect for real-time voice applications, live transcription services, and conversational AI systems! #AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker