stt-gpu-service-python-v4 / LinkedInPost-1.md
Peter Michael Gits
Initial commit: STT GPU Service Python v4 with WebSocket streaming
16b78bc

πŸŽ™οΈ Real-Time Speech-to-Text Service with Kyutai Moshi

Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition!

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  stt-gpu-service-python-v4                 β”‚
β”‚                     (Nvidia T4 Small)                      β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Moshi Model   β”‚    β”‚        API Interfaces           β”‚ β”‚
β”‚  β”‚ kyutai/stt-1b   β”‚    β”‚                                 β”‚ β”‚
β”‚  β”‚   (Cached)      β”‚    β”‚  🌐 WebSocket /ws/stream        β”‚ β”‚
β”‚  β”‚                 β”‚    β”‚     ↓ 80ms audio chunks         β”‚ β”‚
β”‚  β”‚  β€’ 0.5s delay  │◄────     ↑ Real-time transcription   β”‚ β”‚
β”‚  β”‚  β€’ EN/FR       β”‚    β”‚                                 β”‚ β”‚
β”‚  β”‚  β€’ 1B params   β”‚    β”‚  πŸ“‘ REST /transcribe            β”‚ β”‚
β”‚  β”‚                 β”‚    β”‚     ↓ Audio file upload        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚     ↑ JSON transcription       β”‚ β”‚
β”‚                         β”‚                                 β”‚ β”‚
β”‚                         β”‚  πŸ’“ GET /health                 β”‚ β”‚
β”‚                         β”‚     ↑ Service status check     β”‚ β”‚
β”‚                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚              β”‚              β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
            β”‚   Client 1  β”‚ β”‚ Client 2  β”‚ β”‚   Test    β”‚
            β”‚ (Streaming) β”‚ β”‚(Streaming)β”‚ β”‚ (Upload)  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

API Interface Details

🌐 WebSocket Streaming /ws/stream

Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations.

πŸ“‘ REST Upload /transcribe

Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps.

πŸ’“ Health Check /health

Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability.

Technical Highlights

  • Ultra-Low Latency: 80ms frame processing with Moshi's native streaming
  • Model Optimization: Pre-cached in Docker image for instant startup
  • Cost Efficient: T4 Small GPU with 30-minute auto-sleep
  • Production Ready: Supports 2 concurrent streaming connections
  • Multi-Language: English and French recognition support

Perfect for real-time voice applications, live transcription services, and conversational AI systems!

#AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker