stt-gpu-service-python-v4 / LinkedInPost-1.md
Peter Michael Gits
Initial commit: STT GPU Service Python v4 with WebSocket streaming
16b78bc
# πŸŽ™οΈ Real-Time Speech-to-Text Service with Kyutai Moshi
Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition!
## System Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ stt-gpu-service-python-v4 β”‚
β”‚ (Nvidia T4 Small) β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Moshi Model β”‚ β”‚ API Interfaces β”‚ β”‚
β”‚ β”‚ kyutai/stt-1b β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ (Cached) β”‚ β”‚ 🌐 WebSocket /ws/stream β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ ↓ 80ms audio chunks β”‚ β”‚
β”‚ β”‚ β€’ 0.5s delay │◄──── ↑ Real-time transcription β”‚ β”‚
β”‚ β”‚ β€’ EN/FR β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β€’ 1B params β”‚ β”‚ πŸ“‘ REST /transcribe β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ ↓ Audio file upload β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ ↑ JSON transcription β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ πŸ’“ GET /health β”‚ β”‚
β”‚ β”‚ ↑ Service status check β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
β”‚ Client 1 β”‚ β”‚ Client 2 β”‚ β”‚ Test β”‚
β”‚ (Streaming) β”‚ β”‚(Streaming)β”‚ β”‚ (Upload) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## API Interface Details
### 🌐 **WebSocket Streaming** `/ws/stream`
Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations.
### πŸ“‘ **REST Upload** `/transcribe`
Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps.
### πŸ’“ **Health Check** `/health`
Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability.
## Technical Highlights
- **Ultra-Low Latency**: 80ms frame processing with Moshi's native streaming
- **Model Optimization**: Pre-cached in Docker image for instant startup
- **Cost Efficient**: T4 Small GPU with 30-minute auto-sleep
- **Production Ready**: Supports 2 concurrent streaming connections
- **Multi-Language**: English and French recognition support
Perfect for real-time voice applications, live transcription services, and conversational AI systems!
#AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker