Spaces:

pgits
/

stt-gpu-service-python-v4

Runtime error

App Files Files Community

stt-gpu-service-python-v4 / LinkedInPost-1.md

Peter Michael Gits

Initial commit: STT GPU Service Python v4 with WebSocket streaming

16b78bc 7 months ago

preview code

raw

history blame contribute delete

3.83 kB

🎙️ Real-Time Speech-to-Text Service with Kyutai Moshi

Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition!

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                  stt-gpu-service-python-v4                 │
│                     (Nvidia T4 Small)                      │
│                                                             │
│  ┌─────────────────┐    ┌─────────────────────────────────┐ │
│  │   Moshi Model   │    │        API Interfaces           │ │
│  │ kyutai/stt-1b   │    │                                 │ │
│  │   (Cached)      │    │  🌐 WebSocket /ws/stream        │ │
│  │                 │    │     ↓ 80ms audio chunks         │ │
│  │  • 0.5s delay  │◄───┤     ↑ Real-time transcription   │ │
│  │  • EN/FR       │    │                                 │ │
│  │  • 1B params   │    │  📡 REST /transcribe            │ │
│  │                 │    │     ↓ Audio file upload        │ │
│  └─────────────────┘    │     ↑ JSON transcription       │ │
│                         │                                 │ │
│                         │  💓 GET /health                 │ │
│                         │     ↑ Service status check     │ │
│                         └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    │              │              │
            ┌───────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
            │   Client 1  │ │ Client 2  │ │   Test    │
            │ (Streaming) │ │(Streaming)│ │ (Upload)  │
            └─────────────┘ └───────────┘ └───────────┘

API Interface Details

🌐 WebSocket Streaming `/ws/stream`

Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations.

📡 REST Upload `/transcribe`

Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps.

💓 Health Check `/health`

Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability.

Technical Highlights

Ultra-Low Latency: 80ms frame processing with Moshi's native streaming
Model Optimization: Pre-cached in Docker image for instant startup
Cost Efficient: T4 Small GPU with 30-minute auto-sleep
Production Ready: Supports 2 concurrent streaming connections
Multi-Language: English and French recognition support

Perfect for real-time voice applications, live transcription services, and conversational AI systems!

#AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker