Spaces:

pgits
/

stt-gpu-service-python-v4

Runtime error

App Files Files Community

stt-gpu-service-python-v4 / LinkedInPost-1.md

Peter Michael Gits

Initial commit: STT GPU Service Python v4 with WebSocket streaming

16b78bc 7 months ago

preview code

raw

history blame contribute delete

3.83 kB

	# 🎙️ Real-Time Speech-to-Text Service with Kyutai Moshi

	Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition!

	## System Architecture

	```
	┌─────────────────────────────────────────────────────────────┐
	│ stt-gpu-service-python-v4 │
	│ (Nvidia T4 Small) │
	│ │
	│ ┌─────────────────┐ ┌─────────────────────────────────┐ │
	│ │ Moshi Model │ │ API Interfaces │ │
	│ │ kyutai/stt-1b │ │ │ │
	│ │ (Cached) │ │ 🌐 WebSocket /ws/stream │ │
	│ │ │ │ ↓ 80ms audio chunks │ │
	│ │ • 0.5s delay │◄───┤ ↑ Real-time transcription │ │
	│ │ • EN/FR │ │ │ │
	│ │ • 1B params │ │ 📡 REST /transcribe │ │
	│ │ │ │ ↓ Audio file upload │ │
	│ └─────────────────┘ │ ↑ JSON transcription │ │
	│ │ │ │
	│ │ 💓 GET /health │ │
	│ │ ↑ Service status check │ │
	│ └─────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	│
	┌──────────────┼──────────────┐
	│ │ │
	┌───────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
	│ Client 1 │ │ Client 2 │ │ Test │
	│ (Streaming) │ │(Streaming)│ │ (Upload) │
	└─────────────┘ └───────────┘ └───────────┘
	```

	## API Interface Details

	### 🌐 WebSocket Streaming `/ws/stream`
	Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations.

	### 📡 REST Upload `/transcribe`
	Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps.

	### 💓 Health Check `/health`
	Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability.

	## Technical Highlights

	- Ultra-Low Latency: 80ms frame processing with Moshi's native streaming
	- Model Optimization: Pre-cached in Docker image for instant startup
	- Cost Efficient: T4 Small GPU with 30-minute auto-sleep
	- Production Ready: Supports 2 concurrent streaming connections
	- Multi-Language: English and French recognition support

	Perfect for real-time voice applications, live transcription services, and conversational AI systems!

	#AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker