Spaces:
Sleeping
Sleeping
metadata
title: Kyutai STT GPU Service v3
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
hardware: t4-small
app_port: 7860
Kyutai STT GPU Service v3
A cost-optimized Speech-to-Text WebSocket server powered by Kyutai's Delayed Streams Modeling, deployed on HuggingFace Spaces with GPU acceleration.
Features
- Real-time WebSocket streaming for audio transcription
- Multilingual support (English/French) and English-only models
- Cost-optimized deployment with auto-sleep functionality
- GPU acceleration with CUDA support
- Word-level timestamps and Voice Activity Detection
Models
- kyutai/stt-1b-en_fr: Multilingual model (~1B parameters) with VAD support
- kyutai/stt-2.6b-en: English-only model (~2.6B parameters)
WebSocket API
Connect to the WebSocket endpoint and send JSON messages:
Start Streaming
{
"type": "start",
"config": {
"enable_timestamps": true,
"enable_vad": true,
"language": "en"
}
}
Send Audio Data
{
"type": "audio",
"data": "base64_encoded_audio_data",
"sample_rate": 16000,
"channels": 1,
"timestamp": 1234567890
}
Stop Streaming
{
"type": "stop"
}
Response Format
{
"type": "transcription",
"result": {
"text": "Transcribed text",
"confidence": 0.95,
"start_time": 0.0,
"end_time": 2.5
}
}
Cost Management
- Auto-sleep: Space sleeps after 30-60 minutes of inactivity
- No charges during sleep: GPU billing stops completely
- Fast wake-up: 30-90 seconds with preloaded model
Usage Examples
- On-demand (10 hours/week): ~$29/month
- Business hours (8h × 5 days): ~$89/month
- Daily use (4 hours/day): ~$69/month
Development
Built with Rust and the Candle ML framework for optimal performance and GPU utilization.