Spaces:

pgits
/

stt-gpu-service-v3

Sleeping

App Files Files Community

stt-gpu-service-v3 / README.md

Peter Michael Gits

feat: Force fresh deployment with clear STT vs TTS identification

808ab30 8 months ago

preview code

raw

history blame contribute delete

1.88 kB

metadata

title: Kyutai STT GPU Service v3
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
hardware: t4-small
app_port: 7860

Kyutai STT GPU Service v3

A cost-optimized Speech-to-Text WebSocket server powered by Kyutai's Delayed Streams Modeling, deployed on HuggingFace Spaces with GPU acceleration.

Features

Real-time WebSocket streaming for audio transcription
Multilingual support (English/French) and English-only models
Cost-optimized deployment with auto-sleep functionality
GPU acceleration with CUDA support
Word-level timestamps and Voice Activity Detection

Models

kyutai/stt-1b-en_fr: Multilingual model (~1B parameters) with VAD support
kyutai/stt-2.6b-en: English-only model (~2.6B parameters)

WebSocket API

Connect to the WebSocket endpoint and send JSON messages:

Start Streaming

{
  "type": "start",
  "config": {
    "enable_timestamps": true,
    "enable_vad": true,
    "language": "en"
  }
}

Send Audio Data

{
  "type": "audio",
  "data": "base64_encoded_audio_data",
  "sample_rate": 16000,
  "channels": 1,
  "timestamp": 1234567890
}

Stop Streaming

{
  "type": "stop"
}

Response Format

{
  "type": "transcription",
  "result": {
    "text": "Transcribed text",
    "confidence": 0.95,
    "start_time": 0.0,
    "end_time": 2.5
  }
}

Cost Management

Auto-sleep: Space sleeps after 30-60 minutes of inactivity
No charges during sleep: GPU billing stops completely
Fast wake-up: 30-90 seconds with preloaded model

Usage Examples

On-demand (10 hours/week): ~$29/month
Business hours (8h × 5 days): ~$89/month
Daily use (4 hours/day): ~$69/month

Development

Built with Rust and the Candle ML framework for optimal performance and GPU utilization.