stt-gpu-service-v3 / README.md
Peter Michael Gits
feat: Force fresh deployment with clear STT vs TTS identification
808ab30
metadata
title: Kyutai STT GPU Service v3
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
hardware: t4-small
app_port: 7860

Kyutai STT GPU Service v3

A cost-optimized Speech-to-Text WebSocket server powered by Kyutai's Delayed Streams Modeling, deployed on HuggingFace Spaces with GPU acceleration.

Features

  • Real-time WebSocket streaming for audio transcription
  • Multilingual support (English/French) and English-only models
  • Cost-optimized deployment with auto-sleep functionality
  • GPU acceleration with CUDA support
  • Word-level timestamps and Voice Activity Detection

Models

  • kyutai/stt-1b-en_fr: Multilingual model (~1B parameters) with VAD support
  • kyutai/stt-2.6b-en: English-only model (~2.6B parameters)

WebSocket API

Connect to the WebSocket endpoint and send JSON messages:

Start Streaming

{
  "type": "start",
  "config": {
    "enable_timestamps": true,
    "enable_vad": true,
    "language": "en"
  }
}

Send Audio Data

{
  "type": "audio",
  "data": "base64_encoded_audio_data",
  "sample_rate": 16000,
  "channels": 1,
  "timestamp": 1234567890
}

Stop Streaming

{
  "type": "stop"
}

Response Format

{
  "type": "transcription",
  "result": {
    "text": "Transcribed text",
    "confidence": 0.95,
    "start_time": 0.0,
    "end_time": 2.5
  }
}

Cost Management

  • Auto-sleep: Space sleeps after 30-60 minutes of inactivity
  • No charges during sleep: GPU billing stops completely
  • Fast wake-up: 30-90 seconds with preloaded model

Usage Examples

  • On-demand (10 hours/week): ~$29/month
  • Business hours (8h × 5 days): ~$89/month
  • Daily use (4 hours/day): ~$69/month

Development

Built with Rust and the Candle ML framework for optimal performance and GPU utilization.