stt-gpu-service-v3 / DEPLOYMENT.md
Peter Michael Gits
Initial deployment of Kyutai STT Server v3
e491939

HuggingFace Spaces Deployment Guide

Space Configuration

Space Name: stt-gpu-service-v3 Space Type: Docker Hardware: GPU (T4 Small recommended for cost optimization) Visibility: Public

Deployment Steps

1. Create HuggingFace Space

  1. Go to HuggingFace Spaces
  2. Click "Create new Space"
  3. Name: stt-gpu-service-v3
  4. Select "Docker" as the SDK
  5. Choose GPU hardware (T4 Small for cost efficiency)
  6. Set visibility to Public

2. Upload Files

Upload all files from this repository to the HuggingFace Space:

# Clone your space repository
git clone https://huggingface.co/spaces/{your_username}/stt-gpu-service-v3
cd stt-gpu-service-v3

# Copy all files from this project
cp -r /path/to/kyutai-rustServer/* .

# Add and commit
git add .
git commit -m "Initial deployment of Kyutai STT Server v3

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>"

git push

3. Space Settings

In your HuggingFace Space settings:

  • Hardware: T4 Small GPU (~$0.50/hour)
  • Auto-sleep: Enable (sleeps after 30-60min inactivity)
  • Environment Variables: None required (all configured in code)

4. Monitoring Deployment

  1. Watch the build logs in the HuggingFace Space
  2. Build process includes:
    • Rust compilation with CUDA support
    • Model downloading and caching
    • Server startup and health checks

5. Testing the Deployment

Once deployed, the Space will provide:

  • Web Interface: Main Space URL (Gradio + FastAPI)
  • WebSocket Endpoint: wss://your-space-url/ws
  • Health Check: https://your-space-url/health

Expected Build Time

  • Initial Build: 10-15 minutes (downloading models)
  • Subsequent Builds: 5-8 minutes (cached models)
  • Cold Start: 30-90 seconds (preloaded models)

Cost Optimization

Auto-Sleep Configuration

  • Space automatically sleeps after 30-60 minutes of inactivity
  • Wake-up time: 30-90 seconds (vs 10-15 minutes without model preloading)
  • No GPU charges during sleep

Manual Control

# Pause space (API call)
curl -X POST "https://huggingface.co/api/spaces/{username}/stt-gpu-service-v3/pause" \
  -H "Authorization: Bearer {your-hf-token}"

# Resume space (API call)  
curl -X POST "https://huggingface.co/api/spaces/{username}/stt-gpu-service-v3/restart" \
  -H "Authorization: Bearer {your-hf-token}"

Cost Examples

  • On-demand (10 hours/week): ~$29/month
  • Business hours (40 hours/week): ~$89/month
  • Daily use (4 hours/day): ~$69/month

Troubleshooting

Build Failures

  1. Check Rust compilation errors in build logs
  2. Ensure CUDA dependencies are correctly installed
  3. Verify model paths in config files

Runtime Issues

  1. Check GPU availability: nvidia-smi in Space terminal
  2. Monitor memory usage for OOM errors
  3. Verify WebSocket connections in browser developer tools

Model Loading Issues

  1. Check HuggingFace model access permissions
  2. Verify internet connectivity for model downloads
  3. Monitor disk space for model caching

API Usage Examples

JavaScript WebSocket Client

const ws = new WebSocket('wss://your-space-url/ws');

ws.onopen = () => {
    // Start streaming
    ws.send(JSON.stringify({
        "type": "start",
        "config": {"enable_timestamps": true}
    }));
};

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === 'transcription') {
        console.log('Transcription:', data.result.text);
    }
};

// Send audio data (base64 encoded)
ws.send(JSON.stringify({
    "type": "audio", 
    "data": audioBase64,
    "sample_rate": 16000,
    "channels": 1,
    "timestamp": Date.now()
}));

Python Client Example

import asyncio
import json
import websockets

async def test_stt():
    uri = "wss://your-space-url/ws"
    async with websockets.connect(uri) as websocket:
        # Start streaming
        await websocket.send(json.dumps({
            "type": "start",
            "config": {"enable_timestamps": True}
        }))
        
        # Listen for responses
        async for message in websocket:
            data = json.loads(message)
            print(f"Received: {data}")

asyncio.run(test_stt())