Peter Michael Gits
feat: Create STT GPU Service - eliminates Streamlit iframe barriers
ffff531

STT GPU Service - WebRTC Speech-to-Text

GPU-accelerated Speech-to-Text microservice designed to eliminate Streamlit iframe communication barriers for VoiceCalendar integration.

🎯 Purpose

This service solves the iframe communication issues encountered with the previous Streamlit approach by providing:

  • Direct HTTP API endpoints for WebRTC audio processing
  • GPU-accelerated transcription using OpenAI Whisper
  • Base64 audio support for seamless WebRTC integration
  • No iframe/postMessage complexity - pure HTTP communication
  • Scalable microservice architecture ready for production deployment

πŸš€ Key Features

βœ… GPU Acceleration - CUDA-optimized Whisper models
βœ… WebRTC Compatible - Direct base64 audio processing
βœ… Multiple Models - Runtime model switching (tiny to large)
βœ… Real-time Processing - Optimized for voice applications
βœ… HuggingFace Ready - Gradio interface with API endpoints
βœ… Production Scalable - $0.40/hour GPU infrastructure

πŸ—οΈ Architecture

VoiceCalendar WebRTC β†’ Direct HTTP POST β†’ STT GPU Service β†’ Transcription
                     (no iframe barriers)

Previous Issues Eliminated:

  • ❌ window.Streamlit undefined errors
  • ❌ iframe postMessage failures
  • ❌ Complex bridge polling mechanisms
  • ❌ Component communication timeouts

πŸ“‘ API Endpoints

Core Transcription

POST /api/transcribe
Content-Type: application/json

{
  "audio_base64": "base64_encoded_webm_audio", 
  "language": "en",
  "model_size": "base"
}

Health Check

GET /api/health

🎀 WebRTC Integration

JavaScript Example

// Eliminates iframe communication complexity!
async function processVoiceChunk(audioBlob, chunkIndex) {
    // Convert WebRTC audio to base64
    const arrayBuffer = await audioBlob.arrayBuffer();
    const audioArray = new Uint8Array(arrayBuffer);  
    const audioBase64 = btoa(String.fromCharCode(...audioArray));
    
    // Direct API call - no iframe barriers
    const response = await fetch('/api/transcribe', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            audio_base64: audioBase64,
            language: 'en',
            model_size: 'base'
        })
    });
    
    const result = await response.json();
    console.log(`Chunk ${chunkIndex}: ${result.transcription}`);
    return result.transcription;
}

πŸ”§ Model Performance

Model GPU Memory Speed Accuracy Use Case
tiny ~1GB Fastest Good Real-time
base ~1GB Fast Better Balanced
small ~2GB Medium Very Good Quality
medium ~5GB Slower Excellent High accuracy
large ~10GB Slowest Best Production

πŸš€ Deployment

HuggingFace Spaces (GPU)

# Create new HF Space with GPU
# Upload: app.py, requirements.txt, README.md
# Set Hardware: A10G Small ($0.40/hour)

Docker Local

docker build -t stt-gpu-service .
docker run --gpus all -p 7860:7860 stt-gpu-service

πŸ”— VoiceCalendar Integration

The STT service integrates seamlessly with VoiceCalendar's unmute.sh methodology:

  1. WebRTC captures audio with voice activity detection
  2. Direct HTTP POST to STT service (no iframe complexity)
  3. GPU transcription with minimal latency
  4. Real-time display of transcription results

No more bridge communication barriers!

πŸ“Š Benefits vs Previous Approach

Previous (Streamlit) New (STT Service)
iframe communication Direct HTTP API
postMessage barriers Pure JSON requests
Bridge polling complexity Simple HTTP calls
Streamlit constraints Native WebRTC support
Limited scalability Microservice architecture

🎯 Next Steps

  1. βœ… STT Service - Complete
  2. 🚧 TTS Service - Port 7861
  3. 🚧 VoiceCalendar Native App - No Streamlit constraints
  4. 🚧 Production Deployment - GPU infrastructure