Spaces:

pgits
/

voiceCalendar

Sleeping

App Files Files Community

voiceCalendar / stt-gpu-service /README.md

Peter Michael Gits

feat: Create STT GPU Service - eliminates Streamlit iframe barriers

ffff531 8 months ago

preview code

raw

history blame contribute delete

4.13 kB

STT GPU Service - WebRTC Speech-to-Text

GPU-accelerated Speech-to-Text microservice designed to eliminate Streamlit iframe communication barriers for VoiceCalendar integration.

🎯 Purpose

This service solves the iframe communication issues encountered with the previous Streamlit approach by providing:

Direct HTTP API endpoints for WebRTC audio processing
GPU-accelerated transcription using OpenAI Whisper
Base64 audio support for seamless WebRTC integration
No iframe/postMessage complexity - pure HTTP communication
Scalable microservice architecture ready for production deployment

🚀 Key Features

✅ GPU Acceleration - CUDA-optimized Whisper models
✅ WebRTC Compatible - Direct base64 audio processing
✅ Multiple Models - Runtime model switching (tiny to large)
✅ Real-time Processing - Optimized for voice applications
✅ HuggingFace Ready - Gradio interface with API endpoints
✅ Production Scalable - $0.40/hour GPU infrastructure

🏗️ Architecture

VoiceCalendar WebRTC → Direct HTTP POST → STT GPU Service → Transcription
                     (no iframe barriers)

Previous Issues Eliminated:

❌ window.Streamlit undefined errors
❌ iframe postMessage failures
❌ Complex bridge polling mechanisms
❌ Component communication timeouts

📡 API Endpoints

Core Transcription

POST /api/transcribe
Content-Type: application/json

{
  "audio_base64": "base64_encoded_webm_audio", 
  "language": "en",
  "model_size": "base"
}

Health Check

GET /api/health

🎤 WebRTC Integration

JavaScript Example

// Eliminates iframe communication complexity!
async function processVoiceChunk(audioBlob, chunkIndex) {
    // Convert WebRTC audio to base64
    const arrayBuffer = await audioBlob.arrayBuffer();
    const audioArray = new Uint8Array(arrayBuffer);  
    const audioBase64 = btoa(String.fromCharCode(...audioArray));
    
    // Direct API call - no iframe barriers
    const response = await fetch('/api/transcribe', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            audio_base64: audioBase64,
            language: 'en',
            model_size: 'base'
        })
    });
    
    const result = await response.json();
    console.log(`Chunk ${chunkIndex}: ${result.transcription}`);
    return result.transcription;
}

🔧 Model Performance

Model	GPU Memory	Speed	Accuracy	Use Case
tiny	~1GB	Fastest	Good	Real-time
base	~1GB	Fast	Better	Balanced
small	~2GB	Medium	Very Good	Quality
medium	~5GB	Slower	Excellent	High accuracy
large	~10GB	Slowest	Best	Production

🚀 Deployment

HuggingFace Spaces (GPU)

# Create new HF Space with GPU
# Upload: app.py, requirements.txt, README.md
# Set Hardware: A10G Small ($0.40/hour)

Docker Local

docker build -t stt-gpu-service .
docker run --gpus all -p 7860:7860 stt-gpu-service

🔗 VoiceCalendar Integration

The STT service integrates seamlessly with VoiceCalendar's unmute.sh methodology:

WebRTC captures audio with voice activity detection
Direct HTTP POST to STT service (no iframe complexity)
GPU transcription with minimal latency
Real-time display of transcription results

No more bridge communication barriers!

📊 Benefits vs Previous Approach

Previous (Streamlit)	New (STT Service)
iframe communication	Direct HTTP API
postMessage barriers	Pure JSON requests
Bridge polling complexity	Simple HTTP calls
Streamlit constraints	Native WebRTC support
Limited scalability	Microservice architecture

🎯 Next Steps

✅ STT Service - Complete
🚧 TTS Service - Port 7861
🚧 VoiceCalendar Native App - No Streamlit constraints
🚧 Production Deployment - GPU infrastructure