Spaces:
Sleeping
Sleeping
STT GPU Service - WebRTC Speech-to-Text
GPU-accelerated Speech-to-Text microservice designed to eliminate Streamlit iframe communication barriers for VoiceCalendar integration.
π― Purpose
This service solves the iframe communication issues encountered with the previous Streamlit approach by providing:
- Direct HTTP API endpoints for WebRTC audio processing
- GPU-accelerated transcription using OpenAI Whisper
- Base64 audio support for seamless WebRTC integration
- No iframe/postMessage complexity - pure HTTP communication
- Scalable microservice architecture ready for production deployment
π Key Features
β
GPU Acceleration - CUDA-optimized Whisper models
β
WebRTC Compatible - Direct base64 audio processing
β
Multiple Models - Runtime model switching (tiny to large)
β
Real-time Processing - Optimized for voice applications
β
HuggingFace Ready - Gradio interface with API endpoints
β
Production Scalable - $0.40/hour GPU infrastructure
ποΈ Architecture
VoiceCalendar WebRTC β Direct HTTP POST β STT GPU Service β Transcription
(no iframe barriers)
Previous Issues Eliminated:
- β
window.Streamlitundefined errors - β iframe postMessage failures
- β Complex bridge polling mechanisms
- β Component communication timeouts
π‘ API Endpoints
Core Transcription
POST /api/transcribe
Content-Type: application/json
{
"audio_base64": "base64_encoded_webm_audio",
"language": "en",
"model_size": "base"
}
Health Check
GET /api/health
π€ WebRTC Integration
JavaScript Example
// Eliminates iframe communication complexity!
async function processVoiceChunk(audioBlob, chunkIndex) {
// Convert WebRTC audio to base64
const arrayBuffer = await audioBlob.arrayBuffer();
const audioArray = new Uint8Array(arrayBuffer);
const audioBase64 = btoa(String.fromCharCode(...audioArray));
// Direct API call - no iframe barriers
const response = await fetch('/api/transcribe', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
audio_base64: audioBase64,
language: 'en',
model_size: 'base'
})
});
const result = await response.json();
console.log(`Chunk ${chunkIndex}: ${result.transcription}`);
return result.transcription;
}
π§ Model Performance
| Model | GPU Memory | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| tiny | ~1GB | Fastest | Good | Real-time |
| base | ~1GB | Fast | Better | Balanced |
| small | ~2GB | Medium | Very Good | Quality |
| medium | ~5GB | Slower | Excellent | High accuracy |
| large | ~10GB | Slowest | Best | Production |
π Deployment
HuggingFace Spaces (GPU)
# Create new HF Space with GPU
# Upload: app.py, requirements.txt, README.md
# Set Hardware: A10G Small ($0.40/hour)
Docker Local
docker build -t stt-gpu-service .
docker run --gpus all -p 7860:7860 stt-gpu-service
π VoiceCalendar Integration
The STT service integrates seamlessly with VoiceCalendar's unmute.sh methodology:
- WebRTC captures audio with voice activity detection
- Direct HTTP POST to STT service (no iframe complexity)
- GPU transcription with minimal latency
- Real-time display of transcription results
No more bridge communication barriers!
π Benefits vs Previous Approach
| Previous (Streamlit) | New (STT Service) |
|---|---|
| iframe communication | Direct HTTP API |
| postMessage barriers | Pure JSON requests |
| Bridge polling complexity | Simple HTTP calls |
| Streamlit constraints | Native WebRTC support |
| Limited scalability | Microservice architecture |
π― Next Steps
- β STT Service - Complete
- π§ TTS Service - Port 7861
- π§ VoiceCalendar Native App - No Streamlit constraints
- π§ Production Deployment - GPU infrastructure