Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
Implementation Notes
Architecture Overview
The STT system is built in 5 progressive steps, each adding functionality on top of the previous:
- Step 1: Basic offline transcription (Whisper/Vosk)
- Step 2: HTTP API for file uploads
- Step 3: WebSocket streaming for real-time audio
- Step 4: Telephony audio format support (Twilio/Exotel)
- Step 5: Production-ready with stability features
Key Components
Audio Processing
- TelephonyAudioConverter: Handles format conversion
- Twilio: 8kHz μ-law → 16kHz PCM
- Exotel: 8kHz PCM → 16kHz PCM
- Uses scipy.signal.resample for sample rate conversion
Voice Activity Detection (VAD)
- Simple energy-based VAD in Step 5
- Threshold: 0.01 (configurable)
- Frame-based analysis (25ms frames)
- Detects speech vs silence
Audio Buffering
- AudioBuffer: Accumulates audio chunks
- Configurable chunk duration (default: 1.0s)
- Minimum interval between transcriptions (0.5s)
- Handles silence timeouts (3.0s)
Duplicate Prevention
- Compares new transcriptions with previous
- Prevents sending identical text multiple times
- Simple substring matching (can be enhanced)
Things to Consider
Performance
Model Loading: Whisper models are loaded per connection (lazy loading)
- Consider model caching/pooling for production
- Larger models (medium/large) are more accurate but slower
Chunk Size: Balance between latency and accuracy
- Smaller chunks = lower latency but less context
- Larger chunks = better accuracy but higher latency
Concurrent Connections: Each connection loads its own model
- Consider shared model instances for multiple connections
- Monitor memory usage with many concurrent calls
Audio Quality
Sample Rate: Whisper works best with 16kHz
- Telephony audio (8kHz) must be upsampled
- Quality may be reduced compared to native 16kHz
Noise: Telephony audio often has background noise
- Consider noise reduction preprocessing
- VAD helps filter silence but not noise
Format Conversion: μ-law to PCM conversion may introduce artifacts
- Test with real telephony audio
- Consider alternative conversion methods if quality is poor
Stability & Reliability
Disconnections: Handled gracefully in Step 5
- Final transcription on remaining buffer
- Session cleanup on disconnect
Error Handling: Comprehensive error catching
- Logs errors per call
- Continues processing on individual failures
Logging: Per-call logging in Step 5
- Logs stored in
logs/stt.log - Includes call_id for tracking
- Logs stored in
Scaling Considerations
Model Memory: Whisper models are large (base ~150MB, large ~3GB)
- Consider GPU acceleration for faster inference
- Model quantization for reduced memory
API Rate Limiting: No rate limiting implemented
- Add rate limiting for production
- Consider request queuing
Database: No persistent storage
- Add database for call transcripts
- Store session metadata
Load Balancing: Single server implementation
- Consider multiple workers/instances
- Use message queue for audio processing
Security
Authentication: No authentication implemented
- Add API keys/tokens
- WebSocket authentication
Input Validation: Basic validation
- Validate audio format/size
- Rate limit per client
Data Privacy: Transcripts logged
- Consider encryption for sensitive data
- Implement data retention policies
Testing Recommendations
- Unit Tests: Test audio conversion functions
- Integration Tests: Test WebSocket streaming with real audio
- Load Tests: Test with multiple concurrent connections
- Telephony Tests: Test with actual Twilio/Exotel audio streams
Future Enhancements
- Better VAD: Use more sophisticated VAD (e.g., WebRTC VAD)
- Streaming Model: Use streaming-capable models for lower latency
- Language Detection: Auto-detect language
- Speaker Diarization: Identify different speakers
- Punctuation: Better punctuation in transcripts
- Timestamping: Word-level timestamps
- Confidence Scores: Return confidence scores per word