Spaces:
Sleeping
Sleeping
metadata
title: Voice-to-Voice Translator
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
Voice-to-Voice Translator
A real-time voice-to-voice translation system using WebSocket connections for low-latency audio streaming and translation between multiple languages.
Features
- Real-time Audio Streaming: WebSocket-based bidirectional audio communication
- Multi-language Support: English and Hindi with extensible language support
- Low Latency Pipeline: Optimized STT β Translation β TTS pipeline
- Room-based Architecture: Support for multiple concurrent translation sessions
- Offline Capable: Uses local models (Vosk, Argos, Coqui TTS)
- Scalable Design: Worker-based architecture for handling concurrent users
Architecture
The system consists of several key components:
- WebSocket Server: Manages real-time connections and audio streaming
- Speech-to-Text (STT): Vosk-based speech recognition
- Translation Engine: Argos Translate for language translation
- Text-to-Speech (TTS): Coqui TTS for natural voice synthesis
- Room Manager: Handles multi-user session management
- Pipeline Manager: Orchestrates the complete translation flow
Prerequisites
- Python 3.9+
- 4GB RAM minimum (8GB recommended)
- 5GB disk space for models
- Linux/Windows/MacOS
Quick Start
1. Clone the repository
git clone <repository-url>
cd voice-to-voice-translator
2. Install dependencies
pip install -r requirements.txt
3. Download models
python scripts/download_models.py
4. Configure environment
cp .env.example .env
# Edit .env with your configuration
5. Run the server
python app/main.py
The server will start on ws://localhost:8000 by default.
Configuration
Key configuration options in .env:
HOST: Server host (default: 0.0.0.0)PORT: Server port (default: 8000)LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR)MAX_CONNECTIONS: Maximum concurrent connectionsAUDIO_SAMPLE_RATE: Audio sample rate (default: 16000)AUDIO_CHUNK_SIZE: Audio chunk size in bytes
WebSocket Protocol
Clients connect to the WebSocket endpoint and exchange JSON messages:
{
"type": "join_room",
"room_id": "room123",
"user_id": "user1",
"source_lang": "en",
"target_lang": "hi"
}
See docs/websocket-protocol.md for complete protocol documentation.
API Endpoints
ws://host:port/ws: Main WebSocket endpointhttp://host:port/health: Health check endpointhttp://host:port/metrics: Metrics endpoint (optional)
Testing
# Run all tests
pytest tests/
# Run specific test
pytest tests/test_stt.py
# Run with coverage
pytest --cov=app tests/
Docker Deployment
# Build image
docker build -f docker/Dockerfile -t voice-translator .
# Run with docker-compose
docker-compose -f docker/docker-compose.yml up
Performance
- STT Latency: ~100-200ms
- Translation Latency: ~50-100ms
- TTS Latency: ~200-300ms
- Total End-to-End Latency: ~500ms (target)
Project Structure
voice-to-voice-translator/
βββ app/ # Application code
β βββ main.py # Entry point
β βββ config/ # Configuration
β βββ server/ # WebSocket server
β βββ rooms/ # Room management
β βββ pipeline/ # STT, Translation, TTS
β βββ audio/ # Audio processing
β βββ messaging/ # WebSocket messages
β βββ security/ # Auth and rate limiting
β βββ workers/ # Background workers
β βββ utils/ # Utilities
βββ models/ # ML models storage
βββ scripts/ # Utility scripts
βββ tests/ # Test suite
βββ docs/ # Documentation
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details
Support
For issues and questions:
- GitHub Issues: /issues
- Documentation: docs/
Roadmap
- Add more language pairs
- Implement GPU acceleration
- Add speaker diarization
- Web-based client interface
- Mobile app support
- Cloud deployment guides