Spaces:
Sleeping
Sleeping
Roni Egbu commited on
Commit Β·
95ee228
1
Parent(s): 9849826
feat: Revise README.md for improved clarity and structure, update features and architecture sections
Browse files
README.md
CHANGED
|
@@ -1,109 +1,268 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## β¨ Features
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
|
| 15 |
---
|
| 16 |
|
| 17 |
-
## ποΈ
|
| 18 |
|
| 19 |
-
```
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
β
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
β
|
| 26 |
-
βββ
|
| 27 |
-
β
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
βββ setup_models.sh # Automation script to build/download models
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
```
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
-
##
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
``
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
```
|
| 49 |
|
| 50 |
-
###
|
|
|
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
```bash
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
| 57 |
|
|
|
|
|
|
|
|
|
|
| 58 |
```
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
|
| 62 |
```bash
|
| 63 |
cd backend
|
| 64 |
python3 -m venv venv
|
| 65 |
source venv/bin/activate
|
| 66 |
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
| 67 |
|
|
|
|
|
|
|
|
|
|
| 68 |
```
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
---
|
| 71 |
|
| 72 |
-
##
|
| 73 |
|
| 74 |
-
###
|
| 75 |
|
| 76 |
```bash
|
| 77 |
-
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
| 79 |
|
|
|
|
|
|
|
| 80 |
```
|
| 81 |
|
| 82 |
-
|
| 83 |
|
| 84 |
-
##
|
| 85 |
|
| 86 |
-
|
| 87 |
|
|
|
|
|
|
|
| 88 |
```bash
|
| 89 |
-
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
```
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
---
|
| 99 |
|
| 100 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
-
|
| 103 |
-
2. **VAD Analysis:** The server analyzes incoming chunks. When ~1.2s of silence is detected, it triggers the pipeline.
|
| 104 |
-
3. **The Pipeline:**
|
| 105 |
-
* **ASR:** `whisper.cpp` converts the buffered PCM into English text.
|
| 106 |
-
* **MT:** `Argos` translates English text to French.
|
| 107 |
-
* **TTS:** `Piper` generates a French `.wav` file from the translated text.
|
| 108 |
|
| 109 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: LinguaCall Backend
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: docker
|
| 7 |
+
pinned: false
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# π LinguaCall β Real-Time Voice Translation Backend
|
| 11 |
|
| 12 |
+
LinguaCall is a real-time, bi-directional voice translation system that acts as a seamless language bridge between two users. It automatically detects speech, transcribes it, translates it, and plays synthesized audio to the other person in their native language β all with no push-to-talk required.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
|
| 16 |
## β¨ Features
|
| 17 |
|
| 18 |
+
- **Hands-Free VAD** β Voice Activity Detection automatically triggers the pipeline when you stop speaking (~1.2s silence). No button needed.
|
| 19 |
+
- **Bi-Directional Translation** β Both users speak in their own language simultaneously. Each hears the other translated in real time.
|
| 20 |
+
- **5 Supported Languages** β English, French, German, Spanish, and Chinese in any combination.
|
| 21 |
+
- **Synchronized Captions** β Translated text appears in sync with audio playback, not when processing finishes.
|
| 22 |
+
- **Hallucination Filtering** β ASR output is filtered for Whisper hallucinations (loops, silence artifacts, subtitle injections) before translation.
|
| 23 |
+
- **Drop-on-Busy** β If a pipeline is already running for a user, new segments are dropped rather than queued, preventing cascading lag.
|
| 24 |
+
- **Fully Offline-Capable Stack** β All AI components run locally with no external API calls.
|
| 25 |
|
| 26 |
---
|
| 27 |
|
| 28 |
+
## ποΈ Architecture
|
| 29 |
|
| 30 |
+
```
|
| 31 |
+
Browser (React Frontend)
|
| 32 |
+
β
|
| 33 |
+
β WebSocket (PCM16 audio @ 16kHz)
|
| 34 |
+
βΌ
|
| 35 |
+
FastAPI Backend ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 36 |
+
β
|
| 37 |
+
βββ StreamingVAD # RMS energy-based speech detection
|
| 38 |
+
β
|
| 39 |
+
βββ TranslationPipeline
|
| 40 |
+
βββ WhisperASR # faster-whisper (base, CPU int8)
|
| 41 |
+
βββ ArgosTranslator # Argos Translate (offline NMT)
|
| 42 |
+
βββ PiperTTS # Piper ONNX (persistent subprocess)
|
| 43 |
+
```
|
|
|
|
| 44 |
|
| 45 |
+
### WebSocket Protocol
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
Client β Server: JSON handshake { "native_lang": "en" }
|
| 49 |
+
Client β Server: Binary PCM16 chunks (4096 samples @ 16kHz, ~256ms each)
|
| 50 |
+
Client β Server: JSON ping { "type": "ping" }
|
| 51 |
+
|
| 52 |
+
Server β Client: JSON { "type": "connected", "user_id": "...", "room": "..." }
|
| 53 |
+
Server β Client: JSON { "type": "peer_joined", "peer_id": "..." }
|
| 54 |
+
Server β Client: JSON { "type": "peer_left", "peer_id": "..." }
|
| 55 |
+
Server β Client: Binary framed [4-byte JSON length][JSON metadata][WAV bytes]
|
| 56 |
+
metadata: { "type": "audio_with_caption", "text": "...", "original": "..." }
|
| 57 |
```
|
| 58 |
|
| 59 |
---
|
| 60 |
|
| 61 |
+
## π€ AI Stack
|
| 62 |
|
| 63 |
+
| Component | Library | Model | Notes |
|
| 64 |
+
|-----------|---------|-------|-------|
|
| 65 |
+
| ASR | [faster-whisper](https://github.com/SYSTRAN/faster-whisper) | `base` (CPU int8) | ~2β4s on modern CPU |
|
| 66 |
+
| MT | [Argos Translate](https://github.com/argosopentech/argos-translate) | Offline NMT packages | Direct pairs + English pivot |
|
| 67 |
+
| TTS | [Piper](https://github.com/rhasspy/piper) | Low quality ONNX voices | Persistent subprocess per language |
|
| 68 |
|
| 69 |
+
### Supported Language Pairs
|
| 70 |
|
| 71 |
+
Direct Argos packages are installed for all available combinations. Pairs without a direct package (e.g. `zhβde`) automatically pivot through English at runtime.
|
| 72 |
+
|
| 73 |
+
| | EN | FR | DE | ES | ZH |
|
| 74 |
+
|---|---|---|---|---|---|
|
| 75 |
+
| **EN** | β | β
| β
| β
| β
|
|
| 76 |
+
| **FR** | β
| β | β
| β
| β EN |
|
| 77 |
+
| **DE** | β
| β
| β | β
| β EN |
|
| 78 |
+
| **ES** | β
| β
| β
| β | β EN |
|
| 79 |
+
| **ZH** | β
| β EN | β EN | β EN | β |
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## π API Reference
|
| 84 |
+
|
| 85 |
+
### `GET /health`
|
| 86 |
+
Returns server status and active rooms.
|
| 87 |
|
| 88 |
+
```json
|
| 89 |
+
{
|
| 90 |
+
"status": "ok",
|
| 91 |
+
"rooms": {
|
| 92 |
+
"room1": ["user_abc123", "user_def456"]
|
| 93 |
+
}
|
| 94 |
+
}
|
| 95 |
```
|
| 96 |
|
| 97 |
+
### `GET /rooms/{room_id}`
|
| 98 |
+
Check if a room exists and how many users are in it.
|
| 99 |
|
| 100 |
+
```json
|
| 101 |
+
{
|
| 102 |
+
"exists": true,
|
| 103 |
+
"room_id": "room1",
|
| 104 |
+
"occupants": 1
|
| 105 |
+
}
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### `WS /ws/call/{room_id}/{user_id}`
|
| 109 |
+
Main WebSocket endpoint. See protocol above.
|
| 110 |
+
|
| 111 |
+
- Rooms support a maximum of **2 users**
|
| 112 |
+
- Attempting to join a full room returns a `4003` close code
|
| 113 |
+
- Each user must send a handshake JSON within **10 seconds** of connecting
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## π οΈ Local Development
|
| 118 |
+
|
| 119 |
+
### Prerequisites
|
| 120 |
|
| 121 |
```bash
|
| 122 |
+
sudo apt install ffmpeg build-essential cmake
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
### 1. Download Models
|
| 126 |
|
| 127 |
+
```bash
|
| 128 |
+
chmod +x scripts/models_setup.sh
|
| 129 |
+
./scripts/models_setup.sh
|
| 130 |
```
|
| 131 |
|
| 132 |
+
This downloads and compiles whisper.cpp, downloads Piper voice models, and installs Argos language packs.
|
| 133 |
+
|
| 134 |
+
### 2. Python Environment
|
| 135 |
|
| 136 |
```bash
|
| 137 |
cd backend
|
| 138 |
python3 -m venv venv
|
| 139 |
source venv/bin/activate
|
| 140 |
pip install -r requirements.txt
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
### 3. Run the Backend
|
| 144 |
|
| 145 |
+
```bash
|
| 146 |
+
# from project root
|
| 147 |
+
python3 -m app.main
|
| 148 |
```
|
| 149 |
|
| 150 |
+
Wait for: `β
System Ready!`
|
| 151 |
+
|
| 152 |
+
### 4. Run the Test Frontend
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
cd frontend
|
| 156 |
+
python3 -m http.server 3000
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
Open `http://localhost:3000` in two tabs, set different languages, same Room ID, and speak.
|
| 160 |
+
|
| 161 |
---
|
| 162 |
|
| 163 |
+
## π³ Docker
|
| 164 |
|
| 165 |
+
### Build & Run
|
| 166 |
|
| 167 |
```bash
|
| 168 |
+
docker build -t linguacall-backend .
|
| 169 |
+
docker run -p 7860:7860 linguacall-backend
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Docker Compose (local dev with frontend)
|
| 173 |
|
| 174 |
+
```bash
|
| 175 |
+
docker-compose up --build
|
| 176 |
```
|
| 177 |
|
| 178 |
+
---
|
| 179 |
|
| 180 |
+
## βοΈ Deployment (Hugging Face Spaces)
|
| 181 |
|
| 182 |
+
This repo is configured for direct deployment to Hugging Face Spaces using the Docker SDK.
|
| 183 |
|
| 184 |
+
1. Create a new Space β SDK: **Docker**
|
| 185 |
+
2. Add this repo as a remote and push:
|
| 186 |
```bash
|
| 187 |
+
git remote add space https://huggingface.co/spaces/YOUR_USERNAME/linguacall-backend
|
| 188 |
+
git push space main
|
| 189 |
+
```
|
| 190 |
+
3. Monitor the build in the **Logs** tab (~10β15 min first build)
|
| 191 |
+
4. Once live, your backend is at:
|
| 192 |
+
```
|
| 193 |
+
https://YOUR_USERNAME-linguacall-backend.hf.space
|
| 194 |
+
```
|
| 195 |
+
5. Connect your frontend using:
|
| 196 |
+
```
|
| 197 |
+
wss://YOUR_USERNAME-linguacall-backend.hf.space
|
| 198 |
+
```
|
| 199 |
|
| 200 |
+
---
|
| 201 |
+
|
| 202 |
+
## π Project Structure
|
| 203 |
+
|
| 204 |
+
```
|
| 205 |
+
linguacall-backend/
|
| 206 |
+
βββ Dockerfile # Production build (HF Spaces / any Docker host)
|
| 207 |
+
βββ docker-compose.yml # Local dev with nginx frontend
|
| 208 |
+
βββ backend/
|
| 209 |
+
β βββ requirements.txt
|
| 210 |
+
β βββ app/
|
| 211 |
+
β βββ main.py # FastAPI app, WebSocket handler, room management
|
| 212 |
+
β βββ logger.py # Rotating file + console logger
|
| 213 |
+
β βββ models/
|
| 214 |
+
β β βββ asr_model.py # WhisperASR + hallucination filtering
|
| 215 |
+
β β βββ mt_model.py # ArgosTranslator
|
| 216 |
+
β β βββ tts_model.py # PiperTTS with persistent subprocesses
|
| 217 |
+
β βββ services/
|
| 218 |
+
β βββ translation_pipeline.py # ASR β MT β TTS orchestration
|
| 219 |
+
β βββ vad_processor.py # RMS energy VAD with adaptive thresholding
|
| 220 |
+
βββ frontend/
|
| 221 |
+
β βββ index.html # Standalone test client (not for production)
|
| 222 |
+
βββ scripts/
|
| 223 |
+
β βββ models_setup.sh # One-shot model download + compile script
|
| 224 |
+
βββ models/ # AI models (git-ignored, built at Docker build time)
|
| 225 |
+
βββ asr/
|
| 226 |
+
βββ mt/
|
| 227 |
+
βββ tts/
|
| 228 |
```
|
| 229 |
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## βοΈ Configuration
|
| 233 |
+
|
| 234 |
+
Key parameters in `main.py` and `vad_processor.py` you may want to tune:
|
| 235 |
+
|
| 236 |
+
| Parameter | Default | Description |
|
| 237 |
+
|-----------|---------|-------------|
|
| 238 |
+
| `silence_threshold` | `1.2s` | Silence duration before triggering pipeline |
|
| 239 |
+
| `min_speech_duration` | `0.8s` | Minimum speech length to process (filters short blips) |
|
| 240 |
+
| `energy_threshold` | `800.0` | RMS energy threshold for speech detection |
|
| 241 |
+
| `max_speech_duration` | `15.0s` | Hard cap before force-processing |
|
| 242 |
+
| `MAX_USERS_PER_ROOM` | `2` | Max concurrent users per room |
|
| 243 |
+
| Whisper model size | `base` | Change to `tiny` if RAM-constrained |
|
| 244 |
|
| 245 |
---
|
| 246 |
|
| 247 |
+
## π§ Troubleshooting
|
| 248 |
+
|
| 249 |
+
**OOM crash on startup (Hugging Face Spaces)**
|
| 250 |
+
Switch Whisper to `tiny` in `backend/app/services/translation_pipeline.py`:
|
| 251 |
+
```python
|
| 252 |
+
self.asr = WhisperASR(model_size="tiny")
|
| 253 |
+
```
|
| 254 |
+
|
| 255 |
+
**VAD not triggering / triggering too often**
|
| 256 |
+
Watch the `[VAD] avg=... peak=...` logs and adjust `energy_threshold` in `main.py`. Typical values: quiet mic ~250β400, normal room ~600β1000, noisy environment ~1200+.
|
| 257 |
+
|
| 258 |
+
**Piper TTS producing no audio**
|
| 259 |
+
Check that `.onnx` and `.onnx.json` files are both present in `models/tts/`. Both files are required per voice.
|
| 260 |
+
|
| 261 |
+
**WebSocket disconnecting on Hugging Face**
|
| 262 |
+
HF Spaces has a proxy timeout. The frontend sends a ping every 25s to keep the connection alive β make sure your frontend implements this keepalive.
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
|
| 266 |
+
## π License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
|
| 268 |
+
MIT β see [LICENSE](LICENSE) for details.
|