Roni Egbu commited on
Commit
95ee228
Β·
1 Parent(s): 9849826

feat: Revise README.md for improved clarity and structure, update features and architecture sections

Browse files
Files changed (1) hide show
  1. README.md +214 -55
README.md CHANGED
@@ -1,109 +1,268 @@
1
- # 🌐 Voice Translation Bridge
 
 
 
 
 
 
 
 
 
2
 
3
- A real-time, bi-directional voice translation system designed to act as a seamless "language bridge" between two users. It detects speech automatically, translates it, and plays the synthesized voice to the peer in their native language.
 
 
4
 
5
  ## ✨ Features
6
 
7
- * **Hands-Free VAD:** Automatic Voice Activity Detection (VAD) detects when you stop speakingβ€”no "Push-to-Talk" needed.
8
- * **Intelligent Routing:** Automatically identifies the peer in the room and routes the translated audio to them.
9
- * **Full Offline-Capable Stack:**
10
- * **ASR:** [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) (Tiny model) for high-speed speech-to-text.
11
- * **MT:** [Argos Translate](https://github.com/argosopentech/argos-translate) for open-source, offline Neural Machine Translation.
12
- * **TTS:** [Piper](https://github.com/rhasspy/piper) (ONNX-based) for near-instant, human-like synthesized speech.
13
- * **PCM Streaming:** Low-latency raw audio streaming via Web Audio API.
14
 
15
  ---
16
 
17
- ## πŸ—οΈ Project Structure
18
 
19
- ```text
20
- voice-translation-app/
21
- β”œβ”€β”€ backend/
22
- β”‚ β”œβ”€β”€ app/
23
- β”‚ β”‚ β”œβ”€β”€ services/ # VAD processing & Translation Pipeline
24
- β”‚ β”‚ └── main.py # FastAPI WebSocket Server
25
- β”‚ └── requirements.txt # Python Dependencies
26
- β”œβ”€β”€ frontend/
27
- β”‚ └── index.html # Web Interface (PCM Audio Logic)
28
- β”œβ”€β”€ models/ # AI Models (Stored locally)
29
- β”‚ β”œβ”€β”€ asr/ # whisper.cpp source + tiny model
30
- β”‚ β”œβ”€β”€ mt/ # argos-translate language packs
31
- β”‚ └── tts/ # .onnx + .json voice models
32
- └── scripts/
33
- └── setup_models.sh # Automation script to build/download models
34
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ```
36
 
37
  ---
38
 
39
- ## πŸ› οΈ Setup Instructions
40
 
41
- ### 1. System Dependencies
 
 
 
 
42
 
43
- The backend requires **FFmpeg** for final audio containerization and **CMake/Build-Essentials** to compile the ASR engine.
44
 
45
- ```bash
46
- sudo apt update && sudo apt install ffmpeg build-essential cmake
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
 
 
 
 
 
 
 
48
  ```
49
 
50
- ### 2. Automated Model Setup
 
51
 
52
- We provide a setup script that clones `whisper.cpp`, compiles it, downloads the Piper voices, and installs Argos language packs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ```bash
55
- chmod +x setup_models.sh
56
- ./setup_models.sh
 
 
57
 
 
 
 
58
  ```
59
 
60
- ### 3. Python Environment
 
 
61
 
62
  ```bash
63
  cd backend
64
  python3 -m venv venv
65
  source venv/bin/activate
66
  pip install -r requirements.txt
 
 
 
67
 
 
 
 
68
  ```
69
 
 
 
 
 
 
 
 
 
 
 
 
70
  ---
71
 
72
- ## πŸš€ Running the App
73
 
74
- ### 1. Start the Backend
75
 
76
  ```bash
77
- # From the backend directory
78
- python3 app/main.py
 
 
 
79
 
 
 
80
  ```
81
 
82
- Wait for: `βœ… Models Warm. System Ready on Port 8000.`
83
 
84
- ### 2. Launch the Frontend
85
 
86
- Serve the frontend using a simple server:
87
 
 
 
88
  ```bash
89
- cd frontend
90
- python3 -m http.server 3000
 
 
 
 
 
 
 
 
 
 
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ```
93
 
94
- 1. Open `http://localhost:3000` in **Tab 1** (Set to English).
95
- 2. Open `http://localhost:3000` in **Tab 2** (Set to French).
96
- 3. Speak naturally in Tab 1; the translated audio will play automatically in Tab 2.
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  ---
99
 
100
- ## πŸ”„ How the "Bridge" Works
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
- 1. **PCM Streaming:** The browser captures audio at 16kHz and sends raw PCM bytes via WebSocket.
103
- 2. **VAD Analysis:** The server analyzes incoming chunks. When ~1.2s of silence is detected, it triggers the pipeline.
104
- 3. **The Pipeline:**
105
- * **ASR:** `whisper.cpp` converts the buffered PCM into English text.
106
- * **MT:** `Argos` translates English text to French.
107
- * **TTS:** `Piper` generates a French `.wav` file from the translated text.
108
 
109
- 4. **Targeted Delivery:** The server identifies the peer in the room and sends the `.wav` bytes + JSON captions to **only** that user.
 
1
+ ---
2
+ title: LinguaCall Backend
3
+ emoji: 🌐
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ ---
9
+
10
+ # 🌐 LinguaCall β€” Real-Time Voice Translation Backend
11
 
12
+ LinguaCall is a real-time, bi-directional voice translation system that acts as a seamless language bridge between two users. It automatically detects speech, transcribes it, translates it, and plays synthesized audio to the other person in their native language β€” all with no push-to-talk required.
13
+
14
+ ---
15
 
16
  ## ✨ Features
17
 
18
+ - **Hands-Free VAD** β€” Voice Activity Detection automatically triggers the pipeline when you stop speaking (~1.2s silence). No button needed.
19
+ - **Bi-Directional Translation** β€” Both users speak in their own language simultaneously. Each hears the other translated in real time.
20
+ - **5 Supported Languages** β€” English, French, German, Spanish, and Chinese in any combination.
21
+ - **Synchronized Captions** β€” Translated text appears in sync with audio playback, not when processing finishes.
22
+ - **Hallucination Filtering** β€” ASR output is filtered for Whisper hallucinations (loops, silence artifacts, subtitle injections) before translation.
23
+ - **Drop-on-Busy** β€” If a pipeline is already running for a user, new segments are dropped rather than queued, preventing cascading lag.
24
+ - **Fully Offline-Capable Stack** β€” All AI components run locally with no external API calls.
25
 
26
  ---
27
 
28
+ ## πŸ—οΈ Architecture
29
 
30
+ ```
31
+ Browser (React Frontend)
32
+ β”‚
33
+ β”‚ WebSocket (PCM16 audio @ 16kHz)
34
+ β–Ό
35
+ FastAPI Backend ──────────────────────────────────────────────
36
+ β”‚
37
+ β”œβ”€β”€ StreamingVAD # RMS energy-based speech detection
38
+ β”‚
39
+ └── TranslationPipeline
40
+ β”œβ”€β”€ WhisperASR # faster-whisper (base, CPU int8)
41
+ β”œβ”€β”€ ArgosTranslator # Argos Translate (offline NMT)
42
+ └── PiperTTS # Piper ONNX (persistent subprocess)
43
+ ```
 
44
 
45
+ ### WebSocket Protocol
46
+
47
+ ```
48
+ Client β†’ Server: JSON handshake { "native_lang": "en" }
49
+ Client β†’ Server: Binary PCM16 chunks (4096 samples @ 16kHz, ~256ms each)
50
+ Client β†’ Server: JSON ping { "type": "ping" }
51
+
52
+ Server β†’ Client: JSON { "type": "connected", "user_id": "...", "room": "..." }
53
+ Server β†’ Client: JSON { "type": "peer_joined", "peer_id": "..." }
54
+ Server β†’ Client: JSON { "type": "peer_left", "peer_id": "..." }
55
+ Server β†’ Client: Binary framed [4-byte JSON length][JSON metadata][WAV bytes]
56
+ metadata: { "type": "audio_with_caption", "text": "...", "original": "..." }
57
  ```
58
 
59
  ---
60
 
61
+ ## πŸ€– AI Stack
62
 
63
+ | Component | Library | Model | Notes |
64
+ |-----------|---------|-------|-------|
65
+ | ASR | [faster-whisper](https://github.com/SYSTRAN/faster-whisper) | `base` (CPU int8) | ~2–4s on modern CPU |
66
+ | MT | [Argos Translate](https://github.com/argosopentech/argos-translate) | Offline NMT packages | Direct pairs + English pivot |
67
+ | TTS | [Piper](https://github.com/rhasspy/piper) | Low quality ONNX voices | Persistent subprocess per language |
68
 
69
+ ### Supported Language Pairs
70
 
71
+ Direct Argos packages are installed for all available combinations. Pairs without a direct package (e.g. `zh↔de`) automatically pivot through English at runtime.
72
+
73
+ | | EN | FR | DE | ES | ZH |
74
+ |---|---|---|---|---|---|
75
+ | **EN** | β€” | βœ… | βœ… | βœ… | βœ… |
76
+ | **FR** | βœ… | β€” | βœ… | βœ… | β†’ EN |
77
+ | **DE** | βœ… | βœ… | β€” | βœ… | β†’ EN |
78
+ | **ES** | βœ… | βœ… | βœ… | β€” | β†’ EN |
79
+ | **ZH** | βœ… | β†’ EN | β†’ EN | β†’ EN | β€” |
80
+
81
+ ---
82
+
83
+ ## πŸš€ API Reference
84
+
85
+ ### `GET /health`
86
+ Returns server status and active rooms.
87
 
88
+ ```json
89
+ {
90
+ "status": "ok",
91
+ "rooms": {
92
+ "room1": ["user_abc123", "user_def456"]
93
+ }
94
+ }
95
  ```
96
 
97
+ ### `GET /rooms/{room_id}`
98
+ Check if a room exists and how many users are in it.
99
 
100
+ ```json
101
+ {
102
+ "exists": true,
103
+ "room_id": "room1",
104
+ "occupants": 1
105
+ }
106
+ ```
107
+
108
+ ### `WS /ws/call/{room_id}/{user_id}`
109
+ Main WebSocket endpoint. See protocol above.
110
+
111
+ - Rooms support a maximum of **2 users**
112
+ - Attempting to join a full room returns a `4003` close code
113
+ - Each user must send a handshake JSON within **10 seconds** of connecting
114
+
115
+ ---
116
+
117
+ ## πŸ› οΈ Local Development
118
+
119
+ ### Prerequisites
120
 
121
  ```bash
122
+ sudo apt install ffmpeg build-essential cmake
123
+ ```
124
+
125
+ ### 1. Download Models
126
 
127
+ ```bash
128
+ chmod +x scripts/models_setup.sh
129
+ ./scripts/models_setup.sh
130
  ```
131
 
132
+ This downloads and compiles whisper.cpp, downloads Piper voice models, and installs Argos language packs.
133
+
134
+ ### 2. Python Environment
135
 
136
  ```bash
137
  cd backend
138
  python3 -m venv venv
139
  source venv/bin/activate
140
  pip install -r requirements.txt
141
+ ```
142
+
143
+ ### 3. Run the Backend
144
 
145
+ ```bash
146
+ # from project root
147
+ python3 -m app.main
148
  ```
149
 
150
+ Wait for: `βœ… System Ready!`
151
+
152
+ ### 4. Run the Test Frontend
153
+
154
+ ```bash
155
+ cd frontend
156
+ python3 -m http.server 3000
157
+ ```
158
+
159
+ Open `http://localhost:3000` in two tabs, set different languages, same Room ID, and speak.
160
+
161
  ---
162
 
163
+ ## 🐳 Docker
164
 
165
+ ### Build & Run
166
 
167
  ```bash
168
+ docker build -t linguacall-backend .
169
+ docker run -p 7860:7860 linguacall-backend
170
+ ```
171
+
172
+ ### Docker Compose (local dev with frontend)
173
 
174
+ ```bash
175
+ docker-compose up --build
176
  ```
177
 
178
+ ---
179
 
180
+ ## ☁️ Deployment (Hugging Face Spaces)
181
 
182
+ This repo is configured for direct deployment to Hugging Face Spaces using the Docker SDK.
183
 
184
+ 1. Create a new Space β†’ SDK: **Docker**
185
+ 2. Add this repo as a remote and push:
186
  ```bash
187
+ git remote add space https://huggingface.co/spaces/YOUR_USERNAME/linguacall-backend
188
+ git push space main
189
+ ```
190
+ 3. Monitor the build in the **Logs** tab (~10–15 min first build)
191
+ 4. Once live, your backend is at:
192
+ ```
193
+ https://YOUR_USERNAME-linguacall-backend.hf.space
194
+ ```
195
+ 5. Connect your frontend using:
196
+ ```
197
+ wss://YOUR_USERNAME-linguacall-backend.hf.space
198
+ ```
199
 
200
+ ---
201
+
202
+ ## πŸ“ Project Structure
203
+
204
+ ```
205
+ linguacall-backend/
206
+ β”œβ”€β”€ Dockerfile # Production build (HF Spaces / any Docker host)
207
+ β”œβ”€β”€ docker-compose.yml # Local dev with nginx frontend
208
+ β”œβ”€β”€ backend/
209
+ β”‚ β”œβ”€β”€ requirements.txt
210
+ β”‚ └── app/
211
+ β”‚ β”œβ”€β”€ main.py # FastAPI app, WebSocket handler, room management
212
+ β”‚ β”œβ”€β”€ logger.py # Rotating file + console logger
213
+ β”‚ β”œβ”€β”€ models/
214
+ β”‚ β”‚ β”œβ”€β”€ asr_model.py # WhisperASR + hallucination filtering
215
+ β”‚ β”‚ β”œβ”€β”€ mt_model.py # ArgosTranslator
216
+ β”‚ β”‚ └── tts_model.py # PiperTTS with persistent subprocesses
217
+ β”‚ └── services/
218
+ β”‚ β”œβ”€β”€ translation_pipeline.py # ASR β†’ MT β†’ TTS orchestration
219
+ β”‚ └── vad_processor.py # RMS energy VAD with adaptive thresholding
220
+ β”œβ”€β”€ frontend/
221
+ β”‚ └── index.html # Standalone test client (not for production)
222
+ β”œβ”€β”€ scripts/
223
+ β”‚ └── models_setup.sh # One-shot model download + compile script
224
+ └── models/ # AI models (git-ignored, built at Docker build time)
225
+ β”œβ”€β”€ asr/
226
+ β”œβ”€β”€ mt/
227
+ └── tts/
228
  ```
229
 
230
+ ---
231
+
232
+ ## βš™οΈ Configuration
233
+
234
+ Key parameters in `main.py` and `vad_processor.py` you may want to tune:
235
+
236
+ | Parameter | Default | Description |
237
+ |-----------|---------|-------------|
238
+ | `silence_threshold` | `1.2s` | Silence duration before triggering pipeline |
239
+ | `min_speech_duration` | `0.8s` | Minimum speech length to process (filters short blips) |
240
+ | `energy_threshold` | `800.0` | RMS energy threshold for speech detection |
241
+ | `max_speech_duration` | `15.0s` | Hard cap before force-processing |
242
+ | `MAX_USERS_PER_ROOM` | `2` | Max concurrent users per room |
243
+ | Whisper model size | `base` | Change to `tiny` if RAM-constrained |
244
 
245
  ---
246
 
247
+ ## πŸ”§ Troubleshooting
248
+
249
+ **OOM crash on startup (Hugging Face Spaces)**
250
+ Switch Whisper to `tiny` in `backend/app/services/translation_pipeline.py`:
251
+ ```python
252
+ self.asr = WhisperASR(model_size="tiny")
253
+ ```
254
+
255
+ **VAD not triggering / triggering too often**
256
+ Watch the `[VAD] avg=... peak=...` logs and adjust `energy_threshold` in `main.py`. Typical values: quiet mic ~250–400, normal room ~600–1000, noisy environment ~1200+.
257
+
258
+ **Piper TTS producing no audio**
259
+ Check that `.onnx` and `.onnx.json` files are both present in `models/tts/`. Both files are required per voice.
260
+
261
+ **WebSocket disconnecting on Hugging Face**
262
+ HF Spaces has a proxy timeout. The frontend sends a ping every 25s to keep the connection alive β€” make sure your frontend implements this keepalive.
263
+
264
+ ---
265
 
266
+ ## πŸ“„ License
 
 
 
 
 
267
 
268
+ MIT β€” see [LICENSE](LICENSE) for details.