google-labs-jules[bot] commited on
Commit
d6fcf92
Β·
1 Parent(s): ed699be

feat: Create web application for voice translator

Browse files

This commit transforms the command-line voice translator into a web application with a FastAPI backend and a simple HTML/JS frontend.

Key changes include:
- Refactoring the `VoiceTranslator` class into a separate `translator.py` module and adapting it to use asyncio queues for audio I/O.
- Creating a `server.py` with a WebSocket endpoint to handle real-time audio streaming.
- Implementing server-side audio conversion from WebM to PCM using FFmpeg.
- Adding an `index.html` file for the user interface.
- Including a `Dockerfile` and `packages.txt` for deployment on HuggingFace Spaces.
- Updating `requirements.txt` with the necessary web and audio processing dependencies.
- Revising the `README.md` with instructions for running the web app locally and deploying it.

Files changed (8) hide show
  1. Dockerfile +22 -0
  2. README.md +42 -43
  3. app.py +0 -579
  4. index.html +137 -0
  5. packages.txt +1 -0
  6. requirements.txt +4 -1
  7. server.py +96 -0
  8. translator.py +276 -0
Dockerfile ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use an official Python runtime as a parent image
2
+ FROM python:3.9-slim
3
+
4
+ # Set the working directory in the container
5
+ WORKDIR /app
6
+
7
+ # Install system dependencies from packages.txt
8
+ COPY packages.txt .
9
+ RUN apt-get update && apt-get install -y $(cat packages.txt)
10
+
11
+ # Copy the requirements file and install Python dependencies
12
+ COPY requirements.txt .
13
+ RUN pip install --no-cache-dir -r requirements.txt
14
+
15
+ # Copy the rest of the application code
16
+ COPY . .
17
+
18
+ # Expose the port the app runs on
19
+ EXPOSE 8000
20
+
21
+ # Command to run the application
22
+ CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
README.md CHANGED
@@ -1,6 +1,8 @@
1
- # Real-Time English/French Voice Translator
2
 
3
- This project provides a real-time, bidirectional voice translation tool that runs in your terminal. Speak in English or French, and hear the translation in the other language almost instantly.
 
 
4
 
5
  It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
6
 
@@ -8,39 +10,35 @@ It uses a combination of cutting-edge APIs for high-quality speech recognition,
8
  - **Translation:** DeepL API
9
  - **Text-to-Speech (TTS):** ElevenLabs API
10
 
11
-
12
- *(Note: You can replace this with a real GIF of the application in action.)*
13
-
14
  ## Features
15
 
 
16
  - **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
17
- - **Low Latency:** Built with `asyncio` and multithreading for a responsive, conversational experience.
18
  - **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
19
  - **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
20
- - **Robust Streaming:** Automatically manages and restarts API connections to handle pauses in conversation.
21
- - **Simple CLI:** Easy to start and stop from the command line.
22
 
23
  ## How It Works
24
 
25
- The application orchestrates several processes concurrently:
26
 
27
- 1. **Audio Capture:** A dedicated thread captures audio from your default microphone.
28
- 2. **Dual STT Streams:** The captured audio is fed into two separate Google Cloud STT streams in parallel: one configured for `en-US` and one for `fr-FR`.
29
- 3. **Transcription & Translation:** When either STT stream detects a final utterance, it's sent to the DeepL API for translation.
30
- 4. **Speech Synthesis:** The translated text is sent to the ElevenLabs streaming TTS API.
31
- 5. **Audio Playback:** The synthesized audio is played back through your speakers.
32
-
33
- To prevent the system from re-translating its own output, the application pauses microphone processing during TTS playback and intelligently discards any recognized text that matches its last-spoken phrase.
 
34
 
35
  ## Requirements
36
 
37
  ### 1. Software
38
  - Python 3.8+
39
  - `pip` and `venv`
40
- - **PortAudio:** This is a dependency for the `pyaudio` library.
41
- - **macOS (via Homebrew):** `brew install portaudio`
42
- - **Debian/Ubuntu:** `sudo apt-get install portaudio19-dev`
43
- - **Windows:** `pyaudio` can often be installed via `pip` without manual PortAudio installation.
44
 
45
  ### 2. API Keys
46
  You will need active accounts and API keys for the following services:
@@ -51,14 +49,14 @@ You will need active accounts and API keys for the following services:
51
  - **DeepL:**
52
  - A DeepL API plan (the Free plan is sufficient for moderate use).
53
  - **ElevenLabs:**
54
- - An ElevenLabs account. You will also need your **Voice ID** for the voice you wish to use.
55
 
56
  ## Installation & Setup
57
 
58
  1. **Clone the Repository**
59
  ```bash
60
  git clone <your-repository-url>
61
- cd realtime-translator
62
  ```
63
 
64
  2. **Create a Virtual Environment**
@@ -68,15 +66,7 @@ You will need active accounts and API keys for the following services:
68
  ```
69
 
70
  3. **Install Dependencies**
71
- Create a `requirements.txt` file with the following content:
72
- ```
73
- pyaudio
74
- websockets
75
- google-cloud-speech
76
- deepl
77
- python-dotenv
78
- ```
79
- Then, install the packages:
80
  ```bash
81
  pip install -r requirements.txt
82
  ```
@@ -86,27 +76,36 @@ You will need active accounts and API keys for the following services:
86
 
87
  ```env
88
  # Path to your Google Cloud service account JSON file
89
- GOOGLE_APPLICATION_CREDENTIALS="C:/path/to/your/google-credentials.json"
90
 
91
  # Your DeepL API Key
92
- DEEPL_API_KEY="YOUR_DEEPL_API_KEY"
93
 
94
  # Your ElevenLabs API Key and Voice ID
95
  ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
96
  ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
97
  ```
98
 
99
- ## Usage
 
 
 
 
 
 
100
 
101
- Once set up, run the main application script:
 
 
 
 
 
102
 
103
- ```bash
104
- python app.py
105
- ```
106
 
107
- The application will prompt you to press `ENTER` to start and stop the translation session.
108
 
109
- - Press `ENTER` to start recording.
110
- - Speak in either English or French.
111
- - Press `ENTER` again to stop the session.
112
- - Press `Ctrl+C` to quit the application gracefully.
 
1
+ # Real-Time English/French Voice Translator Web App
2
 
3
+ This project provides a real-time, bidirectional voice translation web application. Speak in English or French into your browser, and hear the translation in the other language almost instantly.
4
+
5
+ It is built to be easily deployed as a HuggingFace Space.
6
 
7
  It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
8
 
 
10
  - **Translation:** DeepL API
11
  - **Text-to-Speech (TTS):** ElevenLabs API
12
 
 
 
 
13
  ## Features
14
 
15
+ - **Web-Based UI:** A simple and clean browser interface for real-time translation.
16
  - **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
17
+ - **Low Latency:** Built with `asyncio`, WebSockets, and multithreading for a responsive, conversational experience.
18
  - **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
19
  - **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
 
 
20
 
21
  ## How It Works
22
 
23
+ The application is composed of a web frontend and a Python backend:
24
 
25
+ 1. **Audio Capture (Frontend):** The browser's JavaScript captures audio from your microphone using the Web Audio API.
26
+ 2. **WebSocket Streaming:** The audio is chunked and streamed over a WebSocket connection to the FastAPI backend.
27
+ 3. **Backend Processing:**
28
+ - The `VoiceTranslator` class receives the audio stream.
29
+ - The audio is fed into two separate Google Cloud STT streams in parallel (`en-US` and `fr-FR`).
30
+ - When an STT stream detects a final utterance, it's sent to the DeepL API for translation.
31
+ - The translated text is sent to the ElevenLabs streaming TTS API.
32
+ 4. **Audio Playback (Frontend):** The synthesized audio from ElevenLabs is streamed back to the browser through the WebSocket and played instantly.
33
 
34
  ## Requirements
35
 
36
  ### 1. Software
37
  - Python 3.8+
38
  - `pip` and `venv`
39
+ - **FFmpeg:** This is a system dependency for audio format conversion.
40
+ - **macOS (via Homebrew):** `brew install ffmpeg`
41
+ - **Debian/Ubuntu:** `sudo apt-get install ffmpeg`
 
42
 
43
  ### 2. API Keys
44
  You will need active accounts and API keys for the following services:
 
49
  - **DeepL:**
50
  - A DeepL API plan (the Free plan is sufficient for moderate use).
51
  - **ElevenLabs:**
52
+ - An ElevenLabs account and your **Voice ID** for the desired voice.
53
 
54
  ## Installation & Setup
55
 
56
  1. **Clone the Repository**
57
  ```bash
58
  git clone <your-repository-url>
59
+ cd realtime-translator-webapp # Or your directory name
60
  ```
61
 
62
  2. **Create a Virtual Environment**
 
66
  ```
67
 
68
  3. **Install Dependencies**
69
+ Install the Python packages from `requirements.txt`:
 
 
 
 
 
 
 
 
70
  ```bash
71
  pip install -r requirements.txt
72
  ```
 
76
 
77
  ```env
78
  # Path to your Google Cloud service account JSON file
79
+ GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/google-credentials.json"
80
 
81
  # Your DeepL API Key
82
+ DEPL_API_KEY="YOUR_DEEPL_API_KEY"
83
 
84
  # Your ElevenLabs API Key and Voice ID
85
  ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
86
  ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
87
  ```
88
 
89
+ ## Local Usage
90
+
91
+ 1. **Start the Server**
92
+ Run the Uvicorn server from the project root:
93
+ ```bash
94
+ uvicorn server:app --reload
95
+ ```
96
 
97
+ 2. **Use the Application**
98
+ - Open your web browser and navigate to `http://127.0.0.1:8000`.
99
+ - Click the "Start Translation" button. Your browser will ask for microphone permission.
100
+ - Speak in either English or French.
101
+ - The translated audio will play back automatically.
102
+ - Click "Stop Translation" to end the session.
103
 
104
+ ## Deploying to HuggingFace Spaces
 
 
105
 
106
+ This application is ready to be deployed as a HuggingFace Space.
107
 
108
+ 1. Create a new Space on HuggingFace, selecting the "Docker" template.
109
+ 2. Upload the entire project contents to the Space repository.
110
+ 3. In the Space "Settings" tab, add your API keys (`GOOGLE_APPLICATION_CREDENTIALS`, `DEEPL_API_KEY`, `ELEVENLABS_API_KEY`, `ELEVENLABS_VOICE_ID`) as secrets. Make sure to also add your google credentials file.
111
+ 4. The Space will automatically build the Docker image and start the application. Your translator will be live!
app.py DELETED
@@ -1,579 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Real-Time French/English Voice Translator β€” cleaned version
4
-
5
- Fixes applied:
6
- - Fixed TTS echo caused by double-writing audio chunks
7
- - Removed prebuffer re-injection that could cause echoes
8
- - Added empty transcript filtering
9
- - Added within-stream deduplication
10
- - Removed unnecessary sleeps (reduced latency by ~900ms)
11
- - Reduced TTS prebuffer from 1s to 0.5s for faster playback start
12
- - Cleaned up diagnostic logging
13
-
14
- Keep your env vars:
15
- - GOOGLE_APPLICATION_CREDENTIALS, DEEPL_API_KEY, ELEVENLABS_API_KEY, ELEVENLABS_VOICE_ID
16
- """
17
-
18
- import asyncio
19
- import json
20
- import queue
21
- import threading
22
- import time
23
- import os
24
- import base64
25
- from collections import deque
26
- from typing import Dict, Optional
27
-
28
- import pyaudio
29
- import websockets
30
- from google.cloud import speech
31
- import deepl
32
- from dotenv import load_dotenv
33
-
34
- # -----------------------------------------------------------------------------
35
- # VoiceTranslator
36
- # -----------------------------------------------------------------------------
37
- class VoiceTranslator:
38
- def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
39
- # External clients
40
- self.deepl_client = deepl.Translator(deepl_api_key)
41
- self.elevenlabs_api_key = elevenlabs_api_key
42
- self.voice_id = elevenlabs_voice_id
43
- self.stt_client = speech.SpeechClient()
44
-
45
- # Audio params
46
- self.audio_rate = 16000
47
- self.audio_chunk = 1024
48
-
49
- # Per-language audio queues (raw mic frames)
50
- self.lang_queues: Dict[str, queue.Queue] = {
51
- "en-US": queue.Queue(),
52
- "fr-FR": queue.Queue(),
53
- }
54
-
55
- # Small rolling prebuffer to avoid missing the first bits after a restart
56
- self.prebuffer = deque(maxlen=12)
57
-
58
- # State flags
59
- self.is_recording = False
60
- self.is_speaking = False
61
- self.speaking_event = threading.Event()
62
-
63
- # Deduplication
64
- self.last_processed_transcript = ""
65
- self.last_tts_text_en = ""
66
- self.last_tts_text_fr = ""
67
-
68
- # Threshold
69
- self.min_confidence_threshold = 0.5
70
-
71
- # PyAudio
72
- self.pyaudio_instance = pyaudio.PyAudio()
73
- self.audio_stream = None
74
-
75
- # Threads + async
76
- self.recording_thread: Optional[threading.Thread] = None
77
- self.async_loop = asyncio.new_event_loop()
78
-
79
- # TTS queue + consumer task
80
- self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
81
- self._tts_consumer_task: Optional[asyncio.Task] = None
82
-
83
- # Start async loop in separate thread
84
- self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
85
- self.async_thread.start()
86
-
87
- # schedule tts consumer creation inside the async loop
88
- def _start_consumer():
89
- self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
90
- self.async_loop.call_soon_threadsafe(_start_consumer)
91
-
92
- self.stt_threads: Dict[str, threading.Thread] = {}
93
-
94
- # Per-language restart events (used to tell threads when to start new streams)
95
- self.restart_events: Dict[str, threading.Event] = {
96
- "en-US": threading.Event(),
97
- "fr-FR": threading.Event(),
98
- }
99
-
100
- # Per-language stream started flag
101
- self._stream_started = {"en-US": False, "fr-FR": False}
102
-
103
- # Per-language cancel events to force request_generator to stop
104
- self.stream_cancel_events: Dict[str, threading.Event] = {
105
- "en-US": threading.Event(),
106
- "fr-FR": threading.Event(),
107
- }
108
-
109
- # Diagnostics
110
- self._tts_job_counter = 0
111
-
112
- def _run_async_loop(self):
113
- asyncio.set_event_loop(self.async_loop)
114
- try:
115
- self.async_loop.run_forever()
116
- except Exception as e:
117
- print("[async_loop] stopped with error:", e)
118
-
119
- # ---------------------------
120
- # Audio capture
121
- # ---------------------------
122
- def _record_audio(self):
123
- try:
124
- stream = self.pyaudio_instance.open(
125
- format=pyaudio.paInt16,
126
- channels=1,
127
- rate=self.audio_rate,
128
- input=True,
129
- frames_per_buffer=self.audio_chunk,
130
- )
131
- print("🎀 Recording started...")
132
-
133
- while self.is_recording:
134
- if self.speaking_event.is_set():
135
- time.sleep(0.01)
136
- continue
137
-
138
- try:
139
- data = stream.read(self.audio_chunk, exception_on_overflow=False)
140
- except Exception as e:
141
- print(f"[recorder] read error: {e}")
142
- continue
143
-
144
- if not data:
145
- continue
146
-
147
- self.prebuffer.append(data)
148
- self.lang_queues["en-US"].put(data)
149
- self.lang_queues["fr-FR"].put(data)
150
-
151
- try:
152
- stream.stop_stream()
153
- stream.close()
154
- except Exception:
155
- pass
156
- print("🎀 Recording stopped.")
157
- except Exception as e:
158
- print(f"[recorder] fatal: {e}")
159
-
160
- # ---------------------------
161
- # TTS streaming (ElevenLabs) - async
162
- # ---------------------------
163
- async def _stream_tts(self, text: str):
164
- uri = (
165
- f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
166
- f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
167
- )
168
- tts_audio_stream = None
169
- websocket = None
170
- try:
171
- # Mark speaking and set event so recorder & STT pause
172
- self.is_speaking = True
173
- self.speaking_event.set()
174
-
175
- # Clear prebuffer to avoid re-injecting TTS audio later
176
- self.prebuffer.clear()
177
-
178
- # Clear queued frames to avoid replay
179
- for q in self.lang_queues.values():
180
- with q.mutex:
181
- q.queue.clear()
182
-
183
- websocket = await websockets.connect(uri)
184
- await websocket.send(json.dumps({
185
- "text": " ",
186
- "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
187
- "xi_api_key": self.elevenlabs_api_key,
188
- }))
189
- await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
190
- await websocket.send(json.dumps({"text": ""}))
191
-
192
- tts_audio_stream = self.pyaudio_instance.open(
193
- format=pyaudio.paInt16,
194
- channels=1,
195
- rate=16000,
196
- output=True,
197
- frames_per_buffer=1024,
198
- )
199
-
200
- prebuffer = bytearray()
201
- playback_started = False
202
-
203
- try:
204
- while True:
205
- try:
206
- message = await asyncio.wait_for(websocket.recv(), timeout=8.0)
207
- except asyncio.TimeoutError:
208
- if playback_started:
209
- break
210
- else:
211
- continue
212
-
213
- if isinstance(message, bytes):
214
- if not playback_started:
215
- prebuffer.extend(message)
216
- if len(prebuffer) >= 8000:
217
- tts_audio_stream.write(bytes(prebuffer))
218
- prebuffer.clear()
219
- playback_started = True
220
- else:
221
- tts_audio_stream.write(message)
222
- continue
223
-
224
- try:
225
- data = json.loads(message)
226
- except Exception:
227
- continue
228
-
229
- if data.get("audio"):
230
- audio_bytes = base64.b64decode(data["audio"])
231
- if not playback_started:
232
- prebuffer.extend(audio_bytes)
233
- if len(prebuffer) >= 8000:
234
- tts_audio_stream.write(bytes(prebuffer))
235
- prebuffer.clear()
236
- playback_started = True
237
- else:
238
- tts_audio_stream.write(audio_bytes)
239
- elif data.get("isFinal"):
240
- break
241
- elif data.get("error"):
242
- print("TTS error:", data["error"])
243
- break
244
-
245
- # Handle case where playback never started (very short audio)
246
- if prebuffer and not playback_started:
247
- tts_audio_stream.write(bytes(prebuffer))
248
-
249
- finally:
250
- try:
251
- await websocket.close()
252
- except Exception:
253
- pass
254
-
255
- except Exception as e:
256
- pass
257
- finally:
258
- if tts_audio_stream:
259
- try:
260
- tts_audio_stream.stop_stream()
261
- tts_audio_stream.close()
262
- except Exception:
263
- pass
264
-
265
- # Force the STT request generators to exit by setting cancel events
266
- for lang, ev in self.stream_cancel_events.items():
267
- ev.set()
268
-
269
- # Don't re-inject prebuffer - just clear the queues and let fresh audio come in
270
- for q in self.lang_queues.values():
271
- with q.mutex:
272
- q.queue.clear()
273
-
274
- # Clear speaking state and signal STT threads to restart
275
- self.is_speaking = False
276
- self.speaking_event.clear()
277
-
278
- # Signal restart for both language streams
279
- for lang, ev in self.restart_events.items():
280
- ev.set()
281
-
282
- await asyncio.sleep(0.1)
283
-
284
- # ---------------------------
285
- # TTS consumer (serializes TTS)
286
- # ---------------------------
287
- async def _tts_consumer(self):
288
- print("[tts_consumer] started")
289
- while True:
290
- item = await self._tts_queue.get()
291
- if item is None:
292
- print("[tts_consumer] shutdown sentinel received")
293
- break
294
- text = item.get("text", "")
295
- self._tts_job_counter += 1
296
- job_id = self._tts_job_counter
297
- print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
298
- try:
299
- await asyncio.wait_for(self._stream_tts(text), timeout=35.0)
300
- except asyncio.TimeoutError:
301
- print(f"[tts_consumer] job #{job_id} _stream_tts timed out; proceeding.")
302
- except Exception as e:
303
- print(f"[tts_consumer] job #{job_id} error during _stream_tts: {e}")
304
- finally:
305
- await asyncio.sleep(0.05)
306
- print("[tts_consumer] exiting")
307
-
308
- # ---------------------------
309
- # Translation & TTS trigger
310
- # ---------------------------
311
- async def _process_result(self, transcript: str, confidence: float, language: str):
312
- lang_flag = "πŸ‡«πŸ‡·" if language == "fr-FR" else "πŸ‡¬πŸ‡§"
313
- print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
314
-
315
- # echo suppression vs last TTS in same language
316
- if language == "fr-FR":
317
- if transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
318
- print(" (echo suppressed)")
319
- return
320
- else:
321
- if transcript.strip().lower() == self.last_tts_text_en.strip().lower():
322
- print(" (echo suppressed)")
323
- return
324
-
325
- try:
326
- if language == "fr-FR":
327
- translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
328
- print(f"🌐 FR β†’ EN: {translated}")
329
- await self._tts_queue.put({"text": translated, "source_lang": language})
330
- self.last_tts_text_en = translated
331
- else:
332
- translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
333
- print(f"🌐 EN β†’ FR: {translated}")
334
- await self._tts_queue.put({"text": translated, "source_lang": language})
335
- self.last_tts_text_fr = translated
336
- print("πŸ”Š Queued for speaking...")
337
- except Exception as e:
338
- print(f"Translation error: {e}")
339
-
340
- # ---------------------------
341
- # STT streaming (run per language)
342
- # ---------------------------
343
- def _run_stt_stream(self, language: str):
344
- print(f"[stt:{language}] Thread starting, thread_id={threading.get_ident()}")
345
- self._stream_started[language] = False
346
- last_transcript_in_stream = ""
347
-
348
- while self.is_recording:
349
- try:
350
- if self._stream_started[language]:
351
- print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Waiting for restart signal...")
352
- signaled = self.restart_events[language].wait(timeout=30)
353
- if not signaled and self.is_recording:
354
- print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Timeout waiting for restart, restarting anyway")
355
- if not self.is_recording:
356
- break
357
- try:
358
- self.restart_events[language].clear()
359
- except Exception:
360
- pass
361
- time.sleep(0.01)
362
-
363
- self._stream_started[language] = True
364
- last_transcript_in_stream = ""
365
- print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Starting new stream...")
366
-
367
- config = speech.RecognitionConfig(
368
- encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
369
- sample_rate_hertz=self.audio_rate,
370
- language_code=language,
371
- enable_automatic_punctuation=True,
372
- model="latest_short",
373
- )
374
- streaming_config = speech.StreamingRecognitionConfig(
375
- config=config,
376
- interim_results=True,
377
- single_utterance=False,
378
- )
379
-
380
- # Request generator yields StreamingRecognizeRequest messages
381
- def request_generator():
382
- while self.is_recording:
383
- # If TTS is playing, skip sending mic frames to STT
384
- if self.speaking_event.is_set():
385
- time.sleep(0.01)
386
- continue
387
- # If cancel event set, clear and break to end stream
388
- if self.stream_cancel_events[language].is_set():
389
- try:
390
- self.stream_cancel_events[language].clear()
391
- except Exception:
392
- pass
393
- break
394
- try:
395
- chunk = self.lang_queues[language].get(timeout=1.0)
396
- except queue.Empty:
397
- continue
398
- yield speech.StreamingRecognizeRequest(audio_content=chunk)
399
-
400
- responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
401
-
402
- response_count = 0
403
- final_received = False
404
-
405
- for response in responses:
406
- if not self.is_recording:
407
- print(f"[stt:{language}] Stopped by user")
408
- break
409
- if not response.results:
410
- continue
411
-
412
- response_count += 1
413
- for result in response.results:
414
- if not result.alternatives:
415
- continue
416
- alt = result.alternatives[0]
417
- transcript = alt.transcript.strip()
418
- conf = getattr(alt, "confidence", 0.0)
419
- is_final = bool(result.is_final)
420
-
421
- if is_final:
422
- now = time.strftime("%H:%M:%S")
423
- print(f"[{now}] [stt:{language}] β†’ '{transcript}' (final={is_final}, conf={conf:.2f})")
424
-
425
- # Filter empty transcripts - don't break stream
426
- if not transcript or len(transcript.strip()) == 0:
427
- print(f"[{now}] [stt:{language}] Empty transcript -> ignoring, continuing stream")
428
- continue
429
-
430
- # Deduplicate within same stream
431
- if transcript.strip().lower() == last_transcript_in_stream.strip().lower():
432
- print(f"[{now}] [stt:{language}] Duplicate final in same stream -> suppressed")
433
- continue
434
-
435
- if conf < self.min_confidence_threshold:
436
- print(f"[{now}] [stt:{language}] Final received but confidence {conf:.2f} < threshold -> suppressed")
437
- continue
438
-
439
- last_transcript_in_stream = transcript
440
-
441
- if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
442
- print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_fr)")
443
- continue
444
- if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
445
- print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_en)")
446
- continue
447
-
448
- asyncio.run_coroutine_threadsafe(
449
- self._process_result(transcript, conf, language),
450
- self.async_loop
451
- )
452
- final_received = True
453
- break
454
-
455
- if final_received:
456
- break
457
-
458
- print(f"[stt:{language}] Stream ended after {response_count} responses")
459
-
460
- if self.is_recording and final_received:
461
- print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Final result processed. Waiting for TTS to complete and signal restart.")
462
- elif self.is_recording and not final_received:
463
- print(f"[stt:{language}] Stream ended unexpectedly, reconnecting...")
464
- time.sleep(0.5)
465
- else:
466
- break
467
-
468
- except Exception as e:
469
- if self.is_recording:
470
- import traceback
471
- print(f"[stt:{language}] Error: {e}")
472
- print(traceback.format_exc())
473
- time.sleep(1.0)
474
- else:
475
- break
476
-
477
- print(f"[stt:{language}] Thread exiting")
478
-
479
- # ---------------------------
480
- # Control
481
- # ---------------------------
482
- def start_translation(self):
483
- if self.is_recording:
484
- print("Already recording!")
485
- return
486
- self.is_recording = True
487
- self.last_processed_transcript = ""
488
-
489
- for ev in self.restart_events.values():
490
- try:
491
- ev.clear()
492
- except Exception:
493
- pass
494
- self.speaking_event.clear()
495
-
496
- for q in self.lang_queues.values():
497
- with q.mutex:
498
- q.queue.clear()
499
-
500
- self.recording_thread = threading.Thread(target=self._record_audio, daemon=True)
501
- self.recording_thread.start()
502
-
503
- for lang in ("en-US", "fr-FR"):
504
- t = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
505
- self.stt_threads[lang] = t
506
- t.start()
507
- print(f"[main] STT thread {lang} started: {t.is_alive()} at {time.strftime('%H:%M:%S')}")
508
-
509
- for ev in self.restart_events.values():
510
- ev.set()
511
-
512
- def stop_translation(self):
513
- print("\n⏹️ Stopping translation...")
514
- self.is_recording = False
515
- for ev in self.restart_events.values():
516
- ev.set()
517
- self.speaking_event.clear()
518
-
519
- if self._tts_consumer_task and not (self._tts_consumer_task.done() if hasattr(self._tts_consumer_task, 'done') else False):
520
- try:
521
- def _put_sentinel():
522
- try:
523
- self._tts_queue.put_nowait(None)
524
- except Exception:
525
- asyncio.create_task(self._tts_queue.put(None))
526
- self.async_loop.call_soon_threadsafe(_put_sentinel)
527
- except Exception:
528
- pass
529
-
530
- time.sleep(0.2)
531
-
532
- def cleanup(self):
533
- self.stop_translation()
534
- try:
535
- if self.async_loop.is_running():
536
- def _stop_loop():
537
- if self._tts_consumer_task and not self._tts_consumer_task.done():
538
- try:
539
- self._tts_queue.put_nowait(None)
540
- except Exception:
541
- pass
542
- self.async_loop.stop()
543
- self.async_loop.call_soon_threadsafe(_stop_loop)
544
- except Exception:
545
- pass
546
- try:
547
- self.pyaudio_instance.terminate()
548
- except Exception:
549
- pass
550
-
551
- # -----------------------------------------------------------------------------
552
- # Main entry
553
- # -----------------------------------------------------------------------------
554
- def main():
555
- load_dotenv()
556
- google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
557
- deepl_key = os.getenv("DEEPL_API_KEY")
558
- eleven_key = os.getenv("ELEVENLABS_API_KEY")
559
- voice_id = os.getenv("ELEVENLABS_VOICE_ID")
560
-
561
- if not all([google_creds, deepl_key, eleven_key, voice_id]):
562
- print("Missing API keys or credentials.")
563
- return
564
-
565
- translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
566
- print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
567
-
568
- try:
569
- while True:
570
- input("Press ENTER to start speaking...")
571
- translator.start_translation()
572
- input("Press ENTER to stop...\n")
573
- translator.stop_translation()
574
- except KeyboardInterrupt:
575
- print("\nKeyboardInterrupt received β€” cleaning up.")
576
- translator.cleanup()
577
-
578
- if __name__ == "__main__":
579
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
index.html ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Real-Time Voice Translator</title>
7
+ <style>
8
+ body { font-family: sans-serif; display: flex; flex-direction: column; align-items: center; justify-content: center; height: 100vh; margin: 0; background-color: #f0f0f0; }
9
+ #controls { margin-bottom: 20px; }
10
+ button { font-size: 1.2em; padding: 10px 20px; cursor: pointer; }
11
+ #status { font-size: 1.1em; color: #333; }
12
+ </style>
13
+ </head>
14
+ <body>
15
+ <h1>Real-Time Voice Translator</h1>
16
+ <div id="controls">
17
+ <button id="startButton">Start Translation</button>
18
+ <button id="stopButton" disabled>Stop Translation</button>
19
+ </div>
20
+ <p id="status">Status: Not connected</p>
21
+
22
+ <script>
23
+ const startButton = document.getElementById('startButton');
24
+ const stopButton = document.getElementById('stopButton');
25
+ const statusDiv = document.getElementById('status');
26
+ let socket;
27
+ let mediaRecorder;
28
+ let audioContext;
29
+ let audioQueue = [];
30
+ let isPlaying = false;
31
+
32
+ const connectWebSocket = () => {
33
+ const proto = window.location.protocol === "https:" ? "wss:" : "ws:";
34
+ socket = new WebSocket(`${proto}//${window.location.host}/ws`);
35
+
36
+ socket.onopen = () => {
37
+ statusDiv.textContent = 'Status: Connected. Press Start.';
38
+ startButton.disabled = false;
39
+ };
40
+
41
+ socket.onmessage = (event) => {
42
+ if (event.data instanceof Blob) {
43
+ const reader = new FileReader();
44
+ reader.onload = function() {
45
+ const arrayBuffer = this.result;
46
+ audioContext.decodeAudioData(arrayBuffer, (buffer) => {
47
+ audioQueue.push(buffer);
48
+ if (!isPlaying) {
49
+ playNextInQueue();
50
+ }
51
+ });
52
+ };
53
+ reader.readAsArrayBuffer(event.data);
54
+ }
55
+ };
56
+
57
+ socket.onclose = () => {
58
+ statusDiv.textContent = 'Status: Disconnected';
59
+ startButton.disabled = true;
60
+ stopButton.disabled = true;
61
+ };
62
+
63
+ socket.onerror = (error) => {
64
+ console.error("WebSocket Error:", error);
65
+ statusDiv.textContent = 'Status: Connection error';
66
+ };
67
+ };
68
+
69
+ const playNextInQueue = () => {
70
+ if (audioQueue.length > 0) {
71
+ isPlaying = true;
72
+ const buffer = audioQueue.shift();
73
+ const source = audioContext.createBufferSource();
74
+ source.buffer = buffer;
75
+ source.connect(audioContext.destination);
76
+ source.onended = () => {
77
+ isPlaying = false;
78
+ playNextInQueue();
79
+ };
80
+ source.start();
81
+ }
82
+ };
83
+
84
+
85
+ startButton.onclick = async () => {
86
+ if (!socket || socket.readyState !== WebSocket.OPEN) {
87
+ connectWebSocket();
88
+ }
89
+
90
+ audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
91
+
92
+ if (audioContext.state === 'suspended') {
93
+ await audioContext.resume();
94
+ }
95
+
96
+ navigator.mediaDevices.getUserMedia({ audio: { sampleRate: 16000, channelCount: 1 } })
97
+ .then(stream => {
98
+ mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm; codecs=opus' });
99
+ mediaRecorder.ondataavailable = event => {
100
+ if (event.data.size > 0 && socket.readyState === WebSocket.OPEN) {
101
+ socket.send(event.data);
102
+ }
103
+ };
104
+ mediaRecorder.start(250); // Send data every 250ms
105
+
106
+ startButton.disabled = true;
107
+ stopButton.disabled = false;
108
+ statusDiv.textContent = 'Status: Translating...';
109
+ })
110
+ .catch(err => {
111
+ console.error('Error getting user media:', err);
112
+ statusDiv.textContent = 'Error: Could not access microphone.';
113
+ });
114
+ };
115
+
116
+ stopButton.onclick = () => {
117
+ if (mediaRecorder) {
118
+ mediaRecorder.stop();
119
+ }
120
+ if (socket && socket.readyState === WebSocket.OPEN) {
121
+ socket.send(JSON.stringify({type: "stop"}));
122
+ socket.close();
123
+ }
124
+ startButton.disabled = false;
125
+ stopButton.disabled = true;
126
+ statusDiv.textContent = 'Status: Stopped. Re-connect to start again.';
127
+ };
128
+
129
+ window.onload = () => {
130
+ startButton.disabled = false;
131
+ stopButton.disabled = true;
132
+ statusDiv.textContent = 'Status: Ready to connect.';
133
+ };
134
+
135
+ </script>
136
+ </body>
137
+ </html>
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ ffmpeg
requirements.txt CHANGED
@@ -1,5 +1,8 @@
1
  google-cloud-speech
2
  deepl
3
- pyaudio
4
  websockets
5
  python-dotenv
 
 
 
 
 
1
  google-cloud-speech
2
  deepl
 
3
  websockets
4
  python-dotenv
5
+ fastapi
6
+ uvicorn
7
+ python-multipart
8
+ ffmpeg-python
server.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import os
3
+ import ffmpeg
4
+ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
5
+ from fastapi.responses import HTMLResponse
6
+ from fastapi.staticfiles import StaticFiles
7
+ from dotenv import load_dotenv
8
+ from translator import VoiceTranslator
9
+
10
+ load_dotenv()
11
+
12
+ app = FastAPI()
13
+ app.mount("/static", StaticFiles(directory="static"), name="static")
14
+
15
+ # Load environment variables for API keys
16
+ google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
17
+ deepl_key = os.getenv("DEEPL_API_KEY")
18
+ eleven_key = os.getenv("ELEVENLABS_API_KEY")
19
+ voice_id = os.getenv("ELEVENLABS_VOICE_ID")
20
+
21
+ if not all([google_creds, deepl_key, eleven_key, voice_id]):
22
+ raise ValueError("Missing one or more required API keys in .env file.")
23
+
24
+ translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
25
+
26
+ @app.get("/")
27
+ async def get():
28
+ return HTMLResponse(open("index.html", "r").read())
29
+
30
+ async def audio_output_sender(ws: WebSocket, output_queue: asyncio.Queue):
31
+ print("Audio output sender started.")
32
+ while True:
33
+ try:
34
+ audio_chunk = await output_queue.get()
35
+ if audio_chunk is None:
36
+ break
37
+ await ws.send_bytes(audio_chunk)
38
+ except asyncio.CancelledError:
39
+ break
40
+ print("Audio output sender stopped.")
41
+
42
+ async def handle_audio_input(websocket: WebSocket, input_queue: asyncio.Queue):
43
+ print("Audio input handler started.")
44
+ while True:
45
+ try:
46
+ data = await websocket.receive_bytes()
47
+ # Convert webm to pcm
48
+ process = (
49
+ ffmpeg
50
+ .input('pipe:0')
51
+ .output('pipe:1', format='s16le', acodec='pcm_s16le', ac=1, ar='16k')
52
+ .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True)
53
+ )
54
+ stdout, stderr = process.communicate(input=data)
55
+ await input_queue.put(stdout)
56
+ except WebSocketDisconnect:
57
+ break
58
+ except Exception as e:
59
+ print(f"Audio input error: {e}")
60
+ break
61
+ print("Audio input handler stopped.")
62
+
63
+
64
+ @app.websocket("/ws")
65
+ async def websocket_endpoint(websocket: WebSocket):
66
+ await websocket.accept()
67
+ print("WebSocket connection accepted.")
68
+
69
+ output_sender_task = None
70
+ input_handler_task = None
71
+
72
+ try:
73
+ # Start translation and audio processing tasks
74
+ translator.start_translation()
75
+ output_sender_task = asyncio.create_task(
76
+ audio_output_sender(websocket, translator.output_queue)
77
+ )
78
+ input_handler_task = asyncio.create_task(
79
+ handle_audio_input(websocket, translator.input_queue)
80
+ )
81
+
82
+ await asyncio.gather(input_handler_task, output_sender_task)
83
+
84
+ except WebSocketDisconnect:
85
+ print("Client disconnected.")
86
+ except Exception as e:
87
+ print(f"An error occurred: {e}")
88
+ finally:
89
+ print("Stopping translation and cleaning up...")
90
+ if output_sender_task:
91
+ output_sender_task.cancel()
92
+ if input_handler_task:
93
+ input_handler_task.cancel()
94
+ translator.stop_translation()
95
+ await websocket.close()
96
+ print("WebSocket connection closed.")
translator.py ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import base64
3
+ import json
4
+ import queue
5
+ import threading
6
+ import time
7
+ from collections import deque
8
+ from typing import Dict, Optional
9
+
10
+ import deepl
11
+ import websockets
12
+ from google.cloud import speech
13
+
14
+
15
+ class VoiceTranslator:
16
+ def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
17
+ self.deepl_client = deepl.Translator(deepl_api_key)
18
+ self.elevenlabs_api_key = elevenlabs_api_key
19
+ self.voice_id = elevenlabs_voice_id
20
+ self.stt_client = speech.SpeechClient()
21
+
22
+ self.audio_rate = 16000
23
+ self.audio_chunk = 1024
24
+
25
+ self.input_queue = asyncio.Queue()
26
+ self.output_queue = asyncio.Queue()
27
+
28
+ self.lang_queues: Dict[str, queue.Queue] = {
29
+ "en-US": queue.Queue(),
30
+ "fr-FR": queue.Queue(),
31
+ }
32
+ self.prebuffer = deque(maxlen=12)
33
+
34
+ self.is_recording = False
35
+ self.is_speaking = False
36
+ self.speaking_event = threading.Event()
37
+
38
+ self.last_processed_transcript = ""
39
+ self.last_tts_text_en = ""
40
+ self.last_tts_text_fr = ""
41
+ self.min_confidence_threshold = 0.5
42
+
43
+ self.async_loop = asyncio.new_event_loop()
44
+ self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
45
+ self.async_thread.start()
46
+
47
+ self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
48
+ self._tts_consumer_task: Optional[asyncio.Task] = None
49
+ self._process_audio_task: Optional[asyncio.Task] = None
50
+
51
+ def _start_consumer():
52
+ self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
53
+
54
+ self.async_loop.call_soon_threadsafe(_start_consumer)
55
+
56
+ self.stt_threads: Dict[str, threading.Thread] = {}
57
+ self.restart_events: Dict[str, threading.Event] = {
58
+ "en-US": threading.Event(),
59
+ "fr-FR": threading.Event(),
60
+ }
61
+ self.stream_cancel_events: Dict[str, threading.Event] = {
62
+ "en-US": threading.Event(),
63
+ "fr-FR": threading.Event(),
64
+ }
65
+ self._stream_started = {"en-US": False, "fr-FR": False}
66
+ self._tts_job_counter = 0
67
+
68
+ def _run_async_loop(self):
69
+ asyncio.set_event_loop(self.async_loop)
70
+ try:
71
+ self.async_loop.run_forever()
72
+ except Exception as e:
73
+ print(f"[async_loop] stopped with error: {e}")
74
+
75
+ async def _process_input_audio(self):
76
+ print("🎀 Audio processing task started...")
77
+ while self.is_recording:
78
+ try:
79
+ data = await self.input_queue.get()
80
+ if data is None:
81
+ break
82
+ if not self.speaking_event.is_set():
83
+ self.prebuffer.append(data)
84
+ self.lang_queues["en-US"].put(data)
85
+ self.lang_queues["fr-FR"].put(data)
86
+ except asyncio.CancelledError:
87
+ break
88
+ except Exception as e:
89
+ print(f"[audio_processor] error: {e}")
90
+ print("🎀 Audio processing task stopped.")
91
+
92
+ async def _stream_tts(self, text: str):
93
+ uri = (
94
+ f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
95
+ f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
96
+ )
97
+ try:
98
+ self.is_speaking = True
99
+ self.speaking_event.set()
100
+ self.prebuffer.clear()
101
+ for q in self.lang_queues.values():
102
+ with q.mutex:
103
+ q.queue.clear()
104
+
105
+ async with websockets.connect(uri) as websocket:
106
+ await websocket.send(json.dumps({
107
+ "text": " ",
108
+ "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
109
+ "xi_api_key": self.elevenlabs_api_key,
110
+ }))
111
+ await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
112
+ await websocket.send(json.dumps({"text": ""}))
113
+
114
+ while True:
115
+ try:
116
+ message = await websocket.recv()
117
+ data = json.loads(message)
118
+ if data.get("audio"):
119
+ audio_chunk = base64.b64decode(data["audio"])
120
+ await self.output_queue.put(audio_chunk)
121
+ elif data.get("isFinal"):
122
+ break
123
+ except websockets.exceptions.ConnectionClosed:
124
+ break
125
+ except Exception:
126
+ continue
127
+ except Exception as e:
128
+ print(f"TTS streaming error: {e}")
129
+ finally:
130
+ for lang, ev in self.stream_cancel_events.items():
131
+ ev.set()
132
+ for q in self.lang_queues.values():
133
+ with q.mutex:
134
+ q.queue.clear()
135
+
136
+ self.is_speaking = False
137
+ self.speaking_event.clear()
138
+ for lang, ev in self.restart_events.items():
139
+ ev.set()
140
+ await asyncio.sleep(0.1)
141
+
142
+ async def _tts_consumer(self):
143
+ print("[tts_consumer] started")
144
+ while True:
145
+ try:
146
+ item = await self._tts_queue.get()
147
+ if item is None:
148
+ break
149
+ text = item.get("text", "")
150
+ self._tts_job_counter += 1
151
+ job_id = self._tts_job_counter
152
+ print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
153
+ await self._stream_tts(text)
154
+ except asyncio.CancelledError:
155
+ break
156
+ except Exception as e:
157
+ print(f"[tts_consumer] error: {e}")
158
+ print("[tts_consumer] exiting")
159
+
160
+ async def _process_result(self, transcript: str, confidence: float, language: str):
161
+ lang_flag = "πŸ‡«πŸ‡·" if language == "fr-FR" else "πŸ‡¬πŸ‡§"
162
+ print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
163
+
164
+ if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
165
+ print(" (echo suppressed)")
166
+ return
167
+ if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
168
+ print(" (echo suppressed)")
169
+ return
170
+
171
+ try:
172
+ if language == "fr-FR":
173
+ translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
174
+ print(f"🌐 FR β†’ EN: {translated}")
175
+ await self._tts_queue.put({"text": translated, "source_lang": language})
176
+ self.last_tts_text_en = translated
177
+ else:
178
+ translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
179
+ print(f"🌐 EN β†’ FR: {translated}")
180
+ await self._tts_queue.put({"text": translated, "source_lang": language})
181
+ self.last_tts_text_fr = translated
182
+ print("πŸ”Š Queued for speaking...")
183
+ except Exception as e:
184
+ print(f"Translation error: {e}")
185
+
186
+ def _run_stt_stream(self, language: str):
187
+ print(f"[stt:{language}] Thread starting...")
188
+ while self.is_recording:
189
+ try:
190
+ self.restart_events[language].wait()
191
+ self.restart_events[language].clear()
192
+
193
+ config = speech.RecognitionConfig(
194
+ encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
195
+ sample_rate_hertz=self.audio_rate,
196
+ language_code=language,
197
+ enable_automatic_punctuation=True,
198
+ model="latest_short",
199
+ )
200
+ streaming_config = speech.StreamingRecognitionConfig(
201
+ config=config, interim_results=True, single_utterance=False
202
+ )
203
+
204
+ def request_generator():
205
+ while self.is_recording:
206
+ if self.speaking_event.is_set():
207
+ time.sleep(0.01)
208
+ continue
209
+ if self.stream_cancel_events[language].is_set():
210
+ self.stream_cancel_events[language].clear()
211
+ break
212
+ try:
213
+ chunk = self.lang_queues[language].get(timeout=0.1)
214
+ yield speech.StreamingRecognizeRequest(audio_content=chunk)
215
+ except queue.Empty:
216
+ continue
217
+
218
+ responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
219
+
220
+ for response in responses:
221
+ if not self.is_recording:
222
+ break
223
+ for result in response.results:
224
+ if result.is_final and result.alternatives:
225
+ alt = result.alternatives[0]
226
+ transcript = alt.transcript.strip()
227
+ conf = getattr(alt, "confidence", 0.0)
228
+ if transcript and conf >= self.min_confidence_threshold:
229
+ asyncio.run_coroutine_threadsafe(
230
+ self._process_result(transcript, conf, language), self.async_loop
231
+ )
232
+ except Exception as e:
233
+ print(f"[stt:{language}] Error: {e}")
234
+ time.sleep(1)
235
+ print(f"[stt:{language}] Thread exiting")
236
+
237
+ def start_translation(self):
238
+ if self.is_recording:
239
+ return
240
+ self.is_recording = True
241
+
242
+ for ev in self.restart_events.values():
243
+ ev.clear()
244
+ self.speaking_event.clear()
245
+
246
+ def _start_tasks():
247
+ self._process_audio_task = asyncio.create_task(self._process_input_audio())
248
+
249
+ self.async_loop.call_soon_threadsafe(_start_tasks)
250
+
251
+ for lang in ("en-US", "fr-FR"):
252
+ thread = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
253
+ self.stt_threads[lang] = thread
254
+ thread.start()
255
+ self.restart_events[lang].set()
256
+
257
+ def stop_translation(self):
258
+ if not self.is_recording:
259
+ return
260
+ self.is_recording = False
261
+ for ev in self.restart_events.values():
262
+ ev.set()
263
+
264
+ def _cancel_tasks():
265
+ if self._process_audio_task:
266
+ self._process_audio_task.cancel()
267
+ if self._tts_queue:
268
+ self._tts_queue.put_nowait(None)
269
+
270
+ self.async_loop.call_soon_threadsafe(_cancel_tasks)
271
+
272
+ def cleanup(self):
273
+ self.stop_translation()
274
+ time.sleep(0.5)
275
+ if self.async_loop.is_running():
276
+ self.async_loop.call_soon_threadsafe(self.async_loop.stop)