Mike W commited on
Commit
ed699be
·
0 Parent(s):

Initial commit of real-time voice translator

Browse files
Files changed (8) hide show
  1. .gitignore +9 -0
  2. README.md +112 -0
  3. app.py +579 -0
  4. app_backup.py +597 -0
  5. checkpoint_nov2.py +570 -0
  6. mic_check.py +5 -0
  7. requirements.txt +5 -0
  8. working.py +334 -0
.gitignore ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # Environment variables
2
+ .env
3
+
4
+ # Python virtual environment
5
+ venv/
6
+
7
+ # Python cache
8
+ __pycache__/
9
+ *.pyc
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Real-Time English/French Voice Translator
2
+
3
+ This project provides a real-time, bidirectional voice translation tool that runs in your terminal. Speak in English or French, and hear the translation in the other language almost instantly.
4
+
5
+ It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
6
+
7
+ - **Speech-to-Text (STT):** Google Cloud Speech-to-Text
8
+ - **Translation:** DeepL API
9
+ - **Text-to-Speech (TTS):** ElevenLabs API
10
+
11
+
12
+ *(Note: You can replace this with a real GIF of the application in action.)*
13
+
14
+ ## Features
15
+
16
+ - **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
17
+ - **Low Latency:** Built with `asyncio` and multithreading for a responsive, conversational experience.
18
+ - **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
19
+ - **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
20
+ - **Robust Streaming:** Automatically manages and restarts API connections to handle pauses in conversation.
21
+ - **Simple CLI:** Easy to start and stop from the command line.
22
+
23
+ ## How It Works
24
+
25
+ The application orchestrates several processes concurrently:
26
+
27
+ 1. **Audio Capture:** A dedicated thread captures audio from your default microphone.
28
+ 2. **Dual STT Streams:** The captured audio is fed into two separate Google Cloud STT streams in parallel: one configured for `en-US` and one for `fr-FR`.
29
+ 3. **Transcription & Translation:** When either STT stream detects a final utterance, it's sent to the DeepL API for translation.
30
+ 4. **Speech Synthesis:** The translated text is sent to the ElevenLabs streaming TTS API.
31
+ 5. **Audio Playback:** The synthesized audio is played back through your speakers.
32
+
33
+ To prevent the system from re-translating its own output, the application pauses microphone processing during TTS playback and intelligently discards any recognized text that matches its last-spoken phrase.
34
+
35
+ ## Requirements
36
+
37
+ ### 1. Software
38
+ - Python 3.8+
39
+ - `pip` and `venv`
40
+ - **PortAudio:** This is a dependency for the `pyaudio` library.
41
+ - **macOS (via Homebrew):** `brew install portaudio`
42
+ - **Debian/Ubuntu:** `sudo apt-get install portaudio19-dev`
43
+ - **Windows:** `pyaudio` can often be installed via `pip` without manual PortAudio installation.
44
+
45
+ ### 2. API Keys
46
+ You will need active accounts and API keys for the following services:
47
+
48
+ - **Google Cloud:**
49
+ - A Google Cloud Platform project with the **Speech-to-Text API** enabled.
50
+ - A service account key file (`.json`).
51
+ - **DeepL:**
52
+ - A DeepL API plan (the Free plan is sufficient for moderate use).
53
+ - **ElevenLabs:**
54
+ - An ElevenLabs account. You will also need your **Voice ID** for the voice you wish to use.
55
+
56
+ ## Installation & Setup
57
+
58
+ 1. **Clone the Repository**
59
+ ```bash
60
+ git clone <your-repository-url>
61
+ cd realtime-translator
62
+ ```
63
+
64
+ 2. **Create a Virtual Environment**
65
+ ```bash
66
+ python -m venv venv
67
+ source venv/bin/activate # On Windows, use `venv\Scripts\activate`
68
+ ```
69
+
70
+ 3. **Install Dependencies**
71
+ Create a `requirements.txt` file with the following content:
72
+ ```
73
+ pyaudio
74
+ websockets
75
+ google-cloud-speech
76
+ deepl
77
+ python-dotenv
78
+ ```
79
+ Then, install the packages:
80
+ ```bash
81
+ pip install -r requirements.txt
82
+ ```
83
+
84
+ 4. **Configure Environment Variables**
85
+ Create a file named `.env` in the project root and add your credentials. This file is ignored by Git to keep your keys safe.
86
+
87
+ ```env
88
+ # Path to your Google Cloud service account JSON file
89
+ GOOGLE_APPLICATION_CREDENTIALS="C:/path/to/your/google-credentials.json"
90
+
91
+ # Your DeepL API Key
92
+ DEEPL_API_KEY="YOUR_DEEPL_API_KEY"
93
+
94
+ # Your ElevenLabs API Key and Voice ID
95
+ ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
96
+ ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
97
+ ```
98
+
99
+ ## Usage
100
+
101
+ Once set up, run the main application script:
102
+
103
+ ```bash
104
+ python app.py
105
+ ```
106
+
107
+ The application will prompt you to press `ENTER` to start and stop the translation session.
108
+
109
+ - Press `ENTER` to start recording.
110
+ - Speak in either English or French.
111
+ - Press `ENTER` again to stop the session.
112
+ - Press `Ctrl+C` to quit the application gracefully.
app.py ADDED
@@ -0,0 +1,579 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Real-Time French/English Voice Translator — cleaned version
4
+
5
+ Fixes applied:
6
+ - Fixed TTS echo caused by double-writing audio chunks
7
+ - Removed prebuffer re-injection that could cause echoes
8
+ - Added empty transcript filtering
9
+ - Added within-stream deduplication
10
+ - Removed unnecessary sleeps (reduced latency by ~900ms)
11
+ - Reduced TTS prebuffer from 1s to 0.5s for faster playback start
12
+ - Cleaned up diagnostic logging
13
+
14
+ Keep your env vars:
15
+ - GOOGLE_APPLICATION_CREDENTIALS, DEEPL_API_KEY, ELEVENLABS_API_KEY, ELEVENLABS_VOICE_ID
16
+ """
17
+
18
+ import asyncio
19
+ import json
20
+ import queue
21
+ import threading
22
+ import time
23
+ import os
24
+ import base64
25
+ from collections import deque
26
+ from typing import Dict, Optional
27
+
28
+ import pyaudio
29
+ import websockets
30
+ from google.cloud import speech
31
+ import deepl
32
+ from dotenv import load_dotenv
33
+
34
+ # -----------------------------------------------------------------------------
35
+ # VoiceTranslator
36
+ # -----------------------------------------------------------------------------
37
+ class VoiceTranslator:
38
+ def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
39
+ # External clients
40
+ self.deepl_client = deepl.Translator(deepl_api_key)
41
+ self.elevenlabs_api_key = elevenlabs_api_key
42
+ self.voice_id = elevenlabs_voice_id
43
+ self.stt_client = speech.SpeechClient()
44
+
45
+ # Audio params
46
+ self.audio_rate = 16000
47
+ self.audio_chunk = 1024
48
+
49
+ # Per-language audio queues (raw mic frames)
50
+ self.lang_queues: Dict[str, queue.Queue] = {
51
+ "en-US": queue.Queue(),
52
+ "fr-FR": queue.Queue(),
53
+ }
54
+
55
+ # Small rolling prebuffer to avoid missing the first bits after a restart
56
+ self.prebuffer = deque(maxlen=12)
57
+
58
+ # State flags
59
+ self.is_recording = False
60
+ self.is_speaking = False
61
+ self.speaking_event = threading.Event()
62
+
63
+ # Deduplication
64
+ self.last_processed_transcript = ""
65
+ self.last_tts_text_en = ""
66
+ self.last_tts_text_fr = ""
67
+
68
+ # Threshold
69
+ self.min_confidence_threshold = 0.5
70
+
71
+ # PyAudio
72
+ self.pyaudio_instance = pyaudio.PyAudio()
73
+ self.audio_stream = None
74
+
75
+ # Threads + async
76
+ self.recording_thread: Optional[threading.Thread] = None
77
+ self.async_loop = asyncio.new_event_loop()
78
+
79
+ # TTS queue + consumer task
80
+ self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
81
+ self._tts_consumer_task: Optional[asyncio.Task] = None
82
+
83
+ # Start async loop in separate thread
84
+ self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
85
+ self.async_thread.start()
86
+
87
+ # schedule tts consumer creation inside the async loop
88
+ def _start_consumer():
89
+ self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
90
+ self.async_loop.call_soon_threadsafe(_start_consumer)
91
+
92
+ self.stt_threads: Dict[str, threading.Thread] = {}
93
+
94
+ # Per-language restart events (used to tell threads when to start new streams)
95
+ self.restart_events: Dict[str, threading.Event] = {
96
+ "en-US": threading.Event(),
97
+ "fr-FR": threading.Event(),
98
+ }
99
+
100
+ # Per-language stream started flag
101
+ self._stream_started = {"en-US": False, "fr-FR": False}
102
+
103
+ # Per-language cancel events to force request_generator to stop
104
+ self.stream_cancel_events: Dict[str, threading.Event] = {
105
+ "en-US": threading.Event(),
106
+ "fr-FR": threading.Event(),
107
+ }
108
+
109
+ # Diagnostics
110
+ self._tts_job_counter = 0
111
+
112
+ def _run_async_loop(self):
113
+ asyncio.set_event_loop(self.async_loop)
114
+ try:
115
+ self.async_loop.run_forever()
116
+ except Exception as e:
117
+ print("[async_loop] stopped with error:", e)
118
+
119
+ # ---------------------------
120
+ # Audio capture
121
+ # ---------------------------
122
+ def _record_audio(self):
123
+ try:
124
+ stream = self.pyaudio_instance.open(
125
+ format=pyaudio.paInt16,
126
+ channels=1,
127
+ rate=self.audio_rate,
128
+ input=True,
129
+ frames_per_buffer=self.audio_chunk,
130
+ )
131
+ print("🎤 Recording started...")
132
+
133
+ while self.is_recording:
134
+ if self.speaking_event.is_set():
135
+ time.sleep(0.01)
136
+ continue
137
+
138
+ try:
139
+ data = stream.read(self.audio_chunk, exception_on_overflow=False)
140
+ except Exception as e:
141
+ print(f"[recorder] read error: {e}")
142
+ continue
143
+
144
+ if not data:
145
+ continue
146
+
147
+ self.prebuffer.append(data)
148
+ self.lang_queues["en-US"].put(data)
149
+ self.lang_queues["fr-FR"].put(data)
150
+
151
+ try:
152
+ stream.stop_stream()
153
+ stream.close()
154
+ except Exception:
155
+ pass
156
+ print("🎤 Recording stopped.")
157
+ except Exception as e:
158
+ print(f"[recorder] fatal: {e}")
159
+
160
+ # ---------------------------
161
+ # TTS streaming (ElevenLabs) - async
162
+ # ---------------------------
163
+ async def _stream_tts(self, text: str):
164
+ uri = (
165
+ f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
166
+ f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
167
+ )
168
+ tts_audio_stream = None
169
+ websocket = None
170
+ try:
171
+ # Mark speaking and set event so recorder & STT pause
172
+ self.is_speaking = True
173
+ self.speaking_event.set()
174
+
175
+ # Clear prebuffer to avoid re-injecting TTS audio later
176
+ self.prebuffer.clear()
177
+
178
+ # Clear queued frames to avoid replay
179
+ for q in self.lang_queues.values():
180
+ with q.mutex:
181
+ q.queue.clear()
182
+
183
+ websocket = await websockets.connect(uri)
184
+ await websocket.send(json.dumps({
185
+ "text": " ",
186
+ "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
187
+ "xi_api_key": self.elevenlabs_api_key,
188
+ }))
189
+ await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
190
+ await websocket.send(json.dumps({"text": ""}))
191
+
192
+ tts_audio_stream = self.pyaudio_instance.open(
193
+ format=pyaudio.paInt16,
194
+ channels=1,
195
+ rate=16000,
196
+ output=True,
197
+ frames_per_buffer=1024,
198
+ )
199
+
200
+ prebuffer = bytearray()
201
+ playback_started = False
202
+
203
+ try:
204
+ while True:
205
+ try:
206
+ message = await asyncio.wait_for(websocket.recv(), timeout=8.0)
207
+ except asyncio.TimeoutError:
208
+ if playback_started:
209
+ break
210
+ else:
211
+ continue
212
+
213
+ if isinstance(message, bytes):
214
+ if not playback_started:
215
+ prebuffer.extend(message)
216
+ if len(prebuffer) >= 8000:
217
+ tts_audio_stream.write(bytes(prebuffer))
218
+ prebuffer.clear()
219
+ playback_started = True
220
+ else:
221
+ tts_audio_stream.write(message)
222
+ continue
223
+
224
+ try:
225
+ data = json.loads(message)
226
+ except Exception:
227
+ continue
228
+
229
+ if data.get("audio"):
230
+ audio_bytes = base64.b64decode(data["audio"])
231
+ if not playback_started:
232
+ prebuffer.extend(audio_bytes)
233
+ if len(prebuffer) >= 8000:
234
+ tts_audio_stream.write(bytes(prebuffer))
235
+ prebuffer.clear()
236
+ playback_started = True
237
+ else:
238
+ tts_audio_stream.write(audio_bytes)
239
+ elif data.get("isFinal"):
240
+ break
241
+ elif data.get("error"):
242
+ print("TTS error:", data["error"])
243
+ break
244
+
245
+ # Handle case where playback never started (very short audio)
246
+ if prebuffer and not playback_started:
247
+ tts_audio_stream.write(bytes(prebuffer))
248
+
249
+ finally:
250
+ try:
251
+ await websocket.close()
252
+ except Exception:
253
+ pass
254
+
255
+ except Exception as e:
256
+ pass
257
+ finally:
258
+ if tts_audio_stream:
259
+ try:
260
+ tts_audio_stream.stop_stream()
261
+ tts_audio_stream.close()
262
+ except Exception:
263
+ pass
264
+
265
+ # Force the STT request generators to exit by setting cancel events
266
+ for lang, ev in self.stream_cancel_events.items():
267
+ ev.set()
268
+
269
+ # Don't re-inject prebuffer - just clear the queues and let fresh audio come in
270
+ for q in self.lang_queues.values():
271
+ with q.mutex:
272
+ q.queue.clear()
273
+
274
+ # Clear speaking state and signal STT threads to restart
275
+ self.is_speaking = False
276
+ self.speaking_event.clear()
277
+
278
+ # Signal restart for both language streams
279
+ for lang, ev in self.restart_events.items():
280
+ ev.set()
281
+
282
+ await asyncio.sleep(0.1)
283
+
284
+ # ---------------------------
285
+ # TTS consumer (serializes TTS)
286
+ # ---------------------------
287
+ async def _tts_consumer(self):
288
+ print("[tts_consumer] started")
289
+ while True:
290
+ item = await self._tts_queue.get()
291
+ if item is None:
292
+ print("[tts_consumer] shutdown sentinel received")
293
+ break
294
+ text = item.get("text", "")
295
+ self._tts_job_counter += 1
296
+ job_id = self._tts_job_counter
297
+ print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
298
+ try:
299
+ await asyncio.wait_for(self._stream_tts(text), timeout=35.0)
300
+ except asyncio.TimeoutError:
301
+ print(f"[tts_consumer] job #{job_id} _stream_tts timed out; proceeding.")
302
+ except Exception as e:
303
+ print(f"[tts_consumer] job #{job_id} error during _stream_tts: {e}")
304
+ finally:
305
+ await asyncio.sleep(0.05)
306
+ print("[tts_consumer] exiting")
307
+
308
+ # ---------------------------
309
+ # Translation & TTS trigger
310
+ # ---------------------------
311
+ async def _process_result(self, transcript: str, confidence: float, language: str):
312
+ lang_flag = "🇫🇷" if language == "fr-FR" else "🇬🇧"
313
+ print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
314
+
315
+ # echo suppression vs last TTS in same language
316
+ if language == "fr-FR":
317
+ if transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
318
+ print(" (echo suppressed)")
319
+ return
320
+ else:
321
+ if transcript.strip().lower() == self.last_tts_text_en.strip().lower():
322
+ print(" (echo suppressed)")
323
+ return
324
+
325
+ try:
326
+ if language == "fr-FR":
327
+ translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
328
+ print(f"🌐 FR → EN: {translated}")
329
+ await self._tts_queue.put({"text": translated, "source_lang": language})
330
+ self.last_tts_text_en = translated
331
+ else:
332
+ translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
333
+ print(f"🌐 EN → FR: {translated}")
334
+ await self._tts_queue.put({"text": translated, "source_lang": language})
335
+ self.last_tts_text_fr = translated
336
+ print("🔊 Queued for speaking...")
337
+ except Exception as e:
338
+ print(f"Translation error: {e}")
339
+
340
+ # ---------------------------
341
+ # STT streaming (run per language)
342
+ # ---------------------------
343
+ def _run_stt_stream(self, language: str):
344
+ print(f"[stt:{language}] Thread starting, thread_id={threading.get_ident()}")
345
+ self._stream_started[language] = False
346
+ last_transcript_in_stream = ""
347
+
348
+ while self.is_recording:
349
+ try:
350
+ if self._stream_started[language]:
351
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Waiting for restart signal...")
352
+ signaled = self.restart_events[language].wait(timeout=30)
353
+ if not signaled and self.is_recording:
354
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Timeout waiting for restart, restarting anyway")
355
+ if not self.is_recording:
356
+ break
357
+ try:
358
+ self.restart_events[language].clear()
359
+ except Exception:
360
+ pass
361
+ time.sleep(0.01)
362
+
363
+ self._stream_started[language] = True
364
+ last_transcript_in_stream = ""
365
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Starting new stream...")
366
+
367
+ config = speech.RecognitionConfig(
368
+ encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
369
+ sample_rate_hertz=self.audio_rate,
370
+ language_code=language,
371
+ enable_automatic_punctuation=True,
372
+ model="latest_short",
373
+ )
374
+ streaming_config = speech.StreamingRecognitionConfig(
375
+ config=config,
376
+ interim_results=True,
377
+ single_utterance=False,
378
+ )
379
+
380
+ # Request generator yields StreamingRecognizeRequest messages
381
+ def request_generator():
382
+ while self.is_recording:
383
+ # If TTS is playing, skip sending mic frames to STT
384
+ if self.speaking_event.is_set():
385
+ time.sleep(0.01)
386
+ continue
387
+ # If cancel event set, clear and break to end stream
388
+ if self.stream_cancel_events[language].is_set():
389
+ try:
390
+ self.stream_cancel_events[language].clear()
391
+ except Exception:
392
+ pass
393
+ break
394
+ try:
395
+ chunk = self.lang_queues[language].get(timeout=1.0)
396
+ except queue.Empty:
397
+ continue
398
+ yield speech.StreamingRecognizeRequest(audio_content=chunk)
399
+
400
+ responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
401
+
402
+ response_count = 0
403
+ final_received = False
404
+
405
+ for response in responses:
406
+ if not self.is_recording:
407
+ print(f"[stt:{language}] Stopped by user")
408
+ break
409
+ if not response.results:
410
+ continue
411
+
412
+ response_count += 1
413
+ for result in response.results:
414
+ if not result.alternatives:
415
+ continue
416
+ alt = result.alternatives[0]
417
+ transcript = alt.transcript.strip()
418
+ conf = getattr(alt, "confidence", 0.0)
419
+ is_final = bool(result.is_final)
420
+
421
+ if is_final:
422
+ now = time.strftime("%H:%M:%S")
423
+ print(f"[{now}] [stt:{language}] → '{transcript}' (final={is_final}, conf={conf:.2f})")
424
+
425
+ # Filter empty transcripts - don't break stream
426
+ if not transcript or len(transcript.strip()) == 0:
427
+ print(f"[{now}] [stt:{language}] Empty transcript -> ignoring, continuing stream")
428
+ continue
429
+
430
+ # Deduplicate within same stream
431
+ if transcript.strip().lower() == last_transcript_in_stream.strip().lower():
432
+ print(f"[{now}] [stt:{language}] Duplicate final in same stream -> suppressed")
433
+ continue
434
+
435
+ if conf < self.min_confidence_threshold:
436
+ print(f"[{now}] [stt:{language}] Final received but confidence {conf:.2f} < threshold -> suppressed")
437
+ continue
438
+
439
+ last_transcript_in_stream = transcript
440
+
441
+ if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
442
+ print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_fr)")
443
+ continue
444
+ if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
445
+ print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_en)")
446
+ continue
447
+
448
+ asyncio.run_coroutine_threadsafe(
449
+ self._process_result(transcript, conf, language),
450
+ self.async_loop
451
+ )
452
+ final_received = True
453
+ break
454
+
455
+ if final_received:
456
+ break
457
+
458
+ print(f"[stt:{language}] Stream ended after {response_count} responses")
459
+
460
+ if self.is_recording and final_received:
461
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Final result processed. Waiting for TTS to complete and signal restart.")
462
+ elif self.is_recording and not final_received:
463
+ print(f"[stt:{language}] Stream ended unexpectedly, reconnecting...")
464
+ time.sleep(0.5)
465
+ else:
466
+ break
467
+
468
+ except Exception as e:
469
+ if self.is_recording:
470
+ import traceback
471
+ print(f"[stt:{language}] Error: {e}")
472
+ print(traceback.format_exc())
473
+ time.sleep(1.0)
474
+ else:
475
+ break
476
+
477
+ print(f"[stt:{language}] Thread exiting")
478
+
479
+ # ---------------------------
480
+ # Control
481
+ # ---------------------------
482
+ def start_translation(self):
483
+ if self.is_recording:
484
+ print("Already recording!")
485
+ return
486
+ self.is_recording = True
487
+ self.last_processed_transcript = ""
488
+
489
+ for ev in self.restart_events.values():
490
+ try:
491
+ ev.clear()
492
+ except Exception:
493
+ pass
494
+ self.speaking_event.clear()
495
+
496
+ for q in self.lang_queues.values():
497
+ with q.mutex:
498
+ q.queue.clear()
499
+
500
+ self.recording_thread = threading.Thread(target=self._record_audio, daemon=True)
501
+ self.recording_thread.start()
502
+
503
+ for lang in ("en-US", "fr-FR"):
504
+ t = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
505
+ self.stt_threads[lang] = t
506
+ t.start()
507
+ print(f"[main] STT thread {lang} started: {t.is_alive()} at {time.strftime('%H:%M:%S')}")
508
+
509
+ for ev in self.restart_events.values():
510
+ ev.set()
511
+
512
+ def stop_translation(self):
513
+ print("\n⏹️ Stopping translation...")
514
+ self.is_recording = False
515
+ for ev in self.restart_events.values():
516
+ ev.set()
517
+ self.speaking_event.clear()
518
+
519
+ if self._tts_consumer_task and not (self._tts_consumer_task.done() if hasattr(self._tts_consumer_task, 'done') else False):
520
+ try:
521
+ def _put_sentinel():
522
+ try:
523
+ self._tts_queue.put_nowait(None)
524
+ except Exception:
525
+ asyncio.create_task(self._tts_queue.put(None))
526
+ self.async_loop.call_soon_threadsafe(_put_sentinel)
527
+ except Exception:
528
+ pass
529
+
530
+ time.sleep(0.2)
531
+
532
+ def cleanup(self):
533
+ self.stop_translation()
534
+ try:
535
+ if self.async_loop.is_running():
536
+ def _stop_loop():
537
+ if self._tts_consumer_task and not self._tts_consumer_task.done():
538
+ try:
539
+ self._tts_queue.put_nowait(None)
540
+ except Exception:
541
+ pass
542
+ self.async_loop.stop()
543
+ self.async_loop.call_soon_threadsafe(_stop_loop)
544
+ except Exception:
545
+ pass
546
+ try:
547
+ self.pyaudio_instance.terminate()
548
+ except Exception:
549
+ pass
550
+
551
+ # -----------------------------------------------------------------------------
552
+ # Main entry
553
+ # -----------------------------------------------------------------------------
554
+ def main():
555
+ load_dotenv()
556
+ google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
557
+ deepl_key = os.getenv("DEEPL_API_KEY")
558
+ eleven_key = os.getenv("ELEVENLABS_API_KEY")
559
+ voice_id = os.getenv("ELEVENLABS_VOICE_ID")
560
+
561
+ if not all([google_creds, deepl_key, eleven_key, voice_id]):
562
+ print("Missing API keys or credentials.")
563
+ return
564
+
565
+ translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
566
+ print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
567
+
568
+ try:
569
+ while True:
570
+ input("Press ENTER to start speaking...")
571
+ translator.start_translation()
572
+ input("Press ENTER to stop...\n")
573
+ translator.stop_translation()
574
+ except KeyboardInterrupt:
575
+ print("\nKeyboardInterrupt received — cleaning up.")
576
+ translator.cleanup()
577
+
578
+ if __name__ == "__main__":
579
+ main()
app_backup.py ADDED
@@ -0,0 +1,597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Real-Time French/English Voice Translator — patched single-file v2
4
+
5
+ Changes from previous:
6
+ - Adds per-language stream_cancel_events that force the STT request_generator
7
+ to exit, allowing streaming_recognize to terminate and be restarted cleanly.
8
+ - _stream_tts sets the cancel events immediately after playback finishes (before
9
+ prebuffer re-injection and restart events).
10
+ - Request generator checks cancel event frequently and breaks to end the stream.
11
+
12
+ Keep your env vars:
13
+ - GOOGLE_APPLICATION_CREDENTIALS, DEEPL_API_KEY, ELEVENLABS_API_KEY, ELEVENLABS_VOICE_ID
14
+ """
15
+
16
+ import asyncio
17
+ import json
18
+ import queue
19
+ import threading
20
+ import time
21
+ import os
22
+ import base64
23
+ from collections import deque
24
+ from typing import Dict, Optional
25
+
26
+ import pyaudio
27
+ import websockets
28
+ from google.cloud import speech
29
+ import deepl
30
+ from dotenv import load_dotenv
31
+
32
+ # -----------------------------------------------------------------------------
33
+ # VoiceTranslator
34
+ # -----------------------------------------------------------------------------
35
+ class VoiceTranslator:
36
+ def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
37
+ # External clients
38
+ self.deepl_client = deepl.Translator(deepl_api_key)
39
+ self.elevenlabs_api_key = elevenlabs_api_key
40
+ self.voice_id = elevenlabs_voice_id
41
+ self.stt_client = speech.SpeechClient()
42
+
43
+ # Audio params
44
+ self.audio_rate = 16000
45
+ self.audio_chunk = 1024
46
+
47
+ # Per-language audio queues (raw mic frames)
48
+ self.lang_queues: Dict[str, queue.Queue] = {
49
+ "en-US": queue.Queue(),
50
+ "fr-FR": queue.Queue(),
51
+ }
52
+
53
+ # Small rolling prebuffer to avoid missing the first bits after a restart
54
+ self.prebuffer = deque(maxlen=12)
55
+
56
+ # State flags
57
+ self.is_recording = False
58
+ self.is_speaking = False
59
+ self.speaking_event = threading.Event()
60
+
61
+ # Deduplication
62
+ self.last_processed_transcript = ""
63
+ self.last_tts_text_en = ""
64
+ self.last_tts_text_fr = ""
65
+
66
+ # Threshold
67
+ self.min_confidence_threshold = 0.5
68
+
69
+ # PyAudio
70
+ self.pyaudio_instance = pyaudio.PyAudio()
71
+ self.audio_stream = None
72
+
73
+ # Threads + async
74
+ self.recording_thread: Optional[threading.Thread] = None
75
+ self.async_loop = asyncio.new_event_loop()
76
+
77
+ # TTS queue + consumer task
78
+ self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
79
+ self._tts_consumer_task: Optional[asyncio.Task] = None
80
+
81
+ # Start async loop in separate thread
82
+ self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
83
+ self.async_thread.start()
84
+
85
+ # schedule tts consumer creation inside the async loop
86
+ def _start_consumer():
87
+ self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
88
+ self.async_loop.call_soon_threadsafe(_start_consumer)
89
+
90
+ self.stt_threads: Dict[str, threading.Thread] = {}
91
+
92
+ # Per-language restart events (used to tell threads when to start new streams)
93
+ self.restart_events: Dict[str, threading.Event] = {
94
+ "en-US": threading.Event(),
95
+ "fr-FR": threading.Event(),
96
+ }
97
+
98
+ # Per-language stream started flag
99
+ self._stream_started = {"en-US": False, "fr-FR": False}
100
+
101
+ # **NEW**: per-language cancel events to force request_generator to stop
102
+ self.stream_cancel_events: Dict[str, threading.Event] = {
103
+ "en-US": threading.Event(),
104
+ "fr-FR": threading.Event(),
105
+ }
106
+
107
+ # Diagnostics
108
+ self._tts_job_counter = 0
109
+
110
+ def _run_async_loop(self):
111
+ asyncio.set_event_loop(self.async_loop)
112
+ try:
113
+ self.async_loop.run_forever()
114
+ except Exception as e:
115
+ print("[async_loop] stopped with error:", e)
116
+
117
+ # ---------------------------
118
+ # Audio capture
119
+ # ---------------------------
120
+ def _record_audio(self):
121
+ try:
122
+ stream = self.pyaudio_instance.open(
123
+ format=pyaudio.paInt16,
124
+ channels=1,
125
+ rate=self.audio_rate,
126
+ input=True,
127
+ frames_per_buffer=self.audio_chunk,
128
+ )
129
+ print("🎤 Recording started...")
130
+
131
+ while self.is_recording:
132
+ if self.speaking_event.is_set():
133
+ time.sleep(0.01)
134
+ continue
135
+
136
+ try:
137
+ data = stream.read(self.audio_chunk, exception_on_overflow=False)
138
+ except Exception as e:
139
+ print(f"[recorder] read error: {e}")
140
+ continue
141
+
142
+ if not data:
143
+ continue
144
+
145
+ self.prebuffer.append(data)
146
+ self.lang_queues["en-US"].put(data)
147
+ self.lang_queues["fr-FR"].put(data)
148
+
149
+ try:
150
+ stream.stop_stream()
151
+ stream.close()
152
+ except Exception:
153
+ pass
154
+ print("🎤 Recording stopped.")
155
+ except Exception as e:
156
+ print(f"[recorder] fatal: {e}")
157
+
158
+ # ---------------------------
159
+ # TTS streaming (ElevenLabs) - async
160
+ # ---------------------------
161
+ async def _stream_tts(self, text: str):
162
+ uri = (
163
+ f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
164
+ f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
165
+ )
166
+ tts_audio_stream = None
167
+ websocket = None
168
+ try:
169
+ # Mark speaking and set event so recorder & STT pause
170
+ self.is_speaking = True
171
+ self.speaking_event.set()
172
+ # print(f"[{time.strftime('%H:%M:%S')}] [tts] speaking -> True")
173
+
174
+ # Clear prebuffer to avoid re-injecting TTS audio later
175
+ self.prebuffer.clear()
176
+
177
+ # Clear queued frames to avoid replay; we'll re-inject prebuffer after we cancel streams
178
+ for q in self.lang_queues.values():
179
+ with q.mutex:
180
+ q.queue.clear()
181
+
182
+ # Brief pause to ensure recorder sees speaking_event before we start TTS
183
+ await asyncio.sleep(0.1)
184
+
185
+ websocket = await websockets.connect(uri)
186
+ await websocket.send(json.dumps({
187
+ "text": " ",
188
+ "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
189
+ "xi_api_key": self.elevenlabs_api_key,
190
+ }))
191
+ await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
192
+ await websocket.send(json.dumps({"text": ""}))
193
+
194
+ tts_audio_stream = self.pyaudio_instance.open(
195
+ format=pyaudio.paInt16,
196
+ channels=1,
197
+ rate=16000,
198
+ output=True,
199
+ frames_per_buffer=1024,
200
+ )
201
+
202
+ prebuffer = bytearray()
203
+ playback_started = False
204
+
205
+ try:
206
+ while True:
207
+ try:
208
+ message = await asyncio.wait_for(websocket.recv(), timeout=8.0)
209
+ except asyncio.TimeoutError:
210
+ if playback_started:
211
+ break
212
+ else:
213
+ continue
214
+
215
+ if isinstance(message, bytes):
216
+ prebuffer.extend(message)
217
+ if not playback_started and len(prebuffer) >= 8000:
218
+ tts_audio_stream.write(bytes(prebuffer))
219
+ prebuffer.clear()
220
+ playback_started = True
221
+ elif playback_started:
222
+ tts_audio_stream.write(message)
223
+ continue
224
+
225
+ try:
226
+ data = json.loads(message)
227
+ except Exception:
228
+ continue
229
+
230
+ if data.get("audio"):
231
+ audio_bytes = base64.b64decode(data["audio"])
232
+ if not playback_started:
233
+ prebuffer.extend(audio_bytes)
234
+ if len(prebuffer) >= 16000:
235
+ print(f"[tts] Starting playback, prebuffer size: {len(prebuffer)}")
236
+ tts_audio_stream.write(bytes(prebuffer))
237
+ prebuffer.clear()
238
+ playback_started = True
239
+ else:
240
+ tts_audio_stream.write(audio_bytes)
241
+
242
+ elif data.get("isFinal"):
243
+ print(f"[tts] Received isFinal, prebuffer remaining: {len(prebuffer)}")
244
+ break
245
+
246
+ elif data.get("error"):
247
+ print("TTS error:", data["error"])
248
+ break
249
+
250
+ # Prebuffer should be empty after playback starts, but just in case
251
+ if prebuffer and not playback_started:
252
+ print(f"[tts] Writing final prebuffer: {len(prebuffer)} bytes (playback never started)")
253
+ tts_audio_stream.write(bytes(prebuffer))
254
+ elif prebuffer:
255
+ print(f"[tts] WARNING: prebuffer has {len(prebuffer)} bytes after playback - this is a bug!")
256
+
257
+ finally:
258
+ try:
259
+ await websocket.close()
260
+ except Exception:
261
+ pass
262
+
263
+ except Exception as e:
264
+ # print(f"[tts] error: {e}")
265
+ pass
266
+ finally:
267
+ if tts_audio_stream:
268
+ try:
269
+ tts_audio_stream.stop_stream()
270
+ tts_audio_stream.close()
271
+ except Exception:
272
+ pass
273
+ # **NEW**: force the STT request generators to exit by setting cancel events.
274
+ # This makes streaming_recognize finish; threads will then wait for restart_events
275
+ # and start fresh streams.
276
+ for lang, ev in self.stream_cancel_events.items():
277
+ ev.set()
278
+ # print(f"[{time.strftime('%H:%M:%S')}] [cancel] set -> {lang}")
279
+
280
+ # Don't re-inject prebuffer - it may contain TTS echo
281
+ # Just clear the queues and let fresh audio come in
282
+ for q in self.lang_queues.values():
283
+ with q.mutex:
284
+ q.queue.clear()
285
+
286
+ # Wait for TTS audio to clear from environment (acoustic decay)
287
+ await asyncio.sleep(0.1)
288
+
289
+ # Clear speaking state and signal STT threads to restart (robustly)
290
+ self.is_speaking = False
291
+ self.speaking_event.clear()
292
+ # print(f"[{time.strftime('%H:%M:%S')}] [tts] speaking -> False")
293
+
294
+ # Primary restart: set both events
295
+ for lang, ev in self.restart_events.items():
296
+ ev.set()
297
+ # print(f"[{time.strftime('%H:%M:%S')}] [restart] set -> {lang}")
298
+
299
+ await asyncio.sleep(0.25)
300
+ for lang, ev in self.restart_events.items():
301
+ ev.set()
302
+ await asyncio.sleep(0.25)
303
+
304
+ # ---------------------------
305
+ # TTS consumer (serializes TTS)
306
+ # ---------------------------
307
+ async def _tts_consumer(self):
308
+ print("[tts_consumer] started")
309
+ while True:
310
+ item = await self._tts_queue.get()
311
+ if item is None:
312
+ print("[tts_consumer] shutdown sentinel received")
313
+ break
314
+ text = item.get("text", "")
315
+ self._tts_job_counter += 1
316
+ job_id = self._tts_job_counter
317
+ print(f"[tts_consumer] job #{job_id} dequeued: '{text}'")
318
+ try:
319
+ await asyncio.wait_for(self._stream_tts(text), timeout=35.0)
320
+ except asyncio.TimeoutError:
321
+ print(f"[tts_consumer] job #{job_id} _stream_tts timed out; proceeding.")
322
+ except Exception as e:
323
+ print(f"[tts_consumer] job #{job_id} error during _stream_tts: {e}")
324
+ finally:
325
+ await asyncio.sleep(0.05)
326
+ print("[tts_consumer] exiting")
327
+
328
+ # ---------------------------
329
+ # Translation & TTS trigger
330
+ # ---------------------------
331
+ async def _process_result(self, transcript: str, confidence: float, language: str):
332
+ lang_flag = "🇫🇷" if language == "fr-FR" else "🇬🇧"
333
+ print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
334
+
335
+ # echo suppression vs last TTS in same language
336
+ if language == "fr-FR":
337
+ if transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
338
+ print(" (echo suppressed)")
339
+ return
340
+ else:
341
+ if transcript.strip().lower() == self.last_tts_text_en.strip().lower():
342
+ print(" (echo suppressed)")
343
+ return
344
+
345
+ try:
346
+ if language == "fr-FR":
347
+ translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
348
+ print(f"🌐 FR → EN: {translated}")
349
+ await self._tts_queue.put({"text": translated, "source_lang": language})
350
+ self.last_tts_text_en = translated
351
+ else:
352
+ translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
353
+ print(f"🌐 EN → FR: {translated}")
354
+ await self._tts_queue.put({"text": translated, "source_lang": language})
355
+ self.last_tts_text_fr = translated
356
+ print("🔊 Queued for speaking...")
357
+ except Exception as e:
358
+ print(f"Translation error: {e}")
359
+
360
+ # ---------------------------
361
+ # STT streaming (run per language)
362
+ # ---------------------------
363
+ def _run_stt_stream(self, language: str):
364
+ print(f"[stt:{language}] Thread starting, thread_id={threading.get_ident()}")
365
+ self._stream_started[language] = False
366
+ last_transcript_in_stream = ""
367
+
368
+ while self.is_recording:
369
+ try:
370
+ if self._stream_started[language]:
371
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Waiting for restart signal...")
372
+ signaled = self.restart_events[language].wait(timeout=30)
373
+ if not signaled and self.is_recording:
374
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Timeout waiting for restart, restarting anyway")
375
+ if not self.is_recording:
376
+ break
377
+ try:
378
+ self.restart_events[language].clear()
379
+ except Exception:
380
+ pass
381
+ time.sleep(0.01)
382
+
383
+ self._stream_started[language] = True
384
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Starting new stream...")
385
+
386
+ config = speech.RecognitionConfig(
387
+ encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
388
+ sample_rate_hertz=self.audio_rate,
389
+ language_code=language,
390
+ enable_automatic_punctuation=True,
391
+ model="latest_short",
392
+ )
393
+ streaming_config = speech.StreamingRecognitionConfig(
394
+ config=config,
395
+ interim_results=True,
396
+ single_utterance=False,
397
+ )
398
+
399
+ # Request generator yields StreamingRecognizeRequest messages
400
+ def request_generator():
401
+ while self.is_recording:
402
+ # If TTS is playing, skip sending mic frames to STT
403
+ if self.speaking_event.is_set():
404
+ time.sleep(0.01)
405
+ continue
406
+ # If cancel event set, clear and break to end stream
407
+ if self.stream_cancel_events[language].is_set():
408
+ # print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] request_generator observed cancel -> exiting generator")
409
+ try:
410
+ self.stream_cancel_events[language].clear()
411
+ except Exception:
412
+ pass
413
+ break
414
+ try:
415
+ chunk = self.lang_queues[language].get(timeout=1.0)
416
+ except queue.Empty:
417
+ continue
418
+ yield speech.StreamingRecognizeRequest(audio_content=chunk)
419
+
420
+ responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
421
+
422
+ response_count = 0
423
+ final_received = False
424
+
425
+ for response in responses:
426
+ if not self.is_recording:
427
+ print(f"[stt:{language}] Stopped by user")
428
+ break
429
+ if not response.results:
430
+ continue
431
+
432
+ response_count += 1
433
+ for result in response.results:
434
+ if not result.alternatives:
435
+ continue
436
+ alt = result.alternatives[0]
437
+ transcript = alt.transcript.strip()
438
+ conf = getattr(alt, "confidence", 0.0)
439
+ is_final = bool(result.is_final)
440
+
441
+ if is_final:
442
+ now = time.strftime("%H:%M:%S")
443
+ print(f"[{now}] [stt:{language}] → '{transcript}' (final={is_final}, conf={conf:.2f})")
444
+
445
+ # Filter empty transcripts - don't break stream
446
+ if not transcript or len(transcript.strip()) == 0:
447
+ print(f"[{now}] [stt:{language}] Empty transcript -> ignoring, continuing stream")
448
+ continue
449
+
450
+ # Deduplicate within same stream
451
+ if transcript.strip().lower() == last_transcript_in_stream.strip().lower():
452
+ print(f"[{now}] [stt:{language}] Duplicate final in same stream -> suppressed")
453
+ continue
454
+
455
+ if conf < self.min_confidence_threshold:
456
+ print(f"[{now}] [stt:{language}] Final received but confidence {conf:.2f} < threshold -> suppressed")
457
+ continue
458
+
459
+ if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
460
+ print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_fr)")
461
+ continue
462
+ if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
463
+ print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_en)")
464
+ continue
465
+
466
+ asyncio.run_coroutine_threadsafe(
467
+ self._process_result(transcript, conf, language),
468
+ self.async_loop
469
+ )
470
+ final_received = True
471
+ break
472
+
473
+ if final_received:
474
+ break
475
+
476
+ print(f"[stt:{language}] Stream ended after {response_count} responses")
477
+
478
+ if self.is_recording and final_received:
479
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Final result processed. Waiting for TTS to complete and signal restart.")
480
+ elif self.is_recording and not final_received:
481
+ print(f"[stt:{language}] Stream ended unexpectedly, reconnecting...")
482
+ time.sleep(0.5)
483
+ else:
484
+ break
485
+
486
+ except Exception as e:
487
+ if self.is_recording:
488
+ import traceback
489
+ print(f"[stt:{language}] Error: {e}")
490
+ print(traceback.format_exc())
491
+ time.sleep(1.0)
492
+ else:
493
+ break
494
+
495
+ print(f"[stt:{language}] Thread exiting")
496
+
497
+ # ---------------------------
498
+ # Control
499
+ # ---------------------------
500
+ def start_translation(self):
501
+ if self.is_recording:
502
+ print("Already recording!")
503
+ return
504
+ self.is_recording = True
505
+ self.last_processed_transcript = ""
506
+
507
+ for ev in self.restart_events.values():
508
+ try:
509
+ ev.clear()
510
+ except Exception:
511
+ pass
512
+ self.speaking_event.clear()
513
+
514
+ for q in self.lang_queues.values():
515
+ with q.mutex:
516
+ q.queue.clear()
517
+
518
+ self.recording_thread = threading.Thread(target=self._record_audio, daemon=True)
519
+ self.recording_thread.start()
520
+
521
+ for lang in ("en-US", "fr-FR"):
522
+ t = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
523
+ self.stt_threads[lang] = t
524
+ t.start()
525
+ print(f"[main] STT thread {lang} started: {t.is_alive()} at {time.strftime('%H:%M:%S')}")
526
+
527
+ for ev in self.restart_events.values():
528
+ ev.set()
529
+
530
+ def stop_translation(self):
531
+ print("\n⏹️ Stopping translation...")
532
+ self.is_recording = False
533
+ for ev in self.restart_events.values():
534
+ ev.set()
535
+ self.speaking_event.clear()
536
+
537
+ if self._tts_consumer_task and not (self._tts_consumer_task.done() if hasattr(self._tts_consumer_task, 'done') else False):
538
+ try:
539
+ def _put_sentinel():
540
+ try:
541
+ self._tts_queue.put_nowait(None)
542
+ except Exception:
543
+ asyncio.create_task(self._tts_queue.put(None))
544
+ self.async_loop.call_soon_threadsafe(_put_sentinel)
545
+ except Exception:
546
+ pass
547
+
548
+ time.sleep(0.2)
549
+
550
+ def cleanup(self):
551
+ self.stop_translation()
552
+ try:
553
+ if self.async_loop.is_running():
554
+ def _stop_loop():
555
+ if self._tts_consumer_task and not self._tts_consumer_task.done():
556
+ try:
557
+ self._tts_queue.put_nowait(None)
558
+ except Exception:
559
+ pass
560
+ self.async_loop.stop()
561
+ self.async_loop.call_soon_threadsafe(_stop_loop)
562
+ except Exception:
563
+ pass
564
+ try:
565
+ self.pyaudio_instance.terminate()
566
+ except Exception:
567
+ pass
568
+
569
+ # -----------------------------------------------------------------------------
570
+ # Main entry
571
+ # -----------------------------------------------------------------------------
572
+ def main():
573
+ load_dotenv()
574
+ google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
575
+ deepl_key = os.getenv("DEEPL_API_KEY")
576
+ eleven_key = os.getenv("ELEVENLABS_API_KEY")
577
+ voice_id = os.getenv("ELEVENLABS_VOICE_ID")
578
+
579
+ if not all([google_creds, deepl_key, eleven_key, voice_id]):
580
+ print("Missing API keys or credentials.")
581
+ return
582
+
583
+ translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
584
+ print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
585
+
586
+ try:
587
+ while True:
588
+ input("Press ENTER to start speaking...")
589
+ translator.start_translation()
590
+ input("Press ENTER to stop...\n")
591
+ translator.stop_translation()
592
+ except KeyboardInterrupt:
593
+ print("\nKeyboardInterrupt received — cleaning up.")
594
+ translator.cleanup()
595
+
596
+ if __name__ == "__main__":
597
+ main()
checkpoint_nov2.py ADDED
@@ -0,0 +1,570 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Real-Time French/English Voice Translator — patched single-file v2
4
+
5
+ Changes from previous:
6
+ - Adds per-language stream_cancel_events that force the STT request_generator
7
+ to exit, allowing streaming_recognize to terminate and be restarted cleanly.
8
+ - _stream_tts sets the cancel events immediately after playback finishes (before
9
+ prebuffer re-injection and restart events).
10
+ - Request generator checks cancel event frequently and breaks to end the stream.
11
+
12
+ Keep your env vars:
13
+ - GOOGLE_APPLICATION_CREDENTIALS, DEEPL_API_KEY, ELEVENLABS_API_KEY, ELEVENLABS_VOICE_ID
14
+ """
15
+
16
+ import asyncio
17
+ import json
18
+ import queue
19
+ import threading
20
+ import time
21
+ import os
22
+ import base64
23
+ from collections import deque
24
+ from typing import Dict, Optional
25
+
26
+ import pyaudio
27
+ import websockets
28
+ from google.cloud import speech
29
+ import deepl
30
+ from dotenv import load_dotenv
31
+
32
+ # -----------------------------------------------------------------------------
33
+ # VoiceTranslator
34
+ # -----------------------------------------------------------------------------
35
+ class VoiceTranslator:
36
+ def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
37
+ # External clients
38
+ self.deepl_client = deepl.Translator(deepl_api_key)
39
+ self.elevenlabs_api_key = elevenlabs_api_key
40
+ self.voice_id = elevenlabs_voice_id
41
+ self.stt_client = speech.SpeechClient()
42
+
43
+ # Audio params
44
+ self.audio_rate = 16000
45
+ self.audio_chunk = 1024
46
+
47
+ # Per-language audio queues (raw mic frames)
48
+ self.lang_queues: Dict[str, queue.Queue] = {
49
+ "en-US": queue.Queue(),
50
+ "fr-FR": queue.Queue(),
51
+ }
52
+
53
+ # Small rolling prebuffer to avoid missing the first bits after a restart
54
+ self.prebuffer = deque(maxlen=12)
55
+
56
+ # State flags
57
+ self.is_recording = False
58
+ self.is_speaking = False
59
+ self.speaking_event = threading.Event()
60
+
61
+ # Deduplication
62
+ self.last_processed_transcript = ""
63
+ self.last_tts_text_en = ""
64
+ self.last_tts_text_fr = ""
65
+
66
+ # Threshold
67
+ self.min_confidence_threshold = 0.5
68
+
69
+ # PyAudio
70
+ self.pyaudio_instance = pyaudio.PyAudio()
71
+ self.audio_stream = None
72
+
73
+ # Threads + async
74
+ self.recording_thread: Optional[threading.Thread] = None
75
+ self.async_loop = asyncio.new_event_loop()
76
+
77
+ # TTS queue + consumer task
78
+ self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
79
+ self._tts_consumer_task: Optional[asyncio.Task] = None
80
+
81
+ # Start async loop in separate thread
82
+ self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
83
+ self.async_thread.start()
84
+
85
+ # schedule tts consumer creation inside the async loop
86
+ def _start_consumer():
87
+ self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
88
+ self.async_loop.call_soon_threadsafe(_start_consumer)
89
+
90
+ self.stt_threads: Dict[str, threading.Thread] = {}
91
+
92
+ # Per-language restart events (used to tell threads when to start new streams)
93
+ self.restart_events: Dict[str, threading.Event] = {
94
+ "en-US": threading.Event(),
95
+ "fr-FR": threading.Event(),
96
+ }
97
+
98
+ # Per-language stream started flag
99
+ self._stream_started = {"en-US": False, "fr-FR": False}
100
+
101
+ # **NEW**: per-language cancel events to force request_generator to stop
102
+ self.stream_cancel_events: Dict[str, threading.Event] = {
103
+ "en-US": threading.Event(),
104
+ "fr-FR": threading.Event(),
105
+ }
106
+
107
+ # Diagnostics
108
+ self._tts_job_counter = 0
109
+
110
+ def _run_async_loop(self):
111
+ asyncio.set_event_loop(self.async_loop)
112
+ try:
113
+ self.async_loop.run_forever()
114
+ except Exception as e:
115
+ print("[async_loop] stopped with error:", e)
116
+
117
+ # ---------------------------
118
+ # Audio capture
119
+ # ---------------------------
120
+ def _record_audio(self):
121
+ try:
122
+ stream = self.pyaudio_instance.open(
123
+ format=pyaudio.paInt16,
124
+ channels=1,
125
+ rate=self.audio_rate,
126
+ input=True,
127
+ frames_per_buffer=self.audio_chunk,
128
+ )
129
+ print("🎤 Recording started...")
130
+
131
+ while self.is_recording:
132
+ if self.speaking_event.is_set():
133
+ time.sleep(0.01)
134
+ continue
135
+
136
+ try:
137
+ data = stream.read(self.audio_chunk, exception_on_overflow=False)
138
+ except Exception as e:
139
+ print(f"[recorder] read error: {e}")
140
+ continue
141
+
142
+ if not data:
143
+ continue
144
+
145
+ self.prebuffer.append(data)
146
+ self.lang_queues["en-US"].put(data)
147
+ self.lang_queues["fr-FR"].put(data)
148
+
149
+ try:
150
+ stream.stop_stream()
151
+ stream.close()
152
+ except Exception:
153
+ pass
154
+ print("🎤 Recording stopped.")
155
+ except Exception as e:
156
+ print(f"[recorder] fatal: {e}")
157
+
158
+ # ---------------------------
159
+ # TTS streaming (ElevenLabs) - async
160
+ # ---------------------------
161
+ async def _stream_tts(self, text: str):
162
+ uri = (
163
+ f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
164
+ f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
165
+ )
166
+ tts_audio_stream = None
167
+ websocket = None
168
+ try:
169
+ # Mark speaking and set event so recorder & STT pause
170
+ self.is_speaking = True
171
+ self.speaking_event.set()
172
+ # print(f"[{time.strftime('%H:%M:%S')}] [tts] speaking -> True")
173
+
174
+ # Clear queued frames to avoid replay; we'll re-inject prebuffer after we cancel streams
175
+ for q in self.lang_queues.values():
176
+ with q.mutex:
177
+ q.queue.clear()
178
+
179
+ websocket = await websockets.connect(uri)
180
+ await websocket.send(json.dumps({
181
+ "text": " ",
182
+ "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
183
+ "xi_api_key": self.elevenlabs_api_key,
184
+ }))
185
+ await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
186
+ await websocket.send(json.dumps({"text": ""}))
187
+
188
+ tts_audio_stream = self.pyaudio_instance.open(
189
+ format=pyaudio.paInt16,
190
+ channels=1,
191
+ rate=16000,
192
+ output=True,
193
+ frames_per_buffer=1024,
194
+ )
195
+
196
+ prebuffer = bytearray()
197
+ playback_started = False
198
+
199
+ try:
200
+ while True:
201
+ try:
202
+ message = await asyncio.wait_for(websocket.recv(), timeout=8.0)
203
+ except asyncio.TimeoutError:
204
+ if playback_started:
205
+ break
206
+ else:
207
+ continue
208
+
209
+ if isinstance(message, bytes):
210
+ prebuffer.extend(message)
211
+ if not playback_started and len(prebuffer) >= 16000:
212
+ tts_audio_stream.write(bytes(prebuffer))
213
+ prebuffer.clear()
214
+ playback_started = True
215
+ elif playback_started:
216
+ tts_audio_stream.write(message)
217
+ continue
218
+
219
+ try:
220
+ data = json.loads(message)
221
+ except Exception:
222
+ continue
223
+
224
+ if data.get("audio"):
225
+ audio_bytes = base64.b64decode(data["audio"])
226
+ prebuffer.extend(audio_bytes)
227
+ if not playback_started and len(prebuffer) >= 16000:
228
+ tts_audio_stream.write(bytes(prebuffer))
229
+ prebuffer.clear()
230
+ playback_started = True
231
+ elif playback_started:
232
+ tts_audio_stream.write(audio_bytes)
233
+ elif data.get("isFinal"):
234
+ break
235
+ elif data.get("error"):
236
+ print("TTS error:", data["error"])
237
+ break
238
+
239
+ if prebuffer:
240
+ tts_audio_stream.write(bytes(prebuffer))
241
+
242
+ finally:
243
+ try:
244
+ await websocket.close()
245
+ except Exception:
246
+ pass
247
+
248
+ except Exception as e:
249
+ # print(f"[tts] error: {e}")
250
+ pass
251
+ finally:
252
+ if tts_audio_stream:
253
+ try:
254
+ tts_audio_stream.stop_stream()
255
+ tts_audio_stream.close()
256
+ except Exception:
257
+ pass
258
+
259
+ # **NEW**: force the STT request generators to exit by setting cancel events.
260
+ # This makes streaming_recognize finish; threads will then wait for restart_events
261
+ # and start fresh streams.
262
+ for lang, ev in self.stream_cancel_events.items():
263
+ ev.set()
264
+ # print(f"[{time.strftime('%H:%M:%S')}] [cancel] set -> {lang}")
265
+
266
+ # Now re-inject prebuffer so new streams start with warm audio
267
+ pre_list = list(self.prebuffer)
268
+ if pre_list:
269
+ print(f"[{time.strftime('%H:%M:%S')}] [prebuffer] re-injecting {len(pre_list)} chunks into queues")
270
+ for chunk in pre_list:
271
+ self.lang_queues["en-US"].put(chunk)
272
+ self.lang_queues["fr-FR"].put(chunk)
273
+
274
+ # Clear speaking state and signal STT threads to restart (robustly)
275
+ self.is_speaking = False
276
+ self.speaking_event.clear()
277
+ # print(f"[{time.strftime('%H:%M:%S')}] [tts] speaking -> False")
278
+
279
+ # Primary restart: set both events
280
+ for lang, ev in self.restart_events.items():
281
+ ev.set()
282
+ # print(f"[{time.strftime('%H:%M:%S')}] [restart] set -> {lang}")
283
+
284
+ await asyncio.sleep(0.25)
285
+ for lang, ev in self.restart_events.items():
286
+ ev.set()
287
+ await asyncio.sleep(0.25)
288
+
289
+ # ---------------------------
290
+ # TTS consumer (serializes TTS)
291
+ # ---------------------------
292
+ async def _tts_consumer(self):
293
+ print("[tts_consumer] started")
294
+ while True:
295
+ item = await self._tts_queue.get()
296
+ if item is None:
297
+ print("[tts_consumer] shutdown sentinel received")
298
+ break
299
+ text = item.get("text", "")
300
+ self._tts_job_counter += 1
301
+ job_id = self._tts_job_counter
302
+ print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
303
+ try:
304
+ await asyncio.wait_for(self._stream_tts(text), timeout=35.0)
305
+ except asyncio.TimeoutError:
306
+ print(f"[tts_consumer] job #{job_id} _stream_tts timed out; proceeding.")
307
+ except Exception as e:
308
+ print(f"[tts_consumer] job #{job_id} error during _stream_tts: {e}")
309
+ finally:
310
+ await asyncio.sleep(0.05)
311
+ print("[tts_consumer] exiting")
312
+
313
+ # ---------------------------
314
+ # Translation & TTS trigger
315
+ # ---------------------------
316
+ async def _process_result(self, transcript: str, confidence: float, language: str):
317
+ lang_flag = "🇫🇷" if language == "fr-FR" else "🇬🇧"
318
+ print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
319
+
320
+ # echo suppression vs last TTS in same language
321
+ if language == "fr-FR":
322
+ if transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
323
+ print(" (echo suppressed)")
324
+ return
325
+ else:
326
+ if transcript.strip().lower() == self.last_tts_text_en.strip().lower():
327
+ print(" (echo suppressed)")
328
+ return
329
+
330
+ try:
331
+ if language == "fr-FR":
332
+ translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
333
+ print(f"🌐 FR → EN: {translated}")
334
+ await self._tts_queue.put({"text": translated, "source_lang": language})
335
+ self.last_tts_text_en = translated
336
+ else:
337
+ translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
338
+ print(f"🌐 EN → FR: {translated}")
339
+ await self._tts_queue.put({"text": translated, "source_lang": language})
340
+ self.last_tts_text_fr = translated
341
+ print("🔊 Queued for speaking...")
342
+ except Exception as e:
343
+ print(f"Translation error: {e}")
344
+
345
+ # ---------------------------
346
+ # STT streaming (run per language)
347
+ # ---------------------------
348
+ def _run_stt_stream(self, language: str):
349
+ print(f"[stt:{language}] Thread starting, thread_id={threading.get_ident()}")
350
+ self._stream_started[language] = False
351
+
352
+ while self.is_recording:
353
+ try:
354
+ if self._stream_started[language]:
355
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Waiting for restart signal...")
356
+ signaled = self.restart_events[language].wait(timeout=30)
357
+ if not signaled and self.is_recording:
358
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Timeout waiting for restart, restarting anyway")
359
+ if not self.is_recording:
360
+ break
361
+ try:
362
+ self.restart_events[language].clear()
363
+ except Exception:
364
+ pass
365
+ time.sleep(0.01)
366
+
367
+ self._stream_started[language] = True
368
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Starting new stream...")
369
+
370
+ config = speech.RecognitionConfig(
371
+ encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
372
+ sample_rate_hertz=self.audio_rate,
373
+ language_code=language,
374
+ enable_automatic_punctuation=True,
375
+ model="latest_short",
376
+ )
377
+ streaming_config = speech.StreamingRecognitionConfig(
378
+ config=config,
379
+ interim_results=True,
380
+ single_utterance=False,
381
+ )
382
+
383
+ # Request generator yields StreamingRecognizeRequest messages
384
+ def request_generator():
385
+ while self.is_recording:
386
+ # If TTS is playing, skip sending mic frames to STT
387
+ if self.speaking_event.is_set():
388
+ time.sleep(0.01)
389
+ continue
390
+ # If cancel event set, clear and break to end stream
391
+ if self.stream_cancel_events[language].is_set():
392
+ # print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] request_generator observed cancel -> exiting generator")
393
+ try:
394
+ self.stream_cancel_events[language].clear()
395
+ except Exception:
396
+ pass
397
+ break
398
+ try:
399
+ chunk = self.lang_queues[language].get(timeout=1.0)
400
+ except queue.Empty:
401
+ continue
402
+ yield speech.StreamingRecognizeRequest(audio_content=chunk)
403
+
404
+ responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
405
+
406
+ response_count = 0
407
+ final_received = False
408
+
409
+ for response in responses:
410
+ if not self.is_recording:
411
+ print(f"[stt:{language}] Stopped by user")
412
+ break
413
+ if not response.results:
414
+ continue
415
+
416
+ response_count += 1
417
+ for result in response.results:
418
+ if not result.alternatives:
419
+ continue
420
+ alt = result.alternatives[0]
421
+ transcript = alt.transcript.strip()
422
+ conf = getattr(alt, "confidence", 0.0)
423
+ is_final = bool(result.is_final)
424
+
425
+ if is_final:
426
+ now = time.strftime("%H:%M:%S")
427
+ print(f"[{now}] [stt:{language}] → '{transcript}' (final={is_final}, conf={conf:.2f})")
428
+ if conf < self.min_confidence_threshold:
429
+ print(f"[{now}] [stt:{language}] Final received but confidence {conf:.2f} < threshold -> suppressed")
430
+ continue
431
+
432
+ if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
433
+ print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_fr)")
434
+ continue
435
+ if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
436
+ print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_en)")
437
+ continue
438
+
439
+ asyncio.run_coroutine_threadsafe(
440
+ self._process_result(transcript, conf, language),
441
+ self.async_loop
442
+ )
443
+ final_received = True
444
+ break
445
+
446
+ if final_received:
447
+ break
448
+
449
+ print(f"[stt:{language}] Stream ended after {response_count} responses")
450
+
451
+ if self.is_recording and final_received:
452
+ print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Final result processed. Waiting for TTS to complete and signal restart.")
453
+ elif self.is_recording and not final_received:
454
+ print(f"[stt:{language}] Stream ended unexpectedly, reconnecting...")
455
+ time.sleep(0.5)
456
+ else:
457
+ break
458
+
459
+ except Exception as e:
460
+ if self.is_recording:
461
+ import traceback
462
+ print(f"[stt:{language}] Error: {e}")
463
+ print(traceback.format_exc())
464
+ time.sleep(1.0)
465
+ else:
466
+ break
467
+
468
+ print(f"[stt:{language}] Thread exiting")
469
+
470
+ # ---------------------------
471
+ # Control
472
+ # ---------------------------
473
+ def start_translation(self):
474
+ if self.is_recording:
475
+ print("Already recording!")
476
+ return
477
+ self.is_recording = True
478
+ self.last_processed_transcript = ""
479
+
480
+ for ev in self.restart_events.values():
481
+ try:
482
+ ev.clear()
483
+ except Exception:
484
+ pass
485
+ self.speaking_event.clear()
486
+
487
+ for q in self.lang_queues.values():
488
+ with q.mutex:
489
+ q.queue.clear()
490
+
491
+ self.recording_thread = threading.Thread(target=self._record_audio, daemon=True)
492
+ self.recording_thread.start()
493
+
494
+ for lang in ("en-US", "fr-FR"):
495
+ t = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
496
+ self.stt_threads[lang] = t
497
+ t.start()
498
+ print(f"[main] STT thread {lang} started: {t.is_alive()} at {time.strftime('%H:%M:%S')}")
499
+
500
+ for ev in self.restart_events.values():
501
+ ev.set()
502
+
503
+ def stop_translation(self):
504
+ print("\n⏹️ Stopping translation...")
505
+ self.is_recording = False
506
+ for ev in self.restart_events.values():
507
+ ev.set()
508
+ self.speaking_event.clear()
509
+
510
+ if self._tts_consumer_task and not (self._tts_consumer_task.done() if hasattr(self._tts_consumer_task, 'done') else False):
511
+ try:
512
+ def _put_sentinel():
513
+ try:
514
+ self._tts_queue.put_nowait(None)
515
+ except Exception:
516
+ asyncio.create_task(self._tts_queue.put(None))
517
+ self.async_loop.call_soon_threadsafe(_put_sentinel)
518
+ except Exception:
519
+ pass
520
+
521
+ time.sleep(0.2)
522
+
523
+ def cleanup(self):
524
+ self.stop_translation()
525
+ try:
526
+ if self.async_loop.is_running():
527
+ def _stop_loop():
528
+ if self._tts_consumer_task and not self._tts_consumer_task.done():
529
+ try:
530
+ self._tts_queue.put_nowait(None)
531
+ except Exception:
532
+ pass
533
+ self.async_loop.stop()
534
+ self.async_loop.call_soon_threadsafe(_stop_loop)
535
+ except Exception:
536
+ pass
537
+ try:
538
+ self.pyaudio_instance.terminate()
539
+ except Exception:
540
+ pass
541
+
542
+ # -----------------------------------------------------------------------------
543
+ # Main entry
544
+ # -----------------------------------------------------------------------------
545
+ def main():
546
+ load_dotenv()
547
+ google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
548
+ deepl_key = os.getenv("DEEPL_API_KEY")
549
+ eleven_key = os.getenv("ELEVENLABS_API_KEY")
550
+ voice_id = os.getenv("ELEVENLABS_VOICE_ID")
551
+
552
+ if not all([google_creds, deepl_key, eleven_key, voice_id]):
553
+ print("Missing API keys or credentials.")
554
+ return
555
+
556
+ translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
557
+ print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
558
+
559
+ try:
560
+ while True:
561
+ input("Press ENTER to start speaking...")
562
+ translator.start_translation()
563
+ input("Press ENTER to stop...\n")
564
+ translator.stop_translation()
565
+ except KeyboardInterrupt:
566
+ print("\nKeyboardInterrupt received — cleaning up.")
567
+ translator.cleanup()
568
+
569
+ if __name__ == "__main__":
570
+ main()
mic_check.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ import pyaudio
2
+ p = pyaudio.PyAudio()
3
+ print("Default input device index:", p.get_default_input_device_info()['index'])
4
+ print("Default input name:", p.get_default_input_device_info()['name'])
5
+ p.terminate()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ google-cloud-speech
2
+ deepl
3
+ pyaudio
4
+ websockets
5
+ python-dotenv
working.py ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Real-Time French/English Voice Translator - FIXED VERSION v4.2
3
+ Improvements:
4
+ - Removed noisy [audio_gen]/[tts] prints
5
+ - Added TTS pre-buffer to eliminate start bursts
6
+ - Added silence-based auto-finalization when no STT final detected
7
+ - Switched to "latest_long" model for better segmentation
8
+ - Added echo suppression (skip self-spoken TTS text)
9
+ """
10
+
11
+ import asyncio
12
+ import json
13
+ import queue
14
+ import threading
15
+ import time
16
+ from typing import Optional, Dict, List
17
+ import pyaudio
18
+ import websockets
19
+ from google.cloud import speech
20
+ import deepl
21
+ import os
22
+ from dotenv import load_dotenv
23
+ import base64
24
+
25
+
26
+ class VoiceTranslator:
27
+ def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
28
+ self.stt_client = speech.SpeechClient()
29
+ self.deepl_client = deepl.Translator(deepl_api_key)
30
+ self.elevenlabs_api_key = elevenlabs_api_key
31
+ self.voice_id = elevenlabs_voice_id
32
+
33
+ self.audio_rate = 16000
34
+ self.audio_chunk = 1024
35
+
36
+ self.audio_queue_en = queue.Queue()
37
+ self.audio_queue_fr = queue.Queue()
38
+ self.result_queue = queue.Queue()
39
+ self.is_recording = False
40
+ self.processing_lock = threading.Lock()
41
+
42
+ self.last_processed_transcript = ""
43
+ self.last_tts_text = ""
44
+
45
+ self.pyaudio_instance = pyaudio.PyAudio()
46
+ self.audio_stream = None
47
+
48
+ # ---------- AUDIO CAPTURE ----------
49
+
50
+ def _audio_generator(self, audio_queue: queue.Queue):
51
+ while self.is_recording:
52
+ try:
53
+ chunk = audio_queue.get(timeout=0.2)
54
+ if chunk:
55
+ yield chunk
56
+ except queue.Empty:
57
+ continue
58
+
59
+ def _record_audio(self):
60
+ try:
61
+ stream = self.pyaudio_instance.open(
62
+ format=pyaudio.paInt16,
63
+ channels=1,
64
+ rate=self.audio_rate,
65
+ input=True,
66
+ frames_per_buffer=self.audio_chunk,
67
+ )
68
+ print("🎤 Recording started...")
69
+ while self.is_recording:
70
+ try:
71
+ data = stream.read(self.audio_chunk, exception_on_overflow=False)
72
+ if not data:
73
+ continue
74
+ self.audio_queue_en.put(data)
75
+ self.audio_queue_fr.put(data)
76
+ except Exception as e:
77
+ print(f"[recorder] error: {e}")
78
+ break
79
+ stream.stop_stream()
80
+ stream.close()
81
+ print("🎤 Recording stopped.")
82
+ except Exception as e:
83
+ print(f"[recorder] fatal: {e}")
84
+
85
+ # ---------- TEXT TO SPEECH ----------
86
+
87
+ async def _stream_tts(self, text: str):
88
+ """Stream TTS with small pre-buffer to smooth playback."""
89
+ uri = (
90
+ f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
91
+ f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
92
+ )
93
+
94
+ try:
95
+ async with websockets.connect(uri) as websocket:
96
+ await websocket.send(json.dumps({
97
+ "text": " ",
98
+ "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
99
+ "xi_api_key": self.elevenlabs_api_key,
100
+ }))
101
+ await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
102
+ await websocket.send(json.dumps({"text": ""}))
103
+
104
+ if self.audio_stream is None:
105
+ self.audio_stream = self.pyaudio_instance.open(
106
+ format=pyaudio.paInt16,
107
+ channels=1,
108
+ rate=16000,
109
+ output=True,
110
+ frames_per_buffer=1024,
111
+ )
112
+
113
+ prebuffer = bytearray()
114
+ playback_started = False
115
+ last_chunk_time = time.time()
116
+
117
+ async for message in websocket:
118
+ if isinstance(message, bytes):
119
+ prebuffer.extend(message)
120
+ # Start playback after ~0.5 s of audio buffered
121
+ if not playback_started and len(prebuffer) >= 16000:
122
+ self.audio_stream.write(bytes(prebuffer))
123
+ prebuffer.clear()
124
+ playback_started = True
125
+ elif playback_started:
126
+ self.audio_stream.write(message)
127
+ last_chunk_time = time.time()
128
+ continue
129
+
130
+ try:
131
+ data = json.loads(message)
132
+ except Exception:
133
+ continue
134
+
135
+ if data.get("audio"):
136
+ audio_bytes = base64.b64decode(data["audio"])
137
+ prebuffer.extend(audio_bytes)
138
+ if not playback_started and len(prebuffer) >= 16000:
139
+ self.audio_stream.write(bytes(prebuffer))
140
+ prebuffer.clear()
141
+ playback_started = True
142
+ elif playback_started:
143
+ self.audio_stream.write(audio_bytes)
144
+ last_chunk_time = time.time()
145
+ elif data.get("isFinal"):
146
+ break
147
+ elif data.get("error"):
148
+ print("TTS error:", data["error"])
149
+ break
150
+
151
+ if prebuffer:
152
+ self.audio_stream.write(bytes(prebuffer))
153
+ except Exception as e:
154
+ print(f"[tts] error: {e}")
155
+
156
+ # ---------- TRANSLATION ----------
157
+
158
+ async def _process_result(self, transcript: str, confidence: Optional[float], language: str):
159
+ lang_flag = "🇫🇷" if language == "fr-FR" else "🇬🇧"
160
+ conf_display = f"{confidence:.2f}" if confidence is not None else "n/a"
161
+ print(f"{lang_flag} Heard ({language}, conf {conf_display}): {transcript}")
162
+
163
+ # Simple echo suppression
164
+ if transcript.strip().lower() == self.last_tts_text.strip().lower():
165
+ return
166
+
167
+ try:
168
+ if language == "fr-FR":
169
+ translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
170
+ print(f"🌐 FR → EN: {translated}")
171
+ else:
172
+ translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
173
+ print(f"🌐 EN → FR: {translated}")
174
+
175
+ self.last_tts_text = translated
176
+ print("🔊 Speaking...")
177
+ await self._stream_tts(translated)
178
+ print("✅ Done\n")
179
+
180
+ except Exception as e:
181
+ print(f"Translation error: {e}")
182
+
183
+ # ---------- STT STREAMING ----------
184
+
185
+ def _run_stt_stream(self, language: str, audio_queue: queue.Queue):
186
+ print(f"[stt] Thread start for {language}")
187
+
188
+ config = speech.RecognitionConfig(
189
+ encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
190
+ sample_rate_hertz=self.audio_rate,
191
+ language_code=language,
192
+ enable_automatic_punctuation=True,
193
+ model="latest_long",
194
+ )
195
+
196
+ streaming_config = speech.StreamingRecognitionConfig(
197
+ config=config, interim_results=True, single_utterance=False
198
+ )
199
+
200
+ def requests():
201
+ for content in self._audio_generator(audio_queue):
202
+ yield speech.StreamingRecognizeRequest(audio_content=content)
203
+
204
+ try:
205
+ responses = self.stt_client.streaming_recognize(streaming_config, requests())
206
+
207
+ last_update_time = time.time()
208
+ current_text = ""
209
+ for response in responses:
210
+ if not self.is_recording:
211
+ break
212
+ if not response.results:
213
+ continue
214
+
215
+ for result in response.results:
216
+ if not result.alternatives:
217
+ continue
218
+ alt = result.alternatives[0]
219
+ transcript = alt.transcript.strip()
220
+ conf = getattr(alt, "confidence", None)
221
+ current_text = transcript
222
+ last_update_time = time.time()
223
+
224
+ self.result_queue.put({
225
+ "transcript": transcript,
226
+ "confidence": conf,
227
+ "language": language,
228
+ "is_final": bool(result.is_final),
229
+ })
230
+
231
+ # If we haven’t heard anything new for 1.2 s, flush it as “final”
232
+ if time.time() - last_update_time > 1.2 and current_text:
233
+ self.result_queue.put({
234
+ "transcript": current_text,
235
+ "confidence": 0.5,
236
+ "language": language,
237
+ "is_final": True,
238
+ })
239
+ current_text = ""
240
+
241
+ except Exception as e:
242
+ print(f"[stt:{language}] exception: {e}")
243
+
244
+ # ---------- RESULT AGGREGATION ----------
245
+
246
+ async def _process_results_queue(self):
247
+ while self.is_recording:
248
+ try:
249
+ r = self.result_queue.get(timeout=0.2)
250
+ if r["is_final"] and r["transcript"] != self.last_processed_transcript:
251
+ with self.processing_lock:
252
+ self.last_processed_transcript = r["transcript"]
253
+ await self._process_result(
254
+ r["transcript"], r.get("confidence"), r["language"]
255
+ )
256
+ await asyncio.sleep(0.01)
257
+ except queue.Empty:
258
+ await asyncio.sleep(0.05)
259
+ except Exception as e:
260
+ print("Queue error:", e)
261
+ await asyncio.sleep(0.1)
262
+
263
+ # ---------- CONTROL ----------
264
+
265
+ async def _run_dual_streams(self):
266
+ print("🔄 Dual-stream: English ⇄ French\n")
267
+ en_thread = threading.Thread(target=self._run_stt_stream, args=("en-US", self.audio_queue_en), daemon=True)
268
+ fr_thread = threading.Thread(target=self._run_stt_stream, args=("fr-FR", self.audio_queue_fr), daemon=True)
269
+ en_thread.start()
270
+ fr_thread.start()
271
+ await self._process_results_queue()
272
+
273
+ def start_translation(self):
274
+ if self.is_recording:
275
+ print("Already recording!")
276
+ return
277
+ self.is_recording = True
278
+ self.last_processed_transcript = ""
279
+ while not self.result_queue.empty():
280
+ try: self.result_queue.get_nowait()
281
+ except: break
282
+ threading.Thread(target=self._record_audio, daemon=True).start()
283
+ try:
284
+ asyncio.run(self._run_dual_streams())
285
+ except KeyboardInterrupt:
286
+ self.stop_translation()
287
+
288
+ def stop_translation(self):
289
+ print("\n⏹️ Stopping translation...")
290
+ self.is_recording = False
291
+ if self.audio_stream:
292
+ try:
293
+ self.audio_stream.stop_stream()
294
+ self.audio_stream.close()
295
+ except Exception:
296
+ pass
297
+ self.audio_stream = None
298
+
299
+ def cleanup(self):
300
+ self.stop_translation()
301
+ try:
302
+ self.pyaudio_instance.terminate()
303
+ except Exception:
304
+ pass
305
+
306
+
307
+ # ---------- MAIN ----------
308
+
309
+ def main():
310
+ load_dotenv()
311
+ google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
312
+ deepl_key = os.getenv("DEEPL_API_KEY")
313
+ eleven_key = os.getenv("ELEVENLABS_API_KEY")
314
+ voice_id = os.getenv("ELEVENLABS_VOICE_ID")
315
+
316
+ if not all([google_creds, deepl_key, eleven_key, voice_id]):
317
+ print("Missing API keys or credentials.")
318
+ return
319
+
320
+ translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
321
+ print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
322
+
323
+ try:
324
+ while True:
325
+ input("Press ENTER to start speaking...")
326
+ threading.Thread(target=translator.start_translation, daemon=True).start()
327
+ input("Press ENTER to stop...\n")
328
+ translator.stop_translation()
329
+ except KeyboardInterrupt:
330
+ translator.cleanup()
331
+
332
+
333
+ if __name__ == "__main__":
334
+ main()