Spaces:

majweldon
/

RealtimeTranslator

Sleeping

google-labs-jules[bot] commited on Nov 3, 2025

Commit

d6fcf92

1 Parent(s): ed699be

feat: Create web application for voice translator

This commit transforms the command-line voice translator into a web application with a FastAPI backend and a simple HTML/JS frontend.

Key changes include:
- Refactoring the `VoiceTranslator` class into a separate `translator.py` module and adapting it to use asyncio queues for audio I/O.
- Creating a `server.py` with a WebSocket endpoint to handle real-time audio streaming.
- Implementing server-side audio conversion from WebM to PCM using FFmpeg.
- Adding an `index.html` file for the user interface.
- Including a `Dockerfile` and `packages.txt` for deployment on HuggingFace Spaces.
- Updating `requirements.txt` with the necessary web and audio processing dependencies.
- Revising the `README.md` with instructions for running the web app locally and deploying it.

Files changed (8) hide show

Dockerfile +22 -0
README.md +42 -43
app.py +0 -579
index.html +137 -0
packages.txt +1 -0
requirements.txt +4 -1
server.py +96 -0
translator.py +276 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,22 @@

+# Use an official Python runtime as a parent image
+FROM python:3.9-slim
+# Set the working directory in the container
+WORKDIR /app
+# Install system dependencies from packages.txt
+COPY packages.txt .
+RUN apt-get update && apt-get install -y $(cat packages.txt)
+# Copy the requirements file and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application code
+COPY . .
+# Expose the port the app runs on
+EXPOSE 8000
+# Command to run the application
+CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,6 +1,8 @@
-# Real-Time English/French Voice Translator
-This project provides a real-time, bidirectional voice translation tool that runs in your terminal. Speak in English or French, and hear the translation in the other language almost instantly.
 It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
@@ -8,39 +10,35 @@ It uses a combination of cutting-edge APIs for high-quality speech recognition,
 - **Translation:** DeepL API
 - **Text-to-Speech (TTS):** ElevenLabs API
-*(Note: You can replace this with a real GIF of the application in action.)*
 ## Features
 - **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
-- **Low Latency:** Built with `asyncio` and multithreading for a responsive, conversational experience.
 - **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
 - **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
-- **Robust Streaming:** Automatically manages and restarts API connections to handle pauses in conversation.
-- **Simple CLI:** Easy to start and stop from the command line.
 ## How It Works
-The application orchestrates several processes concurrently:
-1.  **Audio Capture:** A dedicated thread captures audio from your default microphone.
-2.  **Dual STT Streams:** The captured audio is fed into two separate Google Cloud STT streams in parallel: one configured for `en-US` and one for `fr-FR`.
-3.  **Transcription & Translation:** When either STT stream detects a final utterance, it's sent to the DeepL API for translation.
-4.  **Speech Synthesis:** The translated text is sent to the ElevenLabs streaming TTS API.
-5.  **Audio Playback:** The synthesized audio is played back through your speakers.
-To prevent the system from re-translating its own output, the application pauses microphone processing during TTS playback and intelligently discards any recognized text that matches its last-spoken phrase.
 ## Requirements
 ### 1. Software
 - Python 3.8+
 - `pip` and `venv`
-- **PortAudio:** This is a dependency for the `pyaudio` library.
-  - **macOS (via Homebrew):** `brew install portaudio`
-  - **Debian/Ubuntu:** `sudo apt-get install portaudio19-dev`
-  - **Windows:** `pyaudio` can often be installed via `pip` without manual PortAudio installation.
 ### 2. API Keys
 You will need active accounts and API keys for the following services:
@@ -51,14 +49,14 @@ You will need active accounts and API keys for the following services:
 - **DeepL:**
   - A DeepL API plan (the Free plan is sufficient for moderate use).
 - **ElevenLabs:**
-  - An ElevenLabs account. You will also need your **Voice ID** for the voice you wish to use.
 ## Installation & Setup
 1.  **Clone the Repository**
     ```bash
     git clone <your-repository-url>
-    cd realtime-translator
     ```
 2.  **Create a Virtual Environment**
@@ -68,15 +66,7 @@ You will need active accounts and API keys for the following services:
     ```
 3.  **Install Dependencies**
-    Create a `requirements.txt` file with the following content:
-    ```
-    pyaudio
-    websockets
-    google-cloud-speech
-    deepl
-    python-dotenv
-    ```
-    Then, install the packages:
     ```bash
     pip install -r requirements.txt
     ```
@@ -86,27 +76,36 @@ You will need active accounts and API keys for the following services:
     ```env
     # Path to your Google Cloud service account JSON file
-    GOOGLE_APPLICATION_CREDENTIALS="C:/path/to/your/google-credentials.json"
     # Your DeepL API Key
-    DEEPL_API_KEY="YOUR_DEEPL_API_KEY"
     # Your ElevenLabs API Key and Voice ID
     ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
     ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
     ```
-## Usage
-Once set up, run the main application script:
-```bash
-python app.py
-```
-The application will prompt you to press `ENTER` to start and stop the translation session.
-- Press `ENTER` to start recording.
-- Speak in either English or French.
-- Press `ENTER` again to stop the session.
-- Press `Ctrl+C` to quit the application gracefully.

+# Real-Time English/French Voice Translator Web App
+This project provides a real-time, bidirectional voice translation web application. Speak in English or French into your browser, and hear the translation in the other language almost instantly.
+It is built to be easily deployed as a HuggingFace Space.
 It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
 - **Translation:** DeepL API
 - **Text-to-Speech (TTS):** ElevenLabs API
 ## Features
+- **Web-Based UI:** A simple and clean browser interface for real-time translation.
 - **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
+- **Low Latency:** Built with `asyncio`, WebSockets, and multithreading for a responsive, conversational experience.
 - **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
 - **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
 ## How It Works
+The application is composed of a web frontend and a Python backend:
+1.  **Audio Capture (Frontend):** The browser's JavaScript captures audio from your microphone using the Web Audio API.
+2.  **WebSocket Streaming:** The audio is chunked and streamed over a WebSocket connection to the FastAPI backend.
+3.  **Backend Processing:**
+    - The `VoiceTranslator` class receives the audio stream.
+    - The audio is fed into two separate Google Cloud STT streams in parallel (`en-US` and `fr-FR`).
+    - When an STT stream detects a final utterance, it's sent to the DeepL API for translation.
+    - The translated text is sent to the ElevenLabs streaming TTS API.
+4.  **Audio Playback (Frontend):** The synthesized audio from ElevenLabs is streamed back to the browser through the WebSocket and played instantly.
 ## Requirements
 ### 1. Software
 - Python 3.8+
 - `pip` and `venv`
+- **FFmpeg:** This is a system dependency for audio format conversion.
+  - **macOS (via Homebrew):** `brew install ffmpeg`
+  - **Debian/Ubuntu:** `sudo apt-get install ffmpeg`
 ### 2. API Keys
 You will need active accounts and API keys for the following services:
 - **DeepL:**
   - A DeepL API plan (the Free plan is sufficient for moderate use).
 - **ElevenLabs:**
+  - An ElevenLabs account and your **Voice ID** for the desired voice.
 ## Installation & Setup
 1.  **Clone the Repository**
     ```bash
     git clone <your-repository-url>
+    cd realtime-translator-webapp # Or your directory name
     ```
 2.  **Create a Virtual Environment**
     ```
 3.  **Install Dependencies**
+    Install the Python packages from `requirements.txt`:
     ```bash
     pip install -r requirements.txt
     ```
     ```env
     # Path to your Google Cloud service account JSON file
+    GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/google-credentials.json"
     # Your DeepL API Key
+    DEPL_API_KEY="YOUR_DEEPL_API_KEY"
     # Your ElevenLabs API Key and Voice ID
     ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
     ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
     ```
+## Local Usage
+1.  **Start the Server**
+    Run the Uvicorn server from the project root:
+    ```bash
+    uvicorn server:app --reload
+    ```
+2.  **Use the Application**
+    - Open your web browser and navigate to `http://127.0.0.1:8000`.
+    - Click the "Start Translation" button. Your browser will ask for microphone permission.
+    - Speak in either English or French.
+    - The translated audio will play back automatically.
+    - Click "Stop Translation" to end the session.
+## Deploying to HuggingFace Spaces
+This application is ready to be deployed as a HuggingFace Space.
+1.  Create a new Space on HuggingFace, selecting the "Docker" template.
+2.  Upload the entire project contents to the Space repository.
+3.  In the Space "Settings" tab, add your API keys (`GOOGLE_APPLICATION_CREDENTIALS`, `DEEPL_API_KEY`, `ELEVENLABS_API_KEY`, `ELEVENLABS_VOICE_ID`) as secrets. Make sure to also add your google credentials file.
+4.  The Space will automatically build the Docker image and start the application. Your translator will be live!

app.py DELETED Viewed

@@ -1,579 +0,0 @@
-#!/usr/bin/env python3
-"""
-Real-Time French/English Voice Translator — cleaned version
-Fixes applied:
- - Fixed TTS echo caused by double-writing audio chunks
- - Removed prebuffer re-injection that could cause echoes
- - Added empty transcript filtering
- - Added within-stream deduplication
- - Removed unnecessary sleeps (reduced latency by ~900ms)
- - Reduced TTS prebuffer from 1s to 0.5s for faster playback start
- - Cleaned up diagnostic logging
-Keep your env vars:
- - GOOGLE_APPLICATION_CREDENTIALS, DEEPL_API_KEY, ELEVENLABS_API_KEY, ELEVENLABS_VOICE_ID
-"""
-import asyncio
-import json
-import queue
-import threading
-import time
-import os
-import base64
-from collections import deque
-from typing import Dict, Optional
-import pyaudio
-import websockets
-from google.cloud import speech
-import deepl
-from dotenv import load_dotenv
-# -----------------------------------------------------------------------------
-# VoiceTranslator
-# -----------------------------------------------------------------------------
-class VoiceTranslator:
-    def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
-        # External clients
-        self.deepl_client = deepl.Translator(deepl_api_key)
-        self.elevenlabs_api_key = elevenlabs_api_key
-        self.voice_id = elevenlabs_voice_id
-        self.stt_client = speech.SpeechClient()
-        # Audio params
-        self.audio_rate = 16000
-        self.audio_chunk = 1024
-        # Per-language audio queues (raw mic frames)
-        self.lang_queues: Dict[str, queue.Queue] = {
-            "en-US": queue.Queue(),
-            "fr-FR": queue.Queue(),
-        }
-        # Small rolling prebuffer to avoid missing the first bits after a restart
-        self.prebuffer = deque(maxlen=12)
-        # State flags
-        self.is_recording = False
-        self.is_speaking = False
-        self.speaking_event = threading.Event()
-        # Deduplication
-        self.last_processed_transcript = ""
-        self.last_tts_text_en = ""
-        self.last_tts_text_fr = ""
-        # Threshold
-        self.min_confidence_threshold = 0.5
-        # PyAudio
-        self.pyaudio_instance = pyaudio.PyAudio()
-        self.audio_stream = None
-        # Threads + async
-        self.recording_thread: Optional[threading.Thread] = None
-        self.async_loop = asyncio.new_event_loop()
-        # TTS queue + consumer task
-        self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
-        self._tts_consumer_task: Optional[asyncio.Task] = None
-        # Start async loop in separate thread
-        self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
-        self.async_thread.start()
-        # schedule tts consumer creation inside the async loop
-        def _start_consumer():
-            self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
-        self.async_loop.call_soon_threadsafe(_start_consumer)
-        self.stt_threads: Dict[str, threading.Thread] = {}
-        # Per-language restart events (used to tell threads when to start new streams)
-        self.restart_events: Dict[str, threading.Event] = {
-            "en-US": threading.Event(),
-            "fr-FR": threading.Event(),
-        }
-        # Per-language stream started flag
-        self._stream_started = {"en-US": False, "fr-FR": False}
-        # Per-language cancel events to force request_generator to stop
-        self.stream_cancel_events: Dict[str, threading.Event] = {
-            "en-US": threading.Event(),
-            "fr-FR": threading.Event(),
-        }
-        # Diagnostics
-        self._tts_job_counter = 0
-    def _run_async_loop(self):
-        asyncio.set_event_loop(self.async_loop)
-        try:
-            self.async_loop.run_forever()
-        except Exception as e:
-            print("[async_loop] stopped with error:", e)
-    # ---------------------------
-    # Audio capture
-    # ---------------------------
-    def _record_audio(self):
-        try:
-            stream = self.pyaudio_instance.open(
-                format=pyaudio.paInt16,
-                channels=1,
-                rate=self.audio_rate,
-                input=True,
-                frames_per_buffer=self.audio_chunk,
-            )
-            print("🎤 Recording started...")
-            while self.is_recording:
-                if self.speaking_event.is_set():
-                    time.sleep(0.01)
-                    continue
-                try:
-                    data = stream.read(self.audio_chunk, exception_on_overflow=False)
-                except Exception as e:
-                    print(f"[recorder] read error: {e}")
-                    continue
-                if not data:
-                    continue
-                self.prebuffer.append(data)
-                self.lang_queues["en-US"].put(data)
-                self.lang_queues["fr-FR"].put(data)
-            try:
-                stream.stop_stream()
-                stream.close()
-            except Exception:
-                pass
-            print("🎤 Recording stopped.")
-        except Exception as e:
-            print(f"[recorder] fatal: {e}")
-    # ---------------------------
-    # TTS streaming (ElevenLabs) - async
-    # ---------------------------
-    async def _stream_tts(self, text: str):
-        uri = (
-            f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
-            f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
-        )
-        tts_audio_stream = None
-        websocket = None
-        try:
-            # Mark speaking and set event so recorder & STT pause
-            self.is_speaking = True
-            self.speaking_event.set()
-            # Clear prebuffer to avoid re-injecting TTS audio later
-            self.prebuffer.clear()
-            # Clear queued frames to avoid replay
-            for q in self.lang_queues.values():
-                with q.mutex:
-                    q.queue.clear()
-            websocket = await websockets.connect(uri)
-            await websocket.send(json.dumps({
-                "text": " ",
-                "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
-                "xi_api_key": self.elevenlabs_api_key,
-            }))
-            await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
-            await websocket.send(json.dumps({"text": ""}))
-            tts_audio_stream = self.pyaudio_instance.open(
-                format=pyaudio.paInt16,
-                channels=1,
-                rate=16000,
-                output=True,
-                frames_per_buffer=1024,
-            )
-            prebuffer = bytearray()
-            playback_started = False
-            try:
-                while True:
-                    try:
-                        message = await asyncio.wait_for(websocket.recv(), timeout=8.0)
-                    except asyncio.TimeoutError:
-                        if playback_started:
-                            break
-                        else:
-                            continue
-                    if isinstance(message, bytes):
-                        if not playback_started:
-                            prebuffer.extend(message)
-                            if len(prebuffer) >= 8000:
-                                tts_audio_stream.write(bytes(prebuffer))
-                                prebuffer.clear()
-                                playback_started = True
-                        else:
-                            tts_audio_stream.write(message)
-                        continue
-                    try:
-                        data = json.loads(message)
-                    except Exception:
-                        continue
-                    if data.get("audio"):
-                        audio_bytes = base64.b64decode(data["audio"])
-                        if not playback_started:
-                            prebuffer.extend(audio_bytes)
-                            if len(prebuffer) >= 8000:
-                                tts_audio_stream.write(bytes(prebuffer))
-                                prebuffer.clear()
-                                playback_started = True
-                        else:
-                            tts_audio_stream.write(audio_bytes)
-                    elif data.get("isFinal"):
-                        break
-                    elif data.get("error"):
-                        print("TTS error:", data["error"])
-                        break
-                # Handle case where playback never started (very short audio)
-                if prebuffer and not playback_started:
-                    tts_audio_stream.write(bytes(prebuffer))
-            finally:
-                try:
-                    await websocket.close()
-                except Exception:
-                    pass
-        except Exception as e:
-            pass
-        finally:
-            if tts_audio_stream:
-                try:
-                    tts_audio_stream.stop_stream()
-                    tts_audio_stream.close()
-                except Exception:
-                    pass
-            # Force the STT request generators to exit by setting cancel events
-            for lang, ev in self.stream_cancel_events.items():
-                ev.set()
-            # Don't re-inject prebuffer - just clear the queues and let fresh audio come in
-            for q in self.lang_queues.values():
-                with q.mutex:
-                    q.queue.clear()
-            # Clear speaking state and signal STT threads to restart
-            self.is_speaking = False
-            self.speaking_event.clear()
-            # Signal restart for both language streams
-            for lang, ev in self.restart_events.items():
-                ev.set()
-            await asyncio.sleep(0.1)
-    # ---------------------------
-    # TTS consumer (serializes TTS)
-    # ---------------------------
-    async def _tts_consumer(self):
-        print("[tts_consumer] started")
-        while True:
-            item = await self._tts_queue.get()
-            if item is None:
-                print("[tts_consumer] shutdown sentinel received")
-                break
-            text = item.get("text", "")
-            self._tts_job_counter += 1
-            job_id = self._tts_job_counter
-            print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
-            try:
-                await asyncio.wait_for(self._stream_tts(text), timeout=35.0)
-            except asyncio.TimeoutError:
-                print(f"[tts_consumer] job #{job_id} _stream_tts timed out; proceeding.")
-            except Exception as e:
-                print(f"[tts_consumer] job #{job_id} error during _stream_tts: {e}")
-            finally:
-                await asyncio.sleep(0.05)
-        print("[tts_consumer] exiting")
-    # ---------------------------
-    # Translation & TTS trigger
-    # ---------------------------
-    async def _process_result(self, transcript: str, confidence: float, language: str):
-        lang_flag = "🇫🇷" if language == "fr-FR" else "🇬🇧"
-        print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
-        # echo suppression vs last TTS in same language
-        if language == "fr-FR":
-            if transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
-                print("  (echo suppressed)")
-                return
-        else:
-            if transcript.strip().lower() == self.last_tts_text_en.strip().lower():
-                print("  (echo suppressed)")
-                return
-        try:
-            if language == "fr-FR":
-                translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
-                print(f"🌐 FR → EN: {translated}")
-                await self._tts_queue.put({"text": translated, "source_lang": language})
-                self.last_tts_text_en = translated
-            else:
-                translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
-                print(f"🌐 EN → FR: {translated}")
-                await self._tts_queue.put({"text": translated, "source_lang": language})
-                self.last_tts_text_fr = translated
-            print("🔊 Queued for speaking...")
-        except Exception as e:
-            print(f"Translation error: {e}")
-    # ---------------------------
-    # STT streaming (run per language)
-    # ---------------------------
-    def _run_stt_stream(self, language: str):
-        print(f"[stt:{language}] Thread starting, thread_id={threading.get_ident()}")
-        self._stream_started[language] = False
-        last_transcript_in_stream = ""
-        while self.is_recording:
-            try:
-                if self._stream_started[language]:
-                    print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Waiting for restart signal...")
-                    signaled = self.restart_events[language].wait(timeout=30)
-                    if not signaled and self.is_recording:
-                        print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Timeout waiting for restart, restarting anyway")
-                    if not self.is_recording:
-                        break
-                    try:
-                        self.restart_events[language].clear()
-                    except Exception:
-                        pass
-                    time.sleep(0.01)
-                self._stream_started[language] = True
-                last_transcript_in_stream = ""
-                print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Starting new stream...")
-                config = speech.RecognitionConfig(
-                    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
-                    sample_rate_hertz=self.audio_rate,
-                    language_code=language,
-                    enable_automatic_punctuation=True,
-                    model="latest_short",
-                )
-                streaming_config = speech.StreamingRecognitionConfig(
-                    config=config,
-                    interim_results=True,
-                    single_utterance=False,
-                )
-                # Request generator yields StreamingRecognizeRequest messages
-                def request_generator():
-                    while self.is_recording:
-                        # If TTS is playing, skip sending mic frames to STT
-                        if self.speaking_event.is_set():
-                            time.sleep(0.01)
-                            continue
-                        # If cancel event set, clear and break to end stream
-                        if self.stream_cancel_events[language].is_set():
-                            try:
-                                self.stream_cancel_events[language].clear()
-                            except Exception:
-                                pass
-                            break
-                        try:
-                            chunk = self.lang_queues[language].get(timeout=1.0)
-                        except queue.Empty:
-                            continue
-                        yield speech.StreamingRecognizeRequest(audio_content=chunk)
-                responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
-                response_count = 0
-                final_received = False
-                for response in responses:
-                    if not self.is_recording:
-                        print(f"[stt:{language}] Stopped by user")
-                        break
-                    if not response.results:
-                        continue
-                    response_count += 1
-                    for result in response.results:
-                        if not result.alternatives:
-                            continue
-                        alt = result.alternatives[0]
-                        transcript = alt.transcript.strip()
-                        conf = getattr(alt, "confidence", 0.0)
-                        is_final = bool(result.is_final)
-                        if is_final:
-                            now = time.strftime("%H:%M:%S")
-                            print(f"[{now}] [stt:{language}] → '{transcript}' (final={is_final}, conf={conf:.2f})")
-                            # Filter empty transcripts - don't break stream
-                            if not transcript or len(transcript.strip()) == 0:
-                                print(f"[{now}] [stt:{language}] Empty transcript -> ignoring, continuing stream")
-                                continue
-                            # Deduplicate within same stream
-                            if transcript.strip().lower() == last_transcript_in_stream.strip().lower():
-                                print(f"[{now}] [stt:{language}] Duplicate final in same stream -> suppressed")
-                                continue
-                            if conf < self.min_confidence_threshold:
-                                print(f"[{now}] [stt:{language}] Final received but confidence {conf:.2f} < threshold -> suppressed")
-                                continue
-                            last_transcript_in_stream = transcript
-                            if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
-                                print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_fr)")
-                                continue
-                            if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
-                                print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_en)")
-                                continue
-                            asyncio.run_coroutine_threadsafe(
-                                self._process_result(transcript, conf, language),
-                                self.async_loop
-                            )
-                            final_received = True
-                            break
-                    if final_received:
-                        break
-                print(f"[stt:{language}] Stream ended after {response_count} responses")
-                if self.is_recording and final_received:
-                    print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Final result processed. Waiting for TTS to complete and signal restart.")
-                elif self.is_recording and not final_received:
-                    print(f"[stt:{language}] Stream ended unexpectedly, reconnecting...")
-                    time.sleep(0.5)
-                else:
-                    break
-            except Exception as e:
-                if self.is_recording:
-                    import traceback
-                    print(f"[stt:{language}] Error: {e}")
-                    print(traceback.format_exc())
-                    time.sleep(1.0)
-                else:
-                    break
-        print(f"[stt:{language}] Thread exiting")
-    # ---------------------------
-    # Control
-    # ---------------------------
-    def start_translation(self):
-        if self.is_recording:
-            print("Already recording!")
-            return
-        self.is_recording = True
-        self.last_processed_transcript = ""
-        for ev in self.restart_events.values():
-            try:
-                ev.clear()
-            except Exception:
-                pass
-        self.speaking_event.clear()
-        for q in self.lang_queues.values():
-            with q.mutex:
-                q.queue.clear()
-        self.recording_thread = threading.Thread(target=self._record_audio, daemon=True)
-        self.recording_thread.start()
-        for lang in ("en-US", "fr-FR"):
-            t = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
-            self.stt_threads[lang] = t
-            t.start()
-            print(f"[main] STT thread {lang} started: {t.is_alive()} at {time.strftime('%H:%M:%S')}")
-        for ev in self.restart_events.values():
-            ev.set()
-    def stop_translation(self):
-        print("\n⏹️  Stopping translation...")
-        self.is_recording = False
-        for ev in self.restart_events.values():
-            ev.set()
-        self.speaking_event.clear()
-        if self._tts_consumer_task and not (self._tts_consumer_task.done() if hasattr(self._tts_consumer_task, 'done') else False):
-            try:
-                def _put_sentinel():
-                    try:
-                        self._tts_queue.put_nowait(None)
-                    except Exception:
-                        asyncio.create_task(self._tts_queue.put(None))
-                self.async_loop.call_soon_threadsafe(_put_sentinel)
-            except Exception:
-                pass
-        time.sleep(0.2)
-    def cleanup(self):
-        self.stop_translation()
-        try:
-            if self.async_loop.is_running():
-                def _stop_loop():
-                    if self._tts_consumer_task and not self._tts_consumer_task.done():
-                        try:
-                            self._tts_queue.put_nowait(None)
-                        except Exception:
-                            pass
-                    self.async_loop.stop()
-                self.async_loop.call_soon_threadsafe(_stop_loop)
-        except Exception:
-            pass
-        try:
-            self.pyaudio_instance.terminate()
-        except Exception:
-            pass
-# -----------------------------------------------------------------------------
-# Main entry
-# -----------------------------------------------------------------------------
-def main():
-    load_dotenv()
-    google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
-    deepl_key = os.getenv("DEEPL_API_KEY")
-    eleven_key = os.getenv("ELEVENLABS_API_KEY")
-    voice_id = os.getenv("ELEVENLABS_VOICE_ID")
-    if not all([google_creds, deepl_key, eleven_key, voice_id]):
-        print("Missing API keys or credentials.")
-        return
-    translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
-    print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
-    try:
-        while True:
-            input("Press ENTER to start speaking...")
-            translator.start_translation()
-            input("Press ENTER to stop...\n")
-            translator.stop_translation()
-    except KeyboardInterrupt:
-        print("\nKeyboardInterrupt received — cleaning up.")
-        translator.cleanup()
-if __name__ == "__main__":
-    main()

index.html ADDED Viewed

	@@ -0,0 +1,137 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Real-Time Voice Translator</title>
+    <style>
+        body { font-family: sans-serif; display: flex; flex-direction: column; align-items: center; justify-content: center; height: 100vh; margin: 0; background-color: #f0f0f0; }
+        #controls { margin-bottom: 20px; }
+        button { font-size: 1.2em; padding: 10px 20px; cursor: pointer; }
+        #status { font-size: 1.1em; color: #333; }
+    </style>
+</head>
+<body>
+    <h1>Real-Time Voice Translator</h1>
+    <div id="controls">
+        <button id="startButton">Start Translation</button>
+        <button id="stopButton" disabled>Stop Translation</button>
+    </div>
+    <p id="status">Status: Not connected</p>
+    <script>
+        const startButton = document.getElementById('startButton');
+        const stopButton = document.getElementById('stopButton');
+        const statusDiv = document.getElementById('status');
+        let socket;
+        let mediaRecorder;
+        let audioContext;
+        let audioQueue = [];
+        let isPlaying = false;
+        const connectWebSocket = () => {
+            const proto = window.location.protocol === "https:" ? "wss:" : "ws:";
+            socket = new WebSocket(`${proto}//${window.location.host}/ws`);
+            socket.onopen = () => {
+                statusDiv.textContent = 'Status: Connected. Press Start.';
+                startButton.disabled = false;
+            };
+            socket.onmessage = (event) => {
+                if (event.data instanceof Blob) {
+                    const reader = new FileReader();
+                    reader.onload = function() {
+                        const arrayBuffer = this.result;
+                        audioContext.decodeAudioData(arrayBuffer, (buffer) => {
+                            audioQueue.push(buffer);
+                            if (!isPlaying) {
+                                playNextInQueue();
+                            }
+                        });
+                    };
+                    reader.readAsArrayBuffer(event.data);
+                }
+            };
+            socket.onclose = () => {
+                statusDiv.textContent = 'Status: Disconnected';
+                startButton.disabled = true;
+                stopButton.disabled = true;
+            };
+            socket.onerror = (error) => {
+                console.error("WebSocket Error:", error);
+                statusDiv.textContent = 'Status: Connection error';
+            };
+        };
+        const playNextInQueue = () => {
+            if (audioQueue.length > 0) {
+                isPlaying = true;
+                const buffer = audioQueue.shift();
+                const source = audioContext.createBufferSource();
+                source.buffer = buffer;
+                source.connect(audioContext.destination);
+                source.onended = () => {
+                    isPlaying = false;
+                    playNextInQueue();
+                };
+                source.start();
+            }
+        };
+        startButton.onclick = async () => {
+            if (!socket || socket.readyState !== WebSocket.OPEN) {
+                connectWebSocket();
+            }
+            audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
+            if (audioContext.state === 'suspended') {
+                await audioContext.resume();
+            }
+            navigator.mediaDevices.getUserMedia({ audio: { sampleRate: 16000, channelCount: 1 } })
+                .then(stream => {
+                    mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm; codecs=opus' });
+                    mediaRecorder.ondataavailable = event => {
+                        if (event.data.size > 0 && socket.readyState === WebSocket.OPEN) {
+                            socket.send(event.data);
+                        }
+                    };
+                    mediaRecorder.start(250); // Send data every 250ms
+                    startButton.disabled = true;
+                    stopButton.disabled = false;
+                    statusDiv.textContent = 'Status: Translating...';
+                })
+                .catch(err => {
+                    console.error('Error getting user media:', err);
+                    statusDiv.textContent = 'Error: Could not access microphone.';
+                });
+        };
+        stopButton.onclick = () => {
+            if (mediaRecorder) {
+                mediaRecorder.stop();
+            }
+            if (socket && socket.readyState === WebSocket.OPEN) {
+                socket.send(JSON.stringify({type: "stop"}));
+                socket.close();
+            }
+            startButton.disabled = false;
+            stopButton.disabled = true;
+            statusDiv.textContent = 'Status: Stopped. Re-connect to start again.';
+        };
+        window.onload = () => {
+            startButton.disabled = false;
+            stopButton.disabled = true;
+            statusDiv.textContent = 'Status: Ready to connect.';
+        };
+    </script>
+</body>
+</html>

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ ffmpeg

requirements.txt CHANGED Viewed

@@ -1,5 +1,8 @@
 google-cloud-speech
 deepl
-pyaudio
 websockets
 python-dotenv

 google-cloud-speech
 deepl
 websockets
 python-dotenv
+fastapi
+uvicorn
+python-multipart
+ffmpeg-python

server.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import asyncio
+import os
+import ffmpeg
+from fastapi import FastAPI, WebSocket, WebSocketDisconnect
+from fastapi.responses import HTMLResponse
+from fastapi.staticfiles import StaticFiles
+from dotenv import load_dotenv
+from translator import VoiceTranslator
+load_dotenv()
+app = FastAPI()
+app.mount("/static", StaticFiles(directory="static"), name="static")
+# Load environment variables for API keys
+google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
+deepl_key = os.getenv("DEEPL_API_KEY")
+eleven_key = os.getenv("ELEVENLABS_API_KEY")
+voice_id = os.getenv("ELEVENLABS_VOICE_ID")
+if not all([google_creds, deepl_key, eleven_key, voice_id]):
+    raise ValueError("Missing one or more required API keys in .env file.")
+translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
+@app.get("/")
+async def get():
+    return HTMLResponse(open("index.html", "r").read())
+async def audio_output_sender(ws: WebSocket, output_queue: asyncio.Queue):
+    print("Audio output sender started.")
+    while True:
+        try:
+            audio_chunk = await output_queue.get()
+            if audio_chunk is None:
+                break
+            await ws.send_bytes(audio_chunk)
+        except asyncio.CancelledError:
+            break
+    print("Audio output sender stopped.")
+async def handle_audio_input(websocket: WebSocket, input_queue: asyncio.Queue):
+    print("Audio input handler started.")
+    while True:
+        try:
+            data = await websocket.receive_bytes()
+            # Convert webm to pcm
+            process = (
+                ffmpeg
+                .input('pipe:0')
+                .output('pipe:1', format='s16le', acodec='pcm_s16le', ac=1, ar='16k')
+                .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True)
+            )
+            stdout, stderr = process.communicate(input=data)
+            await input_queue.put(stdout)
+        except WebSocketDisconnect:
+            break
+        except Exception as e:
+            print(f"Audio input error: {e}")
+            break
+    print("Audio input handler stopped.")
+@app.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket):
+    await websocket.accept()
+    print("WebSocket connection accepted.")
+    output_sender_task = None
+    input_handler_task = None
+    try:
+        # Start translation and audio processing tasks
+        translator.start_translation()
+        output_sender_task = asyncio.create_task(
+            audio_output_sender(websocket, translator.output_queue)
+        )
+        input_handler_task = asyncio.create_task(
+            handle_audio_input(websocket, translator.input_queue)
+        )
+        await asyncio.gather(input_handler_task, output_sender_task)
+    except WebSocketDisconnect:
+        print("Client disconnected.")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+    finally:
+        print("Stopping translation and cleaning up...")
+        if output_sender_task:
+            output_sender_task.cancel()
+        if input_handler_task:
+            input_handler_task.cancel()
+        translator.stop_translation()
+        await websocket.close()
+        print("WebSocket connection closed.")

translator.py ADDED Viewed

	@@ -0,0 +1,276 @@

+import asyncio
+import base64
+import json
+import queue
+import threading
+import time
+from collections import deque
+from typing import Dict, Optional
+import deepl
+import websockets
+from google.cloud import speech
+class VoiceTranslator:
+    def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
+        self.deepl_client = deepl.Translator(deepl_api_key)
+        self.elevenlabs_api_key = elevenlabs_api_key
+        self.voice_id = elevenlabs_voice_id
+        self.stt_client = speech.SpeechClient()
+        self.audio_rate = 16000
+        self.audio_chunk = 1024
+        self.input_queue = asyncio.Queue()
+        self.output_queue = asyncio.Queue()
+        self.lang_queues: Dict[str, queue.Queue] = {
+            "en-US": queue.Queue(),
+            "fr-FR": queue.Queue(),
+        }
+        self.prebuffer = deque(maxlen=12)
+        self.is_recording = False
+        self.is_speaking = False
+        self.speaking_event = threading.Event()
+        self.last_processed_transcript = ""
+        self.last_tts_text_en = ""
+        self.last_tts_text_fr = ""
+        self.min_confidence_threshold = 0.5
+        self.async_loop = asyncio.new_event_loop()
+        self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
+        self.async_thread.start()
+        self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
+        self._tts_consumer_task: Optional[asyncio.Task] = None
+        self._process_audio_task: Optional[asyncio.Task] = None
+        def _start_consumer():
+            self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
+        self.async_loop.call_soon_threadsafe(_start_consumer)
+        self.stt_threads: Dict[str, threading.Thread] = {}
+        self.restart_events: Dict[str, threading.Event] = {
+            "en-US": threading.Event(),
+            "fr-FR": threading.Event(),
+        }
+        self.stream_cancel_events: Dict[str, threading.Event] = {
+            "en-US": threading.Event(),
+            "fr-FR": threading.Event(),
+        }
+        self._stream_started = {"en-US": False, "fr-FR": False}
+        self._tts_job_counter = 0
+    def _run_async_loop(self):
+        asyncio.set_event_loop(self.async_loop)
+        try:
+            self.async_loop.run_forever()
+        except Exception as e:
+            print(f"[async_loop] stopped with error: {e}")
+    async def _process_input_audio(self):
+        print("🎤 Audio processing task started...")
+        while self.is_recording:
+            try:
+                data = await self.input_queue.get()
+                if data is None:
+                    break
+                if not self.speaking_event.is_set():
+                    self.prebuffer.append(data)
+                    self.lang_queues["en-US"].put(data)
+                    self.lang_queues["fr-FR"].put(data)
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                print(f"[audio_processor] error: {e}")
+        print("🎤 Audio processing task stopped.")
+    async def _stream_tts(self, text: str):
+        uri = (
+            f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
+            f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
+        )
+        try:
+            self.is_speaking = True
+            self.speaking_event.set()
+            self.prebuffer.clear()
+            for q in self.lang_queues.values():
+                with q.mutex:
+                    q.queue.clear()
+            async with websockets.connect(uri) as websocket:
+                await websocket.send(json.dumps({
+                    "text": " ",
+                    "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
+                    "xi_api_key": self.elevenlabs_api_key,
+                }))
+                await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
+                await websocket.send(json.dumps({"text": ""}))
+                while True:
+                    try:
+                        message = await websocket.recv()
+                        data = json.loads(message)
+                        if data.get("audio"):
+                            audio_chunk = base64.b64decode(data["audio"])
+                            await self.output_queue.put(audio_chunk)
+                        elif data.get("isFinal"):
+                            break
+                    except websockets.exceptions.ConnectionClosed:
+                        break
+                    except Exception:
+                        continue
+        except Exception as e:
+            print(f"TTS streaming error: {e}")
+        finally:
+            for lang, ev in self.stream_cancel_events.items():
+                ev.set()
+            for q in self.lang_queues.values():
+                with q.mutex:
+                    q.queue.clear()
+            self.is_speaking = False
+            self.speaking_event.clear()
+            for lang, ev in self.restart_events.items():
+                ev.set()
+            await asyncio.sleep(0.1)
+    async def _tts_consumer(self):
+        print("[tts_consumer] started")
+        while True:
+            try:
+                item = await self._tts_queue.get()
+                if item is None:
+                    break
+                text = item.get("text", "")
+                self._tts_job_counter += 1
+                job_id = self._tts_job_counter
+                print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
+                await self._stream_tts(text)
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                print(f"[tts_consumer] error: {e}")
+        print("[tts_consumer] exiting")
+    async def _process_result(self, transcript: str, confidence: float, language: str):
+        lang_flag = "🇫🇷" if language == "fr-FR" else "🇬🇧"
+        print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
+        if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
+            print("  (echo suppressed)")
+            return
+        if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
+            print("  (echo suppressed)")
+            return
+        try:
+            if language == "fr-FR":
+                translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
+                print(f"🌐 FR → EN: {translated}")
+                await self._tts_queue.put({"text": translated, "source_lang": language})
+                self.last_tts_text_en = translated
+            else:
+                translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
+                print(f"🌐 EN → FR: {translated}")
+                await self._tts_queue.put({"text": translated, "source_lang": language})
+                self.last_tts_text_fr = translated
+            print("🔊 Queued for speaking...")
+        except Exception as e:
+            print(f"Translation error: {e}")
+    def _run_stt_stream(self, language: str):
+        print(f"[stt:{language}] Thread starting...")
+        while self.is_recording:
+            try:
+                self.restart_events[language].wait()
+                self.restart_events[language].clear()
+                config = speech.RecognitionConfig(
+                    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
+                    sample_rate_hertz=self.audio_rate,
+                    language_code=language,
+                    enable_automatic_punctuation=True,
+                    model="latest_short",
+                )
+                streaming_config = speech.StreamingRecognitionConfig(
+                    config=config, interim_results=True, single_utterance=False
+                )
+                def request_generator():
+                    while self.is_recording:
+                        if self.speaking_event.is_set():
+                            time.sleep(0.01)
+                            continue
+                        if self.stream_cancel_events[language].is_set():
+                            self.stream_cancel_events[language].clear()
+                            break
+                        try:
+                            chunk = self.lang_queues[language].get(timeout=0.1)
+                            yield speech.StreamingRecognizeRequest(audio_content=chunk)
+                        except queue.Empty:
+                            continue
+                responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
+                for response in responses:
+                    if not self.is_recording:
+                        break
+                    for result in response.results:
+                        if result.is_final and result.alternatives:
+                            alt = result.alternatives[0]
+                            transcript = alt.transcript.strip()
+                            conf = getattr(alt, "confidence", 0.0)
+                            if transcript and conf >= self.min_confidence_threshold:
+                                asyncio.run_coroutine_threadsafe(
+                                    self._process_result(transcript, conf, language), self.async_loop
+                                )
+            except Exception as e:
+                print(f"[stt:{language}] Error: {e}")
+                time.sleep(1)
+        print(f"[stt:{language}] Thread exiting")
+    def start_translation(self):
+        if self.is_recording:
+            return
+        self.is_recording = True
+        for ev in self.restart_events.values():
+            ev.clear()
+        self.speaking_event.clear()
+        def _start_tasks():
+            self._process_audio_task = asyncio.create_task(self._process_input_audio())
+        self.async_loop.call_soon_threadsafe(_start_tasks)
+        for lang in ("en-US", "fr-FR"):
+            thread = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
+            self.stt_threads[lang] = thread
+            thread.start()
+            self.restart_events[lang].set()
+    def stop_translation(self):
+        if not self.is_recording:
+            return
+        self.is_recording = False
+        for ev in self.restart_events.values():
+            ev.set()
+        def _cancel_tasks():
+            if self._process_audio_task:
+                self._process_audio_task.cancel()
+            if self._tts_queue:
+                self._tts_queue.put_nowait(None)
+        self.async_loop.call_soon_threadsafe(_cancel_tasks)
+    def cleanup(self):
+        self.stop_translation()
+        time.sleep(0.5)
+        if self.async_loop.is_running():
+            self.async_loop.call_soon_threadsafe(self.async_loop.stop)