Spaces:
Sleeping
feat: Create web application for voice translator
Browse filesThis commit transforms the command-line voice translator into a web application with a FastAPI backend and a simple HTML/JS frontend.
Key changes include:
- Refactoring the `VoiceTranslator` class into a separate `translator.py` module and adapting it to use asyncio queues for audio I/O.
- Creating a `server.py` with a WebSocket endpoint to handle real-time audio streaming.
- Implementing server-side audio conversion from WebM to PCM using FFmpeg.
- Adding an `index.html` file for the user interface.
- Including a `Dockerfile` and `packages.txt` for deployment on HuggingFace Spaces.
- Updating `requirements.txt` with the necessary web and audio processing dependencies.
- Revising the `README.md` with instructions for running the web app locally and deploying it.
- Dockerfile +22 -0
- README.md +42 -43
- app.py +0 -579
- index.html +137 -0
- packages.txt +1 -0
- requirements.txt +4 -1
- server.py +96 -0
- translator.py +276 -0
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Use an official Python runtime as a parent image
|
| 2 |
+
FROM python:3.9-slim
|
| 3 |
+
|
| 4 |
+
# Set the working directory in the container
|
| 5 |
+
WORKDIR /app
|
| 6 |
+
|
| 7 |
+
# Install system dependencies from packages.txt
|
| 8 |
+
COPY packages.txt .
|
| 9 |
+
RUN apt-get update && apt-get install -y $(cat packages.txt)
|
| 10 |
+
|
| 11 |
+
# Copy the requirements file and install Python dependencies
|
| 12 |
+
COPY requirements.txt .
|
| 13 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 14 |
+
|
| 15 |
+
# Copy the rest of the application code
|
| 16 |
+
COPY . .
|
| 17 |
+
|
| 18 |
+
# Expose the port the app runs on
|
| 19 |
+
EXPOSE 8000
|
| 20 |
+
|
| 21 |
+
# Command to run the application
|
| 22 |
+
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
@@ -1,6 +1,8 @@
|
|
| 1 |
-
# Real-Time English/French Voice Translator
|
| 2 |
|
| 3 |
-
This project provides a real-time, bidirectional voice translation
|
|
|
|
|
|
|
| 4 |
|
| 5 |
It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
|
| 6 |
|
|
@@ -8,39 +10,35 @@ It uses a combination of cutting-edge APIs for high-quality speech recognition,
|
|
| 8 |
- **Translation:** DeepL API
|
| 9 |
- **Text-to-Speech (TTS):** ElevenLabs API
|
| 10 |
|
| 11 |
-
|
| 12 |
-
*(Note: You can replace this with a real GIF of the application in action.)*
|
| 13 |
-
|
| 14 |
## Features
|
| 15 |
|
|
|
|
| 16 |
- **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
|
| 17 |
-
- **Low Latency:** Built with `asyncio` and multithreading for a responsive, conversational experience.
|
| 18 |
- **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
|
| 19 |
- **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
|
| 20 |
-
- **Robust Streaming:** Automatically manages and restarts API connections to handle pauses in conversation.
|
| 21 |
-
- **Simple CLI:** Easy to start and stop from the command line.
|
| 22 |
|
| 23 |
## How It Works
|
| 24 |
|
| 25 |
-
The application
|
| 26 |
|
| 27 |
-
1. **Audio Capture:**
|
| 28 |
-
2. **
|
| 29 |
-
3. **
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
|
|
|
| 34 |
|
| 35 |
## Requirements
|
| 36 |
|
| 37 |
### 1. Software
|
| 38 |
- Python 3.8+
|
| 39 |
- `pip` and `venv`
|
| 40 |
-
- **
|
| 41 |
-
- **macOS (via Homebrew):** `brew install
|
| 42 |
-
- **Debian/Ubuntu:** `sudo apt-get install
|
| 43 |
-
- **Windows:** `pyaudio` can often be installed via `pip` without manual PortAudio installation.
|
| 44 |
|
| 45 |
### 2. API Keys
|
| 46 |
You will need active accounts and API keys for the following services:
|
|
@@ -51,14 +49,14 @@ You will need active accounts and API keys for the following services:
|
|
| 51 |
- **DeepL:**
|
| 52 |
- A DeepL API plan (the Free plan is sufficient for moderate use).
|
| 53 |
- **ElevenLabs:**
|
| 54 |
-
- An ElevenLabs account
|
| 55 |
|
| 56 |
## Installation & Setup
|
| 57 |
|
| 58 |
1. **Clone the Repository**
|
| 59 |
```bash
|
| 60 |
git clone <your-repository-url>
|
| 61 |
-
cd realtime-translator
|
| 62 |
```
|
| 63 |
|
| 64 |
2. **Create a Virtual Environment**
|
|
@@ -68,15 +66,7 @@ You will need active accounts and API keys for the following services:
|
|
| 68 |
```
|
| 69 |
|
| 70 |
3. **Install Dependencies**
|
| 71 |
-
|
| 72 |
-
```
|
| 73 |
-
pyaudio
|
| 74 |
-
websockets
|
| 75 |
-
google-cloud-speech
|
| 76 |
-
deepl
|
| 77 |
-
python-dotenv
|
| 78 |
-
```
|
| 79 |
-
Then, install the packages:
|
| 80 |
```bash
|
| 81 |
pip install -r requirements.txt
|
| 82 |
```
|
|
@@ -86,27 +76,36 @@ You will need active accounts and API keys for the following services:
|
|
| 86 |
|
| 87 |
```env
|
| 88 |
# Path to your Google Cloud service account JSON file
|
| 89 |
-
GOOGLE_APPLICATION_CREDENTIALS="
|
| 90 |
|
| 91 |
# Your DeepL API Key
|
| 92 |
-
|
| 93 |
|
| 94 |
# Your ElevenLabs API Key and Voice ID
|
| 95 |
ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
|
| 96 |
ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
|
| 97 |
```
|
| 98 |
|
| 99 |
-
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
|
| 104 |
-
python app.py
|
| 105 |
-
```
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
|
|
|
| 1 |
+
# Real-Time English/French Voice Translator Web App
|
| 2 |
|
| 3 |
+
This project provides a real-time, bidirectional voice translation web application. Speak in English or French into your browser, and hear the translation in the other language almost instantly.
|
| 4 |
+
|
| 5 |
+
It is built to be easily deployed as a HuggingFace Space.
|
| 6 |
|
| 7 |
It uses a combination of cutting-edge APIs for high-quality speech recognition, translation, and synthesis:
|
| 8 |
|
|
|
|
| 10 |
- **Translation:** DeepL API
|
| 11 |
- **Text-to-Speech (TTS):** ElevenLabs API
|
| 12 |
|
|
|
|
|
|
|
|
|
|
| 13 |
## Features
|
| 14 |
|
| 15 |
+
- **Web-Based UI:** A simple and clean browser interface for real-time translation.
|
| 16 |
- **Bidirectional Translation:** Simultaneously listens for both English and French and translates to the other language.
|
| 17 |
+
- **Low Latency:** Built with `asyncio`, WebSockets, and multithreading for a responsive, conversational experience.
|
| 18 |
- **High-Quality Voice:** Leverages ElevenLabs for natural-sounding synthesized speech.
|
| 19 |
- **Echo Suppression:** The translator is smart enough not to translate its own spoken output.
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## How It Works
|
| 22 |
|
| 23 |
+
The application is composed of a web frontend and a Python backend:
|
| 24 |
|
| 25 |
+
1. **Audio Capture (Frontend):** The browser's JavaScript captures audio from your microphone using the Web Audio API.
|
| 26 |
+
2. **WebSocket Streaming:** The audio is chunked and streamed over a WebSocket connection to the FastAPI backend.
|
| 27 |
+
3. **Backend Processing:**
|
| 28 |
+
- The `VoiceTranslator` class receives the audio stream.
|
| 29 |
+
- The audio is fed into two separate Google Cloud STT streams in parallel (`en-US` and `fr-FR`).
|
| 30 |
+
- When an STT stream detects a final utterance, it's sent to the DeepL API for translation.
|
| 31 |
+
- The translated text is sent to the ElevenLabs streaming TTS API.
|
| 32 |
+
4. **Audio Playback (Frontend):** The synthesized audio from ElevenLabs is streamed back to the browser through the WebSocket and played instantly.
|
| 33 |
|
| 34 |
## Requirements
|
| 35 |
|
| 36 |
### 1. Software
|
| 37 |
- Python 3.8+
|
| 38 |
- `pip` and `venv`
|
| 39 |
+
- **FFmpeg:** This is a system dependency for audio format conversion.
|
| 40 |
+
- **macOS (via Homebrew):** `brew install ffmpeg`
|
| 41 |
+
- **Debian/Ubuntu:** `sudo apt-get install ffmpeg`
|
|
|
|
| 42 |
|
| 43 |
### 2. API Keys
|
| 44 |
You will need active accounts and API keys for the following services:
|
|
|
|
| 49 |
- **DeepL:**
|
| 50 |
- A DeepL API plan (the Free plan is sufficient for moderate use).
|
| 51 |
- **ElevenLabs:**
|
| 52 |
+
- An ElevenLabs account and your **Voice ID** for the desired voice.
|
| 53 |
|
| 54 |
## Installation & Setup
|
| 55 |
|
| 56 |
1. **Clone the Repository**
|
| 57 |
```bash
|
| 58 |
git clone <your-repository-url>
|
| 59 |
+
cd realtime-translator-webapp # Or your directory name
|
| 60 |
```
|
| 61 |
|
| 62 |
2. **Create a Virtual Environment**
|
|
|
|
| 66 |
```
|
| 67 |
|
| 68 |
3. **Install Dependencies**
|
| 69 |
+
Install the Python packages from `requirements.txt`:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
```bash
|
| 71 |
pip install -r requirements.txt
|
| 72 |
```
|
|
|
|
| 76 |
|
| 77 |
```env
|
| 78 |
# Path to your Google Cloud service account JSON file
|
| 79 |
+
GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/google-credentials.json"
|
| 80 |
|
| 81 |
# Your DeepL API Key
|
| 82 |
+
DEPL_API_KEY="YOUR_DEEPL_API_KEY"
|
| 83 |
|
| 84 |
# Your ElevenLabs API Key and Voice ID
|
| 85 |
ELEVENLABS_API_KEY="YOUR_ELEVENLABS_API_KEY"
|
| 86 |
ELEVENLABS_VOICE_ID="YOUR_ELEVENLABS_VOICE_ID"
|
| 87 |
```
|
| 88 |
|
| 89 |
+
## Local Usage
|
| 90 |
+
|
| 91 |
+
1. **Start the Server**
|
| 92 |
+
Run the Uvicorn server from the project root:
|
| 93 |
+
```bash
|
| 94 |
+
uvicorn server:app --reload
|
| 95 |
+
```
|
| 96 |
|
| 97 |
+
2. **Use the Application**
|
| 98 |
+
- Open your web browser and navigate to `http://127.0.0.1:8000`.
|
| 99 |
+
- Click the "Start Translation" button. Your browser will ask for microphone permission.
|
| 100 |
+
- Speak in either English or French.
|
| 101 |
+
- The translated audio will play back automatically.
|
| 102 |
+
- Click "Stop Translation" to end the session.
|
| 103 |
|
| 104 |
+
## Deploying to HuggingFace Spaces
|
|
|
|
|
|
|
| 105 |
|
| 106 |
+
This application is ready to be deployed as a HuggingFace Space.
|
| 107 |
|
| 108 |
+
1. Create a new Space on HuggingFace, selecting the "Docker" template.
|
| 109 |
+
2. Upload the entire project contents to the Space repository.
|
| 110 |
+
3. In the Space "Settings" tab, add your API keys (`GOOGLE_APPLICATION_CREDENTIALS`, `DEEPL_API_KEY`, `ELEVENLABS_API_KEY`, `ELEVENLABS_VOICE_ID`) as secrets. Make sure to also add your google credentials file.
|
| 111 |
+
4. The Space will automatically build the Docker image and start the application. Your translator will be live!
|
|
@@ -1,579 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Real-Time French/English Voice Translator β cleaned version
|
| 4 |
-
|
| 5 |
-
Fixes applied:
|
| 6 |
-
- Fixed TTS echo caused by double-writing audio chunks
|
| 7 |
-
- Removed prebuffer re-injection that could cause echoes
|
| 8 |
-
- Added empty transcript filtering
|
| 9 |
-
- Added within-stream deduplication
|
| 10 |
-
- Removed unnecessary sleeps (reduced latency by ~900ms)
|
| 11 |
-
- Reduced TTS prebuffer from 1s to 0.5s for faster playback start
|
| 12 |
-
- Cleaned up diagnostic logging
|
| 13 |
-
|
| 14 |
-
Keep your env vars:
|
| 15 |
-
- GOOGLE_APPLICATION_CREDENTIALS, DEEPL_API_KEY, ELEVENLABS_API_KEY, ELEVENLABS_VOICE_ID
|
| 16 |
-
"""
|
| 17 |
-
|
| 18 |
-
import asyncio
|
| 19 |
-
import json
|
| 20 |
-
import queue
|
| 21 |
-
import threading
|
| 22 |
-
import time
|
| 23 |
-
import os
|
| 24 |
-
import base64
|
| 25 |
-
from collections import deque
|
| 26 |
-
from typing import Dict, Optional
|
| 27 |
-
|
| 28 |
-
import pyaudio
|
| 29 |
-
import websockets
|
| 30 |
-
from google.cloud import speech
|
| 31 |
-
import deepl
|
| 32 |
-
from dotenv import load_dotenv
|
| 33 |
-
|
| 34 |
-
# -----------------------------------------------------------------------------
|
| 35 |
-
# VoiceTranslator
|
| 36 |
-
# -----------------------------------------------------------------------------
|
| 37 |
-
class VoiceTranslator:
|
| 38 |
-
def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
|
| 39 |
-
# External clients
|
| 40 |
-
self.deepl_client = deepl.Translator(deepl_api_key)
|
| 41 |
-
self.elevenlabs_api_key = elevenlabs_api_key
|
| 42 |
-
self.voice_id = elevenlabs_voice_id
|
| 43 |
-
self.stt_client = speech.SpeechClient()
|
| 44 |
-
|
| 45 |
-
# Audio params
|
| 46 |
-
self.audio_rate = 16000
|
| 47 |
-
self.audio_chunk = 1024
|
| 48 |
-
|
| 49 |
-
# Per-language audio queues (raw mic frames)
|
| 50 |
-
self.lang_queues: Dict[str, queue.Queue] = {
|
| 51 |
-
"en-US": queue.Queue(),
|
| 52 |
-
"fr-FR": queue.Queue(),
|
| 53 |
-
}
|
| 54 |
-
|
| 55 |
-
# Small rolling prebuffer to avoid missing the first bits after a restart
|
| 56 |
-
self.prebuffer = deque(maxlen=12)
|
| 57 |
-
|
| 58 |
-
# State flags
|
| 59 |
-
self.is_recording = False
|
| 60 |
-
self.is_speaking = False
|
| 61 |
-
self.speaking_event = threading.Event()
|
| 62 |
-
|
| 63 |
-
# Deduplication
|
| 64 |
-
self.last_processed_transcript = ""
|
| 65 |
-
self.last_tts_text_en = ""
|
| 66 |
-
self.last_tts_text_fr = ""
|
| 67 |
-
|
| 68 |
-
# Threshold
|
| 69 |
-
self.min_confidence_threshold = 0.5
|
| 70 |
-
|
| 71 |
-
# PyAudio
|
| 72 |
-
self.pyaudio_instance = pyaudio.PyAudio()
|
| 73 |
-
self.audio_stream = None
|
| 74 |
-
|
| 75 |
-
# Threads + async
|
| 76 |
-
self.recording_thread: Optional[threading.Thread] = None
|
| 77 |
-
self.async_loop = asyncio.new_event_loop()
|
| 78 |
-
|
| 79 |
-
# TTS queue + consumer task
|
| 80 |
-
self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
|
| 81 |
-
self._tts_consumer_task: Optional[asyncio.Task] = None
|
| 82 |
-
|
| 83 |
-
# Start async loop in separate thread
|
| 84 |
-
self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
|
| 85 |
-
self.async_thread.start()
|
| 86 |
-
|
| 87 |
-
# schedule tts consumer creation inside the async loop
|
| 88 |
-
def _start_consumer():
|
| 89 |
-
self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
|
| 90 |
-
self.async_loop.call_soon_threadsafe(_start_consumer)
|
| 91 |
-
|
| 92 |
-
self.stt_threads: Dict[str, threading.Thread] = {}
|
| 93 |
-
|
| 94 |
-
# Per-language restart events (used to tell threads when to start new streams)
|
| 95 |
-
self.restart_events: Dict[str, threading.Event] = {
|
| 96 |
-
"en-US": threading.Event(),
|
| 97 |
-
"fr-FR": threading.Event(),
|
| 98 |
-
}
|
| 99 |
-
|
| 100 |
-
# Per-language stream started flag
|
| 101 |
-
self._stream_started = {"en-US": False, "fr-FR": False}
|
| 102 |
-
|
| 103 |
-
# Per-language cancel events to force request_generator to stop
|
| 104 |
-
self.stream_cancel_events: Dict[str, threading.Event] = {
|
| 105 |
-
"en-US": threading.Event(),
|
| 106 |
-
"fr-FR": threading.Event(),
|
| 107 |
-
}
|
| 108 |
-
|
| 109 |
-
# Diagnostics
|
| 110 |
-
self._tts_job_counter = 0
|
| 111 |
-
|
| 112 |
-
def _run_async_loop(self):
|
| 113 |
-
asyncio.set_event_loop(self.async_loop)
|
| 114 |
-
try:
|
| 115 |
-
self.async_loop.run_forever()
|
| 116 |
-
except Exception as e:
|
| 117 |
-
print("[async_loop] stopped with error:", e)
|
| 118 |
-
|
| 119 |
-
# ---------------------------
|
| 120 |
-
# Audio capture
|
| 121 |
-
# ---------------------------
|
| 122 |
-
def _record_audio(self):
|
| 123 |
-
try:
|
| 124 |
-
stream = self.pyaudio_instance.open(
|
| 125 |
-
format=pyaudio.paInt16,
|
| 126 |
-
channels=1,
|
| 127 |
-
rate=self.audio_rate,
|
| 128 |
-
input=True,
|
| 129 |
-
frames_per_buffer=self.audio_chunk,
|
| 130 |
-
)
|
| 131 |
-
print("π€ Recording started...")
|
| 132 |
-
|
| 133 |
-
while self.is_recording:
|
| 134 |
-
if self.speaking_event.is_set():
|
| 135 |
-
time.sleep(0.01)
|
| 136 |
-
continue
|
| 137 |
-
|
| 138 |
-
try:
|
| 139 |
-
data = stream.read(self.audio_chunk, exception_on_overflow=False)
|
| 140 |
-
except Exception as e:
|
| 141 |
-
print(f"[recorder] read error: {e}")
|
| 142 |
-
continue
|
| 143 |
-
|
| 144 |
-
if not data:
|
| 145 |
-
continue
|
| 146 |
-
|
| 147 |
-
self.prebuffer.append(data)
|
| 148 |
-
self.lang_queues["en-US"].put(data)
|
| 149 |
-
self.lang_queues["fr-FR"].put(data)
|
| 150 |
-
|
| 151 |
-
try:
|
| 152 |
-
stream.stop_stream()
|
| 153 |
-
stream.close()
|
| 154 |
-
except Exception:
|
| 155 |
-
pass
|
| 156 |
-
print("π€ Recording stopped.")
|
| 157 |
-
except Exception as e:
|
| 158 |
-
print(f"[recorder] fatal: {e}")
|
| 159 |
-
|
| 160 |
-
# ---------------------------
|
| 161 |
-
# TTS streaming (ElevenLabs) - async
|
| 162 |
-
# ---------------------------
|
| 163 |
-
async def _stream_tts(self, text: str):
|
| 164 |
-
uri = (
|
| 165 |
-
f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
|
| 166 |
-
f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
|
| 167 |
-
)
|
| 168 |
-
tts_audio_stream = None
|
| 169 |
-
websocket = None
|
| 170 |
-
try:
|
| 171 |
-
# Mark speaking and set event so recorder & STT pause
|
| 172 |
-
self.is_speaking = True
|
| 173 |
-
self.speaking_event.set()
|
| 174 |
-
|
| 175 |
-
# Clear prebuffer to avoid re-injecting TTS audio later
|
| 176 |
-
self.prebuffer.clear()
|
| 177 |
-
|
| 178 |
-
# Clear queued frames to avoid replay
|
| 179 |
-
for q in self.lang_queues.values():
|
| 180 |
-
with q.mutex:
|
| 181 |
-
q.queue.clear()
|
| 182 |
-
|
| 183 |
-
websocket = await websockets.connect(uri)
|
| 184 |
-
await websocket.send(json.dumps({
|
| 185 |
-
"text": " ",
|
| 186 |
-
"voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
|
| 187 |
-
"xi_api_key": self.elevenlabs_api_key,
|
| 188 |
-
}))
|
| 189 |
-
await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
|
| 190 |
-
await websocket.send(json.dumps({"text": ""}))
|
| 191 |
-
|
| 192 |
-
tts_audio_stream = self.pyaudio_instance.open(
|
| 193 |
-
format=pyaudio.paInt16,
|
| 194 |
-
channels=1,
|
| 195 |
-
rate=16000,
|
| 196 |
-
output=True,
|
| 197 |
-
frames_per_buffer=1024,
|
| 198 |
-
)
|
| 199 |
-
|
| 200 |
-
prebuffer = bytearray()
|
| 201 |
-
playback_started = False
|
| 202 |
-
|
| 203 |
-
try:
|
| 204 |
-
while True:
|
| 205 |
-
try:
|
| 206 |
-
message = await asyncio.wait_for(websocket.recv(), timeout=8.0)
|
| 207 |
-
except asyncio.TimeoutError:
|
| 208 |
-
if playback_started:
|
| 209 |
-
break
|
| 210 |
-
else:
|
| 211 |
-
continue
|
| 212 |
-
|
| 213 |
-
if isinstance(message, bytes):
|
| 214 |
-
if not playback_started:
|
| 215 |
-
prebuffer.extend(message)
|
| 216 |
-
if len(prebuffer) >= 8000:
|
| 217 |
-
tts_audio_stream.write(bytes(prebuffer))
|
| 218 |
-
prebuffer.clear()
|
| 219 |
-
playback_started = True
|
| 220 |
-
else:
|
| 221 |
-
tts_audio_stream.write(message)
|
| 222 |
-
continue
|
| 223 |
-
|
| 224 |
-
try:
|
| 225 |
-
data = json.loads(message)
|
| 226 |
-
except Exception:
|
| 227 |
-
continue
|
| 228 |
-
|
| 229 |
-
if data.get("audio"):
|
| 230 |
-
audio_bytes = base64.b64decode(data["audio"])
|
| 231 |
-
if not playback_started:
|
| 232 |
-
prebuffer.extend(audio_bytes)
|
| 233 |
-
if len(prebuffer) >= 8000:
|
| 234 |
-
tts_audio_stream.write(bytes(prebuffer))
|
| 235 |
-
prebuffer.clear()
|
| 236 |
-
playback_started = True
|
| 237 |
-
else:
|
| 238 |
-
tts_audio_stream.write(audio_bytes)
|
| 239 |
-
elif data.get("isFinal"):
|
| 240 |
-
break
|
| 241 |
-
elif data.get("error"):
|
| 242 |
-
print("TTS error:", data["error"])
|
| 243 |
-
break
|
| 244 |
-
|
| 245 |
-
# Handle case where playback never started (very short audio)
|
| 246 |
-
if prebuffer and not playback_started:
|
| 247 |
-
tts_audio_stream.write(bytes(prebuffer))
|
| 248 |
-
|
| 249 |
-
finally:
|
| 250 |
-
try:
|
| 251 |
-
await websocket.close()
|
| 252 |
-
except Exception:
|
| 253 |
-
pass
|
| 254 |
-
|
| 255 |
-
except Exception as e:
|
| 256 |
-
pass
|
| 257 |
-
finally:
|
| 258 |
-
if tts_audio_stream:
|
| 259 |
-
try:
|
| 260 |
-
tts_audio_stream.stop_stream()
|
| 261 |
-
tts_audio_stream.close()
|
| 262 |
-
except Exception:
|
| 263 |
-
pass
|
| 264 |
-
|
| 265 |
-
# Force the STT request generators to exit by setting cancel events
|
| 266 |
-
for lang, ev in self.stream_cancel_events.items():
|
| 267 |
-
ev.set()
|
| 268 |
-
|
| 269 |
-
# Don't re-inject prebuffer - just clear the queues and let fresh audio come in
|
| 270 |
-
for q in self.lang_queues.values():
|
| 271 |
-
with q.mutex:
|
| 272 |
-
q.queue.clear()
|
| 273 |
-
|
| 274 |
-
# Clear speaking state and signal STT threads to restart
|
| 275 |
-
self.is_speaking = False
|
| 276 |
-
self.speaking_event.clear()
|
| 277 |
-
|
| 278 |
-
# Signal restart for both language streams
|
| 279 |
-
for lang, ev in self.restart_events.items():
|
| 280 |
-
ev.set()
|
| 281 |
-
|
| 282 |
-
await asyncio.sleep(0.1)
|
| 283 |
-
|
| 284 |
-
# ---------------------------
|
| 285 |
-
# TTS consumer (serializes TTS)
|
| 286 |
-
# ---------------------------
|
| 287 |
-
async def _tts_consumer(self):
|
| 288 |
-
print("[tts_consumer] started")
|
| 289 |
-
while True:
|
| 290 |
-
item = await self._tts_queue.get()
|
| 291 |
-
if item is None:
|
| 292 |
-
print("[tts_consumer] shutdown sentinel received")
|
| 293 |
-
break
|
| 294 |
-
text = item.get("text", "")
|
| 295 |
-
self._tts_job_counter += 1
|
| 296 |
-
job_id = self._tts_job_counter
|
| 297 |
-
print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
|
| 298 |
-
try:
|
| 299 |
-
await asyncio.wait_for(self._stream_tts(text), timeout=35.0)
|
| 300 |
-
except asyncio.TimeoutError:
|
| 301 |
-
print(f"[tts_consumer] job #{job_id} _stream_tts timed out; proceeding.")
|
| 302 |
-
except Exception as e:
|
| 303 |
-
print(f"[tts_consumer] job #{job_id} error during _stream_tts: {e}")
|
| 304 |
-
finally:
|
| 305 |
-
await asyncio.sleep(0.05)
|
| 306 |
-
print("[tts_consumer] exiting")
|
| 307 |
-
|
| 308 |
-
# ---------------------------
|
| 309 |
-
# Translation & TTS trigger
|
| 310 |
-
# ---------------------------
|
| 311 |
-
async def _process_result(self, transcript: str, confidence: float, language: str):
|
| 312 |
-
lang_flag = "π«π·" if language == "fr-FR" else "π¬π§"
|
| 313 |
-
print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
|
| 314 |
-
|
| 315 |
-
# echo suppression vs last TTS in same language
|
| 316 |
-
if language == "fr-FR":
|
| 317 |
-
if transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
|
| 318 |
-
print(" (echo suppressed)")
|
| 319 |
-
return
|
| 320 |
-
else:
|
| 321 |
-
if transcript.strip().lower() == self.last_tts_text_en.strip().lower():
|
| 322 |
-
print(" (echo suppressed)")
|
| 323 |
-
return
|
| 324 |
-
|
| 325 |
-
try:
|
| 326 |
-
if language == "fr-FR":
|
| 327 |
-
translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
|
| 328 |
-
print(f"π FR β EN: {translated}")
|
| 329 |
-
await self._tts_queue.put({"text": translated, "source_lang": language})
|
| 330 |
-
self.last_tts_text_en = translated
|
| 331 |
-
else:
|
| 332 |
-
translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
|
| 333 |
-
print(f"π EN β FR: {translated}")
|
| 334 |
-
await self._tts_queue.put({"text": translated, "source_lang": language})
|
| 335 |
-
self.last_tts_text_fr = translated
|
| 336 |
-
print("π Queued for speaking...")
|
| 337 |
-
except Exception as e:
|
| 338 |
-
print(f"Translation error: {e}")
|
| 339 |
-
|
| 340 |
-
# ---------------------------
|
| 341 |
-
# STT streaming (run per language)
|
| 342 |
-
# ---------------------------
|
| 343 |
-
def _run_stt_stream(self, language: str):
|
| 344 |
-
print(f"[stt:{language}] Thread starting, thread_id={threading.get_ident()}")
|
| 345 |
-
self._stream_started[language] = False
|
| 346 |
-
last_transcript_in_stream = ""
|
| 347 |
-
|
| 348 |
-
while self.is_recording:
|
| 349 |
-
try:
|
| 350 |
-
if self._stream_started[language]:
|
| 351 |
-
print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Waiting for restart signal...")
|
| 352 |
-
signaled = self.restart_events[language].wait(timeout=30)
|
| 353 |
-
if not signaled and self.is_recording:
|
| 354 |
-
print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Timeout waiting for restart, restarting anyway")
|
| 355 |
-
if not self.is_recording:
|
| 356 |
-
break
|
| 357 |
-
try:
|
| 358 |
-
self.restart_events[language].clear()
|
| 359 |
-
except Exception:
|
| 360 |
-
pass
|
| 361 |
-
time.sleep(0.01)
|
| 362 |
-
|
| 363 |
-
self._stream_started[language] = True
|
| 364 |
-
last_transcript_in_stream = ""
|
| 365 |
-
print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Starting new stream...")
|
| 366 |
-
|
| 367 |
-
config = speech.RecognitionConfig(
|
| 368 |
-
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
|
| 369 |
-
sample_rate_hertz=self.audio_rate,
|
| 370 |
-
language_code=language,
|
| 371 |
-
enable_automatic_punctuation=True,
|
| 372 |
-
model="latest_short",
|
| 373 |
-
)
|
| 374 |
-
streaming_config = speech.StreamingRecognitionConfig(
|
| 375 |
-
config=config,
|
| 376 |
-
interim_results=True,
|
| 377 |
-
single_utterance=False,
|
| 378 |
-
)
|
| 379 |
-
|
| 380 |
-
# Request generator yields StreamingRecognizeRequest messages
|
| 381 |
-
def request_generator():
|
| 382 |
-
while self.is_recording:
|
| 383 |
-
# If TTS is playing, skip sending mic frames to STT
|
| 384 |
-
if self.speaking_event.is_set():
|
| 385 |
-
time.sleep(0.01)
|
| 386 |
-
continue
|
| 387 |
-
# If cancel event set, clear and break to end stream
|
| 388 |
-
if self.stream_cancel_events[language].is_set():
|
| 389 |
-
try:
|
| 390 |
-
self.stream_cancel_events[language].clear()
|
| 391 |
-
except Exception:
|
| 392 |
-
pass
|
| 393 |
-
break
|
| 394 |
-
try:
|
| 395 |
-
chunk = self.lang_queues[language].get(timeout=1.0)
|
| 396 |
-
except queue.Empty:
|
| 397 |
-
continue
|
| 398 |
-
yield speech.StreamingRecognizeRequest(audio_content=chunk)
|
| 399 |
-
|
| 400 |
-
responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
|
| 401 |
-
|
| 402 |
-
response_count = 0
|
| 403 |
-
final_received = False
|
| 404 |
-
|
| 405 |
-
for response in responses:
|
| 406 |
-
if not self.is_recording:
|
| 407 |
-
print(f"[stt:{language}] Stopped by user")
|
| 408 |
-
break
|
| 409 |
-
if not response.results:
|
| 410 |
-
continue
|
| 411 |
-
|
| 412 |
-
response_count += 1
|
| 413 |
-
for result in response.results:
|
| 414 |
-
if not result.alternatives:
|
| 415 |
-
continue
|
| 416 |
-
alt = result.alternatives[0]
|
| 417 |
-
transcript = alt.transcript.strip()
|
| 418 |
-
conf = getattr(alt, "confidence", 0.0)
|
| 419 |
-
is_final = bool(result.is_final)
|
| 420 |
-
|
| 421 |
-
if is_final:
|
| 422 |
-
now = time.strftime("%H:%M:%S")
|
| 423 |
-
print(f"[{now}] [stt:{language}] β '{transcript}' (final={is_final}, conf={conf:.2f})")
|
| 424 |
-
|
| 425 |
-
# Filter empty transcripts - don't break stream
|
| 426 |
-
if not transcript or len(transcript.strip()) == 0:
|
| 427 |
-
print(f"[{now}] [stt:{language}] Empty transcript -> ignoring, continuing stream")
|
| 428 |
-
continue
|
| 429 |
-
|
| 430 |
-
# Deduplicate within same stream
|
| 431 |
-
if transcript.strip().lower() == last_transcript_in_stream.strip().lower():
|
| 432 |
-
print(f"[{now}] [stt:{language}] Duplicate final in same stream -> suppressed")
|
| 433 |
-
continue
|
| 434 |
-
|
| 435 |
-
if conf < self.min_confidence_threshold:
|
| 436 |
-
print(f"[{now}] [stt:{language}] Final received but confidence {conf:.2f} < threshold -> suppressed")
|
| 437 |
-
continue
|
| 438 |
-
|
| 439 |
-
last_transcript_in_stream = transcript
|
| 440 |
-
|
| 441 |
-
if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
|
| 442 |
-
print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_fr)")
|
| 443 |
-
continue
|
| 444 |
-
if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
|
| 445 |
-
print(f"[{now}] [stt:{language}] (echo suppressed - matches last_tts_text_en)")
|
| 446 |
-
continue
|
| 447 |
-
|
| 448 |
-
asyncio.run_coroutine_threadsafe(
|
| 449 |
-
self._process_result(transcript, conf, language),
|
| 450 |
-
self.async_loop
|
| 451 |
-
)
|
| 452 |
-
final_received = True
|
| 453 |
-
break
|
| 454 |
-
|
| 455 |
-
if final_received:
|
| 456 |
-
break
|
| 457 |
-
|
| 458 |
-
print(f"[stt:{language}] Stream ended after {response_count} responses")
|
| 459 |
-
|
| 460 |
-
if self.is_recording and final_received:
|
| 461 |
-
print(f"[{time.strftime('%H:%M:%S')}] [stt:{language}] Final result processed. Waiting for TTS to complete and signal restart.")
|
| 462 |
-
elif self.is_recording and not final_received:
|
| 463 |
-
print(f"[stt:{language}] Stream ended unexpectedly, reconnecting...")
|
| 464 |
-
time.sleep(0.5)
|
| 465 |
-
else:
|
| 466 |
-
break
|
| 467 |
-
|
| 468 |
-
except Exception as e:
|
| 469 |
-
if self.is_recording:
|
| 470 |
-
import traceback
|
| 471 |
-
print(f"[stt:{language}] Error: {e}")
|
| 472 |
-
print(traceback.format_exc())
|
| 473 |
-
time.sleep(1.0)
|
| 474 |
-
else:
|
| 475 |
-
break
|
| 476 |
-
|
| 477 |
-
print(f"[stt:{language}] Thread exiting")
|
| 478 |
-
|
| 479 |
-
# ---------------------------
|
| 480 |
-
# Control
|
| 481 |
-
# ---------------------------
|
| 482 |
-
def start_translation(self):
|
| 483 |
-
if self.is_recording:
|
| 484 |
-
print("Already recording!")
|
| 485 |
-
return
|
| 486 |
-
self.is_recording = True
|
| 487 |
-
self.last_processed_transcript = ""
|
| 488 |
-
|
| 489 |
-
for ev in self.restart_events.values():
|
| 490 |
-
try:
|
| 491 |
-
ev.clear()
|
| 492 |
-
except Exception:
|
| 493 |
-
pass
|
| 494 |
-
self.speaking_event.clear()
|
| 495 |
-
|
| 496 |
-
for q in self.lang_queues.values():
|
| 497 |
-
with q.mutex:
|
| 498 |
-
q.queue.clear()
|
| 499 |
-
|
| 500 |
-
self.recording_thread = threading.Thread(target=self._record_audio, daemon=True)
|
| 501 |
-
self.recording_thread.start()
|
| 502 |
-
|
| 503 |
-
for lang in ("en-US", "fr-FR"):
|
| 504 |
-
t = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
|
| 505 |
-
self.stt_threads[lang] = t
|
| 506 |
-
t.start()
|
| 507 |
-
print(f"[main] STT thread {lang} started: {t.is_alive()} at {time.strftime('%H:%M:%S')}")
|
| 508 |
-
|
| 509 |
-
for ev in self.restart_events.values():
|
| 510 |
-
ev.set()
|
| 511 |
-
|
| 512 |
-
def stop_translation(self):
|
| 513 |
-
print("\nβΉοΈ Stopping translation...")
|
| 514 |
-
self.is_recording = False
|
| 515 |
-
for ev in self.restart_events.values():
|
| 516 |
-
ev.set()
|
| 517 |
-
self.speaking_event.clear()
|
| 518 |
-
|
| 519 |
-
if self._tts_consumer_task and not (self._tts_consumer_task.done() if hasattr(self._tts_consumer_task, 'done') else False):
|
| 520 |
-
try:
|
| 521 |
-
def _put_sentinel():
|
| 522 |
-
try:
|
| 523 |
-
self._tts_queue.put_nowait(None)
|
| 524 |
-
except Exception:
|
| 525 |
-
asyncio.create_task(self._tts_queue.put(None))
|
| 526 |
-
self.async_loop.call_soon_threadsafe(_put_sentinel)
|
| 527 |
-
except Exception:
|
| 528 |
-
pass
|
| 529 |
-
|
| 530 |
-
time.sleep(0.2)
|
| 531 |
-
|
| 532 |
-
def cleanup(self):
|
| 533 |
-
self.stop_translation()
|
| 534 |
-
try:
|
| 535 |
-
if self.async_loop.is_running():
|
| 536 |
-
def _stop_loop():
|
| 537 |
-
if self._tts_consumer_task and not self._tts_consumer_task.done():
|
| 538 |
-
try:
|
| 539 |
-
self._tts_queue.put_nowait(None)
|
| 540 |
-
except Exception:
|
| 541 |
-
pass
|
| 542 |
-
self.async_loop.stop()
|
| 543 |
-
self.async_loop.call_soon_threadsafe(_stop_loop)
|
| 544 |
-
except Exception:
|
| 545 |
-
pass
|
| 546 |
-
try:
|
| 547 |
-
self.pyaudio_instance.terminate()
|
| 548 |
-
except Exception:
|
| 549 |
-
pass
|
| 550 |
-
|
| 551 |
-
# -----------------------------------------------------------------------------
|
| 552 |
-
# Main entry
|
| 553 |
-
# -----------------------------------------------------------------------------
|
| 554 |
-
def main():
|
| 555 |
-
load_dotenv()
|
| 556 |
-
google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
|
| 557 |
-
deepl_key = os.getenv("DEEPL_API_KEY")
|
| 558 |
-
eleven_key = os.getenv("ELEVENLABS_API_KEY")
|
| 559 |
-
voice_id = os.getenv("ELEVENLABS_VOICE_ID")
|
| 560 |
-
|
| 561 |
-
if not all([google_creds, deepl_key, eleven_key, voice_id]):
|
| 562 |
-
print("Missing API keys or credentials.")
|
| 563 |
-
return
|
| 564 |
-
|
| 565 |
-
translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
|
| 566 |
-
print("Ready! Press ENTER to start, ENTER again to stop, Ctrl+C to quit.\n")
|
| 567 |
-
|
| 568 |
-
try:
|
| 569 |
-
while True:
|
| 570 |
-
input("Press ENTER to start speaking...")
|
| 571 |
-
translator.start_translation()
|
| 572 |
-
input("Press ENTER to stop...\n")
|
| 573 |
-
translator.stop_translation()
|
| 574 |
-
except KeyboardInterrupt:
|
| 575 |
-
print("\nKeyboardInterrupt received β cleaning up.")
|
| 576 |
-
translator.cleanup()
|
| 577 |
-
|
| 578 |
-
if __name__ == "__main__":
|
| 579 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>Real-Time Voice Translator</title>
|
| 7 |
+
<style>
|
| 8 |
+
body { font-family: sans-serif; display: flex; flex-direction: column; align-items: center; justify-content: center; height: 100vh; margin: 0; background-color: #f0f0f0; }
|
| 9 |
+
#controls { margin-bottom: 20px; }
|
| 10 |
+
button { font-size: 1.2em; padding: 10px 20px; cursor: pointer; }
|
| 11 |
+
#status { font-size: 1.1em; color: #333; }
|
| 12 |
+
</style>
|
| 13 |
+
</head>
|
| 14 |
+
<body>
|
| 15 |
+
<h1>Real-Time Voice Translator</h1>
|
| 16 |
+
<div id="controls">
|
| 17 |
+
<button id="startButton">Start Translation</button>
|
| 18 |
+
<button id="stopButton" disabled>Stop Translation</button>
|
| 19 |
+
</div>
|
| 20 |
+
<p id="status">Status: Not connected</p>
|
| 21 |
+
|
| 22 |
+
<script>
|
| 23 |
+
const startButton = document.getElementById('startButton');
|
| 24 |
+
const stopButton = document.getElementById('stopButton');
|
| 25 |
+
const statusDiv = document.getElementById('status');
|
| 26 |
+
let socket;
|
| 27 |
+
let mediaRecorder;
|
| 28 |
+
let audioContext;
|
| 29 |
+
let audioQueue = [];
|
| 30 |
+
let isPlaying = false;
|
| 31 |
+
|
| 32 |
+
const connectWebSocket = () => {
|
| 33 |
+
const proto = window.location.protocol === "https:" ? "wss:" : "ws:";
|
| 34 |
+
socket = new WebSocket(`${proto}//${window.location.host}/ws`);
|
| 35 |
+
|
| 36 |
+
socket.onopen = () => {
|
| 37 |
+
statusDiv.textContent = 'Status: Connected. Press Start.';
|
| 38 |
+
startButton.disabled = false;
|
| 39 |
+
};
|
| 40 |
+
|
| 41 |
+
socket.onmessage = (event) => {
|
| 42 |
+
if (event.data instanceof Blob) {
|
| 43 |
+
const reader = new FileReader();
|
| 44 |
+
reader.onload = function() {
|
| 45 |
+
const arrayBuffer = this.result;
|
| 46 |
+
audioContext.decodeAudioData(arrayBuffer, (buffer) => {
|
| 47 |
+
audioQueue.push(buffer);
|
| 48 |
+
if (!isPlaying) {
|
| 49 |
+
playNextInQueue();
|
| 50 |
+
}
|
| 51 |
+
});
|
| 52 |
+
};
|
| 53 |
+
reader.readAsArrayBuffer(event.data);
|
| 54 |
+
}
|
| 55 |
+
};
|
| 56 |
+
|
| 57 |
+
socket.onclose = () => {
|
| 58 |
+
statusDiv.textContent = 'Status: Disconnected';
|
| 59 |
+
startButton.disabled = true;
|
| 60 |
+
stopButton.disabled = true;
|
| 61 |
+
};
|
| 62 |
+
|
| 63 |
+
socket.onerror = (error) => {
|
| 64 |
+
console.error("WebSocket Error:", error);
|
| 65 |
+
statusDiv.textContent = 'Status: Connection error';
|
| 66 |
+
};
|
| 67 |
+
};
|
| 68 |
+
|
| 69 |
+
const playNextInQueue = () => {
|
| 70 |
+
if (audioQueue.length > 0) {
|
| 71 |
+
isPlaying = true;
|
| 72 |
+
const buffer = audioQueue.shift();
|
| 73 |
+
const source = audioContext.createBufferSource();
|
| 74 |
+
source.buffer = buffer;
|
| 75 |
+
source.connect(audioContext.destination);
|
| 76 |
+
source.onended = () => {
|
| 77 |
+
isPlaying = false;
|
| 78 |
+
playNextInQueue();
|
| 79 |
+
};
|
| 80 |
+
source.start();
|
| 81 |
+
}
|
| 82 |
+
};
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
startButton.onclick = async () => {
|
| 86 |
+
if (!socket || socket.readyState !== WebSocket.OPEN) {
|
| 87 |
+
connectWebSocket();
|
| 88 |
+
}
|
| 89 |
+
|
| 90 |
+
audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
|
| 91 |
+
|
| 92 |
+
if (audioContext.state === 'suspended') {
|
| 93 |
+
await audioContext.resume();
|
| 94 |
+
}
|
| 95 |
+
|
| 96 |
+
navigator.mediaDevices.getUserMedia({ audio: { sampleRate: 16000, channelCount: 1 } })
|
| 97 |
+
.then(stream => {
|
| 98 |
+
mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm; codecs=opus' });
|
| 99 |
+
mediaRecorder.ondataavailable = event => {
|
| 100 |
+
if (event.data.size > 0 && socket.readyState === WebSocket.OPEN) {
|
| 101 |
+
socket.send(event.data);
|
| 102 |
+
}
|
| 103 |
+
};
|
| 104 |
+
mediaRecorder.start(250); // Send data every 250ms
|
| 105 |
+
|
| 106 |
+
startButton.disabled = true;
|
| 107 |
+
stopButton.disabled = false;
|
| 108 |
+
statusDiv.textContent = 'Status: Translating...';
|
| 109 |
+
})
|
| 110 |
+
.catch(err => {
|
| 111 |
+
console.error('Error getting user media:', err);
|
| 112 |
+
statusDiv.textContent = 'Error: Could not access microphone.';
|
| 113 |
+
});
|
| 114 |
+
};
|
| 115 |
+
|
| 116 |
+
stopButton.onclick = () => {
|
| 117 |
+
if (mediaRecorder) {
|
| 118 |
+
mediaRecorder.stop();
|
| 119 |
+
}
|
| 120 |
+
if (socket && socket.readyState === WebSocket.OPEN) {
|
| 121 |
+
socket.send(JSON.stringify({type: "stop"}));
|
| 122 |
+
socket.close();
|
| 123 |
+
}
|
| 124 |
+
startButton.disabled = false;
|
| 125 |
+
stopButton.disabled = true;
|
| 126 |
+
statusDiv.textContent = 'Status: Stopped. Re-connect to start again.';
|
| 127 |
+
};
|
| 128 |
+
|
| 129 |
+
window.onload = () => {
|
| 130 |
+
startButton.disabled = false;
|
| 131 |
+
stopButton.disabled = true;
|
| 132 |
+
statusDiv.textContent = 'Status: Ready to connect.';
|
| 133 |
+
};
|
| 134 |
+
|
| 135 |
+
</script>
|
| 136 |
+
</body>
|
| 137 |
+
</html>
|
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
ffmpeg
|
|
@@ -1,5 +1,8 @@
|
|
| 1 |
google-cloud-speech
|
| 2 |
deepl
|
| 3 |
-
pyaudio
|
| 4 |
websockets
|
| 5 |
python-dotenv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
google-cloud-speech
|
| 2 |
deepl
|
|
|
|
| 3 |
websockets
|
| 4 |
python-dotenv
|
| 5 |
+
fastapi
|
| 6 |
+
uvicorn
|
| 7 |
+
python-multipart
|
| 8 |
+
ffmpeg-python
|
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import asyncio
|
| 2 |
+
import os
|
| 3 |
+
import ffmpeg
|
| 4 |
+
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
|
| 5 |
+
from fastapi.responses import HTMLResponse
|
| 6 |
+
from fastapi.staticfiles import StaticFiles
|
| 7 |
+
from dotenv import load_dotenv
|
| 8 |
+
from translator import VoiceTranslator
|
| 9 |
+
|
| 10 |
+
load_dotenv()
|
| 11 |
+
|
| 12 |
+
app = FastAPI()
|
| 13 |
+
app.mount("/static", StaticFiles(directory="static"), name="static")
|
| 14 |
+
|
| 15 |
+
# Load environment variables for API keys
|
| 16 |
+
google_creds = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
|
| 17 |
+
deepl_key = os.getenv("DEEPL_API_KEY")
|
| 18 |
+
eleven_key = os.getenv("ELEVENLABS_API_KEY")
|
| 19 |
+
voice_id = os.getenv("ELEVENLABS_VOICE_ID")
|
| 20 |
+
|
| 21 |
+
if not all([google_creds, deepl_key, eleven_key, voice_id]):
|
| 22 |
+
raise ValueError("Missing one or more required API keys in .env file.")
|
| 23 |
+
|
| 24 |
+
translator = VoiceTranslator(deepl_key, eleven_key, voice_id)
|
| 25 |
+
|
| 26 |
+
@app.get("/")
|
| 27 |
+
async def get():
|
| 28 |
+
return HTMLResponse(open("index.html", "r").read())
|
| 29 |
+
|
| 30 |
+
async def audio_output_sender(ws: WebSocket, output_queue: asyncio.Queue):
|
| 31 |
+
print("Audio output sender started.")
|
| 32 |
+
while True:
|
| 33 |
+
try:
|
| 34 |
+
audio_chunk = await output_queue.get()
|
| 35 |
+
if audio_chunk is None:
|
| 36 |
+
break
|
| 37 |
+
await ws.send_bytes(audio_chunk)
|
| 38 |
+
except asyncio.CancelledError:
|
| 39 |
+
break
|
| 40 |
+
print("Audio output sender stopped.")
|
| 41 |
+
|
| 42 |
+
async def handle_audio_input(websocket: WebSocket, input_queue: asyncio.Queue):
|
| 43 |
+
print("Audio input handler started.")
|
| 44 |
+
while True:
|
| 45 |
+
try:
|
| 46 |
+
data = await websocket.receive_bytes()
|
| 47 |
+
# Convert webm to pcm
|
| 48 |
+
process = (
|
| 49 |
+
ffmpeg
|
| 50 |
+
.input('pipe:0')
|
| 51 |
+
.output('pipe:1', format='s16le', acodec='pcm_s16le', ac=1, ar='16k')
|
| 52 |
+
.run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True)
|
| 53 |
+
)
|
| 54 |
+
stdout, stderr = process.communicate(input=data)
|
| 55 |
+
await input_queue.put(stdout)
|
| 56 |
+
except WebSocketDisconnect:
|
| 57 |
+
break
|
| 58 |
+
except Exception as e:
|
| 59 |
+
print(f"Audio input error: {e}")
|
| 60 |
+
break
|
| 61 |
+
print("Audio input handler stopped.")
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
@app.websocket("/ws")
|
| 65 |
+
async def websocket_endpoint(websocket: WebSocket):
|
| 66 |
+
await websocket.accept()
|
| 67 |
+
print("WebSocket connection accepted.")
|
| 68 |
+
|
| 69 |
+
output_sender_task = None
|
| 70 |
+
input_handler_task = None
|
| 71 |
+
|
| 72 |
+
try:
|
| 73 |
+
# Start translation and audio processing tasks
|
| 74 |
+
translator.start_translation()
|
| 75 |
+
output_sender_task = asyncio.create_task(
|
| 76 |
+
audio_output_sender(websocket, translator.output_queue)
|
| 77 |
+
)
|
| 78 |
+
input_handler_task = asyncio.create_task(
|
| 79 |
+
handle_audio_input(websocket, translator.input_queue)
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
await asyncio.gather(input_handler_task, output_sender_task)
|
| 83 |
+
|
| 84 |
+
except WebSocketDisconnect:
|
| 85 |
+
print("Client disconnected.")
|
| 86 |
+
except Exception as e:
|
| 87 |
+
print(f"An error occurred: {e}")
|
| 88 |
+
finally:
|
| 89 |
+
print("Stopping translation and cleaning up...")
|
| 90 |
+
if output_sender_task:
|
| 91 |
+
output_sender_task.cancel()
|
| 92 |
+
if input_handler_task:
|
| 93 |
+
input_handler_task.cancel()
|
| 94 |
+
translator.stop_translation()
|
| 95 |
+
await websocket.close()
|
| 96 |
+
print("WebSocket connection closed.")
|
|
@@ -0,0 +1,276 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import asyncio
|
| 2 |
+
import base64
|
| 3 |
+
import json
|
| 4 |
+
import queue
|
| 5 |
+
import threading
|
| 6 |
+
import time
|
| 7 |
+
from collections import deque
|
| 8 |
+
from typing import Dict, Optional
|
| 9 |
+
|
| 10 |
+
import deepl
|
| 11 |
+
import websockets
|
| 12 |
+
from google.cloud import speech
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class VoiceTranslator:
|
| 16 |
+
def __init__(self, deepl_api_key: str, elevenlabs_api_key: str, elevenlabs_voice_id: str):
|
| 17 |
+
self.deepl_client = deepl.Translator(deepl_api_key)
|
| 18 |
+
self.elevenlabs_api_key = elevenlabs_api_key
|
| 19 |
+
self.voice_id = elevenlabs_voice_id
|
| 20 |
+
self.stt_client = speech.SpeechClient()
|
| 21 |
+
|
| 22 |
+
self.audio_rate = 16000
|
| 23 |
+
self.audio_chunk = 1024
|
| 24 |
+
|
| 25 |
+
self.input_queue = asyncio.Queue()
|
| 26 |
+
self.output_queue = asyncio.Queue()
|
| 27 |
+
|
| 28 |
+
self.lang_queues: Dict[str, queue.Queue] = {
|
| 29 |
+
"en-US": queue.Queue(),
|
| 30 |
+
"fr-FR": queue.Queue(),
|
| 31 |
+
}
|
| 32 |
+
self.prebuffer = deque(maxlen=12)
|
| 33 |
+
|
| 34 |
+
self.is_recording = False
|
| 35 |
+
self.is_speaking = False
|
| 36 |
+
self.speaking_event = threading.Event()
|
| 37 |
+
|
| 38 |
+
self.last_processed_transcript = ""
|
| 39 |
+
self.last_tts_text_en = ""
|
| 40 |
+
self.last_tts_text_fr = ""
|
| 41 |
+
self.min_confidence_threshold = 0.5
|
| 42 |
+
|
| 43 |
+
self.async_loop = asyncio.new_event_loop()
|
| 44 |
+
self.async_thread = threading.Thread(target=self._run_async_loop, daemon=True)
|
| 45 |
+
self.async_thread.start()
|
| 46 |
+
|
| 47 |
+
self._tts_queue: "asyncio.Queue[Optional[dict]]" = asyncio.Queue()
|
| 48 |
+
self._tts_consumer_task: Optional[asyncio.Task] = None
|
| 49 |
+
self._process_audio_task: Optional[asyncio.Task] = None
|
| 50 |
+
|
| 51 |
+
def _start_consumer():
|
| 52 |
+
self._tts_consumer_task = asyncio.create_task(self._tts_consumer())
|
| 53 |
+
|
| 54 |
+
self.async_loop.call_soon_threadsafe(_start_consumer)
|
| 55 |
+
|
| 56 |
+
self.stt_threads: Dict[str, threading.Thread] = {}
|
| 57 |
+
self.restart_events: Dict[str, threading.Event] = {
|
| 58 |
+
"en-US": threading.Event(),
|
| 59 |
+
"fr-FR": threading.Event(),
|
| 60 |
+
}
|
| 61 |
+
self.stream_cancel_events: Dict[str, threading.Event] = {
|
| 62 |
+
"en-US": threading.Event(),
|
| 63 |
+
"fr-FR": threading.Event(),
|
| 64 |
+
}
|
| 65 |
+
self._stream_started = {"en-US": False, "fr-FR": False}
|
| 66 |
+
self._tts_job_counter = 0
|
| 67 |
+
|
| 68 |
+
def _run_async_loop(self):
|
| 69 |
+
asyncio.set_event_loop(self.async_loop)
|
| 70 |
+
try:
|
| 71 |
+
self.async_loop.run_forever()
|
| 72 |
+
except Exception as e:
|
| 73 |
+
print(f"[async_loop] stopped with error: {e}")
|
| 74 |
+
|
| 75 |
+
async def _process_input_audio(self):
|
| 76 |
+
print("π€ Audio processing task started...")
|
| 77 |
+
while self.is_recording:
|
| 78 |
+
try:
|
| 79 |
+
data = await self.input_queue.get()
|
| 80 |
+
if data is None:
|
| 81 |
+
break
|
| 82 |
+
if not self.speaking_event.is_set():
|
| 83 |
+
self.prebuffer.append(data)
|
| 84 |
+
self.lang_queues["en-US"].put(data)
|
| 85 |
+
self.lang_queues["fr-FR"].put(data)
|
| 86 |
+
except asyncio.CancelledError:
|
| 87 |
+
break
|
| 88 |
+
except Exception as e:
|
| 89 |
+
print(f"[audio_processor] error: {e}")
|
| 90 |
+
print("π€ Audio processing task stopped.")
|
| 91 |
+
|
| 92 |
+
async def _stream_tts(self, text: str):
|
| 93 |
+
uri = (
|
| 94 |
+
f"wss://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}"
|
| 95 |
+
f"/stream-input?model_id=eleven_flash_v2_5&output_format=pcm_16000"
|
| 96 |
+
)
|
| 97 |
+
try:
|
| 98 |
+
self.is_speaking = True
|
| 99 |
+
self.speaking_event.set()
|
| 100 |
+
self.prebuffer.clear()
|
| 101 |
+
for q in self.lang_queues.values():
|
| 102 |
+
with q.mutex:
|
| 103 |
+
q.queue.clear()
|
| 104 |
+
|
| 105 |
+
async with websockets.connect(uri) as websocket:
|
| 106 |
+
await websocket.send(json.dumps({
|
| 107 |
+
"text": " ",
|
| 108 |
+
"voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
|
| 109 |
+
"xi_api_key": self.elevenlabs_api_key,
|
| 110 |
+
}))
|
| 111 |
+
await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))
|
| 112 |
+
await websocket.send(json.dumps({"text": ""}))
|
| 113 |
+
|
| 114 |
+
while True:
|
| 115 |
+
try:
|
| 116 |
+
message = await websocket.recv()
|
| 117 |
+
data = json.loads(message)
|
| 118 |
+
if data.get("audio"):
|
| 119 |
+
audio_chunk = base64.b64decode(data["audio"])
|
| 120 |
+
await self.output_queue.put(audio_chunk)
|
| 121 |
+
elif data.get("isFinal"):
|
| 122 |
+
break
|
| 123 |
+
except websockets.exceptions.ConnectionClosed:
|
| 124 |
+
break
|
| 125 |
+
except Exception:
|
| 126 |
+
continue
|
| 127 |
+
except Exception as e:
|
| 128 |
+
print(f"TTS streaming error: {e}")
|
| 129 |
+
finally:
|
| 130 |
+
for lang, ev in self.stream_cancel_events.items():
|
| 131 |
+
ev.set()
|
| 132 |
+
for q in self.lang_queues.values():
|
| 133 |
+
with q.mutex:
|
| 134 |
+
q.queue.clear()
|
| 135 |
+
|
| 136 |
+
self.is_speaking = False
|
| 137 |
+
self.speaking_event.clear()
|
| 138 |
+
for lang, ev in self.restart_events.items():
|
| 139 |
+
ev.set()
|
| 140 |
+
await asyncio.sleep(0.1)
|
| 141 |
+
|
| 142 |
+
async def _tts_consumer(self):
|
| 143 |
+
print("[tts_consumer] started")
|
| 144 |
+
while True:
|
| 145 |
+
try:
|
| 146 |
+
item = await self._tts_queue.get()
|
| 147 |
+
if item is None:
|
| 148 |
+
break
|
| 149 |
+
text = item.get("text", "")
|
| 150 |
+
self._tts_job_counter += 1
|
| 151 |
+
job_id = self._tts_job_counter
|
| 152 |
+
print(f"[tts_consumer] job #{job_id} dequeued (len={len(text)})")
|
| 153 |
+
await self._stream_tts(text)
|
| 154 |
+
except asyncio.CancelledError:
|
| 155 |
+
break
|
| 156 |
+
except Exception as e:
|
| 157 |
+
print(f"[tts_consumer] error: {e}")
|
| 158 |
+
print("[tts_consumer] exiting")
|
| 159 |
+
|
| 160 |
+
async def _process_result(self, transcript: str, confidence: float, language: str):
|
| 161 |
+
lang_flag = "π«π·" if language == "fr-FR" else "π¬π§"
|
| 162 |
+
print(f"{lang_flag} Heard ({language}, conf {confidence:.2f}): {transcript}")
|
| 163 |
+
|
| 164 |
+
if language == "fr-FR" and transcript.strip().lower() == self.last_tts_text_fr.strip().lower():
|
| 165 |
+
print(" (echo suppressed)")
|
| 166 |
+
return
|
| 167 |
+
if language == "en-US" and transcript.strip().lower() == self.last_tts_text_en.strip().lower():
|
| 168 |
+
print(" (echo suppressed)")
|
| 169 |
+
return
|
| 170 |
+
|
| 171 |
+
try:
|
| 172 |
+
if language == "fr-FR":
|
| 173 |
+
translated = self.deepl_client.translate_text(transcript, target_lang="EN-US").text
|
| 174 |
+
print(f"π FR β EN: {translated}")
|
| 175 |
+
await self._tts_queue.put({"text": translated, "source_lang": language})
|
| 176 |
+
self.last_tts_text_en = translated
|
| 177 |
+
else:
|
| 178 |
+
translated = self.deepl_client.translate_text(transcript, target_lang="FR").text
|
| 179 |
+
print(f"π EN β FR: {translated}")
|
| 180 |
+
await self._tts_queue.put({"text": translated, "source_lang": language})
|
| 181 |
+
self.last_tts_text_fr = translated
|
| 182 |
+
print("π Queued for speaking...")
|
| 183 |
+
except Exception as e:
|
| 184 |
+
print(f"Translation error: {e}")
|
| 185 |
+
|
| 186 |
+
def _run_stt_stream(self, language: str):
|
| 187 |
+
print(f"[stt:{language}] Thread starting...")
|
| 188 |
+
while self.is_recording:
|
| 189 |
+
try:
|
| 190 |
+
self.restart_events[language].wait()
|
| 191 |
+
self.restart_events[language].clear()
|
| 192 |
+
|
| 193 |
+
config = speech.RecognitionConfig(
|
| 194 |
+
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
|
| 195 |
+
sample_rate_hertz=self.audio_rate,
|
| 196 |
+
language_code=language,
|
| 197 |
+
enable_automatic_punctuation=True,
|
| 198 |
+
model="latest_short",
|
| 199 |
+
)
|
| 200 |
+
streaming_config = speech.StreamingRecognitionConfig(
|
| 201 |
+
config=config, interim_results=True, single_utterance=False
|
| 202 |
+
)
|
| 203 |
+
|
| 204 |
+
def request_generator():
|
| 205 |
+
while self.is_recording:
|
| 206 |
+
if self.speaking_event.is_set():
|
| 207 |
+
time.sleep(0.01)
|
| 208 |
+
continue
|
| 209 |
+
if self.stream_cancel_events[language].is_set():
|
| 210 |
+
self.stream_cancel_events[language].clear()
|
| 211 |
+
break
|
| 212 |
+
try:
|
| 213 |
+
chunk = self.lang_queues[language].get(timeout=0.1)
|
| 214 |
+
yield speech.StreamingRecognizeRequest(audio_content=chunk)
|
| 215 |
+
except queue.Empty:
|
| 216 |
+
continue
|
| 217 |
+
|
| 218 |
+
responses = self.stt_client.streaming_recognize(streaming_config, request_generator())
|
| 219 |
+
|
| 220 |
+
for response in responses:
|
| 221 |
+
if not self.is_recording:
|
| 222 |
+
break
|
| 223 |
+
for result in response.results:
|
| 224 |
+
if result.is_final and result.alternatives:
|
| 225 |
+
alt = result.alternatives[0]
|
| 226 |
+
transcript = alt.transcript.strip()
|
| 227 |
+
conf = getattr(alt, "confidence", 0.0)
|
| 228 |
+
if transcript and conf >= self.min_confidence_threshold:
|
| 229 |
+
asyncio.run_coroutine_threadsafe(
|
| 230 |
+
self._process_result(transcript, conf, language), self.async_loop
|
| 231 |
+
)
|
| 232 |
+
except Exception as e:
|
| 233 |
+
print(f"[stt:{language}] Error: {e}")
|
| 234 |
+
time.sleep(1)
|
| 235 |
+
print(f"[stt:{language}] Thread exiting")
|
| 236 |
+
|
| 237 |
+
def start_translation(self):
|
| 238 |
+
if self.is_recording:
|
| 239 |
+
return
|
| 240 |
+
self.is_recording = True
|
| 241 |
+
|
| 242 |
+
for ev in self.restart_events.values():
|
| 243 |
+
ev.clear()
|
| 244 |
+
self.speaking_event.clear()
|
| 245 |
+
|
| 246 |
+
def _start_tasks():
|
| 247 |
+
self._process_audio_task = asyncio.create_task(self._process_input_audio())
|
| 248 |
+
|
| 249 |
+
self.async_loop.call_soon_threadsafe(_start_tasks)
|
| 250 |
+
|
| 251 |
+
for lang in ("en-US", "fr-FR"):
|
| 252 |
+
thread = threading.Thread(target=self._run_stt_stream, args=(lang,), daemon=True)
|
| 253 |
+
self.stt_threads[lang] = thread
|
| 254 |
+
thread.start()
|
| 255 |
+
self.restart_events[lang].set()
|
| 256 |
+
|
| 257 |
+
def stop_translation(self):
|
| 258 |
+
if not self.is_recording:
|
| 259 |
+
return
|
| 260 |
+
self.is_recording = False
|
| 261 |
+
for ev in self.restart_events.values():
|
| 262 |
+
ev.set()
|
| 263 |
+
|
| 264 |
+
def _cancel_tasks():
|
| 265 |
+
if self._process_audio_task:
|
| 266 |
+
self._process_audio_task.cancel()
|
| 267 |
+
if self._tts_queue:
|
| 268 |
+
self._tts_queue.put_nowait(None)
|
| 269 |
+
|
| 270 |
+
self.async_loop.call_soon_threadsafe(_cancel_tasks)
|
| 271 |
+
|
| 272 |
+
def cleanup(self):
|
| 273 |
+
self.stop_translation()
|
| 274 |
+
time.sleep(0.5)
|
| 275 |
+
if self.async_loop.is_running():
|
| 276 |
+
self.async_loop.call_soon_threadsafe(self.async_loop.stop)
|