flozi00 commited on
Commit
a86cdfa
·
1 Parent(s): 9c1a0c2
README.md CHANGED
@@ -1,11 +1,181 @@
1
  ---
2
- title: Chatterbox-Multilingual-TTS
3
- emoji: 🌎
4
  colorFrom: indigo
5
  colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.29.0
8
- app_file: app.py
9
  pinned: false
10
- short_description: Chatterbox TTS supporting 23 languages
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Telefonansagen TTS Engine
3
+ emoji: 📞
4
  colorFrom: indigo
5
  colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.29.0
8
+ app_file: app_new.py
9
  pinned: false
10
+ short_description: Professional phone announcements with AI TTS
11
+ ---
12
+
13
+ # Telefonansagen TTS Engine
14
+
15
+ A modular text-to-speech engine for generating professional phone announcements (Telefonansagen) with support for 23 languages and voice cloning.
16
+
17
+ ## Features
18
+
19
+ - 🎙️ **High-Quality TTS**: Using Chatterbox Multilingual for natural speech synthesis
20
+ - 🌍 **23 Languages**: German, English, French, Spanish, Italian, and many more
21
+ - 🎭 **Voice Cloning**: Clone any voice from a short audio sample
22
+ - 🔌 **Modular Architecture**: Easy to swap TTS backends
23
+ - 🎵 **Background Music**: Optional background music mixing
24
+ - 💾 **Caching**: Local and HuggingFace Hub caching support
25
+
26
+ ## Quick Start
27
+
28
+ ```bash
29
+ # Install dependencies
30
+ pip install -r requirements.txt
31
+
32
+ # Run the application
33
+ python app_new.py
34
+ ```
35
+
36
+ ## Architecture
37
+
38
+ The engine uses a modular backend system that allows easy swapping of TTS providers:
39
+
40
+ ```
41
+ engine/
42
+ ├── __init__.py # Main exports
43
+ ├── tts_engine.py # Core TTS Engine
44
+ ├── audio_processor.py # Post-processing (music, fades)
45
+ ├── cache.py # Caching system
46
+ └── backends/
47
+ ├── base.py # Abstract backend interface
48
+ ├── chatterbox_backend.py # Default: Chatterbox Multilingual
49
+ └── gemini_backend.py # Optional: Google Gemini TTS
50
+ ```
51
+
52
+ ## Usage
53
+
54
+ ### Simple Usage
55
+
56
+ ```python
57
+ from engine import TTSEngine
58
+
59
+ # Create engine with defaults
60
+ engine = TTSEngine()
61
+
62
+ # Generate German announcement (default)
63
+ audio = engine.generate("Willkommen bei unserem Service.")
64
+
65
+ # Generate with specific language
66
+ audio = engine.generate(
67
+ "Welcome to our customer service.",
68
+ language="en"
69
+ )
70
+ ```
71
+
72
+ ### Voice Cloning
73
+
74
+ ```python
75
+ # Clone a voice from reference audio
76
+ audio = engine.generate(
77
+ "Herzlich willkommen!",
78
+ language="de",
79
+ voice_audio="path/to/reference.wav"
80
+ )
81
+ ```
82
+
83
+ ### Switch Backend
84
+
85
+ ```python
86
+ # Use Gemini instead of Chatterbox (requires GEMINI_API_KEY)
87
+ engine.set_backend("gemini")
88
+ audio = engine.generate("Hello world!", language="en")
89
+ ```
90
+
91
+ ### With Background Music
92
+
93
+ ```python
94
+ # Add background music (place .mp3 files in engine/data/assets/)
95
+ audio = engine.generate(
96
+ "Bitte warten Sie.",
97
+ background_music="hold_music"
98
+ )
99
+ ```
100
+
101
+ ## Creating a Custom Backend
102
+
103
+ To add a new TTS backend, inherit from `TTSBackend`:
104
+
105
+ ```python
106
+ from engine.backends.base import TTSBackend, TTSResult, BackendConfig
107
+
108
+ class MyCustomBackend(TTSBackend):
109
+ @property
110
+ def name(self) -> str:
111
+ return "My Custom TTS"
112
+
113
+ @property
114
+ def supports_voice_cloning(self) -> bool:
115
+ return False
116
+
117
+ @property
118
+ def supported_languages(self) -> dict[str, str]:
119
+ return {"en": "English", "de": "German"}
120
+
121
+ def load(self) -> None:
122
+ # Load your model
123
+ self._is_loaded = True
124
+
125
+ def unload(self) -> None:
126
+ # Cleanup
127
+ self._is_loaded = False
128
+
129
+ def generate(self, text: str, language: str = "de", **kwargs) -> TTSResult:
130
+ # Generate audio
131
+ audio = your_tts_function(text, language)
132
+ return TTSResult(audio=audio, sample_rate=22050)
133
+
134
+ # Register the backend
135
+ from engine import TTSEngine
136
+ TTSEngine.register_backend("my_custom", MyCustomBackend)
137
+ ```
138
+
139
+ ## Configuration
140
+
141
+ ### Engine Configuration
142
+
143
+ ```python
144
+ from engine.tts_engine import TTSEngine, EngineConfig
145
+
146
+ config = EngineConfig(
147
+ default_backend="chatterbox",
148
+ device="cuda", # or "cpu", "mps", "auto"
149
+ default_language="de",
150
+ enable_cache=True,
151
+ local_cache_dir="./cache",
152
+ )
153
+
154
+ engine = TTSEngine(config)
155
+ ```
156
+
157
+ ### Environment Variables
158
+
159
+ - `HF_TOKEN`: HuggingFace token for model downloads
160
+ - `GEMINI_API_KEY`: Google API key (for Gemini backend)
161
+
162
+ ## Supported Languages
163
+
164
+ | Code | Language | Code | Language |
165
+ |------|----------|------|----------|
166
+ | de | German | ja | Japanese |
167
+ | en | English | ko | Korean |
168
+ | fr | French | ms | Malay |
169
+ | es | Spanish | nl | Dutch |
170
+ | it | Italian | no | Norwegian |
171
+ | pt | Portuguese | pl | Polish |
172
+ | ru | Russian | sv | Swedish |
173
+ | zh | Chinese | sw | Swahili |
174
+ | ar | Arabic | tr | Turkish |
175
+ | da | Danish | fi | Finnish |
176
+ | el | Greek | he | Hebrew |
177
+ | hi | Hindi | | |
178
+
179
+ ## License
180
+
181
+ MIT License
app.py CHANGED
@@ -1,321 +1,322 @@
 
 
 
 
 
 
 
1
  import random
 
 
2
  import numpy as np
3
  import torch
4
- from src.chatterbox.mtl_tts import ChatterboxMultilingualTTS, SUPPORTED_LANGUAGES
5
- import gradio as gr
6
- import spaces
7
-
8
- DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
9
- print(f"🚀 Running on device: {DEVICE}")
10
-
11
- # --- Global Model Initialization ---
12
- MODEL = None
13
-
14
- LANGUAGE_CONFIG = {
15
- "ar": {
16
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ar_f/ar_prompts2.flac",
17
- "text": "في الشهر الماضي، وصلنا إلى معلم جديد بمليارين من المشاهدات على قناتنا على يوتيوب."
18
- },
19
- "da": {
20
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/da_m1.flac",
21
- "text": "Sidste måned nåede vi en ny milepæl med to milliarder visninger på vores YouTube-kanal."
22
- },
23
- "de": {
24
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/de_f1.flac",
25
- "text": "Letzten Monat haben wir einen neuen Meilenstein erreicht: zwei Milliarden Aufrufe auf unserem YouTube-Kanal."
26
- },
27
- "el": {
28
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/el_m.flac",
29
- "text": "Τον περασμένο μήνα, φτάσαμε σε ένα νέο ορόσημο με δύο δισεκατομμύρια προβολές στο κανάλι μας στο YouTube."
30
- },
31
- "en": {
32
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/en_f1.flac",
33
- "text": "Last month, we reached a new milestone with two billion views on our YouTube channel."
34
- },
35
- "es": {
36
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/es_f1.flac",
37
- "text": "El mes pasado alcanzamos un nuevo hito: dos mil millones de visualizaciones en nuestro canal de YouTube."
38
- },
39
- "fi": {
40
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fi_m.flac",
41
- "text": "Viime kuussa saavutimme uuden virstanpylvään kahden miljardin katselukerran kanssa YouTube-kanavallamme."
42
- },
43
- "fr": {
44
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fr_f1.flac",
45
- "text": "Le mois dernier, nous avons atteint un nouveau jalon avec deux milliards de vues sur notre chaîne YouTube."
46
- },
47
- "he": {
48
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/he_m1.flac",
49
- "text": "בחודש שעבר הגענו לאבן דרך חדשה עם שני מיליארד צפיות בערוץ היוטיוב שלנו."
50
- },
51
- "hi": {
52
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/hi_f1.flac",
53
- "text": "पिछले महीने हमने एक नया मील का पत्थर छुआ: हमारे YouTube चैनल पर दो अरब व्यूज़।"
54
- },
55
- "it": {
56
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/it_m1.flac",
57
- "text": "Il mese scorso abbiamo raggiunto un nuovo traguardo: due miliardi di visualizzazioni sul nostro canale YouTube."
58
- },
59
- "ja": {
60
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ja/ja_prompts1.flac",
61
- "text": "先月、私たちのYouTubeチャンネルで二十億回の再生回数という新たなマイルストーンに到達しました。"
62
- },
63
- "ko": {
64
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ko_f.flac",
65
- "text": "지난달 우리는 유튜브 채널에서 이십억 조회수라는 새로운 이정표에 도달했습니다."
66
- },
67
- "ms": {
68
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ms_f.flac",
69
- "text": "Bulan lepas, kami mencapai pencapaian baru dengan dua bilion tontonan di saluran YouTube kami."
70
- },
71
- "nl": {
72
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/nl_m.flac",
73
- "text": "Vorige maand bereikten we een nieuwe mijlpaal met twee miljard weergaven op ons YouTube-kanaal."
74
- },
75
- "no": {
76
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/no_f1.flac",
77
- "text": "Forrige måned nådde vi en ny milepæl med to milliarder visninger på YouTube-kanalen vår."
78
- },
79
- "pl": {
80
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pl_m.flac",
81
- "text": "W zeszłym miesiącu osiągnęliśmy nowy kamień milowy z dwoma miliardami wyświetleń na naszym kanale YouTube."
82
- },
83
- "pt": {
84
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pt_m1.flac",
85
- "text": "No mês passado, alcançámos um novo marco: dois mil milhões de visualizações no nosso canal do YouTube."
86
- },
87
- "ru": {
88
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ru_m.flac",
89
- "text": "В прошлом месяце мы достигли нового рубежа: два миллиарда просмотров на нашем YouTube-канале."
90
- },
91
- "sv": {
92
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sv_f.flac",
93
- "text": "Förra månaden nådde vi en ny milstolpe med två miljarder visningar på vår YouTube-kanal."
94
- },
95
- "sw": {
96
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sw_m.flac",
97
- "text": "Mwezi uliopita, tulifika hatua mpya ya maoni ya bilioni mbili kweny kituo chetu cha YouTube."
98
- },
99
- "tr": {
100
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/tr_m.flac",
101
- "text": "Geçen ay YouTube kanalımızda iki milyar görüntüleme ile yeni bir dönüm noktasına ulaştık."
102
- },
103
- "zh": {
104
- "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/zh_f2.flac",
105
- "text": "上个月,我们达到了一个新的里程碑。 我们的YouTube频道观看次数达到了二十亿次,这绝对令人难以置信。"
106
- },
107
  }
108
 
109
- # --- UI Helpers ---
110
- def default_audio_for_ui(lang: str) -> str | None:
111
- return LANGUAGE_CONFIG.get(lang, {}).get("audio")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
 
114
- def default_text_for_ui(lang: str) -> str:
115
- return LANGUAGE_CONFIG.get(lang, {}).get("text", "")
116
 
117
 
118
- def get_supported_languages_display() -> str:
119
- """Generate a formatted display of all supported languages."""
120
- language_items = []
121
- for code, name in sorted(SUPPORTED_LANGUAGES.items()):
122
- language_items.append(f"**{name}** (`{code}`)")
123
-
124
- # Split into 2 lines
125
- mid = len(language_items) // 2
126
- line1 = " • ".join(language_items[:mid])
127
- line2 = " • ".join(language_items[mid:])
128
-
129
- return f"""
130
- ### 🌍 Supported Languages ({len(SUPPORTED_LANGUAGES)} total)
131
- {line1}
132
 
133
- {line2}
134
- """
 
 
 
 
 
 
 
 
 
 
 
135
 
136
 
137
- def get_or_load_model():
138
- """Loads the ChatterboxMultilingualTTS model if it hasn't been loaded already,
139
- and ensures it's on the correct device."""
140
- global MODEL
141
- if MODEL is None:
142
- print("Model not loaded, initializing...")
143
- try:
144
- MODEL = ChatterboxMultilingualTTS.from_pretrained(DEVICE)
145
- if hasattr(MODEL, 'to') and str(MODEL.device) != DEVICE:
146
- MODEL.to(DEVICE)
147
- print(f"Model loaded successfully. Internal device: {getattr(MODEL, 'device', 'N/A')}")
148
- except Exception as e:
149
- print(f"Error loading model: {e}")
150
- raise
151
- return MODEL
152
-
153
- # Attempt to load the model at startup.
154
  try:
155
- get_or_load_model()
156
  except Exception as e:
157
- print(f"CRITICAL: Failed to load model on startup. Application may not function. Error: {e}")
158
-
159
- def set_seed(seed: int):
160
- """Sets the random seed for reproducibility across torch, numpy, and random."""
161
- torch.manual_seed(seed)
162
- if DEVICE == "cuda":
163
- torch.cuda.manual_seed(seed)
164
- torch.cuda.manual_seed_all(seed)
165
- random.seed(seed)
166
- np.random.seed(seed)
167
-
168
- def resolve_audio_prompt(language_id: str, provided_path: str | None) -> str | None:
169
- """
170
- Decide which audio prompt to use:
171
- - If user provided a path (upload/mic/url), use it.
172
- - Else, fall back to language-specific default (if any).
173
- """
174
- if provided_path and str(provided_path).strip():
175
- return provided_path
176
- return LANGUAGE_CONFIG.get(language_id, {}).get("audio")
 
 
 
 
 
177
 
178
 
 
179
  @spaces.GPU
180
- def generate_tts_audio(
181
- text_input: str,
182
- language_id: str,
183
- audio_prompt_path_input: str = None,
184
- exaggeration_input: float = 0.5,
185
- temperature_input: float = 0.8,
186
- seed_num_input: int = 0,
187
- cfgw_input: float = 0.5
188
  ) -> tuple[int, np.ndarray]:
189
  """
190
- Generate high-quality speech audio from text using Chatterbox Multilingual model with optional reference audio styling.
191
- Supported languages: English, French, German, Spanish, Italian, Portuguese, and Hindi.
192
-
193
- This tool synthesizes natural-sounding speech from input text. When a reference audio file
194
- is provided, it captures the speaker's voice characteristics and speaking style. The generated audio
195
- maintains the prosody, tone, and vocal qualities of the reference speaker, or uses default voice if no reference is provided.
196
 
197
  Args:
198
- text_input (str): The text to synthesize into speech (maximum 300 characters)
199
- language_id (str): The language code for synthesis (eg. en, fr, de, es, it, pt, hi)
200
- audio_prompt_path_input (str, optional): File path or URL to the reference audio file that defines the target voice style. Defaults to None.
201
- exaggeration_input (float, optional): Controls speech expressiveness (0.25-2.0, neutral=0.5, extreme values may be unstable). Defaults to 0.5.
202
- temperature_input (float, optional): Controls randomness in generation (0.05-5.0, higher=more varied). Defaults to 0.8.
203
- seed_num_input (int, optional): Random seed for reproducible results (0 for random generation). Defaults to 0.
204
- cfgw_input (float, optional): CFG/Pace weight controlling generation guidance (0.2-1.0). Defaults to 0.5, 0 for language transfer.
205
 
206
  Returns:
207
- tuple[int, np.ndarray]: A tuple containing the sample rate (int) and the generated audio waveform (numpy.ndarray)
208
  """
209
- current_model = get_or_load_model()
210
-
211
- if current_model is None:
212
- raise RuntimeError("TTS model is not loaded.")
213
-
214
- if seed_num_input != 0:
215
- set_seed(int(seed_num_input))
216
-
217
- print(f"Generating audio for text: '{text_input[:50]}...'")
218
-
219
- # Handle optional audio prompt
220
- chosen_prompt = audio_prompt_path_input or default_audio_for_ui(language_id)
221
-
222
- generate_kwargs = {
223
- "exaggeration": exaggeration_input,
224
- "temperature": temperature_input,
225
- "cfg_weight": cfgw_input,
226
- }
227
- if chosen_prompt:
228
- generate_kwargs["audio_prompt_path"] = chosen_prompt
229
- print(f"Using audio prompt: {chosen_prompt}")
230
- else:
231
- print("No audio prompt provided; using default voice.")
232
-
233
- wav = current_model.generate(
234
- text_input[:300], # Truncate text to max chars
235
- language_id=language_id,
236
- **generate_kwargs
237
- )
238
- print("Audio generation complete.")
239
- return (current_model.sr, wav.squeeze(0).numpy())
240
-
241
- with gr.Blocks() as demo:
242
- gr.Markdown(
243
- """
244
- # Chatterbox Multilingual Demo
245
- Generate high-quality multilingual speech from text with reference audio styling, supporting 23 languages.
246
-
247
- For a hosted version of Chatterbox Multilingual and for finetuning, please visit [resemble.ai](https://app.resemble.ai)
248
- """
249
  )
250
-
251
- # Display supported languages
252
- gr.Markdown(get_supported_languages_display())
253
- with gr.Row():
254
- with gr.Column():
255
- initial_lang = "fr"
256
- text = gr.Textbox(
257
- value=default_text_for_ui(initial_lang),
258
- label="Text to synthesize (max chars 300)",
259
- max_lines=5
260
- )
261
-
262
- language_id = gr.Dropdown(
263
- choices=list(ChatterboxMultilingualTTS.get_supported_languages().keys()),
264
- value=initial_lang,
265
- label="Language",
266
- info="Select the language for text-to-speech synthesis"
267
- )
268
-
269
- ref_wav = gr.Audio(
270
- sources=["upload", "microphone"],
271
- type="filepath",
272
- label="Reference Audio File (Optional)",
273
- value=default_audio_for_ui(initial_lang)
274
- )
275
 
276
- gr.Markdown(
277
- "💡 **Note**: Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip's language. To mitigate this, set the CFG weight to 0.",
278
- elem_classes=["audio-note"]
279
- )
280
 
281
- exaggeration = gr.Slider(
282
- 0.25, 2, step=.05, label="Exaggeration (Neutral = 0.5, extreme values can be unstable)", value=.5
283
- )
284
- cfg_weight = gr.Slider(
285
- 0.2, 1, step=.05, label="CFG/Pace", value=0.5
286
- )
287
 
288
- with gr.Accordion("More options", open=False):
289
- seed_num = gr.Number(value=0, label="Random seed (0 for random)")
290
- temp = gr.Slider(0.05, 5, step=.05, label="Temperature", value=.8)
 
 
 
 
 
 
291
 
292
- run_btn = gr.Button("Generate", variant="primary")
 
 
 
 
 
 
 
293
 
294
- with gr.Column():
295
- audio_output = gr.Audio(label="Output Audio")
 
 
 
 
 
 
 
 
 
 
 
296
 
297
- def on_language_change(lang, current_ref, current_text):
298
- return default_audio_for_ui(lang), default_text_for_ui(lang)
 
 
 
 
 
299
 
300
- language_id.change(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
301
  fn=on_language_change,
302
- inputs=[language_id, ref_wav, text],
303
- outputs=[ref_wav, text],
304
- show_progress=False
305
  )
306
 
307
- run_btn.click(
308
- fn=generate_tts_audio,
309
- inputs=[
310
- text,
311
- language_id,
312
- ref_wav,
313
- exaggeration,
314
- temp,
315
- seed_num,
316
- cfg_weight,
317
- ],
318
- outputs=[audio_output],
319
- )
320
 
321
- demo.launch(mcp_server=True)
 
 
 
 
1
+ """
2
+ Telefonansagen TTS - Simplified Gradio Application
3
+
4
+ A streamlined interface for generating professional phone announcements
5
+ using the modular TTS engine with Chatterbox Multilingual as default backend.
6
+ """
7
+
8
  import random
9
+
10
+ import gradio as gr
11
  import numpy as np
12
  import torch
13
+
14
+ try:
15
+ import spaces
16
+
17
+ HAS_SPACES = True
18
+ except ImportError:
19
+ HAS_SPACES = False
20
+
21
+ # Create a dummy decorator
22
+ class spaces:
23
+ @staticmethod
24
+ def GPU(func):
25
+ return func
26
+
27
+
28
+ from loguru import logger
29
+
30
+ from engine import TTSEngine
31
+ from engine.backends.chatterbox_backend import DEFAULT_VOICE_PROMPTS
32
+
33
+ # --- Configuration ---
34
+ DEVICE = (
35
+ "cuda"
36
+ if torch.cuda.is_available()
37
+ else "mps" if torch.backends.mps.is_available() else "cpu"
38
+ )
39
+ logger.info(f"🚀 Running on device: {DEVICE}")
40
+
41
+ # Language display configuration
42
+ LANGUAGE_DISPLAY = {
43
+ "de": "🇩🇪 Deutsch",
44
+ "en": "🇬🇧 English",
45
+ "fr": "🇫🇷 Français",
46
+ "es": "🇪🇸 Español",
47
+ "it": "🇮🇹 Italiano",
48
+ "nl": "🇳🇱 Nederlands",
49
+ "pl": "🇵🇱 Polski",
50
+ "pt": "🇵🇹 Português",
51
+ "ru": "🇷🇺 Русский",
52
+ "tr": "🇹🇷 Türkçe",
53
+ "ar": "🇸🇦 العربية",
54
+ "zh": "🇨🇳 中文",
55
+ "ja": "🇯🇵 日本語",
56
+ "ko": "🇰🇷 한국어",
57
+ "hi": "🇮🇳 हिन्दी",
58
+ "da": "🇩🇰 Dansk",
59
+ "el": "🇬🇷 Ελληνικά",
60
+ "fi": "🇫🇮 Suomi",
61
+ "he": "🇮🇱 עברית",
62
+ "ms": "🇲🇾 Bahasa Melayu",
63
+ "no": "🇳🇴 Norsk",
64
+ "sv": "🇸🇪 Svenska",
65
+ "sw": "🇰🇪 Kiswahili",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  }
67
 
68
+ # Example texts per language
69
+ EXAMPLE_TEXTS = {
70
+ "de": "Herzlich willkommen. Sie sind mit unserem Kundenservice verbunden. Bitte haben Sie einen Moment Geduld, wir sind gleich für Sie da.",
71
+ "en": "Welcome to our customer service. Please hold the line, one of our representatives will be with you shortly.",
72
+ "fr": "Bienvenue sur notre service client. Veuillez patienter, un conseiller va prendre votre appel.",
73
+ "es": "Bienvenido a nuestro servicio de atención al cliente. Por favor, espere un momento.",
74
+ "it": "Benvenuto nel nostro servizio clienti. La preghiamo di attendere in linea.",
75
+ "nl": "Welkom bij onze klantenservice. Een moment geduld alstublieft.",
76
+ "pl": "Witamy w naszej obsłudze klienta. Proszę czekać na połączenie.",
77
+ "pt": "Bem-vindo ao nosso serviço de apoio ao cliente. Por favor, aguarde um momento.",
78
+ "ru": "Добро пожаловать в службу поддержки. Пожалуйста, оставайтесь на линии.",
79
+ "tr": "Müşteri hizmetlerimize hoş geldiniz. Lütfen hatta kalın.",
80
+ "ar": "مرحباً بكم في خدمة العملاء. يرجى الانتظار على الخط.",
81
+ "zh": "欢迎致电客户服务中心。请稍候,我们的客服代表将很快为您服务。",
82
+ "ja": "お電話ありがとうございます。担当者におつなぎしますので、少々お待ちください。",
83
+ "ko": "고객 서비스에 오신 것을 환영합니다. 잠시만 기다려 주세요.",
84
+ "hi": "हमारी ग्राहक सेवा में आपका स्वागत है। कृपया प्रतीक्षा करें।",
85
+ "da": "Velkommen til vores kundeservice. Vent venligst.",
86
+ "el": "Καλώς ήρθατε στην εξυπηρέτηση πελατών. Παρακαλώ περιμένετε.",
87
+ "fi": "Tervetuloa asiakaspalveluumme. Odottakaa hetki.",
88
+ "he": "ברוכים הבאים לשירות הלקוחות שלנו. אנא המתינו על הקו.",
89
+ "ms": "Selamat datang ke perkhidmatan pelanggan kami. Sila tunggu sebentar.",
90
+ "no": "Velkommen til vår kundeservice. Vennligst vent.",
91
+ "sv": "Välkommen till vår kundtjänst. Vänligen vänta.",
92
+ "sw": "Karibu kwa huduma yetu ya wateja. Tafadhali subiri.",
93
+ }
94
 
95
 
96
+ # --- Global Engine ---
97
+ ENGINE = None
98
 
99
 
100
+ def get_engine() -> TTSEngine:
101
+ """Get or initialize the TTS engine."""
102
+ global ENGINE
103
+ if ENGINE is None:
104
+ from engine import TTSEngine
105
+ from engine.tts_engine import EngineConfig
 
 
 
 
 
 
 
 
106
 
107
+ logger.info("Initializing TTS Engine...")
108
+ ENGINE = TTSEngine(
109
+ EngineConfig(
110
+ default_backend="chatterbox",
111
+ device=DEVICE,
112
+ default_language="de",
113
+ )
114
+ )
115
+ # Pre-load the model
116
+ ENGINE.load_backend()
117
+ logger.info("TTS Engine ready!")
118
+
119
+ return ENGINE
120
 
121
 
122
+ # Initialize on startup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  try:
124
+ get_engine()
125
  except Exception as e:
126
+ logger.error(f"Failed to initialize engine on startup: {e}")
127
+
128
+
129
+ # --- Helper Functions ---
130
+ def get_language_choices() -> list[tuple[str, str]]:
131
+ """Get language choices for dropdown."""
132
+ engine = get_engine()
133
+ supported = engine.get_supported_languages()
134
+ choices = []
135
+ for code in supported.keys():
136
+ display = LANGUAGE_DISPLAY.get(code, f"{supported[code]} ({code})")
137
+ choices.append((display, code))
138
+ # Sort by display name, but put German first
139
+ choices.sort(key=lambda x: (x[1] != "de", x[0]))
140
+ return choices
141
+
142
+
143
+ def get_example_text(language: str) -> str:
144
+ """Get example text for a language."""
145
+ return EXAMPLE_TEXTS.get(language, EXAMPLE_TEXTS["en"])
146
+
147
+
148
+ def get_default_voice(language: str) -> str:
149
+ """Get default voice prompt URL for a language."""
150
+ return DEFAULT_VOICE_PROMPTS.get(language)
151
 
152
 
153
+ # --- Main Generation Function ---
154
  @spaces.GPU
155
+ def generate_announcement(
156
+ text: str,
157
+ language: str,
158
+ voice_audio: str = None,
159
+ seed: int = 0,
 
 
 
160
  ) -> tuple[int, np.ndarray]:
161
  """
162
+ Generate a phone announcement.
 
 
 
 
 
163
 
164
  Args:
165
+ text: Text to synthesize (max 500 characters)
166
+ language: Language code
167
+ voice_audio: Optional path to reference audio for voice cloning
168
+ seed: Random seed (0 = random)
 
 
 
169
 
170
  Returns:
171
+ Tuple of (sample_rate, audio_array) for Gradio audio component
172
  """
173
+ engine = get_engine()
174
+
175
+ # Set seed for reproducibility
176
+ if seed != 0:
177
+ torch.manual_seed(seed)
178
+ random.seed(seed)
179
+ np.random.seed(seed)
180
+ if DEVICE == "cuda":
181
+ torch.cuda.manual_seed_all(seed)
182
+
183
+ # Truncate text
184
+ text = text[:500]
185
+
186
+ # Use default voice if none provided
187
+ if not voice_audio or not str(voice_audio).strip():
188
+ voice_audio = get_default_voice(language)
189
+
190
+ logger.info(f"Generating: lang={language}, text='{text[:50]}...'")
191
+
192
+ # Generate audio
193
+ result = engine.generate(
194
+ text=text,
195
+ language=language,
196
+ voice_audio=voice_audio,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  )
198
+
199
+ return result
200
+
201
+
202
+ def on_language_change(language: str):
203
+ """Handle language selection change."""
204
+ return get_example_text(language), get_default_voice(language)
205
+
206
+
207
+ # --- Gradio Interface ---
208
+ def create_interface():
209
+ """Create the Gradio interface."""
210
+
211
+ with gr.Blocks(
212
+ title="Telefonansagen Generator",
213
+ theme=gr.themes.Soft(),
214
+ css="""
215
+ .main-title { text-align: center; margin-bottom: 1rem; }
216
+ .generate-btn { min-height: 50px; font-size: 1.1rem; }
217
+ """,
218
+ ) as demo:
219
+ gr.Markdown(
220
+ """
221
+ # 📞 Telefonansagen Generator
 
222
 
223
+ Erstellen Sie professionelle Telefonansagen mit KI-gestützter Sprachsynthese.
224
+ Unterstützt 23 Sprachen mit optionaler Stimmklonung.
 
 
225
 
226
+ ---
227
+ """,
228
+ elem_classes=["main-title"],
229
+ )
 
 
230
 
231
+ with gr.Row():
232
+ # Left column - Input
233
+ with gr.Column(scale=1):
234
+ language = gr.Dropdown(
235
+ choices=get_language_choices(),
236
+ value="de",
237
+ label="🌍 Sprache / Language",
238
+ info="Wählen Sie die Sprache der Ansage",
239
+ )
240
 
241
+ text = gr.Textbox(
242
+ value=EXAMPLE_TEXTS["de"],
243
+ label="📝 Text der Ansage",
244
+ placeholder="Geben Sie hier den Text Ihrer Telefonansage ein...",
245
+ lines=5,
246
+ max_lines=10,
247
+ info="Maximal 500 Zeichen",
248
+ )
249
 
250
+ with gr.Accordion("🎤 Stimmeinstellungen (Optional)", open=False):
251
+ voice_audio = gr.Audio(
252
+ sources=["upload", "microphone"],
253
+ type="filepath",
254
+ label="Referenz-Audio für Stimmklonung",
255
+ value=get_default_voice("de"),
256
+ )
257
+ gr.Markdown(
258
+ """
259
+ 💡 **Tipp:** Laden Sie eine Audioaufnahme hoch, um die Stimme zu klonen.
260
+ Die Standardstimme wird verwendet, wenn keine Aufnahme bereitgestellt wird.
261
+ """
262
+ )
263
 
264
+ with gr.Accordion("⚙️ Erweiterte Einstellungen", open=False):
265
+ seed = gr.Number(
266
+ value=0,
267
+ label="Zufallswert (Seed)",
268
+ info="0 = zufällig, andere Werte für reproduzierbare Ergebnisse",
269
+ precision=0,
270
+ )
271
 
272
+ generate_btn = gr.Button(
273
+ "🎙️ Ansage generieren",
274
+ variant="primary",
275
+ elem_classes=["generate-btn"],
276
+ )
277
+
278
+ # Right column - Output
279
+ with gr.Column(scale=1):
280
+ audio_output = gr.Audio(
281
+ label="📢 Generierte Ansage", type="numpy", interactive=False
282
+ )
283
+
284
+ gr.Markdown(
285
+ """
286
+ ### ℹ️ Hinweise
287
+
288
+ - Die Generierung kann einige Sekunden dauern
289
+ - Für beste Ergebnisse verwenden Sie klare, kurze Sätze
290
+ - Referenz-Audio sollte 5-15 Sekunden lang sein
291
+
292
+ ---
293
+
294
+ **Unterstützte Sprachen:** Deutsch, Englisch, Französisch, Spanisch,
295
+ Italienisch, Niederländisch, Polnisch, Portugiesisch, Russisch,
296
+ Türkisch, Arabisch, Chinesisch, Japanisch, Koreanisch, Hindi,
297
+ Dänisch, Griechisch, Finnisch, Hebräisch, Malaiisch, Norwegisch,
298
+ Schwedisch, Swahili
299
+ """
300
+ )
301
+
302
+ # Event handlers
303
+ language.change(
304
  fn=on_language_change,
305
+ inputs=[language],
306
+ outputs=[text, voice_audio],
307
+ show_progress=False,
308
  )
309
 
310
+ generate_btn.click(
311
+ fn=generate_announcement,
312
+ inputs=[text, language, voice_audio, seed],
313
+ outputs=[audio_output],
314
+ )
315
+
316
+ return demo
317
+
 
 
 
 
 
318
 
319
+ # --- Main ---
320
+ if __name__ == "__main__":
321
+ demo = create_interface()
322
+ demo.launch(server_name="0.0.0.0", server_port=7860, share=False)
engine/__init__.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Telefonansagen TTS Engine
2
+ # A modular text-to-speech engine for generating phone announcements
3
+
4
+ from .audio_processor import AudioProcessingConfig, AudioProcessor
5
+ from .backends.base import BackendConfig, TTSBackend, TTSResult
6
+ from .backends.chatterbox_backend import ChatterboxBackend
7
+ from .cache import AudioCache, CacheConfig
8
+ from .tts_engine import EngineConfig, TTSEngine
9
+
10
+ __all__ = [
11
+ "TTSEngine",
12
+ "EngineConfig",
13
+ "TTSBackend",
14
+ "TTSResult",
15
+ "BackendConfig",
16
+ "ChatterboxBackend",
17
+ "AudioProcessor",
18
+ "AudioProcessingConfig",
19
+ "AudioCache",
20
+ "CacheConfig",
21
+ ]
22
+
23
+ __version__ = "1.0.0"
engine/audio_processor.py ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Audio post-processing for phone announcements.
3
+ Handles background music mixing, normalization, and export.
4
+ """
5
+
6
+ import io
7
+ import os
8
+ from dataclasses import dataclass
9
+ from pathlib import Path
10
+ from typing import Optional, Union
11
+
12
+ import numpy as np
13
+ from loguru import logger
14
+
15
+
16
+ @dataclass
17
+ class AudioProcessingConfig:
18
+ """Configuration for audio post-processing."""
19
+
20
+ # Background music settings
21
+ background_music_path: Optional[str] = None
22
+ music_volume_db: float = -20.0 # Relative volume of background music
23
+
24
+ # Fade settings
25
+ fade_in_ms: int = 500
26
+ fade_out_ms: int = 500
27
+
28
+ # Padding (silence before/after speech)
29
+ padding_start_ms: int = 300
30
+ padding_end_ms: int = 300
31
+
32
+ # Output settings
33
+ normalize: bool = True
34
+ target_loudness_db: float = -16.0 # Target LUFS for normalization
35
+ output_sample_rate: int = 44100
36
+ output_format: str = "mp3"
37
+
38
+
39
+ class AudioProcessor:
40
+ """
41
+ Post-processor for TTS audio.
42
+ Adds background music, applies fades, normalizes, and exports.
43
+ """
44
+
45
+ # Default background music directory
46
+ ASSETS_DIR = Path(__file__).parent.parent / "data" / "assets"
47
+
48
+ def __init__(self, config: Optional[AudioProcessingConfig] = None):
49
+ self.config = config or AudioProcessingConfig()
50
+
51
+ def process(
52
+ self,
53
+ audio: np.ndarray,
54
+ sample_rate: int,
55
+ output_path: Optional[str] = None,
56
+ **override_config,
57
+ ) -> Union[bytes, str]:
58
+ """
59
+ Process audio with background music, fades, and normalization.
60
+
61
+ Args:
62
+ audio: Input audio as numpy array
63
+ sample_rate: Sample rate of input audio
64
+ output_path: Optional path to save the output (returns bytes if None)
65
+ **override_config: Override any config settings for this call
66
+
67
+ Returns:
68
+ Path to output file if output_path is provided, otherwise MP3 bytes
69
+ """
70
+ from pydub import AudioSegment
71
+
72
+ # Merge config overrides
73
+ config = AudioProcessingConfig(**{**self.config.__dict__, **override_config})
74
+
75
+ # Convert numpy array to AudioSegment
76
+ speech = self._numpy_to_audiosegment(audio, sample_rate)
77
+
78
+ # Boost speech slightly for clarity
79
+ speech = speech + 3 # +3 dB
80
+
81
+ # Add padding
82
+ if config.padding_start_ms > 0 or config.padding_end_ms > 0:
83
+ silence_start = AudioSegment.silent(
84
+ duration=config.padding_start_ms, frame_rate=sample_rate
85
+ )
86
+ silence_end = AudioSegment.silent(
87
+ duration=config.padding_end_ms, frame_rate=sample_rate
88
+ )
89
+ speech = silence_start + speech + silence_end
90
+
91
+ # Mix with background music if specified
92
+ if config.background_music_path:
93
+ speech = self._add_background_music(
94
+ speech, config.background_music_path, config.music_volume_db
95
+ )
96
+
97
+ # Apply fades
98
+ if config.fade_in_ms > 0:
99
+ speech = speech.fade_in(config.fade_in_ms)
100
+ if config.fade_out_ms > 0:
101
+ speech = speech.fade_out(config.fade_out_ms)
102
+
103
+ # Normalize if requested
104
+ if config.normalize:
105
+ speech = self._normalize(speech, config.target_loudness_db)
106
+
107
+ # Resample if needed
108
+ if speech.frame_rate != config.output_sample_rate:
109
+ speech = speech.set_frame_rate(config.output_sample_rate)
110
+
111
+ # Export
112
+ if output_path:
113
+ speech.export(output_path, format=config.output_format)
114
+ return output_path
115
+ else:
116
+ buffer = io.BytesIO()
117
+ speech.export(buffer, format=config.output_format)
118
+ return buffer.getvalue()
119
+
120
+ def _numpy_to_audiosegment(
121
+ self, audio: np.ndarray, sample_rate: int
122
+ ) -> "AudioSegment":
123
+ """Convert numpy array to pydub AudioSegment."""
124
+ from pydub import AudioSegment
125
+
126
+ # Ensure float32 and normalize
127
+ if audio.dtype != np.float32:
128
+ audio = audio.astype(np.float32)
129
+
130
+ # Clip and convert to int16
131
+ audio = np.clip(audio, -1.0, 1.0)
132
+ audio_int16 = (audio * 32767).astype(np.int16)
133
+
134
+ # Create AudioSegment
135
+ return AudioSegment(
136
+ data=audio_int16.tobytes(),
137
+ sample_width=2, # 16-bit
138
+ frame_rate=sample_rate,
139
+ channels=1, # Mono
140
+ )
141
+
142
+ def _add_background_music(
143
+ self, speech: "AudioSegment", music_path: str, volume_db: float
144
+ ) -> "AudioSegment":
145
+ """Mix background music with speech."""
146
+ from pydub import AudioSegment
147
+
148
+ # Resolve path
149
+ if not os.path.isabs(music_path):
150
+ # Check in assets directory
151
+ assets_path = self.ASSETS_DIR / f"{music_path}.mp3"
152
+ if assets_path.exists():
153
+ music_path = str(assets_path)
154
+ else:
155
+ assets_path = self.ASSETS_DIR / music_path
156
+ if assets_path.exists():
157
+ music_path = str(assets_path)
158
+
159
+ if not os.path.exists(music_path):
160
+ logger.warning(f"Background music not found: {music_path}")
161
+ return speech
162
+
163
+ try:
164
+ music = AudioSegment.from_file(music_path)
165
+
166
+ # Adjust volume
167
+ music = music + volume_db
168
+
169
+ # Match sample rate
170
+ if music.frame_rate != speech.frame_rate:
171
+ music = music.set_frame_rate(speech.frame_rate)
172
+
173
+ # Loop music to match speech length
174
+ if len(music) < len(speech):
175
+ loops_needed = (len(speech) // len(music)) + 1
176
+ music = music * loops_needed
177
+
178
+ # Trim to exact length
179
+ music = music[: len(speech)]
180
+
181
+ # Overlay
182
+ return speech.overlay(music)
183
+
184
+ except Exception as e:
185
+ logger.error(f"Failed to add background music: {e}")
186
+ return speech
187
+
188
+ def _normalize(self, audio: "AudioSegment", target_db: float) -> "AudioSegment":
189
+ """Normalize audio to target loudness."""
190
+ change_in_db = target_db - audio.dBFS
191
+ return audio.apply_gain(change_in_db)
192
+
193
+ def list_available_music(self) -> list[str]:
194
+ """List available background music files in the assets directory."""
195
+ if not self.ASSETS_DIR.exists():
196
+ return []
197
+
198
+ music_files = []
199
+ for ext in ["mp3", "wav", "flac", "ogg"]:
200
+ music_files.extend([f.stem for f in self.ASSETS_DIR.glob(f"*.{ext}")])
201
+ return sorted(set(music_files))
engine/backends/__init__.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TTS Backend implementations
2
+ from .base import BackendConfig, TTSBackend, TTSResult
3
+ from .chatterbox_backend import ChatterboxBackend
4
+
5
+ __all__ = ["TTSBackend", "TTSResult", "BackendConfig", "ChatterboxBackend"]
6
+
7
+ # Optional backends
8
+ try:
9
+ from .gemini_backend import GeminiBackend
10
+
11
+ __all__.append("GeminiBackend")
12
+ except ImportError:
13
+ pass # google-genai not installed
engine/backends/base.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Abstract base class for TTS backends.
3
+ All TTS backends must implement this interface to be compatible with the engine.
4
+ """
5
+
6
+ from abc import ABC, abstractmethod
7
+ from dataclasses import dataclass
8
+ from typing import Optional
9
+
10
+ import numpy as np
11
+
12
+
13
+ @dataclass
14
+ class TTSResult:
15
+ """Result from TTS generation."""
16
+
17
+ audio: np.ndarray # Audio waveform as numpy array
18
+ sample_rate: int # Sample rate in Hz
19
+
20
+ def to_int16(self) -> np.ndarray:
21
+ """Convert audio to 16-bit integer format."""
22
+ audio = self.audio
23
+ if audio.dtype == np.float32 or audio.dtype == np.float64:
24
+ audio = np.clip(audio, -1.0, 1.0)
25
+ audio = (audio * 32767).astype(np.int16)
26
+ return audio
27
+
28
+
29
+ @dataclass
30
+ class BackendConfig:
31
+ """Configuration for TTS backends."""
32
+
33
+ device: str = "auto" # "auto", "cuda", "mps", "cpu"
34
+
35
+ def resolve_device(self) -> str:
36
+ """Resolve 'auto' to the best available device."""
37
+ if self.device != "auto":
38
+ return self.device
39
+
40
+ import torch
41
+
42
+ if torch.cuda.is_available():
43
+ return "cuda"
44
+ elif torch.backends.mps.is_available():
45
+ return "mps"
46
+ return "cpu"
47
+
48
+
49
+ class TTSBackend(ABC):
50
+ """
51
+ Abstract base class for TTS backends.
52
+
53
+ To create a new backend:
54
+ 1. Inherit from this class
55
+ 2. Implement all abstract methods
56
+ 3. Register the backend in the engine
57
+ """
58
+
59
+ def __init__(self, config: Optional[BackendConfig] = None):
60
+ self.config = config or BackendConfig()
61
+ self._is_loaded = False
62
+
63
+ @property
64
+ @abstractmethod
65
+ def name(self) -> str:
66
+ """Human-readable name of the backend."""
67
+ pass
68
+
69
+ @property
70
+ @abstractmethod
71
+ def supports_voice_cloning(self) -> bool:
72
+ """Whether this backend supports voice cloning from audio."""
73
+ pass
74
+
75
+ @property
76
+ @abstractmethod
77
+ def supported_languages(self) -> dict[str, str]:
78
+ """
79
+ Dictionary of supported language codes to language names.
80
+ Example: {"en": "English", "de": "German"}
81
+ """
82
+ pass
83
+
84
+ @property
85
+ def is_loaded(self) -> bool:
86
+ """Whether the backend model is loaded and ready."""
87
+ return self._is_loaded
88
+
89
+ @abstractmethod
90
+ def load(self) -> None:
91
+ """
92
+ Load the model and prepare for inference.
93
+ Should set self._is_loaded = True when complete.
94
+ """
95
+ pass
96
+
97
+ @abstractmethod
98
+ def unload(self) -> None:
99
+ """
100
+ Unload the model to free memory.
101
+ Should set self._is_loaded = False when complete.
102
+ """
103
+ pass
104
+
105
+ @abstractmethod
106
+ def generate(
107
+ self,
108
+ text: str,
109
+ language: str = "de",
110
+ voice_audio_path: Optional[str] = None,
111
+ **kwargs,
112
+ ) -> TTSResult:
113
+ """
114
+ Generate speech from text.
115
+
116
+ Args:
117
+ text: The text to synthesize
118
+ language: Language code (e.g., "de", "en")
119
+ voice_audio_path: Optional path to reference audio for voice cloning
120
+ **kwargs: Backend-specific parameters
121
+
122
+ Returns:
123
+ TTSResult containing audio waveform and sample rate
124
+ """
125
+ pass
126
+
127
+ def __repr__(self) -> str:
128
+ status = "loaded" if self._is_loaded else "not loaded"
129
+ return f"{self.__class__.__name__}(name='{self.name}', status={status})"
engine/backends/chatterbox_backend.py ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Chatterbox Multilingual TTS Backend with Voice Cloning support.
3
+ This is the default backend for the Telefonansagen engine.
4
+ """
5
+
6
+ from typing import Optional
7
+
8
+ import numpy as np
9
+ from loguru import logger
10
+
11
+ from .base import BackendConfig, TTSBackend, TTSResult
12
+
13
+ # Default voice prompts per language (high-quality reference samples)
14
+ DEFAULT_VOICE_PROMPTS = {
15
+ "ar": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ar_f/ar_prompts2.flac",
16
+ "da": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/da_m1.flac",
17
+ "de": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/de_f1.flac",
18
+ "el": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/el_m.flac",
19
+ "en": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/en_f1.flac",
20
+ "es": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/es_f1.flac",
21
+ "fi": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fi_m.flac",
22
+ "fr": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fr_f1.flac",
23
+ "he": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/he_m1.flac",
24
+ "hi": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/hi_f1.flac",
25
+ "it": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/it_m1.flac",
26
+ "ja": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ja/ja_prompts1.flac",
27
+ "ko": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ko_f.flac",
28
+ "ms": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ms_f.flac",
29
+ "nl": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/nl_m.flac",
30
+ "no": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/no_f1.flac",
31
+ "pl": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pl_m.flac",
32
+ "pt": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pt_m1.flac",
33
+ "ru": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ru_m.flac",
34
+ "sv": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sv_f.flac",
35
+ "sw": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sw_m.flac",
36
+ "tr": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/tr_m.flac",
37
+ "zh": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/zh_f2.flac",
38
+ }
39
+
40
+
41
+ class ChatterboxBackend(TTSBackend):
42
+ """
43
+ Chatterbox Multilingual TTS Backend.
44
+
45
+ Features:
46
+ - 23 language support
47
+ - High-quality voice cloning
48
+ - Expressive speech synthesis
49
+
50
+ This backend uses the ResembleAI Chatterbox model for synthesis.
51
+ """
52
+
53
+ # Optimal defaults for phone announcements (clear, professional)
54
+ DEFAULT_EXAGGERATION = (
55
+ 0.35 # Slightly less expressive for professional announcements
56
+ )
57
+ DEFAULT_TEMPERATURE = 0.7 # Balanced randomness
58
+ DEFAULT_CFG_WEIGHT = 0.5 # Standard guidance
59
+
60
+ SUPPORTED_LANGUAGES = {
61
+ "ar": "Arabic",
62
+ "da": "Danish",
63
+ "de": "German",
64
+ "el": "Greek",
65
+ "en": "English",
66
+ "es": "Spanish",
67
+ "fi": "Finnish",
68
+ "fr": "French",
69
+ "he": "Hebrew",
70
+ "hi": "Hindi",
71
+ "it": "Italian",
72
+ "ja": "Japanese",
73
+ "ko": "Korean",
74
+ "ms": "Malay",
75
+ "nl": "Dutch",
76
+ "no": "Norwegian",
77
+ "pl": "Polish",
78
+ "pt": "Portuguese",
79
+ "ru": "Russian",
80
+ "sv": "Swedish",
81
+ "sw": "Swahili",
82
+ "tr": "Turkish",
83
+ "zh": "Chinese",
84
+ }
85
+
86
+ def __init__(self, config: Optional[BackendConfig] = None):
87
+ super().__init__(config)
88
+ self._model = None
89
+ self._device = None
90
+
91
+ @property
92
+ def name(self) -> str:
93
+ return "Chatterbox Multilingual"
94
+
95
+ @property
96
+ def supports_voice_cloning(self) -> bool:
97
+ return True
98
+
99
+ @property
100
+ def supported_languages(self) -> dict[str, str]:
101
+ return self.SUPPORTED_LANGUAGES.copy()
102
+
103
+ def load(self) -> None:
104
+ """Load the Chatterbox model."""
105
+ if self._is_loaded:
106
+ logger.info("Chatterbox model already loaded")
107
+ return
108
+
109
+ logger.info("Loading Chatterbox Multilingual model...")
110
+
111
+ from src.chatterbox.mtl_tts import ChatterboxMultilingualTTS
112
+
113
+ self._device = self.config.resolve_device()
114
+ logger.info(f"Using device: {self._device}")
115
+
116
+ try:
117
+ self._model = ChatterboxMultilingualTTS.from_pretrained(self._device)
118
+ self._is_loaded = True
119
+ logger.info("Chatterbox model loaded successfully")
120
+ except Exception as e:
121
+ logger.error(f"Failed to load Chatterbox model: {e}")
122
+ raise
123
+
124
+ def unload(self) -> None:
125
+ """Unload the model to free memory."""
126
+ if self._model is not None:
127
+ import torch
128
+
129
+ del self._model
130
+ self._model = None
131
+ if self._device == "cuda":
132
+ torch.cuda.empty_cache()
133
+ self._is_loaded = False
134
+ logger.info("Chatterbox model unloaded")
135
+
136
+ def get_default_voice(self, language: str) -> Optional[str]:
137
+ """Get the default voice prompt URL for a language."""
138
+ return DEFAULT_VOICE_PROMPTS.get(language.lower())
139
+
140
+ def generate(
141
+ self,
142
+ text: str,
143
+ language: str = "de",
144
+ voice_audio_path: Optional[str] = None,
145
+ exaggeration: Optional[float] = None,
146
+ temperature: Optional[float] = None,
147
+ cfg_weight: Optional[float] = None,
148
+ seed: Optional[int] = None,
149
+ **kwargs,
150
+ ) -> TTSResult:
151
+ """
152
+ Generate speech from text using Chatterbox.
153
+
154
+ Args:
155
+ text: Text to synthesize
156
+ language: Language code (default: "de" for German)
157
+ voice_audio_path: Path/URL to reference audio for voice cloning
158
+ exaggeration: Speech expressiveness (0.25-2.0, default: 0.35)
159
+ temperature: Generation randomness (0.05-5.0, default: 0.7)
160
+ cfg_weight: CFG guidance weight (0.2-1.0, default: 0.5)
161
+ seed: Random seed for reproducibility (default: None = random)
162
+
163
+ Returns:
164
+ TTSResult with audio waveform and sample rate
165
+ """
166
+ if not self._is_loaded:
167
+ self.load()
168
+
169
+ import random
170
+
171
+ import torch
172
+
173
+ # Apply seed if provided
174
+ if seed is not None and seed != 0:
175
+ torch.manual_seed(seed)
176
+ random.seed(seed)
177
+ np.random.seed(seed)
178
+ if self._device == "cuda":
179
+ torch.cuda.manual_seed_all(seed)
180
+
181
+ # Use defaults for unspecified parameters
182
+ exaggeration = (
183
+ exaggeration if exaggeration is not None else self.DEFAULT_EXAGGERATION
184
+ )
185
+ temperature = (
186
+ temperature if temperature is not None else self.DEFAULT_TEMPERATURE
187
+ )
188
+ cfg_weight = cfg_weight if cfg_weight is not None else self.DEFAULT_CFG_WEIGHT
189
+
190
+ # Resolve voice prompt
191
+ audio_prompt = voice_audio_path or self.get_default_voice(language)
192
+
193
+ # Validate language
194
+ lang_code = language.lower()
195
+ if lang_code not in self.SUPPORTED_LANGUAGES:
196
+ available = ", ".join(sorted(self.SUPPORTED_LANGUAGES.keys()))
197
+ raise ValueError(
198
+ f"Unsupported language '{language}'. Available: {available}"
199
+ )
200
+
201
+ logger.info(f"Generating speech: lang={lang_code}, text='{text[:50]}...'")
202
+
203
+ try:
204
+ wav = self._model.generate(
205
+ text=text,
206
+ language_id=lang_code,
207
+ audio_prompt_path=audio_prompt,
208
+ exaggeration=exaggeration,
209
+ temperature=temperature,
210
+ cfg_weight=cfg_weight,
211
+ )
212
+
213
+ # Convert to numpy array
214
+ audio_np = wav.squeeze().numpy()
215
+
216
+ return TTSResult(audio=audio_np, sample_rate=self._model.sr)
217
+
218
+ except Exception as e:
219
+ logger.error(f"TTS generation failed: {e}")
220
+ raise
engine/backends/gemini_backend.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Google Gemini TTS Backend.
3
+ Uses Google's Gemini API for text-to-speech synthesis.
4
+ """
5
+
6
+ import io
7
+ import os
8
+ from typing import Optional
9
+
10
+ import numpy as np
11
+ from loguru import logger
12
+
13
+ from .base import BackendConfig, TTSBackend, TTSResult
14
+
15
+
16
+ class GeminiBackend(TTSBackend):
17
+ """
18
+ Google Gemini TTS Backend.
19
+
20
+ Features:
21
+ - High-quality neural TTS
22
+ - Multiple preset voices
23
+ - No voice cloning (uses preset voices)
24
+
25
+ Requires GEMINI_API_KEY environment variable.
26
+ """
27
+
28
+ # Available Gemini voices
29
+ AVAILABLE_VOICES = [
30
+ "Puck",
31
+ "Charon",
32
+ "Kore",
33
+ "Fenrir",
34
+ "Aoede",
35
+ "Leda",
36
+ "Orus",
37
+ "Zephyr",
38
+ ]
39
+
40
+ # Gemini has limited language support compared to Chatterbox
41
+ SUPPORTED_LANGUAGES = {
42
+ "en": "English",
43
+ "de": "German",
44
+ "es": "Spanish",
45
+ "fr": "French",
46
+ "it": "Italian",
47
+ "pt": "Portuguese",
48
+ "ja": "Japanese",
49
+ "ko": "Korean",
50
+ "zh": "Chinese",
51
+ }
52
+
53
+ def __init__(self, config: Optional[BackendConfig] = None, voice: str = "Kore"):
54
+ super().__init__(config)
55
+ self._client = None
56
+ self.voice = voice if voice in self.AVAILABLE_VOICES else "Kore"
57
+
58
+ @property
59
+ def name(self) -> str:
60
+ return "Google Gemini TTS"
61
+
62
+ @property
63
+ def supports_voice_cloning(self) -> bool:
64
+ return False
65
+
66
+ @property
67
+ def supported_languages(self) -> dict[str, str]:
68
+ return self.SUPPORTED_LANGUAGES.copy()
69
+
70
+ def load(self) -> None:
71
+ """Initialize the Gemini client."""
72
+ if self._is_loaded:
73
+ return
74
+
75
+ api_key = os.environ.get("GEMINI_API_KEY")
76
+ if not api_key:
77
+ raise ValueError("GEMINI_API_KEY environment variable not set")
78
+
79
+ try:
80
+ import google.genai as genai
81
+
82
+ self._client = genai.Client(api_key=api_key)
83
+ self._is_loaded = True
84
+ logger.info("Gemini client initialized successfully")
85
+ except Exception as e:
86
+ logger.error(f"Failed to initialize Gemini client: {e}")
87
+ raise
88
+
89
+ def unload(self) -> None:
90
+ """Clean up Gemini client."""
91
+ self._client = None
92
+ self._is_loaded = False
93
+ logger.info("Gemini client unloaded")
94
+
95
+ def set_voice(self, voice: str) -> None:
96
+ """Set the voice to use for synthesis."""
97
+ if voice not in self.AVAILABLE_VOICES:
98
+ raise ValueError(
99
+ f"Unknown voice '{voice}'. Available: {self.AVAILABLE_VOICES}"
100
+ )
101
+ self.voice = voice
102
+
103
+ def generate(
104
+ self,
105
+ text: str,
106
+ language: str = "de",
107
+ voice_audio_path: Optional[str] = None,
108
+ voice: Optional[str] = None,
109
+ **kwargs,
110
+ ) -> TTSResult:
111
+ """
112
+ Generate speech from text using Gemini.
113
+
114
+ Args:
115
+ text: Text to synthesize
116
+ language: Language code (for text processing, voice determines actual synthesis)
117
+ voice_audio_path: Ignored (Gemini doesn't support voice cloning)
118
+ voice: Voice name to use (default: instance voice setting)
119
+
120
+ Returns:
121
+ TTSResult with audio waveform and sample rate
122
+ """
123
+ if not self._is_loaded:
124
+ self.load()
125
+
126
+ if voice_audio_path:
127
+ logger.warning(
128
+ "Gemini backend doesn't support voice cloning, ignoring voice_audio_path"
129
+ )
130
+
131
+ from google.genai import types as genai_types
132
+
133
+ selected_voice = voice or self.voice
134
+
135
+ logger.info(
136
+ f"Generating speech with Gemini: voice={selected_voice}, text='{text[:50]}...'"
137
+ )
138
+
139
+ contents = [
140
+ genai_types.Content(
141
+ role="user", parts=[genai_types.Part.from_text(text=text)]
142
+ )
143
+ ]
144
+
145
+ config = genai_types.GenerateContentConfig(
146
+ temperature=1,
147
+ response_modalities=["audio"],
148
+ speech_config=genai_types.SpeechConfig(
149
+ voice_config=genai_types.VoiceConfig(
150
+ prebuilt_voice_config=genai_types.PrebuiltVoiceConfig(
151
+ voice_name=selected_voice
152
+ )
153
+ )
154
+ ),
155
+ )
156
+
157
+ try:
158
+ audio_chunks = []
159
+ mime_type = None
160
+
161
+ for chunk in self._client.models.generate_content_stream(
162
+ model="gemini-2.5-pro-preview-tts",
163
+ contents=contents,
164
+ config=config,
165
+ ):
166
+ if chunk.candidates:
167
+ inline_data = chunk.candidates[0].content.parts[0].inline_data
168
+ audio_chunks.append(inline_data.data)
169
+ if mime_type is None:
170
+ mime_type = inline_data.mime_type
171
+
172
+ if not audio_chunks:
173
+ raise RuntimeError("No audio data received from Gemini API")
174
+
175
+ raw_audio = b"".join(audio_chunks)
176
+
177
+ # Convert to numpy array
178
+ audio_np, sample_rate = self._process_audio(raw_audio, mime_type)
179
+
180
+ return TTSResult(audio=audio_np, sample_rate=sample_rate)
181
+
182
+ except Exception as e:
183
+ logger.error(f"Gemini TTS generation failed: {e}")
184
+ raise
185
+
186
+ def _process_audio(
187
+ self, raw_audio: bytes, mime_type: str
188
+ ) -> tuple[np.ndarray, int]:
189
+ """Process raw audio data from Gemini into numpy array."""
190
+ from pydub import AudioSegment
191
+
192
+ # Parse MIME type for audio parameters
193
+ sample_rate = 24000 # Default
194
+ bits_per_sample = 16
195
+
196
+ if mime_type and "audio/L" in mime_type:
197
+ # Parse format like audio/L16;rate=24000
198
+ parts = mime_type.split(";")
199
+ for part in parts:
200
+ part = part.strip()
201
+ if part.startswith("audio/L"):
202
+ try:
203
+ bits_per_sample = int(part.split("L")[1])
204
+ except (ValueError, IndexError):
205
+ pass
206
+ elif part.lower().startswith("rate="):
207
+ try:
208
+ sample_rate = int(part.split("=")[1])
209
+ except (ValueError, IndexError):
210
+ pass
211
+
212
+ # Create AudioSegment from raw PCM
213
+ audio_segment = AudioSegment(
214
+ data=raw_audio,
215
+ sample_width=bits_per_sample // 8,
216
+ frame_rate=sample_rate,
217
+ channels=1,
218
+ )
219
+ elif mime_type == "audio/mpeg":
220
+ audio_segment = AudioSegment.from_file(io.BytesIO(raw_audio), format="mp3")
221
+ sample_rate = audio_segment.frame_rate
222
+ else:
223
+ # Try auto-detection
224
+ audio_segment = AudioSegment.from_file(io.BytesIO(raw_audio))
225
+ sample_rate = audio_segment.frame_rate
226
+
227
+ # Convert to numpy array
228
+ samples = np.array(audio_segment.get_array_of_samples())
229
+
230
+ # Normalize to float32 [-1, 1]
231
+ if audio_segment.sample_width == 2: # 16-bit
232
+ samples = samples.astype(np.float32) / 32768.0
233
+ elif audio_segment.sample_width == 1: # 8-bit
234
+ samples = (samples.astype(np.float32) - 128) / 128.0
235
+
236
+ return samples, sample_rate
engine/cache.py ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Caching system for generated audio.
3
+ Supports local and Hugging Face Hub storage.
4
+ """
5
+
6
+ import hashlib
7
+ import os
8
+ from dataclasses import dataclass
9
+ from pathlib import Path
10
+ from typing import Optional
11
+
12
+ from loguru import logger
13
+
14
+
15
+ @dataclass
16
+ class CacheConfig:
17
+ """Configuration for audio caching."""
18
+
19
+ enabled: bool = True
20
+ local_cache_dir: Optional[str] = None # Local cache directory
21
+ hf_repo_id: Optional[str] = None # Hugging Face Hub repo for remote cache
22
+ max_duration_seconds: float = 30.0 # Only cache audio shorter than this
23
+
24
+
25
+ class AudioCache:
26
+ """
27
+ Cache for generated TTS audio.
28
+ Supports both local filesystem and Hugging Face Hub storage.
29
+ """
30
+
31
+ def __init__(self, config: Optional[CacheConfig] = None):
32
+ self.config = config or CacheConfig()
33
+ self._hf_fs = None
34
+
35
+ def _get_cache_key(self, text: str, voice_id: str, backend: str) -> str:
36
+ """Generate a unique cache key for the given parameters."""
37
+ content = f"{backend}:{voice_id}:{text}"
38
+ return hashlib.md5(content.encode()).hexdigest()
39
+
40
+ def _get_hf_fs(self):
41
+ """Get HuggingFace filesystem (lazy initialization)."""
42
+ if self._hf_fs is None and self.config.hf_repo_id:
43
+ try:
44
+ from huggingface_hub import HfFileSystem
45
+
46
+ self._hf_fs = HfFileSystem(token=os.environ.get("HF_TOKEN"))
47
+ except Exception as e:
48
+ logger.warning(f"Could not initialize HF filesystem: {e}")
49
+ return self._hf_fs
50
+
51
+ def get(self, text: str, voice_id: str, backend: str) -> Optional[bytes]:
52
+ """
53
+ Retrieve cached audio if it exists.
54
+
55
+ Args:
56
+ text: Original text that was synthesized
57
+ voice_id: Voice identifier used
58
+ backend: Backend name used for synthesis
59
+
60
+ Returns:
61
+ Cached audio bytes or None if not found
62
+ """
63
+ if not self.config.enabled:
64
+ return None
65
+
66
+ cache_key = self._get_cache_key(text, voice_id, backend)
67
+
68
+ # Try local cache first
69
+ if self.config.local_cache_dir:
70
+ local_path = Path(self.config.local_cache_dir) / f"{cache_key}.mp3"
71
+ if local_path.exists():
72
+ logger.debug(f"Cache hit (local): {cache_key}")
73
+ return local_path.read_bytes()
74
+
75
+ # Try HF Hub cache
76
+ if self.config.hf_repo_id:
77
+ fs = self._get_hf_fs()
78
+ if fs:
79
+ hf_path = f"{self.config.hf_repo_id}/{voice_id}/{cache_key}.mp3"
80
+ try:
81
+ if fs.exists(hf_path):
82
+ with fs.open(hf_path, "rb") as f:
83
+ logger.debug(f"Cache hit (HF Hub): {cache_key}")
84
+ return f.read()
85
+ except Exception as e:
86
+ logger.debug(f"HF cache lookup failed: {e}")
87
+
88
+ return None
89
+
90
+ def set(
91
+ self,
92
+ text: str,
93
+ voice_id: str,
94
+ backend: str,
95
+ audio_data: bytes,
96
+ duration_seconds: Optional[float] = None,
97
+ ) -> bool:
98
+ """
99
+ Store audio in cache.
100
+
101
+ Args:
102
+ text: Original text that was synthesized
103
+ voice_id: Voice identifier used
104
+ backend: Backend name used for synthesis
105
+ audio_data: Audio bytes to cache
106
+ duration_seconds: Duration of the audio (for max duration check)
107
+
108
+ Returns:
109
+ True if cached successfully, False otherwise
110
+ """
111
+ if not self.config.enabled:
112
+ return False
113
+
114
+ # Check duration limit
115
+ if duration_seconds and duration_seconds > self.config.max_duration_seconds:
116
+ logger.debug(
117
+ f"Audio too long to cache: {duration_seconds}s > {self.config.max_duration_seconds}s"
118
+ )
119
+ return False
120
+
121
+ cache_key = self._get_cache_key(text, voice_id, backend)
122
+ success = False
123
+
124
+ # Save to local cache
125
+ if self.config.local_cache_dir:
126
+ try:
127
+ cache_dir = Path(self.config.local_cache_dir)
128
+ cache_dir.mkdir(parents=True, exist_ok=True)
129
+ local_path = cache_dir / f"{cache_key}.mp3"
130
+ local_path.write_bytes(audio_data)
131
+ logger.debug(f"Cached locally: {cache_key}")
132
+ success = True
133
+ except Exception as e:
134
+ logger.warning(f"Failed to cache locally: {e}")
135
+
136
+ # Save to HF Hub
137
+ if self.config.hf_repo_id:
138
+ fs = self._get_hf_fs()
139
+ if fs:
140
+ try:
141
+ voice_dir = f"{self.config.hf_repo_id}/{voice_id}"
142
+ if not fs.exists(voice_dir):
143
+ fs.makedirs(voice_dir, exist_ok=True)
144
+
145
+ hf_path = f"{voice_dir}/{cache_key}.mp3"
146
+ with fs.open(hf_path, "wb") as f:
147
+ f.write(audio_data)
148
+
149
+ logger.debug(f"Cached to HF Hub: {cache_key}")
150
+ success = True
151
+ except Exception as e:
152
+ logger.warning(f"Failed to cache to HF Hub: {e}")
153
+
154
+ return success
155
+
156
+ def clear_local(self) -> int:
157
+ """Clear local cache. Returns number of files deleted."""
158
+ if not self.config.local_cache_dir:
159
+ return 0
160
+
161
+ cache_dir = Path(self.config.local_cache_dir)
162
+ if not cache_dir.exists():
163
+ return 0
164
+
165
+ count = 0
166
+ for file in cache_dir.glob("*.mp3"):
167
+ file.unlink()
168
+ count += 1
169
+
170
+ logger.info(f"Cleared {count} files from local cache")
171
+ return count
engine/data/assets/.gitkeep ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Place background music files (.mp3) here
2
+ # They will be automatically detected by the audio processor
engine/tts_engine.py ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main TTS Engine for Telefonansagen (Phone Announcements).
3
+
4
+ This engine provides a unified interface for generating phone announcements
5
+ using different TTS backends. It handles:
6
+ - Backend management (loading, switching, unloading)
7
+ - Audio generation with sensible defaults
8
+ - Post-processing (background music, fades, normalization)
9
+ - Caching for efficiency
10
+ """
11
+
12
+ import os
13
+ from dataclasses import dataclass
14
+ from pathlib import Path
15
+ from typing import Optional, Type, Union
16
+
17
+ import numpy as np
18
+ from loguru import logger
19
+
20
+ from .audio_processor import AudioProcessingConfig, AudioProcessor
21
+ from .backends.base import BackendConfig, TTSBackend, TTSResult
22
+ from .backends.chatterbox_backend import ChatterboxBackend
23
+ from .cache import AudioCache, CacheConfig
24
+
25
+
26
+ @dataclass
27
+ class EngineConfig:
28
+ """Configuration for the TTS Engine."""
29
+
30
+ # Backend settings
31
+ default_backend: str = "chatterbox"
32
+ device: str = "auto" # "auto", "cuda", "mps", "cpu"
33
+
34
+ # Default generation settings
35
+ default_language: str = "de" # German for phone announcements
36
+
37
+ # Audio processing defaults
38
+ add_background_music: bool = False
39
+ default_music: Optional[str] = None
40
+ music_volume_db: float = -20.0
41
+ fade_in_ms: int = 500
42
+ fade_out_ms: int = 500
43
+
44
+ # Caching
45
+ enable_cache: bool = True
46
+ local_cache_dir: Optional[str] = None
47
+ hf_cache_repo: Optional[str] = None
48
+
49
+
50
+ class TTSEngine:
51
+ """
52
+ Main TTS Engine for generating phone announcements.
53
+
54
+ Usage:
55
+ # Simple usage with defaults
56
+ engine = TTSEngine()
57
+ audio = engine.generate("Willkommen bei unserem Service.")
58
+
59
+ # With voice cloning
60
+ audio = engine.generate(
61
+ "Willkommen bei unserem Service.",
62
+ voice_audio="path/to/reference.wav"
63
+ )
64
+
65
+ # Switch backend
66
+ engine.set_backend("gemini")
67
+ audio = engine.generate("Welcome to our service.", language="en")
68
+ """
69
+
70
+ # Registry of available backends
71
+ _backend_registry: dict[str, Type[TTSBackend]] = {
72
+ "chatterbox": ChatterboxBackend,
73
+ }
74
+
75
+ def __init__(self, config: Optional[EngineConfig] = None):
76
+ self.config = config or EngineConfig()
77
+
78
+ # Initialize components
79
+ self._backends: dict[str, TTSBackend] = {}
80
+ self._current_backend_name: str = self.config.default_backend
81
+
82
+ # Audio processor
83
+ self._processor = AudioProcessor(
84
+ AudioProcessingConfig(
85
+ music_volume_db=self.config.music_volume_db,
86
+ fade_in_ms=self.config.fade_in_ms,
87
+ fade_out_ms=self.config.fade_out_ms,
88
+ )
89
+ )
90
+
91
+ # Cache
92
+ self._cache = AudioCache(
93
+ CacheConfig(
94
+ enabled=self.config.enable_cache,
95
+ local_cache_dir=self.config.local_cache_dir,
96
+ hf_repo_id=self.config.hf_cache_repo,
97
+ )
98
+ )
99
+
100
+ @classmethod
101
+ def register_backend(cls, name: str, backend_class: Type[TTSBackend]) -> None:
102
+ """Register a new backend class."""
103
+ cls._backend_registry[name] = backend_class
104
+ logger.info(f"Registered backend: {name}")
105
+
106
+ @classmethod
107
+ def available_backends(cls) -> list[str]:
108
+ """List available backend names."""
109
+ return list(cls._backend_registry.keys())
110
+
111
+ def _get_backend(self, name: Optional[str] = None) -> TTSBackend:
112
+ """Get or create a backend instance."""
113
+ name = name or self._current_backend_name
114
+
115
+ if name not in self._backend_registry:
116
+ available = ", ".join(self._backend_registry.keys())
117
+ raise ValueError(f"Unknown backend '{name}'. Available: {available}")
118
+
119
+ if name not in self._backends:
120
+ backend_config = BackendConfig(device=self.config.device)
121
+ self._backends[name] = self._backend_registry[name](backend_config)
122
+
123
+ return self._backends[name]
124
+
125
+ @property
126
+ def current_backend(self) -> TTSBackend:
127
+ """Get the current active backend."""
128
+ return self._get_backend()
129
+
130
+ def set_backend(self, name: str) -> None:
131
+ """Switch to a different backend."""
132
+ if name not in self._backend_registry:
133
+ available = ", ".join(self._backend_registry.keys())
134
+ raise ValueError(f"Unknown backend '{name}'. Available: {available}")
135
+
136
+ self._current_backend_name = name
137
+ logger.info(f"Switched to backend: {name}")
138
+
139
+ def load_backend(self, name: Optional[str] = None) -> None:
140
+ """Pre-load a backend's model."""
141
+ backend = self._get_backend(name)
142
+ if not backend.is_loaded:
143
+ backend.load()
144
+
145
+ def unload_backend(self, name: Optional[str] = None) -> None:
146
+ """Unload a backend's model to free memory."""
147
+ backend = self._get_backend(name)
148
+ if backend.is_loaded:
149
+ backend.unload()
150
+
151
+ def get_supported_languages(self, backend: Optional[str] = None) -> dict[str, str]:
152
+ """Get supported languages for a backend."""
153
+ return self._get_backend(backend).supported_languages
154
+
155
+ def generate(
156
+ self,
157
+ text: str,
158
+ language: Optional[str] = None,
159
+ voice_audio: Optional[str] = None,
160
+ background_music: Optional[str] = None,
161
+ output_path: Optional[str] = None,
162
+ use_cache: bool = True,
163
+ **kwargs,
164
+ ) -> Union[bytes, str, tuple[int, np.ndarray]]:
165
+ """
166
+ Generate a phone announcement.
167
+
168
+ Args:
169
+ text: Text to synthesize
170
+ language: Language code (default: "de")
171
+ voice_audio: Path/URL to reference audio for voice cloning
172
+ background_music: Name/path of background music file
173
+ output_path: Optional path to save output file
174
+ use_cache: Whether to use caching (default: True)
175
+ **kwargs: Additional backend-specific parameters
176
+
177
+ Returns:
178
+ - If output_path: path to saved file
179
+ - If no output_path and no background_music: tuple(sample_rate, audio_array) for Gradio
180
+ - Otherwise: MP3 bytes
181
+ """
182
+ language = language or self.config.default_language
183
+ backend = self.current_backend
184
+
185
+ # Generate voice ID for caching
186
+ voice_id = (
187
+ "default"
188
+ if not voice_audio
189
+ else (
190
+ Path(voice_audio).stem
191
+ if os.path.exists(voice_audio or "")
192
+ else "custom"
193
+ )
194
+ )
195
+
196
+ # Check cache
197
+ if use_cache and self._cache.config.enabled:
198
+ cached = self._cache.get(text, voice_id, backend.name)
199
+ if cached:
200
+ logger.info("Using cached audio")
201
+ if output_path:
202
+ Path(output_path).write_bytes(cached)
203
+ return output_path
204
+ return cached
205
+
206
+ # Generate audio
207
+ logger.info(f"Generating TTS: backend={backend.name}, lang={language}")
208
+ result = backend.generate(
209
+ text=text, language=language, voice_audio_path=voice_audio, **kwargs
210
+ )
211
+
212
+ # Determine if we need post-processing
213
+ use_music = background_music or (
214
+ self.config.add_background_music and self.config.default_music
215
+ )
216
+ music_path = background_music or self.config.default_music
217
+
218
+ if use_music or output_path:
219
+ # Process audio with pydub
220
+ processed = self._processor.process(
221
+ audio=result.audio,
222
+ sample_rate=result.sample_rate,
223
+ output_path=output_path,
224
+ background_music_path=music_path if use_music else None,
225
+ )
226
+
227
+ # Cache if appropriate
228
+ if use_cache and isinstance(processed, bytes):
229
+ duration = len(result.audio) / result.sample_rate
230
+ self._cache.set(text, voice_id, backend.name, processed, duration)
231
+
232
+ return processed
233
+ else:
234
+ # Return raw audio for Gradio (sample_rate, audio_array)
235
+ return (result.sample_rate, result.audio)
236
+
237
+ def generate_raw(
238
+ self,
239
+ text: str,
240
+ language: Optional[str] = None,
241
+ voice_audio: Optional[str] = None,
242
+ **kwargs,
243
+ ) -> TTSResult:
244
+ """
245
+ Generate raw audio without post-processing.
246
+
247
+ Returns:
248
+ TTSResult with audio array and sample rate
249
+ """
250
+ language = language or self.config.default_language
251
+ return self.current_backend.generate(
252
+ text=text, language=language, voice_audio_path=voice_audio, **kwargs
253
+ )
254
+
255
+ def list_background_music(self) -> list[str]:
256
+ """List available background music files."""
257
+ return self._processor.list_available_music()
258
+
259
+ def clear_cache(self) -> int:
260
+ """Clear the local audio cache. Returns number of files deleted."""
261
+ return self._cache.clear_local()
262
+
263
+
264
+ # Register additional backends if available
265
+ try:
266
+ from .backends.gemini_backend import GeminiBackend
267
+
268
+ TTSEngine.register_backend("gemini", GeminiBackend)
269
+ except ImportError:
270
+ pass # Gemini backend not available
requirements.txt CHANGED
@@ -1,7 +1,17 @@
1
- gradio
 
 
 
2
  numpy==1.26.0
3
- resampy==0.4.3
 
 
4
  librosa==0.10.0
 
 
 
 
 
5
  s3tokenizer
6
  transformers==4.46.3
7
  diffusers==0.29.0
@@ -10,10 +20,20 @@ resemble-perth==1.0.1
10
  silero-vad==5.1.2
11
  conformer==0.3.2
12
  safetensors
 
 
 
 
 
 
 
 
 
 
13
 
14
  # Optional language-specific dependencies
15
  # Uncomment the ones you need for specific languages:
16
- spacy_pkuseg # For Chinese text segmentation
17
- pykakasi>=2.2.0 # For Japanese text processing (Kanji to Hiragana)
18
- russian-text-stresser @ git+https://github.com/Vuizur/add-stress-to-epub
19
  # dicta-onnx>=0.1.0 # For Hebrew diacritization
 
1
+ # Requirements for Telefonansagen TTS Engine
2
+
3
+ # Core dependencies
4
+ gradio>=4.0.0
5
  numpy==1.26.0
6
+ torch>=2.0.0
7
+
8
+ # Audio processing
9
  librosa==0.10.0
10
+ resampy==0.4.3
11
+ pydub>=0.25.0
12
+ soundfile>=0.12.0
13
+
14
+ # TTS Model dependencies
15
  s3tokenizer
16
  transformers==4.46.3
17
  diffusers==0.29.0
 
20
  silero-vad==5.1.2
21
  conformer==0.3.2
22
  safetensors
23
+ huggingface_hub>=0.20.0
24
+
25
+ # Logging
26
+ loguru>=0.7.0
27
+
28
+ # Optional: Gemini backend
29
+ # google-genai>=0.3.0
30
+
31
+ # Optional: Caching to HuggingFace Hub
32
+ # pandas>=2.0.0
33
 
34
  # Optional language-specific dependencies
35
  # Uncomment the ones you need for specific languages:
36
+ # spacy_pkuseg # For Chinese text segmentation
37
+ # pykakasi>=2.2.0 # For Japanese text processing (Kanji to Hiragana)
38
+ # russian-text-stresser @ git+https://github.com/Vuizur/add-stress-to-epub
39
  # dicta-onnx>=0.1.0 # For Hebrew diacritization