Chatterbox-Multilingual-TTS

Sleeping

App Files Files Community

flozi00 commited on Dec 15, 2025

Commit

a86cdfa

1 Parent(s): 9c1a0c2

refactor

Browse files

Files changed (12) hide show

README.md +175 -5
app.py +286 -285
engine/__init__.py +23 -0
engine/audio_processor.py +201 -0
engine/backends/__init__.py +13 -0
engine/backends/base.py +129 -0
engine/backends/chatterbox_backend.py +220 -0
engine/backends/gemini_backend.py +236 -0
engine/cache.py +171 -0
engine/data/assets/.gitkeep +2 -0
engine/tts_engine.py +270 -0
requirements.txt +25 -5

README.md CHANGED Viewed

@@ -1,11 +1,181 @@
 ---
-title: Chatterbox-Multilingual-TTS
-emoji: 🌎
 colorFrom: indigo
 colorTo: blue
 sdk: gradio
 sdk_version: 5.29.0
-app_file: app.py
 pinned: false
-short_description: Chatterbox TTS supporting 23 languages
----

 ---
+title: Telefonansagen TTS Engine
+emoji: 📞
 colorFrom: indigo
 colorTo: blue
 sdk: gradio
 sdk_version: 5.29.0
+app_file: app_new.py
 pinned: false
+short_description: Professional phone announcements with AI TTS
+---
+# Telefonansagen TTS Engine
+A modular text-to-speech engine for generating professional phone announcements (Telefonansagen) with support for 23 languages and voice cloning.
+## Features
+- 🎙️ **High-Quality TTS**: Using Chatterbox Multilingual for natural speech synthesis
+- 🌍 **23 Languages**: German, English, French, Spanish, Italian, and many more
+- 🎭 **Voice Cloning**: Clone any voice from a short audio sample
+- 🔌 **Modular Architecture**: Easy to swap TTS backends
+- 🎵 **Background Music**: Optional background music mixing
+- 💾 **Caching**: Local and HuggingFace Hub caching support
+## Quick Start
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the application
+python app_new.py
+```
+## Architecture
+The engine uses a modular backend system that allows easy swapping of TTS providers:
+```
+engine/
+├── __init__.py           # Main exports
+├── tts_engine.py         # Core TTS Engine
+├── audio_processor.py    # Post-processing (music, fades)
+├── cache.py              # Caching system
+└── backends/
+    ├── base.py           # Abstract backend interface
+    ├── chatterbox_backend.py  # Default: Chatterbox Multilingual
+    └── gemini_backend.py      # Optional: Google Gemini TTS
+```
+## Usage
+### Simple Usage
+```python
+from engine import TTSEngine
+# Create engine with defaults
+engine = TTSEngine()
+# Generate German announcement (default)
+audio = engine.generate("Willkommen bei unserem Service.")
+# Generate with specific language
+audio = engine.generate(
+    "Welcome to our customer service.",
+    language="en"
+)
+```
+### Voice Cloning
+```python
+# Clone a voice from reference audio
+audio = engine.generate(
+    "Herzlich willkommen!",
+    language="de",
+    voice_audio="path/to/reference.wav"
+)
+```
+### Switch Backend
+```python
+# Use Gemini instead of Chatterbox (requires GEMINI_API_KEY)
+engine.set_backend("gemini")
+audio = engine.generate("Hello world!", language="en")
+```
+### With Background Music
+```python
+# Add background music (place .mp3 files in engine/data/assets/)
+audio = engine.generate(
+    "Bitte warten Sie.",
+    background_music="hold_music"
+)
+```
+## Creating a Custom Backend
+To add a new TTS backend, inherit from `TTSBackend`:
+```python
+from engine.backends.base import TTSBackend, TTSResult, BackendConfig
+class MyCustomBackend(TTSBackend):
+    @property
+    def name(self) -> str:
+        return "My Custom TTS"
+    @property
+    def supports_voice_cloning(self) -> bool:
+        return False
+    @property
+    def supported_languages(self) -> dict[str, str]:
+        return {"en": "English", "de": "German"}
+    def load(self) -> None:
+        # Load your model
+        self._is_loaded = True
+    def unload(self) -> None:
+        # Cleanup
+        self._is_loaded = False
+    def generate(self, text: str, language: str = "de", **kwargs) -> TTSResult:
+        # Generate audio
+        audio = your_tts_function(text, language)
+        return TTSResult(audio=audio, sample_rate=22050)
+# Register the backend
+from engine import TTSEngine
+TTSEngine.register_backend("my_custom", MyCustomBackend)
+```
+## Configuration
+### Engine Configuration
+```python
+from engine.tts_engine import TTSEngine, EngineConfig
+config = EngineConfig(
+    default_backend="chatterbox",
+    device="cuda",  # or "cpu", "mps", "auto"
+    default_language="de",
+    enable_cache=True,
+    local_cache_dir="./cache",
+)
+engine = TTSEngine(config)
+```
+### Environment Variables
+- `HF_TOKEN`: HuggingFace token for model downloads
+- `GEMINI_API_KEY`: Google API key (for Gemini backend)
+## Supported Languages
+| Code | Language | Code | Language |
+|------|----------|------|----------|
+| de | German | ja | Japanese |
+| en | English | ko | Korean |
+| fr | French | ms | Malay |
+| es | Spanish | nl | Dutch |
+| it | Italian | no | Norwegian |
+| pt | Portuguese | pl | Polish |
+| ru | Russian | sv | Swedish |
+| zh | Chinese | sw | Swahili |
+| ar | Arabic | tr | Turkish |
+| da | Danish | fi | Finnish |
+| el | Greek | he | Hebrew |
+| hi | Hindi | | |
+## License
+MIT License

app.py CHANGED Viewed

@@ -1,321 +1,322 @@
 import random
 import numpy as np
 import torch
-from src.chatterbox.mtl_tts import ChatterboxMultilingualTTS, SUPPORTED_LANGUAGES
-import gradio as gr
-import spaces
-DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-print(f"🚀 Running on device: {DEVICE}")
-# --- Global Model Initialization ---
-MODEL = None
-LANGUAGE_CONFIG = {
-    "ar": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ar_f/ar_prompts2.flac",
-        "text": "في الشهر الماضي، وصلنا إلى معلم جديد بمليارين من المشاهدات على قناتنا على يوتيوب."
-    },
-    "da": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/da_m1.flac",
-        "text": "Sidste måned nåede vi en ny milepæl med to milliarder visninger på vores YouTube-kanal."
-    },
-    "de": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/de_f1.flac",
-        "text": "Letzten Monat haben wir einen neuen Meilenstein erreicht: zwei Milliarden Aufrufe auf unserem YouTube-Kanal."
-    },
-    "el": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/el_m.flac",
-        "text": "Τον περασμένο μήνα, φτάσαμε σε ένα νέο ορόσημο με δύο δισεκατομμύρια προβολές στο κανάλι μας στο YouTube."
-    },
-    "en": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/en_f1.flac",
-        "text": "Last month, we reached a new milestone with two billion views on our YouTube channel."
-    },
-    "es": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/es_f1.flac",
-        "text": "El mes pasado alcanzamos un nuevo hito: dos mil millones de visualizaciones en nuestro canal de YouTube."
-    },
-    "fi": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fi_m.flac",
-        "text": "Viime kuussa saavutimme uuden virstanpylvään kahden miljardin katselukerran kanssa YouTube-kanavallamme."
-    },
-    "fr": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fr_f1.flac",
-        "text": "Le mois dernier, nous avons atteint un nouveau jalon avec deux milliards de vues sur notre chaîne YouTube."
-    },
-    "he": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/he_m1.flac",
-        "text": "בחודש שעבר הגענו לאבן דרך חדשה עם שני מיליארד צפיות בערוץ היוטיוב שלנו."
-    },
-    "hi": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/hi_f1.flac",
-        "text": "पिछले महीने हमने एक नया मील का पत्थर छुआ: हमारे YouTube चैनल पर दो अरब व्यूज़।"
-    },
-    "it": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/it_m1.flac",
-        "text": "Il mese scorso abbiamo raggiunto un nuovo traguardo: due miliardi di visualizzazioni sul nostro canale YouTube."
-    },
-    "ja": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ja/ja_prompts1.flac",
-        "text": "先月、私たちのYouTubeチャンネルで二十億回の再生回数という新たなマイルストーンに到達しました。"
-    },
-    "ko": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ko_f.flac",
-        "text": "지난달 우리는 유튜브 채널에서 이십억 조회수라는 새로운 이정표에 도달했습니다."
-    },
-    "ms": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ms_f.flac",
-        "text": "Bulan lepas, kami mencapai pencapaian baru dengan dua bilion tontonan di saluran YouTube kami."
-    },
-    "nl": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/nl_m.flac",
-        "text": "Vorige maand bereikten we een nieuwe mijlpaal met twee miljard weergaven op ons YouTube-kanaal."
-    },
-    "no": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/no_f1.flac",
-        "text": "Forrige måned nådde vi en ny milepæl med to milliarder visninger på YouTube-kanalen vår."
-    },
-    "pl": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pl_m.flac",
-        "text": "W zeszłym miesiącu osiągnęliśmy nowy kamień milowy z dwoma miliardami wyświetleń na naszym kanale YouTube."
-    },
-    "pt": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pt_m1.flac",
-        "text": "No mês passado, alcançámos um novo marco: dois mil milhões de visualizações no nosso canal do YouTube."
-    },
-    "ru": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ru_m.flac",
-        "text": "В прошлом месяце мы достигли нового рубежа: два миллиарда просмотров на нашем YouTube-канале."
-    },
-    "sv": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sv_f.flac",
-        "text": "Förra månaden nådde vi en ny milstolpe med två miljarder visningar på vår YouTube-kanal."
-    },
-    "sw": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sw_m.flac",
-        "text": "Mwezi uliopita, tulifika hatua mpya ya maoni ya bilioni mbili kweny kituo chetu cha YouTube."
-    },
-    "tr": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/tr_m.flac",
-        "text": "Geçen ay YouTube kanalımızda iki milyar görüntüleme ile yeni bir dönüm noktasına ulaştık."
-    },
-    "zh": {
-        "audio": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/zh_f2.flac",
-        "text": "上个月，我们达到了一个新的里程碑。 我们的YouTube频道观看次数达到了二十亿次，这绝对令人难以置信。"
-    },
 }
-# --- UI Helpers ---
-def default_audio_for_ui(lang: str) -> str | None:
-    return LANGUAGE_CONFIG.get(lang, {}).get("audio")
-def default_text_for_ui(lang: str) -> str:
-    return LANGUAGE_CONFIG.get(lang, {}).get("text", "")
-def get_supported_languages_display() -> str:
-    """Generate a formatted display of all supported languages."""
-    language_items = []
-    for code, name in sorted(SUPPORTED_LANGUAGES.items()):
-        language_items.append(f"**{name}** (`{code}`)")
-    # Split into 2 lines
-    mid = len(language_items) // 2
-    line1 = " • ".join(language_items[:mid])
-    line2 = " • ".join(language_items[mid:])
-    return f"""
-### 🌍 Supported Languages ({len(SUPPORTED_LANGUAGES)} total)
-{line1}
-{line2}
-"""
-def get_or_load_model():
-    """Loads the ChatterboxMultilingualTTS model if it hasn't been loaded already,
-    and ensures it's on the correct device."""
-    global MODEL
-    if MODEL is None:
-        print("Model not loaded, initializing...")
-        try:
-            MODEL = ChatterboxMultilingualTTS.from_pretrained(DEVICE)
-            if hasattr(MODEL, 'to') and str(MODEL.device) != DEVICE:
-                MODEL.to(DEVICE)
-            print(f"Model loaded successfully. Internal device: {getattr(MODEL, 'device', 'N/A')}")
-        except Exception as e:
-            print(f"Error loading model: {e}")
-            raise
-    return MODEL
-# Attempt to load the model at startup.
 try:
-    get_or_load_model()
 except Exception as e:
-    print(f"CRITICAL: Failed to load model on startup. Application may not function. Error: {e}")
-def set_seed(seed: int):
-    """Sets the random seed for reproducibility across torch, numpy, and random."""
-    torch.manual_seed(seed)
-    if DEVICE == "cuda":
-        torch.cuda.manual_seed(seed)
-        torch.cuda.manual_seed_all(seed)
-    random.seed(seed)
-    np.random.seed(seed)
-def resolve_audio_prompt(language_id: str, provided_path: str | None) -> str | None:
-    """
-    Decide which audio prompt to use:
-    - If user provided a path (upload/mic/url), use it.
-    - Else, fall back to language-specific default (if any).
-    """
-    if provided_path and str(provided_path).strip():
-        return provided_path
-    return LANGUAGE_CONFIG.get(language_id, {}).get("audio")
 @spaces.GPU
-def generate_tts_audio(
-    text_input: str,
-    language_id: str,
-    audio_prompt_path_input: str = None,
-    exaggeration_input: float = 0.5,
-    temperature_input: float = 0.8,
-    seed_num_input: int = 0,
-    cfgw_input: float = 0.5
 ) -> tuple[int, np.ndarray]:
     """
-    Generate high-quality speech audio from text using Chatterbox Multilingual model with optional reference audio styling.
-    Supported languages: English, French, German, Spanish, Italian, Portuguese, and Hindi.
-    This tool synthesizes natural-sounding speech from input text. When a reference audio file
-    is provided, it captures the speaker's voice characteristics and speaking style. The generated audio
-    maintains the prosody, tone, and vocal qualities of the reference speaker, or uses default voice if no reference is provided.
     Args:
-        text_input (str): The text to synthesize into speech (maximum 300 characters)
-        language_id (str): The language code for synthesis (eg. en, fr, de, es, it, pt, hi)
-        audio_prompt_path_input (str, optional): File path or URL to the reference audio file that defines the target voice style. Defaults to None.
-        exaggeration_input (float, optional): Controls speech expressiveness (0.25-2.0, neutral=0.5, extreme values may be unstable). Defaults to 0.5.
-        temperature_input (float, optional): Controls randomness in generation (0.05-5.0, higher=more varied). Defaults to 0.8.
-        seed_num_input (int, optional): Random seed for reproducible results (0 for random generation). Defaults to 0.
-        cfgw_input (float, optional): CFG/Pace weight controlling generation guidance (0.2-1.0). Defaults to 0.5, 0 for language transfer.
     Returns:
-        tuple[int, np.ndarray]: A tuple containing the sample rate (int) and the generated audio waveform (numpy.ndarray)
     """
-    current_model = get_or_load_model()
-    if current_model is None:
-        raise RuntimeError("TTS model is not loaded.")
-    if seed_num_input != 0:
-        set_seed(int(seed_num_input))
-    print(f"Generating audio for text: '{text_input[:50]}...'")
-    # Handle optional audio prompt
-    chosen_prompt = audio_prompt_path_input or default_audio_for_ui(language_id)
-    generate_kwargs = {
-        "exaggeration": exaggeration_input,
-        "temperature": temperature_input,
-        "cfg_weight": cfgw_input,
-    }
-    if chosen_prompt:
-        generate_kwargs["audio_prompt_path"] = chosen_prompt
-        print(f"Using audio prompt: {chosen_prompt}")
-    else:
-        print("No audio prompt provided; using default voice.")
-    wav = current_model.generate(
-        text_input[:300],  # Truncate text to max chars
-        language_id=language_id,
-        **generate_kwargs
-    )
-    print("Audio generation complete.")
-    return (current_model.sr, wav.squeeze(0).numpy())
-with gr.Blocks() as demo:
-    gr.Markdown(
-        """
-        # Chatterbox Multilingual Demo
-        Generate high-quality multilingual speech from text with reference audio styling, supporting 23 languages.
-        For a hosted version of Chatterbox Multilingual and for finetuning, please visit [resemble.ai](https://app.resemble.ai)
-        """
     )
-    # Display supported languages
-    gr.Markdown(get_supported_languages_display())
-    with gr.Row():
-        with gr.Column():
-            initial_lang = "fr"
-            text = gr.Textbox(
-                value=default_text_for_ui(initial_lang),
-                label="Text to synthesize (max chars 300)",
-                max_lines=5
-            )
-            language_id = gr.Dropdown(
-                choices=list(ChatterboxMultilingualTTS.get_supported_languages().keys()),
-                value=initial_lang,
-                label="Language",
-                info="Select the language for text-to-speech synthesis"
-            )
-            ref_wav = gr.Audio(
-                sources=["upload", "microphone"],
-                type="filepath",
-                label="Reference Audio File (Optional)",
-                value=default_audio_for_ui(initial_lang)
-            )
-            gr.Markdown(
-                "💡 **Note**: Ensure that the reference clip matches the specified language tag. Otherwise, language transfer outputs may inherit the accent of the reference clip's language. To mitigate this, set the CFG weight to 0.",
-                elem_classes=["audio-note"]
-            )
-            exaggeration = gr.Slider(
-                0.25, 2, step=.05, label="Exaggeration (Neutral = 0.5, extreme values can be unstable)", value=.5
-            )
-            cfg_weight = gr.Slider(
-                0.2, 1, step=.05, label="CFG/Pace", value=0.5
-            )
-            with gr.Accordion("More options", open=False):
-                seed_num = gr.Number(value=0, label="Random seed (0 for random)")
-                temp = gr.Slider(0.05, 5, step=.05, label="Temperature", value=.8)
-            run_btn = gr.Button("Generate", variant="primary")
-        with gr.Column():
-            audio_output = gr.Audio(label="Output Audio")
-        def on_language_change(lang, current_ref, current_text):
-            return default_audio_for_ui(lang), default_text_for_ui(lang)
-        language_id.change(
             fn=on_language_change,
-            inputs=[language_id, ref_wav, text],
-            outputs=[ref_wav, text],
-            show_progress=False
         )
-    run_btn.click(
-        fn=generate_tts_audio,
-        inputs=[
-            text,
-            language_id,
-            ref_wav,
-            exaggeration,
-            temp,
-            seed_num,
-            cfg_weight,
-        ],
-        outputs=[audio_output],
-    )
-demo.launch(mcp_server=True)

+"""
+Telefonansagen TTS - Simplified Gradio Application
+A streamlined interface for generating professional phone announcements
+using the modular TTS engine with Chatterbox Multilingual as default backend.
+"""
 import random
+import gradio as gr
 import numpy as np
 import torch
+try:
+    import spaces
+    HAS_SPACES = True
+except ImportError:
+    HAS_SPACES = False
+    # Create a dummy decorator
+    class spaces:
+        @staticmethod
+        def GPU(func):
+            return func
+from loguru import logger
+from engine import TTSEngine
+from engine.backends.chatterbox_backend import DEFAULT_VOICE_PROMPTS
+# --- Configuration ---
+DEVICE = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "mps" if torch.backends.mps.is_available() else "cpu"
+)
+logger.info(f"🚀 Running on device: {DEVICE}")
+# Language display configuration
+LANGUAGE_DISPLAY = {
+    "de": "🇩🇪 Deutsch",
+    "en": "🇬🇧 English",
+    "fr": "🇫🇷 Français",
+    "es": "🇪🇸 Español",
+    "it": "🇮🇹 Italiano",
+    "nl": "🇳🇱 Nederlands",
+    "pl": "🇵🇱 Polski",
+    "pt": "🇵🇹 Português",
+    "ru": "🇷🇺 Русский",
+    "tr": "🇹🇷 Türkçe",
+    "ar": "🇸🇦 العربية",
+    "zh": "🇨🇳 中文",
+    "ja": "🇯🇵 日本語",
+    "ko": "🇰🇷 한국어",
+    "hi": "🇮🇳 हिन्दी",
+    "da": "🇩🇰 Dansk",
+    "el": "🇬🇷 Ελληνικά",
+    "fi": "🇫🇮 Suomi",
+    "he": "🇮🇱 עברית",
+    "ms": "🇲🇾 Bahasa Melayu",
+    "no": "🇳🇴 Norsk",
+    "sv": "🇸🇪 Svenska",
+    "sw": "🇰🇪 Kiswahili",
 }
+# Example texts per language
+EXAMPLE_TEXTS = {
+    "de": "Herzlich willkommen. Sie sind mit unserem Kundenservice verbunden. Bitte haben Sie einen Moment Geduld, wir sind gleich für Sie da.",
+    "en": "Welcome to our customer service. Please hold the line, one of our representatives will be with you shortly.",
+    "fr": "Bienvenue sur notre service client. Veuillez patienter, un conseiller va prendre votre appel.",
+    "es": "Bienvenido a nuestro servicio de atención al cliente. Por favor, espere un momento.",
+    "it": "Benvenuto nel nostro servizio clienti. La preghiamo di attendere in linea.",
+    "nl": "Welkom bij onze klantenservice. Een moment geduld alstublieft.",
+    "pl": "Witamy w naszej obsłudze klienta. Proszę czekać na połączenie.",
+    "pt": "Bem-vindo ao nosso serviço de apoio ao cliente. Por favor, aguarde um momento.",
+    "ru": "Добро пожаловать в службу поддержки. Пожалуйста, оставайтесь на линии.",
+    "tr": "Müşteri hizmetlerimize hoş geldiniz. Lütfen hatta kalın.",
+    "ar": "مرحباً بكم في خدمة العملاء. يرجى الانتظار على الخط.",
+    "zh": "欢迎致电客户服务中心。请稍候，我们的客服代表将很快为您服务。",
+    "ja": "お電話ありがとうございます。担当者におつなぎしますので、少々お待ちください。",
+    "ko": "고객 서비스에 오신 것을 환영합니다. 잠시만 기다려 주세요.",
+    "hi": "हमारी ग्राहक सेवा में आपका स्वागत है। कृपया प्रतीक्षा करें।",
+    "da": "Velkommen til vores kundeservice. Vent venligst.",
+    "el": "Καλώς ήρθατε στην εξυπηρέτηση πελατών. Παρακαλώ περιμένετε.",
+    "fi": "Tervetuloa asiakaspalveluumme. Odottakaa hetki.",
+    "he": "ברוכים הבאים לשירות הלקוחות שלנו. אנא המתינו על הקו.",
+    "ms": "Selamat datang ke perkhidmatan pelanggan kami. Sila tunggu sebentar.",
+    "no": "Velkommen til vår kundeservice. Vennligst vent.",
+    "sv": "Välkommen till vår kundtjänst. Vänligen vänta.",
+    "sw": "Karibu kwa huduma yetu ya wateja. Tafadhali subiri.",
+}
+# --- Global Engine ---
+ENGINE = None
+def get_engine() -> TTSEngine:
+    """Get or initialize the TTS engine."""
+    global ENGINE
+    if ENGINE is None:
+        from engine import TTSEngine
+        from engine.tts_engine import EngineConfig
+        logger.info("Initializing TTS Engine...")
+        ENGINE = TTSEngine(
+            EngineConfig(
+                default_backend="chatterbox",
+                device=DEVICE,
+                default_language="de",
+            )
+        )
+        # Pre-load the model
+        ENGINE.load_backend()
+        logger.info("TTS Engine ready!")
+    return ENGINE
+# Initialize on startup
 try:
+    get_engine()
 except Exception as e:
+    logger.error(f"Failed to initialize engine on startup: {e}")
+# --- Helper Functions ---
+def get_language_choices() -> list[tuple[str, str]]:
+    """Get language choices for dropdown."""
+    engine = get_engine()
+    supported = engine.get_supported_languages()
+    choices = []
+    for code in supported.keys():
+        display = LANGUAGE_DISPLAY.get(code, f"{supported[code]} ({code})")
+        choices.append((display, code))
+    # Sort by display name, but put German first
+    choices.sort(key=lambda x: (x[1] != "de", x[0]))
+    return choices
+def get_example_text(language: str) -> str:
+    """Get example text for a language."""
+    return EXAMPLE_TEXTS.get(language, EXAMPLE_TEXTS["en"])
+def get_default_voice(language: str) -> str:
+    """Get default voice prompt URL for a language."""
+    return DEFAULT_VOICE_PROMPTS.get(language)
+# --- Main Generation Function ---
 @spaces.GPU
+def generate_announcement(
+    text: str,
+    language: str,
+    voice_audio: str = None,
+    seed: int = 0,
 ) -> tuple[int, np.ndarray]:
     """
+    Generate a phone announcement.
     Args:
+        text: Text to synthesize (max 500 characters)
+        language: Language code
+        voice_audio: Optional path to reference audio for voice cloning
+        seed: Random seed (0 = random)
     Returns:
+        Tuple of (sample_rate, audio_array) for Gradio audio component
     """
+    engine = get_engine()
+    # Set seed for reproducibility
+    if seed != 0:
+        torch.manual_seed(seed)
+        random.seed(seed)
+        np.random.seed(seed)
+        if DEVICE == "cuda":
+            torch.cuda.manual_seed_all(seed)
+    # Truncate text
+    text = text[:500]
+    # Use default voice if none provided
+    if not voice_audio or not str(voice_audio).strip():
+        voice_audio = get_default_voice(language)
+    logger.info(f"Generating: lang={language}, text='{text[:50]}...'")
+    # Generate audio
+    result = engine.generate(
+        text=text,
+        language=language,
+        voice_audio=voice_audio,
     )
+    return result
+def on_language_change(language: str):
+    """Handle language selection change."""
+    return get_example_text(language), get_default_voice(language)
+# --- Gradio Interface ---
+def create_interface():
+    """Create the Gradio interface."""
+    with gr.Blocks(
+        title="Telefonansagen Generator",
+        theme=gr.themes.Soft(),
+        css="""
+        .main-title { text-align: center; margin-bottom: 1rem; }
+        .generate-btn { min-height: 50px; font-size: 1.1rem; }
+        """,
+    ) as demo:
+        gr.Markdown(
+            """
+            # 📞 Telefonansagen Generator
+            Erstellen Sie professionelle Telefonansagen mit KI-gestützter Sprachsynthese.
+            Unterstützt 23 Sprachen mit optionaler Stimmklonung.
+            ---
+            """,
+            elem_classes=["main-title"],
+        )
+        with gr.Row():
+            # Left column - Input
+            with gr.Column(scale=1):
+                language = gr.Dropdown(
+                    choices=get_language_choices(),
+                    value="de",
+                    label="🌍 Sprache / Language",
+                    info="Wählen Sie die Sprache der Ansage",
+                )
+                text = gr.Textbox(
+                    value=EXAMPLE_TEXTS["de"],
+                    label="📝 Text der Ansage",
+                    placeholder="Geben Sie hier den Text Ihrer Telefonansage ein...",
+                    lines=5,
+                    max_lines=10,
+                    info="Maximal 500 Zeichen",
+                )
+                with gr.Accordion("🎤 Stimmeinstellungen (Optional)", open=False):
+                    voice_audio = gr.Audio(
+                        sources=["upload", "microphone"],
+                        type="filepath",
+                        label="Referenz-Audio für Stimmklonung",
+                        value=get_default_voice("de"),
+                    )
+                    gr.Markdown(
+                        """
+                        💡 **Tipp:** Laden Sie eine Audioaufnahme hoch, um die Stimme zu klonen.
+                        Die Standardstimme wird verwendet, wenn keine Aufnahme bereitgestellt wird.
+                        """
+                    )
+                with gr.Accordion("⚙️ Erweiterte Einstellungen", open=False):
+                    seed = gr.Number(
+                        value=0,
+                        label="Zufallswert (Seed)",
+                        info="0 = zufällig, andere Werte für reproduzierbare Ergebnisse",
+                        precision=0,
+                    )
+                generate_btn = gr.Button(
+                    "🎙️ Ansage generieren",
+                    variant="primary",
+                    elem_classes=["generate-btn"],
+                )
+            # Right column - Output
+            with gr.Column(scale=1):
+                audio_output = gr.Audio(
+                    label="📢 Generierte Ansage", type="numpy", interactive=False
+                )
+                gr.Markdown(
+                    """
+                    ### ℹ️ Hinweise
+                    - Die Generierung kann einige Sekunden dauern
+                    - Für beste Ergebnisse verwenden Sie klare, kurze Sätze
+                    - Referenz-Audio sollte 5-15 Sekunden lang sein
+                    ---
+                    **Unterstützte Sprachen:** Deutsch, Englisch, Französisch, Spanisch,
+                    Italienisch, Niederländisch, Polnisch, Portugiesisch, Russisch,
+                    Türkisch, Arabisch, Chinesisch, Japanisch, Koreanisch, Hindi,
+                    Dänisch, Griechisch, Finnisch, Hebräisch, Malaiisch, Norwegisch,
+                    Schwedisch, Swahili
+                    """
+                )
+        # Event handlers
+        language.change(
             fn=on_language_change,
+            inputs=[language],
+            outputs=[text, voice_audio],
+            show_progress=False,
         )
+        generate_btn.click(
+            fn=generate_announcement,
+            inputs=[text, language, voice_audio, seed],
+            outputs=[audio_output],
+        )
+    return demo
+# --- Main ---
+if __name__ == "__main__":
+    demo = create_interface()
+    demo.launch(server_name="0.0.0.0", server_port=7860, share=False)

engine/__init__.py ADDED Viewed

	@@ -0,0 +1,23 @@

+# Telefonansagen TTS Engine
+# A modular text-to-speech engine for generating phone announcements
+from .audio_processor import AudioProcessingConfig, AudioProcessor
+from .backends.base import BackendConfig, TTSBackend, TTSResult
+from .backends.chatterbox_backend import ChatterboxBackend
+from .cache import AudioCache, CacheConfig
+from .tts_engine import EngineConfig, TTSEngine
+__all__ = [
+    "TTSEngine",
+    "EngineConfig",
+    "TTSBackend",
+    "TTSResult",
+    "BackendConfig",
+    "ChatterboxBackend",
+    "AudioProcessor",
+    "AudioProcessingConfig",
+    "AudioCache",
+    "CacheConfig",
+]
+__version__ = "1.0.0"

engine/audio_processor.py ADDED Viewed

	@@ -0,0 +1,201 @@

+"""
+Audio post-processing for phone announcements.
+Handles background music mixing, normalization, and export.
+"""
+import io
+import os
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional, Union
+import numpy as np
+from loguru import logger
+@dataclass
+class AudioProcessingConfig:
+    """Configuration for audio post-processing."""
+    # Background music settings
+    background_music_path: Optional[str] = None
+    music_volume_db: float = -20.0  # Relative volume of background music
+    # Fade settings
+    fade_in_ms: int = 500
+    fade_out_ms: int = 500
+    # Padding (silence before/after speech)
+    padding_start_ms: int = 300
+    padding_end_ms: int = 300
+    # Output settings
+    normalize: bool = True
+    target_loudness_db: float = -16.0  # Target LUFS for normalization
+    output_sample_rate: int = 44100
+    output_format: str = "mp3"
+class AudioProcessor:
+    """
+    Post-processor for TTS audio.
+    Adds background music, applies fades, normalizes, and exports.
+    """
+    # Default background music directory
+    ASSETS_DIR = Path(__file__).parent.parent / "data" / "assets"
+    def __init__(self, config: Optional[AudioProcessingConfig] = None):
+        self.config = config or AudioProcessingConfig()
+    def process(
+        self,
+        audio: np.ndarray,
+        sample_rate: int,
+        output_path: Optional[str] = None,
+        **override_config,
+    ) -> Union[bytes, str]:
+        """
+        Process audio with background music, fades, and normalization.
+        Args:
+            audio: Input audio as numpy array
+            sample_rate: Sample rate of input audio
+            output_path: Optional path to save the output (returns bytes if None)
+            **override_config: Override any config settings for this call
+        Returns:
+            Path to output file if output_path is provided, otherwise MP3 bytes
+        """
+        from pydub import AudioSegment
+        # Merge config overrides
+        config = AudioProcessingConfig(**{**self.config.__dict__, **override_config})
+        # Convert numpy array to AudioSegment
+        speech = self._numpy_to_audiosegment(audio, sample_rate)
+        # Boost speech slightly for clarity
+        speech = speech + 3  # +3 dB
+        # Add padding
+        if config.padding_start_ms > 0 or config.padding_end_ms > 0:
+            silence_start = AudioSegment.silent(
+                duration=config.padding_start_ms, frame_rate=sample_rate
+            )
+            silence_end = AudioSegment.silent(
+                duration=config.padding_end_ms, frame_rate=sample_rate
+            )
+            speech = silence_start + speech + silence_end
+        # Mix with background music if specified
+        if config.background_music_path:
+            speech = self._add_background_music(
+                speech, config.background_music_path, config.music_volume_db
+            )
+        # Apply fades
+        if config.fade_in_ms > 0:
+            speech = speech.fade_in(config.fade_in_ms)
+        if config.fade_out_ms > 0:
+            speech = speech.fade_out(config.fade_out_ms)
+        # Normalize if requested
+        if config.normalize:
+            speech = self._normalize(speech, config.target_loudness_db)
+        # Resample if needed
+        if speech.frame_rate != config.output_sample_rate:
+            speech = speech.set_frame_rate(config.output_sample_rate)
+        # Export
+        if output_path:
+            speech.export(output_path, format=config.output_format)
+            return output_path
+        else:
+            buffer = io.BytesIO()
+            speech.export(buffer, format=config.output_format)
+            return buffer.getvalue()
+    def _numpy_to_audiosegment(
+        self, audio: np.ndarray, sample_rate: int
+    ) -> "AudioSegment":
+        """Convert numpy array to pydub AudioSegment."""
+        from pydub import AudioSegment
+        # Ensure float32 and normalize
+        if audio.dtype != np.float32:
+            audio = audio.astype(np.float32)
+        # Clip and convert to int16
+        audio = np.clip(audio, -1.0, 1.0)
+        audio_int16 = (audio * 32767).astype(np.int16)
+        # Create AudioSegment
+        return AudioSegment(
+            data=audio_int16.tobytes(),
+            sample_width=2,  # 16-bit
+            frame_rate=sample_rate,
+            channels=1,  # Mono
+        )
+    def _add_background_music(
+        self, speech: "AudioSegment", music_path: str, volume_db: float
+    ) -> "AudioSegment":
+        """Mix background music with speech."""
+        from pydub import AudioSegment
+        # Resolve path
+        if not os.path.isabs(music_path):
+            # Check in assets directory
+            assets_path = self.ASSETS_DIR / f"{music_path}.mp3"
+            if assets_path.exists():
+                music_path = str(assets_path)
+            else:
+                assets_path = self.ASSETS_DIR / music_path
+                if assets_path.exists():
+                    music_path = str(assets_path)
+        if not os.path.exists(music_path):
+            logger.warning(f"Background music not found: {music_path}")
+            return speech
+        try:
+            music = AudioSegment.from_file(music_path)
+            # Adjust volume
+            music = music + volume_db
+            # Match sample rate
+            if music.frame_rate != speech.frame_rate:
+                music = music.set_frame_rate(speech.frame_rate)
+            # Loop music to match speech length
+            if len(music) < len(speech):
+                loops_needed = (len(speech) // len(music)) + 1
+                music = music * loops_needed
+            # Trim to exact length
+            music = music[: len(speech)]
+            # Overlay
+            return speech.overlay(music)
+        except Exception as e:
+            logger.error(f"Failed to add background music: {e}")
+            return speech
+    def _normalize(self, audio: "AudioSegment", target_db: float) -> "AudioSegment":
+        """Normalize audio to target loudness."""
+        change_in_db = target_db - audio.dBFS
+        return audio.apply_gain(change_in_db)
+    def list_available_music(self) -> list[str]:
+        """List available background music files in the assets directory."""
+        if not self.ASSETS_DIR.exists():
+            return []
+        music_files = []
+        for ext in ["mp3", "wav", "flac", "ogg"]:
+            music_files.extend([f.stem for f in self.ASSETS_DIR.glob(f"*.{ext}")])
+        return sorted(set(music_files))

engine/backends/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# TTS Backend implementations
+from .base import BackendConfig, TTSBackend, TTSResult
+from .chatterbox_backend import ChatterboxBackend
+__all__ = ["TTSBackend", "TTSResult", "BackendConfig", "ChatterboxBackend"]
+# Optional backends
+try:
+    from .gemini_backend import GeminiBackend
+    __all__.append("GeminiBackend")
+except ImportError:
+    pass  # google-genai not installed

engine/backends/base.py ADDED Viewed

	@@ -0,0 +1,129 @@

+"""
+Abstract base class for TTS backends.
+All TTS backends must implement this interface to be compatible with the engine.
+"""
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from typing import Optional
+import numpy as np
+@dataclass
+class TTSResult:
+    """Result from TTS generation."""
+    audio: np.ndarray  # Audio waveform as numpy array
+    sample_rate: int  # Sample rate in Hz
+    def to_int16(self) -> np.ndarray:
+        """Convert audio to 16-bit integer format."""
+        audio = self.audio
+        if audio.dtype == np.float32 or audio.dtype == np.float64:
+            audio = np.clip(audio, -1.0, 1.0)
+            audio = (audio * 32767).astype(np.int16)
+        return audio
+@dataclass
+class BackendConfig:
+    """Configuration for TTS backends."""
+    device: str = "auto"  # "auto", "cuda", "mps", "cpu"
+    def resolve_device(self) -> str:
+        """Resolve 'auto' to the best available device."""
+        if self.device != "auto":
+            return self.device
+        import torch
+        if torch.cuda.is_available():
+            return "cuda"
+        elif torch.backends.mps.is_available():
+            return "mps"
+        return "cpu"
+class TTSBackend(ABC):
+    """
+    Abstract base class for TTS backends.
+    To create a new backend:
+    1. Inherit from this class
+    2. Implement all abstract methods
+    3. Register the backend in the engine
+    """
+    def __init__(self, config: Optional[BackendConfig] = None):
+        self.config = config or BackendConfig()
+        self._is_loaded = False
+    @property
+    @abstractmethod
+    def name(self) -> str:
+        """Human-readable name of the backend."""
+        pass
+    @property
+    @abstractmethod
+    def supports_voice_cloning(self) -> bool:
+        """Whether this backend supports voice cloning from audio."""
+        pass
+    @property
+    @abstractmethod
+    def supported_languages(self) -> dict[str, str]:
+        """
+        Dictionary of supported language codes to language names.
+        Example: {"en": "English", "de": "German"}
+        """
+        pass
+    @property
+    def is_loaded(self) -> bool:
+        """Whether the backend model is loaded and ready."""
+        return self._is_loaded
+    @abstractmethod
+    def load(self) -> None:
+        """
+        Load the model and prepare for inference.
+        Should set self._is_loaded = True when complete.
+        """
+        pass
+    @abstractmethod
+    def unload(self) -> None:
+        """
+        Unload the model to free memory.
+        Should set self._is_loaded = False when complete.
+        """
+        pass
+    @abstractmethod
+    def generate(
+        self,
+        text: str,
+        language: str = "de",
+        voice_audio_path: Optional[str] = None,
+        **kwargs,
+    ) -> TTSResult:
+        """
+        Generate speech from text.
+        Args:
+            text: The text to synthesize
+            language: Language code (e.g., "de", "en")
+            voice_audio_path: Optional path to reference audio for voice cloning
+            **kwargs: Backend-specific parameters
+        Returns:
+            TTSResult containing audio waveform and sample rate
+        """
+        pass
+    def __repr__(self) -> str:
+        status = "loaded" if self._is_loaded else "not loaded"
+        return f"{self.__class__.__name__}(name='{self.name}', status={status})"

engine/backends/chatterbox_backend.py ADDED Viewed

	@@ -0,0 +1,220 @@

+"""
+Chatterbox Multilingual TTS Backend with Voice Cloning support.
+This is the default backend for the Telefonansagen engine.
+"""
+from typing import Optional
+import numpy as np
+from loguru import logger
+from .base import BackendConfig, TTSBackend, TTSResult
+# Default voice prompts per language (high-quality reference samples)
+DEFAULT_VOICE_PROMPTS = {
+    "ar": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ar_f/ar_prompts2.flac",
+    "da": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/da_m1.flac",
+    "de": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/de_f1.flac",
+    "el": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/el_m.flac",
+    "en": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/en_f1.flac",
+    "es": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/es_f1.flac",
+    "fi": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fi_m.flac",
+    "fr": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/fr_f1.flac",
+    "he": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/he_m1.flac",
+    "hi": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/hi_f1.flac",
+    "it": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/it_m1.flac",
+    "ja": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ja/ja_prompts1.flac",
+    "ko": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ko_f.flac",
+    "ms": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ms_f.flac",
+    "nl": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/nl_m.flac",
+    "no": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/no_f1.flac",
+    "pl": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pl_m.flac",
+    "pt": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/pt_m1.flac",
+    "ru": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/ru_m.flac",
+    "sv": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sv_f.flac",
+    "sw": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/sw_m.flac",
+    "tr": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/tr_m.flac",
+    "zh": "https://storage.googleapis.com/chatterbox-demo-samples/mtl_prompts/zh_f2.flac",
+}
+class ChatterboxBackend(TTSBackend):
+    """
+    Chatterbox Multilingual TTS Backend.
+    Features:
+    - 23 language support
+    - High-quality voice cloning
+    - Expressive speech synthesis
+    This backend uses the ResembleAI Chatterbox model for synthesis.
+    """
+    # Optimal defaults for phone announcements (clear, professional)
+    DEFAULT_EXAGGERATION = (
+        0.35  # Slightly less expressive for professional announcements
+    )
+    DEFAULT_TEMPERATURE = 0.7  # Balanced randomness
+    DEFAULT_CFG_WEIGHT = 0.5  # Standard guidance
+    SUPPORTED_LANGUAGES = {
+        "ar": "Arabic",
+        "da": "Danish",
+        "de": "German",
+        "el": "Greek",
+        "en": "English",
+        "es": "Spanish",
+        "fi": "Finnish",
+        "fr": "French",
+        "he": "Hebrew",
+        "hi": "Hindi",
+        "it": "Italian",
+        "ja": "Japanese",
+        "ko": "Korean",
+        "ms": "Malay",
+        "nl": "Dutch",
+        "no": "Norwegian",
+        "pl": "Polish",
+        "pt": "Portuguese",
+        "ru": "Russian",
+        "sv": "Swedish",
+        "sw": "Swahili",
+        "tr": "Turkish",
+        "zh": "Chinese",
+    }
+    def __init__(self, config: Optional[BackendConfig] = None):
+        super().__init__(config)
+        self._model = None
+        self._device = None
+    @property
+    def name(self) -> str:
+        return "Chatterbox Multilingual"
+    @property
+    def supports_voice_cloning(self) -> bool:
+        return True
+    @property
+    def supported_languages(self) -> dict[str, str]:
+        return self.SUPPORTED_LANGUAGES.copy()
+    def load(self) -> None:
+        """Load the Chatterbox model."""
+        if self._is_loaded:
+            logger.info("Chatterbox model already loaded")
+            return
+        logger.info("Loading Chatterbox Multilingual model...")
+        from src.chatterbox.mtl_tts import ChatterboxMultilingualTTS
+        self._device = self.config.resolve_device()
+        logger.info(f"Using device: {self._device}")
+        try:
+            self._model = ChatterboxMultilingualTTS.from_pretrained(self._device)
+            self._is_loaded = True
+            logger.info("Chatterbox model loaded successfully")
+        except Exception as e:
+            logger.error(f"Failed to load Chatterbox model: {e}")
+            raise
+    def unload(self) -> None:
+        """Unload the model to free memory."""
+        if self._model is not None:
+            import torch
+            del self._model
+            self._model = None
+            if self._device == "cuda":
+                torch.cuda.empty_cache()
+            self._is_loaded = False
+            logger.info("Chatterbox model unloaded")
+    def get_default_voice(self, language: str) -> Optional[str]:
+        """Get the default voice prompt URL for a language."""
+        return DEFAULT_VOICE_PROMPTS.get(language.lower())
+    def generate(
+        self,
+        text: str,
+        language: str = "de",
+        voice_audio_path: Optional[str] = None,
+        exaggeration: Optional[float] = None,
+        temperature: Optional[float] = None,
+        cfg_weight: Optional[float] = None,
+        seed: Optional[int] = None,
+        **kwargs,
+    ) -> TTSResult:
+        """
+        Generate speech from text using Chatterbox.
+        Args:
+            text: Text to synthesize
+            language: Language code (default: "de" for German)
+            voice_audio_path: Path/URL to reference audio for voice cloning
+            exaggeration: Speech expressiveness (0.25-2.0, default: 0.35)
+            temperature: Generation randomness (0.05-5.0, default: 0.7)
+            cfg_weight: CFG guidance weight (0.2-1.0, default: 0.5)
+            seed: Random seed for reproducibility (default: None = random)
+        Returns:
+            TTSResult with audio waveform and sample rate
+        """
+        if not self._is_loaded:
+            self.load()
+        import random
+        import torch
+        # Apply seed if provided
+        if seed is not None and seed != 0:
+            torch.manual_seed(seed)
+            random.seed(seed)
+            np.random.seed(seed)
+            if self._device == "cuda":
+                torch.cuda.manual_seed_all(seed)
+        # Use defaults for unspecified parameters
+        exaggeration = (
+            exaggeration if exaggeration is not None else self.DEFAULT_EXAGGERATION
+        )
+        temperature = (
+            temperature if temperature is not None else self.DEFAULT_TEMPERATURE
+        )
+        cfg_weight = cfg_weight if cfg_weight is not None else self.DEFAULT_CFG_WEIGHT
+        # Resolve voice prompt
+        audio_prompt = voice_audio_path or self.get_default_voice(language)
+        # Validate language
+        lang_code = language.lower()
+        if lang_code not in self.SUPPORTED_LANGUAGES:
+            available = ", ".join(sorted(self.SUPPORTED_LANGUAGES.keys()))
+            raise ValueError(
+                f"Unsupported language '{language}'. Available: {available}"
+            )
+        logger.info(f"Generating speech: lang={lang_code}, text='{text[:50]}...'")
+        try:
+            wav = self._model.generate(
+                text=text,
+                language_id=lang_code,
+                audio_prompt_path=audio_prompt,
+                exaggeration=exaggeration,
+                temperature=temperature,
+                cfg_weight=cfg_weight,
+            )
+            # Convert to numpy array
+            audio_np = wav.squeeze().numpy()
+            return TTSResult(audio=audio_np, sample_rate=self._model.sr)
+        except Exception as e:
+            logger.error(f"TTS generation failed: {e}")
+            raise

engine/backends/gemini_backend.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""
+Google Gemini TTS Backend.
+Uses Google's Gemini API for text-to-speech synthesis.
+"""
+import io
+import os
+from typing import Optional
+import numpy as np
+from loguru import logger
+from .base import BackendConfig, TTSBackend, TTSResult
+class GeminiBackend(TTSBackend):
+    """
+    Google Gemini TTS Backend.
+    Features:
+    - High-quality neural TTS
+    - Multiple preset voices
+    - No voice cloning (uses preset voices)
+    Requires GEMINI_API_KEY environment variable.
+    """
+    # Available Gemini voices
+    AVAILABLE_VOICES = [
+        "Puck",
+        "Charon",
+        "Kore",
+        "Fenrir",
+        "Aoede",
+        "Leda",
+        "Orus",
+        "Zephyr",
+    ]
+    # Gemini has limited language support compared to Chatterbox
+    SUPPORTED_LANGUAGES = {
+        "en": "English",
+        "de": "German",
+        "es": "Spanish",
+        "fr": "French",
+        "it": "Italian",
+        "pt": "Portuguese",
+        "ja": "Japanese",
+        "ko": "Korean",
+        "zh": "Chinese",
+    }
+    def __init__(self, config: Optional[BackendConfig] = None, voice: str = "Kore"):
+        super().__init__(config)
+        self._client = None
+        self.voice = voice if voice in self.AVAILABLE_VOICES else "Kore"
+    @property
+    def name(self) -> str:
+        return "Google Gemini TTS"
+    @property
+    def supports_voice_cloning(self) -> bool:
+        return False
+    @property
+    def supported_languages(self) -> dict[str, str]:
+        return self.SUPPORTED_LANGUAGES.copy()
+    def load(self) -> None:
+        """Initialize the Gemini client."""
+        if self._is_loaded:
+            return
+        api_key = os.environ.get("GEMINI_API_KEY")
+        if not api_key:
+            raise ValueError("GEMINI_API_KEY environment variable not set")
+        try:
+            import google.genai as genai
+            self._client = genai.Client(api_key=api_key)
+            self._is_loaded = True
+            logger.info("Gemini client initialized successfully")
+        except Exception as e:
+            logger.error(f"Failed to initialize Gemini client: {e}")
+            raise
+    def unload(self) -> None:
+        """Clean up Gemini client."""
+        self._client = None
+        self._is_loaded = False
+        logger.info("Gemini client unloaded")
+    def set_voice(self, voice: str) -> None:
+        """Set the voice to use for synthesis."""
+        if voice not in self.AVAILABLE_VOICES:
+            raise ValueError(
+                f"Unknown voice '{voice}'. Available: {self.AVAILABLE_VOICES}"
+            )
+        self.voice = voice
+    def generate(
+        self,
+        text: str,
+        language: str = "de",
+        voice_audio_path: Optional[str] = None,
+        voice: Optional[str] = None,
+        **kwargs,
+    ) -> TTSResult:
+        """
+        Generate speech from text using Gemini.
+        Args:
+            text: Text to synthesize
+            language: Language code (for text processing, voice determines actual synthesis)
+            voice_audio_path: Ignored (Gemini doesn't support voice cloning)
+            voice: Voice name to use (default: instance voice setting)
+        Returns:
+            TTSResult with audio waveform and sample rate
+        """
+        if not self._is_loaded:
+            self.load()
+        if voice_audio_path:
+            logger.warning(
+                "Gemini backend doesn't support voice cloning, ignoring voice_audio_path"
+            )
+        from google.genai import types as genai_types
+        selected_voice = voice or self.voice
+        logger.info(
+            f"Generating speech with Gemini: voice={selected_voice}, text='{text[:50]}...'"
+        )
+        contents = [
+            genai_types.Content(
+                role="user", parts=[genai_types.Part.from_text(text=text)]
+            )
+        ]
+        config = genai_types.GenerateContentConfig(
+            temperature=1,
+            response_modalities=["audio"],
+            speech_config=genai_types.SpeechConfig(
+                voice_config=genai_types.VoiceConfig(
+                    prebuilt_voice_config=genai_types.PrebuiltVoiceConfig(
+                        voice_name=selected_voice
+                    )
+                )
+            ),
+        )
+        try:
+            audio_chunks = []
+            mime_type = None
+            for chunk in self._client.models.generate_content_stream(
+                model="gemini-2.5-pro-preview-tts",
+                contents=contents,
+                config=config,
+            ):
+                if chunk.candidates:
+                    inline_data = chunk.candidates[0].content.parts[0].inline_data
+                    audio_chunks.append(inline_data.data)
+                    if mime_type is None:
+                        mime_type = inline_data.mime_type
+            if not audio_chunks:
+                raise RuntimeError("No audio data received from Gemini API")
+            raw_audio = b"".join(audio_chunks)
+            # Convert to numpy array
+            audio_np, sample_rate = self._process_audio(raw_audio, mime_type)
+            return TTSResult(audio=audio_np, sample_rate=sample_rate)
+        except Exception as e:
+            logger.error(f"Gemini TTS generation failed: {e}")
+            raise
+    def _process_audio(
+        self, raw_audio: bytes, mime_type: str
+    ) -> tuple[np.ndarray, int]:
+        """Process raw audio data from Gemini into numpy array."""
+        from pydub import AudioSegment
+        # Parse MIME type for audio parameters
+        sample_rate = 24000  # Default
+        bits_per_sample = 16
+        if mime_type and "audio/L" in mime_type:
+            # Parse format like audio/L16;rate=24000
+            parts = mime_type.split(";")
+            for part in parts:
+                part = part.strip()
+                if part.startswith("audio/L"):
+                    try:
+                        bits_per_sample = int(part.split("L")[1])
+                    except (ValueError, IndexError):
+                        pass
+                elif part.lower().startswith("rate="):
+                    try:
+                        sample_rate = int(part.split("=")[1])
+                    except (ValueError, IndexError):
+                        pass
+            # Create AudioSegment from raw PCM
+            audio_segment = AudioSegment(
+                data=raw_audio,
+                sample_width=bits_per_sample // 8,
+                frame_rate=sample_rate,
+                channels=1,
+            )
+        elif mime_type == "audio/mpeg":
+            audio_segment = AudioSegment.from_file(io.BytesIO(raw_audio), format="mp3")
+            sample_rate = audio_segment.frame_rate
+        else:
+            # Try auto-detection
+            audio_segment = AudioSegment.from_file(io.BytesIO(raw_audio))
+            sample_rate = audio_segment.frame_rate
+        # Convert to numpy array
+        samples = np.array(audio_segment.get_array_of_samples())
+        # Normalize to float32 [-1, 1]
+        if audio_segment.sample_width == 2:  # 16-bit
+            samples = samples.astype(np.float32) / 32768.0
+        elif audio_segment.sample_width == 1:  # 8-bit
+            samples = (samples.astype(np.float32) - 128) / 128.0
+        return samples, sample_rate

engine/cache.py ADDED Viewed

	@@ -0,0 +1,171 @@

+"""
+Caching system for generated audio.
+Supports local and Hugging Face Hub storage.
+"""
+import hashlib
+import os
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+from loguru import logger
+@dataclass
+class CacheConfig:
+    """Configuration for audio caching."""
+    enabled: bool = True
+    local_cache_dir: Optional[str] = None  # Local cache directory
+    hf_repo_id: Optional[str] = None  # Hugging Face Hub repo for remote cache
+    max_duration_seconds: float = 30.0  # Only cache audio shorter than this
+class AudioCache:
+    """
+    Cache for generated TTS audio.
+    Supports both local filesystem and Hugging Face Hub storage.
+    """
+    def __init__(self, config: Optional[CacheConfig] = None):
+        self.config = config or CacheConfig()
+        self._hf_fs = None
+    def _get_cache_key(self, text: str, voice_id: str, backend: str) -> str:
+        """Generate a unique cache key for the given parameters."""
+        content = f"{backend}:{voice_id}:{text}"
+        return hashlib.md5(content.encode()).hexdigest()
+    def _get_hf_fs(self):
+        """Get HuggingFace filesystem (lazy initialization)."""
+        if self._hf_fs is None and self.config.hf_repo_id:
+            try:
+                from huggingface_hub import HfFileSystem
+                self._hf_fs = HfFileSystem(token=os.environ.get("HF_TOKEN"))
+            except Exception as e:
+                logger.warning(f"Could not initialize HF filesystem: {e}")
+        return self._hf_fs
+    def get(self, text: str, voice_id: str, backend: str) -> Optional[bytes]:
+        """
+        Retrieve cached audio if it exists.
+        Args:
+            text: Original text that was synthesized
+            voice_id: Voice identifier used
+            backend: Backend name used for synthesis
+        Returns:
+            Cached audio bytes or None if not found
+        """
+        if not self.config.enabled:
+            return None
+        cache_key = self._get_cache_key(text, voice_id, backend)
+        # Try local cache first
+        if self.config.local_cache_dir:
+            local_path = Path(self.config.local_cache_dir) / f"{cache_key}.mp3"
+            if local_path.exists():
+                logger.debug(f"Cache hit (local): {cache_key}")
+                return local_path.read_bytes()
+        # Try HF Hub cache
+        if self.config.hf_repo_id:
+            fs = self._get_hf_fs()
+            if fs:
+                hf_path = f"{self.config.hf_repo_id}/{voice_id}/{cache_key}.mp3"
+                try:
+                    if fs.exists(hf_path):
+                        with fs.open(hf_path, "rb") as f:
+                            logger.debug(f"Cache hit (HF Hub): {cache_key}")
+                            return f.read()
+                except Exception as e:
+                    logger.debug(f"HF cache lookup failed: {e}")
+        return None
+    def set(
+        self,
+        text: str,
+        voice_id: str,
+        backend: str,
+        audio_data: bytes,
+        duration_seconds: Optional[float] = None,
+    ) -> bool:
+        """
+        Store audio in cache.
+        Args:
+            text: Original text that was synthesized
+            voice_id: Voice identifier used
+            backend: Backend name used for synthesis
+            audio_data: Audio bytes to cache
+            duration_seconds: Duration of the audio (for max duration check)
+        Returns:
+            True if cached successfully, False otherwise
+        """
+        if not self.config.enabled:
+            return False
+        # Check duration limit
+        if duration_seconds and duration_seconds > self.config.max_duration_seconds:
+            logger.debug(
+                f"Audio too long to cache: {duration_seconds}s > {self.config.max_duration_seconds}s"
+            )
+            return False
+        cache_key = self._get_cache_key(text, voice_id, backend)
+        success = False
+        # Save to local cache
+        if self.config.local_cache_dir:
+            try:
+                cache_dir = Path(self.config.local_cache_dir)
+                cache_dir.mkdir(parents=True, exist_ok=True)
+                local_path = cache_dir / f"{cache_key}.mp3"
+                local_path.write_bytes(audio_data)
+                logger.debug(f"Cached locally: {cache_key}")
+                success = True
+            except Exception as e:
+                logger.warning(f"Failed to cache locally: {e}")
+        # Save to HF Hub
+        if self.config.hf_repo_id:
+            fs = self._get_hf_fs()
+            if fs:
+                try:
+                    voice_dir = f"{self.config.hf_repo_id}/{voice_id}"
+                    if not fs.exists(voice_dir):
+                        fs.makedirs(voice_dir, exist_ok=True)
+                    hf_path = f"{voice_dir}/{cache_key}.mp3"
+                    with fs.open(hf_path, "wb") as f:
+                        f.write(audio_data)
+                    logger.debug(f"Cached to HF Hub: {cache_key}")
+                    success = True
+                except Exception as e:
+                    logger.warning(f"Failed to cache to HF Hub: {e}")
+        return success
+    def clear_local(self) -> int:
+        """Clear local cache. Returns number of files deleted."""
+        if not self.config.local_cache_dir:
+            return 0
+        cache_dir = Path(self.config.local_cache_dir)
+        if not cache_dir.exists():
+            return 0
+        count = 0
+        for file in cache_dir.glob("*.mp3"):
+            file.unlink()
+            count += 1
+        logger.info(f"Cleared {count} files from local cache")
+        return count

engine/data/assets/.gitkeep ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Place background music files (.mp3) here
2	+ # They will be automatically detected by the audio processor

engine/tts_engine.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""
+Main TTS Engine for Telefonansagen (Phone Announcements).
+This engine provides a unified interface for generating phone announcements
+using different TTS backends. It handles:
+- Backend management (loading, switching, unloading)
+- Audio generation with sensible defaults
+- Post-processing (background music, fades, normalization)
+- Caching for efficiency
+"""
+import os
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional, Type, Union
+import numpy as np
+from loguru import logger
+from .audio_processor import AudioProcessingConfig, AudioProcessor
+from .backends.base import BackendConfig, TTSBackend, TTSResult
+from .backends.chatterbox_backend import ChatterboxBackend
+from .cache import AudioCache, CacheConfig
+@dataclass
+class EngineConfig:
+    """Configuration for the TTS Engine."""
+    # Backend settings
+    default_backend: str = "chatterbox"
+    device: str = "auto"  # "auto", "cuda", "mps", "cpu"
+    # Default generation settings
+    default_language: str = "de"  # German for phone announcements
+    # Audio processing defaults
+    add_background_music: bool = False
+    default_music: Optional[str] = None
+    music_volume_db: float = -20.0
+    fade_in_ms: int = 500
+    fade_out_ms: int = 500
+    # Caching
+    enable_cache: bool = True
+    local_cache_dir: Optional[str] = None
+    hf_cache_repo: Optional[str] = None
+class TTSEngine:
+    """
+    Main TTS Engine for generating phone announcements.
+    Usage:
+        # Simple usage with defaults
+        engine = TTSEngine()
+        audio = engine.generate("Willkommen bei unserem Service.")
+        # With voice cloning
+        audio = engine.generate(
+            "Willkommen bei unserem Service.",
+            voice_audio="path/to/reference.wav"
+        )
+        # Switch backend
+        engine.set_backend("gemini")
+        audio = engine.generate("Welcome to our service.", language="en")
+    """
+    # Registry of available backends
+    _backend_registry: dict[str, Type[TTSBackend]] = {
+        "chatterbox": ChatterboxBackend,
+    }
+    def __init__(self, config: Optional[EngineConfig] = None):
+        self.config = config or EngineConfig()
+        # Initialize components
+        self._backends: dict[str, TTSBackend] = {}
+        self._current_backend_name: str = self.config.default_backend
+        # Audio processor
+        self._processor = AudioProcessor(
+            AudioProcessingConfig(
+                music_volume_db=self.config.music_volume_db,
+                fade_in_ms=self.config.fade_in_ms,
+                fade_out_ms=self.config.fade_out_ms,
+            )
+        )
+        # Cache
+        self._cache = AudioCache(
+            CacheConfig(
+                enabled=self.config.enable_cache,
+                local_cache_dir=self.config.local_cache_dir,
+                hf_repo_id=self.config.hf_cache_repo,
+            )
+        )
+    @classmethod
+    def register_backend(cls, name: str, backend_class: Type[TTSBackend]) -> None:
+        """Register a new backend class."""
+        cls._backend_registry[name] = backend_class
+        logger.info(f"Registered backend: {name}")
+    @classmethod
+    def available_backends(cls) -> list[str]:
+        """List available backend names."""
+        return list(cls._backend_registry.keys())
+    def _get_backend(self, name: Optional[str] = None) -> TTSBackend:
+        """Get or create a backend instance."""
+        name = name or self._current_backend_name
+        if name not in self._backend_registry:
+            available = ", ".join(self._backend_registry.keys())
+            raise ValueError(f"Unknown backend '{name}'. Available: {available}")
+        if name not in self._backends:
+            backend_config = BackendConfig(device=self.config.device)
+            self._backends[name] = self._backend_registry[name](backend_config)
+        return self._backends[name]
+    @property
+    def current_backend(self) -> TTSBackend:
+        """Get the current active backend."""
+        return self._get_backend()
+    def set_backend(self, name: str) -> None:
+        """Switch to a different backend."""
+        if name not in self._backend_registry:
+            available = ", ".join(self._backend_registry.keys())
+            raise ValueError(f"Unknown backend '{name}'. Available: {available}")
+        self._current_backend_name = name
+        logger.info(f"Switched to backend: {name}")
+    def load_backend(self, name: Optional[str] = None) -> None:
+        """Pre-load a backend's model."""
+        backend = self._get_backend(name)
+        if not backend.is_loaded:
+            backend.load()
+    def unload_backend(self, name: Optional[str] = None) -> None:
+        """Unload a backend's model to free memory."""
+        backend = self._get_backend(name)
+        if backend.is_loaded:
+            backend.unload()
+    def get_supported_languages(self, backend: Optional[str] = None) -> dict[str, str]:
+        """Get supported languages for a backend."""
+        return self._get_backend(backend).supported_languages
+    def generate(
+        self,
+        text: str,
+        language: Optional[str] = None,
+        voice_audio: Optional[str] = None,
+        background_music: Optional[str] = None,
+        output_path: Optional[str] = None,
+        use_cache: bool = True,
+        **kwargs,
+    ) -> Union[bytes, str, tuple[int, np.ndarray]]:
+        """
+        Generate a phone announcement.
+        Args:
+            text: Text to synthesize
+            language: Language code (default: "de")
+            voice_audio: Path/URL to reference audio for voice cloning
+            background_music: Name/path of background music file
+            output_path: Optional path to save output file
+            use_cache: Whether to use caching (default: True)
+            **kwargs: Additional backend-specific parameters
+        Returns:
+            - If output_path: path to saved file
+            - If no output_path and no background_music: tuple(sample_rate, audio_array) for Gradio
+            - Otherwise: MP3 bytes
+        """
+        language = language or self.config.default_language
+        backend = self.current_backend
+        # Generate voice ID for caching
+        voice_id = (
+            "default"
+            if not voice_audio
+            else (
+                Path(voice_audio).stem
+                if os.path.exists(voice_audio or "")
+                else "custom"
+            )
+        )
+        # Check cache
+        if use_cache and self._cache.config.enabled:
+            cached = self._cache.get(text, voice_id, backend.name)
+            if cached:
+                logger.info("Using cached audio")
+                if output_path:
+                    Path(output_path).write_bytes(cached)
+                    return output_path
+                return cached
+        # Generate audio
+        logger.info(f"Generating TTS: backend={backend.name}, lang={language}")
+        result = backend.generate(
+            text=text, language=language, voice_audio_path=voice_audio, **kwargs
+        )
+        # Determine if we need post-processing
+        use_music = background_music or (
+            self.config.add_background_music and self.config.default_music
+        )
+        music_path = background_music or self.config.default_music
+        if use_music or output_path:
+            # Process audio with pydub
+            processed = self._processor.process(
+                audio=result.audio,
+                sample_rate=result.sample_rate,
+                output_path=output_path,
+                background_music_path=music_path if use_music else None,
+            )
+            # Cache if appropriate
+            if use_cache and isinstance(processed, bytes):
+                duration = len(result.audio) / result.sample_rate
+                self._cache.set(text, voice_id, backend.name, processed, duration)
+            return processed
+        else:
+            # Return raw audio for Gradio (sample_rate, audio_array)
+            return (result.sample_rate, result.audio)
+    def generate_raw(
+        self,
+        text: str,
+        language: Optional[str] = None,
+        voice_audio: Optional[str] = None,
+        **kwargs,
+    ) -> TTSResult:
+        """
+        Generate raw audio without post-processing.
+        Returns:
+            TTSResult with audio array and sample rate
+        """
+        language = language or self.config.default_language
+        return self.current_backend.generate(
+            text=text, language=language, voice_audio_path=voice_audio, **kwargs
+        )
+    def list_background_music(self) -> list[str]:
+        """List available background music files."""
+        return self._processor.list_available_music()
+    def clear_cache(self) -> int:
+        """Clear the local audio cache. Returns number of files deleted."""
+        return self._cache.clear_local()
+# Register additional backends if available
+try:
+    from .backends.gemini_backend import GeminiBackend
+    TTSEngine.register_backend("gemini", GeminiBackend)
+except ImportError:
+    pass  # Gemini backend not available

requirements.txt CHANGED Viewed

@@ -1,7 +1,17 @@
-gradio
 numpy==1.26.0
-resampy==0.4.3
 librosa==0.10.0
 s3tokenizer
 transformers==4.46.3
 diffusers==0.29.0
@@ -10,10 +20,20 @@ resemble-perth==1.0.1
 silero-vad==5.1.2
 conformer==0.3.2
 safetensors
 # Optional language-specific dependencies
 # Uncomment the ones you need for specific languages:
- spacy_pkuseg          # For Chinese text segmentation
- pykakasi>=2.2.0       # For Japanese text processing (Kanji to Hiragana)
- russian-text-stresser @ git+https://github.com/Vuizur/add-stress-to-epub
 # dicta-onnx>=0.1.0     # For Hebrew diacritization

+# Requirements for Telefonansagen TTS Engine
+# Core dependencies
+gradio>=4.0.0
 numpy==1.26.0
+torch>=2.0.0
+# Audio processing
 librosa==0.10.0
+resampy==0.4.3
+pydub>=0.25.0
+soundfile>=0.12.0
+# TTS Model dependencies
 s3tokenizer
 transformers==4.46.3
 diffusers==0.29.0
 silero-vad==5.1.2
 conformer==0.3.2
 safetensors
+huggingface_hub>=0.20.0
+# Logging
+loguru>=0.7.0
+# Optional: Gemini backend
+# google-genai>=0.3.0
+# Optional: Caching to HuggingFace Hub
+# pandas>=2.0.0
 # Optional language-specific dependencies
 # Uncomment the ones you need for specific languages:
+# spacy_pkuseg          # For Chinese text segmentation
+# pykakasi>=2.2.0       # For Japanese text processing (Kanji to Hiragana)
+# russian-text-stresser @ git+https://github.com/Vuizur/add-stress-to-epub
 # dicta-onnx>=0.1.0     # For Hebrew diacritization