Spaces:

jkorstad
/

AudioBook

Running on Zero

App Files Files Community

jkorstad commited on Apr 23

Commit

a33bc67

1 Parent(s): 453cba9

Initial deploy: AudioBook Forge with Qwen3-TTS backend, character voice mapping, and dark Gradio UI

Browse files

Files changed (4) hide show

README.md +41 -6
app.py +531 -0
backend.py +514 -0
requirements.txt +10 -0

README.md CHANGED Viewed

@@ -1,15 +1,50 @@
 ---
-title: AudioBook
-emoji: 📉
-colorFrom: green
-colorTo: indigo
 sdk: gradio
 sdk_version: 6.13.0
 python_version: '3.12'
 app_file: app.py
 pinned: false
 license: apache-2.0
-short_description: Create audiobooks with various custom speaker voices
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: AudioBook Forge
+emoji: 🎧
+colorFrom: indigo
+colorTo: cyan
 sdk: gradio
 sdk_version: 6.13.0
 python_version: '3.12'
 app_file: app.py
 pinned: false
 license: apache-2.0
+short_description: High-fidelity audiobook generator with AI character voices using Qwen3-TTS
 ---
+# AudioBook Forge
+**Model-agnostic, high-fidelity audiobook generator** powered by [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS). Create audiobooks where every character speaks with their own unique voice.
+## Features
+- 🎙️ **Character Voice Mapping** — Automatically detect characters from your story and assign unique voices to each one
+- 🎭 **Three Voice Modes**
+  - **Preset** — 9 premium built-in speakers (English, Chinese, Japanese, Korean, dialects)
+  - **Clone** — Upload a 3–10 second voice sample to clone any voice
+  - **Design** — Describe a voice in text (e.g., "a raspy old man with a warm chuckle") and the AI creates it
+- 📖 **Smart Text Processing** — Automatically distinguishes narration from dialogue and routes each segment to the correct voice
+- 🌐 **Multilingual** — Supports 10 languages via Qwen3-TTS
+- ⚡ **ZeroGPU** — Runs on Hugging Face ZeroGPU (free A100 compute)
+- 🔧 **Model Agnostic** — Backend is swappable; upgrade to future SOTA TTS models without changing the UI
+## How to Use
+1. **Paste your story** in the 📖 Story Setup tab.
+2. **Extract characters** automatically with the 🔍 button (or add them manually).
+3. **Configure voices** in the 🎭 Voice Cast tab:
+   - Set the **Narrator** voice (preset, cloned, or designed)
+   - Assign a voice to each **Character**
+4. **Generate** in the ⚡ Generate tab and download your MP3 audiobook.
+## Architecture
+- `app.py` — Gradio frontend with dark-themed custom UI
+- `backend.py` — Model-agnostic TTS engine, dialogue parser, and audio stitcher
+- **TTS Backend:** Qwen3-TTS 1.7B (CustomVoice + Base + VoiceDesign)
+- **Text Processing:** Paragraph-aware chunking, sentence-boundary splitting, quote detection
+- **Audio Pipeline:** Per-segment synthesis → crossfade stitching → peak normalization → MP3 export
+## License
+The application code is Apache 2.0. The underlying Qwen3-TTS models are also Apache 2.0, making this stack fully commercially usable.

app.py ADDED Viewed

	@@ -0,0 +1,531 @@

+"""
+AudioBook Forge - Gradio Frontend
+High-fidelity audiobook generator with character voice mapping.
+"""
+import os
+import json
+from pathlib import Path
+from typing import Dict, List, Optional
+import gradio as gr
+import numpy as np
+import soundfile as sf
+from backend import (
+    AudiobookPipeline,
+    VoiceConfig,
+    PRESET_SPEAKERS,
+)
+# ---------------------------------------------------------------------------
+# CSS & Theme
+# ---------------------------------------------------------------------------
+CUSTOM_CSS = """
+@import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
+body, .gradio-container {
+    font-family: 'Inter', sans-serif !important;
+    background: #0f172a !important;
+    color: #f8fafc !important;
+}
+.gradio-container {
+    max-width: 1200px !important;
+}
+.ab-header {
+    text-align: center;
+    padding: 2.2rem 1rem 1.8rem;
+    background: linear-gradient(135deg, rgba(99,102,241,0.12) 0%, rgba(34,211,238,0.06) 100%);
+    border-radius: 18px;
+    margin-bottom: 1.5rem;
+    border: 1px solid rgba(99,102,241,0.18);
+}
+.ab-header h1 {
+    font-size: 2.6rem;
+    font-weight: 700;
+    margin: 0;
+    background: linear-gradient(90deg, #a5b4fc, #22d3ee);
+    -webkit-background-clip: text;
+    -webkit-text-fill-color: transparent;
+}
+.ab-header p {
+    color: #94a3b8;
+    margin-top: 0.6rem;
+    font-size: 1.05rem;
+}
+.ab-card {
+    background: #1e293b !important;
+    border: 1px solid #334155 !important;
+    border-radius: 14px !important;
+    padding: 1.25rem !important;
+}
+button.primary {
+    background: linear-gradient(135deg, #6366f1, #4f46e5) !important;
+    border: none !important;
+    border-radius: 10px !important;
+    font-weight: 600 !important;
+    transition: all 0.2s ease !important;
+}
+button.primary:hover {
+    transform: translateY(-1px);
+    box-shadow: 0 4px 14px rgba(99,102,241,0.4) !important;
+}
+button.secondary {
+    background: #334155 !important;
+    border: 1px solid #475569 !important;
+    border-radius: 10px !important;
+    color: #f8fafc !important;
+}
+input, textarea, select {
+    background: #0f172a !important;
+    border: 1px solid #334155 !important;
+    border-radius: 8px !important;
+    color: #f8fafc !important;
+}
+input:focus, textarea:focus, select:focus {
+    border-color: #6366f1 !important;
+    box-shadow: 0 0 0 3px rgba(99,102,241,0.15) !important;
+}
+.gr-box, .gr-form {
+    background: #1e293b !important;
+    border-color: #334155 !important;
+}
+.gr-panel {
+    background: #1e293b !important;
+}
+.tabitem {
+    background: #1e293b !important;
+    border-color: #334155 !important;
+}
+"""
+# ---------------------------------------------------------------------------
+# Global State
+# ---------------------------------------------------------------------------
+_pipeline: Optional[AudiobookPipeline] = None
+def get_pipeline() -> AudiobookPipeline:
+    global _pipeline
+    if _pipeline is None:
+        device = "cuda" if os.system("nvidia-smi > /dev/null 2>&1") == 0 else "cpu"
+        _pipeline = AudiobookPipeline(device=device)
+    return _pipeline
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def on_mode_change(mode: str) -> tuple:
+    if mode == "preset":
+        return gr.update(visible=True), gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)
+    elif mode == "clone":
+        return gr.update(visible=False), gr.update(visible=True), gr.update(visible=True), gr.update(visible=False)
+    else:
+        return gr.update(visible=False), gr.update(visible=False), gr.update(visible=False), gr.update(visible=True)
+def extract_chars(text: str, use_ai: bool) -> tuple:
+    if not text or len(text.strip()) < 20:
+        return [], "Text too short. Please paste at least a paragraph."
+    pipe = get_pipeline()
+    chars = pipe.extract_characters(text, use_ai=use_ai)
+    status = f"Found {len(chars)} characters: {', '.join(c['name'] for c in chars)}" if chars else "No characters auto-detected. Add them manually below."
+    return chars, status
+def _build_char_dict(
+    names, descs, modes, presets, audios, ref_texts, designs, instructs, langs
+) -> List[Dict]:
+    chars = []
+    for i in range(8):
+        if names[i]:
+            chars.append({
+                "name": names[i],
+                "description": descs[i] or "",
+                "voice_mode": modes[i],
+                "voice_preset": presets[i] if modes[i] == "preset" else None,
+                "voice_ref_audio": audios[i] if modes[i] == "clone" else None,
+                "voice_ref_text": ref_texts[i] if modes[i] == "clone" else None,
+                "voice_design_desc": designs[i] if modes[i] == "design" else None,
+                "voice_instruct": instructs[i] or "",
+                "language": langs[i],
+            })
+    return chars
+def generate_audiobook(
+    text,
+    nar_mode, nar_preset, nar_audio, nar_ref_text, nar_design, nar_instruct, nar_lang,
+    gen_temp, gen_seed,
+    names, descs, modes, presets, audios, ref_texts, designs, instructs, langs,
+):
+    if not text or len(text.strip()) < 50:
+        return None, "Error: Please provide at least 50 characters of story text."
+    pipe = get_pipeline()
+    nar_cfg = VoiceConfig(
+        name="Narrator",
+        mode=nar_mode,
+        preset=nar_preset if nar_mode == "preset" else None,
+        ref_audio=nar_audio if nar_mode == "clone" and nar_audio else None,
+        ref_text=nar_ref_text if nar_mode == "clone" else None,
+        design_desc=nar_design if nar_mode == "design" else None,
+        instruct=nar_instruct,
+        language=nar_lang,
+    )
+    char_configs = {}
+    for i in range(8):
+        if not names[i]:
+            continue
+        vc = VoiceConfig(
+            name=names[i],
+            mode=modes[i],
+            preset=presets[i] if modes[i] == "preset" else None,
+            ref_audio=audios[i] if modes[i] == "clone" and audios[i] else None,
+            ref_text=ref_texts[i] if modes[i] == "clone" else None,
+            design_desc=designs[i] if modes[i] == "design" else None,
+            instruct=instructs[i] or "",
+            language=langs[i],
+        )
+        char_configs[names[i]] = vc
+    progress_text = ""
+    def prog_cb(ratio: float, msg: str):
+        nonlocal progress_text
+        progress_text = f"[{ratio*100:.0f}%] {msg}"
+        print(progress_text)
+    try:
+        output_path, _ = pipe.generate(
+            text=text,
+            narrator_config=nar_cfg,
+            character_configs=char_configs,
+            progress_callback=prog_cb,
+            temperature=gen_temp,
+            seed=int(gen_seed),
+        )
+        return output_path, f"Done! Audiobook generated."
+    except Exception as e:
+        import traceback
+        traceback.print_exc()
+        return None, f"Error: {str(e)}"
+def preview_narrator(mode, preset, audio, ref_text, design, instruct, lang):
+    pipe = get_pipeline()
+    vc = VoiceConfig(
+        name="Narrator",
+        mode=mode,
+        preset=preset if mode == "preset" else None,
+        ref_audio=audio if mode == "clone" and audio else None,
+        ref_text=ref_text if mode == "clone" else None,
+        design_desc=design if mode == "design" else None,
+        instruct=instruct,
+        language=lang,
+    )
+    try:
+        wav, sr = pipe.preview_voice(vc)
+        return (sr, wav), "Preview ready!"
+    except Exception as e:
+        import traceback
+        traceback.print_exc()
+        return None, f"Preview failed: {e}"
+# ---------------------------------------------------------------------------
+# Build UI
+# ---------------------------------------------------------------------------
+def build_app():
+    theme = gr.themes.Soft(
+        primary_hue="indigo",
+        secondary_hue="cyan",
+        neutral_hue="slate",
+    ).set(
+        body_background_fill="#0f172a",
+        body_background_fill_dark="#0f172a",
+        body_text_color="#f8fafc",
+        body_text_color_subdued="#94a3b8",
+        background_fill_primary="#1e293b",
+        background_fill_secondary="#0f172a",
+        border_color_accent="#334155",
+        color_accent_soft="#22d3ee",
+        button_primary_background_fill="linear-gradient(135deg, #6366f1, #4f46e5)",
+        button_primary_background_fill_hover="linear-gradient(135deg, #4f46e5, #4338ca)",
+        button_primary_text_color="#ffffff",
+        input_background_fill="#0f172a",
+        input_border_color="#334155",
+        block_title_text_color="#f8fafc",
+        block_label_text_color="#94a3b8",
+    )
+    with gr.Blocks(theme=theme, css=CUSTOM_CSS, title="AudioBook Forge") as demo:
+        gr.HTML("""
+        <div class="ab-header">
+            <h1>AudioBook Forge</h1>
+            <p>High-fidelity audiobooks with AI character voices. Model-agnostic TTS powered by Qwen3-TTS.</p>
+        </div>
+        """)
+        with gr.Tabs():
+            # ==================== TAB 1 ====================
+            with gr.TabItem("📖 Story Setup"):
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        story_input = gr.TextArea(
+                            label="Story Text",
+                            placeholder="Paste your book chapter, short story, or script here...",
+                            lines=20,
+                            max_lines=40,
+                        )
+                    with gr.Column(scale=1):
+                        gr.Markdown("### Character Detection")
+                        use_ai_check = gr.Checkbox(
+                            label="Use AI enhancement (slower, more accurate)",
+                            value=False,
+                        )
+                        extract_btn = gr.Button("🔍 Extract Characters", variant="primary")
+                        gr.Markdown("---")
+                        gr.Markdown("**Tips:**")
+                        gr.Markdown("- Use `Character: \"dialogue\"` format for best results.")
+                        gr.Markdown("- Or standard prose with quoted dialogue.")
+                        gr.Markdown("- AI mode uses a small LLM for deeper analysis.")
+                extract_status = gr.Textbox(label="Status", interactive=False)
+                # Hidden states to hold character data
+                char_state = gr.State(value=[])
+            # ==================== TAB 2 ====================
+            with gr.TabItem("🎭 Voice Cast"):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        gr.Markdown("## Narrator")
+                        with gr.Column(elem_classes="ab-card"):
+                            nar_mode = gr.Dropdown(
+                                choices=["preset", "clone", "design"],
+                                value="preset",
+                                label="Narrator Mode",
+                            )
+                            nar_preset = gr.Dropdown(
+                                choices=list(PRESET_SPEAKERS.keys()),
+                                value="Ryan",
+                                label="Preset Voice",
+                            )
+                            nar_audio = gr.Audio(
+                                label="Upload Voice Sample (3–10s)",
+                                type="filepath",
+                                visible=False,
+                            )
+                            nar_ref_text = gr.Textbox(
+                                label="Reference Transcript",
+                                placeholder="What does the reference audio say?",
+                                visible=False,
+                            )
+                            nar_design = gr.TextArea(
+                                label="Voice Description",
+                                placeholder="e.g. A warm, raspy baritone with a slight British accent.",
+                                visible=False,
+                                lines=2,
+                            )
+                            nar_instruct = gr.Textbox(
+                                label="Style Instruction",
+                                placeholder="e.g. Calm, measured storytelling pace.",
+                            )
+                            nar_lang = gr.Dropdown(
+                                choices=["English", "Chinese", "Japanese", "Korean", "German", "French", "Spanish", "Italian", "Portuguese", "Russian"],
+                                value="English",
+                                label="Language",
+                            )
+                            nar_preview_btn = gr.Button("🔊 Preview Narrator", variant="secondary")
+                            nar_preview_audio = gr.Audio(label="Preview", interactive=False)
+                            nar_preview_status = gr.Textbox(show_label=False, interactive=False)
+                            nar_mode.change(
+                                on_mode_change,
+                                inputs=nar_mode,
+                                outputs=[nar_preset, nar_audio, nar_ref_text, nar_design],
+                            )
+                            nar_preview_btn.click(
+                                preview_narrator,
+                                inputs=[nar_mode, nar_preset, nar_audio, nar_ref_text, nar_design, nar_instruct, nar_lang],
+                                outputs=[nar_preview_audio, nar_preview_status],
+                            )
+                    with gr.Column(scale=2):
+                        gr.Markdown("## Character Voices")
+                        gr.Markdown("Configure up to 8 characters. Use **preset** for built-in speakers, **clone** to upload a voice sample, or **design** to describe a voice from text.")
+                        # Dynamic character rows — we'll create 8 static rows and toggle visibility
+                        char_names = []
+                        char_descs = []
+                        char_modes = []
+                        char_presets = []
+                        char_audios = []
+                        char_ref_texts = []
+                        char_designs = []
+                        char_instructs = []
+                        char_langs = []
+                        char_rows = []
+                        for i in range(8):
+                            visible_default = (i == 0)
+                            with gr.Group(visible=visible_default) as row:
+                                with gr.Row():
+                                    cn = gr.Textbox(label=f"Name", placeholder="e.g. Alice", visible=visible_default)
+                                    cd = gr.Textbox(label="Description", placeholder="Personality note", visible=visible_default)
+                                    cm = gr.Dropdown(label="Mode", choices=["preset", "clone", "design"], value="preset", visible=visible_default)
+                                    cp = gr.Dropdown(label="Preset", choices=list(PRESET_SPEAKERS.keys()), value="Ryan", visible=visible_default)
+                                with gr.Row():
+                                    ca = gr.Audio(label="Voice Sample", type="filepath", visible=False)
+                                    crt = gr.Textbox(label="Ref Transcript", placeholder="What the sample says", visible=False)
+                                    cdes = gr.TextArea(label="Voice Description", placeholder="e.g. A shrill, nervous teenager.", visible=False, lines=2)
+                                    cinstr = gr.Textbox(label="Style Instruction", placeholder="e.g. Angry and loud.", visible=visible_default)
+                                    cl = gr.Dropdown(label="Language", choices=["English", "Chinese", "Japanese", "Korean", "German", "French", "Spanish", "Italian", "Portuguese", "Russian"], value="English", visible=visible_default)
+                                cm.change(
+                                    on_mode_change,
+                                    inputs=cm,
+                                    outputs=[cp, ca, crt, cdes],
+                                )
+                            char_rows.append(row)
+                            char_names.append(cn)
+                            char_descs.append(cd)
+                            char_modes.append(cm)
+                            char_presets.append(cp)
+                            char_audios.append(ca)
+                            char_ref_texts.append(crt)
+                            char_designs.append(cdes)
+                            char_instructs.append(cinstr)
+                            char_langs.append(cl)
+            # ==================== TAB 3 ====================
+            with gr.TabItem("⚡ Generate"):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        gr.Markdown("### Settings")
+                        gen_temp = gr.Slider(minimum=0.1, maximum=1.0, value=0.7, step=0.05, label="Temperature")
+                        gen_seed = gr.Number(value=42, precision=0, label="Seed (fix for consistency)")
+                        gen_btn = gr.Button("▶️ Generate Audiobook", variant="primary", size="lg")
+                        gen_progress = gr.Textbox(label="Progress", interactive=False, value="Ready.")
+                    with gr.Column(scale=2):
+                        gr.Markdown("### Output")
+                        output_audio = gr.Audio(label="Generated Audiobook", type="filepath", interactive=False)
+                        output_status = gr.Textbox(label="Status", interactive=False)
+            # ==================== TAB 4 ====================
+            with gr.TabItem("ℹ️ About"):
+                gr.Markdown("""
+                ## AudioBook Forge
+                **Model-agnostic, high-fidelity audiobook generation** using state-of-the-art open TTS.
+                ### Current Backend: Qwen3-TTS
+                - **1.7B CustomVoice** — 9 premium preset speakers with style control
+                - **1.7B Base** — High-quality voice cloning from 3–10 second samples
+                - **1.7B VoiceDesign** — Create voices from text descriptions
+                - **10 languages** supported
+                - **Apache 2.0** license — commercially usable
+                ### Workflow
+                1. **Paste your story** in the Story Setup tab.
+                2. **Extract characters** automatically or define them manually.
+                3. **Assign voices** — choose presets, upload samples for cloning, or describe voices.
+                4. **Generate** — the engine detects narration vs dialogue and routes each segment to the right voice.
+                5. **Download** your finished audiobook as MP3.
+                ### Architecture
+                The TTS engine is fully model-agnostic. Swapping to a future SOTA model only requires updating the backend adapter.
+                ### Tips for Best Quality
+                - Use clean, noise-free voice samples for cloning (3–10 seconds).
+                - Keep reference transcripts accurate — they guide the cloning quality.
+                - Lower temperature (0.5–0.6) for stable narration; higher (0.8–0.9) for expressive dialogue.
+                - Use a fixed seed across chunks to prevent voice drift.
+                """)
+        # ---------- Extract wiring ----------
+        def do_extract(text, use_ai):
+            chars, status = extract_chars(text, use_ai)
+            # Build visibility updates
+            updates = []
+            for i in range(8):
+                if i < len(chars):
+                    updates.extend([
+                        gr.update(visible=True),   # row
+                        gr.update(value=chars[i].get("name", ""), visible=True),
+                        gr.update(value=chars[i].get("description", ""), visible=True),
+                        gr.update(value=chars[i].get("voice_mode", "preset"), visible=True),
+                        gr.update(value=chars[i].get("voice_preset", "Ryan"), visible=True),
+                        gr.update(visible=False),   # audio
+                        gr.update(visible=False),   # ref text
+                        gr.update(visible=False),   # design
+                        gr.update(value=chars[i].get("voice_instruct", ""), visible=True),
+                        gr.update(value=chars[i].get("language", "English"), visible=True),
+                    ])
+                else:
+                    updates.extend([
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                        gr.update(visible=False),
+                    ])
+            return [status] + updates
+        extract_btn.click(
+            do_extract,
+            inputs=[story_input, use_ai_check],
+            outputs=[extract_status] + [
+                item for sublist in [
+                    [char_rows[i], char_names[i], char_descs[i], char_modes[i], char_presets[i],
+                     char_audios[i], char_ref_texts[i], char_designs[i], char_instructs[i], char_langs[i]]
+                    for i in range(8)
+                ] for item in sublist
+            ],
+        )
+        # ---------- Generate wiring ----------
+        all_char_inputs = (
+            char_names + char_descs + char_modes + char_presets +
+            char_audios + char_ref_texts + char_designs + char_instructs + char_langs
+        )
+        gen_btn.click(
+            generate_audiobook,
+            inputs=[
+                story_input,
+                nar_mode, nar_preset, nar_audio, nar_ref_text, nar_design, nar_instruct, nar_lang,
+                gen_temp, gen_seed,
+            ] + all_char_inputs,
+            outputs=[output_audio, output_status],
+        )
+    return demo
+demo = build_app()
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

backend.py ADDED Viewed

	@@ -0,0 +1,514 @@

+"""
+AudioBook Forge - Backend
+Model-agnostic TTS engine with Qwen3-TTS support.
+Character extraction, dialogue parsing, and audio stitching.
+"""
+import os
+import re
+import json
+import hashlib
+import tempfile
+from pathlib import Path
+from typing import List, Dict, Optional, Tuple, Any
+from dataclasses import dataclass, field
+from collections import defaultdict
+import warnings
+import numpy as np
+import soundfile as sf
+from pydub import AudioSegment
+warnings.filterwarnings("ignore")
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+PRESET_SPEAKERS = {
+    "Ryan": {"lang": "English", "desc": "Dynamic, expressive male"},
+    "Aiden": {"lang": "English", "desc": "Sunny, warm male"},
+    "Serena": {"lang": "Chinese", "desc": "Young female (Chinese)"},
+    "Vivian": {"lang": "Chinese", "desc": "Young female (Chinese)"},
+    "Uncle_Fu": {"lang": "Chinese", "desc": "Seasoned elder male (Chinese)"},
+    "Ono_Anna": {"lang": "Japanese", "desc": "Playful female (Japanese)"},
+    "Sohee": {"lang": "Korean", "desc": "Warm female (Korean)"},
+    "Dylan": {"lang": "Chinese", "desc": "Beijing dialect male"},
+    "Eric": {"lang": "Chinese", "desc": "Sichuan dialect male"},
+}
+MAX_CHUNK_CHARS = 380
+MIN_CHUNK_CHARS = 80
+CROSSFADE_MS = 80
+# ---------------------------------------------------------------------------
+# Data Classes
+# ---------------------------------------------------------------------------
+@dataclass
+class VoiceConfig:
+    name: str = "Narrator"
+    mode: str = "preset"          # preset | clone | design
+    preset: Optional[str] = None   # e.g., "Ryan"
+    ref_audio: Optional[str] = None
+    ref_text: Optional[str] = None
+    design_desc: Optional[str] = None
+    instruct: str = ""            # style instruction
+    language: str = "English"
+@dataclass
+class TextSegment:
+    text: str
+    seg_type: str                 # narration | dialogue
+    speaker: Optional[str] = None
+    emotion_hint: Optional[str] = None
+@dataclass
+class CharacterProfile:
+    name: str
+    description: str = ""
+    voice: VoiceConfig = field(default_factory=VoiceConfig)
+    occurrences: int = 0
+# ---------------------------------------------------------------------------
+# TTS Engine (Model-Agnostic Wrapper)
+# ---------------------------------------------------------------------------
+class TTSEngine:
+    """
+    Model-agnostic TTS engine.
+    Currently backed by Qwen3-TTS. Swappable architecture.
+    """
+    def __init__(self, device: str = "cuda"):
+        self.device = device
+        self._custom_voice_model = None
+        self._base_model = None
+        self._design_model = None
+        self._model_ids = {
+            "custom": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
+            "base": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
+            "design": "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
+        }
+        self._cache_dir = Path(tempfile.gettempdir()) / "audiobook_cache"
+        self._cache_dir.mkdir(exist_ok=True)
+    def _load_custom_voice(self):
+        if self._custom_voice_model is not None:
+            return self._custom_voice_model
+        try:
+            from qwen_tts import Qwen3TTSModel
+            import torch
+            print("[TTS] Loading CustomVoice model...")
+            self._custom_voice_model = Qwen3TTSModel.from_pretrained(
+                self._model_ids["custom"],
+                device_map=self.device,
+                dtype=torch.bfloat16,
+            )
+            print("[TTS] CustomVoice ready.")
+        except Exception as e:
+            print(f"[TTS] CustomVoice load failed: {e}")
+            raise
+        return self._custom_voice_model
+    def _load_base(self):
+        if self._base_model is not None:
+            return self._base_model
+        try:
+            from qwen_tts import Qwen3TTSModel
+            import torch
+            print("[TTS] Loading Base (clone) model...")
+            self._base_model = Qwen3TTSModel.from_pretrained(
+                self._model_ids["base"],
+                device_map=self.device,
+                dtype=torch.bfloat16,
+            )
+            print("[TTS] Base ready.")
+        except Exception as e:
+            print(f"[TTS] Base load failed: {e}")
+            raise
+        return self._base_model
+    def _load_design(self):
+        if self._design_model is not None:
+            return self._design_model
+        try:
+            from qwen_tts import Qwen3TTSModel
+            import torch
+            print("[TTS] Loading VoiceDesign model...")
+            self._design_model = Qwen3TTSModel.from_pretrained(
+                self._model_ids["design"],
+                device_map=self.device,
+                dtype=torch.bfloat16,
+            )
+            print("[TTS] VoiceDesign ready.")
+        except Exception as e:
+            print(f"[TTS] VoiceDesign load failed: {e}")
+            raise
+        return self._design_model
+    def _cache_key(self, text: str, voice: VoiceConfig) -> str:
+        payload = f"{text}|{voice.mode}|{voice.preset}|{voice.ref_audio}|{voice.design_desc}|{voice.instruct}|{voice.language}"
+        return hashlib.md5(payload.encode()).hexdigest()
+    def _cached_path(self, key: str) -> Path:
+        return self._cache_dir / f"{key}.wav"
+    def synthesize(
+        self,
+        text: str,
+        voice: VoiceConfig,
+        temperature: float = 0.7,
+        seed: int = 42,
+    ) -> Tuple[np.ndarray, int]:
+        """Generate audio for a text chunk. Returns (audio_array, sample_rate)."""
+        cache_key = self._cache_key(text, voice)
+        cache_path = self._cached_path(cache_key)
+        if cache_path.exists():
+            audio, sr = sf.read(str(cache_path))
+            return audio, sr
+        if voice.mode == "preset":
+            model = self._load_custom_voice()
+            wavs, sr = model.generate_custom_voice(
+                text=text,
+                language=voice.language,
+                speaker=voice.preset or "Ryan",
+                instruct=voice.instruct or "Narrate clearly and expressively.",
+                temperature=temperature,
+                seed=seed,
+            )
+        elif voice.mode == "clone":
+            model = self._load_base()
+            if not voice.ref_audio or not Path(voice.ref_audio).exists():
+                raise ValueError("Clone mode requires ref_audio path.")
+            wavs, sr = model.generate_voice_clone(
+                text=text,
+                language=voice.language,
+                ref_audio=voice.ref_audio,
+                ref_text=voice.ref_text or text[:100],
+                temperature=temperature,
+                seed=seed,
+            )
+        elif voice.mode == "design":
+            model = self._load_design()
+            desc = voice.design_desc or "A clear, expressive narrator voice."
+            wavs, sr = model.generate_voice_design(
+                text=text,
+                language=voice.language,
+                instruct=desc,
+                temperature=temperature,
+                seed=seed,
+            )
+        else:
+            raise ValueError(f"Unknown voice mode: {voice.mode}")
+        # Handle stereo or list returns
+        if isinstance(wavs, list):
+            wavs = wavs[0]
+        if wavs.ndim > 1:
+            wavs = wavs.mean(axis=1)
+        sf.write(str(cache_path), wavs, sr)
+        return wavs, sr
+    def status(self) -> Dict[str, Any]:
+        return {
+            "custom_loaded": self._custom_voice_model is not None,
+            "base_loaded": self._base_model is not None,
+            "design_loaded": self._design_model is not None,
+        }
+# ---------------------------------------------------------------------------
+# Text Processing
+# ---------------------------------------------------------------------------
+class TextProcessor:
+    """Extract characters, parse dialogue, chunk text."""
+    DIALOGUE_RE = re.compile(
+        r'(?:^|[.!?\n]\s+)\s*"([^"]{3,500})"'  # quoted dialogue
+    )
+    SPEAKER_RE = re.compile(
+        r'(?:^|\n)\s*([A-Z][a-zA-Z\s]{1,20})(?:\s*[:\-–])\s*"([^"]+)"'
+    )
+    NAME_RE = re.compile(
+        r'\b([A-Z][a-z]{1,15})\b'
+    )
+    @staticmethod
+    def extract_characters(text: str, use_ai: bool = False) -> List[CharacterProfile]:
+        """Extract character names and basic stats from text."""
+        profiles: Dict[str, CharacterProfile] = {}
+        # Pattern: Name: "dialogue"
+        for match in TextProcessor.SPEAKER_RE.finditer(text):
+            name = match.group(1).strip()
+            if len(name) > 2:
+                if name not in profiles:
+                    profiles[name] = CharacterProfile(name=name)
+                profiles[name].occurrences += 1
+        # Pattern: quoted dialogue near "he said / she said"
+        for match in TextProcessor.DIALOGUE_RE.finditer(text):
+            quote = match.group(1)
+            before = text[max(0, match.start() - 120):match.start()]
+            said_match = re.search(r'([A-Z][a-z]{1,15})\s+(?:said|cried|shouted|whispered|replied|asked)', before)
+            if said_match:
+                name = said_match.group(1)
+                if name not in profiles:
+                    profiles[name] = CharacterProfile(name=name)
+                profiles[name].occurrences += 1
+        # Fallback: capitalized names appearing frequently
+        all_names = TextProcessor.NAME_RE.findall(text)
+        from collections import Counter
+        common = Counter(all_names).most_common(30)
+        for name, count in common:
+            if count >= 3 and len(name) > 2 and name not in profiles:
+                # Filter common words
+                if name.lower() in {"the", "and", "but", "for", "are", "was", "were", "had", "have", "has", "his", "her", "she", "him", "they", "them", "said", "with", "from", "that", "this", "what", "when", "where", "would", "could", "should"}:
+                    continue
+                profiles[name] = CharacterProfile(name=name, occurrences=count)
+        result = sorted(profiles.values(), key=lambda p: p.occurrences, reverse=True)
+        return result[:12]  # Cap at 12 characters
+    @staticmethod
+    def segment_text(text: str, characters: List[str]) -> List[TextSegment]:
+        """Split text into narration/dialogue segments."""
+        segments = []
+        # Normalize newlines
+        text = text.replace("\r\n", "\n").replace("\r", "\n")
+        # Split by paragraphs first
+        paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]
+        for para in paragraphs:
+            # Check if paragraph starts with Character: "dialogue"
+            speaker_match = re.match(r'^([A-Z][a-zA-Z\s]{1,20})[:\-–]\s*"([^"]+)"', para)
+            if speaker_match:
+                speaker = speaker_match.group(1).strip()
+                dialogue = speaker_match.group(2)
+                segments.append(TextSegment(text=dialogue, seg_type="dialogue", speaker=speaker))
+                # Remainder of paragraph as narration
+                remainder = para[speaker_match.end():].strip()
+                if remainder:
+                    segments.append(TextSegment(text=remainder, seg_type="narration"))
+                continue
+            # Check for inline quotes
+            parts = re.split(r'"([^"]{3,500})"', para)
+            for i, part in enumerate(parts):
+                part = part.strip()
+                if not part:
+                    continue
+                if i % 2 == 1:
+                    # This was inside quotes
+                    # Try to attribute speaker from surrounding text
+                    speaker = None
+                    segments.append(TextSegment(text=part, seg_type="dialogue", speaker=speaker))
+                else:
+                    segments.append(TextSegment(text=part, seg_type="narration"))
+        # Merge adjacent narration segments
+        merged = []
+        for seg in segments:
+            if merged and seg.seg_type == "narration" and merged[-1].seg_type == "narration":
+                merged[-1].text += " " + seg.text
+            else:
+                merged.append(seg)
+        return merged
+    @staticmethod
+    def chunk_segments(segments: List[TextSegment], max_chars: int = MAX_CHUNK_CHARS) -> List[TextSegment]:
+        """Break long segments into smaller chunks at sentence boundaries."""
+        result = []
+        for seg in segments:
+            if len(seg.text) <= max_chars:
+                result.append(seg)
+                continue
+            # Split into sentences
+            sentences = re.split(r'(?<=[.!?])\s+', seg.text)
+            current_text = ""
+            current_speaker = seg.speaker
+            current_type = seg.seg_type
+            for sent in sentences:
+                if len(current_text) + len(sent) + 1 <= max_chars:
+                    current_text += (" " if current_text else "") + sent
+                else:
+                    if current_text:
+                        result.append(TextSegment(text=current_text.strip(), seg_type=current_type, speaker=current_speaker))
+                    current_text = sent
+            if current_text:
+                result.append(TextSegment(text=current_text.strip(), seg_type=current_type, speaker=current_speaker))
+        return result
+# ---------------------------------------------------------------------------
+# Audio Utils
+# ---------------------------------------------------------------------------
+def stitch_audio(paths: List[str], crossfade_ms: int = CROSSFADE_MS) -> AudioSegment:
+    """Concatenate WAV files with crossfade."""
+    if not paths:
+        return AudioSegment.silent(duration=0)
+    combined = AudioSegment.from_wav(paths[0])
+    for p in paths[1:]:
+        next_seg = AudioSegment.from_wav(p)
+        # Simple overlap crossfade
+        if crossfade_ms > 0 and len(combined) > crossfade_ms and len(next_seg) > crossfade_ms:
+            combined = combined.append(next_seg, crossfade=crossfade_ms)
+        else:
+            combined += next_seg
+    return combined
+def normalize_audio(audio: AudioSegment, target_dBFS: float = -1.5) -> AudioSegment:
+    """Peak normalize audio."""
+    change = target_dBFS - audio.max_dBFS
+    return audio.apply_gain(change)
+def save_audiobook(segments_paths: List[str], output_path: str, title: str = "Audiobook") -> str:
+    """Stitch segments and export final audiobook."""
+    if not segments_paths:
+        return ""
+    combined = stitch_audio(segments_paths)
+    combined = normalize_audio(combined)
+    combined.export(output_path, format="mp3", bitrate="192k", tags={"title": title, "artist": "AudioBook Forge"})
+    return output_path
+# ---------------------------------------------------------------------------
+# Optional: AI Character Extraction via HF Inference
+# ---------------------------------------------------------------------------
+def ai_extract_characters(text: str, api_token: Optional[str] = None) -> List[CharacterProfile]:
+    """Use a small HF model to extract characters with descriptions."""
+    try:
+        from huggingface_hub import InferenceClient
+        client = InferenceClient(token=api_token or os.getenv("HF_TOKEN"))
+        # Truncate text for context window
+        sample = text[:4000] + ("\n...[truncated]" if len(text) > 4000 else "")
+        prompt = (
+            "Extract all named characters from the following story excerpt. "
+            "For each character, provide their name and a brief description of their personality/role. "
+            "Return ONLY a JSON array like: [{\"name\":\"Alice\",\"description\":\"Curious young girl\"},...]\n\n"
+            f"STORY:\n{sample}\n\nJSON:"
+        )
+        response = client.text_generation(
+            model="Qwen/Qwen3-1.7B",
+            prompt=prompt,
+            max_new_tokens=512,
+            temperature=0.3,
+            return_full_text=False,
+        )
+        # Extract JSON from response
+        json_match = re.search(r'\[.*?\]', response, re.DOTALL)
+        if json_match:
+            data = json.loads(json_match.group())
+            profiles = []
+            for item in data:
+                name = item.get("name", "")
+                desc = item.get("description", "")
+                if name:
+                    profiles.append(CharacterProfile(name=name, description=desc))
+            return profiles
+    except Exception as e:
+        print(f"[AI Extraction] Failed: {e}")
+    return []
+# ---------------------------------------------------------------------------
+# Main Pipeline
+# ---------------------------------------------------------------------------
+class AudiobookPipeline:
+    def __init__(self, device: str = "cuda"):
+        self.tts = TTSEngine(device=device)
+        self.processor = TextProcessor()
+        self.temp_dir = Path(tempfile.gettempdir()) / "audiobook_segments"
+        self.temp_dir.mkdir(exist_ok=True)
+    def extract_characters(self, text: str, use_ai: bool = False) -> List[Dict]:
+        if use_ai:
+            profiles = ai_extract_characters(text)
+            if not profiles:
+                profiles = self.processor.extract_characters(text)
+        else:
+            profiles = self.processor.extract_characters(text)
+        return [
+            {
+                "name": p.name,
+                "description": p.description,
+                "occurrences": p.occurrences,
+                "voice_mode": "preset",
+                "voice_preset": "Ryan",
+                "voice_instruct": "",
+            }
+            for p in profiles
+        ]
+    def generate(
+        self,
+        text: str,
+        narrator_config: VoiceConfig,
+        character_configs: Dict[str, VoiceConfig],
+        progress_callback=None,
+        temperature: float = 0.7,
+        seed: int = 42,
+    ) -> Tuple[str, List[str]]:
+        """
+        Generate audiobook.
+        Returns (final_mp3_path, list_of_segment_wav_paths).
+        """
+        segments = self.processor.segment_text(text, list(character_configs.keys()))
+        segments = self.processor.chunk_segments(segments)
+        segment_paths = []
+        total = len(segments)
+        for i, seg in enumerate(segments):
+            if progress_callback:
+                progress_callback(i / total, f"Generating segment {i+1}/{total} ({seg.seg_type})...")
+            # Determine voice
+            if seg.seg_type == "dialogue" and seg.speaker and seg.speaker in character_configs:
+                voice = character_configs[seg.speaker]
+            else:
+                voice = narrator_config
+            try:
+                wav, sr = self.tts.synthesize(seg.text, voice, temperature=temperature, seed=seed)
+                seg_path = self.temp_dir / f"seg_{i:04d}_{voice.name}.wav"
+                sf.write(str(seg_path), wav, sr)
+                segment_paths.append(str(seg_path))
+            except Exception as e:
+                print(f"[Pipeline] Segment {i} failed: {e}")
+                # Insert silence to maintain timing
+                silent = AudioSegment.silent(duration=500)
+                seg_path = self.temp_dir / f"seg_{i:04d}_silent.wav"
+                silent.export(str(seg_path), format="wav")
+                segment_paths.append(str(seg_path))
+        if progress_callback:
+            progress_callback(1.0, "Stitching final audiobook...")
+        output_path = str(self.temp_dir / "audiobook_final.mp3")
+        save_audiobook(segment_paths, output_path, title="Generated Audiobook")
+        return output_path, segment_paths
+    def preview_voice(
+        self,
+        voice: VoiceConfig,
+        sample_text: str = "Hello, this is a preview of my voice. I hope you enjoy the story.",
+    ) -> Tuple[np.ndarray, int]:
+        return self.tts.synthesize(sample_text, voice, temperature=0.7, seed=42)

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+gradio>=6.13.0,<7.0
+qwen-tts>=0.1.0
+torch>=2.2.0
+torchaudio>=2.2.0
+transformers>=4.40.0
+accelerate>=0.30.0
+huggingface-hub>=0.23.0
+soundfile>=0.12.0
+pydub>=0.25.0
+numpy>=1.26.0