jkorstad commited on
Commit
a33bc67
·
1 Parent(s): 453cba9

Initial deploy: AudioBook Forge with Qwen3-TTS backend, character voice mapping, and dark Gradio UI

Browse files
Files changed (4) hide show
  1. README.md +41 -6
  2. app.py +531 -0
  3. backend.py +514 -0
  4. requirements.txt +10 -0
README.md CHANGED
@@ -1,15 +1,50 @@
1
  ---
2
- title: AudioBook
3
- emoji: 📉
4
- colorFrom: green
5
- colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 6.13.0
8
  python_version: '3.12'
9
  app_file: app.py
10
  pinned: false
11
  license: apache-2.0
12
- short_description: Create audiobooks with various custom speaker voices
13
  ---
14
 
15
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: AudioBook Forge
3
+ emoji: 🎧
4
+ colorFrom: indigo
5
+ colorTo: cyan
6
  sdk: gradio
7
  sdk_version: 6.13.0
8
  python_version: '3.12'
9
  app_file: app.py
10
  pinned: false
11
  license: apache-2.0
12
+ short_description: High-fidelity audiobook generator with AI character voices using Qwen3-TTS
13
  ---
14
 
15
+ # AudioBook Forge
16
+
17
+ **Model-agnostic, high-fidelity audiobook generator** powered by [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS). Create audiobooks where every character speaks with their own unique voice.
18
+
19
+ ## Features
20
+
21
+ - 🎙️ **Character Voice Mapping** — Automatically detect characters from your story and assign unique voices to each one
22
+ - 🎭 **Three Voice Modes**
23
+ - **Preset** — 9 premium built-in speakers (English, Chinese, Japanese, Korean, dialects)
24
+ - **Clone** — Upload a 3–10 second voice sample to clone any voice
25
+ - **Design** — Describe a voice in text (e.g., "a raspy old man with a warm chuckle") and the AI creates it
26
+ - 📖 **Smart Text Processing** — Automatically distinguishes narration from dialogue and routes each segment to the correct voice
27
+ - 🌐 **Multilingual** — Supports 10 languages via Qwen3-TTS
28
+ - ⚡ **ZeroGPU** — Runs on Hugging Face ZeroGPU (free A100 compute)
29
+ - 🔧 **Model Agnostic** — Backend is swappable; upgrade to future SOTA TTS models without changing the UI
30
+
31
+ ## How to Use
32
+
33
+ 1. **Paste your story** in the 📖 Story Setup tab.
34
+ 2. **Extract characters** automatically with the 🔍 button (or add them manually).
35
+ 3. **Configure voices** in the 🎭 Voice Cast tab:
36
+ - Set the **Narrator** voice (preset, cloned, or designed)
37
+ - Assign a voice to each **Character**
38
+ 4. **Generate** in the ⚡ Generate tab and download your MP3 audiobook.
39
+
40
+ ## Architecture
41
+
42
+ - `app.py` — Gradio frontend with dark-themed custom UI
43
+ - `backend.py` — Model-agnostic TTS engine, dialogue parser, and audio stitcher
44
+ - **TTS Backend:** Qwen3-TTS 1.7B (CustomVoice + Base + VoiceDesign)
45
+ - **Text Processing:** Paragraph-aware chunking, sentence-boundary splitting, quote detection
46
+ - **Audio Pipeline:** Per-segment synthesis → crossfade stitching → peak normalization → MP3 export
47
+
48
+ ## License
49
+
50
+ The application code is Apache 2.0. The underlying Qwen3-TTS models are also Apache 2.0, making this stack fully commercially usable.
app.py ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ AudioBook Forge - Gradio Frontend
3
+ High-fidelity audiobook generator with character voice mapping.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ from pathlib import Path
9
+ from typing import Dict, List, Optional
10
+
11
+ import gradio as gr
12
+ import numpy as np
13
+ import soundfile as sf
14
+
15
+ from backend import (
16
+ AudiobookPipeline,
17
+ VoiceConfig,
18
+ PRESET_SPEAKERS,
19
+ )
20
+
21
+ # ---------------------------------------------------------------------------
22
+ # CSS & Theme
23
+ # ---------------------------------------------------------------------------
24
+
25
+ CUSTOM_CSS = """
26
+ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
27
+
28
+ body, .gradio-container {
29
+ font-family: 'Inter', sans-serif !important;
30
+ background: #0f172a !important;
31
+ color: #f8fafc !important;
32
+ }
33
+
34
+ .gradio-container {
35
+ max-width: 1200px !important;
36
+ }
37
+
38
+ .ab-header {
39
+ text-align: center;
40
+ padding: 2.2rem 1rem 1.8rem;
41
+ background: linear-gradient(135deg, rgba(99,102,241,0.12) 0%, rgba(34,211,238,0.06) 100%);
42
+ border-radius: 18px;
43
+ margin-bottom: 1.5rem;
44
+ border: 1px solid rgba(99,102,241,0.18);
45
+ }
46
+ .ab-header h1 {
47
+ font-size: 2.6rem;
48
+ font-weight: 700;
49
+ margin: 0;
50
+ background: linear-gradient(90deg, #a5b4fc, #22d3ee);
51
+ -webkit-background-clip: text;
52
+ -webkit-text-fill-color: transparent;
53
+ }
54
+ .ab-header p {
55
+ color: #94a3b8;
56
+ margin-top: 0.6rem;
57
+ font-size: 1.05rem;
58
+ }
59
+
60
+ .ab-card {
61
+ background: #1e293b !important;
62
+ border: 1px solid #334155 !important;
63
+ border-radius: 14px !important;
64
+ padding: 1.25rem !important;
65
+ }
66
+
67
+ button.primary {
68
+ background: linear-gradient(135deg, #6366f1, #4f46e5) !important;
69
+ border: none !important;
70
+ border-radius: 10px !important;
71
+ font-weight: 600 !important;
72
+ transition: all 0.2s ease !important;
73
+ }
74
+ button.primary:hover {
75
+ transform: translateY(-1px);
76
+ box-shadow: 0 4px 14px rgba(99,102,241,0.4) !important;
77
+ }
78
+ button.secondary {
79
+ background: #334155 !important;
80
+ border: 1px solid #475569 !important;
81
+ border-radius: 10px !important;
82
+ color: #f8fafc !important;
83
+ }
84
+
85
+ input, textarea, select {
86
+ background: #0f172a !important;
87
+ border: 1px solid #334155 !important;
88
+ border-radius: 8px !important;
89
+ color: #f8fafc !important;
90
+ }
91
+ input:focus, textarea:focus, select:focus {
92
+ border-color: #6366f1 !important;
93
+ box-shadow: 0 0 0 3px rgba(99,102,241,0.15) !important;
94
+ }
95
+
96
+ .gr-box, .gr-form {
97
+ background: #1e293b !important;
98
+ border-color: #334155 !important;
99
+ }
100
+ .gr-panel {
101
+ background: #1e293b !important;
102
+ }
103
+
104
+ .tabitem {
105
+ background: #1e293b !important;
106
+ border-color: #334155 !important;
107
+ }
108
+ """
109
+
110
+ # ---------------------------------------------------------------------------
111
+ # Global State
112
+ # ---------------------------------------------------------------------------
113
+
114
+ _pipeline: Optional[AudiobookPipeline] = None
115
+
116
+
117
+ def get_pipeline() -> AudiobookPipeline:
118
+ global _pipeline
119
+ if _pipeline is None:
120
+ device = "cuda" if os.system("nvidia-smi > /dev/null 2>&1") == 0 else "cpu"
121
+ _pipeline = AudiobookPipeline(device=device)
122
+ return _pipeline
123
+
124
+
125
+ # ---------------------------------------------------------------------------
126
+ # Helpers
127
+ # ---------------------------------------------------------------------------
128
+
129
+ def on_mode_change(mode: str) -> tuple:
130
+ if mode == "preset":
131
+ return gr.update(visible=True), gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)
132
+ elif mode == "clone":
133
+ return gr.update(visible=False), gr.update(visible=True), gr.update(visible=True), gr.update(visible=False)
134
+ else:
135
+ return gr.update(visible=False), gr.update(visible=False), gr.update(visible=False), gr.update(visible=True)
136
+
137
+
138
+ def extract_chars(text: str, use_ai: bool) -> tuple:
139
+ if not text or len(text.strip()) < 20:
140
+ return [], "Text too short. Please paste at least a paragraph."
141
+ pipe = get_pipeline()
142
+ chars = pipe.extract_characters(text, use_ai=use_ai)
143
+ status = f"Found {len(chars)} characters: {', '.join(c['name'] for c in chars)}" if chars else "No characters auto-detected. Add them manually below."
144
+ return chars, status
145
+
146
+
147
+ def _build_char_dict(
148
+ names, descs, modes, presets, audios, ref_texts, designs, instructs, langs
149
+ ) -> List[Dict]:
150
+ chars = []
151
+ for i in range(8):
152
+ if names[i]:
153
+ chars.append({
154
+ "name": names[i],
155
+ "description": descs[i] or "",
156
+ "voice_mode": modes[i],
157
+ "voice_preset": presets[i] if modes[i] == "preset" else None,
158
+ "voice_ref_audio": audios[i] if modes[i] == "clone" else None,
159
+ "voice_ref_text": ref_texts[i] if modes[i] == "clone" else None,
160
+ "voice_design_desc": designs[i] if modes[i] == "design" else None,
161
+ "voice_instruct": instructs[i] or "",
162
+ "language": langs[i],
163
+ })
164
+ return chars
165
+
166
+
167
+ def generate_audiobook(
168
+ text,
169
+ nar_mode, nar_preset, nar_audio, nar_ref_text, nar_design, nar_instruct, nar_lang,
170
+ gen_temp, gen_seed,
171
+ names, descs, modes, presets, audios, ref_texts, designs, instructs, langs,
172
+ ):
173
+ if not text or len(text.strip()) < 50:
174
+ return None, "Error: Please provide at least 50 characters of story text."
175
+
176
+ pipe = get_pipeline()
177
+
178
+ nar_cfg = VoiceConfig(
179
+ name="Narrator",
180
+ mode=nar_mode,
181
+ preset=nar_preset if nar_mode == "preset" else None,
182
+ ref_audio=nar_audio if nar_mode == "clone" and nar_audio else None,
183
+ ref_text=nar_ref_text if nar_mode == "clone" else None,
184
+ design_desc=nar_design if nar_mode == "design" else None,
185
+ instruct=nar_instruct,
186
+ language=nar_lang,
187
+ )
188
+
189
+ char_configs = {}
190
+ for i in range(8):
191
+ if not names[i]:
192
+ continue
193
+ vc = VoiceConfig(
194
+ name=names[i],
195
+ mode=modes[i],
196
+ preset=presets[i] if modes[i] == "preset" else None,
197
+ ref_audio=audios[i] if modes[i] == "clone" and audios[i] else None,
198
+ ref_text=ref_texts[i] if modes[i] == "clone" else None,
199
+ design_desc=designs[i] if modes[i] == "design" else None,
200
+ instruct=instructs[i] or "",
201
+ language=langs[i],
202
+ )
203
+ char_configs[names[i]] = vc
204
+
205
+ progress_text = ""
206
+
207
+ def prog_cb(ratio: float, msg: str):
208
+ nonlocal progress_text
209
+ progress_text = f"[{ratio*100:.0f}%] {msg}"
210
+ print(progress_text)
211
+
212
+ try:
213
+ output_path, _ = pipe.generate(
214
+ text=text,
215
+ narrator_config=nar_cfg,
216
+ character_configs=char_configs,
217
+ progress_callback=prog_cb,
218
+ temperature=gen_temp,
219
+ seed=int(gen_seed),
220
+ )
221
+ return output_path, f"Done! Audiobook generated."
222
+ except Exception as e:
223
+ import traceback
224
+ traceback.print_exc()
225
+ return None, f"Error: {str(e)}"
226
+
227
+
228
+ def preview_narrator(mode, preset, audio, ref_text, design, instruct, lang):
229
+ pipe = get_pipeline()
230
+ vc = VoiceConfig(
231
+ name="Narrator",
232
+ mode=mode,
233
+ preset=preset if mode == "preset" else None,
234
+ ref_audio=audio if mode == "clone" and audio else None,
235
+ ref_text=ref_text if mode == "clone" else None,
236
+ design_desc=design if mode == "design" else None,
237
+ instruct=instruct,
238
+ language=lang,
239
+ )
240
+ try:
241
+ wav, sr = pipe.preview_voice(vc)
242
+ return (sr, wav), "Preview ready!"
243
+ except Exception as e:
244
+ import traceback
245
+ traceback.print_exc()
246
+ return None, f"Preview failed: {e}"
247
+
248
+
249
+ # ---------------------------------------------------------------------------
250
+ # Build UI
251
+ # ---------------------------------------------------------------------------
252
+
253
+ def build_app():
254
+ theme = gr.themes.Soft(
255
+ primary_hue="indigo",
256
+ secondary_hue="cyan",
257
+ neutral_hue="slate",
258
+ ).set(
259
+ body_background_fill="#0f172a",
260
+ body_background_fill_dark="#0f172a",
261
+ body_text_color="#f8fafc",
262
+ body_text_color_subdued="#94a3b8",
263
+ background_fill_primary="#1e293b",
264
+ background_fill_secondary="#0f172a",
265
+ border_color_accent="#334155",
266
+ color_accent_soft="#22d3ee",
267
+ button_primary_background_fill="linear-gradient(135deg, #6366f1, #4f46e5)",
268
+ button_primary_background_fill_hover="linear-gradient(135deg, #4f46e5, #4338ca)",
269
+ button_primary_text_color="#ffffff",
270
+ input_background_fill="#0f172a",
271
+ input_border_color="#334155",
272
+ block_title_text_color="#f8fafc",
273
+ block_label_text_color="#94a3b8",
274
+ )
275
+
276
+ with gr.Blocks(theme=theme, css=CUSTOM_CSS, title="AudioBook Forge") as demo:
277
+ gr.HTML("""
278
+ <div class="ab-header">
279
+ <h1>AudioBook Forge</h1>
280
+ <p>High-fidelity audiobooks with AI character voices. Model-agnostic TTS powered by Qwen3-TTS.</p>
281
+ </div>
282
+ """)
283
+
284
+ with gr.Tabs():
285
+ # ==================== TAB 1 ====================
286
+ with gr.TabItem("📖 Story Setup"):
287
+ with gr.Row():
288
+ with gr.Column(scale=2):
289
+ story_input = gr.TextArea(
290
+ label="Story Text",
291
+ placeholder="Paste your book chapter, short story, or script here...",
292
+ lines=20,
293
+ max_lines=40,
294
+ )
295
+ with gr.Column(scale=1):
296
+ gr.Markdown("### Character Detection")
297
+ use_ai_check = gr.Checkbox(
298
+ label="Use AI enhancement (slower, more accurate)",
299
+ value=False,
300
+ )
301
+ extract_btn = gr.Button("🔍 Extract Characters", variant="primary")
302
+ gr.Markdown("---")
303
+ gr.Markdown("**Tips:**")
304
+ gr.Markdown("- Use `Character: \"dialogue\"` format for best results.")
305
+ gr.Markdown("- Or standard prose with quoted dialogue.")
306
+ gr.Markdown("- AI mode uses a small LLM for deeper analysis.")
307
+
308
+ extract_status = gr.Textbox(label="Status", interactive=False)
309
+
310
+ # Hidden states to hold character data
311
+ char_state = gr.State(value=[])
312
+
313
+ # ==================== TAB 2 ====================
314
+ with gr.TabItem("🎭 Voice Cast"):
315
+ with gr.Row():
316
+ with gr.Column(scale=1):
317
+ gr.Markdown("## Narrator")
318
+ with gr.Column(elem_classes="ab-card"):
319
+ nar_mode = gr.Dropdown(
320
+ choices=["preset", "clone", "design"],
321
+ value="preset",
322
+ label="Narrator Mode",
323
+ )
324
+ nar_preset = gr.Dropdown(
325
+ choices=list(PRESET_SPEAKERS.keys()),
326
+ value="Ryan",
327
+ label="Preset Voice",
328
+ )
329
+ nar_audio = gr.Audio(
330
+ label="Upload Voice Sample (3–10s)",
331
+ type="filepath",
332
+ visible=False,
333
+ )
334
+ nar_ref_text = gr.Textbox(
335
+ label="Reference Transcript",
336
+ placeholder="What does the reference audio say?",
337
+ visible=False,
338
+ )
339
+ nar_design = gr.TextArea(
340
+ label="Voice Description",
341
+ placeholder="e.g. A warm, raspy baritone with a slight British accent.",
342
+ visible=False,
343
+ lines=2,
344
+ )
345
+ nar_instruct = gr.Textbox(
346
+ label="Style Instruction",
347
+ placeholder="e.g. Calm, measured storytelling pace.",
348
+ )
349
+ nar_lang = gr.Dropdown(
350
+ choices=["English", "Chinese", "Japanese", "Korean", "German", "French", "Spanish", "Italian", "Portuguese", "Russian"],
351
+ value="English",
352
+ label="Language",
353
+ )
354
+ nar_preview_btn = gr.Button("🔊 Preview Narrator", variant="secondary")
355
+ nar_preview_audio = gr.Audio(label="Preview", interactive=False)
356
+ nar_preview_status = gr.Textbox(show_label=False, interactive=False)
357
+
358
+ nar_mode.change(
359
+ on_mode_change,
360
+ inputs=nar_mode,
361
+ outputs=[nar_preset, nar_audio, nar_ref_text, nar_design],
362
+ )
363
+ nar_preview_btn.click(
364
+ preview_narrator,
365
+ inputs=[nar_mode, nar_preset, nar_audio, nar_ref_text, nar_design, nar_instruct, nar_lang],
366
+ outputs=[nar_preview_audio, nar_preview_status],
367
+ )
368
+
369
+ with gr.Column(scale=2):
370
+ gr.Markdown("## Character Voices")
371
+ gr.Markdown("Configure up to 8 characters. Use **preset** for built-in speakers, **clone** to upload a voice sample, or **design** to describe a voice from text.")
372
+
373
+ # Dynamic character rows — we'll create 8 static rows and toggle visibility
374
+ char_names = []
375
+ char_descs = []
376
+ char_modes = []
377
+ char_presets = []
378
+ char_audios = []
379
+ char_ref_texts = []
380
+ char_designs = []
381
+ char_instructs = []
382
+ char_langs = []
383
+ char_rows = []
384
+
385
+ for i in range(8):
386
+ visible_default = (i == 0)
387
+ with gr.Group(visible=visible_default) as row:
388
+ with gr.Row():
389
+ cn = gr.Textbox(label=f"Name", placeholder="e.g. Alice", visible=visible_default)
390
+ cd = gr.Textbox(label="Description", placeholder="Personality note", visible=visible_default)
391
+ cm = gr.Dropdown(label="Mode", choices=["preset", "clone", "design"], value="preset", visible=visible_default)
392
+ cp = gr.Dropdown(label="Preset", choices=list(PRESET_SPEAKERS.keys()), value="Ryan", visible=visible_default)
393
+ with gr.Row():
394
+ ca = gr.Audio(label="Voice Sample", type="filepath", visible=False)
395
+ crt = gr.Textbox(label="Ref Transcript", placeholder="What the sample says", visible=False)
396
+ cdes = gr.TextArea(label="Voice Description", placeholder="e.g. A shrill, nervous teenager.", visible=False, lines=2)
397
+ cinstr = gr.Textbox(label="Style Instruction", placeholder="e.g. Angry and loud.", visible=visible_default)
398
+ cl = gr.Dropdown(label="Language", choices=["English", "Chinese", "Japanese", "Korean", "German", "French", "Spanish", "Italian", "Portuguese", "Russian"], value="English", visible=visible_default)
399
+
400
+ cm.change(
401
+ on_mode_change,
402
+ inputs=cm,
403
+ outputs=[cp, ca, crt, cdes],
404
+ )
405
+
406
+ char_rows.append(row)
407
+ char_names.append(cn)
408
+ char_descs.append(cd)
409
+ char_modes.append(cm)
410
+ char_presets.append(cp)
411
+ char_audios.append(ca)
412
+ char_ref_texts.append(crt)
413
+ char_designs.append(cdes)
414
+ char_instructs.append(cinstr)
415
+ char_langs.append(cl)
416
+
417
+ # ==================== TAB 3 ====================
418
+ with gr.TabItem("⚡ Generate"):
419
+ with gr.Row():
420
+ with gr.Column(scale=1):
421
+ gr.Markdown("### Settings")
422
+ gen_temp = gr.Slider(minimum=0.1, maximum=1.0, value=0.7, step=0.05, label="Temperature")
423
+ gen_seed = gr.Number(value=42, precision=0, label="Seed (fix for consistency)")
424
+ gen_btn = gr.Button("▶️ Generate Audiobook", variant="primary", size="lg")
425
+ gen_progress = gr.Textbox(label="Progress", interactive=False, value="Ready.")
426
+
427
+ with gr.Column(scale=2):
428
+ gr.Markdown("### Output")
429
+ output_audio = gr.Audio(label="Generated Audiobook", type="filepath", interactive=False)
430
+ output_status = gr.Textbox(label="Status", interactive=False)
431
+
432
+ # ==================== TAB 4 ====================
433
+ with gr.TabItem("ℹ️ About"):
434
+ gr.Markdown("""
435
+ ## AudioBook Forge
436
+
437
+ **Model-agnostic, high-fidelity audiobook generation** using state-of-the-art open TTS.
438
+
439
+ ### Current Backend: Qwen3-TTS
440
+ - **1.7B CustomVoice** — 9 premium preset speakers with style control
441
+ - **1.7B Base** — High-quality voice cloning from 3–10 second samples
442
+ - **1.7B VoiceDesign** — Create voices from text descriptions
443
+ - **10 languages** supported
444
+ - **Apache 2.0** license — commercially usable
445
+
446
+ ### Workflow
447
+ 1. **Paste your story** in the Story Setup tab.
448
+ 2. **Extract characters** automatically or define them manually.
449
+ 3. **Assign voices** — choose presets, upload samples for cloning, or describe voices.
450
+ 4. **Generate** — the engine detects narration vs dialogue and routes each segment to the right voice.
451
+ 5. **Download** your finished audiobook as MP3.
452
+
453
+ ### Architecture
454
+ The TTS engine is fully model-agnostic. Swapping to a future SOTA model only requires updating the backend adapter.
455
+
456
+ ### Tips for Best Quality
457
+ - Use clean, noise-free voice samples for cloning (3–10 seconds).
458
+ - Keep reference transcripts accurate — they guide the cloning quality.
459
+ - Lower temperature (0.5–0.6) for stable narration; higher (0.8–0.9) for expressive dialogue.
460
+ - Use a fixed seed across chunks to prevent voice drift.
461
+ """)
462
+
463
+ # ---------- Extract wiring ----------
464
+ def do_extract(text, use_ai):
465
+ chars, status = extract_chars(text, use_ai)
466
+ # Build visibility updates
467
+ updates = []
468
+ for i in range(8):
469
+ if i < len(chars):
470
+ updates.extend([
471
+ gr.update(visible=True), # row
472
+ gr.update(value=chars[i].get("name", ""), visible=True),
473
+ gr.update(value=chars[i].get("description", ""), visible=True),
474
+ gr.update(value=chars[i].get("voice_mode", "preset"), visible=True),
475
+ gr.update(value=chars[i].get("voice_preset", "Ryan"), visible=True),
476
+ gr.update(visible=False), # audio
477
+ gr.update(visible=False), # ref text
478
+ gr.update(visible=False), # design
479
+ gr.update(value=chars[i].get("voice_instruct", ""), visible=True),
480
+ gr.update(value=chars[i].get("language", "English"), visible=True),
481
+ ])
482
+ else:
483
+ updates.extend([
484
+ gr.update(visible=False),
485
+ gr.update(visible=False),
486
+ gr.update(visible=False),
487
+ gr.update(visible=False),
488
+ gr.update(visible=False),
489
+ gr.update(visible=False),
490
+ gr.update(visible=False),
491
+ gr.update(visible=False),
492
+ gr.update(visible=False),
493
+ gr.update(visible=False),
494
+ ])
495
+ return [status] + updates
496
+
497
+ extract_btn.click(
498
+ do_extract,
499
+ inputs=[story_input, use_ai_check],
500
+ outputs=[extract_status] + [
501
+ item for sublist in [
502
+ [char_rows[i], char_names[i], char_descs[i], char_modes[i], char_presets[i],
503
+ char_audios[i], char_ref_texts[i], char_designs[i], char_instructs[i], char_langs[i]]
504
+ for i in range(8)
505
+ ] for item in sublist
506
+ ],
507
+ )
508
+
509
+ # ---------- Generate wiring ----------
510
+ all_char_inputs = (
511
+ char_names + char_descs + char_modes + char_presets +
512
+ char_audios + char_ref_texts + char_designs + char_instructs + char_langs
513
+ )
514
+
515
+ gen_btn.click(
516
+ generate_audiobook,
517
+ inputs=[
518
+ story_input,
519
+ nar_mode, nar_preset, nar_audio, nar_ref_text, nar_design, nar_instruct, nar_lang,
520
+ gen_temp, gen_seed,
521
+ ] + all_char_inputs,
522
+ outputs=[output_audio, output_status],
523
+ )
524
+
525
+ return demo
526
+
527
+
528
+ demo = build_app()
529
+
530
+ if __name__ == "__main__":
531
+ demo.launch(server_name="0.0.0.0", server_port=7860)
backend.py ADDED
@@ -0,0 +1,514 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ AudioBook Forge - Backend
3
+ Model-agnostic TTS engine with Qwen3-TTS support.
4
+ Character extraction, dialogue parsing, and audio stitching.
5
+ """
6
+
7
+ import os
8
+ import re
9
+ import json
10
+ import hashlib
11
+ import tempfile
12
+ from pathlib import Path
13
+ from typing import List, Dict, Optional, Tuple, Any
14
+ from dataclasses import dataclass, field
15
+ from collections import defaultdict
16
+ import warnings
17
+
18
+ import numpy as np
19
+ import soundfile as sf
20
+ from pydub import AudioSegment
21
+
22
+ warnings.filterwarnings("ignore")
23
+
24
+ # ---------------------------------------------------------------------------
25
+ # Configuration
26
+ # ---------------------------------------------------------------------------
27
+ PRESET_SPEAKERS = {
28
+ "Ryan": {"lang": "English", "desc": "Dynamic, expressive male"},
29
+ "Aiden": {"lang": "English", "desc": "Sunny, warm male"},
30
+ "Serena": {"lang": "Chinese", "desc": "Young female (Chinese)"},
31
+ "Vivian": {"lang": "Chinese", "desc": "Young female (Chinese)"},
32
+ "Uncle_Fu": {"lang": "Chinese", "desc": "Seasoned elder male (Chinese)"},
33
+ "Ono_Anna": {"lang": "Japanese", "desc": "Playful female (Japanese)"},
34
+ "Sohee": {"lang": "Korean", "desc": "Warm female (Korean)"},
35
+ "Dylan": {"lang": "Chinese", "desc": "Beijing dialect male"},
36
+ "Eric": {"lang": "Chinese", "desc": "Sichuan dialect male"},
37
+ }
38
+
39
+ MAX_CHUNK_CHARS = 380
40
+ MIN_CHUNK_CHARS = 80
41
+ CROSSFADE_MS = 80
42
+
43
+ # ---------------------------------------------------------------------------
44
+ # Data Classes
45
+ # ---------------------------------------------------------------------------
46
+
47
+ @dataclass
48
+ class VoiceConfig:
49
+ name: str = "Narrator"
50
+ mode: str = "preset" # preset | clone | design
51
+ preset: Optional[str] = None # e.g., "Ryan"
52
+ ref_audio: Optional[str] = None
53
+ ref_text: Optional[str] = None
54
+ design_desc: Optional[str] = None
55
+ instruct: str = "" # style instruction
56
+ language: str = "English"
57
+
58
+
59
+ @dataclass
60
+ class TextSegment:
61
+ text: str
62
+ seg_type: str # narration | dialogue
63
+ speaker: Optional[str] = None
64
+ emotion_hint: Optional[str] = None
65
+
66
+
67
+ @dataclass
68
+ class CharacterProfile:
69
+ name: str
70
+ description: str = ""
71
+ voice: VoiceConfig = field(default_factory=VoiceConfig)
72
+ occurrences: int = 0
73
+
74
+
75
+ # ---------------------------------------------------------------------------
76
+ # TTS Engine (Model-Agnostic Wrapper)
77
+ # ---------------------------------------------------------------------------
78
+
79
+ class TTSEngine:
80
+ """
81
+ Model-agnostic TTS engine.
82
+ Currently backed by Qwen3-TTS. Swappable architecture.
83
+ """
84
+
85
+ def __init__(self, device: str = "cuda"):
86
+ self.device = device
87
+ self._custom_voice_model = None
88
+ self._base_model = None
89
+ self._design_model = None
90
+ self._model_ids = {
91
+ "custom": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
92
+ "base": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
93
+ "design": "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
94
+ }
95
+ self._cache_dir = Path(tempfile.gettempdir()) / "audiobook_cache"
96
+ self._cache_dir.mkdir(exist_ok=True)
97
+
98
+ def _load_custom_voice(self):
99
+ if self._custom_voice_model is not None:
100
+ return self._custom_voice_model
101
+ try:
102
+ from qwen_tts import Qwen3TTSModel
103
+ import torch
104
+ print("[TTS] Loading CustomVoice model...")
105
+ self._custom_voice_model = Qwen3TTSModel.from_pretrained(
106
+ self._model_ids["custom"],
107
+ device_map=self.device,
108
+ dtype=torch.bfloat16,
109
+ )
110
+ print("[TTS] CustomVoice ready.")
111
+ except Exception as e:
112
+ print(f"[TTS] CustomVoice load failed: {e}")
113
+ raise
114
+ return self._custom_voice_model
115
+
116
+ def _load_base(self):
117
+ if self._base_model is not None:
118
+ return self._base_model
119
+ try:
120
+ from qwen_tts import Qwen3TTSModel
121
+ import torch
122
+ print("[TTS] Loading Base (clone) model...")
123
+ self._base_model = Qwen3TTSModel.from_pretrained(
124
+ self._model_ids["base"],
125
+ device_map=self.device,
126
+ dtype=torch.bfloat16,
127
+ )
128
+ print("[TTS] Base ready.")
129
+ except Exception as e:
130
+ print(f"[TTS] Base load failed: {e}")
131
+ raise
132
+ return self._base_model
133
+
134
+ def _load_design(self):
135
+ if self._design_model is not None:
136
+ return self._design_model
137
+ try:
138
+ from qwen_tts import Qwen3TTSModel
139
+ import torch
140
+ print("[TTS] Loading VoiceDesign model...")
141
+ self._design_model = Qwen3TTSModel.from_pretrained(
142
+ self._model_ids["design"],
143
+ device_map=self.device,
144
+ dtype=torch.bfloat16,
145
+ )
146
+ print("[TTS] VoiceDesign ready.")
147
+ except Exception as e:
148
+ print(f"[TTS] VoiceDesign load failed: {e}")
149
+ raise
150
+ return self._design_model
151
+
152
+ def _cache_key(self, text: str, voice: VoiceConfig) -> str:
153
+ payload = f"{text}|{voice.mode}|{voice.preset}|{voice.ref_audio}|{voice.design_desc}|{voice.instruct}|{voice.language}"
154
+ return hashlib.md5(payload.encode()).hexdigest()
155
+
156
+ def _cached_path(self, key: str) -> Path:
157
+ return self._cache_dir / f"{key}.wav"
158
+
159
+ def synthesize(
160
+ self,
161
+ text: str,
162
+ voice: VoiceConfig,
163
+ temperature: float = 0.7,
164
+ seed: int = 42,
165
+ ) -> Tuple[np.ndarray, int]:
166
+ """Generate audio for a text chunk. Returns (audio_array, sample_rate)."""
167
+ cache_key = self._cache_key(text, voice)
168
+ cache_path = self._cached_path(cache_key)
169
+ if cache_path.exists():
170
+ audio, sr = sf.read(str(cache_path))
171
+ return audio, sr
172
+
173
+ if voice.mode == "preset":
174
+ model = self._load_custom_voice()
175
+ wavs, sr = model.generate_custom_voice(
176
+ text=text,
177
+ language=voice.language,
178
+ speaker=voice.preset or "Ryan",
179
+ instruct=voice.instruct or "Narrate clearly and expressively.",
180
+ temperature=temperature,
181
+ seed=seed,
182
+ )
183
+ elif voice.mode == "clone":
184
+ model = self._load_base()
185
+ if not voice.ref_audio or not Path(voice.ref_audio).exists():
186
+ raise ValueError("Clone mode requires ref_audio path.")
187
+ wavs, sr = model.generate_voice_clone(
188
+ text=text,
189
+ language=voice.language,
190
+ ref_audio=voice.ref_audio,
191
+ ref_text=voice.ref_text or text[:100],
192
+ temperature=temperature,
193
+ seed=seed,
194
+ )
195
+ elif voice.mode == "design":
196
+ model = self._load_design()
197
+ desc = voice.design_desc or "A clear, expressive narrator voice."
198
+ wavs, sr = model.generate_voice_design(
199
+ text=text,
200
+ language=voice.language,
201
+ instruct=desc,
202
+ temperature=temperature,
203
+ seed=seed,
204
+ )
205
+ else:
206
+ raise ValueError(f"Unknown voice mode: {voice.mode}")
207
+
208
+ # Handle stereo or list returns
209
+ if isinstance(wavs, list):
210
+ wavs = wavs[0]
211
+ if wavs.ndim > 1:
212
+ wavs = wavs.mean(axis=1)
213
+
214
+ sf.write(str(cache_path), wavs, sr)
215
+ return wavs, sr
216
+
217
+ def status(self) -> Dict[str, Any]:
218
+ return {
219
+ "custom_loaded": self._custom_voice_model is not None,
220
+ "base_loaded": self._base_model is not None,
221
+ "design_loaded": self._design_model is not None,
222
+ }
223
+
224
+
225
+ # ---------------------------------------------------------------------------
226
+ # Text Processing
227
+ # ---------------------------------------------------------------------------
228
+
229
+ class TextProcessor:
230
+ """Extract characters, parse dialogue, chunk text."""
231
+
232
+ DIALOGUE_RE = re.compile(
233
+ r'(?:^|[.!?\n]\s+)\s*"([^"]{3,500})"' # quoted dialogue
234
+ )
235
+ SPEAKER_RE = re.compile(
236
+ r'(?:^|\n)\s*([A-Z][a-zA-Z\s]{1,20})(?:\s*[:\-–])\s*"([^"]+)"'
237
+ )
238
+ NAME_RE = re.compile(
239
+ r'\b([A-Z][a-z]{1,15})\b'
240
+ )
241
+
242
+ @staticmethod
243
+ def extract_characters(text: str, use_ai: bool = False) -> List[CharacterProfile]:
244
+ """Extract character names and basic stats from text."""
245
+ profiles: Dict[str, CharacterProfile] = {}
246
+
247
+ # Pattern: Name: "dialogue"
248
+ for match in TextProcessor.SPEAKER_RE.finditer(text):
249
+ name = match.group(1).strip()
250
+ if len(name) > 2:
251
+ if name not in profiles:
252
+ profiles[name] = CharacterProfile(name=name)
253
+ profiles[name].occurrences += 1
254
+
255
+ # Pattern: quoted dialogue near "he said / she said"
256
+ for match in TextProcessor.DIALOGUE_RE.finditer(text):
257
+ quote = match.group(1)
258
+ before = text[max(0, match.start() - 120):match.start()]
259
+ said_match = re.search(r'([A-Z][a-z]{1,15})\s+(?:said|cried|shouted|whispered|replied|asked)', before)
260
+ if said_match:
261
+ name = said_match.group(1)
262
+ if name not in profiles:
263
+ profiles[name] = CharacterProfile(name=name)
264
+ profiles[name].occurrences += 1
265
+
266
+ # Fallback: capitalized names appearing frequently
267
+ all_names = TextProcessor.NAME_RE.findall(text)
268
+ from collections import Counter
269
+ common = Counter(all_names).most_common(30)
270
+ for name, count in common:
271
+ if count >= 3 and len(name) > 2 and name not in profiles:
272
+ # Filter common words
273
+ if name.lower() in {"the", "and", "but", "for", "are", "was", "were", "had", "have", "has", "his", "her", "she", "him", "they", "them", "said", "with", "from", "that", "this", "what", "when", "where", "would", "could", "should"}:
274
+ continue
275
+ profiles[name] = CharacterProfile(name=name, occurrences=count)
276
+
277
+ result = sorted(profiles.values(), key=lambda p: p.occurrences, reverse=True)
278
+ return result[:12] # Cap at 12 characters
279
+
280
+ @staticmethod
281
+ def segment_text(text: str, characters: List[str]) -> List[TextSegment]:
282
+ """Split text into narration/dialogue segments."""
283
+ segments = []
284
+ # Normalize newlines
285
+ text = text.replace("\r\n", "\n").replace("\r", "\n")
286
+
287
+ # Split by paragraphs first
288
+ paragraphs = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]
289
+
290
+ for para in paragraphs:
291
+ # Check if paragraph starts with Character: "dialogue"
292
+ speaker_match = re.match(r'^([A-Z][a-zA-Z\s]{1,20})[:\-–]\s*"([^"]+)"', para)
293
+ if speaker_match:
294
+ speaker = speaker_match.group(1).strip()
295
+ dialogue = speaker_match.group(2)
296
+ segments.append(TextSegment(text=dialogue, seg_type="dialogue", speaker=speaker))
297
+ # Remainder of paragraph as narration
298
+ remainder = para[speaker_match.end():].strip()
299
+ if remainder:
300
+ segments.append(TextSegment(text=remainder, seg_type="narration"))
301
+ continue
302
+
303
+ # Check for inline quotes
304
+ parts = re.split(r'"([^"]{3,500})"', para)
305
+ for i, part in enumerate(parts):
306
+ part = part.strip()
307
+ if not part:
308
+ continue
309
+ if i % 2 == 1:
310
+ # This was inside quotes
311
+ # Try to attribute speaker from surrounding text
312
+ speaker = None
313
+ segments.append(TextSegment(text=part, seg_type="dialogue", speaker=speaker))
314
+ else:
315
+ segments.append(TextSegment(text=part, seg_type="narration"))
316
+
317
+ # Merge adjacent narration segments
318
+ merged = []
319
+ for seg in segments:
320
+ if merged and seg.seg_type == "narration" and merged[-1].seg_type == "narration":
321
+ merged[-1].text += " " + seg.text
322
+ else:
323
+ merged.append(seg)
324
+ return merged
325
+
326
+ @staticmethod
327
+ def chunk_segments(segments: List[TextSegment], max_chars: int = MAX_CHUNK_CHARS) -> List[TextSegment]:
328
+ """Break long segments into smaller chunks at sentence boundaries."""
329
+ result = []
330
+ for seg in segments:
331
+ if len(seg.text) <= max_chars:
332
+ result.append(seg)
333
+ continue
334
+ # Split into sentences
335
+ sentences = re.split(r'(?<=[.!?])\s+', seg.text)
336
+ current_text = ""
337
+ current_speaker = seg.speaker
338
+ current_type = seg.seg_type
339
+ for sent in sentences:
340
+ if len(current_text) + len(sent) + 1 <= max_chars:
341
+ current_text += (" " if current_text else "") + sent
342
+ else:
343
+ if current_text:
344
+ result.append(TextSegment(text=current_text.strip(), seg_type=current_type, speaker=current_speaker))
345
+ current_text = sent
346
+ if current_text:
347
+ result.append(TextSegment(text=current_text.strip(), seg_type=current_type, speaker=current_speaker))
348
+ return result
349
+
350
+
351
+ # ---------------------------------------------------------------------------
352
+ # Audio Utils
353
+ # ---------------------------------------------------------------------------
354
+
355
+ def stitch_audio(paths: List[str], crossfade_ms: int = CROSSFADE_MS) -> AudioSegment:
356
+ """Concatenate WAV files with crossfade."""
357
+ if not paths:
358
+ return AudioSegment.silent(duration=0)
359
+ combined = AudioSegment.from_wav(paths[0])
360
+ for p in paths[1:]:
361
+ next_seg = AudioSegment.from_wav(p)
362
+ # Simple overlap crossfade
363
+ if crossfade_ms > 0 and len(combined) > crossfade_ms and len(next_seg) > crossfade_ms:
364
+ combined = combined.append(next_seg, crossfade=crossfade_ms)
365
+ else:
366
+ combined += next_seg
367
+ return combined
368
+
369
+
370
+ def normalize_audio(audio: AudioSegment, target_dBFS: float = -1.5) -> AudioSegment:
371
+ """Peak normalize audio."""
372
+ change = target_dBFS - audio.max_dBFS
373
+ return audio.apply_gain(change)
374
+
375
+
376
+ def save_audiobook(segments_paths: List[str], output_path: str, title: str = "Audiobook") -> str:
377
+ """Stitch segments and export final audiobook."""
378
+ if not segments_paths:
379
+ return ""
380
+ combined = stitch_audio(segments_paths)
381
+ combined = normalize_audio(combined)
382
+ combined.export(output_path, format="mp3", bitrate="192k", tags={"title": title, "artist": "AudioBook Forge"})
383
+ return output_path
384
+
385
+
386
+ # ---------------------------------------------------------------------------
387
+ # Optional: AI Character Extraction via HF Inference
388
+ # ---------------------------------------------------------------------------
389
+
390
+ def ai_extract_characters(text: str, api_token: Optional[str] = None) -> List[CharacterProfile]:
391
+ """Use a small HF model to extract characters with descriptions."""
392
+ try:
393
+ from huggingface_hub import InferenceClient
394
+ client = InferenceClient(token=api_token or os.getenv("HF_TOKEN"))
395
+
396
+ # Truncate text for context window
397
+ sample = text[:4000] + ("\n...[truncated]" if len(text) > 4000 else "")
398
+
399
+ prompt = (
400
+ "Extract all named characters from the following story excerpt. "
401
+ "For each character, provide their name and a brief description of their personality/role. "
402
+ "Return ONLY a JSON array like: [{\"name\":\"Alice\",\"description\":\"Curious young girl\"},...]\n\n"
403
+ f"STORY:\n{sample}\n\nJSON:"
404
+ )
405
+
406
+ response = client.text_generation(
407
+ model="Qwen/Qwen3-1.7B",
408
+ prompt=prompt,
409
+ max_new_tokens=512,
410
+ temperature=0.3,
411
+ return_full_text=False,
412
+ )
413
+
414
+ # Extract JSON from response
415
+ json_match = re.search(r'\[.*?\]', response, re.DOTALL)
416
+ if json_match:
417
+ data = json.loads(json_match.group())
418
+ profiles = []
419
+ for item in data:
420
+ name = item.get("name", "")
421
+ desc = item.get("description", "")
422
+ if name:
423
+ profiles.append(CharacterProfile(name=name, description=desc))
424
+ return profiles
425
+ except Exception as e:
426
+ print(f"[AI Extraction] Failed: {e}")
427
+ return []
428
+
429
+
430
+ # ---------------------------------------------------------------------------
431
+ # Main Pipeline
432
+ # ---------------------------------------------------------------------------
433
+
434
+ class AudiobookPipeline:
435
+ def __init__(self, device: str = "cuda"):
436
+ self.tts = TTSEngine(device=device)
437
+ self.processor = TextProcessor()
438
+ self.temp_dir = Path(tempfile.gettempdir()) / "audiobook_segments"
439
+ self.temp_dir.mkdir(exist_ok=True)
440
+
441
+ def extract_characters(self, text: str, use_ai: bool = False) -> List[Dict]:
442
+ if use_ai:
443
+ profiles = ai_extract_characters(text)
444
+ if not profiles:
445
+ profiles = self.processor.extract_characters(text)
446
+ else:
447
+ profiles = self.processor.extract_characters(text)
448
+ return [
449
+ {
450
+ "name": p.name,
451
+ "description": p.description,
452
+ "occurrences": p.occurrences,
453
+ "voice_mode": "preset",
454
+ "voice_preset": "Ryan",
455
+ "voice_instruct": "",
456
+ }
457
+ for p in profiles
458
+ ]
459
+
460
+ def generate(
461
+ self,
462
+ text: str,
463
+ narrator_config: VoiceConfig,
464
+ character_configs: Dict[str, VoiceConfig],
465
+ progress_callback=None,
466
+ temperature: float = 0.7,
467
+ seed: int = 42,
468
+ ) -> Tuple[str, List[str]]:
469
+ """
470
+ Generate audiobook.
471
+ Returns (final_mp3_path, list_of_segment_wav_paths).
472
+ """
473
+ segments = self.processor.segment_text(text, list(character_configs.keys()))
474
+ segments = self.processor.chunk_segments(segments)
475
+
476
+ segment_paths = []
477
+ total = len(segments)
478
+
479
+ for i, seg in enumerate(segments):
480
+ if progress_callback:
481
+ progress_callback(i / total, f"Generating segment {i+1}/{total} ({seg.seg_type})...")
482
+
483
+ # Determine voice
484
+ if seg.seg_type == "dialogue" and seg.speaker and seg.speaker in character_configs:
485
+ voice = character_configs[seg.speaker]
486
+ else:
487
+ voice = narrator_config
488
+
489
+ try:
490
+ wav, sr = self.tts.synthesize(seg.text, voice, temperature=temperature, seed=seed)
491
+ seg_path = self.temp_dir / f"seg_{i:04d}_{voice.name}.wav"
492
+ sf.write(str(seg_path), wav, sr)
493
+ segment_paths.append(str(seg_path))
494
+ except Exception as e:
495
+ print(f"[Pipeline] Segment {i} failed: {e}")
496
+ # Insert silence to maintain timing
497
+ silent = AudioSegment.silent(duration=500)
498
+ seg_path = self.temp_dir / f"seg_{i:04d}_silent.wav"
499
+ silent.export(str(seg_path), format="wav")
500
+ segment_paths.append(str(seg_path))
501
+
502
+ if progress_callback:
503
+ progress_callback(1.0, "Stitching final audiobook...")
504
+
505
+ output_path = str(self.temp_dir / "audiobook_final.mp3")
506
+ save_audiobook(segment_paths, output_path, title="Generated Audiobook")
507
+ return output_path, segment_paths
508
+
509
+ def preview_voice(
510
+ self,
511
+ voice: VoiceConfig,
512
+ sample_text: str = "Hello, this is a preview of my voice. I hope you enjoy the story.",
513
+ ) -> Tuple[np.ndarray, int]:
514
+ return self.tts.synthesize(sample_text, voice, temperature=0.7, seed=42)
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=6.13.0,<7.0
2
+ qwen-tts>=0.1.0
3
+ torch>=2.2.0
4
+ torchaudio>=2.2.0
5
+ transformers>=4.40.0
6
+ accelerate>=0.30.0
7
+ huggingface-hub>=0.23.0
8
+ soundfile>=0.12.0
9
+ pydub>=0.25.0
10
+ numpy>=1.26.0