Manmay commited on
Commit
be91ea3
Β·
verified Β·
1 Parent(s): af0e7aa

Rewrite model card: proper inference docs, full parameter table, working audio embeds

Browse files
Files changed (1) hide show
  1. README.md +115 -85
README.md CHANGED
@@ -22,132 +22,162 @@ base_model_relation: finetune
22
  > **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.**
23
  > Dramabox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
24
 
25
- Dramabox generates expressive, emotionally rich speech from scene descriptions with optional voice cloning. It is an IC-LoRA fine-tune of the LTX-2.3 audio-only branch β€” a 3.3B Diffusion Transformer with flow matching, conditioned on Gemma 3 12B text embeddings.
26
 
27
- ## Audio Samples
 
 
 
 
 
 
28
 
29
- ### Regal Queen - Cold Fury to Venomous Whisper
30
 
31
- **Prompt:** A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."
32
 
33
- <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/01_queen_sighs_rage.wav"></audio>
 
34
 
35
- ### Catgirl - Uncontrollable Giggling
36
 
37
- **Prompt:** A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, hehe, oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay okay, I will stop, I promise I will stop." She leans in and whispers conspiratorially, "But seriously though, between you and me," then immediately loses it again, "Haha, no I, hehehe, I just cannot! You are way too funny, haha!" She snorts mid-laugh, "Pfft, oh no no no, that was so embarrassing, pretend you did not hear that!"
 
 
 
 
 
 
 
 
 
 
38
 
39
- <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/04_catgirl_giggles_snort.wav"></audio>
40
 
41
- ### Action Hero - Panting Triumph
 
 
 
 
 
 
42
 
43
- **Prompt:** A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."
44
 
45
- <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/06_arnie_panting_triumph.wav"></audio>
 
 
 
 
46
 
47
- ### Villain - Sinister Laugh
48
 
49
- **Prompt:** A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heheheh. Hahahahahahaha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, He clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heheheh."
50
 
51
- <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/09_villain_sinister_laugh.wav"></audio>
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- ### Talk Show Host - Wheezing Laughter
54
 
55
- **Prompt:** A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"
56
 
57
- <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/13_conan_wheezing_laughter.wav"></audio>
 
 
 
 
 
 
58
 
59
- ---
60
 
61
- ## Model Description
62
 
63
- Dramabox is a prompt-driven TTS model where **the text prompt controls everything** - speaker identity, emotion, delivery style, laughs, sighs, pauses, and transitions. With voice cloning, a 10-second reference clip conditions the model to reproduce the speaker's timbre and characteristics.
64
 
65
- ### Key Features
66
 
67
- - **Prompt-driven expressiveness** - laughs, sighs, whispers, shouts, emotional transitions all controlled by the scene description
68
- - **Voice cloning** from 10s reference audio
69
- - **English** speech synthesis
70
- - **Fast inference** - ~2.5s per generation with warm server on H100
71
 
72
- ### Architecture
 
 
73
 
74
- | Component | Details |
75
- |-----------|---------|
76
- | **Transformer** | 3.3B parameter DiT, 48 layers, flow matching (30-step Euler) |
77
- | **Text Encoder** | Gemma 3 12B (q4 quantized) + learned embeddings processor |
78
- | **Audio VAE** | Encodes/decodes 48kHz audio via mel spectrogram latents |
79
- | **Voice Cloning** | Reference audio tokens appended to target with asymmetric attention mask |
80
 
81
- ## Files
82
 
83
- | File | Size | Description |
84
- |------|------|-------------|
85
- | `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (voice cloning weights merged) |
86
- | `dramabox-audio-components.safetensors` | 2.7 GB | Audio VAE encoder/decoder + vocoder + text projection |
87
- | `assets/silence_latent_frame.pt` | 1.5 KB | VAE-encoded silence frame |
88
- | `config.json` | - | Model configuration |
89
 
90
- **Additional requirement**: [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) (text encoder, pre-quantized 4-bit, auto-downloaded)
91
 
92
- ## Quick Start
93
 
94
- ```python
95
- from inference_server import TTSServer
96
 
97
- # Models auto-download from HuggingFace
98
- server = TTSServer(device="cuda", bnb_4bit=True)
99
 
100
- # Text-to-speech
101
- server.generate_to_file(
102
- prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
103
- output="output.wav",
104
- )
105
 
106
- # Voice cloning
107
- server.generate_to_file(
108
- prompt='A woman speaks warmly, "Hello, how are you today?"',
109
- output="cloned.wav",
110
- voice_ref="reference.wav", # 10+ seconds of target voice
111
- )
112
- ```
113
 
114
- ## Prompt Format
115
 
116
- The prompt is a scene description that controls how the model speaks:
117
 
118
- ```
119
- <speaker description>, "<dialogue>" <action direction> "<more dialogue>"
120
- ```
 
 
121
 
122
- ### What Works Inside Quotes (model produces actual sounds)
123
- - Laughs: `"Hahaha"` `"Hehehe"` (always as one word, never separated)
124
- - Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
125
 
126
- ### What Goes Outside Quotes (stage directions)
127
- - `She sighs deeply.` `He gulps nervously.` `A long pause.`
128
- - `Her voice cracks.` `He clears his throat.` `She scoffs.`
129
 
130
- ### Never Inside Quotes (model speaks them literally)
131
- - Ahem, Pfft, Sigh, Gasp, Cough
132
 
133
- ## Inference Settings
134
 
135
- | Parameter | Default | Notes |
136
- |-----------|---------|-------|
137
- | cfg_scale | 2.5 | Text adherence (lower = more natural) |
138
- | stg_scale | 1.5 | Skip-token guidance |
139
- | rescale | 0.0 | No rescaling |
140
- | modality | 1.0 | No modality guidance |
141
- | duration_multiplier | 1.1 | 10% extra breathing room |
142
- | steps | 30 | Euler flow matching |
143
 
144
- ## VRAM Requirements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
- | Setup | VRAM | Speed |
147
- |-------|------|-------|
148
- | Warm server (recommended) | **~24 GB** | **~2.5s** |
149
- | Cold inference (per-call loading) | ~8 GB peak | ~30s |
150
 
151
  ## License & acknowledgement
152
 
153
- Dramabox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β€” see [`LICENSE`](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE). Thanks again to Lightricks for releasing the base model.
 
22
  > **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.**
23
  > Dramabox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
24
 
25
+ Dramabox is a prompt-driven TTS where **the prompt itself controls everything** β€” speaker identity, emotion, delivery, laughs, sighs, breaths, pauses, transitions. An optional 10-second voice reference clones the target timbre. It is an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only** model (Diffusion Transformer + flow matching), conditioned on Gemma 3 12B text embeddings.
26
 
27
+ | | |
28
+ |---|---|
29
+ | πŸ€— Model | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
30
+ | 🎭 Demo Space | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
31
+ | 🎚️ Sample audio | [`ResembleAI/Dramabox-samples`](https://huggingface.co/datasets/ResembleAI/Dramabox-samples) |
32
+ | πŸ—οΈ Base model | [`Lightricks/LTX-2`](https://huggingface.co/Lightricks/LTX-2) |
33
+ | πŸ“œ License | LTX-2 Community License β€” see [LICENSE](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE) |
34
 
35
+ ## Quick start
36
 
37
+ ### Python (warm server β€” recommended, ~2.5 s / generation)
38
 
39
+ ```python
40
+ from src.inference_server import TTSServer
41
 
42
+ server = TTSServer(device="cuda") # downloads weights on first run
43
 
44
+ server.generate_to_file(
45
+ prompt='A woman speaks warmly, "Hello, how are you today?" '
46
+ 'She laughs, "Hahaha, it is so good to see you!"',
47
+ output="output.wav",
48
+ voice_ref="reference.wav", # optional, 10+ seconds of target voice
49
+ cfg_scale=2.5,
50
+ stg_scale=1.5,
51
+ duration_multiplier=1.1,
52
+ seed=42,
53
+ )
54
+ ```
55
 
56
+ ### CLI
57
 
58
+ ```bash
59
+ python src/inference.py \
60
+ --prompt 'A woman speaks warmly, "Hello, how are you today?"' \
61
+ --voice-sample reference.wav \
62
+ --output output.wav \
63
+ --cfg-scale 2.5 --stg-scale 1.5
64
+ ```
65
 
66
+ For **long-form** generation (>20 s music or multi-section scenes), set the explicit target duration:
67
 
68
+ ```bash
69
+ python src/inference.py --gen-duration 30 --prompt '...' # CLI
70
+ # or in Python:
71
+ server.generate_to_file(prompt=..., gen_duration=30.0)
72
+ ```
73
 
74
+ ## Inference parameters
75
 
76
+ Every knob exposed by the warm server, the CLI, and the Gradio Space sliders:
77
 
78
+ | Parameter | Default | What it does |
79
+ |---|---|---|
80
+ | `prompt` | β€” | The scene description. Dialogue inside `"double quotes"`, stage directions outside. See "Prompt format" below. |
81
+ | `voice_ref` (`--voice-sample`) | `None` | Optional 10+ s audio clip whose timbre the model clones. Without it, the model picks a voice that fits the description. |
82
+ | `cfg_scale` | 2.5 | Classifier-free guidance β€” how strictly the output follows the prompt. Lower = more natural, higher = more text-faithful but more dramatic. Auto-rescaled internally to prevent clipping at high cfg (see *Auto rescale* below). |
83
+ | `stg_scale` | 1.5 | Skip-token guidance β€” applied through the perturbed transformer block path (block 29). Increases expressive emphasis without saturating like cfg. |
84
+ | `duration_multiplier` (`--duration-multiplier`) | 1.1 | Multiplier on the auto-estimated speech length (10 % breathing-room headroom). Only used when `gen_duration` (or `--gen-duration`) is 0. |
85
+ | `gen_duration` (`--gen-duration`, "Target duration" slider) | 0 (auto) | Explicit output duration in seconds. Set to 20–60 s for music or long scenes. Overrides the prompt-based estimate when > 0. |
86
+ | `ref_duration` (`--ref-duration`, "Reference duration" slider) | 10.0 | How many seconds of the voice reference the model conditions on (3–30 s). Longer ref β†’ richer timbre capture, shorter ref β†’ faster encode. |
87
+ | `seed` | 42 | Reproducibility. |
88
+ | `rescale_scale` (`--rescale-scale`) | `"auto"` | Latent-side CFG std-rescale. The default is a cfg-aware schedule (0 below cfg=2, ramping to 1.0 by cfg=10) that keeps the output peak below 0 dBFS at every cfg. Pass any float in [0, 1] to override or 0 to disable. |
89
+ | `watermark` (`--no-watermark` to disable) | `True` | Apply [Resemble Perth](https://github.com/resemble-ai/Perth) imperceptible neural watermark to the output. Survives MP3/AAC, common edits; β‰ˆ 100 % detection accuracy. |
90
 
91
+ ### Auto rescale (CFG safety)
92
 
93
+ CFG amplifies the latent (`pred = cond + (cfg-1)Β·(cond - uncond)`). With no compensation, outputs hard-clip at `cfg β‰₯ 3`. Dramabox automatically applies a CFG-aware std-rescale schedule:
94
 
95
+ | cfg | auto rescale | output peak |
96
+ |---|---|---|
97
+ | ≀ 2 | 0.0 (disabled) | safely below 0 dBFS |
98
+ | 3 | 0.6 | ~βˆ’1.8 dBFS |
99
+ | 4–8 | 0.8 | ~βˆ’1 to βˆ’3 dBFS |
100
+ | 9 | 0.9 | ~βˆ’2.7 dBFS |
101
+ | 10 | 1.0 | ~βˆ’4.4 dBFS |
102
 
103
+ No clipping at any CFG, no manual tuning needed. Pass `rescale_scale=<float>` to override.
104
 
105
+ ### End-of-clip silence patch (long-form safety)
106
 
107
+ The base LTX-2.3 DiT was trained on audio ≀ ~20 s and learned a strong end-of-clip silence prior at the next patchifier-aligned latent boundary (frame 513 β‰ˆ 20.4 s). Dramabox automatically interpolates frames 512–513 from their neighbours before VAE decode whenever the output crosses 20.5 s β€” eliminating the ~30 ms silence dip that would otherwise show up in long generations. No flag, no override needed.
108
 
109
+ ## Prompt format
110
 
111
+ ```
112
+ <speaker description>, "<dialogue>" <action direction> "<more dialogue>"
113
+ ```
 
114
 
115
+ **Inside double quotes** β€” the model speaks these literally:
116
+ - Dialogue: `"Hello, how are you?"`
117
+ - Phonetic vocalisations (one word, no separators): `"Hahaha"`, `"Hehehe"`, `"Mmmmm"`, `"Ugh"`, `"Argh"`, `"Hmm"`
118
 
119
+ **Outside quotes** β€” stage directions interpreted as performance cues, never spoken:
120
+ - `She sighs deeply.` Β· `He clears his throat.` Β· `A long pause.` Β· `Her voice cracks.` Β· `He gulps nervously.`
 
 
 
 
121
 
122
+ **Avoid inside quotes** (the model will speak the word literally): `Sigh`, `Gasp`, `Cough`, `Ahem`, `Pfft`.
123
 
124
+ ## Sample outputs
 
 
 
 
 
125
 
126
+ Each clip below was generated by passing the prompt + voice ref to the warm server with default settings (cfg = 2.5, stg = 1.5).
127
 
128
+ ### Regal Queen β€” Cold Fury to Venomous Whisper
129
 
130
+ > A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."
 
131
 
132
+ <audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/01_queen_sighs_rage.wav"></audio>
 
133
 
134
+ ### Catgirl β€” Uncontrollable Giggling
 
 
 
 
135
 
136
+ > A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, hehe, oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay okay, I will stop, I promise I will stop." She leans in and whispers conspiratorially, "But seriously though, between you and me," then immediately loses it again, "Haha, no I, hehehe, I just cannot! You are way too funny, haha!" She snorts mid-laugh, "Pfft, oh no no no, that was so embarrassing, pretend you did not hear that!"
 
 
 
 
 
 
137
 
138
+ <audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/04_catgirl_giggles_snort.wav"></audio>
139
 
140
+ ### Action Hero β€” Panting Triumph
141
 
142
+ > A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."
143
+
144
+ <audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/06_arnie_panting_triumph.wav"></audio>
145
+
146
+ ### Villain β€” Sinister Laugh
147
 
148
+ > A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heh heh heh, ha ha ha ha ha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, He clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heh heh heh."
 
 
149
 
150
+ <audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/09_villain_sinister_laugh.wav"></audio>
 
 
151
 
152
+ ### Talk Show Host β€” Wheezing Laughter
 
153
 
154
+ > A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"
155
 
156
+ <audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/13_conan_wheezing_laughter.wav"></audio>
 
 
 
 
 
 
 
157
 
158
+ ## Files
159
+
160
+ | File | Size | Contents |
161
+ |---|---|---|
162
+ | `dramabox-dit-v1.safetensors` | 6.6 GB | Audio-only DiT (LoRA already merged into base) |
163
+ | `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
164
+ | [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder (auto-downloaded on first run) |
165
+
166
+ **VRAM**: ~24 GB peak, warm server. **Speed**: ~2.5 s / generation on H100 once warm.
167
+
168
+ ## Watermarking
169
+
170
+ Every output of `inference.py` and `TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β€” an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100 % detection accuracy.
171
+
172
+ ```python
173
+ import perth, librosa
174
+ wav, sr = librosa.load("output.wav", sr=None, mono=True)
175
+ detector = perth.PerthImplicitWatermarker()
176
+ print(detector.get_watermark(wav, sample_rate=sr)) # β‰ˆ 1.0 for our outputs
177
+ ```
178
 
179
+ Pass `--no-watermark` (CLI) or `watermark=False` (Python) to disable for debugging.
 
 
 
180
 
181
  ## License & acknowledgement
182
 
183
+ Dramabox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β€” see [LICENSE](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE). Thanks again to Lightricks for releasing the base model.