Rewrite model card: proper inference docs, full parameter table, working audio embeds
Browse files
README.md
CHANGED
|
@@ -22,132 +22,162 @@ base_model_relation: finetune
|
|
| 22 |
> **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.**
|
| 23 |
> Dramabox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
|
| 24 |
|
| 25 |
-
Dramabox
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
##
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
|
|
|
| 34 |
|
| 35 |
-
#
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
**
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
##
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
###
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
##
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
##
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
- **Fast inference** - ~2.5s per generation with warm server on H100
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
| **Transformer** | 3.3B parameter DiT, 48 layers, flow matching (30-step Euler) |
|
| 77 |
-
| **Text Encoder** | Gemma 3 12B (q4 quantized) + learned embeddings processor |
|
| 78 |
-
| **Audio VAE** | Encodes/decodes 48kHz audio via mel spectrogram latents |
|
| 79 |
-
| **Voice Cloning** | Reference audio tokens appended to target with asymmetric attention mask |
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
-
|------|------|-------------|
|
| 85 |
-
| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (voice cloning weights merged) |
|
| 86 |
-
| `dramabox-audio-components.safetensors` | 2.7 GB | Audio VAE encoder/decoder + vocoder + text projection |
|
| 87 |
-
| `assets/silence_latent_frame.pt` | 1.5 KB | VAE-encoded silence frame |
|
| 88 |
-
| `config.json` | - | Model configuration |
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
##
|
| 93 |
|
| 94 |
-
|
| 95 |
-
from inference_server import TTSServer
|
| 96 |
|
| 97 |
-
|
| 98 |
-
server = TTSServer(device="cuda", bnb_4bit=True)
|
| 99 |
|
| 100 |
-
#
|
| 101 |
-
server.generate_to_file(
|
| 102 |
-
prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
|
| 103 |
-
output="output.wav",
|
| 104 |
-
)
|
| 105 |
|
| 106 |
-
|
| 107 |
-
server.generate_to_file(
|
| 108 |
-
prompt='A woman speaks warmly, "Hello, how are you today?"',
|
| 109 |
-
output="cloned.wav",
|
| 110 |
-
voice_ref="reference.wav", # 10+ seconds of target voice
|
| 111 |
-
)
|
| 112 |
-
```
|
| 113 |
|
| 114 |
-
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
|
| 123 |
-
- Laughs: `"Hahaha"` `"Hehehe"` (always as one word, never separated)
|
| 124 |
-
- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
|
| 125 |
|
| 126 |
-
|
| 127 |
-
- `She sighs deeply.` `He gulps nervously.` `A long pause.`
|
| 128 |
-
- `Her voice cracks.` `He clears his throat.` `She scoffs.`
|
| 129 |
|
| 130 |
-
###
|
| 131 |
-
- Ahem, Pfft, Sigh, Gasp, Cough
|
| 132 |
|
| 133 |
-
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|-----------|---------|-------|
|
| 137 |
-
| cfg_scale | 2.5 | Text adherence (lower = more natural) |
|
| 138 |
-
| stg_scale | 1.5 | Skip-token guidance |
|
| 139 |
-
| rescale | 0.0 | No rescaling |
|
| 140 |
-
| modality | 1.0 | No modality guidance |
|
| 141 |
-
| duration_multiplier | 1.1 | 10% extra breathing room |
|
| 142 |
-
| steps | 30 | Euler flow matching |
|
| 143 |
|
| 144 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|-------|------|-------|
|
| 148 |
-
| Warm server (recommended) | **~24 GB** | **~2.5s** |
|
| 149 |
-
| Cold inference (per-call loading) | ~8 GB peak | ~30s |
|
| 150 |
|
| 151 |
## License & acknowledgement
|
| 152 |
|
| 153 |
-
Dramabox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β see [
|
|
|
|
| 22 |
> **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.**
|
| 23 |
> Dramabox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
|
| 24 |
|
| 25 |
+
Dramabox is a prompt-driven TTS where **the prompt itself controls everything** β speaker identity, emotion, delivery, laughs, sighs, breaths, pauses, transitions. An optional 10-second voice reference clones the target timbre. It is an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only** model (Diffusion Transformer + flow matching), conditioned on Gemma 3 12B text embeddings.
|
| 26 |
|
| 27 |
+
| | |
|
| 28 |
+
|---|---|
|
| 29 |
+
| π€ Model | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
|
| 30 |
+
| π Demo Space | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
|
| 31 |
+
| ποΈ Sample audio | [`ResembleAI/Dramabox-samples`](https://huggingface.co/datasets/ResembleAI/Dramabox-samples) |
|
| 32 |
+
| ποΈ Base model | [`Lightricks/LTX-2`](https://huggingface.co/Lightricks/LTX-2) |
|
| 33 |
+
| π License | LTX-2 Community License β see [LICENSE](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE) |
|
| 34 |
|
| 35 |
+
## Quick start
|
| 36 |
|
| 37 |
+
### Python (warm server β recommended, ~2.5 s / generation)
|
| 38 |
|
| 39 |
+
```python
|
| 40 |
+
from src.inference_server import TTSServer
|
| 41 |
|
| 42 |
+
server = TTSServer(device="cuda") # downloads weights on first run
|
| 43 |
|
| 44 |
+
server.generate_to_file(
|
| 45 |
+
prompt='A woman speaks warmly, "Hello, how are you today?" '
|
| 46 |
+
'She laughs, "Hahaha, it is so good to see you!"',
|
| 47 |
+
output="output.wav",
|
| 48 |
+
voice_ref="reference.wav", # optional, 10+ seconds of target voice
|
| 49 |
+
cfg_scale=2.5,
|
| 50 |
+
stg_scale=1.5,
|
| 51 |
+
duration_multiplier=1.1,
|
| 52 |
+
seed=42,
|
| 53 |
+
)
|
| 54 |
+
```
|
| 55 |
|
| 56 |
+
### CLI
|
| 57 |
|
| 58 |
+
```bash
|
| 59 |
+
python src/inference.py \
|
| 60 |
+
--prompt 'A woman speaks warmly, "Hello, how are you today?"' \
|
| 61 |
+
--voice-sample reference.wav \
|
| 62 |
+
--output output.wav \
|
| 63 |
+
--cfg-scale 2.5 --stg-scale 1.5
|
| 64 |
+
```
|
| 65 |
|
| 66 |
+
For **long-form** generation (>20 s music or multi-section scenes), set the explicit target duration:
|
| 67 |
|
| 68 |
+
```bash
|
| 69 |
+
python src/inference.py --gen-duration 30 --prompt '...' # CLI
|
| 70 |
+
# or in Python:
|
| 71 |
+
server.generate_to_file(prompt=..., gen_duration=30.0)
|
| 72 |
+
```
|
| 73 |
|
| 74 |
+
## Inference parameters
|
| 75 |
|
| 76 |
+
Every knob exposed by the warm server, the CLI, and the Gradio Space sliders:
|
| 77 |
|
| 78 |
+
| Parameter | Default | What it does |
|
| 79 |
+
|---|---|---|
|
| 80 |
+
| `prompt` | β | The scene description. Dialogue inside `"double quotes"`, stage directions outside. See "Prompt format" below. |
|
| 81 |
+
| `voice_ref` (`--voice-sample`) | `None` | Optional 10+ s audio clip whose timbre the model clones. Without it, the model picks a voice that fits the description. |
|
| 82 |
+
| `cfg_scale` | 2.5 | Classifier-free guidance β how strictly the output follows the prompt. Lower = more natural, higher = more text-faithful but more dramatic. Auto-rescaled internally to prevent clipping at high cfg (see *Auto rescale* below). |
|
| 83 |
+
| `stg_scale` | 1.5 | Skip-token guidance β applied through the perturbed transformer block path (block 29). Increases expressive emphasis without saturating like cfg. |
|
| 84 |
+
| `duration_multiplier` (`--duration-multiplier`) | 1.1 | Multiplier on the auto-estimated speech length (10 % breathing-room headroom). Only used when `gen_duration` (or `--gen-duration`) is 0. |
|
| 85 |
+
| `gen_duration` (`--gen-duration`, "Target duration" slider) | 0 (auto) | Explicit output duration in seconds. Set to 20β60 s for music or long scenes. Overrides the prompt-based estimate when > 0. |
|
| 86 |
+
| `ref_duration` (`--ref-duration`, "Reference duration" slider) | 10.0 | How many seconds of the voice reference the model conditions on (3β30 s). Longer ref β richer timbre capture, shorter ref β faster encode. |
|
| 87 |
+
| `seed` | 42 | Reproducibility. |
|
| 88 |
+
| `rescale_scale` (`--rescale-scale`) | `"auto"` | Latent-side CFG std-rescale. The default is a cfg-aware schedule (0 below cfg=2, ramping to 1.0 by cfg=10) that keeps the output peak below 0 dBFS at every cfg. Pass any float in [0, 1] to override or 0 to disable. |
|
| 89 |
+
| `watermark` (`--no-watermark` to disable) | `True` | Apply [Resemble Perth](https://github.com/resemble-ai/Perth) imperceptible neural watermark to the output. Survives MP3/AAC, common edits; β 100 % detection accuracy. |
|
| 90 |
|
| 91 |
+
### Auto rescale (CFG safety)
|
| 92 |
|
| 93 |
+
CFG amplifies the latent (`pred = cond + (cfg-1)Β·(cond - uncond)`). With no compensation, outputs hard-clip at `cfg β₯ 3`. Dramabox automatically applies a CFG-aware std-rescale schedule:
|
| 94 |
|
| 95 |
+
| cfg | auto rescale | output peak |
|
| 96 |
+
|---|---|---|
|
| 97 |
+
| β€ 2 | 0.0 (disabled) | safely below 0 dBFS |
|
| 98 |
+
| 3 | 0.6 | ~β1.8 dBFS |
|
| 99 |
+
| 4β8 | 0.8 | ~β1 to β3 dBFS |
|
| 100 |
+
| 9 | 0.9 | ~β2.7 dBFS |
|
| 101 |
+
| 10 | 1.0 | ~β4.4 dBFS |
|
| 102 |
|
| 103 |
+
No clipping at any CFG, no manual tuning needed. Pass `rescale_scale=<float>` to override.
|
| 104 |
|
| 105 |
+
### End-of-clip silence patch (long-form safety)
|
| 106 |
|
| 107 |
+
The base LTX-2.3 DiT was trained on audio β€ ~20 s and learned a strong end-of-clip silence prior at the next patchifier-aligned latent boundary (frame 513 β 20.4 s). Dramabox automatically interpolates frames 512β513 from their neighbours before VAE decode whenever the output crosses 20.5 s β eliminating the ~30 ms silence dip that would otherwise show up in long generations. No flag, no override needed.
|
| 108 |
|
| 109 |
+
## Prompt format
|
| 110 |
|
| 111 |
+
```
|
| 112 |
+
<speaker description>, "<dialogue>" <action direction> "<more dialogue>"
|
| 113 |
+
```
|
|
|
|
| 114 |
|
| 115 |
+
**Inside double quotes** β the model speaks these literally:
|
| 116 |
+
- Dialogue: `"Hello, how are you?"`
|
| 117 |
+
- Phonetic vocalisations (one word, no separators): `"Hahaha"`, `"Hehehe"`, `"Mmmmm"`, `"Ugh"`, `"Argh"`, `"Hmm"`
|
| 118 |
|
| 119 |
+
**Outside quotes** β stage directions interpreted as performance cues, never spoken:
|
| 120 |
+
- `She sighs deeply.` Β· `He clears his throat.` Β· `A long pause.` Β· `Her voice cracks.` Β· `He gulps nervously.`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
**Avoid inside quotes** (the model will speak the word literally): `Sigh`, `Gasp`, `Cough`, `Ahem`, `Pfft`.
|
| 123 |
|
| 124 |
+
## Sample outputs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
Each clip below was generated by passing the prompt + voice ref to the warm server with default settings (cfg = 2.5, stg = 1.5).
|
| 127 |
|
| 128 |
+
### Regal Queen β Cold Fury to Venomous Whisper
|
| 129 |
|
| 130 |
+
> A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."
|
|
|
|
| 131 |
|
| 132 |
+
<audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/01_queen_sighs_rage.wav"></audio>
|
|
|
|
| 133 |
|
| 134 |
+
### Catgirl β Uncontrollable Giggling
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
+
> A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, hehe, oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay okay, I will stop, I promise I will stop." She leans in and whispers conspiratorially, "But seriously though, between you and me," then immediately loses it again, "Haha, no I, hehehe, I just cannot! You are way too funny, haha!" She snorts mid-laugh, "Pfft, oh no no no, that was so embarrassing, pretend you did not hear that!"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
+
<audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/04_catgirl_giggles_snort.wav"></audio>
|
| 139 |
|
| 140 |
+
### Action Hero β Panting Triumph
|
| 141 |
|
| 142 |
+
> A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."
|
| 143 |
+
|
| 144 |
+
<audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/06_arnie_panting_triumph.wav"></audio>
|
| 145 |
+
|
| 146 |
+
### Villain β Sinister Laugh
|
| 147 |
|
| 148 |
+
> A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heh heh heh, ha ha ha ha ha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, He clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heh heh heh."
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
<audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/09_villain_sinister_laugh.wav"></audio>
|
|
|
|
|
|
|
| 151 |
|
| 152 |
+
### Talk Show Host β Wheezing Laughter
|
|
|
|
| 153 |
|
| 154 |
+
> A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"
|
| 155 |
|
| 156 |
+
<audio controls src="https://huggingface.co/datasets/ResembleAI/Dramabox-samples/resolve/main/13_conan_wheezing_laughter.wav"></audio>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
+
## Files
|
| 159 |
+
|
| 160 |
+
| File | Size | Contents |
|
| 161 |
+
|---|---|---|
|
| 162 |
+
| `dramabox-dit-v1.safetensors` | 6.6 GB | Audio-only DiT (LoRA already merged into base) |
|
| 163 |
+
| `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
|
| 164 |
+
| [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder (auto-downloaded on first run) |
|
| 165 |
+
|
| 166 |
+
**VRAM**: ~24 GB peak, warm server. **Speed**: ~2.5 s / generation on H100 once warm.
|
| 167 |
+
|
| 168 |
+
## Watermarking
|
| 169 |
+
|
| 170 |
+
Every output of `inference.py` and `TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100 % detection accuracy.
|
| 171 |
+
|
| 172 |
+
```python
|
| 173 |
+
import perth, librosa
|
| 174 |
+
wav, sr = librosa.load("output.wav", sr=None, mono=True)
|
| 175 |
+
detector = perth.PerthImplicitWatermarker()
|
| 176 |
+
print(detector.get_watermark(wav, sample_rate=sr)) # β 1.0 for our outputs
|
| 177 |
+
```
|
| 178 |
|
| 179 |
+
Pass `--no-watermark` (CLI) or `watermark=False` (Python) to disable for debugging.
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
## License & acknowledgement
|
| 182 |
|
| 183 |
+
Dramabox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β see [LICENSE](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE). Thanks again to Lightricks for releasing the base model.
|