Spaces:
Running on Zero
Running on Zero
File size: 8,376 Bytes
1850702 fdc2b0b 1850702 e53641f 1850702 fdc2b0b b8b67ad fdc2b0b 1850702 31ad2d7 fdc2b0b 5d085de e694869 5d085de 1636761 fdc2b0b e694869 1636761 5d085de 1636761 5d085de 1636761 fdc2b0b 1636761 fdc2b0b 1636761 5d085de 1636761 fdc2b0b 1636761 fdc2b0b 1636761 5d085de 1636761 5d085de 1636761 5d085de e694869 5d085de e694869 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | ---
title: DramaBox
emoji: π
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
hf_oauth: false
short_description: Expressive TTS with voice cloning β DramaBox demo
---
# DramaBox β Expressive TTS with Voice Cloning
> **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.**
> DramaBox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only** model.
| | |
|---|---|
| π€ **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
| π **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
| ποΈ **Base model** | [`Lightricks/LTX-2`](https://huggingface.co/Lightricks/LTX-2) |
| π **License** | LTX-2 Community License β see [`LICENSE`](LICENSE) |
## Models
Auto-downloaded from the HF model repo on first run.
| File | Size | Description |
|---|---|---|
| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) |
| `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
| [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder |
**VRAM**: ~24 GB peak Β· **Speed**: ~2.5 s / generation (warm server, H100)
## Quick Start
### Warm server (recommended)
```python
from src.inference_server import TTSServer
server = TTSServer(device="cuda")
server.generate_to_file(
prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
output="output.wav",
voice_ref="reference.wav", # optional, 10+ seconds
)
```
### CLI
```bash
python src/inference.py \
--voice-sample reference.wav \
--prompt 'A woman speaks warmly, "Hello, how are you today?"' \
--output output.wav \
--cfg-scale 2.5 --stg-scale 1.5
```
### Gradio app
```bash
CUDA_VISIBLE_DEVICES=4 python app.py
```
## Inference Settings
| Parameter | Default | Notes |
|---|---|---|
| `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful |
| `stg-scale` | 1.5 | Skip-token guidance |
| `rescale` | 0 | No rescaling |
| `modality` | 1 | No modality guidance |
| `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length |
| `steps` | 30 | Euler flow matching |
## Prompt Writing Guide
**Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
**Inside quotes** (model produces actual sounds):
- Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
**Outside quotes** (stage directions):
- `She sighs deeply.` Β· `He gulps nervously.` Β· `A long pause.`
- `Her voice cracks.` Β· `He clears his throat.` Β· `She scoffs.`
**Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`.
**Tips**
- Match gender/age in the speaker description to the voice reference
- Break long dialogue into segments with action directions in between
- End the prompt at the last closing quote mark (no trailing description)
## Watermarking
Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
```python
import perth, librosa
wav, sr = librosa.load("output.wav", sr=None, mono=True)
detector = perth.PerthImplicitWatermarker()
print(detector.get_watermark(wav, sample_rate=sr)) # confidence β 1.0
```
Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
## Training a LoRA on top of DramaBox
You can fine-tune your own LoRA using DramaBox itself as the base β no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
### 1. Prepare your index file
The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:
> `A woman speaks warmly, "<your transcript here>"`
Both forms are supported β with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.
**Format A β `manifest` (JSONL)** β recommended for new datasets:
```jsonl
{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}
```
Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional.
**Format B β `tsv`** β simplest, one line per sample:
```
wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?"
wavs/spk01_002.wav Hello, how are you today?
```
**Format C β `gemini_synthetic`** β `~`-separated, used for prompted synthetic data:
```
id~speaker~lang~sr~samples~dur~phonemes~text
spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"
```
**Format D β `libriheavy`** β `~`-separated, for unprompted text-only data:
```
id~speaker~lang~samples~dur_ms~phonemes~text
spk01_001~spk01~en~93000~3875~_~Hello, how are you today?
```
### 2. Preprocess
```bash
python src/preprocess.py \
--dataset-type manifest \
--index your_data.jsonl \
--audio-dir /path/to/wavs \
--output-dir /path/to/preprocessed/ \
--checkpoint /path/to/dramabox-audio-components.safetensors \
--gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
--max-duration 20.0 --min-duration 2.0
```
Output layout (training-ready `.pt` files):
```
preprocessed/
βββ audio_latents/sample_*.pt # Audio VAE-encoded latents
βββ conditions/sample_*.pt # Gemma text embeddings
βββ latents/sample_*.pt # Dummy video latents (placeholder)
```
### 3. Train
Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
```bash
accelerate launch src/train.py \
--config configs/training_args.example.yaml
```
The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
### Inference with your trained LoRA
```bash
python src/inference.py \
--lora /path/to/your/lora_step_5000.safetensors \
--voice-sample reference.wav \
--prompt 'A woman speaks warmly, "..."' \
--output output.wav
```
Always load the LoRA at inference rather than pre-merging it β pre-merged checkpoints have produced degraded output in our runs.
## Language
English.
## License & acknowledgement
DramaBox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β see [`LICENSE`](LICENSE). Thanks again to Lightricks for releasing the base model.
|