Dramabox / README.md
Manmay Nakhashi
Move LTX acknowledgement to top of Space README, position DramaBox as Resemble AI's product
e694869
---
title: DramaBox
emoji: 🎭
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.7.1
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
hf_oauth: false
short_description: Expressive TTS with voice cloning β€” DramaBox demo
---
# DramaBox β€” Expressive TTS with Voice Cloning
> **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.**
> DramaBox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only** model.
| | |
|---|---|
| πŸ€— **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
| 🎭 **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
| πŸ—οΈ **Base model** | [`Lightricks/LTX-2`](https://huggingface.co/Lightricks/LTX-2) |
| πŸ“œ **License** | LTX-2 Community License β€” see [`LICENSE`](LICENSE) |
## Models
Auto-downloaded from the HF model repo on first run.
| File | Size | Description |
|---|---|---|
| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) |
| `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
| [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder |
**VRAM**: ~24 GB peak Β· **Speed**: ~2.5 s / generation (warm server, H100)
## Quick Start
### Warm server (recommended)
```python
from src.inference_server import TTSServer
server = TTSServer(device="cuda")
server.generate_to_file(
prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
output="output.wav",
voice_ref="reference.wav", # optional, 10+ seconds
)
```
### CLI
```bash
python src/inference.py \
--voice-sample reference.wav \
--prompt 'A woman speaks warmly, "Hello, how are you today?"' \
--output output.wav \
--cfg-scale 2.5 --stg-scale 1.5
```
### Gradio app
```bash
CUDA_VISIBLE_DEVICES=4 python app.py
```
## Inference Settings
| Parameter | Default | Notes |
|---|---|---|
| `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful |
| `stg-scale` | 1.5 | Skip-token guidance |
| `rescale` | 0 | No rescaling |
| `modality` | 1 | No modality guidance |
| `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length |
| `steps` | 30 | Euler flow matching |
## Prompt Writing Guide
**Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
**Inside quotes** (model produces actual sounds):
- Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
**Outside quotes** (stage directions):
- `She sighs deeply.` Β· `He gulps nervously.` Β· `A long pause.`
- `Her voice cracks.` Β· `He clears his throat.` Β· `She scoffs.`
**Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`.
**Tips**
- Match gender/age in the speaker description to the voice reference
- Break long dialogue into segments with action directions in between
- End the prompt at the last closing quote mark (no trailing description)
## Watermarking
Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β€” an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
```python
import perth, librosa
wav, sr = librosa.load("output.wav", sr=None, mono=True)
detector = perth.PerthImplicitWatermarker()
print(detector.get_watermark(wav, sample_rate=sr)) # confidence β‰ˆ 1.0
```
Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
## Training a LoRA on top of DramaBox
You can fine-tune your own LoRA using DramaBox itself as the base β€” no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
### 1. Prepare your index file
The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:
> `A woman speaks warmly, "<your transcript here>"`
Both forms are supported β€” with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.
**Format A β€” `manifest` (JSONL)** β€” recommended for new datasets:
```jsonl
{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}
```
Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional.
**Format B β€” `tsv`** β€” simplest, one line per sample:
```
wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?"
wavs/spk01_002.wav Hello, how are you today?
```
**Format C β€” `gemini_synthetic`** β€” `~`-separated, used for prompted synthetic data:
```
id~speaker~lang~sr~samples~dur~phonemes~text
spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"
```
**Format D β€” `libriheavy`** β€” `~`-separated, for unprompted text-only data:
```
id~speaker~lang~samples~dur_ms~phonemes~text
spk01_001~spk01~en~93000~3875~_~Hello, how are you today?
```
### 2. Preprocess
```bash
python src/preprocess.py \
--dataset-type manifest \
--index your_data.jsonl \
--audio-dir /path/to/wavs \
--output-dir /path/to/preprocessed/ \
--checkpoint /path/to/dramabox-audio-components.safetensors \
--gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
--max-duration 20.0 --min-duration 2.0
```
Output layout (training-ready `.pt` files):
```
preprocessed/
β”œβ”€β”€ audio_latents/sample_*.pt # Audio VAE-encoded latents
β”œβ”€β”€ conditions/sample_*.pt # Gemma text embeddings
└── latents/sample_*.pt # Dummy video latents (placeholder)
```
### 3. Train
Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
```bash
accelerate launch src/train.py \
--config configs/training_args.example.yaml
```
The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ— 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β€” `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
### Inference with your trained LoRA
```bash
python src/inference.py \
--lora /path/to/your/lora_step_5000.safetensors \
--voice-sample reference.wav \
--prompt 'A woman speaks warmly, "..."' \
--output output.wav
```
Always load the LoRA at inference rather than pre-merging it β€” pre-merged checkpoints have produced degraded output in our runs.
## Language
English.
## License & acknowledgement
DramaBox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β€” see [`LICENSE`](LICENSE). Thanks again to Lightricks for releasing the base model.