Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.14.0
title: DramaBox
emoji: π
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.7.1
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
hf_oauth: false
short_description: Expressive TTS with voice cloning β DramaBox demo
DramaBox β Expressive TTS with Voice Cloning
Built on LTX-2 by Lightricks. DramaBox is Resemble AI's expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.
Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model.
| π€ Model | ResembleAI/Dramabox |
| π Demo Space | ResembleAI/Dramabox (ZeroGPU) |
| ποΈ Base model | Lightricks/LTX-2 |
| π License | LTX-2 Community License β see LICENSE |
Models
Auto-downloaded from the HF model repo on first run.
| File | Size | Description |
|---|---|---|
dramabox-dit-v1.safetensors |
6.6 GB | DiT transformer (LoRA already merged into base) |
dramabox-audio-components.safetensors |
1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
unsloth/gemma-3-12b-it-bnb-4bit |
~8 GB | Text encoder |
VRAM: ~24 GB peak Β· Speed: ~2.5 s / generation (warm server, H100)
Quick Start
Warm server (recommended)
from src.inference_server import TTSServer
server = TTSServer(device="cuda")
server.generate_to_file(
prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
output="output.wav",
voice_ref="reference.wav", # optional, 10+ seconds
)
CLI
python src/inference.py \
--voice-sample reference.wav \
--prompt 'A woman speaks warmly, "Hello, how are you today?"' \
--output output.wav \
--cfg-scale 2.5 --stg-scale 1.5
Gradio app
CUDA_VISIBLE_DEVICES=4 python app.py
Inference Settings
| Parameter | Default | Notes |
|---|---|---|
cfg-scale |
2.5 | Lower = more natural, higher = more text-faithful |
stg-scale |
1.5 | Skip-token guidance |
rescale |
0 | No rescaling |
modality |
1 | No modality guidance |
duration-multiplier |
1.1 | 10% breathing room on auto-estimated length |
steps |
30 | Euler flow matching |
Prompt Writing Guide
Structure: <speaker description>, "<dialogue>" <action direction> "<more dialogue>"
Inside quotes (model produces actual sounds):
- Laughs:
"Hahaha""Hehehe"(always one word, never separated) - Sounds:
"Mmmmm""Ugh""Argh""Ahhh""Hmm"
Outside quotes (stage directions):
She sighs deeply.Β·He gulps nervously.Β·A long pause.Her voice cracks.Β·He clears his throat.Β·She scoffs.
Avoid inside quotes (model speaks them literally): Ahem, Pfft, Sigh, Gasp, Cough.
Tips
- Match gender/age in the speaker description to the voice reference
- Break long dialogue into segments with action directions in between
- End the prompt at the last closing quote mark (no trailing description)
Watermarking
Every audio output from inference.py and inference_server.TTSServer.generate_to_file is automatically watermarked with Resemble Perth β an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
import perth, librosa
wav, sr = librosa.load("output.wav", sr=None, mono=True)
detector = perth.PerthImplicitWatermarker()
print(detector.get_watermark(wav, sample_rate=sr)) # confidence β 1.0
Pass --no-watermark to inference.py (or watermark=False to generate_to_file) to disable for debugging.
Training a LoRA on top of DramaBox
You can fine-tune your own LoRA using DramaBox itself as the base β no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
1. Prepare your index file
The preprocessor accepts four formats. The text field is the target transcript; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:
A woman speaks warmly, "<your transcript here>"
Both forms are supported β with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.
Format A β manifest (JSONL) β recommended for new datasets:
{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}
Fields: audio_filepath (or audio_path) is required, text (or transcript) is required, duration is optional.
Format B β tsv β simplest, one line per sample:
wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?"
wavs/spk01_002.wav Hello, how are you today?
Format C β gemini_synthetic β ~-separated, used for prompted synthetic data:
id~speaker~lang~sr~samples~dur~phonemes~text
spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"
Format D β libriheavy β ~-separated, for unprompted text-only data:
id~speaker~lang~samples~dur_ms~phonemes~text
spk01_001~spk01~en~93000~3875~_~Hello, how are you today?
2. Preprocess
python src/preprocess.py \
--dataset-type manifest \
--index your_data.jsonl \
--audio-dir /path/to/wavs \
--output-dir /path/to/preprocessed/ \
--checkpoint /path/to/dramabox-audio-components.safetensors \
--gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
--max-duration 20.0 --min-duration 2.0
Output layout (training-ready .pt files):
preprocessed/
βββ audio_latents/sample_*.pt # Audio VAE-encoded latents
βββ conditions/sample_*.pt # Gemma text embeddings
βββ latents/sample_*.pt # Dummy video latents (placeholder)
3. Train
Copy configs/training_args.example.yaml, point data_dir / speaker_index at your preprocessed output, set checkpoint + full_checkpoint to the DramaBox files, then launch with HuggingFace accelerate. Any flag passed on the CLI overrides the YAML.
accelerate launch src/train.py \
--config configs/training_args.example.yaml
The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: audio_attn1.{to_q,to_k,to_v,to_out.0} + audio_ff.{net.0.proj,net.2} Γ 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
To monitor training, set val_config: configs/val_config.example.yaml in your training YAML β src/validate.py is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
Inference with your trained LoRA
python src/inference.py \
--lora /path/to/your/lora_step_5000.safetensors \
--voice-sample reference.wav \
--prompt 'A woman speaks warmly, "..."' \
--output output.wav
Always load the LoRA at inference rather than pre-merging it β pre-merged checkpoints have produced degraded output in our runs.
Language
English.
License & acknowledgement
DramaBox is a Resemble AI fine-tune of LTX-2. Distributed under the LTX-2 Community License Agreement β see LICENSE. Thanks again to Lightricks for releasing the base model.