--- title: DramaBox emoji: ๐ŸŽญ colorFrom: red colorTo: indigo sdk: gradio sdk_version: 5.7.1 app_file: app.py pinned: true license: other license_name: ltx-2-community license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE hf_oauth: false short_description: Expressive TTS with voice cloning โ€” DramaBox demo --- # DramaBox โ€” Expressive TTS with Voice Cloning > **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.** > DramaBox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base. Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only** model. | | | |---|---| | ๐Ÿค— **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) | | ๐ŸŽญ **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) | | ๐Ÿ—๏ธ **Base model** | [`Lightricks/LTX-2`](https://huggingface.co/Lightricks/LTX-2) | | ๐Ÿ“œ **License** | LTX-2 Community License โ€” see [`LICENSE`](LICENSE) | ## Models Auto-downloaded from the HF model repo on first run. | File | Size | Description | |---|---|---| | `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) | | `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder | | [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder | **VRAM**: ~24 GB peak ยท **Speed**: ~2.5 s / generation (warm server, H100) ## Quick Start ### Warm server (recommended) ```python from src.inference_server import TTSServer server = TTSServer(device="cuda") server.generate_to_file( prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"', output="output.wav", voice_ref="reference.wav", # optional, 10+ seconds ) ``` ### CLI ```bash python src/inference.py \ --voice-sample reference.wav \ --prompt 'A woman speaks warmly, "Hello, how are you today?"' \ --output output.wav \ --cfg-scale 2.5 --stg-scale 1.5 ``` ### Gradio app ```bash CUDA_VISIBLE_DEVICES=4 python app.py ``` ## Inference Settings | Parameter | Default | Notes | |---|---|---| | `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful | | `stg-scale` | 1.5 | Skip-token guidance | | `rescale` | 0 | No rescaling | | `modality` | 1 | No modality guidance | | `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length | | `steps` | 30 | Euler flow matching | ## Prompt Writing Guide **Structure:** `, "" ""` **Inside quotes** (model produces actual sounds): - Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated) - Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"` **Outside quotes** (stage directions): - `She sighs deeply.` ยท `He gulps nervously.` ยท `A long pause.` - `Her voice cracks.` ยท `He clears his throat.` ยท `She scoffs.` **Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`. **Tips** - Match gender/age in the speaker description to the voice reference - Break long dialogue into segments with action directions in between - End the prompt at the last closing quote mark (no trailing description) ## Watermarking Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) โ€” an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy. ```python import perth, librosa wav, sr = librosa.load("output.wav", sr=None, mono=True) detector = perth.PerthImplicitWatermarker() print(detector.get_watermark(wav, sample_rate=sr)) # confidence โ‰ˆ 1.0 ``` Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging. ## Training a LoRA on top of DramaBox You can fine-tune your own LoRA using DramaBox itself as the base โ€” no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior. ### 1. Prepare your index file The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on: > `A woman speaks warmly, ""` Both forms are supported โ€” with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech. **Format A โ€” `manifest` (JSONL)** โ€” recommended for new datasets: ```jsonl {"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""} {"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"} {"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7} ``` Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional. **Format B โ€” `tsv`** โ€” simplest, one line per sample: ``` wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?" wavs/spk01_002.wav Hello, how are you today? ``` **Format C โ€” `gemini_synthetic`** โ€” `~`-separated, used for prompted synthetic data: ``` id~speaker~lang~sr~samples~dur~phonemes~text spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?" ``` **Format D โ€” `libriheavy`** โ€” `~`-separated, for unprompted text-only data: ``` id~speaker~lang~samples~dur_ms~phonemes~text spk01_001~spk01~en~93000~3875~_~Hello, how are you today? ``` ### 2. Preprocess ```bash python src/preprocess.py \ --dataset-type manifest \ --index your_data.jsonl \ --audio-dir /path/to/wavs \ --output-dir /path/to/preprocessed/ \ --checkpoint /path/to/dramabox-audio-components.safetensors \ --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \ --max-duration 20.0 --min-duration 2.0 ``` Output layout (training-ready `.pt` files): ``` preprocessed/ โ”œโ”€โ”€ audio_latents/sample_*.pt # Audio VAE-encoded latents โ”œโ”€โ”€ conditions/sample_*.pt # Gemma text embeddings โ””โ”€โ”€ latents/sample_*.pt # Dummy video latents (placeholder) ``` ### 3. Train Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML. ```bash accelerate launch src/train.py \ --config configs/training_args.example.yaml ``` The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` ร— 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps. To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML โ€” `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run. ### Inference with your trained LoRA ```bash python src/inference.py \ --lora /path/to/your/lora_step_5000.safetensors \ --voice-sample reference.wav \ --prompt 'A woman speaks warmly, "..."' \ --output output.wav ``` Always load the LoRA at inference rather than pre-merging it โ€” pre-merged checkpoints have produced degraded output in our runs. ## Language English. ## License & acknowledgement DramaBox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement โ€” see [`LICENSE`](LICENSE). Thanks again to Lightricks for releasing the base model.