Spaces:
Running on Zero
Running on Zero
Manmay Nakhashi
Move LTX acknowledgement to top of Space README, position DramaBox as Resemble AI's product
e694869 | title: DramaBox | |
| emoji: π | |
| colorFrom: red | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 5.7.1 | |
| app_file: app.py | |
| pinned: true | |
| license: other | |
| license_name: ltx-2-community | |
| license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE | |
| hf_oauth: false | |
| short_description: Expressive TTS with voice cloning β DramaBox demo | |
| # DramaBox β Expressive TTS with Voice Cloning | |
| > **Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks.** | |
| > DramaBox is **Resemble AI's** expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base. | |
| Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only** model. | |
| | | | | |
| |---|---| | |
| | π€ **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) | | |
| | π **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) | | |
| | ποΈ **Base model** | [`Lightricks/LTX-2`](https://huggingface.co/Lightricks/LTX-2) | | |
| | π **License** | LTX-2 Community License β see [`LICENSE`](LICENSE) | | |
| ## Models | |
| Auto-downloaded from the HF model repo on first run. | |
| | File | Size | Description | | |
| |---|---|---| | |
| | `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) | | |
| | `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder | | |
| | [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder | | |
| **VRAM**: ~24 GB peak Β· **Speed**: ~2.5 s / generation (warm server, H100) | |
| ## Quick Start | |
| ### Warm server (recommended) | |
| ```python | |
| from src.inference_server import TTSServer | |
| server = TTSServer(device="cuda") | |
| server.generate_to_file( | |
| prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"', | |
| output="output.wav", | |
| voice_ref="reference.wav", # optional, 10+ seconds | |
| ) | |
| ``` | |
| ### CLI | |
| ```bash | |
| python src/inference.py \ | |
| --voice-sample reference.wav \ | |
| --prompt 'A woman speaks warmly, "Hello, how are you today?"' \ | |
| --output output.wav \ | |
| --cfg-scale 2.5 --stg-scale 1.5 | |
| ``` | |
| ### Gradio app | |
| ```bash | |
| CUDA_VISIBLE_DEVICES=4 python app.py | |
| ``` | |
| ## Inference Settings | |
| | Parameter | Default | Notes | | |
| |---|---|---| | |
| | `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful | | |
| | `stg-scale` | 1.5 | Skip-token guidance | | |
| | `rescale` | 0 | No rescaling | | |
| | `modality` | 1 | No modality guidance | | |
| | `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length | | |
| | `steps` | 30 | Euler flow matching | | |
| ## Prompt Writing Guide | |
| **Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"` | |
| **Inside quotes** (model produces actual sounds): | |
| - Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated) | |
| - Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"` | |
| **Outside quotes** (stage directions): | |
| - `She sighs deeply.` Β· `He gulps nervously.` Β· `A long pause.` | |
| - `Her voice cracks.` Β· `He clears his throat.` Β· `She scoffs.` | |
| **Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`. | |
| **Tips** | |
| - Match gender/age in the speaker description to the voice reference | |
| - Break long dialogue into segments with action directions in between | |
| - End the prompt at the last closing quote mark (no trailing description) | |
| ## Watermarking | |
| Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy. | |
| ```python | |
| import perth, librosa | |
| wav, sr = librosa.load("output.wav", sr=None, mono=True) | |
| detector = perth.PerthImplicitWatermarker() | |
| print(detector.get_watermark(wav, sample_rate=sr)) # confidence β 1.0 | |
| ``` | |
| Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging. | |
| ## Training a LoRA on top of DramaBox | |
| You can fine-tune your own LoRA using DramaBox itself as the base β no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior. | |
| ### 1. Prepare your index file | |
| The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on: | |
| > `A woman speaks warmly, "<your transcript here>"` | |
| Both forms are supported β with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech. | |
| **Format A β `manifest` (JSONL)** β recommended for new datasets: | |
| ```jsonl | |
| {"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""} | |
| {"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"} | |
| {"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7} | |
| ``` | |
| Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional. | |
| **Format B β `tsv`** β simplest, one line per sample: | |
| ``` | |
| wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?" | |
| wavs/spk01_002.wav Hello, how are you today? | |
| ``` | |
| **Format C β `gemini_synthetic`** β `~`-separated, used for prompted synthetic data: | |
| ``` | |
| id~speaker~lang~sr~samples~dur~phonemes~text | |
| spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?" | |
| ``` | |
| **Format D β `libriheavy`** β `~`-separated, for unprompted text-only data: | |
| ``` | |
| id~speaker~lang~samples~dur_ms~phonemes~text | |
| spk01_001~spk01~en~93000~3875~_~Hello, how are you today? | |
| ``` | |
| ### 2. Preprocess | |
| ```bash | |
| python src/preprocess.py \ | |
| --dataset-type manifest \ | |
| --index your_data.jsonl \ | |
| --audio-dir /path/to/wavs \ | |
| --output-dir /path/to/preprocessed/ \ | |
| --checkpoint /path/to/dramabox-audio-components.safetensors \ | |
| --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \ | |
| --max-duration 20.0 --min-duration 2.0 | |
| ``` | |
| Output layout (training-ready `.pt` files): | |
| ``` | |
| preprocessed/ | |
| βββ audio_latents/sample_*.pt # Audio VAE-encoded latents | |
| βββ conditions/sample_*.pt # Gemma text embeddings | |
| βββ latents/sample_*.pt # Dummy video latents (placeholder) | |
| ``` | |
| ### 3. Train | |
| Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML. | |
| ```bash | |
| accelerate launch src/train.py \ | |
| --config configs/training_args.example.yaml | |
| ``` | |
| The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps. | |
| To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run. | |
| ### Inference with your trained LoRA | |
| ```bash | |
| python src/inference.py \ | |
| --lora /path/to/your/lora_step_5000.safetensors \ | |
| --voice-sample reference.wav \ | |
| --prompt 'A woman speaks warmly, "..."' \ | |
| --output output.wav | |
| ``` | |
| Always load the LoRA at inference rather than pre-merging it β pre-merged checkpoints have produced degraded output in our runs. | |
| ## Language | |
| English. | |
| ## License & acknowledgement | |
| DramaBox is a Resemble AI fine-tune of [LTX-2](https://github.com/Lightricks/LTX-2). Distributed under the LTX-2 Community License Agreement β see [`LICENSE`](LICENSE). Thanks again to Lightricks for releasing the base model. | |