LAION-Box Emotional v0.7 β expressive voice-acting TTS (fully-merged checkpoints)
Three fully-merged, standalone audio-DiT checkpoints of the LAION-Box / DramaBox text-to-speech model, fine-tuned to produce more emotionally expressive speech. Each file is a complete model β no LoRA loading required.
Lineage (important)
- Base:
Lightricks/LTX-2.3β specifically the LTX-2.3 3.3B audio-only DiT (flow-matching audio latent transformer,AVTransformer3DModel,caption_channels=3840, metadatamodel_version 2.3.0). (It is LTX-2.3, not "LTX-2"; the audio DiT is 3.3B params β the "22b" in the original filename refers to the full multimodal LTX-2.3, not this audio branch.) - DramaBox (
ResembleAI/Dramabox) β Resemble AI's expressive TTS, an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model. - run16 "v0.7" LoRA (rank 256) β LAION continued-fine-tune on the diverse DramaBox tuning mix (German 70% + diverse voice-acting 30%), step 19,500, merged in.
- Emotion LoRA (rank 32, Ξ± 32) β trained here for 10 epochs on a high-emotion subset, merged in.
The result is a single standalone DiT you use exactly like the base LTX-2.3 audio model.
Files
The model (this repo)
| file | source LoRA step | flow loss | notes |
|---|---|---|---|
LAION-Box-Emotional-v0.7_best1_step850.safetensors |
850 (~epoch 9.7) | 0.111 | strongest emotional fit (recommended) |
LAION-Box-Emotional-v0.7_best2_step800.safetensors |
800 (~epoch 9.1) | 0.123 | near-best |
LAION-Box-Emotional-v0.7_best3_step150.safetensors |
150 (~epoch 1.7) | 0.145 | lightest adaptation, closest to base |
dramabox-audio-components.safetensors |
β | β | VAE + vocoder + audio connector (from ResembleAI/Dramabox, ~1.9 GB). Required to turn DiT latents into a waveform. |
inference.py, download_components.py |
β | β | runnable example + fetch the two third-party foundation models below |
Each *Emotional*.safetensors is LTX-2.3 base + DramaBox + run16 LoRA + emotion LoRA, all merged
(Ξ±=32, rank=32) β interchangeable standalone checkpoints.
Components needed for inference
| role | what | where | size |
|---|---|---|---|
| audio DiT | this repo's *Emotional*.safetensors |
β included | 6.1 GB each |
| VAE + vocoder | dramabox-audio-components.safetensors |
β included | 1.9 GB |
| text / prompt encoder | unsloth/gemma-3-12b-it-bnb-4bit (Google Gemma 3 12B, 4-bit) |
β¬οΈ download_components.py |
~7.4 GB |
| reference denoiser (RE-USE) | nvidia/RE-USE (SEMamba) |
β¬οΈ download_components.py |
small |
| pipeline code | DramaBox / LTX-2.3 ltx2 core + src/ |
ResembleAI/Dramabox |
β |
The two foundation models (Google Gemma as the prompt encoder, NVIDIA RE-USE as the reference denoiser) are not re-hosted here β they are fetched from their canonical repos by
download_components.py, under their own licenses (Gemma / NVIDIA). Everything DramaBox/LTX-2.3-specific (the DiT + VAE + vocoder) is in this repo.
Selection note: on this small (5.6k-sample) fine-tune, flow-matching loss is flat across epochs and only weakly tied to emotional expressivity β A/B the three checkpoints on your own prompts rather than trusting the loss ranking.
Requirements
pip install torch safetensors librosa soundfile huggingface_hub transformers
# + the DramaBox / LTX-2.3 pipeline (ltx2 core + src/) from ResembleAI/Dramabox
python download_components.py # fetches Gemma + RE-USE
GPU β₯ ~24 GB (bf16; 4-bit Gemma option).
Usage
CLI (DramaBox src/inference.py)
python src/inference.py \
--checkpoint LAION-Box-Emotional-v0.7_best1_step850.safetensors \
--full-checkpoint dramabox-audio-components.safetensors \
--prompt "A woman, trembling with grief: 'I can't do this anymore.'" \
--voice-ref reference_voice.wav \
--output out.wav \
--cfg-scale 2.5 --stg-scale 1.5 --seed 42
Python (TTSServer)
import sys; sys.path.insert(0, "DramaBox/src")
from inference_server import TTSServer
tts = TTSServer(
checkpoint="LAION-Box-Emotional-v0.7_best1_step850.safetensors", # this DiT
full_checkpoint="dramabox-audio-components.safetensors", # VAE + vocoder
gemma_root="<gemma snapshot dir from download_components.py>", # prompt encoder
device="cuda", dtype="bf16", bnb_4bit=True,
)
tts.generate_to_file(
prompt="An old man, warm and amused, chuckling: 'You remind me of myself at your age.'",
output="out.wav",
voice_ref="reference_voice.wav", # 5-10 s clean speaker reference
cfg_scale=2.5, stg_scale=1.5, duration_multiplier=1.1,
ref_duration=10.0, denoise_ref=True, seed=42,
)
Prompting for emotion
The model conditions on a natural-language description (via Gemma) + a voice reference.
Put the emotional direction in the prompt ("furious, shouting", "tender and hushed",
"nervous, voice shaking"). This checkpoint biases delivery toward stronger emotion than the base.
Key knobs: cfg_scale (β = follows the emotional prompt harder, ~2β4), stg_scale (stability, ~1β2),
voice_ref (timbre), denoise_ref (clean the reference via RE-USE).
Training (emotion stage)
rank 32, Ξ± 32, dropout 0.0; lr 1e-4, 8Γ GPU, grad-accum 8, bf16; 10 epochs (~875 steps), trained
directly on precomputed DramaBox latents (tgt_latent + cond). The 3 lowest flow-matching-loss
checkpoints were merged and shipped.
Emotion data provenance
Fine-tune set = top-emotional subset (5,596 clips) of the DramaBox mix: every clip scored with
laion/Empathic-Insight-Voice-Plus (40 EmoNet
emotions); top 10 % per source dataset (Elise 30 %) by intensity Ξ£(score β per-dim mean) were kept.
Dataset: TTS-AGI/emotional-voice-acting-subset-v0.7 (private).
Credits & license
- LTX-2.3 base model Β© Lightricks β LTX-2 Community License.
- DramaBox expressive TTS Β© Resemble AI (IC-LoRA fine-tune of LTX-2.3).
- Prompt encoder: Google Gemma (Gemma license). Reference denoiser: NVIDIA RE-USE.
- This emotional fine-tune by LAION. Governed by the LTX-2 Community License; research/eval use.