DramaBox DiT INT8 β Selective Weight-Only Quantization
A selectively quantized version of the DramaBox TTS 3.3B DiT (Diffusion Transformer) model from Resemble AI. Reduces VRAM by 20% and checkpoint size by 45% while preserving audio quality.
Base model: ResembleAI/Dramabox | Code: resemble-ai/DramaBox | Architecture: LTX-2.3 DiT + Gemma 3 12B
What's included
| File | Size | Description |
|---|---|---|
dramabox-dit-int8-selective.safetensors |
3.37 GB | Quantized DiT weights (INT8 data + BF16 scales) |
config.json |
28 KB | Layer map: which 562 layers are quantized |
load_int8.py |
3.6 KB | Loader script (works with or without torchao) |
inference_optimized.py |
4.3 KB | Full pipeline with INT8 + Gemma CPU offload |
You still need the other components from ResembleAI/Dramabox:
dramabox-audio-components.safetensors(1.9 GB) β VAE + vocoder- unsloth/gemma-3-12b-it-bnb-4bit (~8 GB) β text encoder
Results
| Metric | Baseline (BF16) | This model (INT8) | Change |
|---|---|---|---|
| DiT checkpoint size | 6.1 GB | 3.37 GB | -45% |
| Peak VRAM | 17.39 GB | 13.8 GB | -20.6% |
| VRAM during denoising | 17.39 GB | 5.93 GB | -65.9% |
| Audio quality (MCD) | 0.0 dB | 4.98 dB | Within threshold |
| Generation time | 2.62s | 3.22s | +23% |
MCD (Mel-Cepstral Distortion) measures spectral distance from the BF16 baseline. Lower is better. Scores below 5.0 dB are perceptually near-identical for speech.
Quantization details
Method: Selective INT8 weight-only quantization via torchao Int8WeightOnlyConfig. Weights are stored as INT8 with per-channel BF16 scales and dequantized at runtime during matrix multiplication.
What's quantized (562 layers, ~81.5% of DiT parameters):
- All attention projections (
to_q,to_k,to_v,to_out) across all 48 transformer blocks - All
gate_logitslayers - All FFN GELU projections (
audio_ff.net.0.proj) across all 48 blocks - FFN output projections (
audio_ff.net.2) in blocks 15β47, excluding block 17 - Input/output projections (
audio_patchify_proj,audio_proj_out)
What's NOT quantized (kept in BF16):
- All normalization layers β extremely sensitive to precision changes
- AdaLN conditioning layers β controls the diffusion process globally
- Timestep embedder β conditioning pathway, highly sensitive
- FFN output projections in blocks 0β14 β early blocks are most sensitive to quantization
- FFN output projection in block 17 β anomalously sensitive individual block
This layer map was discovered through 80+ automated experiments using Andrej Karpathy's auto-research methodology, systematically testing each layer type and block index.
Usage
Option 1: Runtime quantization (simplest, no extra downloads)
If you just want VRAM savings without downloading this checkpoint, you can apply quantization at load time to the original DramaBox model:
import torch, re
from torchao.quantization import quantize_, Int8WeightOnlyConfig
# After loading the standard DramaBox TTSServer:
attn_proj_keys = ("to_q", "to_k", "to_v", "to_out")
def dit_filter(mod, fqn):
if not isinstance(mod, torch.nn.Linear): return False
if "norm" in fqn: return False
if "gate_logits" in fqn: return True
if any(k in fqn for k in attn_proj_keys): return True
if "audio_ff" in fqn:
m = re.search(r'transformer_blocks\.(\d+)\.', fqn)
if m:
idx = int(m.group(1))
if "net.2" in fqn and idx >= 15 and idx != 17: return True
if "net.0.proj" in fqn: return True
return False
def io_filter(mod, fqn):
return fqn in ("audio_patchify_proj", "audio_proj_out") and isinstance(mod, torch.nn.Linear)
quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=dit_filter)
quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=io_filter)
Option 2: Load pre-quantized weights (faster startup)
from load_int8 import load_int8_dit
# Loads the INT8 safetensors and reconstructs quantized Linear layers
load_int8_dit(tts._velocity_model, "dramabox-dit-int8-selective.safetensors")
Option 3: Full optimized pipeline with Gemma offload
For maximum VRAM savings (5.93 GB during denoising), use the included inference_optimized.py which also offloads Gemma 12B to CPU between text encoding and audio generation.
Requirements
- PyTorch >= 2.4
- torchao >= 0.15.0
- CUDA GPU with >= 16 GB VRAM (14 GB with Gemma offload)
- The original DramaBox model and its dependencies
How this was made
We ran 80+ experiments using an automated loop inspired by Karpathy's auto-research methodology:
- Start from the BF16 baseline
- Modify quantization config (which layers, which precision, which blocks)
- Generate 3 evaluation audio samples with fixed prompts/seeds
- Measure peak VRAM, generation time, and MCD vs baseline
- Keep the change if MCD < 5.0 dB, discard otherwise
- Repeat
Key findings from the search:
- Flow-matching diffusion models are far more precision-sensitive than autoregressive LLMs. All 4-bit approaches (NF4, NVFP4, FP4, Int4) produced unacceptable quality (MCD 17β32 dB).
- FP8 is worse than INT8 for weight representation in this model (MCD 11.8 vs 4.35).
torch.compilebreaks audio output even on the unquantized baseline (MCD 24β32 dB). The iterative denoising loop is numerically sensitive to graph optimizations.- Early transformer blocks (0β14) are most sensitive in their FFN output projections. Block 17 is an outlier.
- Attention projections and GELU gates are universally robust to INT8 across all 48 blocks.
Citation
If you use this work, please cite the original DramaBox model:
@misc{dramabox2025,
title={DramaBox: Expressive Text to Speech Model},
author={Resemble AI},
year={2025},
url={https://github.com/resemble-ai/DramaBox}
}
License
Same as the base DramaBox model β LTX-2 Community License.
- Downloads last month
- 21