DramaBox DiT INT8 β€” Selective Weight-Only Quantization

A selectively quantized version of the DramaBox TTS 3.3B DiT (Diffusion Transformer) model from Resemble AI. Reduces VRAM by 20% and checkpoint size by 45% while preserving audio quality.

Base model: ResembleAI/Dramabox | Code: resemble-ai/DramaBox | Architecture: LTX-2.3 DiT + Gemma 3 12B

What's included

File Size Description
dramabox-dit-int8-selective.safetensors 3.37 GB Quantized DiT weights (INT8 data + BF16 scales)
config.json 28 KB Layer map: which 562 layers are quantized
load_int8.py 3.6 KB Loader script (works with or without torchao)
inference_optimized.py 4.3 KB Full pipeline with INT8 + Gemma CPU offload

You still need the other components from ResembleAI/Dramabox:

Results

Metric Baseline (BF16) This model (INT8) Change
DiT checkpoint size 6.1 GB 3.37 GB -45%
Peak VRAM 17.39 GB 13.8 GB -20.6%
VRAM during denoising 17.39 GB 5.93 GB -65.9%
Audio quality (MCD) 0.0 dB 4.98 dB Within threshold
Generation time 2.62s 3.22s +23%

MCD (Mel-Cepstral Distortion) measures spectral distance from the BF16 baseline. Lower is better. Scores below 5.0 dB are perceptually near-identical for speech.

Quantization details

Method: Selective INT8 weight-only quantization via torchao Int8WeightOnlyConfig. Weights are stored as INT8 with per-channel BF16 scales and dequantized at runtime during matrix multiplication.

What's quantized (562 layers, ~81.5% of DiT parameters):

  • All attention projections (to_q, to_k, to_v, to_out) across all 48 transformer blocks
  • All gate_logits layers
  • All FFN GELU projections (audio_ff.net.0.proj) across all 48 blocks
  • FFN output projections (audio_ff.net.2) in blocks 15–47, excluding block 17
  • Input/output projections (audio_patchify_proj, audio_proj_out)

What's NOT quantized (kept in BF16):

  • All normalization layers β€” extremely sensitive to precision changes
  • AdaLN conditioning layers β€” controls the diffusion process globally
  • Timestep embedder β€” conditioning pathway, highly sensitive
  • FFN output projections in blocks 0–14 β€” early blocks are most sensitive to quantization
  • FFN output projection in block 17 β€” anomalously sensitive individual block

This layer map was discovered through 80+ automated experiments using Andrej Karpathy's auto-research methodology, systematically testing each layer type and block index.

Usage

Option 1: Runtime quantization (simplest, no extra downloads)

If you just want VRAM savings without downloading this checkpoint, you can apply quantization at load time to the original DramaBox model:

import torch, re
from torchao.quantization import quantize_, Int8WeightOnlyConfig

# After loading the standard DramaBox TTSServer:
attn_proj_keys = ("to_q", "to_k", "to_v", "to_out")

def dit_filter(mod, fqn):
    if not isinstance(mod, torch.nn.Linear): return False
    if "norm" in fqn: return False
    if "gate_logits" in fqn: return True
    if any(k in fqn for k in attn_proj_keys): return True
    if "audio_ff" in fqn:
        m = re.search(r'transformer_blocks\.(\d+)\.', fqn)
        if m:
            idx = int(m.group(1))
            if "net.2" in fqn and idx >= 15 and idx != 17: return True
            if "net.0.proj" in fqn: return True
    return False

def io_filter(mod, fqn):
    return fqn in ("audio_patchify_proj", "audio_proj_out") and isinstance(mod, torch.nn.Linear)

quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=dit_filter)
quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=io_filter)

Option 2: Load pre-quantized weights (faster startup)

from load_int8 import load_int8_dit

# Loads the INT8 safetensors and reconstructs quantized Linear layers
load_int8_dit(tts._velocity_model, "dramabox-dit-int8-selective.safetensors")

Option 3: Full optimized pipeline with Gemma offload

For maximum VRAM savings (5.93 GB during denoising), use the included inference_optimized.py which also offloads Gemma 12B to CPU between text encoding and audio generation.

Requirements

  • PyTorch >= 2.4
  • torchao >= 0.15.0
  • CUDA GPU with >= 16 GB VRAM (14 GB with Gemma offload)
  • The original DramaBox model and its dependencies

How this was made

We ran 80+ experiments using an automated loop inspired by Karpathy's auto-research methodology:

  1. Start from the BF16 baseline
  2. Modify quantization config (which layers, which precision, which blocks)
  3. Generate 3 evaluation audio samples with fixed prompts/seeds
  4. Measure peak VRAM, generation time, and MCD vs baseline
  5. Keep the change if MCD < 5.0 dB, discard otherwise
  6. Repeat

Key findings from the search:

  • Flow-matching diffusion models are far more precision-sensitive than autoregressive LLMs. All 4-bit approaches (NF4, NVFP4, FP4, Int4) produced unacceptable quality (MCD 17–32 dB).
  • FP8 is worse than INT8 for weight representation in this model (MCD 11.8 vs 4.35).
  • torch.compile breaks audio output even on the unquantized baseline (MCD 24–32 dB). The iterative denoising loop is numerically sensitive to graph optimizations.
  • Early transformer blocks (0–14) are most sensitive in their FFN output projections. Block 17 is an outlier.
  • Attention projections and GELU gates are universally robust to INT8 across all 48 blocks.

Citation

If you use this work, please cite the original DramaBox model:

@misc{dramabox2025,
  title={DramaBox: Expressive Text to Speech Model},
  author={Resemble AI},
  year={2025},
  url={https://github.com/resemble-ai/DramaBox}
}

License

Same as the base DramaBox model β€” LTX-2 Community License.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for moe2382/dramabox-dit-int8

Finetuned
(3)
this model