LongCat-AudioDiT-3.5B β FP8 Weight-Only Quantized (ComfyUI)
FP8 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.
Original Model | Paper | GitHub (Original) | ComfyUI Node
What is this?
This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B β a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice.
| Original (3.5B FP32) | This (3.5B FP8) | |
|---|---|---|
| Weight dtype | float32 | float8_e4m3fn |
| Activation dtype | bfloat16 | bfloat16 |
| Scale | β | per-tensor float32 |
| File size | ~14 GB | ~4 GB |
| VRAM (inference) | ~20 GB | ~8 GB |
| Extra dependencies | none | none |
Quantization Details
What is quantized: All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16.
Method: Per-tensor symmetric FP8
Each weight matrix is quantized using a per-tensor scale factor stored in fp8_scales.json:
scale = max(abs(W)) / FP8_MAX # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_fp32 / scale) # quantize
W_bf16 = W_fp8.to(bfloat16) * scale # dequantize at inference
No external quantization library required. Dequantization is handled automatically by the ComfyUI-LongCat-AudioDIT-TTS node pack at load time. The model is fully compatible with the original audiodit inference code once dequantized.
File layout:
model.safetensorsβ FP8 weight tensorsfp8_scales.jsonβ per-tensor float32 scale factors for dequantization
Hardware Requirements
- GPU: NVIDIA GPU with CUDA support
- VRAM: ~8 GB
- Native FP8 tensor cores: Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) β recommended for full speed
- Older GPUs (Ampere and below): Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores.
Usage β ComfyUI (Recommended)
The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this FP8 model with zero extra setup.
Installation
- Install the ComfyUI node via ComfyUI Manager (search
LongCat-AudioDiT) or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
The model auto-downloads on first use β select
LongCat-AudioDiT-3.5B-fp8from the model dropdown in any LongCat node.Or download manually:
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8
Available Nodes
- LongCat AudioDiT TTS β Zero-shot text-to-speech
- LongCat AudioDiT Voice Clone TTS β Voice cloning from reference audio
- LongCat AudioDiT Multi-Speaker TTS β Multi-speaker conversation synthesis
Recommended Settings
dtype:autoorbf16β FP8 weights are dequantized to BF16 at load timeguidance_method:cfgfor TTS,apgfor voice cloningsteps:16(balanced),32(higher quality)keep_model_loaded:Truefor repeated use
About LongCat-AudioDiT
LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.
The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.
License
This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.
The FP8 quantization was produced by drbaph and is released under the same license.
- Downloads last month
- 133
