LongCat-AudioDiT-3.5B β€” FP8 Weight-Only Quantized (ComfyUI)

FP8 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.

Original Model | Paper | GitHub (Original) | ComfyUI Node


Screenshot 2026-03-30 210100

What is this?

This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B β€” a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice.

Original (3.5B FP32) This (3.5B FP8)
Weight dtype float32 float8_e4m3fn
Activation dtype bfloat16 bfloat16
Scale β€” per-tensor float32
File size ~14 GB ~4 GB
VRAM (inference) ~20 GB ~8 GB
Extra dependencies none none

Quantization Details

What is quantized: All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16.

Method: Per-tensor symmetric FP8

Each weight matrix is quantized using a per-tensor scale factor stored in fp8_scales.json:

scale = max(abs(W)) / FP8_MAX        # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_fp32 / scale)        # quantize
W_bf16 = W_fp8.to(bfloat16) * scale  # dequantize at inference

No external quantization library required. Dequantization is handled automatically by the ComfyUI-LongCat-AudioDIT-TTS node pack at load time. The model is fully compatible with the original audiodit inference code once dequantized.

File layout:

  • model.safetensors β€” FP8 weight tensors
  • fp8_scales.json β€” per-tensor float32 scale factors for dequantization

Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support
  • VRAM: ~8 GB
  • Native FP8 tensor cores: Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) β€” recommended for full speed
  • Older GPUs (Ampere and below): Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores.

Usage β€” ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this FP8 model with zero extra setup.

Installation

  1. Install the ComfyUI node via ComfyUI Manager (search LongCat-AudioDiT) or manually:
   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
  1. The model auto-downloads on first use β€” select LongCat-AudioDiT-3.5B-fp8 from the model dropdown in any LongCat node.

  2. Or download manually:

   huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8

Available Nodes

  • LongCat AudioDiT TTS β€” Zero-shot text-to-speech
  • LongCat AudioDiT Voice Clone TTS β€” Voice cloning from reference audio
  • LongCat AudioDiT Multi-Speaker TTS β€” Multi-speaker conversation synthesis

Recommended Settings

  • dtype: auto or bf16 β€” FP8 weights are dequantized to BF16 at load time
  • guidance_method: cfg for TTS, apg for voice cloning
  • steps: 16 (balanced), 32 (higher quality)
  • keep_model_loaded: True for repeated use

About LongCat-AudioDiT

LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.

The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.


License

This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.

The FP8 quantization was produced by drbaph and is released under the same license.

Downloads last month
133
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support