LongCat-AudioDiT-3.5B β€” BF16 Weight-Only Quantized (ComfyUI)

BF16 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.

Original Model | Paper | GitHub (Original) | ComfyUI Node

Screenshot 2026-03-30 210100


What is this?

This is a BF16 conversion of LongCat-AudioDiT-3.5B β€” a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. Converting from FP32 to BF16 halves the on-disk size and VRAM usage with negligible quality loss, making it the recommended variant for most users.

Original (3.5B FP32) This (3.5B BF16)
Weight dtype float32 bfloat16
Activation dtype float32 bfloat16
File size ~14 GB ~7 GB
VRAM (inference) ~20 GB ~12 GB
Quality Reference Virtually identical
Extra dependencies none none

Conversion Details

All model weights β€” DiT transformer backbone, Wav-VAE, and text encoder β€” are converted from float32 to bfloat16. BF16 preserves the same dynamic range as FP32 (8 exponent bits) while halving memory usage, making it the lossless practical choice for inference on modern GPUs.

No post-training quantization, calibration data, or scale factors are required. The model is a direct dtype cast and is fully compatible with the original audiodit inference code.


Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support (BF16 supported on Ampere and newer; falls back gracefully on older hardware)
  • VRAM: ~7 GB
  • CPU: Supported but slow

Usage β€” ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this BF16 model with zero extra setup.

Installation

  1. Install the ComfyUI node via ComfyUI Manager (search LongCat-AudioDiT) or manually:
   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
  1. The model auto-downloads on first use β€” select LongCat-AudioDiT-3.5B-bf16 from the model dropdown in any LongCat node.

  2. Or download manually:

   huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-bf16 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-bf16

Available Nodes

  • LongCat AudioDiT TTS β€” Zero-shot text-to-speech
  • LongCat AudioDiT Voice Clone TTS β€” Voice cloning from reference audio
  • LongCat AudioDiT Multi-Speaker TTS β€” Multi-speaker conversation synthesis

Recommended Settings

  • dtype: auto or bf16 β€” matches this model's native dtype
  • guidance_method: cfg for TTS, apg for voice cloning
  • steps: 16 (balanced), 32 (higher quality)
  • keep_model_loaded: True for repeated use

This is the recommended variant for most users β€” best balance of quality, VRAM usage, and compatibility.


About LongCat-AudioDiT

LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.

The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.


License

This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.

The BF16 conversion was produced by drbaph and is released under the same license. ````

Downloads last month
294
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support