LongCat-AudioDiT-3.5B — BF16 Weight-Only Quantized (ComfyUI)

BF16 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.

Original Model | Paper | GitHub (Original) | ComfyUI Node

What is this?

This is a BF16 conversion of LongCat-AudioDiT-3.5B — a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. Converting from FP32 to BF16 halves the on-disk size and VRAM usage with negligible quality loss, making it the recommended variant for most users.

	Original (3.5B FP32)	This (3.5B BF16)
Weight dtype	float32	bfloat16
Activation dtype	float32	bfloat16
File size	~14 GB	~7 GB
VRAM (inference)	~20 GB	~12 GB
Quality	Reference	Virtually identical
Extra dependencies	none	none

Conversion Details

All model weights — DiT transformer backbone, Wav-VAE, and text encoder — are converted from float32 to bfloat16. BF16 preserves the same dynamic range as FP32 (8 exponent bits) while halving memory usage, making it the lossless practical choice for inference on modern GPUs.

No post-training quantization, calibration data, or scale factors are required. The model is a direct dtype cast and is fully compatible with the original audiodit inference code.

Hardware Requirements

GPU: NVIDIA GPU with CUDA support (BF16 supported on Ampere and newer; falls back gracefully on older hardware)
VRAM: ~7 GB
CPU: Supported but slow

Usage — ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this BF16 model with zero extra setup.

Installation

Install the ComfyUI node via ComfyUI Manager (search LongCat-AudioDiT) or manually:

   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git

The model auto-downloads on first use — select LongCat-AudioDiT-3.5B-bf16 from the model dropdown in any LongCat node.
Or download manually:

   huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-bf16 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-bf16

Available Nodes

LongCat AudioDiT TTS — Zero-shot text-to-speech
LongCat AudioDiT Voice Clone TTS — Voice cloning from reference audio
LongCat AudioDiT Multi-Speaker TTS — Multi-speaker conversation synthesis

Recommended Settings

dtype: auto or bf16 — matches this model's native dtype
guidance_method: cfg for TTS, apg for voice cloning
steps: 16 (balanced), 32 (higher quality)
keep_model_loaded: True for repeated use

This is the recommended variant for most users — best balance of quality, VRAM usage, and compatibility.

About LongCat-AudioDiT

LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.

The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.

License

This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.

The BF16 conversion was produced by drbaph and is released under the same license. ````

Downloads last month: 7,038

Safetensors

Model size

4B params

Tensor type

BF16