LongCat-AudioDiT-3.5B β BF16 Weight-Only Quantized (ComfyUI)
BF16 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.
Original Model | Paper | GitHub (Original) | ComfyUI Node
What is this?
This is a BF16 conversion of LongCat-AudioDiT-3.5B β a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. Converting from FP32 to BF16 halves the on-disk size and VRAM usage with negligible quality loss, making it the recommended variant for most users.
| Original (3.5B FP32) | This (3.5B BF16) | |
|---|---|---|
| Weight dtype | float32 | bfloat16 |
| Activation dtype | float32 | bfloat16 |
| File size | ~14 GB | ~7 GB |
| VRAM (inference) | ~20 GB | ~12 GB |
| Quality | Reference | Virtually identical |
| Extra dependencies | none | none |
Conversion Details
All model weights β DiT transformer backbone, Wav-VAE, and text encoder β are converted from float32 to bfloat16. BF16 preserves the same dynamic range as FP32 (8 exponent bits) while halving memory usage, making it the lossless practical choice for inference on modern GPUs.
No post-training quantization, calibration data, or scale factors are required. The model is a direct dtype cast and is fully compatible with the original audiodit inference code.
Hardware Requirements
- GPU: NVIDIA GPU with CUDA support (BF16 supported on Ampere and newer; falls back gracefully on older hardware)
- VRAM: ~7 GB
- CPU: Supported but slow
Usage β ComfyUI (Recommended)
The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this BF16 model with zero extra setup.
Installation
- Install the ComfyUI node via ComfyUI Manager (search
LongCat-AudioDiT) or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
The model auto-downloads on first use β select
LongCat-AudioDiT-3.5B-bf16from the model dropdown in any LongCat node.Or download manually:
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-bf16 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-bf16
Available Nodes
- LongCat AudioDiT TTS β Zero-shot text-to-speech
- LongCat AudioDiT Voice Clone TTS β Voice cloning from reference audio
- LongCat AudioDiT Multi-Speaker TTS β Multi-speaker conversation synthesis
Recommended Settings
dtype:autoorbf16β matches this model's native dtypeguidance_method:cfgfor TTS,apgfor voice cloningsteps:16(balanced),32(higher quality)keep_model_loaded:Truefor repeated use
This is the recommended variant for most users β best balance of quality, VRAM usage, and compatibility.
About LongCat-AudioDiT
LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.
The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.
License
This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.
The BF16 conversion was produced by drbaph and is released under the same license. ````
- Downloads last month
- 294
