--- title: LongCat-AudioDiT Enhanced emoji: 🐱 colorFrom: purple colorTo: pink sdk: gradio sdk_version: "5.25.0" app_file: spaces_app.py pinned: false license: apache-2.0 short_description: Voice cloning TTS with Whisper STT --- # LongCat-AudioDiT Enhanced

> **This is a community-enhanced fork of [LongCat-AudioDiT](https://github.com/meituan-longcat/LongCat-AudioDiT) by [Meituan LongCat Team](https://github.com/meituan-longcat).** > All credit for the model architecture, training, and weights goes to the original authors. > We simply wrapped their excellent work in a more accessible package — a Gradio GUI, Whisper-based speech input, and one-click installers — so more people can try it out. > [!IMPORTANT] > **Research and Testing Only.** This project is provided strictly for research, educational, and personal experimentation purposes. It is **not** intended for generating deceptive, misleading, or harmful content. Do not use this software to impersonate real individuals without their explicit consent, to create non-consensual deepfakes, or for any activity that violates applicable laws or regulations. By using this software you agree to take full responsibility for ensuring your use complies with all relevant laws in your jurisdiction. The authors and contributors disclaim all liability for misuse. --- ## What We Added Building on top of the original CLI-only release, this fork adds a few quality-of-life features: | Feature | Original | This Fork | |---|:---:|:---:| | TTS (1B / 3.5B) | ✅ | ✅ | | Zero-shot Voice Cloning | ✅ | ✅ | | Gradio Web UI | — | ✅ | | Whisper Turbo STT | — | ✅ | | Whisper large-v3 STT | — | ✅ | | Speech-to-Speech pipeline | — | ✅ | | VRAM memory manager (auto/sequential) | — | ✅ | | Cross-platform installer (Win + Linux) | — | ✅ | | Model downloader | — | ✅ | | HuggingFace Spaces demo (ZeroGPU) | — | ✅ | ## Quick Start **Windows** ```bat git clone https://github.com/cronos3k/LongCat-AudioDiT-Enhanced cd LongCat-AudioDiT-Enhanced install.bat launch.bat ``` **Linux / macOS** ```bash git clone https://github.com/cronos3k/LongCat-AudioDiT-Enhanced cd LongCat-AudioDiT-Enhanced bash install.sh bash launch.sh ``` **Download models (optional — also auto-downloaded on first run)** ```bash python download_models.py --tts 1B --whisper turbo # minimal (~6 GB) python download_models.py --all # everything ``` ## Installation Options ### Option A — GUI (recommended) ```bash # Linux/Mac bash install.sh # Windows install.bat ``` Then launch: ```bash bash launch.sh # or launch.bat on Windows ``` ### Option B — CLI / Python API only ```bash pip install -r requirements_enhanced.txt ``` ### Option C — Minimal (original, no GUI/Whisper) ```bash pip install -r requirements.txt ``` ## API Endpoints (Gradio REST API) When running the GUI, all actions are also available as REST endpoints: | Endpoint | Description | |---|---| | `POST /api/clone_voice` | Clone a voice: text + reference audio + transcription | | `POST /api/transcribe_reference` | Transcribe reference audio with Whisper | | `POST /api/plain_tts` | Generate speech without a reference voice | | `POST /api/transcribe` | Transcribe any audio file | | `POST /api/save_voice` | Save a voice to the library | | `POST /api/load_voice` | Load a voice from the library by name | | `POST /api/delete_voice` | Delete a voice from the library | | `POST /api/list_voices` | List all saved voices | ## Models at a Glance | Model | VRAM | Notes | |---|---|---| | AudioDiT-1B | ~4 GB | Fast, great quality | | AudioDiT-3.5B | ~10 GB | SOTA quality | | Whisper Turbo | ~1.6 GB | Fast transcription | | Whisper large-v3 | ~3 GB | Most accurate | --- # Original Project — LongCat-AudioDiT *Everything below is from the [original repository](https://github.com/meituan-longcat/LongCat-AudioDiT) by the Meituan LongCat Team. Please cite their work if you use this model.*

## Introduction LongCat-AudioDiT is a state-of-the-art (SOTA) diffusion-based text-to-speech (TTS) model that directly operates in the waveform latent space. > **Abstract**: We present LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community. ![image](assets/architecture.png) This repository provides the HuggingFace-compatible implementation, including model definition, weight conversion, and inference scripts. ## Experimental Results on Seed Benchmark LongCat-AudioDiT obtains state-of-the-art (SOTA) voice cloning performance on the Seed-benchmark, surpassing both close-source and open-source models. | **Model** | **ZH CER (%)** ↓ | **ZH SIM** ↑ | **EN WER (%)** ↓ | **EN SIM** ↑ | **ZH-Hard CER (%)** ↓ | **ZH-Hard SIM** ↑ | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | GT | 1.26 | 0.755 | 2.14 | 0.734 | - | - | | Seed-DiT | 1.18 | 0.809 | 1.73 | **0.790** | - | - | | MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 | 10.27 | 0.748 | | E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 | - | - | | F5 TTS | 1.56 | 0.741 | 1.83 | 0.647 | 8.67 | 0.713 | | F5R-TTS | 1.37 | 0.754 | - | - | 8.79 | 0.718 | | ZipVoice | 1.40 | 0.751 | 1.64 | 0.668 | - | - | | Seed-ICL | 1.12 | 0.796 | 2.25 | 0.762 | 7.59 | 0.776 | | SparkTTS | 1.20 | 0.672 | 1.98 | 0.584 | - | - | | FireRedTTS | 1.51 | 0.635 | 3.82 | 0.460 | 17.45 | 0.621 | | Qwen2.5-Omni | 1.70 | 0.752 | 2.72 | 0.632 | 7.97 | 0.747 | | Qwen2.5-Omni_RL | 1.42 | 0.754 | 2.33 | 0.641 | 6.54 | 0.752 | | CosyVoice | 3.63 | 0.723 | 4.29 | 0.609 | 11.75 | 0.709 | | CosyVoice2 | 1.45 | 0.748 | 2.57 | 0.652 | 6.83 | 0.724 | | FireRedTTS-1S | 1.05 | 0.750 | 2.17 | 0.660 | 7.63 | 0.748 | | CosyVoice3-1.5B | 1.12 | 0.781 | 2.21 | 0.720 | *5.83* | 0.758 | | IndexTTS2 | 1.03 | 0.765 | 2.23 | 0.706 | 7.12 | 0.755 | | DiTAR | 1.02 | 0.753 | 1.69 | 0.735 | - | - | | MiniMax-Speech | 0.99 | 0.799 | 1.90 | 0.738 | - | - | | VoxCPM | *0.93* | 0.772 | 1.85 | 0.729 | 8.87 | 0.730 | | MOSS-TTS | 1.20 | 0.788 | 1.85 | 0.734 | - | - | | Qwen3-TTS | 1.22 | 0.770 | **1.23** | 0.717 | 6.76 | 0.748 | | CosyVoice3.5 | **0.87** | 0.797 | 1.57 | 0.738 | **5.71** | 0.786 | | LongCat-AudioDiT-1B | 1.18 | *0.812* | 1.78 | 0.762 | 6.33 | *0.787* | | LongCat-AudioDiT-3.5B | 1.09 | **0.818** | *1.50* | *0.786* | 6.04 | **0.797** | *Notes*: 1. Results of MOSS-TTS are from [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) 2. Results of CosyVoice3.5 are from [CosyVoice3.5](https://mp.weixin.qq.com/s/sTNC7bVphs9zofly3lBoUQ) ## CLI Inference ```bash # TTS python inference.py --text "今天晴暖转阴雨，空气质量优至良，空气相对湿度较低。" --output_audio output.wav --model_dir meituan-longcat/LongCat-AudioDiT-1B # Voice cloning python inference.py \ --text "今天晴暖转阴雨，空气质量优至良，空气相对湿度较低。" \ --prompt_text "小偷却一点也不气馁，继续在抽屉里翻找。" \ --prompt_audio assets/prompt.wav \ --output_audio output.wav \ --model_dir meituan-longcat/LongCat-AudioDiT-1B \ --guidance_method apg # Batch inference (SeedTTS eval format, one item per line: uid|prompt_text|prompt_wav_path|gen_text) python batch_inference.py \ --lst /path/to/meta.lst \ --output_dir /path/to/output \ --model_dir meituan-longcat/LongCat-AudioDiT-1B \ --guidance_method apg ``` ## Inference (Python API) ### 1. TTS ```python import audiodit # auto-registers with transformers from audiodit import AudioDiTModel from transformers import AutoTokenizer import torch, soundfile as sf # Load model model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda") model.vae.to_half() # VAE runs in fp16 (matching original) model.eval() tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model) # Zero-shot synthesis inputs = tokenizer(["今天晴暖转阴雨，空气质量优至良，空气相对湿度较低。"], padding="longest", return_tensors="pt") output = model( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, duration=62, # latent frames steps=16, cfg_strength=4.0, guidance_method="cfg", # or "apg" ) sf.write("output.wav", output.waveform.squeeze().cpu().numpy(), 24000) ``` ### 2. Voice Cloning (with prompt audio) ```python import librosa, torch # Load prompt audio audio, _ = librosa.load("assets/prompt.wav", sr=24000, mono=True) prompt_wav = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (1, 1, T) # Concatenate prompt_text + gen_text for the text encoder prompt_text = "小偷却一点也不气馁，继续在抽屉里翻找。" gen_text = "今天晴暖转阴雨，空气质量优至良，空气相对湿度较低。" inputs = tokenizer([f"{prompt_text} {gen_text}"], padding="longest", return_tensors="pt") output = model( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, prompt_audio=prompt_wav, duration=138, # prompt_frames + gen_frames steps=16, cfg_strength=4.0, guidance_method="apg", ) ``` ## License The original model weights and source code are released under the **MIT License** by the Meituan LongCat Team. The enhancements in this fork are released under the same MIT License. Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents. For details, see the [LICENSE](./LICENSE) file. ## Contact For questions about the **original model**, please contact the LongCat team at longcat-team@meituan.com or open an issue on the [original repository](https://github.com/meituan-longcat/LongCat-AudioDiT). For questions about **this enhanced fork**, please open an issue [here](https://github.com/cronos3k/LongCat-AudioDiT-Enhanced/issues). #### WeChat Group (Original Team)