--- language: - en - zh license: apache-2.0 library_name: transformers tags: - audio-language-model - speech-to-speech - voice-chat pipeline_tag: any-to-any --- # Fun-Audio-Chat-8B

English | 中文

通义百聆 **Fun-Audio-Chat** 是一个专为自然、低延迟语音交互打造的大型音频语言模型。 [![arXiv](https://img.shields.io/badge/arXiv-2512.20156-red)](https://arxiv.org/pdf/2512.20156) [![GitHub](https://img.shields.io/badge/GitHub-代码-blue)](https://github.com/FunAudioLLM/Fun-Audio-Chat) [![演示](https://img.shields.io/badge/演示-页面-green)](https://funaudiollm.github.io/funaudiochat)
## 模型介绍 Fun-Audio-Chat 是一个专为自然、低延迟语音交互打造的大型音频语言模型。它引入了**双分辨率语音表征**(高效的5Hz共享骨干网络 + 25Hz精细化头部),在保持高语音质量的同时大幅降低计算开销,并采用**Core-Cocktail训练策略**来保持强大的文本LLM能力。该模型在语音问答、音频理解、语音函数调用、语音指令遵循和语音情感共鸣等基准测试中均取得了顶尖成绩。

### 核心特性 - **双分辨率语音表征**:高效的5Hz帧率(相比其他模型的12.5Hz或25Hz),将GPU训练时间减少近50%,同时保持高语音质量 - **业界领先性能**:在同等规模模型(约8B参数)中,在OpenAudioBench、VoiceBench、UltraEval-Audio、MMAU、MMAU-Pro、MMSU、Speech-ACEBench、Speech-BFCL、Speech-SmartInteract、VStyle等评测集上排名领先 - **全面的能力覆盖**:支持语音问答、音频理解、语音函数调用、语音指令遵循、语音情感共鸣

## 模型详情 | 属性 | 值 | |------|-----| | 模型规模 | ~8B 参数 | | 模型架构 | 双分辨率语音表征 | | 支持语言 | 英语、中文 | | 许可证 | Apache 2.0 | ## 环境要求 - Python == 3.12 - PyTorch == 2.8.0 - ffmpeg - 显存要求:推理需要 ~24GB,训练需要 4×80GB ## 安装 ```bash git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat cd Fun-Audio-Chat apt install ffmpeg conda create -n FunAudioChat python=3.12 -y conda activate FunAudioChat pip install torch==2.8.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 pip install -r requirements.txt ``` ## 快速开始 ### 下载模型 **使用 HuggingFace 下载:** ```bash pip install huggingface-hub hf download FunAudioLLM/Fun-Audio-Chat-8B --local-dir ./pretrained_models/Fun-Audio-Chat-8B hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local-dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512 ``` **或使用 ModelScope 下载:** ```bash modelscope download --model FunAudioLLM/Fun-Audio-Chat-8B --local_dir pretrained_models/Fun-Audio-Chat-8B modelscope download --model FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local_dir pretrained_models/Fun-CosyVoice3-0.5B-2512 ``` ### 推理 ```bash export PYTHONPATH=`pwd` # 语音转文字 python examples/infer_s2t.py # 语音转语音 python examples/infer_s2s.py ``` ## 评测 | 基准测试 | 类别 | |---------|------| | OpenAudioBench | 语音问答 | | VoiceBench | 语音问答 | | UltraEval-Audio | 语音转语音 | | MMAU, MMAU-Pro, MMSU | 音频理解 | | Speech-ACEBench, Speech-BFCL, Speech-SmartInteract | 语音函数调用 | | VStyle | 语音指令遵循 | 详细的评测说明请参阅 [GitHub 仓库](https://github.com/FunAudioLLM/Fun-Audio-Chat)。 ## 引用 如果您觉得本模型有帮助,请引用我们的论文: ```bibtex @article{funaudiochat2025, title={Fun-Audio-Chat Technical Report}, author={Qian Chen and Luyao Cheng and Chong Deng and Xiangang Li and Jiaqing Liu and Chao-Hong Tan and Wen Wang and Junhao Xu and Jieping Ye and Qinglin Zhang and Qiquan Zhang and Jingren Zhou}, year={2025}, eprint={2512.20156}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.20156}, } @misc{tan2025drvoiceparallelspeechtextvoice, title={DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations}, author={Chao-Hong Tan and Qian Chen and Wen Wang and Chong Deng and Qinglin Zhang and Luyao Cheng and Hai Yu and Xin Zhang and Xiang Lv and Tianyu Zhao and Chong Zhang and Yukun Ma and Yafeng Chen and Hui Wang and Jiaqing Liu and Xiangang Li and Jieping Ye}, year={2025}, eprint={2506.09349}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.09349}, } ``` ## 许可证 本模型采用 [Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0)。 ## 致谢 本项目基于以下优秀的开源项目构建: - [Transformers](https://github.com/huggingface/transformers) - [LlamaFactory](https://github.com/hiyouga/LLaMA-Factory) - [Moshi](https://github.com/kyutai-labs/moshi) - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) ## 联系我们 - 🐛 提交 [Issue](https://github.com/FunAudioLLM/Fun-Audio-Chat/issues) - 💡 提交 [Pull Request](https://github.com/FunAudioLLM/Fun-Audio-Chat/pulls)