File size: 5,610 Bytes
fa4f933
 
 
 
 
 
 
 
 
 
9b05f32
fa4f933
 
 
 
 
9b05f32
fa4f933
 
 
 
0503c4c
 
fa4f933
 
7bf72dc
fa4f933
 
 
 
 
 
 
 
 
9b05f32
 
 
 
fa4f933
 
 
 
 
 
9b05f32
 
 
 
fa4f933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b05f32
4292842
c620f6a
 
 
 
 
fa4f933
9b05f32
c620f6a
9b05f32
c620f6a
 
 
 
 
 
 
9b05f32
fa4f933
 
9b05f32
fa4f933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
  - en
  - zh
license: apache-2.0
library_name: transformers
tags:
  - audio-language-model
  - speech-to-speech
  - voice-chat
pipeline_tag: any-to-any
---

# Fun-Audio-Chat-8B

<p align="right">
  <a href="Fun-Audio-Chat-8B/blob/main/README.md">English</a> | <a href="Fun-Audio-Chat-8B/blob/main/README_zh.md">中文</a>
</p>

<div align="center">

<img src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/通义百聆.png?raw=true" alt="通义百聆" height="80">

**Fun-Audio-Chat** 是一个专为自然、低延迟语音交互打造的大型音频语言模型。

[![arXiv](https://img.shields.io/badge/arXiv-2512.20156-red)](https://arxiv.org/pdf/2512.20156)
[![GitHub](https://img.shields.io/badge/GitHub-代码-blue)](https://github.com/FunAudioLLM/Fun-Audio-Chat)
[![演示](https://img.shields.io/badge/演示-页面-green)](https://funaudiollm.github.io/funaudiochat)

</div>

## 模型介绍

Fun-Audio-Chat 是一个专为自然、低延迟语音交互打造的大型音频语言模型。它引入了**双分辨率语音表征**(高效的5Hz共享骨干网络 + 25Hz精细化头部),在保持高语音质量的同时大幅降低计算开销,并采用**Core-Cocktail训练策略**来保持强大的文本LLM能力。该模型在语音问答、音频理解、语音函数调用、语音指令遵循和语音情感共鸣等基准测试中均取得了顶尖成绩。

<p align="center">
  <img width="95%" src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/Results.png?raw=true">
</p>

### 核心特性

- **双分辨率语音表征**:高效的5Hz帧率(相比其他模型的12.5Hz或25Hz),将GPU训练时间减少近50%,同时保持高语音质量
- **业界领先性能**:在同等规模模型(约8B参数)中,在OpenAudioBench、VoiceBench、UltraEval-Audio、MMAU、MMAU-Pro、MMSU、Speech-ACEBench、Speech-BFCL、Speech-SmartInteract、VStyle等评测集上排名领先
- **全面的能力覆盖**:支持语音问答、音频理解、语音函数调用、语音指令遵循、语音情感共鸣

<p align="center">
  <img width="95%" src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/Architecture.png?raw=true">
</p>

## 模型详情

| 属性 | 值 |
|------|-----|
| 模型规模 | ~8B 参数 |
| 模型架构 | 双分辨率语音表征 |
| 支持语言 | 英语、中文 |
| 许可证 | Apache 2.0 |

## 环境要求

- Python == 3.12
- PyTorch == 2.8.0
- ffmpeg
- 显存要求:推理需要 ~24GB,训练需要 4×80GB

## 安装

```bash
git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
cd Fun-Audio-Chat

apt install ffmpeg
conda create -n FunAudioChat python=3.12 -y
conda activate FunAudioChat
pip install torch==2.8.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
```

## 快速开始

### 下载模型

**使用 HuggingFace 下载:**
```bash
pip install huggingface-hub
hf download FunAudioLLM/Fun-Audio-Chat-8B --local-dir ./pretrained_models/Fun-Audio-Chat-8B
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local-dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512
```

**或使用 ModelScope 下载:**
```bash
modelscope download --model FunAudioLLM/Fun-Audio-Chat-8B --local_dir pretrained_models/Fun-Audio-Chat-8B
modelscope download --model FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local_dir pretrained_models/Fun-CosyVoice3-0.5B-2512
```

### 推理

```bash
export PYTHONPATH=`pwd`
# 语音转文字
python examples/infer_s2t.py
# 语音转语音
python examples/infer_s2s.py
```

## 评测

| 基准测试 | 类别 |
|---------|------|
| OpenAudioBench | 语音问答 |
| VoiceBench | 语音问答 |
| UltraEval-Audio | 语音转语音 |
| MMAU, MMAU-Pro, MMSU | 音频理解 |
| Speech-ACEBench, Speech-BFCL, Speech-SmartInteract | 语音函数调用 |
| VStyle | 语音指令遵循 |

详细的评测说明请参阅 [GitHub 仓库](https://github.com/FunAudioLLM/Fun-Audio-Chat)。

## 引用

如果您觉得本模型有帮助,请引用我们的论文:

```bibtex
@article{funaudiochat2025,
  title={Fun-Audio-Chat Technical Report},
  author={Qian Chen and Luyao Cheng and Chong Deng and Xiangang Li and Jiaqing Liu and Chao-Hong Tan and Wen Wang and Junhao Xu and Jieping Ye and Qinglin Zhang and Qiquan Zhang and Jingren Zhou},
  year={2025},
  eprint={2512.20156},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2512.20156},
}


@misc{tan2025drvoiceparallelspeechtextvoice,
  title={DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations}, 
  author={Chao-Hong Tan and Qian Chen and Wen Wang and Chong Deng and Qinglin Zhang and Luyao Cheng and Hai Yu and Xin Zhang and Xiang Lv and Tianyu Zhao and Chong Zhang and Yukun Ma and Yafeng Chen and Hui Wang and Jiaqing Liu and Xiangang Li and Jieping Ye},
  year={2025},
  eprint={2506.09349},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.09349}, 
}
```


## 许可证

本模型采用 [Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0)。

## 致谢

本项目基于以下优秀的开源项目构建:

- [Transformers](https://github.com/huggingface/transformers)
- [LlamaFactory](https://github.com/hiyouga/LLaMA-Factory)
- [Moshi](https://github.com/kyutai-labs/moshi)
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)

## 联系我们

- 🐛 提交 [Issue](https://github.com/FunAudioLLM/Fun-Audio-Chat/issues)
- 💡 提交 [Pull Request](https://github.com/FunAudioLLM/Fun-Audio-Chat/pulls)