File size: 5,716 Bytes
fa4f933
 
d532180
 
fa4f933
 
 
d532180
 
 
fa4f933
 
 
 
 
d532180
fa4f933
 
 
 
e5fe266
 
fa4f933
 
bca6829
fa4f933
 
 
 
 
 
 
 
 
d532180
 
 
 
fa4f933
 
 
 
 
 
d532180
 
 
 
fa4f933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d532180
805eb7c
0174f80
 
 
 
 
fa4f933
d532180
0174f80
d532180
0174f80
 
 
 
 
 
 
d532180
fa4f933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d532180
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- en
- zh
license: apache-2.0
library_name: transformers
tags:
- audio-language-model
- speech-to-speech
pipeline_tag: any-to-any
---

# Fun-Audio-Chat-8B

<p align="right">
  <a href="Fun-Audio-Chat-8B/blob/main/README.md">English</a> | <a href="Fun-Audio-Chat-8B/blob/main/README_zh.md">中文</a>
</p>

<div align="center">

<img src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/TONGYI Fun.png?raw=true" alt="TONGYI Fun" height="80">

**Fun-Audio-Chat** is a Large Audio Language Model built for natural, low-latency voice interactions.

[![arXiv](https://img.shields.io/badge/arXiv-2512.20156-red)](https://arxiv.org/pdf/2512.20156)
[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/FunAudioLLM/Fun-Audio-Chat)
[![Demo](https://img.shields.io/badge/Demo-Page-green)](https://funaudiollm.github.io/funaudiochat)

</div>

## Model Description

Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions. It introduces **Dual-Resolution Speech Representations** (an efficient 5Hz shared backbone + a 25Hz refined head) to cut compute while keeping high speech quality, and **Core-Cocktail training** to preserve strong text LLM capabilities. It delivers top-tier results on spoken QA, audio understanding, speech function calling, speech instruction-following and voice empathy benchmarks.

<p align="center">
  <img width="95%" src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/Results.png?raw=true">
</p>

### Key Features

- **Dual-Resolution Speech Representations**: Efficient 5Hz frame rate (vs. 12.5Hz or 25Hz for other models), reducing GPU hours by nearly 50% while maintaining high speech quality
- **State-of-the-Art Performance**: Ranks Top among models of the same size (around-8B parameters) on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, VStyle
- **Comprehensive Capabilities**: Supports spoken QA, audio understanding, speech function calling, speech instruction-following, voice empathy

<p align="center">
  <img width="95%" src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/Architecture.png?raw=true">
</p>

## Model Details

| Attribute | Value |
|-----------|-------|
| Model Size | ~8B parameters |
| Architecture | Dual-Resolution Speech Representations |
| Languages | English, Chinese |
| License | Apache 2.0 |

## Requirements

- Python == 3.12
- PyTorch == 2.8.0
- ffmpeg
- GPU Memory: ~24GB for inference, 4×80GB for training

## Installation

```bash
git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
cd Fun-Audio-Chat

apt install ffmpeg
conda create -n FunAudioChat python=3.12 -y
conda activate FunAudioChat
pip install torch==2.8.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
```

## Quick Start

### Download Models

**Using HuggingFace:**
```bash
pip install huggingface-hub
hf download FunAudioLLM/Fun-Audio-Chat-8B --local-dir ./pretrained_models/Fun-Audio-Chat-8B
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local-dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512
```

**Or using ModelScope:**
```bash
modelscope download --model FunAudioLLM/Fun-Audio-Chat-8B --local_dir pretrained_models/Fun-Audio-Chat-8B
modelscope download --model FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local_dir pretrained_models/Fun-CosyVoice3-0.5B-2512
```

### Inference

```bash
export PYTHONPATH=`pwd`
# Speech-to-Text
python examples/infer_s2t.py
# Speech-to-Speech
python examples/infer_s2s.py
```

## Evaluation

| Benchmark | Category |
|-----------|----------|
| OpenAudioBench | Spoken QA |
| VoiceBench | Spoken QA |
| UltraEval-Audio | Speech-to-Speech |
| MMAU, MMAU-Pro, MMSU | Audio Understanding |
| Speech-ACEBench, Speech-BFCL, Speech-SmartInteract | Speech Function Calling |
| VStyle | Speech Instruction-Following |

For detailed evaluation instructions, please refer to the [GitHub repository](https://github.com/FunAudioLLM/Fun-Audio-Chat).

## Citation

If you find this model useful, please cite our paper:

```bibtex
@article{funaudiochat2025,
  title={Fun-Audio-Chat Technical Report},
  author={Qian Chen and Luyao Cheng and Chong Deng and Xiangang Li and Jiaqing Liu and Chao-Hong Tan and Wen Wang and Junhao Xu and Jieping Ye and Qinglin Zhang and Qiquan Zhang and Jingren Zhou},
  year={2025},
  eprint={2512.20156},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2512.20156},
}


@misc{tan2025drvoiceparallelspeechtextvoice,
  title={DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations}, 
  author={Chao-Hong Tan and Qian Chen and Wen Wang and Chong Deng and Qinglin Zhang and Luyao Cheng and Hai Yu and Xin Zhang and Xiang Lv and Tianyu Zhao and Chong Zhang and Yukun Ma and Yafeng Chen and Hui Wang and Jiaqing Liu and Xiangang Li and Jieping Ye},
  year={2025},
  eprint={2506.09349},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.09349}, 
}
```

## License

This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

## Acknowledgments

This project is based on the following excellent open-source projects:

- [Transformers](https://github.com/huggingface/transformers)
- [LlamaFactory](https://github.com/hiyouga/LLaMA-Factory)
- [Moshi](https://github.com/kyutai-labs/moshi)
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)

## Contact

- 🐛 Submit an [Issue](https://github.com/FunAudioLLM/Fun-Audio-Chat/issues)
- 💡 Submit a [Pull Request](https://github.com/FunAudioLLM/Fun-Audio-Chat/pulls)