Update README.md
Browse files
README.md
CHANGED
|
@@ -1,23 +1,19 @@
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
-
|
| 4 |
-
|
| 5 |
license: apache-2.0
|
| 6 |
library_name: transformers
|
| 7 |
tags:
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
- speech-to-text
|
| 12 |
-
- speech-to-speech
|
| 13 |
-
- voice-chat
|
| 14 |
-
pipeline_tag: audio-text-to-text
|
| 15 |
---
|
| 16 |
|
| 17 |
# Fun-Audio-Chat-8B
|
| 18 |
|
| 19 |
<p align="right">
|
| 20 |
-
<a href="README.md">English</a> | <a href="README_zh.md">中文</a>
|
| 21 |
</p>
|
| 22 |
|
| 23 |
<div align="center">
|
|
@@ -36,12 +32,20 @@ pipeline_tag: audio-text-to-text
|
|
| 36 |
|
| 37 |
Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions. It introduces **Dual-Resolution Speech Representations** (an efficient 5Hz shared backbone + a 25Hz refined head) to cut compute while keeping high speech quality, and **Core-Cocktail training** to preserve strong text LLM capabilities. It delivers top-tier results on spoken QA, audio understanding, speech function calling, speech instruction-following and voice empathy benchmarks.
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
### Key Features
|
| 40 |
|
| 41 |
- **Dual-Resolution Speech Representations**: Efficient 5Hz frame rate (vs. 12.5Hz or 25Hz for other models), reducing GPU hours by nearly 50% while maintaining high speech quality
|
| 42 |
- **State-of-the-Art Performance**: Ranks Top among models of the same size (around-8B parameters) on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, VStyle
|
| 43 |
- **Comprehensive Capabilities**: Supports spoken QA, audio understanding, speech function calling, speech instruction-following, voice empathy
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
## Model Details
|
| 46 |
|
| 47 |
| Attribute | Value |
|
|
@@ -117,10 +121,20 @@ If you find this model useful, please cite our paper:
|
|
| 117 |
|
| 118 |
```bibtex
|
| 119 |
@article{funaudiochat2025,
|
| 120 |
-
title={Fun-Audio-Chat
|
| 121 |
author={Tongyi Fun Team},
|
| 122 |
year={2025}
|
| 123 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
```
|
| 125 |
|
| 126 |
## License
|
|
@@ -139,5 +153,4 @@ This project is based on the following excellent open-source projects:
|
|
| 139 |
## Contact
|
| 140 |
|
| 141 |
- 🐛 Submit an [Issue](https://github.com/FunAudioLLM/Fun-Audio-Chat/issues)
|
| 142 |
-
- 💡 Submit a [Pull Request](https://github.com/FunAudioLLM/Fun-Audio-Chat/pulls)
|
| 143 |
-
|
|
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
+
- en
|
| 4 |
+
- zh
|
| 5 |
license: apache-2.0
|
| 6 |
library_name: transformers
|
| 7 |
tags:
|
| 8 |
+
- audio-language-model
|
| 9 |
+
- speech-to-speech
|
| 10 |
+
pipeline_tag: any-to-any
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# Fun-Audio-Chat-8B
|
| 14 |
|
| 15 |
<p align="right">
|
| 16 |
+
<a href="Fun-Audio-Chat-8B/blob/main/README.md">English</a> | <a href="Fun-Audio-Chat-8B/blob/main/README_zh.md">中文</a>
|
| 17 |
</p>
|
| 18 |
|
| 19 |
<div align="center">
|
|
|
|
| 32 |
|
| 33 |
Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions. It introduces **Dual-Resolution Speech Representations** (an efficient 5Hz shared backbone + a 25Hz refined head) to cut compute while keeping high speech quality, and **Core-Cocktail training** to preserve strong text LLM capabilities. It delivers top-tier results on spoken QA, audio understanding, speech function calling, speech instruction-following and voice empathy benchmarks.
|
| 34 |
|
| 35 |
+
<p align="center">
|
| 36 |
+
<img width="95%" src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/Results.png?raw=true">
|
| 37 |
+
</p>
|
| 38 |
+
|
| 39 |
### Key Features
|
| 40 |
|
| 41 |
- **Dual-Resolution Speech Representations**: Efficient 5Hz frame rate (vs. 12.5Hz or 25Hz for other models), reducing GPU hours by nearly 50% while maintaining high speech quality
|
| 42 |
- **State-of-the-Art Performance**: Ranks Top among models of the same size (around-8B parameters) on OpenAudioBench, VoiceBench, UltraEval-Audio, MMAU, MMAU-Pro, MMSU, Speech-ACEBench, Speech-BFCL, Speech-SmartInteract, VStyle
|
| 43 |
- **Comprehensive Capabilities**: Supports spoken QA, audio understanding, speech function calling, speech instruction-following, voice empathy
|
| 44 |
|
| 45 |
+
<p align="center">
|
| 46 |
+
<img width="95%" src="https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/assets/Architecture.png?raw=true">
|
| 47 |
+
</p>
|
| 48 |
+
|
| 49 |
## Model Details
|
| 50 |
|
| 51 |
| Attribute | Value |
|
|
|
|
| 121 |
|
| 122 |
```bibtex
|
| 123 |
@article{funaudiochat2025,
|
| 124 |
+
title={Fun-Audio-Chat Technical Report},
|
| 125 |
author={Tongyi Fun Team},
|
| 126 |
year={2025}
|
| 127 |
}
|
| 128 |
+
|
| 129 |
+
@misc{tan2025drvoiceparallelspeechtextvoice,
|
| 130 |
+
title={DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations},
|
| 131 |
+
author={Chao-Hong Tan and Qian Chen and Wen Wang and Chong Deng and Qinglin Zhang and Luyao Cheng and Hai Yu and Xin Zhang and Xiang Lv and Tianyu Zhao and Chong Zhang and Yukun Ma and Yafeng Chen and Hui Wang and Jiaqing Liu and Xiangang Li and Jieping Ye},
|
| 132 |
+
year={2025},
|
| 133 |
+
eprint={2506.09349},
|
| 134 |
+
archivePrefix={arXiv},
|
| 135 |
+
primaryClass={cs.CL},
|
| 136 |
+
url={https://arxiv.org/abs/2506.09349},
|
| 137 |
+
}
|
| 138 |
```
|
| 139 |
|
| 140 |
## License
|
|
|
|
| 153 |
## Contact
|
| 154 |
|
| 155 |
- 🐛 Submit an [Issue](https://github.com/FunAudioLLM/Fun-Audio-Chat/issues)
|
| 156 |
+
- 💡 Submit a [Pull Request](https://github.com/FunAudioLLM/Fun-Audio-Chat/pulls)
|
|
|