FLM-Audio / README.md

Update README.md

a97796a verified 5 months ago

4.97 kB

	# FLM-Audio
	FLM-Audio is a audio-language subversion of [RoboEgo/FLM-Ego](https://arxiv.org/abs/2506.01934v1) -- an omnimodal model with native full duplexity. It simultaneously listens, speaks, and composes internal monologue, delivering low‑latency, duplex conversational responses in both English and Chinese. FLM‑Audio is robust to noise and user interruptions, prioritizing responsiveness and naturalness.

	## 📄 Model Card

	- Language(s): Chinese; English;

	## 📚 Technical Report
	Motivation & Survey: [Toward Embodied AGI: A Review of Embodied AI and the Road Ahead](https://arxiv.org/abs/2505.14235)

	FLM-Audio Research Paper: [FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training](https://arxiv.org/abs/2509.02521)

	Omnimodal System Card: [RoboEgo System Card: An Omnimodal Model with Native Full Duplexity](https://arxiv.org/abs/2506.01934v1)



	## ⚠️ Bias, Risks, and Limitations

	Despite extensive data cleaning, FLM‑Audio may still produce undesired content (e.g., biased or offensive language). Users should not disseminate unsafe outputs. Project authors are not responsible for misuse or harmful consequences.


	## 🚀 Quick Start
	Please refer to the repository of [FLM-Audio server](https://github.com/cofe-ai/flm-audio) to interact with FLM-Audio via WebUI.

	## ℹ️ Usage Notice
	This project is intended for research use only in compliance with applicable laws. For commercial use, please contact us.


	## 🏗️ Training Details

	### Overview
	We initialize the FLM-Audio backbone
	with a pre-trained language model. This initialization strategy significantly reduces computational cost while remaining effective for validating the core concepts of omnimodality and full duplexity. The training process of FLM-Audio consists of two stages: post-training and fine-tuning.

	#### 1. Post-training
	In post-training, we introduce audio-oriented capabilities to the backbone model using a large-scale corpus of audio data, while preserving the language modeling abilities of the pre-trained foundation model. This stage encompasses a broad spectrum of speech-related tasks, including automatic speech recognition (ASR), text-to-speech synthesis (TTS).

	#### 2. Supervised Fine-tuning (SFT)
	In this stage, we fine-tune FLM-Audio to function as a general-purpose, full-duplex audio-language chatbot. To this
	end, we primarily utilize synthesized multi-turn, speech dialogues. This dataset is further augmented to support full-duplex
	interruption handling and to enhance robustness against environmental noise.

	### Model Architecture
	To handle real-time language and audio, FLM-Audio features an LLM-based backbone with 7B parameters, enhanced by an audio encoder that embeds incoming speech into semantic + acoustic tokens, and a decoder that generates audio tokens. Listening, speaking, and internal monologue are interleaved in synchronized timesteps, with improved stream organization compared to related work (e.g. Moshi).


	## 🧪 Evaluation

	### Audio Understanding, Generation
	FLM-Audio performs comparably with strong audio-language models, most of which lacks native duplexity.

	\| Model \| ASR-zh↓ \| ASR-en↓ \| TTS-zh↓ \|TTS-en↓ \|
	\|------------\|:-------:\|:----------:\|:---------:\|:---------:\|
	\| \| Fleurs-zh \|LibriSpeech-clean \| Seed-tts-zh\| Seed-tts-en \|
	\| GPT-4o \| 5.4 \| - \| - \| -\|
	\| MinMo \| 3.0 \| 1.7\| 2.48 \| 2.90 \|
	\| GLM-4-Voice \| - \|2.8\| 2.10 \| 2.91 \|
	\| Moshi \| - \|5.7\| - \| - \|
	\| Qwen-2.5-omni \| 3.0 \|1.8\| 1.70 \| 2.72 \|
	\| FLM-Audio \| 5.4 \|3.2\| 2.10 \| 2.95 \|


	### Chat
	Regarding chatting experience, FLM-Audio demonstrates advantages in speech naturalness and responsiveness. The following are LLM-scores for audio chatting scenarios like Alpaca-eval, as well as human evaluation in video-grounded omnimodal chatting. The human scores in Naturalness and Responsiveness reflect the contribution of the same audio-oriented training as FLM-Audio.

	\| Model \| LLM score↑ \| Helpfulness↑ \| Naturalness↑\| Responsiveness↑\| Robustness↑\|
	\|--------------\|:-------:\|:------:\|:-----:\|:-----:\|:-----:\|
	\| Qwen-2.5-omni \| 6.36 \| 7.4 \|7.9 \| 8.1\| 7.7 \|
	\| FLM-Audio \| 6.58 \| 7.2 \| 8.2 \| 8.8 \| 8.0 \|


	## 🙏 Acknowledgements
	This work is supported by the National Science and Technology Major Project (No. 2022ZD0116314).

	## 🗨️ Citation
	If you find our work helpful, please consider citing the following papers.
	```
	@article{embodied-agi,
	title={Toward embodied agi: A review of embodied ai and the road ahead},
	author={Wang, Yequan and Sun, Aixin},
	journal={arXiv preprint arXiv:2505.14235},
	year={2025}
	}
	@article{roboego,
	title={RoboEgo System Card: An Omnimodal Model with Native Full Duplexity},
	author={Yao, Yiqun and Li, Xiang and Jiang, Xin and Fang, Xuezhi and Yu, Naitong and Sun, Aixin and Wang, Yequan},
	journal={arXiv preprint arXiv:2506.01934},
	year={2025}
	}
	```