| # FLM-Audio | |
| FLM-Audio is a audio-language subversion of [RoboEgo/FLM-Ego](https://arxiv.org/abs/2506.01934v1) -- an omnimodal model with native full duplexity. It simultaneously listens, speaks, and composes internal monologue, delivering low‑latency, duplex conversational responses in both English and Chinese. FLM‑Audio is robust to noise and user interruptions, prioritizing responsiveness and naturalness. | |
| ## 📄 Model Card | |
| - **Language(s):** Chinese; English; | |
| ## 📚 Technical Report | |
| Motivation & Survey: [Toward Embodied AGI: A Review of Embodied AI and the Road Ahead](https://arxiv.org/abs/2505.14235) | |
| FLM-Audio Research Paper: [FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training](https://arxiv.org/abs/2509.02521) | |
| Omnimodal System Card: [RoboEgo System Card: An Omnimodal Model with Native Full Duplexity](https://arxiv.org/abs/2506.01934v1) | |
| ## ⚠️ Bias, Risks, and Limitations | |
| Despite extensive data cleaning, FLM‑Audio may still produce undesired content (e.g., biased or offensive language). Users should not disseminate unsafe outputs. Project authors are not responsible for misuse or harmful consequences. | |
| ## 🚀 Quick Start | |
| Please refer to the repository of [FLM-Audio server](https://github.com/cofe-ai/flm-audio) to interact with FLM-Audio via WebUI. | |
| ## ℹ️ Usage Notice | |
| This project is intended for research use only in compliance with applicable laws. For commercial use, please contact us. | |
| ## 🏗️ Training Details | |
| ### Overview | |
| We initialize the FLM-Audio backbone | |
| with a pre-trained language model. This initialization strategy significantly reduces computational cost while remaining effective for validating the core concepts of omnimodality and full duplexity. The training process of FLM-Audio consists of two stages: post-training and fine-tuning. | |
| #### 1. Post-training | |
| In post-training, we introduce audio-oriented capabilities to the backbone model using a large-scale corpus of audio data, while preserving the language modeling abilities of the pre-trained foundation model. This stage encompasses a broad spectrum of speech-related tasks, including automatic speech recognition (ASR), text-to-speech synthesis (TTS). | |
| #### 2. Supervised Fine-tuning (SFT) | |
| In this stage, we fine-tune FLM-Audio to function as a general-purpose, full-duplex audio-language chatbot. To this | |
| end, we primarily utilize synthesized multi-turn, speech dialogues. This dataset is further augmented to support full-duplex | |
| interruption handling and to enhance robustness against environmental noise. | |
| ### Model Architecture | |
| To handle real-time language and audio, FLM-Audio features an LLM-based backbone with 7B parameters, enhanced by an audio encoder that embeds incoming speech into semantic + acoustic tokens, and a decoder that generates audio tokens. Listening, speaking, and internal monologue are interleaved in synchronized timesteps, with improved stream organization compared to related work (e.g. Moshi). | |
| ## 🧪 Evaluation | |
| ### Audio Understanding, Generation | |
| FLM-Audio performs comparably with strong audio-language models, most of which lacks native duplexity. | |
| | Model | ASR-zh↓ | ASR-en↓ | TTS-zh↓ |TTS-en↓ | | |
| |------------|:-------:|:----------:|:---------:|:---------:| | |
| | | Fleurs-zh |LibriSpeech-clean | Seed-tts-zh| Seed-tts-en | | |
| | GPT-4o | 5.4 | - | - | -| | |
| | MinMo | 3.0 | 1.7| 2.48 | 2.90 | | |
| | GLM-4-Voice | - |2.8| 2.10 | 2.91 | | |
| | Moshi | - |5.7| - | - | | |
| | Qwen-2.5-omni | 3.0 |1.8| 1.70 | 2.72 | | |
| | FLM-Audio | 5.4 |3.2| 2.10 | 2.95 | | |
| ### Chat | |
| Regarding chatting experience, FLM-Audio demonstrates advantages in speech naturalness and responsiveness. The following are LLM-scores for audio chatting scenarios like Alpaca-eval, as well as human evaluation in video-grounded omnimodal chatting. The human scores in Naturalness and Responsiveness reflect the contribution of the same audio-oriented training as FLM-Audio. | |
| | Model | LLM score↑ | Helpfulness↑ | Naturalness↑| Responsiveness↑| Robustness↑| | |
| |--------------|:-------:|:------:|:-----:|:-----:|:-----:| | |
| | Qwen-2.5-omni | 6.36 | 7.4 |7.9 | 8.1| 7.7 | | |
| | FLM-Audio | 6.58 | 7.2 | 8.2 | 8.8 | 8.0 | | |
| ## 🙏 Acknowledgements | |
| This work is supported by the National Science and Technology Major Project (No. 2022ZD0116314). | |
| ## 🗨️ Citation | |
| If you find our work helpful, please consider citing the following papers. | |
| ``` | |
| @article{embodied-agi, | |
| title={Toward embodied agi: A review of embodied ai and the road ahead}, | |
| author={Wang, Yequan and Sun, Aixin}, | |
| journal={arXiv preprint arXiv:2505.14235}, | |
| year={2025} | |
| } | |
| @article{roboego, | |
| title={RoboEgo System Card: An Omnimodal Model with Native Full Duplexity}, | |
| author={Yao, Yiqun and Li, Xiang and Jiang, Xin and Fang, Xuezhi and Yu, Naitong and Sun, Aixin and Wang, Yequan}, | |
| journal={arXiv preprint arXiv:2506.01934}, | |
| year={2025} | |
| } | |
| ``` | |