Update README.md
Browse files
README.md
CHANGED
|
@@ -4,4 +4,104 @@ datasets:
|
|
| 4 |
- amaai-lab/MusicBench
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-Omni-7B
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- amaai-lab/MusicBench
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-Omni-7B
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process
|
| 10 |
+
|
| 11 |
+
Ke-Omni-R is an advanced audio reasoning model built upon [Qwen2.5-Omni-7B](https://github.com/QwenLM/Qwen2.5-Omni). With only 10k post-training samples, Ke-Omni-R has achieved state-of-the-art performance on the MMAU *Test-mini* and *Test* benchmarks. Key insights from its development include:
|
| 12 |
+
|
| 13 |
+
- **GRPO Algorithm**: The GRPO algorithm significantly enhances the performance of the already strong base model (Qwen2.5-Omni-7B), demonstrating superior generalization even in unseen speech domains.
|
| 14 |
+
- **Think Process**: Incorporating a concise think process (less than 50 words) plays a crucial role in improving reasoning capabilities.
|
| 15 |
+
- **KL Divergence**: Slight improvements were observed during GRPO training by leveraging KL divergence.
|
| 16 |
+
- **Domain Ratio vs. Data Volume**: Domain diversity outweighs data volume. We utilized only 10k samples, with 5k randomly selected from AVQA and another 5k from MusicBench.
|
| 17 |
+
|
| 18 |
+
If you wish to train or perform inference with the model, please visit the GitHub repository: [https://github.com/shuaijiang/Ke-Omni-R/](https://github.com/shuaijiang/Ke-Omni-R/).
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
## Performance: Accuracies (%) on MMAU Test-mini and Test benchmark
|
| 22 |
+
| Model | Method | Sound (Test-mini) | Sound (Test) | Music (Test-mini) | Music (Test) | Speech (Test-mini) | Speech (Test) | Average (Test-mini) | Average (Test) |
|
| 23 |
+
|---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
|
| 24 |
+
| - | Human\* | 86.31 | - | 78.22 | - | 82.17 | - | 82.23 | - |
|
| 25 |
+
| Gemini Pro 2.0 Flash | Direct Inference\* | 56.46 | 61.73 | 58.68 | 56.53 | 51.65 | 61.53 | 55.60 | 59.93 |
|
| 26 |
+
| Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | **73.95** |**72.90**| 30.93 | 40.26 | 55.48 | 59.42 |
|
| 27 |
+
| GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | **68.66** | 57.30 | 58.74 |
|
| 28 |
+
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
| 29 |
+
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
| 30 |
+
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
| 31 |
+
| Audio-Reasoner(Qwen2-Audio-7B-Instruct) | \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
|
| 32 |
+
| Audio-Cot(Qwen2-Audio-7B-Instruct) | \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
|
| 33 |
+
| R1-AQA(Qwen2-Audio-7B-Instruct) | \[3\] | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
| 34 |
+
| Qwen2.5-Omni-7B | \[4\] | 67.87 | - | 69.16 | - | 59.76 | - | 65.60 | - |
|
| 35 |
+
| Ke-Omni-R(Qwen2.5-Omni-7B) | GRPO(ours) | **69.37** | **71.90** | 69.46 | 67.13 |**67.87** | 67.10 | **68.90** |**68.71** |
|
| 36 |
+
|
| 37 |
+
Note:
|
| 38 |
+
|
| 39 |
+
- \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
|
| 40 |
+
-
|
| 41 |
+
- \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318.
|
| 42 |
+
|
| 43 |
+
- \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246.
|
| 44 |
+
|
| 45 |
+
- \[3\] Li, Gang, et al. "Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering." arXiv preprint arXiv:2503.11197
|
| 46 |
+
|
| 47 |
+
- \[4\] Xu, Jin, et al. "Qwen2.5-Omni Technical Report." arXiv preprint arXiv:2503.20215
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
## Usage
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
|
| 54 |
+
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
|
| 55 |
+
from qwen_omni_utils import process_mm_info
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
# You can directly insert a local file path, a URL, or a base64-encoded audio into the position where you want in the text.
|
| 59 |
+
messages = [
|
| 60 |
+
# Audio
|
| 61 |
+
## Local audio path
|
| 62 |
+
[{"role": "system", "content":[{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
|
| 63 |
+
{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBtBeR6B00_000000.wav"}, {"type": "text", "text": "Please describe this audio."}]}],
|
| 64 |
+
[{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBtBeR6B00_000000.wav"}, {"type": "text", "text": "What is the main source of sound in the audio? ['aircraft', 'Car', 'Tank', 'Missile'] Output the thinking process (less than 50 words) in <think> </think> and final answer in <answer> </answer>."}]}],
|
| 65 |
+
[{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBXTktoom8_000030.wav"}, {"type": "text", "text": "What animal is the main source of sound in the video? ['dog', 'wasp', 'honeybee', 'dragonfly'] Output the thinking process (less than 50 words) in <think> </think> and final answer in <answer> </answer>."}]}],
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained('KE-Team/Ke-Omni-R')
|
| 69 |
+
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
|
| 70 |
+
|
| 71 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 72 |
+
print(text)
|
| 73 |
+
audios, images, videos = process_mm_info(messages, use_audio_in_video=False)
|
| 74 |
+
inputs = processor(text=text, images=images, videos=videos, audio=audios, padding=True, return_tensors="pt")
|
| 75 |
+
|
| 76 |
+
generation = model.generate(**inputs, thinker_temperature=0, thinker_do_sample=False)
|
| 77 |
+
generated_ids = generation[:, inputs.input_ids.size(1):]
|
| 78 |
+
completions = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
| 79 |
+
print(completions)
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
the output should be
|
| 83 |
+
```
|
| 84 |
+
["Well, it sounds like there's a car accelerating. You can hear the engine revving up, and there's a bit of a thump or thud sound too. It might be the car hitting something or just a part of the acceleration process. It gives off a sense of speed and power. What do you think about it? Do you have any other audio samples you want to talk about?", '<think>The audio features a vehicle accelerating and revving, which is characteristic of a car. The sound is consistent with a car engine, not an aircraft, tank, or missile.</think>\n<answer>Car</answer>', "<think>The main source of sound is a buzzing insect, which is consistent with the size and sound of a honeybee. The other options don't match the sound or context.</think>\n<answer>honeybee</answer>"]
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Acknowledgements
|
| 88 |
+
We express our gratitude to the following projects and teams for their contributions:
|
| 89 |
+
- **R1-AQA**: Referenced the GRPO-based training implementation from [R1-AQA](https://github.com/xiaomi-research/r1-aqa).
|
| 90 |
+
- **Qwen Team**: Special thanks to the [Qwen2.5-Omni-7B](https://github.com/QwenLM/Qwen2.5-Omni) model for providing a robust foundation.
|
| 91 |
+
- **Datasets**:
|
| 92 |
+
- [AVAQ](https://mn.cs.tsinghua.edu.cn/avqa/)
|
| 93 |
+
- [MusicBench](https://amaai-lab.github.io/mustango/)
|
| 94 |
+
- [MMAU](https://github.com/Sakshi113/MMAU/)
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
## Citation
|
| 98 |
+
```bib
|
| 99 |
+
@misc{zhao2025keomnir,
|
| 100 |
+
author = {Zhao, Shuaijiang and Guo, Tingwei and Wen, Cheng and Xiang, Bajian and Zou, Wei},
|
| 101 |
+
title = {Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process},
|
| 102 |
+
year = {2025},
|
| 103 |
+
publisher = {GitHub},
|
| 104 |
+
journal = {GitHub Repository},
|
| 105 |
+
howpublished = {\url{https://github.com/shuaijiang/Ke-Omni-R}},
|
| 106 |
+
}
|
| 107 |
+
```
|