Update README.md
Browse files
README.md
CHANGED
|
@@ -14,17 +14,228 @@ tags:
|
|
| 14 |
- chat,
|
| 15 |
- audio
|
| 16 |
---
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
# SeaLLMs-Audio: Large Audio
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
### Audio Analysis Inference
|
| 28 |
|
| 29 |
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
|
|
|
| 14 |
- chat,
|
| 15 |
- audio
|
| 16 |
---
|
| 17 |
+
<p align="center">
|
| 18 |
+
<img src="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/images/seallm-audio-logo.png" alt="SeaLLMs-Audio" width="20%">
|
| 19 |
+
</p>
|
| 20 |
|
| 21 |
+
# SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia
|
| 22 |
|
| 23 |
+
<p align="center">
|
| 24 |
+
<a href="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/" target="_blank" rel="noopener">Website</a>
|
| 25 |
+
|
| 26 |
+
<a href="https://huggingface.co/spaces/SeaLLMs/SeaLLMs-Audio-Demo" target="_blank" rel="noopener"> 🤗 DEMO</a>
|
| 27 |
+
|
| 28 |
+
<a href="https://github.com/DAMO-NLP-SG/SeaLLMs-Audio" target="_blank" rel="noopener">Github</a>
|
| 29 |
+
|
| 30 |
+
<a href="https://huggingface.co/SeaLLMs/SeaLLMs-Audio-7B" target="_blank" rel="noopener">🤗 Model</a>
|
| 31 |
+
|
| 32 |
+
<!-- <a href="https://arxiv.org/pdf/2407.19672" target="_blank" rel="noopener">[NEW] Technical Report</a> -->
|
| 33 |
+
</p>
|
| 34 |
|
| 35 |
+
We introduce **SeaLLMs-Audio**, the multimodal (audio) extension of the [SeaLLMs](https://damo-nlp-sg.github.io/DAMO-SeaLLMs/) (Large Language Models for Southeast Asian languages) family. It is the first large audio-language model (LALM) designed to support multiple Southeast Asian languages, including **Indonesian (id), Thai (th), and Vietnamese (vi), alongside English (en) and Chinese (zh)**.
|
| 36 |
|
| 37 |
+
Trained on a large-scale audio dataset, SeaLLMs-Audio demonstrates strong performance across various audio-related tasks, such as audio analysis tasks and voice-based interactions. As a significant step toward advancing audio LLMs in Southeast Asia, we hope SeaLLMs-Audio will benefit both the research community and industry in the region.
|
| 38 |
|
| 39 |
+
### Key Features of SeaLLMs-Audio:
|
| 40 |
+
|
| 41 |
+
- **Multilingual**: The model mainly supports 5 languages, including 🇮🇩 Indonesian, 🇹🇭 Thai, 🇻🇳 Vietnamese, 🇬🇧 English, and 🇨🇳 Chinese.
|
| 42 |
+
- **Multimodal**: The model supports flexible input formats, such as **audio only, text only, and audio with text**.
|
| 43 |
+
- **Multi-task**: The model supports a variety of tasks, including audio analysis tasks such as audio captioning, automatic speech recognition, speech-to-text translation, speech emotion recognition, speech question answering, and speech summarization. Additionally, it handles voice chat tasks, including answering factual, mathematical, and other general questions.
|
| 44 |
+
|
| 45 |
+
We open-weight [SeaLLMs-Audio](https://huggingface.co/SeaLLMs/SeaLLMs-Audio-7B) on Hugging Face, and we have built a [demo](https://huggingface.co/spaces/SeaLLMs/SeaLLMs-Audio-Demo) for users to interact with.
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
# Training information:
|
| 49 |
+
SeaLLMs-Audio builts upon [Qwen2-Audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B) and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). We replaced the LLM module in Qwen2-Audio-7B by Qwen2.5-7B-Instruct. After that, we do full-parameter fine-tuning on a large-scale audio dataset. This dataset contains 1.58M conversations for multiple tasks, in which 93% are single turn. The tasks can be roughly classified as following task categories: automatic speech recognition (ASR), audio captioning (AC), speech-to-text translation (S2TT), question answering (QA), speech summarization (SS), speech question answering (SQA), chat, math, and fact and mixed tasks (mixed).
|
| 50 |
+
|
| 51 |
+
The distribution of data across languages and tasks are:
|
| 52 |
+
|
| 53 |
+
<p align="center">
|
| 54 |
+
<strong>Distribution of SeaLLMs-Audio training data across languages and tasks</strong>
|
| 55 |
+
</p>
|
| 56 |
+
|
| 57 |
+
<p align="center">
|
| 58 |
+
<img src="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/data_distribution/dist_lang.png" alt="Distribution of SeaLLMs-Audio training data across languages" width="70%">
|
| 59 |
+
<img src="https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/data_distribution/dist_task.png" alt="Distribution of SeaLLMs-Audio training data across tasks" width="70%">
|
| 60 |
+
</p>
|
| 61 |
+
|
| 62 |
+
The training dataset was curated from multiple data sources, including public datasets and in-house data. Public datasets includes: [gigaspeech](https://huggingface.co/datasets/speechcolab/gigaspeech), [gigaspeech2](https://huggingface.co/datasets/speechcolab/gigaspeech2), [common voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [AudioCaps](https://huggingface.co/datasets/OpenSound/AudioCaps), [VoiceAssistant-400K](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K), [YODAS2](https://huggingface.co/datasets/espnet/yodas2), and [Multitask-National-Speech-Corpus](https://huggingface.co/datasets/MERaLiON/Multitask-National-Speech-Corpus-v1). We would like to thank the authors of these datasets for their contributions to the community!
|
| 63 |
+
|
| 64 |
+
We train the model on the dataset for 1 epoch, which took ~6 days to complete on 32 A800 GPUs.
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
# Performance
|
| 68 |
+
Due to the absence of standard audio benchmarks for evaluating audio LLMs in Southeast Asia, we have manually created a benchmark called **SeaBench-Audio**. It comprises nine tasks:
|
| 69 |
+
|
| 70 |
+
- **Tasks with both audio and text inputs:** Audio Captioning (AC), Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Recognition (SER), Speech Question Answering (SQA), and Speech Summarization (SS).
|
| 71 |
+
- **Tasks with only audio inputs:** Factuality, Math, and General.
|
| 72 |
+
|
| 73 |
+
We manually annotated 15 questions per task per language. For evaluation, qualified native speakers rated each response on a scale of 1 to 5, with 5 representing the highest quality.
|
| 74 |
+
|
| 75 |
+
Due to the lack of LALMs for all the three Southeast Asian languages, we compare the performance of SeaLLMs-Audio with relevant LALMs with similar sizes, including: [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) (Qwen2-Audio), [MERaLiON-AudioLLM-Whisper-SEA-LION](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION) (MERaLiON), [llama3.1-typhoon2-audio-8b-instruct](https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct) (typhoon2-audio), and [DiVA-llama-3-v0-8b](https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b) (DiVA).
|
| 76 |
+
All the LALMs can accept audio with text as input. The results are shown in the figure below.
|
| 77 |
+
|
| 78 |
+
<center>
|
| 79 |
+
|
| 80 |
+
**Average scores of SeaLLMs-Audio vs. Other LALMs on SeaBench-Audio**
|
| 81 |
+
|
| 82 |
+

|
| 83 |
+
|
| 84 |
+
</center>
|
| 85 |
+
|
| 86 |
+
The results shows that SeaLLMs-Audio achieve state-of-the-art performance in all the five langauges, demonstrating its effectiveness in supporting audio-related tasks in Southeast Asia.
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
# Quickstart
|
| 90 |
+
Our model is available on Hugging Face, and you can easily use it with the `transformers` library or `vllm` library. Below are some examples to get you started.
|
| 91 |
+
|
| 92 |
+
## Get started with `transformers`
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
|
| 96 |
+
import librosa
|
| 97 |
+
import os
|
| 98 |
+
|
| 99 |
+
model = Qwen2AudioForConditionalGeneration.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B", device_map="auto")
|
| 100 |
+
processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
|
| 101 |
+
|
| 102 |
+
def response_to_audio(conversation, model=None, processor=None):
|
| 103 |
+
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
|
| 104 |
+
audios = []
|
| 105 |
+
for message in conversation:
|
| 106 |
+
if isinstance(message["content"], list):
|
| 107 |
+
for ele in message["content"]:
|
| 108 |
+
if ele["type"] == "audio":
|
| 109 |
+
if ele['audio_url'] != None:
|
| 110 |
+
audios.append(librosa.load(
|
| 111 |
+
ele['audio_url'],
|
| 112 |
+
sr=processor.feature_extractor.sampling_rate)[0]
|
| 113 |
+
)
|
| 114 |
+
if audios != []:
|
| 115 |
+
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True,sampling_rate=16000)
|
| 116 |
+
else:
|
| 117 |
+
inputs = processor(text=text, return_tensors="pt", padding=True)
|
| 118 |
+
inputs.input_ids = inputs.input_ids.to("cuda")
|
| 119 |
+
inputs = {k: v.to("cuda") for k, v in inputs.items() if v is not None}
|
| 120 |
+
generate_ids = model.generate(**inputs, max_new_tokens=2048, temperature = 0, do_sample=False)
|
| 121 |
+
generate_ids = generate_ids[:, inputs["input_ids"].size(1):]
|
| 122 |
+
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
| 123 |
+
return response
|
| 124 |
+
|
| 125 |
+
# Voice Chat
|
| 126 |
+
os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
|
| 127 |
+
os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
|
| 128 |
+
conversation = [
|
| 129 |
+
{"role": "user", "content": [
|
| 130 |
+
{"type": "audio", "audio_url": "fact_en.wav"},
|
| 131 |
+
]},
|
| 132 |
+
{"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
|
| 133 |
+
{"role": "user", "content": [
|
| 134 |
+
{"type": "audio", "audio_url": "general_en.wav"},
|
| 135 |
+
]},
|
| 136 |
+
]
|
| 137 |
+
|
| 138 |
+
response = response_to_audio(conversation, model=model, processor=processor)
|
| 139 |
+
print(response)
|
| 140 |
+
|
| 141 |
+
# Audio Analysis
|
| 142 |
+
os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
|
| 143 |
+
conversation = [
|
| 144 |
+
{"role": "user", "content": [
|
| 145 |
+
{"type": "audio", "audio_url": "ASR_en.wav"},
|
| 146 |
+
{"type": "text", "text": "Please write down what is spoken in the audio file."},
|
| 147 |
+
]},
|
| 148 |
+
]
|
| 149 |
+
|
| 150 |
+
response = response_to_audio(conversation, model=model, processor=processor)
|
| 151 |
+
print(response)
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
## Inference with `vllm`
|
| 155 |
+
|
| 156 |
+
```python
|
| 157 |
+
from vllm import LLM, SamplingParams
|
| 158 |
+
import librosa, os
|
| 159 |
+
from transformers import AutoProcessor
|
| 160 |
+
|
| 161 |
+
processor = AutoProcessor.from_pretrained("SeaLLMs/SeaLLMs-Audio-7B")
|
| 162 |
+
llm = LLM(
|
| 163 |
+
model="SeaLLMs/SeaLLMs-Audio-7B", trust_remote_code=True, gpu_memory_utilization=0.5,
|
| 164 |
+
enforce_eager=True, device = "cuda",
|
| 165 |
+
limit_mm_per_prompt={"audio": 5},
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
def response_to_audio(conversation, model=None, processor=None, temperature = 0.1,repetition_penalty=1.1, top_p = 0.9,max_new_tokens = 4096):
|
| 169 |
+
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
|
| 170 |
+
audios = []
|
| 171 |
+
for message in conversation:
|
| 172 |
+
if isinstance(message["content"], list):
|
| 173 |
+
for ele in message["content"]:
|
| 174 |
+
if ele["type"] == "audio":
|
| 175 |
+
if ele['audio_url'] != None:
|
| 176 |
+
audios.append(librosa.load(
|
| 177 |
+
ele['audio_url'],
|
| 178 |
+
sr=processor.feature_extractor.sampling_rate)[0]
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
sampling_params = SamplingParams(
|
| 182 |
+
temperature=temperature, max_tokens=max_new_tokens, repetition_penalty=repetition_penalty, top_p=top_p, top_k=20,
|
| 183 |
+
stop_token_ids=[],
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
input = {
|
| 187 |
+
'prompt': text,
|
| 188 |
+
'multi_modal_data': {
|
| 189 |
+
'audio': [(audio, 16000) for audio in audios]
|
| 190 |
+
}
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
output = model.generate([input], sampling_params=sampling_params)[0]
|
| 194 |
+
response = output.outputs[0].text
|
| 195 |
+
return response
|
| 196 |
+
|
| 197 |
+
# Voice Chat
|
| 198 |
+
os.system(f"wget -O fact_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/fact_en.wav")
|
| 199 |
+
os.system(f"wget -O general_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/general_en.wav")
|
| 200 |
+
conversation = [
|
| 201 |
+
{"role": "user", "content": [
|
| 202 |
+
{"type": "audio", "audio_url": "fact_en.wav"},
|
| 203 |
+
]},
|
| 204 |
+
{"role": "assistant", "content": "The most abundant gas in Earth's atmosphere is nitrogen. It makes up about 78 percent of the atmosphere by volume."},
|
| 205 |
+
{"role": "user", "content": [
|
| 206 |
+
{"type": "audio", "audio_url": "general_en.wav"},
|
| 207 |
+
]},
|
| 208 |
+
]
|
| 209 |
+
|
| 210 |
+
response = response_to_audio(conversation, model=llm, processor=processor)
|
| 211 |
+
print(response)
|
| 212 |
+
|
| 213 |
+
# Audio Analysis
|
| 214 |
+
os.system(f"wget -O ASR_en.wav https://DAMO-NLP-SG.github.io/SeaLLMs-Audio/static/audios/ASR_en.wav")
|
| 215 |
+
conversation = [
|
| 216 |
+
{"role": "user", "content": [
|
| 217 |
+
{"type": "audio", "audio_url": "ASR_en.wav"},
|
| 218 |
+
{"type": "text", "text": "Please write down what is spoken in the audio file."},
|
| 219 |
+
]},
|
| 220 |
+
]
|
| 221 |
+
|
| 222 |
+
response = response_to_audio(conversation, model=llm, processor=processor)
|
| 223 |
+
print(response)
|
| 224 |
+
```
|
| 225 |
|
|
|
|
| 226 |
|
| 227 |
## Citation
|
| 228 |
+
If you find our project useful, we hope you would kindly star our [repo](https://github.com/DAMO-NLP-SG/SeaLLMs-Audio) and cite our work as follows.
|
| 229 |
+
Corresponding Author: Wenxuan Zhang ([wxzhang@sutd.edu.sg](mailto:wxzhang@sutd.edu.sg))
|
| 230 |
+
```
|
| 231 |
+
@misc{SeaLLMs-Audio,
|
| 232 |
+
author = {Chaoqun Liu and Mahani Aljunied and Guizhen Chen and Hou Pong Chan and Weiwen Xu and Yu Rong and Wenxuan Zhang},
|
| 233 |
+
title = {SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia},
|
| 234 |
+
year = {2025},
|
| 235 |
+
publisher = {GitHub},
|
| 236 |
+
journal = {GitHub repository},
|
| 237 |
+
howpublished = {\url{https://github.com/DAMO-NLP-SG/SeaLLMs-Audio}},
|
| 238 |
+
}
|
| 239 |
+
```
|
| 240 |
+
|
| 241 |
|