Automatic Speech Recognition
speech-llm
conversational-asr
File size: 2,237 Bytes
448f069
919705c
 
 
 
448f069
 
 
 
 
 
 
 
 
 
 
 
 
 
19b42d4
919705c
448f069
 
 
 
919705c
 
 
dcffde8
 
919705c
dcffde8
919705c
 
 
 
 
 
 
 
 
 
 
 
 
 
dcffde8
87cc80a
 
 
1a2f538
919705c
 
 
 
 
 
 
 
 
 
 
d4a15e1
919705c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
base_model:
- Qwen/Qwen2.5-7B
- openai/whisper-large-v3
- utter-project/mHuBERT-147
datasets:
- Nexdata/INTERSPEECH_2025_MLC-SLM_Challenge_Dataset
- bsmu/MLC-SLM-Eval
language:
- en
- fr
- it
- ja
- ko
- vi
- th
- pt
- ru
- es
- de
license: apache-2.0
metrics:
- cer
- wer
pipeline_tag: automatic-speech-recognition
tags:
- speech-llm
- conversational-asr
---

# MLC-SLM: Bridging the Gap in Multilingual Conversational ASR

This repository contains the models and code presented in the paper [Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR](https://huggingface.co/papers/2601.01461). 

The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM). 

- **Paper:** [arXiv:2601.01461](https://huggingface.co/papers/2601.01461)
- **Code:** [GitHub - MLC-SLM](https://github.com/1535176727/MLC-SLM)

## Description

The proposed **Speech-LLM** is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features.

## Results

Performance (CER/WER) on the MLC-SLM Challenge datasets:

| **System**                | **Dev** | **Eval** | **CV-Test** |
|----------------------------|---------|----------|-------------|
| Whisper (LoRA-fine-tuned)  | 11.40   | 10.71    | **11.47**   |
| Whisper (Full-fine-tuned)  | **10.99**   | **10.07**    | 13.11       |
| **Proposed Speech-LLM**    | 11.74   | 10.69| 15.26       |

## Dataset

The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents).

## Citation

```bibtex
@article{mlcslm2025bridging,
  title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR},
  author={Yuxiang Mei, Dongxing Xu, Jiaen Liang and Yanhua Long},
  journal={arXiv preprint arXiv:2601.01461},
  year={2025}
}
```