base_model:
- Qwen/Qwen2.5-7B
- openai/whisper-large-v3
- utter-project/mHuBERT-147
datasets:
- Nexdata/INTERSPEECH_2025_MLC-SLM_Challenge_Dataset
- bsmu/MLC-SLM-Eval
language:
- en
- fr
- it
- ja
- ko
- vi
- th
- pt
- ru
- es
- de
license: apache-2.0
metrics:
- cer
- wer
pipeline_tag: automatic-speech-recognition
tags:
- speech-llm
- conversational-asr
MLC-SLM: Bridging the Gap in Multilingual Conversational ASR
This repository contains the models and code presented in the paper Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR.
The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM).
- Paper: arXiv:2601.01461
- Code: GitHub - MLC-SLM
Description
The proposed Speech-LLM is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features.
Results
Performance (CER/WER) on the MLC-SLM Challenge datasets:
| System | Dev | Eval | CV-Test |
|---|---|---|---|
| Whisper (LoRA-fine-tuned) | 11.40 | 10.71 | 11.47 |
| Whisper (Full-fine-tuned) | 10.99 | 10.07 | 13.11 |
| Proposed Speech-LLM | 11.74 | 10.69 | 15.26 |
Dataset
The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents).
Citation
@article{mlcslm2025bridging,
title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR},
author={Anonymous Authors},
journal={arXiv preprint arXiv:2601.01461},
year={2025}
}