MLC-SLM: Bridging the Gap in Multilingual Conversational ASR
This repository contains the models and code presented in the paper Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR.
The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM).
- Paper: arXiv:2601.01461
- Code: GitHub - MLC-SLM
Description
The proposed Speech-LLM is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features.
Results
Performance (CER/WER) on the MLC-SLM Challenge datasets:
| System | Dev | Eval | CV-Test |
|---|---|---|---|
| Whisper (LoRA-fine-tuned) | 11.40 | 10.71 | 11.47 |
| Whisper (Full-fine-tuned) | 10.99 | 10.07 | 13.11 |
| Proposed Speech-LLM | 11.74 | 10.69 | 15.26 |
Dataset
The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents).
Citation
@article{mlcslm2025bridging,
title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR},
author={Anonymous Authors},
journal={arXiv preprint arXiv:2601.01461},
year={2025}
}
Model tree for YuCeong-May/MLC-SLM
Base model
Qwen/Qwen2.5-7B