Automatic Speech Recognition
speech-llm
conversational-asr

MLC-SLM: Bridging the Gap in Multilingual Conversational ASR

This repository contains the models and code presented in the paper Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR.

The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM).

Description

The proposed Speech-LLM is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features.

Results

Performance (CER/WER) on the MLC-SLM Challenge datasets:

System Dev Eval CV-Test
Whisper (LoRA-fine-tuned) 11.40 10.71 11.47
Whisper (Full-fine-tuned) 10.99 10.07 13.11
Proposed Speech-LLM 11.74 10.69 15.26

Dataset

The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents).

Citation

@article{mlcslm2025bridging,
  title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2601.01461},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuCeong-May/MLC-SLM

Base model

Qwen/Qwen2.5-7B
Finetuned
(802)
this model

Datasets used to train YuCeong-May/MLC-SLM

Paper for YuCeong-May/MLC-SLM