Automatic Speech Recognition
speech-llm
conversational-asr
MLC-SLM / README.md
YuCeong-May's picture
Improve model card and add paper/GitHub links (#1)
919705c verified
metadata
base_model:
  - Qwen/Qwen2.5-7B
  - openai/whisper-large-v3
  - utter-project/mHuBERT-147
datasets:
  - Nexdata/INTERSPEECH_2025_MLC-SLM_Challenge_Dataset
  - bsmu/MLC-SLM-Eval
language:
  - en
  - fr
  - it
  - ja
  - ko
  - vi
  - th
  - pt
  - ru
  - es
  - de
license: apache-2.0
metrics:
  - cer
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - speech-llm
  - conversational-asr

MLC-SLM: Bridging the Gap in Multilingual Conversational ASR

This repository contains the models and code presented in the paper Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR.

The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM).

Description

The proposed Speech-LLM is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features.

Results

Performance (CER/WER) on the MLC-SLM Challenge datasets:

System Dev Eval CV-Test
Whisper (LoRA-fine-tuned) 11.40 10.71 11.47
Whisper (Full-fine-tuned) 10.99 10.07 13.11
Proposed Speech-LLM 11.74 10.69 15.26

Dataset

The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents).

Citation

@article{mlcslm2025bridging,
  title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2601.01461},
  year={2025}
}