MLC-SLM / README.md

YuCeong-May

Improve model card and add paper/GitHub links (#1)

919705c verified 8 days ago

preview code

raw

history blame contribute delete

2.2 kB

metadata

base_model:
  - Qwen/Qwen2.5-7B
  - openai/whisper-large-v3
  - utter-project/mHuBERT-147
datasets:
  - Nexdata/INTERSPEECH_2025_MLC-SLM_Challenge_Dataset
  - bsmu/MLC-SLM-Eval
language:
  - en
  - fr
  - it
  - ja
  - ko
  - vi
  - th
  - pt
  - ru
  - es
  - de
license: apache-2.0
metrics:
  - cer
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - speech-llm
  - conversational-asr

MLC-SLM: Bridging the Gap in Multilingual Conversational ASR

This repository contains the models and code presented in the paper Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR.

The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM).

Paper: arXiv:2601.01461
Code: GitHub - MLC-SLM

Description

The proposed Speech-LLM is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features.

Results

Performance (CER/WER) on the MLC-SLM Challenge datasets:

System	Dev	Eval	CV-Test
Whisper (LoRA-fine-tuned)	11.40	10.71	11.47
Whisper (Full-fine-tuned)	10.99	10.07	13.11
Proposed Speech-LLM	11.74	10.69	15.26

Dataset

The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents).

Citation

@article{mlcslm2025bridging,
  title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2601.01461},
  year={2025}
}