Add comprehensive model card for MCAT
Browse filesThis PR adds a comprehensive model card for the MCAT model, based on the paper [MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages](https://huggingface.co/papers/2512.01512).
The update includes:
- Relevant metadata such as `license`, `pipeline_tag`, `library_name`, `tags`, and `base_model`.
- Links to the paper and the GitHub repository.
- A brief description of the model and its capabilities.
- Detailed installation, model download, and inference demo instructions from the GitHub README.
Please review and merge if everything looks good.
README.md
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: automatic-speech-recognition
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- speech-to-text-translation
|
| 7 |
+
- multilingual
|
| 8 |
+
base_model:
|
| 9 |
+
- openai/whisper-large-v3
|
| 10 |
+
- google/gemma-3-27b-it
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
|
| 14 |
+
|
| 15 |
+
This repository contains the **MCAT** (Multilingual Cost-effective Accelerated Speech-to-Text Translator) framework, presented in the paper [MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages](https://huggingface.co/papers/2512.01512).
|
| 16 |
+
|
| 17 |
+
MCAT addresses key challenges in Speech-to-Text Translation (S2TT) tasks with Multimodal Large Language Models (MLLMs), specifically language coverage and efficiency. It introduces a language scaling method leveraging curriculum learning and a data balancing strategy to extend translation capabilities to 70 languages. Furthermore, an optimized speech adapter module is designed to reduce the length of speech sequences, enhancing batch inference efficiency.
|
| 18 |
+
|
| 19 |
+
For more details, including training and evaluation scripts, please visit the [GitHub repository](https://github.com/yxduir/m2m-70).
|
| 20 |
+
|
| 21 |
+
✅ **Current Version MCAT (v2.0)**
|
| 22 |
+
- **Supported 70 Languages**: Afrikaans (afr), Amharic (amh), Arabic (ara), Assamese (asm), Azerbaijani (azj), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Czech (ces), Chinese (cmn), Welsh (cym), Danish (dan), German (deu), Greek (ell), English (eng), Estonian (est), Persian (fas), Finnish (fin), French (fra), Galician (glg), Gujarati (guj), Hebrew (heb), Hindi (hin), Croatian (hrv), Hungarian (hun), Armenian (hye), Indonesian (ind), Icelandic (isl), Italian (ita), Javanese (jav), Japanese (jpn), Kannada (kan), Georgian (kat), Kazakh (kaz), Khmer (khm), Kyrgyz (kir), Korean (kor), Lao (lao), Latvian (lav), Lithuanian (lit), Malayalam (mal), Macedonian (mkd), Malay (msa), Burmese (mya), Dutch (nld), Norwegian (nob), Nepali (npi), Punjabi (pan), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Slovak (slk), Slovenian (slv), Spanish (spa), Serbian (srp), Swedish (swe), Swahili (swh), Tamil (tam), Telugu (tel), Tagalog (tgl), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Uzbek (uzb), Vietnamese (vie), Cantonese (yue)
|
| 23 |
+
- **4830 Translation Directions** - Supports all 4830 possible translation directions (70×69 language pairs)
|
| 24 |
+
|
| 25 |
+
## Installation
|
| 26 |
+
```
|
| 27 |
+
conda create -n m2m-70 python=3.10
|
| 28 |
+
conda activate m2m-70
|
| 29 |
+
|
| 30 |
+
git clone https://github.com/yxduir/m2m-70
|
| 31 |
+
cd m2m-70/SLAM-LLM
|
| 32 |
+
|
| 33 |
+
sudo apt update
|
| 34 |
+
sudo apt install ffmpeg
|
| 35 |
+
sudo apt install git-lfs
|
| 36 |
+
|
| 37 |
+
pip install -r requirements.txt
|
| 38 |
+
pip install -e .
|
| 39 |
+
cd ..
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Download Model
|
| 43 |
+
Encoder | Adapter | LLM
|
| 44 |
+
|---|---|---
|
| 45 |
+
[Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Adapter](https://huggingface.co/yxdu/mcat-large) | [Gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it)
|
| 46 |
+
```
|
| 47 |
+
cd models/
|
| 48 |
+
# Total 150G of storage space for models
|
| 49 |
+
git lfs clone https://huggingface.co/openai/whisper-large-v3
|
| 50 |
+
git lfs clone https://huggingface.co/yxdu/mcat-large
|
| 51 |
+
# Access to the Gemma models is required before using git lfs.
|
| 52 |
+
git lfs clone https://huggingface.co/google/gemma-3-27b-it
|
| 53 |
+
cd ..
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
## Infer Demo
|
| 57 |
+
This is a demo inference script, covering translation between 70 languages, with a total of 70×69=4,830 directions.
|
| 58 |
+
This demo downloads the 9 GB dataset from HuggingFace.
|
| 59 |
+
It requires GPUs with 80GB VRAM, with support for BF16 only.
|
| 60 |
+
```bash
|
| 61 |
+
bash scripts/infer_demo.sh
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Citation
|
| 65 |
+
```bibtex
|
| 66 |
+
@misc{du2025mcatscalingmanytomanyspeechtotext,
|
| 67 |
+
title={MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages},
|
| 68 |
+
author={Yexing Du and Kaiyuan Liu and Youcheng Pan and Bo Yang and Keqi Deng and Xie Chen and Yang Xiang and Ming Liu and Bin Qin and YaoWei Wang},
|
| 69 |
+
year={2025},
|
| 70 |
+
eprint={2512.01512},
|
| 71 |
+
archivePrefix={arXiv},
|
| 72 |
+
primaryClass={cs.CL},
|
| 73 |
+
url={https://arxiv.org/abs/2512.01512},
|
| 74 |
+
}
|
| 75 |
+
@article{du2025speech2text,
|
| 76 |
+
title = {Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning},
|
| 77 |
+
author = {Du, Yexing and Pan, Youcheng and Ma, Ziyang and Yang, Bo and Yang, Yifang and Deng, Keqi and Chen, Xie and Xiang, Yang and Liu, Ming and Qin, Bing},
|
| 78 |
+
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)},
|
| 79 |
+
year = {2025},
|
| 80 |
+
}
|
| 81 |
+
```
|