Linear Projector and readme file

Files changed (3) hide show

README.md +138 -0
images/eloquence_eu.png +0 -0
model.pt +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,141 @@
 ---
 license: cc-by-4.0
 ---

 ---
+language:
+- English
+- French
+- Ger-man
+- Italian
+- Spanish
+- Portuguese
+- Dutch
+- Polish
+- Hungarian
+- Czech
+- Romanian
+- Bulgarian
+- Slovak
+- Slovene
+- Serbian
+- Greek
+- Danish
+- Swedish
+- Finnish
+- Latvian
+- Lithuanian
+- Estonian
+- Welsh
+- Maltese
+- Breton
+- Irish
+- Galician
+- Basque
+pipeline_tag: automatic-speech-recognition
 license: cc-by-4.0
 ---
+## Model Details
+### Model Description
+A 17.31M parameter multilingual linear projector version 2 trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework.
+Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
+and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).
+- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
+- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
+- **Model type:** Linear projector in a speechLLM framework
+- **Supported Language(s):** English, Italian, Spanish, German, French
+- **License:** CC-BY-4.0
+## Uses
+This model is trained for Automatic Speech Recognition (ASR) and is meant to be the version 2 of the mEUltilingual speechLLM projectors collection.
+## How to Get Started with the Model
+This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.
+Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
+## Training Details
+### Training Data
+The linear projector was trained with a multilingual dataset covering 28 European languages, that relys on widely used speech datasets: [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [Fleurs](https://huggingface.co/datasets/google/fleurs), and [Vox-Populi](https://huggingface.co/datasets/facebook/voxpopuli). As the distribution of data across languages is highli imbalanced, we applied a cap of 100K audio samples per language per dataset, discarding any additional samples beyond this threshold. This strategy allowed us to reduce data skew while keeping training computationally feasible. To assess the generalizability and robustness of our models on out-of-domain speech, we used the official evaluation set of the [INTERSPEECH 2025 MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm).
+### Training Procedure
+* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
+* Only the linear projector was trained.
+* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
+and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen, but applying LoRA during training.
+* A single monolingual prompt has been used for the training: "Transcribe speech to text."
+* Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
+#### Training Hyperparameters
+|           |        |
+| -------- | ------- |
+| llm_name  | eurollm-1.7b |
+| llm_dim | 2048     |
+| context_length | 4096 |
+| encoder_name    | whisper    |
+| encoder_projector_ds_rate    | 5    |
+| encoder_dim    | 1280    |
+| encoder_projector    | linear    |
+| input_type   | mel    |
+| mel_size  | 128    |
+| epochs  | 3    |
+| freeze_encoder  | true    |
+| freeze_llm | true    |
+| warmup_steps | 1000    |
+| total_steps | 100000    |
+| lr | 1e-4    |
+| validation_interval | 1000    |
+| batch_size_training | 4    |
+| val_size_training | 4    |
+| num_workers_dataloader | 2   |
+| optimizer | AdamW   |
+| enable_fdsp | false   |
+| enable_ddp | true   |
+| use_fp16 | true   |
+## Evaluation
+The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library.
+### Results
+| Language   | CV (CV test) | CV (FL test) | CV (MLC test) | CV+FL (CV test) | CV+FL (FL test) | CV+FL (MLC test) | CV+FL+VoxPop. (CV test) | CV+FL+VoxPop. (FL test) | CV+FL+VoxPop. (MLC test) |
+|------------|--------------|--------------|---------------|-----------------|-----------------|------------------|--------------------------|--------------------------|---------------------------|
+| Spanish    | 5.45 | 18.58 | 33.99 | 5.71 | 3.09 | 22.56 | 5.22 | 4.09 | 21.86 |
+| German     | 7.72 | 23.54 | 53.06 | 7.67 | 6.88 | 47.87 | 7.11 | 7.79 | 32.77 |
+| Dutch      | 7.78 | 19.64 | - | 7.58 | 8.18 | - | 6.83 | 8.65 | - |
+| Portuguese | 10.32 | 17.09 | 85.99 | 10.02 | 3.17 | 75.00 | 9.39 | 4.86 | 51.75 |
+| Galician   | 12.38 | 23.67 | - | 13.18 | 8.06 | - | 12.70 | 9.98 | - |
+| English    | 12.64 | 16.06 | 36.13 | 12.90 | 4.77 | 36.92 | 12.94 | 6.34 | 46.56 |
+| Polish     | 13.56 | 19.66 | - | 14.11 | 7.33 | - | 14.19 | 8.68 | - |
+| Czech      | 14.30 | 30.76 | - | 14.18 | 9.76 | - | 11.16 | 11.32 | - |
+| French     | - | - | 70.00 | - | - | 61.12 | 11.24 | 7.83 | 42.05 |
+| Hungarian  | 16.11 | 37.44 | - | 16.23 | 15.51 | - | 14.59 | 16.87 | - |
+| Italian    | 16.28 | 21.38 | 56.24 | 6.14 | 3.89 | 49.44 | 6.01 | 3.32 | 36.13 |
+| Swedish    | 17.51 | 24.76 | - | 17.05 | 7.69 | - | 15.99 | 10.94 | - |
+| Romanian   | 18.99 | 28.28 | - | 18.92 | 8.37 | - | 17.39 | 9.65 | - |
+| Danish     | 20.36 | 29.43 | - | 19.59 | 11.02 | - | 18.81 | 14.65 | - |
+| Basque     | 20.65 | - | - | 20.92 | - | - | 19.96 | - | - |
+| Bulgarian  | 24.05 | 33.73 | - | 23.93 | 13.21 | - | 24.26 | 15.20 | - |
+| Finnish    | 28.74 | 48.37 | - | 28.19 | 13.28 | - | 22.61 | 15.29 | - |
+| Latvian    | 29.28 | 44.25 | - | 29.78 | 15.23 | - | 27.12 | 17.23 | - |
+| Lithuanian | 32.00 | 52.88 | - | 31.80 | 20.58 | - | 28.27 | 24.30 | - |
+| Greek      | 34.47 | 44.26 | - | 32.73 | 20.32 | - | 30.06 | 18.35 | - |
+| Slovak     | 40.38 | 29.86 | - | 44.77 | 8.31 | - | 35.84 | 9.71 | - |
+| Slovenian  | 42.19 | 43.82 | - | 40.43 | 18.11 | - | 34.72 | 19.41 | - |
+| Estonian   | 44.11 | 68.87 | - | 44.73 | 18.57 | - | 37.19 | 19.83 | - |
+| Welsh      | 54.80 | 88.41 | - | 55.65 | 51.52 | - | 50.40 | 39.96 | - |
+| Serbian    | 61.53 | 115.32 | - | 61.22 | 28.95 | - | 56.49 | 27.60 | - |
+| Maltese    | 69.71 | 112.25 | - | 69.57 | 52.96 | - | 58.84 | 44.89 | - |
+| Breton     | 98.01 | - | - | 102.70 | - | - | 95.68 | - | - |
+| Irish      | 104.79 | 135.80 | - | 100.91 | 135.61 | - | 82.23 | 88.06 | - |
+## Acknowledgements
+<img src="images/eloquence_eu.png" align="center" width="30%">
+This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).

images/eloquence_eu.png ADDED Viewed

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9f343bd5d77d8891ea91d91c47ef0e6cb76e0b3cc9109f7d81133981d9fb593
+size 74757338