Lorenzoncina commited on
Commit
557c9e8
·
1 Parent(s): c48a64a

Linear Projector and readme file

Browse files
Files changed (3) hide show
  1. README.md +138 -0
  2. images/eloquence_eu.png +0 -0
  3. model.pt +3 -0
README.md CHANGED
@@ -1,3 +1,141 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - English
4
+ - French
5
+ - Ger-man
6
+ - Italian
7
+ - Spanish
8
+ - Portuguese
9
+ - Dutch
10
+ - Polish
11
+ - Hungarian
12
+ - Czech
13
+ - Romanian
14
+ - Bulgarian
15
+ - Slovak
16
+ - Slovene
17
+ - Serbian
18
+ - Greek
19
+ - Danish
20
+ - Swedish
21
+ - Finnish
22
+ - Latvian
23
+ - Lithuanian
24
+ - Estonian
25
+ - Welsh
26
+ - Maltese
27
+ - Breton
28
+ - Irish
29
+ - Galician
30
+ - Basque
31
+ pipeline_tag: automatic-speech-recognition
32
  license: cc-by-4.0
33
  ---
34
+ ## Model Details
35
+
36
+ ### Model Description
37
+
38
+ A 17.31M parameter multilingual linear projector version 2 trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework.
39
+ Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
40
+ and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).
41
+
42
+ - **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
43
+ - **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
44
+ - **Model type:** Linear projector in a speechLLM framework
45
+ - **Supported Language(s):** English, Italian, Spanish, German, French
46
+ - **License:** CC-BY-4.0
47
+
48
+ ## Uses
49
+
50
+ This model is trained for Automatic Speech Recognition (ASR) and is meant to be the version 2 of the mEUltilingual speechLLM projectors collection.
51
+
52
+ ## How to Get Started with the Model
53
+
54
+ This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.
55
+
56
+ Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
57
+
58
+ ## Training Details
59
+
60
+ ### Training Data
61
+
62
+ The linear projector was trained with a multilingual dataset covering 28 European languages, that relys on widely used speech datasets: [Common Voice 17.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [Fleurs](https://huggingface.co/datasets/google/fleurs), and [Vox-Populi](https://huggingface.co/datasets/facebook/voxpopuli). As the distribution of data across languages is highli imbalanced, we applied a cap of 100K audio samples per language per dataset, discarding any additional samples beyond this threshold. This strategy allowed us to reduce data skew while keeping training computationally feasible. To assess the generalizability and robustness of our models on out-of-domain speech, we used the official evaluation set of the [INTERSPEECH 2025 MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm).
63
+
64
+ ### Training Procedure
65
+
66
+ * The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
67
+ * Only the linear projector was trained.
68
+ * The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
69
+ and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen, but applying LoRA during training.
70
+ * A single monolingual prompt has been used for the training: "Transcribe speech to text."
71
+ * Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
72
+
73
+ #### Training Hyperparameters
74
+
75
+ | | |
76
+ | -------- | ------- |
77
+ | llm_name | eurollm-1.7b |
78
+ | llm_dim | 2048 |
79
+ | context_length | 4096 |
80
+ | encoder_name | whisper |
81
+ | encoder_projector_ds_rate | 5 |
82
+ | encoder_dim | 1280 |
83
+ | encoder_projector | linear |
84
+ | input_type | mel |
85
+ | mel_size | 128 |
86
+ | epochs | 3 |
87
+ | freeze_encoder | true |
88
+ | freeze_llm | true |
89
+ | warmup_steps | 1000 |
90
+ | total_steps | 100000 |
91
+ | lr | 1e-4 |
92
+ | validation_interval | 1000 |
93
+ | batch_size_training | 4 |
94
+ | val_size_training | 4 |
95
+ | num_workers_dataloader | 2 |
96
+ | optimizer | AdamW |
97
+ | enable_fdsp | false |
98
+ | enable_ddp | true |
99
+ | use_fp16 | true |
100
+
101
+ ## Evaluation
102
+
103
+ The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library.
104
+
105
+ ### Results
106
+
107
+ | Language | CV (CV test) | CV (FL test) | CV (MLC test) | CV+FL (CV test) | CV+FL (FL test) | CV+FL (MLC test) | CV+FL+VoxPop. (CV test) | CV+FL+VoxPop. (FL test) | CV+FL+VoxPop. (MLC test) |
108
+ |------------|--------------|--------------|---------------|-----------------|-----------------|------------------|--------------------------|--------------------------|---------------------------|
109
+ | Spanish | 5.45 | 18.58 | 33.99 | 5.71 | 3.09 | 22.56 | 5.22 | 4.09 | 21.86 |
110
+ | German | 7.72 | 23.54 | 53.06 | 7.67 | 6.88 | 47.87 | 7.11 | 7.79 | 32.77 |
111
+ | Dutch | 7.78 | 19.64 | - | 7.58 | 8.18 | - | 6.83 | 8.65 | - |
112
+ | Portuguese | 10.32 | 17.09 | 85.99 | 10.02 | 3.17 | 75.00 | 9.39 | 4.86 | 51.75 |
113
+ | Galician | 12.38 | 23.67 | - | 13.18 | 8.06 | - | 12.70 | 9.98 | - |
114
+ | English | 12.64 | 16.06 | 36.13 | 12.90 | 4.77 | 36.92 | 12.94 | 6.34 | 46.56 |
115
+ | Polish | 13.56 | 19.66 | - | 14.11 | 7.33 | - | 14.19 | 8.68 | - |
116
+ | Czech | 14.30 | 30.76 | - | 14.18 | 9.76 | - | 11.16 | 11.32 | - |
117
+ | French | - | - | 70.00 | - | - | 61.12 | 11.24 | 7.83 | 42.05 |
118
+ | Hungarian | 16.11 | 37.44 | - | 16.23 | 15.51 | - | 14.59 | 16.87 | - |
119
+ | Italian | 16.28 | 21.38 | 56.24 | 6.14 | 3.89 | 49.44 | 6.01 | 3.32 | 36.13 |
120
+ | Swedish | 17.51 | 24.76 | - | 17.05 | 7.69 | - | 15.99 | 10.94 | - |
121
+ | Romanian | 18.99 | 28.28 | - | 18.92 | 8.37 | - | 17.39 | 9.65 | - |
122
+ | Danish | 20.36 | 29.43 | - | 19.59 | 11.02 | - | 18.81 | 14.65 | - |
123
+ | Basque | 20.65 | - | - | 20.92 | - | - | 19.96 | - | - |
124
+ | Bulgarian | 24.05 | 33.73 | - | 23.93 | 13.21 | - | 24.26 | 15.20 | - |
125
+ | Finnish | 28.74 | 48.37 | - | 28.19 | 13.28 | - | 22.61 | 15.29 | - |
126
+ | Latvian | 29.28 | 44.25 | - | 29.78 | 15.23 | - | 27.12 | 17.23 | - |
127
+ | Lithuanian | 32.00 | 52.88 | - | 31.80 | 20.58 | - | 28.27 | 24.30 | - |
128
+ | Greek | 34.47 | 44.26 | - | 32.73 | 20.32 | - | 30.06 | 18.35 | - |
129
+ | Slovak | 40.38 | 29.86 | - | 44.77 | 8.31 | - | 35.84 | 9.71 | - |
130
+ | Slovenian | 42.19 | 43.82 | - | 40.43 | 18.11 | - | 34.72 | 19.41 | - |
131
+ | Estonian | 44.11 | 68.87 | - | 44.73 | 18.57 | - | 37.19 | 19.83 | - |
132
+ | Welsh | 54.80 | 88.41 | - | 55.65 | 51.52 | - | 50.40 | 39.96 | - |
133
+ | Serbian | 61.53 | 115.32 | - | 61.22 | 28.95 | - | 56.49 | 27.60 | - |
134
+ | Maltese | 69.71 | 112.25 | - | 69.57 | 52.96 | - | 58.84 | 44.89 | - |
135
+ | Breton | 98.01 | - | - | 102.70 | - | - | 95.68 | - | - |
136
+ | Irish | 104.79 | 135.80 | - | 100.91 | 135.61 | - | 82.23 | 88.06 | - |
137
+
138
+ ## Acknowledgements
139
+ <img src="images/eloquence_eu.png" align="center" width="30%">
140
+ This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
141
+
images/eloquence_eu.png ADDED
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9f343bd5d77d8891ea91d91c47ef0e6cb76e0b3cc9109f7d81133981d9fb593
3
+ size 74757338