OSainz commited on
Commit
8c6bb2d
·
verified ·
1 Parent(s): 45da86f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -0
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ language:
6
+ - eu
7
+ - gl
8
+ - ca
9
+ - es
10
+ - en
11
+ datasets:
12
+ - HiTZ/latxa-corpus-v1.1
13
+ base_model:
14
+ - Qwen/Qwen3-VL-8B-Instruct
15
+ ---
16
+
17
+ # Model Card for HiTZ/Latxa-Qwen3-VL-8B-Instruct
18
+
19
+ <p align="center">
20
+ <img src="https://raw.githubusercontent.com/hitz-zentroa/latxa/refs/heads/main/assets/latxa_vision_circle.png" style="height: 350px;">
21
+ </p>
22
+
23
+ Latxa-Qwen3-VL-8B-Instruct is a Basque-adapted multimodal and multilingual instruct model built on top of Qwen3-VL-8B-Instruct, a powerful vision-language LLM capable of understanding and generating text and processing images. This model has been adapted by the HiTZ Research Center for improved performance on Basque (`mono_eu` variant), Galician and Catalan (`multi` variant) languages and interactive instruction following.
24
+
25
+ > [!WARNING]
26
+ > DISCLAIMER
27
+ >
28
+ > These models are still under development.
29
+ > The released models are preliminary, and might be updated and improve in the future.
30
+
31
+ The released model contains several versions (revisions):
32
+ - Multilingual (`multi`): in addition to Basque, the model has been also adapted to Galician and Catalan.
33
+ - Basque monolingual (`mono_eu`): the Basque monolingual variant.
34
+
35
+ You can choose the model version by specifying the revision when loading the model with `revision="multi"`. By default (`main`) the multilingual variant is downloaded.
36
+
37
+ ## Model Details
38
+
39
+ ### Model Description
40
+
41
+ Latxa Vision models are a family of Vision-Language Models based on Qwen3-VL. The models were adapted to different languages following [Sainz et al. (2025)](https://aclanthology.org/2025.emnlp-main.1484/) adaptation method. The models are released under different language variants: `multi` (it has been adapted to Basque, Galician and Catalan) and `mono_eu` (adapted only to Basque).
42
+
43
+ - **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
44
+ - **Funded by:** Ikergaitu and ALIA projects (Basque and Spanish Government)
45
+ - **Model type:** Vision-Language Instruct Model
46
+ - **Language(s) (NLP):** Basque, Galician, Catalan, Spanish, English and more.
47
+ - **License:** Apache 2.0
48
+ - **Finetuned from model:** Qwen3-VL-8B-Instruct
49
+
50
+ ## Getting Started
51
+
52
+ Use the code below to get started with the model.
53
+
54
+ ```python
55
+ from transformers import pipeline
56
+
57
+ # Load the text and image to text pipeline
58
+ pipe = pipeline("image-text-to-text", model="HiTZ/Latxa-Qwen3-VL-8B-Instruct", revision='multi')
59
+
60
+ # Messages can be of many types
61
+ messages = [
62
+ {
63
+ "role": "user",
64
+ "content": [
65
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"},
66
+ {"type": "text", "text": "What do we see in this image?"},
67
+ ]
68
+ }
69
+ ]
70
+ output = pipe(messages)
71
+ print(output)
72
+ ```
73
+
74
+ ## Uses
75
+
76
+ Latxa models are intended to be used with Basque data; for any other language the performance is not guaranteed.
77
+ Regarding the `multi` variant, it was additionally adapted for Galician and Catalan.
78
+
79
+
80
+ ### Direct Use
81
+
82
+ Latxa Instruct models are trained to follow instructions or to work as chat assistants.
83
+
84
+ ### Out-of-Scope Use
85
+
86
+ The model is not intended for malicious activities, such as harming others or violating human rights.
87
+ Any downstream application must comply with current laws and regulations.
88
+ Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
89
+
90
+
91
+ ## Bias, Risks, and Limitations
92
+
93
+ In an effort to alleviate the potentially disturbing or harmful content, Latxa has been trained on carefully selected and processed
94
+ data which comes mainly from local media, national/regional newspapers, encyclopedias and blogs (see [Latxa Corpus v1.1](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1)). Still, the
95
+ model is based on Qwen3-VL models and can potentially carry the same bias, risk and limitations.
96
+
97
+ ### Recommendations
98
+
99
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
100
+
101
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
102
+
103
+ ## Training Details
104
+
105
+ ### Training Data
106
+
107
+ For training details, please, refer to our paper: [Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque](https://aclanthology.org/2025.emnlp-main.1484/)
108
+
109
+
110
+ ## Evaluation
111
+
112
+ We evaluated the models using 5-shot settings on multiple-choice and generative tasks.
113
+
114
+ | Task | Qwen3-VL 2B | 2B `mono_eu` | 2B `multi` | Qwen3-VL 4B | 4B `mono_eu` | 4B `multi` |
115
+ |------|:-----------:|:----------:|:-----------------:|:------------:|:----------:|:-----------------:|
116
+ | arc_eu_challenge_mc | 36.95 | 51.28 <span style="color:green">(+14.33)</span> | 55.20 <span style="color:green">(+18.25)</span> | 53.75 | 75.09 <span style="color:green">(+21.34)</span> | 75.34 <span style="color:green">(+21.59)</span> |
117
+ | arc_eu_easy_mc | 43.27 | 65.99 <span style="color:green">(+22.72)</span> | 69.95 <span style="color:green">(+26.68)</span> | 66.20 | 87.58 <span style="color:green">(+21.38)</span> | 87.58 <span style="color:green">(+21.38)</span> |
118
+ | belebele_eus_Latn | 46.00 | 65.44 <span style="color:green">(+19.44)</span> | 60.67 <span style="color:green">(+14.67)</span> | 69.67 | 80.67 <span style="color:green">(+11.00)</span> | 79.00 <span style="color:green">(+9.33)</span> |
119
+ | bertaqa_eu_global | 46.03 | 53.43 <span style="color:green">(+7.40)</span> | 56.81 <span style="color:green">(+10.78)</span> | 60.66 | 69.06 <span style="color:green">(+8.40)</span> | 69.65 <span style="color:green">(+8.99)</span> |
120
+ | bertaqa_eu_local | 37.27 | 42.51 <span style="color:green">(+5.24)</span> | 44.46 <span style="color:green">(+7.19)</span> | 40.27 | 53.43 <span style="color:green">(+13.16)</span> | 54.36 <span style="color:green">(+14.09)</span> |
121
+ | bl2mp | 49.11 | 87.94 <span style="color:green">(+38.83)</span> | 89.22 <span style="color:green">(+40.11)</span> | 55.89 | 90.17 <span style="color:green">(+34.28)</span> | 90.28 <span style="color:green">(+34.39)</span> |
122
+ | eus_exams_eu | 33.81 | 42.44 <span style="color:green">(+8.63)</span> | 42.81 <span style="color:green">(+9.00)</span> | 47.21 | 55.39 <span style="color:green">(+8.18)</span> | 56.40 <span style="color:green">(+9.19)</span> |
123
+ | eus_proficiency | 25.69 | 36.45 <span style="color:green">(+10.76)</span> | 36.58 <span style="color:green">(+10.89)</span> | 28.98 | 51.00 <span style="color:green">(+22.02)</span> | 51.77 <span style="color:green">(+22.79)</span> |
124
+ | eus_trivia | 35.04 | 40.41 <span style="color:green">(+5.37)</span> | 42.04 <span style="color:green">(+7.00)</span> | 44.49 | 56.27 <span style="color:green">(+11.78)</span> | 57.55 <span style="color:green">(+13.06)</span> |
125
+ | mgsm_native_cot_eu | 13.10 | 33.20 <span style="color:green">(+20.10)</span> | 34.00 <span style="color:green">(+20.90)</span> | 39.20 | 58.40 <span style="color:green">(+19.20)</span> | 62.40 <span style="color:green">(+23.20)</span> |
126
+ | mmlu_eu | 34.07 | 43.33 <span style="color:green">(+9.26)</span> | 45.93 <span style="color:green">(+11.86)</span> | 51.48 | 55.19 <span style="color:green">(+3.71)</span> | 57.41 <span style="color:green">(+5.93)</span> |
127
+ | piqa_eu_mc | 53.70 | 55.17 <span style="color:green">(+1.47)</span> | 54.08 <span style="color:green">(+0.38)</span> | 56.81 | 64.49 <span style="color:green">(+7.68)</span> | 68.68 <span style="color:green">(+11.87)</span> |
128
+ | siqa_eu_mc | 38.18 | 48.26 <span style="color:green">(+10.08)</span> | 50.31 <span style="color:green">(+12.13)</span> | 47.54 | 61.67 <span style="color:green">(+14.13)</span> | 62.59 <span style="color:green">(+15.05)</span> |
129
+ | xstorycloze_eu | 50.50 | 56.98 <span style="color:green">(+6.48)</span> | 57.05 <span style="color:green">(+6.55)</span> | 50.63 | 61.22 <span style="color:green">(+10.59)</span> | 61.81 <span style="color:green">(+11.18)</span> |
130
+ | **AVG EU** | **38.77** | **51.63 <span style="color:green">(+12.86)</span>** | **52.79 <span style="color:green">(+14.02)</span>** | **50.91** | **65.69 <span style="color:green">(+14.78)</span>** | **66.77 <span style="color:green">(+15.86)</span>** |
131
+
132
+
133
+
134
+ > [!WARNING]
135
+ > DISCLAIMER
136
+ >
137
+ > These model are still under development.
138
+ > The results are only reported for Basque tasks, the results in the rest of the languages will be released in the near future.
139
+
140
+ ## Citation
141
+
142
+ ```bibtex
143
+ @inproceedings{sainz-etal-2025-instructing,
144
+ title = "Instructing Large Language Models for Low-Resource Languages: A Systematic Study for {B}asque",
145
+ author = "Sainz, Oscar and
146
+ Perez, Naiara and
147
+ Etxaniz, Julen and
148
+ Fernandez de Landa, Joseba and
149
+ Aldabe, Itziar and
150
+ Garc{\'i}a-Ferrero, Iker and
151
+ Zabala, Aimar and
152
+ Azurmendi, Ekhi and
153
+ Rigau, German and
154
+ Agirre, Eneko and
155
+ Artetxe, Mikel and
156
+ Soroa, Aitor",
157
+ editor = "Christodoulopoulos, Christos and
158
+ Chakraborty, Tanmoy and
159
+ Rose, Carolyn and
160
+ Peng, Violet",
161
+ booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
162
+ month = nov,
163
+ year = "2025",
164
+ address = "Suzhou, China",
165
+ publisher = "Association for Computational Linguistics",
166
+ url = "https://aclanthology.org/2025.emnlp-main.1484/",
167
+ doi = "10.18653/v1/2025.emnlp-main.1484",
168
+ pages = "29124--29148",
169
+ ISBN = "979-8-89176-332-6",
170
+ abstract = "Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation."
171
+ }
172
+ ```
173
+
174
+ ## Acknowledgements
175
+
176
+ This work has been partially supported by the
177
+ Basque Government (Research group funding
178
+ IT1570-22 and IKER-GAITU project), the Span-
179
+ ish Ministry for Digital Transformation and of
180
+ Civil Service, and the EU-funded NextGenera-
181
+ tionEU Recovery, Transformation and Resilience
182
+ Plan (ALIA project). The models were trained on the
183
+ Leonardo supercomputer at CINECA under the
184
+ EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.