Model Card for Llama-3.1-8B-Instruct-Multimodal (Basque)

⚠️ DEPRECATION NOTICE: This model is deprecated. Please use the updated models available in the HiTZ/latxa-vl collection.

This model is an open Multimodal Large Language Model (MLLM) specifically developed to process the Basque language alongside English. It adapts the English-centric Llama-3.1-8B-Instruct backbone to process both image and text inputs, demonstrating that an English-centric LLM can achieve strong multimodal performance in a low-resource language when trained with the right data mixture.

Model Details

  • Developed by: HITZ Basque Center for Language Technology - Ixa NLP Group, University of the Basque Country UPV/EHU
  • Model type: Multimodal Large Language Model (Late-fusion architecture)
  • Language(s) (NLP): Basque (eu), English (en)
  • Backbone LLM: Llama-3.1-8B-Instruct
  • Vision Encoder: CLIP (clip-vit-large-patch14-336)
  • Vision-Language Connector: Single-layer fully connected linear layer

Uses

Direct Use

This model is intended for general-purpose multimodal understanding and generation tasks in Basque and English. Typical use cases include:

  • Image captioning
  • Visual Question Answering (VQA)
  • Open-ended text generation from visual inputs

Out-of-Scope Use

  • The model is not optimized for specialized multimodal skills such as Optical Character Recognition (OCR) or complex table/chart understanding.

Training Details

The model was developed using a two-stage training procedure specifically adapted for low-resource language constraints.

Stage 1: Vision-Language Alignment

  • Goal: Align the visual representations generated by the CLIP encoder with the embedding space of the Llama backbone.
  • Dataset: A mix of the original and translated Conceptual Captions dataset (CC3M and $CC3M_{Eus}$).
  • Data Mixture: 80% Basque and 20% English samples[cite: 271].
  • Trainable Parameters: Only the linear connector was trained; the vision encoder and LLM remained frozen.

Stage 2: Multimodal Instruction Tuning

  • Goal: Fine-tune both the connector and the backbone LLM to follow complex multimodal instructions.
  • Dataset Composition: The model was trained on a final mixture of 173k samples, consisting of 83% multimodal data and 17% text-only data.
    • Multimodal Data: Derived from the Pixmo-AMA dataset (translated to $Pixmo-AMA_{Eus}$). The mixture used was 80% Basque and 20% English multimodal instructions.
    • Text-only Data: Augmented with 29k text-only instructions (80% Basque and 20% English) sampled from the Magpie-Llama-3.1-8B-Instruct-Filtered-1M dataset. Incorporating this text-only instruction dataset helps counteract the decline in text-only tasks often caused by multimodal training, and actually improves multimodal performance in Basque.

Limitations and Bias

  • Cultural Knowledge: Because the multimodal training and evaluation datasets were created by translating English-centric datasets via machine translation, the model does not inherently include Basque multimodal cultural knowledge.
  • Bias and Safety: Like other MLLMs, this model presents risks including the potential enhancement of existing human biases, which might be extended to local behaviors due to the multilingual adaptation for a low-resource language.

Citation

If you use this model, please cite the following paper:

@article{arana2025multimodal,
  title={Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque},
  author={Arana, Lukas and Etxaniz, Julen and Salaberria, Ander and Azkune, Gorka},
  journal={arXiv preprint arXiv:2511.09396v1},
  year={2025}
}
Downloads last month
23
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including HiTZ/Llama-Latxa-3.1-VL-8B-Instruct

Paper for HiTZ/Llama-Latxa-3.1-VL-8B-Instruct