|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
language: |
|
|
- gl |
|
|
- pt |
|
|
- es |
|
|
- en |
|
|
- ca |
|
|
base_model: |
|
|
- BSC-LT/salamandra-7b-instruct |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- Salamandra |
|
|
- Instruction-tuning |
|
|
- Multilingual |
|
|
--- |
|
|
|
|
|
# Carvalho-Salamandra-Instruct |
|
|
|
|
|
> [!WARNING] |
|
|
> **WARNING:** This is a preliminary version of Carvalho-Salamandra-Instruct. |
|
|
|
|
|
## Table of Contents |
|
|
<details> |
|
|
<summary>Click to expand</summary> |
|
|
|
|
|
- [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct) |
|
|
- [Table of Contents](#table-of-contents) |
|
|
- [Model description](#model-description) |
|
|
- [Intended uses and limitations](#intended-uses-and-limitations) |
|
|
- [How to use](#how-to-use) |
|
|
- [Training](#training) |
|
|
- [Tools](#tools) |
|
|
- [Training data](#training-data) |
|
|
- [Training hyperparameters](#training-hyperparameters) |
|
|
- [Framework](#framework) |
|
|
- [Evaluation](#evaluation) |
|
|
- [Additional information](#additional-information) |
|
|
- [Funding](#funding) |
|
|
- [Cite this model](#cite-this-model) |
|
|
|
|
|
</details> |
|
|
|
|
|
## Model description |
|
|
|
|
|
**Carvalho-Salamandra-Instruct** is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan. |
|
|
|
|
|
It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese. |
|
|
|
|
|
This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior. |
|
|
|
|
|
## Intended uses and limitations |
|
|
|
|
|
**Intended uses** |
|
|
- Instruction following and dialogue-style generation. |
|
|
- Multilingual text generation and content creation. |
|
|
- Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data). |
|
|
|
|
|
**Limitations** |
|
|
- Not intended as a sole source for high-stakes or safety-critical decisions. |
|
|
- May produce incorrect or biased factual information — verify outputs when accuracy matters. |
|
|
- Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis. |
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
from datetime import datetime |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model_id = "proxectonos/Carvalho-Salamandra-Instruct" |
|
|
|
|
|
text = "Qué sabes sobre o Proxecto Nós?" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="auto", |
|
|
torch_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
message = [ { "role": "user", "content": text } ] |
|
|
date_string = datetime.today().strftime('%Y-%m-%d') |
|
|
|
|
|
prompt = tokenizer.apply_chat_template( |
|
|
message, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
date_string=date_string |
|
|
) |
|
|
|
|
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
|
|
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200) |
|
|
generated_tokens = outputs[0][len(inputs[0]):] |
|
|
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip() |
|
|
response = response.split("<|reserved_token_1|>")[0].strip() |
|
|
print(response) |
|
|
|
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
|
|
|
### Training data |
|
|
|
|
|
The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities. |
|
|
|
|
|
| **Dataset Type** | **Languages** | **Tokens per language/Source** | |
|
|
|----------------------|------------------------------|------------| |
|
|
| Full instruction set | GL , ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) | |
|
|
| High-quality corpus | GL, PT | 250M | |
|
|
| Small HQ corpus | EN, ES, CAT | 30M | |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
- **epochs:** 1 |
|
|
- **dtype:** bf16 |
|
|
- **block size:** 2048 |
|
|
- **total batch size:** 128 |
|
|
- **learning rate:** 2e-6 |
|
|
- **scheduler:** Linear |
|
|
- **optimizations:** |
|
|
- gradient checkpointing: True |
|
|
- flash attention: True |
|
|
- liger kernels: True |
|
|
- DeepSpeed stage: 2 |
|
|
|
|
|
### Framework |
|
|
Training was performed at the **Galician Supercomputing Center (CESGA)** using **2 nodes** (each with **2× NVIDIA A100 40GB**) — a total of **4 GPUs** — across **2 days**. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available. |
|
|
|
|
|
## Additional information |
|
|
|
|
|
## Funding |
|
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA |
|
|
|
|
|
### Cite this model |
|
|
Please cite this model as: |
|
|
|
|
|
``` |
|
|
@misc{carvalho_salamandra_instruct_2025, |
|
|
title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages}, |
|
|
author = {Proxecto Nós Team}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}}, |
|
|
} |
|
|
|
|
|
``` |
|
|
|