Carvalho-Salamandra-Instruct / README.md

Update example

8a2a444 verified about 1 month ago

5.31 kB

	---
	library_name: transformers
	license: mit
	language:
	- gl
	- pt
	- es
	- en
	- ca
	base_model:
	- BSC-LT/salamandra-7b-instruct
	pipeline_tag: text-generation
	tags:
	- Salamandra
	- Instruction-tuning
	- Multilingual
	---

	# Carvalho-Salamandra-Instruct

	> [!WARNING]
	> WARNING: This is a preliminary version of Carvalho-Salamandra-Instruct.

	## Table of Contents
	<details>
	<summary>Click to expand</summary>

	- [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct)
	- [Table of Contents](#table-of-contents)
	- [Model description](#model-description)
	- [Intended uses and limitations](#intended-uses-and-limitations)
	- [How to use](#how-to-use)
	- [Training](#training)
	- [Tools](#tools)
	- [Training data](#training-data)
	- [Training hyperparameters](#training-hyperparameters)
	- [Framework](#framework)
	- [Evaluation](#evaluation)
	- [Additional information](#additional-information)
	- [Funding](#funding)
	- [Cite this model](#cite-this-model)

	</details>

	## Model description

	Carvalho-Salamandra-Instruct is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan.

	It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese.

	This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior.

	## Intended uses and limitations

	Intended uses
	- Instruction following and dialogue-style generation.
	- Multilingual text generation and content creation.
	- Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data).

	Limitations
	- Not intended as a sole source for high-stakes or safety-critical decisions.
	- May produce incorrect or biased factual information — verify outputs when accuracy matters.
	- Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis.

	## How to use

	```python
	from datetime import datetime
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import transformers
	import torch

	model_id = "proxectonos/Carvalho-Salamandra-Instruct"

	text = "Qué sabes sobre o Proxecto Nós?"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype=torch.bfloat16
	)

	message = [ { "role": "user", "content": text } ]
	date_string = datetime.today().strftime('%Y-%m-%d')

	prompt = tokenizer.apply_chat_template(
	message,
	tokenize=False,
	add_generation_prompt=True,
	date_string=date_string
	)

	inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
	outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
	generated_tokens = outputs[0][len(inputs[0]):]
	response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
	response = response.split("<\|reserved_token_1\|>")[0].strip()
	print(response)

	```

	## Training


	### Training data

	The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities.

	\| Dataset Type \| Languages \| Tokens per language/Source \|
	\|----------------------\|------------------------------\|------------\|
	\| Full instruction set \| GL , ES , PT , CAT , EN \| [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) \|
	\| High-quality corpus \| GL, PT \| 250M \|
	\| Small HQ corpus \| EN, ES, CAT \| 30M \|

	### Training hyperparameters

	- epochs: 1
	- dtype: bf16
	- block size: 2048
	- total batch size: 128
	- learning rate: 2e-6
	- scheduler: Linear
	- optimizations:
	- gradient checkpointing: True
	- flash attention: True
	- liger kernels: True
	- DeepSpeed stage: 2

	### Framework
	Training was performed at the Galician Supercomputing Center (CESGA) using 2 nodes (each with 2× NVIDIA A100 40GB) — a total of 4 GPUs — across 2 days.

	## Evaluation

	Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available.

	## Additional information

	## Funding
	This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

	### Cite this model
	Please cite this model as:

	```
	@misc{carvalho_salamandra_instruct_2025,
	title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages},
	author = {Proxecto Nós Team},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}},
	}

	```