--- library_name: transformers license: mit language: - gl - pt - es - en - ca base_model: - BSC-LT/salamandra-7b-instruct pipeline_tag: text-generation tags: - Salamandra - Instruction-tuning - Multilingual --- # Carvalho-Salamandra-Instruct > [!WARNING] > **WARNING:** This is a preliminary version of Carvalho-Salamandra-Instruct. ## Table of Contents
Click to expand - [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct) - [Table of Contents](#table-of-contents) - [Model description](#model-description) - [Intended uses and limitations](#intended-uses-and-limitations) - [How to use](#how-to-use) - [Training](#training) - [Tools](#tools) - [Training data](#training-data) - [Training hyperparameters](#training-hyperparameters) - [Framework](#framework) - [Evaluation](#evaluation) - [Additional information](#additional-information) - [Funding](#funding) - [Cite this model](#cite-this-model)
## Model description **Carvalho-Salamandra-Instruct** is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan. It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese. This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior. ## Intended uses and limitations **Intended uses** - Instruction following and dialogue-style generation. - Multilingual text generation and content creation. - Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data). **Limitations** - Not intended as a sole source for high-stakes or safety-critical decisions. - May produce incorrect or biased factual information — verify outputs when accuracy matters. - Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis. ## How to use ```python from datetime import datetime from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch model_id = "proxectonos/Carvalho-Salamandra-Instruct" text = "Qué sabes sobre o Proxecto Nós?" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16 ) message = [ { "role": "user", "content": text } ] date_string = datetime.today().strftime('%Y-%m-%d') prompt = tokenizer.apply_chat_template( message, tokenize=False, add_generation_prompt=True, date_string=date_string ) inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200) generated_tokens = outputs[0][len(inputs[0]):] response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip() response = response.split("<|reserved_token_1|>")[0].strip() print(response) ``` ## Training ### Training data The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities. | **Dataset Type** | **Languages** | **Tokens per language/Source** | |----------------------|------------------------------|------------| | Full instruction set | GL , ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) | | High-quality corpus | GL, PT | 250M | | Small HQ corpus | EN, ES, CAT | 30M | ### Training hyperparameters - **epochs:** 1 - **dtype:** bf16 - **block size:** 2048 - **total batch size:** 128 - **learning rate:** 2e-6 - **scheduler:** Linear - **optimizations:** - gradient checkpointing: True - flash attention: True - liger kernels: True - DeepSpeed stage: 2 ### Framework Training was performed at the **Galician Supercomputing Center (CESGA)** using **2 nodes** (each with **2× NVIDIA A100 40GB**) — a total of **4 GPUs** — across **2 days**. ## Evaluation Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available. ## Additional information ## Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA ### Cite this model Please cite this model as: ``` @misc{carvalho_salamandra_instruct_2025, title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages}, author = {Proxecto Nós Team}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}}, } ```