--- library_name: transformers tags: - legal - instruction-tuning - multilingual license: mit language: - gl - es base_model: - BSC-LT/salamandra-7b-instruct pipeline_tag: text-generation --- # Carballo-Legal ## Table of Contents
Click to expand - [Carballo-Legal](#carballo-legal) - [Table of Contents](#table-of-contents) - [Model description](#model-description) - [Intended uses and limitations](#intended-uses-and-limitations) - [How to use](#how-to-use) - [Training](#training) - [Tools](#tools) - [Training data](#training-data) - [Training hyperparameters](#training-hyperparameters) - [Framework](#framework) - [Evaluation](#evaluation) - [Additional information](#additional-information) - [Funding](#funding) - [Cite this model](#cite-this-model)
## Model description **Carballo-Legal** is a specialized 7B-parameter instruction-tuned model designed for **legal text understanding and generation** in **Galician (GL)** and **Spanish (ES)**. It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality legal corpora extracted from official public institutions. This model enhances Salamandra’s instruction-following abilities with legal language, terminology, document structure, and reasoning patterns found in administrative and legislative texts. ## Intended uses and limitations **Intended uses** - Legal-oriented text generation (summaries, rephrasing, explanations). - Chat-style legal assistance (non-professional). - Downstream fine-tuning for specific legal domains or tasks. **Limitations** - Not a substitute for professional legal interpretation. - May produce incomplete or incorrect legal statements. - Not suitable for high-stakes or judicial decision-making. - Works best for GL and ES; other languages are not reinforced in this checkpoint. ## How to use ```python from datetime import datetime from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch model_id = "proxectonos/Carballo-Legal" text = "Qué sabes sobre o Proxecto Nós?" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16 ) message = [ { "role": "user", "content": text } ] date_string = datetime.today().strftime('%Y-%m-%d') prompt = tokenizer.apply_chat_template( message, tokenize=False, add_generation_prompt=True, date_string=date_string ) inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200) generated_tokens = outputs[0][len(inputs[0]):] response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip() response = response.split("<|reserved_token_1|>")[0].strip() print(response) ``` ## Training ### Training data The model was trained on a mixture of general instructions and domain-specific legal texts. | **Dataset Type** | **Languages** | **Sources** | |------------------|---------------|-------------| | Instruction set | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) | | Legal corpus | GL, ES | DOGA, BOP Pontevedra, BOP A Coruña | ### Training hyperparameters - **epochs:** 0.5 - **dtype:** bf16 - **block size:** 2048 - **total batch size:** 128 - **learning rate:** 2e-6 - **scheduler:** Linear - **optimizations:** - gradient checkpointing: True - flash attention: True - liger kernels: True - DeepSpeed stage: 2 ### Framework Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**. ## Evaluation Formal evaluation is in progress. Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES. ## Additional information ## Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA ### Cite this model Please cite the model as follows: ``` @misc{carballo_legal_2025, title = {Carballo-Legal: A Legal Domain Instruction-Tuned Model for Galician and Spanish}, author = {Proxecto Nós Team}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Legal}}, } ```