File size: 4,377 Bytes
1f7f09c
 
b4b4ee3
 
 
 
 
 
9171df1
 
1f7f09c
 
b4b4ee3
1f7f09c
b4b4ee3
 
 
1f7f09c
b4b4ee3
 
 
 
 
 
 
 
 
 
 
 
 
 
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
 
b4b4ee3
1f7f09c
b4b4ee3
 
 
1f7f09c
b4b4ee3
 
 
 
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
 
 
 
 
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
 
 
 
 
 
1f7f09c
b4b4ee3
 
1f7f09c
b4b4ee3
 
 
 
 
 
1f7f09c
b4b4ee3
 
 
 
 
 
 
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
 
 
 
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
 
 
 
 
 
 
 
 
 
 
 
 
 
1f7f09c
 
 
b4b4ee3
1f7f09c
b4b4ee3
1f7f09c
b4b4ee3
 
1f7f09c
b4b4ee3
 
1f7f09c
b4b4ee3
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
library_name: transformers
license: mit
language:
- gl
- es
base_model:
- BSC-LT/salamandra-7b-instruct
datasets:
- proxectonos/corpus_dominio_cientifico
---

# Carballo-Science

## Table of Contents
<details>
<summary>Click to expand</summary>

- [Carballo-Legal](#carballo-legal)
  - [Table of Contents](#table-of-contents)
  - [Model description](#model-description)
  - [Intended uses and limitations](#intended-uses-and-limitations)
  - [How to use](#how-to-use)
  - [Training](#training)
    - [Tools](#tools)
    - [Training data](#training-data)
    - [Training hyperparameters](#training-hyperparameters)
    - [Framework](#framework)
  - [Evaluation](#evaluation)
  - [Additional information](#additional-information)
    - [Funding](#funding)
    - [Cite this model](#cite-this-model)

</details>

## Model description

**Carballo-Science** is a specialized 7B-parameter instruction-tuned model designed for **scientific text understanding and generation** in **Galician (GL)** and **Spanish (ES)**.

It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality scientific corpora extracted from diverse sources.


## Intended uses and limitations

**Intended uses**
- Scientific-oriented text generation (summaries, rephrasing, explanations).    
- Chat-style scientific assistance (non-professional).  

**Limitations**
- May produce incomplete or incorrect scientific statements.  
- Not suitable for high-stakes or science decision-making.  
- Works best for GL and ES; other languages are not reinforced in this checkpoint.

## How to use

```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "proxectonos/Carballo-Science"

text = "Qué sabes sobre o Proxecto Nós?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
generated_tokens = outputs[0][len(inputs[0]):]
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
response = response.split("<|reserved_token_1|>")[0].strip()
print(response)
```

## Training

### Training data

The model was trained on a mixture of general instructions and domain-specific legal texts.

| **Dataset Type** | **Languages** | **Sources** |
|------------------|---------------|-------------|
| Instruction set  | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
| Scientific corpus     | GL, ES        | Wikipedia, PhD Thesis |

### Training hyperparameters

- **epochs:** 0.5  
- **dtype:** bf16  
- **block size:** 2048  
- **total batch size:** 128  
- **learning rate:** 2e-6  
- **scheduler:** Linear  
- **optimizations:**  
  - gradient checkpointing: True  
  - flash attention: True  
  - liger kernels: True  
  - DeepSpeed stage: 2  

### Framework
Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**.

## Evaluation

Formal evaluation is in progress.  Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.

## Additional information

## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

### Cite this model
Please cite the model as follows:

```
@misc{carballo_legal_2025,
    title     = {Carballo-Science: A Science Domain Instruction-Tuned Model for Galician and Spanish},
    author    = {Proxecto Nós Team},
    year      = {2025},
    publisher = {HuggingFace},
    howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Science}},
}
```