File size: 5,446 Bytes
73f7d15
 
 
 
 
 
 
 
b6b9b04
73f7d15
 
 
 
388f898
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f05e05
 
 
 
 
 
 
 
 
 
 
388f898
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fde78d7
388f898
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language: es
library_name: transformers
license: apache-2.0
tags:
  - roberta
  - spanish
  - scientific
  - fill-mask
---

# Sci-BETO-base

**Sci-BETO** is a domain-specific RoBERTa encoder pretrained entirely on **Spanish scientific texts**.  

---

## Model Description

Sci-BETO-base is a transformer-based encoder following the **RoBERTa** architecture (125M parameters).  
It was pretrained from scratch using byte-level BPE tokenization on a large corpus of **Spanish open-access scientific publications**, including theses, dissertations, and peer-reviewed papers from Colombian universities and international repositories.

The model was designed to capture **scientific discourse**, terminology, and abstract reasoning patterns typical of research documents in economics, engineering, medicine, and the social sciences.

| Property | Value |
|-----------|--------|
| Architecture | RoBERTa-base |
| Parameters | ~125M |
| Vocabulary size | 50,262 |
| Tokenizer | Byte-Level BPE (trained from scratch) |
| Pretraining objective | Masked Language Modeling (MLM) |
| Pretraining steps | 85K |
| Max sequence length | 512 tokens |
| Framework | Transformers |

---

## Pretraining Data

The pretraining corpus includes over **11 billion tokens** from Spanish academic and scientific sources:

- Open-access repositories of Colombian universities (Universidad de los Andes, Universidad Nacional, Universidad Javeriana, Universidad del Rosario).  
- CORE API and institutional repositories (theses, dissertations, working papers).
- Tax Statutes in Colombia

The final dataset covers multiple disciplines (economics, medicine, engineering, humanities), ensuring representation across scientific domains.

| **Source**                     | **# Documents** | **# Words (deduplicated)** | **Percentage (%)** |
|--------------------------------|----------------:|----------------------------:|-------------------:|
| Universidad de los Andes        | 33,858          | 365,752,780                 | 3.23               |
| Universidad Nacional            | 44,686          | 537,022,975                 | 4.75               |
| CORE API                        | 2,181,689       | 9,624,189,002               | 85.10              |
| Universidad del Rosario         | 22,404          | 183,356,109                 | 1.62               |
| Universidad Javeriana           | 25,624          | 323,918,445                 | 2.86               |
| Tax Statutes in Colombia        | 392             | 13,924,060                  | 0.12               |
| Extra                           | 2               | 261,131,453                 | 2.31               |
| **Total**                       | **2,308,655**   | **11,309,295,824**          | **100.00**         |

---

## Benchmarks

Sci-BETO was fine-tuned and benchmarked across multiple downstream tasks, both general-domain and scientific:

| **Dataset**        | **Metric**      | **Sci-BETO Large** | **Sci-BETO Base** | **BETO** | **BERTIN** |
|---------------------|----------------|-------------------:|------------------:|----------:|------------:|
| **WikiCAT**         | F1 (macro)     | **0.7738**         | 0.7583            | 0.7624    | 0.7598      |
| **PAWS-X (es)**     | F1 (macro)     | **0.9148**         | 0.8794            | 0.8985    | 0.8961      |
| **PharmaCoNER**     | F1 (micro)     | **0.8959**         | 0.8733            | 0.8845    | 0.8802      |
| **CANTEMIST**       | F1 (micro)     | 0.8809             | 0.8784            | 0.8954    | **0.8956**  |
| **NLI (ESNLI-R)**   | F1 (micro)     | —                  | —                 | —         | —           |
| **BanRep (JEL)**    | Exact Match    | **0.6116**         | 0.6043            | 0.5933    | 0.5807      |
| **Rosario**         | F1 (macro)     | **0.9203**         | 0.9194            | 0.9079    | 0.9121      |
| **Econ-IE**         | F1 (micro)     | **0.5256**         | 0.5158            | 0.5199    | 0.4992      |


On average, **Sci-BETO** achieves comparable or superior results to general-domain Spanish models in specialized contexts (scientific, biomedical, economic), while maintaining strong performance in general text understanding.

---

## Intended Use

- Research and experimentation in **Spanish scientific NLP**.
- Downstream fine-tuning for:
  - Text classification (scientific or academic domains),
  - Named Entity Recognition (NER),
  - Semantic similarity and paraphrase detection,
  - Knowledge extraction from academic documents.

---

## Limitations

- The model may underperform on highly informal or non-academic Spanish (e.g., social media).  
- It is not designed for generative tasks (e.g., text completion, chat).  
- Domain bias toward academic register and Latin American Spanish variants.  
- Pretraining corpus excludes English or bilingual data.

---

## Example Usage

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Flaglab/Sci-BETO-base")
model = AutoModelForMaskedLM.from_pretrained("Flaglab/Sci-BETO-base")

text = "El Banco de la República va a subir las [mask] de interes."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token = tokenizer.decode(logits[0, masked_index].argmax(dim=-1))
print("Predicted token:", predicted_token)