File size: 2,603 Bytes
ba9bc52
 
 
 
 
 
 
 
 
863bb02
 
 
9427010
863bb02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
library_name: transformers
pipeline_tag: text-generation
language:
- eu
license: apache-2.0
base_model:
- Qwen/Qwen3-8B-Base
---

# HiTZ/eu_Qwen3-8B-Base

This is a **Basque (eu) language-specific base language model** trained by the HiTZ Research Center, starting from **Qwen3-8B-Base** and further pretrained on curated Basque data.

This model is released as a **base model**, intended for further fine-tuning or adaptation (e.g., instruction tuning, domain adaptation).

---

## Training Data

To train language-specific base LLMs, we followed the methodology proposed by [Etxaniz et al. (2024)](https://aclanthology.org/2024.acl-long.799/), originally developed for Basque, and extended it to other low-resource languages. To enable fair comparisons across languages, we limited the corpus size for each language to roughly the same number of tokens. We also included a small English subset to mitigate catastrophic forgetting.

### Corpus composition

| Language | Documents | Tokens (Qwen3) |
|----------|-----------|---------------:|
| Basque (eu) | 4.2M | ~3.5B |
| English (en) | 0.5M | ~0.3B |

Token counts vary slightly depending on the tokenizer, but remain comparable in overall size.

### Data sources

Basque data was obtained from the Latxa corpus, which consists primarily of large-scale web-crawled content, news articles, and encyclopedic text.  
The English subset was sampled from the FineWeb corpus.

---

## Model Training

- Sequence length: 8,196 tokens  
- Effective batch size: 256 sequences  
- Tokens per optimization step: ~2M  
- Learning rate schedule: cosine decay with 10% warm-up  
- Peak learning rate: 1e-5  

Training was conducted on the CINECA Leonardo high-performance computing cluster using Fully Sharded Data Parallel (FSDP) across 32 nodes, each equipped with 4 NVIDIA A100 GPUs (64 GB).

---

## Getting Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HiTZ/eu_Qwen3-8B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

inputs = tokenizer("Kaixo!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Acknowledgements

This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and of Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project).