enekovalero commited on
Commit
20ce5b9
·
verified ·
1 Parent(s): 819cc1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md CHANGED
@@ -7,3 +7,63 @@ license: apache-2.0
7
  base_model:
8
  - Qwen/Qwen3-14B-Base
9
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  base_model:
8
  - Qwen/Qwen3-14B-Base
9
  ---
10
+
11
+ # HiTZ/cat_Qwen3-14B-Base
12
+
13
+ This is a **Catalan (ca) language-specific base language model** trained by the HiTZ Research Center, starting from **Qwen3-14B** and further pretrained on curated Catalan data.
14
+
15
+ This model is released as a **base model**, intended for further fine-tuning or adaptation (e.g., instruction tuning, domain adaptation).
16
+
17
+ ---
18
+
19
+ ## Training Data
20
+
21
+ To train language-specific base LLMs, we followed the methodology proposed by [Etxaniz et al. (2024)](https://aclanthology.org/2024.acl-long.799/), originally developed for Basque, and extended it to other low-resource languages. To enable fair comparisons across languages, we limited the corpus size for each language to roughly the same number of tokens. We also included a small English subset to mitigate catastrophic forgetting.
22
+
23
+ ### Corpus composition
24
+
25
+ | Language | Documents | Tokens (Qwen3) |
26
+ |----------|-----------|---------------:|
27
+ | Catalan (ca) | 3.8M | ~3.8B |
28
+ | English (en) | 0.5M | ~0.3B |
29
+
30
+ Token counts vary slightly depending on the tokenizer, but remain comparable in overall size.
31
+
32
+ ### Data sources
33
+
34
+ Catalan data was extracted from the multilingual CulturaX corpus. Given the substantially larger size of CulturaX compared to the Basque and Galician resources, we applied targeted filtering using the Dolma toolkit with the Gopher and C4 heuristics to obtain a representative subset.
35
+ The English subset was sampled from the FineWeb corpus.
36
+
37
+ ---
38
+
39
+ ## Model Training
40
+
41
+ - Sequence length: 8,196 tokens
42
+ - Effective batch size: 256 sequences
43
+ - Tokens per optimization step: ~2M
44
+ - Learning rate schedule: cosine decay with 10% warm-up
45
+ - Peak learning rate: 1e-5
46
+
47
+ Training was conducted on the CINECA Leonardo high-performance computing cluster using Fully Sharded Data Parallel (FSDP) across 32 nodes, each equipped with 4 NVIDIA A100 GPUs (64 GB).
48
+
49
+ ---
50
+
51
+ ## Getting Started
52
+
53
+ ```python
54
+ from transformers import AutoModelForCausalLM, AutoTokenizer
55
+
56
+ model_id = "HiTZ/cat_Qwen3-14B-Base"
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
59
+ model = AutoModelForCausalLM.from_pretrained(model_id)
60
+
61
+ inputs = tokenizer("Hola!", return_tensors="pt")
62
+ outputs = model.generate(**inputs, max_new_tokens=50)
63
+
64
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
65
+ ```
66
+
67
+ ## Acknowledgements
68
+
69
+ This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and of Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project).