HiTZ
/

es_Llama-3.1-8B

Text Generation

text-generation-inference

Model card Files Files and versions

enekovalero commited on 15 days ago

Commit

6cce1c2

·

verified ·

1 Parent(s): 2022afc

Update README.md

Files changed (1) hide show

README.md +61 -0

README.md CHANGED Viewed

@@ -7,3 +7,64 @@ license: llama3.1
 base_model:
 - meta-llama/Llama-3.1-8B
 ---

 base_model:
 - meta-llama/Llama-3.1-8B
 ---
+# HiTZ/es_Llama-3.1-8B
+This is a **Spanish (es) language-specific base language model** trained by the HiTZ Research Center, starting from **Llama 3.1** and further pretrained on curated Spanish data.
+This model is released as a **base model**, intended for further fine-tuning or adaptation (e.g., instruction tuning, domain adaptation).
+---
+## Training Data
+To train language-specific base LLMs, we followed the methodology proposed by [Etxaniz et al. (2024)](https://aclanthology.org/2024.acl-long.799/), originally developed for Basque, and extended it to other low-resource languages. To enable fair comparisons across languages, we limited the corpus size for each language to roughly the same number of tokens. We also included a small English subset to mitigate catastrophic forgetting.
+### Corpus composition
+| Language | Documents | Tokens (Llama 3.1) |
+|----------|-----------|-------------------:|
+| Spanish (es) | 3.8M | ~3.4B |
+| English (en) | 0.5M | ~0.3B |
+Token counts vary slightly depending on the tokenizer, but remain comparable in overall size.
+### Data sources
+Spanish data was extracted from the multilingual CulturaX corpus. Given the substantially larger size of CulturaX compared to the Basque and Galician resources, we applied targeted filtering to obtain a more representative subset. Specifically, we retained only documents whose URLs indicate origin in Spain (i.e., containing the top-level domains `.es`, `.eus`, `.cat`, or `.gal`).
+In addition, the Spanish data was filtered using the Dolma toolkit with the Gopher and C4 heuristics.
+The English subset was sampled from the FineWeb corpus.
+---
+## Model Training
+- Sequence length: 8,196 tokens
+- Effective batch size: 256 sequences
+- Tokens per optimization step: ~2M
+- Learning rate schedule: cosine decay with 10% warm-up
+- Peak learning rate: 1e-5
+Training was conducted on the CINECA Leonardo high-performance computing cluster using Fully Sharded Data Parallel (FSDP) across 32 nodes, each equipped with 4 NVIDIA A100 GPUs (64 GB).
+---
+## Getting Started
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HiTZ/es_Llama-3.1-8B"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+inputs = tokenizer("Hola!", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Acknowledgements
+This work has been partially supported by the Basque Government (Research group funding IT1570-22 and IKER-GAITU project), the Spanish Ministry for Digital Transformation and of Civil Service, and the EU-funded NextGenerationEU Recovery, Transformation and Resilience Plan (ILENIA project, 2022/TL22/00215335; and ALIA project).