|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- MoroccanArabic |
|
|
- Darija |
|
|
- GemMaroc |
|
|
- DarijaLLM |
|
|
- conversational |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- GemMaroc/TULU-3-50k-darija-english |
|
|
language: |
|
|
- ar |
|
|
- ary |
|
|
- en |
|
|
base_model: |
|
|
- google/gemma-3-27b-it |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
# GemMaroc‑27B |
|
|
|
|
|
Unlocking **Moroccan Darija** proficiency in a state‑of‑the‑art large language model, trained with a *minimal‑data, green‑AI* recipe that preserves Gemma‑27B’s strong reasoning abilities while adding fluent Darija generation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model at a glance |
|
|
|
|
|
| | Details | |
|
|
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------- | |
|
|
| **Model ID** | `AbderrahmanSkiredj1/GemMaroc-27b-it` | |
|
|
| **Base model** | [`google/gemma-3-27b`](https://huggingface.co/google/gemma-3-27b) | |
|
|
| **Architecture** | Decoder‑only Transformer (Gemma 3) | |
|
|
| **Parameters** | 27 billion | |
|
|
| **Context length** | 2 048 tokens | |
|
|
| **Training regime** | Supervised fine‑tuning (LoRA → merged) on 50 K high‑quality Darija/English instructions TULU‑50K slice | |
|
|
| **Compute budget** | 48 GPU·h (8 × H100‑80GB × 6 h) – ≈ 26 kWh / 10 kg CO₂e | |
|
|
| **License** | Apache 2.0 | |
|
|
|
|
|
--- |
|
|
|
|
|
## Why another Darija model? |
|
|
|
|
|
* **Inclusive AI** > 36 million speakers of Moroccan Arabic remain underserved by open LLMs. |
|
|
* **Quality‑over‑quantity** A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross‑lingual reasoning. |
|
|
* **Green AI** GemMaroc achieves Atlas‑Chat‑level Darija scores using < 2 % of the energy. |
|
|
|
|
|
--- |
|
|
|
|
|
## Benchmark summary |
|
|
|
|
|
| Model | Darija MMLU | Darija HellaSwag | GSM8K @5 | HellaSwag (EN) | |
|
|
| ---------------- | ----------- | ---------------- | ---------- | -------------- | |
|
|
| Atlas‑Chat‑27B | **61.9 %** | 48.4 % | 82.0 % | 77.8 % | |
|
|
| **GemMaroc‑27B** | 61.6 % | **60.5 %** | **84.2 %** | **79.3 %** | |
|
|
|
|
|
<sub>Zero‑shot accuracy; full table in the paper.</sub> |
|
|
|
|
|
--- |
|
|
|
|
|
## Quick start |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
|
|
|
|
|
model_id = "AbderrahmanSkiredj1/GemMaroc-27b-it" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
pipe = pipeline( |
|
|
"text-generation", |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
device_map="auto", |
|
|
max_new_tokens=1024, |
|
|
temperature=0.7, |
|
|
repetition_penalty=1.2, |
|
|
no_repeat_ngram_size=3, |
|
|
) |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": "شنو هي نظرية ‘butterfly effect’؟ فسّرها بدارجة ونقّط مثال بسيط."} |
|
|
] |
|
|
|
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
print(pipe(prompt)[0]["generated_text"][len(prompt):]) |
|
|
``` |
|
|
|
|
|
### Chat template (Gemma 3 format) |
|
|
|
|
|
The tokenizer provides a baked‑in Jinja template that starts with a **begin‑of‑sequence** token (`<bos>`), then alternates user/model turns, each wrapped by `<start_of_turn>` … `<end_of_turn>` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue: |
|
|
|
|
|
``` |
|
|
<bos><start_of_turn>user |
|
|
{user message}<end_of_turn> |
|
|
<start_of_turn>model |
|
|
``` |
|
|
|
|
|
The assistant will keep generating tokens until it decides to emit `<end_of_turn>`. |
|
|
|
|
|
```python |
|
|
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
|
|
``` |
|
|
|
|
|
No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically. |
|
|
|
|
|
--- |
|
|
|
|
|
Pre‑quantised checkpoints will be published under the same repo tags (`gemmaroc‑27b‑awq‑int4`, `gemmaroc‑27b‑gguf‑q4_k_m`). |
|
|
|
|
|
--- |
|
|
|
|
|
## Training recipe (one‑paragraph recap) |
|
|
|
|
|
1. **Data** Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross‑lingual robustness. |
|
|
2. **LoRA SFT** Rank 16, α = 32, 3 epochs, bf16, context 2 048. |
|
|
3. **Merge & push** Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations & ethical considerations |
|
|
|
|
|
* Sentiment and abstractive summarisation still trail state‑of‑the‑art. |
|
|
* Tokeniser is unchanged; rare Darija spellings may fragment. |
|
|
* Model may inherit societal biases present in pre‑training data. |
|
|
* No RLHF / RLAIF safety alignment yet – apply a moderation layer in production. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use GemMaroc in your work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{skiredj2025gemmarocunlockingdarijaproficiency, |
|
|
title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data}, |
|
|
author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada}, |
|
|
year={2025}, |
|
|
eprint={2505.17082}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2505.17082}, |
|
|
} |
|
|
|
|
|
|
|
|
``` |