GemMaroc-27b-it / README.md

Copy from GemMaroc/GemMaroc-27b-it

bf88f83 verified 3 months ago

5.92 kB

	---
	library_name: transformers
	tags:
	- MoroccanArabic
	- Darija
	- GemMaroc
	- DarijaLLM
	- conversational
	pipeline_tag: text-generation
	datasets:
	- GemMaroc/TULU-3-50k-darija-english
	language:
	- ar
	- ary
	- en
	base_model:
	- google/gemma-3-27b-it
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->


	# GemMaroc‑27B

	Unlocking Moroccan Darija proficiency in a state‑of‑the‑art large language model, trained with a minimal‑data, green‑AI recipe that preserves Gemma‑27B’s strong reasoning abilities while adding fluent Darija generation.

	---

	## Model at a glance

	\| \| Details \|
	\| ------------------- \| ----------------------------------------------------------------------------------------------------------------------------- \|
	\| Model ID \| `AbderrahmanSkiredj1/GemMaroc-27b-it` \|
	\| Base model \| [`google/gemma-3-27b`](https://huggingface.co/google/gemma-3-27b) \|
	\| Architecture \| Decoder‑only Transformer (Gemma 3) \|
	\| Parameters \| 27 billion \|
	\| Context length \| 2 048 tokens \|
	\| Training regime \| Supervised fine‑tuning (LoRA → merged) on 50 K high‑quality Darija/English instructions TULU‑50K slice \|
	\| Compute budget \| 48 GPU·h (8 × H100‑80GB × 6 h) – ≈ 26 kWh / 10 kg CO₂e \|
	\| License \| Apache 2.0 \|

	---

	## Why another Darija model?

	* Inclusive AI > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
	* Quality‑over‑quantity A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross‑lingual reasoning.
	* Green AI GemMaroc achieves Atlas‑Chat‑level Darija scores using < 2 % of the energy.

	---

	## Benchmark summary

	\| Model \| Darija MMLU \| Darija HellaSwag \| GSM8K @5 \| HellaSwag (EN) \|
	\| ---------------- \| ----------- \| ---------------- \| ---------- \| -------------- \|
	\| Atlas‑Chat‑27B \| 61.9 % \| 48.4 % \| 82.0 % \| 77.8 % \|
	\| GemMaroc‑27B \| 61.6 % \| 60.5 % \| 84.2 % \| 79.3 % \|

	<sub>Zero‑shot accuracy; full table in the paper.</sub>

	---

	## Quick start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

	model_id = "AbderrahmanSkiredj1/GemMaroc-27b-it"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype="auto",
	device_map="auto"
	)

	pipe = pipeline(
	"text-generation",
	model=model,
	tokenizer=tokenizer,
	device_map="auto",
	max_new_tokens=1024,
	temperature=0.7,
	repetition_penalty=1.2,
	no_repeat_ngram_size=3,
	)

	messages = [
	{"role": "user", "content": "شنو هي نظرية ‘butterfly effect’؟ فسّرها بدارجة ونقّط مثال بسيط."}
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	print(pipe(prompt)[0]["generated_text"][len(prompt):])
	```

	### Chat template (Gemma 3 format)

	The tokenizer provides a baked‑in Jinja template that starts with a begin‑of‑sequence token (`<bos>`), then alternates user/model turns, each wrapped by `<start_of_turn>` … `<end_of_turn>` markers. When you set `add_generation_prompt=True` it ends after the opening model tag so the model can continue:

	```
	<bos><start_of_turn>user
	{user message}<end_of_turn>
	<start_of_turn>model
	```

	The assistant will keep generating tokens until it decides to emit `<end_of_turn>`.

	```python
	prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	```

	No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically.

	---

	Pre‑quantised checkpoints will be published under the same repo tags (`gemmaroc‑27b‑awq‑int4`, `gemmaroc‑27b‑gguf‑q4_k_m`).

	---

	## Training recipe (one‑paragraph recap)

	1. Data Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross‑lingual robustness.
	2. LoRA SFT Rank 16, α = 32, 3 epochs, bf16, context 2 048.
	3. Merge & push Merge LoRA into base weights (`peft.merge_and_unload`), convert to safetensors, upload.

	---

	## Limitations & ethical considerations

	* Sentiment and abstractive summarisation still trail state‑of‑the‑art.
	* Tokeniser is unchanged; rare Darija spellings may fragment.
	* Model may inherit societal biases present in pre‑training data.
	* No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.

	---

	## Citation

	If you use GemMaroc in your work, please cite:

	```bibtex
	@misc{skiredj2025gemmarocunlockingdarijaproficiency,
	title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data},
	author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},
	year={2025},
	eprint={2505.17082},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2505.17082},
	}


	```