🔤 Gemma-3-4B Arabic Semantic Chunker
A fine-tuned google/gemma-3-4b-it model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.
📋 Table of Contents
- Model Overview
- Intended Use
- Training Details
- Training & Validation Loss
- Hardware & Infrastructure
- Dataset
- Quickstart / Inference
- Output Format
- Limitations
- Authors
- Citation
- License
🧠 Model Overview
| Attribute | Value |
|---|---|
| Base Model | google/gemma-3-4b-it |
| Task | Arabic Semantic Text Segmentation |
| Fine-tuning Method | Supervised Fine-Tuning (SFT) with LoRA |
| Precision | 4-bit NF4 quantisation (QLoRA) |
| Vocabulary Size | 262,144 tokens |
| Max Sequence Length | 2,048 tokens |
| Trainable Parameters | 32,788,480 (0.76% of 4.33B total) |
| Framework | Unsloth + Hugging Face TRL |
This model is a LoRA adapter merged into the base google/gemma-3-4b-it weights (saved in 16-bit precision for compatibility with vLLM and standard transformers pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content.
🎯 Intended Use
This model is designed for any Arabic NLP pipeline that benefits from precise sentence-level granularity:
- Retrieval-Augmented Generation (RAG) — chunk documents into high-quality semantic units before embedding
- Arabic NLP preprocessing — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
- Corpus annotation — automatically segment raw Arabic corpora for downstream labelling tasks
- Information extraction — isolate individual claims or facts before analysis
- Search & summarisation — improve context windows by feeding well-bounded sentence units
⚠️ This model is not intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.
🏋️ Training Details
LoRA Configuration
| Parameter | Value |
|---|---|
LoRA Rank (r) |
16 |
| LoRA Alpha | 16 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Bias | None |
| Gradient Checkpointing | Unsloth (memory-optimised) |
SFT Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Per-device Batch Size | 2 |
| Gradient Accumulation | 16 steps |
| Effective Batch Size | 32 |
| Learning Rate | 1e-4 |
| LR Scheduler | Linear |
| Warmup Steps | 10 |
| Optimiser | adamw_8bit |
| Weight Decay | 0.01 |
| Max Gradient Norm | 0.3 |
| Evaluation Strategy | Every 10 steps |
| Best Model Metric | eval_loss |
| Total Training Steps | 85 |
| Mixed Precision | FP16 (T4 GPU) |
| Random Seed | 3407 |
📉 Training & Validation Loss
The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.
| Step | Training Loss | Validation Loss |
|---|---|---|
| 10 | 1.9981 | 1.9311 |
| 20 | 1.3280 | 1.2628 |
| 30 | 1.1018 | 1.0792 |
| 40 | 1.0133 | 0.9678 |
| 50 | 0.9917 | 0.9304 |
| 60 | 0.9053 | 0.8815 |
| 70 | 0.9122 | 0.8845 |
| 80 | 0.8935 | 0.8894 |
| 85 | 0.9160 | 0.8910 |
Final overall training loss: 1.2197
Best validation loss: 0.8815 (Step 60)
Total training time: ~83 minutes 46 seconds
The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.
🖥️ Hardware & Infrastructure
| Component | Specification |
|---|---|
| GPU | NVIDIA Tesla T4 |
| VRAM | 15.6 GB |
| Peak VRAM Used | 15.19 GB |
| Platform | Google Colab (free tier) |
| CUDA | 12.8 / Toolkit 7.5 |
| PyTorch | 2.10.0+cu128 |
📦 Dataset
The model was fine-tuned on a custom curated dataset of 586 Arabic text samples (dataset_final.json), each consisting of:
prompt— a raw Arabic paragraph prefixed with"Text to split:\n"response— a gold-standard JSON object{"sentences": [...]}containing the correctly segmented sentences
| Split | Samples |
|---|---|
| Train | 527 |
| Validation | 59 |
| Total | 586 |
The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.
🚀 Quickstart / Inference
Installation
pip install transformers torch accelerate
Using transformers (Recommended)
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# ── Configuration ────────────────────────────────────────────────────────────
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# ── System prompt ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""
# ── Load model & tokenizer ────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model.eval()
# ── Inference function ────────────────────────────────────────────────────────
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
"""
Segment an Arabic paragraph into a list of semantic sentences.
Args:
text: Raw Arabic text to segment.
max_new_tokens: Maximum number of tokens to generate.
Returns:
A list of Arabic sentence strings.
"""
messages = [
{"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=1.0,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens
generated = output_ids[0][inputs["input_ids"].shape[-1]:]
raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()
# Parse JSON response
parsed = json.loads(raw_output)
return parsed["sentences"]
# ── Example ────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
arabic_text = (
"الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
"قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
"على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً "
"ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة "
"وتوافر كميات ضخمة من البيانات."
)
sentences = segment_arabic(arabic_text)
print(f"✅ Segmented into {len(sentences)} sentence(s):\n")
for i, sentence in enumerate(sentences, 1):
print(f" [{i}] {sentence}")
Expected Output
✅ Segmented into 3 sentence(s):
[1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
[2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.
[3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات.
Using Unsloth (2× Faster Inference)
import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor
MODEL_ID = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_ID,
max_seq_length = MAX_SEQ_LENGTH,
dtype = None, # auto-detect
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""
def segment_arabic_unsloth(text: str) -> list[str]:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Text to split:\n{text}"},
]
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
do_sample=False,
)
generated = outputs[0][inputs["input_ids"].shape[-1]:]
raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
return json.loads(raw)["sentences"]
📤 Output Format
The model always returns a strict JSON object with a single key "sentences" whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.
{
"sentences": [
"الجملة الأولى.",
"الجملة الثانية.",
"الجملة الثالثة."
]
}
Guarantees:
- No paraphrasing — every sentence is a verbatim span of the source text
- No hallucination of new content
- No translation, grammar correction, or interpretation
- Deterministic output with
do_sample=False
⚠️ Limitations
- Domain scope — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
- Dataset size — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
- Context length — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
- Language exclusivity — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
- Base model license — Usage is subject to Google's Gemma Terms of Use. Commercial use requires compliance with those terms.
👥 Authors
This model was developed and trained by:
| Name | Role |
|---|---|
| Omar Abdelmoniem | Model development, training pipeline, LoRA configuration |
| Mariam Emad | Dataset curation, system prompt engineering, evaluation |
📖 Citation
If you use this model in your research or applications, please cite it as follows:
@misc{abdelmoniem2025arabicsemantic,
title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
author = {Abdelmoniem, Omar and Emad, Mariam},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}
📜 License
This model inherits the Gemma Terms of Use from the base google/gemma-3-4b-it model. By using this model, you agree to those terms.
The fine-tuning code, dataset format, and system prompt design are released under the MIT License.
Made with ❤️ for the Arabic NLP community
Fine-tuned with Unsloth · Built on Gemma 3 · Powered by Hugging Face 🤗
- Downloads last month
- 282