---
language:
- ar
license: gemma
base_model: google/gemma-3-4b-it
tags:
- arabic
- nlp
- text-segmentation
- semantic-chunking
- gemma3
- lora
- unsloth
- fine-tuned
- rag
- information-retrieval
pipeline_tag: text-generation
library_name: transformers
inference: true
---
# ๐ค Gemma-3-4B Arabic Semantic Chunker
**A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.**
[](https://huggingface.co/marioVIC/arabic-semantic-chunking)
[](https://huggingface.co/google/gemma-3-4b-it)
[](https://ai.google.dev/gemma/terms)
[](https://en.wikipedia.org/wiki/Arabic)
---
## ๐ Table of Contents
- [Model Overview](#-model-overview)
- [Intended Use](#-intended-use)
- [Training Details](#-training-details)
- [Training & Validation Loss](#-training--validation-loss)
- [Hardware & Infrastructure](#-hardware--infrastructure)
- [Dataset](#-dataset)
- [Quickstart / Inference](#-quickstart--inference)
- [Output Format](#-output-format)
- [Limitations](#-limitations)
- [Authors](#-authors)
- [Citation](#-citation)
- [License](#-license)
---
## ๐ง Model Overview
| Attribute | Value |
|-------------------------|--------------------------------------------|
| **Base Model** | `google/gemma-3-4b-it` |
| **Task** | Arabic Semantic Text Segmentation |
| **Fine-tuning Method** | Supervised Fine-Tuning (SFT) with LoRA |
| **Precision** | 4-bit NF4 quantisation (QLoRA) |
| **Vocabulary Size** | 262,144 tokens |
| **Max Sequence Length** | 2,048 tokens |
| **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total) |
| **Framework** | Unsloth + Hugging Face TRL |
This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences โ with zero paraphrasing and zero hallucination of content.
---
## ๐ฏ Intended Use
This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**:
- **Retrieval-Augmented Generation (RAG)** โ chunk documents into high-quality semantic units before embedding
- **Arabic NLP preprocessing** โ replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
- **Corpus annotation** โ automatically segment raw Arabic corpora for downstream labelling tasks
- **Information extraction** โ isolate individual claims or facts before analysis
- **Search & summarisation** โ improve context windows by feeding well-bounded sentence units
> โ ๏ธ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.
---
## ๐๏ธ Training Details
### LoRA Configuration
| Parameter | Value |
|-------------------------|-----------------------------------------------------------------------------|
| **LoRA Rank (`r`)** | 16 |
| **LoRA Alpha** | 16 |
| **LoRA Dropout** | 0.05 |
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **Bias** | None |
| **Gradient Checkpointing** | Unsloth (memory-optimised) |
### SFT Hyperparameters
| Parameter | Value |
|------------------------------|--------------------|
| **Epochs** | 5 |
| **Per-device Batch Size** | 2 |
| **Gradient Accumulation** | 16 steps |
| **Effective Batch Size** | 32 |
| **Learning Rate** | 1e-4 |
| **LR Scheduler** | Linear |
| **Warmup Steps** | 10 |
| **Optimiser** | `adamw_8bit` |
| **Weight Decay** | 0.01 |
| **Max Gradient Norm** | 0.3 |
| **Evaluation Strategy** | Every 10 steps |
| **Best Model Metric** | `eval_loss` |
| **Total Training Steps** | 85 |
| **Mixed Precision** | FP16 (T4 GPU) |
| **Random Seed** | 3407 |
---
## ๐ Training & Validation Loss
The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.
| Step | Training Loss | Validation Loss |
|:----:|:-------------:|:---------------:|
| 10 | 1.9981 | 1.9311 |
| 20 | 1.3280 | 1.2628 |
| 30 | 1.1018 | 1.0792 |
| 40 | 1.0133 | 0.9678 |
| 50 | 0.9917 | 0.9304 |
| 60 | 0.9053 | 0.8815 |
| 70 | 0.9122 | 0.8845 |
| 80 | 0.8935 | 0.8894 |
| 85 | 0.9160 | 0.8910 |
**Final overall training loss: `1.2197`**
**Best validation loss: `0.8815`** (Step 60)
**Total training time: ~83 minutes 46 seconds**
The sharp initial drop (steps 10โ40) reflects rapid task adaptation, after which the model plateaus at a stable low loss โ a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.
---
## ๐ฅ๏ธ Hardware & Infrastructure
| Component | Specification |
|--------------|----------------------------|
| **GPU** | NVIDIA Tesla T4 |
| **VRAM** | 15.6 GB |
| **Peak VRAM Used** | 15.19 GB |
| **Platform** | Google Colab (free tier) |
| **CUDA** | 12.8 / Toolkit 7.5 |
| **PyTorch** | 2.10.0+cu128 |
---
## ๐ฆ Dataset
The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of:
- **`prompt`** โ a raw Arabic paragraph prefixed with `"Text to split:\n"`
- **`response`** โ a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences
| Split | Samples |
|-----------------|---------|
| **Train** | 527 |
| **Validation** | 59 |
| **Total** | 586 |
The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.
---
## ๐ Quickstart / Inference
### Installation
```bash
pip install transformers torch accelerate
```
### Using `transformers` (Recommended)
```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# โโ Configuration โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# โโ System prompt โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly โ do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object โ no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["", "", ...]}
"""
# โโ Load model & tokenizer โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model.eval()
# โโ Inference function โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
"""
Segment an Arabic paragraph into a list of semantic sentences.
Args:
text: Raw Arabic text to segment.
max_new_tokens: Maximum number of tokens to generate.
Returns:
A list of Arabic sentence strings.
"""
messages = [
{"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=1.0,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens
generated = output_ids[0][inputs["input_ids"].shape[-1]:]
raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()
# Parse JSON response
parsed = json.loads(raw_output)
return parsed["sentences"]
# โโ Example โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if __name__ == "__main__":
arabic_text = (
"ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ู
ุฌุงู ู
ู ู
ุฌุงูุงุช ุนููู
ุงูุญุงุณูุจ ููุชู
ุจุชุทููุฑ ุฃูุธู
ุฉ "
"ูุงุฏุฑุฉ ุนูู ุชูููุฐ ู
ูุงู
ุชุชุทูุจ ุนุงุฏุฉู ุฐูุงุกู ุจุดุฑูุงู. ุชุดู
ู ูุฐู ุงูู
ูุงู
ุงูุชุนุฑู "
"ุนูู ุงูููุงู
ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุงุชุฎุงุฐ ุงููุฑุงุฑุงุช. ููุฏ ุดูุฏ ูุฐุง ุงูู
ุฌุงู ุชุทูุฑุงู "
"ู
ูุญูุธุงู ูู ุงูุณููุงุช ุงูุฃุฎูุฑุฉ ุจูุถู ุงูุชูุฏู
ูู ุงูุดุจูุงุช ุงูุนุตุจูุฉ ุงูุนู
ููุฉ "
"ูุชูุงูุฑ ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช."
)
sentences = segment_arabic(arabic_text)
print(f"โ
Segmented into {len(sentences)} sentence(s):\n")
for i, sentence in enumerate(sentences, 1):
print(f" [{i}] {sentence}")
```
### Expected Output
```
โ
Segmented into 3 sentence(s):
[1] ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ู
ุฌุงู ู
ู ู
ุฌุงูุงุช ุนููู
ุงูุญุงุณูุจ ููุชู
ุจุชุทููุฑ ุฃูุธู
ุฉ ูุงุฏุฑุฉ ุนูู ุชูููุฐ ู
ูุงู
ุชุชุทูุจ ุนุงุฏุฉู ุฐูุงุกู ุจุดุฑูุงู.
[2] ุชุดู
ู ูุฐู ุงูู
ูุงู
ุงูุชุนุฑู ุนูู ุงูููุงู
ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุงุชุฎุงุฐ ุงููุฑุงุฑุงุช.
[3] ููุฏ ุดูุฏ ูุฐุง ุงูู
ุฌุงู ุชุทูุฑุงู ู
ูุญูุธุงู ูู ุงูุณููุงุช ุงูุฃุฎูุฑุฉ ุจูุถู ุงูุชูุฏู
ูู ุงูุดุจูุงุช ุงูุนุตุจูุฉ ุงูุนู
ููุฉ ูุชูุงูุฑ ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช.
```
### Using Unsloth (2ร Faster Inference)
```python
import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor
MODEL_ID = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_ID,
max_seq_length = MAX_SEQ_LENGTH,
dtype = None, # auto-detect
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly โ do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object โ no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["", "", ...]}
"""
def segment_arabic_unsloth(text: str) -> list[str]:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Text to split:\n{text}"},
]
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
do_sample=False,
)
generated = outputs[0][inputs["input_ids"].shape[-1]:]
raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
return json.loads(raw)["sentences"]
```
---
## ๐ค Output Format
The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.
```json
{
"sentences": [
"ุงูุฌู
ูุฉ ุงูุฃููู.",
"ุงูุฌู
ูุฉ ุงูุซุงููุฉ.",
"ุงูุฌู
ูุฉ ุงูุซุงูุซุฉ."
]
}
```
**Guarantees:**
- No paraphrasing โ every sentence is a verbatim span of the source text
- No hallucination of new content
- No translation, grammar correction, or interpretation
- Deterministic output with `do_sample=False`
---
## โ ๏ธ Limitations
- **Domain scope** โ Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
- **Dataset size** โ The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
- **Context length** โ Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
- **Language exclusivity** โ This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
- **Base model license** โ Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.
---
## ๐ฅ Authors
This model was developed and trained by:
| Name | Role |
|------|------|
| **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration |
| **Mariam Emad** | Dataset curation, system prompt engineering, evaluation |
---
## ๐ Citation
If you use this model in your research or applications, please cite it as follows:
```bibtex
@misc{abdelmoniem2025arabicsemantic,
title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
author = {Abdelmoniem, Omar and Emad, Mariam},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}
```
---
## ๐ License
This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.
The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**.
---
Made with โค๏ธ for the Arabic NLP community
*Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) ยท Built on [Gemma 3](https://ai.google.dev/gemma) ยท Powered by [Hugging Face ๐ค](https://huggingface.co)*