File size: 17,096 Bytes
5ba9d00 230d219 ef32462 5ba9d00 ef32462 5ba9d00 ef32462 5ba9d00 ef32462 5ba9d00 ef32462 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 | ---
language:
- ar
license: gemma
base_model: google/gemma-3-4b-it
tags:
- arabic
- nlp
- text-segmentation
- semantic-chunking
- gemma3
- lora
- unsloth
- fine-tuned
- rag
- information-retrieval
pipeline_tag: text-generation
library_name: transformers
inference: true
---
<div align="center">
# ๐ค Gemma-3-4B Arabic Semantic Chunker
**A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.**
[](https://huggingface.co/marioVIC/arabic-semantic-chunking)
[](https://huggingface.co/google/gemma-3-4b-it)
[](https://ai.google.dev/gemma/terms)
[](https://en.wikipedia.org/wiki/Arabic)
</div>
---
## ๐ Table of Contents
- [Model Overview](#-model-overview)
- [Intended Use](#-intended-use)
- [Training Details](#-training-details)
- [Training & Validation Loss](#-training--validation-loss)
- [Hardware & Infrastructure](#-hardware--infrastructure)
- [Dataset](#-dataset)
- [Quickstart / Inference](#-quickstart--inference)
- [Output Format](#-output-format)
- [Limitations](#-limitations)
- [Authors](#-authors)
- [Citation](#-citation)
- [License](#-license)
---
## ๐ง Model Overview
| Attribute | Value |
|-------------------------|--------------------------------------------|
| **Base Model** | `google/gemma-3-4b-it` |
| **Task** | Arabic Semantic Text Segmentation |
| **Fine-tuning Method** | Supervised Fine-Tuning (SFT) with LoRA |
| **Precision** | 4-bit NF4 quantisation (QLoRA) |
| **Vocabulary Size** | 262,144 tokens |
| **Max Sequence Length** | 2,048 tokens |
| **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total) |
| **Framework** | Unsloth + Hugging Face TRL |
This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences โ with zero paraphrasing and zero hallucination of content.
---
## ๐ฏ Intended Use
This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**:
- **Retrieval-Augmented Generation (RAG)** โ chunk documents into high-quality semantic units before embedding
- **Arabic NLP preprocessing** โ replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
- **Corpus annotation** โ automatically segment raw Arabic corpora for downstream labelling tasks
- **Information extraction** โ isolate individual claims or facts before analysis
- **Search & summarisation** โ improve context windows by feeding well-bounded sentence units
> โ ๏ธ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.
---
## ๐๏ธ Training Details
### LoRA Configuration
| Parameter | Value |
|-------------------------|-----------------------------------------------------------------------------|
| **LoRA Rank (`r`)** | 16 |
| **LoRA Alpha** | 16 |
| **LoRA Dropout** | 0.05 |
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **Bias** | None |
| **Gradient Checkpointing** | Unsloth (memory-optimised) |
### SFT Hyperparameters
| Parameter | Value |
|------------------------------|--------------------|
| **Epochs** | 5 |
| **Per-device Batch Size** | 2 |
| **Gradient Accumulation** | 16 steps |
| **Effective Batch Size** | 32 |
| **Learning Rate** | 1e-4 |
| **LR Scheduler** | Linear |
| **Warmup Steps** | 10 |
| **Optimiser** | `adamw_8bit` |
| **Weight Decay** | 0.01 |
| **Max Gradient Norm** | 0.3 |
| **Evaluation Strategy** | Every 10 steps |
| **Best Model Metric** | `eval_loss` |
| **Total Training Steps** | 85 |
| **Mixed Precision** | FP16 (T4 GPU) |
| **Random Seed** | 3407 |
---
## ๐ Training & Validation Loss
The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.
| Step | Training Loss | Validation Loss |
|:----:|:-------------:|:---------------:|
| 10 | 1.9981 | 1.9311 |
| 20 | 1.3280 | 1.2628 |
| 30 | 1.1018 | 1.0792 |
| 40 | 1.0133 | 0.9678 |
| 50 | 0.9917 | 0.9304 |
| 60 | 0.9053 | 0.8815 |
| 70 | 0.9122 | 0.8845 |
| 80 | 0.8935 | 0.8894 |
| 85 | 0.9160 | 0.8910 |
**Final overall training loss: `1.2197`**
**Best validation loss: `0.8815`** (Step 60)
**Total training time: ~83 minutes 46 seconds**
The sharp initial drop (steps 10โ40) reflects rapid task adaptation, after which the model plateaus at a stable low loss โ a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.
---
## ๐ฅ๏ธ Hardware & Infrastructure
| Component | Specification |
|--------------|----------------------------|
| **GPU** | NVIDIA Tesla T4 |
| **VRAM** | 15.6 GB |
| **Peak VRAM Used** | 15.19 GB |
| **Platform** | Google Colab (free tier) |
| **CUDA** | 12.8 / Toolkit 7.5 |
| **PyTorch** | 2.10.0+cu128 |
---
## ๐ฆ Dataset
The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of:
- **`prompt`** โ a raw Arabic paragraph prefixed with `"Text to split:\n"`
- **`response`** โ a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences
| Split | Samples |
|-----------------|---------|
| **Train** | 527 |
| **Validation** | 59 |
| **Total** | 586 |
The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.
---
## ๐ Quickstart / Inference
### Installation
```bash
pip install transformers torch accelerate
```
### Using `transformers` (Recommended)
```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# โโ Configuration โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# โโ System prompt โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly โ do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object โ no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""
# โโ Load model & tokenizer โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model.eval()
# โโ Inference function โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
"""
Segment an Arabic paragraph into a list of semantic sentences.
Args:
text: Raw Arabic text to segment.
max_new_tokens: Maximum number of tokens to generate.
Returns:
A list of Arabic sentence strings.
"""
messages = [
{"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=1.0,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens
generated = output_ids[0][inputs["input_ids"].shape[-1]:]
raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()
# Parse JSON response
parsed = json.loads(raw_output)
return parsed["sentences"]
# โโ Example โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if __name__ == "__main__":
arabic_text = (
"ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ู
ุฌุงู ู
ู ู
ุฌุงูุงุช ุนููู
ุงูุญุงุณูุจ ููุชู
ุจุชุทููุฑ ุฃูุธู
ุฉ "
"ูุงุฏุฑุฉ ุนูู ุชูููุฐ ู
ูุงู
ุชุชุทูุจ ุนุงุฏุฉู ุฐูุงุกู ุจุดุฑูุงู. ุชุดู
ู ูุฐู ุงูู
ูุงู
ุงูุชุนุฑู "
"ุนูู ุงูููุงู
ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุงุชุฎุงุฐ ุงููุฑุงุฑุงุช. ููุฏ ุดูุฏ ูุฐุง ุงูู
ุฌุงู ุชุทูุฑุงู "
"ู
ูุญูุธุงู ูู ุงูุณููุงุช ุงูุฃุฎูุฑุฉ ุจูุถู ุงูุชูุฏู
ูู ุงูุดุจูุงุช ุงูุนุตุจูุฉ ุงูุนู
ููุฉ "
"ูุชูุงูุฑ ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช."
)
sentences = segment_arabic(arabic_text)
print(f"โ
Segmented into {len(sentences)} sentence(s):\n")
for i, sentence in enumerate(sentences, 1):
print(f" [{i}] {sentence}")
```
### Expected Output
```
โ
Segmented into 3 sentence(s):
[1] ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ู
ุฌุงู ู
ู ู
ุฌุงูุงุช ุนููู
ุงูุญุงุณูุจ ููุชู
ุจุชุทููุฑ ุฃูุธู
ุฉ ูุงุฏุฑุฉ ุนูู ุชูููุฐ ู
ูุงู
ุชุชุทูุจ ุนุงุฏุฉู ุฐูุงุกู ุจุดุฑูุงู.
[2] ุชุดู
ู ูุฐู ุงูู
ูุงู
ุงูุชุนุฑู ุนูู ุงูููุงู
ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุงุชุฎุงุฐ ุงููุฑุงุฑุงุช.
[3] ููุฏ ุดูุฏ ูุฐุง ุงูู
ุฌุงู ุชุทูุฑุงู ู
ูุญูุธุงู ูู ุงูุณููุงุช ุงูุฃุฎูุฑุฉ ุจูุถู ุงูุชูุฏู
ูู ุงูุดุจูุงุช ุงูุนุตุจูุฉ ุงูุนู
ููุฉ ูุชูุงูุฑ ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช.
```
### Using Unsloth (2ร Faster Inference)
```python
import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor
MODEL_ID = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_ID,
max_seq_length = MAX_SEQ_LENGTH,
dtype = None, # auto-detect
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly โ do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object โ no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""
def segment_arabic_unsloth(text: str) -> list[str]:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Text to split:\n{text}"},
]
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
do_sample=False,
)
generated = outputs[0][inputs["input_ids"].shape[-1]:]
raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
return json.loads(raw)["sentences"]
```
---
## ๐ค Output Format
The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.
```json
{
"sentences": [
"ุงูุฌู
ูุฉ ุงูุฃููู.",
"ุงูุฌู
ูุฉ ุงูุซุงููุฉ.",
"ุงูุฌู
ูุฉ ุงูุซุงูุซุฉ."
]
}
```
**Guarantees:**
- No paraphrasing โ every sentence is a verbatim span of the source text
- No hallucination of new content
- No translation, grammar correction, or interpretation
- Deterministic output with `do_sample=False`
---
## โ ๏ธ Limitations
- **Domain scope** โ Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
- **Dataset size** โ The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
- **Context length** โ Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
- **Language exclusivity** โ This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
- **Base model license** โ Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.
---
## ๐ฅ Authors
This model was developed and trained by:
| Name | Role |
|------|------|
| **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration |
| **Mariam Emad** | Dataset curation, system prompt engineering, evaluation |
---
## ๐ Citation
If you use this model in your research or applications, please cite it as follows:
```bibtex
@misc{abdelmoniem2025arabicsemantic,
title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
author = {Abdelmoniem, Omar and Emad, Mariam},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}
```
---
## ๐ License
This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.
The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**.
---
<div align="center">
Made with โค๏ธ for the Arabic NLP community
*Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) ยท Built on [Gemma 3](https://ai.google.dev/gemma) ยท Powered by [Hugging Face ๐ค](https://huggingface.co)*
</div> |