marioVIC's picture
Update README.md
ef32462 verified
---
language:
- ar
license: gemma
base_model: google/gemma-3-4b-it
tags:
- arabic
- nlp
- text-segmentation
- semantic-chunking
- gemma3
- lora
- unsloth
- fine-tuned
- rag
- information-retrieval
pipeline_tag: text-generation
library_name: transformers
inference: true
---
<div align="center">
# 🔤 Gemma-3-4B Arabic Semantic Chunker
**A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.**
[![Model on HF](https://img.shields.io/badge/🤗%20Hugging%20Face-arabic--semantic--chunking-yellow)](https://huggingface.co/marioVIC/arabic-semantic-chunking)
[![Base Model](https://img.shields.io/badge/Base%20Model-google%2Fgemma--3--4b--it-blue)](https://huggingface.co/google/gemma-3-4b-it)
[![License](https://img.shields.io/badge/License-Gemma-orange)](https://ai.google.dev/gemma/terms)
[![Language](https://img.shields.io/badge/Language-Arabic%20🇸🇦-green)](https://en.wikipedia.org/wiki/Arabic)
</div>
---
## 📋 Table of Contents
- [Model Overview](#-model-overview)
- [Intended Use](#-intended-use)
- [Training Details](#-training-details)
- [Training & Validation Loss](#-training--validation-loss)
- [Hardware & Infrastructure](#-hardware--infrastructure)
- [Dataset](#-dataset)
- [Quickstart / Inference](#-quickstart--inference)
- [Output Format](#-output-format)
- [Limitations](#-limitations)
- [Authors](#-authors)
- [Citation](#-citation)
- [License](#-license)
---
## 🧠 Model Overview
| Attribute | Value |
|-------------------------|--------------------------------------------|
| **Base Model** | `google/gemma-3-4b-it` |
| **Task** | Arabic Semantic Text Segmentation |
| **Fine-tuning Method** | Supervised Fine-Tuning (SFT) with LoRA |
| **Precision** | 4-bit NF4 quantisation (QLoRA) |
| **Vocabulary Size** | 262,144 tokens |
| **Max Sequence Length** | 2,048 tokens |
| **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total) |
| **Framework** | Unsloth + Hugging Face TRL |
This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content.
---
## 🎯 Intended Use
This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**:
- **Retrieval-Augmented Generation (RAG)** — chunk documents into high-quality semantic units before embedding
- **Arabic NLP preprocessing** — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
- **Corpus annotation** — automatically segment raw Arabic corpora for downstream labelling tasks
- **Information extraction** — isolate individual claims or facts before analysis
- **Search & summarisation** — improve context windows by feeding well-bounded sentence units
> ⚠️ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.
---
## 🏋️ Training Details
### LoRA Configuration
| Parameter | Value |
|-------------------------|-----------------------------------------------------------------------------|
| **LoRA Rank (`r`)** | 16 |
| **LoRA Alpha** | 16 |
| **LoRA Dropout** | 0.05 |
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **Bias** | None |
| **Gradient Checkpointing** | Unsloth (memory-optimised) |
### SFT Hyperparameters
| Parameter | Value |
|------------------------------|--------------------|
| **Epochs** | 5 |
| **Per-device Batch Size** | 2 |
| **Gradient Accumulation** | 16 steps |
| **Effective Batch Size** | 32 |
| **Learning Rate** | 1e-4 |
| **LR Scheduler** | Linear |
| **Warmup Steps** | 10 |
| **Optimiser** | `adamw_8bit` |
| **Weight Decay** | 0.01 |
| **Max Gradient Norm** | 0.3 |
| **Evaluation Strategy** | Every 10 steps |
| **Best Model Metric** | `eval_loss` |
| **Total Training Steps** | 85 |
| **Mixed Precision** | FP16 (T4 GPU) |
| **Random Seed** | 3407 |
---
## 📉 Training & Validation Loss
The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.
| Step | Training Loss | Validation Loss |
|:----:|:-------------:|:---------------:|
| 10 | 1.9981 | 1.9311 |
| 20 | 1.3280 | 1.2628 |
| 30 | 1.1018 | 1.0792 |
| 40 | 1.0133 | 0.9678 |
| 50 | 0.9917 | 0.9304 |
| 60 | 0.9053 | 0.8815 |
| 70 | 0.9122 | 0.8845 |
| 80 | 0.8935 | 0.8894 |
| 85 | 0.9160 | 0.8910 |
**Final overall training loss: `1.2197`**
**Best validation loss: `0.8815`** (Step 60)
**Total training time: ~83 minutes 46 seconds**
The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.
---
## 🖥️ Hardware & Infrastructure
| Component | Specification |
|--------------|----------------------------|
| **GPU** | NVIDIA Tesla T4 |
| **VRAM** | 15.6 GB |
| **Peak VRAM Used** | 15.19 GB |
| **Platform** | Google Colab (free tier) |
| **CUDA** | 12.8 / Toolkit 7.5 |
| **PyTorch** | 2.10.0+cu128 |
---
## 📦 Dataset
The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of:
- **`prompt`** — a raw Arabic paragraph prefixed with `"Text to split:\n"`
- **`response`** — a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences
| Split | Samples |
|-----------------|---------|
| **Train** | 527 |
| **Validation** | 59 |
| **Total** | 586 |
The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.
---
## 🚀 Quickstart / Inference
### Installation
```bash
pip install transformers torch accelerate
```
### Using `transformers` (Recommended)
```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# ── Configuration ────────────────────────────────────────────────────────────
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# ── System prompt ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""
# ── Load model & tokenizer ────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model.eval()
# ── Inference function ────────────────────────────────────────────────────────
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
"""
Segment an Arabic paragraph into a list of semantic sentences.
Args:
text: Raw Arabic text to segment.
max_new_tokens: Maximum number of tokens to generate.
Returns:
A list of Arabic sentence strings.
"""
messages = [
{"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=1.0,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens
generated = output_ids[0][inputs["input_ids"].shape[-1]:]
raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()
# Parse JSON response
parsed = json.loads(raw_output)
return parsed["sentences"]
# ── Example ────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
arabic_text = (
"الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
"قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
"على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً "
"ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة "
"وتوافر كميات ضخمة من البيانات."
)
sentences = segment_arabic(arabic_text)
print(f"✅ Segmented into {len(sentences)} sentence(s):\n")
for i, sentence in enumerate(sentences, 1):
print(f" [{i}] {sentence}")
```
### Expected Output
```
✅ Segmented into 3 sentence(s):
[1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
[2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.
[3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات.
```
### Using Unsloth (2× Faster Inference)
```python
import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor
MODEL_ID = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_ID,
max_seq_length = MAX_SEQ_LENGTH,
dtype = None, # auto-detect
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""
def segment_arabic_unsloth(text: str) -> list[str]:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Text to split:\n{text}"},
]
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
use_cache=True,
do_sample=False,
)
generated = outputs[0][inputs["input_ids"].shape[-1]:]
raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
return json.loads(raw)["sentences"]
```
---
## 📤 Output Format
The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.
```json
{
"sentences": [
"الجملة الأولى.",
"الجملة الثانية.",
"الجملة الثالثة."
]
}
```
**Guarantees:**
- No paraphrasing — every sentence is a verbatim span of the source text
- No hallucination of new content
- No translation, grammar correction, or interpretation
- Deterministic output with `do_sample=False`
---
## ⚠️ Limitations
- **Domain scope** — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
- **Dataset size** — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
- **Context length** — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
- **Language exclusivity** — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
- **Base model license** — Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.
---
## 👥 Authors
This model was developed and trained by:
| Name | Role |
|------|------|
| **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration |
| **Mariam Emad** | Dataset curation, system prompt engineering, evaluation |
---
## 📖 Citation
If you use this model in your research or applications, please cite it as follows:
```bibtex
@misc{abdelmoniem2025arabicsemantic,
title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
author = {Abdelmoniem, Omar and Emad, Mariam},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}
```
---
## 📜 License
This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.
The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**.
---
<div align="center">
Made with ❤️ for the Arabic NLP community
*Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) · Built on [Gemma 3](https://ai.google.dev/gemma) · Powered by [Hugging Face 🤗](https://huggingface.co)*
</div>