README.md · marioVIC/arabic-semantic-chunking at main

arabic-semantic-chunking / README.md

marioVIC

Update README.md

ef32462 verified 2 days ago

preview code

raw

history blame contribute delete

17.1 kB

	---
	language:
	- ar
	license: gemma
	base_model: google/gemma-3-4b-it
	tags:
	- arabic
	- nlp
	- text-segmentation
	- semantic-chunking
	- gemma3
	- lora
	- unsloth
	- fine-tuned
	- rag
	- information-retrieval
	pipeline_tag: text-generation
	library_name: transformers
	inference: true
	---

	<div align="center">

	# 🔤 Gemma-3-4B Arabic Semantic Chunker

	A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.

	[![Model on HF](https://img.shields.io/badge/🤗%20Hugging%20Face-arabic--semantic--chunking-yellow)](https://huggingface.co/marioVIC/arabic-semantic-chunking)
	[![Base Model](https://img.shields.io/badge/Base%20Model-google%2Fgemma--3--4b--it-blue)](https://huggingface.co/google/gemma-3-4b-it)
	[![License](https://img.shields.io/badge/License-Gemma-orange)](https://ai.google.dev/gemma/terms)
	[![Language](https://img.shields.io/badge/Language-Arabic%20🇸🇦-green)](https://en.wikipedia.org/wiki/Arabic)

	</div>

	---

	## 📋 Table of Contents

	- [Model Overview](#-model-overview)
	- [Intended Use](#-intended-use)
	- [Training Details](#-training-details)
	- [Training & Validation Loss](#-training--validation-loss)
	- [Hardware & Infrastructure](#-hardware--infrastructure)
	- [Dataset](#-dataset)
	- [Quickstart / Inference](#-quickstart--inference)
	- [Output Format](#-output-format)
	- [Limitations](#-limitations)
	- [Authors](#-authors)
	- [Citation](#-citation)
	- [License](#-license)

	---

	## 🧠 Model Overview

	\| Attribute \| Value \|
	\|-------------------------\|--------------------------------------------\|
	\| Base Model \| `google/gemma-3-4b-it` \|
	\| Task \| Arabic Semantic Text Segmentation \|
	\| Fine-tuning Method \| Supervised Fine-Tuning (SFT) with LoRA \|
	\| Precision \| 4-bit NF4 quantisation (QLoRA) \|
	\| Vocabulary Size \| 262,144 tokens \|
	\| Max Sequence Length \| 2,048 tokens \|
	\| Trainable Parameters\| 32,788,480 (0.76% of 4.33B total) \|
	\| Framework \| Unsloth + Hugging Face TRL \|

	This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content.

	---

	## 🎯 Intended Use

	This model is designed for any Arabic NLP pipeline that benefits from precise sentence-level granularity:

	- Retrieval-Augmented Generation (RAG) — chunk documents into high-quality semantic units before embedding
	- Arabic NLP preprocessing — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
	- Corpus annotation — automatically segment raw Arabic corpora for downstream labelling tasks
	- Information extraction — isolate individual claims or facts before analysis
	- Search & summarisation — improve context windows by feeding well-bounded sentence units

	> ⚠️ This model is not intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.

	---

	## 🏋️ Training Details

	### LoRA Configuration

	\| Parameter \| Value \|
	\|-------------------------\|-----------------------------------------------------------------------------\|
	\| LoRA Rank (`r`) \| 16 \|
	\| LoRA Alpha \| 16 \|
	\| LoRA Dropout \| 0.05 \|
	\| Target Modules \| `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` \|
	\| Bias \| None \|
	\| Gradient Checkpointing \| Unsloth (memory-optimised) \|

	### SFT Hyperparameters

	\| Parameter \| Value \|
	\|------------------------------\|--------------------\|
	\| Epochs \| 5 \|
	\| Per-device Batch Size \| 2 \|
	\| Gradient Accumulation \| 16 steps \|
	\| Effective Batch Size \| 32 \|
	\| Learning Rate \| 1e-4 \|
	\| LR Scheduler \| Linear \|
	\| Warmup Steps \| 10 \|
	\| Optimiser \| `adamw_8bit` \|
	\| Weight Decay \| 0.01 \|
	\| Max Gradient Norm \| 0.3 \|
	\| Evaluation Strategy \| Every 10 steps \|
	\| Best Model Metric \| `eval_loss` \|
	\| Total Training Steps \| 85 \|
	\| Mixed Precision \| FP16 (T4 GPU) \|
	\| Random Seed \| 3407 \|

	---

	## 📉 Training & Validation Loss

	The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.

	\| Step \| Training Loss \| Validation Loss \|
	\|:----:\|:-------------:\|:---------------:\|
	\| 10 \| 1.9981 \| 1.9311 \|
	\| 20 \| 1.3280 \| 1.2628 \|
	\| 30 \| 1.1018 \| 1.0792 \|
	\| 40 \| 1.0133 \| 0.9678 \|
	\| 50 \| 0.9917 \| 0.9304 \|
	\| 60 \| 0.9053 \| 0.8815 \|
	\| 70 \| 0.9122 \| 0.8845 \|
	\| 80 \| 0.8935 \| 0.8894 \|
	\| 85 \| 0.9160 \| 0.8910 \|

	Final overall training loss: `1.2197`
	Best validation loss: `0.8815` (Step 60)
	Total training time: ~83 minutes 46 seconds

	The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.

	---

	## 🖥️ Hardware & Infrastructure

	\| Component \| Specification \|
	\|--------------\|----------------------------\|
	\| GPU \| NVIDIA Tesla T4 \|
	\| VRAM \| 15.6 GB \|
	\| Peak VRAM Used \| 15.19 GB \|
	\| Platform \| Google Colab (free tier) \|
	\| CUDA \| 12.8 / Toolkit 7.5 \|
	\| PyTorch \| 2.10.0+cu128 \|

	---

	## 📦 Dataset

	The model was fine-tuned on a custom curated dataset of 586 Arabic text samples (`dataset_final.json`), each consisting of:

	- `prompt` — a raw Arabic paragraph prefixed with `"Text to split:\n"`
	- `response` — a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences

	\| Split \| Samples \|
	\|-----------------\|---------\|
	\| Train \| 527 \|
	\| Validation \| 59 \|
	\| Total \| 586 \|

	The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.

	---

	## 🚀 Quickstart / Inference

	### Installation

	```bash
	pip install transformers torch accelerate
	```

	### Using `transformers` (Recommended)

	```python
	import json
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# ── Configuration ────────────────────────────────────────────────────────────
	MODEL_ID = "marioVIC/arabic-semantic-chunking"
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	# ── System prompt ─────────────────────────────────────────────────────────────
	SYSTEM_PROMPT = """\
	You are an expert Arabic text segmentation assistant. Your task is to split \
	the given Arabic text into small, meaningful sentences.
	Follow these rules strictly:
	1. Each sentence must be a complete, self-contained meaningful unit.
	2. Do NOT merge multiple ideas into one sentence.
	3. Do NOT split a single idea across multiple sentences.
	4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
	5. Remove excessive whitespace or newlines, but keep the words intact.
	6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
	The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
	"""

	# ── Load model & tokenizer ────────────────────────────────────────────────────
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID,
	torch_dtype=torch.float16,
	device_map="auto",
	)
	model.eval()

	# ── Inference function ────────────────────────────────────────────────────────
	def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
	"""
	Segment an Arabic paragraph into a list of semantic sentences.

	Args:
	text: Raw Arabic text to segment.
	max_new_tokens: Maximum number of tokens to generate.

	Returns:
	A list of Arabic sentence strings.
	"""
	messages = [
	{"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
	]

	prompt = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)

	inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	do_sample=False,
	temperature=1.0,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.eos_token_id,
	)

	# Decode only the newly generated tokens
	generated = output_ids[0][inputs["input_ids"].shape[-1]:]
	raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()

	# Parse JSON response
	parsed = json.loads(raw_output)
	return parsed["sentences"]


	# ── Example ────────────────────────────────────────────────────────────────────
	if __name__ == "__main__":
	arabic_text = (
	"الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
	"قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
	"على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً "
	"ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة "
	"وتوافر كميات ضخمة من البيانات."
	)

	sentences = segment_arabic(arabic_text)

	print(f"✅ Segmented into {len(sentences)} sentence(s):\n")
	for i, sentence in enumerate(sentences, 1):
	print(f" [{i}] {sentence}")
	```

	### Expected Output

	```
	✅ Segmented into 3 sentence(s):

	[1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
	[2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.
	[3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات.
	```

	### Using Unsloth (2× Faster Inference)

	```python
	import json
	from unsloth import FastLanguageModel
	from transformers import AutoProcessor

	MODEL_ID = "marioVIC/arabic-semantic-chunking"
	MAX_SEQ_LENGTH = 2048

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = MODEL_ID,
	max_seq_length = MAX_SEQ_LENGTH,
	dtype = None, # auto-detect
	load_in_4bit = True,
	)
	FastLanguageModel.for_inference(model)

	processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

	SYSTEM_PROMPT = """\
	You are an expert Arabic text segmentation assistant. Your task is to split \
	the given Arabic text into small, meaningful sentences.
	Follow these rules strictly:
	1. Each sentence must be a complete, self-contained meaningful unit.
	2. Do NOT merge multiple ideas into one sentence.
	3. Do NOT split a single idea across multiple sentences.
	4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
	5. Remove excessive whitespace or newlines, but keep the words intact.
	6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
	The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
	"""

	def segment_arabic_unsloth(text: str) -> list[str]:
	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": f"Text to split:\n{text}"},
	]

	prompt = processor.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)

	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	use_cache=True,
	do_sample=False,
	)

	generated = outputs[0][inputs["input_ids"].shape[-1]:]
	raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
	return json.loads(raw)["sentences"]
	```

	---

	## 📤 Output Format

	The model always returns a strict JSON object with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.

	```json
	{
	"sentences": [
	"الجملة الأولى.",
	"الجملة الثانية.",
	"الجملة الثالثة."
	]
	}
	```

	Guarantees:
	- No paraphrasing — every sentence is a verbatim span of the source text
	- No hallucination of new content
	- No translation, grammar correction, or interpretation
	- Deterministic output with `do_sample=False`

	---

	## ⚠️ Limitations

	- Domain scope — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
	- Dataset size — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
	- Context length — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
	- Language exclusivity — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
	- Base model license — Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.

	---

	## 👥 Authors

	This model was developed and trained by:

	\| Name \| Role \|
	\|------\|------\|
	\| Omar Abdelmoniem \| Model development, training pipeline, LoRA configuration \|
	\| Mariam Emad \| Dataset curation, system prompt engineering, evaluation \|

	---

	## 📖 Citation

	If you use this model in your research or applications, please cite it as follows:

	```bibtex
	@misc{abdelmoniem2025arabicsemantic,
	title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
	author = {Abdelmoniem, Omar and Emad, Mariam},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
	}
	```

	---

	## 📜 License

	This model inherits the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.

	The fine-tuning code, dataset format, and system prompt design are released under the MIT License.

	---

	<div align="center">

	Made with ❤️ for the Arabic NLP community

	Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) · Built on [Gemma 3](https://ai.google.dev/gemma) · Powered by [Hugging Face 🤗](https://huggingface.co)

	</div>