| --- |
| language: |
| - ar |
| license: gemma |
| base_model: google/gemma-3-4b-it |
| tags: |
| - arabic |
| - nlp |
| - text-segmentation |
| - semantic-chunking |
| - gemma3 |
| - lora |
| - unsloth |
| - fine-tuned |
| - rag |
| - information-retrieval |
| pipeline_tag: text-generation |
| library_name: transformers |
| inference: true |
| --- |
| |
| <div align="center"> |
|
|
| # 🔤 Gemma-3-4B Arabic Semantic Chunker |
|
|
| **A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.** |
|
|
| [](https://huggingface.co/marioVIC/arabic-semantic-chunking) |
| [](https://huggingface.co/google/gemma-3-4b-it) |
| [](https://ai.google.dev/gemma/terms) |
| [](https://en.wikipedia.org/wiki/Arabic) |
|
|
| </div> |
|
|
| --- |
|
|
| ## 📋 Table of Contents |
|
|
| - [Model Overview](#-model-overview) |
| - [Intended Use](#-intended-use) |
| - [Training Details](#-training-details) |
| - [Training & Validation Loss](#-training--validation-loss) |
| - [Hardware & Infrastructure](#-hardware--infrastructure) |
| - [Dataset](#-dataset) |
| - [Quickstart / Inference](#-quickstart--inference) |
| - [Output Format](#-output-format) |
| - [Limitations](#-limitations) |
| - [Authors](#-authors) |
| - [Citation](#-citation) |
| - [License](#-license) |
|
|
| --- |
|
|
| ## 🧠 Model Overview |
|
|
| | Attribute | Value | |
| |-------------------------|--------------------------------------------| |
| | **Base Model** | `google/gemma-3-4b-it` | |
| | **Task** | Arabic Semantic Text Segmentation | |
| | **Fine-tuning Method** | Supervised Fine-Tuning (SFT) with LoRA | |
| | **Precision** | 4-bit NF4 quantisation (QLoRA) | |
| | **Vocabulary Size** | 262,144 tokens | |
| | **Max Sequence Length** | 2,048 tokens | |
| | **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total) | |
| | **Framework** | Unsloth + Hugging Face TRL | |
|
|
| This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content. |
|
|
| --- |
|
|
| ## 🎯 Intended Use |
|
|
| This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**: |
|
|
| - **Retrieval-Augmented Generation (RAG)** — chunk documents into high-quality semantic units before embedding |
| - **Arabic NLP preprocessing** — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter |
| - **Corpus annotation** — automatically segment raw Arabic corpora for downstream labelling tasks |
| - **Information extraction** — isolate individual claims or facts before analysis |
| - **Search & summarisation** — improve context windows by feeding well-bounded sentence units |
|
|
| > ⚠️ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text. |
|
|
| --- |
|
|
| ## 🏋️ Training Details |
|
|
| ### LoRA Configuration |
|
|
| | Parameter | Value | |
| |-------------------------|-----------------------------------------------------------------------------| |
| | **LoRA Rank (`r`)** | 16 | |
| | **LoRA Alpha** | 16 | |
| | **LoRA Dropout** | 0.05 | |
| | **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | |
| | **Bias** | None | |
| | **Gradient Checkpointing** | Unsloth (memory-optimised) | |
|
|
| ### SFT Hyperparameters |
|
|
| | Parameter | Value | |
| |------------------------------|--------------------| |
| | **Epochs** | 5 | |
| | **Per-device Batch Size** | 2 | |
| | **Gradient Accumulation** | 16 steps | |
| | **Effective Batch Size** | 32 | |
| | **Learning Rate** | 1e-4 | |
| | **LR Scheduler** | Linear | |
| | **Warmup Steps** | 10 | |
| | **Optimiser** | `adamw_8bit` | |
| | **Weight Decay** | 0.01 | |
| | **Max Gradient Norm** | 0.3 | |
| | **Evaluation Strategy** | Every 10 steps | |
| | **Best Model Metric** | `eval_loss` | |
| | **Total Training Steps** | 85 | |
| | **Mixed Precision** | FP16 (T4 GPU) | |
| | **Random Seed** | 3407 | |
|
|
| --- |
|
|
| ## 📉 Training & Validation Loss |
|
|
| The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting. |
|
|
| | Step | Training Loss | Validation Loss | |
| |:----:|:-------------:|:---------------:| |
| | 10 | 1.9981 | 1.9311 | |
| | 20 | 1.3280 | 1.2628 | |
| | 30 | 1.1018 | 1.0792 | |
| | 40 | 1.0133 | 0.9678 | |
| | 50 | 0.9917 | 0.9304 | |
| | 60 | 0.9053 | 0.8815 | |
| | 70 | 0.9122 | 0.8845 | |
| | 80 | 0.8935 | 0.8894 | |
| | 85 | 0.9160 | 0.8910 | |
|
|
| **Final overall training loss: `1.2197`** |
| **Best validation loss: `0.8815`** (Step 60) |
| **Total training time: ~83 minutes 46 seconds** |
|
|
| The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task. |
|
|
| --- |
|
|
| ## 🖥️ Hardware & Infrastructure |
|
|
| | Component | Specification | |
| |--------------|----------------------------| |
| | **GPU** | NVIDIA Tesla T4 | |
| | **VRAM** | 15.6 GB | |
| | **Peak VRAM Used** | 15.19 GB | |
| | **Platform** | Google Colab (free tier) | |
| | **CUDA** | 12.8 / Toolkit 7.5 | |
| | **PyTorch** | 2.10.0+cu128 | |
|
|
| --- |
|
|
| ## 📦 Dataset |
|
|
| The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of: |
|
|
| - **`prompt`** — a raw Arabic paragraph prefixed with `"Text to split:\n"` |
| - **`response`** — a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences |
|
|
| | Split | Samples | |
| |-----------------|---------| |
| | **Train** | 527 | |
| | **Validation** | 59 | |
| | **Total** | 586 | |
|
|
| The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions. |
|
|
| --- |
|
|
| ## 🚀 Quickstart / Inference |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch accelerate |
| ``` |
|
|
| ### Using `transformers` (Recommended) |
|
|
| ```python |
| import json |
| import torch |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| # ── Configuration ──────────────────────────────────────────────────────────── |
| MODEL_ID = "marioVIC/arabic-semantic-chunking" |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| # ── System prompt ───────────────────────────────────────────────────────────── |
| SYSTEM_PROMPT = """\ |
| You are an expert Arabic text segmentation assistant. Your task is to split \ |
| the given Arabic text into small, meaningful sentences. |
| Follow these rules strictly: |
| 1. Each sentence must be a complete, self-contained meaningful unit. |
| 2. Do NOT merge multiple ideas into one sentence. |
| 3. Do NOT split a single idea across multiple sentences. |
| 4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar. |
| 5. Remove excessive whitespace or newlines, but keep the words intact. |
| 6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences. |
| The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]} |
| """ |
| |
| # ── Load model & tokenizer ──────────────────────────────────────────────────── |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
| model = AutoModelForCausalLM.from_pretrained( |
| MODEL_ID, |
| torch_dtype=torch.float16, |
| device_map="auto", |
| ) |
| model.eval() |
| |
| # ── Inference function ──────────────────────────────────────────────────────── |
| def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]: |
| """ |
| Segment an Arabic paragraph into a list of semantic sentences. |
| |
| Args: |
| text: Raw Arabic text to segment. |
| max_new_tokens: Maximum number of tokens to generate. |
| |
| Returns: |
| A list of Arabic sentence strings. |
| """ |
| messages = [ |
| {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"}, |
| ] |
| |
| prompt = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE) |
| |
| with torch.no_grad(): |
| output_ids = model.generate( |
| **inputs, |
| max_new_tokens=max_new_tokens, |
| do_sample=False, |
| temperature=1.0, |
| eos_token_id=tokenizer.eos_token_id, |
| pad_token_id=tokenizer.eos_token_id, |
| ) |
| |
| # Decode only the newly generated tokens |
| generated = output_ids[0][inputs["input_ids"].shape[-1]:] |
| raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip() |
| |
| # Parse JSON response |
| parsed = json.loads(raw_output) |
| return parsed["sentences"] |
| |
| |
| # ── Example ──────────────────────────────────────────────────────────────────── |
| if __name__ == "__main__": |
| arabic_text = ( |
| "الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة " |
| "قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف " |
| "على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً " |
| "ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة " |
| "وتوافر كميات ضخمة من البيانات." |
| ) |
| |
| sentences = segment_arabic(arabic_text) |
| |
| print(f"✅ Segmented into {len(sentences)} sentence(s):\n") |
| for i, sentence in enumerate(sentences, 1): |
| print(f" [{i}] {sentence}") |
| ``` |
|
|
| ### Expected Output |
|
|
| ``` |
| ✅ Segmented into 3 sentence(s): |
| |
| [1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. |
| [2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات. |
| [3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات. |
| ``` |
|
|
| ### Using Unsloth (2× Faster Inference) |
|
|
| ```python |
| import json |
| from unsloth import FastLanguageModel |
| from transformers import AutoProcessor |
| |
| MODEL_ID = "marioVIC/arabic-semantic-chunking" |
| MAX_SEQ_LENGTH = 2048 |
| |
| model, tokenizer = FastLanguageModel.from_pretrained( |
| model_name = MODEL_ID, |
| max_seq_length = MAX_SEQ_LENGTH, |
| dtype = None, # auto-detect |
| load_in_4bit = True, |
| ) |
| FastLanguageModel.for_inference(model) |
| |
| processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it") |
| |
| SYSTEM_PROMPT = """\ |
| You are an expert Arabic text segmentation assistant. Your task is to split \ |
| the given Arabic text into small, meaningful sentences. |
| Follow these rules strictly: |
| 1. Each sentence must be a complete, self-contained meaningful unit. |
| 2. Do NOT merge multiple ideas into one sentence. |
| 3. Do NOT split a single idea across multiple sentences. |
| 4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar. |
| 5. Remove excessive whitespace or newlines, but keep the words intact. |
| 6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences. |
| The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]} |
| """ |
| |
| def segment_arabic_unsloth(text: str) -> list[str]: |
| messages = [ |
| {"role": "system", "content": SYSTEM_PROMPT}, |
| {"role": "user", "content": f"Text to split:\n{text}"}, |
| ] |
| |
| prompt = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=512, |
| use_cache=True, |
| do_sample=False, |
| ) |
| |
| generated = outputs[0][inputs["input_ids"].shape[-1]:] |
| raw = tokenizer.decode(generated, skip_special_tokens=True).strip() |
| return json.loads(raw)["sentences"] |
| ``` |
|
|
| --- |
|
|
| ## 📤 Output Format |
|
|
| The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input. |
|
|
| ```json |
| { |
| "sentences": [ |
| "الجملة الأولى.", |
| "الجملة الثانية.", |
| "الجملة الثالثة." |
| ] |
| } |
| ``` |
|
|
| **Guarantees:** |
| - No paraphrasing — every sentence is a verbatim span of the source text |
| - No hallucination of new content |
| - No translation, grammar correction, or interpretation |
| - Deterministic output with `do_sample=False` |
|
|
| --- |
|
|
| ## ⚠️ Limitations |
|
|
| - **Domain scope** — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary. |
| - **Dataset size** — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally. |
| - **Context length** — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation. |
| - **Language exclusivity** — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks. |
| - **Base model license** — Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms. |
|
|
| --- |
|
|
| ## 👥 Authors |
|
|
| This model was developed and trained by: |
|
|
| | Name | Role | |
| |------|------| |
| | **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration | |
| | **Mariam Emad** | Dataset curation, system prompt engineering, evaluation | |
|
|
| --- |
|
|
| ## 📖 Citation |
|
|
| If you use this model in your research or applications, please cite it as follows: |
|
|
| ```bibtex |
| @misc{abdelmoniem2025arabicsemantic, |
| title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation}, |
| author = {Abdelmoniem, Omar and Emad, Mariam}, |
| year = {2025}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}}, |
| } |
| ``` |
|
|
| --- |
|
|
| ## 📜 License |
|
|
| This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms. |
|
|
| The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**. |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| Made with ❤️ for the Arabic NLP community |
|
|
| *Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) · Built on [Gemma 3](https://ai.google.dev/gemma) · Powered by [Hugging Face 🤗](https://huggingface.co)* |
|
|
| </div> |