README.md · marioVIC/arabic-semantic-chunking at main

File size: 17,096 Bytes

---
language:
  - ar
license: gemma
base_model: google/gemma-3-4b-it
tags:
  - arabic
  - nlp
  - text-segmentation
  - semantic-chunking
  - gemma3
  - lora
  - unsloth
  - fine-tuned
  - rag
  - information-retrieval
pipeline_tag: text-generation
library_name: transformers
inference: true
---

<div align="center">

# 🔤 Gemma-3-4B Arabic Semantic Chunker

**A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.**

[![Model on HF](https://img.shields.io/badge/🤗%20Hugging%20Face-arabic--semantic--chunking-yellow)](https://huggingface.co/marioVIC/arabic-semantic-chunking)
[![Base Model](https://img.shields.io/badge/Base%20Model-google%2Fgemma--3--4b--it-blue)](https://huggingface.co/google/gemma-3-4b-it)
[![License](https://img.shields.io/badge/License-Gemma-orange)](https://ai.google.dev/gemma/terms)
[![Language](https://img.shields.io/badge/Language-Arabic%20🇸🇦-green)](https://en.wikipedia.org/wiki/Arabic)

</div>

---

## 📋 Table of Contents

- [Model Overview](#-model-overview)
- [Intended Use](#-intended-use)
- [Training Details](#-training-details)
- [Training & Validation Loss](#-training--validation-loss)
- [Hardware & Infrastructure](#-hardware--infrastructure)
- [Dataset](#-dataset)
- [Quickstart / Inference](#-quickstart--inference)
- [Output Format](#-output-format)
- [Limitations](#-limitations)
- [Authors](#-authors)
- [Citation](#-citation)
- [License](#-license)

---

## 🧠 Model Overview

| Attribute               | Value                                      |
|-------------------------|--------------------------------------------|
| **Base Model**          | `google/gemma-3-4b-it`                     |
| **Task**                | Arabic Semantic Text Segmentation          |
| **Fine-tuning Method**  | Supervised Fine-Tuning (SFT) with LoRA     |
| **Precision**           | 4-bit NF4 quantisation (QLoRA)             |
| **Vocabulary Size**     | 262,144 tokens                             |
| **Max Sequence Length** | 2,048 tokens                               |
| **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total)          |
| **Framework**           | Unsloth + Hugging Face TRL                 |

This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content.

---

## 🎯 Intended Use

This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**:

- **Retrieval-Augmented Generation (RAG)** — chunk documents into high-quality semantic units before embedding
- **Arabic NLP preprocessing** — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
- **Corpus annotation** — automatically segment raw Arabic corpora for downstream labelling tasks
- **Information extraction** — isolate individual claims or facts before analysis
- **Search & summarisation** — improve context windows by feeding well-bounded sentence units

> ⚠️ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.

---

## 🏋️ Training Details

### LoRA Configuration

| Parameter               | Value                                                                       |
|-------------------------|-----------------------------------------------------------------------------|
| **LoRA Rank (`r`)**     | 16                                                                          |
| **LoRA Alpha**          | 16                                                                          |
| **LoRA Dropout**        | 0.05                                                                        |
| **Target Modules**      | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **Bias**                | None                                                                        |
| **Gradient Checkpointing** | Unsloth (memory-optimised)                                              |

### SFT Hyperparameters

| Parameter                    | Value              |
|------------------------------|--------------------|
| **Epochs**                   | 5                  |
| **Per-device Batch Size**    | 2                  |
| **Gradient Accumulation**    | 16 steps           |
| **Effective Batch Size**     | 32                 |
| **Learning Rate**            | 1e-4               |
| **LR Scheduler**             | Linear             |
| **Warmup Steps**             | 10                 |
| **Optimiser**                | `adamw_8bit`       |
| **Weight Decay**             | 0.01               |
| **Max Gradient Norm**        | 0.3                |
| **Evaluation Strategy**      | Every 10 steps     |
| **Best Model Metric**        | `eval_loss`        |
| **Total Training Steps**     | 85                 |
| **Mixed Precision**          | FP16 (T4 GPU)      |
| **Random Seed**              | 3407               |

---

## 📉 Training & Validation Loss

The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.

| Step | Training Loss | Validation Loss |
|:----:|:-------------:|:---------------:|
| 10   | 1.9981        | 1.9311          |
| 20   | 1.3280        | 1.2628          |
| 30   | 1.1018        | 1.0792          |
| 40   | 1.0133        | 0.9678          |
| 50   | 0.9917        | 0.9304          |
| 60   | 0.9053        | 0.8815          |
| 70   | 0.9122        | 0.8845          |
| 80   | 0.8935        | 0.8894          |
| 85   | 0.9160        | 0.8910          |

**Final overall training loss: `1.2197`**  
**Best validation loss: `0.8815`** (Step 60)  
**Total training time: ~83 minutes 46 seconds**

The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.

---

## 🖥️ Hardware & Infrastructure

| Component    | Specification              |
|--------------|----------------------------|
| **GPU**      | NVIDIA Tesla T4            |
| **VRAM**     | 15.6 GB                    |
| **Peak VRAM Used** | 15.19 GB             |
| **Platform** | Google Colab (free tier)   |
| **CUDA**     | 12.8 / Toolkit 7.5         |
| **PyTorch**  | 2.10.0+cu128               |

---

## 📦 Dataset

The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of:

- **`prompt`** — a raw Arabic paragraph prefixed with `"Text to split:\n"`
- **`response`** — a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences

| Split           | Samples |
|-----------------|---------|
| **Train**       | 527     |
| **Validation**  | 59      |
| **Total**       | 586     |

The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.

---

## 🚀 Quickstart / Inference

### Installation

```bash
pip install transformers torch accelerate
```

### Using `transformers` (Recommended)

```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ── Configuration ────────────────────────────────────────────────────────────
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

# ── System prompt ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

# ── Load model & tokenizer ────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.eval()

# ── Inference function ────────────────────────────────────────────────────────
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
    """
    Segment an Arabic paragraph into a list of semantic sentences.

    Args:
        text:           Raw Arabic text to segment.
        max_new_tokens: Maximum number of tokens to generate.

    Returns:
        A list of Arabic sentence strings.
    """
    messages = [
        {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=1.0,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens
    generated = output_ids[0][inputs["input_ids"].shape[-1]:]
    raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()

    # Parse JSON response
    parsed = json.loads(raw_output)
    return parsed["sentences"]


# ── Example ────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    arabic_text = (
        "الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
        "قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
        "على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً "
        "ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة "
        "وتوافر كميات ضخمة من البيانات."
    )

    sentences = segment_arabic(arabic_text)

    print(f"✅ Segmented into {len(sentences)} sentence(s):\n")
    for i, sentence in enumerate(sentences, 1):
        print(f"  [{i}] {sentence}")
```

### Expected Output

```
✅ Segmented into 3 sentence(s):

  [1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
  [2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.
  [3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات.
```

### Using Unsloth (2× Faster Inference)

```python
import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor

MODEL_ID       = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype          = None,       # auto-detect
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

def segment_arabic_unsloth(text: str) -> list[str]:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"Text to split:\n{text}"},
    ]

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        use_cache=True,
        do_sample=False,
    )

    generated = outputs[0][inputs["input_ids"].shape[-1]:]
    raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return json.loads(raw)["sentences"]
```

---

## 📤 Output Format

The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.

```json
{
  "sentences": [
    "الجملة الأولى.",
    "الجملة الثانية.",
    "الجملة الثالثة."
  ]
}
```

**Guarantees:**
- No paraphrasing — every sentence is a verbatim span of the source text
- No hallucination of new content
- No translation, grammar correction, or interpretation
- Deterministic output with `do_sample=False`

---

## ⚠️ Limitations

- **Domain scope** — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
- **Dataset size** — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
- **Context length** — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
- **Language exclusivity** — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
- **Base model license** — Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.

---

## 👥 Authors

This model was developed and trained by:

| Name | Role |
|------|------|
| **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration |
| **Mariam Emad** | Dataset curation, system prompt engineering, evaluation |

---

## 📖 Citation

If you use this model in your research or applications, please cite it as follows:

```bibtex
@misc{abdelmoniem2025arabicsemantic,
  title        = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
  author       = {Abdelmoniem, Omar and Emad, Mariam},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}
```

---

## 📜 License

This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.

The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**.

---

<div align="center">

Made with ❤️ for the Arabic NLP community

*Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) · Built on [Gemma 3](https://ai.google.dev/gemma) · Powered by [Hugging Face 🤗](https://huggingface.co)*

</div>