🔤 Gemma-3-4B Arabic Semantic Chunker

A fine-tuned google/gemma-3-4b-it model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.

Model on HF Base Model License Language


📋 Table of Contents


🧠 Model Overview

Attribute Value
Base Model google/gemma-3-4b-it
Task Arabic Semantic Text Segmentation
Fine-tuning Method Supervised Fine-Tuning (SFT) with LoRA
Precision 4-bit NF4 quantisation (QLoRA)
Vocabulary Size 262,144 tokens
Max Sequence Length 2,048 tokens
Trainable Parameters 32,788,480 (0.76% of 4.33B total)
Framework Unsloth + Hugging Face TRL

This model is a LoRA adapter merged into the base google/gemma-3-4b-it weights (saved in 16-bit precision for compatibility with vLLM and standard transformers pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content.


🎯 Intended Use

This model is designed for any Arabic NLP pipeline that benefits from precise sentence-level granularity:

  • Retrieval-Augmented Generation (RAG) — chunk documents into high-quality semantic units before embedding
  • Arabic NLP preprocessing — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
  • Corpus annotation — automatically segment raw Arabic corpora for downstream labelling tasks
  • Information extraction — isolate individual claims or facts before analysis
  • Search & summarisation — improve context windows by feeding well-bounded sentence units

⚠️ This model is not intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.


🏋️ Training Details

LoRA Configuration

Parameter Value
LoRA Rank (r) 16
LoRA Alpha 16
LoRA Dropout 0.05
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias None
Gradient Checkpointing Unsloth (memory-optimised)

SFT Hyperparameters

Parameter Value
Epochs 5
Per-device Batch Size 2
Gradient Accumulation 16 steps
Effective Batch Size 32
Learning Rate 1e-4
LR Scheduler Linear
Warmup Steps 10
Optimiser adamw_8bit
Weight Decay 0.01
Max Gradient Norm 0.3
Evaluation Strategy Every 10 steps
Best Model Metric eval_loss
Total Training Steps 85
Mixed Precision FP16 (T4 GPU)
Random Seed 3407

📉 Training & Validation Loss

The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.

Step Training Loss Validation Loss
10 1.9981 1.9311
20 1.3280 1.2628
30 1.1018 1.0792
40 1.0133 0.9678
50 0.9917 0.9304
60 0.9053 0.8815
70 0.9122 0.8845
80 0.8935 0.8894
85 0.9160 0.8910

Final overall training loss: 1.2197
Best validation loss: 0.8815 (Step 60)
Total training time: ~83 minutes 46 seconds

The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.


🖥️ Hardware & Infrastructure

Component Specification
GPU NVIDIA Tesla T4
VRAM 15.6 GB
Peak VRAM Used 15.19 GB
Platform Google Colab (free tier)
CUDA 12.8 / Toolkit 7.5
PyTorch 2.10.0+cu128

📦 Dataset

The model was fine-tuned on a custom curated dataset of 586 Arabic text samples (dataset_final.json), each consisting of:

  • prompt — a raw Arabic paragraph prefixed with "Text to split:\n"
  • response — a gold-standard JSON object {"sentences": [...]} containing the correctly segmented sentences
Split Samples
Train 527
Validation 59
Total 586

The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.


🚀 Quickstart / Inference

Installation

pip install transformers torch accelerate

Using transformers (Recommended)

import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ── Configuration ────────────────────────────────────────────────────────────
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

# ── System prompt ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

# ── Load model & tokenizer ────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.eval()

# ── Inference function ────────────────────────────────────────────────────────
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
    """
    Segment an Arabic paragraph into a list of semantic sentences.

    Args:
        text:           Raw Arabic text to segment.
        max_new_tokens: Maximum number of tokens to generate.

    Returns:
        A list of Arabic sentence strings.
    """
    messages = [
        {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=1.0,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens
    generated = output_ids[0][inputs["input_ids"].shape[-1]:]
    raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()

    # Parse JSON response
    parsed = json.loads(raw_output)
    return parsed["sentences"]


# ── Example ────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    arabic_text = (
        "الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
        "قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
        "على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً "
        "ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة "
        "وتوافر كميات ضخمة من البيانات."
    )

    sentences = segment_arabic(arabic_text)

    print(f"✅ Segmented into {len(sentences)} sentence(s):\n")
    for i, sentence in enumerate(sentences, 1):
        print(f"  [{i}] {sentence}")

Expected Output

✅ Segmented into 3 sentence(s):

  [1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
  [2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.
  [3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات.

Using Unsloth (2× Faster Inference)

import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor

MODEL_ID       = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype          = None,       # auto-detect
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

def segment_arabic_unsloth(text: str) -> list[str]:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"Text to split:\n{text}"},
    ]

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        use_cache=True,
        do_sample=False,
    )

    generated = outputs[0][inputs["input_ids"].shape[-1]:]
    raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return json.loads(raw)["sentences"]

📤 Output Format

The model always returns a strict JSON object with a single key "sentences" whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.

{
  "sentences": [
    "الجملة الأولى.",
    "الجملة الثانية.",
    "الجملة الثالثة."
  ]
}

Guarantees:

  • No paraphrasing — every sentence is a verbatim span of the source text
  • No hallucination of new content
  • No translation, grammar correction, or interpretation
  • Deterministic output with do_sample=False

⚠️ Limitations

  • Domain scope — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
  • Dataset size — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
  • Context length — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
  • Language exclusivity — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
  • Base model license — Usage is subject to Google's Gemma Terms of Use. Commercial use requires compliance with those terms.

👥 Authors

This model was developed and trained by:

Name Role
Omar Abdelmoniem Model development, training pipeline, LoRA configuration
Mariam Emad Dataset curation, system prompt engineering, evaluation

📖 Citation

If you use this model in your research or applications, please cite it as follows:

@misc{abdelmoniem2025arabicsemantic,
  title        = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
  author       = {Abdelmoniem, Omar and Emad, Mariam},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}

📜 License

This model inherits the Gemma Terms of Use from the base google/gemma-3-4b-it model. By using this model, you agree to those terms.

The fine-tuning code, dataset format, and system prompt design are released under the MIT License.


Made with ❤️ for the Arabic NLP community

Fine-tuned with Unsloth · Built on Gemma 3 · Powered by Hugging Face 🤗

Downloads last month
282
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marioVIC/arabic-semantic-chunking

Adapter
(294)
this model