--- language: - ar license: gemma base_model: google/gemma-3-4b-it tags: - arabic - nlp - text-segmentation - semantic-chunking - gemma3 - lora - unsloth - fine-tuned - rag - information-retrieval pipeline_tag: text-generation library_name: transformers inference: true ---
# ๐Ÿ”ค Gemma-3-4B Arabic Semantic Chunker **A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.** [![Model on HF](https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-arabic--semantic--chunking-yellow)](https://huggingface.co/marioVIC/arabic-semantic-chunking) [![Base Model](https://img.shields.io/badge/Base%20Model-google%2Fgemma--3--4b--it-blue)](https://huggingface.co/google/gemma-3-4b-it) [![License](https://img.shields.io/badge/License-Gemma-orange)](https://ai.google.dev/gemma/terms) [![Language](https://img.shields.io/badge/Language-Arabic%20๐Ÿ‡ธ๐Ÿ‡ฆ-green)](https://en.wikipedia.org/wiki/Arabic)
--- ## ๐Ÿ“‹ Table of Contents - [Model Overview](#-model-overview) - [Intended Use](#-intended-use) - [Training Details](#-training-details) - [Training & Validation Loss](#-training--validation-loss) - [Hardware & Infrastructure](#-hardware--infrastructure) - [Dataset](#-dataset) - [Quickstart / Inference](#-quickstart--inference) - [Output Format](#-output-format) - [Limitations](#-limitations) - [Authors](#-authors) - [Citation](#-citation) - [License](#-license) --- ## ๐Ÿง  Model Overview | Attribute | Value | |-------------------------|--------------------------------------------| | **Base Model** | `google/gemma-3-4b-it` | | **Task** | Arabic Semantic Text Segmentation | | **Fine-tuning Method** | Supervised Fine-Tuning (SFT) with LoRA | | **Precision** | 4-bit NF4 quantisation (QLoRA) | | **Vocabulary Size** | 262,144 tokens | | **Max Sequence Length** | 2,048 tokens | | **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total) | | **Framework** | Unsloth + Hugging Face TRL | This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences โ€” with zero paraphrasing and zero hallucination of content. --- ## ๐ŸŽฏ Intended Use This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**: - **Retrieval-Augmented Generation (RAG)** โ€” chunk documents into high-quality semantic units before embedding - **Arabic NLP preprocessing** โ€” replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter - **Corpus annotation** โ€” automatically segment raw Arabic corpora for downstream labelling tasks - **Information extraction** โ€” isolate individual claims or facts before analysis - **Search & summarisation** โ€” improve context windows by feeding well-bounded sentence units > โš ๏ธ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text. --- ## ๐Ÿ‹๏ธ Training Details ### LoRA Configuration | Parameter | Value | |-------------------------|-----------------------------------------------------------------------------| | **LoRA Rank (`r`)** | 16 | | **LoRA Alpha** | 16 | | **LoRA Dropout** | 0.05 | | **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | **Bias** | None | | **Gradient Checkpointing** | Unsloth (memory-optimised) | ### SFT Hyperparameters | Parameter | Value | |------------------------------|--------------------| | **Epochs** | 5 | | **Per-device Batch Size** | 2 | | **Gradient Accumulation** | 16 steps | | **Effective Batch Size** | 32 | | **Learning Rate** | 1e-4 | | **LR Scheduler** | Linear | | **Warmup Steps** | 10 | | **Optimiser** | `adamw_8bit` | | **Weight Decay** | 0.01 | | **Max Gradient Norm** | 0.3 | | **Evaluation Strategy** | Every 10 steps | | **Best Model Metric** | `eval_loss` | | **Total Training Steps** | 85 | | **Mixed Precision** | FP16 (T4 GPU) | | **Random Seed** | 3407 | --- ## ๐Ÿ“‰ Training & Validation Loss The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting. | Step | Training Loss | Validation Loss | |:----:|:-------------:|:---------------:| | 10 | 1.9981 | 1.9311 | | 20 | 1.3280 | 1.2628 | | 30 | 1.1018 | 1.0792 | | 40 | 1.0133 | 0.9678 | | 50 | 0.9917 | 0.9304 | | 60 | 0.9053 | 0.8815 | | 70 | 0.9122 | 0.8845 | | 80 | 0.8935 | 0.8894 | | 85 | 0.9160 | 0.8910 | **Final overall training loss: `1.2197`** **Best validation loss: `0.8815`** (Step 60) **Total training time: ~83 minutes 46 seconds** The sharp initial drop (steps 10โ€“40) reflects rapid task adaptation, after which the model plateaus at a stable low loss โ€” a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task. --- ## ๐Ÿ–ฅ๏ธ Hardware & Infrastructure | Component | Specification | |--------------|----------------------------| | **GPU** | NVIDIA Tesla T4 | | **VRAM** | 15.6 GB | | **Peak VRAM Used** | 15.19 GB | | **Platform** | Google Colab (free tier) | | **CUDA** | 12.8 / Toolkit 7.5 | | **PyTorch** | 2.10.0+cu128 | --- ## ๐Ÿ“ฆ Dataset The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of: - **`prompt`** โ€” a raw Arabic paragraph prefixed with `"Text to split:\n"` - **`response`** โ€” a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences | Split | Samples | |-----------------|---------| | **Train** | 527 | | **Validation** | 59 | | **Total** | 586 | The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions. --- ## ๐Ÿš€ Quickstart / Inference ### Installation ```bash pip install transformers torch accelerate ``` ### Using `transformers` (Recommended) ```python import json import torch from transformers import AutoTokenizer, AutoModelForCausalLM # โ”€โ”€ Configuration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ MODEL_ID = "marioVIC/arabic-semantic-chunking" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # โ”€โ”€ System prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ SYSTEM_PROMPT = """\ You are an expert Arabic text segmentation assistant. Your task is to split \ the given Arabic text into small, meaningful sentences. Follow these rules strictly: 1. Each sentence must be a complete, self-contained meaningful unit. 2. Do NOT merge multiple ideas into one sentence. 3. Do NOT split a single idea across multiple sentences. 4. Preserve the original Arabic text exactly โ€” do not paraphrase, translate, or fix grammar. 5. Remove excessive whitespace or newlines, but keep the words intact. 6. Return ONLY a valid JSON object โ€” no explanation, no markdown, no code fences. The JSON format must be exactly: {"sentences": ["", "", ...]} """ # โ”€โ”€ Load model & tokenizer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.float16, device_map="auto", ) model.eval() # โ”€โ”€ Inference function โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]: """ Segment an Arabic paragraph into a list of semantic sentences. Args: text: Raw Arabic text to segment. max_new_tokens: Maximum number of tokens to generate. Returns: A list of Arabic sentence strings. """ messages = [ {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"}, ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE) with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False, temperature=1.0, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.eos_token_id, ) # Decode only the newly generated tokens generated = output_ids[0][inputs["input_ids"].shape[-1]:] raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip() # Parse JSON response parsed = json.loads(raw_output) return parsed["sentences"] # โ”€โ”€ Example โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ if __name__ == "__main__": arabic_text = ( "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ู…ุฌุงู„ ู…ู† ู…ุฌุงู„ุงุช ุนู„ูˆู… ุงู„ุญุงุณูˆุจ ูŠู‡ุชู… ุจุชุทูˆูŠุฑ ุฃู†ุธู…ุฉ " "ู‚ุงุฏุฑุฉ ุนู„ู‰ ุชู†ููŠุฐ ู…ู‡ุงู… ุชุชุทู„ุจ ุนุงุฏุฉู‹ ุฐูƒุงุกู‹ ุจุดุฑูŠุงู‹. ุชุดู…ู„ ู‡ุฐู‡ ุงู„ู…ู‡ุงู… ุงู„ุชุนุฑู " "ุนู„ู‰ ุงู„ูƒู„ุงู… ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุงุชุฎุงุฐ ุงู„ู‚ุฑุงุฑุงุช. ูˆู‚ุฏ ุดู‡ุฏ ู‡ุฐุง ุงู„ู…ุฌุงู„ ุชุทูˆุฑุงู‹ " "ู…ู„ุญูˆุธุงู‹ ููŠ ุงู„ุณู†ูˆุงุช ุงู„ุฃุฎูŠุฑุฉ ุจูุถู„ ุงู„ุชู‚ุฏู… ููŠ ุงู„ุดุจูƒุงุช ุงู„ุนุตุจูŠุฉ ุงู„ุนู…ูŠู‚ุฉ " "ูˆุชูˆุงูุฑ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช." ) sentences = segment_arabic(arabic_text) print(f"โœ… Segmented into {len(sentences)} sentence(s):\n") for i, sentence in enumerate(sentences, 1): print(f" [{i}] {sentence}") ``` ### Expected Output ``` โœ… Segmented into 3 sentence(s): [1] ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ู…ุฌุงู„ ู…ู† ู…ุฌุงู„ุงุช ุนู„ูˆู… ุงู„ุญุงุณูˆุจ ูŠู‡ุชู… ุจุชุทูˆูŠุฑ ุฃู†ุธู…ุฉ ู‚ุงุฏุฑุฉ ุนู„ู‰ ุชู†ููŠุฐ ู…ู‡ุงู… ุชุชุทู„ุจ ุนุงุฏุฉู‹ ุฐูƒุงุกู‹ ุจุดุฑูŠุงู‹. [2] ุชุดู…ู„ ู‡ุฐู‡ ุงู„ู…ู‡ุงู… ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ูƒู„ุงู… ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุงุชุฎุงุฐ ุงู„ู‚ุฑุงุฑุงุช. [3] ูˆู‚ุฏ ุดู‡ุฏ ู‡ุฐุง ุงู„ู…ุฌุงู„ ุชุทูˆุฑุงู‹ ู…ู„ุญูˆุธุงู‹ ููŠ ุงู„ุณู†ูˆุงุช ุงู„ุฃุฎูŠุฑุฉ ุจูุถู„ ุงู„ุชู‚ุฏู… ููŠ ุงู„ุดุจูƒุงุช ุงู„ุนุตุจูŠุฉ ุงู„ุนู…ูŠู‚ุฉ ูˆุชูˆุงูุฑ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช. ``` ### Using Unsloth (2ร— Faster Inference) ```python import json from unsloth import FastLanguageModel from transformers import AutoProcessor MODEL_ID = "marioVIC/arabic-semantic-chunking" MAX_SEQ_LENGTH = 2048 model, tokenizer = FastLanguageModel.from_pretrained( model_name = MODEL_ID, max_seq_length = MAX_SEQ_LENGTH, dtype = None, # auto-detect load_in_4bit = True, ) FastLanguageModel.for_inference(model) processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it") SYSTEM_PROMPT = """\ You are an expert Arabic text segmentation assistant. Your task is to split \ the given Arabic text into small, meaningful sentences. Follow these rules strictly: 1. Each sentence must be a complete, self-contained meaningful unit. 2. Do NOT merge multiple ideas into one sentence. 3. Do NOT split a single idea across multiple sentences. 4. Preserve the original Arabic text exactly โ€” do not paraphrase, translate, or fix grammar. 5. Remove excessive whitespace or newlines, but keep the words intact. 6. Return ONLY a valid JSON object โ€” no explanation, no markdown, no code fences. The JSON format must be exactly: {"sentences": ["", "", ...]} """ def segment_arabic_unsloth(text: str) -> list[str]: messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": f"Text to split:\n{text}"}, ] prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=512, use_cache=True, do_sample=False, ) generated = outputs[0][inputs["input_ids"].shape[-1]:] raw = tokenizer.decode(generated, skip_special_tokens=True).strip() return json.loads(raw)["sentences"] ``` --- ## ๐Ÿ“ค Output Format The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input. ```json { "sentences": [ "ุงู„ุฌู…ู„ุฉ ุงู„ุฃูˆู„ู‰.", "ุงู„ุฌู…ู„ุฉ ุงู„ุซุงู†ูŠุฉ.", "ุงู„ุฌู…ู„ุฉ ุงู„ุซุงู„ุซุฉ." ] } ``` **Guarantees:** - No paraphrasing โ€” every sentence is a verbatim span of the source text - No hallucination of new content - No translation, grammar correction, or interpretation - Deterministic output with `do_sample=False` --- ## โš ๏ธ Limitations - **Domain scope** โ€” Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary. - **Dataset size** โ€” The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally. - **Context length** โ€” Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation. - **Language exclusivity** โ€” This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks. - **Base model license** โ€” Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms. --- ## ๐Ÿ‘ฅ Authors This model was developed and trained by: | Name | Role | |------|------| | **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration | | **Mariam Emad** | Dataset curation, system prompt engineering, evaluation | --- ## ๐Ÿ“– Citation If you use this model in your research or applications, please cite it as follows: ```bibtex @misc{abdelmoniem2025arabicsemantic, title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation}, author = {Abdelmoniem, Omar and Emad, Mariam}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}}, } ``` --- ## ๐Ÿ“œ License This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms. The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**. ---
Made with โค๏ธ for the Arabic NLP community *Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) ยท Built on [Gemma 3](https://ai.google.dev/gemma) ยท Powered by [Hugging Face ๐Ÿค—](https://huggingface.co)*