Instructions to use marioVIC/arabic-semantic-chunking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use marioVIC/arabic-semantic-chunking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="marioVIC/arabic-semantic-chunking")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("marioVIC/arabic-semantic-chunking")
model = AutoModelForMultimodalLM.from_pretrained("marioVIC/arabic-semantic-chunking")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use marioVIC/arabic-semantic-chunking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "marioVIC/arabic-semantic-chunking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "marioVIC/arabic-semantic-chunking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/marioVIC/arabic-semantic-chunking

SGLang

How to use marioVIC/arabic-semantic-chunking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "marioVIC/arabic-semantic-chunking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "marioVIC/arabic-semantic-chunking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "marioVIC/arabic-semantic-chunking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "marioVIC/arabic-semantic-chunking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use marioVIC/arabic-semantic-chunking with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for marioVIC/arabic-semantic-chunking to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for marioVIC/arabic-semantic-chunking to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for marioVIC/arabic-semantic-chunking to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="marioVIC/arabic-semantic-chunking",
    max_seq_length=2048,
)

Docker Model Runner
How to use marioVIC/arabic-semantic-chunking with Docker Model Runner:
```
docker model run hf.co/marioVIC/arabic-semantic-chunking
```

🔤 Gemma-3-4B Arabic Semantic Chunker

A fine-tuned google/gemma-3-4b-it model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.

🧠 Model Overview

Attribute	Value
Base Model	`google/gemma-3-4b-it`
Task	Arabic Semantic Text Segmentation
Fine-tuning Method	Supervised Fine-Tuning (SFT) with LoRA
Precision	4-bit NF4 quantisation (QLoRA)
Vocabulary Size	262,144 tokens
Max Sequence Length	2,048 tokens
Trainable Parameters	32,788,480 (0.76% of 4.33B total)
Framework	Unsloth + Hugging Face TRL

This model is a LoRA adapter merged into the base google/gemma-3-4b-it weights (saved in 16-bit precision for compatibility with vLLM and standard transformers pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences — with zero paraphrasing and zero hallucination of content.

🎯 Intended Use

This model is designed for any Arabic NLP pipeline that benefits from precise sentence-level granularity:

Retrieval-Augmented Generation (RAG) — chunk documents into high-quality semantic units before embedding
Arabic NLP preprocessing — replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
Corpus annotation — automatically segment raw Arabic corpora for downstream labelling tasks
Information extraction — isolate individual claims or facts before analysis
Search & summarisation — improve context windows by feeding well-bounded sentence units

⚠️ This model is not intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.

🏋️ Training Details

LoRA Configuration

Parameter	Value
LoRA Rank (`r`)	16
LoRA Alpha	16
LoRA Dropout	0.05
Target Modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Bias	None
Gradient Checkpointing	Unsloth (memory-optimised)

SFT Hyperparameters

Parameter	Value
Epochs	5
Per-device Batch Size	2
Gradient Accumulation	16 steps
Effective Batch Size	32
Learning Rate	1e-4
LR Scheduler	Linear
Warmup Steps	10
Optimiser	`adamw_8bit`
Weight Decay	0.01
Max Gradient Norm	0.3
Evaluation Strategy	Every 10 steps
Best Model Metric	`eval_loss`
Total Training Steps	85
Mixed Precision	FP16 (T4 GPU)
Random Seed	3407

📉 Training & Validation Loss

The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.

Step	Training Loss	Validation Loss
10	1.9981	1.9311
20	1.3280	1.2628
30	1.1018	1.0792
40	1.0133	0.9678
50	0.9917	0.9304
60	0.9053	0.8815
70	0.9122	0.8845
80	0.8935	0.8894
85	0.9160	0.8910

Final overall training loss: 1.2197
Best validation loss: 0.8815 (Step 60)
Total training time: ~83 minutes 46 seconds

The sharp initial drop (steps 10–40) reflects rapid task adaptation, after which the model plateaus at a stable low loss — a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.

🖥️ Hardware & Infrastructure

Component	Specification
GPU	NVIDIA Tesla T4
VRAM	15.6 GB
Peak VRAM Used	15.19 GB
Platform	Google Colab (free tier)
CUDA	12.8 / Toolkit 7.5
PyTorch	2.10.0+cu128

📦 Dataset

The model was fine-tuned on a custom curated dataset of 586 Arabic text samples (dataset_final.json), each consisting of:

prompt — a raw Arabic paragraph prefixed with "Text to split:\n"
response — a gold-standard JSON object {"sentences": [...]} containing the correctly segmented sentences

Split	Samples
Train	527
Validation	59
Total	586

The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.

🚀 Quickstart / Inference

Installation

pip install transformers torch accelerate

Using `transformers` (Recommended)

import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ── Configuration ────────────────────────────────────────────────────────────
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

# ── System prompt ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

# ── Load model & tokenizer ────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.eval()

# ── Inference function ────────────────────────────────────────────────────────
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
    """
    Segment an Arabic paragraph into a list of semantic sentences.

    Args:
        text:           Raw Arabic text to segment.
        max_new_tokens: Maximum number of tokens to generate.

    Returns:
        A list of Arabic sentence strings.
    """
    messages = [
        {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=1.0,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens
    generated = output_ids[0][inputs["input_ids"].shape[-1]:]
    raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()

    # Parse JSON response
    parsed = json.loads(raw_output)
    return parsed["sentences"]


# ── Example ────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    arabic_text = (
        "الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة "
        "قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً. تشمل هذه المهام التعرف "
        "على الكلام وترجمة اللغات واتخاذ القرارات. وقد شهد هذا المجال تطوراً "
        "ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة "
        "وتوافر كميات ضخمة من البيانات."
    )

    sentences = segment_arabic(arabic_text)

    print(f"✅ Segmented into {len(sentences)} sentence(s):\n")
    for i, sentence in enumerate(sentences, 1):
        print(f"  [{i}] {sentence}")

Expected Output

✅ Segmented into 3 sentence(s):

  [1] الذكاء الاصطناعي هو مجال من مجالات علوم الحاسوب يهتم بتطوير أنظمة قادرة على تنفيذ مهام تتطلب عادةً ذكاءً بشرياً.
  [2] تشمل هذه المهام التعرف على الكلام وترجمة اللغات واتخاذ القرارات.
  [3] وقد شهد هذا المجال تطوراً ملحوظاً في السنوات الأخيرة بفضل التقدم في الشبكات العصبية العميقة وتوافر كميات ضخمة من البيانات.

Using Unsloth (2× Faster Inference)

import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor

MODEL_ID       = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype          = None,       # auto-detect
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly — do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object — no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

def segment_arabic_unsloth(text: str) -> list[str]:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"Text to split:\n{text}"},
    ]

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        use_cache=True,
        do_sample=False,
    )

    generated = outputs[0][inputs["input_ids"].shape[-1]:]
    raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return json.loads(raw)["sentences"]

📤 Output Format

The model always returns a strict JSON object with a single key "sentences" whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.

{
  "sentences": [
    "الجملة الأولى.",
    "الجملة الثانية.",
    "الجملة الثالثة."
  ]
}

Guarantees:

No paraphrasing — every sentence is a verbatim span of the source text
No hallucination of new content
No translation, grammar correction, or interpretation
Deterministic output with do_sample=False

⚠️ Limitations

Domain scope — Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
Dataset size — The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
Context length — Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
Language exclusivity — This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
Base model license — Usage is subject to Google's Gemma Terms of Use. Commercial use requires compliance with those terms.

👥 Authors

This model was developed and trained by:

Name	Role
Omar Abdelmoniem	Model development, training pipeline, LoRA configuration
Mariam Emad	Dataset curation, system prompt engineering, evaluation

📖 Citation

If you use this model in your research or applications, please cite it as follows:

@misc{abdelmoniem2025arabicsemantic,
  title        = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
  author       = {Abdelmoniem, Omar and Emad, Mariam},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}

📜 License

This model inherits the Gemma Terms of Use from the base google/gemma-3-4b-it model. By using this model, you agree to those terms.

The fine-tuning code, dataset format, and system prompt design are released under the MIT License.

Made with ❤️ for the Arabic NLP community

Fine-tuned with Unsloth · Built on Gemma 3 · Powered by Hugging Face 🤗

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for marioVIC/arabic-semantic-chunking

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Adapter

(425)

this model