File size: 17,096 Bytes
5ba9d00
230d219
ef32462
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ba9d00
 
ef32462
5ba9d00
ef32462
5ba9d00
ef32462
5ba9d00
ef32462
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
---
language:
  - ar
license: gemma
base_model: google/gemma-3-4b-it
tags:
  - arabic
  - nlp
  - text-segmentation
  - semantic-chunking
  - gemma3
  - lora
  - unsloth
  - fine-tuned
  - rag
  - information-retrieval
pipeline_tag: text-generation
library_name: transformers
inference: true
---

<div align="center">

# ๐Ÿ”ค Gemma-3-4B Arabic Semantic Chunker

**A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.**

[![Model on HF](https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-arabic--semantic--chunking-yellow)](https://huggingface.co/marioVIC/arabic-semantic-chunking)
[![Base Model](https://img.shields.io/badge/Base%20Model-google%2Fgemma--3--4b--it-blue)](https://huggingface.co/google/gemma-3-4b-it)
[![License](https://img.shields.io/badge/License-Gemma-orange)](https://ai.google.dev/gemma/terms)
[![Language](https://img.shields.io/badge/Language-Arabic%20๐Ÿ‡ธ๐Ÿ‡ฆ-green)](https://en.wikipedia.org/wiki/Arabic)

</div>

---

## ๐Ÿ“‹ Table of Contents

- [Model Overview](#-model-overview)
- [Intended Use](#-intended-use)
- [Training Details](#-training-details)
- [Training & Validation Loss](#-training--validation-loss)
- [Hardware & Infrastructure](#-hardware--infrastructure)
- [Dataset](#-dataset)
- [Quickstart / Inference](#-quickstart--inference)
- [Output Format](#-output-format)
- [Limitations](#-limitations)
- [Authors](#-authors)
- [Citation](#-citation)
- [License](#-license)

---

## ๐Ÿง  Model Overview

| Attribute               | Value                                      |
|-------------------------|--------------------------------------------|
| **Base Model**          | `google/gemma-3-4b-it`                     |
| **Task**                | Arabic Semantic Text Segmentation          |
| **Fine-tuning Method**  | Supervised Fine-Tuning (SFT) with LoRA     |
| **Precision**           | 4-bit NF4 quantisation (QLoRA)             |
| **Vocabulary Size**     | 262,144 tokens                             |
| **Max Sequence Length** | 2,048 tokens                               |
| **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total)          |
| **Framework**           | Unsloth + Hugging Face TRL                 |

This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences โ€” with zero paraphrasing and zero hallucination of content.

---

## ๐ŸŽฏ Intended Use

This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**:

- **Retrieval-Augmented Generation (RAG)** โ€” chunk documents into high-quality semantic units before embedding
- **Arabic NLP preprocessing** โ€” replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
- **Corpus annotation** โ€” automatically segment raw Arabic corpora for downstream labelling tasks
- **Information extraction** โ€” isolate individual claims or facts before analysis
- **Search & summarisation** โ€” improve context windows by feeding well-bounded sentence units

> โš ๏ธ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.

---

## ๐Ÿ‹๏ธ Training Details

### LoRA Configuration

| Parameter               | Value                                                                       |
|-------------------------|-----------------------------------------------------------------------------|
| **LoRA Rank (`r`)**     | 16                                                                          |
| **LoRA Alpha**          | 16                                                                          |
| **LoRA Dropout**        | 0.05                                                                        |
| **Target Modules**      | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **Bias**                | None                                                                        |
| **Gradient Checkpointing** | Unsloth (memory-optimised)                                              |

### SFT Hyperparameters

| Parameter                    | Value              |
|------------------------------|--------------------|
| **Epochs**                   | 5                  |
| **Per-device Batch Size**    | 2                  |
| **Gradient Accumulation**    | 16 steps           |
| **Effective Batch Size**     | 32                 |
| **Learning Rate**            | 1e-4               |
| **LR Scheduler**             | Linear             |
| **Warmup Steps**             | 10                 |
| **Optimiser**                | `adamw_8bit`       |
| **Weight Decay**             | 0.01               |
| **Max Gradient Norm**        | 0.3                |
| **Evaluation Strategy**      | Every 10 steps     |
| **Best Model Metric**        | `eval_loss`        |
| **Total Training Steps**     | 85                 |
| **Mixed Precision**          | FP16 (T4 GPU)      |
| **Random Seed**              | 3407               |

---

## ๐Ÿ“‰ Training & Validation Loss

The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.

| Step | Training Loss | Validation Loss |
|:----:|:-------------:|:---------------:|
| 10   | 1.9981        | 1.9311          |
| 20   | 1.3280        | 1.2628          |
| 30   | 1.1018        | 1.0792          |
| 40   | 1.0133        | 0.9678          |
| 50   | 0.9917        | 0.9304          |
| 60   | 0.9053        | 0.8815          |
| 70   | 0.9122        | 0.8845          |
| 80   | 0.8935        | 0.8894          |
| 85   | 0.9160        | 0.8910          |

**Final overall training loss: `1.2197`**  
**Best validation loss: `0.8815`** (Step 60)  
**Total training time: ~83 minutes 46 seconds**

The sharp initial drop (steps 10โ€“40) reflects rapid task adaptation, after which the model plateaus at a stable low loss โ€” a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.

---

## ๐Ÿ–ฅ๏ธ Hardware & Infrastructure

| Component    | Specification              |
|--------------|----------------------------|
| **GPU**      | NVIDIA Tesla T4            |
| **VRAM**     | 15.6 GB                    |
| **Peak VRAM Used** | 15.19 GB             |
| **Platform** | Google Colab (free tier)   |
| **CUDA**     | 12.8 / Toolkit 7.5         |
| **PyTorch**  | 2.10.0+cu128               |

---

## ๐Ÿ“ฆ Dataset

The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of:

- **`prompt`** โ€” a raw Arabic paragraph prefixed with `"Text to split:\n"`
- **`response`** โ€” a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences

| Split           | Samples |
|-----------------|---------|
| **Train**       | 527     |
| **Validation**  | 59      |
| **Total**       | 586     |

The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.

---

## ๐Ÿš€ Quickstart / Inference

### Installation

```bash
pip install transformers torch accelerate
```

### Using `transformers` (Recommended)

```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# โ”€โ”€ Configuration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
MODEL_ID = "marioVIC/arabic-semantic-chunking"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

# โ”€โ”€ System prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly โ€” do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object โ€” no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

# โ”€โ”€ Load model & tokenizer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.eval()

# โ”€โ”€ Inference function โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
    """
    Segment an Arabic paragraph into a list of semantic sentences.

    Args:
        text:           Raw Arabic text to segment.
        max_new_tokens: Maximum number of tokens to generate.

    Returns:
        A list of Arabic sentence strings.
    """
    messages = [
        {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=1.0,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the newly generated tokens
    generated = output_ids[0][inputs["input_ids"].shape[-1]:]
    raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()

    # Parse JSON response
    parsed = json.loads(raw_output)
    return parsed["sentences"]


# โ”€โ”€ Example โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
if __name__ == "__main__":
    arabic_text = (
        "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ู…ุฌุงู„ ู…ู† ู…ุฌุงู„ุงุช ุนู„ูˆู… ุงู„ุญุงุณูˆุจ ูŠู‡ุชู… ุจุชุทูˆูŠุฑ ุฃู†ุธู…ุฉ "
        "ู‚ุงุฏุฑุฉ ุนู„ู‰ ุชู†ููŠุฐ ู…ู‡ุงู… ุชุชุทู„ุจ ุนุงุฏุฉู‹ ุฐูƒุงุกู‹ ุจุดุฑูŠุงู‹. ุชุดู…ู„ ู‡ุฐู‡ ุงู„ู…ู‡ุงู… ุงู„ุชุนุฑู "
        "ุนู„ู‰ ุงู„ูƒู„ุงู… ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุงุชุฎุงุฐ ุงู„ู‚ุฑุงุฑุงุช. ูˆู‚ุฏ ุดู‡ุฏ ู‡ุฐุง ุงู„ู…ุฌุงู„ ุชุทูˆุฑุงู‹ "
        "ู…ู„ุญูˆุธุงู‹ ููŠ ุงู„ุณู†ูˆุงุช ุงู„ุฃุฎูŠุฑุฉ ุจูุถู„ ุงู„ุชู‚ุฏู… ููŠ ุงู„ุดุจูƒุงุช ุงู„ุนุตุจูŠุฉ ุงู„ุนู…ูŠู‚ุฉ "
        "ูˆุชูˆุงูุฑ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช."
    )

    sentences = segment_arabic(arabic_text)

    print(f"โœ… Segmented into {len(sentences)} sentence(s):\n")
    for i, sentence in enumerate(sentences, 1):
        print(f"  [{i}] {sentence}")
```

### Expected Output

```
โœ… Segmented into 3 sentence(s):

  [1] ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ู…ุฌุงู„ ู…ู† ู…ุฌุงู„ุงุช ุนู„ูˆู… ุงู„ุญุงุณูˆุจ ูŠู‡ุชู… ุจุชุทูˆูŠุฑ ุฃู†ุธู…ุฉ ู‚ุงุฏุฑุฉ ุนู„ู‰ ุชู†ููŠุฐ ู…ู‡ุงู… ุชุชุทู„ุจ ุนุงุฏุฉู‹ ุฐูƒุงุกู‹ ุจุดุฑูŠุงู‹.
  [2] ุชุดู…ู„ ู‡ุฐู‡ ุงู„ู…ู‡ุงู… ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ูƒู„ุงู… ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุงุชุฎุงุฐ ุงู„ู‚ุฑุงุฑุงุช.
  [3] ูˆู‚ุฏ ุดู‡ุฏ ู‡ุฐุง ุงู„ู…ุฌุงู„ ุชุทูˆุฑุงู‹ ู…ู„ุญูˆุธุงู‹ ููŠ ุงู„ุณู†ูˆุงุช ุงู„ุฃุฎูŠุฑุฉ ุจูุถู„ ุงู„ุชู‚ุฏู… ููŠ ุงู„ุดุจูƒุงุช ุงู„ุนุตุจูŠุฉ ุงู„ุนู…ูŠู‚ุฉ ูˆุชูˆุงูุฑ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช.
```

### Using Unsloth (2ร— Faster Inference)

```python
import json
from unsloth import FastLanguageModel
from transformers import AutoProcessor

MODEL_ID       = "marioVIC/arabic-semantic-chunking"
MAX_SEQ_LENGTH = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype          = None,       # auto-detect
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

SYSTEM_PROMPT = """\
You are an expert Arabic text segmentation assistant. Your task is to split \
the given Arabic text into small, meaningful sentences.
Follow these rules strictly:
1. Each sentence must be a complete, self-contained meaningful unit.
2. Do NOT merge multiple ideas into one sentence.
3. Do NOT split a single idea across multiple sentences.
4. Preserve the original Arabic text exactly โ€” do not paraphrase, translate, or fix grammar.
5. Remove excessive whitespace or newlines, but keep the words intact.
6. Return ONLY a valid JSON object โ€” no explanation, no markdown, no code fences.
The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
"""

def segment_arabic_unsloth(text: str) -> list[str]:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"Text to split:\n{text}"},
    ]

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        use_cache=True,
        do_sample=False,
    )

    generated = outputs[0][inputs["input_ids"].shape[-1]:]
    raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
    return json.loads(raw)["sentences"]
```

---

## ๐Ÿ“ค Output Format

The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.

```json
{
  "sentences": [
    "ุงู„ุฌู…ู„ุฉ ุงู„ุฃูˆู„ู‰.",
    "ุงู„ุฌู…ู„ุฉ ุงู„ุซุงู†ูŠุฉ.",
    "ุงู„ุฌู…ู„ุฉ ุงู„ุซุงู„ุซุฉ."
  ]
}
```

**Guarantees:**
- No paraphrasing โ€” every sentence is a verbatim span of the source text
- No hallucination of new content
- No translation, grammar correction, or interpretation
- Deterministic output with `do_sample=False`

---

## โš ๏ธ Limitations

- **Domain scope** โ€” Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
- **Dataset size** โ€” The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
- **Context length** โ€” Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
- **Language exclusivity** โ€” This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
- **Base model license** โ€” Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.

---

## ๐Ÿ‘ฅ Authors

This model was developed and trained by:

| Name | Role |
|------|------|
| **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration |
| **Mariam Emad** | Dataset curation, system prompt engineering, evaluation |

---

## ๐Ÿ“– Citation

If you use this model in your research or applications, please cite it as follows:

```bibtex
@misc{abdelmoniem2025arabicsemantic,
  title        = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
  author       = {Abdelmoniem, Omar and Emad, Mariam},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
}
```

---

## ๐Ÿ“œ License

This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.

The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**.

---

<div align="center">

Made with โค๏ธ for the Arabic NLP community

*Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) ยท Built on [Gemma 3](https://ai.google.dev/gemma) ยท Powered by [Hugging Face ๐Ÿค—](https://huggingface.co)*

</div>