marioVIC commited on
Commit
ef32462
ยท
verified ยท
1 Parent(s): a00483b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +413 -14
README.md CHANGED
@@ -1,21 +1,420 @@
1
  ---
2
- base_model: unsloth/gemma-3-4b-it-unsloth-bnb-4bit
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - gemma3
8
- license: apache-2.0
9
  language:
10
- - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
14
 
15
- - **Developed by:** marioVIC
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/gemma-3-4b-it-unsloth-bnb-4bit
18
 
19
- This gemma3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
1
  ---
 
 
 
 
 
 
 
2
  language:
3
+ - ar
4
+ license: gemma
5
+ base_model: google/gemma-3-4b-it
6
+ tags:
7
+ - arabic
8
+ - nlp
9
+ - text-segmentation
10
+ - semantic-chunking
11
+ - gemma3
12
+ - lora
13
+ - unsloth
14
+ - fine-tuned
15
+ - rag
16
+ - information-retrieval
17
+ pipeline_tag: text-generation
18
+ library_name: transformers
19
+ inference: true
20
+ ---
21
+
22
+ <div align="center">
23
+
24
+ # ๐Ÿ”ค Gemma-3-4B Arabic Semantic Chunker
25
+
26
+ **A fine-tuned `google/gemma-3-4b-it` model for accurate, structure-preserving segmentation of Arabic text into semantically complete sentences.**
27
+
28
+ [![Model on HF](https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-arabic--semantic--chunking-yellow)](https://huggingface.co/marioVIC/arabic-semantic-chunking)
29
+ [![Base Model](https://img.shields.io/badge/Base%20Model-google%2Fgemma--3--4b--it-blue)](https://huggingface.co/google/gemma-3-4b-it)
30
+ [![License](https://img.shields.io/badge/License-Gemma-orange)](https://ai.google.dev/gemma/terms)
31
+ [![Language](https://img.shields.io/badge/Language-Arabic%20๐Ÿ‡ธ๐Ÿ‡ฆ-green)](https://en.wikipedia.org/wiki/Arabic)
32
+
33
+ </div>
34
+
35
+ ---
36
+
37
+ ## ๐Ÿ“‹ Table of Contents
38
+
39
+ - [Model Overview](#-model-overview)
40
+ - [Intended Use](#-intended-use)
41
+ - [Training Details](#-training-details)
42
+ - [Training & Validation Loss](#-training--validation-loss)
43
+ - [Hardware & Infrastructure](#-hardware--infrastructure)
44
+ - [Dataset](#-dataset)
45
+ - [Quickstart / Inference](#-quickstart--inference)
46
+ - [Output Format](#-output-format)
47
+ - [Limitations](#-limitations)
48
+ - [Authors](#-authors)
49
+ - [Citation](#-citation)
50
+ - [License](#-license)
51
+
52
+ ---
53
+
54
+ ## ๐Ÿง  Model Overview
55
+
56
+ | Attribute | Value |
57
+ |-------------------------|--------------------------------------------|
58
+ | **Base Model** | `google/gemma-3-4b-it` |
59
+ | **Task** | Arabic Semantic Text Segmentation |
60
+ | **Fine-tuning Method** | Supervised Fine-Tuning (SFT) with LoRA |
61
+ | **Precision** | 4-bit NF4 quantisation (QLoRA) |
62
+ | **Vocabulary Size** | 262,144 tokens |
63
+ | **Max Sequence Length** | 2,048 tokens |
64
+ | **Trainable Parameters**| 32,788,480 (0.76% of 4.33B total) |
65
+ | **Framework** | Unsloth + Hugging Face TRL |
66
+
67
+ This model is a LoRA adapter merged into the base `google/gemma-3-4b-it` weights (saved in 16-bit precision for compatibility with vLLM and standard `transformers` pipelines). Given an Arabic paragraph or document, the model outputs a structured JSON object containing an ordered list of semantically self-contained sentences โ€” with zero paraphrasing and zero hallucination of content.
68
+
69
+ ---
70
+
71
+ ## ๐ŸŽฏ Intended Use
72
+
73
+ This model is designed for **any Arabic NLP pipeline that benefits from precise sentence-level granularity**:
74
+
75
+ - **Retrieval-Augmented Generation (RAG)** โ€” chunk documents into high-quality semantic units before embedding
76
+ - **Arabic NLP preprocessing** โ€” replace rule-based splitters (which fail on run-on sentences, parenthetical clauses, and informal text) with a learned segmenter
77
+ - **Corpus annotation** โ€” automatically segment raw Arabic corpora for downstream labelling tasks
78
+ - **Information extraction** โ€” isolate individual claims or facts before analysis
79
+ - **Search & summarisation** โ€” improve context windows by feeding well-bounded sentence units
80
+
81
+ > โš ๏ธ This model is **not** intended for tasks requiring paraphrasing, translation, summarisation, or content generation. It strictly preserves the original Arabic text.
82
+
83
+ ---
84
+
85
+ ## ๐Ÿ‹๏ธ Training Details
86
+
87
+ ### LoRA Configuration
88
+
89
+ | Parameter | Value |
90
+ |-------------------------|-----------------------------------------------------------------------------|
91
+ | **LoRA Rank (`r`)** | 16 |
92
+ | **LoRA Alpha** | 16 |
93
+ | **LoRA Dropout** | 0.05 |
94
+ | **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
95
+ | **Bias** | None |
96
+ | **Gradient Checkpointing** | Unsloth (memory-optimised) |
97
+
98
+ ### SFT Hyperparameters
99
+
100
+ | Parameter | Value |
101
+ |------------------------------|--------------------|
102
+ | **Epochs** | 5 |
103
+ | **Per-device Batch Size** | 2 |
104
+ | **Gradient Accumulation** | 16 steps |
105
+ | **Effective Batch Size** | 32 |
106
+ | **Learning Rate** | 1e-4 |
107
+ | **LR Scheduler** | Linear |
108
+ | **Warmup Steps** | 10 |
109
+ | **Optimiser** | `adamw_8bit` |
110
+ | **Weight Decay** | 0.01 |
111
+ | **Max Gradient Norm** | 0.3 |
112
+ | **Evaluation Strategy** | Every 10 steps |
113
+ | **Best Model Metric** | `eval_loss` |
114
+ | **Total Training Steps** | 85 |
115
+ | **Mixed Precision** | FP16 (T4 GPU) |
116
+ | **Random Seed** | 3407 |
117
+
118
+ ---
119
+
120
+ ## ๐Ÿ“‰ Training & Validation Loss
121
+
122
+ The model was evaluated on the held-out validation set every 10 steps throughout training. Both curves show consistent, stable convergence across all 5 epochs with no signs of overfitting.
123
+
124
+ | Step | Training Loss | Validation Loss |
125
+ |:----:|:-------------:|:---------------:|
126
+ | 10 | 1.9981 | 1.9311 |
127
+ | 20 | 1.3280 | 1.2628 |
128
+ | 30 | 1.1018 | 1.0792 |
129
+ | 40 | 1.0133 | 0.9678 |
130
+ | 50 | 0.9917 | 0.9304 |
131
+ | 60 | 0.9053 | 0.8815 |
132
+ | 70 | 0.9122 | 0.8845 |
133
+ | 80 | 0.8935 | 0.8894 |
134
+ | 85 | 0.9160 | 0.8910 |
135
+
136
+ **Final overall training loss: `1.2197`**
137
+ **Best validation loss: `0.8815`** (Step 60)
138
+ **Total training time: ~83 minutes 46 seconds**
139
+
140
+ The sharp initial drop (steps 10โ€“40) reflects rapid task adaptation, after which the model plateaus at a stable low loss โ€” a hallmark of well-tuned LoRA fine-tuning on a focused, in-domain task.
141
+
142
+ ---
143
+
144
+ ## ๐Ÿ–ฅ๏ธ Hardware & Infrastructure
145
+
146
+ | Component | Specification |
147
+ |--------------|----------------------------|
148
+ | **GPU** | NVIDIA Tesla T4 |
149
+ | **VRAM** | 15.6 GB |
150
+ | **Peak VRAM Used** | 15.19 GB |
151
+ | **Platform** | Google Colab (free tier) |
152
+ | **CUDA** | 12.8 / Toolkit 7.5 |
153
+ | **PyTorch** | 2.10.0+cu128 |
154
+
155
+ ---
156
+
157
+ ## ๐Ÿ“ฆ Dataset
158
+
159
+ The model was fine-tuned on a custom curated dataset of **586 Arabic text samples** (`dataset_final.json`), each consisting of:
160
+
161
+ - **`prompt`** โ€” a raw Arabic paragraph prefixed with `"Text to split:\n"`
162
+ - **`response`** โ€” a gold-standard JSON object `{"sentences": [...]}` containing the correctly segmented sentences
163
+
164
+ | Split | Samples |
165
+ |-----------------|---------|
166
+ | **Train** | 527 |
167
+ | **Validation** | 59 |
168
+ | **Total** | 586 |
169
+
170
+ The dataset covers a range of Modern Standard Arabic (MSA) domains including science, history, and general knowledge, formatted to enforce strict Gemma 3 chat template conventions.
171
+
172
+ ---
173
+
174
+ ## ๐Ÿš€ Quickstart / Inference
175
+
176
+ ### Installation
177
+
178
+ ```bash
179
+ pip install transformers torch accelerate
180
+ ```
181
+
182
+ ### Using `transformers` (Recommended)
183
+
184
+ ```python
185
+ import json
186
+ import torch
187
+ from transformers import AutoTokenizer, AutoModelForCausalLM
188
+
189
+ # โ”€โ”€ Configuration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
190
+ MODEL_ID = "marioVIC/arabic-semantic-chunking"
191
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
192
+
193
+ # โ”€โ”€ System prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
194
+ SYSTEM_PROMPT = """\
195
+ You are an expert Arabic text segmentation assistant. Your task is to split \
196
+ the given Arabic text into small, meaningful sentences.
197
+ Follow these rules strictly:
198
+ 1. Each sentence must be a complete, self-contained meaningful unit.
199
+ 2. Do NOT merge multiple ideas into one sentence.
200
+ 3. Do NOT split a single idea across multiple sentences.
201
+ 4. Preserve the original Arabic text exactly โ€” do not paraphrase, translate, or fix grammar.
202
+ 5. Remove excessive whitespace or newlines, but keep the words intact.
203
+ 6. Return ONLY a valid JSON object โ€” no explanation, no markdown, no code fences.
204
+ The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
205
+ """
206
+
207
+ # โ”€โ”€ Load model & tokenizer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
208
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
209
+ model = AutoModelForCausalLM.from_pretrained(
210
+ MODEL_ID,
211
+ torch_dtype=torch.float16,
212
+ device_map="auto",
213
+ )
214
+ model.eval()
215
+
216
+ # โ”€โ”€ Inference function โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
217
+ def segment_arabic(text: str, max_new_tokens: int = 512) -> list[str]:
218
+ """
219
+ Segment an Arabic paragraph into a list of semantic sentences.
220
+
221
+ Args:
222
+ text: Raw Arabic text to segment.
223
+ max_new_tokens: Maximum number of tokens to generate.
224
+
225
+ Returns:
226
+ A list of Arabic sentence strings.
227
+ """
228
+ messages = [
229
+ {"role": "user", "content": f"{SYSTEM_PROMPT}\nText to split:\n{text}"},
230
+ ]
231
+
232
+ prompt = tokenizer.apply_chat_template(
233
+ messages,
234
+ tokenize=False,
235
+ add_generation_prompt=True,
236
+ )
237
+
238
+ inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
239
+
240
+ with torch.no_grad():
241
+ output_ids = model.generate(
242
+ **inputs,
243
+ max_new_tokens=max_new_tokens,
244
+ do_sample=False,
245
+ temperature=1.0,
246
+ eos_token_id=tokenizer.eos_token_id,
247
+ pad_token_id=tokenizer.eos_token_id,
248
+ )
249
+
250
+ # Decode only the newly generated tokens
251
+ generated = output_ids[0][inputs["input_ids"].shape[-1]:]
252
+ raw_output = tokenizer.decode(generated, skip_special_tokens=True).strip()
253
+
254
+ # Parse JSON response
255
+ parsed = json.loads(raw_output)
256
+ return parsed["sentences"]
257
+
258
+
259
+ # โ”€โ”€ Example โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
260
+ if __name__ == "__main__":
261
+ arabic_text = (
262
+ "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ู…ุฌุงู„ ู…ู† ู…ุฌุงู„ุงุช ุนู„ูˆู… ุงู„ุญุงุณูˆุจ ูŠู‡ุชู… ุจุชุทูˆูŠุฑ ุฃู†ุธู…ุฉ "
263
+ "ู‚ุงุฏุฑุฉ ุนู„ู‰ ุชู†ููŠุฐ ู…ู‡ุงู… ุชุชุทู„ุจ ุนุงุฏุฉู‹ ุฐูƒุงุกู‹ ุจุดุฑูŠุงู‹. ุชุดู…ู„ ู‡ุฐู‡ ุงู„ู…ู‡ุงู… ุงู„ุชุนุฑู "
264
+ "ุนู„ู‰ ุงู„ูƒู„ุงู… ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุงุชุฎุงุฐ ุงู„ู‚ุฑุงุฑุงุช. ูˆู‚ุฏ ุดู‡ุฏ ู‡ุฐุง ุงู„ู…ุฌุงู„ ุชุทูˆุฑุงู‹ "
265
+ "ู…ู„ุญูˆุธุงู‹ ููŠ ุงู„ุณู†ูˆุงุช ุงู„ุฃุฎูŠุฑุฉ ุจูุถู„ ุงู„ุชู‚ุฏู… ููŠ ุงู„ุดุจูƒุงุช ุงู„ุนุตุจูŠุฉ ุงู„ุนู…ูŠู‚ุฉ "
266
+ "ูˆุชูˆุงูุฑ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช."
267
+ )
268
+
269
+ sentences = segment_arabic(arabic_text)
270
+
271
+ print(f"โœ… Segmented into {len(sentences)} sentence(s):\n")
272
+ for i, sentence in enumerate(sentences, 1):
273
+ print(f" [{i}] {sentence}")
274
+ ```
275
+
276
+ ### Expected Output
277
+
278
+ ```
279
+ โœ… Segmented into 3 sentence(s):
280
+
281
+ [1] ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ู…ุฌุงู„ ู…ู† ู…ุฌุงู„ุงุช ุนู„ูˆู… ุงู„ุญุงุณูˆุจ ูŠู‡ุชู… ุจุชุทูˆูŠุฑ ุฃู†ุธู…ุฉ ู‚ุงุฏุฑุฉ ุนู„ู‰ ุชู†ููŠุฐ ู…ู‡ุงู… ุชุชุทู„ุจ ุนุงุฏุฉู‹ ุฐูƒุงุกู‹ ุจุดุฑูŠุงู‹.
282
+ [2] ุชุดู…ู„ ู‡ุฐู‡ ุงู„ู…ู‡ุงู… ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ูƒู„ุงู… ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุงุชุฎุงุฐ ุงู„ู‚ุฑุงุฑุงุช.
283
+ [3] ูˆู‚ุฏ ุดู‡ุฏ ู‡ุฐุง ุงู„ู…ุฌุงู„ ุชุทูˆุฑุงู‹ ู…ู„ุญูˆุธุงู‹ ููŠ ุงู„ุณู†ูˆุงุช ุงู„ุฃุฎูŠุฑุฉ ุจูุถู„ ุงู„ุชู‚ุฏู… ููŠ ุงู„ุดุจูƒุงุช ุงู„ุนุตุจูŠุฉ ุงู„ุนู…ูŠู‚ุฉ ูˆุชูˆุงูุฑ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช.
284
+ ```
285
+
286
+ ### Using Unsloth (2ร— Faster Inference)
287
+
288
+ ```python
289
+ import json
290
+ from unsloth import FastLanguageModel
291
+ from transformers import AutoProcessor
292
+
293
+ MODEL_ID = "marioVIC/arabic-semantic-chunking"
294
+ MAX_SEQ_LENGTH = 2048
295
+
296
+ model, tokenizer = FastLanguageModel.from_pretrained(
297
+ model_name = MODEL_ID,
298
+ max_seq_length = MAX_SEQ_LENGTH,
299
+ dtype = None, # auto-detect
300
+ load_in_4bit = True,
301
+ )
302
+ FastLanguageModel.for_inference(model)
303
+
304
+ processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
305
+
306
+ SYSTEM_PROMPT = """\
307
+ You are an expert Arabic text segmentation assistant. Your task is to split \
308
+ the given Arabic text into small, meaningful sentences.
309
+ Follow these rules strictly:
310
+ 1. Each sentence must be a complete, self-contained meaningful unit.
311
+ 2. Do NOT merge multiple ideas into one sentence.
312
+ 3. Do NOT split a single idea across multiple sentences.
313
+ 4. Preserve the original Arabic text exactly โ€” do not paraphrase, translate, or fix grammar.
314
+ 5. Remove excessive whitespace or newlines, but keep the words intact.
315
+ 6. Return ONLY a valid JSON object โ€” no explanation, no markdown, no code fences.
316
+ The JSON format must be exactly: {"sentences": ["<sentence1>", "<sentence2>", ...]}
317
+ """
318
+
319
+ def segment_arabic_unsloth(text: str) -> list[str]:
320
+ messages = [
321
+ {"role": "system", "content": SYSTEM_PROMPT},
322
+ {"role": "user", "content": f"Text to split:\n{text}"},
323
+ ]
324
+
325
+ prompt = processor.apply_chat_template(
326
+ messages,
327
+ tokenize=False,
328
+ add_generation_prompt=True,
329
+ )
330
+
331
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
332
+
333
+ outputs = model.generate(
334
+ **inputs,
335
+ max_new_tokens=512,
336
+ use_cache=True,
337
+ do_sample=False,
338
+ )
339
+
340
+ generated = outputs[0][inputs["input_ids"].shape[-1]:]
341
+ raw = tokenizer.decode(generated, skip_special_tokens=True).strip()
342
+ return json.loads(raw)["sentences"]
343
+ ```
344
+
345
+ ---
346
+
347
+ ## ๐Ÿ“ค Output Format
348
+
349
+ The model always returns a **strict JSON object** with a single key `"sentences"` whose value is an ordered array of strings. Each string is an exact substring of the original Arabic input.
350
+
351
+ ```json
352
+ {
353
+ "sentences": [
354
+ "ุงู„ุฌู…ู„ุฉ ุงู„ุฃูˆู„ู‰.",
355
+ "ุงู„ุฌู…ู„ุฉ ุงู„ุซุงู†ูŠุฉ.",
356
+ "ุงู„ุฌู…ู„ุฉ ุงู„ุซุงู„ุซุฉ."
357
+ ]
358
+ }
359
+ ```
360
+
361
+ **Guarantees:**
362
+ - No paraphrasing โ€” every sentence is a verbatim span of the source text
363
+ - No hallucination of new content
364
+ - No translation, grammar correction, or interpretation
365
+ - Deterministic output with `do_sample=False`
366
+
367
+ ---
368
+
369
+ ## โš ๏ธ Limitations
370
+
371
+ - **Domain scope** โ€” Trained primarily on Modern Standard Arabic (MSA). Performance on dialectal Arabic (Egyptian, Levantine, Gulf, etc.) or highly technical jargon may vary.
372
+ - **Dataset size** โ€” The training set is relatively small (527 examples). Edge cases with unusual punctuation, code-switching, or deeply nested clauses may not be handled optimally.
373
+ - **Context length** โ€” Inputs exceeding ~1,800 tokens may be truncated. For long documents, consider chunking the input before segmentation.
374
+ - **Language exclusivity** โ€” This model is purpose-built for Arabic. It is not suitable for multilingual or cross-lingual segmentation tasks.
375
+ - **Base model license** โ€” Usage is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Commercial use requires compliance with those terms.
376
+
377
+ ---
378
+
379
+ ## ๐Ÿ‘ฅ Authors
380
+
381
+ This model was developed and trained by:
382
+
383
+ | Name | Role |
384
+ |------|------|
385
+ | **Omar Abdelmoniem** | Model development, training pipeline, LoRA configuration |
386
+ | **Mariam Emad** | Dataset curation, system prompt engineering, evaluation |
387
+
388
+ ---
389
+
390
+ ## ๐Ÿ“– Citation
391
+
392
+ If you use this model in your research or applications, please cite it as follows:
393
+
394
+ ```bibtex
395
+ @misc{abdelmoniem2025arabicsemantic,
396
+ title = {Gemma-3-4B Arabic Semantic Chunker: Fine-tuning Gemma 3 for Arabic Text Segmentation},
397
+ author = {Abdelmoniem, Omar and Emad, Mariam},
398
+ year = {2025},
399
+ publisher = {Hugging Face},
400
+ howpublished = {\url{https://huggingface.co/marioVIC/arabic-semantic-chunking}},
401
+ }
402
+ ```
403
+
404
+ ---
405
+
406
+ ## ๐Ÿ“œ License
407
+
408
+ This model inherits the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)** from the base `google/gemma-3-4b-it` model. By using this model, you agree to those terms.
409
+
410
+ The fine-tuning code, dataset format, and system prompt design are released under the **MIT License**.
411
+
412
  ---
413
 
414
+ <div align="center">
415
 
416
+ Made with โค๏ธ for the Arabic NLP community
 
 
417
 
418
+ *Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) ยท Built on [Gemma 3](https://ai.google.dev/gemma) ยท Powered by [Hugging Face ๐Ÿค—](https://huggingface.co)*
419
 
420
+ </div>