Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,39 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Features
|
| 6 |
|
|
@@ -162,4 +195,4 @@ CC-BY-NC-4.0
|
|
| 162 |
url={https://huggingface.co/almaghrabima/SARFTokenizer},
|
| 163 |
note={Independent research, part of Suhail Project}
|
| 164 |
}
|
| 165 |
-
```
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- ar
|
| 5 |
+
- en
|
| 6 |
+
tags:
|
| 7 |
+
- tokenizer
|
| 8 |
+
- arabic
|
| 9 |
+
- morphology
|
| 10 |
+
- bpe
|
| 11 |
+
- deeplatent
|
| 12 |
+
- english
|
| 13 |
+
- arabic
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
+
---
|
| 16 |
|
| 17 |
+
# DeepLatent SARF Tokenizer
|
| 18 |
+
|
| 19 |
+
**Part of Suhail Project - Independent Research by Mohammed Almaghrabi**
|
| 20 |
+
|
| 21 |
+
This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.
|
| 22 |
+
|
| 23 |
+
## What is SARF?
|
| 24 |
+
|
| 25 |
+
**SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs:
|
| 26 |
+
|
| 27 |
+
- Word formation
|
| 28 |
+
- Roots and patterns (جذر / وزن)
|
| 29 |
+
- Prefixes, suffixes, infixes
|
| 30 |
+
- Tense, gender, number, and derivation
|
| 31 |
+
|
| 32 |
+
> **Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.**
|
| 33 |
+
|
| 34 |
+
SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.
|
| 35 |
+
|
| 36 |
+
Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.**
|
| 37 |
|
| 38 |
## Features
|
| 39 |
|
|
|
|
| 195 |
url={https://huggingface.co/almaghrabima/SARFTokenizer},
|
| 196 |
note={Independent research, part of Suhail Project}
|
| 197 |
}
|
| 198 |
+
```
|