almaghrabima commited on
Commit
d3c23ec
·
verified ·
1 Parent(s): 55db6a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -3
README.md CHANGED
@@ -1,6 +1,39 @@
1
- # Deeplatent
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- High-performance Arabic tokenizer with morphology and parity awareness. Built with Rust for speed, with Python bindings for ease of use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Features
6
 
@@ -162,4 +195,4 @@ CC-BY-NC-4.0
162
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
163
  note={Independent research, part of Suhail Project}
164
  }
165
- ```
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ar
5
+ - en
6
+ tags:
7
+ - tokenizer
8
+ - arabic
9
+ - morphology
10
+ - bpe
11
+ - deeplatent
12
+ - english
13
+ - arabic
14
+ pipeline_tag: text-generation
15
+ ---
16
 
17
+ # DeepLatent SARF Tokenizer
18
+
19
+ **Part of Suhail Project - Independent Research by Mohammed Almaghrabi**
20
+
21
+ This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.
22
+
23
+ ## What is SARF?
24
+
25
+ **SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs:
26
+
27
+ - Word formation
28
+ - Roots and patterns (جذر / وزن)
29
+ - Prefixes, suffixes, infixes
30
+ - Tense, gender, number, and derivation
31
+
32
+ > **Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.**
33
+
34
+ SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.
35
+
36
+ Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.**
37
 
38
  ## Features
39
 
 
195
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
196
  note={Independent research, part of Suhail Project}
197
  }
198
+ ```