omarkamali commited on
Commit
7ca7ea7
·
verified ·
1 Parent(s): 40d9b30

Upload all models and assets for arz (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +297 -147
  2. models/embeddings/monolingual/arz_128d.bin +2 -2
  3. models/embeddings/monolingual/arz_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/arz_32d.bin +2 -2
  5. models/embeddings/monolingual/arz_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/arz_64d.bin +2 -2
  7. models/embeddings/monolingual/arz_64d_metadata.json +5 -3
  8. models/subword_markov/arz_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/arz_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/arz_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/arz_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/arz_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/arz_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/arz_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/arz_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/arz_2gram_subword.parquet +2 -2
  17. models/subword_ngram/arz_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/arz_3gram_subword.parquet +2 -2
  19. models/subword_ngram/arz_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/arz_4gram_subword.parquet +2 -2
  21. models/subword_ngram/arz_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/arz_tokenizer_16k.model +2 -2
  23. models/tokenizer/arz_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/arz_tokenizer_32k.model +2 -2
  25. models/tokenizer/arz_tokenizer_32k.vocab +0 -0
  26. models/tokenizer/arz_tokenizer_64k.model +2 -2
  27. models/tokenizer/arz_tokenizer_64k.vocab +0 -0
  28. models/tokenizer/arz_tokenizer_8k.model +2 -2
  29. models/tokenizer/arz_tokenizer_8k.vocab +0 -0
  30. models/vocabulary/arz_vocabulary.parquet +2 -2
  31. models/vocabulary/arz_vocabulary_metadata.json +10 -9
  32. models/word_markov/arz_markov_ctx1_word.parquet +2 -2
  33. models/word_markov/arz_markov_ctx1_word_metadata.json +2 -2
  34. models/word_markov/arz_markov_ctx2_word.parquet +2 -2
  35. models/word_markov/arz_markov_ctx2_word_metadata.json +2 -2
  36. models/word_markov/arz_markov_ctx3_word.parquet +2 -2
  37. models/word_markov/arz_markov_ctx3_word_metadata.json +2 -2
  38. models/word_markov/arz_markov_ctx4_word.parquet +2 -2
  39. models/word_markov/arz_markov_ctx4_word_metadata.json +2 -2
  40. models/word_ngram/arz_2gram_word.parquet +2 -2
  41. models/word_ngram/arz_2gram_word_metadata.json +2 -2
  42. models/word_ngram/arz_3gram_word.parquet +2 -2
  43. models/word_ngram/arz_3gram_word_metadata.json +2 -2
  44. models/word_ngram/arz_4gram_word.parquet +2 -2
  45. models/word_ngram/arz_4gram_word_metadata.json +2 -2
  46. visualizations/embedding_isotropy.png +0 -0
  47. visualizations/embedding_norms.png +0 -0
  48. visualizations/embedding_similarity.png +2 -2
  49. visualizations/markov_branching.png +0 -0
  50. visualizations/markov_contexts.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 4.023
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.8067
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 1000000
33
- generated: 2025-12-27
34
  ---
35
 
36
  # Egyptian Arabic - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,64 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.017x | 2.93 | 0.4317% | 1,742,125 |
76
- | **16k** | 3.365x | 3.27 | 0.4814% | 1,562,212 |
77
- | **32k** | 3.703x | 3.60 | 0.5297% | 1,419,540 |
78
- | **64k** | 4.023x 🏆 | 3.91 | 0.5755% | 1,306,653 |
79
 
80
  ### Tokenization Examples
81
 
82
  Below are sample sentences tokenized with each vocabulary size:
83
 
84
- **Sample 1:** `دونج سارى فنان من كامبوديا.
85
-
86
- حياته
87
- دونج سارى من مواليد يوم 1 يناير 1957.
88
-
89
- لين...`
90
 
91
  | Vocab | Tokens | Count |
92
  |-------|--------|-------|
93
- | 8k | `▁دونج ▁س ارى ▁فنان ▁من ▁كامب ود يا . ▁حياته ... (+28 more)` | 38 |
94
- | 16k | `▁دونج ▁س ارى ▁فنان ▁من ▁كامب وديا . ▁حياته ▁دونج ... (+26 more)` | 36 |
95
- | 32k | `▁دونج ▁سارى ▁فنان ▁من ▁كامبوديا . ▁حياته ▁دونج ▁سارى ▁من ... (+22 more)` | 32 |
96
- | 64k | `▁دونج ▁سارى ▁فنان ▁من ▁كامبوديا . ▁حياته ▁دونج ▁سارى ▁من ... (+22 more)` | 32 |
97
-
98
- **Sample 2:** `بيريباروس ( الاسم العلمى: Periparus ) هوا جنس من الطيور بيتبع قرقفيات.
99
 
100
- لينكات ...`
101
 
102
  | Vocab | Tokens | Count |
103
  |-------|--------|-------|
104
- | 8k | `▁بير يب اروس ▁( ▁الاسم ▁العلم ى : ▁per ip ... (+36 more)` | 46 |
105
- | 16k | `▁بير يب اروس ▁( ▁الاسم ▁العلمى : ▁per ip arus ... (+31 more)` | 41 |
106
- | 32k | `▁بير يب اروس ▁( ▁الاسم ▁العلمى : ▁per ip arus ... (+27 more)` | 37 |
107
- | 64k | `▁بير يب اروس ▁( ▁الاسم ▁العلمى : ▁per ip arus ... (+26 more)` | 36 |
108
-
109
- **Sample 3:** `قلى بيغلو ( بالفارسى قلیبیگلو ) قريه فى ايران.
110
-
111
- لينكات برانيه
112
 
113
-
114
- سبب التسميه
115
- ...`
116
 
117
  | Vocab | Tokens | Count |
118
  |-------|--------|-------|
119
- | 8k | `▁ق لى ▁بي غل و ▁( ▁بالفارسى ▁قل ی ب ... (+21 more)` | 31 |
120
- | 16k | `▁ق لى ▁بي غلو ▁( ▁بالفارسى ▁قل ی ب ی ... (+20 more)` | 30 |
121
- | 32k | `▁ق لى ▁بي غلو ▁( ▁بالفارسى ▁قل ی بی گ ... (+19 more)` | 29 |
122
- | 64k | `▁ق لى ▁بي غلو ▁( ▁بالفارسى ▁قل ی بی گ ... (+19 more)` | 29 |
123
 
124
 
125
  ### Key Findings
126
 
127
- - **Best Compression:** 64k achieves 4.023x compression
128
- - **Lowest UNK Rate:** 8k with 0.4317% unknown tokens
129
  - **Trade-off:** Larger vocabularies improve compression but increase model size
130
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
131
 
@@ -134,57 +129,89 @@ Below are sample sentences tokenized with each vocabulary size:
134
 
135
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
136
 
 
 
137
  ![N-gram Coverage](visualizations/ngram_coverage.png)
138
 
139
  ### Results
140
 
141
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
142
- |--------|------------|---------|----------------|------------------|-------------------|
143
- | **2-gram** | 6,568 🏆 | 12.68 | 1,353,863 | 27.8% | 64.6% |
144
- | **2-gram** | 384 🏆 | 8.58 | 15,875 | 59.5% | 97.6% |
145
- | **3-gram** | 11,727 | 13.52 | 2,306,233 | 23.3% | 59.5% |
146
- | **3-gram** | 2,432 | 11.25 | 170,638 | 29.7% | 70.0% |
147
- | **4-gram** | 21,664 | 14.40 | 4,880,347 | 21.3% | 54.9% |
148
- | **4-gram** | 8,554 | 13.06 | 1,045,558 | 20.0% | 54.3% |
149
 
150
  ### Top 5 N-grams by Size
151
 
152
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
153
 
154
  | Rank | N-gram | Count |
155
  |------|--------|-------|
156
- | 1 | `تصنيف :` | 4,787,649 |
157
- | 2 | `مصادر تصنيف` | 1,597,448 |
158
- | 3 | `لينكات برانيه` | 1,294,464 |
159
- | 4 | `برانيه مصادر` | 1,167,374 |
160
- | 5 | `من مواليد` | 829,312 |
161
 
162
- **3-grams:**
163
 
164
  | Rank | N-gram | Count |
165
  |------|--------|-------|
166
- | 1 | `مصادر تصنيف :` | 1,597,447 |
167
- | 2 | `برانيه مصادر تصنيف` | 1,165,016 |
168
- | 3 | `لينكات برانيه مصادر` | 1,164,743 |
169
- | 4 | `من مواليد يوم` | 809,006 |
170
- | 5 | `. المطلع المستقيم` | 668,795 |
171
 
172
- **4-grams:**
173
 
174
  | Rank | N-gram | Count |
175
  |------|--------|-------|
176
- | 1 | `برانيه مصادر تصنيف :` | 1,165,016 |
177
- | 2 | `لينكات برانيه مصادر تصنيف` | 1,162,419 |
178
- | 3 | `. لينكات برانيه مصادر` | 558,227 |
179
- | 4 | `الدايره الساعيه لجرم سماوى` | 445,892 |
180
- | 5 | `خط الاستوا السماوى تكون` | 445,860 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
 
182
 
183
  ### Key Findings
184
 
185
- - **Best Perplexity:** 2-gram with 384
186
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
187
- - **Coverage:** Top-1000 patterns cover ~54% of corpus
188
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
189
 
190
  ---
@@ -192,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
192
 
193
  ![Markov Entropy](visualizations/markov_entropy.png)
194
 
 
 
195
  ![Markov Branching](visualizations/markov_branching.png)
196
 
197
  ### Results
198
 
199
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
200
- |---------|-------------|------------|------------------|-----------------|----------------|
201
- | **1** | 0.8482 | 1.800 | 6.43 | 2,177,526 | 15.2% |
202
- | **1** | 1.2448 | 2.370 | 9.40 | 4,455 | 0.0% |
203
- | **2** | 0.3675 | 1.290 | 2.04 | 13,972,922 | 63.2% |
204
- | **2** | 0.9830 | 1.977 | 7.64 | 41,887 | 1.7% |
205
- | **3** | 0.1212 | 1.088 | 1.34 | 28,492,247 | 87.9% |
206
- | **3** | 0.9327 | 1.909 | 5.31 | 320,101 | 6.7% |
207
- | **4** | 0.0720 🏆 | 1.051 | 1.20 | 38,092,155 | 92.8% |
208
- | **4** | 0.7558 🏆 | 1.689 | 3.71 | 1,700,014 | 24.4% |
209
 
210
- ### Generated Text Samples
211
 
212
- Below are text samples generated from each Markov chain model:
213
 
214
  **Context Size 1:**
215
 
216
- 1. `. شوف كمان رواد الشركات فى كليه سانت لورانس و كيكرز . 0 . السرعه الشعاعيه`
217
- 2. `: مهندسين من تشيكوسلوفاكيا . hubble census finds galaxies at the ottoman empire and historical demog...`
218
- 3. `تصنيف : anguilla australis ) و خط الاستوا السماوى تكون قيمة بعده بالموجب ( ) بيتر`
219
 
220
  **Context Size 2:**
221
 
222
- 1. `تصنيف : لاعب كريكت من دومينيكا . لينكات برانيه مصادر تصنيف : قسيس كاتوليك من مملكة نيديرلاند`
223
- 2. `مصادر تصنيف : مسطحات مائيه فى كندا . جغرافيا نهر وايتاهايا بيصب فى نهر اوبى . اقرا`
224
- 3. `لينكات برانيه مصادر تصنيف : ناس تصنيف : عالم لاهوت من المانيا ) جورج فرانسيس ماكجينيس من`
225
 
226
  **Context Size 3:**
227
 
228
- 1. `مصادر تصنيف : موسم رياضى فى كورة قدم اتعمل فى پورتوجال سنة 2007 . لينكات برانيه مصادر تصنيف`
229
- 2. `برانيه مصادر تصنيف : ممثلين من البرازيل تصنيف : متعلمين فى جامعة ڤيرچينيا كومونولث و جامعة ڤيرچينيا ...`
230
- 3. `لينكات برانيه مصادر تصنيف : سياسيين تصنيف : سياسيين تصنيف : سياسيين من تشيكيا تصنيف : ناس لسا`
231
 
232
  **Context Size 4:**
233
 
234
- 1. `برانيه مصادر تصنيف : قرى اليمن تصنيف : قرى يافع`
235
- 2. `لينكات برانيه مصادر تصنيف : مغنيين تصنيف : مغنيين امريكان تصنيف : منتجين تيليڤزيون تصنيف : مخرجين تي...`
236
- 3. `. لينكات برانيه مصادر تصنيف : مسطحات مائيه فى فينلاندا`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
 
239
  ### Key Findings
240
 
241
- - **Best Predictability:** Context-4 with 92.8% predictability
242
  - **Branching Factor:** Decreases with context size (more deterministic)
243
- - **Memory Trade-off:** Larger contexts require more storage (1,700,014 contexts)
244
  - **Recommendation:** Context-3 or Context-4 for text generation
245
 
246
  ---
@@ -256,64 +314,64 @@ Below are text samples generated from each Markov chain model:
256
 
257
  | Metric | Value |
258
  |--------|-------|
259
- | Vocabulary Size | 1,000,000 |
260
- | Total Tokens | 132,694,568 |
261
- | Mean Frequency | 132.69 |
262
- | Median Frequency | 5 |
263
- | Frequency Std Dev | 10016.26 |
264
 
265
  ### Most Common Words
266
 
267
  | Rank | Word | Frequency |
268
  |------|------|-----------|
269
- | 1 | تصنيف | 4,791,431 |
270
- | 2 | فى | 4,424,323 |
271
- | 3 | من | 3,918,728 |
272
- | 4 | و | 3,519,108 |
273
- | 5 | مصادر | 1,613,116 |
274
- | 6 | لينكات | 1,360,026 |
275
- | 7 | برانيه | 1,299,627 |
276
- | 8 | مواليد | 1,130,480 |
277
- | 9 | هيا | 1,062,827 |
278
- | 10 | اللى | 967,453 |
279
 
280
  ### Least Common Words (from vocabulary)
281
 
282
  | Rank | Word | Frequency |
283
  |------|------|-----------|
284
- | 1 | 1513504 | 2 |
285
- | 2 | 412321 | 2 |
286
- | 3 | j08004409 | 2 |
287
- | 4 | 1545353 | 2 |
288
- | 5 | 1073083 | 2 |
289
- | 6 | 0566528 | 2 |
290
- | 7 | 2103522 | 2 |
291
- | 8 | 1579797 | 2 |
292
- | 9 | 2051684 | 2 |
293
- | 10 | 600674 | 2 |
294
 
295
  ### Zipf's Law Analysis
296
 
297
  | Metric | Value |
298
  |--------|-------|
299
- | Zipf Coefficient | 1.2848 |
300
- | R² (Goodness of Fit) | 0.995303 |
301
  | Adherence Quality | **excellent** |
302
 
303
  ### Coverage Analysis
304
 
305
  | Top N Words | Coverage |
306
  |-------------|----------|
307
- | Top 100 | 44.5% |
308
- | Top 1,000 | 75.8% |
309
- | Top 5,000 | 85.7% |
310
- | Top 10,000 | 88.8% |
311
 
312
  ### Key Findings
313
 
314
- - **Zipf Compliance:** R²=0.9953 indicates excellent adherence to Zipf's law
315
- - **High Frequency Dominance:** Top 100 words cover 44.5% of corpus
316
- - **Long Tail:** 990,000 words needed for remaining 11.2% coverage
317
 
318
  ---
319
  ## 5. Word Embeddings Evaluation
@@ -326,24 +384,113 @@ Below are text samples generated from each Markov chain model:
326
 
327
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
328
 
329
- ### Model Comparison
330
 
331
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
332
- |-------|------------|-----------|----------|----------|----------|
333
- | **mono_32d** | 573,439 | 32 | 5.170 | 1.489 | 0.8067 🏆 |
334
- | **mono_64d** | 573,439 | 64 | 5.558 | 1.367 | 0.7701 |
335
- | **mono_128d** | 573,439 | 128 | 5.982 | 1.280 | 0.7047 |
336
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
337
 
338
  ### Key Findings
339
 
340
- - **Best Isotropy:** mono_32d with 0.8067 (more uniform distribution)
341
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
342
- - **Vocabulary Coverage:** All models cover 573,439 words
343
- - **Recommendation:** 100d for balanced semantic capture and efficiency
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
344
 
345
  ---
346
- ## 6. Summary & Recommendations
347
 
348
  ![Performance Dashboard](visualizations/performance_dashboard.png)
349
 
@@ -351,11 +498,12 @@ Below are text samples generated from each Markov chain model:
351
 
352
  | Component | Recommended | Rationale |
353
  |-----------|-------------|-----------|
354
- | Tokenizer | **32k BPE** | Best compression (4.02x) with low UNK rate |
355
- | N-gram | **5-gram** | Lowest perplexity (384) |
356
- | Markov | **Context-4** | Highest predictability (92.8%) |
357
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
358
 
 
359
  ---
360
  ## Appendix: Metrics Glossary & Interpretation Guide
361
 
@@ -545,7 +693,8 @@ If you use these models in your research, please cite:
545
  author = {Kamali, Omar},
546
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
547
  year = {2025},
548
- publisher = {HuggingFace},
 
549
  url = {https://huggingface.co/wikilangs}
550
  institution = {Omneity Labs}
551
  }
@@ -561,7 +710,8 @@ MIT License - Free for academic and commercial use.
561
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
562
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
563
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
564
  ---
565
  *Generated by Wikilangs Models Pipeline*
566
 
567
- *Report Date: 2025-12-27 18:23:46*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 3.905
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.7897
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # Egyptian Arabic - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 2.876x | 2.88 | 0.8210% | 1,709,035 |
84
+ | **16k** | 3.215x | 3.22 | 0.9180% | 1,528,463 |
85
+ | **32k** | 3.559x | 3.56 | 1.0163% | 1,380,735 |
86
+ | **64k** | 3.905x 🏆 | 3.91 | 1.1149% | 1,258,558 |
87
 
88
  ### Tokenization Examples
89
 
90
  Below are sample sentences tokenized with each vocabulary size:
91
 
92
+ **Sample 1:** `تاملكوت هوا دوار فى المغرب. المكان تاملكوت موجود فى منطقه اداريه اسمها تماسين. س...`
 
 
 
 
 
93
 
94
  | Vocab | Tokens | Count |
95
  |-------|--------|-------|
96
+ | 8k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المك ان ... (+24 more)` | 34 |
97
+ | 16k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المك ان ... (+24 more)` | 34 |
98
+ | 32k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المكان ▁تام ... (+23 more)` | 33 |
99
+ | 64k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المكان ▁تام ... (+23 more)` | 33 |
 
 
100
 
101
+ **Sample 2:** `جيريمى ديفيدسون مخرج افلام من امريكا. حياته جيريمى ديفيدسون من مواليد يوم 24 ديس...`
102
 
103
  | Vocab | Tokens | Count |
104
  |-------|--------|-------|
105
+ | 8k | `▁جير يمى ▁ديفيد سون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ... (+23 more)` | 33 |
106
+ | 16k | `▁جيريمى ▁ديفيد سون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ▁جيريمى ... (+21 more)` | 31 |
107
+ | 32k | `▁جيريمى ▁ديفيدسون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ▁جيريمى ▁ديفيدسون ... (+19 more)` | 29 |
108
+ | 64k | `▁جيريمى ▁ديفيدسون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ▁جيريمى ▁ديفيدسون ... (+19 more)` | 29 |
 
 
 
 
109
 
110
+ **Sample 3:** `ابهينايا ممثله من الهند. حياتها ابهينايا من مواليد يوم 13 نوفمبر سنة فى كارناتاك...`
 
 
111
 
112
  | Vocab | Tokens | Count |
113
  |-------|--------|-------|
114
+ | 8k | `▁اب ه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁اب ... (+28 more)` | 38 |
115
+ | 16k | `▁اب ه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁اب ... (+27 more)` | 37 |
116
+ | 32k | `▁ابه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁ابه ينا ... (+25 more)` | 35 |
117
+ | 64k | `▁ابه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁ابه ينا ... (+24 more)` | 34 |
118
 
119
 
120
  ### Key Findings
121
 
122
+ - **Best Compression:** 64k achieves 3.905x compression
123
+ - **Lowest UNK Rate:** 8k with 0.8210% unknown tokens
124
  - **Trade-off:** Larger vocabularies improve compression but increase model size
125
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
126
 
 
129
 
130
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
131
 
132
+ ![N-gram Unique](visualizations/ngram_unique.png)
133
+
134
  ![N-gram Coverage](visualizations/ngram_coverage.png)
135
 
136
  ### Results
137
 
138
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
139
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
140
+ | **2-gram** | Word | 5,793 | 12.50 | 1,073,861 | 30.2% | 66.4% |
141
+ | **2-gram** | Subword | 316 🏆 | 8.30 | 15,451 | 62.6% | 98.6% |
142
+ | **3-gram** | Word | 8,299 | 13.02 | 1,682,809 | 28.5% | 62.7% |
143
+ | **3-gram** | Subword | 2,021 | 10.98 | 129,923 | 30.1% | 74.0% |
144
+ | **4-gram** | Word | 12,842 | 13.65 | 3,054,922 | 27.3% | 59.4% |
145
+ | **4-gram** | Subword | 7,215 | 12.82 | 788,718 | 19.6% | 56.9% |
146
 
147
  ### Top 5 N-grams by Size
148
 
149
+ **2-grams (Word):**
150
+
151
+ | Rank | N-gram | Count |
152
+ |------|--------|-------|
153
+ | 1 | `لينكات برانيه` | 1,293,684 |
154
+ | 2 | `برانيه مصادر` | 1,167,581 |
155
+ | 3 | `من مواليد` | 829,322 |
156
+ | 4 | `مواليد يوم` | 809,177 |
157
+ | 5 | `الاستوا السماوى` | 668,876 |
158
+
159
+ **3-grams (Word):**
160
 
161
  | Rank | N-gram | Count |
162
  |------|--------|-------|
163
+ | 1 | `لينكات برانيه مصادر` | 1,164,952 |
164
+ | 2 | `من مواليد يوم` | 809,029 |
165
+ | 3 | `خط الاستوا السماوى` | 630,228 |
166
+ | 4 | `الدايره الساعيه لجرم` | 445,892 |
167
+ | 5 | `الساعيه لجرم سماوى` | 445,892 |
168
 
169
+ **4-grams (Word):**
170
 
171
  | Rank | N-gram | Count |
172
  |------|--------|-------|
173
+ | 1 | `الدايره الساعيه لجرم سماوى` | 445,892 |
174
+ | 2 | `السماوى تكون قيمة بعده` | 445,860 |
175
+ | 3 | `خط الاستوا السماوى تكون` | 445,860 |
176
+ | 4 | `الاستوا السماوى تكون قيمة` | 445,860 |
177
+ | 5 | `لينكات برانيه مصادر من` | 320,727 |
178
 
179
+ **2-grams (Subword):**
180
 
181
  | Rank | N-gram | Count |
182
  |------|--------|-------|
183
+ | 1 | `_ ا` | 31,144,333 |
184
+ | 2 | ل` | 30,224,243 |
185
+ | 3 | _` | 17,180,633 |
186
+ | 4 | `_ م` | 13,559,836 |
187
+ | 5 | _` | 11,805,719 |
188
+
189
+ **3-grams (Subword):**
190
+
191
+ | Rank | N-gram | Count |
192
+ |------|--------|-------|
193
+ | 1 | `_ ا ل` | 25,116,125 |
194
+ | 2 | `ي ه _` | 6,396,587 |
195
+ | 3 | `ه _ ا` | 6,346,797 |
196
+ | 4 | `ا ل م` | 5,946,692 |
197
+ | 5 | `_ م ن` | 4,537,386 |
198
+
199
+ **4-grams (Subword):**
200
+
201
+ | Rank | N-gram | Count |
202
+ |------|--------|-------|
203
+ | 1 | `ه _ ا ل` | 5,297,759 |
204
+ | 2 | `_ ا ل م` | 5,200,038 |
205
+ | 3 | `_ ف ى _` | 4,251,301 |
206
+ | 4 | `_ م ن _` | 3,906,606 |
207
+ | 5 | `_ ا ل ا` | 3,578,656 |
208
 
209
 
210
  ### Key Findings
211
 
212
+ - **Best Perplexity:** 2-gram (subword) with 316
213
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
214
+ - **Coverage:** Top-1000 patterns cover ~57% of corpus
215
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
216
 
217
  ---
 
219
 
220
  ![Markov Entropy](visualizations/markov_entropy.png)
221
 
222
+ ![Markov Contexts](visualizations/markov_contexts.png)
223
+
224
  ![Markov Branching](visualizations/markov_branching.png)
225
 
226
  ### Results
227
 
228
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
229
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
230
+ | **1** | Word | 1.2217 | 2.332 | 9.13 | 1,353,062 | 0.0% |
231
+ | **1** | Subword | 1.0533 | 2.075 | 8.28 | 5,726 | 0.0% |
232
+ | **2** | Word | 0.3648 | 1.288 | 1.91 | 12,336,484 | 63.5% |
233
+ | **2** | Subword | 0.7848 | 1.723 | 5.54 | 47,379 | 21.5% |
234
+ | **3** | Word | 0.1139 | 1.082 | 1.28 | 23,517,673 | 88.6% |
235
+ | **3** | Subword | 0.7673 | 1.702 | 4.73 | 262,420 | 23.3% |
236
+ | **4** | Word | 0.0625 🏆 | 1.044 | 1.17 | 29,894,419 | 93.7% |
237
+ | **4** | Subword | 0.7433 | 1.674 | 3.81 | 1,241,425 | 25.7% |
238
 
239
+ ### Generated Text Samples (Word-based)
240
 
241
+ Below are text samples generated from each word-based Markov chain model:
242
 
243
  **Context Size 1:**
244
 
245
+ 1. `فى العالم حسب المساحه لستة اكبر بحيرات اوروبا لينكات مصادر من مملكه ايطاليا حياته الرياضيه بيلعب`
246
+ 2. `من مواليد يوم 16 يناير سنة فى ذا ماتشيس بتقدم الانواع الفنيه كانت دى لوبو من`
247
+ 3. بتنقاس بالانزياح الاحمر المطلع المستقيم ممكن يتقاس بقوس دايره الاستواء السماويه من الجرى و نادى`
248
 
249
  **Context Size 2:**
250
 
251
+ 1. `لينكات برانيه مصادر عجل ناريه من المانيا حياته اليكساندر انتونوڤيتش ريزونى اليكساندر انستروثير اليكس...`
252
+ 2. `برانيه مصادر كوره قدم من الميكسيك حياته اڤير كاباليرو اڤيرالدو فيريرا لاعب كورة قدم من اليابان حياته`
253
+ 3. `من مواليد يوم 19 اغسطس لسا عايشين فى استانبول لينكات برانيه مصادر هوكى الجليد من امريكا حياته`
254
 
255
  **Context Size 3:**
256
 
257
+ 1. `لينكات برانيه مصادر سكان سكان فى ايران المكان ادم درهسى عليا adam darrehsi ye olya هيا تجمع سكان`
258
+ 2. `من مواليد يوم 7 ديسمبر فى مونتفيدو الحياه الرياضيه بيلعب فى مركز مُدَافِع و لعب مع فريق ريال`
259
+ 3. `خط الاستوا السماوى تكون قيمة بعده بالموجب و لو النجم جنوب خط الاستوا السماوى لو كان النجم شمال`
260
 
261
  **Context Size 4:**
262
 
263
+ 1. `الدايره الساعيه لجرم سماوى و الدايره الساعيه لنقطة الاعتدال الربيعى المطلع المستقيم ممكن يتقاس بقوس ...`
264
+ 2. `الاستوا السماوى تكون قيمة بعده بالسالب مصادر كوكبه`
265
+ 3. `السماوى تكون قيمة بعده بالسالب مصادر 2ماس كوكبه`
266
+
267
+
268
+ ### Generated Text Samples (Subword-based)
269
+
270
+ Below are text samples generated from each subword-based Markov chain model:
271
+
272
+ **Context Size 1:**
273
+
274
+ 1. `_مرا_ارو_لسيه_س_`
275
+ 2. `الدريه_اثر_كالال`
276
+ 3. `لخطونالمطة_جو_عب`
277
+
278
+ **Context Size 2:**
279
+
280
+ 1. `_الكريتالسمات_فى_`
281
+ 2. `اليه_عاعيه_مطلحجم`
282
+ 3. `ه_بقه_ليكا_بقوى_ا`
283
+
284
+ **Context Size 3:**
285
+
286
+ 1. `_المستقيم_محمد_بيس`
287
+ 2. `يه_مصادر_كورة_قدم_`
288
+ 3. `ه_العقبت_برات_السم`
289
+
290
+ **Context Size 4:**
291
+
292
+ 1. `ه_السماوى_مع_فريق_ن`
293
+ 2. `_المكافئ_الفلك._الم`
294
+ 3. `_فى_باردوه_مصادر_اس`
295
 
296
 
297
  ### Key Findings
298
 
299
+ - **Best Predictability:** Context-4 (word) with 93.7% predictability
300
  - **Branching Factor:** Decreases with context size (more deterministic)
301
+ - **Memory Trade-off:** Larger contexts require more storage (1,241,425 contexts)
302
  - **Recommendation:** Context-3 or Context-4 for text generation
303
 
304
  ---
 
314
 
315
  | Metric | Value |
316
  |--------|-------|
317
+ | Vocabulary Size | 856,070 |
318
+ | Total Tokens | 116,711,182 |
319
+ | Mean Frequency | 136.33 |
320
+ | Median Frequency | 4 |
321
+ | Frequency Std Dev | 9391.59 |
322
 
323
  ### Most Common Words
324
 
325
  | Rank | Word | Frequency |
326
  |------|------|-----------|
327
+ | 1 | فى | 4,414,661 |
328
+ | 2 | من | 3,909,776 |
329
+ | 3 | و | 3,512,508 |
330
+ | 4 | مصادر | 1,612,463 |
331
+ | 5 | لينكات | 1,359,404 |
332
+ | 6 | برانيه | 1,298,834 |
333
+ | 7 | هيا | 1,062,266 |
334
+ | 8 | اللى | 965,103 |
335
+ | 9 | يوم | 853,034 |
336
+ | 10 | مواليد | 836,295 |
337
 
338
  ### Least Common Words (from vocabulary)
339
 
340
  | Rank | Word | Frequency |
341
  |------|------|-----------|
342
+ | 1 | algeriens | 2 |
343
+ | 2 | وبتينا | 2 |
344
+ | 3 | روتلُف | 2 |
345
+ | 4 | bouabdellah | 2 |
346
+ | 5 | الخُضرة | 2 |
347
+ | 6 | impressionisms | 2 |
348
+ | 7 | assyriaca | 2 |
349
+ | 8 | جروكبيديا | 2 |
350
+ | 9 | grokipedia | 2 |
351
+ | 10 | grok | 2 |
352
 
353
  ### Zipf's Law Analysis
354
 
355
  | Metric | Value |
356
  |--------|-------|
357
+ | Zipf Coefficient | 1.2602 |
358
+ | R² (Goodness of Fit) | 0.994644 |
359
  | Adherence Quality | **excellent** |
360
 
361
  ### Coverage Analysis
362
 
363
  | Top N Words | Coverage |
364
  |-------------|----------|
365
+ | Top 100 | 46.0% |
366
+ | Top 1,000 | 76.7% |
367
+ | Top 5,000 | 85.9% |
368
+ | Top 10,000 | 89.0% |
369
 
370
  ### Key Findings
371
 
372
+ - **Zipf Compliance:** R²=0.9946 indicates excellent adherence to Zipf's law
373
+ - **High Frequency Dominance:** Top 100 words cover 46.0% of corpus
374
+ - **Long Tail:** 846,070 words needed for remaining 11.0% coverage
375
 
376
  ---
377
  ## 5. Word Embeddings Evaluation
 
384
 
385
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
386
 
 
387
 
388
+ ### 5.1 Cross-Lingual Alignment
389
+
390
+ > *Note: Multilingual alignment visualization not available for this language.*
391
+
392
+
393
+ ### 5.2 Model Comparison
394
+
395
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
396
+ |-------|-----------|----------|------------------|---------------|----------------|
397
+ | **mono_32d** | 32 | 0.7897 🏆 | 0.3482 | N/A | N/A |
398
+ | **mono_64d** | 64 | 0.7690 | 0.2976 | N/A | N/A |
399
+ | **mono_128d** | 128 | 0.7177 | 0.2526 | N/A | N/A |
400
 
401
  ### Key Findings
402
 
403
+ - **Best Isotropy:** mono_32d with 0.7897 (more uniform distribution)
404
+ - **Semantic Density:** Average pairwise similarity of 0.2995. Lower values indicate better semantic separation.
405
+ - **Alignment Quality:** No aligned models evaluated in this run.
406
+ - **Recommendation:** 128d aligned for best cross-lingual performance
407
+
408
+ ---
409
+ ## 6. Morphological Analysis (Experimental)
410
+
411
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
412
+
413
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
414
+
415
+ ### 6.1 Productivity & Complexity
416
+
417
+ | Metric | Value | Interpretation | Recommendation |
418
+ |--------|-------|----------------|----------------|
419
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
420
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
421
+
422
+ ### 6.2 Affix Inventory (Productive Units)
423
+
424
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
425
+
426
+ #### Productive Prefixes
427
+ | Prefix | Examples |
428
+ |--------|----------|
429
+ | `-ال` | الخوذ, المندوبين, الدمرداشيه |
430
+
431
+ #### Productive Suffixes
432
+ | Suffix | Examples |
433
+ |--------|----------|
434
+ | `-ين` | كلوكيرين, بيرجرين, المندوبين |
435
+ | `-ان` | مالڤان, ملازمان, پايرلمان |
436
+
437
+ ### 6.3 Bound Stems (Lexical Roots)
438
+
439
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
440
+
441
+ | Stem | Cohesion | Substitutability | Examples |
442
+ |------|----------|------------------|----------|
443
+ | `العا` | 1.85x | 296 contexts | العام, العاج, العال |
444
+ | `المج` | 1.79x | 267 contexts | المجد, المجر, المجئ |
445
+ | `انزي` | 1.95x | 165 contexts | انزيا, انزيت, انزيد |
446
+ | `الشع` | 2.11x | 103 contexts | الشعب, الشعف, الشعز |
447
+ | `ياته` | 2.11x | 96 contexts | عياته, آياته, حياته |
448
+ | `الاع` | 2.00x | 107 contexts | الاعور, الاعتر, الاعدا |
449
+ | `مستق` | 2.01x | 80 contexts | مستقل, مستقر, مستقله |
450
+ | `الاح` | 1.79x | 110 contexts | الاحد, صالاحى, الاحرش |
451
+ | `لموج` | 2.13x | 48 contexts | لموجة, الموج, الموجب |
452
+ | `لمجر` | 1.85x | 71 contexts | لمجره, المجر, لمجرة |
453
+ | `لساع` | 2.34x | 28 contexts | لساعة, الساعة, لساعات |
454
+ | `مريك` | 1.69x | 102 contexts | لمريك, مريكا, مريكن |
455
+
456
+ ### 6.4 Affix Compatibility (Co-occurrence)
457
+
458
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
459
+
460
+ | Prefix | Suffix | Frequency | Examples |
461
+ |--------|--------|-----------|----------|
462
+ | `-ال` | `-ين` | 47 words | الصديقين, الحدوديين |
463
+ | `-ال` | `-ان` | 11 words | الأخوان, الترامان |
464
+
465
+ ### 6.5 Recursive Morpheme Segmentation
466
+
467
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
468
+
469
+ | Word | Suggested Split | Confidence | Stem |
470
+ |------|-----------------|------------|------|
471
+ | السريانيين | **`ال-سرياني-ين`** | 6.0 | `سرياني` |
472
+ | كانتيلينين | **`كانتيل-ين-ين`** | 6.0 | `كانتيل` |
473
+ | الجينومية | **`ال-جينومية`** | 4.5 | `جينومية` |
474
+ | البرمجيات | **`ال-برمجيات`** | 4.5 | `برمجيات` |
475
+ | الاستعلامات | **`ال-استعلامات`** | 4.5 | `استعلامات` |
476
+ | بيجلاندسفچوردين | **`بيجلاندسفچورد-ين`** | 4.5 | `بيجلاندسفچورد` |
477
+ | السينابون | **`ال-سينابون`** | 4.5 | `سينابون` |
478
+ | الديمقراطي | **`ال-ديمقراطي`** | 4.5 | `ديمقراطي` |
479
+ | الانبعاثية | **`ال-انبعاثية`** | 4.5 | `انبعاثية` |
480
+ | الميتانيه | **`ال-ميتانيه`** | 4.5 | `ميتانيه` |
481
+ | الطويحينه | **`ال-طويحينه`** | 4.5 | `طويحينه` |
482
+ | الصابونجى | **`ال-صابونجى`** | 4.5 | `صابونجى` |
483
+ | البنغاليه | **`ال-بنغاليه`** | 4.5 | `بنغاليه` |
484
+ | المتحدثون | **`ال-متحدثون`** | 4.5 | `متحدثون` |
485
+ | ستشميدلين | **`ستشميدل-ين`** | 4.5 | `ستشميدل` |
486
+
487
+ ### 6.6 Linguistic Interpretation
488
+
489
+ > **Automated Insight:**
490
+ The language Egyptian Arabic appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
491
 
492
  ---
493
+ ## 7. Summary & Recommendations
494
 
495
  ![Performance Dashboard](visualizations/performance_dashboard.png)
496
 
 
498
 
499
  | Component | Recommended | Rationale |
500
  |-----------|-------------|-----------|
501
+ | Tokenizer | **64k BPE** | Best compression (3.91x) |
502
+ | N-gram | **2-gram** | Lowest perplexity (316) |
503
+ | Markov | **Context-4** | Highest predictability (93.7%) |
504
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
505
 
506
+
507
  ---
508
  ## Appendix: Metrics Glossary & Interpretation Guide
509
 
 
693
  author = {Kamali, Omar},
694
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
695
  year = {2025},
696
+ doi = {10.5281/zenodo.18073153},
697
+ publisher = {Zenodo},
698
  url = {https://huggingface.co/wikilangs}
699
  institution = {Omneity Labs}
700
  }
 
710
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
711
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
712
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
713
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
714
  ---
715
  *Generated by Wikilangs Models Pipeline*
716
 
717
+ *Report Date: 2026-01-03 07:45:31*
models/embeddings/monolingual/arz_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:75a11f48624beba71cf5e2fb4ca6caff536a87ce2cd7ecf4cf2d74b7d061ab24
3
- size 1623402315
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bdb36e12bc1a678fd5c157ad32e02341de1eca60c4bc2b5ef93fe49b61bd555
3
+ size 1527330535
models/embeddings/monolingual/arz_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 573439
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 481203
15
  }
models/embeddings/monolingual/arz_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:20ccadbe8619b844c6eace4a02ce277749c846f73614cb161ecbc7a86c873672
3
- size 415001163
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30582a755835303bdd9d0ea4ad7243b43b631aa397c3e567b21c8d4a5c4449d3
3
+ size 389766631
models/embeddings/monolingual/arz_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 573439
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 481203
15
  }
models/embeddings/monolingual/arz_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4801d005678b73c1bb59d155df23fbc73c9f36b3c781e8ec0acef3f3d272ee93
3
- size 817801547
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5068a8b810efa630cdf0314c353a4495557bcd65811e48356936ecd7a645b35c
3
+ size 768954599
models/embeddings/monolingual/arz_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 573439
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 481203
15
  }
models/subword_markov/arz_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e7f161c6e8fab68e5fbc9600f3451111167fbdb2ccb9ee71b98eeb2ffb7dace1
3
- size 322433
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93b9ef162389968b455ca3710639cb0faa0e67079e19324a4f3d232dc220101e
3
+ size 347493
models/subword_markov/arz_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_contexts": 4455,
6
- "total_transitions": 825886705
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_contexts": 5726,
6
+ "total_transitions": 693777470
7
  }
models/subword_markov/arz_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cabc4c543608790d6c5880f1744ba45b47d2697724261238e8925f80995b0766
3
- size 2557134
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98db3946a7c34d13376e437f6401ae1d31aa408e141da100b0793536f9682ba1
3
+ size 2267694
models/subword_markov/arz_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_contexts": 41887,
6
- "total_transitions": 824256957
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_contexts": 47379,
6
+ "total_transitions": 692148775
7
  }
models/subword_markov/arz_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:de040657fc371a339db59ea68152c56a326e76cc8924dac007de0473e86340c4
3
- size 13516070
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:791a2be4e464c30d58ac081746362b42037f50dd8d0c4b552c3b0e68c2518dea
3
+ size 10864076
models/subword_markov/arz_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_contexts": 320101,
6
- "total_transitions": 822627209
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_contexts": 262420,
6
+ "total_transitions": 690520080
7
  }
models/subword_markov/arz_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a0e33690e76e9f408ce70c7c674c98a698e70a625d4ecd0d337e959c13d77fe6
3
- size 53877656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9412f5ef232c0c613c068829b486f969ed98f144b0956cfba48ca7651d697cc
3
+ size 40934252
models/subword_markov/arz_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_contexts": 1700014,
6
- "total_transitions": 820997461
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_contexts": 1241425,
6
+ "total_transitions": 688891385
7
  }
models/subword_ngram/arz_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:10ff36eb8f35bdaef34f28641466ebac1bd3df6e7831eab6cec38563f7a83d00
3
- size 224935
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54b1e94823bf6eda78e610a31f50839e3eb6945296345628a51c2775bbc535b5
3
+ size 225336
models/subword_ngram/arz_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_ngrams": 15875,
6
- "total_ngrams": 825886705
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_ngrams": 15451,
6
+ "total_ngrams": 693777470
7
  }
models/subword_ngram/arz_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:eee986ddda6f36ba1d3e23a8759ad8f5509e3c48dc4ac592f689fa826fe34187
3
- size 2146043
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6241036ff50c386080b4d36854fdc224c1c5099af471b3862e1464cd644b63ae
3
+ size 1697106
models/subword_ngram/arz_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_ngrams": 170638,
6
- "total_ngrams": 824256957
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_ngrams": 129923,
6
+ "total_ngrams": 692148775
7
  }
models/subword_ngram/arz_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2d7383505bdffee6fff1e255e8a66837be06fe62780607b993d354aed68b8d0c
3
- size 13255914
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c000b6a963ccc11a1bb0feebe6d49196497cbb6667e247ab3df4764a4e91003c
3
+ size 10220298
models/subword_ngram/arz_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "arz",
5
- "unique_ngrams": 1045558,
6
- "total_ngrams": 822627209
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "arz",
5
+ "unique_ngrams": 788718,
6
+ "total_ngrams": 690520080
7
  }
models/tokenizer/arz_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3250e5b1fa849747bd31693274ccdde1fde493080110892d89c056f2ed133c9f
3
- size 553249
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8182c7627a2b642bc8213e1f478fb8483ae43e1decdfb9d36a8b81c0b1e5db70
3
+ size 553522
models/tokenizer/arz_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/arz_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1d22c8b625df815da9ab4ad818bdb3e662168bfa181b0747326154e8db1cc2c3
3
- size 880194
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f58411587d93e3605bc0bfb493f2adb152e457978754608d4bfa1098bbe3484
3
+ size 874271
models/tokenizer/arz_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/arz_tokenizer_64k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e2a8f70071cee82c435cc2eff7086ba3060e694f61e640ad9482184f637090af
3
- size 1550950
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b7caa9ae0cc952f588395f385251d423792d50e832f759c19f62d2debfa53c97
3
+ size 1535709
models/tokenizer/arz_tokenizer_64k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/arz_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e3cf01b1aa16a4d259de9989643f646ceecca5967b87fc2b269ffd411f6ec306
3
- size 394841
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75cf05a2c2e3d6160ef1a57d3e6c1105fcab185e64a3ffd3fdab63dacc685a1f
3
+ size 396360
models/tokenizer/arz_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/arz_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:08ac9dc5a3750de53ceee73bd67ec52e8f566950d994cd5585cbd78867c9790d
3
- size 13750687
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f49f18512e3b41813735d519aef4205264437e74599a3808b6c8fbea97f3bb17
3
+ size 12321602
models/vocabulary/arz_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "arz",
3
- "vocabulary_size": 1000000,
 
4
  "statistics": {
5
- "type_token_ratio": 0.016231404296933028,
6
  "coverage": {
7
- "top_100": 0.44057406359024887,
8
- "top_1000": 0.7499824220069986,
9
- "top_5000": 0.8482657952087583,
10
- "top_10000": 0.8783945758611642
11
  },
12
- "hapax_count": 918045,
13
- "hapax_ratio": 0.42167650913059435,
14
- "total_documents": 1629748
15
  }
16
  }
 
1
  {
2
  "language": "arz",
3
+ "vocabulary_size": 856070,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.011546453807845636,
7
  "coverage": {
8
+ "top_100": 0.4579160902506231,
9
+ "top_1000": 0.7632735433913325,
10
+ "top_5000": 0.8553371329341142,
11
+ "top_10000": 0.885855315521865
12
  },
13
+ "hapax_count": 497272,
14
+ "hapax_ratio": 0.36744001146790684,
15
+ "total_documents": 1628695
16
  }
17
  }
models/word_markov/arz_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e5b7e0c4e188921694b306869a2449ce112fb199b792128f1ce21236b921cc84
3
- size 138534439
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27fac99b38072bd41d19f9620ff6d00511cdf0cb2c23718cc24825c3425b0c81
3
+ size 125535197
models/word_markov/arz_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_contexts": 2177526,
6
- "total_transitions": 156097037
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_contexts": 1353062,
6
+ "total_transitions": 115579759
7
  }
models/word_markov/arz_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:40f8b7cfcf0a1b3e4aeec633859a0e58e1f83c410282fc81378b386a72be3a23
3
- size 440008287
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9256f52202a6653f8df7a39b233dce04b0731a6961f54294e451f0ff631a642f
3
+ size 413546149
models/word_markov/arz_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_contexts": 13972922,
6
- "total_transitions": 154467289
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_contexts": 12336484,
6
+ "total_transitions": 113951064
7
  }
models/word_markov/arz_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:10869a5b62f9dc9fced8f99ecf292c2e516552cafb1ebc22439f4c66b29703c6
3
- size 749801655
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2fa6b997fa7dcafddfbda3b21ca4c509b51a1df648b8e5d0d2476f263f776b1
3
+ size 672892499
models/word_markov/arz_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_contexts": 28492247,
6
- "total_transitions": 152837542
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_contexts": 23517673,
6
+ "total_transitions": 112322369
7
  }
models/word_markov/arz_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1c3be90b2e9c7b80d31d2c18860bc2236215712a3340fa2f9225e2de14a91424
3
- size 1031396379
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3b791771441b0e8ec65696094fed0fc807d2e95898fb465fd96d5dd181add4d
3
+ size 913306903
models/word_markov/arz_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_contexts": 38092155,
6
- "total_transitions": 151207798
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_contexts": 29894419,
6
+ "total_transitions": 110693674
7
  }
models/word_ngram/arz_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:739d5abccec316bf509349d0e17742a1bcf59999ee5f8e810053baf5105c06e5
3
- size 25088974
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d603ccb8a9bbb46f85861773239d3031d8c0e830c109fc753e1cadd69e9c1e2
3
+ size 22328144
models/word_ngram/arz_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_ngrams": 1353863,
6
- "total_ngrams": 156097037
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_ngrams": 1073861,
6
+ "total_ngrams": 115579759
7
  }
models/word_ngram/arz_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f744cb53e652ce297a655902bc6eb5ef9d19d5edfc79585d48fa10ea229f7ccb
3
- size 49189391
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:879ba3166c5b357094ce092093cc1eae670fb25bcde483ad58a4dd24786c83e5
3
+ size 41669189
models/word_ngram/arz_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_ngrams": 2306233,
6
- "total_ngrams": 154467289
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_ngrams": 1682809,
6
+ "total_ngrams": 113951064
7
  }
models/word_ngram/arz_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d78b4518cd7a719707b9135b4062484b3fba60daa9102960b10361304503a812
3
- size 109747485
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bfe08d57419c8d5a905c1468d1a4d8b376b38c13bc69a838fa0cb4e9e94d0d24
3
+ size 84435807
models/word_ngram/arz_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "arz",
5
- "unique_ngrams": 4880347,
6
- "total_ngrams": 152837542
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "arz",
5
+ "unique_ngrams": 3054922,
6
+ "total_ngrams": 112322369
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 9b565414fa6dee515b8e690f3dbc205f9826b0ce775f66f2a3da1a5455fce378
  • Pointer size: 131 Bytes
  • Size of remote file: 149 kB

Git LFS Details

  • SHA256: 6bdb3a6083cd153ca705ca2b3f8fe7f308b5e34447ca66be0e9ee0aab7abebda
  • Pointer size: 131 Bytes
  • Size of remote file: 154 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED