omarkamali commited on
Commit
20d27c8
·
verified ·
1 Parent(s): 8fc0a68

Upload all models and assets for chr (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +252 -128
  2. models/embeddings/monolingual/chr_128d.bin +2 -2
  3. models/embeddings/monolingual/chr_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/chr_32d.bin +2 -2
  5. models/embeddings/monolingual/chr_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/chr_64d.bin +2 -2
  7. models/embeddings/monolingual/chr_64d_metadata.json +5 -3
  8. models/subword_markov/chr_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/chr_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/chr_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/chr_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/chr_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/chr_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/chr_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/chr_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/chr_2gram_subword.parquet +2 -2
  17. models/subword_ngram/chr_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/chr_3gram_subword.parquet +2 -2
  19. models/subword_ngram/chr_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/chr_4gram_subword.parquet +2 -2
  21. models/subword_ngram/chr_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/chr_tokenizer_16k.model +2 -2
  23. models/tokenizer/chr_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/chr_tokenizer_32k.model +2 -2
  25. models/tokenizer/chr_tokenizer_32k.vocab +0 -0
  26. models/tokenizer/chr_tokenizer_8k.model +2 -2
  27. models/tokenizer/chr_tokenizer_8k.vocab +0 -0
  28. models/vocabulary/chr_vocabulary.parquet +2 -2
  29. models/vocabulary/chr_vocabulary_metadata.json +10 -9
  30. models/word_markov/chr_markov_ctx1_word.parquet +2 -2
  31. models/word_markov/chr_markov_ctx1_word_metadata.json +2 -2
  32. models/word_markov/chr_markov_ctx2_word.parquet +2 -2
  33. models/word_markov/chr_markov_ctx2_word_metadata.json +2 -2
  34. models/word_markov/chr_markov_ctx3_word.parquet +2 -2
  35. models/word_markov/chr_markov_ctx3_word_metadata.json +2 -2
  36. models/word_markov/chr_markov_ctx4_word.parquet +2 -2
  37. models/word_markov/chr_markov_ctx4_word_metadata.json +2 -2
  38. models/word_ngram/chr_2gram_word.parquet +2 -2
  39. models/word_ngram/chr_2gram_word_metadata.json +2 -2
  40. models/word_ngram/chr_3gram_word.parquet +2 -2
  41. models/word_ngram/chr_3gram_word_metadata.json +2 -2
  42. models/word_ngram/chr_4gram_word.parquet +2 -2
  43. models/word_ngram/chr_4gram_word_metadata.json +2 -2
  44. visualizations/embedding_isotropy.png +0 -0
  45. visualizations/embedding_norms.png +0 -0
  46. visualizations/embedding_similarity.png +2 -2
  47. visualizations/markov_branching.png +0 -0
  48. visualizations/markov_contexts.png +0 -0
  49. visualizations/markov_entropy.png +0 -0
  50. visualizations/model_sizes.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 3.838
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.2339
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 4475
33
- generated: 2025-12-28
34
  ---
35
 
36
  # CHR - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,58 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.087x | 2.96 | 0.1137% | 95,883 |
76
- | **16k** | 3.454x | 3.32 | 0.1272% | 85,691 |
77
- | **32k** | 3.838x 🏆 | 3.68 | 0.1414% | 77,111 |
78
 
79
  ### Tokenization Examples
80
 
81
  Below are sample sentences tokenized with each vocabulary size:
82
 
83
- **Sample 1:** `ᏍᎪᏯ ᏗᏃᎷᏩᏘᏍᎩ"Consortium Word List." 2016-04-15. (sgoya dinoluwatisgi)
84
-
85
- ᏓᏓᏚᎬ ᎪᏪᎵ ...`
86
 
87
  | Vocab | Tokens | Count |
88
  |-------|--------|-------|
89
- | 8k | `▁ᏍᎪᏯ ▁ᏗᏃ ᎷᏩᏘᏍᎩ " consortium ▁word ▁list ."2 ... (+31 more)` | 41 |
90
- | 16k | `▁ᏍᎪᏯ ▁ᏗᏃ ᎷᏩᏘᏍᎩ " consortium ▁word ▁list ."2 ... (+30 more)` | 40 |
91
- | 32k | `▁ᏍᎪᏯ ▁ᏗᏃᎷᏩᏘᏍᎩ " consortium ▁word ▁list ."2 0 ... (+26 more)` | 36 |
92
 
93
- **Sample 2:** `ᎤᎧᏲᏗ ᎡᎶᎯᏱ ...
94
-
95
- Category:ᎠᏓᏕᏲᎲᏍᎩ ᎡᎶᎯ
96
- Category:To be checked`
97
 
98
  | Vocab | Tokens | Count |
99
  |-------|--------|-------|
100
- | 8k | `▁ᎤᎧ ᏲᏗ ▁ᎡᎶᎯ ▁... category : ᎠᏓᏕᏲᎲᏍᎩ ▁ᎡᎶᎯ ▁category ... (+4 more)` | 14 |
101
- | 16k | `▁ᎤᎧᏲᏗ ▁ᎡᎶᎯ ▁... category : ᎠᏓᏕᏲᎲᏍᎩ ▁ᎡᎶᎯ ▁category : ... (+3 more)` | 13 |
102
- | 32k | `▁ᎤᎧᏲᏗ ▁ᎡᎶᎯᏱ ▁...category : ᎠᏓᏕᏲᎲᏍᎩ ▁ᎡᎶᎯ ▁category : to ... (+2 more)` | 12 |
103
-
104
- **Sample 3:** `ᎤᏢᏅᏛ"Consortium Word List." 2016-04-15. (tlvnvdv)
105
-
106
- ᏓᏓᏚᎬ ᎪᏪᎵ
107
-
108
- ᏙᏯᏗᏢ ᏗᏕᎬᏔᏛ
109
 
110
- Cat...`
111
 
112
  | Vocab | Tokens | Count |
113
  |-------|--------|-------|
114
- | 8k | `▁ᎤᏢ ᏅᏛ " consortium ▁word ▁list ." 2 0 ... (+26 more)` | 36 |
115
- | 16k | `▁ᎤᏢᏅᏛ " consortium ▁word ▁list ." 2 0 1 ... (+25 more)` | 35 |
116
- | 32k | `▁ᎤᏢᏅᏛ " consortium ▁word ▁list ." 2 0 1 ... (+24 more)` | 34 |
117
 
118
 
119
  ### Key Findings
120
 
121
- - **Best Compression:** 32k achieves 3.838x compression
122
- - **Lowest UNK Rate:** 8k with 0.1137% unknown tokens
123
  - **Trade-off:** Larger vocabularies improve compression but increase model size
124
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
125
 
@@ -128,57 +125,89 @@ Cat...`
128
 
129
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
130
 
 
 
131
  ![N-gram Coverage](visualizations/ngram_coverage.png)
132
 
133
  ### Results
134
 
135
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
136
- |--------|------------|---------|----------------|------------------|-------------------|
137
- | **2-gram** | 257 🏆 | 8.00 | 1,020 | 64.1% | 99.6% |
138
- | **2-gram** | 893 🏆 | 9.80 | 3,626 | 42.9% | 85.8% |
139
- | **3-gram** | 292 | 8.19 | 1,275 | 62.4% | 96.1% |
140
- | **3-gram** | 3,674 | 11.84 | 14,102 | 27.3% | 56.4% |
141
- | **4-gram** | 539 | 9.07 | 2,691 | 53.0% | 83.5% |
142
- | **4-gram** | 7,268 | 12.83 | 31,632 | 25.8% | 46.0% |
143
 
144
  ### Top 5 N-grams by Size
145
 
146
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
147
 
148
  | Rank | N-gram | Count |
149
  |------|--------|-------|
150
- | 1 | `category :` | 2,341 |
151
- | 2 | `be checked` | 958 |
152
- | 3 | `to be` | 958 |
153
- | 4 | `: to` | 958 |
154
- | 5 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ` | 609 |
155
 
156
- **3-grams:**
157
 
158
  | Rank | N-gram | Count |
159
  |------|--------|-------|
160
- | 1 | `to be checked` | 958 |
161
- | 2 | `: to be` | 958 |
162
- | 3 | `category : to` | 958 |
163
- | 4 | `ꮧꮥꭼꮤꮫ category :` | 596 |
164
- | 5 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ category` | 566 |
165
 
166
- **4-grams:**
167
 
168
  | Rank | N-gram | Count |
169
  |------|--------|-------|
170
- | 1 | `category : to be` | 958 |
171
- | 2 | `: to be checked` | 958 |
172
- | 3 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ category :` | 566 |
173
- | 4 | `ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ` | 444 |
174
- | 5 | `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ category` | 428 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
 
177
  ### Key Findings
178
 
179
- - **Best Perplexity:** 2-gram with 257
180
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
181
- - **Coverage:** Top-1000 patterns cover ~46% of corpus
182
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
183
 
184
  ---
@@ -186,55 +215,86 @@ Cat...`
186
 
187
  ![Markov Entropy](visualizations/markov_entropy.png)
188
 
 
 
189
  ![Markov Branching](visualizations/markov_branching.png)
190
 
191
  ### Results
192
 
193
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
194
- |---------|-------------|------------|------------------|-----------------|----------------|
195
- | **1** | 0.4793 | 1.394 | 2.62 | 13,897 | 52.1% |
196
- | **1** | 1.6917 | 3.230 | 17.49 | 449 | 0.0% |
197
- | **2** | 0.1273 | 1.092 | 1.24 | 36,338 | 87.3% |
198
- | **2** | 1.0597 | 2.084 | 4.84 | 7,851 | 0.0% |
199
- | **3** | 0.0430 | 1.030 | 1.08 | 45,090 | 95.7% |
200
- | **3** | 0.5746 | 1.489 | 2.30 | 38,007 | 42.5% |
201
- | **4** | 0.0217 🏆 | 1.015 | 1.04 | 48,532 | 97.8% |
202
- | **4** | 0.2783 🏆 | 1.213 | 1.46 | 87,252 | 72.2% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
 
204
- ### Generated Text Samples
205
 
206
- Below are text samples generated from each Markov chain model:
207
 
208
  **Context Size 1:**
209
 
210
- 1. `. ꮳꮃꭹ ᏼꮻ . ᏹꮒꭼꮫꮎ , 537 the united nations statistics ꭸꮎꮣ ꭹ ꭹꭲꮒ ( ꮧꮡꮻꮝꮧrobinson`
211
- 2. `: to be checked category : to be checked ꭿꭺꮹꮤ ꮎꮝꭶ ꮎꮝꮗ ꭶꮒꮭꭲ . " consortium`
212
- 3. `, 863 , bekasi ) ꮔ nu ꮕ nv ) ꮣꮣꮪꭼ ꭺꮺꮅ ꮧꭷꮑꮝꮩꮧ " consortium word`
213
 
214
  **Context Size 2:**
215
 
216
- 1. `category : to be checked category : ꭼꮓꮣ ꭰꭶꮞꮝꮩꮧ category : to be checked category : ꮒꭶꭵ`
217
- 2. `to be checked category : ꮳꮃꭹ category : ꭰꮒꮩꮃ category : ꭰᏸꮅ ꮪꮎꮩꮲꮹꮧꮢ category : ꮧꭶꭺꭿꮧ category`
218
- 3. `: to be checked category : ꮳꮃꭹ category : ꭶꮎꮭꭲ ( ꮷꭶꮓꮾ ꭰꮊꮅꭶ ) category : ꮣꮆꮒꭶꮝꮫ`
219
 
220
  **Context Size 3:**
221
 
222
- 1. `category : to be checked category : ꭴꮬꮹꮣ category : ꭶꮎꮭꭲ ( ꭱꮃꮧꮬ ) category : ꭶꮎꮭꭲ (`
223
- 2. `: to be checked nds : land # länner sv : världsgeografi # lista över länder`
224
- 3. `ꮧꮥꭼꮤꮫ category : ꭼꮓꮣ ꭰꭶꮞꮝꮩꮧ category : to be checked category : ꭶꮎꮭꭲ ( ꮷꭶꮓꮾ ꭰꮊꮅꭶ ) category`
225
 
226
  **Context Size 4:**
227
 
228
- 1. `category : to be checked category : ꭶꮎꮭꭲ ( ᏻꮃꮫ ) category : ꭶꮎꮭꭲ ( ꮷꭶꮓꮾ ꭰꮊꮅꭶ ) category`
229
- 2. `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ category : ꭼꮓꮣ ꭰꭶꮞꮝꮩꮧ category : to be checked nds : land # länner sv : världsgeografi`
230
- 3. `ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ category : ꭴꮎꮩꮲꭿ category : to be checked category : ꭰᏸꮅ ꮪꮎꮩꮲꮹꮧꮢ category : to`
231
 
232
 
233
  ### Key Findings
234
 
235
- - **Best Predictability:** Context-4 with 97.8% predictability
236
  - **Branching Factor:** Decreases with context size (more deterministic)
237
- - **Memory Trade-off:** Larger contexts require more storage (87,252 contexts)
238
  - **Recommendation:** Context-3 or Context-4 for text generation
239
 
240
  ---
@@ -250,26 +310,26 @@ Below are text samples generated from each Markov chain model:
250
 
251
  | Metric | Value |
252
  |--------|-------|
253
- | Vocabulary Size | 4,475 |
254
- | Total Tokens | 42,573 |
255
- | Mean Frequency | 9.51 |
256
  | Median Frequency | 3 |
257
- | Frequency Std Dev | 53.38 |
258
 
259
  ### Most Common Words
260
 
261
  | Rank | Word | Frequency |
262
  |------|------|-----------|
263
- | 1 | category | 2,341 |
264
- | 2 | to | 976 |
265
- | 3 | be | 960 |
266
- | 4 | checked | 958 |
267
- | 5 | ꭰꮄ | 903 |
268
- | 6 | ꭿꭰ | 772 |
269
- | 7 | ꮧꮥꭼꮤꮫ | 649 |
270
- | 8 | ꮩꮿꮧꮲ | 611 |
271
- | 9 | ꭺꮺꮅ | 548 |
272
- | 10 | ꮣꮣꮪꭼ | 506 |
273
 
274
  ### Least Common Words (from vocabulary)
275
 
@@ -290,24 +350,24 @@ Below are text samples generated from each Markov chain model:
290
 
291
  | Metric | Value |
292
  |--------|-------|
293
- | Zipf Coefficient | 0.8972 |
294
- | R² (Goodness of Fit) | 0.986656 |
295
  | Adherence Quality | **excellent** |
296
 
297
  ### Coverage Analysis
298
 
299
  | Top N Words | Coverage |
300
  |-------------|----------|
301
- | Top 100 | 44.9% |
302
- | Top 1,000 | 76.0% |
303
  | Top 5,000 | 0.0% |
304
  | Top 10,000 | 0.0% |
305
 
306
  ### Key Findings
307
 
308
- - **Zipf Compliance:** R²=0.9867 indicates excellent adherence to Zipf's law
309
- - **High Frequency Dominance:** Top 100 words cover 44.9% of corpus
310
- - **Long Tail:** -5,525 words needed for remaining 100.0% coverage
311
 
312
  ---
313
  ## 5. Word Embeddings Evaluation
@@ -320,24 +380,85 @@ Below are text samples generated from each Markov chain model:
320
 
321
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
322
 
323
- ### Model Comparison
324
 
325
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
326
- |-------|------------|-----------|----------|----------|----------|
327
- | **mono_32d** | 1,242 | 32 | 2.681 | 0.618 | 0.2339 🏆 |
328
- | **mono_64d** | 1,242 | 64 | 2.624 | 0.626 | 0.0550 |
329
- | **mono_128d** | 1,242 | 128 | 2.563 | 0.654 | 0.0093 |
330
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
331
 
332
  ### Key Findings
333
 
334
- - **Best Isotropy:** mono_32d with 0.2339 (more uniform distribution)
335
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
336
- - **Vocabulary Coverage:** All models cover 1,242 words
337
- - **Recommendation:** 100d for balanced semantic capture and efficiency
338
 
339
  ---
340
- ## 6. Summary & Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
341
 
342
  ![Performance Dashboard](visualizations/performance_dashboard.png)
343
 
@@ -345,11 +466,12 @@ Below are text samples generated from each Markov chain model:
345
 
346
  | Component | Recommended | Rationale |
347
  |-----------|-------------|-----------|
348
- | Tokenizer | **32k BPE** | Best compression (3.84x) with low UNK rate |
349
- | N-gram | **5-gram** | Lowest perplexity (257) |
350
- | Markov | **Context-4** | Highest predictability (97.8%) |
351
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
352
 
 
353
  ---
354
  ## Appendix: Metrics Glossary & Interpretation Guide
355
 
@@ -539,7 +661,8 @@ If you use these models in your research, please cite:
539
  author = {Kamali, Omar},
540
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
541
  year = {2025},
542
- publisher = {HuggingFace},
 
543
  url = {https://huggingface.co/wikilangs}
544
  institution = {Omneity Labs}
545
  }
@@ -555,7 +678,8 @@ MIT License - Free for academic and commercial use.
555
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
556
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
557
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
558
  ---
559
  *Generated by Wikilangs Models Pipeline*
560
 
561
- *Report Date: 2025-12-28 22:42:07*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 3.573
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.2005
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # CHR - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 2.938x | 2.95 | 0.1472% | 83,560 |
84
+ | **16k** | 3.361x | 3.37 | 0.1684% | 73,046 |
85
+ | **32k** | 3.573x 🏆 | 3.59 | 0.1790% | 68,709 |
86
 
87
  ### Tokenization Examples
88
 
89
  Below are sample sentences tokenized with each vocabulary size:
90
 
91
+ **Sample 1:** `ᎠᏥᏂᏘᏂᎠ ᏙᏱᏗᏢᎦᏚᎲ ᏧᎦᎾᏮ ᎠᎹᏰᏟ-Ꮒ. Buenos Aires ᎠᏰᎵᏗᎦᎳᏫᎢᏍᏗ ᎦᏚᎲᎢ. ᏓᏓᏚᎬ ᎪᏪᎵ ᏙᏯᏗᏢ ᏗᏕᎬᏔᏛ ᎠᎹ...`
 
 
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
+ | 8k | `▁ᎠᏥᏂᏘᏂᎠ ▁ᏙᏱᏗᏢᎦᏚᎲ ▁ᏧᎦᎾᏮ ▁ᎠᎹᏰᏟ - . ▁buenos ▁aires ▁ᎠᏰᎵᏗᎦᎳᏫᎢᏍᏗ ... (+10 more)` | 20 |
96
+ | 16k | `▁ᎠᏥᏂᏘᏂᎠ ▁ᏙᏱᏗᏢᎦᏚᎲ ▁ᏧᎦᎾᏮ ▁ᎠᎹᏰᏟ - . ▁buenos ▁aires ▁ᎠᏰᎵᏗᎦᎳᏫᎢᏍᏗ ... (+10 more)` | 20 |
97
+ | 32k | `▁ᎠᏥᏂᏘᏂᎠ ▁ᏙᏱᏗᏢᎦᏚᎲ ▁ᏧᎦᎾᏮ ▁ᎠᎹᏰᏟ - . ▁buenos ▁aires ▁ᎠᏰᎵᏗᎦᎳᏫᎢᏍᏗ ... (+10 more)` | 20 |
98
 
99
+ **Sample 2:** `ᎠᎷᏆ"Tsalagi Anagalisgi." ᏙᏱᏗᏢᎦᏚᎲ ᎾᏍᎩ ᏁᏛᎳᏂ-Ꮒ. Oranjestad ᎠᏰᎵᏗᎦᎳᏫᎢᏍᏗ ᎦᏚᎲᎢ. ᏓᏓᏚᎬ ᎪᏪ...`
 
 
 
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
+ | 8k | `▁ᎠᎷ " tsalagianagalisgi ." ▁ᏙᏱᏗᏢᎦᏚᎲ ▁ᎾᏍᎩ ▁ᏁᏛᎳᏂ - ... (+18 more)` | 28 |
104
+ | 16k | `▁ᎠᎷᏆ " tsalagianagalisgi ." ▁ᏙᏱᏗᏢᎦᏚᎲ ▁ᎾᏍᎩ ▁ᏁᏛᎳᏂ - ... (+14 more)` | 24 |
105
+ | 32k | `▁ᎠᎷᏆ " tsalagianagalisgi ." ▁ᏙᏱᏗᏢᎦᏚᎲ ▁ᎾᏍᎩ ▁ᏁᏛᎳᏂ - ... (+13 more)` | 23 |
 
 
 
 
 
 
106
 
107
+ **Sample 3:** `ᎴᎹᏂ (Lemani) ᎠᏓᎶᏂᎨᎢ ᎤᏓᏔᏅᎯ Ꭰ���ᏍᏗ ᎨᏐᎢ. ᎠᎩᏍᏗ be checked`
108
 
109
  | Vocab | Tokens | Count |
110
  |-------|--------|-------|
111
+ | 8k | `▁Ꮄ ᎹᏂ ▁( le mani ) ▁ᎠᏓᎶᏂᎨᎢ ▁ᎤᏓᏔᏅᎯ ▁ᎠᎩᏍᏗ ▁ᎨᏐᎢ ... (+4 more)` | 14 |
112
+ | 16k | `▁ᎴᎹᏂ ▁( lemani ) ▁ᎠᏓᎶᏂᎨᎢ ▁ᎤᏓᏔᏅᎯ ▁ᎠᎩᏍᏗ ▁ᎨᏐᎢ . ▁ᎠᎩᏍᏗ ... (+2 more)` | 12 |
113
+ | 32k | `▁ᎴᎹᏂ ▁( lemani ) ▁ᎠᏓᎶᏂᎨᎢ ▁ᎤᏓᏔᏅᎯ ▁ᎠᎩᏍᏗ ▁ᎨᏐᎢ . ▁ᎠᎩᏍᏗ ... (+2 more)` | 12 |
114
 
115
 
116
  ### Key Findings
117
 
118
+ - **Best Compression:** 32k achieves 3.573x compression
119
+ - **Lowest UNK Rate:** 8k with 0.1472% unknown tokens
120
  - **Trade-off:** Larger vocabularies improve compression but increase model size
121
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
122
 
 
125
 
126
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
127
 
128
+ ![N-gram Unique](visualizations/ngram_unique.png)
129
+
130
  ![N-gram Coverage](visualizations/ngram_coverage.png)
131
 
132
  ### Results
133
 
134
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
135
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
136
+ | **2-gram** | Word | 160 🏆 | 7.32 | 494 | 67.3% | 100.0% |
137
+ | **2-gram** | Subword | 938 | 9.87 | 3,277 | 40.0% | 86.1% |
138
+ | **3-gram** | Word | 222 | 7.80 | 667 | 63.1% | 100.0% |
139
+ | **3-gram** | Subword | 4,484 | 12.13 | 12,903 | 21.8% | 52.1% |
140
+ | **4-gram** | Word | 508 | 8.99 | 1,308 | 48.2% | 89.2% |
141
+ | **4-gram** | Subword | 9,908 | 13.27 | 28,877 | 18.5% | 39.2% |
142
 
143
  ### Top 5 N-grams by Size
144
 
145
+ **2-grams (Word):**
146
+
147
+ | Rank | N-gram | Count |
148
+ |------|--------|-------|
149
+ | 1 | `be checked` | 850 |
150
+ | 2 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ` | 578 |
151
+ | 3 | `ꮣꮣꮪꭼ ꭺꮺꮅ` | 472 |
152
+ | 4 | `ꭺꮺꮅ ꮩꮿꮧꮲ` | 431 |
153
+ | 5 | `word list` | 345 |
154
+
155
+ **3-grams (Word):**
156
 
157
  | Rank | N-gram | Count |
158
  |------|--------|-------|
159
+ | 1 | `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ` | 431 |
160
+ | 2 | `ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ` | 431 |
161
+ | 3 | `consortium word list` | 343 |
162
+ | 4 | `ꮧꮥꭼꮤꮫ be checked` | 227 |
163
+ | 5 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be` | 216 |
164
 
165
+ **4-grams (Word):**
166
 
167
  | Rank | N-gram | Count |
168
  |------|--------|-------|
169
+ | 1 | `ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ` | 431 |
170
+ | 2 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be checked` | 216 |
171
+ | 3 | `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be` | 163 |
172
+ | 4 | `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ ꭰꭶꮞꮝꮤꮕ` | 96 |
173
+ | 5 | `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ ꭰꭶꮞꮝꮤꮕ be` | 96 |
174
 
175
+ **2-grams (Subword):**
176
 
177
  | Rank | N-gram | Count |
178
  |------|--------|-------|
179
+ | 1 | `_ ꭰ` | 5,396 |
180
+ | 2 | `_ ꭴ` | 3,446 |
181
+ | 3 | `ꮧ _` | 2,866 |
182
+ | 4 | `. _` | 2,628 |
183
+ | 5 | `, _` | 2,120 |
184
+
185
+ **3-grams (Subword):**
186
+
187
+ | Rank | N-gram | Count |
188
+ |------|--------|-------|
189
+ | 1 | `ꮝ ꮧ _` | 1,406 |
190
+ | 2 | `_ c h` | 987 |
191
+ | 3 | `_ ꭰ ꮄ` | 975 |
192
+ | 4 | `c h e` | 965 |
193
+ | 5 | `ꭰ ꮄ _` | 899 |
194
+
195
+ **4-grams (Subword):**
196
+
197
+ | Rank | N-gram | Count |
198
+ |------|--------|-------|
199
+ | 1 | `_ c h e` | 918 |
200
+ | 2 | `_ ꭰ ꮄ _` | 892 |
201
+ | 3 | `e _ c h` | 857 |
202
+ | 4 | `_ b e _` | 851 |
203
+ | 5 | `c k e d` | 850 |
204
 
205
 
206
  ### Key Findings
207
 
208
+ - **Best Perplexity:** 2-gram (word) with 160
209
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
210
+ - **Coverage:** Top-1000 patterns cover ~39% of corpus
211
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
212
 
213
  ---
 
215
 
216
  ![Markov Entropy](visualizations/markov_entropy.png)
217
 
218
+ ![Markov Contexts](visualizations/markov_contexts.png)
219
+
220
  ![Markov Branching](visualizations/markov_branching.png)
221
 
222
  ### Results
223
 
224
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
225
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
226
+ | **1** | Word | 0.4942 | 1.409 | 2.32 | 13,243 | 50.6% |
227
+ | **1** | Subword | 1.6130 | 3.059 | 16.20 | 448 | 0.0% |
228
+ | **2** | Word | 0.0935 | 1.067 | 1.16 | 30,660 | 90.6% |
229
+ | **2** | Subword | 1.0024 | 2.003 | 4.68 | 7,258 | 0.0% |
230
+ | **3** | Word | 0.0291 | 1.020 | 1.05 | 35,274 | 97.1% |
231
+ | **3** | Subword | 0.5824 | 1.497 | 2.33 | 33,930 | 41.8% |
232
+ | **4** | Word | 0.0141 🏆 | 1.010 | 1.02 | 36,782 | 98.6% |
233
+ | **4** | Subword | 0.2770 | 1.212 | 1.46 | 79,071 | 72.3% |
234
+
235
+ ### Generated Text Samples (Word-based)
236
+
237
+ Below are text samples generated from each word-based Markov chain model:
238
+
239
+ **Context Size 1:**
240
+
241
+ 1. `ꭰꮄ ꭿꭰ ꮳꮃꭹ ꮧꭶꮳꭶꮈᏼ ꮧꮜꮓ ꮪꮎꮜꮓꭾꭲ ꭰꮒꮝꭶꮿꮓ ꭰꮗꮱꮝꮧ ꭰꮏꮼꭲ ꭵꮝꮚ ꭰꭺꮩꮝꭶ ꭴꮒꭺꮫ ᏼꮻ ꭴꮕꮤꮒꮣꮝꮩꮧ v ꮝ`
242
+ 2. `be checked ꭿꭺꮹꮤ ꮎꮝꭶ ꮎꮝꮗ ꭰꮝꮪꭲꮫ ꭿꭰ ꭰꮣꮄꮒꮝꭼ ꭰꮫꮝꭼ ꭰᏸꮈ ꭴꮣꮄꮴꮂ consortium word list sdagoi ꮣꮣꮪꭼ`
243
+ 3. `ꭿꭰ ꮳꮃꭹ ꭴꭼꮻᏻꭿ ꭸꮢ ꮎꮏ ꮌꮞꮋꮗꭹ ꮎ ꮎꮝꭹ ꮑꮫꮃꮒ ꮪꮎꮩꮲꮹꮧꮢ be checked ꭱꮃꮧꮬ thumb none thumb`
244
+
245
+ **Context Size 2:**
246
+
247
+ 1. `ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be checked ꮪꮎꮩꮲꮹꮧꮢ`
248
+ 2. `ꮣꮣꮪꭼ ꭺꮺꮅ ꭴꮣꮞꭶꮴꮧ ꮧꮥꭼꮤꮫ ꮣꮆꮒꭶꮝꮫ ᏻꮃꮫ ꭼꮏꭸꮝꮫ ꮷᏼꮲ ꭰꮉᏸꮯ be checked ꭱꮃꮧꮬ ꮖꮗꭲꮴꭹꭲꮒ`
249
+ 3. `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be checked ꮤꮝꮊꮒꮿ ꮷꮑꭿ ꭰꮄ ꭿꮈꮝꮧ ꭰᏺꮅꭸ ꭰꮉᏸꮅ ꭷꮓꭾꭽ ꭰꮄ ꮶꭲ ꮏꮎ ꭸꮢ yolngu`
250
+
251
+ **Context Size 3:**
252
+
253
+ 1. `ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ ꮪꮎꮩꮲꮹꮧꮢ be checked`
254
+ 2. `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be checked ꭱꮃꮧꮬ ꮖꮗꭲꮴꭹꭲꮒ`
255
+ 3. `consortium word list dikaneisd digoluwatisgi ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ ꮔꮫꮏꮥꭼ be checked`
256
+
257
+ **Context Size 4:**
258
+
259
+ 1. `ꮣꮣꮪꭼ ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ ꭰꭶꮞꮝꮩꮧ be checked`
260
+ 2. `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ be checked ꮷᏼꮲ ꭰꮉᏸꮯ ᏻꮃꮫ ꮣꮆꮒꭶꮝꮫ ꭼꮏꭸꮝꮫ`
261
+ 3. `ꭺꮺꮅ ꮩꮿꮧꮲ ꮧꮥꭼꮤꮫ ꭰꭶꮞꮝꮤꮕ be checked`
262
+
263
 
264
+ ### Generated Text Samples (Subword-based)
265
 
266
+ Below are text samples generated from each subword-based Markov chain model:
267
 
268
  **Context Size 1:**
269
 
270
+ 1. `_ꭻꮹꮧꮳꮕᏹꭹꮸꮟꮅꭳꮜꭲꮧ_`
271
+ 2. `ꮧꮲ_ꮣꮞꭲᏻꭿꮵꮅ_blipe`
272
+ 3. `ꮝꮩꮲꮹꮢ_ꭰ_ꭴꮓꭿᏻꭿꭰ_ꮣ`
273
 
274
  **Context Size 2:**
275
 
276
+ 1. `_ꭰꮄ_ꭴꮎꮴꮅ_ꮧꭶꮎꮕꮟꮣꮑꮈ`
277
+ 2. `_ꭴꮓꮈꮤꮕꭲ,_.._tr>_<`
278
+ 3. `ꮧ_list."_(iaꭰꮒꭺꮫ_`
279
 
280
  **Context Size 3:**
281
 
282
+ 1. `ꮝꮧ_ble._galv)._ꮣꮣꮪ`
283
+ 2. `_checked_(gila)_ꮗꮅ`
284
+ 3. `_ꭰꮄ_ꭴꮣꮔꮦᏺꮂ_ꮞꮅꮎ_seq`
285
 
286
  **Context Size 4:**
287
 
288
+ 1. `_checked_(ꭱꮃꮧꮭꭲ_ꭰꮣꭿ`
289
+ 2. `_ꭰꮄ_ꭰꮣꭿꭿ_polydeuces`
290
+ 3. `e_checked_(ꭱꮃꮧꮬ._re`
291
 
292
 
293
  ### Key Findings
294
 
295
+ - **Best Predictability:** Context-4 (word) with 98.6% predictability
296
  - **Branching Factor:** Decreases with context size (more deterministic)
297
+ - **Memory Trade-off:** Larger contexts require more storage (79,071 contexts)
298
  - **Recommendation:** Context-3 or Context-4 for text generation
299
 
300
  ---
 
310
 
311
  | Metric | Value |
312
  |--------|-------|
313
+ | Vocabulary Size | 4,236 |
314
+ | Total Tokens | 35,186 |
315
+ | Mean Frequency | 8.31 |
316
  | Median Frequency | 3 |
317
+ | Frequency Std Dev | 35.59 |
318
 
319
  ### Most Common Words
320
 
321
  | Rank | Word | Frequency |
322
  |------|------|-----------|
323
+ | 1 | ꭰꮄ | 903 |
324
+ | 2 | be | 852 |
325
+ | 3 | checked | 850 |
326
+ | 4 | ꭿꭰ | 797 |
327
+ | 5 | ꮧꮥꭼꮤꮫ | 611 |
328
+ | 6 | ꮩꮿꮧꮲ | 580 |
329
+ | 7 | ꭺꮺꮅ | 524 |
330
+ | 8 | ꮣꮣꮪꭼ | 482 |
331
+ | 9 | ꮳꮃꭹ | 475 |
332
+ | 10 | word | 346 |
333
 
334
  ### Least Common Words (from vocabulary)
335
 
 
350
 
351
  | Metric | Value |
352
  |--------|-------|
353
+ | Zipf Coefficient | 0.8714 |
354
+ | R² (Goodness of Fit) | 0.984546 |
355
  | Adherence Quality | **excellent** |
356
 
357
  ### Coverage Analysis
358
 
359
  | Top N Words | Coverage |
360
  |-------------|----------|
361
+ | Top 100 | 39.8% |
362
+ | Top 1,000 | 74.2% |
363
  | Top 5,000 | 0.0% |
364
  | Top 10,000 | 0.0% |
365
 
366
  ### Key Findings
367
 
368
+ - **Zipf Compliance:** R²=0.9845 indicates excellent adherence to Zipf's law
369
+ - **High Frequency Dominance:** Top 100 words cover 39.8% of corpus
370
+ - **Long Tail:** -5,764 words needed for remaining 100.0% coverage
371
 
372
  ---
373
  ## 5. Word Embeddings Evaluation
 
380
 
381
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
382
 
 
383
 
384
+ ### 5.1 Cross-Lingual Alignment
385
+
386
+ > *Note: Multilingual alignment visualization not available for this language.*
387
+
388
+
389
+ ### 5.2 Model Comparison
390
+
391
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
392
+ |-------|-----------|----------|------------------|---------------|----------------|
393
+ | **mono_32d** | 32 | 0.2005 🏆 | 0.5045 | N/A | N/A |
394
+ | **mono_64d** | 64 | 0.0662 | 0.4736 | N/A | N/A |
395
+ | **mono_128d** | 128 | 0.0103 | 0.4890 | N/A | N/A |
396
 
397
  ### Key Findings
398
 
399
+ - **Best Isotropy:** mono_32d with 0.2005 (more uniform distribution)
400
+ - **Semantic Density:** Average pairwise similarity of 0.4890. Lower values indicate better semantic separation.
401
+ - **Alignment Quality:** No aligned models evaluated in this run.
402
+ - **Recommendation:** 128d aligned for best cross-lingual performance
403
 
404
  ---
405
+ ## 6. Morphological Analysis (Experimental)
406
+
407
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
408
+
409
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
410
+
411
+ ### 6.1 Productivity & Complexity
412
+
413
+ | Metric | Value | Interpretation | Recommendation |
414
+ |--------|-------|----------------|----------------|
415
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
416
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
417
+
418
+ ### 6.2 Affix Inventory (Productive Units)
419
+
420
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
421
+
422
+ #### Productive Prefixes
423
+ | Prefix | Examples |
424
+ |--------|----------|
425
+
426
+ #### Productive Suffixes
427
+ | Suffix | Examples |
428
+ |--------|----------|
429
+ | `-ꮝꮧ` | ꮵꮣꮯꮆꮝꮧ, ꭴꮺꮕꮝꮧ, ꭰꭶꮤꮂꮝꮧ |
430
+
431
+ ### 6.3 Bound Stems (Lexical Roots)
432
+
433
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
434
+
435
+ *No significant bound stems detected.*
436
+
437
+
438
+ ### 6.4 Affix Compatibility (Co-occurrence)
439
+
440
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
441
+
442
+ *No significant affix co-occurrences detected.*
443
+
444
+
445
+ ### 6.5 Recursive Morpheme Segmentation
446
+
447
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
448
+
449
+ | Word | Suggested Split | Confidence | Stem |
450
+ |------|-----------------|------------|------|
451
+ | ꮒꭶꮅꮝꮧꮝꭸꮝꮧ | **`ꮒꭶꮅꮝꮧꮝꭸ-ꮝꮧ`** | 1.5 | `ꮒꭶꮅꮝꮧꮝꭸ` |
452
+ | ꭰᏸꮅꮧꭶꮃꮻꭲꮝꮧ | **`ꭰᏸꮅꮧꭶꮃꮻꭲ-ꮝꮧ`** | 1.5 | `ꭰᏸꮅꮧꭶꮃꮻꭲ` |
453
+ | ꭰꮜꮓꮧꮥꮆꮖꮝꮧ | **`ꭰꮜꮓꮧꮥꮆꮖ-ꮝꮧ`** | 1.5 | `ꭰꮜꮓꮧꮥꮆꮖ` |
454
+
455
+ ### 6.6 Linguistic Interpretation
456
+
457
+ > **Automated Insight:**
458
+ The language CHR appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
459
+
460
+ ---
461
+ ## 7. Summary & Recommendations
462
 
463
  ![Performance Dashboard](visualizations/performance_dashboard.png)
464
 
 
466
 
467
  | Component | Recommended | Rationale |
468
  |-----------|-------------|-----------|
469
+ | Tokenizer | **32k BPE** | Best compression (3.57x) |
470
+ | N-gram | **2-gram** | Lowest perplexity (160) |
471
+ | Markov | **Context-4** | Highest predictability (98.6%) |
472
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
473
 
474
+
475
  ---
476
  ## Appendix: Metrics Glossary & Interpretation Guide
477
 
 
661
  author = {Kamali, Omar},
662
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
663
  year = {2025},
664
+ doi = {10.5281/zenodo.18073153},
665
+ publisher = {Zenodo},
666
  url = {https://huggingface.co/wikilangs}
667
  institution = {Omneity Labs}
668
  }
 
678
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
679
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
680
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
681
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
682
  ---
683
  *Generated by Wikilangs Models Pipeline*
684
 
685
+ *Report Date: 2026-01-03 10:09:15*
models/embeddings/monolingual/chr_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3d20dfbda9a0fad185b3fbdecaaf3911d08563e5eec2416cc1d49cabcfd19fbe
3
- size 1025297761
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b65a441c9c4908f0ffa5c095ba64382b4b323cde80a4f6d979c87e4b93102427
3
+ size 1025246107
models/embeddings/monolingual/chr_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1242
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 1193
15
  }
models/embeddings/monolingual/chr_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c1700517d4093dd3dfbf1011faa55ebb1e1b7cd19fffeb86d693090d2c17c336
3
- size 256343905
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9b5ecb349ca3c9d1d8f1f38b34647edbd029de966e73fe6f8043e5b64357e71
3
+ size 256329883
models/embeddings/monolingual/chr_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1242
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 1193
15
  }
models/embeddings/monolingual/chr_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:09d69f872f3cb41d7d61e371667e89c7559d22c55e910dd6ec1cc05d27940abd
3
- size 512661857
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ff3baa5798aac84304b756ac363fd816043d3a9775848343c3a39af84cf09bf
3
+ size 512635291
models/embeddings/monolingual/chr_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1242
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 1193
15
  }
models/subword_markov/chr_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:94764e1f3d25aa83b86cc0bc93dae20352a14e0147c26b8923882474f061a379
3
- size 51927
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47a719104c2e085b3f0210500f2e0d35dd3fad4c62155a28ff1885ad952f603e
3
+ size 48128
models/subword_markov/chr_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_contexts": 449,
6
- "total_transitions": 304475
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_contexts": 448,
6
+ "total_transitions": 244557
7
  }
models/subword_markov/chr_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:832beea5bece4732c22e619cef5564a0512d794894d07a9f5943bd7eea7020fb
3
- size 242322
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90017c14d3413b7443aa8cfd4e8186bbbfa36eb5c061301510a67e33d59ced11
3
+ size 213598
models/subword_markov/chr_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_contexts": 7851,
6
- "total_transitions": 303415
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_contexts": 7258,
6
+ "total_transitions": 243650
7
  }
models/subword_markov/chr_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a6e96393ee13e9a83e555b4fc7719ccadbad4a3d5edcae2387524f2c1cb9bced
3
- size 695224
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3aa3158ad438f31c7b0fe3d234fab09e4e04d062e867a5f9276cd4c9c48b1d0c
3
+ size 611513
models/subword_markov/chr_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_contexts": 38007,
6
- "total_transitions": 302355
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_contexts": 33930,
6
+ "total_transitions": 242743
7
  }
models/subword_markov/chr_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:91ca7a6f04658e80a3bee03c2eb5c7a6e0575ba5e7b3311b5a5e42c2367c2de4
3
- size 1281671
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:041f8fb81d9f01bcb72c1f16d3266d79dd5c44b4454b909b4095a6dc7c4e41c9
3
+ size 1161581
models/subword_markov/chr_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_contexts": 87252,
6
- "total_transitions": 301295
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_contexts": 79071,
6
+ "total_transitions": 241836
7
  }
models/subword_ngram/chr_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d1b27ecd2e3bf4c97a30d980ece0b0db63017dc2003eabca6ea2fcbd56f7aba2
3
- size 41814
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d97d4c240d2e544cee601293886cf200a3ca9744dc6f05309d9f3f6dedb5701
3
+ size 37349
models/subword_ngram/chr_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_ngrams": 3626,
6
- "total_ngrams": 304475
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_ngrams": 3277,
6
+ "total_ngrams": 244557
7
  }
models/subword_ngram/chr_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2491c3a5631b3d1d0f21d6d30f46d2658a5acdb741a45513792750076f809272
3
- size 171485
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4909748af2830cbec527b301ed9c11dc5fba2c6a9e7f48e1b13a633d5cfdcd87
3
+ size 157760
models/subword_ngram/chr_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_ngrams": 14102,
6
- "total_ngrams": 303415
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_ngrams": 12903,
6
+ "total_ngrams": 243650
7
  }
models/subword_ngram/chr_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1c9d8cf44e962d72f73232a0405cac11c2f2997bf107cc3e7f89f07a266eb115
3
- size 410498
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a19dcd2eca820072c87f80cca01672f0fafd765e2ca81e6dbfd95249c77828dc
3
+ size 374199
models/subword_ngram/chr_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "chr",
5
- "unique_ngrams": 31632,
6
- "total_ngrams": 302355
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "chr",
5
+ "unique_ngrams": 28877,
6
+ "total_ngrams": 242743
7
  }
models/tokenizer/chr_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4cd7d2cae1a887eaa13ebc4706df2231175ed02c2496506d48c55a25082a93c7
3
- size 560277
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:359423c436a228311e5a63ed11e5f00c42f4e1f43e6dab0726841d8a8cea7899
3
+ size 558319
models/tokenizer/chr_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/chr_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:590a2d79a920b938da84b347c50ae4007d77893ad6ba128d43467fac61235885
3
- size 876858
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd42bf932a4c3f148e083933ebcc3b363a57f53b5b4db22c1e6a60d5017edabe
3
+ size 835851
models/tokenizer/chr_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/chr_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6b3ffbdb4812b8aa46b6ca458563e45453188e92b8db13d34f690953654b8071
3
- size 391397
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb436a7afc74fa9ba90ff21df46b3064327fe8aa4d69c5a5859a405d4b67a6c2
3
+ size 392805
models/tokenizer/chr_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/chr_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:54b695ca6569f4eb86b2f393922a1789e9b40736180d2d6b5bcee701ba4c0a6d
3
- size 76700
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8068c404442dc10b84690c7100fc813e925cbe51c1e787ef250cbc0cea84d3d3
3
+ size 73331
models/vocabulary/chr_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "chr",
3
- "vocabulary_size": 4475,
 
4
  "statistics": {
5
- "type_token_ratio": 0.266443314849045,
6
  "coverage": {
7
- "top_100": 0.3678373382624769,
8
- "top_1000": 0.6233441158348737,
9
- "top_5000": 0.8298290203327172,
10
- "top_10000": 0.9261013555144794
11
  },
12
- "hapax_count": 9363,
13
- "hapax_ratio": 0.6766151177915883,
14
- "total_documents": 1060
15
  }
16
  }
 
1
  {
2
  "language": "chr",
3
+ "vocabulary_size": 4236,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.2998054386679336,
7
  "coverage": {
8
+ "top_100": 0.3170671010361522,
9
+ "top_1000": 0.590561513053708,
10
+ "top_5000": 0.8133116148590561,
11
+ "top_10000": 0.9264286683860459
12
  },
13
+ "hapax_count": 9016,
14
+ "hapax_ratio": 0.6803501358285542,
15
+ "total_documents": 907
16
  }
17
  }
models/word_markov/chr_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8918e88174d5169ca34a3d8546e070f207f78c82b6a54021b21f4749b2bcac59
3
- size 416475
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e06367eae370c3314f083ee9fc62374591dc05fc32a00400e11de6f6a357de8
3
+ size 388759
models/word_markov/chr_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_contexts": 13897,
6
- "total_transitions": 68467
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_contexts": 13243,
6
+ "total_transitions": 43295
7
  }
models/word_markov/chr_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8dfcf759c0a6ff971c3a7533de300aed0be285590a97c16fc1cd0747fefc4a90
3
- size 763529
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:efac8874c4bac9908d723a0cd4c5b8e19f9b13e1c696f3e7076b4376fd315124
3
+ size 694832
models/word_markov/chr_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_contexts": 36338,
6
- "total_transitions": 67407
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_contexts": 30660,
6
+ "total_transitions": 42388
7
  }
models/word_markov/chr_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f281bedae46cd1bfd33bbc0700ca416910fa96689b0ac5c6f08a701d13123382
3
- size 963702
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61472a61178571af03b7dbace2a9fde2feec3747df77700588b54d959fe32645
3
+ size 845527
models/word_markov/chr_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_contexts": 45090,
6
- "total_transitions": 66349
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_contexts": 35274,
6
+ "total_transitions": 41481
7
  }
models/word_markov/chr_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2bbf2d1e5afc57aa98fb73deb55260a7990021c69bacc60cfd97aeae6bc20ebf
3
- size 1109183
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6aa8b7bda0357275e7522519313380f562702793a851b30e31d9572a0788269c
3
+ size 962465
models/word_markov/chr_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_contexts": 48532,
6
- "total_transitions": 65291
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_contexts": 36782,
6
+ "total_transitions": 40574
7
  }
models/word_ngram/chr_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9b9df0a5e5cf8f1dd0a206e774645f09871457def4961be9f74cad43d9572d79
3
- size 19190
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6dcb76c997efc3d0f422d57c9533c80f8b2f241a407922fa917f6630a6cb7a9
3
+ size 12045
models/word_ngram/chr_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_ngrams": 1020,
6
- "total_ngrams": 68467
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_ngrams": 494,
6
+ "total_ngrams": 43295
7
  }
models/word_ngram/chr_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c6229d1d90e9b36c1f6c18272e1dcff42ed4f2c3ed66c2d7a275f82dfa186aab
3
- size 26597
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3600f5f22489e28064419bd04a825cda3951b505b66b70ee4817bc8726929f00
3
+ size 16559
models/word_ngram/chr_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_ngrams": 1275,
6
- "total_ngrams": 67407
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_ngrams": 667,
6
+ "total_ngrams": 42388
7
  }
models/word_ngram/chr_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cf1e68ff7493da7aea3281593ac2e5ae7a6c6111a57fd51b4dcc3cd0e77c021d
3
- size 56699
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:498ec6c6f2cbba90463444020e6eca072fa8d8029af3c11e05672af984d43fc6
3
+ size 34189
models/word_ngram/chr_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "chr",
5
- "unique_ngrams": 2691,
6
- "total_ngrams": 66349
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "chr",
5
+ "unique_ngrams": 1308,
6
+ "total_ngrams": 41481
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 4c742d1af816480a8ff1d49bae24e659e8852303ce2c1a89c0d22f0acb33cd01
  • Pointer size: 131 Bytes
  • Size of remote file: 145 kB

Git LFS Details

  • SHA256: a380849e9f2cd0160a2e73ba29444e0565af8d9a99799e40597ba5667b3d7341
  • Pointer size: 131 Bytes
  • Size of remote file: 148 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED