omarkamali commited on
Commit
8164949
·
verified ·
1 Parent(s): 351da00

Upload all models and assets for am (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +270 -140
  2. models/embeddings/monolingual/am_128d.bin +2 -2
  3. models/embeddings/monolingual/am_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/am_32d.bin +2 -2
  5. models/embeddings/monolingual/am_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/am_64d.bin +2 -2
  7. models/embeddings/monolingual/am_64d_metadata.json +5 -3
  8. models/subword_markov/am_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/am_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/am_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/am_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/am_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/am_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/am_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/am_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/am_2gram_subword.parquet +2 -2
  17. models/subword_ngram/am_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/am_3gram_subword.parquet +2 -2
  19. models/subword_ngram/am_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/am_4gram_subword.parquet +2 -2
  21. models/subword_ngram/am_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/am_tokenizer_16k.model +2 -2
  23. models/tokenizer/am_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/am_tokenizer_32k.model +2 -2
  25. models/tokenizer/am_tokenizer_32k.vocab +0 -0
  26. models/tokenizer/am_tokenizer_64k.model +2 -2
  27. models/tokenizer/am_tokenizer_64k.vocab +0 -0
  28. models/tokenizer/am_tokenizer_8k.model +2 -2
  29. models/tokenizer/am_tokenizer_8k.vocab +0 -0
  30. models/vocabulary/am_vocabulary.parquet +2 -2
  31. models/vocabulary/am_vocabulary_metadata.json +10 -9
  32. models/word_markov/am_markov_ctx1_word.parquet +2 -2
  33. models/word_markov/am_markov_ctx1_word_metadata.json +2 -2
  34. models/word_markov/am_markov_ctx2_word.parquet +2 -2
  35. models/word_markov/am_markov_ctx2_word_metadata.json +2 -2
  36. models/word_markov/am_markov_ctx3_word.parquet +2 -2
  37. models/word_markov/am_markov_ctx3_word_metadata.json +2 -2
  38. models/word_markov/am_markov_ctx4_word.parquet +2 -2
  39. models/word_markov/am_markov_ctx4_word_metadata.json +2 -2
  40. models/word_ngram/am_2gram_word.parquet +2 -2
  41. models/word_ngram/am_2gram_word_metadata.json +2 -2
  42. models/word_ngram/am_3gram_word.parquet +2 -2
  43. models/word_ngram/am_3gram_word_metadata.json +2 -2
  44. models/word_ngram/am_4gram_word.parquet +2 -2
  45. models/word_ngram/am_4gram_word_metadata.json +2 -2
  46. visualizations/embedding_isotropy.png +0 -0
  47. visualizations/embedding_norms.png +0 -0
  48. visualizations/embedding_similarity.png +2 -2
  49. visualizations/markov_branching.png +0 -0
  50. visualizations/markov_contexts.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 3.278
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.9070
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 108024
33
- generated: 2025-12-27
34
  ---
35
 
36
  # AM - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,59 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 2.456x | 2.43 | 0.1639% | 455,103 |
76
- | **16k** | 2.758x | 2.73 | 0.1841% | 405,251 |
77
- | **32k** | 3.035x | 3.00 | 0.2026% | 368,183 |
78
- | **64k** | 3.278x 🏆 | 3.24 | 0.2188% | 340,895 |
79
 
80
  ### Tokenization Examples
81
 
82
  Below are sample sentences tokenized with each vocabulary size:
83
 
84
- **Sample 1:** `' ቻይና ንጉሥበር
85
-
86
- ዋቢ መጽሐፍት
87
-
88
- መደብ:የቻይና ነገሥታት`
89
 
90
  | Vocab | Tokens | Count |
91
  |-------|--------|-------|
92
- | 8k | `▁' ▁የቻይናንጉሥር። መጽሐፍት ▁መደብ : ቻይና ... (+1 more)` | 11 |
93
- | 16k | `▁' ▁ቻይናንጉሥር። መጽሐፍት ▁መደብ : ቻይና ... (+1 more)` | 11 |
94
- | 32k | `▁'ቻይናንጉሥ ▁ነር። መጽሐፍት ▁መደብ : የቻይና ... (+1 more)` | 11 |
95
- | 64k | `▁'ቻይናንጉሥ ▁ነር። ▁መጽሐፍት ▁መደብ : የቻይና ... (+1 more)` | 11 |
96
 
97
- **Sample 2:** `ኢትዮጵያ ጥ የሚ የምግብ አይ ሲሆ የሚ ከሱፍስንዴና አንድ አንድ ጊዜሽምብራ ቆሎ ነው።
98
-
99
- አዘገጃጀት
100
- ሊተ...`
101
 
102
  | Vocab | Tokens | Count |
103
  |-------|--------|-------|
104
- | 8k | `▁ኢትዮጵያ ▁ውስጥየሚየምግብ ▁አይነትሲሆን፣ ▁የሚሰራ ውምከሱ ... (+20 more)` | 30 |
105
- | 16k | `▁ኢትዮጵያ ▁ውስጥየሚየምግብ ▁አይነትሲሆን፣ ▁የሚሰራውምከሱ ስን ... (+16 more)` | 26 |
106
- | 32k | `▁ኢትዮጵያጥ ▁የሚየምግብአይነት ▁ሲ���ን፣ ▁የሚሰራውምከሱ ፍ ስንዴ ... (+14 more)` | 24 |
107
- | 64k | `▁ኢትዮጵያጥ ▁የሚየምግብአይነት ▁ሲሆን፣ ▁የሚሰራውም ▁ከሱ ፍ ስ ... (+14 more)` | 24 |
108
 
109
- **Sample 3:** `1 January 1955 - 11 September 1955 እ.ኤ.ኣ. = 1947 አ.ም.
110
- 12 September 1955 - 31 Dec...`
111
 
112
  | Vocab | Tokens | Count |
113
  |-------|--------|-------|
114
- | 8k | `▁ 1january1 9 5 5 - ▁ ... (+59 more)` | 69 |
115
- | 16k | `▁ 1january1 9 5 5- ▁ ... (+59 more)` | 69 |
116
- | 32k | `▁ 1january1 9 5 5- ▁ ... (+59 more)` | 69 |
117
- | 64k | `▁ 1january1 9 5 5- ▁ ... (+59 more)` | 69 |
118
 
119
 
120
  ### Key Findings
121
 
122
- - **Best Compression:** 64k achieves 3.278x compression
123
- - **Lowest UNK Rate:** 8k with 0.1639% unknown tokens
124
  - **Trade-off:** Larger vocabularies improve compression but increase model size
125
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
126
 
@@ -129,55 +129,87 @@ Below are sample sentences tokenized with each vocabulary size:
129
 
130
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
131
 
 
 
132
  ![N-gram Coverage](visualizations/ngram_coverage.png)
133
 
134
  ### Results
135
 
136
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
137
- |--------|------------|---------|----------------|------------------|-------------------|
138
- | **2-gram** | 10,759 🏆 | 13.39 | 48,164 | 21.6% | 40.3% |
139
- | **2-gram** | 2,321 🏆 | 11.18 | 26,048 | 32.4% | 67.6% |
140
- | **3-gram** | 16,194 | 13.98 | 68,935 | 19.7% | 36.8% |
141
- | **3-gram** | 21,529 | 14.39 | 173,382 | 11.5% | 34.1% |
142
- | **4-gram** | 45,509 | 15.47 | 148,975 | 14.6% | 27.1% |
143
- | **4-gram** | 104,202 | 16.67 | 623,179 | 6.7% | 19.0% |
144
 
145
  ### Top 5 N-grams by Size
146
 
147
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
  | Rank | N-gram | Count |
150
  |------|--------|-------|
151
- | 1 | `ነው ` | 21,098 |
152
- | 2 | `መደብ :` | 19,279 |
153
- | 3 | ` ` | 9,600 |
154
- | 4 | `ነ` | 6,290 |
155
- | 5 | ` .` | 6,166 |
156
 
157
- **3-grams:**
158
 
159
  | Rank | N-gram | Count |
160
  |------|--------|-------|
161
- | 1 | `ምሳሌ ነው ።` | 5,880 |
162
- | 2 | `የአማርኛ ምሳሌ ነው` | 5,832 |
163
- | 3 | ` . ም` | 5,831 |
164
- | 4 | ` መደብ :` | 5,106 |
165
- | 5 | `. ም .` | 4,919 |
166
 
167
- **4-grams:**
168
 
169
  | Rank | N-gram | Count |
170
  |------|--------|-------|
171
- | 1 | `የአማርኛ ምሳሌ ነው ።` | 5,832 |
172
- | 2 | ` . ም .` | 4,753 |
173
- | 3 | ` . ኤ .` | 4,046 |
174
- | 4 | `. . አ` | 3,983 |
175
- | 5 | `ምሳሌ ነው ። ትርጉሙ` | 3,720 |
 
 
 
 
 
 
 
 
 
 
176
 
177
 
178
  ### Key Findings
179
 
180
- - **Best Perplexity:** 2-gram with 2,321
181
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
182
  - **Coverage:** Top-1000 patterns cover ~19% of corpus
183
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
@@ -187,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
187
 
188
  ![Markov Entropy](visualizations/markov_entropy.png)
189
 
 
 
190
  ![Markov Branching](visualizations/markov_branching.png)
191
 
192
  ### Results
193
 
194
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
195
- |---------|-------------|------------|------------------|-----------------|----------------|
196
- | **1** | 0.6978 | 1.622 | 4.88 | 254,884 | 30.2% |
197
- | **1** | 1.4402 | 2.714 | 22.49 | 2,349 | 0.0% |
198
- | **2** | 0.1891 | 1.140 | 1.44 | 1,243,645 | 81.1% |
199
- | **2** | 1.1315 | 2.191 | 7.49 | 52,808 | 0.0% |
200
- | **3** | 0.0627 | 1.044 | 1.12 | 1,794,848 | 93.7% |
201
- | **3** | 0.6676 | 1.588 | 3.40 | 395,365 | 33.2% |
202
- | **4** | 0.0256 🏆 | 1.018 | 1.04 | 2,006,381 | 97.4% |
203
- | **4** | 0.4515 🏆 | 1.367 | 2.11 | 1,342,049 | 54.9% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
- ### Generated Text Samples
 
 
 
 
 
 
 
206
 
207
- Below are text samples generated from each Markov chain model:
208
 
209
  **Context Size 1:**
210
 
211
- 1. `። ትምህርት ቤት ነጣጥሎ መገ ይኖርበታል ተብሎ የሚታመነው የቶሪኖ ቀኖና ስተማና ትምህርት ሽግግር መግስት 1643 -`
212
- 2. `፡ ማስረጃ እንደሌለ ይቆጠራል የአማርኛ ምሳሌ ነው መደብ : / wiki / div id ) 2002`
213
- 3. `. ) እና የበለጠ እውነ ሆኖ ለፓውሜራስ ክለብ ሐረር ፣ 1943 ) የአሹር ንጉሥ ታቁ ፒተር`
214
 
215
  **Context Size 2:**
216
 
217
- 1. `ነው ። ትርጉሙ መልስ ሲታጣ ፣ ዝምታ ተፈጥሮው ሆነ ግሩም ድምጻዊ ነው ። በ13ው ክፍለ ዘመን የዪጂንግ`
218
- 2. `መ : ኮሪያ ነገሥታት`
219
- 3. `፡ ፡ አካላዊ ገጽታ እና ከህዝቡ አን አራኛውን ይሸፍናል ። ንም እንኳን መሳሳይ ጥምረት ከብዙ አሥርተ ዓመታት`
220
 
221
  **Context Size 3:**
222
 
223
- 1. `ምሳሌ ነው ። ዘመነ ግርምቢጥ ውሻ ወደ ሰርዶ አህያ ወደ ሊጥ ዘመን እንደንጉሱ አውድ እንደንፋሱ የአማኛ ምሳሌ ነው`
224
- 2. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ : ያልተረጎመ ምሳሌ መደብ : ተረትና ምሳሌ መደብ : ተረትና ምሳሌ መደብ`
225
- 3. `ዓ . . በኋላ ለሆኑት ዓመታት ግን በሌላ ቀን ላ መሆኑን ይገንዘቡ ። ለእነዚ ዓመቶች ይህ የቀን`
226
 
227
  **Context Size 4:**
228
 
229
- 1. `የአርኛ ምሳሌ ነው ። ሙ መደብ : ያልተተረጎመ ምሳሌ መደብ : ተረና ምሳሌ መደ : ተረትና ምሳሌ መደብ :`
230
- 2. `ዓ . ም . አስቀድሞ ይም ከ2091 ዓ . ም . ተክለጻድ ሪያ መደብ : ተክለጻድቅ መኩርያ መደብ :`
231
- 3. `እ . ኤ . አ . በ 1914 የሩሲያ ንጉሠ ገሥ ኒኮላስ ii ( 1894 - 1917 ) እ`
232
 
233
 
234
  ### Key Findings
235
 
236
- - **Best Predictability:** Context-4 with 97.4% predictability
237
  - **Branching Factor:** Decreases with context size (more deterministic)
238
- - **Memory Trade-off:** Larger contexts require more storage (1,342,049 contexts)
239
  - **Recommendation:** Context-3 or Context-4 for text generation
240
 
241
  ---
@@ -251,64 +314,64 @@ Below are text samples generated from each Markov chain model:
251
 
252
  | Metric | Value |
253
  |--------|-------|
254
- | Vocabulary Size | 108,024 |
255
- | Total Tokens | 1,810,273 |
256
- | Mean Frequency | 16.76 |
257
  | Median Frequency | 3 |
258
- | Frequency Std Dev | 184.67 |
259
 
260
  ### Most Common Words
261
 
262
  | Rank | Word | Frequency |
263
  |------|------|-----------|
264
- | 1 | ነው | 28,114 |
265
- | 2 | እና | 23,158 |
266
- | 3 | መደብ | 19,525 |
267
- | 4 | ላይ | 13,580 |
268
- | 5 | ምሳሌ | 12,239 |
269
- | 6 | ውስጥ | 9,959 |
270
- | 7 | ነበር | 9,632 |
271
- | 8 | ወደ | 9,166 |
272
- | 9 | | 8,691 |
273
- | 10 | | 8,629 |
274
 
275
  ### Least Common Words (from vocabulary)
276
 
277
  | Rank | Word | Frequency |
278
  |------|------|-----------|
279
- | 1 | ጂኒካ | 2 |
280
- | 2 | ዲኒካላ | 2 |
281
- | 3 | ወስደሽ | 2 |
282
- | 4 | አንኳኳ | 2 |
283
- | 5 | መዳልወ | 2 |
284
- | 6 | ረድእ | 2 |
285
- | 7 | አንደኛይቱ | 2 |
286
- | 8 | ወደሰልፍ | 2 |
287
- | 9 | የኒኮፖሊስ | 2 |
288
- | 10 | ጂምናዚየም | 2 |
289
 
290
  ### Zipf's Law Analysis
291
 
292
  | Metric | Value |
293
  |--------|-------|
294
- | Zipf Coefficient | 0.9297 |
295
- | R² (Goodness of Fit) | 0.994674 |
296
  | Adherence Quality | **excellent** |
297
 
298
  ### Coverage Analysis
299
 
300
  | Top N Words | Coverage |
301
  |-------------|----------|
302
- | Top 100 | 22.3% |
303
- | Top 1,000 | 44.9% |
304
- | Top 5,000 | 65.4% |
305
- | Top 10,000 | 74.1% |
306
 
307
  ### Key Findings
308
 
309
- - **Zipf Compliance:** R²=0.9947 indicates excellent adherence to Zipf's law
310
- - **High Frequency Dominance:** Top 100 words cover 22.3% of corpus
311
- - **Long Tail:** 98,024 words needed for remaining 25.9% coverage
312
 
313
  ---
314
  ## 5. Word Embeddings Evaluation
@@ -321,24 +384,88 @@ Below are text samples generated from each Markov chain model:
321
 
322
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
323
 
324
- ### Model Comparison
325
 
326
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
327
- |-------|------------|-----------|----------|----------|----------|
328
- | **mono_32d** | 40,456 | 32 | 3.565 | 0.969 | 0.8976 |
329
- | **mono_64d** | 40,456 | 64 | 4.280 | 0.895 | 0.9070 🏆 |
330
- | **mono_128d** | 40,456 | 128 | 5.026 | 0.790 | 0.8490 |
331
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
332
 
333
  ### Key Findings
334
 
335
- - **Best Isotropy:** mono_64d with 0.9070 (more uniform distribution)
336
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
337
- - **Vocabulary Coverage:** All models cover 40,456 words
338
- - **Recommendation:** 100d for balanced semantic capture and efficiency
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
339
 
340
  ---
341
- ## 6. Summary & Recommendations
342
 
343
  ![Performance Dashboard](visualizations/performance_dashboard.png)
344
 
@@ -346,11 +473,12 @@ Below are text samples generated from each Markov chain model:
346
 
347
  | Component | Recommended | Rationale |
348
  |-----------|-------------|-----------|
349
- | Tokenizer | **32k BPE** | Best compression (3.28x) with low UNK rate |
350
- | N-gram | **5-gram** | Lowest perplexity (2,321) |
351
- | Markov | **Context-4** | Highest predictability (97.4%) |
352
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
353
 
 
354
  ---
355
  ## Appendix: Metrics Glossary & Interpretation Guide
356
 
@@ -540,7 +668,8 @@ If you use these models in your research, please cite:
540
  author = {Kamali, Omar},
541
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
542
  year = {2025},
543
- publisher = {HuggingFace},
 
544
  url = {https://huggingface.co/wikilangs}
545
  institution = {Omneity Labs}
546
  }
@@ -556,7 +685,8 @@ MIT License - Free for academic and commercial use.
556
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
557
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
558
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
559
  ---
560
  *Generated by Wikilangs Models Pipeline*
561
 
562
- *Report Date: 2025-12-27 05:42:07*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 3.287
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.9163
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # AM - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 2.436x | 2.44 | 0.1557% | 683,952 |
84
+ | **16k** | 2.745x | 2.75 | 0.1754% | 607,060 |
85
+ | **32k** | 3.031x | 3.03 | 0.1937% | 549,802 |
86
+ | **64k** | 3.287x 🏆 | 3.29 | 0.2101% | 506,938 |
87
 
88
  ### Tokenization Examples
89
 
90
  Below are sample sentences tokenized with each vocabulary size:
91
 
92
+ **Sample 1:** `አዋሳ ከነማ ስታዲ በአዋሳ፣ ኢትዮጵያ የሚገኝ ስታዲዮም ፳፭ ሺህ ሰዎችን መያዝ ሲችል የአዋሳ ከተማ የእግር ኳስ ክለብ...`
 
 
 
 
93
 
94
  | Vocab | Tokens | Count |
95
  |-------|--------|-------|
96
+ | 8k | `▁አዋ ከነ ስታዲየም ▁በሳ፣ ዮጵያ ▁የሚገኝ ... (+25 more)` | 35 |
97
+ | 16k | `▁አዋሳከነ ስታዲየም ▁በሳ፣ ዮጵያ ▁የሚገኝ ▁ስታ ... (+22 more)` | 32 |
98
+ | 32k | `▁አዋሳከነማስታዲየም ▁በሳ፣ ዮጵያየሚገኝ ▁ስታ ዲዮ ... (+20 more)` | 30 |
99
+ | 64k | `▁አዋሳከነማስታዲየም ▁በሳ፣ ዮጵያየሚገኝ ▁ስታ ዲዮ ... (+19 more)` | 29 |
100
 
101
+ **Sample 2:** `የዝንጀሮ በውሻ ጩኸት ይበተናል አማርኛ ሳሌው። የዝጀሮ ስብባ በ ጩኸት ይበተ ማርኛሳሌ ነው። ትር...`
 
 
 
102
 
103
  | Vocab | Tokens | Count |
104
  |-------|--------|-------|
105
+ | 8k | `▁የዝ ንጀሮስብበው ይበ ... (+29 more)` | 39 |
106
+ | 16k | `▁የዝ ንጀሮስብበው ኸትይበ ናል ... (+25 more)` | 35 |
107
+ | 32k | `▁የዝንጀሮ ▁ስበው ጩ ኸት ▁ይበ ተናል ▁የአማርኛምሳሌ ... (+21 more)` | 31 |
108
+ | 64k | `▁የዝንጀሮ ▁ስበውሻጩኸት ▁ይበ ተናል ▁የአማርኛ ▁ሳሌነው። ▁የዝጀሮ ... (+17 more)` | 27 |
109
 
110
+ **Sample 3:** `የሐረሪ ብሔራዊ ሊግ የኢትዮጵያ ፖለቲካ ፓርቲ ነው። ዓላማ ሊቀመንበር ታሪክ መደብ: በርጫ የተሳተፉ የኢትዮጵያ ፓርቲዎች መደብ...`
 
111
 
112
  | Vocab | Tokens | Count |
113
  |-------|--------|-------|
114
+ | 8k | `▁የሐ ብሔራዊሊግ ▁የኢትዮጵያ ▁ፖለቲካ ▁ፓርቲነው።ዓላማ ... (+13 more)` | 23 |
115
+ | 16k | `▁የሐ ረሪብሔራዊሊግ ▁የኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው።ዓላማሊቀመንበር ... (+12 more)` | 22 |
116
+ | 32k | `▁የሐረሪ ▁ብሔራዊሊግየኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው። ▁ዓላማሊቀመንበርታሪክ ... (+11 more)` | 21 |
117
+ | 64k | `▁የሐረሪ ▁ብሔራዊሊግየኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው። ▁ዓላማሊቀመንበርታሪክ ... (+11 more)` | 21 |
118
 
119
 
120
  ### Key Findings
121
 
122
+ - **Best Compression:** 64k achieves 3.287x compression
123
+ - **Lowest UNK Rate:** 8k with 0.1557% unknown tokens
124
  - **Trade-off:** Larger vocabularies improve compression but increase model size
125
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
126
 
 
129
 
130
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
131
 
132
+ ![N-gram Unique](visualizations/ngram_unique.png)
133
+
134
  ![N-gram Coverage](visualizations/ngram_coverage.png)
135
 
136
  ### Results
137
 
138
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
139
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
140
+ | **2-gram** | Word | 8,988 | 13.13 | 27,901 | 19.7% | 39.7% |
141
+ | **2-gram** | Subword | 2,079 🏆 | 11.02 | 23,804 | 34.0% | 69.2% |
142
+ | **3-gram** | Word | 9,944 | 13.28 | 35,714 | 22.1% | 40.5% |
143
+ | **3-gram** | Subword | 19,139 | 14.22 | 153,027 | 11.8% | 35.5% |
144
+ | **4-gram** | Word | 36,744 | 15.17 | 90,792 | 13.9% | 25.8% |
145
+ | **4-gram** | Subword | 94,777 | 16.53 | 549,996 | 6.6% | 19.5% |
146
 
147
  ### Top 5 N-grams by Size
148
 
149
+ **2-grams (Word):**
150
+
151
+ | Rank | N-gram | Count |
152
+ |------|--------|-------|
153
+ | 1 | `ዓ ም` | 8,324 |
154
+ | 2 | `ምሳሌ ነው` | 5,625 |
155
+ | 3 | `የአማርኛ ምሳሌ` | 5,563 |
156
+ | 4 | `እ ኤ` | 4,026 |
157
+ | 5 | `ኤ አ` | 3,961 |
158
+
159
+ **3-grams (Word):**
160
+
161
+ | Rank | N-gram | Count |
162
+ |------|--------|-------|
163
+ | 1 | `የአማርኛ ምሳሌ ነው` | 5,563 |
164
+ | 2 | `እ ኤ አ` | 3,908 |
165
+ | 3 | `ምሳሌ ነው ትርጉሙ` | 3,454 |
166
+ | 4 | `መደብ ተረትና ምሳሌ` | 3,051 |
167
+ | 5 | `ነው ትርጉሙ መደብ` | 2,533 |
168
+
169
+ **4-grams (Word):**
170
 
171
  | Rank | N-gram | Count |
172
  |------|--------|-------|
173
+ | 1 | `የአማርኛ ምሳሌ ነው ትርጉሙ` | 3,452 |
174
+ | 2 | `ምሳሌ ነው ትርጉሙ መደብ` | 2,533 |
175
+ | 3 | `ትርጉሙ መደብ ያልተተረጎመ ምሳሌ` | 2,118 |
176
+ | 4 | `ነው ትጉሙ መደብ ያልተተረጎመ` | 2,114 |
177
+ | 5 | `ምሳሌ መደብ ተረትና ምሳሌ` | 1,854 |
178
 
179
+ **2-grams (Subword):**
180
 
181
  | Rank | N-gram | Count |
182
  |------|--------|-------|
183
+ | 1 | `_ ` | 170,716 |
184
+ | 2 | ` _` | 145,051 |
185
+ | 3 | `_ ` | 140,839 |
186
+ | 4 | ` _` | 132,909 |
187
+ | 5 | `_ ` | 113,769 |
188
 
189
+ **3-grams (Subword):**
190
 
191
  | Rank | N-gram | Count |
192
  |------|--------|-------|
193
+ | 1 | `_ ` | 32,319 |
194
+ | 2 | `_ ` | 26,511 |
195
+ | 3 | ` _` | 24,155 |
196
+ | 4 | `_ ` | 23,843 |
197
+ | 5 | ` _` | 22,397 |
198
+
199
+ **4-grams (Subword):**
200
+
201
+ | Rank | N-gram | Count |
202
+ |------|--------|-------|
203
+ | 1 | `_ እ ና _` | 22,267 |
204
+ | 2 | `_ ነ ው ።` | 19,378 |
205
+ | 3 | `ነ ው ። _` | 18,922 |
206
+ | 4 | `_ እ ን ደ` | 13,836 |
207
+ | 5 | `_ ላ ይ _` | 12,924 |
208
 
209
 
210
  ### Key Findings
211
 
212
+ - **Best Perplexity:** 2-gram (subword) with 2,079
213
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
214
  - **Coverage:** Top-1000 patterns cover ~19% of corpus
215
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
 
219
 
220
  ![Markov Entropy](visualizations/markov_entropy.png)
221
 
222
+ ![Markov Contexts](visualizations/markov_contexts.png)
223
+
224
  ![Markov Branching](visualizations/markov_branching.png)
225
 
226
  ### Results
227
 
228
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
229
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
230
+ | **1** | Word | 0.7502 | 1.682 | 4.80 | 236,353 | 25.0% |
231
+ | **1** | Subword | 1.2235 | 2.335 | 17.52 | 2,854 | 0.0% |
232
+ | **2** | Word | 0.1468 | 1.107 | 1.28 | 1,130,961 | 85.3% |
233
+ | **2** | Subword | 1.0397 | 2.056 | 6.98 | 49,981 | 0.0% |
234
+ | **3** | Word | 0.0355 | 1.025 | 1.06 | 1,446,616 | 96.4% |
235
+ | **3** | Subword | 0.6354 | 1.553 | 3.36 | 348,535 | 36.5% |
236
+ | **4** | Word | 0.0159 🏆 | 1.011 | 1.02 | 1,520,994 | 98.4% |
237
+ | **4** | Subword | 0.4515 | 1.367 | 2.14 | 1,171,344 | 54.9% |
238
+
239
+ ### Generated Text Samples (Word-based)
240
+
241
+ Below are text samples generated from each word-based Markov chain model:
242
+
243
+ **Context Size 1:**
244
+
245
+ 1. `ነው በጥሩ አስተዳደር የህዝብ ልውውጥ ኮሚሽን ሕንፃ በሥነ ሕንጻ ተጠናቆ ክፍት አረግንጓዴ ከጂዮርጂያ በእ አ የሆነ`
246
+ 2. `እና ሲሞት ቤተሰቦቹ ጋር ቢኮብለል ወይም ቀበሮ ዝርያ ያለበት ቦታ 4 5 14 847 ቅጥር የተለጠፈ`
247
+ 3. `ላይ መስኮት እና ከደቡብ ህንድ ቀጥሎ ዓ ም ቀዳማዊ ኃይለ ሥላሴ ለአፍሪካ ገሞጂማ ሣርማ ቦታዎች ይቀመጡ`
248
+
249
+ **Context Size 2:**
250
+
251
+ 1. `ዓ ም አስቀድሞ ወይም ዓ ም የአርጡስ ወንድም 4 ካርል 663 669 ዓ ም ሁላቸው ሲስማሙ ወደ`
252
+ 2. `ምሳሌ ነው ጀርባዬን እከክልኝ ለኔ ራቀኝ የአማርኛ ምሳሌ ነው ዝርክርክ ከወንፊት የባሰ ዝክዝክ የአማርኛ ምሳሌ ነው ትርጉሙ`
253
+ 3. `የአማርኛ ምሳሌ ነው የምትጠላው ሰው ፈሱ እሆዱ ውስጥ ሳለ ኔሽን ኦፍ ኢስላም ጋር ያለው ዝምድና ግልጽ ነው`
254
+
255
+ **Context Size 3:**
256
+
257
+ 1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ቁና ሰፋች`
258
+ 2. `እ ኤ አ በ በሂትለር ተጽዕኖ ሙሶሎኒ በጣሊያን ፀረ ሴማዊ የዘር ህጎች እንዲፀድቁ ደገፈ በመጋቢት ጀርመን ቼኮዝሎቫኪያን ከቀላቀለች`
259
+ 3. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ምግባር ሳይኖር ስም እንደማለት ነዉ`
260
 
261
+ **Context Size 4:**
262
+
263
+ 1. `የአማርኛ ምሳሌ ነው ትርጉሙ ሁለቱም አያዋጡም መደብ ተረትና ምሳሌ`
264
+ 2. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ፈሊጣዊ አነጋገር መደብ ተረትና ምሳሌ ቁና ሰፋች`
265
+ 3. `ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ሴት ሁሉን ቻይ ናት`
266
+
267
+
268
+ ### Generated Text Samples (Subword-based)
269
 
270
+ Below are text samples generated from each subword-based Markov chain model:
271
 
272
  **Context Size 1:**
273
 
274
+ 1. `_አ፣_ለማር_(“ሀ`
275
+ 2. `ን_(dicole_ገቡድ_ሞ`
276
+ 3. `ት_ቅ_ሓምበላስድ_ጋ_ይለት`
277
 
278
  **Context Size 2:**
279
 
280
+ 1. `_ኢትዮጵያ_አፖሎ_,00_`
281
+ 2. `ት_ተለሰን_ኣሉ።_ባህር`
282
+ 3. `_ዘመዴ_ሲፀቅ_በተ`
283
 
284
  **Context Size 3:**
285
 
286
+ 1. `_እንደ_ማርኮ_ከተማ_እና_ግሮ`
287
+ 2. `_ነው።_ከማው_ባልሞራል_እን`
288
+ 3. `ው።_ኮፕዩተራዝ_ካሊፎርኒያ`
289
 
290
  **Context Size 4:**
291
 
292
+ 1. `_እና_ከማቸ__ቁጭ_`
293
+ 2. `_ነው።_ዋጋው_ወቅት_የጀመሪያ`
294
+ 3. `ነው።_ርጉሙ_አንቴና_ይፈሳሉ፡`
295
 
296
 
297
  ### Key Findings
298
 
299
+ - **Best Predictability:** Context-4 (word) with 98.4% predictability
300
  - **Branching Factor:** Decreases with context size (more deterministic)
301
+ - **Memory Trade-off:** Larger contexts require more storage (1,171,344 contexts)
302
  - **Recommendation:** Context-3 or Context-4 for text generation
303
 
304
  ---
 
314
 
315
  | Metric | Value |
316
  |--------|-------|
317
+ | Vocabulary Size | 99,716 |
318
+ | Total Tokens | 1,636,892 |
319
+ | Mean Frequency | 16.42 |
320
  | Median Frequency | 3 |
321
+ | Frequency Std Dev | 174.41 |
322
 
323
  ### Most Common Words
324
 
325
  | Rank | Word | Frequency |
326
  |------|------|-----------|
327
+ | 1 | ነው | 26,460 |
328
+ | 2 | እና | 22,392 |
329
+ | 3 | ላይ | 13,250 |
330
+ | 4 | ምሳሌ | 11,607 |
331
+ | 5 | ውስጥ | 9,622 |
332
+ | 6 | ነበር | 9,005 |
333
+ | 7 | | 8,679 |
334
+ | 8 | | 8,584 |
335
+ | 9 | ወደ | 8,446 |
336
+ | 10 | እንደ | 6,776 |
337
 
338
  ### Least Common Words (from vocabulary)
339
 
340
  | Rank | Word | Frequency |
341
  |------|------|-----------|
342
+ | 1 | ቫሊን | 2 |
343
+ | 2 | ግሎቡ | 2 |
344
+ | 3 | ኢንዛይሞች | 2 |
345
+ | 4 | የማከማቻ | 2 |
346
+ | 5 | ለph | 2 |
347
+ | 6 | ግብመልሶችን | 2 |
348
+ | 7 | behi | 2 |
349
+ | 8 | ቤሂ | 2 |
350
+ | 9 | goli | 2 |
351
+ | 10 | ክሩድስ | 2 |
352
 
353
  ### Zipf's Law Analysis
354
 
355
  | Metric | Value |
356
  |--------|-------|
357
+ | Zipf Coefficient | 0.9367 |
358
+ | R² (Goodness of Fit) | 0.995214 |
359
  | Adherence Quality | **excellent** |
360
 
361
  ### Coverage Analysis
362
 
363
  | Top N Words | Coverage |
364
  |-------------|----------|
365
+ | Top 100 | 22.7% |
366
+ | Top 1,000 | 45.8% |
367
+ | Top 5,000 | 66.2% |
368
+ | Top 10,000 | 74.9% |
369
 
370
  ### Key Findings
371
 
372
+ - **Zipf Compliance:** R²=0.9952 indicates excellent adherence to Zipf's law
373
+ - **High Frequency Dominance:** Top 100 words cover 22.7% of corpus
374
+ - **Long Tail:** 89,716 words needed for remaining 25.1% coverage
375
 
376
  ---
377
  ## 5. Word Embeddings Evaluation
 
384
 
385
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
386
 
 
387
 
388
+ ### 5.1 Cross-Lingual Alignment
389
+
390
+ > *Note: Multilingual alignment visualization not available for this language.*
391
+
392
+
393
+ ### 5.2 Model Comparison
394
+
395
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
396
+ |-------|-----------|----------|------------------|---------------|----------------|
397
+ | **mono_32d** | 32 | 0.9125 | 0.3250 | N/A | N/A |
398
+ | **mono_64d** | 64 | 0.9163 🏆 | 0.2292 | N/A | N/A |
399
+ | **mono_128d** | 128 | 0.8535 | 0.1745 | N/A | N/A |
400
 
401
  ### Key Findings
402
 
403
+ - **Best Isotropy:** mono_64d with 0.9163 (more uniform distribution)
404
+ - **Semantic Density:** Average pairwise similarity of 0.2429. Lower values indicate better semantic separation.
405
+ - **Alignment Quality:** No aligned models evaluated in this run.
406
+ - **Recommendation:** 128d aligned for best cross-lingual performance
407
+
408
+ ---
409
+ ## 6. Morphological Analysis (Experimental)
410
+
411
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
412
+
413
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
414
+
415
+ ### 6.1 Productivity & Complexity
416
+
417
+ | Metric | Value | Interpretation | Recommendation |
418
+ |--------|-------|----------------|----------------|
419
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
420
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
421
+
422
+ ### 6.2 Affix Inventory (Productive Units)
423
+
424
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
425
+
426
+ *No productive affixes detected.*
427
+
428
+
429
+ ### 6.3 Bound Stems (Lexical Roots)
430
+
431
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
432
+
433
+ | Stem | Cohesion | Substitutability | Examples |
434
+ |------|----------|------------------|----------|
435
+ | `እንደሚ` | 2.46x | 153 contexts | እንደሚሹ, እንደሚል, እንደሚሉ |
436
+ | `ርስቲያ` | 2.48x | 60 contexts | ክርስቲያ, ክርስቲያኗ, ክርስቲያኖ |
437
+ | `ትዮጵያ` | 2.23x | 57 contexts | እትዮጵያ, ኢትዮጵያ, ኢትዮጵያው |
438
+ | `ግዚአብ` | 2.73x | 24 contexts | እግዚአብሔር, እግዚአብሐር, እግዚአብሄር |
439
+ | `ኢትዮጵ` | 2.24x | 46 contexts | ኢትዮጵያ, ኢትዮጵያው, ኢትዮጵስት |
440
+ | `መንግሥ` | 2.21x | 46 contexts | መንግሥተ, መንግሥት, መንግሥቱ |
441
+ | `መንግስ` | 2.16x | 48 contexts | መንግስት, መንግስተ, መንግስቱ |
442
+ | `ፈረንሳ` | 2.33x | 34 contexts | ፈረንሳዊ, ፈረንሳይ, በፈረንሳዩ |
443
+ | `አስተዳ` | 2.33x | 33 contexts | አስተዳዳሪ, አስተዳደጓ, አስተዳደረ |
444
+ | `እንግሊ` | 2.05x | 53 contexts | እንግሊዝ, እንግሊዙ, እንግሊኛ |
445
+ | `tion` | 2.82x | 17 contexts | nation, action, section |
446
+ | `ጀመሪያ` | 2.28x | 33 contexts | መጀመሪያ, ለመጀመሪያ, መጀመሪያው |
447
+
448
+ ### 6.4 Affix Compatibility (Co-occurrence)
449
+
450
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
451
+
452
+ *No significant affix co-occurrences detected.*
453
+
454
+
455
+ ### 6.5 Recursive Morpheme Segmentation
456
+
457
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
458
+
459
+ *Insufficient data for recursive segmentation.*
460
+
461
+
462
+ ### 6.6 Linguistic Interpretation
463
+
464
+ > **Automated Insight:**
465
+ The language AM appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
466
 
467
  ---
468
+ ## 7. Summary & Recommendations
469
 
470
  ![Performance Dashboard](visualizations/performance_dashboard.png)
471
 
 
473
 
474
  | Component | Recommended | Rationale |
475
  |-----------|-------------|-----------|
476
+ | Tokenizer | **64k BPE** | Best compression (3.29x) |
477
+ | N-gram | **2-gram** | Lowest perplexity (2,079) |
478
+ | Markov | **Context-4** | Highest predictability (98.4%) |
479
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
480
 
481
+
482
  ---
483
  ## Appendix: Metrics Glossary & Interpretation Guide
484
 
 
668
  author = {Kamali, Omar},
669
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
670
  year = {2025},
671
+ doi = {10.5281/zenodo.18073153},
672
+ publisher = {Zenodo},
673
  url = {https://huggingface.co/wikilangs}
674
  institution = {Omneity Labs}
675
  }
 
685
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
686
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
687
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
688
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
689
  ---
690
  *Generated by Wikilangs Models Pipeline*
691
 
692
+ *Report Date: 2026-01-03 05:13:17*
models/embeddings/monolingual/am_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0efb3357d959a6e2e5481a2cb05a3adbd88201db376b5adc924dff9b57f0ef0e
3
- size 1066335087
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d4be827f7a31270ecd980640a7032e11c5ed3885d037c1b8cd46df3d9007492
3
+ size 1063990202
models/embeddings/monolingual/am_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 40456
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 38213
15
  }
models/embeddings/monolingual/am_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:25b527389da984b0a46121bb9f197602821de2917c7a5256185de9bc6db2e0bf
3
- size 267264879
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1facd0cecd15328df01b8eb00c409adeaac9321a5d62748462e3055c6a4b976e
3
+ size 266642618
models/embeddings/monolingual/am_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 40456
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 38213
15
  }
models/embeddings/monolingual/am_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a2d6a5af7f3b896dee81096e1e86293fa2f34063054412b3ceda74be6e0c1ab5
3
- size 533621615
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f42620197f1e1b51b39471cc1e5886ab38c161775048e9529ceb06fdf065e6e1
3
+ size 532425146
models/embeddings/monolingual/am_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 40456
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 38213
15
  }
models/subword_markov/am_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c2cb11e67c389141a2fadeda43f425f0f1bf4de68d4b119cf4340c1a6b943bc0
3
- size 365675
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0dea8b9be146dc2d3f0da26601ce42020e0c84af4656d9f7692b3f4d34d5d6ae
3
+ size 356106
models/subword_markov/am_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_contexts": 2349,
6
- "total_transitions": 10028204
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_contexts": 2854,
6
+ "total_transitions": 8936316
7
  }
models/subword_markov/am_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6fe5f25cdfc3e26bfa52e7017c7ccecd73d51852c672946b0a471baf4c253fc7
3
- size 2376324
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e9b01d5674bb78549bd0f7a41241639ddc7934db689c638aee8d0daacc70172
3
+ size 2154891
models/subword_markov/am_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_contexts": 52808,
6
- "total_transitions": 10014026
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_contexts": 49981,
6
+ "total_transitions": 8923902
7
  }
models/subword_markov/am_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0d8b1997b67fd57bb5cef7ecfef1a9045802e2e6af44a0c7b5fe11aa4286d3f4
3
- size 9540494
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36e05d4c86f22f4d3f6c7dff80e4b868513b425317942829e61a871110f71ee0
3
+ size 8408029
models/subword_markov/am_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_contexts": 395365,
6
- "total_transitions": 9999848
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_contexts": 348535,
6
+ "total_transitions": 8911488
7
  }
models/subword_markov/am_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:41248045001571748a27c9b0e03b64cd156b22e6a9687173d0b8c4cbdcd2460d
3
- size 26523692
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab3d345339fb5097d316a6acad43d3cbd7f781c24bd690b17950a9d30bdd7dfa
3
+ size 23278804
models/subword_markov/am_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_contexts": 1342049,
6
- "total_transitions": 9985670
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_contexts": 1171344,
6
+ "total_transitions": 8899074
7
  }
models/subword_ngram/am_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:85dd052f630c86a220366e62124925683c30d80262595a95d2da8e61e4dc4b1d
3
- size 326227
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6bcf32e20adcf3b9aa77e784d9fd125484c7c7d01f306652cf74a76522fbc536
3
+ size 300324
models/subword_ngram/am_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_ngrams": 26048,
6
- "total_ngrams": 10028204
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_ngrams": 23804,
6
+ "total_ngrams": 8936316
7
  }
models/subword_ngram/am_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ff695ae4e7eab84ead000ce11ad4fc0c65484a5a1fd01a30d7ec14360faf19cc
3
- size 2119089
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c8cece5e05ace848d6a685201f3e49e33ed838dc7744aec46795f09bbfdb9e0
3
+ size 1881540
models/subword_ngram/am_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_ngrams": 173382,
6
- "total_ngrams": 10014026
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_ngrams": 153027,
6
+ "total_ngrams": 8923902
7
  }
models/subword_ngram/am_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7d3f77f95c04e37e46abbe5dffc45c0c1e05d58dc384a30081cad87c6523fd88
3
- size 8027416
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6cf11a9b9599d5b462fc5775302067d643de8e8fd248de5060f89351b18b9ab
3
+ size 7082960
models/subword_ngram/am_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "am",
5
- "unique_ngrams": 623179,
6
- "total_ngrams": 9999848
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "am",
5
+ "unique_ngrams": 549996,
6
+ "total_ngrams": 8911488
7
  }
models/tokenizer/am_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cedf1c15aa7971e234b55d99070a4df149f61a8f6fd1250b6e3f1e2cef58ba64
3
- size 557859
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f02cb4592939e59c4831f57ca855266e8d28172e5efebde508c8b57daefbafb6
3
+ size 559482
models/tokenizer/am_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/am_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7093df805cbfa8e5f44b174f96bd89b7c4e418862fcb6b2806f03fb78060610d
3
- size 899806
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93bd53271f8c4e6a918072a97dd0581c020a8af645616093bc8213a30b88940e
3
+ size 902409
models/tokenizer/am_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/am_tokenizer_64k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f9b1db1dd69389fa6546da7500a06e5d0fce2ab1f61a918dedf24fef3f7bf3a9
3
- size 1619242
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8481661d7e8d938bace980d6a5c315614597c92fb98ccbef70df78797efdecc
3
+ size 1589488
models/tokenizer/am_tokenizer_64k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/am_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c8c82fbbd74ca9cc772a93c5ca966ce88ce53a71386e90e51549f764cec258d2
3
- size 393601
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ad931bdd41158b2542d8aac191257d87f612a4ef884c47473efc0d7a86dd011
3
+ size 394754
models/tokenizer/am_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/am_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1c9e63246ff6b65932ad2b6cd7bb7743b7d9505ea2e1ac9ca771e738d442b5f8
3
- size 1928354
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:294dfe907fb7e7f0defd53a349bd532546806ddc295a92c292188997c0d498bd
3
+ size 1787217
models/vocabulary/am_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "am",
3
- "vocabulary_size": 108024,
 
4
  "statistics": {
5
- "type_token_ratio": 0.1301235738821781,
6
  "coverage": {
7
- "top_100": 0.20614026570786442,
8
- "top_1000": 0.4150476829002814,
9
- "top_5000": 0.6047674724025824,
10
- "top_10000": 0.685510039930788
11
  },
12
- "hapax_count": 146613,
13
- "hapax_ratio": 0.5757725703648724,
14
- "total_documents": 14178
15
  }
16
  }
 
1
  {
2
  "language": "am",
3
+ "vocabulary_size": 99716,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.13335844069738334,
7
  "coverage": {
8
+ "top_100": 0.20952339607919193,
9
+ "top_1000": 0.42309028051841446,
10
+ "top_5000": 0.6109410976729082,
11
+ "top_10000": 0.6909696930060957
12
  },
13
+ "hapax_count": 136824,
14
+ "hapax_ratio": 0.5784391646233196,
15
+ "total_documents": 12414
16
  }
17
  }
models/word_markov/am_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:511a07287de3c7cf84b1e388ff1c84f89b6c4179a305abbcd9a9e9daa1347eda
3
- size 14354786
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61264cf69517fe843352a0b6cc8383107c7cd20cb9613edb6a901779636cd1bd
3
+ size 13693182
models/word_markov/am_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_contexts": 254884,
6
- "total_transitions": 2461052
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_contexts": 236353,
6
+ "total_transitions": 1761302
7
  }
models/word_markov/am_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6a9fc5e0cbc0c30fa8a9ce98c35ab2c6fe838a6f56a9bb619aab4e816d85e97d
3
- size 32957872
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c565bae97b8cf56a6c0d532f8f790eb6482096f8f6f9ee9069aec8edcf717f1d
3
+ size 29639293
models/word_markov/am_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_contexts": 1243645,
6
- "total_transitions": 2446875
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_contexts": 1130961,
6
+ "total_transitions": 1748889
7
  }
models/word_markov/am_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8612b4ff0daced324460640533c0da0a21f0036b72c30ba71fc85ca672380734
3
- size 44191826
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a56abcadfce2f027d0d47be74e244c8bad54627001acca61fa2ccbdb09097197
3
+ size 37383956
models/word_markov/am_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_contexts": 1794848,
6
- "total_transitions": 2432701
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_contexts": 1446616,
6
+ "total_transitions": 1736476
7
  }
models/word_markov/am_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7d3b579b0a2858994d2ed6dc10d89e51f2e2d480167145c883ec696b10015a1a
3
- size 51775915
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec1a431379039964d30885d1e5d82fe1a0b2e28f13cf9155ebc7fca6bdcb9ad8
3
+ size 42858855
models/word_markov/am_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_contexts": 2006381,
6
- "total_transitions": 2418531
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_contexts": 1520994,
6
+ "total_transitions": 1724063
7
  }
models/word_ngram/am_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9b07feb909a23389d029c0062399e6e8ed00f7ccd83570e865a1da8c2eed6084
3
- size 832555
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:883d58efb5ccc7881009333e3d02b3fddd7520f5bf828d29b36423f3fd1b1647
3
+ size 543095
models/word_ngram/am_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_ngrams": 48164,
6
- "total_ngrams": 2461052
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_ngrams": 27901,
6
+ "total_ngrams": 1761302
7
  }
models/word_ngram/am_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8aae59e56d73699ee378c8da34b637013e6a0a71d3aed969e9c5a9774e23a880
3
- size 1311880
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:99b4534a7f6dc004d58b92078a7fffcda803fea6ba859a4e7db87b4fcb8dc3f9
3
+ size 757738
models/word_ngram/am_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_ngrams": 68935,
6
- "total_ngrams": 2446875
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_ngrams": 35714,
6
+ "total_ngrams": 1748889
7
  }
models/word_ngram/am_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f623c1b0bd05d77bc71c826be908d3103d05a620611c25416ea052ee25389b0b
3
- size 3080431
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6129a2517d2fe4d5b4e31b4535a6231670013b872e0921b742709f16592c413d
3
+ size 2114443
models/word_ngram/am_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "am",
5
- "unique_ngrams": 148975,
6
- "total_ngrams": 2432701
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "am",
5
+ "unique_ngrams": 90792,
6
+ "total_ngrams": 1736476
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: d7fc30cd5ab05e81c4d071e31f2ae295f6256e278daeced99815e503cc771cf9
  • Pointer size: 131 Bytes
  • Size of remote file: 134 kB

Git LFS Details

  • SHA256: 086a8c2f54d8a48386dc992dc235b2d89a213f63d90eb0e8eadeebf192cd5471
  • Pointer size: 131 Bytes
  • Size of remote file: 135 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED