omarkamali commited on
Commit
7713c5e
·
verified ·
1 Parent(s): 4a22dd8

Upload all models and assets for alt (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +302 -140
  2. models/embeddings/monolingual/alt_128d.bin +2 -2
  3. models/embeddings/monolingual/alt_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/alt_32d.bin +2 -2
  5. models/embeddings/monolingual/alt_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/alt_64d.bin +2 -2
  7. models/embeddings/monolingual/alt_64d_metadata.json +5 -3
  8. models/subword_markov/alt_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/alt_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/alt_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/alt_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/alt_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/alt_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/alt_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/alt_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/alt_2gram_subword.parquet +2 -2
  17. models/subword_ngram/alt_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/alt_3gram_subword.parquet +2 -2
  19. models/subword_ngram/alt_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/alt_4gram_subword.parquet +2 -2
  21. models/subword_ngram/alt_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/alt_tokenizer_16k.model +2 -2
  23. models/tokenizer/alt_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/alt_tokenizer_8k.model +2 -2
  25. models/tokenizer/alt_tokenizer_8k.vocab +0 -0
  26. models/vocabulary/alt_vocabulary.parquet +2 -2
  27. models/vocabulary/alt_vocabulary_metadata.json +10 -9
  28. models/word_markov/alt_markov_ctx1_word.parquet +2 -2
  29. models/word_markov/alt_markov_ctx1_word_metadata.json +2 -2
  30. models/word_markov/alt_markov_ctx2_word.parquet +2 -2
  31. models/word_markov/alt_markov_ctx2_word_metadata.json +2 -2
  32. models/word_markov/alt_markov_ctx3_word.parquet +2 -2
  33. models/word_markov/alt_markov_ctx3_word_metadata.json +2 -2
  34. models/word_markov/alt_markov_ctx4_word.parquet +2 -2
  35. models/word_markov/alt_markov_ctx4_word_metadata.json +2 -2
  36. models/word_ngram/alt_2gram_word.parquet +2 -2
  37. models/word_ngram/alt_2gram_word_metadata.json +2 -2
  38. models/word_ngram/alt_3gram_word.parquet +2 -2
  39. models/word_ngram/alt_3gram_word_metadata.json +2 -2
  40. models/word_ngram/alt_4gram_word.parquet +2 -2
  41. models/word_ngram/alt_4gram_word_metadata.json +2 -2
  42. visualizations/embedding_isotropy.png +0 -0
  43. visualizations/embedding_norms.png +0 -0
  44. visualizations/embedding_similarity.png +2 -2
  45. visualizations/markov_branching.png +0 -0
  46. visualizations/markov_contexts.png +0 -0
  47. visualizations/markov_entropy.png +0 -0
  48. visualizations/model_sizes.png +0 -0
  49. visualizations/nearest_neighbors.png +0 -0
  50. visualizations/ngram_coverage.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 4.265
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.8322
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 27823
33
- generated: 2025-12-27
34
  ---
35
 
36
  # ALT - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,59 +70,49 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.662x | 3.60 | 0.1372% | 996,183 |
76
- | **16k** | 3.919x | 3.85 | 0.1469% | 930,651 |
77
- | **32k** | 4.115x | 4.05 | 0.1542% | 886,408 |
78
- | **64k** | 4.265x 🏆 | 4.19 | 0.1598% | 855,191 |
79
 
80
  ### Tokenization Examples
81
 
82
  Below are sample sentences tokenized with each vocabulary size:
83
 
84
- **Sample 1:** `Шӱлӱк () эмдеер тынду, чойлошкон.
85
-
86
- Тайантылар
87
-
88
- Категория:Азыранты`
89
 
90
  | Vocab | Tokens | Count |
91
  |-------|--------|-------|
92
- | 8k | `▁шӱл ӱк ▁() ▁— ▁эмде ер ▁тынду , ▁ч ой ... (+9 more)` | 19 |
93
- | 16k | `▁шӱл ӱк ▁() ▁— ▁эмдеер ▁тынду , ▁чой ло шкон ... (+6 more)` | 16 |
94
- | 32k | `▁шӱл ӱк ▁() ▁— ▁эмдеер ▁тынду , ▁чой ло шкон ... (+5 more)` | 15 |
95
- | 64k | `▁шӱл ӱк ▁() ▁— ▁эмдеер ▁тынду , ▁чойлошкон . ▁тайантылар ... (+3 more)` | 13 |
96
 
97
- **Sample 2:** `Казахтар кош-агаштыҥ Алтайыста јадып турган казахтар. Тӧрӧли Казахстан.
98
-
99
- Катег...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
- | 8k | `▁казах тар ▁кош - агаштыҥ ▁— ▁алтай ы ста ▁јадып ... (+15 more)` | 25 |
104
- | 16k | `▁казахтар ▁кош - агаштыҥ ▁— ▁алтай ы ста ▁јадып ▁турган ... (+12 more)` | 22 |
105
- | 32k | `▁казахтар ▁кош - агаштыҥ ▁— ▁алтай ыста ▁јадып ▁турган ▁казахтар ... (+11 more)` | 21 |
106
- | 64k | `▁казахтар ▁кош - агаштыҥ ▁— ▁алтайыста ▁јадып ▁турган ▁казахтар . ... (+10 more)` | 20 |
107
 
108
- **Sample 3:** `Тура:
109
- кижи јадар айыл.
110
- кала (темдектезе: Ойрот-Тура, Јаш-Тура, Том-Тура).`
111
 
112
  | Vocab | Tokens | Count |
113
  |-------|--------|-------|
114
- | 8k | `▁тура : ▁кижи ▁јадар ▁айыл . ▁кала ▁( темде кт ... (+14 more)` | 24 |
115
- | 16k | `▁тура : ▁кижи ▁јадар ▁айыл . ▁кала ▁( темдектезе : ... (+12 more)` | 22 |
116
- | 32k | `▁тура : ▁кижи ▁јадар ▁айыл . ▁кала ▁( темдектезе : ... (+12 more)` | 22 |
117
- | 64k | `▁тура : ▁кижи ▁јадар ▁айыл . ▁кала ▁( темдектезе : ... (+12 more)` | 22 |
118
 
119
 
120
  ### Key Findings
121
 
122
- - **Best Compression:** 64k achieves 4.265x compression
123
- - **Lowest UNK Rate:** 8k with 0.1372% unknown tokens
124
  - **Trade-off:** Larger vocabularies improve compression but increase model size
125
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
126
 
@@ -129,57 +121,89 @@ Below are sample sentences tokenized with each vocabulary size:
129
 
130
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
131
 
 
 
132
  ![N-gram Coverage](visualizations/ngram_coverage.png)
133
 
134
  ### Results
135
 
136
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
137
- |--------|------------|---------|----------------|------------------|-------------------|
138
- | **2-gram** | 5,603 🏆 | 12.45 | 19,162 | 17.9% | 51.7% |
139
- | **2-gram** | 479 🏆 | 8.91 | 3,376 | 52.1% | 97.1% |
140
- | **3-gram** | 9,322 | 13.19 | 31,313 | 12.0% | 45.1% |
141
- | **3-gram** | 3,850 | 11.91 | 26,889 | 18.2% | 59.9% |
142
- | **4-gram** | 14,040 | 13.78 | 53,203 | 10.9% | 40.5% |
143
- | **4-gram** | 16,189 | 13.98 | 114,000 | 9.9% | 34.0% |
144
 
145
  ### Top 5 N-grams by Size
146
 
147
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
  | Rank | N-gram | Count |
150
  |------|--------|-------|
151
- | 1 | `) ,` | 6,036 |
152
- | 2 | `. —` | 5,108 |
153
- | 3 | `) .` | 3,265 |
154
- | 4 | `. )` | 3,242 |
155
- | 5 | `( )` | 2,922 |
156
 
157
- **3-grams:**
158
 
159
  | Rank | N-gram | Count |
160
  |------|--------|-------|
161
- | 1 | `. ) ,` | 2,899 |
162
- | 2 | `чык . )` | 1,583 |
163
- | 3 | `. чык .` | 1,572 |
164
- | 4 | . чык` | 1,391 |
165
- | 5 | `. бож .` | 1,334 |
166
 
167
- **4-grams:**
168
 
169
  | Rank | N-gram | Count |
170
  |------|--------|-------|
171
- | 1 | `. чык . )` | 1,570 |
172
- | 2 | `чык . ) ,` | 1,561 |
173
- | 3 | `. бож . )` | 1,331 |
174
- | 4 | . чык .` | 1,308 |
175
- | 5 | `бож . ) ,` | 1,276 |
 
 
 
 
 
 
 
 
 
 
176
 
177
 
178
  ### Key Findings
179
 
180
- - **Best Perplexity:** 2-gram with 479
181
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
182
- - **Coverage:** Top-1000 patterns cover ~34% of corpus
183
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
184
 
185
  ---
@@ -187,55 +211,86 @@ Below are sample sentences tokenized with each vocabulary size:
187
 
188
  ![Markov Entropy](visualizations/markov_entropy.png)
189
 
 
 
190
  ![Markov Branching](visualizations/markov_branching.png)
191
 
192
  ### Results
193
 
194
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
195
- |---------|-------------|------------|------------------|-----------------|----------------|
196
- | **1** | 0.6271 | 1.544 | 4.13 | 68,485 | 37.3% |
197
- | **1** | 1.7812 | 3.437 | 20.47 | 296 | 0.0% |
198
- | **2** | 0.2402 | 1.181 | 1.61 | 283,129 | 76.0% |
199
- | **2** | 1.3538 | 2.556 | 8.10 | 6,058 | 0.0% |
200
- | **3** | 0.0999 | 1.072 | 1.20 | 455,274 | 90.0% |
201
- | **3** | 0.8810 | 1.842 | 4.06 | 49,063 | 11.9% |
202
- | **4** | 0.0482 🏆 | 1.034 | 1.09 | 544,523 | 95.2% |
203
- | **4** | 0.5803 🏆 | 1.495 | 2.45 | 199,058 | 42.0% |
204
 
205
- ### Generated Text Samples
206
 
207
- Below are text samples generated from each Markov chain model:
208
 
209
  **Context Size 1:**
210
 
211
- 1. `, карасал . труд » 40 салковой безналичныйла тӧлӧзӧ ( 1354 ) , 2017 ,`
212
- 2. `. 2011 ) день памяти святой софии ( ортозында јадат . скобканыҥ ичинде кереестиҥ аҥылу`
213
- 3. `- оозы 19 категория : электронный . чаган айдыҥ 21 паспаул16 22 кӱнинде пермский государственный уни...`
214
 
215
  **Context Size 2:**
216
 
217
- 1. `) , совет гимнаст , кöп сабазында бу профильный федерал министерстволор кандидатуралар аайынча jöптö...`
218
- 2. `. 267 с . . мигранттардыҥ тоозы москваныҥ јоныныҥ тоозы астаганыныҥ шылтагы миграционный отток н...`
219
- 3. `) . јылдыҥ учына јетире 37 кӱн арткан . куран айдыҥ 20 кӱни григориан кӱнтизӱ юлиан кӱнтизӱни`
220
 
221
  **Context Size 3:**
222
 
223
- 1. `. ) , орус литературалык критик , кеендикте романтизм деп ууламјыга кирген . 1826 алхазов , яков`
224
- 2. `чык . ) , совет архитектор ( днепрогэс , театр - студия , студенческий театр « ринхбург »`
225
- 3. `. чык . ) , актёр театр ла киноныҥ . 2016 алиева фазу гамзатовна ( 1899 ј`
226
 
227
  **Context Size 4:**
228
 
229
- 1. `. чык . ) , армян билимчи - монах , просветитель , армян алфавитти эткен . 1673 — мольер`
230
- 2. `чык . ) , немец тÿÿкичи , литературовед , политик . отниел чарльз марш ( j . чык`
231
- 3. `. бож . ) , совет тӱӱкичи , археолог , топонимик ле этнограф . 1862 — джон тайлер (`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
 
233
 
234
  ### Key Findings
235
 
236
- - **Best Predictability:** Context-4 with 95.2% predictability
237
  - **Branching Factor:** Decreases with context size (more deterministic)
238
- - **Memory Trade-off:** Larger contexts require more storage (199,058 contexts)
239
  - **Recommendation:** Context-3 or Context-4 for text generation
240
 
241
  ---
@@ -251,64 +306,64 @@ Below are text samples generated from each Markov chain model:
251
 
252
  | Metric | Value |
253
  |--------|-------|
254
- | Vocabulary Size | 27,823 |
255
- | Total Tokens | 605,437 |
256
- | Mean Frequency | 21.76 |
257
  | Median Frequency | 3 |
258
- | Frequency Std Dev | 124.54 |
259
 
260
  ### Most Common Words
261
 
262
  | Rank | Word | Frequency |
263
  |------|------|-----------|
264
- | 1 | ла | 6,610 |
265
- | 2 | алтай | 5,102 |
266
- | 3 | ле | 4,975 |
267
- | 4 | с | 3,949 |
268
- | 5 | деп | 3,904 |
269
- | 6 | јылда | 3,748 |
270
- | 7 | айдыҥ | 3,442 |
271
  | 8 | болгон | 3,231 |
272
- | 9 | јурт | 3,141 |
273
- | 10 | и | 3,049 |
274
 
275
  ### Least Common Words (from vocabulary)
276
 
277
  | Rank | Word | Frequency |
278
  |------|------|-----------|
279
- | 1 | туузаланат | 2 |
280
- | 2 | узаныш | 2 |
281
- | 3 | эрессейде | 2 |
282
- | 4 | метеметике | 2 |
283
- | 5 | јеткилдери | 2 |
284
- | 6 | кӧмпӱтерлик | 2 |
285
- | 7 | чоотош | 2 |
286
- | 8 | кошлык | 2 |
287
- | 9 | програмалары | 2 |
288
- | 10 | türkiye | 2 |
289
 
290
  ### Zipf's Law Analysis
291
 
292
  | Metric | Value |
293
  |--------|-------|
294
- | Zipf Coefficient | 1.1608 |
295
- | R² (Goodness of Fit) | 0.984170 |
296
  | Adherence Quality | **excellent** |
297
 
298
  ### Coverage Analysis
299
 
300
  | Top N Words | Coverage |
301
  |-------------|----------|
302
- | Top 100 | 26.1% |
303
- | Top 1,000 | 64.2% |
304
- | Top 5,000 | 85.3% |
305
- | Top 10,000 | 91.9% |
306
 
307
  ### Key Findings
308
 
309
- - **Zipf Compliance:** R²=0.9842 indicates excellent adherence to Zipf's law
310
- - **High Frequency Dominance:** Top 100 words cover 26.1% of corpus
311
- - **Long Tail:** 17,823 words needed for remaining 8.1% coverage
312
 
313
  ---
314
  ## 5. Word Embeddings Evaluation
@@ -321,24 +376,128 @@ Below are text samples generated from each Markov chain model:
321
 
322
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
323
 
324
- ### Model Comparison
325
 
326
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
327
- |-------|------------|-----------|----------|----------|----------|
328
- | **mono_32d** | 12,740 | 32 | 4.584 | 1.210 | 0.8322 🏆 |
329
- | **mono_64d** | 12,740 | 64 | 4.972 | 1.017 | 0.7431 |
330
- | **mono_128d** | 12,740 | 128 | 5.170 | 0.932 | 0.3673 |
331
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
332
 
333
  ### Key Findings
334
 
335
- - **Best Isotropy:** mono_32d with 0.8322 (more uniform distribution)
336
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
337
- - **Vocabulary Coverage:** All models cover 12,740 words
338
- - **Recommendation:** 100d for balanced semantic capture and efficiency
339
 
340
  ---
341
- ## 6. Summary & Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
342
 
343
  ![Performance Dashboard](visualizations/performance_dashboard.png)
344
 
@@ -346,11 +505,12 @@ Below are text samples generated from each Markov chain model:
346
 
347
  | Component | Recommended | Rationale |
348
  |-----------|-------------|-----------|
349
- | Tokenizer | **32k BPE** | Best compression (4.27x) with low UNK rate |
350
- | N-gram | **5-gram** | Lowest perplexity (479) |
351
- | Markov | **Context-4** | Highest predictability (95.2%) |
352
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
353
 
 
354
  ---
355
  ## Appendix: Metrics Glossary & Interpretation Guide
356
 
@@ -540,7 +700,8 @@ If you use these models in your research, please cite:
540
  author = {Kamali, Omar},
541
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
542
  year = {2025},
543
- publisher = {HuggingFace},
 
544
  url = {https://huggingface.co/wikilangs}
545
  institution = {Omneity Labs}
546
  }
@@ -556,7 +717,8 @@ MIT License - Free for academic and commercial use.
556
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
557
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
558
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
559
  ---
560
  *Generated by Wikilangs Models Pipeline*
561
 
562
- *Report Date: 2025-12-27 05:34:43*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 3.681
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.8352
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # ALT - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 3.483x | 3.48 | 0.3997% | 976,020 |
84
+ | **16k** | 3.681x 🏆 | 3.68 | 0.4223% | 923,645 |
 
 
85
 
86
  ### Tokenization Examples
87
 
88
  Below are sample sentences tokenized with each vocabulary size:
89
 
90
+ **Sample 1:** `Тижимеева Галина Ивановна Кан-Оозы аймактыҥ аймак депутатды. Ӱстӱги Јалаҥый Ба...`
 
 
 
 
91
 
92
  | Vocab | Tokens | Count |
93
  |-------|--------|-------|
94
+ | 8k | `▁ти жи ме ева ▁галина ▁ивановна ▁— ▁кан - оозы ... (+12 more)` | 22 |
95
+ | 16k | `▁тижимеева ▁галина ▁ивановна ▁— ▁кан - оозы ▁аймактыҥ ▁аймак ▁депутатды ... (+8 more)` | 18 |
 
 
96
 
97
+ **Sample 2:** `«Кызалаҥду јылдар» (орус. «Трудные годы») баштапкы алтай тӱӱкилик роман. Автор...`
 
 
98
 
99
  | Vocab | Tokens | Count |
100
  |-------|--------|-------|
101
+ | 8k | `▁« кы за ла ҥ ду ▁јылдар » ▁( орус ... (+19 more)` | 29 |
102
+ | 16k | `▁« кызалаҥду ▁јылдар » ▁( орус . ▁« трудные ▁годы ... (+14 more)` | 24 |
 
 
103
 
104
+ **Sample 3:** `Эски Чечкаб (, ) — јурт Россияда Татарстан Республиканыҥ Кайбыч аймагында кирет....`
 
 
105
 
106
  | Vocab | Tokens | Count |
107
  |-------|--------|-------|
108
+ | 8k | `▁эски ▁че ч ка б ▁(, ▁) ▁— ▁јурт ▁россияда ... (+12 more)` | 22 |
109
+ | 16k | `▁эски ▁чечкаб ▁(, ▁) ▁— ▁јурт ▁россияда ▁татарстан ▁республиканыҥ ▁кайбыч ... (+7 more)` | 17 |
 
 
110
 
111
 
112
  ### Key Findings
113
 
114
+ - **Best Compression:** 16k achieves 3.681x compression
115
+ - **Lowest UNK Rate:** 8k with 0.3997% unknown tokens
116
  - **Trade-off:** Larger vocabularies improve compression but increase model size
117
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
118
 
 
121
 
122
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
123
 
124
+ ![N-gram Unique](visualizations/ngram_unique.png)
125
+
126
  ![N-gram Coverage](visualizations/ngram_coverage.png)
127
 
128
  ### Results
129
 
130
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
131
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
132
+ | **2-gram** | Word | 4,436 | 12.12 | 12,008 | 16.5% | 55.5% |
133
+ | **2-gram** | Subword | 413 🏆 | 8.69 | 2,712 | 55.2% | 98.2% |
134
+ | **3-gram** | Word | 5,478 | 12.42 | 16,272 | 15.6% | 52.1% |
135
+ | **3-gram** | Subword | 3,295 | 11.69 | 22,501 | 19.5% | 62.8% |
136
+ | **4-gram** | Word | 8,026 | 12.97 | 27,756 | 15.3% | 46.2% |
137
+ | **4-gram** | Subword | 14,033 | 13.78 | 96,739 | 10.5% | 35.6% |
138
 
139
  ### Top 5 N-grams by Size
140
 
141
+ **2-grams (Word):**
142
+
143
+ | Rank | N-gram | Count |
144
+ |------|--------|-------|
145
+ | 1 | `республики алтай` | 1,480 |
146
+ | 2 | `ј чык` | 1,391 |
147
+ | 3 | `горно алтайск` | 1,246 |
148
+ | 4 | `алтай республиканыҥ` | 1,222 |
149
+ | 5 | `ј бож` | 1,072 |
150
+
151
+ **3-grams (Word):**
152
+
153
+ | Rank | N-gram | Count |
154
+ |------|--------|-------|
155
+ | 1 | `јылдыҥ ӱлӱрген айыныҥ` | 755 |
156
+ | 2 | `ӱлӱрген айыныҥ 15` | 730 |
157
+ | 3 | `алтайск ау ра` | 511 |
158
+ | 4 | `горно алтайск ау` | 511 |
159
+ | 5 | `јон јаткан јерлери` | 504 |
160
+
161
+ **4-grams (Word):**
162
 
163
  | Rank | N-gram | Count |
164
  |------|--------|-------|
165
+ | 1 | `јылдыҥ ӱлӱрген айыныҥ 15` | 730 |
166
+ | 2 | `горно алтайск ау ра` | 511 |
167
+ | 3 | `болгон јылдыҥ ӱлӱрген айыныҥ` | 367 |
168
+ | 4 | `тоолоорго окылу конвертер датла` | 365 |
169
+ | 5 | `окылу конвертер датла тузаланарга` | 365 |
170
 
171
+ **2-grams (Subword):**
172
 
173
  | Rank | N-gram | Count |
174
  |------|--------|-------|
175
+ | 1 | `_ к` | 74,491 |
176
+ | 2 | `, _` | 64,716 |
177
+ | 3 | `_ ј` | 55,670 |
178
+ | 4 | _` | 55,340 |
179
+ | 5 | _` | 54,127 |
180
 
181
+ **3-grams (Subword):**
182
 
183
  | Rank | N-gram | Count |
184
  |------|--------|-------|
185
+ | 1 | ҥ _` | 34,280 |
186
+ | 2 | а _` | 17,047 |
187
+ | 3 | `_ _` | 16,876 |
188
+ | 4 | ы ҥ` | 15,865 |
189
+ | 5 | `_ к а` | 15,102 |
190
+
191
+ **4-grams (Subword):**
192
+
193
+ | Rank | N-gram | Count |
194
+ |------|--------|-------|
195
+ | 1 | `н ы ҥ _` | 15,267 |
196
+ | 2 | `д ы ҥ _` | 13,210 |
197
+ | 3 | `_ к ӱ н` | 11,149 |
198
+ | 4 | `а л т а` | 9,638 |
199
+ | 5 | `_ ј ы л` | 9,359 |
200
 
201
 
202
  ### Key Findings
203
 
204
+ - **Best Perplexity:** 2-gram (subword) with 413
205
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
206
+ - **Coverage:** Top-1000 patterns cover ~36% of corpus
207
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
208
 
209
  ---
 
211
 
212
  ![Markov Entropy](visualizations/markov_entropy.png)
213
 
214
+ ![Markov Contexts](visualizations/markov_contexts.png)
215
+
216
  ![Markov Branching](visualizations/markov_branching.png)
217
 
218
  ### Results
219
 
220
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
221
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
222
+ | **1** | Word | 0.7272 | 1.655 | 4.24 | 64,506 | 27.3% |
223
+ | **1** | Subword | 1.6383 | 3.113 | 16.08 | 301 | 0.0% |
224
+ | **2** | Word | 0.1675 | 1.123 | 1.34 | 273,261 | 83.2% |
225
+ | **2** | Subword | 1.3152 | 2.488 | 8.05 | 4,839 | 0.0% |
226
+ | **3** | Word | 0.0551 | 1.039 | 1.10 | 366,294 | 94.5% |
227
+ | **3** | Subword | 0.8839 | 1.845 | 4.16 | 38,940 | 11.6% |
228
+ | **4** | Word | 0.0265 🏆 | 1.019 | 1.05 | 402,354 | 97.4% |
229
+ | **4** | Subword | 0.6047 | 1.521 | 2.55 | 162,075 | 39.5% |
230
 
231
+ ### Generated Text Samples (Word-based)
232
 
233
+ Below are text samples generated from each word-based Markov chain model:
234
 
235
  **Context Size 1:**
236
 
237
+ 1. `ла эмчиликте фундаментал шиҥжӱлер эдип чотолот чике тоозын айдып салган аш курсактыҥ томский пивоныҥ...`
238
+ 2. `ле бийик эмес ортолой кеми 27 ноября года n 107 об образовании муниципальных образований наделении с...`
239
+ 3. `алтай республиканыҥ јурт јеезезине статус ла лесопильный ла иш аайынча министр сорокин почвоведение ...`
240
 
241
  **Context Size 2:**
242
 
243
+ 1. `республики алтай и верхний иртыш под ред и м краевед ада тӧрӧл учун улу јууныҥ туружаачызы канча`
244
+ 2. чык британ черӱниҥ баштапкы јаан чууганга туштаган театрдыҥ сценазында јылда ачылган зимняя вишня ...`
245
+ 3. `горно алтайск гагу 267 с ил библиогр с 233 256 isbn текст электронный сууларда азый балыктыҥ кандыйы`
246
 
247
  **Context Size 3:**
248
 
249
+ 1. `јылдыҥ ӱлӱрген айыныҥ 15 кӱнинеҥ ала кочкор айдыҥ 18 кӱнинде восход 2 корабльда космонавт а а леонов...`
250
+ 2. `ӱлӱрген айыныҥ 15 кӱнинеҥ ала кочкор айдыҥ 3 кӱни григориан кӱнтизӱде јылдыҥ 208 кӱни високосный јыл...`
251
+ 3. `алтайск ау ра литературно издательский дом алтын туу јери ле јолдоры јуртта 3 ором казаковтыҥ кыдраш...`
252
 
253
  **Context Size 4:**
254
 
255
+ 1. `јылдыҥ ӱлӱрген айыныҥ 15 кӱнинеҥ ала чаган айдыҥ 17 кӱни юлиан кӱнтизӱ аайынча јылдыҥ ӱлӱрген айыныҥ...`
256
+ 2. `горно алтайск ау ра литературно издательский дом алтын туу јери ле јолдоры јурттыҥ текши јери 124 4 ...`
257
+ 3. `болгон јылдыҥ ӱлӱрген айыныҥ 15 кӱнине јетире болгон јылдыҥ ӱлӱрген айыныҥ 15 кӱнинеҥ ала кандык айд...`
258
+
259
+
260
+ ### Generated Text Samples (Subword-based)
261
+
262
+ Below are text samples generated from each subword-based Markov chain model:
263
+
264
+ **Context Size 1:**
265
+
266
+ 1. `_эдыҥ_оваралетик`
267
+ 2. `акен._ј._бачӱ_10`
268
+ 3. `рн_орнфилтӧрораа`
269
+
270
+ **Context Size 2:**
271
+
272
+ 1. `_ка_мештай,_эдищн`
273
+ 2. `,_ӱйматкальдынде_`
274
+ 3. `_јылдыҥ_мет_башен`
275
+
276
+ **Context Size 3:**
277
+
278
+ 1. `ыҥ_бичинентизӱлери`
279
+ 2. `да_эмчилевич_ј.бож`
280
+ 3. `_—_грицаныҥ_јаҥыс_`
281
+
282
+ **Context Size 4:**
283
+
284
+ 1. `ныҥ_15_кӱнде_фоновы`
285
+ 2. `дыҥ_эдеги_келтейинд`
286
+ 3. `_кӱн_айдыҥ_15_айдыҥ`
287
 
288
 
289
  ### Key Findings
290
 
291
+ - **Best Predictability:** Context-4 (word) with 97.4% predictability
292
  - **Branching Factor:** Decreases with context size (more deterministic)
293
+ - **Memory Trade-off:** Larger contexts require more storage (162,075 contexts)
294
  - **Recommendation:** Context-3 or Context-4 for text generation
295
 
296
  ---
 
306
 
307
  | Metric | Value |
308
  |--------|-------|
309
+ | Vocabulary Size | 26,456 |
310
+ | Total Tokens | 567,020 |
311
+ | Mean Frequency | 21.43 |
312
  | Median Frequency | 3 |
313
+ | Frequency Std Dev | 124.45 |
314
 
315
  ### Most Common Words
316
 
317
  | Rank | Word | Frequency |
318
  |------|------|-----------|
319
+ | 1 | ла | 6,612 |
320
+ | 2 | ле | 4,973 |
321
+ | 3 | алтай | 4,656 |
322
+ | 4 | деп | 3,921 |
323
+ | 5 | с | 3,896 |
324
+ | 6 | јылда | 3,763 |
325
+ | 7 | айдыҥ | 3,450 |
326
  | 8 | болгон | 3,231 |
327
+ | 9 | км | 3,151 |
328
+ | 10 | јурт | 3,140 |
329
 
330
  ### Least Common Words (from vocabulary)
331
 
332
  | Rank | Word | Frequency |
333
  |------|------|-----------|
334
+ | 1 | таскадуларды | 2 |
335
+ | 2 | туузаланат | 2 |
336
+ | 3 | узаныш | 2 |
337
+ | 4 | эрессейде | 2 |
338
+ | 5 | метеметике | 2 |
339
+ | 6 | јеткилдери | 2 |
340
+ | 7 | кӧмпӱтерлик | 2 |
341
+ | 8 | чоотош | 2 |
342
+ | 9 | кошлык | 2 |
343
+ | 10 | програмалары | 2 |
344
 
345
  ### Zipf's Law Analysis
346
 
347
  | Metric | Value |
348
  |--------|-------|
349
+ | Zipf Coefficient | 1.1623 |
350
+ | R² (Goodness of Fit) | 0.985922 |
351
  | Adherence Quality | **excellent** |
352
 
353
  ### Coverage Analysis
354
 
355
  | Top N Words | Coverage |
356
  |-------------|----------|
357
+ | Top 100 | 27.1% |
358
+ | Top 1,000 | 65.6% |
359
+ | Top 5,000 | 85.8% |
360
+ | Top 10,000 | 92.3% |
361
 
362
  ### Key Findings
363
 
364
+ - **Zipf Compliance:** R²=0.9859 indicates excellent adherence to Zipf's law
365
+ - **High Frequency Dominance:** Top 100 words cover 27.1% of corpus
366
+ - **Long Tail:** 16,456 words needed for remaining 7.7% coverage
367
 
368
  ---
369
  ## 5. Word Embeddings Evaluation
 
376
 
377
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
378
 
 
379
 
380
+ ### 5.1 Cross-Lingual Alignment
381
+
382
+ > *Note: Multilingual alignment visualization not available for this language.*
383
+
384
+
385
+ ### 5.2 Model Comparison
386
+
387
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
388
+ |-------|-----------|----------|------------------|---------------|----------------|
389
+ | **mono_32d** | 32 | 0.8352 🏆 | 0.3587 | N/A | N/A |
390
+ | **mono_64d** | 64 | 0.7406 | 0.3005 | N/A | N/A |
391
+ | **mono_128d** | 128 | 0.3709 | 0.2867 | N/A | N/A |
392
 
393
  ### Key Findings
394
 
395
+ - **Best Isotropy:** mono_32d with 0.8352 (more uniform distribution)
396
+ - **Semantic Density:** Average pairwise similarity of 0.3153. Lower values indicate better semantic separation.
397
+ - **Alignment Quality:** No aligned models evaluated in this run.
398
+ - **Recommendation:** 128d aligned for best cross-lingual performance
399
 
400
  ---
401
+ ## 6. Morphological Analysis (Experimental)
402
+
403
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
404
+
405
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
406
+
407
+ ### 6.1 Productivity & Complexity
408
+
409
+ | Metric | Value | Interpretation | Recommendation |
410
+ |--------|-------|----------------|----------------|
411
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
412
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
413
+
414
+ ### 6.2 Affix Inventory (Productive Units)
415
+
416
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
417
+
418
+ #### Productive Prefixes
419
+ | Prefix | Examples |
420
+ |--------|----------|
421
+ | `-ко` | корнелия, концертные, коруланар |
422
+ | `-ка` | каа, каталанской, казанды |
423
+
424
+ #### Productive Suffixes
425
+ | Suffix | Examples |
426
+ |--------|----------|
427
+ | `-ыҥ` | пятницаныҥ, јазатырдыҥ, экспедициязыныҥ |
428
+ | `-ий` | автобиографический, университетский, кентерберийский |
429
+ | `-кий` | автобиографический, университетский, кентерберийский |
430
+ | `-ский` | автобиографический, университетский, кентерберийский |
431
+ | `-ныҥ` | пятницаныҥ, экспедициязыныҥ, тартканыныҥ |
432
+ | `-иҥ` | унсеттиҥ, билимдериниҥ, эштектиҥ |
433
+ | `-да` | фонында, лида, украинада |
434
+ | `-ый` | сосновый, туберкулезный, маршрутный |
435
+
436
+ ### 6.3 Bound Stems (Lexical Roots)
437
+
438
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
439
+
440
+ | Stem | Cohesion | Substitutability | Examples |
441
+ |------|----------|------------------|----------|
442
+ | `ский` | 2.13x | 43 contexts | южский, айский, омский |
443
+ | `ында` | 1.56x | 51 contexts | мында, адында, ойында |
444
+ | `ыныҥ` | 1.77x | 30 contexts | зыныҥ, мыныҥ, ажыныҥ |
445
+ | `лтай` | 1.93x | 21 contexts | алтай, шылтай, алтайды |
446
+ | `лгон` | 2.28x | 12 contexts | болгон, толгон, болгоны |
447
+ | `аныҥ` | 1.77x | 23 contexts | кааныҥ, уфаныҥ, оканыҥ |
448
+ | `олго` | 1.78x | 22 contexts | јолго, колго, иолго |
449
+ | `осси` | 2.07x | 13 contexts | россии, россий, россия |
450
+ | `алта` | 1.64x | 26 contexts | алтам, алтан, алтая |
451
+ | `лган` | 1.67x | 24 contexts | алган, салган, алганы |
452
+ | `рген` | 1.53x | 27 contexts | юрген, мерген, тӱрген |
453
+ | `ылда` | 1.69x | 19 contexts | јылда, дылда, тылда |
454
+
455
+ ### 6.4 Affix Compatibility (Co-occurrence)
456
+
457
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
458
+
459
+ | Prefix | Suffix | Frequency | Examples |
460
+ |--------|--------|-----------|----------|
461
+ | `-ко` | `-ыҥ` | 26 words | комедияныҥ, командазыныҥ |
462
+ | `-ка` | `-ыҥ` | 23 words | каспаныҥ, кардыҥ |
463
+ | `-ко` | `-ныҥ` | 16 words | комедияныҥ, командазыныҥ |
464
+ | `-ка` | `-ий` | 15 words | калий, кавказский |
465
+ | `-ка` | `-ныҥ` | 13 words | каспаныҥ, калаларыныҥ |
466
+ | `-ка` | `-да` | 13 words | картазында, кампанияда |
467
+ | `-ка` | `-кий` | 12 words | кавказский, каледонский |
468
+ | `-ка` | `-ский` | 12 words | кавказский, каледонский |
469
+ | `-ка` | `-ар` | 11 words | кайыҥдар, каналдар |
470
+ | `-ко` | `-ар` | 11 words | космонавттар, коллекциялар |
471
+
472
+ ### 6.5 Recursive Morpheme Segmentation
473
+
474
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
475
+
476
+ | Word | Suggested Split | Confidence | Stem |
477
+ |------|-----------------|------------|------|
478
+ | молотовский | **`молот-ов-ский`** | 6.0 | `молот` |
479
+ | логиканыҥ | **`логика-ныҥ`** | 4.5 | `логика` |
480
+ | кереестериниҥ | **`кереестерин-иҥ`** | 4.5 | `кереестерин` |
481
+ | тӱӱкизиниҥ | **`тӱӱкизин-иҥ`** | 4.5 | `тӱӱкизин` |
482
+ | швейцарияда | **`швейцария-да`** | 4.5 | `швейцария` |
483
+ | съездиниҥ | **`съездин-иҥ`** | 4.5 | `съездин` |
484
+ | јӱрӱминиҥ | **`јӱрӱмин-иҥ`** | 4.5 | `јӱрӱмин` |
485
+ | политиканыҥ | **`политика-ныҥ`** | 4.5 | `политика` |
486
+ | алексеевский | **`алексеев-ский`** | 4.5 | `алексеев` |
487
+ | субъектов | **`субъект-ов`** | 4.5 | `субъект` |
488
+ | фабриканыҥ | **`фабрика-ныҥ`** | 4.5 | `фабрика` |
489
+ | улаганский | **`улаган-ский`** | 4.5 | `улаган` |
490
+ | бийигиниҥ | **`бийигин-иҥ`** | 4.5 | `бийигин` |
491
+ | черӱлериниҥ | **`черӱлерин-иҥ`** | 4.5 | `черӱлерин` |
492
+ | мьянманыҥ | **`мьянма-ныҥ`** | 4.5 | `мьянма` |
493
+
494
+ ### 6.6 Linguistic Interpretation
495
+
496
+ > **Automated Insight:**
497
+ The language ALT appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
498
+
499
+ ---
500
+ ## 7. Summary & Recommendations
501
 
502
  ![Performance Dashboard](visualizations/performance_dashboard.png)
503
 
 
505
 
506
  | Component | Recommended | Rationale |
507
  |-----------|-------------|-----------|
508
+ | Tokenizer | **16k BPE** | Best compression (3.68x) |
509
+ | N-gram | **2-gram** | Lowest perplexity (413) |
510
+ | Markov | **Context-4** | Highest predictability (97.4%) |
511
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
512
 
513
+
514
  ---
515
  ## Appendix: Metrics Glossary & Interpretation Guide
516
 
 
700
  author = {Kamali, Omar},
701
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
702
  year = {2025},
703
+ doi = {10.5281/zenodo.18073153},
704
+ publisher = {Zenodo},
705
  url = {https://huggingface.co/wikilangs}
706
  institution = {Omneity Labs}
707
  }
 
717
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
718
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
719
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
720
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
721
  ---
722
  *Generated by Wikilangs Models Pipeline*
723
 
724
+ *Report Date: 2026-01-03 05:04:55*
models/embeddings/monolingual/alt_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:25f39c60e203d2438da48c866dc0122d86895494620c2edb87b2287604fd5a3c
3
- size 1037346864
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e396190247b1c989d377e3a31a5ca94405fd3ee9794d9a1f7bafcef3e5cf2c32
3
+ size 1036365432
models/embeddings/monolingual/alt_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 12740
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 11800
15
  }
models/embeddings/monolingual/alt_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3bd06d4b82d8979ba635eb09cfed9e13630bc7fee3c9520ad4c1989e80b32a1f
3
- size 259562544
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61f263d2302c0b79944fb6dea7a5410f34344972105b8624236c585557cd9b72
3
+ size 259303032
models/embeddings/monolingual/alt_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 12740
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 11800
15
  }
models/embeddings/monolingual/alt_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:db52613d7b4587ee9e0a772f70c33f1c2803a3b6ded1304f0d08823bb6254261
3
- size 518823984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0af70072fe6e458bf918c9d11f1a56126a09d6b5cade10dc1cf79494ec3cad2b
3
+ size 518323832
models/embeddings/monolingual/alt_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 12740
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 11800
15
  }
models/subword_markov/alt_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:588f76957b8af3c47a7147669c4d774dc2bfa12d36bb00f3f9b9c83098608e1e
3
- size 50442
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9835d764d81a373a43d4c69af7b885b80b3e7e6708cce0d6899e9b5ea4672187
3
+ size 43649
models/subword_markov/alt_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_contexts": 296,
6
- "total_transitions": 4709685
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_contexts": 301,
6
+ "total_transitions": 4392884
7
  }
models/subword_markov/alt_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:202f7a9500c7c4a3d4e0c876818f2e61129809a6bcce24d5a91465259febeeab
3
- size 374335
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d66028d1955552f26a81718406af2dce47d0f3dedd00f8bb0b84c80b869c131
3
+ size 310925
models/subword_markov/alt_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_contexts": 6058,
6
- "total_transitions": 4708580
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_contexts": 4839,
6
+ "total_transitions": 4391785
7
  }
models/subword_markov/alt_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4dace38255fd3c7ec3e49d088e674c5b23bf4c3c7755cbf939a65bfbd1c8b904
3
- size 1452315
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93204d2e51859eacc761cb0757c0e6456d7c7bc8f68ee9401b438c2b0f12f236
3
+ size 1232693
models/subword_markov/alt_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_contexts": 49063,
6
- "total_transitions": 4707475
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_contexts": 38940,
6
+ "total_transitions": 4390686
7
  }
models/subword_markov/alt_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e8cd16eed7b77e9d51bc740beb2ec21f62d42dbe8b7f1b31b5235ad41e9fb70
3
- size 4344236
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:24dc5e341215c2a39c5ae7484dd8bd985f42205d269ff7f3a91b8cc25d862939
3
+ size 3689341
models/subword_markov/alt_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_contexts": 199058,
6
- "total_transitions": 4706370
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_contexts": 162075,
6
+ "total_transitions": 4389587
7
  }
models/subword_ngram/alt_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1aa2c2c1c4c6fe8449d0737457837d21c94791c0ece82c6602065fb5b6c4e42f
3
- size 46118
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6cafd90f885b32dd8861ed71808430f5107a59536dc5f4e342a7bdc0fbbba4c
3
+ size 38120
models/subword_ngram/alt_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_ngrams": 3376,
6
- "total_ngrams": 4709685
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_ngrams": 2712,
6
+ "total_ngrams": 4392884
7
  }
models/subword_ngram/alt_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b7f013dc15eeae2ca077839f1e1f4a82ef2dcf39c0b39576ba5a79d0572f7302
3
- size 354929
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:34a836528a3396f9306f4ecc25205690e3dd0d56877599f1efb2e2f194507c84
3
+ size 295825
models/subword_ngram/alt_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_ngrams": 26889,
6
- "total_ngrams": 4708580
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_ngrams": 22501,
6
+ "total_ngrams": 4391785
7
  }
models/subword_ngram/alt_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ab13f71def8ec0b710fcbdef73871e6cd1a0b69db0016a454c131495f4d82bf9
3
- size 1469638
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:884f3c7557c6823455b4677c54106f38a9691634c5e0cfe29fac18815f11c7a2
3
+ size 1241123
models/subword_ngram/alt_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "alt",
5
- "unique_ngrams": 114000,
6
- "total_ngrams": 4707475
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "alt",
5
+ "unique_ngrams": 96739,
6
+ "total_ngrams": 4390686
7
  }
models/tokenizer/alt_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:980b4133ea1daddc81173676db443164466831f3ca0e9db86af16461414f5a6e
3
- size 592388
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ed995fcbde5668b2f32931d416ecfd444547f4fccf04118ff4bf11e3c248ef4
3
+ size 600334
models/tokenizer/alt_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/alt_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dbc2117cbe37352fbe0fb0a78326dd7a71e661fda8e7fb425ca1f800ac235000
3
- size 411942
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:452a0aec3e7e4b4e17384e2ff0d3b52a51f9cb273b8e8bbc7addbb7f2e51363f
3
+ size 410662
models/tokenizer/alt_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/alt_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a885262275b1763315e2f0b46972b066080bc341581c9b3933083e4efc882732
3
- size 527033
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c316fcfa2120415d93073b97d54878932bd6da42c81696b2da0093488988631
3
+ size 512673
models/vocabulary/alt_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "alt",
3
- "vocabulary_size": 27823,
 
4
  "statistics": {
5
- "type_token_ratio": 0.1059063546289431,
6
  "coverage": {
7
- "top_100": 0.2449394999945823,
8
- "top_1000": 0.6020404530418725,
9
- "top_5000": 0.7997656466465335,
10
- "top_10000": 0.8615411287039516
11
  },
12
- "hapax_count": 40596,
13
- "hapax_ratio": 0.5933439541647788,
14
- "total_documents": 1105
15
  }
16
  }
 
1
  {
2
  "language": "alt",
3
+ "vocabulary_size": 26456,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.10662391749851259,
7
  "coverage": {
8
+ "top_100": 0.2540457460170556,
9
+ "top_1000": 0.6146790507040392,
10
+ "top_5000": 0.8042837310768824,
11
+ "top_10000": 0.8649781847028493
12
  },
13
+ "hapax_count": 38060,
14
+ "hapax_ratio": 0.5899311798623598,
15
+ "total_documents": 1099
16
  }
17
  }
models/word_markov/alt_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:072d8e4929be96c8f98e65837e9bb62d7c0ad7b9399dd020e1db736fa445cf7d
3
- size 3441947
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ed78756b17891f1d853be9b080f0dbe62a2e12d8ac311c5761369520b78a512
3
+ size 3264406
models/word_markov/alt_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_contexts": 68485,
6
- "total_transitions": 917583
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_contexts": 64506,
6
+ "total_transitions": 603981
7
  }
models/word_markov/alt_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:33891ed476ac087a46973edb85932340c5aa8a095bf6d74e8c7f5aecb2036ac2
3
- size 8484870
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:99cc5861c45675e04db4b52347a14f4dad4cced5b1fde724f96eb05b82e2b557
3
+ size 8258854
models/word_markov/alt_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_contexts": 283129,
6
- "total_transitions": 916478
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_contexts": 273261,
6
+ "total_transitions": 602882
7
  }
models/word_markov/alt_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c8c2326ff49e21f16d6438f45aa87c46782d543383b0a3ca4bb3c2a033c97859
3
- size 12475746
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:761a9c53530fa3d9cba8ad3b211af79457306c88812173b71d35bdd3d1faedac
3
+ size 11105253
models/word_markov/alt_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_contexts": 455274,
6
- "total_transitions": 915373
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_contexts": 366294,
6
+ "total_transitions": 601783
7
  }
models/word_markov/alt_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6bfb1a38815ea112cbe8d8484df23d09293f2f182435f4caf9bfb8cc389d527f
3
- size 15452158
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4483468f8c0c180566bb8295e44c88b85cd214ea65a3fca1dfdc4c1fd87d8d95
3
+ size 13560943
models/word_markov/alt_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_contexts": 544523,
6
- "total_transitions": 914268
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_contexts": 402354,
6
+ "total_transitions": 600684
7
  }
models/word_ngram/alt_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:06433ca90a29b7fc4f3dd8b44d5a816235fdd8957c630b28fcb469db2389c420
3
- size 412706
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5493baaabf8c3309446cf79df0d0518dd14efc63de47d09ffd902abb6b59cf0d
3
+ size 301065
models/word_ngram/alt_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_ngrams": 19162,
6
- "total_ngrams": 917583
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_ngrams": 12008,
6
+ "total_ngrams": 603981
7
  }
models/word_ngram/alt_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9396925135c898d843fdd5b52c715879fd9baa7387ffb5e60715bd57deafc479
3
- size 733761
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a74da8a3a5caaddba92e392597b92357d19a6fc6d4c6d2f0b5c8824ced3fa20
3
+ size 470151
models/word_ngram/alt_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_ngrams": 31313,
6
- "total_ngrams": 916478
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_ngrams": 16272,
6
+ "total_ngrams": 602882
7
  }
models/word_ngram/alt_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:31e26d5fad3df318aea57c7bf608a001e4a1f4d6fea056b2a8dd62fd899d62e1
3
- size 1354732
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e79b9594793378b010dc641e26308ed766fabaa853ad5069f4bcdf76467d0a3d
3
+ size 898038
models/word_ngram/alt_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "alt",
5
- "unique_ngrams": 53203,
6
- "total_ngrams": 915373
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "alt",
5
+ "unique_ngrams": 27756,
6
+ "total_ngrams": 601783
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 502565246483f705433497707ef1835e87aec2c62e5c31e29fa1cc8733d00059
  • Pointer size: 131 Bytes
  • Size of remote file: 145 kB

Git LFS Details

  • SHA256: 89696067bd23e032f11a0e445bbb81292b23a5f9656ab9961f125d14691414ee
  • Pointer size: 131 Bytes
  • Size of remote file: 143 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED
visualizations/nearest_neighbors.png CHANGED
visualizations/ngram_coverage.png CHANGED