omarkamali commited on
Commit
236e9c1
·
verified ·
1 Parent(s): 022916d

Upload all models and assets for chy (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +278 -130
  2. models/embeddings/monolingual/chy_128d.bin +2 -2
  3. models/embeddings/monolingual/chy_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/chy_32d.bin +2 -2
  5. models/embeddings/monolingual/chy_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/chy_64d.bin +2 -2
  7. models/embeddings/monolingual/chy_64d_metadata.json +5 -3
  8. models/subword_markov/chy_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/chy_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/chy_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/chy_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/chy_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/chy_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/chy_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/chy_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/chy_2gram_subword.parquet +2 -2
  17. models/subword_ngram/chy_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/chy_3gram_subword.parquet +2 -2
  19. models/subword_ngram/chy_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/chy_4gram_subword.parquet +2 -2
  21. models/subword_ngram/chy_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/chy_tokenizer_8k.model +2 -2
  23. models/tokenizer/chy_tokenizer_8k.vocab +0 -0
  24. models/vocabulary/chy_vocabulary.parquet +2 -2
  25. models/vocabulary/chy_vocabulary_metadata.json +8 -7
  26. models/word_markov/chy_markov_ctx1_word.parquet +2 -2
  27. models/word_markov/chy_markov_ctx1_word_metadata.json +2 -2
  28. models/word_markov/chy_markov_ctx2_word.parquet +2 -2
  29. models/word_markov/chy_markov_ctx2_word_metadata.json +2 -2
  30. models/word_markov/chy_markov_ctx3_word.parquet +2 -2
  31. models/word_markov/chy_markov_ctx3_word_metadata.json +2 -2
  32. models/word_markov/chy_markov_ctx4_word.parquet +2 -2
  33. models/word_markov/chy_markov_ctx4_word_metadata.json +2 -2
  34. models/word_ngram/chy_2gram_word.parquet +2 -2
  35. models/word_ngram/chy_2gram_word_metadata.json +2 -2
  36. models/word_ngram/chy_3gram_word.parquet +2 -2
  37. models/word_ngram/chy_3gram_word_metadata.json +2 -2
  38. models/word_ngram/chy_4gram_word.parquet +2 -2
  39. models/word_ngram/chy_4gram_word_metadata.json +2 -2
  40. visualizations/embedding_isotropy.png +0 -0
  41. visualizations/embedding_norms.png +0 -0
  42. visualizations/embedding_similarity.png +2 -2
  43. visualizations/markov_branching.png +0 -0
  44. visualizations/markov_contexts.png +0 -0
  45. visualizations/markov_entropy.png +0 -0
  46. visualizations/model_sizes.png +0 -0
  47. visualizations/nearest_neighbors.png +0 -0
  48. visualizations/ngram_coverage.png +0 -0
  49. visualizations/ngram_entropy.png +0 -0
  50. visualizations/ngram_perplexity.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 3.456
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.0028
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 1659
33
- generated: 2025-12-28
34
  ---
35
 
36
  # CHY - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,50 +70,45 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.426x | 3.37 | 0.0811% | 33,276 |
76
- | **16k** | 3.456x 🏆 | 3.40 | 0.0819% | 32,987 |
77
 
78
  ### Tokenization Examples
79
 
80
  Below are sample sentences tokenized with each vocabulary size:
81
 
82
- **Sample 1:** `Môxéhéó'o (vé'ho'énêstsestôtse: broom, "sweeping [thing]") Pl: môxéheonôtse.
83
-
84
-
85
- C...`
86
 
87
  | Vocab | Tokens | Count |
88
  |-------|--------|-------|
89
- | 8k | `▁môxéhéó ' o ▁( ' ho ' énêstsestôtse : ... (+18 more)` | 28 |
90
- | 16k | `▁môxéhéó ' o ▁( vé ' ho ' énêstsestôtse : ... (+17 more)` | 27 |
91
 
92
- **Sample 2:** `Brazil, na'éstse ho'e-éve, Amérika.
93
-
94
- Category:Brazil`
95
 
96
  | Vocab | Tokens | Count |
97
  |-------|--------|-------|
98
- | 8k | `▁brazil , ▁na ' éstse ▁ho ' e - éve ... (+6 more)` | 16 |
99
- | 16k | `▁brazil , ▁na ' éstse ▁ho ' e - éve ... (+6 more)` | 16 |
100
-
101
- **Sample 3:** `Boise, na'éstse manâhéno, Idaho.
102
 
103
- Category:Mâhoestôtse`
104
 
105
  | Vocab | Tokens | Count |
106
  |-------|--------|-------|
107
- | 8k | `▁boise ,na ' éstse manâhéno ,idaho .category ... (+2 more)` | 12 |
108
- | 16k | `▁boise , ▁na ' éstse ▁manâhéno , ▁idaho . ▁category ... (+2 more)` | 12 |
109
 
110
 
111
  ### Key Findings
112
 
113
- - **Best Compression:** 16k achieves 3.456x compression
114
- - **Lowest UNK Rate:** 8k with 0.0811% unknown tokens
115
  - **Trade-off:** Larger vocabularies improve compression but increase model size
116
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
117
 
@@ -120,57 +117,89 @@ Category:Mâhoestôtse`
120
 
121
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
122
 
 
 
123
  ![N-gram Coverage](visualizations/ngram_coverage.png)
124
 
125
  ### Results
126
 
127
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
128
- |--------|------------|---------|----------------|------------------|-------------------|
129
- | **2-gram** | 237 🏆 | 7.89 | 654 | 65.5% | 100.0% |
130
- | **2-gram** | 360 🏆 | 8.49 | 1,127 | 58.7% | 99.4% |
131
- | **3-gram** | 533 | 9.06 | 1,211 | 49.6% | 95.0% |
132
- | **3-gram** | 1,561 | 10.61 | 4,876 | 32.9% | 74.3% |
133
- | **4-gram** | 1,077 | 10.07 | 2,302 | 37.1% | 77.9% |
134
- | **4-gram** | 3,419 | 11.74 | 11,151 | 25.8% | 57.5% |
135
 
136
  ### Top 5 N-grams by Size
137
 
138
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
  | Rank | N-gram | Count |
141
  |------|--------|-------|
142
- | 1 | `category :` | 973 |
143
- | 2 | `' e` | 663 |
144
- | 3 | `ho '` | 511 |
145
- | 4 | `' o` | 391 |
146
- | 5 | `. category` | 332 |
147
 
148
- **3-grams:**
149
 
150
  | Rank | N-gram | Count |
151
  |------|--------|-------|
152
- | 1 | `. category :` | 331 |
153
- | 2 | `na ' éstse` | 288 |
154
- | 3 | `' ho '` | 225 |
155
- | 4 | `| thumb |` | 204 |
156
- | 5 | `| right |` | 201 |
157
 
158
- **4-grams:**
159
 
160
  | Rank | N-gram | Count |
161
  |------|--------|-------|
162
- | 1 | `, na ' éstse` | 199 |
163
- | 2 | `| thumb | right` | 167 |
164
- | 3 | `thumb | right |` | 155 |
165
- | 4 | ` ' ho '` | 131 |
166
- | 5 | `300px | thumb |` | 128 |
167
 
168
 
169
  ### Key Findings
170
 
171
- - **Best Perplexity:** 2-gram with 237
172
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
173
- - **Coverage:** Top-1000 patterns cover ~58% of corpus
174
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
175
 
176
  ---
@@ -178,55 +207,86 @@ Category:Mâhoestôtse`
178
 
179
  ![Markov Entropy](visualizations/markov_entropy.png)
180
 
 
 
181
  ![Markov Branching](visualizations/markov_branching.png)
182
 
183
  ### Results
184
 
185
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
186
- |---------|-------------|------------|------------------|-----------------|----------------|
187
- | **1** | 0.3489 | 1.274 | 2.42 | 4,255 | 65.1% |
188
- | **1** | 1.3997 | 2.638 | 10.97 | 189 | 0.0% |
189
- | **2** | 0.1560 | 1.114 | 1.36 | 10,197 | 84.4% |
190
- | **2** | 1.2345 | 2.353 | 5.27 | 2,073 | 0.0% |
191
- | **3** | 0.0936 | 1.067 | 1.18 | 13,745 | 90.6% |
192
- | **3** | 0.6401 | 1.558 | 2.32 | 10,919 | 36.0% |
193
- | **4** | 0.0555 🏆 | 1.039 | 1.10 | 16,004 | 94.5% |
194
- | **4** | 0.2796 🏆 | 1.214 | 1.44 | 25,260 | 72.0% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
- ### Generated Text Samples
197
 
198
- Below are text samples generated from each Markov chain model:
 
 
199
 
200
  **Context Size 1:**
201
 
202
- 1. `' konénėhesó - éve . manâhestôtse 7 heše ' tavö ' éhoo ' he tsénėxhésemé '`
203
- 2. `, na ' he tsénėxhésemé ' evo ' éno ' e 1904 , na ' otsenáhkohe`
204
- 3. `: turkey , na ' o tsétsêhéstâhese - éve . gus . curitiba - éve .`
205
 
206
  **Context Size 2:**
207
 
208
- 1. `category : ó ' he ( ' ho ' énestse 71 , 740 6 , 418`
209
- 2. `' e 300px | thumb | amâho ' hestôtse amêške tsémo ' ôhtávoome amâho ' hestôtse category`
210
- 3. `ho ' xó ' mâhoéve ' ho ' énêstsestôtse : purgatoire river , picketwire river ) -`
211
 
212
  **Context Size 3:**
213
 
214
- 1. `. category : mâhoestôtse category : california`
215
- 2. `na ' éstse ho ' e - éve hóxovê - hooma , asia ) . *`
216
- 3. `' ho ' e - éve , meško . category : mâhoestôtse category : ho ' honáeo '`
217
 
218
  **Context Size 4:**
219
 
220
- 1. `, na ' éstse ho ' e - éve , amérika . *`
221
- 2. `| thumb | right | hóxeeséeto ' hamestôtse 300px | thumb | right | méstaa ' êhéhe category :`
222
- 3. `thumb | right | hotóhkeo ' o tsénésôhtôxese 300px | thumb | right | mámaa ' e mámaa '`
223
 
224
 
225
  ### Key Findings
226
 
227
- - **Best Predictability:** Context-4 with 94.5% predictability
228
  - **Branching Factor:** Decreases with context size (more deterministic)
229
- - **Memory Trade-off:** Larger contexts require more storage (25,260 contexts)
230
  - **Recommendation:** Context-3 or Context-4 for text generation
231
 
232
  ---
@@ -242,64 +302,64 @@ Below are text samples generated from each Markov chain model:
242
 
243
  | Metric | Value |
244
  |--------|-------|
245
- | Vocabulary Size | 1,659 |
246
- | Total Tokens | 14,360 |
247
- | Mean Frequency | 8.66 |
248
  | Median Frequency | 3 |
249
- | Frequency Std Dev | 38.60 |
250
 
251
  ### Most Common Words
252
 
253
  | Rank | Word | Frequency |
254
  |------|------|-----------|
255
- | 1 | category | 974 |
256
- | 2 | e | 690 |
257
- | 3 | ho | 531 |
258
- | 4 | o | 414 |
259
- | 5 | na | 293 |
260
- | 6 | éstse | 288 |
261
- | 7 | right | 260 |
262
- | 8 | éve | 259 |
263
- | 9 | thumb | 226 |
264
- | 10 | | 180 |
265
 
266
  ### Least Common Words (from vocabulary)
267
 
268
  | Rank | Word | Frequency |
269
  |------|------|-----------|
270
- | 1 | evenóse | 2 |
271
- | 2 | mountain | 2 |
272
- | 3 | cal | 2 |
273
- | 4 | poly | 2 |
274
- | 5 | mustangs | 2 |
275
- | 6 | sevonévo | 2 |
276
- | 7 | ėstovátamevéotse | 2 |
277
- | 8 | ėstova | 2 |
278
- | 9 | nėstse | 2 |
279
- | 10 | 2025 | 2 |
280
 
281
  ### Zipf's Law Analysis
282
 
283
  | Metric | Value |
284
  |--------|-------|
285
- | Zipf Coefficient | 0.8829 |
286
- | R² (Goodness of Fit) | 0.980523 |
287
  | Adherence Quality | **excellent** |
288
 
289
  ### Coverage Analysis
290
 
291
  | Top N Words | Coverage |
292
  |-------------|----------|
293
- | Top 100 | 57.7% |
294
- | Top 1,000 | 90.8% |
295
  | Top 5,000 | 0.0% |
296
  | Top 10,000 | 0.0% |
297
 
298
  ### Key Findings
299
 
300
- - **Zipf Compliance:** R²=0.9805 indicates excellent adherence to Zipf's law
301
- - **High Frequency Dominance:** Top 100 words cover 57.7% of corpus
302
- - **Long Tail:** -8,341 words needed for remaining 100.0% coverage
303
 
304
  ---
305
  ## 5. Word Embeddings Evaluation
@@ -312,24 +372,109 @@ Below are text samples generated from each Markov chain model:
312
 
313
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
314
 
315
- ### Model Comparison
316
 
317
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
318
- |-------|------------|-----------|----------|----------|----------|
319
- | **mono_32d** | 223 | 32 | 1.577 | 0.877 | 0.0028 🏆 |
320
- | **mono_64d** | 223 | 64 | 1.556 | 0.897 | 0.0009 |
321
- | **mono_128d** | 223 | 128 | 1.593 | 0.888 | 0.0002 |
322
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
323
 
324
  ### Key Findings
325
 
326
- - **Best Isotropy:** mono_32d with 0.0028 (more uniform distribution)
327
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
328
- - **Vocabulary Coverage:** All models cover 223 words
329
- - **Recommendation:** 100d for balanced semantic capture and efficiency
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
330
 
331
  ---
332
- ## 6. Summary & Recommendations
333
 
334
  ![Performance Dashboard](visualizations/performance_dashboard.png)
335
 
@@ -337,11 +482,12 @@ Below are text samples generated from each Markov chain model:
337
 
338
  | Component | Recommended | Rationale |
339
  |-----------|-------------|-----------|
340
- | Tokenizer | **32k BPE** | Best compression (3.46x) with low UNK rate |
341
- | N-gram | **5-gram** | Lowest perplexity (237) |
342
- | Markov | **Context-4** | Highest predictability (94.5%) |
343
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
344
 
 
345
  ---
346
  ## Appendix: Metrics Glossary & Interpretation Guide
347
 
@@ -531,7 +677,8 @@ If you use these models in your research, please cite:
531
  author = {Kamali, Omar},
532
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
533
  year = {2025},
534
- publisher = {HuggingFace},
 
535
  url = {https://huggingface.co/wikilangs}
536
  institution = {Omneity Labs}
537
  }
@@ -547,7 +694,8 @@ MIT License - Free for academic and commercial use.
547
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
548
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
549
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
550
  ---
551
  *Generated by Wikilangs Models Pipeline*
552
 
553
- *Report Date: 2025-12-28 22:42:59*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 3.497
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.0023
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # CHY - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 3.497x 🏆 | 3.52 | 0.0960% | 19,785 |
 
84
 
85
  ### Tokenization Examples
86
 
87
  Below are sample sentences tokenized with each vocabulary size:
88
 
89
+ **Sample 1:** `Vášêtaëno, Amâho'hestôtse (Pl: amâho'héstotôtse) Ama'éno'hamémôxe'êstoo'o Ama'én...`
 
 
 
90
 
91
  | Vocab | Tokens | Count |
92
  |-------|--------|-------|
93
+ | 8k | `▁vášêtaëno , ▁amâho ' hestôtse ▁( pl : ▁amâho ' ... (+16 more)` | 26 |
 
94
 
95
+ **Sample 2:** `Ma'xeamóvôhtó'hestôtse, Éam-óvôhtó'heo'o. thumb|right thumb|right thumb|Daimler-...`
 
 
96
 
97
  | Vocab | Tokens | Count |
98
  |-------|--------|-------|
99
+ | 8k | `▁ma ' xeamóvôhtó ' hestôtse , ▁éam - óvôhtó ' ... (+16 more)` | 26 |
 
 
 
100
 
101
+ **Sample 3:** `Mâhpémo'éhe (Alces alces) máto héva popóhpoévêsémo'éhe váótséva-éve.`
102
 
103
  | Vocab | Tokens | Count |
104
  |-------|--------|-------|
105
+ | 8k | `▁mâhpémo ' éhe ( alcesalces )máto ▁hévapopóhpoévêsémo ... (+6 more)` | 16 |
 
106
 
107
 
108
  ### Key Findings
109
 
110
+ - **Best Compression:** 8k achieves 3.497x compression
111
+ - **Lowest UNK Rate:** 8k with 0.0960% unknown tokens
112
  - **Trade-off:** Larger vocabularies improve compression but increase model size
113
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
114
 
 
117
 
118
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
119
 
120
+ ![N-gram Unique](visualizations/ngram_unique.png)
121
+
122
  ![N-gram Coverage](visualizations/ngram_coverage.png)
123
 
124
  ### Results
125
 
126
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
127
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
128
+ | **2-gram** | Word | 102 🏆 | 6.68 | 159 | 86.3% | 100.0% |
129
+ | **2-gram** | Subword | 330 | 8.37 | 871 | 59.3% | 100.0% |
130
+ | **3-gram** | Word | 156 | 7.28 | 245 | 72.6% | 100.0% |
131
+ | **3-gram** | Subword | 1,700 | 10.73 | 3,811 | 27.1% | 73.0% |
132
+ | **4-gram** | Word | 310 | 8.27 | 449 | 52.1% | 100.0% |
133
+ | **4-gram** | Subword | 4,072 | 11.99 | 8,559 | 18.3% | 52.6% |
134
 
135
  ### Top 5 N-grams by Size
136
 
137
+ **2-grams (Word):**
138
+
139
+ | Rank | N-gram | Count |
140
+ |------|--------|-------|
141
+ | 1 | `na éstse` | 161 |
142
+ | 2 | `vé ho` | 119 |
143
+ | 3 | `ho énêstsestôtse` | 72 |
144
+ | 4 | `republic of` | 67 |
145
+ | 5 | `ho e` | 57 |
146
+
147
+ **3-grams (Word):**
148
+
149
+ | Rank | N-gram | Count |
150
+ |------|--------|-------|
151
+ | 1 | `vé ho énêstsestôtse` | 72 |
152
+ | 2 | `na éstse manâhéno` | 56 |
153
+ | 3 | `ho e éve` | 49 |
154
+ | 4 | `na éstse ho` | 48 |
155
+ | 5 | `éstse ho e` | 48 |
156
+
157
+ **4-grams (Word):**
158
+
159
+ | Rank | N-gram | Count |
160
+ |------|--------|-------|
161
+ | 1 | `éstse ho e éve` | 48 |
162
+ | 2 | `na éstse ho e` | 48 |
163
+ | 3 | `ma kaetaévôxe êstoo o` | 25 |
164
+ | 4 | `toháano éve ho etse` | 23 |
165
+ | 5 | `na éstse manâhéno ho` | 22 |
166
+
167
+ **2-grams (Subword):**
168
 
169
  | Rank | N-gram | Count |
170
  |------|--------|-------|
171
+ | 1 | `e _` | 1,534 |
172
+ | 2 | `s e` | 1,395 |
173
+ | 3 | `s t` | 1,310 |
174
+ | 4 | `t s` | 1,310 |
175
+ | 5 | `h e` | 1,012 |
176
 
177
+ **3-grams (Subword):**
178
 
179
  | Rank | N-gram | Count |
180
  |------|--------|-------|
181
+ | 1 | `t s e` | 1,002 |
182
+ | 2 | `s e _` | 580 |
183
+ | 3 | `e s t` | 468 |
184
+ | 4 | `s t s` | 459 |
185
+ | 5 | `h o '` | 443 |
186
 
187
+ **4-grams (Subword):**
188
 
189
  | Rank | N-gram | Count |
190
  |------|--------|-------|
191
+ | 1 | `t s e _` | 456 |
192
+ | 2 | `s t s e` | 436 |
193
+ | 3 | t s e` | 287 |
194
+ | 4 | `t ô t s` | 208 |
195
+ | 5 | `e s t ô` | 198 |
196
 
197
 
198
  ### Key Findings
199
 
200
+ - **Best Perplexity:** 2-gram (word) with 102
201
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
202
+ - **Coverage:** Top-1000 patterns cover ~53% of corpus
203
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
204
 
205
  ---
 
207
 
208
  ![Markov Entropy](visualizations/markov_entropy.png)
209
 
210
+ ![Markov Contexts](visualizations/markov_contexts.png)
211
+
212
  ![Markov Branching](visualizations/markov_branching.png)
213
 
214
  ### Results
215
 
216
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
217
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
218
+ | **1** | Word | 0.4118 | 1.330 | 2.00 | 3,383 | 58.8% |
219
+ | **1** | Subword | 1.3726 | 2.589 | 9.72 | 175 | 0.0% |
220
+ | **2** | Word | 0.1118 | 1.081 | 1.20 | 6,516 | 88.8% |
221
+ | **2** | Subword | 1.2032 | 2.303 | 5.05 | 1,699 | 0.0% |
222
+ | **3** | Word | 0.0474 | 1.033 | 1.08 | 7,515 | 95.3% |
223
+ | **3** | Subword | 0.6524 | 1.572 | 2.34 | 8,541 | 34.8% |
224
+ | **4** | Word | 0.0269 🏆 | 1.019 | 1.04 | 7,792 | 97.3% |
225
+ | **4** | Subword | 0.2844 | 1.218 | 1.44 | 19,944 | 71.6% |
226
+
227
+ ### Generated Text Samples (Word-based)
228
+
229
+ Below are text samples generated from each word-based Markov chain model:
230
+
231
+ **Context Size 1:**
232
+
233
+ 1. `e he óonéma enóne éohkê héška ó he hohamháa há continentan naa nêhéóhe násáahéne enomóvóhe tsé`
234
+ 2. `ho honáemanėstóoseo o united states manâhestôtse 1 188 lwanda 100px mogadishu somali shilling swahil...`
235
+ 3. `o vé keehoohtsêstse vó kaehevôtse vé ho énêstsestôtse billingscheyenne english dictionary chief dull...`
236
+
237
+ **Context Size 2:**
238
+
239
+ 1. `na éstse ho e éve asia center thumb handelskade in willemstad curaçao`
240
+ 2. `vé ho énêstsestôtse black hills ho honáéšé e missouri ó he e pónoeo hé e na éstse`
241
+ 3. `ho énêstsestôtse cimarron river bull river forgan heévȧhetanéno`
242
+
243
+ **Context Size 3:**
244
+
245
+ 1. `vé ho énêstsestôtse bay horse variant tsé vó névóvâtse`
246
+ 2. `na éstse manâhéno ho honáéšé e united states óoetaneo o óoetanéno tsé amo eétâhéstove vé ho énêstses...`
247
+ 3. `ho e éve hóxovê hooma center frameless upright 1 5`
248
+
249
+ **Context Size 4:**
250
+
251
+ 1. `éstse ho e éve meško`
252
+ 2. `na éstse ho e éve vietnam dong hoi airport`
253
+ 3. `ma kaetaévôxe êstoo o sango toháano éve ho etse 622 984 4 216 666 1 198 chad republic of`
254
 
 
255
 
256
+ ### Generated Text Samples (Subword-based)
257
+
258
+ Below are text samples generated from each subword-based Markov chain model:
259
 
260
  **Context Size 1:**
261
 
262
+ 1. `e_a_29150002)_te`
263
+ 2. `_rkul_mâxpoeése.`
264
+ 3. `a_xema'ėh-évese:`
265
 
266
  **Context Size 2:**
267
 
268
+ 1. `e_(vé'še'tó'neadd`
269
+ 2. `seotó'o_poestôtse`
270
+ 3. `ts_wymnetugual_ju`
271
 
272
  **Context Size 3:**
273
 
274
+ 1. `tse._vé'ho'hé'e_bo`
275
+ 2. `se_rokese_mâhestȯt`
276
+ 3. `estôtse_manâhá'e_(`
277
 
278
  **Context Size 4:**
279
 
280
+ 1. `tse_hotómá'e_12_évȯ`
281
+ 2. `stseévenomo_hovanan`
282
+ 3. `ôtse:_ten_sage";_ar`
283
 
284
 
285
  ### Key Findings
286
 
287
+ - **Best Predictability:** Context-4 (word) with 97.3% predictability
288
  - **Branching Factor:** Decreases with context size (more deterministic)
289
+ - **Memory Trade-off:** Larger contexts require more storage (19,944 contexts)
290
  - **Recommendation:** Context-3 or Context-4 for text generation
291
 
292
  ---
 
302
 
303
  | Metric | Value |
304
  |--------|-------|
305
+ | Vocabulary Size | 1,237 |
306
+ | Total Tokens | 8,401 |
307
+ | Mean Frequency | 6.79 |
308
  | Median Frequency | 3 |
309
+ | Frequency Std Dev | 21.86 |
310
 
311
  ### Most Common Words
312
 
313
  | Rank | Word | Frequency |
314
  |------|------|-----------|
315
+ | 1 | e | 431 |
316
+ | 2 | ho | 377 |
317
+ | 3 | o | 237 |
318
+ | 4 | na | 165 |
319
+ | 5 | | 161 |
320
+ | 6 | éstse | 161 |
321
+ | 7 | éve | 154 |
322
+ | 8 | of | 118 |
323
+ | 9 | he | 108 |
324
+ | 10 | naa | 108 |
325
 
326
  ### Least Common Words (from vocabulary)
327
 
328
  | Rank | Word | Frequency |
329
  |------|------|-----------|
330
+ | 1 | mustangs | 2 |
331
+ | 2 | sevonévo | 2 |
332
+ | 3 | ėstovátamevéotse | 2 |
333
+ | 4 | ėstova | 2 |
334
+ | 5 | nėstse | 2 |
335
+ | 6 | kūnas | 2 |
336
+ | 7 | epsteins | 2 |
337
+ | 8 | ir | 2 |
338
+ | 9 | felon | 2 |
339
+ | 10 | immigrants | 2 |
340
 
341
  ### Zipf's Law Analysis
342
 
343
  | Metric | Value |
344
  |--------|-------|
345
+ | Zipf Coefficient | 0.8151 |
346
+ | R² (Goodness of Fit) | 0.976018 |
347
  | Adherence Quality | **excellent** |
348
 
349
  ### Coverage Analysis
350
 
351
  | Top N Words | Coverage |
352
  |-------------|----------|
353
+ | Top 100 | 54.8% |
354
+ | Top 1,000 | 94.4% |
355
  | Top 5,000 | 0.0% |
356
  | Top 10,000 | 0.0% |
357
 
358
  ### Key Findings
359
 
360
+ - **Zipf Compliance:** R²=0.9760 indicates excellent adherence to Zipf's law
361
+ - **High Frequency Dominance:** Top 100 words cover 54.8% of corpus
362
+ - **Long Tail:** -8,763 words needed for remaining 100.0% coverage
363
 
364
  ---
365
  ## 5. Word Embeddings Evaluation
 
372
 
373
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
374
 
 
375
 
376
+ ### 5.1 Cross-Lingual Alignment
377
+
378
+ > *Note: Multilingual alignment visualization not available for this language.*
379
+
380
+
381
+ ### 5.2 Model Comparison
382
+
383
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
384
+ |-------|-----------|----------|------------------|---------------|----------------|
385
+ | **mono_32d** | 32 | 0.0023 🏆 | 0.8533 | N/A | N/A |
386
+ | **mono_64d** | 64 | 0.0008 | 0.9264 | N/A | N/A |
387
+ | **mono_128d** | 128 | 0.0002 | 0.9821 | N/A | N/A |
388
 
389
  ### Key Findings
390
 
391
+ - **Best Isotropy:** mono_32d with 0.0023 (more uniform distribution)
392
+ - **Semantic Density:** Average pairwise similarity of 0.9206. Lower values indicate better semantic separation.
393
+ - **Alignment Quality:** No aligned models evaluated in this run.
394
+ - **Recommendation:** 128d aligned for best cross-lingual performance
395
+
396
+ ---
397
+ ## 6. Morphological Analysis (Experimental)
398
+
399
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
400
+
401
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
402
+
403
+ ### 6.1 Productivity & Complexity
404
+
405
+ | Metric | Value | Interpretation | Recommendation |
406
+ |--------|-------|----------------|----------------|
407
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
408
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
409
+
410
+ ### 6.2 Affix Inventory (Productive Units)
411
+
412
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
413
+
414
+ #### Productive Prefixes
415
+ | Prefix | Examples |
416
+ |--------|----------|
417
+ | `-ho` | horse, hotoa, hoohëö |
418
+
419
+ #### Productive Suffixes
420
+ | Suffix | Examples |
421
+ |--------|----------|
422
+ | `-e` | néstovátamevéotse, kôhtse, where |
423
+ | `-se` | néstovátamevéotse, kôhtse, kaehevotôtse |
424
+ | `-tse` | néstovátamevéotse, kôhtse, kaehevotôtse |
425
+ | `-ne` | mâhoestôtsene, kane, mâheóne |
426
+ | `-ôtse` | kaehevotôtse, oestôtse, xemenôtse |
427
+ | `-ia` | anastacia, abkhazia, shepherdia |
428
+ | `-ve` | êstonêstove, native, hestoháatamaahéstove |
429
+
430
+ ### 6.3 Bound Stems (Lexical Roots)
431
+
432
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
433
+
434
+ *No significant bound stems detected.*
435
+
436
+
437
+ ### 6.4 Affix Compatibility (Co-occurrence)
438
+
439
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
440
+
441
+ | Prefix | Suffix | Frequency | Examples |
442
+ |--------|--------|-----------|----------|
443
+ | `-ho` | `-e` | 5 words | horse, hováhne |
444
+ | `-ho` | `-ne` | 2 words | hováhne, hovahne |
445
+ | `-ho` | `-se` | 1 words | horse, hotse |
446
+ | `-ho` | `-tse` | 1 words | hotse, hohpâhtsenámenôtse |
447
+ | `-ho` | `-ôtse` | 1 words | hohpâhtsenámenôtse |
448
+
449
+ ### 6.5 Recursive Morpheme Segmentation
450
+
451
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
452
+
453
+ | Word | Suggested Split | Confidence | Stem |
454
+ |------|-----------------|------------|------|
455
+ | mâhoestôtsene | **`mâhoest-ôtse-ne`** | 3.0 | `mâhoest` |
456
+ | sevoneóneve | **`sevoneó-ne-ve`** | 3.0 | `sevoneó` |
457
+ | náhkȯhehetanetse | **`náhkȯheheta-ne-tse`** | 3.0 | `náhkȯheheta` |
458
+ | enóseoneve | **`enóseo-ne-ve`** | 3.0 | `enóseo` |
459
+ | éestsėstóseoneve | **`éestsėstóseo-ne-ve`** | 3.0 | `éestsėstóseo` |
460
+ | kaehevotôtse | **`kaehevot-ôtse`** | 1.5 | `kaehevot` |
461
+ | anastacia | **`anastac-ia`** | 1.5 | `anastac` |
462
+ | névóvâtse | **`névóvâ-tse`** | 1.5 | `névóvâ` |
463
+ | shepherdia | **`shepherd-ia`** | 1.5 | `shepherd` |
464
+ | êstonêstove | **`êstonêsto-ve`** | 1.5 | `êstonêsto` |
465
+ | yellowstone | **`yellowsto-ne`** | 1.5 | `yellowsto` |
466
+ | hoohtseto | **`ho-ohtseto`** | 1.5 | `ohtseto` |
467
+ | xemenôtse | **`xemen-ôtse`** | 1.5 | `xemen` |
468
+ | véhonevoemėstse | **`véhonevoemės-tse`** | 1.5 | `véhonevoemės` |
469
+ | manestôtse | **`manest-ôtse`** | 1.5 | `manest` |
470
+
471
+ ### 6.6 Linguistic Interpretation
472
+
473
+ > **Automated Insight:**
474
+ The language CHY appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
475
 
476
  ---
477
+ ## 7. Summary & Recommendations
478
 
479
  ![Performance Dashboard](visualizations/performance_dashboard.png)
480
 
 
482
 
483
  | Component | Recommended | Rationale |
484
  |-----------|-------------|-----------|
485
+ | Tokenizer | **8k BPE** | Best compression (3.50x) |
486
+ | N-gram | **2-gram** | Lowest perplexity (102) |
487
+ | Markov | **Context-4** | Highest predictability (97.3%) |
488
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
489
 
490
+
491
  ---
492
  ## Appendix: Metrics Glossary & Interpretation Guide
493
 
 
677
  author = {Kamali, Omar},
678
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
679
  year = {2025},
680
+ doi = {10.5281/zenodo.18073153},
681
+ publisher = {Zenodo},
682
  url = {https://huggingface.co/wikilangs}
683
  institution = {Omneity Labs}
684
  }
 
694
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
695
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
696
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
697
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
698
  ---
699
  *Generated by Wikilangs Models Pipeline*
700
 
701
+ *Report Date: 2026-01-03 10:13:21*
models/embeddings/monolingual/chy_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9abb00efdaee95eb235694ba4738708972eef813303beafecc2c4852c0a11360
3
- size 1024233321
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe82c87b0cea7042a6a16f85cf1875011ca32f13b8c941e25d5c1d57849b2eed
3
+ size 1024170112
models/embeddings/monolingual/chy_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 223
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 163
15
  }
models/embeddings/monolingual/chy_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8c8e4f25fd101e1917ce06bf1768de9245362d868e23a5ebbbc943b0e4650a60
3
- size 256062057
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70c8741bb1d28e9983cdf32850bd345db541b195a40f92c1ebc8ba49374864be
3
+ size 256044928
models/embeddings/monolingual/chy_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 223
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 163
15
  }
models/embeddings/monolingual/chy_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4c920935ac87a1441f862361919b9313ce17a00634fb4b6dc1a1bb483d79a2f5
3
- size 512119145
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:235dc5785bb5f5c2123f5d894c3c7266a0f239882928a7be48b4ddbb112ff63d
3
+ size 512086656
models/embeddings/monolingual/chy_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 223
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 163
15
  }
models/subword_markov/chy_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6ea83b117f3a486981315ed6c06f66c8b179b52afa64753b10adcb037d01a299
3
- size 19111
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d18af851c234d84f1c8c493282bad10c2ce3cf38e5f884b0def3b69a8cc46a72
3
+ size 16739
models/subword_markov/chy_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 189,
6
- "total_transitions": 113177
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 175,
6
+ "total_transitions": 68722
7
  }
models/subword_markov/chy_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cc4389e4a77f6c71ab2c352c280e4fc1281ab64c25a40abbcf7af73411115bda
3
- size 71565
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ecd88390cc78af12a57e86676cd4ac3d721a2a3b6105dedf56498e4f9198ba6
3
+ size 55410
models/subword_markov/chy_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 2073,
6
- "total_transitions": 112352
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 1699,
6
+ "total_transitions": 68263
7
  }
models/subword_markov/chy_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e7f1d3cabcf310e4f6579f8d79db5b6ca9e6cb8c218d840ca5d728e17f8a9d72
3
- size 189294
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd7301d331a8f6e5cf3904b5bd4befc8f3edf1b29659b35c4c4df3b64b73ed4d
3
+ size 150466
models/subword_markov/chy_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 10919,
6
- "total_transitions": 111527
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 8541,
6
+ "total_transitions": 67804
7
  }
models/subword_markov/chy_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6f537764e3ba64873ce0be3e727efce7efa3b3ee9a512520fdff8fe7b9f23660
3
- size 357786
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e44005f2ed1b58ece31011bfe69d86d11229eda4839c36faadb0b8ac3af5479
3
+ size 281701
models/subword_markov/chy_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 25260,
6
- "total_transitions": 110702
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 19944,
6
+ "total_transitions": 67345
7
  }
models/subword_ngram/chy_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2ed544f13314cfc2324a3b1462dfcc47e71f1236aeb6319b89b7d6e18ac3920a
3
- size 14546
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0228df27f63ed413c576b4cbe4ab6f76c9785023f717763c1927ee318cba7061
3
+ size 11557
models/subword_ngram/chy_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_ngrams": 1127,
6
- "total_ngrams": 113177
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_ngrams": 871,
6
+ "total_ngrams": 68722
7
  }
models/subword_ngram/chy_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4b077e5de241ab6ccaa7e1f45221e1202536459be7a01c5b4b63e9fd20ba730d
3
- size 54182
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6c8121df7c4fff9c78c7deca418c3eaecb7394cf750b4681239ca320d9977e8
3
+ size 41181
models/subword_ngram/chy_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_ngrams": 4876,
6
- "total_ngrams": 112352
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_ngrams": 3811,
6
+ "total_ngrams": 68263
7
  }
models/subword_ngram/chy_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9b1c577153cd04ac40d0b50f9a07ec2270c218aff4d156a1ece01fd0edfec031
3
- size 130151
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bfb26d6de423f9bad18bfb5f3a57a90fada9cbd23138dc9ac7e2fcdbe0c7831
3
+ size 100088
models/subword_ngram/chy_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_ngrams": 11151,
6
- "total_ngrams": 111527
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_ngrams": 8559,
6
+ "total_ngrams": 67804
7
  }
models/tokenizer/chy_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:17ad6a6d3558277d6925eb68f3855cb005e2fe3557d8485bf68a6f7274296dab
3
- size 375317
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7224c406ce7f0ceae9ae28bfb815c3ebc81aafabd231728ea1823e5477c23e0d
3
+ size 374664
models/tokenizer/chy_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/chy_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:470592a0b112b9a3a65f69af1557219cccb407f27f3a4236e60e71c7c2f40313
3
- size 29068
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:557659dcccab98922c4fc45c22d401a01f92a33330e2d06cbdbafd9e3d8f37d2
3
+ size 22392
models/vocabulary/chy_vocabulary_metadata.json CHANGED
@@ -1,14 +1,15 @@
1
  {
2
  "language": "chy",
3
- "vocabulary_size": 1659,
 
4
  "statistics": {
5
- "type_token_ratio": 0.24974895150333748,
6
  "coverage": {
7
- "top_100": 0.4891015417331207,
8
- "top_1000": 0.7703939984641739
9
  },
10
- "hapax_count": 2569,
11
- "hapax_ratio": 0.6076158940397351,
12
- "total_documents": 825
13
  }
14
  }
 
1
  {
2
  "language": "chy",
3
+ "vocabulary_size": 1237,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.32707120045087357,
7
  "coverage": {
8
+ "top_100": 0.4324628968626714,
9
+ "top_1000": 0.7445989103888785
10
  },
11
+ "hapax_count": 2245,
12
+ "hapax_ratio": 0.644744399770247,
13
+ "total_documents": 459
14
  }
15
  }
models/word_markov/chy_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6fb77d185c917a1f8ece3debd54279262bbc2e1a01799fcb691731a7561bee31
3
- size 113562
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:386ae58f4271bb09f2a7300c1ffb43cb66268f9329c798213791212b7375c544
3
+ size 86902
models/word_markov/chy_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 4255,
6
- "total_transitions": 27559
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 3383,
6
+ "total_transitions": 10187
7
  }
models/word_markov/chy_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c1602a760151707012bb6aef66d181795727fa9d8cb5780233a0b6c1c90a2312
3
- size 189750
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:25b29a5fd2a035a5761b624f9763e3b2f1bd7218a31608a0a6f94197490478db
3
+ size 131833
models/word_markov/chy_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 10197,
6
- "total_transitions": 26734
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 6516,
6
+ "total_transitions": 9728
7
  }
models/word_markov/chy_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7a91e6dbb80dac8f5da14abcdfc3ba5c7122ed0a171bc88e3d506ea5a62e4790
3
- size 251639
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4b7d0cf8b279e25f0c1dd5d12c4331f6647670846d67a4e626aa888e8687564
3
+ size 158524
models/word_markov/chy_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 13745,
6
- "total_transitions": 25909
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 7515,
6
+ "total_transitions": 9269
7
  }
models/word_markov/chy_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b827d8d8deb6ee8ee5a1c2f7e9164c6009220d08c889406bed208c143c67ef6f
3
- size 299200
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46f60b22a8fcac07a7de898ad66d40c570966ec70737f00b5cf2d7b22bf2c6a2
3
+ size 174153
models/word_markov/chy_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 16004,
6
- "total_transitions": 25085
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 7792,
6
+ "total_transitions": 8810
7
  }
models/word_ngram/chy_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0dc44fada7846034391ea2690c56b864a908054940a91ca581f91cf63ba43b79
3
- size 11396
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3cf90b3734f097b7881f2bd6766a12f1c1a78a49ef2c9af43b8bb1c04f684de
3
+ size 5216
models/word_ngram/chy_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_ngrams": 654,
6
- "total_ngrams": 27559
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_ngrams": 159,
6
+ "total_ngrams": 10187
7
  }
models/word_ngram/chy_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ac58b70ff5ea478cd859a06e2722ba50d36f1d0a7d83c8e35282a144a67b1369
3
- size 20739
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18e0cd0236f9672b7cdc3748740203bb789cab4d98a9bba4f58941ef3779da01
3
+ size 7252
models/word_ngram/chy_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_ngrams": 1211,
6
- "total_ngrams": 26734
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_ngrams": 245,
6
+ "total_ngrams": 9728
7
  }
models/word_ngram/chy_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8fb9ac65e984d6fb53483e3e4ddaf089e90baeef0b877dc6f9e5067d9f4d0a9d
3
- size 39062
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be5660989ec57f68e126189b60da16af9da01d4a13b7e53d4bc22f9849e36215
3
+ size 11340
models/word_ngram/chy_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_ngrams": 2302,
6
- "total_ngrams": 25909
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_ngrams": 449,
6
+ "total_ngrams": 9269
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 125496fee51a0eff7033a820a5c76329e49094e721ffbc2269e6e45e220f6eb8
  • Pointer size: 131 Bytes
  • Size of remote file: 174 kB

Git LFS Details

  • SHA256: 64b3cbb726cc52039aef89526c5df1602ede7ff01d5aafa01666629a9683d23c
  • Pointer size: 131 Bytes
  • Size of remote file: 127 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED
visualizations/nearest_neighbors.png CHANGED
visualizations/ngram_coverage.png CHANGED
visualizations/ngram_entropy.png CHANGED
visualizations/ngram_perplexity.png CHANGED