omarkamali commited on
Commit
b6d39c4
·
verified ·
1 Parent(s): 236e9c1

Upload all models and assets for chy (latest)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. README.md +186 -149
  3. models/embeddings/aligned/chy_128d.bin +3 -0
  4. models/embeddings/aligned/chy_128d.meta.json +1 -0
  5. models/embeddings/aligned/chy_128d.projection.npy +3 -0
  6. models/embeddings/aligned/chy_128d_metadata.json +8 -0
  7. models/embeddings/aligned/chy_32d.bin +3 -0
  8. models/embeddings/aligned/chy_32d.meta.json +1 -0
  9. models/embeddings/aligned/chy_32d.projection.npy +3 -0
  10. models/embeddings/aligned/chy_32d_metadata.json +8 -0
  11. models/embeddings/aligned/chy_64d.bin +3 -0
  12. models/embeddings/aligned/chy_64d.meta.json +1 -0
  13. models/embeddings/aligned/chy_64d.projection.npy +3 -0
  14. models/embeddings/aligned/chy_64d_metadata.json +8 -0
  15. models/embeddings/monolingual/chy_128d.bin +2 -2
  16. models/embeddings/monolingual/chy_128d_metadata.json +1 -1
  17. models/embeddings/monolingual/chy_32d.bin +2 -2
  18. models/embeddings/monolingual/chy_32d_metadata.json +1 -1
  19. models/embeddings/monolingual/chy_64d.bin +2 -2
  20. models/embeddings/monolingual/chy_64d_metadata.json +1 -1
  21. models/subword_markov/chy_markov_ctx1_subword.parquet +2 -2
  22. models/subword_markov/chy_markov_ctx1_subword_metadata.json +2 -2
  23. models/subword_markov/chy_markov_ctx2_subword.parquet +2 -2
  24. models/subword_markov/chy_markov_ctx2_subword_metadata.json +2 -2
  25. models/subword_markov/chy_markov_ctx3_subword.parquet +2 -2
  26. models/subword_markov/chy_markov_ctx3_subword_metadata.json +2 -2
  27. models/subword_markov/chy_markov_ctx4_subword.parquet +2 -2
  28. models/subword_markov/chy_markov_ctx4_subword_metadata.json +2 -2
  29. models/subword_ngram/chy_2gram_subword.parquet +2 -2
  30. models/subword_ngram/chy_2gram_subword_metadata.json +2 -2
  31. models/subword_ngram/chy_3gram_subword.parquet +2 -2
  32. models/subword_ngram/chy_3gram_subword_metadata.json +2 -2
  33. models/subword_ngram/chy_4gram_subword.parquet +2 -2
  34. models/subword_ngram/chy_4gram_subword_metadata.json +2 -2
  35. models/subword_ngram/chy_5gram_subword.parquet +3 -0
  36. models/subword_ngram/chy_5gram_subword_metadata.json +7 -0
  37. models/tokenizer/chy_tokenizer_8k.model +2 -2
  38. models/tokenizer/chy_tokenizer_8k.vocab +0 -0
  39. models/vocabulary/chy_vocabulary.parquet +2 -2
  40. models/vocabulary/chy_vocabulary_metadata.json +7 -7
  41. models/word_markov/chy_markov_ctx1_word.parquet +2 -2
  42. models/word_markov/chy_markov_ctx1_word_metadata.json +2 -2
  43. models/word_markov/chy_markov_ctx2_word.parquet +2 -2
  44. models/word_markov/chy_markov_ctx2_word_metadata.json +2 -2
  45. models/word_markov/chy_markov_ctx3_word.parquet +2 -2
  46. models/word_markov/chy_markov_ctx3_word_metadata.json +2 -2
  47. models/word_markov/chy_markov_ctx4_word.parquet +2 -2
  48. models/word_markov/chy_markov_ctx4_word_metadata.json +2 -2
  49. models/word_ngram/chy_2gram_word.parquet +2 -2
  50. models/word_ngram/chy_2gram_word_metadata.json +2 -2
.gitattributes CHANGED
@@ -38,3 +38,4 @@ visualizations/performance_dashboard.png filter=lfs diff=lfs merge=lfs -text
38
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
39
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
 
 
38
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
39
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
41
+ visualizations/embedding_tsne_multilingual.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  language: chy
3
- language_name: CHY
4
  language_family: american_algonquian
5
  tags:
6
  - wikilangs
@@ -10,11 +10,21 @@ tags:
10
  - n-gram
11
  - markov
12
  - wikipedia
 
 
 
 
 
 
 
 
 
 
13
  - monolingual
14
  - family-american_algonquian
15
  license: mit
16
  library_name: wikilangs
17
- pipeline_tag: feature-extraction
18
  datasets:
19
  - omarkamali/wikipedia-monthly
20
  dataset_info:
@@ -23,7 +33,7 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 3.497
27
  - name: best_isotropy
28
  type: isotropy
29
  value: 0.0023
@@ -33,10 +43,10 @@ metrics:
33
  generated: 2026-01-03
34
  ---
35
 
36
- # CHY - Wikilangs Models
37
  ## Comprehensive Research Report & Full Ablation Study
38
 
39
- This repository contains NLP models trained and evaluated by Wikilangs, specifically on **CHY** Wikipedia data.
40
  We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
41
 
42
  ## 📋 Repository Contents
@@ -60,7 +70,7 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
- - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
  - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
@@ -80,35 +90,35 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
- | **8k** | 3.497x 🏆 | 3.52 | 0.0960% | 19,785 |
84
 
85
  ### Tokenization Examples
86
 
87
  Below are sample sentences tokenized with each vocabulary size:
88
 
89
- **Sample 1:** `Vášêtaëno, Amâho'hestôtse (Pl: amâho'héstotôtse) Ama'éno'hamémôxe'êstoo'o Ama'én...`
90
 
91
  | Vocab | Tokens | Count |
92
  |-------|--------|-------|
93
- | 8k | `▁vášêtaëno , ▁amâho ' hestôtse ▁( pl : amâho ' ... (+16 more)` | 26 |
94
 
95
- **Sample 2:** `Ma'xeamóvôhtó'hestôtse, Éam-óvôhtó'heo'o. thumb|right thumb|right thumb|Daimler-...`
96
 
97
  | Vocab | Tokens | Count |
98
  |-------|--------|-------|
99
- | 8k | `▁ma ' xeamóvôhtó ' hestôtse , ▁éam - óvôhtó ' ... (+16 more)` | 26 |
100
 
101
- **Sample 3:** `Mâhpémo'éhe (Alces alces) máto héva popóhpoévêsémo'éhe váótséva-éve.`
102
 
103
  | Vocab | Tokens | Count |
104
  |-------|--------|-------|
105
- | 8k | `▁mâhpémo ' éhe ▁( alces ▁alces ) ▁máto ▁héva ▁popóhpoévêsémo ... (+6 more)` | 16 |
106
 
107
 
108
  ### Key Findings
109
 
110
- - **Best Compression:** 8k achieves 3.497x compression
111
- - **Lowest UNK Rate:** 8k with 0.0960% unknown tokens
112
  - **Trade-off:** Larger vocabularies improve compression but increase model size
113
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
114
 
@@ -125,12 +135,14 @@ Below are sample sentences tokenized with each vocabulary size:
125
 
126
  | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
127
  |--------|---------|------------|---------|----------------|------------------|-------------------|
128
- | **2-gram** | Word | 102 🏆 | 6.68 | 159 | 86.3% | 100.0% |
129
- | **2-gram** | Subword | 330 | 8.37 | 871 | 59.3% | 100.0% |
130
- | **3-gram** | Word | 156 | 7.28 | 245 | 72.6% | 100.0% |
131
- | **3-gram** | Subword | 1,700 | 10.73 | 3,811 | 27.1% | 73.0% |
132
- | **4-gram** | Word | 310 | 8.27 | 449 | 52.1% | 100.0% |
133
- | **4-gram** | Subword | 4,072 | 11.99 | 8,559 | 18.3% | 52.6% |
 
 
134
 
135
  ### Top 5 N-grams by Size
136
 
@@ -138,68 +150,88 @@ Below are sample sentences tokenized with each vocabulary size:
138
 
139
  | Rank | N-gram | Count |
140
  |------|--------|-------|
141
- | 1 | `na éstse` | 161 |
142
  | 2 | `vé ho` | 119 |
143
  | 3 | `ho énêstsestôtse` | 72 |
144
  | 4 | `republic of` | 67 |
145
- | 5 | `ho e` | 57 |
146
 
147
  **3-grams (Word):**
148
 
149
  | Rank | N-gram | Count |
150
  |------|--------|-------|
151
  | 1 | `vé ho énêstsestôtse` | 72 |
152
- | 2 | `na éstse manâhéno` | 56 |
153
- | 3 | `ho e éve` | 49 |
154
- | 4 | `na éstse ho` | 48 |
155
- | 5 | `éstse ho e` | 48 |
156
 
157
  **4-grams (Word):**
158
 
159
  | Rank | N-gram | Count |
160
  |------|--------|-------|
161
- | 1 | `éstse ho e éve` | 48 |
162
- | 2 | `na éstse ho e` | 48 |
163
  | 3 | `ma kaetaévôxe êstoo o` | 25 |
164
  | 4 | `toháano éve ho etse` | 23 |
165
- | 5 | `na éstse manâhéno ho` | 22 |
 
 
 
 
 
 
 
 
 
 
166
 
167
  **2-grams (Subword):**
168
 
169
  | Rank | N-gram | Count |
170
  |------|--------|-------|
171
- | 1 | `e _` | 1,534 |
172
- | 2 | `s e` | 1,395 |
173
- | 3 | `s t` | 1,310 |
174
- | 4 | `t s` | 1,310 |
175
- | 5 | `h e` | 1,012 |
176
 
177
  **3-grams (Subword):**
178
 
179
  | Rank | N-gram | Count |
180
  |------|--------|-------|
181
- | 1 | `t s e` | 1,002 |
182
- | 2 | `s e _` | 580 |
183
- | 3 | `e s t` | 468 |
184
- | 4 | `s t s` | 459 |
185
- | 5 | `h o '` | 443 |
186
 
187
  **4-grams (Subword):**
188
 
189
  | Rank | N-gram | Count |
190
  |------|--------|-------|
191
- | 1 | `t s e _` | 456 |
192
- | 2 | `s t s e` | 436 |
193
- | 3 | `ô t s e` | 287 |
194
- | 4 | `t ô t s` | 208 |
195
- | 5 | `e s t ô` | 198 |
 
 
 
 
 
 
 
 
 
 
196
 
197
 
198
  ### Key Findings
199
 
200
- - **Best Perplexity:** 2-gram (word) with 102
201
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
202
- - **Coverage:** Top-1000 patterns cover ~53% of corpus
203
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
204
 
205
  ---
@@ -215,14 +247,14 @@ Below are sample sentences tokenized with each vocabulary size:
215
 
216
  | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
217
  |---------|---------|-------------|------------|------------------|-----------------|----------------|
218
- | **1** | Word | 0.4118 | 1.330 | 2.00 | 3,383 | 58.8% |
219
- | **1** | Subword | 1.3726 | 2.589 | 9.72 | 175 | 0.0% |
220
- | **2** | Word | 0.1118 | 1.081 | 1.20 | 6,516 | 88.8% |
221
- | **2** | Subword | 1.2032 | 2.303 | 5.05 | 1,699 | 0.0% |
222
- | **3** | Word | 0.0474 | 1.033 | 1.08 | 7,515 | 95.3% |
223
- | **3** | Subword | 0.6524 | 1.572 | 2.34 | 8,541 | 34.8% |
224
- | **4** | Word | 0.0269 🏆 | 1.019 | 1.04 | 7,792 | 97.3% |
225
- | **4** | Subword | 0.2844 | 1.218 | 1.44 | 19,944 | 71.6% |
226
 
227
  ### Generated Text Samples (Word-based)
228
 
@@ -230,27 +262,27 @@ Below are text samples generated from each word-based Markov chain model:
230
 
231
  **Context Size 1:**
232
 
233
- 1. `e he óonéma enóne éohkê héška ó he hohamháa continentan naa nêhéóhe násáahéne enomóvóhe tsé`
234
- 2. `ho honáemanėstóoseo o united states manâhestôtse 1 188 lwanda 100px mogadishu somali shilling swahil...`
235
- 3. `o keehoohtsêstse kaehevôtse vé ho énêstsestôtse billingscheyenne english dictionary chief dull...`
236
 
237
  **Context Size 2:**
238
 
239
- 1. `na éstse ho e éve asia center thumb handelskade in willemstad curaçao`
240
- 2. `vé ho énêstsestôtse black hills ho honáéšé e missouri ó he e pónoeo e na éstse`
241
- 3. `ho énêstsestôtse cimarron river bull river forgan heévȧhetanéno`
242
 
243
  **Context Size 3:**
244
 
245
- 1. `vé ho énêstsestôtse bay horse variant tsé vó névóvâtse`
246
- 2. `na éstse manâhéno ho honáéšé e united states óoetaneo o óoetanéno tsé amo eétâhéstove ho énêstses...`
247
- 3. `ho e éve hóxovê hooma center frameless upright 1 5`
248
 
249
  **Context Size 4:**
250
 
251
- 1. `éstse ho e éve meško`
252
- 2. `na éstse ho e éve vietnam dong hoi airport`
253
- 3. `ma kaetaévôxe êstoo o sango toháano éve ho etse 622 984 4 216 666 1 198 chad republic of`
254
 
255
 
256
  ### Generated Text Samples (Subword-based)
@@ -259,34 +291,34 @@ Below are text samples generated from each subword-based Markov chain model:
259
 
260
  **Context Size 1:**
261
 
262
- 1. `e_a_29150002)_te`
263
- 2. `_rkul_mâxpoeése.`
264
- 3. `a_xema'ėh-évese:`
265
 
266
  **Context Size 2:**
267
 
268
- 1. `e_(vé'še'tó'neadd`
269
- 2. `seotó'o_poestôtse`
270
- 3. `ts_wymnetugual_ju`
271
 
272
  **Context Size 3:**
273
 
274
- 1. `tse._vé'ho'hé'e_bo`
275
- 2. `se_rokese_mâhestȯt`
276
- 3. `estôtse_manâhá'e_(`
277
 
278
  **Context Size 4:**
279
 
280
- 1. `tse_hotómá'e_12_évȯ`
281
- 2. `stseévenomo_hovanan`
282
- 3. `ôtse:_ten_sage";_ar`
283
 
284
 
285
  ### Key Findings
286
 
287
- - **Best Predictability:** Context-4 (word) with 97.3% predictability
288
  - **Branching Factor:** Decreases with context size (more deterministic)
289
- - **Memory Trade-off:** Larger contexts require more storage (19,944 contexts)
290
  - **Recommendation:** Context-3 or Context-4 for text generation
291
 
292
  ---
@@ -302,64 +334,64 @@ Below are text samples generated from each subword-based Markov chain model:
302
 
303
  | Metric | Value |
304
  |--------|-------|
305
- | Vocabulary Size | 1,237 |
306
- | Total Tokens | 8,401 |
307
- | Mean Frequency | 6.79 |
308
  | Median Frequency | 3 |
309
- | Frequency Std Dev | 21.86 |
310
 
311
  ### Most Common Words
312
 
313
  | Rank | Word | Frequency |
314
  |------|------|-----------|
315
- | 1 | e | 431 |
316
- | 2 | ho | 377 |
317
- | 3 | o | 237 |
318
- | 4 | na | 165 |
319
- | 5 | | 161 |
320
- | 6 | éstse | 161 |
321
- | 7 | éve | 154 |
322
- | 8 | of | 118 |
323
- | 9 | he | 108 |
324
- | 10 | naa | 108 |
325
 
326
  ### Least Common Words (from vocabulary)
327
 
328
  | Rank | Word | Frequency |
329
  |------|------|-----------|
330
- | 1 | mustangs | 2 |
331
- | 2 | sevonévo | 2 |
332
- | 3 | ėstovátamevéotse | 2 |
333
- | 4 | ėstova | 2 |
334
- | 5 | nėstse | 2 |
335
- | 6 | kūnas | 2 |
336
- | 7 | epsteins | 2 |
337
- | 8 | ir | 2 |
338
- | 9 | felon | 2 |
339
- | 10 | immigrants | 2 |
340
 
341
  ### Zipf's Law Analysis
342
 
343
  | Metric | Value |
344
  |--------|-------|
345
- | Zipf Coefficient | 0.8151 |
346
- | R² (Goodness of Fit) | 0.976018 |
347
  | Adherence Quality | **excellent** |
348
 
349
  ### Coverage Analysis
350
 
351
  | Top N Words | Coverage |
352
  |-------------|----------|
353
- | Top 100 | 54.8% |
354
- | Top 1,000 | 94.4% |
355
  | Top 5,000 | 0.0% |
356
  | Top 10,000 | 0.0% |
357
 
358
  ### Key Findings
359
 
360
- - **Zipf Compliance:** R²=0.9760 indicates excellent adherence to Zipf's law
361
- - **High Frequency Dominance:** Top 100 words cover 54.8% of corpus
362
- - **Long Tail:** -8,763 words needed for remaining 100.0% coverage
363
 
364
  ---
365
  ## 5. Word Embeddings Evaluation
@@ -375,37 +407,40 @@ Below are text samples generated from each subword-based Markov chain model:
375
 
376
  ### 5.1 Cross-Lingual Alignment
377
 
378
- > *Note: Multilingual alignment visualization not available for this language.*
 
 
379
 
380
 
381
  ### 5.2 Model Comparison
382
 
383
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
384
  |-------|-----------|----------|------------------|---------------|----------------|
385
- | **mono_32d** | 32 | 0.0023 🏆 | 0.8533 | N/A | N/A |
386
- | **mono_64d** | 64 | 0.0008 | 0.9264 | N/A | N/A |
387
- | **mono_128d** | 128 | 0.0002 | 0.9821 | N/A | N/A |
 
 
 
388
 
389
  ### Key Findings
390
 
391
  - **Best Isotropy:** mono_32d with 0.0023 (more uniform distribution)
392
- - **Semantic Density:** Average pairwise similarity of 0.9206. Lower values indicate better semantic separation.
393
- - **Alignment Quality:** No aligned models evaluated in this run.
394
  - **Recommendation:** 128d aligned for best cross-lingual performance
395
 
396
  ---
397
  ## 6. Morphological Analysis (Experimental)
398
 
399
- > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
400
-
401
  This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
402
 
403
  ### 6.1 Productivity & Complexity
404
 
405
  | Metric | Value | Interpretation | Recommendation |
406
  |--------|-------|----------------|----------------|
407
- | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
408
- | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
409
 
410
  ### 6.2 Affix Inventory (Productive Units)
411
 
@@ -414,18 +449,18 @@ These are the most productive prefixes and suffixes identified by sampling the v
414
  #### Productive Prefixes
415
  | Prefix | Examples |
416
  |--------|----------|
417
- | `-ho` | horse, hotoa, hoohëö |
418
 
419
  #### Productive Suffixes
420
  | Suffix | Examples |
421
  |--------|----------|
422
- | `-e` | néstovátamevéotse, kôhtse, where |
423
- | `-se` | néstovátamevéotse, kôhtse, kaehevotôtse |
424
- | `-tse` | néstovátamevéotse, kôhtse, kaehevotôtse |
425
- | `-ne` | mâhoestôtsene, kane, mâheóne |
426
- | `-ôtse` | kaehevotôtse, oestôtse, xemenôtse |
427
- | `-ia` | anastacia, abkhazia, shepherdia |
428
- | `-ve` | êstonêstove, native, hestoháatamaahéstove |
429
 
430
  ### 6.3 Bound Stems (Lexical Roots)
431
 
@@ -440,9 +475,9 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
440
 
441
  | Prefix | Suffix | Frequency | Examples |
442
  |--------|--------|-----------|----------|
443
- | `-ho` | `-e` | 5 words | horse, hováhne |
444
- | `-ho` | `-ne` | 2 words | hováhne, hovahne |
445
- | `-ho` | `-se` | 1 words | horse, hotse |
446
  | `-ho` | `-tse` | 1 words | hotse, hohpâhtsenámenôtse |
447
  | `-ho` | `-ôtse` | 1 words | hohpâhtsenámenôtse |
448
 
@@ -454,24 +489,26 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
454
  |------|-----------------|------------|------|
455
  | mâhoestôtsene | **`mâhoest-ôtse-ne`** | 3.0 | `mâhoest` |
456
  | sevoneóneve | **`sevoneó-ne-ve`** | 3.0 | `sevoneó` |
457
- | náhkȯhehetanetse | **`náhkȯheheta-ne-tse`** | 3.0 | `náhkȯheheta` |
458
- | enóseoneve | **`enóseo-ne-ve`** | 3.0 | `enóseo` |
459
  | éestsėstóseoneve | **`éestsėst��seo-ne-ve`** | 3.0 | `éestsėstóseo` |
460
- | kaehevotôtse | **`kaehevot-ôtse`** | 1.5 | `kaehevot` |
461
- | anastacia | **`anastac-ia`** | 1.5 | `anastac` |
462
- | névóvâtse | **`névóvâ-tse`** | 1.5 | `névóvâ` |
 
 
 
 
 
 
 
463
  | shepherdia | **`shepherd-ia`** | 1.5 | `shepherd` |
464
- | êstonêstove | **`êstonêsto-ve`** | 1.5 | `êstonêsto` |
465
- | yellowstone | **`yellowsto-ne`** | 1.5 | `yellowsto` |
466
- | hoohtseto | **`ho-ohtseto`** | 1.5 | `ohtseto` |
467
- | xemenôtse | **`xemen-ôtse`** | 1.5 | `xemen` |
468
- | véhonevoemėstse | **`véhonevoemės-tse`** | 1.5 | `véhonevoemės` |
469
- | manestôtse | **`manest-ôtse`** | 1.5 | `manest` |
470
 
471
  ### 6.6 Linguistic Interpretation
472
 
473
  > **Automated Insight:**
474
- The language CHY appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 
 
475
 
476
  ---
477
  ## 7. Summary & Recommendations
@@ -482,9 +519,9 @@ The language CHY appears to be more isolating or has a highly fixed vocabulary.
482
 
483
  | Component | Recommended | Rationale |
484
  |-----------|-------------|-----------|
485
- | Tokenizer | **8k BPE** | Best compression (3.50x) |
486
- | N-gram | **2-gram** | Lowest perplexity (102) |
487
- | Markov | **Context-4** | Highest predictability (97.3%) |
488
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
489
 
490
 
@@ -698,4 +735,4 @@ MIT License - Free for academic and commercial use.
698
  ---
699
  *Generated by Wikilangs Models Pipeline*
700
 
701
- *Report Date: 2026-01-03 10:13:21*
 
1
  ---
2
  language: chy
3
+ language_name: Cheyenne
4
  language_family: american_algonquian
5
  tags:
6
  - wikilangs
 
10
  - n-gram
11
  - markov
12
  - wikipedia
13
+ - feature-extraction
14
+ - sentence-similarity
15
+ - tokenization
16
+ - n-grams
17
+ - markov-chain
18
+ - text-mining
19
+ - fasttext
20
+ - babelvec
21
+ - vocabulous
22
+ - vocabulary
23
  - monolingual
24
  - family-american_algonquian
25
  license: mit
26
  library_name: wikilangs
27
+ pipeline_tag: text-generation
28
  datasets:
29
  - omarkamali/wikipedia-monthly
30
  dataset_info:
 
33
  metrics:
34
  - name: best_compression_ratio
35
  type: compression
36
+ value: 3.494
37
  - name: best_isotropy
38
  type: isotropy
39
  value: 0.0023
 
43
  generated: 2026-01-03
44
  ---
45
 
46
+ # Cheyenne - Wikilangs Models
47
  ## Comprehensive Research Report & Full Ablation Study
48
 
49
+ This repository contains NLP models trained and evaluated by Wikilangs, specifically on **Cheyenne** Wikipedia data.
50
  We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
51
 
52
  ## 📋 Repository Contents
 
70
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
71
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
72
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
73
+ - [6. Morphological Analysis (Experimental)](#6--morphological-analysis-experimental)
74
  - [7. Summary & Recommendations](#7-summary--recommendations)
75
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
76
  - [Visualizations Index](#visualizations-index)
 
90
 
91
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
92
  |------------|-------------|---------------|----------|--------------|
93
+ | **8k** | 3.494x 🏆 | 3.52 | 0.1022% | 18,598 |
94
 
95
  ### Tokenization Examples
96
 
97
  Below are sample sentences tokenized with each vocabulary size:
98
 
99
+ **Sample 1:** `Vóo'kooma, vóo'ooma (Melanerpes erythrocephalus) ve'kêseho-éve. Tôhohko`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
+ | 8k | `▁vóo ' kooma , ▁vóo ' ooma ▁( melanerpeserythrocephalus ... (+8 more)` | 18 |
104
 
105
+ **Sample 2:** `Hestaahtsémeno (Ribes floridum), heso'xêhestaahtsémeno, na'éstse máhtáme.`
106
 
107
  | Vocab | Tokens | Count |
108
  |-------|--------|-------|
109
+ | 8k | `▁hestaahtsémeno ▁( ribes ▁floridum ), ▁heso ' xêhestaahtsémeno , ▁na ... (+4 more)` | 14 |
110
 
111
+ **Sample 3:** `'aehesanestôtse (vé'ho'énêstsestôtse: buckskin suit; "antelope-dress") Pl: '...`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
+ | 8k | `▁ ' aehesanestôtse ▁( ' ho ' énêstsestôtse : ... (+20 more)` | 30 |
116
 
117
 
118
  ### Key Findings
119
 
120
+ - **Best Compression:** 8k achieves 3.494x compression
121
+ - **Lowest UNK Rate:** 8k with 0.1022% unknown tokens
122
  - **Trade-off:** Larger vocabularies improve compression but increase model size
123
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
124
 
 
135
 
136
  | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
137
  |--------|---------|------------|---------|----------------|------------------|-------------------|
138
+ | **2-gram** | Word | 98 🏆 | 6.62 | 148 | 88.0% | 100.0% |
139
+ | **2-gram** | Subword | 325 | 8.34 | 853 | 59.8% | 100.0% |
140
+ | **3-gram** | Word | 150 | 7.23 | 229 | 74.0% | 100.0% |
141
+ | **3-gram** | Subword | 1,635 | 10.67 | 3,634 | 27.6% | 73.9% |
142
+ | **4-gram** | Word | 301 | 8.23 | 420 | 52.7% | 100.0% |
143
+ | **4-gram** | Subword | 3,873 | 11.92 | 8,064 | 18.7% | 53.5% |
144
+ | **5-gram** | Word | 213 | 7.74 | 290 | 59.9% | 100.0% |
145
+ | **5-gram** | Subword | 4,512 | 12.14 | 8,516 | 17.1% | 49.3% |
146
 
147
  ### Top 5 N-grams by Size
148
 
 
150
 
151
  | Rank | N-gram | Count |
152
  |------|--------|-------|
153
+ | 1 | `na éstse` | 140 |
154
  | 2 | `vé ho` | 119 |
155
  | 3 | `ho énêstsestôtse` | 72 |
156
  | 4 | `republic of` | 67 |
157
+ | 5 | `éstse manâhéno` | 55 |
158
 
159
  **3-grams (Word):**
160
 
161
  | Rank | N-gram | Count |
162
  |------|--------|-------|
163
  | 1 | `vé ho énêstsestôtse` | 72 |
164
+ | 2 | `na éstse manâhéno` | 55 |
165
+ | 3 | `ho honáéšé e` | 44 |
166
+ | 4 | `ho e éve` | 33 |
167
+ | 5 | `éstse ho e` | 32 |
168
 
169
  **4-grams (Word):**
170
 
171
  | Rank | N-gram | Count |
172
  |------|--------|-------|
173
+ | 1 | `na éstse ho e` | 32 |
174
+ | 2 | `éstse ho e éve` | 32 |
175
  | 3 | `ma kaetaévôxe êstoo o` | 25 |
176
  | 4 | `toháano éve ho etse` | 23 |
177
+ | 5 | `manâhéno ho honáéšé e` | 22 |
178
+
179
+ **5-grams (Word):**
180
+
181
+ | Rank | N-gram | Count |
182
+ |------|--------|-------|
183
+ | 1 | `na éstse ho e éve` | 32 |
184
+ | 2 | `ho honáéšé e united states` | 22 |
185
+ | 3 | `éstse manâhéno ho honáéšé e` | 22 |
186
+ | 4 | `na éstse manâhéno ho honáéšé` | 22 |
187
+ | 5 | `manâhéno ho honáéšé e united` | 21 |
188
 
189
  **2-grams (Subword):**
190
 
191
  | Rank | N-gram | Count |
192
  |------|--------|-------|
193
+ | 1 | `e _` | 1,450 |
194
+ | 2 | `s e` | 1,334 |
195
+ | 3 | `s t` | 1,269 |
196
+ | 4 | `t s` | 1,249 |
197
+ | 5 | `h e` | 974 |
198
 
199
  **3-grams (Subword):**
200
 
201
  | Rank | N-gram | Count |
202
  |------|--------|-------|
203
+ | 1 | `t s e` | 956 |
204
+ | 2 | `s e _` | 548 |
205
+ | 3 | `e s t` | 461 |
206
+ | 4 | `s t s` | 436 |
207
+ | 5 | `h o '` | 420 |
208
 
209
  **4-grams (Subword):**
210
 
211
  | Rank | N-gram | Count |
212
  |------|--------|-------|
213
+ | 1 | `t s e _` | 427 |
214
+ | 2 | `s t s e` | 413 |
215
+ | 3 | `ô t s e` | 276 |
216
+ | 4 | `t ô t s` | 204 |
217
+ | 5 | `e s t ô` | 194 |
218
+
219
+ **5-grams (Subword):**
220
+
221
+ | Rank | N-gram | Count |
222
+ |------|--------|-------|
223
+ | 1 | `s t s e _` | 216 |
224
+ | 2 | `t ô t s e` | 203 |
225
+ | 3 | `s t ô t s` | 190 |
226
+ | 4 | `e s t ô t` | 190 |
227
+ | 5 | `ê s t s e` | 170 |
228
 
229
 
230
  ### Key Findings
231
 
232
+ - **Best Perplexity:** 2-gram (word) with 98
233
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
234
+ - **Coverage:** Top-1000 patterns cover ~49% of corpus
235
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
236
 
237
  ---
 
247
 
248
  | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
249
  |---------|---------|-------------|------------|------------------|-----------------|----------------|
250
+ | **1** | Word | 0.4049 | 1.324 | 1.97 | 3,214 | 59.5% |
251
+ | **1** | Subword | 1.3402 | 2.532 | 9.42 | 172 | 0.0% |
252
+ | **2** | Word | 0.1099 | 1.079 | 1.20 | 6,126 | 89.0% |
253
+ | **2** | Subword | 1.2169 | 2.324 | 5.05 | 1,620 | 0.0% |
254
+ | **3** | Word | 0.0453 | 1.032 | 1.08 | 7,065 | 95.5% |
255
+ | **3** | Subword | 0.6471 | 1.566 | 2.32 | 8,158 | 35.3% |
256
+ | **4** | Word | 0.0256 🏆 | 1.018 | 1.04 | 7,317 | 97.4% |
257
+ | **4** | Subword | 0.2799 | 1.214 | 1.44 | 18,852 | 72.0% |
258
 
259
  ### Generated Text Samples (Word-based)
260
 
 
262
 
263
  **Context Size 1:**
264
 
265
+ 1. `e éve ho honáéšé e cfa ma kaetaévôxe êstoo o toháano éve hóxovê hooma naa kánome`
266
+ 2. `ho éstova éhe nėstaane néstse vóonotse 30 hestáotse naa unie van zuid afrika hotómá e great`
267
+ 3. `o gdp ppp 72 7 afrikaans vé ho etse 56 785 6 coloured 9 indian tséh`
268
 
269
  **Context Size 2:**
270
 
271
+ 1. `na éstse manâhéno ho honáéšé e vehicle license kȧhkoetohko prefix 29 hotómá e mo hetaneho e hánêsóvó...`
272
+ 2. `vé ho énestse 71 740 6 144 562 903 somali federal republic of the congo congo kinshasa`
273
+ 3. `ho énêstsestôtse wyolacheyenne english dictionarychief dull knife college hoig stan the peace chiefs...`
274
 
275
  **Context Size 3:**
276
 
277
+ 1. `vé ho énêstsestôtse airplane this is`
278
+ 2. `na éstse manâhéno china republic of china republic of china republic of china republic of china repu...`
279
+ 3. `ho honáéšé e native news project`
280
 
281
  **Context Size 4:**
282
 
283
+ 1. `éstse ho e éve vietnam dong hoi airport`
284
+ 2. `na éstse ho e éve united states states of america`
285
+ 3. `ma kaetaévôxe êstoo o toháano éve ho etse 322 460 1 600 democratic republic of the congo of the`
286
 
287
 
288
  ### Generated Text Samples (Subword-based)
 
291
 
292
  **Context Size 1:**
293
 
294
+ 1. `etokfive_piente'`
295
+ 2. `_t:_manésé'e'e,_`
296
+ 3. `aliotse'étinoo's`
297
 
298
  **Context Size 2:**
299
 
300
+ 1. `e_100px_minestȯts`
301
+ 2. `se_cre_manéó'ho'ô`
302
+ 3. `stanjunt.thumb_la`
303
 
304
  **Context Size 3:**
305
 
306
+ 1. `tse_(lephonáéšé'e,`
307
+ 2. `se_odom_capid_city`
308
+ 3. `estôtsestôtsestôts`
309
 
310
  **Context Size 4:**
311
 
312
+ 1. `tse_évȯhkėha'etaneh`
313
+ 2. `stsestȯtse_kóhkonôh`
314
+ 3. `ôtsenáesëö'o_môxeov`
315
 
316
 
317
  ### Key Findings
318
 
319
+ - **Best Predictability:** Context-4 (word) with 97.4% predictability
320
  - **Branching Factor:** Decreases with context size (more deterministic)
321
+ - **Memory Trade-off:** Larger contexts require more storage (18,852 contexts)
322
  - **Recommendation:** Context-3 or Context-4 for text generation
323
 
324
  ---
 
334
 
335
  | Metric | Value |
336
  |--------|-------|
337
+ | Vocabulary Size | 1,174 |
338
+ | Total Tokens | 7,828 |
339
+ | Mean Frequency | 6.67 |
340
  | Median Frequency | 3 |
341
+ | Frequency Std Dev | 21.01 |
342
 
343
  ### Most Common Words
344
 
345
  | Rank | Word | Frequency |
346
  |------|------|-----------|
347
+ | 1 | e | 407 |
348
+ | 2 | ho | 351 |
349
+ | 3 | o | 229 |
350
+ | 4 | | 159 |
351
+ | 5 | na | 144 |
352
+ | 6 | éstse | 140 |
353
+ | 7 | éve | 133 |
354
+ | 8 | of | 117 |
355
+ | 9 | naa | 104 |
356
+ | 10 | he | 103 |
357
 
358
  ### Least Common Words (from vocabulary)
359
 
360
  | Rank | Word | Frequency |
361
  |------|------|-----------|
362
+ | 1 | pack | 2 |
363
+ | 2 | evenóse | 2 |
364
+ | 3 | mountain | 2 |
365
+ | 4 | cal | 2 |
366
+ | 5 | poly | 2 |
367
+ | 6 | mustangs | 2 |
368
+ | 7 | sevonévo | 2 |
369
+ | 8 | ėstovátamevéotse | 2 |
370
+ | 9 | ėstova | 2 |
371
+ | 10 | nėstse | 2 |
372
 
373
  ### Zipf's Law Analysis
374
 
375
  | Metric | Value |
376
  |--------|-------|
377
+ | Zipf Coefficient | 0.8142 |
378
+ | R² (Goodness of Fit) | 0.973597 |
379
  | Adherence Quality | **excellent** |
380
 
381
  ### Coverage Analysis
382
 
383
  | Top N Words | Coverage |
384
  |-------------|----------|
385
+ | Top 100 | 55.3% |
386
+ | Top 1,000 | 95.6% |
387
  | Top 5,000 | 0.0% |
388
  | Top 10,000 | 0.0% |
389
 
390
  ### Key Findings
391
 
392
+ - **Zipf Compliance:** R²=0.9736 indicates excellent adherence to Zipf's law
393
+ - **High Frequency Dominance:** Top 100 words cover 55.3% of corpus
394
+ - **Long Tail:** -8,826 words needed for remaining 100.0% coverage
395
 
396
  ---
397
  ## 5. Word Embeddings Evaluation
 
407
 
408
  ### 5.1 Cross-Lingual Alignment
409
 
410
+ ![Alignment Quality](visualizations/embedding_alignment_quality.png)
411
+
412
+ ![Multilingual t-SNE](visualizations/embedding_tsne_multilingual.png)
413
 
414
 
415
  ### 5.2 Model Comparison
416
 
417
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
418
  |-------|-----------|----------|------------------|---------------|----------------|
419
+ | **mono_32d** | 32 | 0.0023 🏆 | 0.8896 | N/A | N/A |
420
+ | **mono_64d** | 64 | 0.0007 | 0.9590 | N/A | N/A |
421
+ | **mono_128d** | 128 | 0.0002 | 0.9907 | N/A | N/A |
422
+ | **aligned_32d** | 32 | 0.0023 | 0.8896 | 0.0513 | 0.2179 |
423
+ | **aligned_64d** | 64 | 0.0007 | 0.9590 | 0.0385 | 0.1795 |
424
+ | **aligned_128d** | 128 | 0.0002 | 0.9907 | 0.0128 | 0.1667 |
425
 
426
  ### Key Findings
427
 
428
  - **Best Isotropy:** mono_32d with 0.0023 (more uniform distribution)
429
+ - **Semantic Density:** Average pairwise similarity of 0.9464. Lower values indicate better semantic separation.
430
+ - **Alignment Quality:** Aligned models achieve up to 5.1% R@1 in cross-lingual retrieval.
431
  - **Recommendation:** 128d aligned for best cross-lingual performance
432
 
433
  ---
434
  ## 6. Morphological Analysis (Experimental)
435
 
 
 
436
  This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
437
 
438
  ### 6.1 Productivity & Complexity
439
 
440
  | Metric | Value | Interpretation | Recommendation |
441
  |--------|-------|----------------|----------------|
442
+ | Productivity Index | **5.000** | High morphological productivity | Reliable analysis |
443
+ | Idiomaticity Gap | **1.027** | High formulaic/idiomatic content | - |
444
 
445
  ### 6.2 Affix Inventory (Productive Units)
446
 
 
449
  #### Productive Prefixes
450
  | Prefix | Examples |
451
  |--------|----------|
452
+ | `-ho` | hotóao, hohtóvá, hoéstónéó |
453
 
454
  #### Productive Suffixes
455
  | Suffix | Examples |
456
  |--------|----------|
457
+ | `-e` | ôhkêhenove, háahpe, manâhestôtse |
458
+ | `-se` | manâhestôtse, tsétsêhéstâhese, xaénéhetse |
459
+ | `-tse` | manâhestôtse, xaénéhetse, ôhnéménêstse |
460
+ | `-ôtse` | manâhestôtse, mâhoestôtse, ôtse |
461
+ | `-ne` | lione, mâhoestôtsene, nemâhmoteone |
462
+ | `-ve` | ôhkêhenove, ôhkemôxeonêstove, kêsaéve |
463
+ | `-ia` | alnifolia, austria, nitsvia |
464
 
465
  ### 6.3 Bound Stems (Lexical Roots)
466
 
 
475
 
476
  | Prefix | Suffix | Frequency | Examples |
477
  |--------|--------|-----------|----------|
478
+ | `-ho` | `-e` | 5 words | house, hovahne |
479
+ | `-ho` | `-ne` | 2 words | hovahne, hovane |
480
+ | `-ho` | `-se` | 1 words | house, hotse |
481
  | `-ho` | `-tse` | 1 words | hotse, hohpâhtsenámenôtse |
482
  | `-ho` | `-ôtse` | 1 words | hohpâhtsenámenôtse |
483
 
 
489
  |------|-----------------|------------|------|
490
  | mâhoestôtsene | **`mâhoest-ôtse-ne`** | 3.0 | `mâhoest` |
491
  | sevoneóneve | **`sevoneó-ne-ve`** | 3.0 | `sevoneó` |
 
 
492
  | éestsėstóseoneve | **`éestsėst��seo-ne-ve`** | 3.0 | `éestsėstóseo` |
493
+ | enóseoneve | **`enóseo-ne-ve`** | 3.0 | `enóseo` |
494
+ | náhkȯhehetanetse | **`náhkȯheheta-ne-tse`** | 3.0 | `náhkȯheheta` |
495
+ | ôhkêhenove | **`ôhkêheno-ve`** | 1.5 | `ôhkêheno` |
496
+ | manâhestôtse | **`manâhest-ôtse`** | 1.5 | `manâhest` |
497
+ | alnifolia | **`alnifol-ia`** | 1.5 | `alnifol` |
498
+ | ôhkemôxeonêstove | **`ôhkemôxeonêsto-ve`** | 1.5 | `ôhkemôxeonêsto` |
499
+ | hoéstónéó | **`ho-éstónéó`** | 1.5 | `éstónéó` |
500
+ | nemâhmoteone | **`nemâhmoteo-ne`** | 1.5 | `nemâhmoteo` |
501
+ | tsétsêhéstâhese | **`tsétsêhéstâhe-se`** | 1.5 | `tsétsêhéstâhe` |
502
+ | australia | **`austral-ia`** | 1.5 | `austral` |
503
  | shepherdia | **`shepherd-ia`** | 1.5 | `shepherd` |
504
+ | xaénéhetse | **`xaénéhe-tse`** | 1.5 | `xaénéhe` |
 
 
 
 
 
505
 
506
  ### 6.6 Linguistic Interpretation
507
 
508
  > **Automated Insight:**
509
+ The language Cheyenne shows high morphological productivity. The subword models are significantly more efficient than word models, suggesting a rich system of affixation or compounding.
510
+
511
+ > **Note on Idiomaticity:** The high Idiomaticity Gap suggests a large number of frequent multi-word expressions or formulaic sequences that are statistically distinct from their component parts.
512
 
513
  ---
514
  ## 7. Summary & Recommendations
 
519
 
520
  | Component | Recommended | Rationale |
521
  |-----------|-------------|-----------|
522
+ | Tokenizer | **8k BPE** | Best compression (3.49x) |
523
+ | N-gram | **2-gram** | Lowest perplexity (98) |
524
+ | Markov | **Context-4** | Highest predictability (97.4%) |
525
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
526
 
527
 
 
735
  ---
736
  *Generated by Wikilangs Models Pipeline*
737
 
738
+ *Report Date: 2026-01-03 20:28:03*
models/embeddings/aligned/chy_128d.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4fcc832a966fb87260314fdde26c7aaca3eb6f87727d6172f71003e9cbd2c5eb
3
+ size 1024154505
models/embeddings/aligned/chy_128d.meta.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lang": "chy", "dim": 128, "max_seq_len": 512, "is_aligned": true}
models/embeddings/aligned/chy_128d.projection.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c45ba64aef67dfa385dc9f8ca17d2d96ad776d8ccbe75505f9e1f6875aad67b
3
+ size 65664
models/embeddings/aligned/chy_128d_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "language": "chy",
3
+ "dimension": 128,
4
+ "version": "aligned",
5
+ "hub_language": "en",
6
+ "seed_vocab_size": 78,
7
+ "vocab_size": 148
8
+ }
models/embeddings/aligned/chy_32d.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:207f94e4c6d8a4f9a8bc11e98f994580ddfbed45c5a57f6cc17e016f113b2fa3
3
+ size 256040841
models/embeddings/aligned/chy_32d.meta.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lang": "chy", "dim": 32, "max_seq_len": 512, "is_aligned": true}
models/embeddings/aligned/chy_32d.projection.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf25a3e94561de4f68e7c3c373a0986dd3586b6921811406570a582aa2c047ad
3
+ size 4224
models/embeddings/aligned/chy_32d_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "language": "chy",
3
+ "dimension": 32,
4
+ "version": "aligned",
5
+ "hub_language": "en",
6
+ "seed_vocab_size": 78,
7
+ "vocab_size": 148
8
+ }
models/embeddings/aligned/chy_64d.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f311db029de4600904491344614a6a0e06edf2dfd38492760620dcec861d4e73
3
+ size 512078729
models/embeddings/aligned/chy_64d.meta.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lang": "chy", "dim": 64, "max_seq_len": 512, "is_aligned": true}
models/embeddings/aligned/chy_64d.projection.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bade9f56bce2587b66a3e08b84b9549bcdb85c4f121a359ad5166f3f0e2fcf0
3
+ size 16512
models/embeddings/aligned/chy_64d_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "language": "chy",
3
+ "dimension": 64,
4
+ "version": "aligned",
5
+ "hub_language": "en",
6
+ "seed_vocab_size": 78,
7
+ "vocab_size": 148
8
+ }
models/embeddings/monolingual/chy_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fe82c87b0cea7042a6a16f85cf1875011ca32f13b8c941e25d5c1d57849b2eed
3
- size 1024170112
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4fcc832a966fb87260314fdde26c7aaca3eb6f87727d6172f71003e9cbd2c5eb
3
+ size 1024154505
models/embeddings/monolingual/chy_128d_metadata.json CHANGED
@@ -11,5 +11,5 @@
11
  "encoding_method": "rope",
12
  "dim": 128
13
  },
14
- "vocab_size": 163
15
  }
 
11
  "encoding_method": "rope",
12
  "dim": 128
13
  },
14
+ "vocab_size": 148
15
  }
models/embeddings/monolingual/chy_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:70c8741bb1d28e9983cdf32850bd345db541b195a40f92c1ebc8ba49374864be
3
- size 256044928
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:207f94e4c6d8a4f9a8bc11e98f994580ddfbed45c5a57f6cc17e016f113b2fa3
3
+ size 256040841
models/embeddings/monolingual/chy_32d_metadata.json CHANGED
@@ -11,5 +11,5 @@
11
  "encoding_method": "rope",
12
  "dim": 32
13
  },
14
- "vocab_size": 163
15
  }
 
11
  "encoding_method": "rope",
12
  "dim": 32
13
  },
14
+ "vocab_size": 148
15
  }
models/embeddings/monolingual/chy_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:235dc5785bb5f5c2123f5d894c3c7266a0f239882928a7be48b4ddbb112ff63d
3
- size 512086656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f311db029de4600904491344614a6a0e06edf2dfd38492760620dcec861d4e73
3
+ size 512078729
models/embeddings/monolingual/chy_64d_metadata.json CHANGED
@@ -11,5 +11,5 @@
11
  "encoding_method": "rope",
12
  "dim": 64
13
  },
14
- "vocab_size": 163
15
  }
 
11
  "encoding_method": "rope",
12
  "dim": 64
13
  },
14
+ "vocab_size": 148
15
  }
models/subword_markov/chy_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d18af851c234d84f1c8c493282bad10c2ce3cf38e5f884b0def3b69a8cc46a72
3
- size 16739
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c35468a046d4cdb8600723c2cc9f33f643f687ec3db096b191c13c05400bf4a
3
+ size 15097
models/subword_markov/chy_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 175,
6
- "total_transitions": 68722
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 172,
6
+ "total_transitions": 64543
7
  }
models/subword_markov/chy_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8ecd88390cc78af12a57e86676cd4ac3d721a2a3b6105dedf56498e4f9198ba6
3
- size 55410
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68f0dc3358f63ea1ede0f6586e8b16d1fd372465d1bc2955b2c27e2422d47ff4
3
+ size 53541
models/subword_markov/chy_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 1699,
6
- "total_transitions": 68263
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 1620,
6
+ "total_transitions": 64114
7
  }
models/subword_markov/chy_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fd7301d331a8f6e5cf3904b5bd4befc8f3edf1b29659b35c4c4df3b64b73ed4d
3
- size 150466
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d479318e787a4b6b2b6582e01bf5581981ed29a293db4c95bfe4803a02ceac4
3
+ size 141411
models/subword_markov/chy_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 8541,
6
- "total_transitions": 67804
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 8158,
6
+ "total_transitions": 63685
7
  }
models/subword_markov/chy_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3e44005f2ed1b58ece31011bfe69d86d11229eda4839c36faadb0b8ac3af5479
3
- size 281701
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a00c38db3647e5284daec765ca6502387e939c467196ad926faa95198224d6c
3
+ size 266437
models/subword_markov/chy_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_contexts": 19944,
6
- "total_transitions": 67345
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_contexts": 18852,
6
+ "total_transitions": 63256
7
  }
models/subword_ngram/chy_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0228df27f63ed413c576b4cbe4ab6f76c9785023f717763c1927ee318cba7061
3
- size 11557
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:34dc6910378237012a37e238d7df974f017c91b1c3ebd46bf8d76be772f32ac2
3
+ size 11321
models/subword_ngram/chy_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_ngrams": 871,
6
- "total_ngrams": 68722
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_ngrams": 853,
6
+ "total_ngrams": 64543
7
  }
models/subword_ngram/chy_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c6c8121df7c4fff9c78c7deca418c3eaecb7394cf750b4681239ca320d9977e8
3
- size 41181
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c62fad42c8e95d0d3266d32d66f8bd47a309860a70bb08f72065f1851f496ca7
3
+ size 39559
models/subword_ngram/chy_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_ngrams": 3811,
6
- "total_ngrams": 68263
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_ngrams": 3634,
6
+ "total_ngrams": 64114
7
  }
models/subword_ngram/chy_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4bfb26d6de423f9bad18bfb5f3a57a90fada9cbd23138dc9ac7e2fcdbe0c7831
3
- size 100088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f24a9511d5150fb308647b27c266b5a6f588d314eec67eede769dc1df1d15a0
3
+ size 93862
models/subword_ngram/chy_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
- "unique_ngrams": 8559,
6
- "total_ngrams": 67804
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "chy",
5
+ "unique_ngrams": 8064,
6
+ "total_ngrams": 63685
7
  }
models/subword_ngram/chy_5gram_subword.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a5a05456a14449cd7ecfb447e8bd0b51ddacc83add84722e977dd5f7c8b8b64
3
+ size 107070
models/subword_ngram/chy_5gram_subword_metadata.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "n": 5,
3
+ "variant": "subword",
4
+ "language": "chy",
5
+ "unique_ngrams": 8516,
6
+ "total_ngrams": 63256
7
+ }
models/tokenizer/chy_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7224c406ce7f0ceae9ae28bfb815c3ebc81aafabd231728ea1823e5477c23e0d
3
- size 374664
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1c8a724c8c94c9c917aae80be9128fd26d48b89f29362db6780a3920aa83b5f
3
+ size 373477
models/tokenizer/chy_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/chy_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:557659dcccab98922c4fc45c22d401a01f92a33330e2d06cbdbafd9e3d8f37d2
3
- size 22392
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b9b062a41dc39802b42fd2a6c1857efd43a84af7da75bb98cfda744f6ad48c62
3
+ size 21524
models/vocabulary/chy_vocabulary_metadata.json CHANGED
@@ -1,15 +1,15 @@
1
  {
2
  "language": "chy",
3
- "vocabulary_size": 1237,
4
  "variant": "full",
5
  "statistics": {
6
- "type_token_ratio": 0.32707120045087357,
7
  "coverage": {
8
- "top_100": 0.4324628968626714,
9
- "top_1000": 0.7445989103888785
10
  },
11
- "hapax_count": 2245,
12
- "hapax_ratio": 0.644744399770247,
13
- "total_documents": 459
14
  }
15
  }
 
1
  {
2
  "language": "chy",
3
+ "vocabulary_size": 1174,
4
  "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.33165930092406587,
7
  "coverage": {
8
+ "top_100": 0.4347127360385697,
9
+ "top_1000": 0.7513057452792286
10
  },
11
+ "hapax_count": 2128,
12
+ "hapax_ratio": 0.6444579043004239,
13
+ "total_documents": 429
14
  }
15
  }
models/word_markov/chy_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:386ae58f4271bb09f2a7300c1ffb43cb66268f9329c798213791212b7375c544
3
- size 86902
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d274a31557ebe9914adbbba1a99399f013526f094054439fc9c1a835e8d5e6a4
3
+ size 82511
models/word_markov/chy_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 3383,
6
- "total_transitions": 10187
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 3214,
6
+ "total_transitions": 9527
7
  }
models/word_markov/chy_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:25b29a5fd2a035a5761b624f9763e3b2f1bd7218a31608a0a6f94197490478db
3
- size 131833
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:72ec884b474155c4fb42b336ac8db08240a62ef9d9b96702036915eb8ece49a9
3
+ size 124517
models/word_markov/chy_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 6516,
6
- "total_transitions": 9728
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 6126,
6
+ "total_transitions": 9098
7
  }
models/word_markov/chy_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b4b7d0cf8b279e25f0c1dd5d12c4331f6647670846d67a4e626aa888e8687564
3
- size 158524
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17f7aa921fc6af5108a2d17c0dfb401db36aa8745424ebfb48f77007070d9508
3
+ size 149507
models/word_markov/chy_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 7515,
6
- "total_transitions": 9269
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 7065,
6
+ "total_transitions": 8669
7
  }
models/word_markov/chy_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:46f60b22a8fcac07a7de898ad66d40c570966ec70737f00b5cf2d7b22bf2c6a2
3
- size 174153
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:459da114df8b2b97b442e07a5ee700b713cda57712639a7ac65a2bb20621e0d4
3
+ size 163772
models/word_markov/chy_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_contexts": 7792,
6
- "total_transitions": 8810
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_contexts": 7317,
6
+ "total_transitions": 8240
7
  }
models/word_ngram/chy_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f3cf90b3734f097b7881f2bd6766a12f1c1a78a49ef2c9af43b8bb1c04f684de
3
- size 5216
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3c85c0ba694c7e59c5083db686db2a289b676f8dc72587178d4e8b7e405f9d5
3
+ size 5096
models/word_ngram/chy_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "chy",
5
- "unique_ngrams": 159,
6
- "total_ngrams": 10187
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "chy",
5
+ "unique_ngrams": 148,
6
+ "total_ngrams": 9527
7
  }