diff --git a/.gitattributes b/.gitattributes index 9800be970571345a39f1f48b1c34335a62f77c48..e6a8b279f21bb839e7c4d279a1e89ce9f06c5726 100644 --- a/.gitattributes +++ b/.gitattributes @@ -38,3 +38,4 @@ visualizations/performance_dashboard.png filter=lfs diff=lfs merge=lfs -text visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text +visualizations/embedding_tsne_multilingual.png filter=lfs diff=lfs merge=lfs -text diff --git a/README.md b/README.md index bd3ffb591ca583ce362b3bb3d988db14cfbfc8a2..55e9579677ca80fc01d3a5ed07099920c014dde3 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ --- language: chy -language_name: CHY +language_name: Cheyenne language_family: american_algonquian tags: - wikilangs @@ -10,11 +10,21 @@ tags: - n-gram - markov - wikipedia + - feature-extraction + - sentence-similarity + - tokenization + - n-grams + - markov-chain + - text-mining + - fasttext + - babelvec + - vocabulous + - vocabulary - monolingual - family-american_algonquian license: mit library_name: wikilangs -pipeline_tag: feature-extraction +pipeline_tag: text-generation datasets: - omarkamali/wikipedia-monthly dataset_info: @@ -23,7 +33,7 @@ dataset_info: metrics: - name: best_compression_ratio type: compression - value: 3.497 + value: 3.494 - name: best_isotropy type: isotropy value: 0.0023 @@ -33,10 +43,10 @@ metrics: generated: 2026-01-03 --- -# CHY - Wikilangs Models +# Cheyenne - Wikilangs Models ## Comprehensive Research Report & Full Ablation Study -This repository contains NLP models trained and evaluated by Wikilangs, specifically on **CHY** Wikipedia data. +This repository contains NLP models trained and evaluated by Wikilangs, specifically on **Cheyenne** Wikipedia data. We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings. ## 📋 Repository Contents @@ -60,7 +70,7 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and - [3. Markov Chain Evaluation](#3-markov-chain-evaluation) - [4. Vocabulary Analysis](#4-vocabulary-analysis) - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation) -- [6. Morphological Analysis (Experimental)](#6-morphological-analysis) +- [6. Morphological Analysis (Experimental)](#6--morphological-analysis-experimental) - [7. Summary & Recommendations](#7-summary--recommendations) - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide) - [Visualizations Index](#visualizations-index) @@ -80,35 +90,35 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens | |------------|-------------|---------------|----------|--------------| -| **8k** | 3.497x 🏆 | 3.52 | 0.0960% | 19,785 | +| **8k** | 3.494x 🏆 | 3.52 | 0.1022% | 18,598 | ### Tokenization Examples Below are sample sentences tokenized with each vocabulary size: -**Sample 1:** `Vášêtaëno, Amâho'hestôtse (Pl: amâho'héstotôtse) Ama'éno'hamémôxe'êstoo'o Ama'én...` +**Sample 1:** `Vóo'kooma, vóo'ooma (Melanerpes erythrocephalus) ve'kêseho-éve. Tôhohko` | Vocab | Tokens | Count | |-------|--------|-------| -| 8k | `▁vášêtaëno , ▁amâho ' hestôtse ▁( pl : ▁amâho ' ... (+16 more)` | 26 | +| 8k | `▁vóo ' kooma , ▁vóo ' ooma ▁( melanerpes ▁erythrocephalus ... (+8 more)` | 18 | -**Sample 2:** `Ma'xeamóvôhtó'hestôtse, Éam-óvôhtó'heo'o. thumb|right thumb|right thumb|Daimler-...` +**Sample 2:** `Hestaahtsémeno (Ribes floridum), heso'xêhestaahtsémeno, na'éstse máhtáme.` | Vocab | Tokens | Count | |-------|--------|-------| -| 8k | `▁ma ' xeamóvôhtó ' hestôtse , ▁éam - óvôhtó ' ... (+16 more)` | 26 | +| 8k | `▁hestaahtsémeno ▁( ribes ▁floridum ), ▁heso ' xêhestaahtsémeno , ▁na ... (+4 more)` | 14 | -**Sample 3:** `Mâhpémo'éhe (Alces alces) máto héva popóhpoévêsémo'éhe váótséva-éve.` +**Sample 3:** `Vó'aehesanestôtse (vé'ho'énêstsestôtse: buckskin suit; "antelope-dress") Pl: vó'...` | Vocab | Tokens | Count | |-------|--------|-------| -| 8k | `▁mâhpémo ' éhe ▁( alces ▁alces ) ▁máto ▁héva ▁popóhpoévêsémo ... (+6 more)` | 16 | +| 8k | `▁vó ' aehesanestôtse ▁( vé ' ho ' énêstsestôtse : ... (+20 more)` | 30 | ### Key Findings -- **Best Compression:** 8k achieves 3.497x compression -- **Lowest UNK Rate:** 8k with 0.0960% unknown tokens +- **Best Compression:** 8k achieves 3.494x compression +- **Lowest UNK Rate:** 8k with 0.1022% unknown tokens - **Trade-off:** Larger vocabularies improve compression but increase model size - **Recommendation:** 32k vocabulary provides optimal balance for production use @@ -125,12 +135,14 @@ Below are sample sentences tokenized with each vocabulary size: | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage | |--------|---------|------------|---------|----------------|------------------|-------------------| -| **2-gram** | Word | 102 🏆 | 6.68 | 159 | 86.3% | 100.0% | -| **2-gram** | Subword | 330 | 8.37 | 871 | 59.3% | 100.0% | -| **3-gram** | Word | 156 | 7.28 | 245 | 72.6% | 100.0% | -| **3-gram** | Subword | 1,700 | 10.73 | 3,811 | 27.1% | 73.0% | -| **4-gram** | Word | 310 | 8.27 | 449 | 52.1% | 100.0% | -| **4-gram** | Subword | 4,072 | 11.99 | 8,559 | 18.3% | 52.6% | +| **2-gram** | Word | 98 🏆 | 6.62 | 148 | 88.0% | 100.0% | +| **2-gram** | Subword | 325 | 8.34 | 853 | 59.8% | 100.0% | +| **3-gram** | Word | 150 | 7.23 | 229 | 74.0% | 100.0% | +| **3-gram** | Subword | 1,635 | 10.67 | 3,634 | 27.6% | 73.9% | +| **4-gram** | Word | 301 | 8.23 | 420 | 52.7% | 100.0% | +| **4-gram** | Subword | 3,873 | 11.92 | 8,064 | 18.7% | 53.5% | +| **5-gram** | Word | 213 | 7.74 | 290 | 59.9% | 100.0% | +| **5-gram** | Subword | 4,512 | 12.14 | 8,516 | 17.1% | 49.3% | ### Top 5 N-grams by Size @@ -138,68 +150,88 @@ Below are sample sentences tokenized with each vocabulary size: | Rank | N-gram | Count | |------|--------|-------| -| 1 | `na éstse` | 161 | +| 1 | `na éstse` | 140 | | 2 | `vé ho` | 119 | | 3 | `ho énêstsestôtse` | 72 | | 4 | `republic of` | 67 | -| 5 | `ho e` | 57 | +| 5 | `éstse manâhéno` | 55 | **3-grams (Word):** | Rank | N-gram | Count | |------|--------|-------| | 1 | `vé ho énêstsestôtse` | 72 | -| 2 | `na éstse manâhéno` | 56 | -| 3 | `ho e éve` | 49 | -| 4 | `na éstse ho` | 48 | -| 5 | `éstse ho e` | 48 | +| 2 | `na éstse manâhéno` | 55 | +| 3 | `ho honáéšé e` | 44 | +| 4 | `ho e éve` | 33 | +| 5 | `éstse ho e` | 32 | **4-grams (Word):** | Rank | N-gram | Count | |------|--------|-------| -| 1 | `éstse ho e éve` | 48 | -| 2 | `na éstse ho e` | 48 | +| 1 | `na éstse ho e` | 32 | +| 2 | `éstse ho e éve` | 32 | | 3 | `ma kaetaévôxe êstoo o` | 25 | | 4 | `toháano éve ho etse` | 23 | -| 5 | `na éstse manâhéno ho` | 22 | +| 5 | `manâhéno ho honáéšé e` | 22 | + +**5-grams (Word):** + +| Rank | N-gram | Count | +|------|--------|-------| +| 1 | `na éstse ho e éve` | 32 | +| 2 | `ho honáéšé e united states` | 22 | +| 3 | `éstse manâhéno ho honáéšé e` | 22 | +| 4 | `na éstse manâhéno ho honáéšé` | 22 | +| 5 | `manâhéno ho honáéšé e united` | 21 | **2-grams (Subword):** | Rank | N-gram | Count | |------|--------|-------| -| 1 | `e _` | 1,534 | -| 2 | `s e` | 1,395 | -| 3 | `s t` | 1,310 | -| 4 | `t s` | 1,310 | -| 5 | `h e` | 1,012 | +| 1 | `e _` | 1,450 | +| 2 | `s e` | 1,334 | +| 3 | `s t` | 1,269 | +| 4 | `t s` | 1,249 | +| 5 | `h e` | 974 | **3-grams (Subword):** | Rank | N-gram | Count | |------|--------|-------| -| 1 | `t s e` | 1,002 | -| 2 | `s e _` | 580 | -| 3 | `e s t` | 468 | -| 4 | `s t s` | 459 | -| 5 | `h o '` | 443 | +| 1 | `t s e` | 956 | +| 2 | `s e _` | 548 | +| 3 | `e s t` | 461 | +| 4 | `s t s` | 436 | +| 5 | `h o '` | 420 | **4-grams (Subword):** | Rank | N-gram | Count | |------|--------|-------| -| 1 | `t s e _` | 456 | -| 2 | `s t s e` | 436 | -| 3 | `ô t s e` | 287 | -| 4 | `t ô t s` | 208 | -| 5 | `e s t ô` | 198 | +| 1 | `t s e _` | 427 | +| 2 | `s t s e` | 413 | +| 3 | `ô t s e` | 276 | +| 4 | `t ô t s` | 204 | +| 5 | `e s t ô` | 194 | + +**5-grams (Subword):** + +| Rank | N-gram | Count | +|------|--------|-------| +| 1 | `s t s e _` | 216 | +| 2 | `t ô t s e` | 203 | +| 3 | `s t ô t s` | 190 | +| 4 | `e s t ô t` | 190 | +| 5 | `ê s t s e` | 170 | ### Key Findings -- **Best Perplexity:** 2-gram (word) with 102 +- **Best Perplexity:** 2-gram (word) with 98 - **Entropy Trend:** Decreases with larger n-grams (more predictable) -- **Coverage:** Top-1000 patterns cover ~53% of corpus +- **Coverage:** Top-1000 patterns cover ~49% of corpus - **Recommendation:** 4-gram or 5-gram for best predictive performance --- @@ -215,14 +247,14 @@ Below are sample sentences tokenized with each vocabulary size: | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability | |---------|---------|-------------|------------|------------------|-----------------|----------------| -| **1** | Word | 0.4118 | 1.330 | 2.00 | 3,383 | 58.8% | -| **1** | Subword | 1.3726 | 2.589 | 9.72 | 175 | 0.0% | -| **2** | Word | 0.1118 | 1.081 | 1.20 | 6,516 | 88.8% | -| **2** | Subword | 1.2032 | 2.303 | 5.05 | 1,699 | 0.0% | -| **3** | Word | 0.0474 | 1.033 | 1.08 | 7,515 | 95.3% | -| **3** | Subword | 0.6524 | 1.572 | 2.34 | 8,541 | 34.8% | -| **4** | Word | 0.0269 🏆 | 1.019 | 1.04 | 7,792 | 97.3% | -| **4** | Subword | 0.2844 | 1.218 | 1.44 | 19,944 | 71.6% | +| **1** | Word | 0.4049 | 1.324 | 1.97 | 3,214 | 59.5% | +| **1** | Subword | 1.3402 | 2.532 | 9.42 | 172 | 0.0% | +| **2** | Word | 0.1099 | 1.079 | 1.20 | 6,126 | 89.0% | +| **2** | Subword | 1.2169 | 2.324 | 5.05 | 1,620 | 0.0% | +| **3** | Word | 0.0453 | 1.032 | 1.08 | 7,065 | 95.5% | +| **3** | Subword | 0.6471 | 1.566 | 2.32 | 8,158 | 35.3% | +| **4** | Word | 0.0256 🏆 | 1.018 | 1.04 | 7,317 | 97.4% | +| **4** | Subword | 0.2799 | 1.214 | 1.44 | 18,852 | 72.0% | ### Generated Text Samples (Word-based) @@ -230,27 +262,27 @@ Below are text samples generated from each word-based Markov chain model: **Context Size 1:** -1. `e he óonéma enóne éohkê héška ó he hohamháa há continentan naa nêhéóhe násáahéne enomóvóhe tsé` -2. `ho honáemanėstóoseo o united states manâhestôtse 1 188 lwanda 100px mogadishu somali shilling swahil...` -3. `o vé keehoohtsêstse vó kaehevôtse vé ho énêstsestôtse billingscheyenne english dictionary chief dull...` +1. `e éve ho honáéšé e cfa ma kaetaévôxe êstoo o toháano éve hóxovê hooma naa kánome` +2. `ho éstova éhe nėstaane néstse vóonotse 30 hestáotse naa unie van zuid afrika hotómá e great` +3. `o gdp ppp 72 7 afrikaans vé ho etse 56 785 6 coloured 9 indian tséh` **Context Size 2:** -1. `na éstse ho e éve asia center thumb handelskade in willemstad curaçao` -2. `vé ho énêstsestôtse black hills ho honáéšé e missouri ó he e pónoeo hé e na éstse` -3. `ho énêstsestôtse cimarron river bull river forgan heévȧhetanéno` +1. `na éstse manâhéno ho honáéšé e vehicle license kȧhkoetohko prefix 29 hotómá e mo hetaneho e hánêsóvó...` +2. `vé ho énestse 71 740 6 144 562 903 somali federal republic of the congo congo kinshasa` +3. `ho énêstsestôtse wyolacheyenne english dictionarychief dull knife college hoig stan the peace chiefs...` **Context Size 3:** -1. `vé ho énêstsestôtse bay horse variant tsé vó névóvâtse` -2. `na éstse manâhéno ho honáéšé e united states óoetaneo o óoetanéno tsé amo eétâhéstove vé ho énêstses...` -3. `ho e éve hóxovê hooma center frameless upright 1 5` +1. `vé ho énêstsestôtse airplane this is` +2. `na éstse manâhéno china republic of china republic of china republic of china republic of china repu...` +3. `ho honáéšé e native news project` **Context Size 4:** -1. `éstse ho e éve meško` -2. `na éstse ho e éve vietnam dong hoi airport` -3. `ma kaetaévôxe êstoo o sango toháano éve ho etse 622 984 4 216 666 1 198 chad republic of` +1. `éstse ho e éve vietnam dong hoi airport` +2. `na éstse ho e éve united states states of america` +3. `ma kaetaévôxe êstoo o toháano éve ho etse 322 460 1 600 democratic republic of the congo of the` ### Generated Text Samples (Subword-based) @@ -259,34 +291,34 @@ Below are text samples generated from each subword-based Markov chain model: **Context Size 1:** -1. `e_a_29150002)_te` -2. `_rkul_mâxpoeése.` -3. `a_xema'ėh-évese:` +1. `etokfive_piente'` +2. `_t:_manésé'e'e,_` +3. `aliotse'étinoo's` **Context Size 2:** -1. `e_(vé'še'tó'neadd` -2. `seotó'o_poestôtse` -3. `ts_wymnetugual_ju` +1. `e_100px_minestȯts` +2. `se_cre_manéó'ho'ô` +3. `stanjunt.thumb_la` **Context Size 3:** -1. `tse._vé'ho'hé'e_bo` -2. `se_rokese_mâhestȯt` -3. `estôtse_manâhá'e_(` +1. `tse_(lephonáéšé'e,` +2. `se_odom_capid_city` +3. `estôtsestôtsestôts` **Context Size 4:** -1. `tse_hotómá'e_12_évȯ` -2. `stseévenomo_hovanan` -3. `ôtse:_ten_sage";_ar` +1. `tse_évȯhkėha'etaneh` +2. `stsestȯtse_kóhkonôh` +3. `ôtsenáesëö'o_môxeov` ### Key Findings -- **Best Predictability:** Context-4 (word) with 97.3% predictability +- **Best Predictability:** Context-4 (word) with 97.4% predictability - **Branching Factor:** Decreases with context size (more deterministic) -- **Memory Trade-off:** Larger contexts require more storage (19,944 contexts) +- **Memory Trade-off:** Larger contexts require more storage (18,852 contexts) - **Recommendation:** Context-3 or Context-4 for text generation --- @@ -302,64 +334,64 @@ Below are text samples generated from each subword-based Markov chain model: | Metric | Value | |--------|-------| -| Vocabulary Size | 1,237 | -| Total Tokens | 8,401 | -| Mean Frequency | 6.79 | +| Vocabulary Size | 1,174 | +| Total Tokens | 7,828 | +| Mean Frequency | 6.67 | | Median Frequency | 3 | -| Frequency Std Dev | 21.86 | +| Frequency Std Dev | 21.01 | ### Most Common Words | Rank | Word | Frequency | |------|------|-----------| -| 1 | e | 431 | -| 2 | ho | 377 | -| 3 | o | 237 | -| 4 | na | 165 | -| 5 | vé | 161 | -| 6 | éstse | 161 | -| 7 | éve | 154 | -| 8 | of | 118 | -| 9 | he | 108 | -| 10 | naa | 108 | +| 1 | e | 407 | +| 2 | ho | 351 | +| 3 | o | 229 | +| 4 | vé | 159 | +| 5 | na | 144 | +| 6 | éstse | 140 | +| 7 | éve | 133 | +| 8 | of | 117 | +| 9 | naa | 104 | +| 10 | he | 103 | ### Least Common Words (from vocabulary) | Rank | Word | Frequency | |------|------|-----------| -| 1 | mustangs | 2 | -| 2 | sevonévo | 2 | -| 3 | ėstovátamevéotse | 2 | -| 4 | ėstova | 2 | -| 5 | nėstse | 2 | -| 6 | kūnas | 2 | -| 7 | epsteins | 2 | -| 8 | ir | 2 | -| 9 | felon | 2 | -| 10 | immigrants | 2 | +| 1 | pack | 2 | +| 2 | evenóse | 2 | +| 3 | mountain | 2 | +| 4 | cal | 2 | +| 5 | poly | 2 | +| 6 | mustangs | 2 | +| 7 | sevonévo | 2 | +| 8 | ėstovátamevéotse | 2 | +| 9 | ėstova | 2 | +| 10 | nėstse | 2 | ### Zipf's Law Analysis | Metric | Value | |--------|-------| -| Zipf Coefficient | 0.8151 | -| R² (Goodness of Fit) | 0.976018 | +| Zipf Coefficient | 0.8142 | +| R² (Goodness of Fit) | 0.973597 | | Adherence Quality | **excellent** | ### Coverage Analysis | Top N Words | Coverage | |-------------|----------| -| Top 100 | 54.8% | -| Top 1,000 | 94.4% | +| Top 100 | 55.3% | +| Top 1,000 | 95.6% | | Top 5,000 | 0.0% | | Top 10,000 | 0.0% | ### Key Findings -- **Zipf Compliance:** R²=0.9760 indicates excellent adherence to Zipf's law -- **High Frequency Dominance:** Top 100 words cover 54.8% of corpus -- **Long Tail:** -8,763 words needed for remaining 100.0% coverage +- **Zipf Compliance:** R²=0.9736 indicates excellent adherence to Zipf's law +- **High Frequency Dominance:** Top 100 words cover 55.3% of corpus +- **Long Tail:** -8,826 words needed for remaining 100.0% coverage --- ## 5. Word Embeddings Evaluation @@ -375,37 +407,40 @@ Below are text samples generated from each subword-based Markov chain model: ### 5.1 Cross-Lingual Alignment -> *Note: Multilingual alignment visualization not available for this language.* +![Alignment Quality](visualizations/embedding_alignment_quality.png) + +![Multilingual t-SNE](visualizations/embedding_tsne_multilingual.png) ### 5.2 Model Comparison | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 | |-------|-----------|----------|------------------|---------------|----------------| -| **mono_32d** | 32 | 0.0023 🏆 | 0.8533 | N/A | N/A | -| **mono_64d** | 64 | 0.0008 | 0.9264 | N/A | N/A | -| **mono_128d** | 128 | 0.0002 | 0.9821 | N/A | N/A | +| **mono_32d** | 32 | 0.0023 🏆 | 0.8896 | N/A | N/A | +| **mono_64d** | 64 | 0.0007 | 0.9590 | N/A | N/A | +| **mono_128d** | 128 | 0.0002 | 0.9907 | N/A | N/A | +| **aligned_32d** | 32 | 0.0023 | 0.8896 | 0.0513 | 0.2179 | +| **aligned_64d** | 64 | 0.0007 | 0.9590 | 0.0385 | 0.1795 | +| **aligned_128d** | 128 | 0.0002 | 0.9907 | 0.0128 | 0.1667 | ### Key Findings - **Best Isotropy:** mono_32d with 0.0023 (more uniform distribution) -- **Semantic Density:** Average pairwise similarity of 0.9206. Lower values indicate better semantic separation. -- **Alignment Quality:** No aligned models evaluated in this run. +- **Semantic Density:** Average pairwise similarity of 0.9464. Lower values indicate better semantic separation. +- **Alignment Quality:** Aligned models achieve up to 5.1% R@1 in cross-lingual retrieval. - **Recommendation:** 128d aligned for best cross-lingual performance --- ## 6. Morphological Analysis (Experimental) -> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages. - This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data. ### 6.1 Productivity & Complexity | Metric | Value | Interpretation | Recommendation | |--------|-------|----------------|----------------| -| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable | -| Idiomaticity Gap | **-1.000** | Low formulaic content | - | +| Productivity Index | **5.000** | High morphological productivity | Reliable analysis | +| Idiomaticity Gap | **1.027** | High formulaic/idiomatic content | - | ### 6.2 Affix Inventory (Productive Units) @@ -414,18 +449,18 @@ These are the most productive prefixes and suffixes identified by sampling the v #### Productive Prefixes | Prefix | Examples | |--------|----------| -| `-ho` | horse, hotoa, hoohëö | +| `-ho` | hotóao, hohtóvá, hoéstónéó | #### Productive Suffixes | Suffix | Examples | |--------|----------| -| `-e` | néstovátamevéotse, kôhtse, where | -| `-se` | néstovátamevéotse, kôhtse, kaehevotôtse | -| `-tse` | néstovátamevéotse, kôhtse, kaehevotôtse | -| `-ne` | mâhoestôtsene, kane, mâheóne | -| `-ôtse` | kaehevotôtse, oestôtse, xemenôtse | -| `-ia` | anastacia, abkhazia, shepherdia | -| `-ve` | êstonêstove, native, hestoháatamaahéstove | +| `-e` | ôhkêhenove, háahpe, manâhestôtse | +| `-se` | manâhestôtse, tsétsêhéstâhese, xaénéhetse | +| `-tse` | manâhestôtse, xaénéhetse, ôhnéménêstse | +| `-ôtse` | manâhestôtse, mâhoestôtse, ôtse | +| `-ne` | lione, mâhoestôtsene, nemâhmoteone | +| `-ve` | ôhkêhenove, ôhkemôxeonêstove, kêsaéve | +| `-ia` | alnifolia, austria, nitsvia | ### 6.3 Bound Stems (Lexical Roots) @@ -440,9 +475,9 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam | Prefix | Suffix | Frequency | Examples | |--------|--------|-----------|----------| -| `-ho` | `-e` | 5 words | horse, hováhne | -| `-ho` | `-ne` | 2 words | hováhne, hovahne | -| `-ho` | `-se` | 1 words | horse, hotse | +| `-ho` | `-e` | 5 words | house, hovahne | +| `-ho` | `-ne` | 2 words | hovahne, hovane | +| `-ho` | `-se` | 1 words | house, hotse | | `-ho` | `-tse` | 1 words | hotse, hohpâhtsenámenôtse | | `-ho` | `-ôtse` | 1 words | hohpâhtsenámenôtse | @@ -454,24 +489,26 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in |------|-----------------|------------|------| | mâhoestôtsene | **`mâhoest-ôtse-ne`** | 3.0 | `mâhoest` | | sevoneóneve | **`sevoneó-ne-ve`** | 3.0 | `sevoneó` | -| náhkȯhehetanetse | **`náhkȯheheta-ne-tse`** | 3.0 | `náhkȯheheta` | -| enóseoneve | **`enóseo-ne-ve`** | 3.0 | `enóseo` | | éestsėstóseoneve | **`éestsėstóseo-ne-ve`** | 3.0 | `éestsėstóseo` | -| kaehevotôtse | **`kaehevot-ôtse`** | 1.5 | `kaehevot` | -| anastacia | **`anastac-ia`** | 1.5 | `anastac` | -| névóvâtse | **`névóvâ-tse`** | 1.5 | `névóvâ` | +| enóseoneve | **`enóseo-ne-ve`** | 3.0 | `enóseo` | +| náhkȯhehetanetse | **`náhkȯheheta-ne-tse`** | 3.0 | `náhkȯheheta` | +| ôhkêhenove | **`ôhkêheno-ve`** | 1.5 | `ôhkêheno` | +| manâhestôtse | **`manâhest-ôtse`** | 1.5 | `manâhest` | +| alnifolia | **`alnifol-ia`** | 1.5 | `alnifol` | +| ôhkemôxeonêstove | **`ôhkemôxeonêsto-ve`** | 1.5 | `ôhkemôxeonêsto` | +| hoéstónéó | **`ho-éstónéó`** | 1.5 | `éstónéó` | +| nemâhmoteone | **`nemâhmoteo-ne`** | 1.5 | `nemâhmoteo` | +| tsétsêhéstâhese | **`tsétsêhéstâhe-se`** | 1.5 | `tsétsêhéstâhe` | +| australia | **`austral-ia`** | 1.5 | `austral` | | shepherdia | **`shepherd-ia`** | 1.5 | `shepherd` | -| êstonêstove | **`êstonêsto-ve`** | 1.5 | `êstonêsto` | -| yellowstone | **`yellowsto-ne`** | 1.5 | `yellowsto` | -| hoohtseto | **`ho-ohtseto`** | 1.5 | `ohtseto` | -| xemenôtse | **`xemen-ôtse`** | 1.5 | `xemen` | -| véhonevoemėstse | **`véhonevoemės-tse`** | 1.5 | `véhonevoemės` | -| manestôtse | **`manest-ôtse`** | 1.5 | `manest` | +| xaénéhetse | **`xaénéhe-tse`** | 1.5 | `xaénéhe` | ### 6.6 Linguistic Interpretation > **Automated Insight:** -The language CHY appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes. +The language Cheyenne shows high morphological productivity. The subword models are significantly more efficient than word models, suggesting a rich system of affixation or compounding. + +> **Note on Idiomaticity:** The high Idiomaticity Gap suggests a large number of frequent multi-word expressions or formulaic sequences that are statistically distinct from their component parts. --- ## 7. Summary & Recommendations @@ -482,9 +519,9 @@ The language CHY appears to be more isolating or has a highly fixed vocabulary. | Component | Recommended | Rationale | |-----------|-------------|-----------| -| Tokenizer | **8k BPE** | Best compression (3.50x) | -| N-gram | **2-gram** | Lowest perplexity (102) | -| Markov | **Context-4** | Highest predictability (97.3%) | +| Tokenizer | **8k BPE** | Best compression (3.49x) | +| N-gram | **2-gram** | Lowest perplexity (98) | +| Markov | **Context-4** | Highest predictability (97.4%) | | Embeddings | **100d** | Balanced semantic capture and isotropy | @@ -698,4 +735,4 @@ MIT License - Free for academic and commercial use. --- *Generated by Wikilangs Models Pipeline* -*Report Date: 2026-01-03 10:13:21* +*Report Date: 2026-01-03 20:28:03* diff --git a/models/embeddings/aligned/chy_128d.bin b/models/embeddings/aligned/chy_128d.bin new file mode 100644 index 0000000000000000000000000000000000000000..1031dfff283cc2105cc3cc3d79b47b472034a081 --- /dev/null +++ b/models/embeddings/aligned/chy_128d.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4fcc832a966fb87260314fdde26c7aaca3eb6f87727d6172f71003e9cbd2c5eb +size 1024154505 diff --git a/models/embeddings/aligned/chy_128d.meta.json b/models/embeddings/aligned/chy_128d.meta.json new file mode 100644 index 0000000000000000000000000000000000000000..0cc7ab9163aff93d4bec4eef15e31a17df149a99 --- /dev/null +++ b/models/embeddings/aligned/chy_128d.meta.json @@ -0,0 +1 @@ +{"lang": "chy", "dim": 128, "max_seq_len": 512, "is_aligned": true} \ No newline at end of file diff --git a/models/embeddings/aligned/chy_128d.projection.npy b/models/embeddings/aligned/chy_128d.projection.npy new file mode 100644 index 0000000000000000000000000000000000000000..2a58b4f14c96c80600266ff8dedba072018b4c88 --- /dev/null +++ b/models/embeddings/aligned/chy_128d.projection.npy @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5c45ba64aef67dfa385dc9f8ca17d2d96ad776d8ccbe75505f9e1f6875aad67b +size 65664 diff --git a/models/embeddings/aligned/chy_128d_metadata.json b/models/embeddings/aligned/chy_128d_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..41f4309899d622906546dc247928c8fb5e395c0a --- /dev/null +++ b/models/embeddings/aligned/chy_128d_metadata.json @@ -0,0 +1,8 @@ +{ + "language": "chy", + "dimension": 128, + "version": "aligned", + "hub_language": "en", + "seed_vocab_size": 78, + "vocab_size": 148 +} \ No newline at end of file diff --git a/models/embeddings/aligned/chy_32d.bin b/models/embeddings/aligned/chy_32d.bin new file mode 100644 index 0000000000000000000000000000000000000000..c3ab75c93c9240e7ca69ccded7971412c199a541 --- /dev/null +++ b/models/embeddings/aligned/chy_32d.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:207f94e4c6d8a4f9a8bc11e98f994580ddfbed45c5a57f6cc17e016f113b2fa3 +size 256040841 diff --git a/models/embeddings/aligned/chy_32d.meta.json b/models/embeddings/aligned/chy_32d.meta.json new file mode 100644 index 0000000000000000000000000000000000000000..4efe08fa710a3e07a8b6a51253e381b604c5effd --- /dev/null +++ b/models/embeddings/aligned/chy_32d.meta.json @@ -0,0 +1 @@ +{"lang": "chy", "dim": 32, "max_seq_len": 512, "is_aligned": true} \ No newline at end of file diff --git a/models/embeddings/aligned/chy_32d.projection.npy b/models/embeddings/aligned/chy_32d.projection.npy new file mode 100644 index 0000000000000000000000000000000000000000..e643e1281535a2f77a727067b6b2a16a8e6d4b66 --- /dev/null +++ b/models/embeddings/aligned/chy_32d.projection.npy @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bf25a3e94561de4f68e7c3c373a0986dd3586b6921811406570a582aa2c047ad +size 4224 diff --git a/models/embeddings/aligned/chy_32d_metadata.json b/models/embeddings/aligned/chy_32d_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..6ed51c5c2ee803d19ecb72fdf49940c40fedb6b9 --- /dev/null +++ b/models/embeddings/aligned/chy_32d_metadata.json @@ -0,0 +1,8 @@ +{ + "language": "chy", + "dimension": 32, + "version": "aligned", + "hub_language": "en", + "seed_vocab_size": 78, + "vocab_size": 148 +} \ No newline at end of file diff --git a/models/embeddings/aligned/chy_64d.bin b/models/embeddings/aligned/chy_64d.bin new file mode 100644 index 0000000000000000000000000000000000000000..09de7c5b72642e5ebe7f3c2c1e1a9b58ee9a0727 --- /dev/null +++ b/models/embeddings/aligned/chy_64d.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f311db029de4600904491344614a6a0e06edf2dfd38492760620dcec861d4e73 +size 512078729 diff --git a/models/embeddings/aligned/chy_64d.meta.json b/models/embeddings/aligned/chy_64d.meta.json new file mode 100644 index 0000000000000000000000000000000000000000..7fee8f8227971e0a82d57d76f3e14165ac86c10e --- /dev/null +++ b/models/embeddings/aligned/chy_64d.meta.json @@ -0,0 +1 @@ +{"lang": "chy", "dim": 64, "max_seq_len": 512, "is_aligned": true} \ No newline at end of file diff --git a/models/embeddings/aligned/chy_64d.projection.npy b/models/embeddings/aligned/chy_64d.projection.npy new file mode 100644 index 0000000000000000000000000000000000000000..42b38e37535880bab82c061214fece9af9fce1ab --- /dev/null +++ b/models/embeddings/aligned/chy_64d.projection.npy @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5bade9f56bce2587b66a3e08b84b9549bcdb85c4f121a359ad5166f3f0e2fcf0 +size 16512 diff --git a/models/embeddings/aligned/chy_64d_metadata.json b/models/embeddings/aligned/chy_64d_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..7e84b2dac0543b544ae79e885b11b11934cf6192 --- /dev/null +++ b/models/embeddings/aligned/chy_64d_metadata.json @@ -0,0 +1,8 @@ +{ + "language": "chy", + "dimension": 64, + "version": "aligned", + "hub_language": "en", + "seed_vocab_size": 78, + "vocab_size": 148 +} \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_128d.bin b/models/embeddings/monolingual/chy_128d.bin index cbbc1fbaa3af948312fcbc5028d7213731c479fd..1031dfff283cc2105cc3cc3d79b47b472034a081 100644 --- a/models/embeddings/monolingual/chy_128d.bin +++ b/models/embeddings/monolingual/chy_128d.bin @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:fe82c87b0cea7042a6a16f85cf1875011ca32f13b8c941e25d5c1d57849b2eed -size 1024170112 +oid sha256:4fcc832a966fb87260314fdde26c7aaca3eb6f87727d6172f71003e9cbd2c5eb +size 1024154505 diff --git a/models/embeddings/monolingual/chy_128d_metadata.json b/models/embeddings/monolingual/chy_128d_metadata.json index 7cb412f2a05bf300795a1bd9c0deb3cc791e4e3a..19d819c98a9353592456ea6a6a83b4abd8fdb4d0 100644 --- a/models/embeddings/monolingual/chy_128d_metadata.json +++ b/models/embeddings/monolingual/chy_128d_metadata.json @@ -11,5 +11,5 @@ "encoding_method": "rope", "dim": 128 }, - "vocab_size": 163 + "vocab_size": 148 } \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_32d.bin b/models/embeddings/monolingual/chy_32d.bin index 53109a70fa560ac8e5056453088c57c9e63824b2..c3ab75c93c9240e7ca69ccded7971412c199a541 100644 --- a/models/embeddings/monolingual/chy_32d.bin +++ b/models/embeddings/monolingual/chy_32d.bin @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:70c8741bb1d28e9983cdf32850bd345db541b195a40f92c1ebc8ba49374864be -size 256044928 +oid sha256:207f94e4c6d8a4f9a8bc11e98f994580ddfbed45c5a57f6cc17e016f113b2fa3 +size 256040841 diff --git a/models/embeddings/monolingual/chy_32d_metadata.json b/models/embeddings/monolingual/chy_32d_metadata.json index 8703755715898037e70b71b4a446ec1888736501..4d4989d1b52b8c0ed8ae8329e76b3b0427f5122f 100644 --- a/models/embeddings/monolingual/chy_32d_metadata.json +++ b/models/embeddings/monolingual/chy_32d_metadata.json @@ -11,5 +11,5 @@ "encoding_method": "rope", "dim": 32 }, - "vocab_size": 163 + "vocab_size": 148 } \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_64d.bin b/models/embeddings/monolingual/chy_64d.bin index 6323ef4dce20317572c8cdad66ae99dc2f9b2570..09de7c5b72642e5ebe7f3c2c1e1a9b58ee9a0727 100644 --- a/models/embeddings/monolingual/chy_64d.bin +++ b/models/embeddings/monolingual/chy_64d.bin @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:235dc5785bb5f5c2123f5d894c3c7266a0f239882928a7be48b4ddbb112ff63d -size 512086656 +oid sha256:f311db029de4600904491344614a6a0e06edf2dfd38492760620dcec861d4e73 +size 512078729 diff --git a/models/embeddings/monolingual/chy_64d_metadata.json b/models/embeddings/monolingual/chy_64d_metadata.json index a5bb0711d4a69c2409647310e3406b462d88b0c6..01793da7f1d6e37cb20c97f81473e134a333ee4f 100644 --- a/models/embeddings/monolingual/chy_64d_metadata.json +++ b/models/embeddings/monolingual/chy_64d_metadata.json @@ -11,5 +11,5 @@ "encoding_method": "rope", "dim": 64 }, - "vocab_size": 163 + "vocab_size": 148 } \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx1_subword.parquet b/models/subword_markov/chy_markov_ctx1_subword.parquet index b65525baa7d3f7cd71d46cb27241c8622e7e0004..c8649d6a0a572760484b8d8db88de4455bb662d4 100644 --- a/models/subword_markov/chy_markov_ctx1_subword.parquet +++ b/models/subword_markov/chy_markov_ctx1_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:d18af851c234d84f1c8c493282bad10c2ce3cf38e5f884b0def3b69a8cc46a72 -size 16739 +oid sha256:1c35468a046d4cdb8600723c2cc9f33f643f687ec3db096b191c13c05400bf4a +size 15097 diff --git a/models/subword_markov/chy_markov_ctx1_subword_metadata.json b/models/subword_markov/chy_markov_ctx1_subword_metadata.json index 61c636a7b2c4bed78761d65a9eb60dcffce0edaa..9be4e37dee6ca84564065d9393a80e91330054eb 100644 --- a/models/subword_markov/chy_markov_ctx1_subword_metadata.json +++ b/models/subword_markov/chy_markov_ctx1_subword_metadata.json @@ -2,6 +2,6 @@ "context_size": 1, "variant": "subword", "language": "chy", - "unique_contexts": 175, - "total_transitions": 68722 + "unique_contexts": 172, + "total_transitions": 64543 } \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx2_subword.parquet b/models/subword_markov/chy_markov_ctx2_subword.parquet index b51219dc720f154bc0467c8881e036051db8889d..f2b810492310f15023df5ab0068bc2af411a2926 100644 --- a/models/subword_markov/chy_markov_ctx2_subword.parquet +++ b/models/subword_markov/chy_markov_ctx2_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:8ecd88390cc78af12a57e86676cd4ac3d721a2a3b6105dedf56498e4f9198ba6 -size 55410 +oid sha256:68f0dc3358f63ea1ede0f6586e8b16d1fd372465d1bc2955b2c27e2422d47ff4 +size 53541 diff --git a/models/subword_markov/chy_markov_ctx2_subword_metadata.json b/models/subword_markov/chy_markov_ctx2_subword_metadata.json index 5abcbba2adee7b69c55772e0e2126f9d51f9b219..0ccd677008944c0c0e09be71e0b0e9818d4fa9e2 100644 --- a/models/subword_markov/chy_markov_ctx2_subword_metadata.json +++ b/models/subword_markov/chy_markov_ctx2_subword_metadata.json @@ -2,6 +2,6 @@ "context_size": 2, "variant": "subword", "language": "chy", - "unique_contexts": 1699, - "total_transitions": 68263 + "unique_contexts": 1620, + "total_transitions": 64114 } \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx3_subword.parquet b/models/subword_markov/chy_markov_ctx3_subword.parquet index cc95a10d7c95dacb2a56d9e697939efcd6c07628..2eaa11531cb7c298f239d7a940b6781f83a7e8e9 100644 --- a/models/subword_markov/chy_markov_ctx3_subword.parquet +++ b/models/subword_markov/chy_markov_ctx3_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:fd7301d331a8f6e5cf3904b5bd4befc8f3edf1b29659b35c4c4df3b64b73ed4d -size 150466 +oid sha256:8d479318e787a4b6b2b6582e01bf5581981ed29a293db4c95bfe4803a02ceac4 +size 141411 diff --git a/models/subword_markov/chy_markov_ctx3_subword_metadata.json b/models/subword_markov/chy_markov_ctx3_subword_metadata.json index c24141efc8c80b4b285006966f092c5e10f45581..badac1b9afba30ae1f496bb00a1f6259ebc9150b 100644 --- a/models/subword_markov/chy_markov_ctx3_subword_metadata.json +++ b/models/subword_markov/chy_markov_ctx3_subword_metadata.json @@ -2,6 +2,6 @@ "context_size": 3, "variant": "subword", "language": "chy", - "unique_contexts": 8541, - "total_transitions": 67804 + "unique_contexts": 8158, + "total_transitions": 63685 } \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx4_subword.parquet b/models/subword_markov/chy_markov_ctx4_subword.parquet index dc790e6a616230ca8a03aa4bb3c1f5ed316a9f83..72cef3c4a1772ef98d1c5cccfb4e362c5cb35d67 100644 --- a/models/subword_markov/chy_markov_ctx4_subword.parquet +++ b/models/subword_markov/chy_markov_ctx4_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:3e44005f2ed1b58ece31011bfe69d86d11229eda4839c36faadb0b8ac3af5479 -size 281701 +oid sha256:2a00c38db3647e5284daec765ca6502387e939c467196ad926faa95198224d6c +size 266437 diff --git a/models/subword_markov/chy_markov_ctx4_subword_metadata.json b/models/subword_markov/chy_markov_ctx4_subword_metadata.json index c58d3718779bc58a7413fb367c8b4c8ef76e37a3..9abb4754b6a9546ff3181a76a22bc707aa285d0b 100644 --- a/models/subword_markov/chy_markov_ctx4_subword_metadata.json +++ b/models/subword_markov/chy_markov_ctx4_subword_metadata.json @@ -2,6 +2,6 @@ "context_size": 4, "variant": "subword", "language": "chy", - "unique_contexts": 19944, - "total_transitions": 67345 + "unique_contexts": 18852, + "total_transitions": 63256 } \ No newline at end of file diff --git a/models/subword_ngram/chy_2gram_subword.parquet b/models/subword_ngram/chy_2gram_subword.parquet index eeb1a3ba644fcd9c5a6fe84b4530a5066de6cb7a..c112cec4b2f42ca862632a006fad1c4c469a2476 100644 --- a/models/subword_ngram/chy_2gram_subword.parquet +++ b/models/subword_ngram/chy_2gram_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:0228df27f63ed413c576b4cbe4ab6f76c9785023f717763c1927ee318cba7061 -size 11557 +oid sha256:34dc6910378237012a37e238d7df974f017c91b1c3ebd46bf8d76be772f32ac2 +size 11321 diff --git a/models/subword_ngram/chy_2gram_subword_metadata.json b/models/subword_ngram/chy_2gram_subword_metadata.json index 7d60949de5fe8c91d4b002d324ae6d41b1adc0bd..3e8c54dcca833ee43c4001d7e6f3cb9d14cb757b 100644 --- a/models/subword_ngram/chy_2gram_subword_metadata.json +++ b/models/subword_ngram/chy_2gram_subword_metadata.json @@ -2,6 +2,6 @@ "n": 2, "variant": "subword", "language": "chy", - "unique_ngrams": 871, - "total_ngrams": 68722 + "unique_ngrams": 853, + "total_ngrams": 64543 } \ No newline at end of file diff --git a/models/subword_ngram/chy_3gram_subword.parquet b/models/subword_ngram/chy_3gram_subword.parquet index 9f30cbda3681577623a233a22ba3ab9e73d419ca..e4ba1183ea716c3b0addef454e62e338b446c5ca 100644 --- a/models/subword_ngram/chy_3gram_subword.parquet +++ b/models/subword_ngram/chy_3gram_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:c6c8121df7c4fff9c78c7deca418c3eaecb7394cf750b4681239ca320d9977e8 -size 41181 +oid sha256:c62fad42c8e95d0d3266d32d66f8bd47a309860a70bb08f72065f1851f496ca7 +size 39559 diff --git a/models/subword_ngram/chy_3gram_subword_metadata.json b/models/subword_ngram/chy_3gram_subword_metadata.json index d41fc983f291e48b27620d6b8b3577b107cf57a5..958aa6bdc9118c5ca7903f357e8360e1b0af832a 100644 --- a/models/subword_ngram/chy_3gram_subword_metadata.json +++ b/models/subword_ngram/chy_3gram_subword_metadata.json @@ -2,6 +2,6 @@ "n": 3, "variant": "subword", "language": "chy", - "unique_ngrams": 3811, - "total_ngrams": 68263 + "unique_ngrams": 3634, + "total_ngrams": 64114 } \ No newline at end of file diff --git a/models/subword_ngram/chy_4gram_subword.parquet b/models/subword_ngram/chy_4gram_subword.parquet index 98131800dc88baaa7e492a9a2e23e06e5cf8b1dc..b431c618a60afce291a703045032fc9bb04bb913 100644 --- a/models/subword_ngram/chy_4gram_subword.parquet +++ b/models/subword_ngram/chy_4gram_subword.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:4bfb26d6de423f9bad18bfb5f3a57a90fada9cbd23138dc9ac7e2fcdbe0c7831 -size 100088 +oid sha256:7f24a9511d5150fb308647b27c266b5a6f588d314eec67eede769dc1df1d15a0 +size 93862 diff --git a/models/subword_ngram/chy_4gram_subword_metadata.json b/models/subword_ngram/chy_4gram_subword_metadata.json index cb5dfa6b588bb1d733a482630adbddc954f31b32..20ef24e3fc52a19258d41d40b4f7177d3720d846 100644 --- a/models/subword_ngram/chy_4gram_subword_metadata.json +++ b/models/subword_ngram/chy_4gram_subword_metadata.json @@ -2,6 +2,6 @@ "n": 4, "variant": "subword", "language": "chy", - "unique_ngrams": 8559, - "total_ngrams": 67804 + "unique_ngrams": 8064, + "total_ngrams": 63685 } \ No newline at end of file diff --git a/models/subword_ngram/chy_5gram_subword.parquet b/models/subword_ngram/chy_5gram_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..70b1dd6f0e4cb22ab68f9a391dccfb8d5522cf88 --- /dev/null +++ b/models/subword_ngram/chy_5gram_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1a5a05456a14449cd7ecfb447e8bd0b51ddacc83add84722e977dd5f7c8b8b64 +size 107070 diff --git a/models/subword_ngram/chy_5gram_subword_metadata.json b/models/subword_ngram/chy_5gram_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..09b732d6dda879fe0fc74b13148d3c1b89705f14 --- /dev/null +++ b/models/subword_ngram/chy_5gram_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 5, + "variant": "subword", + "language": "chy", + "unique_ngrams": 8516, + "total_ngrams": 63256 +} \ No newline at end of file diff --git a/models/tokenizer/chy_tokenizer_8k.model b/models/tokenizer/chy_tokenizer_8k.model index 2ef754ccf84ca9730755e4045c39b9a3d235e006..4d0b6697b9b715f2fc1517316ead38985b64a02e 100644 --- a/models/tokenizer/chy_tokenizer_8k.model +++ b/models/tokenizer/chy_tokenizer_8k.model @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:7224c406ce7f0ceae9ae28bfb815c3ebc81aafabd231728ea1823e5477c23e0d -size 374664 +oid sha256:f1c8a724c8c94c9c917aae80be9128fd26d48b89f29362db6780a3920aa83b5f +size 373477 diff --git a/models/tokenizer/chy_tokenizer_8k.vocab b/models/tokenizer/chy_tokenizer_8k.vocab index 080cb7183827b809db0c9bc73806dc658602c9ab..80cbe8978544fe91d3497fa492af51016b05859a 100644 --- a/models/tokenizer/chy_tokenizer_8k.vocab +++ b/models/tokenizer/chy_tokenizer_8k.vocab @@ -10,7906 +10,7906 @@ an -4 ▁m -5 tse -6 ▁n -7 -▁t -8 -stse -9 +stse -8 +on -9 ve -10 -on -11 +▁t -11 ▁( -12 hé -13 -en -14 -▁ho -15 -ôtse -16 -▁na -17 -in -18 -▁c -19 -én -20 -am -21 -▁v -22 -te -23 -ne -24 -sé -25 -ar -26 +▁ho -14 +no -15 +ne -16 +ôtse -17 +▁na -18 +me -19 +▁c -20 +sé -21 +▁a -22 +vé -23 +in -24 +ma -25 +te -26 ▁o -27 -éve -28 -▁p -29 -stôtse -30 -ht -31 -▁h -32 -▁s -33 -om -34 -hk -35 +ta -28 +stôtse -29 +ha -30 +▁p -31 +re -32 +va -33 +éve -34 +êstse -35 ▁he -36 -êstse -37 -éstse -38 -▁é -39 +hk -37 +▁s -38 +ri -39 sto -40 -no -41 -re -42 -ta -43 -vé -44 -▁a -45 -er -46 -to -47 -ic -48 -va -49 -ia -50 -em -51 -▁b -52 -ha -53 -▁k -54 -le -55 -sta -56 -tó -57 -▁d -58 -ane -59 -un -60 -us -61 -ma -62 -la -63 -▁man -64 -hp -65 -▁hé -66 -▁ma -67 -ri -68 -óne -69 -énêstse -70 -and -71 -âhe -72 -eo -73 -um -74 -▁naa -75 -ra -76 -âhé -77 -▁tsé -78 -vo -79 -es -80 -ti -81 -éno -82 -šé -83 -ná -84 -há -85 -xe -86 -htá -87 -▁hó -88 -is -89 -ts -90 +to -41 +▁é -42 +▁h -43 +éstse -44 +ht -45 +en -46 +sta -47 +vó -48 +▁k -49 +tó -50 +ia -51 +ic -52 +ane -53 +én -54 +▁b -55 +xe -56 +le -57 +un -58 +▁d -59 +ra -60 +vo -61 +▁tsé -62 +us -63 +▁hé -64 +▁man -65 +énêstse -66 +la -67 +âhe -68 +▁ma -69 +hp -70 +še -71 +šé -72 +âhé -73 +▁naa -74 +and -75 +ke -76 +ti -77 +one -78 +om -79 +ná -80 +er -81 +há -82 +vá -83 +énêstsestôtse -84 +hó -85 +li -86 +éno -87 +htá -88 +▁hó -89 +▁vé -90 ▁mo -91 -▁vé -92 -ame -93 -énêstsestôtse -94 -še -95 -hum -96 -▁* -97 -ing -98 -▁of -99 -ke -100 -humb -101 -), -102 -ol -103 -▁sta -104 -ite -105 -▁g -106 -▁re -107 -▁state -108 -al -109 -âho -110 -▁w -111 -▁un -112 -▁vó -113 -▁states -114 -▁manâhé -115 -me -116 -ro -117 -▁e -118 -one -119 -êse -120 -óv -121 -êhe -122 -▁ch -123 -▁f -124 -tan -125 -honá -126 -né -127 -▁the -128 -▁thumb -129 -▁manâhéno -130 -or -131 -vá -132 -▁má -133 -hó -134 -ited -135 -▁united -136 -▁am -137 -▁ó -138 -est -139 -ian -140 -lo -141 -án -142 -ȯtse -143 -it -144 -ov -145 -htse -146 -honáé -147 -év -148 -oma -149 -ter -150 -as -151 -▁ts -152 -hko -153 -honáéšé -154 -il -155 -šk -156 -ina -157 -el -158 -éhe -159 -estôtse -160 -pu -161 -at -162 -ot -163 -▁" -164 -▁j -165 -má -166 -vó -167 -▁tse -168 -▁hotó -169 -ie -170 -ur -171 -▁ne -172 -tane -173 -ésto -174 -stȯtse -175 -sia -176 -otse -177 -▁hést -178 -ae -179 -ôxe -180 -tion -181 -htáme -182 -ad -183 -▁l -184 -ham -185 -ôhk -186 -▁pl -187 -xêse -188 -▁máhtáme -189 -ge -190 -ir -191 -ko -192 -oe -193 -eno -194 -xem -195 -▁china -196 -ch -197 -êhé -198 -ee -199 -na -200 -ôh -201 -▁i -202 -êst -203 -óva -204 -▁an -205 -ck -206 -ka -207 -hpe -208 -ėstse -209 -bl -210 -qu -211 -so -212 -her -213 -xov -214 -éma -215 -▁ar -216 -tanéno -217 -if -218 -mo -219 -ul -220 -ëö -221 -ver -222 -land -223 -ce -224 -han -225 -arian -226 -enter -227 -mé -228 -▁se -229 -▁ta -230 -ánóva -231 -enôtse -232 -hn -233 -sa -234 -êš -235 -hey -236 -sóv -237 -éne -238 -ëno -239 -▁mé -240 -enne -241 -publ -242 -êsto -243 -ariant -244 +es -92 +▁of -93 +▁sta -94 +né -95 +ite -96 +▁re -97 +), -98 +ing -99 +▁state -100 +ro -101 +▁g -102 +âho -103 +ts -104 +▁states -105 +ar -106 +▁un -107 +▁manâhé -108 +▁* -109 +▁w -110 +▁vó -111 +êse -112 +▁ch -113 +êhe -114 +honá -115 +nó -116 +ol -117 +▁the -118 +▁manâhéno -119 +▁e -120 +▁f -121 +ited -122 +▁united -123 +eo -124 +lo -125 +mo -126 +nė -127 +na -128 +mé -129 +▁ó -130 +ȯtse -131 +má -132 +or -133 +ina -134 +stó -135 +▁má -136 +honáé -137 +htse -138 +stȯtse -139 +hke -140 +tan -141 +ôxe -142 +honáéšé -143 +tane -144 +hko -145 +ge -146 +is -147 +otse -148 +pu -149 +th -150 +ésto -151 +šk -152 +▁" -153 +al -154 +as -155 +stá -156 +éhe -157 +óne -158 +▁ne -159 +vê -160 +hoo -161 +▁tse -162 +▁hotó -163 +ka -164 +tion -165 +ot -166 +só -167 +um -168 +év -169 +xêse -170 +it -171 +oe -172 +ur -173 +▁pl -174 +▁ame -175 +▁china -176 +fe -177 +ana -178 +hpe -179 +sia -180 +mâho -181 +ch -182 +xo -183 +ant -184 +▁há -185 +htáme -186 +estôtse -187 +ck -188 +ko -189 +êhé -190 +▁máhtáme -191 +ce -192 +ie -193 +il -194 +sa -195 +nóva -196 +da -197 +so -198 +ëö -199 +ôh -200 +éma -201 +▁ts -202 +stánóva -203 +ir -204 +▁j -205 +ver -206 +vari -207 +xa -208 +ric -209 +▁an -210 +tanéno -211 +ee -212 +▁i -213 +her -214 +hey -215 +lic -216 +men -217 +pub -218 +ške -219 +▁mé -220 +enne -221 +êsto -222 +heyenne -223 +variant -224 +▁héstánóva -225 +at -226 +el -227 +ni -228 +háa -229 +▁au -230 +estó -231 +land -232 +public -233 +▁hotómá -234 +qu -235 +ham -236 +man -237 +ëno -238 +▁ri -239 +▁ta -240 +htsé -241 +sóvó -242 +▁con -243 +stâhe -244 sóvóne -245 -heyenne -246 -co -247 -th -248 -▁au -249 -▁in -250 -estó -251 -htsé -252 -xovê -253 -▁con -254 -▁asia -255 -public -256 -▁hotómá -257 -ig -258 -les -259 -ome -260 -▁ka -261 -▁ri -262 -▁sa -263 -ther -264 -ôhtá -265 -hooma -266 -stâhe -267 -▁hóestó -268 -▁republic -269 -▁héstánóva -270 -ft -271 -id -272 -▁– -273 -ica -274 -man -275 -ném -276 -ohe -277 -▁oó -278 -tsêhé -279 -âhoestôtse -280 -ap -281 -ed -282 -▁co -283 -tahe -284 -▁hóxovê -285 -oo -286 -px -287 -ȯh -288 -eng -289 -heo -290 -ėse -291 -▁pa -292 -▁vo -293 -▁év -294 -hkon -295 -▁heév -296 -▁mâhe -297 -center -298 -▁oxêse -299 -stâhese -300 -") -301 -be -302 -hoo -303 -háa -304 -▁ve -305 -oland -306 -ȯhóne -307 -▁amer -308 -▁hova -309 -▁mâheóne -310 -▁variant -311 -ȯhónestȯtse -312 -▁mâhoestôtse -313 -ip -314 -kê -315 -▁- -316 -▁r -317 -ana -318 -hke -319 -stó -320 -ôhé -321 -▁(" -322 -▁mâ -323 -hése -324 -▁éve -325 -ėsóvóne -326 -▁évȯhónestȯtse -327 -▁[ -328 -▁x -329 -ces -330 -esé -331 -hne -332 -ôse -333 -ȧhe -334 -▁gr -335 -▁mȧ -336 -tame -337 -êške -338 -▁hán -339 -▁not -340 -óhkon -341 -ôhném -342 -âhoéve -343 -▁river -344 -ôhnéménêstse -345 -ay -346 -go -347 -ib -348 -im -349 -po -350 -sev -351 -son -352 -êho -353 -▁la -354 -▁vá -355 -▁auto -356 -▁grand -357 -▁hesto -358 -▁theft -359 -▁tséhe -360 -tsêhéstâhese -361 -ea -362 -ty -363 -ôx -364 -ary -365 -ema -366 -fer -367 -ish -368 -seo -369 -êšk -370 -êšé -371 -óho -372 -ôhe -373 -▁há -374 -▁le -375 -▁col -376 -eotse -377 -mâhoéve -378 -▁poland -379 -▁cheyenne -380 -pa -381 -tsé -382 -âha -383 -évo -384 -▁da -385 -▁ná -386 -▁nė -387 -▁son -388 -feren -389 -hkohe -390 -▁coun -391 -vátame -392 -▁hoham -393 -âhestôtse -394 -▁mâhoestôtsene -395 -▁tsétsêhéstâhese -396 -ba -397 -ou -398 -pe -399 -ém -400 -ahe -401 -ano -402 -ent -403 -ile -404 -ire -405 -nif -406 -ȧhé -407 -▁ko -408 -▁mi -409 -▁ro -410 -hóom -411 -▁ame -412 -▁nor -413 -êstoo -414 -▁tsén -415 -▁america -416 -▁referen -417 -eé -418 -io -419 -▁u -420 -"), -421 -enó -422 -hno -423 -hto -424 -ine -425 -rea -426 -sus -427 -tal -428 -tor -429 -▁de -430 -▁me -431 -▁mó -432 -▁po -433 -kêse -434 -oehá -435 -▁can -436 -▁dic -437 -▁hoo -438 -▁oóo -439 -estse -440 -kêseho -441 -staane -442 -sêstse -443 -éstova -444 -▁sonic -445 -▁tsêhe -446 -énestse -447 -hestôtse -448 -hóomoehá -449 -▁heévȧhe -450 -▁oóoxêse -451 -▁tséhetó -452 -▁nėstaane -453 -▁references -454 -ca -455 -fr -456 -ik -457 -ph -458 -ėx -459 -šê -460 -ang -461 -art -462 -hon -463 -ial -464 -ief -465 -ity -466 -ong -467 -ull -468 -éhn -469 -▁km -470 -▁né -471 -▁pó -472 -▁to -473 -▁šé -474 -nife -475 -▁and -476 -▁éše -477 -lands -478 -âhtse -479 -ôhkem -480 -▁vóhp -481 -hetane -482 -hésemé -483 -véotse -484 -éhnėse -485 -▁knife -486 -▁nésto -487 -hanôtse -488 -▁county -489 -▁kóhkon -490 -▁tsénėx -491 -therlands -492 -▁hánėsóvóne -493 -vátamevéotse -494 -▁tsénėxhésemé -495 -). -496 -ax -497 -nó -498 -sk -499 -ud -500 -▁y -501 -ama -502 -ber -503 -evo -504 -hta -505 -kee -506 -ono -507 -per -508 -rig -509 -âhá -510 -äma -511 -êha -512 -▁be -513 -engl -514 -oneo -515 -staa -516 -šêta -517 -▁car -518 -▁hes -519 -▁tur -520 -thumb -521 -ôhtse -522 -▁dull -523 -▁heóv -524 -▁héva -525 -▁máto -526 -onôtse -527 -▁canad -528 -▁colle -529 -english -530 -tionary -531 -šêtaëno -532 -▁héstan -533 -kêhanôtse -534 -▁vášêtaëno -535 -êstsestôtse -536 -▁dictionary -537 -▁heévȧhetanéno -538 -%) -539 -li -540 -ss -541 -xo -542 -éo -543 -êh -544 -▁· -545 -all -546 -ara -547 -are -548 -hal -549 -ist -550 -kon -551 -ura -552 -êše -553 -▁bo -554 -▁ca -555 -▁lo -556 -▁mô -557 -▁ob -558 -▁on -559 -▁su -560 -▁sé -561 -▁tó -562 -hame -563 -kota -564 -tain -565 -vôse -566 -âhéó -567 -▁ama -568 -▁har -569 -▁óoe -570 -tsêhe -571 -véhon -572 -êston -573 -šêške -574 -htseto -575 -êstove -576 -menôtse -577 -▁notäma -578 -cheyenne -579 -htsêstse -580 -véhoname -581 -▁college -582 -▁hoohtseto -583 -▁mȧhóomoehá -584 -▁manâhestôtse -585 -da -586 -do -587 -fa -588 -ga -589 -hi -590 -ry -591 -ug -592 -wa -593 -aho -594 -amo -595 -car -596 -dge -597 -ené -598 -fic -599 -ira -600 -lan -601 -low -602 -nia -603 -noo -604 -ond -605 -ree -606 -ser -607 -sëö -608 -uck -609 -óve -610 -ôho -611 -šen -612 -▁te -613 -aehe -614 -amae -615 -aneo -616 -onal -617 -tive -618 -tová -619 -âheo -620 -êheo -621 -▁bri -622 -▁cro -623 -▁san -624 -▁tai -625 -stotó -626 -êstsé -627 -▁môxe -628 -nėstse -629 -▁chief -630 -▁canada -631 -▁dakota -632 -âhetanéno -633 -▁hohamháa -634 -hoohtsêstse -635 -▁netherlands -636 -▁nêstsestôtse -637 -." -638 -.) -639 -ac -640 -ct -641 -kâ -642 -mö -643 -ok -644 -up -645 -ut -646 -▁á -647 -▁ł -648 -cep -649 -dom -650 -emâ -651 -ene -652 -eve -653 -evó -654 -iet -655 -ila -656 -mar -657 -omo -658 -ral -659 -éso -660 -▁al -661 -▁fr -662 -▁no -663 -▁ok -664 -▁sq -665 -▁tä -666 -ille -667 -lack -668 -ohke -669 -otsé -670 -séeo -671 -êhëö -672 -ėheo -673 -▁bra -674 -▁com -675 -▁ehó -676 -▁for -677 -▁ind -678 -▁tsė -679 -anéve -680 -stótó -681 -thern -682 -šenov -683 -ȧhého -684 -▁city -685 -hkoohe -686 -ingdom -687 -▁black -688 -▁chile -689 -▁xamae -690 -estȯtse -691 -sevoneo -692 -▁ehóhtá -693 -emenôtse -694 -hetaneho -695 -▁šéstotó -696 -šenovôtse -697 -▁kóhkonêhëö -698 -▁heévâhetanéno -699 -"; -700 -bi -701 -ds -702 -hu -703 -hö -704 -jo -705 -mó -706 -ni -707 -pp -708 -pó -709 -si -710 -uc -711 -ya -712 -áa -713 -ôs -714 -▁į -715 -amá -716 -ané -717 -api -718 -des -719 -ero -720 -eto -721 -gra -722 -ill -723 -int -724 -iss -725 -key -726 -oná -727 -onó -728 -rse -729 -sem -730 -stá -731 -ted -732 -éri -733 -▁ad -734 -▁is -735 -▁tr -736 -▁tâ -737 -enëö -738 -eéve -739 -inus -740 -reat -741 -reek -742 -ėsto -743 -▁cen -744 -▁mar -745 -▁rus -746 -▁sou -747 -eséve -748 -hesto -749 -onald -750 -oneve -751 -tôtse -752 -érika -753 -óhomö -754 -óneve -755 -▁braz -756 -▁sena -757 -ficial -758 -htôtse -759 -nêstse -760 -ôhtávo -761 -▁great -762 -▁netse -763 -êsóvóne -764 -šeeséve -765 -▁census -766 -▁center -767 -▁contor -768 -▁tšêške -769 -ôhkemôxe -770 -▁amérika -771 -▁kingdom -772 -▁tsėhése -773 -anestôtse -774 -enestôtse -775 -▁héstanėheo -776 -ao -777 -de -778 -kó -779 -mê -780 -pr -781 -só -782 -tš -783 -wn -784 -ze -785 -ash -786 -bia -787 -era -788 -ery -789 -eta -790 -etó -791 -eét -792 -hpé -793 -htó -794 -ier -795 -ion -796 -kâs -797 -lep -798 -nam -799 -ona -800 -pul -801 -rid -802 -tap -803 -tin -804 -van -805 -via -806 -way -807 -xêš -808 -ést -809 -éše -810 -êsé -811 -óhe -812 -óné -813 -óon -814 -óxe -815 -ôhó -816 -▁ba -817 -▁gu -818 -▁oa -819 -▁oo -820 -▁qu -821 -▁ra -822 -▁sw -823 -▁wy -824 -erus -825 -ings -826 -less -827 -skin -828 -tana -829 -váše -830 -xeme -831 -xêhe -832 -éhno -833 -ótsé -834 -▁bel -835 -▁bir -836 -▁eng -837 -▁new -838 -▁ota -839 -anese -840 -carpa -841 -evoem -842 -frame -843 -graph -844 -hohpe -845 -kâsén -846 -right -847 -stahe -848 -stove -849 -ursus -850 -ôhtsé -851 -▁aust -852 -▁capi -853 -▁comm -854 -▁heva -855 -▁hoto -856 -▁hová -857 -▁náhk -858 -▁viet -859 -emahpe -860 -hpotsé -861 -notová -862 -sevone -863 -tapâha -864 -tional -865 -vášeta -866 -xemeno -867 -ótséva -868 -▁amâho -869 -▁creek -870 -▁mâhpé -871 -▁póevó -872 -seoneve -873 -uckskin -874 -variant -875 -▁americ -876 -▁donald -877 -▁hotóhk -878 -▁russia -879 -▁britain -880 -▁náhkohe -881 -▁vietnam -882 -frameless -883 -hnestôtse -884 -▁contorta -885 -▁manȧhého -886 -▁northern -887 -▁official -888 -▁vétapâha -889 -evoemêstse -890 -▁tsevášeta -891 -▁óoetanéno -892 -hpotséhohpe -893 -sevoneóneve -894 -▁hánêsóvóne -895 -nėstsestȯtse -896 -aa -897 -ah -898 -bo -899 -eb -900 -ev -901 -eš -902 -gy -903 -kh -904 -ly -905 -mâ -906 -pi -907 -pl -908 -sp -909 -ue -910 -vö -911 -ón -912 -▁' -913 -▁z -914 -▁š -915 -aan -916 -aná -917 -con -918 -cto -919 -ehe -920 -ena -921 -ená -922 -gar -923 -hka -924 -hle -925 -hog -926 -hou -927 -ino -928 -isi -929 -ism -930 -iti -931 -kem -932 -ket -933 -mus -934 -olf -935 -omé -936 -ple -937 -tar -938 -tho -939 -ton -940 -ump -941 -óhé -942 -óvé -943 -ėhe -944 -ȯxe -945 -▁ab -946 -▁ed -947 -▁kâ -948 -▁wa -949 -ance -950 -ania -951 -ants -952 -bean -953 -eehe -954 -eove -955 -esta -956 -esto -957 -etse -958 -haná -959 -háat -960 -hést -961 -inal -962 -kane -963 -kêsé -964 -mate -965 -náne -966 -pula -967 -quah -968 -star -969 -tóvé -970 -unip -971 -xêšé -972 -xėse -973 -óhtá -974 -▁cal -975 -▁den -976 -▁est -977 -▁hoi -978 -▁hon -979 -▁ire -980 -▁lit -981 -▁min -982 -▁mon -983 -▁nan -984 -▁ome -985 -▁par -986 -▁pro -987 -▁red -988 -▁sea -989 -▁slo -990 -▁sto -991 -▁tha -992 -▁tra -993 -anehe -994 -erman -995 -erosa -996 -hnoma -997 -hpeno -998 -hésto -999 -mâheo -1000 -mâhéó -1001 -oming -1002 -thing -1003 -évohk -1004 -óhomo -1005 -ôhtáv -1006 -▁crow -1007 -▁heap -1008 -▁hešk -1009 -▁mešk -1010 -▁miss -1011 -▁moun -1012 -▁nemâ -1013 -▁pond -1014 -▁sint -1015 -▁site -1016 -dgehog -1017 -ibbean -1018 -kâsénâ -1019 -manâhé -1020 -nestse -1021 -tóvéto -1022 -xêšéne -1023 -âhonoo -1024 -êstséa -1025 -óonéma -1026 -ôhkêhe -1027 -ôhtáva -1028 -▁aésto -1029 -▁birds -1030 -▁hoóxe -1031 -▁môxem -1032 -▁south -1033 -▁tahle -1034 -▁tasem -1035 -▁vóhpo -1036 -emanese -1037 -háatama -1038 -âhtsená -1039 -▁brazil -1040 -▁otaesé -1041 -▁tsesto -1042 -▁tséana -1043 -keemahpe -1044 -notováhe -1045 -pulation -1046 -staahtsé -1047 -uniperus -1048 -xėseotse -1049 -êšéstótó -1050 -▁capital -1051 -▁ireland -1052 -▁tâhpeno -1053 -▁wyoming -1054 -évohkôtse -1055 -▁buckskin -1056 -▁hedgehog -1057 -▁national -1058 -▁vóhkoohe -1059 -hamestôtse -1060 -▁anėsóvóne -1061 -▁ponderosa -1062 -▁tahlequah -1063 -âhaemenôtse -1064 -êsenotováhe -1065 -▁hohamháahp -1066 -▁manestôtse -1067 -véhpotséhohpe -1068 -kâsénâhnestôtse -1069 -): -1070 -av -1071 -aó -1072 -dź -1073 -eá -1074 -gu -1075 -hã -1076 -iv -1077 -iz -1078 -ld -1079 -nt -1080 -op -1081 -tá -1082 -tâ -1083 -xi -1084 -ää -1085 -óo -1086 -▁) -1087 -▁+ -1088 -▁. -1089 -aen -1090 -bez -1091 -bud -1092 -cca -1093 -cem -1094 -che -1095 -cia -1096 -col -1097 -com -1098 -cus -1099 -eet -1100 -ers -1101 -ese -1102 -eso -1103 -eše -1104 -gas -1105 -gen -1106 -ges -1107 -gin -1108 -hen -1109 -hné -1110 -hov -1111 -hpa -1112 -hpo -1113 -hve -1114 -höö -1115 -ice -1116 -ies -1117 -ind -1118 -ith -1119 -kév -1120 -las -1121 -lig -1122 -nas -1123 -net -1124 -nii -1125 -ohk -1126 -ony -1127 -que -1128 -ria -1129 -ril -1130 -ron -1131 -rus -1132 -sed -1133 -sel -1134 -stâ -1135 -tem -1136 -tia -1137 -tic -1138 -too -1139 -tri -1140 -ulo -1141 -uly -1142 -vak -1143 -âhk -1144 -çao -1145 -éha -1146 -ése -1147 -êhá -1148 -êhó -1149 -ëva -1150 -ódź -1151 -óse -1152 -ôhö -1153 -ôsá -1154 -ôxa -1155 -ėho -1156 -ėst -1157 -▁ac -1158 -▁ap -1159 -▁bu -1160 -▁en -1161 -▁eu -1162 -▁jo -1163 -▁nē -1164 -▁sk -1165 -▁so -1166 -▁sp -1167 -▁sá -1168 -▁th -1169 -▁tá -1170 -▁ón -1171 -▁ôh -1172 -▁ło -1173 -ameo -1174 -axen -1175 -eden -1176 -eohé -1177 -hahe -1178 -hase -1179 -hoht -1180 -homa -1181 -háse -1182 -iber -1183 -ides -1184 -ilai -1185 -iten -1186 -kese -1187 -kôhó -1188 -pano -1189 -ress -1190 -sono -1191 -taly -1192 -tamá -1193 -unus -1194 -voto -1195 -vóon -1196 -xévo -1197 -zart -1198 -zech -1199 -áhto -1200 -âtse -1201 -éotó -1202 -ôheo -1203 -ôhéó -1204 -▁air -1205 -▁anê -1206 -▁ben -1207 -▁cli -1208 -▁des -1209 -▁fin -1210 -▁geo -1211 -▁gla -1212 -▁hae -1213 -▁kok -1214 -▁las -1215 -▁mas -1216 -▁mȧx -1217 -▁ová -1218 -▁pat -1219 -▁peo -1220 -▁spa -1221 -▁squ -1222 -▁ter -1223 -▁too -1224 -▁vir -1225 -▁web -1226 -▁wik -1227 -▁wor -1228 -▁yel -1229 -▁éšk -1230 -aehes -1231 -aehno -1232 -arten -1233 -elman -1234 -halus -1235 -hamém -1236 -horse -1237 -iland -1238 -irahã -1239 -keéno -1240 -lowst -1241 -mahpe -1242 -omêše -1243 -ondon -1244 -orone -1245 -pulus -1246 -ralia -1247 -sebud -1248 -taneo -1249 -tséno -1250 -éhavo -1251 -énohe -1252 -énéhe -1253 -êškév -1254 -ôhévo -1255 -ôhöne -1256 -öhtse -1257 -▁ange -1258 -▁cura -1259 -▁dong -1260 -▁hesó -1261 -▁hetó -1262 -▁heše -1263 -▁hohp -1264 -▁héve -1265 -▁mato -1266 -▁sage -1267 -▁tséx -1268 -▁voax -1269 -▁wolf -1270 -▁ôhmo -1271 -▁łódź -1272 -ashing -1273 -eétâhé -1274 -ginian -1275 -hovane -1276 -htsená -1277 -kêséhe -1278 -kôhóme -1279 -lahoma -1280 -ligion -1281 -onesto -1282 -onêheo -1283 -onêške -1284 -poland -1285 -staneo -1286 -sėstse -1287 -tamáno -1288 -tinent -1289 -xêšéhe -1290 -émâhéó -1291 -éstove -1292 -êhaseo -1293 -êhasëö -1294 -êškóne -1295 -ôhkêho -1296 -▁becca -1297 -▁congo -1298 -▁czech -1299 -▁hotse -1300 -▁italy -1301 -▁korea -1302 -▁meško -1303 -▁miles -1304 -▁north -1305 -▁pinus -1306 -▁vóhka -1307 -▁áháse -1308 -▁łobez -1309 -ameotse -1310 -hestohe -1311 -náhkohe -1312 -seohtsé -1313 -taheéve -1314 -énėstse -1315 -êheséeo -1316 -êstonem -1317 -▁harper -1318 -▁hetane -1319 -▁háeohé -1320 -▁háesto -1321 -▁mozart -1322 -▁people -1323 -▁sweden -1324 -▁turkey -1325 -cephalus -1326 -elmannii -1327 -enáhkohe -1328 -hamémôxe -1329 -lowstone -1330 -tsénooná -1331 -vóonotse -1332 -▁climate -1333 -▁curaçao -1334 -▁finland -1335 -▁hotohke -1336 -▁maarten -1337 -▁onéhavo -1338 -▁rosebud -1339 -ashington -1340 -axenôhöne -1341 -emestȯtse -1342 -juniperus -1343 -▁geograph -1344 -▁manâhého -1345 -▁mountain -1346 -▁móxêšéhe -1347 -▁okôhkêho -1348 -▁religion -1349 -▁thailand -1350 -▁tsetsêhe -1351 -▁váótséva -1352 -▁vóhpoomé -1353 -enóseoneve -1354 -háatamaahe -1355 -véhestôtse -1356 -▁australia -1357 -▁póevónáne -1358 -▁virginian -1359 -eétâhéstove -1360 -ôhtávaestse -1361 -▁héstanêheo -1362 -▁tsehestohe -1363 -polandnestse -1364 -êsenotováheé -1365 -▁engelmannii -1366 -▁héstánóvaan -1367 -▁vétapâhaeto -1368 -▁yellowstone -1369 -▁vóhkoohetane -1370 -keehoohtsêstse -1371 -▁héstánóvaanéve -1372 -▁tsetsêhestâhese -1373 -", -1374 -". -1375 -)- -1376 -ak -1377 -by -1378 -gs -1379 -hä -1380 -ix -1381 -ja -1382 -ki -1383 -kė -1384 -ns -1385 -pt -1386 -tr -1387 -ua -1388 -ui -1389 -xa -1390 -xé -1391 -yd -1392 -yl -1393 -äö -1394 -′′ -1395 -▁$ -1396 -▁: -1397 -.") -1398 -ade -1399 -aka -1400 -ami -1401 -ans -1402 -ate -1403 -axa -1404 -bel -1405 -boo -1406 -bra -1407 -chi -1408 -cro -1409 -dia -1410 -eho -1411 -ein -1412 -ght -1413 -haa -1414 -haz -1415 -hem -1416 -hir -1417 -hla -1418 -hán -1419 -ich -1420 -ick -1421 -ide -1422 -ima -1423 -ins -1424 -ise -1425 -itu -1426 -ivo -1427 -jug -1428 -kom -1429 -kra -1430 -lat -1431 -lay -1432 -nov -1433 -néó -1434 -oan -1435 -ohé -1436 -ola -1437 -oli -1438 -omá -1439 -omó -1440 -oug -1441 -pom -1442 -rib -1443 -sen -1444 -sóe -1445 -the -1446 -tiv -1447 -toa -1448 -tov -1449 -tsw -1450 -tôx -1451 -uba -1452 -uco -1453 -uel -1454 -ure -1455 -urr -1456 -uru -1457 -ved -1458 -vem -1459 -wan -1460 -xeo -1461 -yat -1462 -zon -1463 -ánó -1464 -âht -1465 -évé -1466 -êna -1467 -êxo -1468 -ëso -1469 -óma -1470 -óvó -1471 -óvô -1472 -ôht -1473 -ôto -1474 -ėjo -1475 -ėše -1476 -▁), -1477 -▁af -1478 -▁at -1479 -▁ea -1480 -▁es -1481 -▁fa -1482 -▁fo -1483 -▁ha -1484 -▁it -1485 -▁ke -1486 -▁ku -1487 -▁kö -1488 -▁kȧ -1489 -▁pe -1490 -▁ph -1491 -▁pé -1492 -▁ru -1493 -▁sc -1494 -▁sȯ -1495 -▁va -1496 -▁vi -1497 -▁še -1498 -apan -1499 -asia -1500 -asus -1501 -bies -1502 -book -1503 -bykh -1504 -cepc -1505 -ctos -1506 -eave -1507 -enen -1508 -eneo -1509 -etan -1510 -eved -1511 -fast -1512 -fire -1513 -gent -1514 -gypt -1515 -hael -1516 -hahk -1517 -here -1518 -heóv -1519 -hone -1520 -hová -1521 -héne -1522 -ibik -1523 -idae -1524 -ifor -1525 -igra -1526 -ills -1527 -inst -1528 -ires -1529 -irne -1530 -khaz -1531 -king -1532 -kome -1533 -késo -1534 -kêsa -1535 -lans -1536 -lect -1537 -lope -1538 -mami -1539 -mark -1540 -meno -1541 -môxe -1542 -nehe -1543 -nâha -1544 -néhe -1545 -néta -1546 -nóse -1547 -ohko -1548 -olia -1549 -pher -1550 -pora -1551 -pper -1552 -póno -1553 -ries -1554 -road -1555 -rope -1556 -seph -1557 -sian -1558 -skie -1559 -stsé -1560 -sêhe -1561 -tohe -1562 -town -1563 -tâhá -1564 -tôxá -1565 -vada -1566 -vian -1567 -vovó -1568 -véhe -1569 -vêsé -1570 -vóhp -1571 -xemé -1572 -xove -1573 -xéve -1574 -xôse -1575 -áahe -1576 -éest -1577 -éstó -1578 -êšev -1579 -óhév -1580 -óseo -1581 -ôhke -1582 -ėstó -1583 -šeno -1584 -škee -1585 -▁ahk -1586 -▁amo -1587 -▁ant -1588 -▁bat -1589 -▁cam -1590 -▁cla -1591 -▁hal -1592 -▁hel -1593 -▁hla -1594 -▁hóx -1595 -▁jim -1596 -▁los -1597 -▁mic -1598 -▁mén -1599 -▁môx -1600 -▁oeš -1601 -▁otá -1602 -▁pra -1603 -▁pre -1604 -▁rho -1605 -▁sco -1606 -▁sil -1607 -▁sás -1608 -▁val -1609 -▁xaó -1610 -▁xäö -1611 -▁ôxa -1612 -▁šan -1613 -arete -1614 -chief -1615 -dagas -1616 -desia -1617 -eetus -1618 -emâho -1619 -ercus -1620 -etane -1621 -etset -1622 -honáe -1623 -house -1624 -háano -1625 -illet -1626 -istan -1627 -kaehe -1628 -máhta -1629 -mêsta -1630 -onebi -1631 -onévo -1632 -river -1633 -selle -1634 -sisto -1635 -ssing -1636 -stati -1637 -stein -1638 -stôxe -1639 -sésto -1640 -tiago -1641 -vakia -1642 -venia -1643 -xésta -1644 -évôxe -1645 -óvemá -1646 -óvéta -1647 -ôhéve -1648 -ôtove -1649 -▁apie -1650 -▁aést -1651 -▁cauc -1652 -▁fire -1653 -▁from -1654 -▁haná -1655 -▁hese -1656 -▁héne -1657 -▁kôsá -1658 -▁left -1659 -▁loca -1660 -▁nēhi -1661 -▁nėse -1662 -▁orig -1663 -▁oéve -1664 -▁rock -1665 -▁sand -1666 -▁tsis -1667 -▁tséh -1668 -▁tôhé -1669 -▁ukra -1670 -▁ural -1671 -▁west -1672 -▁éohk -1673 -▁éveé -1674 -allery -1675 -aneonó -1676 -aénohe -1677 -cebook -1678 -cember -1679 -chived -1680 -eeheso -1681 -eestse -1682 -estone -1683 -etóxeo -1684 -hanehe -1685 -hemêšé -1686 -heséeo -1687 -hkoohé -1688 -hováve -1689 -huania -1690 -hóxovê -1691 -kóhkon -1692 -kôhtse -1693 -mêstaa -1694 -person -1695 -prunus -1696 -stâhem -1697 -telope -1698 -ternet -1699 -tivist -1700 -továto -1701 -tswana -1702 -tšêške -1703 -urasia -1704 -vation -1705 -veotse -1706 -xemâho -1707 -xepóno -1708 -âheone -1709 -âhevan -1710 -âhtseo -1711 -êhésto -1712 -óvôhtó -1713 -▁aruba -1714 -▁edgar -1715 -▁hestá -1716 -▁hesén -1717 -▁horse -1718 -▁hésto -1719 -▁hóxov -1720 -▁india -1721 -▁japan -1722 -▁lasio -1723 -▁leuco -1724 -▁liber -1725 -▁meave -1726 -▁mésta -1727 -▁nigra -1728 -▁obvia -1729 -▁pages -1730 -▁péhpe -1731 -▁retri -1732 -▁right -1733 -▁sibik -1734 -▁spain -1735 -▁sȯsóe -1736 -▁there -1737 -▁trump -1738 -▁tsehe -1739 -▁tónov -1740 -▁ubykh -1741 -gentina -1742 -honáéva -1743 -iaeetus -1744 -ifornia -1745 -juglans -1746 -kêsaéve -1747 -mamione -1748 -ohtôtse -1749 -onetane -1750 -populus -1751 -sistots -1752 -sonants -1753 -tâháéno -1754 -tóeotse -1755 -tšêškév -1756 -upright -1757 -vonêške -1758 -átamáno -1759 -éotóaho -1760 -éstónéó -1761 -êhóóhév -1762 -ęstaneo -1763 -▁ahkôhe -1764 -▁arctos -1765 -▁desyat -1766 -▁france -1767 -▁hóvôse -1768 -▁indian -1769 -▁jetset -1770 -▁korone -1771 -▁london -1772 -▁manaan -1773 -▁native -1774 -▁nevada -1775 -▁norway -1776 -▁néstse -1777 -▁pirahã -1778 -▁square -1779 -▁tepper -1780 -▁turėjo -1781 -▁éestse -1782 -▁éškôse -1783 -▁êstove -1784 -aseohtsé -1785 -hahtsená -1786 -htsévôse -1787 -manâhéno -1788 -móxêšéne -1789 -senêstse -1790 -▁angeles -1791 -▁denmark -1792 -▁heóvâhá -1793 -▁madagas -1794 -▁menôtse -1795 -▁mâhpémo -1796 -▁oeškese -1797 -▁reôhkem -1798 -▁sanders -1799 -▁skillet -1800 -▁toháano -1801 -▁tseohke -1802 -▁ukraine -1803 -▁xamaevo -1804 -eotsestsé -1805 -hoestȯtse -1806 -héstánóva -1807 -xemenôtse -1808 -xemâhoévé -1809 -ôhketóxeo -1810 -ȧhestȯtse -1811 -▁activist -1812 -▁archived -1813 -▁caucasus -1814 -▁crossing -1815 -▁december -1816 -▁hotóhkeo -1817 -▁oklahoma -1818 -▁original -1819 -▁rhodesia -1820 -▁tsisinst -1821 -▁tsêhésto -1822 -▁véenôtse -1823 -▁vóhkoohé -1824 -kévénėstse -1825 -nêstsevôse -1826 -tsénoonáhe -1827 -tsêhéstahe -1828 -xėseotsean -1829 -▁argentina -1830 -▁blackfire -1831 -▁caribbean -1832 -▁hoéstónéó -1833 -▁lithuania -1834 -▁obviative -1835 -▁religions -1836 -▁retrieved -1837 -▁sibikeove -1838 -▁tasemiten -1839 -▁vótâháéno -1840 -enenestôtse -1841 -komêšéstótó -1842 -mâhoestôtse -1843 -netherlands -1844 -êstonêstove -1845 -ôhkêheóvemá -1846 -▁california -1847 -▁consonants -1848 -▁heóvonêheo -1849 -▁hóvôsenâha -1850 -▁hóxâhtsená -1851 -▁koronestse -1852 -▁lasiocarpa -1853 -▁mâhóomoehá -1854 -▁môxemarete -1855 -▁nėsesėstse -1856 -▁population -1857 -▁tsêhestôxe -1858 -▁washington -1859 -eestsestȯtse -1860 -staahtsémeno -1861 -vátameveotse -1862 -▁ahkôheöhtse -1863 -▁desyatonebi -1864 -▁háeohémahpe -1865 -▁nemâmamione -1866 -êhóóhévâhtseo -1867 -êstonemâheone -1868 -▁reôhkemôtove -1869 -héstoeotsestsé -1870 -kemâhaemenôtse -1871 -▁leucocephalus -1872 -▁otaesémenôtse -1873 -▁vóhpoométanéno -1874 -▁héstánóvaannéta -1875 -▁tsisinstsistots -1876 -▁tsêhéstoestôtse -1877 -▁vóhkoohémâhoéve -1878 --- -1879 -.: -1880 - -7943 -~ -7944 -à -7945 -ñ -7946 -ú -7947 -û -7948 -ń -7949 -ō -7950 -ǐ -7951 -ʃ -7952 -н -7953 -п -7954 -ы -7955 -ә -7956 -ર -7957 -ા -7958 -ᐃ -7959 -民 -7960 -& -7961 -è -7962 -î -7963 -ò -7964 -ć -7965 -ī -7966 -ŋ -7967 -ś -7968 -ų -7969 -ǧ -7970 -ɛ -7971 -ɪ -7972 -ʔ -7973 -б -7974 -е -7975 -л -7976 -о -7977 -х -7978 -ј -7979 -қ -7980 -ҟ -7981 -ҭ -7982 -ҳ -7983 -ԥ -7984 -ખ -7985 -ફ -7986 -બ -7987 -લ -7988 -સ -7989 -ો -7990 -્ -7991 -တ -7992 -း -7993 -ႆ -7994 -Ꭹ -7995 +ã -7921 +а -7922 ++ -7923 +! -7924 +í -7925 +ē -7926 +| -7927 +ź -7928 +ʼ -7929 +р -7930 +# -7931 +$ -7932 +< -7933 +° -7934 +ç -7935 +č -7936 +и -7937 +с -7938 +> -7939 +~ -7940 +à -7941 +ñ -7942 +ú -7943 +û -7944 +ę -7945 +ń -7946 +ō -7947 +ǐ -7948 +ʃ -7949 +н -7950 +п -7951 +ы -7952 +ә -7953 +ર -7954 +ા -7955 +ᐃ -7956 +民 -7957 +& -7958 += -7959 +î -7960 +ò -7961 +ć -7962 +ī -7963 +ŋ -7964 +ś -7965 +ž -7966 +ǧ -7967 +ɛ -7968 +ɪ -7969 +ʔ -7970 +б -7971 +е -7972 +л -7973 +о -7974 +х -7975 +ј -7976 +қ -7977 +ҟ -7978 +ҭ -7979 +ҳ -7980 +ԥ -7981 +ખ -7982 +ફ -7983 +બ -7984 +લ -7985 +સ -7986 +ો -7987 +્ -7988 +တ -7989 +း -7990 +ႆ -7991 +Ꭹ -7992 +Ꮃ -7993 +Ꮳ -7994 +ᐍ -7995 diff --git a/models/vocabulary/chy_vocabulary.parquet b/models/vocabulary/chy_vocabulary.parquet index 5f3009ad677a72a74348a5993bf8c804d65599ae..0fc6c07960135a027e72f65e4c49fbceba78b752 100644 --- a/models/vocabulary/chy_vocabulary.parquet +++ b/models/vocabulary/chy_vocabulary.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:557659dcccab98922c4fc45c22d401a01f92a33330e2d06cbdbafd9e3d8f37d2 -size 22392 +oid sha256:b9b062a41dc39802b42fd2a6c1857efd43a84af7da75bb98cfda744f6ad48c62 +size 21524 diff --git a/models/vocabulary/chy_vocabulary_metadata.json b/models/vocabulary/chy_vocabulary_metadata.json index 929cdaba95adcb0396ddbbe9a9cb203f3f0535eb..2b4884b0d7a86d32350bf7651eff2f194b985eb0 100644 --- a/models/vocabulary/chy_vocabulary_metadata.json +++ b/models/vocabulary/chy_vocabulary_metadata.json @@ -1,15 +1,15 @@ { "language": "chy", - "vocabulary_size": 1237, + "vocabulary_size": 1174, "variant": "full", "statistics": { - "type_token_ratio": 0.32707120045087357, + "type_token_ratio": 0.33165930092406587, "coverage": { - "top_100": 0.4324628968626714, - "top_1000": 0.7445989103888785 + "top_100": 0.4347127360385697, + "top_1000": 0.7513057452792286 }, - "hapax_count": 2245, - "hapax_ratio": 0.644744399770247, - "total_documents": 459 + "hapax_count": 2128, + "hapax_ratio": 0.6444579043004239, + "total_documents": 429 } } \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx1_word.parquet b/models/word_markov/chy_markov_ctx1_word.parquet index 8acf149959f72ea8c4e48203f71cfd6c798bdfce..47f73259c9d523613d66ba8e51d2afc85051dec0 100644 --- a/models/word_markov/chy_markov_ctx1_word.parquet +++ b/models/word_markov/chy_markov_ctx1_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:386ae58f4271bb09f2a7300c1ffb43cb66268f9329c798213791212b7375c544 -size 86902 +oid sha256:d274a31557ebe9914adbbba1a99399f013526f094054439fc9c1a835e8d5e6a4 +size 82511 diff --git a/models/word_markov/chy_markov_ctx1_word_metadata.json b/models/word_markov/chy_markov_ctx1_word_metadata.json index 79f793a05930f1a4c24f54487e1287ece9e416e5..30edce2c5eab6c4fafb07cfbf7ff9fe5e8a06745 100644 --- a/models/word_markov/chy_markov_ctx1_word_metadata.json +++ b/models/word_markov/chy_markov_ctx1_word_metadata.json @@ -2,6 +2,6 @@ "context_size": 1, "variant": "word", "language": "chy", - "unique_contexts": 3383, - "total_transitions": 10187 + "unique_contexts": 3214, + "total_transitions": 9527 } \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx2_word.parquet b/models/word_markov/chy_markov_ctx2_word.parquet index 0d99ee534a93e60b5ada7345e8e5d7bc54302e94..2e11867c93491f318ec10e2dfd09c5e4f1773441 100644 --- a/models/word_markov/chy_markov_ctx2_word.parquet +++ b/models/word_markov/chy_markov_ctx2_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:25b29a5fd2a035a5761b624f9763e3b2f1bd7218a31608a0a6f94197490478db -size 131833 +oid sha256:72ec884b474155c4fb42b336ac8db08240a62ef9d9b96702036915eb8ece49a9 +size 124517 diff --git a/models/word_markov/chy_markov_ctx2_word_metadata.json b/models/word_markov/chy_markov_ctx2_word_metadata.json index 61d1fac9fca5fa6d7d5f94e1fba5de4e62a9781c..d629a4385314ca67a23f9aee351bd05e99410fc7 100644 --- a/models/word_markov/chy_markov_ctx2_word_metadata.json +++ b/models/word_markov/chy_markov_ctx2_word_metadata.json @@ -2,6 +2,6 @@ "context_size": 2, "variant": "word", "language": "chy", - "unique_contexts": 6516, - "total_transitions": 9728 + "unique_contexts": 6126, + "total_transitions": 9098 } \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx3_word.parquet b/models/word_markov/chy_markov_ctx3_word.parquet index a266a81efaaa552b218d093b93e2289099aa546f..d15d1da81f11c5f1f71faeeb77bed1fc22400658 100644 --- a/models/word_markov/chy_markov_ctx3_word.parquet +++ b/models/word_markov/chy_markov_ctx3_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:b4b7d0cf8b279e25f0c1dd5d12c4331f6647670846d67a4e626aa888e8687564 -size 158524 +oid sha256:17f7aa921fc6af5108a2d17c0dfb401db36aa8745424ebfb48f77007070d9508 +size 149507 diff --git a/models/word_markov/chy_markov_ctx3_word_metadata.json b/models/word_markov/chy_markov_ctx3_word_metadata.json index e0b57ac5abfa0f78354819cea4c154faba32fc60..512caad5fc87490857235f0c4674ce17adf94d20 100644 --- a/models/word_markov/chy_markov_ctx3_word_metadata.json +++ b/models/word_markov/chy_markov_ctx3_word_metadata.json @@ -2,6 +2,6 @@ "context_size": 3, "variant": "word", "language": "chy", - "unique_contexts": 7515, - "total_transitions": 9269 + "unique_contexts": 7065, + "total_transitions": 8669 } \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx4_word.parquet b/models/word_markov/chy_markov_ctx4_word.parquet index 1a23dcf88786cfa6078cf32710ccb96faa83b1bf..a724c85cf55bfcd6d49ef63749a49d1c6b7c9648 100644 --- a/models/word_markov/chy_markov_ctx4_word.parquet +++ b/models/word_markov/chy_markov_ctx4_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:46f60b22a8fcac07a7de898ad66d40c570966ec70737f00b5cf2d7b22bf2c6a2 -size 174153 +oid sha256:459da114df8b2b97b442e07a5ee700b713cda57712639a7ac65a2bb20621e0d4 +size 163772 diff --git a/models/word_markov/chy_markov_ctx4_word_metadata.json b/models/word_markov/chy_markov_ctx4_word_metadata.json index a7c139ec9a85a098b20d1b2df766e602e4ae888e..c68f674815979d0e15214c64a8c3d6c1dc47340a 100644 --- a/models/word_markov/chy_markov_ctx4_word_metadata.json +++ b/models/word_markov/chy_markov_ctx4_word_metadata.json @@ -2,6 +2,6 @@ "context_size": 4, "variant": "word", "language": "chy", - "unique_contexts": 7792, - "total_transitions": 8810 + "unique_contexts": 7317, + "total_transitions": 8240 } \ No newline at end of file diff --git a/models/word_ngram/chy_2gram_word.parquet b/models/word_ngram/chy_2gram_word.parquet index 632b74c24a0b92b9071d12c3b4dc53a2e3c6d9b8..699579cc518c94bd261605ee5676bbd2e28d5149 100644 --- a/models/word_ngram/chy_2gram_word.parquet +++ b/models/word_ngram/chy_2gram_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:f3cf90b3734f097b7881f2bd6766a12f1c1a78a49ef2c9af43b8bb1c04f684de -size 5216 +oid sha256:d3c85c0ba694c7e59c5083db686db2a289b676f8dc72587178d4e8b7e405f9d5 +size 5096 diff --git a/models/word_ngram/chy_2gram_word_metadata.json b/models/word_ngram/chy_2gram_word_metadata.json index 56d2bc6055684beeb1fb7fd1ca2f684fb65b1b5c..ff6eb0fefa715176f3d73a8aae819b8636f39f8d 100644 --- a/models/word_ngram/chy_2gram_word_metadata.json +++ b/models/word_ngram/chy_2gram_word_metadata.json @@ -2,6 +2,6 @@ "n": 2, "variant": "word", "language": "chy", - "unique_ngrams": 159, - "total_ngrams": 10187 + "unique_ngrams": 148, + "total_ngrams": 9527 } \ No newline at end of file diff --git a/models/word_ngram/chy_3gram_word.parquet b/models/word_ngram/chy_3gram_word.parquet index bbfea6bb6328da419a09d61d4c91f24d9c72a818..c3b4a3204739f549109bf343c749d313907cbc79 100644 --- a/models/word_ngram/chy_3gram_word.parquet +++ b/models/word_ngram/chy_3gram_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:18e0cd0236f9672b7cdc3748740203bb789cab4d98a9bba4f58941ef3779da01 -size 7252 +oid sha256:26fce2bfe2586ba0dcd7bb658e8be412227cdde23ac0b2a7463b068c32f7e141 +size 6956 diff --git a/models/word_ngram/chy_3gram_word_metadata.json b/models/word_ngram/chy_3gram_word_metadata.json index c29c0216e28a64b2c83dea0c9724f900b99f5a8f..752be0ffc47bd88e6005d2d6d2701c6279d22332 100644 --- a/models/word_ngram/chy_3gram_word_metadata.json +++ b/models/word_ngram/chy_3gram_word_metadata.json @@ -2,6 +2,6 @@ "n": 3, "variant": "word", "language": "chy", - "unique_ngrams": 245, - "total_ngrams": 9728 + "unique_ngrams": 229, + "total_ngrams": 9098 } \ No newline at end of file diff --git a/models/word_ngram/chy_4gram_word.parquet b/models/word_ngram/chy_4gram_word.parquet index 4cc9d0cc656f3444c80f96679bbd8d91bbd68c82..a5030198897e1f9138b427d5bd66951bc3c98107 100644 --- a/models/word_ngram/chy_4gram_word.parquet +++ b/models/word_ngram/chy_4gram_word.parquet @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:be5660989ec57f68e126189b60da16af9da01d4a13b7e53d4bc22f9849e36215 -size 11340 +oid sha256:cba9059bc488c7dc6e3c1b212db591855c167e3b56c626e7097cace18aba4c8c +size 10742 diff --git a/models/word_ngram/chy_4gram_word_metadata.json b/models/word_ngram/chy_4gram_word_metadata.json index 61d6abe8a60c9188890608ed9978cf38bc5625ff..955dd9783e52660333312e08caafdb8b870c47a7 100644 --- a/models/word_ngram/chy_4gram_word_metadata.json +++ b/models/word_ngram/chy_4gram_word_metadata.json @@ -2,6 +2,6 @@ "n": 4, "variant": "word", "language": "chy", - "unique_ngrams": 449, - "total_ngrams": 9269 + "unique_ngrams": 420, + "total_ngrams": 8669 } \ No newline at end of file diff --git a/models/word_ngram/chy_5gram_word.parquet b/models/word_ngram/chy_5gram_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..5c65d0ab803deaea3d5727fed97034ab675b5cc1 --- /dev/null +++ b/models/word_ngram/chy_5gram_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:86f9f4ef2e4e6b1a0fe9702a89d83ada7fa2f70d2d2ef37ff0666d5cc357ade3 +size 8978 diff --git a/models/word_ngram/chy_5gram_word_metadata.json b/models/word_ngram/chy_5gram_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..3938c04cc0529deec33b561dcf3c802dc45d530e --- /dev/null +++ b/models/word_ngram/chy_5gram_word_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 5, + "variant": "word", + "language": "chy", + "unique_ngrams": 290, + "total_ngrams": 8240 +} \ No newline at end of file diff --git a/visualizations/embedding_alignment_quality.png b/visualizations/embedding_alignment_quality.png new file mode 100644 index 0000000000000000000000000000000000000000..0fde00bbafda94df0f684d09398707dd790e6e06 Binary files /dev/null and b/visualizations/embedding_alignment_quality.png differ diff --git a/visualizations/embedding_isotropy.png b/visualizations/embedding_isotropy.png index ea9dc9f6e87929e7366aa2c06604dc6e6a6dda09..1eef158893232d05d606e03dc3956c48cc8c57c1 100644 Binary files a/visualizations/embedding_isotropy.png and b/visualizations/embedding_isotropy.png differ diff --git a/visualizations/embedding_norms.png b/visualizations/embedding_norms.png index 6bac4c354cf4551a0d05ebcc6c5f5a89f0b7965e..b916b8b1fb04808e51873406467271b0305e8d46 100644 Binary files a/visualizations/embedding_norms.png and b/visualizations/embedding_norms.png differ diff --git a/visualizations/embedding_similarity.png b/visualizations/embedding_similarity.png index cdbdab745d9a5b59fa842a8ccef21a3a76fa59db..f23bb586b8286c112da1c134a4d1357e183d3388 100644 --- a/visualizations/embedding_similarity.png +++ b/visualizations/embedding_similarity.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:64b3cbb726cc52039aef89526c5df1602ede7ff01d5aafa01666629a9683d23c -size 126808 +oid sha256:e12f517c3ff1adceb59355bbc444ddb612d1e69de30ff0a26de1910044cd7ec4 +size 125808 diff --git a/visualizations/embedding_tsne_multilingual.png b/visualizations/embedding_tsne_multilingual.png new file mode 100644 index 0000000000000000000000000000000000000000..2413782e9a5bc2c2aff2b82c5892643a7a9b880a --- /dev/null +++ b/visualizations/embedding_tsne_multilingual.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fdea8956cf313a2b0158aba6814818af1a45e26d155e97ee89e751b81804fd43 +size 241317 diff --git a/visualizations/markov_branching.png b/visualizations/markov_branching.png index 82188252c06cc849a115b2120f3298eda3266ce7..8117e0815c086a7a885cb51e9975963bf94443b1 100644 Binary files a/visualizations/markov_branching.png and b/visualizations/markov_branching.png differ diff --git a/visualizations/markov_contexts.png b/visualizations/markov_contexts.png index 878da0bfeccb8cd12e9bbeb4e155ccde60be4c40..59c5789f3d34b342689f936a2990843eef3d412d 100644 Binary files a/visualizations/markov_contexts.png and b/visualizations/markov_contexts.png differ diff --git a/visualizations/markov_entropy.png b/visualizations/markov_entropy.png index 2cb2a33aa3f32f2c9020ff2e7b4ded85951fd36f..cbed0337929df4e657fa8fd1fe2b644867da8a8b 100644 Binary files a/visualizations/markov_entropy.png and b/visualizations/markov_entropy.png differ diff --git a/visualizations/model_sizes.png b/visualizations/model_sizes.png index 885a4e8611f5767abc6f1341fffea6aa3a6c29a6..ab3ddf5828f45c17ac20af6a9f00904e0088c765 100644 Binary files a/visualizations/model_sizes.png and b/visualizations/model_sizes.png differ diff --git a/visualizations/ngram_coverage.png b/visualizations/ngram_coverage.png index b45518093f24f6ee5cccd3194d70808e25706ae6..ff6544f5afaafe63f66f1e5a1cc648afe71b304f 100644 Binary files a/visualizations/ngram_coverage.png and b/visualizations/ngram_coverage.png differ diff --git a/visualizations/ngram_entropy.png b/visualizations/ngram_entropy.png index 70447cd4707daafceccd6d9f29ba189f50ad7490..e5f2e767cfa6201903362417207ee3f57d5648e6 100644 Binary files a/visualizations/ngram_entropy.png and b/visualizations/ngram_entropy.png differ diff --git a/visualizations/ngram_perplexity.png b/visualizations/ngram_perplexity.png index 00be5b866f4bbaa1cfe56d6e0019c6985cc23ef7..8e0696b20f4e260b93212d4dd2b00748883856b6 100644 Binary files a/visualizations/ngram_perplexity.png and b/visualizations/ngram_perplexity.png differ diff --git a/visualizations/ngram_unique.png b/visualizations/ngram_unique.png index ced4bb414a2ef4c8c2411799c46387aa89c72610..17ee67fd7b55213e29704629d44a5fd4dc599e18 100644 Binary files a/visualizations/ngram_unique.png and b/visualizations/ngram_unique.png differ diff --git a/visualizations/performance_dashboard.png b/visualizations/performance_dashboard.png index 876dbe7d9c7ec97ccfa5a5a7eae8adb280a308ae..a538f2423a9f33e558be069e812149bb67ada4c8 100644 --- a/visualizations/performance_dashboard.png +++ b/visualizations/performance_dashboard.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:0f899824cb6c4547d62da3e7f6e6bf75020586ecf2b7d2b2c5976916c155732c -size 270991 +oid sha256:f2b6ae27fcfed64b1b19e16ec4bbecb5cd8069e6f86ce21f03c00358dcbaece0 +size 355907 diff --git a/visualizations/position_encoding_comparison.png b/visualizations/position_encoding_comparison.png index cd2f4289ff4cba7547c156aa481424b7c0b1ff51..21fd7b1b90d4c5968893f72fc14ccdfe69df1c86 100644 Binary files a/visualizations/position_encoding_comparison.png and b/visualizations/position_encoding_comparison.png differ diff --git a/visualizations/tokenizer_compression.png b/visualizations/tokenizer_compression.png index 144fa9f2d16397930b23016f3cc061343d451a2c..a06c8974bcff83c78d145e9198301125adfbf5ce 100644 Binary files a/visualizations/tokenizer_compression.png and b/visualizations/tokenizer_compression.png differ diff --git a/visualizations/tokenizer_fertility.png b/visualizations/tokenizer_fertility.png index b78937f46dcba6e1ffa2522e1317cb2014b141a4..898c4f673e8f45c1b1e59d5875e758cf394676fe 100644 Binary files a/visualizations/tokenizer_fertility.png and b/visualizations/tokenizer_fertility.png differ diff --git a/visualizations/tokenizer_oov.png b/visualizations/tokenizer_oov.png index 8425f47754399510422fc6cf50fedf5c73262db9..541927db1803488b92e4709cde8b3a329ca0d799 100644 Binary files a/visualizations/tokenizer_oov.png and b/visualizations/tokenizer_oov.png differ diff --git a/visualizations/tokenizer_total_tokens.png b/visualizations/tokenizer_total_tokens.png index df04deecf80674d5a3b55a254e9339eeb9a9f2c5..f02337fa19aee26bd9d7a98e434cc689a94dc323 100644 Binary files a/visualizations/tokenizer_total_tokens.png and b/visualizations/tokenizer_total_tokens.png differ diff --git a/visualizations/top20_words.png b/visualizations/top20_words.png index 3b0b881b0807cc6580ac8f34bc3fc94115e3845a..a4a1eaa6770742b3123c2a7fe8f8863b59466622 100644 Binary files a/visualizations/top20_words.png and b/visualizations/top20_words.png differ diff --git a/visualizations/tsne_sentences.png b/visualizations/tsne_sentences.png index 2577ac4fae041121b7464551bdf94f635cbba1f4..04655f2f9d432933dfeb9e4bf712d36d68794e9d 100644 --- a/visualizations/tsne_sentences.png +++ b/visualizations/tsne_sentences.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:5738fc98a5806c2cf9c2b9af8ce05b1bea2476236ad9f4f96f9de41339a47881 -size 266026 +oid sha256:d289ea9d84b3d94e329d99c2d017812fa894e301a1fb214956cb9fd1e0b02e75 +size 252543 diff --git a/visualizations/tsne_words.png b/visualizations/tsne_words.png index 55b5263831180e9d37c2d7dace265571784c20ce..7a9a2bb924e96a95814d022f4d07291eaa356caf 100644 --- a/visualizations/tsne_words.png +++ b/visualizations/tsne_words.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:c594af3b95d4f54484fa6f1b4e358b98359869c1d60cd48af9c2f6e7bf4ff9e0 -size 201396 +oid sha256:b209bd5b1208e847902c7273fd369c112d8cd1ca9d925c6ebb36d212643f58d6 +size 193610 diff --git a/visualizations/vocab_coverage.png b/visualizations/vocab_coverage.png index fd9bcde8fede172a65f14222b4ac99abef6a3972..a7fcd94f5f36036d7ac6bcf90cf4234c719e7ab7 100644 Binary files a/visualizations/vocab_coverage.png and b/visualizations/vocab_coverage.png differ diff --git a/visualizations/vocab_freq_dist.png b/visualizations/vocab_freq_dist.png index d489b4f79d52d6a56c9af52b3e8627fa69cf3047..f46735c346a7839679f501c26ab51dc1516d108c 100644 Binary files a/visualizations/vocab_freq_dist.png and b/visualizations/vocab_freq_dist.png differ diff --git a/visualizations/zipf_law.png b/visualizations/zipf_law.png index bf799f018ae3ad16148e03c5cd4e4b199a86e3ee..e7a39edef8634d253df07b1784abb2ae80b71d99 100644 --- a/visualizations/zipf_law.png +++ b/visualizations/zipf_law.png @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:99ca3b312b85657822250a843bbdcbf32e8d8a66d8904f88265e7b1f4f25df66 -size 99198 +oid sha256:a0ca4016030e491d5dcdf3b27b8df5e8d902dd930f641e0e803a66836b7723d3 +size 100188