diff --git a/.gitattributes b/.gitattributes index a6344aac8c09253b3b630fb776ae94478aa0275b..9800be970571345a39f1f48b1c34335a62f77c48 100644 --- a/.gitattributes +++ b/.gitattributes @@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zip filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text +visualizations/embedding_similarity.png filter=lfs diff=lfs merge=lfs -text +visualizations/performance_dashboard.png filter=lfs diff=lfs merge=lfs -text +visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text +visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text +visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..25ab99438e5368aabd19fc7a4d6cba143334782a --- /dev/null +++ b/README.md @@ -0,0 +1,553 @@ +--- +language: chy +language_name: CHY +language_family: american_algonquian +tags: + - wikilangs + - nlp + - tokenizer + - embeddings + - n-gram + - markov + - wikipedia + - monolingual + - family-american_algonquian +license: mit +library_name: wikilangs +pipeline_tag: feature-extraction +datasets: + - omarkamali/wikipedia-monthly +dataset_info: + name: wikipedia-monthly + description: Monthly snapshots of Wikipedia articles across 300+ languages +metrics: + - name: best_compression_ratio + type: compression + value: 3.456 + - name: best_isotropy + type: isotropy + value: 0.0028 + - name: vocabulary_size + type: vocab + value: 1659 +generated: 2025-12-28 +--- + +# CHY - Wikilangs Models +## Comprehensive Research Report & Full Ablation Study + +This repository contains NLP models trained and evaluated by Wikilangs, specifically on **CHY** Wikipedia data. +We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings. + +## 📋 Repository Contents + +### Models & Assets + +- Tokenizers (8k, 16k, 32k, 64k) +- N-gram models (2, 3, 4-gram) +- Markov chains (context of 1, 2, 3 and 4) +- Subword N-gram and Markov chains +- Embeddings in various sizes and dimensions +- Language Vocabulary +- Language Statistics +![Performance Dashboard](visualizations/performance_dashboard.png) + +### Analysis and Evaluation + +- [1. Tokenizer Evaluation](#1-tokenizer-evaluation) +- [2. N-gram Model Evaluation](#2-n-gram-model-evaluation) +- [3. Markov Chain Evaluation](#3-markov-chain-evaluation) +- [4. Vocabulary Analysis](#4-vocabulary-analysis) +- [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation) +- [6. Summary & Recommendations](#6-summary--recommendations) +- [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide) +- [Visualizations Index](#visualizations-index) + +--- +## 1. Tokenizer Evaluation + +![Tokenizer Compression](visualizations/tokenizer_compression.png) + +### Results + +| Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens | +|------------|-------------|---------------|----------|--------------| +| **8k** | 3.426x | 3.37 | 0.0811% | 33,276 | +| **16k** | 3.456x 🏆 | 3.40 | 0.0819% | 32,987 | + +### Tokenization Examples + +Below are sample sentences tokenized with each vocabulary size: + +**Sample 1:** `Môxéhéó'o (vé'ho'énêstsestôtse: broom, "sweeping [thing]") Pl: môxéheonôtse. + + +C...` + +| Vocab | Tokens | Count | +|-------|--------|-------| +| 8k | `▁môxéhéó ' o ▁( vé ' ho ' énêstsestôtse : ... (+18 more)` | 28 | +| 16k | `▁môxéhéó ' o ▁( vé ' ho ' énêstsestôtse : ... (+17 more)` | 27 | + +**Sample 2:** `Brazil, na'éstse ho'e-éve, Amérika. + +Category:Brazil` + +| Vocab | Tokens | Count | +|-------|--------|-------| +| 8k | `▁brazil , ▁na ' éstse ▁ho ' e - éve ... (+6 more)` | 16 | +| 16k | `▁brazil , ▁na ' éstse ▁ho ' e - éve ... (+6 more)` | 16 | + +**Sample 3:** `Boise, na'éstse manâhéno, Idaho. + +Category:Mâhoestôtse` + +| Vocab | Tokens | Count | +|-------|--------|-------| +| 8k | `▁boise , ▁na ' éstse ▁manâhéno , ▁idaho . ▁category ... (+2 more)` | 12 | +| 16k | `▁boise , ▁na ' éstse ▁manâhéno , ▁idaho . ▁category ... (+2 more)` | 12 | + + +### Key Findings + +- **Best Compression:** 16k achieves 3.456x compression +- **Lowest UNK Rate:** 8k with 0.0811% unknown tokens +- **Trade-off:** Larger vocabularies improve compression but increase model size +- **Recommendation:** 32k vocabulary provides optimal balance for production use + +--- +## 2. N-gram Model Evaluation + +![N-gram Perplexity](visualizations/ngram_perplexity.png) + +![N-gram Coverage](visualizations/ngram_coverage.png) + +### Results + +| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage | +|--------|------------|---------|----------------|------------------|-------------------| +| **2-gram** | 237 🏆 | 7.89 | 654 | 65.5% | 100.0% | +| **2-gram** | 360 🏆 | 8.49 | 1,127 | 58.7% | 99.4% | +| **3-gram** | 533 | 9.06 | 1,211 | 49.6% | 95.0% | +| **3-gram** | 1,561 | 10.61 | 4,876 | 32.9% | 74.3% | +| **4-gram** | 1,077 | 10.07 | 2,302 | 37.1% | 77.9% | +| **4-gram** | 3,419 | 11.74 | 11,151 | 25.8% | 57.5% | + +### Top 5 N-grams by Size + +**2-grams:** + +| Rank | N-gram | Count | +|------|--------|-------| +| 1 | `category :` | 973 | +| 2 | `' e` | 663 | +| 3 | `ho '` | 511 | +| 4 | `' o` | 391 | +| 5 | `. category` | 332 | + +**3-grams:** + +| Rank | N-gram | Count | +|------|--------|-------| +| 1 | `. category :` | 331 | +| 2 | `na ' éstse` | 288 | +| 3 | `' ho '` | 225 | +| 4 | `| thumb |` | 204 | +| 5 | `| right |` | 201 | + +**4-grams:** + +| Rank | N-gram | Count | +|------|--------|-------| +| 1 | `, na ' éstse` | 199 | +| 2 | `| thumb | right` | 167 | +| 3 | `thumb | right |` | 155 | +| 4 | `vé ' ho '` | 131 | +| 5 | `300px | thumb |` | 128 | + + +### Key Findings + +- **Best Perplexity:** 2-gram with 237 +- **Entropy Trend:** Decreases with larger n-grams (more predictable) +- **Coverage:** Top-1000 patterns cover ~58% of corpus +- **Recommendation:** 4-gram or 5-gram for best predictive performance + +--- +## 3. Markov Chain Evaluation + +![Markov Entropy](visualizations/markov_entropy.png) + +![Markov Branching](visualizations/markov_branching.png) + +### Results + +| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability | +|---------|-------------|------------|------------------|-----------------|----------------| +| **1** | 0.3489 | 1.274 | 2.42 | 4,255 | 65.1% | +| **1** | 1.3997 | 2.638 | 10.97 | 189 | 0.0% | +| **2** | 0.1560 | 1.114 | 1.36 | 10,197 | 84.4% | +| **2** | 1.2345 | 2.353 | 5.27 | 2,073 | 0.0% | +| **3** | 0.0936 | 1.067 | 1.18 | 13,745 | 90.6% | +| **3** | 0.6401 | 1.558 | 2.32 | 10,919 | 36.0% | +| **4** | 0.0555 🏆 | 1.039 | 1.10 | 16,004 | 94.5% | +| **4** | 0.2796 🏆 | 1.214 | 1.44 | 25,260 | 72.0% | + +### Generated Text Samples + +Below are text samples generated from each Markov chain model: + +**Context Size 1:** + +1. `' konénėhesó - éve . manâhestôtse 7 heše ' tavö ' éhoo ' he tsénėxhésemé '` +2. `, na ' he tsénėxhésemé ' evo ' éno ' e 1904 , na ' otsenáhkohe` +3. `: turkey , na ' o tsétsêhéstâhese - éve . gus . curitiba - éve .` + +**Context Size 2:** + +1. `category : ó ' he ( vé ' ho ' énestse 71 , 740 6 , 418` +2. `' e 300px | thumb | amâho ' hestôtse amêške tsémo ' ôhtávoome amâho ' hestôtse category` +3. `ho ' xó ' mâhoéve ' ho ' énêstsestôtse : purgatoire river , picketwire river ) -` + +**Context Size 3:** + +1. `. category : mâhoestôtse category : california` +2. `na ' éstse ho ' e - éve hóxovê - hooma , asia ) . *` +3. `' ho ' e - éve , meško . category : mâhoestôtse category : ho ' honáeo '` + +**Context Size 4:** + +1. `, na ' éstse ho ' e - éve , amérika . *` +2. `| thumb | right | hóxeeséeto ' hamestôtse 300px | thumb | right | méstaa ' êhéhe category :` +3. `thumb | right | hotóhkeo ' o tsénésôhtôxese 300px | thumb | right | mámaa ' e mámaa '` + + +### Key Findings + +- **Best Predictability:** Context-4 with 94.5% predictability +- **Branching Factor:** Decreases with context size (more deterministic) +- **Memory Trade-off:** Larger contexts require more storage (25,260 contexts) +- **Recommendation:** Context-3 or Context-4 for text generation + +--- +## 4. Vocabulary Analysis + +![Zipf's Law](visualizations/zipf_law.png) + +![Top Words](visualizations/top20_words.png) + +![Coverage Curve](visualizations/vocab_coverage.png) + +### Statistics + +| Metric | Value | +|--------|-------| +| Vocabulary Size | 1,659 | +| Total Tokens | 14,360 | +| Mean Frequency | 8.66 | +| Median Frequency | 3 | +| Frequency Std Dev | 38.60 | + +### Most Common Words + +| Rank | Word | Frequency | +|------|------|-----------| +| 1 | category | 974 | +| 2 | e | 690 | +| 3 | ho | 531 | +| 4 | o | 414 | +| 5 | na | 293 | +| 6 | éstse | 288 | +| 7 | right | 260 | +| 8 | éve | 259 | +| 9 | thumb | 226 | +| 10 | vé | 180 | + +### Least Common Words (from vocabulary) + +| Rank | Word | Frequency | +|------|------|-----------| +| 1 | evenóse | 2 | +| 2 | mountain | 2 | +| 3 | cal | 2 | +| 4 | poly | 2 | +| 5 | mustangs | 2 | +| 6 | sevonévo | 2 | +| 7 | ėstovátamevéotse | 2 | +| 8 | ėstova | 2 | +| 9 | nėstse | 2 | +| 10 | 2025 | 2 | + +### Zipf's Law Analysis + +| Metric | Value | +|--------|-------| +| Zipf Coefficient | 0.8829 | +| R² (Goodness of Fit) | 0.980523 | +| Adherence Quality | **excellent** | + +### Coverage Analysis + +| Top N Words | Coverage | +|-------------|----------| +| Top 100 | 57.7% | +| Top 1,000 | 90.8% | +| Top 5,000 | 0.0% | +| Top 10,000 | 0.0% | + +### Key Findings + +- **Zipf Compliance:** R²=0.9805 indicates excellent adherence to Zipf's law +- **High Frequency Dominance:** Top 100 words cover 57.7% of corpus +- **Long Tail:** -8,341 words needed for remaining 100.0% coverage + +--- +## 5. Word Embeddings Evaluation + +![Embedding Isotropy](visualizations/embedding_isotropy.png) + +![Similarity Matrix](visualizations/embedding_similarity.png) + +![t-SNE Words](visualizations/tsne_words.png) + +![t-SNE Sentences](visualizations/tsne_sentences.png) + +### Model Comparison + +| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy | +|-------|------------|-----------|----------|----------|----------| +| **mono_32d** | 223 | 32 | 1.577 | 0.877 | 0.0028 🏆 | +| **mono_64d** | 223 | 64 | 1.556 | 0.897 | 0.0009 | +| **mono_128d** | 223 | 128 | 1.593 | 0.888 | 0.0002 | +| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 | + +### Key Findings + +- **Best Isotropy:** mono_32d with 0.0028 (more uniform distribution) +- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy +- **Vocabulary Coverage:** All models cover 223 words +- **Recommendation:** 100d for balanced semantic capture and efficiency + +--- +## 6. Summary & Recommendations + +![Performance Dashboard](visualizations/performance_dashboard.png) + +### Production Recommendations + +| Component | Recommended | Rationale | +|-----------|-------------|-----------| +| Tokenizer | **32k BPE** | Best compression (3.46x) with low UNK rate | +| N-gram | **5-gram** | Lowest perplexity (237) | +| Markov | **Context-4** | Highest predictability (94.5%) | +| Embeddings | **100d** | Balanced semantic capture and isotropy | + +--- +## Appendix: Metrics Glossary & Interpretation Guide + +This section provides definitions, intuitions, and guidance for interpreting the metrics used throughout this report. + +### Tokenizer Metrics + +**Compression Ratio** +> *Definition:* The ratio of characters to tokens (chars/token). Measures how efficiently the tokenizer represents text. +> +> *Intuition:* Higher compression means fewer tokens needed to represent the same text, reducing sequence lengths for downstream models. A 3x compression means ~3 characters per token on average. +> +> *What to seek:* Higher is generally better for efficiency, but extremely high compression may indicate overly aggressive merging that loses morphological information. + +**Average Token Length (Fertility)** +> *Definition:* Mean number of characters per token produced by the tokenizer. +> +> *Intuition:* Reflects the granularity of tokenization. Longer tokens capture more context but may struggle with rare words; shorter tokens are more flexible but increase sequence length. +> +> *What to seek:* Balance between 2-5 characters for most languages. Arabic/morphologically-rich languages may benefit from slightly longer tokens. + +**Unknown Token Rate (OOV Rate)** +> *Definition:* Percentage of tokens that map to the unknown/UNK token, indicating words the tokenizer cannot represent. +> +> *Intuition:* Lower OOV means better vocabulary coverage. High OOV indicates the tokenizer encounters many unseen character sequences. +> +> *What to seek:* Below 1% is excellent; below 5% is acceptable. BPE tokenizers typically achieve very low OOV due to subword fallback. + +### N-gram Model Metrics + +**Perplexity** +> *Definition:* Measures how "surprised" the model is by test data. Mathematically: 2^(cross-entropy). Lower values indicate better prediction. +> +> *Intuition:* If perplexity is 100, the model is as uncertain as if choosing uniformly among 100 options at each step. A perplexity of 10 means effectively choosing among 10 equally likely options. +> +> *What to seek:* Lower is better. Perplexity decreases with larger n-grams (more context). Values vary widely by language and corpus size. + +**Entropy** +> *Definition:* Average information content (in bits) needed to encode the next token given the context. Related to perplexity: perplexity = 2^entropy. +> +> *Intuition:* High entropy means high uncertainty/randomness; low entropy means predictable patterns. Natural language typically has entropy between 1-4 bits per character. +> +> *What to seek:* Lower entropy indicates more predictable text patterns. Entropy should decrease as n-gram size increases. + +**Coverage (Top-K)** +> *Definition:* Percentage of corpus occurrences explained by the top K most frequent n-grams. +> +> *Intuition:* High coverage with few patterns indicates repetitive/formulaic text; low coverage suggests diverse vocabulary usage. +> +> *What to seek:* Depends on use case. For language modeling, moderate coverage (40-60% with top-1000) is typical for natural text. + +### Markov Chain Metrics + +**Average Entropy** +> *Definition:* Mean entropy across all contexts, measuring average uncertainty in next-word prediction. +> +> *Intuition:* Lower entropy means the model is more confident about what comes next. Context-1 has high entropy (many possible next words); Context-4 has low entropy (few likely continuations). +> +> *What to seek:* Decreasing entropy with larger context sizes. Very low entropy (<0.1) indicates highly deterministic transitions. + +**Branching Factor** +> *Definition:* Average number of unique next tokens observed for each context. +> +> *Intuition:* High branching = many possible continuations (flexible but uncertain); low branching = few options (predictable but potentially repetitive). +> +> *What to seek:* Branching factor should decrease with context size. Values near 1.0 indicate nearly deterministic chains. + +**Predictability** +> *Definition:* Derived metric: (1 - normalized_entropy) × 100%. Indicates how deterministic the model's predictions are. +> +> *Intuition:* 100% predictability means the next word is always certain; 0% means completely random. Real text falls between these extremes. +> +> *What to seek:* Higher predictability for text generation quality, but too high (>98%) may produce repetitive output. + +### Vocabulary & Zipf's Law Metrics + +**Zipf's Coefficient** +> *Definition:* The slope of the log-log plot of word frequency vs. rank. Zipf's law predicts this should be approximately -1. +> +> *Intuition:* A coefficient near -1 indicates the corpus follows natural language patterns where a few words are very common and most words are rare. +> +> *What to seek:* Values between -0.8 and -1.2 indicate healthy natural language distribution. Deviations may suggest domain-specific or artificial text. + +**R² (Coefficient of Determination)** +> *Definition:* Measures how well the linear fit explains the frequency-rank relationship. Ranges from 0 to 1. +> +> *Intuition:* R² near 1.0 means the data closely follows Zipf's law; lower values indicate deviation from expected word frequency patterns. +> +> *What to seek:* R² > 0.95 is excellent; > 0.99 indicates near-perfect Zipf adherence typical of large natural corpora. + +**Vocabulary Coverage** +> *Definition:* Cumulative percentage of corpus tokens accounted for by the top N words. +> +> *Intuition:* Shows how concentrated word usage is. If top-100 words cover 50% of text, the corpus relies heavily on common words. +> +> *What to seek:* Top-100 covering 30-50% is typical. Higher coverage indicates more repetitive text; lower suggests richer vocabulary. + +### Word Embedding Metrics + +**Isotropy** +> *Definition:* Measures how uniformly distributed vectors are in the embedding space. Computed as the ratio of minimum to maximum singular values. +> +> *Intuition:* High isotropy (near 1.0) means vectors spread evenly in all directions; low isotropy means vectors cluster in certain directions, reducing expressiveness. +> +> *What to seek:* Higher isotropy generally indicates better-quality embeddings. Values > 0.1 are reasonable; > 0.3 is good. Lower-dimensional embeddings tend to have higher isotropy. + +**Average Norm** +> *Definition:* Mean magnitude (L2 norm) of word vectors in the embedding space. +> +> *Intuition:* Indicates the typical "length" of vectors. Consistent norms suggest stable training; high variance may indicate some words are undertrained. +> +> *What to seek:* Relatively consistent norms across models. The absolute value matters less than consistency (low std deviation). + +**Cosine Similarity** +> *Definition:* Measures angular similarity between vectors, ranging from -1 (opposite) to 1 (identical direction). +> +> *Intuition:* Words with similar meanings should have high cosine similarity. This is the standard metric for semantic relatedness in embeddings. +> +> *What to seek:* Semantically related words should score > 0.5; unrelated words should be near 0. Synonyms often score > 0.7. + +**t-SNE Visualization** +> *Definition:* t-Distributed Stochastic Neighbor Embedding - a dimensionality reduction technique that preserves local structure for visualization. +> +> *Intuition:* Clusters in t-SNE plots indicate groups of semantically related words. Spread indicates vocabulary diversity; tight clusters suggest semantic coherence. +> +> *What to seek:* Meaningful clusters (e.g., numbers together, verbs together). Avoid over-interpreting distances - t-SNE preserves local, not global, structure. + +### General Interpretation Guidelines + +1. **Compare within model families:** Metrics are most meaningful when comparing models of the same type (e.g., 8k vs 64k tokenizer). +2. **Consider trade-offs:** Better performance on one metric often comes at the cost of another (e.g., compression vs. OOV rate). +3. **Context matters:** Optimal values depend on downstream tasks. Text generation may prioritize different metrics than classification. +4. **Corpus influence:** All metrics are influenced by corpus characteristics. Wikipedia text differs from social media or literature. +5. **Language-specific patterns:** Morphologically rich languages (like Arabic) may show different optimal ranges than analytic languages. + + +### Visualizations Index + +| Visualization | Description | +|---------------|-------------| +| Tokenizer Compression | Compression ratios by vocabulary size | +| Tokenizer Fertility | Average token length by vocabulary | +| Tokenizer OOV | Unknown token rates | +| Tokenizer Total Tokens | Total tokens by vocabulary | +| N-gram Perplexity | Perplexity by n-gram size | +| N-gram Entropy | Entropy by n-gram size | +| N-gram Coverage | Top pattern coverage | +| N-gram Unique | Unique n-gram counts | +| Markov Entropy | Entropy by context size | +| Markov Branching | Branching factor by context | +| Markov Contexts | Unique context counts | +| Zipf's Law | Frequency-rank distribution with fit | +| Vocab Frequency | Word frequency distribution | +| Top 20 Words | Most frequent words | +| Vocab Coverage | Cumulative coverage curve | +| Embedding Isotropy | Vector space uniformity | +| Embedding Norms | Vector magnitude distribution | +| Embedding Similarity | Word similarity heatmap | +| Nearest Neighbors | Similar words for key terms | +| t-SNE Words | 2D word embedding visualization | +| t-SNE Sentences | 2D sentence embedding visualization | +| Position Encoding | Encoding method comparison | +| Model Sizes | Storage requirements | +| Performance Dashboard | Comprehensive performance overview | + +--- +## About This Project + +### Data Source + +Models trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) - a monthly snapshot of Wikipedia articles across 300+ languages. + +### Project + +A project by **[Wikilangs](https://wikilangs.org)** - Open-source NLP models for every Wikipedia language. + +### Maintainer + +[Omar Kamali](https://omarkamali.com) - [Omneity Labs](https://omneitylabs.com) + +### Citation + +If you use these models in your research, please cite: + +```bibtex +@misc{wikilangs2025, + author = {Kamali, Omar}, + title = {Wikilangs: Open NLP Models for Wikipedia Languages}, + year = {2025}, + publisher = {HuggingFace}, + url = {https://huggingface.co/wikilangs} + institution = {Omneity Labs} +} +``` + +### License + +MIT License - Free for academic and commercial use. + +### Links + +- 🌐 Website: [wikilangs.org](https://wikilangs.org) +- 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs) +- 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) +- 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali) +--- +*Generated by Wikilangs Models Pipeline* + +*Report Date: 2025-12-28 22:42:59* diff --git a/models/embeddings/monolingual/chy_128d.bin b/models/embeddings/monolingual/chy_128d.bin new file mode 100644 index 0000000000000000000000000000000000000000..7cc0d1a73826ae004ae0bbdfe1f6ecddaa38e567 --- /dev/null +++ b/models/embeddings/monolingual/chy_128d.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9abb00efdaee95eb235694ba4738708972eef813303beafecc2c4852c0a11360 +size 1024233321 diff --git a/models/embeddings/monolingual/chy_128d.meta.json b/models/embeddings/monolingual/chy_128d.meta.json new file mode 100644 index 0000000000000000000000000000000000000000..44b1d91543c91d03bcb72030184c582569df3f5c --- /dev/null +++ b/models/embeddings/monolingual/chy_128d.meta.json @@ -0,0 +1 @@ +{"lang": "chy", "dim": 128, "max_seq_len": 512, "is_aligned": false} \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_128d_metadata.json b/models/embeddings/monolingual/chy_128d_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..e6d7548ad73151b59072f478fccb96cd56bde54f --- /dev/null +++ b/models/embeddings/monolingual/chy_128d_metadata.json @@ -0,0 +1,13 @@ +{ + "language": "chy", + "dimension": 128, + "version": "monolingual", + "training_params": { + "dim": 128, + "min_count": 5, + "window": 5, + "negative": 5, + "epochs": 5 + }, + "vocab_size": 223 +} \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_32d.bin b/models/embeddings/monolingual/chy_32d.bin new file mode 100644 index 0000000000000000000000000000000000000000..faa1a2545909a3901b8e11b0d5062a490ed4c3b9 --- /dev/null +++ b/models/embeddings/monolingual/chy_32d.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8c8e4f25fd101e1917ce06bf1768de9245362d868e23a5ebbbc943b0e4650a60 +size 256062057 diff --git a/models/embeddings/monolingual/chy_32d.meta.json b/models/embeddings/monolingual/chy_32d.meta.json new file mode 100644 index 0000000000000000000000000000000000000000..b8ad29336a79e3d7f1de6bd56eac61cc079e09be --- /dev/null +++ b/models/embeddings/monolingual/chy_32d.meta.json @@ -0,0 +1 @@ +{"lang": "chy", "dim": 32, "max_seq_len": 512, "is_aligned": false} \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_32d_metadata.json b/models/embeddings/monolingual/chy_32d_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..bf5c61f66a810a6453b2a58d2dc60838365927c8 --- /dev/null +++ b/models/embeddings/monolingual/chy_32d_metadata.json @@ -0,0 +1,13 @@ +{ + "language": "chy", + "dimension": 32, + "version": "monolingual", + "training_params": { + "dim": 32, + "min_count": 5, + "window": 5, + "negative": 5, + "epochs": 5 + }, + "vocab_size": 223 +} \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_64d.bin b/models/embeddings/monolingual/chy_64d.bin new file mode 100644 index 0000000000000000000000000000000000000000..8b835d503ef3904b198aa963ca8aa6ccc76c6426 --- /dev/null +++ b/models/embeddings/monolingual/chy_64d.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4c920935ac87a1441f862361919b9313ce17a00634fb4b6dc1a1bb483d79a2f5 +size 512119145 diff --git a/models/embeddings/monolingual/chy_64d.meta.json b/models/embeddings/monolingual/chy_64d.meta.json new file mode 100644 index 0000000000000000000000000000000000000000..d159b09a73d0733a5222b7312d590d2afe2be50b --- /dev/null +++ b/models/embeddings/monolingual/chy_64d.meta.json @@ -0,0 +1 @@ +{"lang": "chy", "dim": 64, "max_seq_len": 512, "is_aligned": false} \ No newline at end of file diff --git a/models/embeddings/monolingual/chy_64d_metadata.json b/models/embeddings/monolingual/chy_64d_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..28e834650be51bdc4e5b8fac779e22cad0c921ca --- /dev/null +++ b/models/embeddings/monolingual/chy_64d_metadata.json @@ -0,0 +1,13 @@ +{ + "language": "chy", + "dimension": 64, + "version": "monolingual", + "training_params": { + "dim": 64, + "min_count": 5, + "window": 5, + "negative": 5, + "epochs": 5 + }, + "vocab_size": 223 +} \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx1_subword.parquet b/models/subword_markov/chy_markov_ctx1_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..48bfbd9ca9423981a7fd71ed73dcdbb643a68ddf --- /dev/null +++ b/models/subword_markov/chy_markov_ctx1_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6ea83b117f3a486981315ed6c06f66c8b179b52afa64753b10adcb037d01a299 +size 19111 diff --git a/models/subword_markov/chy_markov_ctx1_subword_metadata.json b/models/subword_markov/chy_markov_ctx1_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..555cbdabfc245856d836baf2a585059de4b524f6 --- /dev/null +++ b/models/subword_markov/chy_markov_ctx1_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 1, + "variant": "subword", + "language": "chy", + "unique_contexts": 189, + "total_transitions": 113177 +} \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx2_subword.parquet b/models/subword_markov/chy_markov_ctx2_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..8af29c7d8c63dd216b83b54b4ae3cf708354f1f6 --- /dev/null +++ b/models/subword_markov/chy_markov_ctx2_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cc4389e4a77f6c71ab2c352c280e4fc1281ab64c25a40abbcf7af73411115bda +size 71565 diff --git a/models/subword_markov/chy_markov_ctx2_subword_metadata.json b/models/subword_markov/chy_markov_ctx2_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..2e128b8c6e873f4c04c8ac8aa70c5939469d8f3f --- /dev/null +++ b/models/subword_markov/chy_markov_ctx2_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 2, + "variant": "subword", + "language": "chy", + "unique_contexts": 2073, + "total_transitions": 112352 +} \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx3_subword.parquet b/models/subword_markov/chy_markov_ctx3_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..549a659dc4ac3f5025ab2f4fe797894755e9a950 --- /dev/null +++ b/models/subword_markov/chy_markov_ctx3_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e7f1d3cabcf310e4f6579f8d79db5b6ca9e6cb8c218d840ca5d728e17f8a9d72 +size 189294 diff --git a/models/subword_markov/chy_markov_ctx3_subword_metadata.json b/models/subword_markov/chy_markov_ctx3_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..12bfcd9a3b0c47c5248808bf557431e91050a788 --- /dev/null +++ b/models/subword_markov/chy_markov_ctx3_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 3, + "variant": "subword", + "language": "chy", + "unique_contexts": 10919, + "total_transitions": 111527 +} \ No newline at end of file diff --git a/models/subword_markov/chy_markov_ctx4_subword.parquet b/models/subword_markov/chy_markov_ctx4_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..05da4d063f6bbd0151f2c299b5e930a02f52835a --- /dev/null +++ b/models/subword_markov/chy_markov_ctx4_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6f537764e3ba64873ce0be3e727efce7efa3b3ee9a512520fdff8fe7b9f23660 +size 357786 diff --git a/models/subword_markov/chy_markov_ctx4_subword_metadata.json b/models/subword_markov/chy_markov_ctx4_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..e5dcbb27e6019594508ac0e8b6c7489daea460b3 --- /dev/null +++ b/models/subword_markov/chy_markov_ctx4_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 4, + "variant": "subword", + "language": "chy", + "unique_contexts": 25260, + "total_transitions": 110702 +} \ No newline at end of file diff --git a/models/subword_ngram/chy_2gram_subword.parquet b/models/subword_ngram/chy_2gram_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..f7fe4054bb2b00d6e459d9f5f58e96b2c8fc82ad --- /dev/null +++ b/models/subword_ngram/chy_2gram_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2ed544f13314cfc2324a3b1462dfcc47e71f1236aeb6319b89b7d6e18ac3920a +size 14546 diff --git a/models/subword_ngram/chy_2gram_subword_metadata.json b/models/subword_ngram/chy_2gram_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..3dcd391380620333d51a4816e09bc1fc27ac3d70 --- /dev/null +++ b/models/subword_ngram/chy_2gram_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 2, + "variant": "subword", + "language": "chy", + "unique_ngrams": 1127, + "total_ngrams": 113177 +} \ No newline at end of file diff --git a/models/subword_ngram/chy_3gram_subword.parquet b/models/subword_ngram/chy_3gram_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..c86eaa2a4b34e13973086ad076b8d89ccda56941 --- /dev/null +++ b/models/subword_ngram/chy_3gram_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4b077e5de241ab6ccaa7e1f45221e1202536459be7a01c5b4b63e9fd20ba730d +size 54182 diff --git a/models/subword_ngram/chy_3gram_subword_metadata.json b/models/subword_ngram/chy_3gram_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..555548174e8b8190acc337c36d67838ac1a9344b --- /dev/null +++ b/models/subword_ngram/chy_3gram_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 3, + "variant": "subword", + "language": "chy", + "unique_ngrams": 4876, + "total_ngrams": 112352 +} \ No newline at end of file diff --git a/models/subword_ngram/chy_4gram_subword.parquet b/models/subword_ngram/chy_4gram_subword.parquet new file mode 100644 index 0000000000000000000000000000000000000000..e539b7ada332f216735381cb95f42c91973d7500 --- /dev/null +++ b/models/subword_ngram/chy_4gram_subword.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9b1c577153cd04ac40d0b50f9a07ec2270c218aff4d156a1ece01fd0edfec031 +size 130151 diff --git a/models/subword_ngram/chy_4gram_subword_metadata.json b/models/subword_ngram/chy_4gram_subword_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..b4d5facb8d4bf125a5e1a04832f4cd389ea0b219 --- /dev/null +++ b/models/subword_ngram/chy_4gram_subword_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 4, + "variant": "subword", + "language": "chy", + "unique_ngrams": 11151, + "total_ngrams": 111527 +} \ No newline at end of file diff --git a/models/tokenizer/chy_tokenizer_16k.model b/models/tokenizer/chy_tokenizer_16k.model new file mode 100644 index 0000000000000000000000000000000000000000..64aa1aa21aab1e68776d1e5116860c6c187514c9 --- /dev/null +++ b/models/tokenizer/chy_tokenizer_16k.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c71f72411f24cc631cd9f61bb3360c8a91877e246b11a2af3c2ce3fe5aafda3c +size 494718 diff --git a/models/tokenizer/chy_tokenizer_16k.vocab b/models/tokenizer/chy_tokenizer_16k.vocab new file mode 100644 index 0000000000000000000000000000000000000000..19d46e6af7a2b489511d339ee74c4ad9116047d3 --- /dev/null +++ b/models/tokenizer/chy_tokenizer_16k.vocab @@ -0,0 +1,16000 @@ + 0 + 0 + 0 + 0 +se -0 +st -1 +ho -2 +he -3 +▁c -4 +at -5 +an -6 +or -7 +ate -8 +ory -9 +ateg -10 +ategory -11 +▁category -12 +tse -13 +▁m -14 +ne -15 +ve -16 +ht -17 +▁n -18 +stse -19 +hé -20 +ôtse -21 +me -22 +no -23 +ri -24 +▁( -25 +▁ho -26 +▁t -27 +on -28 +▁na -29 +sé -30 +stôtse -31 +ma -32 +éve -33 +in -34 +éstse -35 +px -36 +vé -37 +▁a -38 +va -39 +ha -40 +th -41 +rig -42 +right -43 +mb -44 +mâ -45 +en -46 +ta -47 +▁p -48 +hk -49 +umb -50 +êstse -51 +hó -52 +sto -53 +▁he -54 +▁o -55 +ane -56 +tó -57 +▁s -58 +thumb -59 +re -60 +vó -61 +ia -62 +šk -63 +to -64 +▁é -65 +vo -66 +htá -67 +le -68 +estôtse -69 +er -70 +▁man -71 +xe -72 +mâho -73 +én -74 +ic -75 +há -76 +ná -77 +hp -78 +us -79 +▁hé -80 +né -81 +šé -82 +▁k -83 +it -84 +▁b -85 +eo -86 +mo -87 +un -88 +má -89 +▁ma -90 +vá -91 +âhé -92 +ar -93 +▁d -94 +heo -95 +honá -96 +mâhoestôtse -97 +sê -98 +še -99 +▁tsé -100 +âhe -101 +êse -102 +▁mo -103 +and -104 +al -105 +htáme -106 +▁manâhé -107 +ol -108 +om -109 +éno -110 +▁naa -111 +tane -112 +ed -113 +▁manâhéno -114 +énêstse -115 +nó -116 +honáé -117 +il -118 +is -119 +tsé -120 +es -121 +as -122 +év -123 +▁vé -124 +hoo -125 +honáéšé -126 +ing -127 +▁hó -128 +), -129 +ško -130 +mé -131 +ion -132 +tsê -133 +ra -134 +▁* -135 +na -136 +ro -137 +▁st -138 +so -139 +▁má -140 +ȯtse -141 +tan -142 +▁of -143 +▁vó -144 +énêstsestôtse -145 +sta -146 +hi -147 +▁l -148 +stȯtse -149 +▁g -150 +▁w -151 +vê -152 +hko -153 +▁re -154 +▁un -155 +hést -156 +ited -157 +▁hést -158 +▁state -159 +▁f -160 +▁states -161 +ke -162 +óne -163 +▁e -164 +kêse -165 +kêseho -166 +só -167 +hpe -168 +htse -169 +▁the -170 +ac -171 +nė -172 +xo -173 +▁ó -174 +hke -175 +▁hotó -176 +ka -177 +▁me -178 +mene -179 +▁united -180 +ad -181 +ge -182 +ésto -183 +ur -184 +ánó -185 +▁máhtáme -186 +pu -187 +ôxe -188 +ation -189 +hne -190 +stó -191 +▁mâ -192 +ánóva -193 +han -194 +hetane -195 +lo -196 +ce -197 +ëö -198 +▁há -199 +▁ne -200 +ano -201 +hno -202 +tanéno -203 +▁héstánóva -204 +ww -205 +ot -206 +▁j -207 +sia -208 +▁hotómá -209 +ko -210 +seo -211 +ške -212 +▁ame -213 +onôtse -214 +máhtáme -215 +ina -216 +▁ch -217 +vóne -218 +co -219 +id -220 +ant -221 +men -222 +éhe -223 +pa -224 +▁ve -225 +hotó -226 +la -227 +xa -228 +ric -229 +âhese -230 +sóvóne -231 +// -232 +tp -233 +▁" -234 +:// -235 +sus -236 +http -237 +▁hova -238 +bl -239 +vi -240 +ôh -241 +êsto -242 +nestȯtse -243 +ap -244 +sa -245 +ȧhe -246 +némene -247 +némenestôtse -248 +ham -249 +héó -250 +ȯhó -251 +htsé -252 +▁tse -253 +tsêhést -254 +tsêhéstâhese -255 +et -256 +if -257 +ôhe -258 +ȧhé -259 +▁an -260 +▁pl -261 +xêse -262 +meško -263 +▁meško -264 +ȯhónestȯtse -265 +em -266 +▁h -267 +ame -268 +oné -269 +the -270 +ëno -271 +sêstse -272 +hoestôtse -273 +go -274 +po -275 +qu -276 +▁in -277 +land -278 +tahe -279 +ôhtá -280 +ul -281 +cen -282 +esé -283 +otse -284 +xovê -285 +évȧhe -286 +▁china -287 +šê -288 +ata -289 +man -290 +▁mé -291 +blic -292 +hooma -293 +▁mâhe -294 +hotóma -295 +menôtse -296 +▁mâheóne -297 +). -298 +ee -299 +ft -300 +ver -301 +ania -302 +hova -303 +êheo -304 +public -305 +bu -306 +el -307 +kê -308 +kô -309 +oo -310 +▁i -311 +ans -312 +ian -313 +▁vo -314 +oland -315 +hovahne -316 +évȧhetanéno -317 +▁mâhoestôtse -318 +oe -319 +ti -320 +▁au -321 +▁ta -322 +vari -323 +ȧhého -324 +ch -325 +ir -326 +nê -327 +ou -328 +ter -329 +https -330 +mâhéó -331 +▁asia -332 +variant -333 +▁hóxovê -334 +hetaneho -335 +ay -336 +les -337 +ôse -338 +▁co -339 +môxe -340 +éhno -341 +ôhné -342 +▁chi -343 +▁col -344 +▁con -345 +▁éve -346 +htsêstse -347 +▁republic -348 +ehe -349 +hey -350 +hpo -351 +éso -352 +enne -353 +estó -354 +ôhném -355 +anêheo -356 +heyenne -357 +hoohtsêstse -358 +ôhnéménêstse -359 +ig -360 +vâ -361 +chi -362 +ome -363 +▁ar -364 +▁év -365 +ings -366 +rica -367 +▁mâhoestôtsene -368 +eb -369 +pp -370 +sė -371 +▁- -372 +háa -373 +éma -374 +óho -375 +▁ha -376 +▁oó -377 +▁th -378 +▁to -379 +pain -380 +▁hóestó -381 +▁manȧhého -382 +am -383 +▁– -384 +ana -385 +ild -386 +onê -387 +ort -388 +www -389 +▁ri -390 +▁se -391 +séeo -392 +tame -393 +ther -394 +taneo -395 +énano -396 +ceania -397 +mâhoéve -398 +mó -399 +xê -400 +▁r -401 +tal -402 +▁de -403 +▁ná -404 +hése -405 +▁hoo -406 +nestse -407 +▁hesto -408 +▁oxêse -409 +▁variant -410 +héstanêheo -411 +▁évȯhónestȯtse -412 +de -413 +hä -414 +te -415 +tr -416 +anė -417 +est -418 +ish -419 +▁mȧ -420 +▁né -421 +▁nė -422 +▁po -423 +ôhke -424 +šeno -425 +build -426 +heová -427 +hkohe -428 +▁hoham -429 +▁tséhe -430 +oceania -431 +buildings -432 +heováéstse -433 +heévȧhetanéno -434 +▁tsétsêhéstâhese -435 +gl -436 +pe -437 +ty -438 +tá -439 +ces -440 +eno -441 +hóo -442 +ohe -443 +anad -444 +häme -445 +vone -446 +vého -447 +xeme -448 +▁not -449 +▁poland -450 +hestôtse -451 +") -452 +ck -453 +gu -454 +ib -455 +evo -456 +ine -457 +ple -458 +óve -459 +▁al -460 +▁gr -461 +▁mi -462 +aneo -463 +name -464 +census -465 +▁river -466 +ab -467 +im -468 +ut -469 +▁[ -470 +ati -471 +com -472 +ene -473 +hkê -474 +nam -475 +pul -476 +êst -477 +óhk -478 +▁(" -479 +▁ka -480 +▁sa -481 +▁vá -482 +hame -483 +viet -484 +▁ind -485 +eople -486 +êstoo -487 +heséeo -488 +▁thumb -489 +hanôtse -490 +vietnam -491 +véhoname -492 +âhestôtse -493 +ki -494 +um -495 +ary -496 +fer -497 +hta -498 +hëö -499 +oná -500 +sëö -501 +évo -502 +hone -503 +šêta -504 +▁cen -505 +▁tsê -506 +feren -507 +véhoo -508 +vôtse -509 +▁auto -510 +anėheo -511 +éstove -512 +▁grand -513 +▁theft -514 +šêtaëno -515 +nėsóvóne -516 +▁tséhetó -517 +kêhanôtse -518 +tsêhetane -519 +tsêhetaneonôtse -520 +be -521 +ef -522 +ph -523 +éh -524 +óe -525 +ese -526 +kee -527 +one -528 +org -529 +▁ca -530 +▁pa -531 +▁tó -532 +apan -533 +haná -534 +port -535 +staa -536 +▁nor -537 +▁son -538 +hóomo -539 +mahpe -540 +mâheo -541 +▁hánê -542 +htseto -543 +▁referen -544 +▁cheyenne -545 +▁hoohtseto -546 +▁hánêsóvóne -547 +tsétsêhéstâhese -548 +ae -549 +mö -550 +ry -551 +ya -552 +za -553 +ehá -554 +hae -555 +hto -556 +ial -557 +ire -558 +ity -559 +nia -560 +nif -561 +oma -562 +son -563 +ôho -564 +ôhé -565 +▁km -566 +▁mó -567 +▁šé -568 +hamé -569 +hest -570 +nomá -571 +unty -572 +▁heó -573 +stone -574 +vátame -575 +▁chile -576 +▁héstanėheo -577 +▁references -578 +ba -579 +mê -580 +ôx -581 +ack -582 +ila -583 +nėx -584 +yom -585 +▁bo -586 +▁da -587 +▁oo -588 +▁xa -589 +nėse -590 +rika -591 +▁and -592 +▁bri -593 +▁éše -594 +trans -595 +▁môxe -596 +hésemé -597 +people -598 +sevone -599 +yoming -600 +éhnėse -601 +▁spain -602 +óonôtse -603 +▁tsénėx -604 +hóomoehá -605 +▁america -606 +portation -607 +▁hánėsóvóne -608 +évȯhónestȯtse -609 +▁tsénėxhésemé -610 +transportation -611 +'- -612 +ea -613 +ga -614 +kó -615 +sá -616 +uk -617 +ws -618 +▁y -619 +"), -620 +all -621 +col -622 +htó -623 +noo -624 +per -625 +ran -626 +▁ro -627 +▁sé -628 +▁tá -629 +inus -630 +left -631 +ohke -632 +tion -633 +▁oóo -634 +anada -635 +estse -636 +hóhtá -637 +lands -638 +tsêhé -639 +tôtse -640 +évâhe -641 +óhomö -642 +ėstse -643 +htsená -644 +staane -645 +éstova -646 +▁sonic -647 +énestse -648 +▁census -649 +▁oóoxêse -650 +šenovôtse -651 +▁nėstaane -652 +ca -653 +eé -654 +hö -655 +ik -656 +kâ -657 +ss -658 +ts -659 +xâ -660 +▁v -661 +ado -662 +ahe -663 +aho -664 +eso -665 +ica -666 +ill -667 +ral -668 +raz -669 +sis -670 +too -671 +ull -672 +äma -673 +▁gu -674 +▁is -675 +▁no -676 +asia -677 +nife -678 +stsé -679 +véto -680 +âhpe -681 +▁cro -682 +▁dic -683 +▁san -684 +▁sou -685 +anehe -686 +anese -687 +eotse -688 +hnoma -689 +ôhtsé -690 +▁héva -691 +▁kóhk -692 +▁máto -693 +mérika -694 +nêstse -695 +nėstse -696 +véotse -697 +▁knife -698 +▁nésto -699 +▁county -700 +▁notäma -701 +pulation -702 +▁heévâhe -703 +anestôtse -704 +vátamevéotse -705 +ao -706 +dp -707 +io -708 +ng -709 +▁z -710 +amo -711 +ast -712 +dge -713 +enó -714 +eve -715 +gra -716 +hoe -717 +ies -718 +ind -719 +mar -720 +omé -721 +ong -722 +oom -723 +otá -724 +tor -725 +êše -726 +êšk -727 +ëso -728 +ëva -729 +▁af -730 +▁bl -731 +▁ko -732 +engl -733 +hkoo -734 +hpév -735 +ital -736 +kota -737 +ótsé -738 +▁ama -739 +atama -740 +color -741 +graph -742 +onéma -743 +tsémâ -744 +ussia -745 +škéso -746 +šêške -747 +▁dull -748 +onêške -749 +ótséva -750 +▁colle -751 +▁tsêhe -752 +english -753 +tionary -754 +tsémâhpév -755 +▁dictionary -756 +▁heévâhetanéno -757 +%) -758 +." -759 +br -760 +ci -761 +ld -762 +nâ -763 +uc -764 +ue -765 +wa -766 +wn -767 +éo -768 +′′ -769 +▁· -770 +art -771 +evé -772 +fic -773 +gen -774 +key -775 +mon -776 +ond -777 +oro -778 +ree -779 +ton -780 +web -781 +wor -782 +ého -783 +émâ -784 +éše -785 +óno -786 +▁bu -787 +▁en -788 +▁lo -789 +▁tä -790 +néhe -791 +tain -792 +vôse -793 +xêšé -794 +éhpe -795 +▁cap -796 +▁mat -797 +ansas -798 +anéve -799 +ative -800 +chile -801 +hkôhe -802 +hésto -803 +spain -804 +stotó -805 +▁aust -806 +▁haná -807 +▁hešk -808 +▁hová -809 +▁xama -810 +center -811 +hanehe -812 +honáeo -813 +htôtse -814 +poland -815 +tóvéto -816 +êstove -817 +▁south -818 +▁tâhpe -819 +▁dakota -820 +▁nêstse -821 +▁amérika -822 +▁college -823 +▁vietnam -824 +therlands -825 +▁hohamháa -826 +▁mȧhóomoehá -827 +▁manâhestôtse -828 +▁nêstsestôtse -829 +ah -830 +da -831 +do -832 +li -833 +mô -834 +ni -835 +pd -836 +pr -837 +ub -838 +xė -839 +▁á -840 +aen -841 +ang -842 +are -843 +car -844 +eho -845 +eme -846 +eta -847 +low -848 +ono -849 +ria -850 +ser -851 +uth -852 +êha -853 +êhó -854 +óné -855 +▁hi -856 +▁pi -857 +▁sq -858 +alif -859 +news -860 +stat -861 +vêsé -862 +óevó -863 +ôheo -864 +▁com -865 +▁gre -866 +▁har -867 +▁tur -868 +▁óoe -869 +hohpe -870 +ornia -871 +razil -872 +taesé -873 +thern -874 +škese -875 +▁vóhp -876 +united -877 +ôhtávo -878 +▁aésto -879 +▁chief -880 +▁japan -881 +honáéva -882 +méškéso -883 +tsêhésê -884 +▁americ -885 +▁indian -886 +▁vóhkoo -887 +cheyenne -888 +colorado -889 +onêškeho -890 +émâhtáme -891 +ôhkemôxe -892 +▁kóhkonê -893 +▁šéstotó -894 +alifornia -895 +▁vášêtaëno -896 +▁heévȧhetanéno -897 +__ -898 +ak -899 +ds -900 +ex -901 +gy -902 +kȯ -903 +mi -904 +nt -905 +ok -906 +pi -907 +pl -908 +si -909 +sw -910 +▁' -911 +▁ł -912 +ase -913 +dom -914 +erc -915 +eto -916 +gov -917 +hal -918 +ite -919 +ith -920 +lep -921 +neo -922 +onó -923 +rib -924 +toa -925 +tur -926 +uri -927 +way -928 +âhá -929 +évé -930 +êho -931 +óhé -932 +▁le -933 +▁lu -934 +▁ob -935 +▁sp -936 +▁sw -937 +aska -938 +enëö -939 +eéve -940 +hese -941 +htáv -942 +ille -943 +kâsé -944 +késo -945 +kôsá -946 +sono -947 +tová -948 +vose -949 +óhtá -950 +ôhkê -951 +ėsto -952 +▁arc -953 +▁new -954 +▁tsė -955 +▁war -956 +chive -957 +heóve -958 +háano -959 +ralia -960 +story -961 +stótó -962 +ursus -963 +ôhtse -964 +▁city -965 +▁hesó -966 +▁king -967 +ficial -968 +kansas -969 +kȯhtáv -970 +émâhéó -971 +ôhtávê -972 +ational -973 +háatama -974 +náhkohe -975 +sevoneo -976 +▁canada -977 +▁ehóhtá -978 +stotôtse -979 +▁kingdom -980 +▁náhkohe -981 +hoestȯtse -982 +vášêtaëno -983 +hamestôtse -984 +population -985 +▁kóhkonêhëö -986 +▁manestôtse -987 +▁netherlands -988 +"; -989 +fi -990 +km -991 +mȧ -992 +sk -993 +vô -994 +▁x -995 +amb -996 +anó -997 +ash -998 +ave -999 +cep -1000 +con -1001 +des -1002 +dis -1003 +ero -1004 +geo -1005 +gin -1006 +hné -1007 +kon -1008 +ook -1009 +ors -1010 +pdf -1011 +red -1012 +rse -1013 +uro -1014 +use -1015 +éme -1016 +óoe -1017 +óxe -1018 +ôsé -1019 +▁ab -1020 +▁ad -1021 +▁at -1022 +▁cl -1023 +▁fr -1024 +▁jo -1025 +▁on -1026 +▁vi -1027 +eden -1028 +hemê -1029 +heve -1030 +hóno -1031 +móht -1032 +noma -1033 +reek -1034 +ress -1035 +tavo -1036 +tâhé -1037 +tóhé -1038 +ukra -1039 +vovó -1040 +xévo -1041 +éome -1042 +šeta -1043 +▁car -1044 +▁ire -1045 +ambig -1046 +anóse -1047 +eséve -1048 +hesto -1049 +honoo -1050 +japan -1051 +maheo -1052 +onáhe -1053 +pinus -1054 +tohke -1055 +vâtse -1056 +évôxe -1057 +óneve -1058 +▁sena -1059 +▁vóhk -1060 +estone -1061 +hpotsé -1062 +seotse -1063 +xekôsá -1064 +xemeno -1065 +êstone -1066 +óonéma -1067 +▁black -1068 +▁great -1069 +graphic -1070 +ukraine -1071 +wyoming -1072 +šeeséve -1073 +▁africa -1074 +▁otaesé -1075 +▁russia -1076 +▁tšêške -1077 +disambig -1078 +nestôtse -1079 +óoetaneo -1080 +▁ireland -1081 +▁tsėhése -1082 +▁wyoming -1083 +▁northern -1084 +▁official -1085 +estonemaheo -1086 +hpotséhohpe -1087 +haméstotôtse -1088 +.) -1089 +ax -1090 +bo -1091 +hu -1092 +kh -1093 +ud -1094 +vö -1095 +xó -1096 +yy -1097 +ze -1098 +ää -1099 +ëë -1100 +ón -1101 +▁, -1102 +▁ô -1103 +age -1104 +ané -1105 +ara -1106 +axa -1107 +ber -1108 +bri -1109 +che -1110 +ers -1111 +her -1112 +hog -1113 +hpâ -1114 +hpé -1115 +ile -1116 +ini -1117 +net -1118 +ora -1119 +ril -1120 +stá -1121 +ura -1122 +urg -1123 +uta -1124 +weo -1125 +wpp -1126 +zon -1127 +éne -1128 +óhe -1129 +▁kâ -1130 +▁la -1131 +▁nê -1132 +▁nó -1133 +▁oe -1134 +aehe -1135 +atia -1136 +cksk -1137 +etan -1138 +hkeo -1139 +hóme -1140 +kane -1141 +mber -1142 +mbia -1143 +moxe -1144 +mâhö -1145 +nehe -1146 +neve -1147 +névó -1148 +omëë -1149 +onal -1150 +taly -1151 +tana -1152 +tern -1153 +tóno -1154 +xemé -1155 +êstó -1156 +▁bel -1157 +▁for -1158 +▁mon -1159 +▁mén -1160 +▁tai -1161 +▁éšk -1162 +carpa -1163 +evoem -1164 +hoham -1165 +htseo -1166 +idaho -1167 +iland -1168 +ohtsé -1169 +rance -1170 +south -1171 +stahe -1172 +torta -1173 +tsêhe -1174 +world -1175 +éheme -1176 +óhkon -1177 +óhomo -1178 +▁hetó -1179 +▁heva -1180 +braska -1181 +ckskin -1182 +dgehog -1183 +kôhóme -1184 +notová -1185 +vášeta -1186 +xêšéne -1187 +êškóne -1188 +▁amâho -1189 +▁creek -1190 +▁heóvê -1191 +▁hohpâ -1192 +▁italy -1193 +▁mâhpé -1194 +▁netse -1195 +▁poeso -1196 +▁péhpe -1197 +▁póevó -1198 +▁vóhpo -1199 +seoneve -1200 +šéstótó -1201 +staahtsé -1202 +váótséva -1203 +▁britain -1204 +▁capital -1205 +▁náhkôhe -1206 +▁tâhpeno -1207 +hnestôtse -1208 +névóvâtse -1209 +xekôsáeho -1210 +▁contorta -1211 +▁hedgehog -1212 +▁manâhého -1213 +evoemêstse -1214 +▁australia -1215 +▁tsevášeta -1216 +▁óoetanéno -1217 +sevoneóneve -1218 +▁california -1219 +▁hohamháahp -1220 +nėstsestȯtse -1221 +". -1222 +): -1223 +fa -1224 +hn -1225 +ie -1226 +ká -1227 +ls -1228 +lv -1229 +ly -1230 +ov -1231 +pt -1232 +pâ -1233 +pó -1234 +sp -1235 +uz -1236 +we -1237 +wi -1238 +▁â -1239 +▁š -1240 +▁— -1241 +ace -1242 +amp -1243 +anâ -1244 +arc -1245 +ard -1246 +ass -1247 +bud -1248 +cro -1249 +ená -1250 +eše -1251 +ger -1252 +ges -1253 +haa -1254 +heó -1255 +hle -1256 +hnó -1257 +hox -1258 +ice -1259 +ili -1260 +ism -1261 +ket -1262 +lan -1263 +lig -1264 +mit -1265 +oli -1266 +pes -1267 +que -1268 +rat -1269 +rid -1270 +tem -1271 +tos -1272 +tov -1273 +tsė -1274 +tôx -1275 +uni -1276 +ust -1277 +von -1278 +xov -1279 +ylv -1280 +éšk -1281 +óvé -1282 +ȯxe -1283 +▁ac -1284 +▁hu -1285 +▁it -1286 +▁qu -1287 +▁ra -1288 +▁so -1289 +▁sá -1290 +▁te -1291 +▁ur -1292 +aenê -1293 +eoxa -1294 +esta -1295 +esto -1296 +hase -1297 +hevé -1298 +homa -1299 +icle -1300 +idae -1301 +ides -1302 +ilai -1303 +inal -1304 +kêsé -1305 +meno -1306 +máno -1307 +náne -1308 +ohko -1309 +ortu -1310 +pano -1311 +quah -1312 +tapâ -1313 +tine -1314 +tish -1315 +tôxá -1316 +utch -1317 +vian -1318 +vánó -1319 +vávo -1320 +vâhá -1321 +xeto -1322 +xêhe -1323 +âhev -1324 +óhko -1325 +▁anê -1326 +▁anė -1327 +▁bar -1328 +▁bir -1329 +▁den -1330 +▁est -1331 +▁geo -1332 +▁mas -1333 +▁ome -1334 +▁ová -1335 +▁par -1336 +▁pro -1337 +▁red -1338 +▁sea -1339 +▁slo -1340 +▁squ -1341 +▁too -1342 +▁óno -1343 +aehes -1344 +ensis -1345 +erosa -1346 +hahtá -1347 +hasëö -1348 +heóne -1349 +hkêho -1350 +honáe -1351 +icago -1352 +kaneo -1353 +mâhae -1354 +nêhpo -1355 +omêše -1356 +oured -1357 +perus -1358 +sebud -1359 +stove -1360 +stséa -1361 +thing -1362 +tsévó -1363 +urope -1364 +évohk -1365 +óhévâ -1366 +ónétó -1367 +ôhévo -1368 +▁ange -1369 +▁boer -1370 +▁crow -1371 +▁hesé -1372 +▁honó -1373 +▁kéme -1374 +▁mana -1375 +▁moun -1376 +▁nemâ -1377 +▁pond -1378 +▁site -1379 +▁tase -1380 +▁tsis -1381 +anâhke -1382 +canada -1383 +eneško -1384 +enôtse -1385 +hemêšé -1386 +hóxovê -1387 +inland -1388 +kâsénâ -1389 +lahoma -1390 +ligion -1391 +manâhé -1392 +xovôho -1393 +▁canad -1394 +▁heškó -1395 +▁hotse -1396 +▁hoóxe -1397 +▁mahpe -1398 +▁miles -1399 +▁mésta -1400 +▁portu -1401 +▁tahle -1402 +▁táase -1403 +▁vóhka -1404 +ameotse -1405 +archive -1406 +emanese -1407 +hailand -1408 +ométane -1409 +seohtsé -1410 +tapâhae -1411 +ôhtsévó -1412 +škovávo -1413 +▁brazil -1414 +▁france -1415 +▁hetane -1416 +▁sweden -1417 +▁tsesto -1418 +▁tséana -1419 +▁turkey -1420 +hásêstse -1421 +keemahpe -1422 +manȧhého -1423 +nebraska -1424 +notováhe -1425 +onêstove -1426 +tsénêhpo -1427 +uniperus -1428 +xâhtsená -1429 +xėseotse -1430 +êhóóhévâ -1431 +▁heškóve -1432 +▁hováhne -1433 +▁háhnoma -1434 +▁oeškese -1435 +▁ukraine -1436 +héstánóva -1437 +évohkôtse -1438 +▁buckskin -1439 +▁coloured -1440 +▁national -1441 +▁váótséva -1442 +▁vóhpeoxa -1443 +hahtátaneo -1444 +háatamaahe -1445 +véhestôtse -1446 +xetoeneško -1447 +▁anėsóvóne -1448 +▁ponderosa -1449 +▁tahlequah -1450 +▁vétapâhae -1451 +êsenotováhe -1452 +êstonemâheo -1453 +▁canadensis -1454 +mâhaemenôtse -1455 +▁mȧhoestȯtse -1456 +aenêhoestôtse -1457 +êhóóhévâhtseo -1458 +▁vóhkoohetane -1459 +keehoohtsêstse -1460 +kâsénâhnestôtse -1461 +ôhtávêhahtátaneo -1462 +bi -1463 +cl -1464 +ct -1465 +dź -1466 +eá -1467 +gs -1468 +hã -1469 +ji -1470 +jo -1471 +ju -1472 +ky -1473 +ns -1474 +op -1475 +oó -1476 +rt -1477 +vf -1478 +xé -1479 +yd -1480 +▁) -1481 +▁+ -1482 +▁. -1483 +▁: -1484 +act -1485 +ala -1486 +bez -1487 +can -1488 +clo -1489 +dle -1490 +edg -1491 +ent -1492 +era -1493 +evá -1494 +gal -1495 +gas -1496 +gdp -1497 +gel -1498 +har -1499 +hkė -1500 +hov -1501 +hve -1502 +ick -1503 +imb -1504 +int -1505 +iro -1506 +ise -1507 +itu -1508 +kie -1509 +maa -1510 +mat -1511 +mes -1512 +mus -1513 +nev -1514 +nez -1515 +nii -1516 +ola -1517 +olf -1518 +oly -1519 +ová -1520 +par -1521 +pet -1522 +pic -1523 +ris -1524 +sel -1525 +ssa -1526 +séó -1527 +sóe -1528 +tho -1529 +tom -1530 +táa -1531 +tšê -1532 +uby -1533 +uit -1534 +ulo -1535 +vak -1536 +wan -1537 +âsé -1538 +éna -1539 +ódź -1540 +óto -1541 +▁'' -1542 +▁), -1543 +▁__ -1544 +▁am -1545 +▁ex -1546 +▁kȧ -1547 +▁nē -1548 +▁ok -1549 +▁or -1550 +▁pu -1551 +▁sȯ -1552 +▁tr -1553 +▁va -1554 +▁wi -1555 +▁ło -1556 +aceb -1557 +acer -1558 +ance -1559 +asio -1560 +ated -1561 +ater -1562 +bean -1563 +coun -1564 +ench -1565 +eohé -1566 +erre -1567 +gent -1568 +glas -1569 +gypt -1570 +hahe -1571 +havo -1572 +háse -1573 +icat -1574 +icus -1575 +iver -1576 +kome -1577 +kone -1578 +kévé -1579 +kôhé -1580 +lery -1581 +ment -1582 +menó -1583 +mons -1584 +nóne -1585 +omée -1586 +orea -1587 +póno -1588 +rahã -1589 +rgin -1590 +ring -1591 +riti -1592 +sena -1593 +sity -1594 +sone -1595 +stan -1596 +tamó -1597 +tove -1598 +tury -1599 +tâhá -1600 +uela -1601 +unus -1602 +vfik -1603 +voto -1604 +xôse -1605 +zart -1606 +zech -1607 +áhto -1608 +éotó -1609 +êstá -1610 +êške -1611 +óseo -1612 +ôhëö -1613 +ôtáa -1614 +šemé -1615 +ȯhkė -1616 +▁amo -1617 +▁ant -1618 +▁dia -1619 +▁han -1620 +▁hoi -1621 +▁jim -1622 +▁lib -1623 +▁los -1624 +▁mic -1625 +▁mis -1626 +▁môx -1627 +▁mȧx -1628 +▁okô -1629 +▁oné -1630 +▁otá -1631 +▁per -1632 +▁ter -1633 +▁ton -1634 +▁vée -1635 +▁web -1636 +▁wor -1637 +▁yel -1638 +▁éte -1639 +▁ôxa -1640 +aehno -1641 +archt -1642 +edgar -1643 +enohe -1644 +ercus -1645 +estsé -1646 +etane -1647 +files -1648 +guage -1649 +hohtó -1650 +horse -1651 +hotoa -1652 +hpenó -1653 +imate -1654 +italy -1655 +ition -1656 +kaehe -1657 +keéno -1658 +mâhöö -1659 +netsé -1660 +nésta -1661 +ombia -1662 +omôho -1663 +onald -1664 +ondon -1665 +onávo -1666 +picea -1667 +poeso -1668 +pulus -1669 +péhpe -1670 +rizon -1671 +theid -1672 +tsêha -1673 +ubykh -1674 +vóono -1675 +xeesé -1676 +âséha -1677 +énohe -1678 +ôhené -1679 +öhtse -1680 +škëso -1681 +▁demo -1682 +▁dong -1683 +▁from -1684 +▁heap -1685 +▁heše -1686 +▁héne -1687 +▁héve -1688 +▁koro -1689 +▁mato -1690 +▁náhk -1691 +▁sage -1692 +▁sint -1693 +▁táxe -1694 +▁tôho -1695 +▁wolf -1696 +▁łódź -1697 +aenôhe -1698 +ameehe -1699 +ashing -1700 +brazil -1701 +ehóhtá -1702 +estôhe -1703 +gelman -1704 +graphy -1705 +hamëso -1706 +hkeehe -1707 +hovane -1708 +illion -1709 +kóhkon -1710 +ohketo -1711 +onéame -1712 +ouglas -1713 +russia -1714 +sėstse -1715 +ternal -1716 +ternet -1717 +tinent -1718 +tšêške -1719 +xemené -1720 +êhaseo -1721 +êhasëö -1722 +ôhtáva -1723 +ôseesé -1724 +▁birds -1725 +▁congo -1726 +▁czech -1727 +▁hestó -1728 +▁heóve -1729 +▁hésto -1730 +▁india -1731 +▁north -1732 +▁onéma -1733 +▁pinus -1734 +▁sȯsóe -1735 +▁vóhpe -1736 +▁áháse -1737 +▁éohke -1738 +▁łobez -1739 +acebook -1740 +etanóto -1741 +gentina -1742 +hestohe -1743 +icators -1744 +ligions -1745 +nezuela -1746 +oeškese -1747 +rginian -1748 +ribbean -1749 +searcht -1750 +statssa -1751 +séeotsé -1752 +taheéve -1753 +toháano -1754 +tsêhest -1755 +tâháéno -1756 +vonêheo -1757 +énanóse -1758 +êsevêsé -1759 +▁arctos -1760 +▁hoohëö -1761 +▁háeohé -1762 +▁háesto -1763 +▁london -1764 +▁mozart -1765 +▁móxêšé -1766 +▁norway -1767 +▁people -1768 +▁square -1769 +▁yellow -1770 +enáhkohe -1771 +hamémôxe -1772 +hestȯtse -1773 +republic -1774 +vahtôtse -1775 +vóonotse -1776 +xeeséeto -1777 +xestôtse -1778 +▁angeles -1779 +▁british -1780 +▁century -1781 +▁climate -1782 +▁croatia -1783 +▁finland -1784 +▁history -1785 +▁hotohke -1786 +▁onéhavo -1787 +▁rosebud -1788 +▁toháano -1789 +▁tôhohko -1790 +ashington -1791 +emestȯtse -1792 +gelmannii -1793 +heónemôxe -1794 +hémâhoéve -1795 +juniperus -1796 +konôhtávo -1797 +tôxámâhéó -1798 +▁american -1799 +▁colombia -1800 +▁hohpâháa -1801 +▁mountain -1802 +▁okôhkêho -1803 +▁thailand -1804 +enóseoneve -1805 +indicators -1806 +xôseonéame -1807 +âséhahnoma -1808 +▁heséeotsé -1809 +▁póevónáne -1810 +▁virginian -1811 +▁vótâháéno -1812 +▁éškôseesé -1813 +aenôheéstse -1814 +hemêšéonávo -1815 +▁héstanêheo -1816 +▁móxêšéhevé -1817 +▁population -1818 +▁tsehestohe -1819 +▁tsetsêhest -1820 +ameehestôtse -1821 +polandnestse -1822 +êsenotováheé -1823 +êsevêséhotoa -1824 +▁engelmannii -1825 +▁héstánóvaan -1826 +▁yellowstone -1827 +móhtâhestôtse -1828 +ohketoetanóto -1829 +▁héstánóvaanéve -1830 +▁tsetsêhestâhese -1831 +", -1832 +)- -1833 +.: -1834 +bc -1835 +cu -1836 +cy -1837 +dc -1838 +hā -1839 +hō -1840 +ii -1841 +iv -1842 +iz -1843 +ja -1844 +ks -1845 +ké -1846 +kė -1847 +lu -1848 +ml -1849 +nu -1850 +nô -1851 +of -1852 +oh -1853 +su -1854 +sô -1855 +tâ -1856 +tä -1857 +ua -1858 +vė -1859 +xt -1860 +ym -1861 +áa -1862 +án -1863 +äö -1864 +éé -1865 +óo -1866 +óé -1867 +šė -1868 +▁$ -1869 +▁/ -1870 +%), -1871 +.") -1872 +ach -1873 +ahé -1874 +aka -1875 +app -1876 +ath -1877 +ato -1878 +bia -1879 +bir -1880 +bli -1881 +bou -1882 +bra -1883 +bre -1884 +cel -1885 +cer -1886 +cho -1887 +cib -1888 +cip -1889 +cit -1890 +cta -1891 +dal -1892 +dia -1893 +duc -1894 +ear -1895 +eet -1896 +ell -1897 +elo -1898 +ena -1899 +ené -1900 +esy -1901 +eva -1902 +evê -1903 +haz -1904 +hen -1905 +hil -1906 +hla -1907 +hpa -1908 +htâ -1909 +htö -1910 +ich -1911 +ida -1912 +ima -1913 +imf -1914 +ino -1915 +iza -1916 +kom -1917 +lyn -1918 +mbo -1919 +med -1920 +mâx -1921 +nas -1922 +nes -1923 +ney -1924 +nom -1925 +nor -1926 +néó -1927 +oan -1928 +oem -1929 +ols -1930 +ove -1931 +ppp -1932 +pro -1933 +pén -1934 +res -1935 +rio -1936 +sex -1937 +stâ -1938 +tri -1939 +uba -1940 +uly -1941 +und -1942 +ung -1943 +ure -1944 +uru -1945 +uva -1946 +vec -1947 +voo -1948 +véo -1949 +zer -1950 +áta -1951 +çao -1952 +éoe -1953 +évâ -1954 +êhé -1955 +êna -1956 +êsé -1957 +êxo -1958 +ëhe -1959 +ëse -1960 +óma -1961 +ôsá -1962 +ėst -1963 +▁do -1964 +▁es -1965 +▁fi -1966 +▁ht -1967 +▁ki -1968 +▁ku -1969 +▁kö -1970 +▁oc -1971 +▁pe -1972 +▁ru -1973 +▁sk -1974 +▁ti -1975 +▁ut -1976 +▁wa -1977 +▁ôh -1978 +abwe -1979 +ames -1980 +anat -1981 +ants -1982 +asus -1983 +axaa -1984 +axáa -1985 +cerv -1986 +char -1987 +code -1988 +curi -1989 +down -1990 +edir -1991 +enns -1992 +eolo -1993 +eove -1994 +eral -1995 +esen -1996 +está -1997 +esëö -1998 +eved -1999 +fast -2000 +fire -2001 +gins -2002 +hael -2003 +hahk -2004 +hama -2005 +hamȧ -2006 +hene -2007 +here -2008 +hesó -2009 +hetó -2010 +heše -2011 +hová -2012 +hoxo -2013 +html -2014 +htoo -2015 +héve -2016 +höva -2017 +hāme -2018 +iago -2019 +ibik -2020 +ical -2021 +ific -2022 +ills -2023 +inst -2024 +ires -2025 +isco -2026 +ithu -2027 +khaz -2028 +komê -2029 +koné -2030 +kêsa -2031 +llow -2032 +mala -2033 +mami -2034 +many -2035 +mare -2036 +mark -2037 +mate -2038 +mené -2039 +môse -2040 +nahe -2041 +nene -2042 +nese -2043 +nâha -2044 +néta -2045 +nóse -2046 +olia -2047 +otsw -2048 +page -2049 +peru -2050 +pher -2051 +pper -2052 +ratt -2053 +road -2054 +sane -2055 +sean -2056 +seph -2057 +sian -2058 +star -2059 +tavö -2060 +tiba -2061 +ties -2062 +tivi -2063 +tohe -2064 +táhe -2065 +täso -2066 +tómô -2067 +tóxe -2068 +urne -2069 +utah -2070 +vada -2071 +vata -2072 +voem -2073 +vove -2074 +vâho -2075 +xove -2076 +áahe -2077 +âheo -2078 +âhtö -2079 +éseo -2080 +éstó -2081 +ësta -2082 +óóhe -2083 +ôhta -2084 +ôhéó -2085 +škee -2086 +ȯhnó -2087 +▁air -2088 +▁ara -2089 +▁atl -2090 +▁bay -2091 +▁ben -2092 +▁bit -2093 +▁cal -2094 +▁day -2095 +▁del -2096 +▁din -2097 +▁gla -2098 +▁hea -2099 +▁loc -2100 +▁mel -2101 +▁mex -2102 +▁mos -2103 +▁nan -2104 +▁nat -2105 +▁neg -2106 +▁pac -2107 +▁pra -2108 +▁pre -2109 +▁rho -2110 +▁sac -2111 +▁sco -2112 +▁sho -2113 +▁sto -2114 +▁tra -2115 +▁tre -2116 +▁van -2117 +▁wal -2118 +▁xaó -2119 +▁xäö -2120 +▁šan -2121 +advec -2122 +anohe -2123 +arten -2124 +atone -2125 +bania -2126 +brief -2127 +chief -2128 +dagas -2129 +desia -2130 +elope -2131 +guese -2132 +halus -2133 +hamâx -2134 +hketa -2135 +hkéso -2136 +hotse -2137 +house -2138 +htâhé -2139 +htóht -2140 +illet -2141 +kaeta -2142 +keaho -2143 +miten -2144 +máhta -2145 +néheo -2146 +nôtse -2147 +oemdc -2148 +póevó -2149 +river -2150 +selle -2151 +sisto -2152 +ssing -2153 +stâhe -2154 +tanév -2155 +tooxo -2156 +tšêšk -2157 +vakia -2158 +venia -2159 +vetoo -2160 +ville -2161 +vóhpo -2162 +vôhtó -2163 +xésta -2164 +émene -2165 +énéhe -2166 +évôse -2167 +óvéta -2168 +ôxhoo -2169 +ȧhéno -2170 +▁alge -2171 +▁anna -2172 +▁anta -2173 +▁apar -2174 +▁camp -2175 +▁cauc -2176 +▁cura -2177 +▁dece -2178 +▁desy -2179 +▁fire -2180 +▁hehp -2181 +▁hese -2182 +▁john -2183 +▁kosa -2184 +▁kôsá -2185 +▁mene -2186 +▁mâhö -2187 +▁mótó -2188 +▁néta -2189 +▁nėse -2190 +▁obvi -2191 +▁orig -2192 +▁oéve -2193 +▁page -2194 +▁poly -2195 +▁rock -2196 +▁sand -2197 +▁tono -2198 +▁tséh -2199 +▁tséx -2200 +▁tóno -2201 +▁tôhé -2202 +▁ural -2203 +▁vose -2204 +▁wars -2205 +▁west -2206 +▁wiki -2207 +▁zimb -2208 +▁éveé -2209 +▁ôhmo -2210 +aneonó -2211 +aénohe -2212 +chived -2213 +datama -2214 +edirne -2215 +eestse -2216 +eoestá -2217 +eétâhé -2218 +haeolo -2219 +hanáhe -2220 +hkoohe -2221 +hotóao -2222 +hováve -2223 +htsemo -2224 +hótame -2225 +kemene -2226 +kôhtse -2227 +kôhévé -2228 +marete -2229 +mêstaa -2230 +môtove -2231 +nétâhé -2232 +onesia -2233 +person -2234 +prunus -2235 +rizona -2236 +sósone -2237 +tevfik -2238 +tivist -2239 +turkey -2240 +uation -2241 +urasia -2242 +vation -2243 +veotse -2244 +xemenó -2245 +xemâho -2246 +xepóno -2247 +xêhest -2248 +éestse -2249 +éseohé -2250 +êstóne -2251 +óhtáhe -2252 +ėstane -2253 +šeméhe -2254 +ȯhkėha -2255 +▁aruba -2256 +▁botsw -2257 +▁coast -2258 +▁dutch -2259 +▁hestá -2260 +▁horse -2261 +▁hotóa -2262 +▁hésta -2263 +▁idaho -2264 +▁korea -2265 +▁lasio -2266 +▁lithu -2267 +▁meave -2268 +▁pages -2269 +▁penns -2270 +▁press -2271 +▁retri -2272 +▁sibik -2273 +▁there -2274 +▁tsehe -2275 +▁ubykh -2276 +▁xamae -2277 +antiago -2278 +article -2279 +atonebi -2280 +estséat -2281 +hestôxe -2282 +heóvemá -2283 +hkêsono -2284 +hohtóva -2285 +honáeka -2286 +háhnoma -2287 +hótsêha -2288 +kêsaéve -2289 +mamione -2290 +matôtse -2291 +montana -2292 +noonáhe -2293 +ohtôtse -2294 +panâhke -2295 +populus -2296 +sistots -2297 +sonants -2298 +tsévovó -2299 +tsêheta -2300 +tómôhéó -2301 +vetanév -2302 +vêstséa -2303 +vóneške -2304 +xeméhne -2305 +xâhonoo -2306 +xémâhéó -2307 +átamáno -2308 +âhtötse -2309 +éhestat -2310 +éotóaho -2311 +éstónéó -2312 +óhéhéve -2313 +▁ahkôhe -2314 +▁donald -2315 +▁europe -2316 +▁hoésto -2317 +▁hóvôse -2318 +▁kóhkon -2319 +▁kȧhamȧ -2320 +▁manaan -2321 +▁mariti -2322 +▁matana -2323 +▁ménôhe -2324 +▁mónahe -2325 +▁nanóse -2326 +▁native -2327 +▁nevada -2328 +▁néstse -2329 +▁pirahã -2330 +▁reôhke -2331 +▁tóhtoo -2332 +▁univer -2333 +▁éestse -2334 +▁êstove -2335 +aseohtsé -2336 +cephalus -2337 +etanetse -2338 +external -2339 +hahtsená -2340 +hamâxéve -2341 +háestôhe -2342 +hóxovôho -2343 +keahonoo -2344 +kâsénoma -2345 +kėhanáhe -2346 +language -2347 +manâhéno -2348 +móxêšéne -2349 +peruvian -2350 +tšêškévâ -2351 +vovetäso -2352 +weoworld -2353 +éoeškëso -2354 +êškóneho -2355 +ôhketóxe -2356 +šestôtse -2357 +▁aéstome -2358 +▁chicago -2359 +▁curaçao -2360 +▁denmark -2361 +▁douglas -2362 +▁hahpenó -2363 +▁heévȧhe -2364 +▁heóvâhá -2365 +▁hohamma -2366 +▁islands -2367 +▁maarten -2368 +▁madagas -2369 +▁menôtse -2370 +▁million -2371 +▁mâhpémo -2372 +▁sanders -2373 +▁skillet -2374 +▁tseohke -2375 +▁xamaevo -2376 +anėsóvóne -2377 +enestôtse -2378 +eotsestsé -2379 +hamémâhéó -2380 +netsénóne -2381 +nêškovávo -2382 +ohtsévôse -2383 +xemâhoévé -2384 +ȧhestȯtse -2385 +▁activist -2386 +▁archived -2387 +▁botswana -2388 +▁caucasus -2389 +▁crossing -2390 +▁december -2391 +▁hohtsemo -2392 +▁hotóhkeo -2393 +▁héstooma -2394 +▁manȧhéno -2395 +▁oklahoma -2396 +▁original -2397 +▁pennsylv -2398 +▁rhodesia -2399 +▁tsisinst -2400 +▁tsêhésto -2401 +▁zimbabwe -2402 +california -2403 +datamapper -2404 +hkoestséat -2405 +hánėsóvóne -2406 +hóxeeséeto -2407 +kaetaévôxe -2408 +kévénėstse -2409 +kôhévénéhe -2410 +nêstsevôse -2411 +sêhestôtse -2412 +tsêhéstahe -2413 +xėseotsean -2414 +▁apartheid -2415 +▁argentina -2416 +▁caribbean -2417 +▁geography -2418 +▁hoéstónéó -2419 +▁lithuania -2420 +▁monêškeho -2421 +▁obviative -2422 +▁religions -2423 +▁retrieved -2424 +▁sibikeove -2425 +▁tasemiten -2426 +enenestôtse -2427 +eétâhéstove -2428 +héstánóvaan -2429 +komêšéstótó -2430 +mȧhoestȯtse -2431 +êstonêstove -2432 +ôhkêheóvemá -2433 +ôhtávaestse -2434 +▁consonants -2435 +▁heóvonêheo -2436 +▁hóvôsenâha -2437 +▁hóxâhtsená -2438 +▁koronestse -2439 +▁kȧhamȧxévo -2440 +▁lasiocarpa -2441 +▁mâhóomoehá -2442 +▁môxemarete -2443 +▁nėsesėstse -2444 +▁portuguese -2445 +▁tootómôhéó -2446 +▁tsêhestôxe -2447 +▁university -2448 +▁véhestôtse -2449 +▁véhestȯtse -2450 +▁washington -2451 +eestsestȯtse -2452 +staahtsémeno -2453 +vátameveotse -2454 +▁ahkôheöhtse -2455 +▁desyatonebi -2456 +▁háeohémahpe -2457 +▁nemâmamione -2458 +▁vétapâhaeto -2459 +▁éškôseeséma -2460 +mâhoestôtsene -2461 +vetanévȯhkėha -2462 +xêhestâhtötse -2463 +êstonemâheone -2464 +šenonetsénóne -2465 +▁heóvêháhnoma -2466 +▁reôhkemôtove -2467 +aehesanestôtse -2468 +disambiguation -2469 +hamâxéveóhtáhe -2470 +héstoeotsestsé -2471 +kemâhaemenôtse -2472 +▁otaesémenôtse -2473 +▁héstoomaestôtse -2474 +▁héstánóvaannéta -2475 +▁tsisinstsistots -2476 +▁tsêhéstoestôtse -2477 +▁vóhkoohémâhoéve -2478 +af -2479 +ag -2480 +cf -2481 +cr -2482 +ec -2483 +fe -2484 +fo -2485 +fr -2486 +ip -2487 +jp -2488 +pė -2489 +sz -2490 +yo -2491 +ão -2492 +aga -2493 +anl -2494 +aná -2495 +ban -2496 +bat -2497 +big -2498 +blo -2499 +); -2500 +ai -2501 +bn -2502 +eg -2503 +eó -2504 +gc -2505 +ix -2506 +kȧ -2507 +lm -2508 +my -2509 +nd -2510 +oq -2511 +rh -2512 +up -2513 +wh -2514 +ép -2515 +anz -2516 +cfm -2517 +cot -2518 +cre -2519 +day -2520 +dif -2521 +dil -2522 +din -2523 +don -2524 +dor -2525 +ean -2526 +eat -2527 +esó -2528 +ext -2529 +ffa -2530 +for -2531 +gne -2532 +gta -2533 +hin -2534 +hké -2535 +hna -2536 +hou -2537 +hoé -2538 +htä -2539 +htȧ -2540 +hum -2541 +hôh -2542 +ibe -2543 +ibi -2544 +ics -2545 +ide -2546 +ied -2547 +ils -2548 +inh -2549 +iov -2550 +irc -2551 +iry -2552 +ita -2553 +iti -2554 +jap -2555 +jpg -2556 +kes -2557 +kyo -2558 +lag -2559 +lay -2560 +lsx -2561 +mau -2562 +meo -2563 +mib -2564 +mot -2565 +nee -2566 +odo -2567 +ork -2568 +orn -2569 +out -2570 +ped -2571 +poe -2572 +ppi -2573 +pua -2574 +ron -2575 +sci -2576 +sha -2577 +slo -2578 +sse -2579 +ste -2580 +tah -2581 +táo -2582 +uco -2583 +ump -2584 +une -2585 +usp -2586 +vel -2587 +wad -2588 +yan -2589 +ype -2590 +zco -2591 +äse -2592 +épó -2593 +éšé -2594 +êhá -2595 +êma -2596 +êšé -2597 +óvâ -2598 +ôhö -2599 +.. -2600 + -15962 +{ -15963 +à -15964 +ñ -15965 +ø -15966 +ú -15967 +û -15968 +ę -15969 +ń -15970 +ǐ -15971 +ʔ -15972 +н -15973 +п -15974 +ы -15975 +ә -15976 +ર -15977 +ા -15978 +ᐃ -15979 +民 -15980 +è -15981 +î -15982 +ò -15983 +ć -15984 +ī -15985 +ŋ -15986 +ś -15987 +ž -15988 +ǧ -15989 +ɛ -15990 +ɪ -15991 +б -15992 +е -15993 +л -15994 +о -15995 diff --git a/models/tokenizer/chy_tokenizer_8k.model b/models/tokenizer/chy_tokenizer_8k.model new file mode 100644 index 0000000000000000000000000000000000000000..3c2bd8bdb30b7d0f487e18950af3dd12076f3693 --- /dev/null +++ b/models/tokenizer/chy_tokenizer_8k.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:17ad6a6d3558277d6925eb68f3855cb005e2fe3557d8485bf68a6f7274296dab +size 375317 diff --git a/models/tokenizer/chy_tokenizer_8k.vocab b/models/tokenizer/chy_tokenizer_8k.vocab new file mode 100644 index 0000000000000000000000000000000000000000..696dd1ad2040390fcaac724f8d8dab9d8ce0a04b --- /dev/null +++ b/models/tokenizer/chy_tokenizer_8k.vocab @@ -0,0 +1,8000 @@ + 0 + 0 + 0 + 0 +se -0 +st -1 +ho -2 +he -3 +▁c -4 +at -5 +an -6 +or -7 +ate -8 +ory -9 +ateg -10 +ategory -11 +▁category -12 +tse -13 +▁m -14 +ne -15 +ve -16 +ht -17 +▁n -18 +stse -19 +hé -20 +ôtse -21 +me -22 +no -23 +ri -24 +▁( -25 +▁ho -26 +▁t -27 +on -28 +▁na -29 +sé -30 +stôtse -31 +ma -32 +éve -33 +in -34 +éstse -35 +px -36 +vé -37 +▁a -38 +va -39 +ha -40 +th -41 +rig -42 +right -43 +mb -44 +mâ -45 +en -46 +ta -47 +▁p -48 +hk -49 +umb -50 +êstse -51 +hó -52 +sto -53 +▁he -54 +▁o -55 +ane -56 +tó -57 +▁s -58 +thumb -59 +re -60 +vó -61 +ia -62 +šk -63 +to -64 +▁é -65 +vo -66 +htá -67 +le -68 +estôtse -69 +er -70 +▁man -71 +xe -72 +mâho -73 +én -74 +ic -75 +há -76 +ná -77 +hp -78 +us -79 +▁hé -80 +né -81 +šé -82 +▁k -83 +it -84 +▁b -85 +eo -86 +mo -87 +un -88 +má -89 +▁ma -90 +vá -91 +âhé -92 +ar -93 +▁d -94 +heo -95 +honá -96 +mâhoestôtse -97 +sê -98 +še -99 +▁tsé -100 +âhe -101 +êse -102 +▁mo -103 +and -104 +al -105 +htáme -106 +▁manâhé -107 +ol -108 +om -109 +éno -110 +▁naa -111 +tane -112 +ed -113 +▁manâhéno -114 +énêstse -115 +nó -116 +honáé -117 +il -118 +is -119 +tsé -120 +es -121 +as -122 +év -123 +▁vé -124 +hoo -125 +honáéšé -126 +ing -127 +▁hó -128 +), -129 +ško -130 +mé -131 +ion -132 +tsê -133 +ra -134 +▁* -135 +na -136 +ro -137 +▁st -138 +so -139 +▁má -140 +ȯtse -141 +tan -142 +▁of -143 +▁vó -144 +énêstsestôtse -145 +sta -146 +hi -147 +▁l -148 +stȯtse -149 +▁g -150 +▁w -151 +vê -152 +hko -153 +▁re -154 +▁un -155 +hést -156 +ited -157 +▁hést -158 +▁state -159 +▁f -160 +▁states -161 +ke -162 +óne -163 +▁e -164 +kêse -165 +kêseho -166 +só -167 +hpe -168 +htse -169 +▁the -170 +ac -171 +nė -172 +xo -173 +▁ó -174 +hke -175 +▁hotó -176 +ka -177 +▁me -178 +mene -179 +▁united -180 +ad -181 +ge -182 +ésto -183 +ur -184 +ánó -185 +▁máhtáme -186 +pu -187 +ôxe -188 +ation -189 +hne -190 +stó -191 +▁mâ -192 +ánóva -193 +han -194 +hetane -195 +lo -196 +ce -197 +ëö -198 +▁há -199 +▁ne -200 +ano -201 +hno -202 +tanéno -203 +▁héstánóva -204 +ww -205 +ot -206 +▁j -207 +sia -208 +▁hotómá -209 +ko -210 +seo -211 +ške -212 +▁ame -213 +onôtse -214 +máhtáme -215 +ina -216 +▁ch -217 +vóne -218 +co -219 +id -220 +ant -221 +men -222 +éhe -223 +pa -224 +▁ve -225 +hotó -226 +la -227 +xa -228 +ric -229 +âhese -230 +sóvóne -231 +// -232 +tp -233 +▁" -234 +:// -235 +sus -236 +http -237 +▁hova -238 +bl -239 +vi -240 +ôh -241 +êsto -242 +nestȯtse -243 +ap -244 +sa -245 +ȧhe -246 +némene -247 +némenestôtse -248 +ham -249 +héó -250 +ȯhó -251 +htsé -252 +▁tse -253 +tsêhést -254 +tsêhéstâhese -255 +et -256 +if -257 +ôhe -258 +ȧhé -259 +▁an -260 +▁pl -261 +xêse -262 +meško -263 +▁meško -264 +ȯhónestȯtse -265 +em -266 +▁h -267 +ame -268 +oné -269 +the -270 +ëno -271 +sêstse -272 +hoestôtse -273 +go -274 +po -275 +qu -276 +▁in -277 +land -278 +tahe -279 +ôhtá -280 +ul -281 +cen -282 +esé -283 +otse -284 +xovê -285 +évȧhe -286 +▁china -287 +šê -288 +ata -289 +man -290 +▁mé -291 +blic -292 +hooma -293 +▁mâhe -294 +hotóma -295 +menôtse -296 +▁mâheóne -297 +). -298 +ee -299 +ft -300 +ver -301 +ania -302 +hova -303 +êheo -304 +public -305 +bu -306 +el -307 +kê -308 +kô -309 +oo -310 +▁i -311 +ans -312 +ian -313 +▁vo -314 +oland -315 +hovahne -316 +évȧhetanéno -317 +▁mâhoestôtse -318 +oe -319 +ti -320 +▁au -321 +▁ta -322 +vari -323 +ȧhého -324 +ch -325 +ir -326 +nê -327 +ou -328 +ter -329 +https -330 +mâhéó -331 +▁asia -332 +variant -333 +▁hóxovê -334 +hetaneho -335 +ay -336 +les -337 +ôse -338 +▁co -339 +môxe -340 +éhno -341 +ôhné -342 +▁chi -343 +▁col -344 +▁con -345 +▁éve -346 +htsêstse -347 +▁republic -348 +ehe -349 +hey -350 +hpo -351 +éso -352 +enne -353 +estó -354 +ôhném -355 +anêheo -356 +heyenne -357 +hoohtsêstse -358 +ôhnéménêstse -359 +ig -360 +vâ -361 +chi -362 +ome -363 +▁ar -364 +▁év -365 +ings -366 +rica -367 +▁mâhoestôtsene -368 +eb -369 +pp -370 +sė -371 +▁- -372 +háa -373 +éma -374 +óho -375 +▁ha -376 +▁oó -377 +▁th -378 +▁to -379 +pain -380 +▁hóestó -381 +▁manȧhého -382 +am -383 +▁– -384 +ana -385 +ild -386 +onê -387 +ort -388 +www -389 +▁ri -390 +▁se -391 +séeo -392 +tame -393 +ther -394 +taneo -395 +énano -396 +ceania -397 +mâhoéve -398 +mó -399 +xê -400 +▁r -401 +tal -402 +▁de -403 +▁ná -404 +hése -405 +▁hoo -406 +nestse -407 +▁hesto -408 +▁oxêse -409 +▁variant -410 +héstanêheo -411 +▁évȯhónestȯtse -412 +de -413 +hä -414 +te -415 +tr -416 +anė -417 +est -418 +ish -419 +▁mȧ -420 +▁né -421 +▁nė -422 +▁po -423 +ôhke -424 +šeno -425 +build -426 +heová -427 +hkohe -428 +▁hoham -429 +▁tséhe -430 +oceania -431 +buildings -432 +heováéstse -433 +heévȧhetanéno -434 +▁tsétsêhéstâhese -435 +gl -436 +pe -437 +ty -438 +tá -439 +ces -440 +eno -441 +hóo -442 +ohe -443 +anad -444 +häme -445 +vone -446 +vého -447 +xeme -448 +▁not -449 +▁poland -450 +hestôtse -451 +") -452 +ck -453 +gu -454 +ib -455 +evo -456 +ine -457 +ple -458 +óve -459 +▁al -460 +▁gr -461 +▁mi -462 +aneo -463 +name -464 +census -465 +▁river -466 +ab -467 +im -468 +ut -469 +▁[ -470 +ati -471 +com -472 +ene -473 +hkê -474 +nam -475 +pul -476 +êst -477 +óhk -478 +▁(" -479 +▁ka -480 +▁sa -481 +▁vá -482 +hame -483 +viet -484 +▁ind -485 +eople -486 +êstoo -487 +heséeo -488 +▁thumb -489 +hanôtse -490 +vietnam -491 +véhoname -492 +âhestôtse -493 +ki -494 +um -495 +ary -496 +fer -497 +hta -498 +hëö -499 +oná -500 +sëö -501 +évo -502 +hone -503 +šêta -504 +▁cen -505 +▁tsê -506 +feren -507 +véhoo -508 +vôtse -509 +▁auto -510 +anėheo -511 +éstove -512 +▁grand -513 +▁theft -514 +šêtaëno -515 +nėsóvóne -516 +▁tséhetó -517 +kêhanôtse -518 +tsêhetane -519 +tsêhetaneonôtse -520 +be -521 +ef -522 +ph -523 +éh -524 +óe -525 +ese -526 +kee -527 +one -528 +org -529 +▁ca -530 +▁pa -531 +▁tó -532 +apan -533 +haná -534 +port -535 +staa -536 +▁nor -537 +▁son -538 +hóomo -539 +mahpe -540 +mâheo -541 +▁hánê -542 +htseto -543 +▁referen -544 +▁cheyenne -545 +▁hoohtseto -546 +▁hánêsóvóne -547 +tsétsêhéstâhese -548 +ae -549 +mö -550 +ry -551 +ya -552 +za -553 +ehá -554 +hae -555 +hto -556 +ial -557 +ire -558 +ity -559 +nia -560 +nif -561 +oma -562 +son -563 +ôho -564 +ôhé -565 +▁km -566 +▁mó -567 +▁šé -568 +hamé -569 +hest -570 +nomá -571 +unty -572 +▁heó -573 +stone -574 +vátame -575 +▁chile -576 +▁héstanėheo -577 +▁references -578 +ba -579 +mê -580 +ôx -581 +ack -582 +ila -583 +nėx -584 +yom -585 +▁bo -586 +▁da -587 +▁oo -588 +▁xa -589 +nėse -590 +rika -591 +▁and -592 +▁bri -593 +▁éše -594 +trans -595 +▁môxe -596 +hésemé -597 +people -598 +sevone -599 +yoming -600 +éhnėse -601 +▁spain -602 +óonôtse -603 +▁tsénėx -604 +hóomoehá -605 +▁america -606 +portation -607 +▁hánėsóvóne -608 +évȯhónestȯtse -609 +▁tsénėxhésemé -610 +transportation -611 +'- -612 +ea -613 +ga -614 +kó -615 +sá -616 +uk -617 +ws -618 +▁y -619 +"), -620 +all -621 +col -622 +htó -623 +noo -624 +per -625 +ran -626 +▁ro -627 +▁sé -628 +▁tá -629 +inus -630 +left -631 +ohke -632 +tion -633 +▁oóo -634 +anada -635 +estse -636 +hóhtá -637 +lands -638 +tsêhé -639 +tôtse -640 +évâhe -641 +óhomö -642 +ėstse -643 +htsená -644 +staane -645 +éstova -646 +▁sonic -647 +énestse -648 +▁census -649 +▁oóoxêse -650 +šenovôtse -651 +▁nėstaane -652 +ca -653 +eé -654 +hö -655 +ik -656 +kâ -657 +ss -658 +ts -659 +xâ -660 +▁v -661 +ado -662 +ahe -663 +aho -664 +eso -665 +ica -666 +ill -667 +ral -668 +raz -669 +sis -670 +too -671 +ull -672 +äma -673 +▁gu -674 +▁is -675 +▁no -676 +asia -677 +nife -678 +stsé -679 +véto -680 +âhpe -681 +▁cro -682 +▁dic -683 +▁san -684 +▁sou -685 +anehe -686 +anese -687 +eotse -688 +hnoma -689 +ôhtsé -690 +▁héva -691 +▁kóhk -692 +▁máto -693 +mérika -694 +nêstse -695 +nėstse -696 +véotse -697 +▁knife -698 +▁nésto -699 +▁county -700 +▁notäma -701 +pulation -702 +▁heévâhe -703 +anestôtse -704 +vátamevéotse -705 +ao -706 +dp -707 +io -708 +ng -709 +▁z -710 +amo -711 +ast -712 +dge -713 +enó -714 +eve -715 +gra -716 +hoe -717 +ies -718 +ind -719 +mar -720 +omé -721 +ong -722 +oom -723 +otá -724 +tor -725 +êše -726 +êšk -727 +ëso -728 +ëva -729 +▁af -730 +▁bl -731 +▁ko -732 +engl -733 +hkoo -734 +hpév -735 +ital -736 +kota -737 +ótsé -738 +▁ama -739 +atama -740 +color -741 +graph -742 +onéma -743 +tsémâ -744 +ussia -745 +škéso -746 +šêške -747 +▁dull -748 +onêške -749 +ótséva -750 +▁colle -751 +▁tsêhe -752 +english -753 +tionary -754 +tsémâhpév -755 +▁dictionary -756 +▁heévâhetanéno -757 +%) -758 +." -759 +br -760 +ci -761 +ld -762 +nâ -763 +uc -764 +ue -765 +wa -766 +wn -767 +éo -768 +′′ -769 +▁· -770 +art -771 +evé -772 +fic -773 +gen -774 +key -775 +mon -776 +ond -777 +oro -778 +ree -779 +ton -780 +web -781 +wor -782 +ého -783 +émâ -784 +éše -785 +óno -786 +▁bu -787 +▁en -788 +▁lo -789 +▁tä -790 +néhe -791 +tain -792 +vôse -793 +xêšé -794 +éhpe -795 +▁cap -796 +▁mat -797 +ansas -798 +anéve -799 +ative -800 +chile -801 +hkôhe -802 +hésto -803 +spain -804 +stotó -805 +▁aust -806 +▁haná -807 +▁hešk -808 +▁hová -809 +▁xama -810 +center -811 +hanehe -812 +honáeo -813 +htôtse -814 +poland -815 +tóvéto -816 +êstove -817 +▁south -818 +▁tâhpe -819 +▁dakota -820 +▁nêstse -821 +▁amérika -822 +▁college -823 +▁vietnam -824 +therlands -825 +▁hohamháa -826 +▁mȧhóomoehá -827 +▁manâhestôtse -828 +▁nêstsestôtse -829 +ah -830 +da -831 +do -832 +li -833 +mô -834 +ni -835 +pd -836 +pr -837 +ub -838 +xė -839 +▁á -840 +aen -841 +ang -842 +are -843 +car -844 +eho -845 +eme -846 +eta -847 +low -848 +ono -849 +ria -850 +ser -851 +uth -852 +êha -853 +êhó -854 +óné -855 +▁hi -856 +▁pi -857 +▁sq -858 +alif -859 +news -860 +stat -861 +vêsé -862 +óevó -863 +ôheo -864 +▁com -865 +▁gre -866 +▁har -867 +▁tur -868 +▁óoe -869 +hohpe -870 +ornia -871 +razil -872 +taesé -873 +thern -874 +škese -875 +▁vóhp -876 +united -877 +ôhtávo -878 +▁aésto -879 +▁chief -880 +▁japan -881 +honáéva -882 +méškéso -883 +tsêhésê -884 +▁americ -885 +▁indian -886 +▁vóhkoo -887 +cheyenne -888 +colorado -889 +onêškeho -890 +émâhtáme -891 +ôhkemôxe -892 +▁kóhkonê -893 +▁šéstotó -894 +alifornia -895 +▁vášêtaëno -896 +▁heévȧhetanéno -897 +__ -898 +ak -899 +ds -900 +ex -901 +gy -902 +kȯ -903 +mi -904 +nt -905 +ok -906 +pi -907 +pl -908 +si -909 +sw -910 +▁' -911 +▁ł -912 +ase -913 +dom -914 +erc -915 +eto -916 +gov -917 +hal -918 +ite -919 +ith -920 +lep -921 +neo -922 +onó -923 +rib -924 +toa -925 +tur -926 +uri -927 +way -928 +âhá -929 +évé -930 +êho -931 +óhé -932 +▁le -933 +▁lu -934 +▁ob -935 +▁sp -936 +▁sw -937 +aska -938 +enëö -939 +eéve -940 +hese -941 +htáv -942 +ille -943 +kâsé -944 +késo -945 +kôsá -946 +sono -947 +tová -948 +vose -949 +óhtá -950 +ôhkê -951 +ėsto -952 +▁arc -953 +▁new -954 +▁tsė -955 +▁war -956 +chive -957 +heóve -958 +háano -959 +ralia -960 +story -961 +stótó -962 +ursus -963 +ôhtse -964 +▁city -965 +▁hesó -966 +▁king -967 +ficial -968 +kansas -969 +kȯhtáv -970 +émâhéó -971 +ôhtávê -972 +ational -973 +háatama -974 +náhkohe -975 +sevoneo -976 +▁canada -977 +▁ehóhtá -978 +stotôtse -979 +▁kingdom -980 +▁náhkohe -981 +hoestȯtse -982 +vášêtaëno -983 +hamestôtse -984 +population -985 +▁kóhkonêhëö -986 +▁manestôtse -987 +▁netherlands -988 +"; -989 +fi -990 +km -991 +mȧ -992 +sk -993 +vô -994 +▁x -995 +amb -996 +anó -997 +ash -998 +ave -999 +cep -1000 +con -1001 +des -1002 +dis -1003 +ero -1004 +geo -1005 +gin -1006 +hné -1007 +kon -1008 +ook -1009 +ors -1010 +pdf -1011 +red -1012 +rse -1013 +uro -1014 +use -1015 +éme -1016 +óoe -1017 +óxe -1018 +ôsé -1019 +▁ab -1020 +▁ad -1021 +▁at -1022 +▁cl -1023 +▁fr -1024 +▁jo -1025 +▁on -1026 +▁vi -1027 +eden -1028 +hemê -1029 +heve -1030 +hóno -1031 +móht -1032 +noma -1033 +reek -1034 +ress -1035 +tavo -1036 +tâhé -1037 +tóhé -1038 +ukra -1039 +vovó -1040 +xévo -1041 +éome -1042 +šeta -1043 +▁car -1044 +▁ire -1045 +ambig -1046 +anóse -1047 +eséve -1048 +hesto -1049 +honoo -1050 +japan -1051 +maheo -1052 +onáhe -1053 +pinus -1054 +tohke -1055 +vâtse -1056 +évôxe -1057 +óneve -1058 +▁sena -1059 +▁vóhk -1060 +estone -1061 +hpotsé -1062 +seotse -1063 +xekôsá -1064 +xemeno -1065 +êstone -1066 +óonéma -1067 +▁black -1068 +▁great -1069 +graphic -1070 +ukraine -1071 +wyoming -1072 +šeeséve -1073 +▁africa -1074 +▁otaesé -1075 +▁russia -1076 +▁tšêške -1077 +disambig -1078 +nestôtse -1079 +óoetaneo -1080 +▁ireland -1081 +▁tsėhése -1082 +▁wyoming -1083 +▁northern -1084 +▁official -1085 +estonemaheo -1086 +hpotséhohpe -1087 +haméstotôtse -1088 +.) -1089 +ax -1090 +bo -1091 +hu -1092 +kh -1093 +ud -1094 +vö -1095 +xó -1096 +yy -1097 +ze -1098 +ää -1099 +ëë -1100 +ón -1101 +▁, -1102 +▁ô -1103 +age -1104 +ané -1105 +ara -1106 +axa -1107 +ber -1108 +bri -1109 +che -1110 +ers -1111 +her -1112 +hog -1113 +hpâ -1114 +hpé -1115 +ile -1116 +ini -1117 +net -1118 +ora -1119 +ril -1120 +stá -1121 +ura -1122 +urg -1123 +uta -1124 +weo -1125 +wpp -1126 +zon -1127 +éne -1128 +óhe -1129 +▁kâ -1130 +▁la -1131 +▁nê -1132 +▁nó -1133 +▁oe -1134 +aehe -1135 +atia -1136 +cksk -1137 +etan -1138 +hkeo -1139 +hóme -1140 +kane -1141 +mber -1142 +mbia -1143 +moxe -1144 +mâhö -1145 +nehe -1146 +neve -1147 +névó -1148 +omëë -1149 +onal -1150 +taly -1151 +tana -1152 +tern -1153 +tóno -1154 +xemé -1155 +êstó -1156 +▁bel -1157 +▁for -1158 +▁mon -1159 +▁mén -1160 +▁tai -1161 +▁éšk -1162 +carpa -1163 +evoem -1164 +hoham -1165 +htseo -1166 +idaho -1167 +iland -1168 +ohtsé -1169 +rance -1170 +south -1171 +stahe -1172 +torta -1173 +tsêhe -1174 +world -1175 +éheme -1176 +óhkon -1177 +óhomo -1178 +▁hetó -1179 +▁heva -1180 +braska -1181 +ckskin -1182 +dgehog -1183 +kôhóme -1184 +notová -1185 +vášeta -1186 +xêšéne -1187 +êškóne -1188 +▁amâho -1189 +▁creek -1190 +▁heóvê -1191 +▁hohpâ -1192 +▁italy -1193 +▁mâhpé -1194 +▁netse -1195 +▁poeso -1196 +▁péhpe -1197 +▁póevó -1198 +▁vóhpo -1199 +seoneve -1200 +šéstótó -1201 +staahtsé -1202 +váótséva -1203 +▁britain -1204 +▁capital -1205 +▁náhkôhe -1206 +▁tâhpeno -1207 +hnestôtse -1208 +névóvâtse -1209 +xekôsáeho -1210 +▁contorta -1211 +▁hedgehog -1212 +▁manâhého -1213 +evoemêstse -1214 +▁australia -1215 +▁tsevášeta -1216 +▁óoetanéno -1217 +sevoneóneve -1218 +▁california -1219 +▁hohamháahp -1220 +nėstsestȯtse -1221 +". -1222 +): -1223 +fa -1224 +hn -1225 +ie -1226 +ká -1227 +ls -1228 +lv -1229 +ly -1230 +ov -1231 +pt -1232 +pâ -1233 +pó -1234 +sp -1235 +uz -1236 +we -1237 +wi -1238 +▁â -1239 +▁š -1240 +▁— -1241 +ace -1242 +amp -1243 +anâ -1244 +arc -1245 +ard -1246 +ass -1247 +bud -1248 +cro -1249 +ená -1250 +eše -1251 +ger -1252 +ges -1253 +haa -1254 +heó -1255 +hle -1256 +hnó -1257 +hox -1258 +ice -1259 +ili -1260 +ism -1261 +ket -1262 +lan -1263 +lig -1264 +mit -1265 +oli -1266 +pes -1267 +que -1268 +rat -1269 +rid -1270 +tem -1271 +tos -1272 +tov -1273 +tsė -1274 +tôx -1275 +uni -1276 +ust -1277 +von -1278 +xov -1279 +ylv -1280 +éšk -1281 +óvé -1282 +ȯxe -1283 +▁ac -1284 +▁hu -1285 +▁it -1286 +▁qu -1287 +▁ra -1288 +▁so -1289 +▁sá -1290 +▁te -1291 +▁ur -1292 +aenê -1293 +eoxa -1294 +esta -1295 +esto -1296 +hase -1297 +hevé -1298 +homa -1299 +icle -1300 +idae -1301 +ides -1302 +ilai -1303 +inal -1304 +kêsé -1305 +meno -1306 +máno -1307 +náne -1308 +ohko -1309 +ortu -1310 +pano -1311 +quah -1312 +tapâ -1313 +tine -1314 +tish -1315 +tôxá -1316 +utch -1317 +vian -1318 +vánó -1319 +vávo -1320 +vâhá -1321 +xeto -1322 +xêhe -1323 +âhev -1324 +óhko -1325 +▁anê -1326 +▁anė -1327 +▁bar -1328 +▁bir -1329 +▁den -1330 +▁est -1331 +▁geo -1332 +▁mas -1333 +▁ome -1334 +▁ová -1335 +▁par -1336 +▁pro -1337 +▁red -1338 +▁sea -1339 +▁slo -1340 +▁squ -1341 +▁too -1342 +▁óno -1343 +aehes -1344 +ensis -1345 +erosa -1346 +hahtá -1347 +hasëö -1348 +heóne -1349 +hkêho -1350 +honáe -1351 +icago -1352 +kaneo -1353 +mâhae -1354 +nêhpo -1355 +omêše -1356 +oured -1357 +perus -1358 +sebud -1359 +stove -1360 +stséa -1361 +thing -1362 +tsévó -1363 +urope -1364 +évohk -1365 +óhévâ -1366 +ónétó -1367 +ôhévo -1368 +▁ange -1369 +▁boer -1370 +▁crow -1371 +▁hesé -1372 +▁honó -1373 +▁kéme -1374 +▁mana -1375 +▁moun -1376 +▁nemâ -1377 +▁pond -1378 +▁site -1379 +▁tase -1380 +▁tsis -1381 +anâhke -1382 +canada -1383 +eneško -1384 +enôtse -1385 +hemêšé -1386 +hóxovê -1387 +inland -1388 +kâsénâ -1389 +lahoma -1390 +ligion -1391 +manâhé -1392 +xovôho -1393 +▁canad -1394 +▁heškó -1395 +▁hotse -1396 +▁hoóxe -1397 +▁mahpe -1398 +▁miles -1399 +▁mésta -1400 +▁portu -1401 +▁tahle -1402 +▁táase -1403 +▁vóhka -1404 +ameotse -1405 +archive -1406 +emanese -1407 +hailand -1408 +ométane -1409 +seohtsé -1410 +tapâhae -1411 +ôhtsévó -1412 +škovávo -1413 +▁brazil -1414 +▁france -1415 +▁hetane -1416 +▁sweden -1417 +▁tsesto -1418 +▁tséana -1419 +▁turkey -1420 +hásêstse -1421 +keemahpe -1422 +manȧhého -1423 +nebraska -1424 +notováhe -1425 +onêstove -1426 +tsénêhpo -1427 +uniperus -1428 +xâhtsená -1429 +xėseotse -1430 +êhóóhévâ -1431 +▁heškóve -1432 +▁hováhne -1433 +▁háhnoma -1434 +▁oeškese -1435 +▁ukraine -1436 +héstánóva -1437 +évohkôtse -1438 +▁buckskin -1439 +▁coloured -1440 +▁national -1441 +▁váótséva -1442 +▁vóhpeoxa -1443 +hahtátaneo -1444 +háatamaahe -1445 +véhestôtse -1446 +xetoeneško -1447 +▁anėsóvóne -1448 +▁ponderosa -1449 +▁tahlequah -1450 +▁vétapâhae -1451 +êsenotováhe -1452 +êstonemâheo -1453 +▁canadensis -1454 +mâhaemenôtse -1455 +▁mȧhoestȯtse -1456 +aenêhoestôtse -1457 +êhóóhévâhtseo -1458 +▁vóhkoohetane -1459 +keehoohtsêstse -1460 +kâsénâhnestôtse -1461 +ôhtávêhahtátaneo -1462 +bi -1463 +cl -1464 +ct -1465 +dź -1466 +eá -1467 +gs -1468 +hã -1469 +ji -1470 +jo -1471 +ju -1472 +ky -1473 +ns -1474 +op -1475 +oó -1476 +rt -1477 +vf -1478 +xé -1479 +yd -1480 +▁) -1481 +▁+ -1482 +▁. -1483 +▁: -1484 +act -1485 +ala -1486 +bez -1487 +can -1488 +clo -1489 +dle -1490 +edg -1491 +ent -1492 +era -1493 +evá -1494 +gal -1495 +gas -1496 +gdp -1497 +gel -1498 +har -1499 +hkė -1500 +hov -1501 +hve -1502 +ick -1503 +imb -1504 +int -1505 +iro -1506 +ise -1507 +itu -1508 +kie -1509 +maa -1510 +mat -1511 +mes -1512 +mus -1513 +nev -1514 +nez -1515 +nii -1516 +ola -1517 +olf -1518 +oly -1519 +ová -1520 +par -1521 +pet -1522 +pic -1523 +ris -1524 +sel -1525 +ssa -1526 +séó -1527 +sóe -1528 +tho -1529 +tom -1530 +táa -1531 +tšê -1532 +uby -1533 +uit -1534 +ulo -1535 +vak -1536 +wan -1537 +âsé -1538 +éna -1539 +ódź -1540 +óto -1541 +▁'' -1542 +▁), -1543 +▁__ -1544 +▁am -1545 +▁ex -1546 +▁kȧ -1547 +▁nē -1548 +▁ok -1549 +▁or -1550 +▁pu -1551 +▁sȯ -1552 +▁tr -1553 +▁va -1554 +▁wi -1555 +▁ło -1556 +aceb -1557 +acer -1558 +ance -1559 +asio -1560 +ated -1561 +ater -1562 +bean -1563 +coun -1564 +ench -1565 +eohé -1566 +erre -1567 +gent -1568 +glas -1569 +gypt -1570 +hahe -1571 +havo -1572 +háse -1573 +icat -1574 +icus -1575 +iver -1576 +kome -1577 +kone -1578 +kévé -1579 +kôhé -1580 +lery -1581 +ment -1582 +menó -1583 +mons -1584 +nóne -1585 +omée -1586 +orea -1587 +póno -1588 +rahã -1589 +rgin -1590 +ring -1591 +riti -1592 +sena -1593 +sity -1594 +sone -1595 +stan -1596 +tamó -1597 +tove -1598 +tury -1599 +tâhá -1600 +uela -1601 +unus -1602 +vfik -1603 +voto -1604 +xôse -1605 +zart -1606 +zech -1607 +áhto -1608 +éotó -1609 +êstá -1610 +êške -1611 +óseo -1612 +ôhëö -1613 +ôtáa -1614 +šemé -1615 +ȯhkė -1616 +▁amo -1617 +▁ant -1618 +▁dia -1619 +▁han -1620 +▁hoi -1621 +▁jim -1622 +▁lib -1623 +▁los -1624 +▁mic -1625 +▁mis -1626 +▁môx -1627 +▁mȧx -1628 +▁okô -1629 +▁oné -1630 +▁otá -1631 +▁per -1632 +▁ter -1633 +▁ton -1634 +▁vée -1635 +▁web -1636 +▁wor -1637 +▁yel -1638 +▁éte -1639 +▁ôxa -1640 +aehno -1641 +archt -1642 +edgar -1643 +enohe -1644 +ercus -1645 +estsé -1646 +etane -1647 +files -1648 +guage -1649 +hohtó -1650 +horse -1651 +hotoa -1652 +hpenó -1653 +imate -1654 +italy -1655 +ition -1656 +kaehe -1657 +keéno -1658 +mâhöö -1659 +netsé -1660 +nésta -1661 +ombia -1662 +omôho -1663 +onald -1664 +ondon -1665 +onávo -1666 +picea -1667 +poeso -1668 +pulus -1669 +péhpe -1670 +rizon -1671 +theid -1672 +tsêha -1673 +ubykh -1674 +vóono -1675 +xeesé -1676 +âséha -1677 +énohe -1678 +ôhené -1679 +öhtse -1680 +škëso -1681 +▁demo -1682 +▁dong -1683 +▁from -1684 +▁heap -1685 +▁heše -1686 +▁héne -1687 +▁héve -1688 +▁koro -1689 +▁mato -1690 +▁náhk -1691 +▁sage -1692 +▁sint -1693 +▁táxe -1694 +▁tôho -1695 +▁wolf -1696 +▁łódź -1697 +aenôhe -1698 +ameehe -1699 +ashing -1700 +brazil -1701 +ehóhtá -1702 +estôhe -1703 +gelman -1704 +graphy -1705 +hamëso -1706 +hkeehe -1707 +hovane -1708 +illion -1709 +kóhkon -1710 +ohketo -1711 +onéame -1712 +ouglas -1713 +russia -1714 +sėstse -1715 +ternal -1716 +ternet -1717 +tinent -1718 +tšêške -1719 +xemené -1720 +êhaseo -1721 +êhasëö -1722 +ôhtáva -1723 +ôseesé -1724 +▁birds -1725 +▁congo -1726 +▁czech -1727 +▁hestó -1728 +▁heóve -1729 +▁hésto -1730 +▁india -1731 +▁north -1732 +▁onéma -1733 +▁pinus -1734 +▁sȯsóe -1735 +▁vóhpe -1736 +▁áháse -1737 +▁éohke -1738 +▁łobez -1739 +acebook -1740 +etanóto -1741 +gentina -1742 +hestohe -1743 +icators -1744 +ligions -1745 +nezuela -1746 +oeškese -1747 +rginian -1748 +ribbean -1749 +searcht -1750 +statssa -1751 +séeotsé -1752 +taheéve -1753 +toháano -1754 +tsêhest -1755 +tâháéno -1756 +vonêheo -1757 +énanóse -1758 +êsevêsé -1759 +▁arctos -1760 +▁hoohëö -1761 +▁háeohé -1762 +▁háesto -1763 +▁london -1764 +▁mozart -1765 +▁móxêšé -1766 +▁norway -1767 +▁people -1768 +▁square -1769 +▁yellow -1770 +enáhkohe -1771 +hamémôxe -1772 +hestȯtse -1773 +republic -1774 +vahtôtse -1775 +vóonotse -1776 +xeeséeto -1777 +xestôtse -1778 +▁angeles -1779 +▁british -1780 +▁century -1781 +▁climate -1782 +▁croatia -1783 +▁finland -1784 +▁history -1785 +▁hotohke -1786 +▁onéhavo -1787 +▁rosebud -1788 +▁toháano -1789 +▁tôhohko -1790 +ashington -1791 +emestȯtse -1792 +gelmannii -1793 +heónemôxe -1794 +hémâhoéve -1795 +juniperus -1796 +konôhtávo -1797 +tôxámâhéó -1798 +▁american -1799 +▁colombia -1800 +▁hohpâháa -1801 +▁mountain -1802 +▁okôhkêho -1803 +▁thailand -1804 +enóseoneve -1805 +indicators -1806 +xôseonéame -1807 +âséhahnoma -1808 +▁heséeotsé -1809 +▁póevónáne -1810 +▁virginian -1811 +▁vótâháéno -1812 +▁éškôseesé -1813 +aenôheéstse -1814 +hemêšéonávo -1815 +▁héstanêheo -1816 +▁móxêšéhevé -1817 +▁population -1818 +▁tsehestohe -1819 +▁tsetsêhest -1820 +ameehestôtse -1821 +polandnestse -1822 +êsenotováheé -1823 +êsevêséhotoa -1824 +▁engelmannii -1825 +▁héstánóvaan -1826 +▁yellowstone -1827 +móhtâhestôtse -1828 +ohketoetanóto -1829 +▁héstánóvaanéve -1830 +▁tsetsêhestâhese -1831 +", -1832 +)- -1833 +.: -1834 +bc -1835 +cu -1836 +cy -1837 +dc -1838 +hā -1839 +hō -1840 +ii -1841 +iv -1842 +iz -1843 +ja -1844 +ks -1845 +ké -1846 +kė -1847 +lu -1848 +ml -1849 +nu -1850 +nô -1851 +of -1852 +oh -1853 +su -1854 +sô -1855 +tâ -1856 +tä -1857 +ua -1858 +vė -1859 +xt -1860 +ym -1861 +áa -1862 +án -1863 +äö -1864 +éé -1865 +óo -1866 +óé -1867 +šė -1868 +▁$ -1869 +▁/ -1870 +%), -1871 +.") -1872 +ach -1873 +ahé -1874 +aka -1875 +app -1876 +ath -1877 +ato -1878 +bia -1879 +bir -1880 +bli -1881 +bou -1882 +bra -1883 +bre -1884 +cel -1885 +cer -1886 +cho -1887 +cib -1888 +cip -1889 +cit -1890 +cta -1891 +dal -1892 +dia -1893 +duc -1894 +ear -1895 +eet -1896 +ell -1897 +elo -1898 +ena -1899 +ené -1900 +esy -1901 +eva -1902 +evê -1903 +haz -1904 +hen -1905 +hil -1906 +hla -1907 +hpa -1908 +htâ -1909 +htö -1910 +ich -1911 +ida -1912 +ima -1913 +imf -1914 +ino -1915 +iza -1916 +kom -1917 +lyn -1918 +mbo -1919 +med -1920 +mâx -1921 +nas -1922 +nes -1923 +ney -1924 +nom -1925 +nor -1926 +néó -1927 +oan -1928 +oem -1929 +ols -1930 +ove -1931 +ppp -1932 +pro -1933 +pén -1934 +res -1935 +rio -1936 +sex -1937 +stâ -1938 +tri -1939 +uba -1940 +uly -1941 +und -1942 +ung -1943 +ure -1944 +uru -1945 +uva -1946 +vec -1947 +voo -1948 +véo -1949 +zer -1950 +áta -1951 +çao -1952 +éoe -1953 +évâ -1954 +êhé -1955 +êna -1956 +êsé -1957 +êxo -1958 +ëhe -1959 +ëse -1960 +óma -1961 +ôsá -1962 +ėst -1963 +▁do -1964 +▁es -1965 +▁fi -1966 +▁ht -1967 +▁ki -1968 +▁ku -1969 +▁kö -1970 +▁oc -1971 +▁pe -1972 +▁ru -1973 +▁sk -1974 +▁ti -1975 +▁ut -1976 +▁wa -1977 +▁ôh -1978 +abwe -1979 +ames -1980 +anat -1981 +ants -1982 +asus -1983 +axaa -1984 +axáa -1985 +cerv -1986 +char -1987 +code -1988 +curi -1989 +down -1990 +edir -1991 +enns -1992 +eolo -1993 +eove -1994 +eral -1995 +esen -1996 +está -1997 +esëö -1998 +eved -1999 +fast -2000 +fire -2001 +gins -2002 +hael -2003 +hahk -2004 +hama -2005 +hamȧ -2006 +hene -2007 +here -2008 +hesó -2009 +hetó -2010 +heše -2011 +hová -2012 +hoxo -2013 +html -2014 +htoo -2015 +héve -2016 +höva -2017 +hāme -2018 +iago -2019 +ibik -2020 +ical -2021 +ific -2022 +ills -2023 +inst -2024 +ires -2025 +isco -2026 +ithu -2027 +khaz -2028 +komê -2029 +koné -2030 +kêsa -2031 +llow -2032 +mala -2033 +mami -2034 +many -2035 +mare -2036 +mark -2037 +mate -2038 +mené -2039 +môse -2040 +nahe -2041 +nene -2042 +nese -2043 +nâha -2044 +néta -2045 +nóse -2046 +olia -2047 +otsw -2048 +page -2049 +peru -2050 +pher -2051 +pper -2052 +ratt -2053 +road -2054 +sane -2055 +sean -2056 +seph -2057 +sian -2058 +star -2059 +tavö -2060 +tiba -2061 +ties -2062 +tivi -2063 +tohe -2064 +táhe -2065 +täso -2066 +tómô -2067 +tóxe -2068 +urne -2069 +utah -2070 +vada -2071 +vata -2072 +voem -2073 +vove -2074 +vâho -2075 +xove -2076 +áahe -2077 +âheo -2078 +âhtö -2079 +éseo -2080 +éstó -2081 +ësta -2082 +óóhe -2083 +ôhta -2084 +ôhéó -2085 +škee -2086 +ȯhnó -2087 +▁air -2088 +▁ara -2089 +▁atl -2090 +▁bay -2091 +▁ben -2092 +▁bit -2093 +▁cal -2094 +▁day -2095 +▁del -2096 +▁din -2097 +▁gla -2098 +▁hea -2099 +▁loc -2100 +▁mel -2101 +▁mex -2102 +▁mos -2103 +▁nan -2104 +▁nat -2105 +▁neg -2106 +▁pac -2107 +▁pra -2108 +▁pre -2109 +▁rho -2110 +▁sac -2111 +▁sco -2112 +▁sho -2113 +▁sto -2114 +▁tra -2115 +▁tre -2116 +▁van -2117 +▁wal -2118 +▁xaó -2119 +▁xäö -2120 +▁šan -2121 +advec -2122 +anohe -2123 +arten -2124 +atone -2125 +bania -2126 +brief -2127 +chief -2128 +dagas -2129 +desia -2130 +elope -2131 +guese -2132 +halus -2133 +hamâx -2134 +hketa -2135 +hkéso -2136 +hotse -2137 +house -2138 +htâhé -2139 +htóht -2140 +illet -2141 +kaeta -2142 +keaho -2143 +miten -2144 +máhta -2145 +néheo -2146 +nôtse -2147 +oemdc -2148 +póevó -2149 +river -2150 +selle -2151 +sisto -2152 +ssing -2153 +stâhe -2154 +tanév -2155 +tooxo -2156 +tšêšk -2157 +vakia -2158 +venia -2159 +vetoo -2160 +ville -2161 +vóhpo -2162 +vôhtó -2163 +xésta -2164 +émene -2165 +énéhe -2166 +évôse -2167 +óvéta -2168 +ôxhoo -2169 +ȧhéno -2170 +▁alge -2171 +▁anna -2172 +▁anta -2173 +▁apar -2174 +▁camp -2175 +▁cauc -2176 +▁cura -2177 +▁dece -2178 +▁desy -2179 +▁fire -2180 +▁hehp -2181 +▁hese -2182 +▁john -2183 +▁kosa -2184 +▁kôsá -2185 +▁mene -2186 +▁mâhö -2187 +▁mótó -2188 +▁néta -2189 +▁nėse -2190 +▁obvi -2191 +▁orig -2192 +▁oéve -2193 +▁page -2194 +▁poly -2195 +▁rock -2196 +▁sand -2197 +▁tono -2198 +▁tséh -2199 +▁tséx -2200 +▁tóno -2201 +▁tôhé -2202 +▁ural -2203 +▁vose -2204 +▁wars -2205 +▁west -2206 +▁wiki -2207 +▁zimb -2208 +▁éveé -2209 +▁ôhmo -2210 +aneonó -2211 +aénohe -2212 +chived -2213 +datama -2214 +edirne -2215 +eestse -2216 +eoestá -2217 +eétâhé -2218 +haeolo -2219 +hanáhe -2220 +hkoohe -2221 +hotóao -2222 +hováve -2223 +htsemo -2224 +hótame -2225 +kemene -2226 +kôhtse -2227 +kôhévé -2228 +marete -2229 +mêstaa -2230 +môtove -2231 +nétâhé -2232 +onesia -2233 +person -2234 +prunus -2235 +rizona -2236 +sósone -2237 +tevfik -2238 +tivist -2239 +turkey -2240 +uation -2241 +urasia -2242 +vation -2243 +veotse -2244 +xemenó -2245 +xemâho -2246 +xepóno -2247 +xêhest -2248 +éestse -2249 +éseohé -2250 +êstóne -2251 +óhtáhe -2252 +ėstane -2253 +šeméhe -2254 +ȯhkėha -2255 +▁aruba -2256 +▁botsw -2257 +▁coast -2258 +▁dutch -2259 +▁hestá -2260 +▁horse -2261 +▁hotóa -2262 +▁hésta -2263 +▁idaho -2264 +▁korea -2265 +▁lasio -2266 +▁lithu -2267 +▁meave -2268 +▁pages -2269 +▁penns -2270 +▁press -2271 +▁retri -2272 +▁sibik -2273 +▁there -2274 +▁tsehe -2275 +▁ubykh -2276 +▁xamae -2277 +antiago -2278 +article -2279 +atonebi -2280 +estséat -2281 +hestôxe -2282 +heóvemá -2283 +hkêsono -2284 +hohtóva -2285 +honáeka -2286 +háhnoma -2287 +hótsêha -2288 +kêsaéve -2289 +mamione -2290 +matôtse -2291 +montana -2292 +noonáhe -2293 +ohtôtse -2294 +panâhke -2295 +populus -2296 +sistots -2297 +sonants -2298 +tsévovó -2299 +tsêheta -2300 +tómôhéó -2301 +vetanév -2302 +vêstséa -2303 +vóneške -2304 +xeméhne -2305 +xâhonoo -2306 +xémâhéó -2307 +átamáno -2308 +âhtötse -2309 +éhestat -2310 +éotóaho -2311 +éstónéó -2312 +óhéhéve -2313 +▁ahkôhe -2314 +▁donald -2315 +▁europe -2316 +▁hoésto -2317 +▁hóvôse -2318 +▁kóhkon -2319 +▁kȧhamȧ -2320 +▁manaan -2321 +▁mariti -2322 +▁matana -2323 +▁ménôhe -2324 +▁mónahe -2325 +▁nanóse -2326 +▁native -2327 +▁nevada -2328 +▁néstse -2329 +▁pirahã -2330 +▁reôhke -2331 +▁tóhtoo -2332 +▁univer -2333 +▁éestse -2334 +▁êstove -2335 +aseohtsé -2336 +cephalus -2337 +etanetse -2338 +external -2339 +hahtsená -2340 +hamâxéve -2341 +háestôhe -2342 +hóxovôho -2343 +keahonoo -2344 +kâsénoma -2345 +kėhanáhe -2346 +language -2347 +manâhéno -2348 +móxêšéne -2349 +peruvian -2350 +tšêškévâ -2351 +vovetäso -2352 +weoworld -2353 +éoeškëso -2354 +êškóneho -2355 +ôhketóxe -2356 +šestôtse -2357 +▁aéstome -2358 +▁chicago -2359 +▁curaçao -2360 +▁denmark -2361 +▁douglas -2362 +▁hahpenó -2363 +▁heévȧhe -2364 +▁heóvâhá -2365 +▁hohamma -2366 +▁islands -2367 +▁maarten -2368 +▁madagas -2369 +▁menôtse -2370 +▁million -2371 +▁mâhpémo -2372 +▁sanders -2373 +▁skillet -2374 +▁tseohke -2375 +▁xamaevo -2376 +anėsóvóne -2377 +enestôtse -2378 +eotsestsé -2379 +hamémâhéó -2380 +netsénóne -2381 +nêškovávo -2382 +ohtsévôse -2383 +xemâhoévé -2384 +ȧhestȯtse -2385 +▁activist -2386 +▁archived -2387 +▁botswana -2388 +▁caucasus -2389 +▁crossing -2390 +▁december -2391 +▁hohtsemo -2392 +▁hotóhkeo -2393 +▁héstooma -2394 +▁manȧhéno -2395 +▁oklahoma -2396 +▁original -2397 +▁pennsylv -2398 +▁rhodesia -2399 +▁tsisinst -2400 +▁tsêhésto -2401 +▁zimbabwe -2402 +california -2403 +datamapper -2404 +hkoestséat -2405 +hánėsóvóne -2406 +hóxeeséeto -2407 +kaetaévôxe -2408 +kévénėstse -2409 +kôhévénéhe -2410 +nêstsevôse -2411 +sêhestôtse -2412 +tsêhéstahe -2413 +xėseotsean -2414 +▁apartheid -2415 +▁argentina -2416 +▁caribbean -2417 +▁geography -2418 +▁hoéstónéó -2419 +▁lithuania -2420 +▁monêškeho -2421 +▁obviative -2422 +▁religions -2423 +▁retrieved -2424 +▁sibikeove -2425 +▁tasemiten -2426 +enenestôtse -2427 +eétâhéstove -2428 +héstánóvaan -2429 +komêšéstótó -2430 +mȧhoestȯtse -2431 +êstonêstove -2432 +ôhkêheóvemá -2433 +ôhtávaestse -2434 +▁consonants -2435 +▁heóvonêheo -2436 +▁hóvôsenâha -2437 +▁hóxâhtsená -2438 +▁koronestse -2439 +▁kȧhamȧxévo -2440 +▁lasiocarpa -2441 +▁mâhóomoehá -2442 +▁môxemarete -2443 +▁nėsesėstse -2444 +▁portuguese -2445 +▁tootómôhéó -2446 +▁tsêhestôxe -2447 +▁university -2448 +▁véhestôtse -2449 +▁véhestȯtse -2450 +▁washington -2451 +eestsestȯtse -2452 +staahtsémeno -2453 +vátameveotse -2454 +▁ahkôheöhtse -2455 +▁desyatonebi -2456 +▁háeohémahpe -2457 +▁nemâmamione -2458 +▁vétapâhaeto -2459 +▁éškôseeséma -2460 +mâhoestôtsene -2461 +vetanévȯhkėha -2462 +xêhestâhtötse -2463 +êstonemâheone -2464 +šenonetsénóne -2465 +▁heóvêháhnoma -2466 +▁reôhkemôtove -2467 +aehesanestôtse -2468 +disambiguation -2469 +hamâxéveóhtáhe -2470 +héstoeotsestsé -2471 +kemâhaemenôtse -2472 +▁otaesémenôtse -2473 +▁héstoomaestôtse -2474 +▁héstánóvaannéta -2475 +▁tsisinstsistots -2476 +▁tsêhéstoestôtse -2477 +▁vóhkoohémâhoéve -2478 +af -2479 +ag -2480 +cf -2481 +cr -2482 +ec -2483 +fe -2484 +fo -2485 +fr -2486 +ip -2487 +jp -2488 +pė -2489 +sz -2490 +yo -2491 +ão -2492 +aga -2493 +anl -2494 +aná -2495 +ban -2496 +bat -2497 +big -2498 +blo -2499 +); -2500 +ai -2501 +bn -2502 +eg -2503 +eó -2504 +gc -2505 +ix -2506 +kȧ -2507 +lm -2508 +my -2509 +nd -2510 +oq -2511 +rh -2512 +up -2513 +wh -2514 +ép -2515 +anz -2516 +cfm -2517 +cot -2518 +cre -2519 +day -2520 +dif -2521 +dil -2522 +din -2523 +don -2524 +dor -2525 +ean -2526 +eat -2527 +esó -2528 +ext -2529 +ffa -2530 +for -2531 +gne -2532 +gta -2533 +hin -2534 +hké -2535 +hna -2536 +hou -2537 +hoé -2538 +htä -2539 +htȧ -2540 +hum -2541 +hôh -2542 +ibe -2543 +ibi -2544 +ics -2545 +ide -2546 +ied -2547 +ils -2548 +inh -2549 +iov -2550 +irc -2551 +iry -2552 +ita -2553 +iti -2554 +jap -2555 +jpg -2556 +kes -2557 +kyo -2558 +lag -2559 +lay -2560 +lsx -2561 +mau -2562 +meo -2563 +mib -2564 +mot -2565 +nee -2566 +odo -2567 +ork -2568 +orn -2569 +out -2570 +ped -2571 +poe -2572 +ppi -2573 +pua -2574 +ron -2575 +sci -2576 +sha -2577 +slo -2578 +sse -2579 +ste -2580 +tah -2581 +táo -2582 +uco -2583 +ump -2584 +une -2585 +usp -2586 +vel -2587 +wad -2588 +yan -2589 +ype -2590 +zco -2591 +äse -2592 +épó -2593 +éšé -2594 +êhá -2595 +êma -2596 +êšé -2597 +óvâ -2598 +ôhö -2599 +.. -2600 + -7962 +{ -7963 +à -7964 +ñ -7965 +ø -7966 +ú -7967 +û -7968 +ę -7969 +ń -7970 +ǐ -7971 +ʔ -7972 +н -7973 +п -7974 +ы -7975 +ә -7976 +ર -7977 +ા -7978 +ᐃ -7979 +民 -7980 +è -7981 +î -7982 +ò -7983 +ć -7984 +ī -7985 +ŋ -7986 +ś -7987 +ž -7988 +ǧ -7989 +ɛ -7990 +ɪ -7991 +б -7992 +е -7993 +л -7994 +о -7995 diff --git a/models/vocabulary/chy_vocabulary.parquet b/models/vocabulary/chy_vocabulary.parquet new file mode 100644 index 0000000000000000000000000000000000000000..d3250908dac1000b045a1613c86c8979658ce153 --- /dev/null +++ b/models/vocabulary/chy_vocabulary.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:470592a0b112b9a3a65f69af1557219cccb407f27f3a4236e60e71c7c2f40313 +size 29068 diff --git a/models/vocabulary/chy_vocabulary_metadata.json b/models/vocabulary/chy_vocabulary_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..33d754655c8c37b5913c6babe30cef985f13ed0f --- /dev/null +++ b/models/vocabulary/chy_vocabulary_metadata.json @@ -0,0 +1,14 @@ +{ + "language": "chy", + "vocabulary_size": 1659, + "statistics": { + "type_token_ratio": 0.24974895150333748, + "coverage": { + "top_100": 0.4891015417331207, + "top_1000": 0.7703939984641739 + }, + "hapax_count": 2569, + "hapax_ratio": 0.6076158940397351, + "total_documents": 825 + } +} \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx1_word.parquet b/models/word_markov/chy_markov_ctx1_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..08a4a7d9d7f03e6128afff8c7c6eaa2ee082effd --- /dev/null +++ b/models/word_markov/chy_markov_ctx1_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6fb77d185c917a1f8ece3debd54279262bbc2e1a01799fcb691731a7561bee31 +size 113562 diff --git a/models/word_markov/chy_markov_ctx1_word_metadata.json b/models/word_markov/chy_markov_ctx1_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..5cdc3d6fc43a79066182c32df25b830eabb01291 --- /dev/null +++ b/models/word_markov/chy_markov_ctx1_word_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 1, + "variant": "word", + "language": "chy", + "unique_contexts": 4255, + "total_transitions": 27559 +} \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx2_word.parquet b/models/word_markov/chy_markov_ctx2_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..9a3f7765734ed1f2b0f0b58803c294c5d3576f01 --- /dev/null +++ b/models/word_markov/chy_markov_ctx2_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c1602a760151707012bb6aef66d181795727fa9d8cb5780233a0b6c1c90a2312 +size 189750 diff --git a/models/word_markov/chy_markov_ctx2_word_metadata.json b/models/word_markov/chy_markov_ctx2_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..02ef0b577464a41bc9fd806995cf2e32c51c5a62 --- /dev/null +++ b/models/word_markov/chy_markov_ctx2_word_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 2, + "variant": "word", + "language": "chy", + "unique_contexts": 10197, + "total_transitions": 26734 +} \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx3_word.parquet b/models/word_markov/chy_markov_ctx3_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..991dca945f547e533edfa586401fa51db3f6932e --- /dev/null +++ b/models/word_markov/chy_markov_ctx3_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7a91e6dbb80dac8f5da14abcdfc3ba5c7122ed0a171bc88e3d506ea5a62e4790 +size 251639 diff --git a/models/word_markov/chy_markov_ctx3_word_metadata.json b/models/word_markov/chy_markov_ctx3_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..e33000a870c5215c6bf21af4585b3cd1d9bc4b9a --- /dev/null +++ b/models/word_markov/chy_markov_ctx3_word_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 3, + "variant": "word", + "language": "chy", + "unique_contexts": 13745, + "total_transitions": 25909 +} \ No newline at end of file diff --git a/models/word_markov/chy_markov_ctx4_word.parquet b/models/word_markov/chy_markov_ctx4_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..98b1da880fc3131ffd354e69756ca26873ae63d4 --- /dev/null +++ b/models/word_markov/chy_markov_ctx4_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b827d8d8deb6ee8ee5a1c2f7e9164c6009220d08c889406bed208c143c67ef6f +size 299200 diff --git a/models/word_markov/chy_markov_ctx4_word_metadata.json b/models/word_markov/chy_markov_ctx4_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..28dde75ed594460a55b7a0d16cdabc0acb057b70 --- /dev/null +++ b/models/word_markov/chy_markov_ctx4_word_metadata.json @@ -0,0 +1,7 @@ +{ + "context_size": 4, + "variant": "word", + "language": "chy", + "unique_contexts": 16004, + "total_transitions": 25085 +} \ No newline at end of file diff --git a/models/word_ngram/chy_2gram_word.parquet b/models/word_ngram/chy_2gram_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..b8aed7294befca87168337c3a35da6dbd879b91b --- /dev/null +++ b/models/word_ngram/chy_2gram_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0dc44fada7846034391ea2690c56b864a908054940a91ca581f91cf63ba43b79 +size 11396 diff --git a/models/word_ngram/chy_2gram_word_metadata.json b/models/word_ngram/chy_2gram_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..37fdf67c8b77cc258a8e46fc087faa60585c5ebc --- /dev/null +++ b/models/word_ngram/chy_2gram_word_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 2, + "variant": "word", + "language": "chy", + "unique_ngrams": 654, + "total_ngrams": 27559 +} \ No newline at end of file diff --git a/models/word_ngram/chy_3gram_word.parquet b/models/word_ngram/chy_3gram_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..427d66c3b9c2768d68c5cb44bffe42f9b2af4ada --- /dev/null +++ b/models/word_ngram/chy_3gram_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ac58b70ff5ea478cd859a06e2722ba50d36f1d0a7d83c8e35282a144a67b1369 +size 20739 diff --git a/models/word_ngram/chy_3gram_word_metadata.json b/models/word_ngram/chy_3gram_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..dfd0ad0bad2ac0681a282fda23378e9accd99d9a --- /dev/null +++ b/models/word_ngram/chy_3gram_word_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 3, + "variant": "word", + "language": "chy", + "unique_ngrams": 1211, + "total_ngrams": 26734 +} \ No newline at end of file diff --git a/models/word_ngram/chy_4gram_word.parquet b/models/word_ngram/chy_4gram_word.parquet new file mode 100644 index 0000000000000000000000000000000000000000..f1204693a668d69729fa7acfe4695067f2d2a2cd --- /dev/null +++ b/models/word_ngram/chy_4gram_word.parquet @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8fb9ac65e984d6fb53483e3e4ddaf089e90baeef0b877dc6f9e5067d9f4d0a9d +size 39062 diff --git a/models/word_ngram/chy_4gram_word_metadata.json b/models/word_ngram/chy_4gram_word_metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..e380a75a3a33c8bbddab4dd39985a8938c807eb6 --- /dev/null +++ b/models/word_ngram/chy_4gram_word_metadata.json @@ -0,0 +1,7 @@ +{ + "n": 4, + "variant": "word", + "language": "chy", + "unique_ngrams": 2302, + "total_ngrams": 25909 +} \ No newline at end of file diff --git a/visualizations/embedding_isotropy.png b/visualizations/embedding_isotropy.png new file mode 100644 index 0000000000000000000000000000000000000000..f9c83093dbbacc8c96f6b8d0b13d80452f73e23a Binary files /dev/null and b/visualizations/embedding_isotropy.png differ diff --git a/visualizations/embedding_norms.png b/visualizations/embedding_norms.png new file mode 100644 index 0000000000000000000000000000000000000000..04157894bd6abb623cf7728919fa8ba8ab278c0c Binary files /dev/null and b/visualizations/embedding_norms.png differ diff --git a/visualizations/embedding_similarity.png b/visualizations/embedding_similarity.png new file mode 100644 index 0000000000000000000000000000000000000000..34569d570e461d877e73a044c150aff77884730b --- /dev/null +++ b/visualizations/embedding_similarity.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:125496fee51a0eff7033a820a5c76329e49094e721ffbc2269e6e45e220f6eb8 +size 174316 diff --git a/visualizations/markov_branching.png b/visualizations/markov_branching.png new file mode 100644 index 0000000000000000000000000000000000000000..3d1b9a76d64faa807ecc710b35463c5d2c4d997b Binary files /dev/null and b/visualizations/markov_branching.png differ diff --git a/visualizations/markov_contexts.png b/visualizations/markov_contexts.png new file mode 100644 index 0000000000000000000000000000000000000000..09142004f613dcae9d743372b8ba19fe850d83bc Binary files /dev/null and b/visualizations/markov_contexts.png differ diff --git a/visualizations/markov_entropy.png b/visualizations/markov_entropy.png new file mode 100644 index 0000000000000000000000000000000000000000..f69b6daa31e8d4c7ecf7d9dbe4e721ecdc422feb Binary files /dev/null and b/visualizations/markov_entropy.png differ diff --git a/visualizations/model_sizes.png b/visualizations/model_sizes.png new file mode 100644 index 0000000000000000000000000000000000000000..cd57618649af251bf29a959169de322207f5db26 Binary files /dev/null and b/visualizations/model_sizes.png differ diff --git a/visualizations/nearest_neighbors.png b/visualizations/nearest_neighbors.png new file mode 100644 index 0000000000000000000000000000000000000000..cb3c0895a189df84db3032c4a9fcfdd130bc23f0 Binary files /dev/null and b/visualizations/nearest_neighbors.png differ diff --git a/visualizations/ngram_coverage.png b/visualizations/ngram_coverage.png new file mode 100644 index 0000000000000000000000000000000000000000..80e8fea86f70b95ea35d4ec0079ad47388993a03 Binary files /dev/null and b/visualizations/ngram_coverage.png differ diff --git a/visualizations/ngram_entropy.png b/visualizations/ngram_entropy.png new file mode 100644 index 0000000000000000000000000000000000000000..535ae32da5aed6ed238bf11d9b43f8d475bb0f75 Binary files /dev/null and b/visualizations/ngram_entropy.png differ diff --git a/visualizations/ngram_perplexity.png b/visualizations/ngram_perplexity.png new file mode 100644 index 0000000000000000000000000000000000000000..1c541e6bfb131e4fc777744a43d97b7fcc1604d1 Binary files /dev/null and b/visualizations/ngram_perplexity.png differ diff --git a/visualizations/ngram_unique.png b/visualizations/ngram_unique.png new file mode 100644 index 0000000000000000000000000000000000000000..dc7fa8a01b2392ca7cea2fa35c147f0aade4ece5 Binary files /dev/null and b/visualizations/ngram_unique.png differ diff --git a/visualizations/performance_dashboard.png b/visualizations/performance_dashboard.png new file mode 100644 index 0000000000000000000000000000000000000000..6369f638fbcc3ce442ac6b78ff1dcb99bbeaebb3 --- /dev/null +++ b/visualizations/performance_dashboard.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b861d72edf7bb6c8cf345d9be70d2d6d0a0a216d6e17021fbe8fa2794b212d3e +size 288321 diff --git a/visualizations/position_encoding_comparison.png b/visualizations/position_encoding_comparison.png new file mode 100644 index 0000000000000000000000000000000000000000..a1aeba54650b4a60c936d88216bb44bf1206a6b1 Binary files /dev/null and b/visualizations/position_encoding_comparison.png differ diff --git a/visualizations/tokenizer_compression.png b/visualizations/tokenizer_compression.png new file mode 100644 index 0000000000000000000000000000000000000000..79cdac463e699214ccf59bd636466018ccbfd553 Binary files /dev/null and b/visualizations/tokenizer_compression.png differ diff --git a/visualizations/tokenizer_fertility.png b/visualizations/tokenizer_fertility.png new file mode 100644 index 0000000000000000000000000000000000000000..edd4301152cfef7bcea4a6b3a297bcb391f72732 Binary files /dev/null and b/visualizations/tokenizer_fertility.png differ diff --git a/visualizations/tokenizer_oov.png b/visualizations/tokenizer_oov.png new file mode 100644 index 0000000000000000000000000000000000000000..6e8f23f88633158ec088f9d93ef4d98814108b9b Binary files /dev/null and b/visualizations/tokenizer_oov.png differ diff --git a/visualizations/tokenizer_total_tokens.png b/visualizations/tokenizer_total_tokens.png new file mode 100644 index 0000000000000000000000000000000000000000..77d89afaadf72acb713ff3350d887d04cd3e2fc5 Binary files /dev/null and b/visualizations/tokenizer_total_tokens.png differ diff --git a/visualizations/top20_words.png b/visualizations/top20_words.png new file mode 100644 index 0000000000000000000000000000000000000000..25960f06e9a43427fcacaeb3ed7a56263f2f6cf2 Binary files /dev/null and b/visualizations/top20_words.png differ diff --git a/visualizations/tsne_sentences.png b/visualizations/tsne_sentences.png new file mode 100644 index 0000000000000000000000000000000000000000..b7e069ce54a7a8b67690e2490df8f0b47ff58d9d --- /dev/null +++ b/visualizations/tsne_sentences.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0dc88c5899ee297f88854b53b1a881cde76421efd331639ce36eee79f8707972 +size 259359 diff --git a/visualizations/tsne_words.png b/visualizations/tsne_words.png new file mode 100644 index 0000000000000000000000000000000000000000..5e526f8acf5033493c7d3d88654282c5a2a56920 --- /dev/null +++ b/visualizations/tsne_words.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d0724431f0ba5d0023e807f46bde04ec4c5c61dc35cdd7462ba385d4404afd40 +size 264480 diff --git a/visualizations/vocab_coverage.png b/visualizations/vocab_coverage.png new file mode 100644 index 0000000000000000000000000000000000000000..d4ea22517aa6d105e7c8c881ed6b6accd2798001 Binary files /dev/null and b/visualizations/vocab_coverage.png differ diff --git a/visualizations/vocab_freq_dist.png b/visualizations/vocab_freq_dist.png new file mode 100644 index 0000000000000000000000000000000000000000..530c48f0bae84f719dc56899bacf4248734a1401 Binary files /dev/null and b/visualizations/vocab_freq_dist.png differ diff --git a/visualizations/zipf_law.png b/visualizations/zipf_law.png new file mode 100644 index 0000000000000000000000000000000000000000..71e4fb16adb7814e9ec304069fc503ad6f0e5b25 --- /dev/null +++ b/visualizations/zipf_law.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cce6ed8b0dec7d5d68b967ff424b0d29f7404f5c5d77ffdb1d6a44392fea5877 +size 102444