omarkamali commited on Jan 3

Commit

7adfc95

verified ·

1 Parent(s): adfc404

Upload all models and assets for crh (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +299 -133
models/embeddings/monolingual/crh_128d.bin +2 -2
models/embeddings/monolingual/crh_128d_metadata.json +5 -3
models/embeddings/monolingual/crh_32d.bin +2 -2
models/embeddings/monolingual/crh_32d_metadata.json +5 -3
models/embeddings/monolingual/crh_64d.bin +2 -2
models/embeddings/monolingual/crh_64d_metadata.json +5 -3
models/subword_markov/crh_markov_ctx1_subword.parquet +2 -2
models/subword_markov/crh_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/crh_markov_ctx2_subword.parquet +2 -2
models/subword_markov/crh_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/crh_markov_ctx3_subword.parquet +2 -2
models/subword_markov/crh_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/crh_markov_ctx4_subword.parquet +2 -2
models/subword_markov/crh_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/crh_2gram_subword.parquet +2 -2
models/subword_ngram/crh_2gram_subword_metadata.json +2 -2
models/subword_ngram/crh_3gram_subword.parquet +2 -2
models/subword_ngram/crh_3gram_subword_metadata.json +2 -2
models/subword_ngram/crh_4gram_subword.parquet +2 -2
models/subword_ngram/crh_4gram_subword_metadata.json +2 -2
models/tokenizer/crh_tokenizer_16k.model +2 -2
models/tokenizer/crh_tokenizer_16k.vocab +0 -0
models/tokenizer/crh_tokenizer_32k.model +2 -2
models/tokenizer/crh_tokenizer_32k.vocab +0 -0
models/tokenizer/crh_tokenizer_64k.model +2 -2
models/tokenizer/crh_tokenizer_64k.vocab +0 -0
models/tokenizer/crh_tokenizer_8k.model +2 -2
models/tokenizer/crh_tokenizer_8k.vocab +0 -0
models/vocabulary/crh_vocabulary.parquet +2 -2
models/vocabulary/crh_vocabulary_metadata.json +10 -9
models/word_markov/crh_markov_ctx1_word.parquet +2 -2
models/word_markov/crh_markov_ctx1_word_metadata.json +2 -2
models/word_markov/crh_markov_ctx2_word.parquet +2 -2
models/word_markov/crh_markov_ctx2_word_metadata.json +2 -2
models/word_markov/crh_markov_ctx3_word.parquet +2 -2
models/word_markov/crh_markov_ctx3_word_metadata.json +2 -2
models/word_markov/crh_markov_ctx4_word.parquet +2 -2
models/word_markov/crh_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/crh_2gram_word.parquet +2 -2
models/word_ngram/crh_2gram_word_metadata.json +2 -2
models/word_ngram/crh_3gram_word.parquet +2 -2
models/word_ngram/crh_3gram_word_metadata.json +2 -2
models/word_ngram/crh_4gram_word.parquet +2 -2
models/word_ngram/crh_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.462
   - name: best_isotropy
     type: isotropy
-    value: 0.7580
   - name: vocabulary_size
     type: vocab
-    value: 53689
-generated: 2025-12-28
 ---
 # CRH - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,51 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.468x | 3.42 | 0.2080% | 235,121 |
-| **16k** | 3.842x | 3.78 | 0.2305% | 212,188 |
-| **32k** | 4.188x | 4.13 | 0.2512% | 194,672 |
-| **64k** | 4.462x 🏆 | 4.39 | 0.2676% | 182,737 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Novotroyevka () - Rusiyeniñ Belgorod vilâyetinde Koroça rayonında bir köy. Ealis...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁novo tr oy evka ▁() ▁- ▁rusiyeniñ ▁belgorod ▁vilâyetinde ▁koroça ... (+17 more)` | 27 |
-| 16k | `▁novotr oy evka ▁() ▁- ▁rusiyeniñ ▁belgorod ▁vilâyetinde ▁koroça ▁rayonında ... (+16 more)` | 26 |
-| 32k | `▁novotroy evka ▁() ▁- ▁rusiyeniñ ▁belgorod ▁vilâyetinde ▁koroça ▁rayonında ▁bir ... (+15 more)` | 25 |
-| 64k | `▁novotroy evka ▁() ▁- ▁rusiyeniñ ▁belgorod ▁vilâyetinde ▁koroça ▁rayonında ▁bir ... (+15 more)` | 25 |
-**Sample 2:** `Holodna Balka () - Ukrainanıñ Ades vilâyetinde Ades rayonında bir köy. Ealisiniñ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁hol od na ▁balka ▁() ▁- ▁ukrainanıñ ▁ades ▁vilâyetinde ▁ades ... (+18 more)` | 28 |
-| 16k | `▁hol od na ▁balka ▁() ▁- ▁ukrainanıñ ▁ades ▁vilâyetinde ▁ades ... (+18 more)` | 28 |
-| 32k | `▁hol odna ▁balka ▁() ▁- ▁ukrainanıñ ▁ades ▁vilâyetinde ▁ades ▁rayonında ... (+17 more)` | 27 |
-| 64k | `▁holodna ▁balka ▁() ▁- ▁ukrainanıñ ▁ades ▁vilâyetinde ▁ades ▁rayonında ▁bir ... (+16 more)` | 26 |
-**Sample 3:** `Tereşpil () - Ukrainanıñ Vinnıtsâ vilâyetinde Hmilnık rayonında bir köy. Ealisin...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ter eş pil ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁hmilnık ▁rayonında ... (+16 more)` | 26 |
-| 16k | `▁tereş pil ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁hmilnık ▁rayonında ▁bir ... (+15 more)` | 25 |
-| 32k | `▁tereş pil ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁hmilnık ▁rayonında ▁bir ... (+15 more)` | 25 |
-| 64k | `▁tereşpil ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁hmilnık ▁rayonında ▁bir ▁köy ... (+14 more)` | 24 |
 ### Key Findings
-- **Best Compression:** 64k achieves 4.462x compression
-- **Lowest UNK Rate:** 8k with 0.2080% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -121,57 +129,89 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 1,152 🏆 | 10.17 | 18,186 | 53.8% | 71.2% |
-| **2-gram** | 405 🏆 | 8.66 | 4,814 | 59.9% | 97.2% |
-| **3-gram** | 1,983 | 10.95 | 27,457 | 46.4% | 65.5% |
-| **3-gram** | 2,478 | 11.28 | 35,169 | 32.1% | 69.8% |
-| **4-gram** | 4,721 | 12.21 | 55,382 | 36.7% | 54.6% |
-| **4-gram** | 8,182 | 13.00 | 157,163 | 26.4% | 52.7% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `kategoriya :` | 32,150 |
-| 2 | `( )` | 23,218 |
-| 3 | `) -` | 21,379 |
-| 4 | `ealisiniñ sayısı` | 20,740 |
-| 5 | `. ealisiniñ` | 20,734 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `. ealisiniñ sayısı` | 20,734 |
-| 2 | `( ) -` | 19,732 |
-| 3 | `. kategoriya :` | 16,456 |
-| 4 | `kişi . kategoriya` | 14,755 |
-| 5 | `bir köy .` | 10,054 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `kişi . kategoriya :` | 14,755 |
-| 2 | `rayonında bir köy .` | 9,313 |
-| 3 | `( ) - rusiyeniñ` | 9,196 |
-| 4 | `bir köy . ealisiniñ` | 9,139 |
-| 5 | `köy . ealisiniñ sayısı` | 9,139 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 405
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~53% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -179,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.5665 | 1.481 | 3.09 | 136,749 | 43.3% |
-| **1** | 1.1866 | 2.276 | 9.27 | 1,393 | 0.0% |
-| **2** | 0.1665 | 1.122 | 1.39 | 422,458 | 83.3% |
-| **2** | 0.9666 | 1.954 | 5.70 | 12,902 | 3.3% |
-| **3** | 0.0665 | 1.047 | 1.14 | 585,200 | 93.4% |
-| **3** | 0.8194 | 1.765 | 3.81 | 73,541 | 18.1% |
-| **4** | 0.0328 🏆 | 1.023 | 1.07 | 663,395 | 96.7% |
-| **4** | 0.5748 🏆 | 1.489 | 2.40 | 280,480 | 42.5% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `. menbalar türkiye i ̇ zmalkovo krasnoye rayonında bir qasaba . ealisiniñ sayısı 27 , başqırtistan`
-2. `- rusiyeniñ belgorod vilâyetinde podilsk rayonı ( , cin æmæ jyn kad ! bu ağnıñ esas`
-3. `, latin elifbesiniñ 19 ( ) , cıyınlarda cıyınlarnıñ cıyıntıgı cıyıntıq alına kelgen medinet es herma...`
 **Context Size 2:**
-1. `kategoriya : troitskoye rayonındaki meskün yerler kategoriya : primorye ülkesindeki meskün yerler ka...`
-2. `( ) - rusiyeniñ altay ülkesinde şelaboliha rayonında bir köy . ealisiniñ sayısı 47 kişi . kategoriya`
-3. `) - rusiyede , başqırtistan cumhuriyetiniñ miyeke rayonında bir hutor . ealisiniñ sayısı 177 kişi . ...`
 **Context Size 3:**
-1. `. ealisiniñ sayısı 485 kişi . kategoriya : herson vilâyeti`
-2. `( ) - rusiyeniñ brânsk vilâyetinde karaçev rayonında bir köy . ealisiniñ sayısı 423 kişi . kategoriy...`
-3. `. kategoriya : başqırtistandaki meskün yerler`
 **Context Size 4:**
-1. `kişi . kategoriya : ades vilâyetindeki köyler`
-2. `rayonında bir köy . ealisiniñ sayısı 448 kişi . i ̇ htar kategoriya : tahtamukay rayonındaki meskün ...`
-3. `( ) - rusiyeniñ amur vilâyetinde şimanovsk rayonında bir köy . ealisiniñ sayısı 654 kişi . kategoriy...`
 ### Key Findings
-- **Best Predictability:** Context-4 with 96.7% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (280,480 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -243,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 53,689 |
-| Total Tokens | 889,654 |
-| Mean Frequency | 16.57 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 308.97 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | kategoriya | 32,152 |
-| 2 | bir | 27,919 |
-| 3 | kişi | 20,861 |
-| 4 | sayısı | 20,822 |
-| 5 | ealisiniñ | 20,770 |
-| 6 | rayonında | 17,392 |
-| 7 | i | 13,962 |
-| 8 | meskün | 13,507 |
-| 9 | yerler | 12,928 |
-| 10 | vilâyetinde | 12,440 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | зияде | 2 |
-| 2 | atalarnıñ | 2 |
-| 3 | kotsubınskıylar | 2 |
-| 4 | yüneskonıñ | 2 |
-| 5 | اورمودا | 2 |
-| 6 | دیللر | 2 |
-| 7 | ازبری | 2 |
-| 8 | اولان | 2 |
-| 9 | قیزی | 2 |
-| 10 | samançı | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.0203 |
-| R² (Goodness of Fit) | 0.996904 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 46.7% |
-| Top 1,000 | 65.0% |
-| Top 5,000 | 79.6% |
-| Top 10,000 | 85.3% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9969 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 46.7% of corpus
-- **Long Tail:** 43,689 words needed for remaining 14.7% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -313,24 +384,116 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 16,090 | 32 | 4.513 | 0.777 | 0.7580 🏆 |
-| **mono_64d** | 16,090 | 64 | 4.759 | 0.736 | 0.5447 |
-| **mono_128d** | 16,090 | 128 | 4.802 | 0.733 | 0.1564 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.7580 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 16,090 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -338,11 +501,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.46x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (405) |
-| Markov | **Context-4** | Highest predictability (96.7%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -532,7 +696,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -548,7 +713,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-28 23:15:40*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.773
   - name: best_isotropy
     type: isotropy
+    value: 0.6920
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # CRH - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.643x | 3.65 | 0.2020% | 214,351 |
+| **16k** | 4.073x | 4.08 | 0.2258% | 191,723 |
+| **32k** | 4.453x | 4.46 | 0.2469% | 175,361 |
+| **64k** | 4.773x 🏆 | 4.78 | 0.2647% | 163,591 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Şaburovo () - Rusiyeniñ Altay ülkesinde Solton rayonında bir qasaba. Ealisiniñ s...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁şab urovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁solton ▁rayonında ▁bir ... (+12 more)` | 22 |
+| 16k | `▁şab urovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁solton ▁rayonında ▁bir ... (+12 more)` | 22 |
+| 32k | `▁şab urovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁solton ▁rayonında ▁bir ... (+12 more)` | 22 |
+| 64k | `▁şaburovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁solton ▁rayonında ▁bir ▁qasaba ... (+11 more)` | 21 |
+**Sample 2:** `Polzunovo () - Rusiyeniñ Altay ülkesinde Barnaul şeer bölgesinda bir stantsiya. ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁pol z unovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁barnaul ▁şeer ... (+17 more)` | 27 |
+| 16k | `▁pol z unovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁barnaul ▁şeer ... (+17 more)` | 27 |
+| 32k | `▁pol z unovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁barnaul ▁şeer ... (+17 more)` | 27 |
+| 64k | `▁polzunovo ▁() ▁- ▁rusiyeniñ ▁altay ▁ülkesinde ▁barnaul ▁şeer ▁bölgesinda ▁bir ... (+15 more)` | 25 |
+**Sample 3:** `Bobliv () - Ukrainanıñ Vinnıtsâ vilâyetinde Vinnıtsâ rayonında bir köy. Ealisini...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁bob liv ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁vinnıtsâ ▁rayonında ▁bir ... (+12 more)` | 22 |
+| 16k | `▁bob liv ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁vinnıtsâ ▁rayonında ▁bir ... (+12 more)` | 22 |
+| 32k | `▁bob liv ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁vinnıtsâ ▁rayonında ▁bir ... (+12 more)` | 22 |
+| 64k | `▁bobliv ▁() ▁- ▁ukrainanıñ ▁vinnıtsâ ▁vilâyetinde ▁vinnıtsâ ▁rayonında ▁bir ▁köy ... (+11 more)` | 21 |
 ### Key Findings
+- **Best Compression:** 64k achieves 4.773x compression
+- **Lowest UNK Rate:** 8k with 0.2020% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 851 | 9.73 | 10,225 | 56.0% | 74.3% |
+| **2-gram** | Subword | 349 🏆 | 8.45 | 3,905 | 63.4% | 98.0% |
+| **3-gram** | Word | 1,276 | 10.32 | 13,312 | 49.1% | 71.8% |
+| **3-gram** | Subword | 2,227 | 11.12 | 29,302 | 33.0% | 71.7% |
+| **4-gram** | Word | 4,192 | 12.03 | 31,529 | 31.9% | 54.7% |
+| **4-gram** | Subword | 7,868 | 12.94 | 131,434 | 26.0% | 52.2% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ealisiniñ sayısı` | 20,731 |
+| 2 | `rayonında bir` | 17,343 |
+| 3 | `meskün yerler` | 12,883 |
+| 4 | `bir köy` | 10,053 |
+| 5 | `köy ealisiniñ` | 9,130 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `rayonında bir köy` | 9,305 |
+| 2 | `köy ealisiniñ sayısı` | 9,130 |
+| 3 | `bir köy ealisiniñ` | 9,130 |
+| 4 | `rayonındaki meskün yerler` | 5,591 |
+| 5 | `kişi meskün yerler` | 4,604 |
+**4-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `bir köy ealisiniñ sayısı` | 9,130 |
+| 2 | `rayonında bir köy ealisiniñ` | 8,976 |
+| 3 | `bir köydir ealisiniñ sayısı` | 4,601 |
+| 4 | `rayonında bir köydir ealisiniñ` | 4,565 |
+| 5 | `i̇htar rayonındaki meskün yerler` | 3,615 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `i n` | 101,180 |
+| 2 | `e r` | 95,484 |
+| 3 | `a _` | 88,670 |
+| 4 | `r _` | 84,656 |
+| 5 | `. _` | 80,946 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `i ñ _` | 43,476 |
+| 2 | `n i ñ` | 42,980 |
+| 3 | `l e r` | 42,878 |
+| 4 | `n d e` | 35,841 |
+| 5 | `e t i` | 35,623 |
+**4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `n i ñ _` | 42,723 |
+| 2 | `i n d e` | 34,201 |
+| 3 | `y e t i` | 30,809 |
+| 4 | `ı n d a` | 30,136 |
+| 5 | `_ b i r` | 29,657 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 349
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~52% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.6260 | 1.543 | 3.00 | 128,781 | 37.4% |
+| **1** | Subword | 0.8874 | 1.850 | 6.86 | 1,506 | 11.3% |
+| **2** | Word | 0.1303 | 1.094 | 1.24 | 384,958 | 87.0% |
+| **2** | Subword | 0.9042 | 1.872 | 5.57 | 10,319 | 9.6% |
+| **3** | Word | 0.0387 | 1.027 | 1.07 | 475,888 | 96.1% |
+| **3** | Subword | 0.8152 | 1.760 | 3.87 | 57,491 | 18.5% |
+| **4** | Word | 0.0241 🏆 | 1.017 | 1.05 | 504,736 | 97.6% |
+| **4** | Subword | 0.6067 | 1.523 | 2.54 | 222,424 | 39.3% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `bir köy ealisiniñ sayısı 102 1 ya ayaqlı bir köydir ealisiniñ sayısı kişi vilâyetindeki qasabalar ma...`
+2. `kişi vilâyetindeki meskün yerleri şeer şeklinde qasabalar albörikent accı suv miktarınıñ az olğanına...`
+3. `sayısı 919 kişi vilâyetindeki şeerler boksitogorsk rayonında bir dürki türkiyede oturğan 2 567 kişi ...`
+**Context Size 2:**
+1. `ealisiniñ sayısı 3 900 kişi muhtar vilâyetindeki şeer şeklinde qasabalar bazarnıy karabulak rayonını...`
+2. `rayonında bir şeer gornozavodsk rayonınıñ merkezi ealisiniñ sayısı 562 kişi vilâyetindeki köyler ray...`
+3. `bir köy ealisiniñ sayısı 952 kişi vilâyetindeki meskün yerler köyler atıflar rayonındaki meskün yerl...`
+**Context Size 3:**
+1. `rayonında bir köy ealisiniñ sayısı kişi i̇htar rayonındaki meskün yerler köyler atıflar rayonındaki ...`
+2. `köy ealisiniñ sayısı 183 kişi i̇htar rayonındaki meskün yerler köyler atıflar rayonındaki meskün yer...`
+3. `bir köy ealisiniñ sayısı 7 kişi i̇htar rayonındaki meskün yerler şeer şeklinde qasabalar verhovye gl...`
+**Context Size 4:**
+1. `bir köy ealisiniñ sayısı kişi vilâyetindeki şeer şeklinde qasabalar şeer bölgesindeki meskün yerler ...`
+2. `rayonında bir köy ealisiniñ sayısı 572 kişi i̇htar rayonındaki meskün yerler vilâyetindeki şeer şekl...`
+3. `bir köydir ealisiniñ sayısı 306 kişi meskün yerler`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_revıñ_moviñmur_`
+2. `a_()._ay.._obeto`
+3. `i_робу_yıñ_()_fi`
 **Context Size 2:**
+1. `inovaq_ümalmannıñ`
+2. `a_başqırınde_talo`
+3. `r_rus_graynay_the`
 **Context Size 3:**
+1. `iñ_brânskaya_yañız`
+2. `niñ_araman_sürtüyl`
+3. `nde_bek-tatar_vilâ`
 **Context Size 4:**
+1. `niñ_stepanovskiy_—_`
+2. `inde_ölümlerinen_so`
+3. `yetindeki_meskün_ye`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 97.6% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (222,424 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 51,581 |
+| Total Tokens | 778,307 |
+| Mean Frequency | 15.09 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 271.68 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | bir | 27,780 |
+| 2 | kişi | 20,845 |
+| 3 | sayısı | 20,811 |
+| 4 | ealisiniñ | 20,761 |
+| 5 | rayonında | 17,383 |
+| 6 | meskün | 13,506 |
+| 7 | yerler | 12,928 |
+| 8 | vilâyetinde | 12,431 |
+| 9 | köy | 10,895 |
+| 10 | rusiyeniñ | 9,597 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ekranlar | 2 |
+| 2 | oem | 2 |
+| 3 | macbook | 2 |
+| 4 | mahsuldarlıq | 2 |
+| 5 | planşetler | 2 |
+| 6 | fatemeh | 2 |
+| 7 | movaghar | 2 |
+| 8 | پریسا | 2 |
+| 9 | موقر | 2 |
+| 10 | slammer | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.9849 |
+| R² (Goodness of Fit) | 0.998059 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 45.4% |
+| Top 1,000 | 63.7% |
+| Top 5,000 | 78.1% |
+| Top 10,000 | 84.3% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9981 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 45.4% of corpus
+- **Long Tail:** 41,581 words needed for remaining 15.7% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.6920 🏆 | 0.3816 | N/A | N/A |
+| **mono_64d** | 64 | 0.4424 | 0.3546 | N/A | N/A |
+| **mono_128d** | 128 | 0.1085 | 0.3496 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.6920 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.3619. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-a` | belâyeva, sira, boynuna |
+| `-ka` | tatyanovka, verigovka, mazanka |
+| `-vo` | çkalovo, beketovo, çufarovo |
+| `-vka` | tatyanovka, verigovka, karnauhovka |
+| `-an` | yasaqlağan, başqırtistan, i̇talyan |
+| `-ovo` | çkalovo, beketovo, çufarovo |
+| `-en` | yerlerinden, esitgen, yaratıcılığınen |
+| `-ya` | nesterovskaya, podsosennaya, borzovaya |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `rler` | 1.64x | 57 contexts | erler, yerler, kirler |
+| `siye` | 2.07x | 21 contexts | asiye, rusiye, vasiyet |
+| `isin` | 1.60x | 34 contexts | lisin, episine, ekisini |
+| `iniñ` | 1.68x | 26 contexts | eliniñ, aliniñ, öziniñ |
+| `nesi` | 1.66x | 22 contexts | nesir, nesib, nesil |
+| `usiy` | 2.15x | 9 contexts | lusiya, rusiye, hususiy |
+| `eniñ` | 1.77x | 15 contexts | heniñ, seniñ, ekeniñ |
+| `lâye` | 1.89x | 11 contexts | gulâyev, belâyev, vilâyet |
+| `âyet` | 1.89x | 11 contexts | vilâyet, menâyet, şikâyet |
+| `yeti` | 1.62x | 17 contexts | yetim, yetip, yetişe |
+| `sini` | 1.62x | 15 contexts | siniy, sinip, sesini |
+| `tind` | 1.75x | 11 contexts | etinden, betinde, şetinde |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+*No significant affix co-occurrences detected.*
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| çabayevka | **`çaba-ye-vka`** | 6.0 | `çaba` |
+| ruzayevka | **`ruza-ye-vka`** | 6.0 | `ruza` |
+| turmayevo | **`turma-ye-vo`** | 6.0 | `turma` |
+| natalivka | **`natali-vka`** | 4.5 | `natali` |
+| çingizovo | **`çingiz-ovo`** | 4.5 | `çingiz` |
+| krasnoyarovo | **`krasnoyar-ovo`** | 4.5 | `krasnoyar` |
+| kapustinka | **`kapustin-ka`** | 4.5 | `kapustin` |
+| malinovka | **`malino-vka`** | 4.5 | `malino` |
+| soldatovo | **`soldat-ovo`** | 4.5 | `soldat` |
+| kaltımanovo | **`kaltım-an-ovo`** | 3.0 | `kaltım` |
+| balabanovo | **`balab-an-ovo`** | 3.0 | `balab` |
+| olehovskaya | **`olehovs-ka-ya`** | 3.0 | `olehovs` |
+| olşanskaya | **`olşans-ka-ya`** | 3.0 | `olşans` |
+| kıtmanovo | **`kıtm-an-ovo`** | 3.0 | `kıtm` |
+| kropıvenka | **`kropıv-en-ka`** | 3.0 | `kropıv` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language CRH appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (4.77x) |
+| N-gram | **2-gram** | Lowest perplexity (349) |
+| Markov | **Context-4** | Highest predictability (97.6%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 10:33:02*

models/embeddings/monolingual/crh_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2711de686fbbdccec9b4c232aeb1a2ff6589a0a5ab35d03b8eed6ce02bd7f456
-size 1040761703

 version https://git-lfs.github.com/spec/v1
+oid sha256:a2c03c90e9321eff6a486a7b259eb6d14069ad97d6e4851aaa1d32c01bb7310b
+size 1039804561

models/embeddings/monolingual/crh_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 16090
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 15175
 }

models/embeddings/monolingual/crh_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:66a6e8733bef4321f51d9516ccf447352157f4a62af7a17ee0b3077db4528dcf
-size 260404583

 version https://git-lfs.github.com/spec/v1
+oid sha256:08fc415d5a1b109301770bd6dbcbfbdde2008dd25a25d23f81ace78bc3971a3a
+size 260150161

models/embeddings/monolingual/crh_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 16090
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 15175
 }

models/embeddings/monolingual/crh_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9d958dcba834b7704a2fa91467c099cd2ee01f0d1d5fed91d7a38303939710c0
-size 520523623

 version https://git-lfs.github.com/spec/v1
+oid sha256:e5d8136014d664519e3bafeb0cfc668b4993f78425d51bdcf0d9d06734bd0726
+size 520034961

models/embeddings/monolingual/crh_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 16090
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 15175
 }

models/subword_markov/crh_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9de08c51e7f35ddea586397377dd1a690a335be3deafe5034997cfcfdc8fb563
-size 98281

 version https://git-lfs.github.com/spec/v1
+oid sha256:e2ba0d156d7362837baf91177f2123bb0ccb074f4556c38c22f0d046b19f9805
+size 84372

models/subword_markov/crh_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "crh",
-  "unique_contexts": 1393,
-  "total_transitions": 7898659
 }

   "context_size": 1,
   "variant": "subword",
   "language": "crh",
+  "unique_contexts": 1506,
+  "total_transitions": 6777473
 }

models/subword_markov/crh_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e9bb8639872bba7668e3be3688a341970c45c509563b7d7eff3fddaa1b857968
-size 581277

 version https://git-lfs.github.com/spec/v1
+oid sha256:a42c66f0a162aa7d06896db540b1636552e8a9ebc256f5e6cf8d91c7301d5444
+size 481861

models/subword_markov/crh_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "crh",
-  "unique_contexts": 12902,
-  "total_transitions": 7868965
 }

   "context_size": 2,
   "variant": "subword",
   "language": "crh",
+  "unique_contexts": 10319,
+  "total_transitions": 6748084
 }

models/subword_markov/crh_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6ab59e75e6595f11a4437338aa658f32767c561497c79e161cf3d4148cdd2c79
-size 2047334

 version https://git-lfs.github.com/spec/v1
+oid sha256:5a42665c99cf0102347580ad03ad8662c7dcd958de7c0259e8380dcbc939a89f
+size 1701410

models/subword_markov/crh_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "crh",
-  "unique_contexts": 73541,
-  "total_transitions": 7839271
 }

   "context_size": 3,
   "variant": "subword",
   "language": "crh",
+  "unique_contexts": 57491,
+  "total_transitions": 6718695
 }

models/subword_markov/crh_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4fcb0e55ae35489db7ded118cd7a233f52f8dd657681364cb8e3372187263e04
-size 5676695

 version https://git-lfs.github.com/spec/v1
+oid sha256:769a741cb6357905e7c73c65cde5f662a2cbc84d3c2b1870ec910617615ba5a1
+size 4686904

models/subword_markov/crh_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "crh",
-  "unique_contexts": 280480,
-  "total_transitions": 7809577
 }

   "context_size": 4,
   "variant": "subword",
   "language": "crh",
+  "unique_contexts": 222424,
+  "total_transitions": 6689306
 }

models/subword_ngram/crh_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a1d24c9778380ff73f830ff74e1323c5c9a67b1ad255aca792619d77b25d5912
-size 64028

 version https://git-lfs.github.com/spec/v1
+oid sha256:d352e2e3b7dd102bd7055d607a37c3738aff67c0c66285e95804ad330500d280
+size 52920

models/subword_ngram/crh_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "crh",
-  "unique_ngrams": 4814,
-  "total_ngrams": 7898659
 }

   "n": 2,
   "variant": "subword",
   "language": "crh",
+  "unique_ngrams": 3905,
+  "total_ngrams": 6777473
 }

models/subword_ngram/crh_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a4fd9693cd6180b77894216a45c8557781e0405949f934ac0db671694231e38f
-size 451805

 version https://git-lfs.github.com/spec/v1
+oid sha256:1a92a6f76ef54f7e43af0d7e9c762ace522be6627fa4f29ab3df5200c1f2a63d
+size 375416

models/subword_ngram/crh_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "crh",
-  "unique_ngrams": 35169,
-  "total_ngrams": 7868965
 }

   "n": 3,
   "variant": "subword",
   "language": "crh",
+  "unique_ngrams": 29302,
+  "total_ngrams": 6748084
 }

models/subword_ngram/crh_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:45d6866f9f3500b9daf768646045ee8c250bd213b574f0a8fce2c8ef66523e4e
-size 1836873

 version https://git-lfs.github.com/spec/v1
+oid sha256:0ddc20df05c5929fbc17c37bba729a38c20ac69752548221a46c1e42b8e8359f
+size 1562486

models/subword_ngram/crh_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "crh",
-  "unique_ngrams": 157163,
-  "total_ngrams": 7839271
 }

   "n": 4,
   "variant": "subword",
   "language": "crh",
+  "unique_ngrams": 131434,
+  "total_ngrams": 6718695
 }

models/tokenizer/crh_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8bf313aa3863643a3d3a48927e34d71e36c34b5c866db4ec55ff481f2c42cf78
-size 511102

 version https://git-lfs.github.com/spec/v1
+oid sha256:7252484520841503b3c10a110b3ea4ba3a62f7073a69a342466526a0b31ef51c
+size 513301

models/tokenizer/crh_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/crh_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a90dac68538f5086942dbcf27b4787b28e5927dcb62fb4edbfd0d3f645b170cd
-size 798788

 version https://git-lfs.github.com/spec/v1
+oid sha256:60b38160d9a927f6ea17a75ef13bbe991a80a50121d1c845806c967c815488f3
+size 797422

models/tokenizer/crh_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/crh_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0ca674ae3850ca528644e4e9d7d41feaaf7f45a78b69b9eccd17204ad57d458a
-size 1388807

 version https://git-lfs.github.com/spec/v1
+oid sha256:8b45034637ab90c0741ce2ac1009165fc7e87f960f1ff87e489cabdedb42a8c8
+size 1425086

models/tokenizer/crh_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/crh_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dd97a27e642560427bbf6bc2e1ec3da514ec6c83ec093a16660de476f4797974
-size 373198

 version https://git-lfs.github.com/spec/v1
+oid sha256:414811c3993e9f6351075aaf47ac4ebe0185a4a1e2948c911bbbe25a0c48d8d1
+size 373959

models/tokenizer/crh_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/crh_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:37995711df1e3c6e65e7b604e037154a533bfd0c17bbed09013a0b12d6d0cc3a
-size 925242

 version https://git-lfs.github.com/spec/v1
+oid sha256:e154d72a53b3ecac61e12ba483392cfb82a4cc6a0a1b854df1b5a9a58503b2e9
+size 892791

models/vocabulary/crh_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "crh",
-  "vocabulary_size": 53689,
   "statistics": {
-    "type_token_ratio": 0.14047543157959674,
     "coverage": {
-      "top_100": 0.42677284364428997,
-      "top_1000": 0.594827213933929,
-      "top_5000": 0.727782518841444,
-      "top_10000": 0.7806146474876361
     },
-    "hapax_count": 82936,
-    "hapax_ratio": 0.6070338517840805,
-    "total_documents": 29694
   }
 }

 {
   "language": "crh",
+  "vocabulary_size": 51581,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.1506807050021212,
     "coverage": {
+      "top_100": 0.4134086438841732,
+      "top_1000": 0.5791304225875555,
+      "top_5000": 0.7107789686755324,
+      "top_10000": 0.7672291584127752
     },
+    "hapax_count": 77350,
+    "hapax_ratio": 0.599933297655335,
+    "total_documents": 29389
   }
 }

models/word_markov/crh_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:35f93040f44b20143ca6995aa27d0e51be65a8aed5ec256feed368bc396e7b39
-size 4448665

 version https://git-lfs.github.com/spec/v1
+oid sha256:7aa9bdb634e256e31b79d824958c523b31cd09097f73541b17948bf1af86cd78
+size 4147698

models/word_markov/crh_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "crh",
-  "unique_contexts": 136749,
-  "total_transitions": 1289772
 }

   "context_size": 1,
   "variant": "word",
   "language": "crh",
+  "unique_contexts": 128781,
+  "total_transitions": 826268
 }

models/word_markov/crh_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1186f4b8afeb522aea0d59da82796161e681cf8812db7eafe8dce2d594b4cff3
-size 8722410

 version https://git-lfs.github.com/spec/v1
+oid sha256:4febe8a6192229f809c3e676345cedfae6db5b2e8a8029ea9bd763e24109c749
+size 8051739

models/word_markov/crh_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "crh",
-  "unique_contexts": 422458,
-  "total_transitions": 1260079
 }

   "context_size": 2,
   "variant": "word",
   "language": "crh",
+  "unique_contexts": 384958,
+  "total_transitions": 796879
 }

models/word_markov/crh_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:70418ae3723bd86e8fa05bb025daf404c3ba9978b2435ea847cff6c13a4929e8
-size 11645146

 version https://git-lfs.github.com/spec/v1
+oid sha256:caae72ac1890563cfd5dfa7c47424d891c27b0908ee8f566663d3643381cb21a
+size 10138150

models/word_markov/crh_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "crh",
-  "unique_contexts": 585200,
-  "total_transitions": 1230387
 }

   "context_size": 3,
   "variant": "word",
   "language": "crh",
+  "unique_contexts": 475888,
+  "total_transitions": 767490
 }

models/word_markov/crh_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ca323a9f7ec0e2f8d1a95747ea890278db7668783f6c23a6e5fb60385dc416b1
-size 13583842