omarkamali commited on Jan 3

Commit

8164949

verified ·

1 Parent(s): 351da00

Upload all models and assets for am (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +270 -140
models/embeddings/monolingual/am_128d.bin +2 -2
models/embeddings/monolingual/am_128d_metadata.json +5 -3
models/embeddings/monolingual/am_32d.bin +2 -2
models/embeddings/monolingual/am_32d_metadata.json +5 -3
models/embeddings/monolingual/am_64d.bin +2 -2
models/embeddings/monolingual/am_64d_metadata.json +5 -3
models/subword_markov/am_markov_ctx1_subword.parquet +2 -2
models/subword_markov/am_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/am_markov_ctx2_subword.parquet +2 -2
models/subword_markov/am_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/am_markov_ctx3_subword.parquet +2 -2
models/subword_markov/am_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/am_markov_ctx4_subword.parquet +2 -2
models/subword_markov/am_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/am_2gram_subword.parquet +2 -2
models/subword_ngram/am_2gram_subword_metadata.json +2 -2
models/subword_ngram/am_3gram_subword.parquet +2 -2
models/subword_ngram/am_3gram_subword_metadata.json +2 -2
models/subword_ngram/am_4gram_subword.parquet +2 -2
models/subword_ngram/am_4gram_subword_metadata.json +2 -2
models/tokenizer/am_tokenizer_16k.model +2 -2
models/tokenizer/am_tokenizer_16k.vocab +0 -0
models/tokenizer/am_tokenizer_32k.model +2 -2
models/tokenizer/am_tokenizer_32k.vocab +0 -0
models/tokenizer/am_tokenizer_64k.model +2 -2
models/tokenizer/am_tokenizer_64k.vocab +0 -0
models/tokenizer/am_tokenizer_8k.model +2 -2
models/tokenizer/am_tokenizer_8k.vocab +0 -0
models/vocabulary/am_vocabulary.parquet +2 -2
models/vocabulary/am_vocabulary_metadata.json +10 -9
models/word_markov/am_markov_ctx1_word.parquet +2 -2
models/word_markov/am_markov_ctx1_word_metadata.json +2 -2
models/word_markov/am_markov_ctx2_word.parquet +2 -2
models/word_markov/am_markov_ctx2_word_metadata.json +2 -2
models/word_markov/am_markov_ctx3_word.parquet +2 -2
models/word_markov/am_markov_ctx3_word_metadata.json +2 -2
models/word_markov/am_markov_ctx4_word.parquet +2 -2
models/word_markov/am_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/am_2gram_word.parquet +2 -2
models/word_ngram/am_2gram_word_metadata.json +2 -2
models/word_ngram/am_3gram_word.parquet +2 -2
models/word_ngram/am_3gram_word_metadata.json +2 -2
models/word_ngram/am_4gram_word.parquet +2 -2
models/word_ngram/am_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 3.278
   - name: best_isotropy
     type: isotropy
-    value: 0.9070
   - name: vocabulary_size
     type: vocab
-    value: 108024
-generated: 2025-12-27
 ---
 # AM - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,59 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 2.456x | 2.43 | 0.1639% | 455,103 |
-| **16k** | 2.758x | 2.73 | 0.1841% | 405,251 |
-| **32k** | 3.035x | 3.00 | 0.2026% | 368,183 |
-| **64k** | 3.278x 🏆 | 3.24 | 0.2188% | 340,895 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `'  የ ቻይና  ንጉሥ ነበር።
- ዋቢ መጽሐፍት
-መደብ:የቻይና ነገሥታት`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁' ▁የ ▁ቻይና ▁ንጉሥ ▁ነበር። ▁ዋቢ ▁መጽሐፍት ▁መደብ : የቻይና ... (+1 more)` | 11 |
-| 16k | `▁' ▁የ ▁ቻይና ▁ንጉሥ ▁ነበር። ▁ዋቢ ▁መጽሐፍት ▁መደብ : የቻይና ... (+1 more)` | 11 |
-| 32k | `▁' ▁የ ▁ቻይና ▁ንጉሥ ▁ነበር። ▁ዋቢ ▁መጽሐፍት ▁መደብ : የቻይና ... (+1 more)` | 11 |
-| 64k | `▁' ▁የ ▁ቻይና ▁ንጉሥ ▁ነበር። ▁ዋቢ ▁መጽሐፍት ▁መደብ : የቻይና ... (+1 more)` | 11 |
-**Sample 2:** `ኢትዮጵያ ውስጥ የሚሰራ የምግብ አይነት ሲሆን፣ የሚሰራውም ከሱፍስንዴና አንድ አንድ ጊዜም ሽምብራ ቆሎ ነው።
-አዘገጃጀት
-ሊተ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ኢትዮጵያ ▁ውስጥ ▁የሚሰራ ▁የምግብ ▁አይነት ▁ሲሆን፣ ▁የሚሰራ ውም ▁ከሱ ፍ ... (+20 more)` | 30 |
-| 16k | `▁ኢትዮጵያ ▁ውስጥ ▁የሚሰራ ▁የምግብ ▁አይነት ▁ሲሆን፣ ▁የሚሰራውም ▁ከሱ ፍ ስን ... (+16 more)` | 26 |
-| 32k | `▁ኢትዮጵያ ▁ውስጥ ▁የሚሰራ ▁የምግብ ▁አይነት ▁ሲ���ን፣ ▁የሚሰራውም ▁ከሱ ፍ ስንዴ ... (+14 more)` | 24 |
-| 64k | `▁ኢትዮጵያ ▁ውስጥ ▁የሚሰራ ▁የምግብ ▁አይነት ▁ሲሆን፣ ▁የሚሰራውም ▁ከሱ ፍ ስንዴ ... (+14 more)` | 24 |
-**Sample 3:** `1 January 1955 - 11 September 1955 እ.ኤ.ኣ. = 1947 አ.ም.
-12 September 1955 - 31 Dec...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ 1 ▁january ▁ 1 9 5 5 ▁- ▁ ... (+59 more)` | 69 |
-| 16k | `▁ 1 ▁january ▁ 1 9 5 5 ▁- ▁ ... (+59 more)` | 69 |
-| 32k | `▁ 1 ▁january ▁ 1 9 5 5 ▁- ▁ ... (+59 more)` | 69 |
-| 64k | `▁ 1 ▁january ▁ 1 9 5 5 ▁- ▁ ... (+59 more)` | 69 |
 ### Key Findings
-- **Best Compression:** 64k achieves 3.278x compression
-- **Lowest UNK Rate:** 8k with 0.1639% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -129,55 +129,87 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 10,759 🏆 | 13.39 | 48,164 | 21.6% | 40.3% |
-| **2-gram** | 2,321 🏆 | 11.18 | 26,048 | 32.4% | 67.6% |
-| **3-gram** | 16,194 | 13.98 | 68,935 | 19.7% | 36.8% |
-| **3-gram** | 21,529 | 14.39 | 173,382 | 11.5% | 34.1% |
-| **4-gram** | 45,509 | 15.47 | 148,975 | 14.6% | 27.1% |
-| **4-gram** | 104,202 | 16.67 | 623,179 | 6.7% | 19.0% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ነው ።` | 21,098 |
-| 2 | `መደብ :` | 19,279 |
-| 3 | `፡ ፡` | 9,600 |
-| 4 | `ነበር ።` | 6,290 |
-| 5 | `ዓ .` | 6,166 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ምሳሌ ነው ።` | 5,880 |
-| 2 | `የአማርኛ ምሳሌ ነው` | 5,832 |
-| 3 | `ዓ . ም` | 5,831 |
-| 4 | `። መደብ :` | 5,106 |
-| 5 | `. ም .` | 4,919 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `የአማርኛ ምሳሌ ነው ።` | 5,832 |
-| 2 | `ዓ . ም .` | 4,753 |
-| 3 | `እ . ኤ .` | 4,046 |
-| 4 | `. ኤ . አ` | 3,983 |
-| 5 | `ምሳሌ ነው ። ትርጉሙ` | 3,720 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 2,321
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~19% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
@@ -187,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.6978 | 1.622 | 4.88 | 254,884 | 30.2% |
-| **1** | 1.4402 | 2.714 | 22.49 | 2,349 | 0.0% |
-| **2** | 0.1891 | 1.140 | 1.44 | 1,243,645 | 81.1% |
-| **2** | 1.1315 | 2.191 | 7.49 | 52,808 | 0.0% |
-| **3** | 0.0627 | 1.044 | 1.12 | 1,794,848 | 93.7% |
-| **3** | 0.6676 | 1.588 | 3.40 | 395,365 | 33.2% |
-| **4** | 0.0256 🏆 | 1.018 | 1.04 | 2,006,381 | 97.4% |
-| **4** | 0.4515 🏆 | 1.367 | 2.11 | 1,342,049 | 54.9% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `። ትምህርት ቤት ነጣጥሎ መገንዘብ ይኖርበታል ተብሎ የሚታመነው የቶሪኖ ቀኖና ማስተማርና ትምህርት ሽግግር መንግስት 1643 -`
-2. `፡ ማስረጃ እንደሌለ ይቆጠራል የአማርኛ ምሳሌ ነው መደብ : / wiki / div id ) 2002`
-3. `. ) እና የበለጠ እውነት ሆኖ ለፓውሜራስ ክለብ በሐረር ፣ 1943 ) የአሹር ንጉሥ ታላቁ ፒተር`
 **Context Size 2:**
-1. `ነው ። ትርጉሙ መልስ ሲታጣ ፣ ዝምታ ተፈጥሮው የሆነ ግሩም ድምጻዊ ነው ። በ13ኛው ክፍለ ዘመን የዪጂንግ`
-2. `መደብ : የኮሪያ ነገሥታት`
-3. `፡ ፡ በአካላዊ ገጽታ እና ከህዝቡ አንድ አራተኛውን ይሸፍናል ። ምንም እንኳን ተመሳሳይ ጥምረት ከብዙ አሥርተ ዓመታት`
 **Context Size 3:**
-1. `ምሳሌ ነው ። ዘመነ ግርምቢጥ ውሻ ወደ ሰርዶ አህያ ወደ ሊጥ ዘመን እንደንጉሱ አውድማ እንደንፋሱ የአማርኛ ምሳሌ ነው`
-2. `የአማርኛ ምሳሌ ነው ። ትርጉሙ መደብ : ያልተተረጎመ ምሳሌ መደብ : ተረትና ምሳሌ መደብ : ተረትና ምሳሌ መደብ`
-3. `ዓ . ም . በኋላ ለሆኑት ዓመታት ግን በሌላ ቀን ላይ መሆኑን ይገንዘቡ ። ለእነዚያ ዓመቶች ይህ የቀን`
 **Context Size 4:**
-1. `የአማርኛ ምሳሌ ነው ። ትርጉሙ መደብ : ያልተተረጎመ ምሳሌ መደብ : ተረትና ምሳሌ መደብ : ተረትና ምሳሌ መደብ :`
-2. `ዓ . ም . አስቀድሞ ወይም ከ2091 ዓ . ም . ተክለጻድቅ መኩሪያ መደብ : ተክለጻድቅ መኩርያ መደብ :`
-3. `እ . ኤ . አ . በ 1914 የሩሲያ ንጉሠ ነገሥት ኒኮላስ ii ( 1894 - 1917 ) እ`
 ### Key Findings
-- **Best Predictability:** Context-4 with 97.4% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (1,342,049 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -251,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 108,024 |
-| Total Tokens | 1,810,273 |
-| Mean Frequency | 16.76 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 184.67 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ነው | 28,114 |
-| 2 | እና | 23,158 |
-| 3 | መደብ | 19,525 |
-| 4 | ላይ | 13,580 |
-| 5 | ምሳሌ | 12,239 |
-| 6 | ውስጥ | 9,959 |
-| 7 | ነበር | 9,632 |
-| 8 | ወደ | 9,166 |
-| 9 | ም | 8,691 |
-| 10 | ዓ | 8,629 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ጂኒካ | 2 |
-| 2 | ዲኒካላ | 2 |
-| 3 | ወስደሽ | 2 |
-| 4 | አንኳኳ | 2 |
-| 5 | መዳልወ | 2 |
-| 6 | ረድእ | 2 |
-| 7 | አንደኛይቱ | 2 |
-| 8 | ወደሰልፍ | 2 |
-| 9 | የኒኮፖሊስ | 2 |
-| 10 | ጂምናዚየም | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 0.9297 |
-| R² (Goodness of Fit) | 0.994674 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 22.3% |
-| Top 1,000 | 44.9% |
-| Top 5,000 | 65.4% |
-| Top 10,000 | 74.1% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9947 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 22.3% of corpus
-- **Long Tail:** 98,024 words needed for remaining 25.9% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -321,24 +384,88 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 40,456 | 32 | 3.565 | 0.969 | 0.8976 |
-| **mono_64d** | 40,456 | 64 | 4.280 | 0.895 | 0.9070 🏆 |
-| **mono_128d** | 40,456 | 128 | 5.026 | 0.790 | 0.8490 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_64d with 0.9070 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 40,456 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -346,11 +473,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (3.28x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (2,321) |
-| Markov | **Context-4** | Highest predictability (97.4%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -540,7 +668,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -556,7 +685,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-27 05:42:07*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 3.287
   - name: best_isotropy
     type: isotropy
+    value: 0.9163
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # AM - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 2.436x | 2.44 | 0.1557% | 683,952 |
+| **16k** | 2.745x | 2.75 | 0.1754% | 607,060 |
+| **32k** | 3.031x | 3.03 | 0.1937% | 549,802 |
+| **64k** | 3.287x 🏆 | 3.29 | 0.2101% | 506,938 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `አዋሳ ከነማ ስታዲየም በአዋሳ፣ ኢትዮጵያ የሚገኝ ስታዲዮም ነው። ፳፭ ሺህ ሰዎችን መያዝ ሲችል የአዋሳ ከተማ የእግር ኳስ ክለብ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁አዋ ሳ ▁ከነ ማ ▁ስታዲየም ▁በአ ዋ ሳ፣ ▁ኢትዮጵያ ▁የሚገኝ ... (+25 more)` | 35 |
+| 16k | `▁አዋሳ ▁ከነ ማ ▁ስታዲየም ▁በአ ዋ ሳ፣ ▁ኢትዮጵያ ▁የሚገኝ ▁ስታ ... (+22 more)` | 32 |
+| 32k | `▁አዋሳ ▁ከነማ ▁ስታዲየም ▁በአ ዋ ሳ፣ ▁ኢትዮጵያ ▁የሚገኝ ▁ስታ ዲዮ ... (+20 more)` | 30 |
+| 64k | `▁አዋሳ ▁ከነማ ▁ስታዲየም ▁በአዋ ሳ፣ ▁ኢትዮጵያ ▁የሚገኝ ▁ስታ ዲዮ ም ... (+19 more)` | 29 |
+**Sample 2:** `የዝንጀሮ ስብሰባ በውሻ ጩኸት ይበተናል የአማርኛ ምሳሌ ነው። የዝንጀሮ ስብሰባ በውሻ ጩኸት ይበተናል የአማርኛ ምሳሌ ነው። ትር...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁የዝ ንጀሮ ▁ስብሰባ ▁በው ሻ ▁ ጩ ኸ ት ▁ይበ ... (+29 more)` | 39 |
+| 16k | `▁የዝ ንጀሮ ▁ስብሰባ ▁በው ሻ ▁ጩ ኸት ▁ይበ ተ ናል ... (+25 more)` | 35 |
+| 32k | `▁የዝንጀሮ ▁ስብሰባ ▁በው ሻ ▁ጩ ኸት ▁ይበ ተናል ▁የአማርኛ ▁ምሳሌ ... (+21 more)` | 31 |
+| 64k | `▁የዝንጀሮ ▁ስብሰባ ▁በውሻ ▁ጩኸት ▁ይበ ተናል ▁የአማርኛ ▁ምሳሌ ▁ነው። ▁የዝንጀሮ ... (+17 more)` | 27 |
+**Sample 3:** `የሐረሪ ብሔራዊ ሊግ የኢትዮጵያ ፖለቲካ ፓርቲ ነው። ዓላማ ሊቀመንበር ታሪክ መደብ: በምርጫ የተሳተፉ የኢትዮጵያ ፓርቲዎች መደብ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁የሐ ረ ሪ ▁ብሔራዊ ▁ሊግ ▁የኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው። ▁ዓላማ ... (+13 more)` | 23 |
+| 16k | `▁የሐ ረሪ ▁ብሔራዊ ▁ሊግ ▁የኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው። ▁ዓላማ ▁ሊቀመንበር ... (+12 more)` | 22 |
+| 32k | `▁የሐረሪ ▁ብሔራዊ ▁ሊግ ▁የኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው። ▁ዓላማ ▁ሊቀመንበር ▁ታሪክ ... (+11 more)` | 21 |
+| 64k | `▁የሐረሪ ▁ብሔራዊ ▁ሊግ ▁የኢትዮጵያ ▁ፖለቲካ ▁ፓርቲ ▁ነው። ▁ዓላማ ▁ሊቀመንበር ▁ታሪክ ... (+11 more)` | 21 |
 ### Key Findings
+- **Best Compression:** 64k achieves 3.287x compression
+- **Lowest UNK Rate:** 8k with 0.1557% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 8,988 | 13.13 | 27,901 | 19.7% | 39.7% |
+| **2-gram** | Subword | 2,079 🏆 | 11.02 | 23,804 | 34.0% | 69.2% |
+| **3-gram** | Word | 9,944 | 13.28 | 35,714 | 22.1% | 40.5% |
+| **3-gram** | Subword | 19,139 | 14.22 | 153,027 | 11.8% | 35.5% |
+| **4-gram** | Word | 36,744 | 15.17 | 90,792 | 13.9% | 25.8% |
+| **4-gram** | Subword | 94,777 | 16.53 | 549,996 | 6.6% | 19.5% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ዓ ም` | 8,324 |
+| 2 | `ምሳሌ ነው` | 5,625 |
+| 3 | `የአማርኛ ምሳሌ` | 5,563 |
+| 4 | `እ ኤ` | 4,026 |
+| 5 | `ኤ አ` | 3,961 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `የአማርኛ ምሳሌ ነው` | 5,563 |
+| 2 | `እ ኤ አ` | 3,908 |
+| 3 | `ምሳሌ ነው ትርጉሙ` | 3,454 |
+| 4 | `መደብ ተረትና ምሳሌ` | 3,051 |
+| 5 | `ነው ትርጉሙ መደብ` | 2,533 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `የአማርኛ ምሳሌ ነው ትርጉሙ` | 3,452 |
+| 2 | `ምሳሌ ነው ትርጉሙ መደብ` | 2,533 |
+| 3 | `ትርጉሙ መደብ ያልተተረጎመ ምሳሌ` | 2,118 |
+| 4 | `ነው ትርጉሙ መደብ ያልተተረጎመ` | 2,114 |
+| 5 | `ምሳሌ መደብ ተረትና ምሳሌ` | 1,854 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ የ` | 170,716 |
+| 2 | `ት _` | 145,051 |
+| 3 | `_ በ` | 140,839 |
+| 4 | `ን _` | 132,909 |
+| 5 | `_ አ` | 113,769 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ እ ን` | 32,319 |
+| 2 | `_ ነ ው` | 26,511 |
+| 3 | `ው ። _` | 24,155 |
+| 4 | `_ እ ና` | 23,843 |
+| 5 | `እ ና _` | 22,397 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ እ ና _` | 22,267 |
+| 2 | `_ ነ ው ።` | 19,378 |
+| 3 | `ነ ው ። _` | 18,922 |
+| 4 | `_ እ ን ደ` | 13,836 |
+| 5 | `_ ላ ይ _` | 12,924 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 2,079
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~19% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.7502 | 1.682 | 4.80 | 236,353 | 25.0% |
+| **1** | Subword | 1.2235 | 2.335 | 17.52 | 2,854 | 0.0% |
+| **2** | Word | 0.1468 | 1.107 | 1.28 | 1,130,961 | 85.3% |
+| **2** | Subword | 1.0397 | 2.056 | 6.98 | 49,981 | 0.0% |
+| **3** | Word | 0.0355 | 1.025 | 1.06 | 1,446,616 | 96.4% |
+| **3** | Subword | 0.6354 | 1.553 | 3.36 | 348,535 | 36.5% |
+| **4** | Word | 0.0159 🏆 | 1.011 | 1.02 | 1,520,994 | 98.4% |
+| **4** | Subword | 0.4515 | 1.367 | 2.14 | 1,171,344 | 54.9% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `ነው በጥሩ አስተዳደር የህዝብ ልውውጥ ኮሚሽን ሕንፃ በሥነ ሕንጻ ተጠናቆ ክፍት አረግንጓዴ ከጂዮርጂያ በእ አ የሆነ`
+2. `እና ሲሞት ቤተሰቦቹ ጋር ቢኮብለል ወይም ቀበሮ ዝርያ ያለበት ቦታ 4 5 14 847 ቅጥር የተለጠፈ`
+3. `ላይ መስኮት እና ከደቡብ ህንድ ቀጥሎ ዓ ም ቀዳማዊ ኃይለ ሥላሴ ለአፍሪካ ገሞጂማ ሣርማ ቦታዎች ይቀመጡ`
+**Context Size 2:**
+1. `ዓ ም አስቀድሞ ወይም ዓ ም የአርጡስ ወንድም 4 ካርል 663 669 ዓ ም ሁላቸው ሲስማሙ ወደ`
+2. `ምሳሌ ነው ጀርባዬን እከክልኝ ለኔ ራቀኝ የአማርኛ ምሳሌ ነው ዝርክርክ ከወንፊት የባሰ ዝክዝክ የአማርኛ ምሳሌ ነው ትርጉሙ`
+3. `የአማርኛ ምሳሌ ነው የምትጠላው ሰው ፈሱ እሆዱ ውስጥ ሳለ ኔሽን ኦፍ ኢስላም ጋር ያለው ዝምድና ግልጽ ነው`
+**Context Size 3:**
+1. `የአማርኛ ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ቁና ሰፋች`
+2. `እ ኤ አ በ በሂትለር ተጽዕኖ ሙሶሎኒ በጣሊያን ፀረ ሴማዊ የዘር ህጎች እንዲፀድቁ ደገፈ በመጋቢት ጀርመን ቼኮዝሎቫኪያን ከቀላቀለች`
+3. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ምግባር ሳይኖር ስም እንደማለት ነዉ`
+**Context Size 4:**
+1. `የአማርኛ ምሳሌ ነው ትርጉሙ ሁለቱም አያዋጡም መደብ ተረትና ምሳሌ`
+2. `ምሳሌ ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ መደብ ፈሊጣዊ አነጋገር መደብ ተረትና ምሳሌ ቁና ሰፋች`
+3. `ነው ትርጉሙ መደብ ያልተተረጎመ ምሳሌ መደብ ተረትና ምሳሌ ሴት ሁሉን ቻይ ናት`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_አንግብ፣_ለማር_(“ሀንን`
+2. `ን_(dicole_ገደቡድ_ሞ`
+3. `ት_ቅ_ሓምበላስድ_ጋ_ይለት`
 **Context Size 2:**
+1. `_የኢትዮጵያ_አፖሎኛ_,00_`
+2. `ት_ተመለሰብን_ኣሉ።_የባህር`
+3. `_በዘመዴ_ሲፀድቅ_በተአምስተ`
 **Context Size 3:**
+1. `_እንደ_ማርኮ_ከተማ_እና_ግሮ`
+2. `_ነው።_ከተማው_ባልሞራል_እን`
+3. `ው።_ኮምፕዩተራይዝ_ካሊፎርኒያ`
 **Context Size 4:**
+1. `_እና_ማከማቸት_ጉዳት_ቁጭ_ብለ`
+2. `_ነው።_ዋጋው_ወቅት_የመጀመሪያ`
+3. `ነው።_ትርጉሙ_አንቴና_ይፈሳሉ፡`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 98.4% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (1,171,344 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 99,716 |
+| Total Tokens | 1,636,892 |
+| Mean Frequency | 16.42 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 174.41 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ነው | 26,460 |
+| 2 | እና | 22,392 |
+| 3 | ላይ | 13,250 |
+| 4 | ምሳሌ | 11,607 |
+| 5 | ውስጥ | 9,622 |
+| 6 | ነበር | 9,005 |
+| 7 | ዓ | 8,679 |
+| 8 | ም | 8,584 |
+| 9 | ወደ | 8,446 |
+| 10 | እንደ | 6,776 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ቫሊን | 2 |
+| 2 | ግሎቡላር | 2 |
+| 3 | ኢንዛይሞች | 2 |
+| 4 | የማከማቻ | 2 |
+| 5 | ለph | 2 |
+| 6 | ግብረመልሶችን | 2 |
+| 7 | behi | 2 |
+| 8 | ቤሂ | 2 |
+| 9 | goli | 2 |
+| 10 | ክሩድስ | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.9367 |
+| R² (Goodness of Fit) | 0.995214 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 22.7% |
+| Top 1,000 | 45.8% |
+| Top 5,000 | 66.2% |
+| Top 10,000 | 74.9% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9952 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 22.7% of corpus
+- **Long Tail:** 89,716 words needed for remaining 25.1% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.9125 | 0.3250 | N/A | N/A |
+| **mono_64d** | 64 | 0.9163 🏆 | 0.2292 | N/A | N/A |
+| **mono_128d** | 128 | 0.8535 | 0.1745 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_64d with 0.9163 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.2429. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+*No productive affixes detected.*
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `እንደሚ` | 2.46x | 153 contexts | እንደሚሹ, እንደሚል, እንደሚሉ |
+| `ርስቲያ` | 2.48x | 60 contexts | ክርስቲያ, ክርስቲያኗ, ክርስቲያኖ |
+| `ትዮጵያ` | 2.23x | 57 contexts | እትዮጵያ, ኢትዮጵያ, ኢትዮጵያው |
+| `ግዚአብ` | 2.73x | 24 contexts | እግዚአብሔር, እግዚአብሐር, እግዚአብሄር |
+| `ኢትዮጵ` | 2.24x | 46 contexts | ኢትዮጵያ, ኢትዮጵያው, ኢትዮጵስት |
+| `መንግሥ` | 2.21x | 46 contexts | መንግሥተ, መንግሥት, መንግሥቱ |
+| `መንግስ` | 2.16x | 48 contexts | መንግስት, መንግስተ, መንግስቱ |
+| `ፈረንሳ` | 2.33x | 34 contexts | ፈረንሳዊ, ፈረንሳይ, በፈረንሳዩ |
+| `አስተዳ` | 2.33x | 33 contexts | አስተዳዳሪ, አስተዳደጓ, አስተዳደረ |
+| `እንግሊ` | 2.05x | 53 contexts | እንግሊዝ, እንግሊዙ, እንግሊኛ |
+| `tion` | 2.82x | 17 contexts | nation, action, section |
+| `ጀመሪያ` | 2.28x | 33 contexts | መጀመሪያ, ለመጀመሪያ, መጀመሪያው |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+*No significant affix co-occurrences detected.*
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+*Insufficient data for recursive segmentation.*
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language AM appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (3.29x) |
+| N-gram | **2-gram** | Lowest perplexity (2,079) |
+| Markov | **Context-4** | Highest predictability (98.4%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 05:13:17*

models/embeddings/monolingual/am_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0efb3357d959a6e2e5481a2cb05a3adbd88201db376b5adc924dff9b57f0ef0e
-size 1066335087

 version https://git-lfs.github.com/spec/v1
+oid sha256:1d4be827f7a31270ecd980640a7032e11c5ed3885d037c1b8cd46df3d9007492
+size 1063990202

models/embeddings/monolingual/am_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 40456
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 38213
 }

models/embeddings/monolingual/am_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:25b527389da984b0a46121bb9f197602821de2917c7a5256185de9bc6db2e0bf
-size 267264879

 version https://git-lfs.github.com/spec/v1
+oid sha256:1facd0cecd15328df01b8eb00c409adeaac9321a5d62748462e3055c6a4b976e
+size 266642618

models/embeddings/monolingual/am_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 40456
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 38213
 }

models/embeddings/monolingual/am_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a2d6a5af7f3b896dee81096e1e86293fa2f34063054412b3ceda74be6e0c1ab5
-size 533621615

 version https://git-lfs.github.com/spec/v1
+oid sha256:f42620197f1e1b51b39471cc1e5886ab38c161775048e9529ceb06fdf065e6e1
+size 532425146

models/embeddings/monolingual/am_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 40456
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 38213
 }

models/subword_markov/am_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c2cb11e67c389141a2fadeda43f425f0f1bf4de68d4b119cf4340c1a6b943bc0
-size 365675

 version https://git-lfs.github.com/spec/v1
+oid sha256:0dea8b9be146dc2d3f0da26601ce42020e0c84af4656d9f7692b3f4d34d5d6ae
+size 356106

models/subword_markov/am_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "am",
-  "unique_contexts": 2349,
-  "total_transitions": 10028204
 }

   "context_size": 1,
   "variant": "subword",
   "language": "am",
+  "unique_contexts": 2854,
+  "total_transitions": 8936316
 }

models/subword_markov/am_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6fe5f25cdfc3e26bfa52e7017c7ccecd73d51852c672946b0a471baf4c253fc7
-size 2376324

 version https://git-lfs.github.com/spec/v1
+oid sha256:0e9b01d5674bb78549bd0f7a41241639ddc7934db689c638aee8d0daacc70172
+size 2154891

models/subword_markov/am_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "am",
-  "unique_contexts": 52808,
-  "total_transitions": 10014026
 }

   "context_size": 2,
   "variant": "subword",
   "language": "am",
+  "unique_contexts": 49981,
+  "total_transitions": 8923902
 }

models/subword_markov/am_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0d8b1997b67fd57bb5cef7ecfef1a9045802e2e6af44a0c7b5fe11aa4286d3f4
-size 9540494

 version https://git-lfs.github.com/spec/v1
+oid sha256:36e05d4c86f22f4d3f6c7dff80e4b868513b425317942829e61a871110f71ee0
+size 8408029

models/subword_markov/am_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "am",
-  "unique_contexts": 395365,
-  "total_transitions": 9999848
 }

   "context_size": 3,
   "variant": "subword",
   "language": "am",
+  "unique_contexts": 348535,
+  "total_transitions": 8911488
 }

models/subword_markov/am_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:41248045001571748a27c9b0e03b64cd156b22e6a9687173d0b8c4cbdcd2460d
-size 26523692

 version https://git-lfs.github.com/spec/v1
+oid sha256:ab3d345339fb5097d316a6acad43d3cbd7f781c24bd690b17950a9d30bdd7dfa
+size 23278804

models/subword_markov/am_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "am",
-  "unique_contexts": 1342049,
-  "total_transitions": 9985670
 }

   "context_size": 4,
   "variant": "subword",
   "language": "am",
+  "unique_contexts": 1171344,
+  "total_transitions": 8899074
 }

models/subword_ngram/am_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:85dd052f630c86a220366e62124925683c30d80262595a95d2da8e61e4dc4b1d
-size 326227

 version https://git-lfs.github.com/spec/v1
+oid sha256:6bcf32e20adcf3b9aa77e784d9fd125484c7c7d01f306652cf74a76522fbc536
+size 300324

models/subword_ngram/am_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "am",
-  "unique_ngrams": 26048,
-  "total_ngrams": 10028204
 }

   "n": 2,
   "variant": "subword",
   "language": "am",
+  "unique_ngrams": 23804,
+  "total_ngrams": 8936316
 }

models/subword_ngram/am_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ff695ae4e7eab84ead000ce11ad4fc0c65484a5a1fd01a30d7ec14360faf19cc
-size 2119089

 version https://git-lfs.github.com/spec/v1
+oid sha256:9c8cece5e05ace848d6a685201f3e49e33ed838dc7744aec46795f09bbfdb9e0
+size 1881540

models/subword_ngram/am_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "am",
-  "unique_ngrams": 173382,
-  "total_ngrams": 10014026
 }

   "n": 3,
   "variant": "subword",
   "language": "am",
+  "unique_ngrams": 153027,
+  "total_ngrams": 8923902
 }

models/subword_ngram/am_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7d3f77f95c04e37e46abbe5dffc45c0c1e05d58dc384a30081cad87c6523fd88
-size 8027416

 version https://git-lfs.github.com/spec/v1
+oid sha256:f6cf11a9b9599d5b462fc5775302067d643de8e8fd248de5060f89351b18b9ab
+size 7082960

models/subword_ngram/am_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "am",
-  "unique_ngrams": 623179,
-  "total_ngrams": 9999848
 }

   "n": 4,
   "variant": "subword",
   "language": "am",
+  "unique_ngrams": 549996,
+  "total_ngrams": 8911488
 }

models/tokenizer/am_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cedf1c15aa7971e234b55d99070a4df149f61a8f6fd1250b6e3f1e2cef58ba64
-size 557859

 version https://git-lfs.github.com/spec/v1
+oid sha256:f02cb4592939e59c4831f57ca855266e8d28172e5efebde508c8b57daefbafb6
+size 559482

models/tokenizer/am_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/am_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7093df805cbfa8e5f44b174f96bd89b7c4e418862fcb6b2806f03fb78060610d
-size 899806

 version https://git-lfs.github.com/spec/v1
+oid sha256:93bd53271f8c4e6a918072a97dd0581c020a8af645616093bc8213a30b88940e
+size 902409

models/tokenizer/am_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/am_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f9b1db1dd69389fa6546da7500a06e5d0fce2ab1f61a918dedf24fef3f7bf3a9
-size 1619242

 version https://git-lfs.github.com/spec/v1
+oid sha256:b8481661d7e8d938bace980d6a5c315614597c92fb98ccbef70df78797efdecc
+size 1589488

models/tokenizer/am_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/am_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c8c82fbbd74ca9cc772a93c5ca966ce88ce53a71386e90e51549f764cec258d2
-size 393601

 version https://git-lfs.github.com/spec/v1
+oid sha256:4ad931bdd41158b2542d8aac191257d87f612a4ef884c47473efc0d7a86dd011
+size 394754

models/tokenizer/am_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/am_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1c9e63246ff6b65932ad2b6cd7bb7743b7d9505ea2e1ac9ca771e738d442b5f8
-size 1928354

 version https://git-lfs.github.com/spec/v1
+oid sha256:294dfe907fb7e7f0defd53a349bd532546806ddc295a92c292188997c0d498bd
+size 1787217

models/vocabulary/am_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "am",
-  "vocabulary_size": 108024,
   "statistics": {
-    "type_token_ratio": 0.1301235738821781,
     "coverage": {
-      "top_100": 0.20614026570786442,
-      "top_1000": 0.4150476829002814,
-      "top_5000": 0.6047674724025824,
-      "top_10000": 0.685510039930788
     },
-    "hapax_count": 146613,
-    "hapax_ratio": 0.5757725703648724,
-    "total_documents": 14178
   }
 }

 {
   "language": "am",
+  "vocabulary_size": 99716,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.13335844069738334,
     "coverage": {
+      "top_100": 0.20952339607919193,
+      "top_1000": 0.42309028051841446,
+      "top_5000": 0.6109410976729082,
+      "top_10000": 0.6909696930060957
     },
+    "hapax_count": 136824,
+    "hapax_ratio": 0.5784391646233196,
+    "total_documents": 12414
   }
 }

models/word_markov/am_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:511a07287de3c7cf84b1e388ff1c84f89b6c4179a305abbcd9a9e9daa1347eda
-size 14354786

 version https://git-lfs.github.com/spec/v1
+oid sha256:61264cf69517fe843352a0b6cc8383107c7cd20cb9613edb6a901779636cd1bd
+size 13693182

models/word_markov/am_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "am",
-  "unique_contexts": 254884,
-  "total_transitions": 2461052
 }

   "context_size": 1,
   "variant": "word",
   "language": "am",
+  "unique_contexts": 236353,
+  "total_transitions": 1761302
 }

models/word_markov/am_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6a9fc5e0cbc0c30fa8a9ce98c35ab2c6fe838a6f56a9bb619aab4e816d85e97d
-size 32957872

 version https://git-lfs.github.com/spec/v1
+oid sha256:c565bae97b8cf56a6c0d532f8f790eb6482096f8f6f9ee9069aec8edcf717f1d
+size 29639293

models/word_markov/am_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "am",
-  "unique_contexts": 1243645,
-  "total_transitions": 2446875
 }

   "context_size": 2,
   "variant": "word",
   "language": "am",
+  "unique_contexts": 1130961,
+  "total_transitions": 1748889
 }

models/word_markov/am_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8612b4ff0daced324460640533c0da0a21f0036b72c30ba71fc85ca672380734
-size 44191826

 version https://git-lfs.github.com/spec/v1
+oid sha256:a56abcadfce2f027d0d47be74e244c8bad54627001acca61fa2ccbdb09097197
+size 37383956

models/word_markov/am_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "am",
-  "unique_contexts": 1794848,
-  "total_transitions": 2432701
 }

   "context_size": 3,
   "variant": "word",
   "language": "am",
+  "unique_contexts": 1446616,
+  "total_transitions": 1736476
 }

models/word_markov/am_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7d3b579b0a2858994d2ed6dc10d89e51f2e2d480167145c883ec696b10015a1a
-size 51775915

 version https://git-lfs.github.com/spec/v1
+oid sha256:ec1a431379039964d30885d1e5d82fe1a0b2e28f13cf9155ebc7fca6bdcb9ad8
+size 42858855

models/word_markov/am_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "am",
-  "unique_contexts": 2006381,
-  "total_transitions": 2418531
 }

   "context_size": 4,
   "variant": "word",
   "language": "am",
+  "unique_contexts": 1520994,
+  "total_transitions": 1724063
 }

models/word_ngram/am_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9b07feb909a23389d029c0062399e6e8ed00f7ccd83570e865a1da8c2eed6084
-size 832555

 version https://git-lfs.github.com/spec/v1
+oid sha256:883d58efb5ccc7881009333e3d02b3fddd7520f5bf828d29b36423f3fd1b1647
+size 543095

models/word_ngram/am_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "am",
-  "unique_ngrams": 48164,
-  "total_ngrams": 2461052
 }

   "n": 2,
   "variant": "word",
   "language": "am",
+  "unique_ngrams": 27901,
+  "total_ngrams": 1761302
 }

models/word_ngram/am_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8aae59e56d73699ee378c8da34b637013e6a0a71d3aed969e9c5a9774e23a880
-size 1311880

 version https://git-lfs.github.com/spec/v1
+oid sha256:99b4534a7f6dc004d58b92078a7fffcda803fea6ba859a4e7db87b4fcb8dc3f9
+size 757738

models/word_ngram/am_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "am",
-  "unique_ngrams": 68935,
-  "total_ngrams": 2446875
 }

   "n": 3,
   "variant": "word",
   "language": "am",
+  "unique_ngrams": 35714,
+  "total_ngrams": 1748889
 }

models/word_ngram/am_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f623c1b0bd05d77bc71c826be908d3103d05a620611c25416ea052ee25389b0b
-size 3080431