omarkamali commited on Jan 3

Commit

6af90c4

verified ·

1 Parent(s): bb3819a

Upload all models and assets for azb (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +293 -133
models/embeddings/monolingual/azb_128d.bin +2 -2
models/embeddings/monolingual/azb_128d_metadata.json +5 -3
models/embeddings/monolingual/azb_32d.bin +2 -2
models/embeddings/monolingual/azb_32d_metadata.json +5 -3
models/embeddings/monolingual/azb_64d.bin +2 -2
models/embeddings/monolingual/azb_64d_metadata.json +5 -3
models/subword_markov/azb_markov_ctx1_subword.parquet +2 -2
models/subword_markov/azb_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/azb_markov_ctx2_subword.parquet +2 -2
models/subword_markov/azb_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/azb_markov_ctx3_subword.parquet +2 -2
models/subword_markov/azb_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/azb_markov_ctx4_subword.parquet +2 -2
models/subword_markov/azb_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/azb_2gram_subword.parquet +2 -2
models/subword_ngram/azb_2gram_subword_metadata.json +2 -2
models/subword_ngram/azb_3gram_subword.parquet +2 -2
models/subword_ngram/azb_3gram_subword_metadata.json +2 -2
models/subword_ngram/azb_4gram_subword.parquet +2 -2
models/subword_ngram/azb_4gram_subword_metadata.json +2 -2
models/tokenizer/azb_tokenizer_16k.model +2 -2
models/tokenizer/azb_tokenizer_16k.vocab +0 -0
models/tokenizer/azb_tokenizer_32k.model +2 -2
models/tokenizer/azb_tokenizer_32k.vocab +0 -0
models/tokenizer/azb_tokenizer_64k.model +2 -2
models/tokenizer/azb_tokenizer_64k.vocab +0 -0
models/tokenizer/azb_tokenizer_8k.model +2 -2
models/tokenizer/azb_tokenizer_8k.vocab +0 -0
models/vocabulary/azb_vocabulary.parquet +2 -2
models/vocabulary/azb_vocabulary_metadata.json +10 -9
models/word_markov/azb_markov_ctx1_word.parquet +2 -2
models/word_markov/azb_markov_ctx1_word_metadata.json +2 -2
models/word_markov/azb_markov_ctx2_word.parquet +2 -2
models/word_markov/azb_markov_ctx2_word_metadata.json +2 -2
models/word_markov/azb_markov_ctx3_word.parquet +2 -2
models/word_markov/azb_markov_ctx3_word_metadata.json +2 -2
models/word_markov/azb_markov_ctx4_word.parquet +2 -2
models/word_markov/azb_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/azb_2gram_word.parquet +2 -2
models/word_ngram/azb_2gram_word_metadata.json +2 -2
models/word_ngram/azb_3gram_word.parquet +2 -2
models/word_ngram/azb_3gram_word_metadata.json +2 -2
models/word_ngram/azb_4gram_word.parquet +2 -2
models/word_ngram/azb_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.198
   - name: best_isotropy
     type: isotropy
-    value: 0.8242
   - name: vocabulary_size
     type: vocab
-    value: 317640
-generated: 2025-12-27
 ---
 # AZB - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,51 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.179x | 3.12 | 0.3165% | 385,493 |
-| **16k** | 3.558x | 3.49 | 0.3542% | 344,465 |
-| **32k** | 3.895x | 3.82 | 0.3877% | 314,660 |
-| **64k** | 4.198x 🏆 | 4.12 | 0.4179% | 291,946 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `ماقنتیتی ( ) روسیه اؤلکه‌سینده یئر آلان بیر کند دیر و مورمانسک اوبلاستیندا یئرلش...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁م اق ن تی تی ▁( ▁) ▁روسیه ▁اؤلکه ▁سینده ... (+19 more)` | 29 |
-| 16k | `▁ماق ن تی تی ▁( ▁) ▁روسیه ▁اؤلکه ▁سینده ▁یئر ... (+16 more)` | 26 |
-| 32k | `▁ماق نتی تی ▁( ▁) ▁روسیه ▁اؤلکه ▁سینده ▁یئر ▁آلان ... (+13 more)` | 23 |
-| 64k | `▁ماق نتی تی ▁( ▁) ▁روسیه ▁اؤلکه ▁سینده ▁یئر ▁آلان ... (+13 more)` | 23 |
-**Sample 2:** `سید ضیاء الدین طباطبائی یزدی (دوغوم:۱۲۸۶ شیراز -اؤلوم:۷ شهریور ۱۳۴۸ تهران) احمدش...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁سید ▁ض یا ء ▁الدین ▁ط با ط با ئی ... (+34 more)` | 44 |
-| 16k | `▁سید ▁ضیا ء ▁الدین ▁ط با ط با ئی ▁ی ... (+29 more)` | 39 |
-| 32k | `▁سید ▁ضیا ء ▁الدین ▁ط باط با ئی ▁یز دی ... (+26 more)` | 36 |
-| 64k | `▁سید ▁ضیاء ▁الدین ▁طباطبا ئی ▁یزدی ▁( دوغوم : ۱۲ ... (+21 more)` | 31 |
-**Sample 3:** `ویکسن (اینگیلیسجه: Vixen, Louisiana) آمریکانین لوئیزیانا ایالتینده یئرلشن بیر یا...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ویک سن ▁( اینگیلیسجه : ▁v ix en , ▁louisiana ... (+15 more)` | 25 |
-| 16k | `▁ویک سن ▁( اینگیلیسجه : ▁v ix en , ▁louisiana ... (+15 more)` | 25 |
-| 32k | `▁ویک سن ▁( اینگیلیسجه : ▁v ix en , ▁louisiana ... (+15 more)` | 25 |
-| 64k | `▁ویک سن ▁( اینگیلیسجه : ▁vixen , ▁louisiana ) ▁آمریکانین ... (+13 more)` | 23 |
 ### Key Findings
-- **Best Compression:** 64k achieves 4.198x compression
-- **Lowest UNK Rate:** 8k with 0.3165% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -121,57 +129,89 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 7,645 🏆 | 12.90 | 268,529 | 30.2% | 57.7% |
-| **2-gram** | 655 🏆 | 9.35 | 14,131 | 46.5% | 93.7% |
-| **3-gram** | 13,711 | 13.74 | 512,828 | 24.9% | 52.4% |
-| **3-gram** | 4,808 | 12.23 | 141,425 | 21.1% | 58.1% |
-| **4-gram** | 23,522 | 14.52 | 979,944 | 21.9% | 47.2% |
-| **4-gram** | 19,536 | 14.25 | 804,296 | 13.9% | 41.7% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `بؤلمه :` | 391,479 |
-| 2 | `او ��` | 139,043 |
-| 3 | `‌ لار` | 137,076 |
-| 4 | `قایناقلار بؤلمه` | 125,059 |
-| 5 | `‌ نین` | 118,471 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `قایناقلار بؤلمه :` | 125,058 |
-| 2 | `( اینگیلیسجه :` | 112,489 |
-| 3 | `. قایناقلار بؤلمه` | 97,901 |
-| 4 | `قایناق ‌ لار` | 92,555 |
-| 5 | `ایشلدنلری طرفیندن یارانمیش` | 76,848 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `. قایناقلار بؤلمه :` | 97,901 |
-| 2 | `ایشلدنلری طرفیندن یارانمیش «` | 76,839 |
-| 3 | `ْ خلانیلیبدیر ) .` | 76,827 |
-| 4 | `» ، مقاله ‌` | 76,824 |
-| 5 | `، مقاله ‌ سیندن` | 76,824 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 655
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~42% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -179,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.4591 | 1.375 | 4.28 | 1,005,885 | 54.1% |
-| **1** | 1.2059 | 2.307 | 10.74 | 2,934 | 0.0% |
-| **2** | 0.2308 | 1.173 | 1.71 | 4,302,287 | 76.9% |
-| **2** | 1.1061 | 2.153 | 8.32 | 31,498 | 0.0% |
-| **3** | 0.1101 | 1.079 | 1.28 | 7,369,809 | 89.0% |
-| **3** | 0.9393 | 1.918 | 5.09 | 262,151 | 6.1% |
-| **4** | 0.0637 🏆 | 1.045 | 1.14 | 9,387,722 | 93.6% |
-| **4** | 0.7222 🏆 | 1.650 | 3.21 | 1,333,406 | 27.8% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `‌ ایللیین اووللرین ده یاشایان اینسانلار بؤلمه : bhumibol adulyadej crown , footballer , alchimie du`
-2. `. قان معائناتی و زنگین ‌ نین ایشلدنلری طرفیندن یارانمیش « panama 2 episodes ایلآدی رو`
-3. `: sergey ryzhikov ( ۱۹۴۸ ایلینده گونئی کارولینا ( تحصیلات ) رومنی ( لهیستانجا : 2`
 **Context Size 2:**
-1. `بؤلمه : وورت بؤلگه ‌ سیاراک بؤلگه ‌ سی ‌ دیر گؤرونتولر قایناق ‌ لار اینگیلیسجه ویکی`
-2. `او ْ یونچو . ۲۶ ژوئن ۱۹۵۷ میلادی تاریخیندا وفات ائدیب . قایناق ‌ لار اینگیلیسجه ویکی`
-3. `‌ لار بؤلمه : آمریکا شهرلری بؤلمه : گرینزبورو دوغوملولار بؤلمه : ۲۰۱۹ - جو ایلینده افغانیستاندا`
 **Context Size 3:**
-1. `قایناقلار بؤلمه : هیندوستان کندلری en : kahra`
-2. `( اینگیلیسجه : louis prima ) آمریکالی موغنی ، یازیچی و او ْ ندان ایستیفاده ائتمیشدیر . روبنسین`
-3. `. قایناقلار بؤلمه : هیندوستان کندلری en : kalawad`
 **Context Size 4:**
-1. `. قایناقلار بؤلمه : هیندوستان کندلری en : balowal`
-2. `ایشلدنلری طرفیندن یارانمیش « crabtree ' s catalyst » ، مقاله ‌ سیندن گؤتورولوبدور . ( ۷ سپتامبر ۲۰۱۷`
-3. `، مقاله ‌ سیندن گؤتورولوبدور . ( ۱۲ ژوئن ۲۰۱۸ تاریخینده یو ْ خلانیلیبدیر ) . بؤلمه : ایستادیوملار`
 ### Key Findings
-- **Best Predictability:** Context-4 with 93.6% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (1,333,406 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -243,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 317,640 |
-| Total Tokens | 16,697,106 |
-| Mean Frequency | 52.57 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 1499.80 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | بؤلمه | 391,840 |
-| 2 | و | 291,285 |
-| 3 | اینگیلیسجه | 188,551 |
-| 4 | بیر | 169,796 |
-| 5 | او | 145,694 |
-| 6 | قایناقلار | 143,368 |
-| 7 | سی | 139,966 |
-| 8 | لار | 138,388 |
-| 9 | دیر | 132,038 |
-| 10 | the | 130,277 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | سئندن | 2 |
-| 2 | ائششکین | 2 |
-| 3 | onager | 2 |
-| 4 | داشناکلارلا | 2 |
-| 5 | قۇلان | 2 |
-| 6 | آسینۇس | 2 |
-| 7 | هئمیو | 2 |
-| 8 | تاپؽلمیشدیر | 2 |
-| 9 | kulan | 2 |
-| 10 | کسا | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.1842 |
-| R² (Goodness of Fit) | 0.995333 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 36.8% |
-| Top 1,000 | 66.7% |
-| Top 5,000 | 81.2% |
-| Top 10,000 | 85.9% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9953 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 36.8% of corpus
-- **Long Tail:** 307,640 words needed for remaining 14.1% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -313,24 +384,110 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 134,712 | 32 | 5.098 | 1.243 | 0.8242 🏆 |
-| **mono_64d** | 134,712 | 64 | 5.525 | 1.165 | 0.7928 |
-| **mono_128d** | 134,712 | 128 | 6.056 | 1.052 | 0.7520 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.8242 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 134,712 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -338,11 +495,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.20x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (655) |
-| Markov | **Context-4** | Highest predictability (93.6%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -532,7 +690,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -548,7 +707,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-27 22:57:16*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.148
   - name: best_isotropy
     type: isotropy
+    value: 0.8282
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # AZB - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.135x | 3.14 | 0.5011% | 364,218 |
+| **16k** | 3.510x | 3.51 | 0.5610% | 325,334 |
+| **32k** | 3.852x | 3.86 | 0.6157% | 296,427 |
+| **64k** | 4.148x 🏆 | 4.15 | 0.6629% | 275,291 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `قالیکتیس (، ، ، ) ییٛرتیجیلار دسته‌سینه عایید حئیوان نؤعو. قایناقلار سیراسینا گؤ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁قا لیک تیس ▁(، ▁، ▁، ▁) ▁ییٛرتیجیلار ▁دسته ▁سینه ... (+8 more)` | 18 |
+| 16k | `▁قا لیک تیس ▁(، ▁، ▁، ▁) ▁ییٛرتیجیلار ▁دسته ▁سینه ... (+8 more)` | 18 |
+| 32k | `▁قا لیک تیس ▁(، ▁، ▁، ▁) ▁ییٛرتیجیلار ▁دسته ▁سینه ... (+8 more)` | 18 |
+| 64k | `▁قا لیک تیس ▁(، ▁، ▁، ▁) ▁ییٛرتیجیلار ▁دسته ▁سینه ... (+8 more)` | 18 |
+**Sample 2:** `هیندوستان اؤلکه‌سینین کرالا ایالتینده بیر شهر دیر. بۇ شهرده مالایالم دیلی و اینگ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁هیندوستان ▁اؤلکه ▁سینین ▁کرالا ▁ایالتینده ▁بیر ▁شهر ▁دیر . ▁بۇ ... (+10 more)` | 20 |
+| 16k | `▁هیندوستان ▁اؤلکه ▁سینین ▁کرالا ▁ایالتینده ▁بیر ▁شهر ▁دیر . ▁بۇ ... (+10 more)` | 20 |
+| 32k | `▁هیندوستان ▁اؤلکه ▁سینین ▁کرالا ▁ایالتینده ▁بیر ▁شهر ▁دیر . ▁بۇ ... (+10 more)` | 20 |
+| 64k | `▁هیندوستان ▁اؤلکه ▁سینین ▁کرالا ▁ایالتینده ▁بیر ▁شهر ▁دیر . ▁بۇ ... (+10 more)` | 20 |
+**Sample 3:** `آرقا, کارناتاکا Karnataka) هیندوستان اؤلکه‌سینین کارناتاکا ایالتینده بیر کند دیر...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁آر قا , ▁کارناتاکا ▁kar n at aka ) ▁هیندوستان ... (+16 more)` | 26 |
+| 16k | `▁آر قا , ▁کارناتاکا ▁kar nat aka ) ▁هیندوستان ▁اؤلکه ... (+15 more)` | 25 |
+| 32k | `▁آر قا , ▁کارناتاکا ▁karnataka ) ▁هیندوستان ▁اؤلکه ▁سینین ▁کارناتاکا ... (+13 more)` | 23 |
+| 64k | `▁آر قا , ▁کارناتاکا ▁karnataka ) ▁هیندوستان ▁اؤلکه ▁سینین ▁کارناتاکا ... (+13 more)` | 23 |
 ### Key Findings
+- **Best Compression:** 64k achieves 4.148x compression
+- **Lowest UNK Rate:** 8k with 0.5011% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 8,048 | 12.97 | 158,908 | 25.7% | 56.1% |
+| **2-gram** | Subword | 528 🏆 | 9.04 | 12,648 | 51.6% | 95.7% |
+| **3-gram** | Word | 10,249 | 13.32 | 236,749 | 22.6% | 53.6% |
+| **3-gram** | Subword | 3,765 | 11.88 | 106,644 | 23.1% | 62.4% |
+| **4-gram** | Word | 17,175 | 14.07 | 426,395 | 19.0% | 47.9% |
+| **4-gram** | Subword | 15,100 | 13.88 | 581,225 | 14.6% | 44.8% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ایشلدنلری طرفیندن` | 75,584 |
+| 2 | `مقاله‌سیندن گؤتورولوبدور` | 75,503 |
+| 3 | `ویکی‌پدیاسی‌نین ایشلدنلری` | 73,734 |
+| 4 | `اینگیلیسجه ویکی‌پدیاسی‌نین` | 71,132 |
+| 5 | `قایناق‌لار اینگیلیسجه` | 70,880 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن` | 73,734 |
+| 2 | `اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری` | 71,132 |
+| 3 | `قایناق‌لار اینگیلیسجه ویکی‌پدیاسی‌نین` | 70,806 |
+| 4 | `بیر یاشاییش منطقه‌سی‌دیر` | 40,399 |
+| 5 | `بیر کند دیر` | 30,448 |
+**4-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن` | 71,132 |
+| 2 | `قایناق‌لار اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری` | 70,806 |
+| 3 | `سوْن نۆفوس ساییمی اساسيندا` | 24,568 |
+| 4 | `شهرلرین لیستی قایناق‌لار اینگیلیسجه` | 22,937 |
+| 5 | `لیستی قایناق‌لار اینگیلیسجه ویکی‌پدیاسی‌نین` | 22,937 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ی ن` | 1,868,991 |
+| 2 | `_ ا` | 1,658,104 |
+| 3 | `ی _` | 1,437,263 |
+| 4 | `ا ی` | 1,393,221 |
+| 5 | `ن _` | 1,215,806 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ ا ی` | 717,380 |
+| 2 | `ی ن د` | 658,977 |
+| 3 | `د ه _` | 585,522 |
+| 4 | `ل ا ر` | 580,226 |
+| 5 | `ا ی ن` | 470,621 |
+**4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ن د ه _` | 347,347 |
+| 2 | `ل ا ر _` | 329,379 |
+| 3 | `ی ن د ه` | 320,994 |
+| 4 | `_ ب ی ر` | 258,707 |
+| 5 | `ن ی ن _` | 257,628 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 528
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~45% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.6640 | 1.584 | 5.09 | 726,930 | 33.6% |
+| **1** | Subword | 1.0592 | 2.084 | 9.07 | 3,409 | 0.0% |
+| **2** | Word | 0.1970 | 1.146 | 1.48 | 3,693,091 | 80.3% |
+| **2** | Subword | 0.9308 | 1.906 | 6.57 | 30,905 | 6.9% |
+| **3** | Word | 0.0689 | 1.049 | 1.14 | 5,447,170 | 93.1% |
+| **3** | Subword | 0.8408 | 1.791 | 4.70 | 203,056 | 15.9% |
+| **4** | Word | 0.0340 🏆 | 1.024 | 1.07 | 6,178,524 | 96.6% |
+| **4** | Subword | 0.7000 | 1.625 | 3.22 | 953,883 | 30.0% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `و شمشددیل سلطانلیغی ایله برابر ایدی قایناقلار شهرلری آمریکا بیرلشمیش ایالتلرینین ایداری بؤلوملری ائش...`
+2. `بیر کند دیر و یا هچ اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن mountain a few years until 5`
+3. `اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن mountain wilbert minnesota مقاله‌سیندن گؤتورولوبدور ۸ آ...`
+**Context Size 2:**
+1. `ایشلدنلری طرفیندن مقاله‌سیندن گؤتورولوبدور ۸ آقوست تاریخینده یوْخلانیلیبدیر شهرلری en güləh`
+2. `مقاله‌سیندن گؤتورولوبدور ۳۰ نوْوامبر تاریخینده یوْخلانیلیبدیر کولوبلاری en araks ararat fc مقاله‌سین...`
+3. `ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن indiana مقاله‌سیندن گؤتورولوبدور ۲۱ دسامبر تاریخینده یوْخلانیلیبدی...`
+**Context Size 3:**
+1. `ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن مقاله‌سیندن گؤتورولوبدور ۳۰ نوْوامبر تاریخینده یوْخلانیلیبدیر گؤرو...`
+2. `اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن rutherfurd مقاله‌سیندن گؤتورولوبدور ۲۲ ژانویه تاریخینده...`
+3. `قایناق‌لار اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن مقاله‌سیندن گؤتورولوبدور ۱۹ جولای یوْخلانیلی...`
+**Context Size 4:**
+1. `اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن مقاله‌سیندن گؤتورولوبدور ۸ آقوست تاریخینده یوْخلانیلیبد...`
+2. `قایناق‌لار اینگیلیسجه ویکی‌پدیاسی‌نین ایشلدنلری طرفیندن georgia مقاله‌سیندن گؤتورولوبدور ۸ آقوست تار...`
+3. `سوْن نۆفوس ساییمی اساسيندا ۱۳۷ نفر ایمیش و ویسوچینا اوستانیندا یئرلشیب بۆتون چک‌دا اوْلدوغو کیمی بۇ ...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_متیشاورله‌ستبشلی`
+2. `یتا_d_by_jottiep`
+3. `الاق‌لر)_ژادالين_`
 **Context Size 2:**
+1. `ین_حیازیل_دیکان_ا`
+2. `_ایلدا_۹_جی_giran`
+3. `ی_حافع_سان_آماری_`
 **Context Size 3:**
+1. `_ایدان_عوضوو._۲۷۳_`
+2. `ینده_روس_سال_منطقه‌`
+3. `ده_یوْخلانی_آما_خوب`
 **Context Size 4:**
+1. `نده_یوْخلانیلیبدیر).`
+2. `لار_یولو_۳۰_دسامبر_`
+3. `ینده_هر_گونئی_کاروا`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 96.6% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (953,883 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 271,198 |
+| Total Tokens | 12,478,531 |
+| Mean Frequency | 46.01 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 1145.11 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | و | 284,031 |
+| 2 | بیر | 169,280 |
+| 3 | اینگیلیسجه | 149,737 |
+| 4 | قایناقلار | 141,945 |
+| 5 | the | 114,439 |
+| 6 | تاریخینده | 92,079 |
+| 7 | قایناق‌لار | 90,963 |
+| 8 | ایلده | 83,679 |
+| 9 | شهرلری | 81,894 |
+| 10 | طرفیندن | 80,132 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | کساسیاسی | 2 |
+| 2 | کالابری�� | 2 |
+| 3 | کونتینوا | 2 |
+| 4 | تحقیق‌لری | 2 |
+| 5 | romanzo | 2 |
+| 6 | strage | 2 |
+| 7 | سۆره‌جینده | 2 |
+| 8 | ایلکه‌لر | 2 |
+| 9 | لائیکلیک | 2 |
+| 10 | شاسکوه | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.1609 |
+| R² (Goodness of Fit) | 0.995521 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 34.4% |
+| Top 1,000 | 64.8% |
+| Top 5,000 | 79.6% |
+| Top 10,000 | 84.6% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9955 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 34.4% of corpus
+- **Long Tail:** 261,198 words needed for remaining 15.4% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.8282 🏆 | 0.3614 | N/A | N/A |
+| **mono_64d** | 64 | 0.7952 | 0.3099 | N/A | N/A |
+| **mono_128d** | 128 | 0.7570 | 0.2493 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.8282 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.3069. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-ین` | ائتدیگی‌نین, دالینین, لشکرینین |
+| `-ان` | سيران, کاپیتان, تاپیلمایان |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `رلری` | 1.93x | 205 contexts | ارلری, یرلری, دیرلری |
+| `اقلا` | 1.95x | 131 contexts | ناقلا, آیاقلا, آياقلا |
+| `قلار` | 2.11x | 54 contexts | لیقلار, حقلاری, ماقلار |
+| `تیند` | 1.93x | 72 contexts | تیندل, تینده, اتیندن |
+| `یبدی` | 2.32x | 31 contexts | آلیبدی, ییبدیر, گلیبدی |
+| `اریخ` | 1.93x | 41 contexts | تا��یخ, تاریخه, ‌تاریخ |
+| `ولوب` | 1.70x | 60 contexts | کولوب, گولوب, سولوب |
+| `ئرلش` | 2.00x | 24 contexts | يئرلشن, یئرلشن, یئرلشه |
+| `ریخی` | 2.03x | 22 contexts | مریخی, ریخین, تاریخی |
+| `یناق` | 1.87x | 27 contexts | سیناق, قیناق, ایناق |
+| `هرلر` | 2.14x | 17 contexts | شهرلر, شهرلره, شهرلري |
+| `یلیس` | 1.56x | 43 contexts | هیلیس, یلیسی, تیلیس |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+*No significant affix co-occurrences detected.*
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| وطنداشلارینین | **`وطنداشلار-ین-ین`** | 6.0 | `وطنداشلار` |
+| تورپاق‌لارینین | **`تورپاق‌لار-ین-ین`** | 6.0 | `تورپاق‌لار` |
+| دئموکرات‌لارینین | **`دئموکرات‌لار-ین-ین`** | 6.0 | `دئموکرات‌لار` |
+| اوستانلارینان | **`اوستانلار-ین-ان`** | 6.0 | `اوستانلار` |
+| تولیدینین | **`تولید-ین-ین`** | 6.0 | `تولید` |
+| المنتلرینین | **`المنتلر-ین-ین`** | 6.0 | `المنتلر` |
+| جومهوریتلرین | **`جومهوریتلر-ین`** | 4.5 | `جومهوریتلر` |
+| سیمالارین | **`سیمالار-ین`** | 4.5 | `سیمالار` |
+| ژوآن‌لارین | **`ژوآن‌لار-ین`** | 4.5 | `ژوآن‌لار` |
+| ماشین‌لارین | **`ماشین‌لار-ین`** | 4.5 | `ماشین‌لار` |
+| بدوی‌لرین | **`بدوی‌لر-ین`** | 4.5 | `بدوی‌لر` |
+| تبریزلیلرین | **`تبریزلیلر-ین`** | 4.5 | `تبریزلیلر` |
+| ناخوشلوقلارین | **`ناخوشلوقلار-ین`** | 4.5 | `ناخوشلوقلار` |
+| اویکونیمین | **`اویکونیم-ین`** | 4.5 | `اویکونیم` |
+| ائرککلرین | **`ائرککلر-ین`** | 4.5 | `ائرککلر` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language AZB appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (4.15x) |
+| N-gram | **2-gram** | Lowest perplexity (528) |
+| Markov | **Context-4** | Highest predictability (96.6%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 06:14:53*

models/embeddings/monolingual/azb_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e36e34c4aa5231270647ed20effcf7427bf66e34bf81f236183cc4d2910eef2d
-size 1164785458

 version https://git-lfs.github.com/spec/v1
+oid sha256:0affe475b2c95d7dbb76d561e1c19235d6d09956c11cb77486d2ed42c5bc91f8
+size 1145287401

models/embeddings/monolingual/azb_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 134712
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 116118
 }

models/embeddings/monolingual/azb_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:24dbd964d896595164ff03aa344de312318ab9024fadf1dda4173202ae37885b
-size 293326642

 version https://git-lfs.github.com/spec/v1
+oid sha256:dd0d8e4ac83c586e8f75145c0ce099bef44ff8d39e955040bc9454b60e32022c
+size 288108777

models/embeddings/monolingual/azb_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 134712
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 116118
 }

models/embeddings/monolingual/azb_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6193385dacda8dfbeee9fb60f5514bc20a74571e0a03c1d6372443118dd7271a
-size 583812914

 version https://git-lfs.github.com/spec/v1
+oid sha256:2aae88ec4bdf46e7654ecacada31d2b3be0690719379b657e44df7793b45fa65
+size 573834985

models/embeddings/monolingual/azb_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 134712
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 116118
 }

models/subword_markov/azb_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b5b4c16c0cc262c904b79b490c2cb1ed8fa1b624111ee815c8a153b6470de540
-size 249202

 version https://git-lfs.github.com/spec/v1
+oid sha256:4ff3da37abf00dd52bdcb28478a692c4366754cc1a111aa42559670cd7ea3aee
+size 247087

models/subword_markov/azb_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "azb",
-  "unique_contexts": 2934,
-  "total_transitions": 117347430
 }

   "context_size": 1,
   "variant": "subword",
   "language": "azb",
+  "unique_contexts": 3409,
+  "total_transitions": 91613484
 }

models/subword_markov/azb_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1bb9b0eb974612ded94c2843ab693fd0d8bae85b9e9719cc399abf8806f0d8e6
-size 2049903

 version https://git-lfs.github.com/spec/v1
+oid sha256:1090e7d0037ce29d3cddcfe27f87428fcd5c872805dcf5e3d4202b9f83929b66
+size 1650829

models/subword_markov/azb_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "azb",
-  "unique_contexts": 31498,
-  "total_transitions": 117102967
 }

   "context_size": 2,
   "variant": "subword",
   "language": "azb",
+  "unique_contexts": 30905,
+  "total_transitions": 91370829
 }

models/subword_markov/azb_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:364a895a59b1416e468680327f44afdb5268c191c6d191e6b325691c0f3c3920
-size 10556130

 version https://git-lfs.github.com/spec/v1
+oid sha256:06a5083f143c1f0688e95d617ee8d409f5c78a44f6ea6c79e014253dcf249ac5
+size 7956042

models/subword_markov/azb_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "azb",
-  "unique_contexts": 262151,
-  "total_transitions": 116858504
 }

   "context_size": 3,
   "variant": "subword",
   "language": "azb",
+  "unique_contexts": 203056,
+  "total_transitions": 91128174
 }

models/subword_markov/azb_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6bbc3fbabaaf693d43117cfaccb8266ae07186aa1c883be3dd208a2f9aad8724
-size 35507016

 version https://git-lfs.github.com/spec/v1
+oid sha256:d4bd1bdcfb24b934a84145355781bb3e8cf414b875f467fc28d048b5d64ec5d1
+size 26328560

models/subword_markov/azb_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "azb",
-  "unique_contexts": 1333406,
-  "total_transitions": 116614041
 }

   "context_size": 4,
   "variant": "subword",
   "language": "azb",
+  "unique_contexts": 953883,
+  "total_transitions": 90885519
 }

models/subword_ngram/azb_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ccea8a8d09e5e2c50495929c3ed6209d6250e502dbf821570083b812edd44ab9
-size 198369

 version https://git-lfs.github.com/spec/v1
+oid sha256:383d8448d14aee343af41336d2c34807c417414896fb0905958afee3cec1c929
+size 182774

models/subword_ngram/azb_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "azb",
-  "unique_ngrams": 14131,
-  "total_ngrams": 117347430
 }

   "n": 2,
   "variant": "subword",
   "language": "azb",
+  "unique_ngrams": 12648,
+  "total_ngrams": 91613484
 }

models/subword_ngram/azb_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1e3575791a79055ab8760da4b584e9186ebd76add76ef9131f8b1c690f6f619c
-size 1733384

 version https://git-lfs.github.com/spec/v1
+oid sha256:ed1d402ce0eac47e37053808b7981cf3eb00b76c5e2e5e887890b036ff01d869
+size 1377099

models/subword_ngram/azb_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "azb",
-  "unique_ngrams": 141425,
-  "total_ngrams": 117102967
 }

   "n": 3,
   "variant": "subword",
   "language": "azb",
+  "unique_ngrams": 106644,
+  "total_ngrams": 91370829
 }

models/subword_ngram/azb_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a2d7bf39c2289a572f7d6d41c6ee5b70e899bc025d9c7d73b34bf4e38cac70f4
-size 9977940

 version https://git-lfs.github.com/spec/v1
+oid sha256:cde924d7fc46a3c6a14d25e38135fbf37a14e774606b67187d527dd381b1d75f
+size 7368829

models/subword_ngram/azb_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "azb",
-  "unique_ngrams": 804296,
-  "total_ngrams": 116858504
 }

   "n": 4,
   "variant": "subword",
   "language": "azb",
+  "unique_ngrams": 581225,
+  "total_ngrams": 91128174
 }

models/tokenizer/azb_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3884a029fd3957ce39121cce73cb5a8166d611ce94cef5675b43779699c0bd8a
-size 524898

 version https://git-lfs.github.com/spec/v1
+oid sha256:c4f3786bae820ff2d6cc627dbdc96f1ede4586a6afa95391543c078559a7c125
+size 527370

models/tokenizer/azb_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/azb_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7da271a7e77b591daaf850d5e0ca617ac859c9d9797bfc37fd65791b5cfc15fe
-size 819972

 version https://git-lfs.github.com/spec/v1
+oid sha256:d5380be73c3e18000f3f24bd0d0309c1b6d15259c73623f50c42cea3c6f8992d
+size 820944

models/tokenizer/azb_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/azb_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:54cad8dd709976bcc93887547846baf55cbc48f7161ddab74a438fd09acb2685
-size 1430728

 version https://git-lfs.github.com/spec/v1
+oid sha256:efb9e4835aafb1f0f10715a303c2ac20f1f169ed4ffc46f49d685eea42a8bfae
+size 1430567

models/tokenizer/azb_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/azb_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2ecb3b8a28e5f01ac0a0633151869481ce9b8a3af8ad9c54007045e09589ce83
-size 382716

 version https://git-lfs.github.com/spec/v1
+oid sha256:e6b1776330fc0caaa7cc96df48505e83e46f3a00daa53e66e6a94354397d3dd8
+size 384993

models/tokenizer/azb_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/azb_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4c0ed41ec149e3751f6a1ace2372fcee3d368b53c6355b2765ae7b98caed8924
-size 4872673

 version https://git-lfs.github.com/spec/v1
+oid sha256:1346fa231de1649149828cb0cae257f644b1914797c84a5a8c4b56c62e716253
+size 4348211

models/vocabulary/azb_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "azb",
-  "vocabulary_size": 317640,
   "statistics": {
-    "type_token_ratio": 0.05786680977434657,
     "coverage": {
-      "top_100": 0.35375568505036664,
-      "top_1000": 0.6410291673928461,
-      "top_5000": 0.7797111502624887,
-      "top_10000": 0.8250270483868463
     },
-    "hapax_count": 688404,
-    "hapax_ratio": 0.6842682825005666,
-    "total_documents": 244463
   }
 }

 {
   "language": "azb",
+  "vocabulary_size": 271198,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.05623983023140009,
     "coverage": {
+      "top_100": 0.33209927062556827,
+      "top_1000": 0.6250560987377987,
+      "top_5000": 0.7676335196346162,
+      "top_10000": 0.8163255618590586
     },
+    "hapax_count": 456252,
+    "hapax_ratio": 0.627193621554746,
+    "total_documents": 242655
   }
 }

models/word_markov/azb_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0f8d5ae87507c80aafd715379fb38b0179c5049568d6f9e2b033dffefdfb0395
-size 49308871

 version https://git-lfs.github.com/spec/v1
+oid sha256:816eaec35f734257259dcbb9d574a8c019635474f0de8b0daca5e593a77ac78b
+size 42654567

models/word_markov/azb_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "azb",
-  "unique_contexts": 1005885,
-  "total_transitions": 23557605
 }

   "context_size": 1,
   "variant": "word",
   "language": "azb",
+  "unique_contexts": 726930,
+  "total_transitions": 12692128
 }

models/word_markov/azb_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4c8a73136462b690d7c5514e30b6f6b274d36f9f121fbafeb5e99c8af7ec34c3
-size 115872504

 version https://git-lfs.github.com/spec/v1
+oid sha256:46d9babbdaaaad5f5dc9ac4a05883d1447bd376822b6b3dc01d9bfa81167154f
+size 101955475

models/word_markov/azb_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "azb",
-  "unique_contexts": 4302287,
-  "total_transitions": 23313143
 }

   "context_size": 2,
   "variant": "word",
   "language": "azb",
+  "unique_contexts": 3693091,
+  "total_transitions": 12449473
 }

models/word_markov/azb_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:acd2b58441a2312fe6aff4e924e858f5d85a5843063ac2adf427f032e6628d22
-size 173864269

 version https://git-lfs.github.com/spec/v1
+oid sha256:950723c3851972daff338e7c32d25b6b2711a549f145cfc4f36896cd3c7ac632
+size 137860035

models/word_markov/azb_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "azb",
-  "unique_contexts": 7369809,
-  "total_transitions": 23068685
 }

   "context_size": 3,
   "variant": "word",
   "language": "azb",
+  "unique_contexts": 5447170,
+  "total_transitions": 12206818
 }

models/word_markov/azb_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:849a5c01ecae01be263fc6446cd008e4ee8cd861a0e83cd8e5da46a5f178d5ae
-size 222555153

 version https://git-lfs.github.com/spec/v1
+oid sha256:a8542b69e6498bf99f08ac81a2c69e6e10cfaec85f6041228b2d33e47374fa2f
+size 167254340

models/word_markov/azb_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "azb",
-  "unique_contexts": 9387722,
-  "total_transitions": 22824244
 }

   "context_size": 4,
   "variant": "word",
   "language": "azb",
+  "unique_contexts": 6178524,
+  "total_transitions": 11964164
 }

models/word_ngram/azb_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5d9b9580f5793b668288ab3a7773338440486cda750c383b557c68ac0bc9221f
-size 4287214

 version https://git-lfs.github.com/spec/v1
+oid sha256:38d788975186225cbd44f6dc3f5933b5ef0372dd125cccbf762b7c146ac00818
+size 2898203

models/word_ngram/azb_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "azb",
-  "unique_ngrams": 268529,
-  "total_ngrams": 23557605
 }

   "n": 2,
   "variant": "word",
   "language": "azb",
+  "unique_ngrams": 158908,
+  "total_ngrams": 12692128
 }

models/word_ngram/azb_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ec70d427e623958a19690dad68f77970e4b6ff77368e7285059f2db52e46e7e6
-size 9355723

 version https://git-lfs.github.com/spec/v1
+oid sha256:e3b2c502cac90ffea868839ff83a17e6f50d9e2b94a2406bbe1479ad2b587f88
+size 4982020

models/word_ngram/azb_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "azb",
-  "unique_ngrams": 512828,
-  "total_ngrams": 23313143
 }

   "n": 3,
   "variant": "word",
   "language": "azb",
+  "unique_ngrams": 236749,
+  "total_ngrams": 12449473
 }

models/word_ngram/azb_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6b30786e6c43d7ce3c871df1db1b870820ce26c16155e9f336633c143d3099b5
-size 19762130