omarkamali commited on Jan 3

Commit

02b9bba

verified ·

1 Parent(s): 1bcc2f6

Upload all models and assets for as (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +293 -142
models/embeddings/monolingual/as_128d.bin +2 -2
models/embeddings/monolingual/as_128d_metadata.json +5 -3
models/embeddings/monolingual/as_32d.bin +2 -2
models/embeddings/monolingual/as_32d_metadata.json +5 -3
models/embeddings/monolingual/as_64d.bin +2 -2
models/embeddings/monolingual/as_64d_metadata.json +5 -3
models/subword_markov/as_markov_ctx1_subword.parquet +2 -2
models/subword_markov/as_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/as_markov_ctx2_subword.parquet +2 -2
models/subword_markov/as_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/as_markov_ctx3_subword.parquet +2 -2
models/subword_markov/as_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/as_markov_ctx4_subword.parquet +2 -2
models/subword_markov/as_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/as_2gram_subword.parquet +2 -2
models/subword_ngram/as_2gram_subword_metadata.json +2 -2
models/subword_ngram/as_3gram_subword.parquet +2 -2
models/subword_ngram/as_3gram_subword_metadata.json +2 -2
models/subword_ngram/as_4gram_subword.parquet +2 -2
models/subword_ngram/as_4gram_subword_metadata.json +2 -2
models/tokenizer/as_tokenizer_16k.model +2 -2
models/tokenizer/as_tokenizer_16k.vocab +0 -0
models/tokenizer/as_tokenizer_32k.model +2 -2
models/tokenizer/as_tokenizer_32k.vocab +0 -0
models/tokenizer/as_tokenizer_64k.model +2 -2
models/tokenizer/as_tokenizer_64k.vocab +0 -0
models/tokenizer/as_tokenizer_8k.model +2 -2
models/tokenizer/as_tokenizer_8k.vocab +0 -0
models/vocabulary/as_vocabulary.parquet +2 -2
models/vocabulary/as_vocabulary_metadata.json +10 -9
models/word_markov/as_markov_ctx1_word.parquet +2 -2
models/word_markov/as_markov_ctx1_word_metadata.json +2 -2
models/word_markov/as_markov_ctx2_word.parquet +2 -2
models/word_markov/as_markov_ctx2_word_metadata.json +2 -2
models/word_markov/as_markov_ctx3_word.parquet +2 -2
models/word_markov/as_markov_ctx3_word_metadata.json +2 -2
models/word_markov/as_markov_ctx4_word.parquet +2 -2
models/word_markov/as_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/as_2gram_word.parquet +2 -2
models/word_ngram/as_2gram_word_metadata.json +2 -2
models/word_ngram/as_3gram_word.parquet +2 -2
models/word_ngram/as_3gram_word_metadata.json +2 -2
models/word_ngram/as_4gram_word.parquet +2 -2
models/word_ngram/as_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.519
   - name: best_isotropy
     type: isotropy
-    value: 0.8484
   - name: vocabulary_size
     type: vocab
-    value: 69383
-generated: 2025-12-27
 ---
 # AS - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,60 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.390x | 3.36 | 0.0640% | 1,495,826 |
-| **16k** | 3.830x | 3.80 | 0.0723% | 1,324,012 |
-| **32k** | 4.212x | 4.18 | 0.0795% | 1,203,836 |
-| **64k** | 4.519x 🏆 | 4.48 | 0.0853% | 1,122,074 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `ইউৰেনাচ নামে তলৰ প্ৰৱন্ধসমূহ বুজাব পাৰে:
- ইউৰেনাচ: সৌৰজগতৰ সপ্তম গ্ৰহ
- ইউৰেনাচ (...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ইউ ৰে না চ ▁নামে ▁তলৰ ▁প্ৰৱ ন্ধ সমূহ ▁বুজাব ... (+28 more)` | 38 |
-| 16k | `▁ইউৰে না চ ▁নামে ▁তলৰ ▁প্ৰৱ ন্ধ সমূহ ▁বুজাব ▁পাৰে ... (+23 more)` | 33 |
-| 32k | `▁ইউৰে নাচ ▁নামে ▁তলৰ ▁প্ৰৱন্ধ সমূহ ▁বুজাব ▁পাৰে : ▁ইউৰে ... (+18 more)` | 28 |
-| 64k | `▁ইউৰে নাচ ▁নামে ▁তলৰ ▁প্ৰৱন্ধ সমূহ ▁বুজাব ▁পাৰে : ▁ইউৰে ... (+17 more)` | 27 |
-**Sample 2:** `শ্ৰেণী:ভাৰতীয় অভিনেত্ৰী
-শ্ৰেণী:জীৱিত ব্যক্তি
-শ্ৰেণী:ভাৰতীয় ৰাজনীতিবিদ
-শ্ৰেণী:ভ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁শ্ৰ���ণী : ভাৰতীয় ▁অভিনেত্ৰী ▁শ্ৰেণী : জীৱিত ▁ব্যক্তি ▁শ্ৰেণী : ... (+9 more)` | 19 |
-| 16k | `▁শ্ৰেণী : ভাৰতীয় ▁অভিনেত্ৰী ▁শ্ৰেণী : জীৱিত ▁ব্যক্তি ▁শ্ৰেণী : ... (+8 more)` | 18 |
-| 32k | `▁শ্ৰেণী : ভাৰতীয় ▁অভিনেত্ৰী ▁শ্ৰেণী : জীৱিত ▁ব্যক্তি ▁শ্ৰেণী : ... (+8 more)` | 18 |
-| 64k | `▁শ্ৰেণী : ভাৰতীয় ▁অভিনেত্ৰী ▁শ্ৰেণী : জীৱিত ▁ব্যক্তি ▁শ্ৰেণী : ... (+7 more)` | 17 |
-**Sample 3:** `তাল বুলিলে তলৰ পৃষ্ঠাসমূহ বুজাব পাৰে:
-তাল (বাদ্যযন্ত্ৰ)
-তাল (সংগীত)
-তাল (ফল)`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁তাল ▁বুল িলে ▁তলৰ ▁পৃষ্ঠ াসমূহ ▁বুজাব ▁পাৰে : ▁তাল ... (+13 more)` | 23 |
-| 16k | `▁তাল ▁বুলিলে ▁তলৰ ▁পৃষ্ঠ াসমূহ ▁বুজাব ▁পাৰে : ▁তাল ▁( ... (+11 more)` | 21 |
-| 32k | `▁তাল ▁বুলিলে ▁তলৰ ▁পৃষ্ঠ াসমূহ ▁বুজাব ▁পাৰে : ▁তাল ▁( ... (+11 more)` | 21 |
-| 64k | `▁তাল ▁বুলিলে ▁তলৰ ▁পৃষ্ঠাসমূহ ▁বুজাব ▁পাৰে : ▁তাল ▁( বাদ্যযন্ত্ৰ ... (+9 more)` | 19 |
 ### Key Findings
-- **Best Compression:** 64k achieves 4.519x compression
-- **Lowest UNK Rate:** 8k with 0.0640% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -130,57 +129,89 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 2,255 🏆 | 11.14 | 144,083 | 37.8% | 74.6% |
-| **2-gram** | 712 🏆 | 9.48 | 14,933 | 47.2% | 91.5% |
-| **3-gram** | 20,803 | 14.34 | 558,435 | 14.3% | 40.2% |
-| **3-gram** | 6,279 | 12.62 | 135,792 | 17.8% | 52.9% |
-| **4-gram** | 109,133 | 16.74 | 1,749,119 | 7.8% | 23.0% |
-| **4-gram** | 34,064 | 15.06 | 727,754 | 9.4% | 29.5% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `া ৰ` | 608,846 |
-| 2 | `য ়` | 511,382 |
-| 3 | `্ ৰ` | 409,497 |
-| 4 | `প ্` | 302,441 |
-| 5 | `ি ল` | 296,895 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `য ় া` | 192,871 |
-| 2 | `ি য ়` | 190,336 |
-| 3 | `ী য ়` | 164,267 |
-| 4 | `ত ে ও` | 138,798 |
-| 5 | `ে ও ঁ` | 138,171 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ত ে ও ঁ` | 137,993 |
-| 2 | `ি য ় া` | 118,996 |
-| 3 | `ছ ি ল ।` | 81,488 |
-| 4 | `ি ছ ি ল` | 71,229 |
-| 5 | `শ ্ ৰ ে` | 58,374 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 712
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~29% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -188,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.5873 | 1.502 | 6.01 | 178,570 | 41.3% |
-| **1** | 1.0627 | 2.089 | 7.96 | 5,555 | 0.0% |
-| **2** | 0.3625 | 1.286 | 2.70 | 1,073,346 | 63.7% |
-| **2** | 0.9491 | 1.931 | 6.70 | 44,224 | 5.1% |
-| **3** | 0.2925 | 1.225 | 2.07 | 2,893,182 | 70.7% |
-| **3** | 0.8311 | 1.779 | 4.51 | 296,427 | 16.9% |
-| **4** | 0.2472 🏆 | 1.187 | 1.70 | 5,984,484 | 75.3% |
-| **4** | 0.6476 🏆 | 1.567 | 2.96 | 1,337,967 | 35.2% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `া লয ় যদ ি ল ৰ া ন ে । পৰ ি ছ া উৰ`
-2. `্ চত ভ া শ ্ ট ো ক ভ ু ১ট া ’ ল ্`
-3. `ি য ় , ক া ল । ইয ় ন কৰ ম ৃ ��কসকলক প`
 **Context Size 2:**
-1. `া ৰ ব ি ব া গ া ঁ ৱত কচ ্ ‌ ২০০৫ – ২০১০ চনত`
-2. `য ় া ৰ ব ি প ি ছত anthropological survey of hinduism today , march 2009`
-3. `্ ৰ া থম ি ক ব ি ন ্ দ ্ ব া ৰ ব ি`
 **Context Size 3:**
-1. `য ় া ষ ্ ট ে চনত আশ ্ ৰফ ে চৰ১৯৬৮ফ ি ল ্ ল ী`
-2. `ি য ় া ধ ী ন হয ় আৰ ু মহ ি ল া ৰ দ ৌ`
-3. `ী য ় া ল ো কগ া থ া ক ে ব ি ব া হৰ ব`
 **Context Size 4:**
-1. `ত ে ও ঁ ৰ প ু ৰণ ি ড ি মৰ ু গছৰ পৰ া প ্ ৰ`
-2. `ি য ় া চ ি ন ্ ত ু স ে ই বছৰত ে ত ে ও ঁ`
-3. `ছ ি ল । ইয ় া ৰ অন ু ম ো দ ী ৰ ্ ঘদ ি ন`
 ### Key Findings
-- **Best Predictability:** Context-4 with 75.3% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (1,337,967 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -252,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 69,383 |
-| Total Tokens | 22,699,994 |
-| Mean Frequency | 327.17 |
 | Median Frequency | 4 |
-| Frequency Std Dev | 13137.34 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ৰ | 1,751,445 |
-| 2 | ত | 1,226,468 |
-| 3 | ব | 988,164 |
-| 4 | য | 983,360 |
-| 5 | ক | 951,103 |
-| 6 | ন | 878,240 |
-| 7 | ল | 815,336 |
-| 8 | প | 750,050 |
-| 9 | ম | 610,982 |
-| 10 | স | 521,058 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | zns | 2 |
-| 2 | tuklas | 2 |
-| 3 | bilqees | 2 |
-| 4 | surbhi | 2 |
-| 5 | ullasamga | 2 |
-| 6 | utsahamga | 2 |
-| 7 | manchস | 2 |
-| 8 | swfl | 2 |
-| 9 | manhunt | 2 |
-| 10 | megamodel | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.5303 |
-| R² (Goodness of Fit) | 0.997399 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 78.6% |
-| Top 1,000 | 93.9% |
-| Top 5,000 | 97.8% |
-| Top 10,000 | 98.6% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9974 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 78.6% of corpus
-- **Long Tail:** 59,383 words needed for remaining 1.4% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -322,24 +384,110 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 113,429 | 32 | 3.011 | 0.609 | 0.8484 🏆 |
-| **mono_64d** | 113,429 | 64 | 3.498 | 0.644 | 0.8482 |
-| **mono_128d** | 113,429 | 128 | 4.089 | 0.704 | 0.8254 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.8484 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 113,429 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -347,11 +495,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.52x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (712) |
-| Markov | **Context-4** | Highest predictability (75.3%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -541,7 +690,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -557,7 +707,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-27 18:45:07*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.534
   - name: best_isotropy
     type: isotropy
+    value: 0.8566
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # AS - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.446x | 3.45 | 0.0759% | 1,436,355 |
+| **16k** | 3.889x | 3.89 | 0.0856% | 1,272,728 |
+| **32k** | 4.259x | 4.26 | 0.0938% | 1,162,147 |
+| **64k** | 4.534x 🏆 | 4.53 | 0.0999% | 1,091,630 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `জয়নগৰ মজিলপুৰ ভাৰতৰ পশ্চিমবংগ ৰাজ্যৰ দক্ষিণ চব্বিশ পৰগনা জিলাত অৱস্থিত এখন চহৰ।...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁জ য়ন গৰ ▁মজ িল পুৰ ▁ভাৰতৰ ▁পশ্চিমব ংগ ▁ৰাজ্যৰ ... (+14 more)` | 24 |
+| 16k | `▁জ য়ন গৰ ▁মজ িল পুৰ ▁ভাৰতৰ ▁পশ্চিমবংগ ▁ৰাজ্যৰ ▁দক্ষিণ ... (+12 more)` | 22 |
+| 32k | `▁জয়ন গৰ ▁মজ িল পুৰ ▁ভাৰতৰ ▁পশ্চিমবংগ ▁ৰাজ্যৰ ▁দক্ষিণ ▁চব্বিশ ... (+8 more)` | 18 |
+| 64k | `▁জয়নগৰ ▁মজ িল পুৰ ▁ভাৰতৰ ▁পশ্চিমবংগ ▁ৰাজ্যৰ ▁দক্ষিণ ▁চব্বিশ ▁পৰগনা ... (+7 more)` | 17 |
+**Sample 2:** `প্ৰদীপ আচাৰ্য্য একবিংশ শতাব্দীৰ অসমৰ এগৰাকী প্ৰসিদ্ধ লেখক, সমালোচক । সংক্ষিপ্ত জ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁প্ৰদীপ ▁আচাৰ্য ্য ▁এক বিংশ ▁শতা ব্দ ীৰ ▁অসমৰ ▁এগৰাকী ... (+10 more)` | 20 |
+| 16k | `▁প্ৰদীপ ▁আচাৰ্য ্য ▁একবিংশ ▁শতাব্দীৰ ▁অসমৰ ▁এগৰাকী ▁প্ৰসিদ্ধ ▁লেখক , ... (+7 more)` | 17 |
+| 32k | `▁প্ৰদীপ ▁আচাৰ্য ্য ▁একবিংশ ▁শতাব্দীৰ ▁অসমৰ ▁এগৰাকী ▁প্ৰসিদ্ধ ▁লেখক , ... (+7 more)` | 17 |
+| 64k | `▁প্ৰদীপ ▁আচাৰ্য ্য ▁একবিংশ ▁শতাব্দীৰ ▁অসমৰ ▁এগৰাকী ▁প্ৰসিদ্ধ ▁লেখক , ... (+7 more)` | 17 |
+**Sample 3:** `মাটিকালি অৱস্থান কৰ্মচাৰী সা-সুবিধা তথ্যসূত্ৰ বিদ্যালয়সমূহ`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁মাটিকালি ▁অৱস্থান ▁কৰ্মচাৰী ▁সা - সু বিধা ▁তথ্যসূত্ৰ ▁বিদ্যালয় সমূহ` | 10 |
+| 16k | `▁মাটিকালি ▁অৱস্থান ▁কৰ্মচাৰী ▁সা - সু বিধা ▁তথ্যসূত্ৰ ▁বিদ্যালয়সমূহ` | 9 |
+| 32k | `▁মাটিকালি ▁অৱস্থান ▁কৰ্মচাৰী ▁সা - সুবিধা ▁তথ্যসূত্ৰ ▁বিদ্যালয়সমূহ` | 8 |
+| 64k | `▁মাটিকালি ▁অৱস্থান ▁কৰ্মচাৰী ▁সা - সুবিধা ▁তথ্যসূত্ৰ ▁বিদ্যালয়সমূহ` | 8 |
 ### Key Findings
+- **Best Compression:** 64k achieves 4.534x compression
+- **Lowest UNK Rate:** 8k with 0.0759% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 60,931 | 15.89 | 198,049 | 8.3% | 21.5% |
+| **2-gram** | Subword | 2,317 🏆 | 11.18 | 62,544 | 34.0% | 69.3% |
+| **3-gram** | Word | 105,867 | 16.69 | 226,215 | 4.9% | 14.7% |
+| **3-gram** | Subword | 21,008 | 14.36 | 364,128 | 13.2% | 35.4% |
+| **4-gram** | Word | 237,754 | 17.86 | 355,974 | 2.4% | 7.8% |
+| **4-gram** | Subword | 113,775 | 16.80 | 1,477,005 | 7.8% | 20.9% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `কৰা হয়` | 27,116 |
+| 2 | `কৰা হৈছিল` | 11,596 |
+| 3 | `হ ল` | 10,746 |
+| 4 | `লাভ কৰে` | 10,053 |
+| 5 | `কৰা হৈছে` | 9,448 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ব্যৱহাৰ কৰা হয়` | 3,039 |
+| 2 | `হ ব পাৰে` | 3,023 |
+| 3 | `বুলি কোৱা হয়` | 2,966 |
+| 4 | `গণ্য কৰা হয়` | 2,121 |
+| 5 | `ডিগ্ৰী লাভ কৰে` | 1,927 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `তথ্য সংগ্ৰহ বাহ্যিক সংযোগ` | 1,636 |
+| 2 | `বুলি গণ্য কৰা হয়` | 1,147 |
+| 3 | `স্নাতক ডিগ্ৰী লাভ কৰে` | 819 |
+| 4 | `তথ্য উৎস বাহ্যিক সংযোগ` | 772 |
+| 5 | `হিচাপে গণ্য কৰা হয়` | 749 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ৰ _` | 1,253,155 |
+| 2 | `ত _` | 617,790 |
+| 3 | `_ আ` | 557,646 |
+| 4 | `। _` | 441,423 |
+| 5 | `_ ক` | 431,976 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `আ ৰু _` | 234,191 |
+| 2 | `_ আ ৰু` | 234,020 |
+| 3 | `_ ক ৰি` | 132,035 |
+| 4 | `_ তে ওঁ` | 130,105 |
+| 5 | `ন ৰ _` | 119,581 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ আ ৰু _` | 233,600 |
+| 2 | `ছি ল । _` | 95,977 |
+| 3 | `_ ক ৰা _` | 84,715 |
+| 4 | `_ তে ওঁ _` | 61,201 |
+| 5 | `_ এ ই _` | 61,142 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 2,317
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~21% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.8462 | 1.798 | 7.80 | 533,621 | 15.4% |
+| **1** | Subword | 0.8352 | 1.784 | 12.14 | 14,852 | 16.5% |
+| **2** | Word | 0.2679 | 1.204 | 1.70 | 4,157,114 | 73.2% |
+| **2** | Subword | 0.7097 | 1.635 | 5.34 | 180,337 | 29.0% |
+| **3** | Word | 0.0819 | 1.058 | 1.15 | 7,059,669 | 91.8% |
+| **3** | Subword | 0.5596 | 1.474 | 3.48 | 962,394 | 44.0% |
+| **4** | Word | 0.0273 🏆 | 1.019 | 1.04 | 8,117,676 | 97.3% |
+| **4** | Subword | 0.4358 | 1.353 | 2.26 | 3,350,979 | 56.4% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
+1. `আৰু আন্তঃৰাষ্ট্ৰীয় ত্ৰিবৰ্ষীয় নতুন আলোচনী বিভাগ আৰু তিনিটা ভাগত কেইটামান ষ্ট্ৰ মেল ফেৰাৰ হেলেনা দ্...`
+2. `কৰা আয়াতসমূহক সাধাৰণতে এই লিংগ জাতীয় উৎসৱ পৰ্ব অনুষ্ঠান হিচাপে ব্যৱহাৰ সংস্কৃতিক কেন্দ্ৰৰ centre s...`
+3. `হয় মিনিক থিয়েম ৬ ৩৬ চৌৰঙ্গী উপন্যাসৰ নতুন ব্ৰডগজ ইঞ্জিন বিদ্যুতৰ অনুমতি দিযা নি পকড় কাফীৰ`
 **Context Size 2:**
+1. `কৰা হয় পিছলৈ কানৱ ঋষিৰ আশ্ৰমত বাস কৰে স্থিতি তথা সংৰক্ষণ তথ্যসূত্ৰ বহিঃসংযোগ ইণ্টাৰনেট মুভি ডাটাবেছ...`
+2. `কৰা হৈছিল ৰডবোৰে এটা সংখ্যাৰ অংকৰ যোগফলৰ দ্বাৰা পোৱা গৈছিল উদ্ভিদ ৰসায়ন টিনোস্প ৰা কৰ্ডিফ লিয়াত এল...`
+3. `হ ল ছিৰাম চিক্‌নেছ সদৃশ লক্ষণ প্ৰদৰ্শনকাৰী মধ্যমীয়া আৰু যুক্তিসংগত বিশ্লেষণৰ জৰিয়তে শ্ৰীকৃষ্ণ কীৰ্...`
 **Context Size 3:**
+1. `ব্যৱহাৰ কৰা হয় এটা জনপ্ৰিয় কিংবদন্তি অনুসৰি বৈষ্ণৱ পণ্ডিতসকলে শৈৱ বুলি প্ৰত্যাখ্যান কৰাৰ পিছত তেওঁ...`
+2. `হ ব পাৰে ৰচনাৰ তাৰিখ ঐতৰেয় ব্ৰাহ্মণ কিছু নিশ্চিতভাৱে খ্ৰীষ্টপূৰ্ব ১ম সহস্ৰাব্দৰ সম্ভৱতঃ ইয়াৰ প্ৰথম...`
+3. `বুলি কোৱা হয় চনৰ ১১ ফেব্ৰুৱাৰীত বাংলাদেশত আলিৰ মৃত্যু হয় তেওঁৰ মৃত্যুৰ পিছত নিউয়ৰ্ক টাইমছে তেওঁক ...`
 **Context Size 4:**
+1. `তথ্য সংগ্ৰহ বাহ্যিক সংযোগ আনুষ্ঠানিক mamata banerjee official all india trinamool congress party pro...`
+2. `বুলি গণ্য কৰা হয় একশৰণ নাম ধৰ্মৰ অনুগামীসকলে গুণমালা পুথিখনক অতি পৱিত্ৰ জ্ঞান কৰি গুৰু আসনত প্ৰতিষ্...`
+3. `স্নাতক ডিগ্ৰী লাভ কৰে আৰু বেৰিষ্টাৰ ইজাজ হুছেইন বাটালৱীৰ চেম্বাৰত যোগদান কৰে চনত লাহোৰ উচ্চ ন্যায়াল...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
+**Context Size 1:**
+1. `_ভাৱেশনত-_পলবাদলকৰা_`
+2. `ৰ_বিয়া_usis_ই_সেই_ধ্ব`
+3. `কবৰু_।_মহিলাৰশ্বি_ম_৩৬`
+**Context Size 2:**
+1. `ৰ_ডে'_+_(spishaver`
+2. `ত_উপন্যাসৰ_ওপৰ_ওৰে_চক্ৰ`
+3. `_আৰু_পাৰে।_চলন_বোমা-কাশ্মী`
+**Context Size 3:**
+1. `আৰু_মৰলৰ_মেক্সিমফৰ_এটা-আ`
+2. `_আৰু_লাইকা"_এটা_উত্তৰ_শ্ৰেষ্ঠ`
+3. `_কৰিবলৈ_গঢ়_লৈ_যোৱা_সমসাম`
+**Context Size 4:**
+1. `_আৰু_সামাজিক_অৱ_ছিংগাপুৰত_২`
+2. `ছিল।_ইভান্সে_প্ৰাণীবিজ্ঞান,_���্ৰমবৰ্ধ`
+3. `_কৰা_হয়।_ষ্টাফ_ৰিপৰ্টাৰ_২৪_`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 97.3% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (3,350,979 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 219,027 |
+| Total Tokens | 8,615,852 |
+| Mean Frequency | 39.34 |
 | Median Frequency | 4 |
+| Frequency Std Dev | 763.95 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | আৰু | 234,273 |
+| 2 | কৰা | 88,269 |
+| 3 | হয় | 83,006 |
+| 4 | কৰে | 74,637 |
+| 5 | এই | 61,800 |
+| 6 | তেওঁ | 61,727 |
+| 7 | পৰা | 52,844 |
+| 8 | কৰিছিল | 48,735 |
+| 9 | বাবে | 48,165 |
+| 10 | চনত | 47,181 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | চেকিজাং | 2 |
+| 2 | জটৱানী | 2 |
+| 3 | জটৱানীৰ | 2 |
+| 4 | ভিটাইৰ | 2 |
+| 5 | সিন্ধীজ | 2 |
+| 6 | দেৱচন্দ্ৰৰ | 2 |
+| 7 | দেৱচন্দ্ৰ | 2 |
+| 8 | প্ৰাণনাথৰ | 2 |
+| 9 | প্ৰাণনাথে | 2 |
+| 10 | গুৰদ্বাৰ | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.0086 |
+| R² (Goodness of Fit) | 0.989742 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 25.4% |
+| Top 1,000 | 50.8% |
+| Top 5,000 | 71.8% |
+| Top 10,000 | 79.6% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9897 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 25.4% of corpus
+- **Long Tail:** 209,027 words needed for remaining 20.4% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.8476 | 0.3643 | N/A | N/A |
+| **mono_64d** | 64 | 0.8566 🏆 | 0.2729 | N/A | N/A |
+| **mono_128d** | 128 | 0.8399 | 0.2134 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_64d with 0.8566 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.2836. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-ৰ` | নিৰ্ধাৰনৰ, স্পঞ্জৰ, চেমেলেৰ |
+| `-াৰ` | শীতলাৰ, গুজ্জাৰ, ভেঁ‌ড়াৰ |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `ther` | 3.32x | 64 contexts | other, theri, there |
+| `ight` | 3.29x | 55 contexts | bight, tight, might |
+| `ress` | 3.32x | 41 contexts | dress, press, presse |
+| `indi` | 3.32x | 39 contexts | hindi, indie, pindi |
+| `vers` | 3.14x | 46 contexts | evers, overs, verse |
+| `nter` | 3.21x | 38 contexts | inter, enter, hunter |
+| `olog` | 3.28x | 34 contexts | oology, biology, zoology |
+| `ment` | 3.17x | 38 contexts | cement, mentor, mentha |
+| `ctio` | 3.34x | 31 contexts | action, diction, section |
+| `atio` | 3.18x | 37 contexts | fatio, ratio, nation |
+| `stor` | 3.19x | 33 contexts | storm, jstor, story |
+| `iver` | 3.17x | 26 contexts | liver, river, giver |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+*No significant affix co-occurrences detected.*
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| লিথুৱানিয়াৰ | **`লিথুৱানিয়-াৰ`** | 1.5 | `লিথুৱানিয়` |
+| পুনৰ্ব্যৱহাৰ | **`পুনৰ্ব্যৱহ-াৰ`** | 1.5 | `পুনৰ্ব্যৱহ` |
+| প্লাছিয়াৰ | **`প্লাছিয়-াৰ`** | 1.5 | `প্লাছিয়` |
+| ক্ষেত্ৰাধিকাৰ | **`ক্ষেত্ৰাধিক-াৰ`** | 1.5 | `ক্ষেত্ৰাধিক` |
+| বিন্ধোৱাৰ | **`বিন্ধোৱ-াৰ`** | 1.5 | `বিন্ধোৱ` |
+| চুলিক্‌ফাৰ | **`চুলিক্‌ফ-াৰ`** | 1.5 | `চুলিক্‌ফ` |
+| ইউনিলিভাৰ | **`ইউনিলিভ-াৰ`** | 1.5 | `ইউনিলিভ` |
+| চিৰস্তাদাৰ | **`চিৰস্তাদ-াৰ`** | 1.5 | `চিৰস্তাদ` |
+| লাখটকীয়াৰ | **`লাখটকীয়-াৰ`** | 1.5 | `লাখটকীয়` |
+| জাতিসত্তাৰ | **`জাতিসত্ত-াৰ`** | 1.5 | `জাতিসত্ত` |
+| দৰিদ্ৰতাৰ | **`দৰিদ্ৰত-াৰ`** | 1.5 | `দৰিদ্ৰত` |
+| ছিলভেষ্টাৰ | **`ছিলভেষ্ট-াৰ`** | 1.5 | `ছিলভেষ্ট` |
+| চিলভেষ্টাৰ | **`চিলভেষ্ট-াৰ`** | 1.5 | `চিলভেষ্ট` |
+| বাগ্মীতাৰ | **`বাগ্মীত-াৰ`** | 1.5 | `বাগ্মীত` |
+| নিয়ন্ত্ৰণহীনতাৰ | **`নিয়ন্ত্ৰণহীনত-াৰ`** | 1.5 | `নিয়ন্ত্ৰণহীনত` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language AS appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (4.53x) |
+| N-gram | **2-gram** | Lowest perplexity (2,317) |
+| Markov | **Context-4** | Highest predictability (97.3%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 05:56:00*

models/embeddings/monolingual/as_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7294ee96f3b0c61ba7c46af8e348b551645a26d74537adc45861ba9171026d24
-size 1143382361

 version https://git-lfs.github.com/spec/v1
+oid sha256:0ac87d06ad44c5f86054b2229ad119a6da7bf6318454bddbdc55c7518d8c4fed
+size 1134845578

models/embeddings/monolingual/as_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 113429
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 105317
 }

models/embeddings/monolingual/as_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8c0125149319853f09b65df93575e1bf1665cf0cf98a0a4e52df3248b94a776d
-size 288268889

 version https://git-lfs.github.com/spec/v1
+oid sha256:4d92a65e15cacfccf9d549fee6b6a0ce808975a3c742f0ba231c9f7d4e6b2be4
+size 285962122

models/embeddings/monolingual/as_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 113429
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 105317
 }

models/embeddings/monolingual/as_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1b271a32a3a16aa6c5905bd4ef669972e7fa9d2fcad872330bcbfe8211a6cbf6
-size 573306713

 version https://git-lfs.github.com/spec/v1
+oid sha256:105129e3ce95f6d925453fd6d253ba75069fd36adaf12abf53d323c8a32ee134
+size 568923274

models/embeddings/monolingual/as_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 113429
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 105317
 }

models/subword_markov/as_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:304b81823be6914953bac63df3fb3c2d3d5e01a1f9b4b8f4a8878717e903a2e3
-size 319610

 version https://git-lfs.github.com/spec/v1
+oid sha256:9c80c8a3d77091dfe2da29d02e66a930b780f77dce7362509695133b8da1ea80
+size 1182447

models/subword_markov/as_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "as",
-  "unique_contexts": 5555,
-  "total_transitions": 65140856
 }

   "context_size": 1,
   "variant": "subword",
   "language": "as",
+  "unique_contexts": 14852,
+  "total_transitions": 39716105
 }

models/subword_markov/as_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:072a0a77d5f9b149b82195090a2f7ec01b569c51640f1c61e11c74d314891b20
-size 2244616

 version https://git-lfs.github.com/spec/v1
+oid sha256:885badaceaa6a76e70eec40f59892815b915e219ae00559881567d3b989d3ef2
+size 7340754

models/subword_markov/as_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "as",
-  "unique_contexts": 44224,
-  "total_transitions": 65120064
 }

   "context_size": 2,
   "variant": "subword",
   "language": "as",
+  "unique_contexts": 180337,
+  "total_transitions": 39696289
 }

models/subword_markov/as_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d7c22336529a0ded174b0241383329d3154a9858bd06639bd78f20c1e786cf00
-size 10618800

 version https://git-lfs.github.com/spec/v1
+oid sha256:fa1ba33509e5c790fdf467b449165c1f4accd50771e716b68c45863725e8653c
+size 28776160

models/subword_markov/as_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "as",
-  "unique_contexts": 296427,
-  "total_transitions": 65099272
 }

   "context_size": 3,
   "variant": "subword",
   "language": "as",
+  "unique_contexts": 962394,
+  "total_transitions": 39676473
 }

models/subword_markov/as_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fba617be292f432879d2b6f88a252644b452cbe07fc436c4d4b98898b0dd2f33
-size 34480970

 version https://git-lfs.github.com/spec/v1
+oid sha256:f3c3fed04aa29c942baa4ca8eff754698cf97cb5a32b6ba2a80b814a80269e95
+size 82473540

models/subword_markov/as_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "as",
-  "unique_contexts": 1337967,
-  "total_transitions": 65078480
 }

   "context_size": 4,
   "variant": "subword",
   "language": "as",
+  "unique_contexts": 3350979,
+  "total_transitions": 39656657
 }

models/subword_ngram/as_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:13f2a80a67d004f512051da3fb0f08c9a906b943f4d8c6e170756448689a3095
-size 206387

 version https://git-lfs.github.com/spec/v1
+oid sha256:305bff8b4892b68a20e77689ddf5bbd66be7385e136f504ab07ed4ec09661b44
+size 935875

models/subword_ngram/as_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "as",
-  "unique_ngrams": 14933,
-  "total_ngrams": 65140856
 }

   "n": 2,
   "variant": "subword",
   "language": "as",
+  "unique_ngrams": 62544,
+  "total_ngrams": 39716105
 }

models/subword_ngram/as_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:566ad2045c2c52180f92dc29b097fad25c84b062a65319808f9431fad7f538cb
-size 1714165

 version https://git-lfs.github.com/spec/v1
+oid sha256:67472e7d9683c3e8802771c68e0afe1ec9b79cf06fa64d62f682208c1565fa8f
+size 5569851

models/subword_ngram/as_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "as",
-  "unique_ngrams": 135792,
-  "total_ngrams": 65120064
 }

   "n": 3,
   "variant": "subword",
   "language": "as",
+  "unique_ngrams": 364128,
+  "total_ngrams": 39696289
 }

models/subword_ngram/as_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:645ec172809e1ba4475428602980c6d539af311c472d34ab195bc24111837333
-size 9556409

 version https://git-lfs.github.com/spec/v1
+oid sha256:5c4153dc76e2de363499d7501db843a99bc9685c1f78f1712d49767c2d3edb5e
+size 23557608

models/subword_ngram/as_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "as",
-  "unique_ngrams": 727754,
-  "total_ngrams": 65099272
 }

   "n": 4,
   "variant": "subword",
   "language": "as",
+  "unique_ngrams": 1477005,
+  "total_ngrams": 39676473
 }

models/tokenizer/as_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:92c4a0d6d3435730960b337cbe00d6561bb754c82a66840b10b3effdf114a9c0
-size 627322

 version https://git-lfs.github.com/spec/v1
+oid sha256:75647425d4994600927e38e762eb4e12bd76a7d759b998a6b27d3ef72fb92134
+size 629826

models/tokenizer/as_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/as_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cfd4440400274db9e0e200c32c56c9b2f8b5d1e4a80cd88b665224da4685840d
-size 1042352

 version https://git-lfs.github.com/spec/v1
+oid sha256:d675824377e9be300eedb755a0221aa145be09b86d891593a1c2428484b63864
+size 1041081

models/tokenizer/as_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/as_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f1a070ac3d05309f6fc98145f97c0714ee5da0ecebf4b99231acd5001077e67f
-size 1892860

 version https://git-lfs.github.com/spec/v1
+oid sha256:b3a5aae89065887f7e21c46af47bbb046edc46cdf2b663af9a2c1ae34034851f
+size 1870438

models/tokenizer/as_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/as_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a3ac9ae44b7976a9a581b8c12e38d04180fcbc38165d660d6b96b537c7954700
-size 427062

 version https://git-lfs.github.com/spec/v1
+oid sha256:dd471a819995e56dd4aefb6525b4fb7cb54d7cb5e996fa7e85acdd00b8586c74
+size 429491

models/tokenizer/as_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/as_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:afd7c512db097eb79bf2751927085a424d4d4be0b149ff56f3b19b0846f20e25
-size 1236046

 version https://git-lfs.github.com/spec/v1
+oid sha256:55714b8ff99e197cb70a99feadc4f2eb8b928c63ec225a2ec5cce487f9b36d0d
+size 3925398

models/vocabulary/as_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "as",
-  "vocabulary_size": 69383,
   "statistics": {
-    "type_token_ratio": 0.007789323101723796,
     "coverage": {
-      "top_100": 0.7824795758310844,
-      "top_1000": 0.9345050339631166,
-      "top_5000": 0.9733725115168742,
-      "top_10000": 0.9816343824731659
     },
-    "hapax_count": 108278,
-    "hapax_ratio": 0.6094640917252521,
-    "total_documents": 20792
   }
 }

 {
   "language": "as",
+  "vocabulary_size": 219027,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.059760376668044714,
     "coverage": {
+      "top_100": 0.2446919080599598,
+      "top_1000": 0.48975445539765006,
+      "top_5000": 0.6928702663989404,
+      "top_10000": 0.7681790167555828
     },
+    "hapax_count": 314664,
+    "hapax_ratio": 0.5895995997684053,
+    "total_documents": 19816
   }
 }

models/word_markov/as_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9d6086bc53947c43f34edcda8e0a034c9779c3d6c44de35c31cc2ba88f8aa774
-size 9625358

 version https://git-lfs.github.com/spec/v1
+oid sha256:6dee794f4598493fa43d8881826c834158cd1924e7bf01f8b09883966c2c4551
+size 51785769

models/word_markov/as_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "as",
-  "unique_contexts": 178570,
-  "total_transitions": 41886369
 }

   "context_size": 1,
   "variant": "word",
   "language": "as",
+  "unique_contexts": 533621,
+  "total_transitions": 8910700
 }

models/word_markov/as_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1542cf7905c52aefaff0db643b5107e37e6e10fccc17a863ba95e7ee172740fd
-size 27241048

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e48a73c55b3245526277a461699b91064b48048043751385238d1834df5c660
+size 152841623

models/word_markov/as_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "as",
-  "unique_contexts": 1073346,
-  "total_transitions": 41865578
 }

   "context_size": 2,
   "variant": "word",
   "language": "as",
+  "unique_contexts": 4157114,
+  "total_transitions": 8890884
 }

models/word_markov/as_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:09d22e9d5414b02524a41246b9410c78ab3bf52bcade25ce445168e3445ee4d8
-size 63382094

 version https://git-lfs.github.com/spec/v1
+oid sha256:922936a8c2bed9b31b58c6a39ab1ac62bbb87ba920bce8481b34b94b18e54d6c
+size 221316941

models/word_markov/as_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "as",
-  "unique_contexts": 2893182,
-  "total_transitions": 41844788
 }

   "context_size": 3,
   "variant": "word",
   "language": "as",
+  "unique_contexts": 7059669,
+  "total_transitions": 8871068
 }

models/word_markov/as_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:05e3849b2bcb7ebe1f9ae63c37ed5ea9973336b012b7a2c60020150cd53b6a7d
-size 124869442

 version https://git-lfs.github.com/spec/v1
+oid sha256:ec28d09ad62e65c263cebddcd4aa013d595cde5d3a84b87806a2678142794293
+size 264956691

models/word_markov/as_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "as",
-  "unique_contexts": 5984484,
-  "total_transitions": 41824000
 }

   "context_size": 4,
   "variant": "word",
   "language": "as",
+  "unique_contexts": 8117676,
+  "total_transitions": 8851252
 }

models/word_ngram/as_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:46ca72b889b191be977809c46f9e5c601830c691249878aa061223811c6c065a
-size 1981156

 version https://git-lfs.github.com/spec/v1
+oid sha256:fedc9b5fb35b14c97554711e0c15f091fd551fc13bd0f58c65e3736a27873618
+size 4559205

models/word_ngram/as_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "as",
-  "unique_ngrams": 144083,
-  "total_ngrams": 41886369
 }

   "n": 2,
   "variant": "word",
   "language": "as",
+  "unique_ngrams": 198049,
+  "total_ngrams": 8910700
 }

models/word_ngram/as_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9b16c54cf2957c0d5c96809a5a7c39da4fe7c1dd6c13b3345c3073e0cc4a9da1
-size 8106622

 version https://git-lfs.github.com/spec/v1
+oid sha256:4d4d06c699b30c3af5e1b2c5d4e9d86cf2bcddb46b24cacb7d1cdac931cdb9c0
+size 6432611

models/word_ngram/as_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "as",
-  "unique_ngrams": 558435,
-  "total_ngrams": 41865578
 }

   "n": 3,
   "variant": "word",
   "language": "as",
+  "unique_ngrams": 226215,
+  "total_ngrams": 8890884
 }

models/word_ngram/as_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c567cb651bd6273d82124f43b6ac4426b9ece8a16302648d995617ab25da8d18
-size 26163353