omarkamali commited on Jan 3

Commit

7ca7ea7

verified ·

1 Parent(s): 40d9b30

Upload all models and assets for arz (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +297 -147
models/embeddings/monolingual/arz_128d.bin +2 -2
models/embeddings/monolingual/arz_128d_metadata.json +5 -3
models/embeddings/monolingual/arz_32d.bin +2 -2
models/embeddings/monolingual/arz_32d_metadata.json +5 -3
models/embeddings/monolingual/arz_64d.bin +2 -2
models/embeddings/monolingual/arz_64d_metadata.json +5 -3
models/subword_markov/arz_markov_ctx1_subword.parquet +2 -2
models/subword_markov/arz_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/arz_markov_ctx2_subword.parquet +2 -2
models/subword_markov/arz_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/arz_markov_ctx3_subword.parquet +2 -2
models/subword_markov/arz_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/arz_markov_ctx4_subword.parquet +2 -2
models/subword_markov/arz_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/arz_2gram_subword.parquet +2 -2
models/subword_ngram/arz_2gram_subword_metadata.json +2 -2
models/subword_ngram/arz_3gram_subword.parquet +2 -2
models/subword_ngram/arz_3gram_subword_metadata.json +2 -2
models/subword_ngram/arz_4gram_subword.parquet +2 -2
models/subword_ngram/arz_4gram_subword_metadata.json +2 -2
models/tokenizer/arz_tokenizer_16k.model +2 -2
models/tokenizer/arz_tokenizer_16k.vocab +0 -0
models/tokenizer/arz_tokenizer_32k.model +2 -2
models/tokenizer/arz_tokenizer_32k.vocab +0 -0
models/tokenizer/arz_tokenizer_64k.model +2 -2
models/tokenizer/arz_tokenizer_64k.vocab +0 -0
models/tokenizer/arz_tokenizer_8k.model +2 -2
models/tokenizer/arz_tokenizer_8k.vocab +0 -0
models/vocabulary/arz_vocabulary.parquet +2 -2
models/vocabulary/arz_vocabulary_metadata.json +10 -9
models/word_markov/arz_markov_ctx1_word.parquet +2 -2
models/word_markov/arz_markov_ctx1_word_metadata.json +2 -2
models/word_markov/arz_markov_ctx2_word.parquet +2 -2
models/word_markov/arz_markov_ctx2_word_metadata.json +2 -2
models/word_markov/arz_markov_ctx3_word.parquet +2 -2
models/word_markov/arz_markov_ctx3_word_metadata.json +2 -2
models/word_markov/arz_markov_ctx4_word.parquet +2 -2
models/word_markov/arz_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/arz_2gram_word.parquet +2 -2
models/word_ngram/arz_2gram_word_metadata.json +2 -2
models/word_ngram/arz_3gram_word.parquet +2 -2
models/word_ngram/arz_3gram_word_metadata.json +2 -2
models/word_ngram/arz_4gram_word.parquet +2 -2
models/word_ngram/arz_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.023
   - name: best_isotropy
     type: isotropy
-    value: 0.8067
   - name: vocabulary_size
     type: vocab
-    value: 1000000
-generated: 2025-12-27
 ---
 # Egyptian Arabic - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,64 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.017x | 2.93 | 0.4317% | 1,742,125 |
-| **16k** | 3.365x | 3.27 | 0.4814% | 1,562,212 |
-| **32k** | 3.703x | 3.60 | 0.5297% | 1,419,540 |
-| **64k** | 4.023x 🏆 | 3.91 | 0.5755% | 1,306,653 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `دونج سارى فنان من كامبوديا.
- حياته
-دونج سارى من مواليد يوم 1 يناير 1957.
- لين...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁دونج ▁س ارى ▁فنان ▁من ▁كامب ود يا . ▁حياته ... (+28 more)` | 38 |
-| 16k | `▁دونج ▁س ارى ▁فنان ▁من ▁كامب وديا . ▁حياته ▁دونج ... (+26 more)` | 36 |
-| 32k | `▁دونج ▁سارى ▁فنان ▁من ▁كامبوديا . ▁حياته ▁دونج ▁سارى ▁من ... (+22 more)` | 32 |
-| 64k | `▁دونج ▁سارى ▁فنان ▁من ▁كامبوديا . ▁حياته ▁دونج ▁سارى ▁من ... (+22 more)` | 32 |
-**Sample 2:** `بيريباروس ( الاسم العلمى: Periparus ) هوا جنس من الطيور بيتبع قرقفيات.
- لينكات ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁بير يب اروس ▁( ▁الاسم ▁العلم ى : ▁per ip ... (+36 more)` | 46 |
-| 16k | `▁بير يب اروس ▁( ▁الاسم ▁العلمى : ▁per ip arus ... (+31 more)` | 41 |
-| 32k | `▁بير يب اروس ▁( ▁الاسم ▁العلمى : ▁per ip arus ... (+27 more)` | 37 |
-| 64k | `▁بير يب اروس ▁( ▁الاسم ▁العلمى : ▁per ip arus ... (+26 more)` | 36 |
-**Sample 3:** `قلى بيغلو ( بالفارسى قلیبیگلو ) قريه فى ايران.
- لينكات برانيه
- سبب التسميه
-...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ق لى ▁بي غل و ▁( ▁بالفارسى ▁قل ی ب ... (+21 more)` | 31 |
-| 16k | `▁ق لى ▁بي غلو ▁( ▁بالفارسى ▁قل ی ب ی ... (+20 more)` | 30 |
-| 32k | `▁ق لى ▁بي غلو ▁( ▁بالفارسى ▁قل ی بی گ ... (+19 more)` | 29 |
-| 64k | `▁ق لى ▁بي غلو ▁( ▁بالفارسى ▁قل ی بی گ ... (+19 more)` | 29 |
 ### Key Findings
-- **Best Compression:** 64k achieves 4.023x compression
-- **Lowest UNK Rate:** 8k with 0.4317% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -134,57 +129,89 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 6,568 🏆 | 12.68 | 1,353,863 | 27.8% | 64.6% |
-| **2-gram** | 384 🏆 | 8.58 | 15,875 | 59.5% | 97.6% |
-| **3-gram** | 11,727 | 13.52 | 2,306,233 | 23.3% | 59.5% |
-| **3-gram** | 2,432 | 11.25 | 170,638 | 29.7% | 70.0% |
-| **4-gram** | 21,664 | 14.40 | 4,880,347 | 21.3% | 54.9% |
-| **4-gram** | 8,554 | 13.06 | 1,045,558 | 20.0% | 54.3% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `تصنيف :` | 4,787,649 |
-| 2 | `مصادر تصنيف` | 1,597,448 |
-| 3 | `لينكات برانيه` | 1,294,464 |
-| 4 | `برانيه مصادر` | 1,167,374 |
-| 5 | `من مواليد` | 829,312 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `مصادر تصنيف :` | 1,597,447 |
-| 2 | `برانيه مصادر تصنيف` | 1,165,016 |
-| 3 | `لينكات برانيه مصادر` | 1,164,743 |
-| 4 | `من مواليد يوم` | 809,006 |
-| 5 | `. المطلع المستقيم` | 668,795 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `برانيه مصادر تصنيف :` | 1,165,016 |
-| 2 | `لينكات برانيه مصادر تصنيف` | 1,162,419 |
-| 3 | `. لينكات برانيه مصادر` | 558,227 |
-| 4 | `الدايره الساعيه لجرم سماوى` | 445,892 |
-| 5 | `خط الاستوا السماوى تكون` | 445,860 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 384
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~54% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -192,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.8482 | 1.800 | 6.43 | 2,177,526 | 15.2% |
-| **1** | 1.2448 | 2.370 | 9.40 | 4,455 | 0.0% |
-| **2** | 0.3675 | 1.290 | 2.04 | 13,972,922 | 63.2% |
-| **2** | 0.9830 | 1.977 | 7.64 | 41,887 | 1.7% |
-| **3** | 0.1212 | 1.088 | 1.34 | 28,492,247 | 87.9% |
-| **3** | 0.9327 | 1.909 | 5.31 | 320,101 | 6.7% |
-| **4** | 0.0720 🏆 | 1.051 | 1.20 | 38,092,155 | 92.8% |
-| **4** | 0.7558 🏆 | 1.689 | 3.71 | 1,700,014 | 24.4% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `. شوف كمان رواد الشركات فى كليه سانت لورانس و كيكرز . 0 . السرعه الشعاعيه`
-2. `: مهندسين من تشيكوسلوفاكيا . hubble census finds galaxies at the ottoman empire and historical demog...`
-3. `تصنيف : anguilla australis ) و خط الاستوا السماوى تكون قيمة بعده بالموجب ( ) بيتر`
 **Context Size 2:**
-1. `تصنيف : لاعب كريكت من دومينيكا . لينكات برانيه مصادر تصنيف : قسيس كاتوليك من مملكة نيديرلاند`
-2. `مصادر تصنيف : مسطحات مائيه فى كندا . جغرافيا نهر وايتاهايا بيصب فى نهر اوبى . اقرا`
-3. `لينكات برانيه مصادر تصنيف : ناس تصنيف : عالم لاهوت من المانيا ) جورج فرانسيس ماكجينيس من`
 **Context Size 3:**
-1. `مصادر تصنيف : موسم رياضى فى كورة قدم اتعمل فى پورتوجال سنة 2007 . لينكات برانيه مصادر تصنيف`
-2. `برانيه مصادر تصنيف : ممثلين من البرازيل تصنيف : متعلمين فى جامعة ڤيرچينيا كومونولث و جامعة ڤيرچينيا ...`
-3. `لينكات برانيه مصادر تصنيف : سياسيين تصنيف : سياسيين تصنيف : سياسيين من تشيكيا تصنيف : ناس لسا`
 **Context Size 4:**
-1. `برانيه مصادر تصنيف : قرى اليمن تصنيف : قرى يافع`
-2. `لينكات برانيه مصادر تصنيف : مغنيين تصنيف : مغنيين امريكان تصنيف : منتجين تيليڤزيون تصنيف : مخرجين تي...`
-3. `. لينكات برانيه مصادر تصنيف : مسطحات مائيه فى فينلاندا`
 ### Key Findings
-- **Best Predictability:** Context-4 with 92.8% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (1,700,014 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -256,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 1,000,000 |
-| Total Tokens | 132,694,568 |
-| Mean Frequency | 132.69 |
-| Median Frequency | 5 |
-| Frequency Std Dev | 10016.26 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | تصنيف | 4,791,431 |
-| 2 | فى | 4,424,323 |
-| 3 | من | 3,918,728 |
-| 4 | و | 3,519,108 |
-| 5 | مصادر | 1,613,116 |
-| 6 | لينكات | 1,360,026 |
-| 7 | برانيه | 1,299,627 |
-| 8 | مواليد | 1,130,480 |
-| 9 | هيا | 1,062,827 |
-| 10 | اللى | 967,453 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | 1513504 | 2 |
-| 2 | 412321 | 2 |
-| 3 | j08004409 | 2 |
-| 4 | 1545353 | 2 |
-| 5 | 1073083 | 2 |
-| 6 | 0566528 | 2 |
-| 7 | 2103522 | 2 |
-| 8 | 1579797 | 2 |
-| 9 | 2051684 | 2 |
-| 10 | 600674 | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.2848 |
-| R² (Goodness of Fit) | 0.995303 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 44.5% |
-| Top 1,000 | 75.8% |
-| Top 5,000 | 85.7% |
-| Top 10,000 | 88.8% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9953 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 44.5% of corpus
-- **Long Tail:** 990,000 words needed for remaining 11.2% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -326,24 +384,113 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 573,439 | 32 | 5.170 | 1.489 | 0.8067 🏆 |
-| **mono_64d** | 573,439 | 64 | 5.558 | 1.367 | 0.7701 |
-| **mono_128d** | 573,439 | 128 | 5.982 | 1.280 | 0.7047 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.8067 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 573,439 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -351,11 +498,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.02x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (384) |
-| Markov | **Context-4** | Highest predictability (92.8%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -545,7 +693,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -561,7 +710,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-27 18:23:46*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 3.905
   - name: best_isotropy
     type: isotropy
+    value: 0.7897
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # Egyptian Arabic - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 2.876x | 2.88 | 0.8210% | 1,709,035 |
+| **16k** | 3.215x | 3.22 | 0.9180% | 1,528,463 |
+| **32k** | 3.559x | 3.56 | 1.0163% | 1,380,735 |
+| **64k** | 3.905x 🏆 | 3.91 | 1.1149% | 1,258,558 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `تاملكوت هوا دوار فى المغرب. المكان تاملكوت موجود فى منطقه اداريه اسمها تماسين. س...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المك ان ... (+24 more)` | 34 |
+| 16k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المك ان ... (+24 more)` | 34 |
+| 32k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المكان ▁تام ... (+23 more)` | 33 |
+| 64k | `▁تام لك وت ▁هوا ▁دوار ▁فى ▁المغرب . ▁المكان ▁تام ... (+23 more)` | 33 |
+**Sample 2:** `جيريمى ديفيدسون مخرج افلام من امريكا. حياته جيريمى ديفيدسون من مواليد يوم 24 ديس...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁جير يمى ▁ديفيد سون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ... (+23 more)` | 33 |
+| 16k | `▁جيريمى ▁ديفيد سون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ▁جيريمى ... (+21 more)` | 31 |
+| 32k | `▁جيريمى ▁ديفيدسون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ▁جيريمى ▁ديفيدسون ... (+19 more)` | 29 |
+| 64k | `▁جيريمى ▁ديفيدسون ▁مخرج ▁افلام ▁من ▁امريكا . ▁حياته ▁جيريمى ▁ديفيدسون ... (+19 more)` | 29 |
+**Sample 3:** `ابهينايا ممثله من الهند. حياتها ابهينايا من مواليد يوم 13 نوفمبر سنة فى كارناتاك...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁اب ه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁اب ... (+28 more)` | 38 |
+| 16k | `▁اب ه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁اب ... (+27 more)` | 37 |
+| 32k | `▁ابه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁ابه ينا ... (+25 more)` | 35 |
+| 64k | `▁ابه ينا يا ▁ممثله ▁من ▁الهند . ▁حياتها ▁ابه ينا ... (+24 more)` | 34 |
 ### Key Findings
+- **Best Compression:** 64k achieves 3.905x compression
+- **Lowest UNK Rate:** 8k with 0.8210% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 5,793 | 12.50 | 1,073,861 | 30.2% | 66.4% |
+| **2-gram** | Subword | 316 🏆 | 8.30 | 15,451 | 62.6% | 98.6% |
+| **3-gram** | Word | 8,299 | 13.02 | 1,682,809 | 28.5% | 62.7% |
+| **3-gram** | Subword | 2,021 | 10.98 | 129,923 | 30.1% | 74.0% |
+| **4-gram** | Word | 12,842 | 13.65 | 3,054,922 | 27.3% | 59.4% |
+| **4-gram** | Subword | 7,215 | 12.82 | 788,718 | 19.6% | 56.9% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `لينكات برانيه` | 1,293,684 |
+| 2 | `برانيه مصادر` | 1,167,581 |
+| 3 | `من مواليد` | 829,322 |
+| 4 | `مواليد يوم` | 809,177 |
+| 5 | `الاستوا السماوى` | 668,876 |
+**3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `لينكات برانيه مصادر` | 1,164,952 |
+| 2 | `من مواليد يوم` | 809,029 |
+| 3 | `خط الاستوا السماوى` | 630,228 |
+| 4 | `الدايره الساعيه لجرم` | 445,892 |
+| 5 | `الساعيه لجرم سماوى` | 445,892 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `الدايره الساعيه لجرم سماوى` | 445,892 |
+| 2 | `السماوى تكون قيمة بعده` | 445,860 |
+| 3 | `خط الاستوا السماوى تكون` | 445,860 |
+| 4 | `الاستوا السماوى تكون قيمة` | 445,860 |
+| 5 | `لينكات برانيه مصادر من` | 320,727 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ ا` | 31,144,333 |
+| 2 | `ا ل` | 30,224,243 |
+| 3 | `ه _` | 17,180,633 |
+| 4 | `_ م` | 13,559,836 |
+| 5 | `ى _` | 11,805,719 |
+**3-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ ا ل` | 25,116,125 |
+| 2 | `ي ه _` | 6,396,587 |
+| 3 | `ه _ ا` | 6,346,797 |
+| 4 | `ا ل م` | 5,946,692 |
+| 5 | `_ م ن` | 4,537,386 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ه _ ا ل` | 5,297,759 |
+| 2 | `_ ا ل م` | 5,200,038 |
+| 3 | `_ ف ى _` | 4,251,301 |
+| 4 | `_ م ن _` | 3,906,606 |
+| 5 | `_ ا ل ا` | 3,578,656 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 316
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~57% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 1.2217 | 2.332 | 9.13 | 1,353,062 | 0.0% |
+| **1** | Subword | 1.0533 | 2.075 | 8.28 | 5,726 | 0.0% |
+| **2** | Word | 0.3648 | 1.288 | 1.91 | 12,336,484 | 63.5% |
+| **2** | Subword | 0.7848 | 1.723 | 5.54 | 47,379 | 21.5% |
+| **3** | Word | 0.1139 | 1.082 | 1.28 | 23,517,673 | 88.6% |
+| **3** | Subword | 0.7673 | 1.702 | 4.73 | 262,420 | 23.3% |
+| **4** | Word | 0.0625 🏆 | 1.044 | 1.17 | 29,894,419 | 93.7% |
+| **4** | Subword | 0.7433 | 1.674 | 3.81 | 1,241,425 | 25.7% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
+1. `فى العالم حسب المساحه لستة اكبر بحيرات اوروبا لينكات مصادر من مملكه ايطاليا حياته الرياضيه بيلعب`
+2. `من مواليد يوم 16 يناير سنة فى ذا ماتشيس بتقدم الانواع الفنيه كانت دى لوبو من`
+3. `و بتنقاس بالانزياح الاحمر المطلع المستقيم ممكن يتقاس بقوس دايره الاستواء السماويه من الجرى و نادى`
 **Context Size 2:**
+1. `لينكات برانيه مصادر عجل ناريه من المانيا حياته اليكساندر انتونوڤيتش ريزونى اليكساندر انستروثير اليكس...`
+2. `برانيه مصادر كوره قدم من الميكسيك حياته اڤير كاباليرو اڤيرالدو فيريرا لاعب كورة قدم من اليابان حياته`
+3. `من مواليد يوم 19 اغسطس لسا عايشين فى استانبول لينكات برانيه مصادر هوكى الجليد من امريكا حياته`
 **Context Size 3:**
+1. `لينكات برانيه مصادر سكان سكان فى ايران المكان ادم درهسى عليا adam darrehsi ye olya هيا تجمع سكان`
+2. `من مواليد يوم 7 ديسمبر فى مونتفيدو الحياه الرياضيه بيلعب فى مركز مُدَافِع و لعب مع فريق ريال`
+3. `خط الاستوا السماوى تكون قيمة بعده بالموجب و لو النجم جنوب خط الاستوا السماوى لو كان النجم شمال`
 **Context Size 4:**
+1. `الدايره الساعيه لجرم سماوى و الدايره الساعيه لنقطة الاعتدال الربيعى المطلع المستقيم ممكن يتقاس بقوس ...`
+2. `الاستوا السماوى تكون قيمة بعده بالسالب مصادر كوكبه`
+3. `السماوى تكون قيمة بعده بالسالب مصادر 2ماس كوكبه`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
+**Context Size 1:**
+1. `_مرا_ارو_لسيه_س_`
+2. `الدريه_اثر_كالال`
+3. `لخطونالمطة_جو_عب`
+**Context Size 2:**
+1. `_الكريتالسمات_فى_`
+2. `اليه_عاعيه_مطلحجم`
+3. `ه_بقه_ليكا_بقوى_ا`
+**Context Size 3:**
+1. `_المستقيم_محمد_بيس`
+2. `يه_مصادر_كورة_قدم_`
+3. `ه_العقبت_برات_السم`
+**Context Size 4:**
+1. `ه_السماوى_مع_فريق_ن`
+2. `_المكافئ_الفلك._الم`
+3. `_فى_باردوه_مصادر_اس`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 93.7% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (1,241,425 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 856,070 |
+| Total Tokens | 116,711,182 |
+| Mean Frequency | 136.33 |
+| Median Frequency | 4 |
+| Frequency Std Dev | 9391.59 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | فى | 4,414,661 |
+| 2 | من | 3,909,776 |
+| 3 | و | 3,512,508 |
+| 4 | مصادر | 1,612,463 |
+| 5 | لينكات | 1,359,404 |
+| 6 | برانيه | 1,298,834 |
+| 7 | هيا | 1,062,266 |
+| 8 | اللى | 965,103 |
+| 9 | يوم | 853,034 |
+| 10 | مواليد | 836,295 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | algeriens | 2 |
+| 2 | وبتينا | 2 |
+| 3 | روتلُف | 2 |
+| 4 | bouabdellah | 2 |
+| 5 | الخُضرة | 2 |
+| 6 | impressionisms | 2 |
+| 7 | assyriaca | 2 |
+| 8 | جروكبيديا | 2 |
+| 9 | grokipedia | 2 |
+| 10 | grok | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.2602 |
+| R² (Goodness of Fit) | 0.994644 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 46.0% |
+| Top 1,000 | 76.7% |
+| Top 5,000 | 85.9% |
+| Top 10,000 | 89.0% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9946 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 46.0% of corpus
+- **Long Tail:** 846,070 words needed for remaining 11.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.7897 🏆 | 0.3482 | N/A | N/A |
+| **mono_64d** | 64 | 0.7690 | 0.2976 | N/A | N/A |
+| **mono_128d** | 128 | 0.7177 | 0.2526 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.7897 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.2995. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-ال` | الخوذ, المندوبين, الدمرداشيه |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-ين` | كلوكيرين, بيرجرين, المندوبين |
+| `-ان` | مالڤان, ملازمان, پايرلمان |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `العا` | 1.85x | 296 contexts | العام, العاج, العال |
+| `المج` | 1.79x | 267 contexts | المجد, المجر, المجئ |
+| `انزي` | 1.95x | 165 contexts | انزيا, انزيت, انزيد |
+| `الشع` | 2.11x | 103 contexts | الشعب, الشعف, الشعز |
+| `ياته` | 2.11x | 96 contexts | عياته, آياته, حياته |
+| `الاع` | 2.00x | 107 contexts | الاعور, الاعتر, الاعدا |
+| `مستق` | 2.01x | 80 contexts | مستقل, مستقر, مستقله |
+| `الاح` | 1.79x | 110 contexts | الاحد, صالاحى, الاحرش |
+| `لموج` | 2.13x | 48 contexts | لموجة, الموج, الموجب |
+| `لمجر` | 1.85x | 71 contexts | لمجره, المجر, لمجرة |
+| `لساع` | 2.34x | 28 contexts | لساعة, الساعة, لساعات |
+| `مريك` | 1.69x | 102 contexts | لمريك, مريكا, مريكن |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-ال` | `-ين` | 47 words | الصديقين, الحدوديين |
+| `-ال` | `-ان` | 11 words | الأخوان, الترامان |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| السريانيين | **`ال-سرياني-ين`** | 6.0 | `سرياني` |
+| كانتيلينين | **`كانتيل-ين-ين`** | 6.0 | `كانتيل` |
+| الجينومية | **`ال-جينومية`** | 4.5 | `جينومية` |
+| البرمجيات | **`ال-برمجيات`** | 4.5 | `برمجيات` |
+| الاستعلامات | **`ال-استعلامات`** | 4.5 | `استعلامات` |
+| بيجلاندسفچوردين | **`بيجلاندسفچورد-ين`** | 4.5 | `بيجلاندسفچورد` |
+| السينابون | **`ال-سينابون`** | 4.5 | `سينابون` |
+| الديمقراطي | **`ال-ديمقراطي`** | 4.5 | `ديمقراطي` |
+| الانبعاثية | **`ال-انبعاثية`** | 4.5 | `انبعاثية` |
+| الميتانيه | **`ال-ميتانيه`** | 4.5 | `ميتانيه` |
+| الطويحينه | **`ال-طويحينه`** | 4.5 | `طويحينه` |
+| الصابونجى | **`ال-صابونجى`** | 4.5 | `صابونجى` |
+| البنغاليه | **`ال-بنغاليه`** | 4.5 | `بنغاليه` |
+| المتحدثون | **`ال-متحدثون`** | 4.5 | `متحدثون` |
+| ستشميدلين | **`ستشميدل-ين`** | 4.5 | `ستشميدل` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language Egyptian Arabic appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (3.91x) |
+| N-gram | **2-gram** | Lowest perplexity (316) |
+| Markov | **Context-4** | Highest predictability (93.7%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 07:45:31*

models/embeddings/monolingual/arz_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:75a11f48624beba71cf5e2fb4ca6caff536a87ce2cd7ecf4cf2d74b7d061ab24
-size 1623402315

 version https://git-lfs.github.com/spec/v1
+oid sha256:5bdb36e12bc1a678fd5c157ad32e02341de1eca60c4bc2b5ef93fe49b61bd555
+size 1527330535

models/embeddings/monolingual/arz_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 573439
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 481203
 }

models/embeddings/monolingual/arz_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:20ccadbe8619b844c6eace4a02ce277749c846f73614cb161ecbc7a86c873672
-size 415001163

 version https://git-lfs.github.com/spec/v1
+oid sha256:30582a755835303bdd9d0ea4ad7243b43b631aa397c3e567b21c8d4a5c4449d3
+size 389766631

models/embeddings/monolingual/arz_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 573439
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 481203
 }

models/embeddings/monolingual/arz_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4801d005678b73c1bb59d155df23fbc73c9f36b3c781e8ec0acef3f3d272ee93
-size 817801547

 version https://git-lfs.github.com/spec/v1
+oid sha256:5068a8b810efa630cdf0314c353a4495557bcd65811e48356936ecd7a645b35c
+size 768954599

models/embeddings/monolingual/arz_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 573439
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 481203
 }

models/subword_markov/arz_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e7f161c6e8fab68e5fbc9600f3451111167fbdb2ccb9ee71b98eeb2ffb7dace1
-size 322433

 version https://git-lfs.github.com/spec/v1
+oid sha256:93b9ef162389968b455ca3710639cb0faa0e67079e19324a4f3d232dc220101e
+size 347493

models/subword_markov/arz_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "arz",
-  "unique_contexts": 4455,
-  "total_transitions": 825886705
 }

   "context_size": 1,
   "variant": "subword",
   "language": "arz",
+  "unique_contexts": 5726,
+  "total_transitions": 693777470
 }

models/subword_markov/arz_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cabc4c543608790d6c5880f1744ba45b47d2697724261238e8925f80995b0766
-size 2557134

 version https://git-lfs.github.com/spec/v1
+oid sha256:98db3946a7c34d13376e437f6401ae1d31aa408e141da100b0793536f9682ba1
+size 2267694

models/subword_markov/arz_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "arz",
-  "unique_contexts": 41887,
-  "total_transitions": 824256957
 }

   "context_size": 2,
   "variant": "subword",
   "language": "arz",
+  "unique_contexts": 47379,
+  "total_transitions": 692148775
 }

models/subword_markov/arz_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:de040657fc371a339db59ea68152c56a326e76cc8924dac007de0473e86340c4
-size 13516070

 version https://git-lfs.github.com/spec/v1
+oid sha256:791a2be4e464c30d58ac081746362b42037f50dd8d0c4b552c3b0e68c2518dea
+size 10864076

models/subword_markov/arz_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "arz",
-  "unique_contexts": 320101,
-  "total_transitions": 822627209
 }

   "context_size": 3,
   "variant": "subword",
   "language": "arz",
+  "unique_contexts": 262420,
+  "total_transitions": 690520080
 }

models/subword_markov/arz_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a0e33690e76e9f408ce70c7c674c98a698e70a625d4ecd0d337e959c13d77fe6
-size 53877656

 version https://git-lfs.github.com/spec/v1
+oid sha256:f9412f5ef232c0c613c068829b486f969ed98f144b0956cfba48ca7651d697cc
+size 40934252

models/subword_markov/arz_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "arz",
-  "unique_contexts": 1700014,
-  "total_transitions": 820997461
 }

   "context_size": 4,
   "variant": "subword",
   "language": "arz",
+  "unique_contexts": 1241425,
+  "total_transitions": 688891385
 }

models/subword_ngram/arz_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:10ff36eb8f35bdaef34f28641466ebac1bd3df6e7831eab6cec38563f7a83d00
-size 224935

 version https://git-lfs.github.com/spec/v1
+oid sha256:54b1e94823bf6eda78e610a31f50839e3eb6945296345628a51c2775bbc535b5
+size 225336

models/subword_ngram/arz_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "arz",
-  "unique_ngrams": 15875,
-  "total_ngrams": 825886705
 }

   "n": 2,
   "variant": "subword",
   "language": "arz",
+  "unique_ngrams": 15451,
+  "total_ngrams": 693777470
 }

models/subword_ngram/arz_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:eee986ddda6f36ba1d3e23a8759ad8f5509e3c48dc4ac592f689fa826fe34187
-size 2146043

 version https://git-lfs.github.com/spec/v1
+oid sha256:6241036ff50c386080b4d36854fdc224c1c5099af471b3862e1464cd644b63ae
+size 1697106

models/subword_ngram/arz_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "arz",
-  "unique_ngrams": 170638,
-  "total_ngrams": 824256957
 }

   "n": 3,
   "variant": "subword",
   "language": "arz",
+  "unique_ngrams": 129923,
+  "total_ngrams": 692148775
 }

models/subword_ngram/arz_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2d7383505bdffee6fff1e255e8a66837be06fe62780607b993d354aed68b8d0c
-size 13255914

 version https://git-lfs.github.com/spec/v1
+oid sha256:c000b6a963ccc11a1bb0feebe6d49196497cbb6667e247ab3df4764a4e91003c
+size 10220298

models/subword_ngram/arz_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "arz",
-  "unique_ngrams": 1045558,
-  "total_ngrams": 822627209
 }

   "n": 4,
   "variant": "subword",
   "language": "arz",
+  "unique_ngrams": 788718,
+  "total_ngrams": 690520080
 }

models/tokenizer/arz_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3250e5b1fa849747bd31693274ccdde1fde493080110892d89c056f2ed133c9f
-size 553249

 version https://git-lfs.github.com/spec/v1
+oid sha256:8182c7627a2b642bc8213e1f478fb8483ae43e1decdfb9d36a8b81c0b1e5db70
+size 553522

models/tokenizer/arz_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arz_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1d22c8b625df815da9ab4ad818bdb3e662168bfa181b0747326154e8db1cc2c3
-size 880194

 version https://git-lfs.github.com/spec/v1
+oid sha256:3f58411587d93e3605bc0bfb493f2adb152e457978754608d4bfa1098bbe3484
+size 874271

models/tokenizer/arz_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arz_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e2a8f70071cee82c435cc2eff7086ba3060e694f61e640ad9482184f637090af
-size 1550950

 version https://git-lfs.github.com/spec/v1
+oid sha256:b7caa9ae0cc952f588395f385251d423792d50e832f759c19f62d2debfa53c97
+size 1535709

models/tokenizer/arz_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arz_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e3cf01b1aa16a4d259de9989643f646ceecca5967b87fc2b269ffd411f6ec306
-size 394841

 version https://git-lfs.github.com/spec/v1
+oid sha256:75cf05a2c2e3d6160ef1a57d3e6c1105fcab185e64a3ffd3fdab63dacc685a1f
+size 396360

models/tokenizer/arz_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/arz_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:08ac9dc5a3750de53ceee73bd67ec52e8f566950d994cd5585cbd78867c9790d
-size 13750687

 version https://git-lfs.github.com/spec/v1
+oid sha256:f49f18512e3b41813735d519aef4205264437e74599a3808b6c8fbea97f3bb17
+size 12321602

models/vocabulary/arz_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "arz",
-  "vocabulary_size": 1000000,
   "statistics": {
-    "type_token_ratio": 0.016231404296933028,
     "coverage": {
-      "top_100": 0.44057406359024887,
-      "top_1000": 0.7499824220069986,
-      "top_5000": 0.8482657952087583,
-      "top_10000": 0.8783945758611642
     },
-    "hapax_count": 918045,
-    "hapax_ratio": 0.42167650913059435,
-    "total_documents": 1629748
   }
 }

 {
   "language": "arz",
+  "vocabulary_size": 856070,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.011546453807845636,
     "coverage": {
+      "top_100": 0.4579160902506231,
+      "top_1000": 0.7632735433913325,
+      "top_5000": 0.8553371329341142,
+      "top_10000": 0.885855315521865
     },
+    "hapax_count": 497272,
+    "hapax_ratio": 0.36744001146790684,
+    "total_documents": 1628695
   }
 }

models/word_markov/arz_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e5b7e0c4e188921694b306869a2449ce112fb199b792128f1ce21236b921cc84
-size 138534439

 version https://git-lfs.github.com/spec/v1
+oid sha256:27fac99b38072bd41d19f9620ff6d00511cdf0cb2c23718cc24825c3425b0c81
+size 125535197

models/word_markov/arz_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "arz",
-  "unique_contexts": 2177526,
-  "total_transitions": 156097037
 }

   "context_size": 1,
   "variant": "word",
   "language": "arz",
+  "unique_contexts": 1353062,
+  "total_transitions": 115579759
 }

models/word_markov/arz_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:40f8b7cfcf0a1b3e4aeec633859a0e58e1f83c410282fc81378b386a72be3a23
-size 440008287

 version https://git-lfs.github.com/spec/v1
+oid sha256:9256f52202a6653f8df7a39b233dce04b0731a6961f54294e451f0ff631a642f
+size 413546149

models/word_markov/arz_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "arz",
-  "unique_contexts": 13972922,
-  "total_transitions": 154467289
 }

   "context_size": 2,
   "variant": "word",
   "language": "arz",
+  "unique_contexts": 12336484,
+  "total_transitions": 113951064
 }

models/word_markov/arz_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:10869a5b62f9dc9fced8f99ecf292c2e516552cafb1ebc22439f4c66b29703c6
-size 749801655

 version https://git-lfs.github.com/spec/v1
+oid sha256:a2fa6b997fa7dcafddfbda3b21ca4c509b51a1df648b8e5d0d2476f263f776b1
+size 672892499

models/word_markov/arz_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "arz",
-  "unique_contexts": 28492247,
-  "total_transitions": 152837542
 }

   "context_size": 3,
   "variant": "word",
   "language": "arz",
+  "unique_contexts": 23517673,
+  "total_transitions": 112322369
 }

models/word_markov/arz_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1c3be90b2e9c7b80d31d2c18860bc2236215712a3340fa2f9225e2de14a91424
-size 1031396379

 version https://git-lfs.github.com/spec/v1
+oid sha256:d3b791771441b0e8ec65696094fed0fc807d2e95898fb465fd96d5dd181add4d
+size 913306903

models/word_markov/arz_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "arz",
-  "unique_contexts": 38092155,
-  "total_transitions": 151207798
 }

   "context_size": 4,
   "variant": "word",
   "language": "arz",
+  "unique_contexts": 29894419,
+  "total_transitions": 110693674
 }

models/word_ngram/arz_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:739d5abccec316bf509349d0e17742a1bcf59999ee5f8e810053baf5105c06e5
-size 25088974

 version https://git-lfs.github.com/spec/v1
+oid sha256:3d603ccb8a9bbb46f85861773239d3031d8c0e830c109fc753e1cadd69e9c1e2
+size 22328144

models/word_ngram/arz_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "arz",
-  "unique_ngrams": 1353863,
-  "total_ngrams": 156097037
 }

   "n": 2,
   "variant": "word",
   "language": "arz",
+  "unique_ngrams": 1073861,
+  "total_ngrams": 115579759
 }

models/word_ngram/arz_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f744cb53e652ce297a655902bc6eb5ef9d19d5edfc79585d48fa10ea229f7ccb
-size 49189391

 version https://git-lfs.github.com/spec/v1
+oid sha256:879ba3166c5b357094ce092093cc1eae670fb25bcde483ad58a4dd24786c83e5
+size 41669189

models/word_ngram/arz_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "arz",
-  "unique_ngrams": 2306233,
-  "total_ngrams": 154467289
 }

   "n": 3,
   "variant": "word",
   "language": "arz",
+  "unique_ngrams": 1682809,
+  "total_ngrams": 113951064
 }

models/word_ngram/arz_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d78b4518cd7a719707b9135b4062484b3fba60daa9102960b10361304503a812
-size 109747485