omarkamali commited on Jan 3

Commit

ecd1032

verified ·

1 Parent(s): 949fd96

Upload all models and assets for arc (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +292 -137
models/embeddings/monolingual/arc_128d.bin +2 -2
models/embeddings/monolingual/arc_128d_metadata.json +5 -3
models/embeddings/monolingual/arc_32d.bin +2 -2
models/embeddings/monolingual/arc_32d_metadata.json +5 -3
models/embeddings/monolingual/arc_64d.bin +2 -2
models/embeddings/monolingual/arc_64d_metadata.json +5 -3
models/subword_markov/arc_markov_ctx1_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/arc_markov_ctx2_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/arc_markov_ctx3_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/arc_markov_ctx4_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/arc_2gram_subword.parquet +2 -2
models/subword_ngram/arc_2gram_subword_metadata.json +2 -2
models/subword_ngram/arc_3gram_subword.parquet +2 -2
models/subword_ngram/arc_3gram_subword_metadata.json +2 -2
models/subword_ngram/arc_4gram_subword.parquet +2 -2
models/subword_ngram/arc_4gram_subword_metadata.json +2 -2
models/tokenizer/arc_tokenizer_16k.model +2 -2
models/tokenizer/arc_tokenizer_16k.vocab +0 -0
models/tokenizer/arc_tokenizer_32k.model +2 -2
models/tokenizer/arc_tokenizer_32k.vocab +0 -0
models/tokenizer/arc_tokenizer_8k.model +2 -2
models/tokenizer/arc_tokenizer_8k.vocab +0 -0
models/vocabulary/arc_vocabulary.parquet +2 -2
models/vocabulary/arc_vocabulary_metadata.json +10 -9
models/word_markov/arc_markov_ctx1_word.parquet +2 -2
models/word_markov/arc_markov_ctx1_word_metadata.json +2 -2
models/word_markov/arc_markov_ctx2_word.parquet +2 -2
models/word_markov/arc_markov_ctx2_word_metadata.json +2 -2
models/word_markov/arc_markov_ctx3_word.parquet +2 -2
models/word_markov/arc_markov_ctx3_word_metadata.json +2 -2
models/word_markov/arc_markov_ctx4_word.parquet +2 -2
models/word_markov/arc_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/arc_2gram_word.parquet +2 -2
models/word_ngram/arc_2gram_word_metadata.json +2 -2
models/word_ngram/arc_3gram_word.parquet +2 -2
models/word_ngram/arc_3gram_word_metadata.json +2 -2
models/word_ngram/arc_4gram_word.parquet +2 -2
models/word_ngram/arc_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0
visualizations/markov_entropy.png +0 -0
visualizations/model_sizes.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.512
   - name: best_isotropy
     type: isotropy
-    value: 0.2995
   - name: vocabulary_size
     type: vocab
-    value: 6528
-generated: 2025-12-27
 ---
 # ARC - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,57 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.534x | 3.51 | 0.0853% | 59,794 |
-| **16k** | 3.932x | 3.90 | 0.0949% | 53,742 |
-| **32k** | 4.512x 🏆 | 4.48 | 0.1089% | 46,835 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `R (ܙܥܘܪܬܐ r) ܗܝ ܐܬܘܬܐ ܕܐܠܦܒܝܬ ܠܐܛܝܢܝܐ܀`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁r ▁( ܙܥܘܪܬܐ ▁r ) ▁ܗܝ ▁ܐܬܘܬܐ ▁ܕܐܠܦܒܝܬ ▁ܠܐܛܝܢܝܐ܀` | 9 |
-| 16k | `▁r ▁( ܙܥܘܪܬܐ ▁r ) ▁ܗܝ ▁ܐܬܘܬܐ ▁ܕܐܠܦܒܝܬ ▁ܠܐܛܝܢܝܐ܀` | 9 |
-| 32k | `▁r ▁( ܙܥܘܪܬܐ ▁r ) ▁ܗܝ ▁ܐܬܘܬܐ ▁ܕܐܠܦܒܝܬ ▁ܠܐܛܝܢܝܐ܀` | 9 |
-**Sample 2:** `1847 ܗܘܬ ܫܢܬܐ܀
-ܓܕܫ̈ܐ
- ܐܬܝܠܕ
- ܡܝܬ
-ܣܕܪܐ:ܕܪܐ ܬܫܥܣܪܝܢܝܐ`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ 1 8 4 7 ▁ܗܘܬ ▁ܫܢܬܐ܀ ▁ܓܕܫ̈ܐ ▁ܐܬܝܠܕ ▁ܡܝܬ ... (+5 more)` | 15 |
-| 16k | `▁ 1 8 4 7 ▁ܗܘܬ ▁ܫܢܬܐ܀ ▁ܓܕܫ̈ܐ ▁ܐܬܝܠܕ ▁ܡܝܬ ... (+5 more)` | 15 |
-| 32k | `▁ 1 8 4 7 ▁ܗܘܬ ▁ܫܢܬܐ܀ ▁ܓܕܫ̈ܐ ▁ܐܬܝܠܕ ▁ܡܝܬ ... (+4 more)` | 14 |
-**Sample 3:** `ܗܘܦܪܟܝܐ ܕܒܝܠܓܝܟ ܗܝ ܗܘܦܪܟܝܐ ܒܛܘܪܩܝܐ܀
-ܣܕܪܐ:ܗܘܦܪܟܝܣ ܕܛܘܪܩܝܐ`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ܗܘܦܪܟܝܐ ▁ܕܒܝܠ ܓ ܝܟ ▁ܗܝ ▁ܗܘܦܪܟܝܐ ▁ܒܛܘܪܩܝܐ܀ ▁ܣܕܪܐ : ܗܘܦܪܟܝܣ ... (+1 more)` | 11 |
-| 16k | `▁ܗܘܦܪܟܝܐ ▁ܕܒܝܠ ܓܝܟ ▁ܗܝ ▁ܗܘܦܪܟܝܐ ▁ܒܛܘܪܩܝܐ܀ ▁ܣܕܪܐ : ܗܘܦܪܟܝܣ ▁ܕܛܘܪܩܝܐ` | 10 |
-| 32k | `▁ܗܘܦܪܟܝܐ ▁ܕܒܝܠܓܝܟ ▁ܗܝ ▁ܗܘܦܪܟܝܐ ▁ܒܛܘܪܩܝܐ܀ ▁ܣܕܪܐ : ܗܘܦܪܟܝܣ ▁ܕܛܘܪܩܝܐ` | 9 |
 ### Key Findings
-- **Best Compression:** 32k achieves 4.512x compression
-- **Lowest UNK Rate:** 8k with 0.0853% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -127,55 +125,87 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 836 🏆 | 9.71 | 1,994 | 37.5% | 82.7% |
-| **2-gram** | 405 🏆 | 8.66 | 2,501 | 57.6% | 95.6% |
-| **3-gram** | 1,500 | 10.55 | 2,669 | 27.2% | 73.4% |
-| **3-gram** | 2,617 | 11.35 | 11,822 | 27.5% | 65.5% |
-| **4-gram** | 2,666 | 11.38 | 4,604 | 22.0% | 58.3% |
-| **4-gram** | 9,085 | 13.15 | 32,191 | 14.3% | 42.7% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `̈ ܐ` | 2,050 |
-| 2 | `ܣܕܪܐ :` | 1,195 |
-| 3 | `܀ ܣܕܪܐ` | 593 |
-| 4 | `) ܗܝ` | 445 |
-| 5 | `̈ ܝܐ` | 356 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `܀ ܣܕܪܐ :` | 593 |
-| 2 | `ܐܢܫ ̈ ܐ` | 135 |
-| 3 | `܀ ܐܦ ܚܙܝ` | 134 |
-| 4 | `̈ ܐ ܀` | 127 |
-| 5 | `ܣܕܪܐ : ܝܘܠܦܢ` | 117 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ܣܕܪܐ : ܝܘܠܦܢ ܨܪܘܝܘܬܐ` | 115 |
-| 2 | `܀ ܣܕܪܐ : ܝܘܠܦܢ` | 97 |
-| 3 | `̈ ܐ ܒܪ ̈` | 91 |
-| 4 | `ܐ ܒܪ ̈ ܝܐ` | 90 |
-| 5 | `ܐ ܀ ܣܕܪܐ :` | 66 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 405
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~43% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
@@ -185,55 +215,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.5575 | 1.472 | 3.10 | 18,087 | 44.3% |
-| **1** | 1.3634 | 2.573 | 8.68 | 797 | 0.0% |
-| **2** | 0.1553 | 1.114 | 1.32 | 55,465 | 84.5% |
-| **2** | 0.9613 | 1.947 | 4.38 | 6,904 | 3.9% |
-| **3** | 0.0630 | 1.045 | 1.11 | 72,203 | 93.7% |
-| **3** | 0.6343 | 1.552 | 2.52 | 30,176 | 36.6% |
-| **4** | 0.0270 🏆 | 1.019 | 1.04 | 78,995 | 97.3% |
-| **4** | 0.3633 🏆 | 1.286 | 1.71 | 75,950 | 63.7% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `̈ ܠܐ ܀ ܣܕܪܐ : ܐܘܢܓܠܝܘܢ ܕܡܪܩܘܣ ܘܐܘܢܓܠܝܘܢ ܕܡܪܩܘܣ ܀ ܣܕܪܐ : ܡܐܢܐ ܕܐܝܬ ܠܗ ܬܪܬܝܢ`
-2. `: ܕܝܬܝܩܝ ܥܬܝܩܬܐ ܘܗܝ ܚܕܐ ܡܢ ܐܠܗܐ ܫܪܝܪܐ ܝܠܝܕܐ ܘܠܐ ܛܥܢܢ ܠܡܕܟܪ ܕܟܢܘܫܬܐ ܡܪܕܘܬܢܝܬܐ ܣܘܪܝܝܬܐ ܐܪܬܘܕܟܣܝܬܐ`
-3. `ܐ ܣܢܝܩܐ ܝܘܚ ܠܡܚܒܢ ̈ ܬܐ ܐܚܪ ̈ ܐ ܩܕܡ ܡܫܝܚܐ ܥܕܡܐ ܠܫܢܬ 1919ܡ ܘܒܡܕܒܚ ̈`
 **Context Size 2:**
-1. `̈ ܐ ܒܥܠܡܐ . ܘܦܪܣܐ ܒܝܬܝܪ ܡܢ ܠܫܢܐ ܣܘܪܝܝܐ ܘܐܪܡܢܝܐ ܡܬܬܗܪܓܝܢ ܒܡܕܪ ̈ ܫܬܐ ܬܝܪܝܟܝܬܐ ܡܪܘ ̈`
-2. `ܣܕܪܐ : ܛܪܘܢܐ ܣܕܪܐ : ܡܕܝܢܬܐ ܕܥܝܪܐܩ ܣܕܪܐ : ܛܪܘܢܐ ܣܕܪܐ : ܗܘܐ ܒܬܫܪܝܢ ܐܚܪܝܣܕܪܐ : ܒܬܫܪܝܢ`
-3. `܀ ܣܕܪܐ : ܣܘܪܝܐ ܣܕܪܐ : ܒܝܬ ܢܗܪܝܢ ܣܕܪܐ : ܝܗܘܕܝܘܬܐ ܣܕܪܐ : ܡܐܢܐ ܡܘܣܝܩܝܐ . ܒܥܕܬܐ`
 **Context Size 3:**
-1. `܀ ܣܕܪܐ : ܝܘܠܦܢ ܨܪܘܝܘܬܐ ܣܕܪܐ : ܥܝܢܐ ( ܝܘܠܦܢ ܨܪܘܝܘܬܐ ) ܣܕܪܐ : ܡܫܝܚܝܘܬܐ ܣܕܪܐ : ܕܝܬܝܩܝ`
-2. `ܐܢܫ ̈ ܐ ܒܓܘܪܓܝܐ ܢܡܠܠܘܢ ܓܘܪܓܐܝܬ ܀`
-3. `܀ ܐܦ ܚܙܝ ܓܪܡܐ ܣܕܪܐ : ܝܘܠܦܢ ܟܝܢܝܬܐ`
 **Context Size 4:**
-1. `܀ ܣܕܪܐ : ܝܘܠܦܢ ܨܪܘܝܘܬܐ ܣܕܪܐ : ܝܘܠܦܢ ܨܪܘܝܘܬܐ ܣܕܪܐ : ܓܪܡܐ`
-2. `̈ ܐ ܒܪ ̈ ܝܐ ܡܓܠܬܐ 1 ܘ 2 ‌ ܘ ܘܡܓܠܬܐ 3 ܕܓܢܙܐ ܪܒܐ ܒܠܫܢܐ ܣܘܪܝܝܐ .`
-3. `ܐ ܒܪ ̈ ܝܐ ܐܓܪܬܐ ܩܕܡܝܬܐ ܕܦܘܠܘܣ ܫܠܝܚܐ ܕܠܘܬ ܛܝܡܬܐܘܣ ܕܬܪܬܝܢ ܚܕܐ ܡܢ ܐܓܪ ̈ ܬܐ ܕܕܝܬܝܩܝ ܚܕܬܐ .`
 ### Key Findings
-- **Best Predictability:** Context-4 with 97.3% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (75,950 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -249,64 +310,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 6,528 |
-| Total Tokens | 65,426 |
-| Mean Frequency | 10.02 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 48.74 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ܐ | 2,433 |
-| 2 | ܡܢ | 1,300 |
-| 3 | ܣܕܪܐ | 1,205 |
-| 4 | ܐܘ | 1,034 |
-| 5 | ܗܝ | 1,024 |
-| 6 | ܗܘ | 1,023 |
-| 7 | ܐܝܬ | 520 |
-| 8 | ܗܘܐ | 408 |
-| 9 | ܬܐ | 376 |
-| 10 | ܝܐ | 369 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ܟܢܘܢܝܐ | 2 |
-| 2 | ܘܟ | 2 |
-| 3 | ܦܩ | 2 |
-| 4 | ܕܚܘ | 2 |
-| 5 | ܒܐܘ | 2 |
-| 6 | ܪܚ | 2 |
-| 7 | ܐܘܟܝܬܐ | 2 |
-| 8 | ܕܠܥ | 2 |
-| 9 | ܕܒܘ | 2 |
-| 10 | ܠܨܡ | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 0.9501 |
-| R² (Goodness of Fit) | 0.985114 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 35.0% |
-| Top 1,000 | 70.1% |
-| Top 5,000 | 95.3% |
 | Top 10,000 | 0.0% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9851 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 35.0% of corpus
-- **Long Tail:** -3,472 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -319,24 +380,115 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 1,958 | 32 | 3.019 | 0.712 | 0.2995 🏆 |
-| **mono_64d** | 1,958 | 64 | 2.997 | 0.742 | 0.0596 |
-| **mono_128d** | 1,958 | 128 | 2.998 | 0.754 | 0.0093 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.2995 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 1,958 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -344,11 +496,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.51x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (405) |
-| Markov | **Context-4** | Highest predictability (97.3%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -538,7 +691,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -554,7 +708,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-27 16:35:06*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.583
   - name: best_isotropy
     type: isotropy
+    value: 0.2739
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # ARC - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.552x | 3.57 | 0.1271% | 63,747 |
+| **16k** | 3.988x | 4.01 | 0.1427% | 56,780 |
+| **32k** | 4.583x 🏆 | 4.60 | 0.1640% | 49,402 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `ܡܬܠܬܐ ܗܘ ܐܣܟܡܐ ܡܚܪܝܐ (ܓܐܘܡܛܪܝܐ) ܕܐܝܬ ܠܗ ܬܠܬܐ ܐܠܥ̈ܐ ܘܬܠܬܐ ܙܘܝܬ̈ܐ܀`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐ ܘܡ ܛܪܝܐ ) ▁ܕܐܝܬ ... (+8 more)` | 18 |
+| 16k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐܘܡܛܪܝܐ ) ▁ܕܐܝܬ ▁ܠܗ ▁ܬܠܬܐ ... (+4 more)` | 14 |
+| 32k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐܘܡܛܪܝܐ ) ▁ܕܐܝܬ ▁ܠܗ ▁ܬܠܬܐ ... (+3 more)` | 13 |
+**Sample 2:** `ܟܐܢܣܐܣ ܐܘ ܟܐܢܙܐܣ (Kansas) ܐܝܬܝܗ ܐܘܚܕܢܐ ܓܘ ܡܢܬܐ ܡܥܪܒܝܬܐ ܡܨܥܝܬܐ ܕܐ̈ܘܚܕܢܐ ܡ̈ܚܝܕܐ ܕܐ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ܟ ܐܢ ܣܐܣ ▁ܐܘ ▁ܟ ܐܢ ܙܐ ܣ ▁( k ... (+14 more)` | 24 |
+| 16k | `▁ܟܐܢܣܐܣ ▁ܐܘ ▁ܟܐܢܙܐܣ ▁( kansas ) ▁ܐܝܬܝܗ ▁ܐܘܚܕܢܐ ▁ܓܘ ▁ܡܢܬܐ ... (+7 more)` | 17 |
+| 32k | `▁ܟܐܢܣܐܣ ▁ܐܘ ▁ܟܐܢܙܐܣ ▁( kansas ) ▁ܐܝܬܝܗ ▁ܐܘܚܕܢܐ ▁ܓܘ ▁ܡܢܬܐ ... (+7 more)` | 17 |
+**Sample 3:** `ܢܝܘ ܗܐܡܦܫܪ (New Hampshire) ܗܝ ܐܬܪܐ ܓܘ ܡܢܬܐ ܓܪܒܝܝܬܐ ܡܕܢܚܝܬܐ ܕܐܬ݂ܪ̈ܘܬ݂ܐ ܡ̈ܚܝܕܐ ܕܐܡ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ܢܝܘ ▁ܗܐܡ ܦܫ ܪ ▁( new ▁h am p shi ... (+14 more)` | 24 |
+| 16k | `▁ܢܝܘ ▁ܗܐܡܦܫܪ ▁( new ▁hamp shire ) ▁ܗܝ ▁ܐܬܪܐ ▁ܓܘ ... (+9 more)` | 19 |
+| 32k | `▁ܢܝܘ ▁ܗܐܡܦܫܪ ▁( new ▁hampshire ) ▁ܗܝ ▁ܐܬܪܐ ▁ܓܘ ▁ܡܢܬܐ ... (+8 more)` | 18 |
 ### Key Findings
+- **Best Compression:** 32k achieves 4.583x compression
+- **Lowest UNK Rate:** 8k with 0.1271% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 477 | 8.90 | 718 | 45.8% | 100.0% |
+| **2-gram** | Subword | 365 🏆 | 8.51 | 2,347 | 59.8% | 96.1% |
+| **3-gram** | Word | 437 | 8.77 | 752 | 52.0% | 100.0% |
+| **3-gram** | Subword | 2,390 | 11.22 | 10,625 | 28.2% | 66.9% |
+| **4-gram** | Word | 742 | 9.53 | 1,438 | 43.7% | 84.6% |
+| **4-gram** | Subword | 8,576 | 13.07 | 28,979 | 14.2% | 43.1% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ܐܦ ܚܙܝ` | 193 |
+| 2 | `ܚܕ ܡܢ` | 141 |
+| 3 | `ܗܝ ܐܬܪܐ` | 123 |
+| 4 | `ܐܝܬ ܠܗ` | 103 |
+| 5 | `ܬܚܘܡܐ ܥܡ` | 88 |
+**3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܗܘ ܚܕ ܡܢ` | 72 |
+| 2 | `ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍` | 52 |
+| 3 | `ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ` | 52 |
+| 4 | `ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ` | 52 |
+| 5 | `ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ` | 52 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ` | 52 |
+| 2 | `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ` | 52 |
+| 3 | `ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ` | 52 |
+| 4 | `ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ` | 52 |
+| 5 | `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ` | 52 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܐ _` | 24,633 |
+| 2 | `_ ܕ` | 7,621 |
+| 3 | `ܬ ܐ` | 7,176 |
+| 4 | `_ ܐ` | 6,899 |
+| 5 | `ܝ ܐ` | 5,702 |
+**3-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ܐ _ ܕ` | 6,138 |
+| 2 | `ܬ ܐ _` | 5,890 |
+| 3 | `ܝ ܐ _` | 4,242 |
+| 4 | `ܐ _ ܐ` | 2,477 |
+| 5 | `ܢ ܐ _` | 2,397 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ܬ ܐ _ ܕ` | 2,008 |
+| 2 | `ܝ ܬ ܐ _` | 1,523 |
+| 3 | `ܐ ܝ ܬ _` | 1,367 |
+| 4 | `ܘ ܬ ܐ _` | 1,297 |
+| 5 | `_ ܡ ܢ _` | 1,210 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 365
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~43% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.5449 | 1.459 | 2.60 | 18,018 | 45.5% |
+| **1** | Subword | 0.9655 | 1.953 | 6.06 | 1,232 | 3.5% |
+| **2** | Word | 0.1027 | 1.074 | 1.16 | 45,749 | 89.7% |
+| **2** | Subword | 0.7977 | 1.738 | 3.85 | 7,459 | 20.2% |
+| **3** | Word | 0.0295 | 1.021 | 1.04 | 51,822 | 97.1% |
+| **3** | Subword | 0.5934 | 1.509 | 2.45 | 28,618 | 40.7% |
+| **4** | Word | 0.0106 🏆 | 1.007 | 1.01 | 52,472 | 98.9% |
+| **4** | Subword | 0.3583 | 1.282 | 1.71 | 69,915 | 64.2% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
+1. `ܡܢ ܓܪܒܝܐ ܕܒܝܬ ܗܘܠܢܕ̈ܝܐ ܗܘܠܢܕܐܝܬ hengelo ܗܝ ܐܘܚܕܢܐ ܒܓܪܒܝܐ ܕܥܝܪܐܩ ܝܘܡܢܐ ܬܡܢ ܣܝܡܠܗ̇ ܥܕܬ̈ܐ ܡܕܢܚܝܬ̈ܐ ܕܐܬ݂...`
+2. `ܐܘ ܙܢܓܒܝܠܐ ܗܘ ܡܣܘܪܩܐ ܡܐܢܐ ܕܐܪܕܟܠܘܬܐ`
+3. `ܗܘ ܓܘܣܐ ܒܥܠܬܐ ܐܘ ܝܣܪܝܠ ܐܘ ܬܫܪܝܢ ܒ ܩܛܠܥܡܐ ܣܘܪܝܝܐ ܒ ت ܬ ܦ 80 754`
 **Context Size 2:**
+1. `ܐܦ ܚܙܝ ܓܒܪܐ`
+2. `ܚܕ ܡܢ ܐܪܒܥܐ ܟܬܒ̈ܐ ܩܕ̈ܡܝܐ ܕܕܝܬܝܩܝ ܚܕܬܐ ܦܘܠܘܣ ܫܠܝܚܐ ܟܬܒ ܗܕܐ ܐܓܪܬܐ ܠܩܘܠܣܝ̈ܐ ܕܗܢܘܢ ܐܢܫ̈ܐ ܕܡܕܝܢܬܐ ܕܐܦܣܘܣ`
+3. `ܗܝ ܐܬܪܐ ܒܐܘܪܘܦܐ ܩܘܛܢܝܘܬܐ ܕܐܝܪܠܢܕ ܗܝ ܒܓܘ ܓܙܪܬܐ ܕܐܝܪܠܢܕ ܠܐܝܪܠܢܕ ܓܪܒܝܝܬܐ ܐܝܬ ܬܚܘܡܐ ܥܡ ܪܘܡܢܝܐ ܘܥܡ ܛܘܪܩܝܐ`
 **Context Size 3:**
+1. `ܗܘ ܚܕ ܡܢ ܓܘܢ̈ܐ ܪ̈ܝܫܝܐ ܕܗܢܘܢ ܣܘܡܩܐ ܘܫܥܘܬܐ ܘܙܪܩܐ ܢܘܗܪܐ ܣܘܡܩܐ ܐܝܬ ܠܗ ܐܘܪܟܐ ܓܠܠܝܐ ܢܐܢܘܡܝܛܪ`
+2. `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ...`
+3. `ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟ...`
 **Context Size 4:**
+1. `ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚ...`
+2. `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ...`
+3. `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝ...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
+**Context Size 1:**
+1. `_ܨܝܬܐ_أبيد)_ܙܘܢܓ`
+2. `��._ܥܡܫܝܗܝܢܝܬ̈ܐ_ܡܕ`
+3. `ܝܟܫܬܐ_ܐ܀_ܐ_ܫܘܪܒܡ`
+**Context Size 2:**
+1. `ܐ_ܕܗܘܡܝܠܢܕܐ_ܕܐܥܬܐ`
+2. `_ܕܥܣܪܘܝܕܝܢܘܢ_ܐܘ_ܦ`
+3. `ܬܐ_ܕܩܪܝܬܐ_ܥܠ_ܐܟܪܝ`
+**Context Size 3:**
+1. `ܐ_ܕܒܓܕܐ_ܕܛܘܪ_ܗܘܘ_ܬ`
+2. `ܬܐ_ܕܛܲܟ݂ܣܵܐ_ܡܠܟܐ_ܫܠܝܚ̈`
+3. `ܝܐ_ܘܡܬܝ_ܡܠܝܝܐ_מלזי`
+**Context Size 4:**
+1. `ܬܐ_ܕܗܘܦܪܟܝܐ_ܕܪܗܘܡܝܬ`
+2. `ܝܬܐ_ܙܘ_ܫܘܬܐ_ܕܬܘܕܝܬܐ`
+3. `ܐܝܬ_ܡܨܪ̈ܝܐ܆_ܚܣܢ_ܒܡܕܝ`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 98.9% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (69,915 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 6,113 |
+| Total Tokens | 50,830 |
+| Mean Frequency | 8.32 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 32.05 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ܡܢ | 1,283 |
+| 2 | ܐܘ | 975 |
+| 3 | ܗܘ | 861 |
+| 4 | ܗܝ | 816 |
+| 5 | ܐܝܬ | 512 |
+| 6 | ܗܘܐ | 394 |
+| 7 | ܥܠ | 327 |
+| 8 | ܘܥܡ | 326 |
+| 9 | ܐܦ | 275 |
+| 10 | ܠܫܢܐ | 266 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ܐܚܹܪ̈ܢܹܐ | 2 |
+| 2 | ܦܘܼܪܡܘܼܠܵܐ | 2 |
+| 3 | ܢܣܲܒܪܲܚ | 2 |
+| 4 | ܚܲܕ | 2 |
+| 5 | ܐܘܟܝܬܐ | 2 |
+| 6 | ܕܫܠܝܼܡܵܐ | 2 |
+| 7 | ܡܚܲܝܕܵܐ | 2 |
+| 8 | ܓܵܪܹܫ | 2 |
+| 9 | ܕܠܥܸܠ | 2 |
+| 10 | ܕܒܘܿܠܨܡܲܢ | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.8947 |
+| R² (Goodness of Fit) | 0.982774 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 31.7% |
+| Top 1,000 | 68.0% |
+| Top 5,000 | 95.6% |
 | Top 10,000 | 0.0% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9828 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 31.7% of corpus
+- **Long Tail:** -3,887 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.2739 🏆 | 0.4979 | N/A | N/A |
+| **mono_64d** | 64 | 0.0566 | 0.5064 | N/A | N/A |
+| **mono_128d** | 128 | 0.0089 | 0.4882 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.2739 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.4975. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-ܐ` | ܘܫܛܚܐ, ܠܘܚ̈ܐ, ܐܢܫܘܬܐ |
+| `-ܬܐ` | ܐܢܫܘܬܐ, ܚܘܪܬܐ, ܐܘܪܬܕܘܟܣܝܬܐ |
+| `-ܝܐ` | ܘܥܪ̈ܒܝܐ, ܒܐܠܒܢܝܐ, ܥܒܝܐ |
+| `-̈ܐ` | ܠܘܚ̈ܐ, ܡܚܝܕ̈ܐ, ܥܝܢ̈ܐ |
+| `-ܝܬܐ` | ܐܘܪܬܕܘܟܣܝܬܐ, ܣܘܪܝܝܬܐ, ܒܩܕܡܝܬܐ |
+| `-ܘܬܐ` | ܐܢܫܘܬܐ, ܘܕܡܠܟܘܬܐ, ܛܝܒܘܬܐ |
+| `-ܢܐ` | ܕܫܝܢܐ, ܡܢܝܢܐ, ܥܝܢܐ |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `ܢܝܬܐ` | 1.58x | 23 contexts | ܦܢܝܬܐ, ܡܢܝܬܐ, ܡܪܢܝܬܐ |
+| `ܪܝܬܐ` | 1.59x | 18 contexts | ܫܪܝܬܐ, ܩܪܝܬܐ, ܒܪܝܬܐ |
+| `ܫܝܚܝ` | 1.61x | 16 contexts | ܡܫܝܚܝܐ, ܡܫܝܚܝܬܐ, ܕܡܫܝܚܝܐ |
+| `ܪܒܝܐ` | 1.59x | 16 contexts | ܨܪܒܝܐ, ܥܪܒܝܐ, ܐܪܒܝܐ |
+| `ܘܢܝܐ` | 1.57x | 16 contexts | ܟܘܢܝܐ, ܩܘܢܝܐ, ܓܘܢܝܐ |
+| `ܘܪܝܐ` | 1.37x | 23 contexts | ܛܘܪܝܐ, ܣܘܪܝܐ, ܟܘܪܝܐ |
+| `ܡܫܝܚ` | 1.59x | 14 contexts | ܡܫܝܚܐ, ܡܫܝܚܝܐ, ܕܡܫܝܚܐ |
+| `ܡܕܝܢ` | 1.58x | 13 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
+| `ܢܐܝܬ` | 1.53x | 14 contexts | ܝܘܢܐܝܬ, ܨܝܢܐܝܬ, ܟܠܢܐܝܬ |
+| `ܣܘܪܝ` | 1.38x | 18 contexts | ܣܘܪܝܐ, ܣܘܪܝܬ, ܘܣܘܪܝܐ |
+| `ܝܢܬܐ` | 1.65x | 9 contexts | ܩܝܢܬܐ, ܡܕܝܢܬܐ, ܣܦܝܢܬܐ |
+| `ܕܝܢܬ` | 1.62x | 9 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+*No significant affix co-occurrences detected.*
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| ܝܘܪܕܢܢܝܬܐ | **`ܝܘܪܕܢܢ-ܝܬܐ`** | 4.5 | `ܝܘܪܕܢܢ` |
+| ܥܘܬܡܐܢܝܬܐ | **`ܥܘܬܡܐܢ-ܝܬܐ`** | 4.5 | `ܥܘܬܡܐܢ` |
+| ܕܬܠܝܬܝܘܬܐ | **`ܕܬܠܝܬܝ-ܘܬܐ`** | 4.5 | `ܕܬܠܝܬܝ` |
+| ܕܐܢܛܝܘܟܝܐ | **`ܕܐܢܛܝܘܟ-ܝܐ`** | 4.5 | `ܕܐܢܛܝܘܟ` |
+| ܐܝܣܪܐܝܠܝܐ | **`ܐܝܣܪܐܝܠ-ܝܐ`** | 4.5 | `ܐܝܣܪܐܝܠ` |
+| ܦܘܪܛܘܓܠܝܐ | **`ܦܘܪܛܘܓܠ-ܝܐ`** | 4.5 | `ܦܘܪܛܘܓܠ` |
+| ܡܬܥܡܪܢܝܬܐ | **`ܡܬܥܡܪܢ-ܝܬܐ`** | 4.5 | `ܡܬܥܡܪܢ` |
+| ܛܘܪܥܒܕܝܢܝܐ | **`ܛܘܪܥܒܕܝܢ-ܝܐ`** | 4.5 | `ܛܘܪܥܒܕܝܢ` |
+| ܩܬܘܠܝܩܝ̈ܐ | **`ܩܬܘܠܝܩܝ-̈ܐ`** | 4.5 | `ܩܬܘܠܝܩܝ` |
+| ܠܫܘܠܛܢܘܬܐ | **`ܠܫܘܠܛܢ-ܘܬܐ`** | 1.5 | `ܠܫܘܠܛܢ` |
+| ܐܘܪܬܕܘܟܣܝܐ | **`ܐܘܪܬܕܘܟܣ-ܝܐ`** | 1.5 | `ܐܘܪܬܕܘܟܣ` |
+| ܐܝܓܘܦܛܝܬܐ | **`ܐܝܓܘܦܛ-ܝܬܐ`** | 1.5 | `ܐܝܓܘܦܛ` |
+| ܘܒܐܘܚܕ̈ܢܐ | **`ܘܒܐܘܚܕ̈-ܢܐ`** | 1.5 | `ܘܒܐܘܚܕ̈` |
+| ܘܒܡܫܝܚܝܘܬܐ | **`ܘܒܡܫܝܚܝ-ܘܬܐ`** | 1.5 | `ܘܒܡܫܝܚܝ` |
+| ܢܩܪܘܡܢܛܝܐ | **`ܢܩܪܘܡܢܛ-ܝܐ`** | 1.5 | `ܢܩܪܘܡܢܛ` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language ARC appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **32k BPE** | Best compression (4.58x) |
+| N-gram | **2-gram** | Lowest perplexity (365) |
+| Markov | **Context-4** | Highest predictability (98.9%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 05:14:47*

models/embeddings/monolingual/arc_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4837cbf17dc8afc088fdf1e21419e03ca91e8ea6df0e96425e80b5d9a487ffcc
-size 1026044955

 version https://git-lfs.github.com/spec/v1
+oid sha256:af294265d0648fcd88d1a3a79c52a316b5312eb2df0e22491460d7962b9abecf
+size 1025880436

models/embeddings/monolingual/arc_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 1958
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 1801
 }

models/embeddings/monolingual/arc_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:913460969e1240ed6f202c66ac801c20a82745d7723e7bb5674cf6e4e44ce378
-size 256541211

 version https://git-lfs.github.com/spec/v1
+oid sha256:ca828785023acb65e4a3915ca46d6488b08273c410da1d5ebe5859bffe778de4
+size 256497268

models/embeddings/monolingual/arc_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 1958
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 1801
 }

models/embeddings/monolingual/arc_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:653a7b15a84735fd70f3c904bdb3fbeaeefd0db41e9d318bc421f7789fe732d5
-size 513042459

 version https://git-lfs.github.com/spec/v1
+oid sha256:3ca0fd83ec54d91bfbc8d6aec7c2188702f2f3ff651521ae421ae4403a931166
+size 512958324

models/embeddings/monolingual/arc_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 1958
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 1801
 }

models/subword_markov/arc_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ec81048f0b697c5ef8814bf4a212babdecfeffc2132c582028ae9497f4e263d2
-size 52917

 version https://git-lfs.github.com/spec/v1
+oid sha256:bc87b5460d601a98fbc387cbc99f283bf6cd7d98c37e6a136392f9742a5c1769
+size 61605

models/subword_markov/arc_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 797,
-  "total_transitions": 419545
 }

   "context_size": 1,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 1232,
+  "total_transitions": 368665
 }

models/subword_markov/arc_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:640b341980a08ec099ab4ada8f63a1d3fe2ab634ed70a1ac9775fb4f48741db8
-size 216906

 version https://git-lfs.github.com/spec/v1
+oid sha256:44f7913fb0e3f66a526482ce6ecca8f16f96c74f389962eac8c930ae99d709a4
+size 236563

models/subword_markov/arc_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 6904,
-  "total_transitions": 417586
 }

   "context_size": 2,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 7459,
+  "total_transitions": 367072
 }

models/subword_markov/arc_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e817885697fb90519987face639656e80f4a711be45aa058aa7deb391c5b2a01
-size 631662

 version https://git-lfs.github.com/spec/v1
+oid sha256:6c02d515c8771584609106a8bfb0e1e6f4a01778f690d1e902e6c23730d85b1f
+size 614974

models/subword_markov/arc_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 30176,
-  "total_transitions": 415627
 }

   "context_size": 3,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 28618,
+  "total_transitions": 365479
 }

models/subword_markov/arc_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ecb5b54df5a44bacfd37832cde268bb71a63c8764fe087dd4679398d7a74c655
-size 1267490

 version https://git-lfs.github.com/spec/v1
+oid sha256:ab3159c713b354d456d78d674f47cafc43e66c4c18a290510850783e3575b191
+size 1213186

models/subword_markov/arc_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 75950,
-  "total_transitions": 413668
 }

   "context_size": 4,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 69915,
+  "total_transitions": 363886
 }

models/subword_ngram/arc_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:af4e8cc6802d1add1ff93363c8ccc163a89472ff8b3b69a137ae643d3fe26fc8
-size 31635

 version https://git-lfs.github.com/spec/v1
+oid sha256:1e6f137266958bfa5bf2d5725d17daa253bc7444dd15168bf7c4aba0c82d498e
+size 30619

models/subword_ngram/arc_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "arc",
-  "unique_ngrams": 2501,
-  "total_ngrams": 419545
 }

   "n": 2,
   "variant": "subword",
   "language": "arc",
+  "unique_ngrams": 2347,
+  "total_ngrams": 368665
 }

models/subword_ngram/arc_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b6ff6ceec88fedf4d9696b4a775251332e446c4fd0b6b3ad44cecfee8f91db33
-size 148445

 version https://git-lfs.github.com/spec/v1
+oid sha256:ce18df7b3c9474a7e74ca5ef2fd82d105d907ac676d24a02a4a055d5c06da3df
+size 135373

models/subword_ngram/arc_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "arc",
-  "unique_ngrams": 11822,
-  "total_ngrams": 417586
 }

   "n": 3,
   "variant": "subword",
   "language": "arc",
+  "unique_ngrams": 10625,
+  "total_ngrams": 367072
 }

models/subword_ngram/arc_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e13852e81422d6b99190605904715f9839adb8a8b216c7576bb6fd42b19100d8
-size 446279

 version https://git-lfs.github.com/spec/v1
+oid sha256:36cdb55afe0905dc264f2b5348858599da9f4a343be608731c24757accd88088
+size 398536

models/subword_ngram/arc_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "arc",
-  "unique_ngrams": 32191,
-  "total_ngrams": 415627
 }

   "n": 4,
   "variant": "subword",
   "language": "arc",
+  "unique_ngrams": 28979,
+  "total_ngrams": 365479
 }

models/tokenizer/arc_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:367e3b2a22375ddd3d88817ba1709bd7643670ad0106e2eb141344253a7fb454
-size 551251

 version https://git-lfs.github.com/spec/v1
+oid sha256:f9d8b946b07c5ee953e8f59e73c726ea02c7e02592415beaef72b11ce24d07ed
+size 553548

models/tokenizer/arc_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arc_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5823939f3ffb240c668a02808069bb97f33262098d8601e83ea34f1a899be0b5
-size 905690

 version https://git-lfs.github.com/spec/v1
+oid sha256:dc878161f3ebfde37edd5364173a906b743ffa182cc4a09da6dedb402cd0d479
+size 891580

models/tokenizer/arc_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arc_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:67f3e230ce64dd72a320d2684f32479b3bd61e6f0ba27e151727416757e65509
-size 389381

 version https://git-lfs.github.com/spec/v1
+oid sha256:e552460c83ce8697c66bb59c12883cdc3ee45df742f4f4a31ed47f9e096b570f
+size 390840

models/tokenizer/arc_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/arc_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3b45cacd75fde3f5334c33a28bfc4333498d0ac3db4b00832e14d4ffa2dff9c9
-size 103543

 version https://git-lfs.github.com/spec/v1
+oid sha256:a0c7a49ea4d72d65cb41e84815025c1a3ac379f03d6c2738d077d2713987df61
+size 98987

models/vocabulary/arc_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "arc",
-  "vocabulary_size": 6528,
   "statistics": {
-    "type_token_ratio": 0.2340763088766938,
     "coverage": {
-      "top_100": 0.2975890140185701,
-      "top_1000": 0.5967515410023668,
-      "top_5000": 0.8110744102577441,
-      "top_10000": 0.8959660849436917
     },
-    "hapax_count": 11472,
-    "hapax_ratio": 0.6373333333333333,
-    "total_documents": 1959
   }
 }

 {
   "language": "arc",
+  "vocabulary_size": 6113,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.28987946832669004,
     "coverage": {
+      "top_100": 0.2557526480443379,
+      "top_1000": 0.5486970192628353,
+      "top_5000": 0.7718473583077925,
+      "top_10000": 0.8689237903161773
     },
+    "hapax_count": 12141,
+    "hapax_ratio": 0.6651144954530513,
+    "total_documents": 1593
   }
 }

models/word_markov/arc_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:be0301162ceeb31bf05250db417e7a78133aa599465993cff73117e95292164a
-size 575823

 version https://git-lfs.github.com/spec/v1
+oid sha256:832f26dc2d4ff5a3ebd23bf31f4087617a431dd92bb988486e6d4cc21c427ab8
+size 561103

models/word_markov/arc_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "arc",
-  "unique_contexts": 18087,
-  "total_transitions": 98674
 }

   "context_size": 1,
   "variant": "word",
   "language": "arc",
+  "unique_contexts": 18018,
+  "total_transitions": 61378
 }

models/word_markov/arc_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:91579b48342a7c4a5e2042ea24e67219434aa6d0e3c1f757801d321393a4642f
-size 1199157

 version https://git-lfs.github.com/spec/v1
+oid sha256:de53c3acb316d8128ebd4e1ec3ca41812fa0d445a955fc1d817d7c545461f871
+size 1081973

models/word_markov/arc_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "arc",
-  "unique_contexts": 55465,
-  "total_transitions": 96715
 }

   "context_size": 2,
   "variant": "word",
   "language": "arc",
+  "unique_contexts": 45749,
+  "total_transitions": 59785
 }

models/word_markov/arc_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e49b6befa755f07916a3b98b954b006f24cc3d8c685cd363a0d80b6068712fd2
-size 1603718

 version https://git-lfs.github.com/spec/v1
+oid sha256:8d239eaf545ca57c200cb2afeabb8ab0ceb2b6cc79f4dcd2f2e5f4aa5cc4e9c1
+size 1332931

models/word_markov/arc_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "arc",
-  "unique_contexts": 72203,
-  "total_transitions": 94756
 }

   "context_size": 3,
   "variant": "word",
   "language": "arc",
+  "unique_contexts": 51822,
+  "total_transitions": 58192
 }

models/word_markov/arc_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9a54a5ffb22a9f5f303e36dd991688e4ab528b6254681a949db801002249f59e
-size 1904517