omarkamali commited on Jan 3

Commit

a98ea1b

verified ·

1 Parent(s): ecd1032

Upload all models and assets for arc (latest)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
README.md +184 -147
models/embeddings/aligned/arc_128d.bin +3 -0
models/embeddings/aligned/arc_128d.meta.json +1 -0
models/embeddings/aligned/arc_128d.projection.npy +3 -0
models/embeddings/aligned/arc_128d_metadata.json +8 -0
models/embeddings/aligned/arc_32d.bin +3 -0
models/embeddings/aligned/arc_32d.meta.json +1 -0
models/embeddings/aligned/arc_32d.projection.npy +3 -0
models/embeddings/aligned/arc_32d_metadata.json +8 -0
models/embeddings/aligned/arc_64d.bin +3 -0
models/embeddings/aligned/arc_64d.meta.json +1 -0
models/embeddings/aligned/arc_64d.projection.npy +3 -0
models/embeddings/aligned/arc_64d_metadata.json +8 -0
models/embeddings/monolingual/arc_128d.bin +2 -2
models/embeddings/monolingual/arc_128d_metadata.json +1 -1
models/embeddings/monolingual/arc_32d.bin +2 -2
models/embeddings/monolingual/arc_32d_metadata.json +1 -1
models/embeddings/monolingual/arc_64d.bin +2 -2
models/embeddings/monolingual/arc_64d_metadata.json +1 -1
models/subword_markov/arc_markov_ctx1_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/arc_markov_ctx2_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/arc_markov_ctx3_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/arc_markov_ctx4_subword.parquet +2 -2
models/subword_markov/arc_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/arc_2gram_subword.parquet +2 -2
models/subword_ngram/arc_2gram_subword_metadata.json +2 -2
models/subword_ngram/arc_3gram_subword.parquet +2 -2
models/subword_ngram/arc_3gram_subword_metadata.json +2 -2
models/subword_ngram/arc_4gram_subword.parquet +2 -2
models/subword_ngram/arc_4gram_subword_metadata.json +2 -2
models/subword_ngram/arc_5gram_subword.parquet +3 -0
models/subword_ngram/arc_5gram_subword_metadata.json +7 -0
models/tokenizer/arc_tokenizer_16k.model +2 -2
models/tokenizer/arc_tokenizer_16k.vocab +0 -0
models/tokenizer/arc_tokenizer_32k.model +2 -2
models/tokenizer/arc_tokenizer_32k.vocab +0 -0
models/tokenizer/arc_tokenizer_8k.model +2 -2
models/tokenizer/arc_tokenizer_8k.vocab +0 -0
models/vocabulary/arc_vocabulary.parquet +2 -2
models/vocabulary/arc_vocabulary_metadata.json +9 -9
models/word_markov/arc_markov_ctx1_word.parquet +2 -2
models/word_markov/arc_markov_ctx1_word_metadata.json +2 -2
models/word_markov/arc_markov_ctx2_word.parquet +2 -2
models/word_markov/arc_markov_ctx2_word_metadata.json +2 -2
models/word_markov/arc_markov_ctx3_word.parquet +2 -2
models/word_markov/arc_markov_ctx3_word_metadata.json +2 -2

.gitattributes CHANGED Viewed

@@ -39,3 +39,4 @@ visualizations/position_encoding_comparison.png filter=lfs diff=lfs merge=lfs -t
 visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
 visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
 visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text

 visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
 visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
 visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
+visualizations/embedding_tsne_multilingual.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 language: arc
-language_name: ARC
 language_family: semitic_aramaic
 tags:
   - wikilangs
@@ -10,11 +10,21 @@ tags:
   - n-gram
   - markov
   - wikipedia
   - monolingual
   - family-semitic_aramaic
 license: mit
 library_name: wikilangs
-pipeline_tag: feature-extraction
 datasets:
   - omarkamali/wikipedia-monthly
 dataset_info:
@@ -26,17 +36,17 @@ metrics:
     value: 4.583
   - name: best_isotropy
     type: isotropy
-    value: 0.2739
   - name: vocabulary_size
     type: vocab
     value: 0
 generated: 2026-01-03
 ---
-# ARC - Wikilangs Models
 ## Comprehensive Research Report & Full Ablation Study
-This repository contains NLP models trained and evaluated by Wikilangs, specifically on **ARC** Wikipedia data.
 We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
 ## 📋 Repository Contents
@@ -60,7 +70,7 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
 - [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -80,43 +90,43 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.552x | 3.57 | 0.1271% | 63,747 |
-| **16k** | 3.988x | 4.01 | 0.1427% | 56,780 |
-| **32k** | 4.583x 🏆 | 4.60 | 0.1640% | 49,402 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `ܡܬܠܬܐ ܗܘ ܐܣܟܡܐ ܡܚܪܝܐ (ܓܐܘܡܛܪܝܐ) ܕܐܝܬ ܠܗ ܬܠܬܐ ܐܠܥ̈ܐ ܘܬܠܬܐ ܙܘܝܬ̈ܐ܀`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐ ܘܡ ܛܪܝܐ ) ▁ܕܐܝܬ ... (+8 more)` | 18 |
-| 16k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐܘܡܛܪܝܐ ) ▁ܕܐܝܬ ▁ܠܗ ▁ܬܠܬܐ ... (+4 more)` | 14 |
-| 32k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐܘܡܛܪܝܐ ) ▁ܕܐܝܬ ▁ܠܗ ▁ܬܠܬܐ ... (+3 more)` | 13 |
-**Sample 2:** `ܟܐܢܣܐܣ ܐܘ ܟܐܢܙܐܣ (Kansas) ܐܝܬܝܗ ܐܘܚܕܢܐ ܓܘ ܡܢܬܐ ܡܥܪܒܝܬܐ ܡܨܥܝܬܐ ܕܐ̈ܘܚܕܢܐ ܡ̈ܚܝܕܐ ܕܐ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ܟ ܐܢ ܣܐܣ ▁ܐܘ ▁ܟ ܐܢ ܙܐ ܣ ▁( k ... (+14 more)` | 24 |
-| 16k | `▁ܟܐܢܣܐܣ ▁ܐܘ ▁ܟܐܢܙܐܣ ▁( kansas ) ▁ܐܝܬܝܗ ▁ܐܘܚܕܢܐ ▁ܓܘ ▁ܡܢܬܐ ... (+7 more)` | 17 |
-| 32k | `▁ܟܐܢܣܐܣ ▁ܐܘ ▁ܟܐܢܙܐܣ ▁( kansas ) ▁ܐܝܬܝܗ ▁ܐܘܚܕܢܐ ▁ܓܘ ▁ܡܢܬܐ ... (+7 more)` | 17 |
-**Sample 3:** `ܢܝܘ ܗܐܡܦܫܪ (New Hampshire) ܗܝ ܐܬܪܐ ܓܘ ܡܢܬܐ ܓܪܒܝܝܬܐ ܡܕܢܚܝܬܐ ܕܐܬ݂ܪ̈ܘܬ݂ܐ ܡ̈ܚܝܕܐ ܕܐܡ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ܢܝܘ ▁ܗܐܡ ܦܫ ܪ ▁( new ▁h am p shi ... (+14 more)` | 24 |
-| 16k | `▁ܢܝܘ ▁ܗܐܡܦܫܪ ▁( new ▁hamp shire ) ▁ܗܝ ▁ܐܬܪܐ ▁ܓܘ ... (+9 more)` | 19 |
-| 32k | `▁ܢܝܘ ▁ܗܐܡܦܫܪ ▁( new ▁hampshire ) ▁ܗܝ ▁ܐܬܪܐ ▁ܓܘ ▁ܡܢܬܐ ... (+8 more)` | 18 |
 ### Key Findings
 - **Best Compression:** 32k achieves 4.583x compression
-- **Lowest UNK Rate:** 8k with 0.1271% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -133,12 +143,14 @@ Below are sample sentences tokenized with each vocabulary size:
 | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
 |--------|---------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | Word | 477 | 8.90 | 718 | 45.8% | 100.0% |
-| **2-gram** | Subword | 365 🏆 | 8.51 | 2,347 | 59.8% | 96.1% |
-| **3-gram** | Word | 437 | 8.77 | 752 | 52.0% | 100.0% |
-| **3-gram** | Subword | 2,390 | 11.22 | 10,625 | 28.2% | 66.9% |
-| **4-gram** | Word | 742 | 9.53 | 1,438 | 43.7% | 84.6% |
-| **4-gram** | Subword | 8,576 | 13.07 | 28,979 | 14.2% | 43.1% |
 ### Top 5 N-grams by Size
@@ -148,66 +160,86 @@ Below are sample sentences tokenized with each vocabulary size:
 |------|--------|-------|
 | 1 | `ܐܦ ܚܙܝ` | 193 |
 | 2 | `ܚܕ ܡܢ` | 141 |
-| 3 | `ܗܝ ܐܬܪܐ` | 123 |
-| 4 | `ܐܝܬ ܠܗ` | 103 |
-| 5 | `ܬܚܘܡܐ ܥܡ` | 88 |
 **3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
 | 1 | `ܗܘ ܚܕ ܡܢ` | 72 |
-| 2 | `ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍` | 52 |
-| 3 | `ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ` | 52 |
-| 4 | `ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ` | 52 |
-| 5 | `ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ` | 52 |
 **4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ` | 52 |
-| 2 | `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ` | 52 |
-| 3 | `ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ` | 52 |
-| 4 | `ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ` | 52 |
-| 5 | `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ` | 52 |
 **2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ܐ _` | 24,633 |
-| 2 | `_ ܕ` | 7,621 |
-| 3 | `ܬ ܐ` | 7,176 |
-| 4 | `_ ܐ` | 6,899 |
-| 5 | `ܝ ܐ` | 5,702 |
 **3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ܐ _ ܕ` | 6,138 |
-| 2 | `ܬ ܐ _` | 5,890 |
-| 3 | `ܝ ܐ _` | 4,242 |
 | 4 | `ܐ _ ܐ` | 2,477 |
-| 5 | `ܢ ܐ _` | 2,397 |
 **4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ܬ ܐ _ ܕ` | 2,008 |
-| 2 | `ܝ ܬ ܐ _` | 1,523 |
-| 3 | `ܐ ܝ ܬ _` | 1,367 |
-| 4 | `ܘ ܬ ܐ _` | 1,297 |
-| 5 | `_ ܡ ܢ _` | 1,210 |
 ### Key Findings
-- **Best Perplexity:** 2-gram (subword) with 365
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~43% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -223,14 +255,14 @@ Below are sample sentences tokenized with each vocabulary size:
 | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
 |---------|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | Word | 0.5449 | 1.459 | 2.60 | 18,018 | 45.5% |
-| **1** | Subword | 0.9655 | 1.953 | 6.06 | 1,232 | 3.5% |
-| **2** | Word | 0.1027 | 1.074 | 1.16 | 45,749 | 89.7% |
-| **2** | Subword | 0.7977 | 1.738 | 3.85 | 7,459 | 20.2% |
-| **3** | Word | 0.0295 | 1.021 | 1.04 | 51,822 | 97.1% |
-| **3** | Subword | 0.5934 | 1.509 | 2.45 | 28,618 | 40.7% |
-| **4** | Word | 0.0106 🏆 | 1.007 | 1.01 | 52,472 | 98.9% |
-| **4** | Subword | 0.3583 | 1.282 | 1.71 | 69,915 | 64.2% |
 ### Generated Text Samples (Word-based)
@@ -238,27 +270,27 @@ Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
-1. `ܡܢ ܓܪܒܝܐ ܕܒܝܬ ܗܘܠܢܕ̈ܝܐ ܗܘܠܢܕܐܝܬ hengelo ܗܝ ܐܘܚܕܢܐ ܒܓܪܒܝܐ ܕܥܝܪܐܩ ܝܘܡܢܐ ܬܡܢ ܣܝܡܠܗ̇ ܥܕܬ̈ܐ ܡܕܢܚܝܬ̈ܐ ܕܐܬ݂...`
-2. `ܐܘ ܙܢܓܒܝܠܐ ܗܘ ܡܣܘܪܩܐ ܡܐܢܐ ܕܐܪܕܟܠܘܬܐ`
-3. `ܗܘ ܓܘܣܐ ܒܥܠܬܐ ܐܘ ܝܣܪܝܠ ܐܘ ܬܫܪܝܢ ܒ ܩܛܠܥܡܐ ܣܘܪܝܝܐ ܒ ت ܬ ܦ 80 754`
 **Context Size 2:**
-1. `ܐܦ ܚܙܝ ܓܒܪܐ`
-2. `ܚܕ ܡܢ ܐܪܒܥܐ ܟܬܒ̈ܐ ܩܕ̈ܡܝܐ ܕܕܝܬܝܩܝ ܚܕܬܐ ܦܘܠܘܣ ܫܠܝܚܐ ܟܬܒ ܗܕܐ ܐܓܪܬܐ ܠܩܘܠܣܝ̈ܐ ܕܗܢܘܢ ܐܢܫ̈ܐ ܕܡܕܝܢܬܐ ܕܐܦܣܘܣ`
-3. `ܗܝ ܐܬܪܐ ܒܐܘܪܘܦܐ ܩܘܛܢܝܘܬܐ ܕܐܝܪܠܢܕ ܗܝ ܒܓܘ ܓܙܪܬܐ ܕܐܝܪܠܢܕ ܠܐܝܪܠܢܕ ܓܪܒܝܝܬܐ ܐܝܬ ܬܚܘܡܐ ܥܡ ܪܘܡܢܝܐ ܘܥܡ ܛܘܪܩܝܐ`
 **Context Size 3:**
-1. `ܗܘ ܚܕ ܡܢ ܓܘܢ̈ܐ ܪ̈ܝܫܝܐ ܕܗܢܘܢ ܣܘܡܩܐ ܘܫܥܘܬܐ ܘܙܪܩܐ ܢܘܗܪܐ ܣܘܡܩܐ ܐܝܬ ܠܗ ܐܘܪܟܐ ܓܠܠܝܐ ܢܐܢܘܡܝܛܪ`
-2. `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ...`
-3. `ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟ...`
 **Context Size 4:**
-1. `ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚ...`
-2. `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝ��ܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ...`
-3. `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝ...`
 ### Generated Text Samples (Subword-based)
@@ -267,34 +299,34 @@ Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
-1. `_ܨܝܬܐ_أبيد)_ܙܘܢܓ`
-2. `ܐ._ܥܡܫܝܗܝܢܝܬ̈ܐ_ܡܕ`
-3. `ܝܟܫܬܐ_ܐ܀_ܐ_ܫܘܪܒܡ`
 **Context Size 2:**
-1. `ܐ_ܕܗܘܡܝܠܢܕܐ_ܕܐܥܬܐ`
-2. `_ܕܥܣܪܘܝܕܝܢܘܢ_ܐܘ_ܦ`
-3. `ܬܐ_ܕܩܪܝܬܐ_ܥܠ_ܐܟܪܝ`
 **Context Size 3:**
-1. `ܐ_ܕܒܓܕܐ_ܕܛܘܪ_ܗܘܘ_ܬ`
-2. `ܬܐ_ܕܛܲܟ݂ܣܵܐ_ܡܠܟܐ_ܫܠܝܚ̈`
-3. `ܝܐ_ܘܡܬܝ_ܡܠܝܝܐ_מלזי`
 **Context Size 4:**
-1. `ܬܐ_ܕܗܘܦܪܟܝܐ_ܕܪܗܘܡܝܬ`
-2. `ܝܬܐ_ܙܘ_ܫܘܬܐ_ܕܬܘܕܝܬܐ`
-3. `ܐܝܬ_ܡܨܪ̈ܝܐ܆_ܚܣܢ_ܒܡܕܝ`
 ### Key Findings
 - **Best Predictability:** Context-4 (word) with 98.9% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (69,915 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -310,9 +342,9 @@ Below are text samples generated from each subword-based Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 6,113 |
-| Total Tokens | 50,830 |
-| Mean Frequency | 8.32 |
 | Median Frequency | 3 |
 | Frequency Std Dev | 32.05 |
@@ -320,16 +352,16 @@ Below are text samples generated from each subword-based Markov chain model:
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ܡܢ | 1,283 |
-| 2 | ܐܘ | 975 |
-| 3 | ܗܘ | 861 |
 | 4 | ܗܝ | 816 |
-| 5 | ܐܝܬ | 512 |
 | 6 | ܗܘܐ | 394 |
-| 7 | ܥܠ | 327 |
-| 8 | ܘܥܡ | 326 |
-| 9 | ܐܦ | 275 |
-| 10 | ܠܫܢܐ | 266 |
 ### Least Common Words (from vocabulary)
@@ -350,24 +382,24 @@ Below are text samples generated from each subword-based Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 0.8947 |
-| R² (Goodness of Fit) | 0.982774 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 31.7% |
 | Top 1,000 | 68.0% |
-| Top 5,000 | 95.6% |
 | Top 10,000 | 0.0% |
 ### Key Findings
 - **Zipf Compliance:** R²=0.9828 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 31.7% of corpus
-- **Long Tail:** -3,887 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -383,37 +415,40 @@ Below are text samples generated from each subword-based Markov chain model:
 ### 5.1 Cross-Lingual Alignment
-> *Note: Multilingual alignment visualization not available for this language.*
 ### 5.2 Model Comparison
 | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
 |-------|-----------|----------|------------------|---------------|----------------|
-| **mono_32d** | 32 | 0.2739 🏆 | 0.4979 | N/A | N/A |
-| **mono_64d** | 64 | 0.0566 | 0.5064 | N/A | N/A |
-| **mono_128d** | 128 | 0.0089 | 0.4882 | N/A | N/A |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.2739 (more uniform distribution)
-- **Semantic Density:** Average pairwise similarity of 0.4975. Lower values indicate better semantic separation.
-- **Alignment Quality:** No aligned models evaluated in this run.
 - **Recommendation:** 128d aligned for best cross-lingual performance
 ---
 ## 6.  Morphological Analysis (Experimental)
-> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
 This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
 ### 6.1 Productivity & Complexity
 | Metric | Value | Interpretation | Recommendation |
 |--------|-------|----------------|----------------|
-| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
-| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
 ### 6.2 Affix Inventory (Productive Units)
@@ -426,13 +461,13 @@ These are the most productive prefixes and suffixes identified by sampling the v
 #### Productive Suffixes
 | Suffix | Examples |
 |--------|----------|
-| `-ܐ` | ܘܫܛܚܐ, ܠܘܚ̈ܐ, ܐܢܫܘܬܐ |
-| `-ܬܐ` | ܐܢܫܘܬܐ, ܚܘܪܬܐ, ܐܘܪܬܕܘܟܣܝܬܐ |
-| `-ܝܐ` | ܘܥܪ̈ܒܝܐ, ܒܐܠܒܢܝܐ, ܥܒܝܐ |
-| `-̈ܐ` | ܠܘܚ̈ܐ, ܡܚܝܕ̈ܐ, ܥܝܢ̈ܐ |
-| `-ܝܬܐ` | ܐܘܪܬܕܘܟܣܝܬܐ, ܣܘܪܝܝܬܐ, ܒܩܕܡܝܬܐ |
-| `-ܘܬܐ` | ܐܢܫܘܬܐ, ܘܕܡܠܟܘܬܐ, ܛܝܒܘܬܐ |
-| `-ܢܐ` | ܕܫܝܢܐ, ܡܢܝܢܐ, ܥܝܢܐ |
 ### 6.3 Bound Stems (Lexical Roots)
@@ -440,18 +475,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
 | Stem | Cohesion | Substitutability | Examples |
 |------|----------|------------------|----------|
-| `ܢܝܬܐ` | 1.58x | 23 contexts | ܦܢܝܬܐ, ܡܢܝܬܐ, ܡܪܢܝܬܐ |
-| `ܪܝܬܐ` | 1.59x | 18 contexts | ܫܪܝܬܐ, ܩܪܝܬܐ, ܒܪܝܬܐ |
-| `ܫܝܚܝ` | 1.61x | 16 contexts | ܡܫܝܚܝܐ, ܡܫܝܚܝܬܐ, ܕܡܫܝܚܝܐ |
-| `ܪܒܝܐ` | 1.59x | 16 contexts | ܨܪܒܝܐ, ܥܪܒܝܐ, ܐܪܒܝܐ |
-| `ܘܢܝܐ` | 1.57x | 16 contexts | ܟܘܢܝܐ, ܩܘܢܝܐ, ܓܘܢܝܐ |
-| `ܘܪܝܐ` | 1.37x | 23 contexts | ܛܘܪܝܐ, ܣܘܪܝܐ, ܟܘܪܝܐ |
-| `ܡܫܝܚ` | 1.59x | 14 contexts | ܡܫܝܚܐ, ܡܫܝܚܝܐ, ܕܡܫܝܚܐ |
-| `ܡܕܝܢ` | 1.58x | 13 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
-| `ܢܐܝܬ` | 1.53x | 14 contexts | ܝܘܢܐܝܬ, ܨܝܢܐܝܬ, ܟܠܢܐܝܬ |
-| `ܣܘܪܝ` | 1.38x | 18 contexts | ܣܘܪܝܐ, ܣܘܪܝܬ, ܘܣܘܪܝܐ |
-| `ܝܢܬܐ` | 1.65x | 9 contexts | ܩܝܢܬܐ, ܡܕܝܢܬܐ, ܣܦܝܢܬܐ |
-| `ܕܝܢܬ` | 1.62x | 9 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
 ### 6.4 Affix Compatibility (Co-occurrence)
@@ -467,25 +502,27 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
 | Word | Suggested Split | Confidence | Stem |
 |------|-----------------|------------|------|
 | ܝܘܪܕܢܢܝܬܐ | **`ܝܘܪܕܢܢ-ܝܬܐ`** | 4.5 | `ܝܘܪܕܢܢ` |
 | ܥܘܬܡܐܢܝܬܐ | **`ܥܘܬܡܐܢ-ܝܬܐ`** | 4.5 | `ܥܘܬܡܐܢ` |
 | ܕܬܠܝܬܝܘܬܐ | **`ܕܬܠܝܬܝ-ܘܬܐ`** | 4.5 | `ܕܬܠܝܬܝ` |
 | ܕܐܢܛܝܘܟܝܐ | **`ܕܐܢܛܝܘܟ-ܝܐ`** | 4.5 | `ܕܐܢܛܝܘܟ` |
-| ܐܝܣܪܐܝܠܝܐ | **`ܐܝܣܪܐܝܠ-ܝܐ`** | 4.5 | `ܐܝܣܪܐܝܠ` |
-| ܦܘܪܛܘܓܠܝܐ | **`ܦܘܪܛܘܓܠ-ܝܐ`** | 4.5 | `ܦܘܪܛܘܓܠ` |
-| ܡܬܥܡܪܢܝܬܐ | **`ܡܬܥܡܪܢ-ܝܬܐ`** | 4.5 | `ܡܬܥܡܪܢ` |
 | ܛܘܪܥܒܕܝܢܝܐ | **`ܛܘܪܥܒܕܝܢ-ܝܐ`** | 4.5 | `ܛܘܪܥܒܕܝܢ` |
 | ܩܬܘܠܝܩܝ̈ܐ | **`ܩܬܘܠܝܩܝ-̈ܐ`** | 4.5 | `ܩܬܘܠܝܩܝ` |
 | ܠܫܘܠܛܢܘܬܐ | **`ܠܫܘܠܛܢ-ܘܬܐ`** | 1.5 | `ܠܫܘܠܛܢ` |
-| ܐܘܪܬܕܘܟܣܝܐ | **`ܐܘܪܬܕܘܟܣ-ܝܐ`** | 1.5 | `ܐܘܪܬܕܘܟܣ` |
-| ܐܝܓܘܦܛܝܬܐ | **`ܐܝܓܘܦܛ-ܝܬܐ`** | 1.5 | `ܐܝܓܘܦܛ` |
-| ܘܒܐܘܚܕ̈ܢܐ | **`ܘܒܐܘܚܕ̈-ܢܐ`** | 1.5 | `ܘܒܐܘܚܕ̈` |
-| ܘܒܡܫܝܚܝܘܬܐ | **`ܘܒܡܫܝܚܝ-ܘܬܐ`** | 1.5 | `ܘܒܡܫܝܚܝ` |
-| ܢܩܪܘܡܢܛܝܐ | **`ܢܩܪܘܡܢܛ-ܝܐ`** | 1.5 | `ܢܩܪܘܡܢܛ` |
 ### 6.6 Linguistic Interpretation
 > **Automated Insight:**
-The language ARC appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
 ## 7. Summary & Recommendations
@@ -497,7 +534,7 @@ The language ARC appears to be more isolating or has a highly fixed vocabulary.
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
 | Tokenizer | **32k BPE** | Best compression (4.58x) |
-| N-gram | **2-gram** | Lowest perplexity (365) |
 | Markov | **Context-4** | Highest predictability (98.9%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
@@ -712,4 +749,4 @@ MIT License - Free for academic and commercial use.
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2026-01-03 05:14:47*

 ---
 language: arc
+language_name: Aramaic
 language_family: semitic_aramaic
 tags:
   - wikilangs
   - n-gram
   - markov
   - wikipedia
+  - feature-extraction
+  - sentence-similarity
+  - tokenization
+  - n-grams
+  - markov-chain
+  - text-mining
+  - fasttext
+  - babelvec
+  - vocabulous
+  - vocabulary
   - monolingual
   - family-semitic_aramaic
 license: mit
 library_name: wikilangs
+pipeline_tag: text-generation
 datasets:
   - omarkamali/wikipedia-monthly
 dataset_info:
     value: 4.583
   - name: best_isotropy
     type: isotropy
+    value: 0.3326
   - name: vocabulary_size
     type: vocab
     value: 0
 generated: 2026-01-03
 ---
+# Aramaic - Wikilangs Models
 ## Comprehensive Research Report & Full Ablation Study
+This repository contains NLP models trained and evaluated by Wikilangs, specifically on **Aramaic** Wikipedia data.
 We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
 ## 📋 Repository Contents
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6--morphological-analysis-experimental)
 - [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.553x | 3.57 | 0.1262% | 63,406 |
+| **16k** | 3.990x | 4.01 | 0.1417% | 56,456 |
+| **32k** | 4.583x 🏆 | 4.60 | 0.1628% | 49,148 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `ܐܫܬܐ (ܟܢܘܫܝܐ: ܐܫܝ̈ܬܐ) ܗܝ ܡܢܬܐ ܓܠܝܠܬܐ ܕܨܪܘܝܘܬܐ ܕܬܦܠ ܒܒܣܬܪܐ ܕܐܓܢܐ ܕܓܘܫܡ̈ܐ ܕܐܢܫ̈ܐ ܘ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ܐܫܬܐ ▁( ܟܢܘܫܝܐ : ▁ܐ ܫܝ ̈ܬܐ ) ▁ܗܝ ▁ܡܢܬܐ ... (+20 more)` | 30 |
+| 16k | `▁ܐܫܬܐ ▁( ܟܢܘܫܝܐ : ▁ܐ ܫܝ ̈ܬܐ ) ▁ܗܝ ▁ܡܢܬܐ ... (+18 more)` | 28 |
+| 32k | `▁ܐܫܬܐ ▁( ܟܢܘܫܝܐ : ▁ܐܫܝ̈ܬܐ ) ▁ܗܝ ▁ܡܢܬܐ ▁ܓܠܝܠܬܐ ▁ܕܨܪܘܝܘܬܐ ... (+12 more)` | 22 |
+**Sample 2:** `ܐܫܬܐ ܗܝ ܓܕܫܐ ܐܣܝܝܐ. ܐܫܬܐ ܗܝ ܚܡܝܡܘܬ ܓܘܫܡܐ ܠܥܠ ܡܢ ܫܘܝܐ ܟܝܢܝܐ ܕܚܡܝܡܘܬܐ ܓܘܝܬܐ ܕܓܘܫܡܐ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ܐܫܬܐ ▁ܗܝ ▁ܓܕܫܐ ▁ܐܣ ܝܝܐ . ▁ܐܫܬܐ ▁ܗܝ ▁ܚܡ ܝܡ ... (+10 more)` | 20 |
+| 16k | `▁ܐܫܬܐ ▁ܗܝ ▁ܓܕܫܐ ▁ܐܣܝܝܐ . ▁ܐܫܬܐ ▁ܗܝ ▁ܚܡܝܡܘܬ ▁ܓܘܫܡܐ ▁ܠܥܠ ... (+6 more)` | 16 |
+| 32k | `▁ܐܫܬܐ ▁ܗܝ ▁ܓܕܫܐ ▁ܐܣܝܝܐ . ▁ܐܫܬܐ ▁ܗܝ ▁ܚܡܝܡܘܬ ▁ܓܘܫܡܐ ▁ܠܥܠ ... (+6 more)` | 16 |
+**Sample 3:** `ܡܪܝ ܢܪܣܝ ܕܒܙ (ܐܬܝܠܕ 17 ܐܝܪ - ܡܝܬ 14 ܫܒܛ ܗܘܐ ܡܝܛܪܦܘܠܝܛܐ ܕܠܒܢܢ ܘܣܘܪܝܐ ܘܟܠܗ̇ ܐܘܪܘܦܐ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ܡܪܝ ▁ܢܪܣܝ ▁ܕܒܙ ▁( ܐܬܝܠܕ ▁ 1 7 ▁ܐܝܪ ▁- ... (+17 more)` | 27 |
+| 16k | `▁ܡܪܝ ▁ܢܪܣܝ ▁ܕܒܙ ▁( ܐܬܝܠܕ ▁ 1 7 ▁ܐܝܪ ▁- ... (+17 more)` | 27 |
+| 32k | `▁ܡܪܝ ▁ܢܪܣܝ ▁ܕܒܙ ▁( ܐܬܝܠܕ ▁ 1 7 ▁ܐܝܪ ▁- ... (+16 more)` | 26 |
 ### Key Findings
 - **Best Compression:** 32k achieves 4.583x compression
+- **Lowest UNK Rate:** 8k with 0.1262% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
 |--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 477 | 8.90 | 719 | 45.8% | 100.0% |
+| **2-gram** | Subword | 363 🏆 | 8.50 | 2,330 | 59.9% | 96.1% |
+| **3-gram** | Word | 438 | 8.77 | 754 | 52.0% | 100.0% |
+| **3-gram** | Subword | 2,379 | 11.22 | 10,583 | 28.3% | 67.0% |
+| **4-gram** | Word | 759 | 9.57 | 1,466 | 43.3% | 83.8% |
+| **4-gram** | Subword | 8,542 | 13.06 | 28,872 | 14.3% | 43.2% |
+| **5-gram** | Word | 508 | 8.99 | 1,055 | 50.5% | 97.5% |
+| **5-gram** | Subword | 16,168 | 13.98 | 38,919 | 8.4% | 30.9% |
 ### Top 5 N-grams by Size
 |------|--------|-------|
 | 1 | `ܐܦ ܚܙܝ` | 193 |
 | 2 | `ܚܕ ܡܢ` | 141 |
+| 3 | `ܗܝ ܐܬܪܐ` | 124 |
+| 4 | `ܐܝܬ ܠܗ` | 102 |
+| 5 | `ܬܚܘܡܐ ܥܡ` | 89 |
 **3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
 | 1 | `ܗܘ ܚܕ ܡܢ` | 72 |
+| 2 | `ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ` | 52 |
+| 3 | `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ` | 52 |
+| 4 | `ܝܢܦܠ ܡܒܤܢ ܐܤܡ` | 52 |
+| 5 | `ܓܐ ܝܢܦܠ ܡܒܤܢ` | 52 |
 **4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ` | 52 |
+| 2 | `ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ` | 52 |
+| 3 | `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ` | 52 |
+| 4 | `ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ` | 52 |
+| 5 | `ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ` | 52 |
+**5-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ` | 52 |
+| 2 | `ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ` | 52 |
+| 3 | `ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ` | 52 |
+| 4 | `ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ` | 52 |
+| 5 | `ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ` | 52 |
 **2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܐ _` | 24,552 |
+| 2 | `_ ܕ` | 7,580 |
+| 3 | `ܬ ܐ` | 7,166 |
+| 4 | `_ ܐ` | 6,890 |
+| 5 | `ܝ ܐ` | 5,689 |
 **3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܐ _ ܕ` | 6,099 |
+| 2 | `ܬ ܐ _` | 5,875 |
+| 3 | `ܝ ܐ _` | 4,233 |
 | 4 | `ܐ _ ܐ` | 2,477 |
+| 5 | `ܐ _ ܘ` | 2,392 |
 **4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ܬ ܐ _ ܕ` | 1,993 |
+| 2 | `ܝ ܬ ܐ _` | 1,513 |
+| 3 | `ܐ ܝ ܬ _` | 1,372 |
+| 4 | `ܘ ܬ ܐ _` | 1,304 |
+| 5 | `_ ܡ ܢ _` | 1,203 |
+**5-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ܐ _ ܐ ܘ _` | 603 |
+| 2 | `ܘ ܬ ܐ _ ܕ` | 543 |
+| 3 | `ܐ _ ܡ ܢ _` | 533 |
+| 4 | `_ ܐ ܝ ܬ _` | 503 |
+| 5 | `ܝ ܘ ܬ ܐ _` | 487 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 363
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~31% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
 |---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.5444 | 1.458 | 2.59 | 17,962 | 45.6% |
+| **1** | Subword | 0.9651 | 1.952 | 6.05 | 1,231 | 3.5% |
+| **2** | Word | 0.1025 | 1.074 | 1.16 | 45,549 | 89.7% |
+| **2** | Subword | 0.7954 | 1.736 | 3.84 | 7,433 | 20.5% |
+| **3** | Word | 0.0296 | 1.021 | 1.04 | 51,591 | 97.0% |
+| **3** | Subword | 0.5935 | 1.509 | 2.45 | 28,465 | 40.7% |
+| **4** | Word | 0.0106 🏆 | 1.007 | 1.01 | 52,240 | 98.9% |
+| **4** | Subword | 0.3585 | 1.282 | 1.71 | 69,540 | 64.2% |
 ### Generated Text Samples (Word-based)
 **Context Size 1:**
+1. `ܡܢ ܕܝܪܐ ܕܡܪܝ ܓܒܪܐܝܠ ܗܘ ܫܡܐ ܕܒܗܘ ܢܬܝܕܥܘܢ ܗܘܘ ܡܠܘܢ ܕܝܢ ܪ̈ܥܘܬܐ ܕܝܠܗܘܢ ܒܪܓܘܠܐ ܕܓܗܢܐ ܗܢܐ`
+2. `ܐܘ ܡܣܐܢܐ ܐܘ ܚܡܗܐ ܗܘ ܣܦܪܐ ܕܗܘܫܥ ܘܕܐܫܥܝܐ ܘܗܘܝܘ ܦܪܝܣܐ ܩܘܢܣܛܢܛܝܢ`
+3. `ܗܘ ܢܒܝܐ ܙܥܘܪܐ ܒܠܒܘܫܐ ܕܒܗ ܫܩܠܘ ܐܢܛܝܘܟܝܐ ܘܫܚܠܦ ܗܘܐ ܦܛܪܝܪܟܐ ܩܢܘܢܝܐ ܣܕܪܐ ܘܟܠ ܕܡܘ 2 ܒܢܝܣܢ`
 **Context Size 2:**
+1. `ܐܦ ܚܙܝ ܥܡܐ ܒܝܬܝܘܬܐ de verwandtschaftsbeziehung onkel und tante`
+2. `ܚܕ ܡܢ ܬܪܥܣܪ ܢܒܝ̈ܐ ܙܥܘܪ̈ܐ ܕܬܢܟ ܘܕܕܝܬܝܩܝ ܥܬܝܩܬܐ ܥܬܝܩܬܐ`
+3. `ܗܝ ܐܬܪܐ ܒܐܣܝܐ ܨܝܢ ܐܝܬ ܠܗ̇ ܪܡܙܐ ܕܡܘܠܕܐ ܬܪܝܢܐ ܒܝܕ ܡܝ̈ܐ ܘܪܘܚܐ ܕܩܘܕܫܐ ܝܫܘܥ ܐܡܪ ܠܗ ܘܠܐܚܗ`
 **Context Size 3:**
+1. `ܗܘ ܚܕ ܡܢ ܠܫܢ̈ܐ ܨܝܢܝ̈ܐ ܕܢܬܡܠܠܘܢ ܒܬܝܡܢ ܡܕܢܚ ܕܨܝܢ ܐܝܬܘܗܝ ܠܫܢܐ ܐܡܗܝܐ ܕܝܬܝܪ ܡܢ 90 ܡܠܝܘܢܐ ܐܢܫ̈ܝܢ ܪܝܫܐܝܬ`
+2. `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢ...`
+3. `ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫ...`
 **Context Size 4:**
+1. `ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓ...`
+2. `ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢ...`
+3. `ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫ...`
 ### Generated Text Samples (Subword-based)
 **Context Size 1:**
+1. `_ܥܪ_ܙܥܠܐܡܘܦܐܘܗ_ܡ`
+2. `ܐܣܬܐ_ܒܕ_ܐ_ܡܐ_ܕܡܦ`
+3. `ܝܓܝܛܠܚܕܢܪܝܦܝ̈ܘܪܝܐ`
 **Context Size 2:**
+1. `ܐ_ܫܡܐ_ܕܢܐܡܪܝܬ_800`
+2. `_ܕܠܐܬܘܕܝܫܐ_ܘܒܝܬ_ا`
+3. `ܬܐ_ܕܛܚܝܬ_ܡܚܝܬ_ܡܥܘ`
 **Context Size 3:**
+1. `ܐ_ܕܐܬܥܢܘܬܐ_ܠܫܢܐ_(r`
+2. `ܬܐ_ܥܪܒܐ_ܩܘܛܢܝܘܬܐ_ܛ`
+3. `ܝܐ_ܒܪ_ܒܪ_ܐܘ_ܐܘܢܛܐܪ`
 **Context Size 4:**
+1. `ܬܐ_ܕܢܒܥ_ܡܫܝܚܐ._ܥܠܠܢ̈`
+2. `ܝܬܐ_ܕܪ̈ܗܘܡܝܐ܀_ܢܗܪܝܢ܁`
+3. `ܐܝܬ_ܚܡܫܐ_ܨܒܝܢܗ._ܟܬܒ`
 ### Key Findings
 - **Best Predictability:** Context-4 (word) with 98.9% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (69,540 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 6,099 |
+| Total Tokens | 50,661 |
+| Mean Frequency | 8.31 |
 | Median Frequency | 3 |
 | Frequency Std Dev | 32.05 |
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ܡܢ | 1,276 |
+| 2 | ܐܘ | 979 |
+| 3 | ܗܘ | 860 |
 | 4 | ܗܝ | 816 |
+| 5 | ܐܝܬ | 513 |
 | 6 | ܗܘܐ | 394 |
+| 7 | ܘܥܡ | 330 |
+| 8 | ܥܠ | 324 |
+| 9 | ܐܦ | 277 |
+| 10 | ܠܫܢܐ | 263 |
 ### Least Common Words (from vocabulary)
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.8942 |
+| R² (Goodness of Fit) | 0.982775 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 31.8% |
 | Top 1,000 | 68.0% |
+| Top 5,000 | 95.7% |
 | Top 10,000 | 0.0% |
 ### Key Findings
 - **Zipf Compliance:** R²=0.9828 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 31.8% of corpus
+- **Long Tail:** -3,901 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ### 5.1 Cross-Lingual Alignment
+![Alignment Quality](visualizations/embedding_alignment_quality.png)
+![Multilingual t-SNE](visualizations/embedding_tsne_multilingual.png)
 ### 5.2 Model Comparison
 | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
 |-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.3326 | 0.4816 | N/A | N/A |
+| **mono_64d** | 64 | 0.0524 | 0.4997 | N/A | N/A |
+| **mono_128d** | 128 | 0.0094 | 0.4954 | N/A | N/A |
+| **aligned_32d** | 32 | 0.3326 🏆 | 0.4744 | 0.2099 | 0.5556 |
+| **aligned_64d** | 64 | 0.0524 | 0.4826 | 0.1975 | 0.6543 |
+| **aligned_128d** | 128 | 0.0094 | 0.5048 | 0.2099 | 0.7037 |
 ### Key Findings
+- **Best Isotropy:** aligned_32d with 0.3326 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.4897. Lower values indicate better semantic separation.
+- **Alignment Quality:** Aligned models achieve up to 21.0% R@1 in cross-lingual retrieval.
 - **Recommendation:** 128d aligned for best cross-lingual performance
 ---
 ## 6.  Morphological Analysis (Experimental)
 This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
 ### 6.1 Productivity & Complexity
 | Metric | Value | Interpretation | Recommendation |
 |--------|-------|----------------|----------------|
+| Productivity Index | **5.000** | High morphological productivity | Reliable analysis |
+| Idiomaticity Gap | **2.160** | High formulaic/idiomatic content | - |
 ### 6.2 Affix Inventory (Productive Units)
 #### Productive Suffixes
 | Suffix | Examples |
 |--------|----------|
+| `-ܐ` | ܬܫܐܢܐܟܐܠܐ, ܟܘܕܢܝܐ, ܕܚܘܼܡܵܐ |
+| `-ܬܐ` | ܐܒܪ̈ܗܡܝܬܐ, ܕܡܥܡܕܝܬܐ, ܫܬܐ |
+| `-ܝܐ` | ܟܘܕܢܝܐ, ܡܠܝܝܐ, ܓܐܘܓܪܦܝܐ |
+| `-ܝܬܐ` | ܐܒܪ̈ܗܡܝܬܐ, ܕܡܥܡܕܝܬܐ, ܟܢܘܫܝܬܐ |
+| `-̈ܐ` | ܢܒܝ̈ܐ, ܕܓܘܫܡ̈ܐ, ܘܚܝ̈ܐ |
+| `-ܘܬܐ` | ܘܐܬܪ̈ܘܬܐ, ܛܝܒܘܬܐ, ܦܛܪܝܪܟܘܬܐ |
+| `-ܢܐ` | ܐܡܝܢܐ, ܘܡܩܝܡܢܐ, ܕܓܝܗܢܐ |
 ### 6.3 Bound Stems (Lexical Roots)
 | Stem | Cohesion | Substitutability | Examples |
 |------|----------|------------------|----------|
+| `ܢܝܬܐ` | 1.64x | 23 contexts | ܦܢܝܬܐ, ܡܢܝܬܐ, ܡܕܢܝܬܐ |
+| `ܪܝܬܐ` | 1.68x | 18 contexts | ܒܪܝܬܐ, ܫܪܝܬܐ, ܩܪܝܬܐ |
+| `ܘܪܝܐ` | 1.51x | 23 contexts | ܣܘܪܝܐ, ܛܘܪܝܐ, ܟܘܪܝܐ |
+| `ܫܝܚܝ` | 1.69x | 15 contexts | ܡܫܝܚܝܐ, ܘܡܫܝܚܝܐ, ܡܫܝܚܝ̈ܐ |
+| `ܪܒܝܐ` | 1.63x | 16 contexts | ܨܪܒܝܐ, ܓܪܒܝܐ, ܐܪܒܝܐ |
+| `ܘܢܝܐ` | 1.63x | 15 contexts | ܝܘܢܝܐ, ܓܘܢܝܐ, ܟܘܢܝܐ |
+| `ܡܫܝܚ` | 1.70x | 13 contexts | ܡܫܝܚܐ, ܡܫܝܚܝܐ, ܕܡܫܝܚܐ |
+| `ܣܘܪܝ` | 1.49x | 18 contexts | ܣܘܪܝܐ, ܣܘܪܝܬ, ܐܣܘܪܝܐ |
+| `ܡܕܝܢ` | 1.63x | 13 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
+| `ܢܐܝܬ` | 1.55x | 14 contexts | ܨܝܢܐܝܬ, ܝܘܢܐܝܬ, ܝܦܢܐܝܬ |
+| `ܝܢܬܐ` | 1.71x | 9 contexts | ܩܝܢܬܐ, ܣܦܝܢܬܐ, ܡܕܝܢܬܐ |
+| `ܕܝܢܬ` | 1.67x | 9 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
 ### 6.4 Affix Compatibility (Co-occurrence)
 | Word | Suggested Split | Confidence | Stem |
 |------|-----------------|------------|------|
 | ܝܘܪܕܢܢܝܬܐ | **`ܝܘܪܕܢܢ-ܝܬܐ`** | 4.5 | `ܝܘܪܕܢܢ` |
+| ܦܘܪܛܘܓܠܝܐ | **`ܦܘܪܛܘܓܠ-ܝܐ`** | 4.5 | `ܦܘܪܛܘܓܠ` |
 | ܥܘܬܡܐܢܝܬܐ | **`ܥܘܬܡܐܢ-ܝܬܐ`** | 4.5 | `ܥܘܬܡܐܢ` |
 | ܕܬܠܝܬܝܘܬܐ | **`ܕܬܠܝܬܝ-ܘܬܐ`** | 4.5 | `ܕܬܠܝܬܝ` |
 | ܕܐܢܛܝܘܟܝܐ | **`ܕܐܢܛܝܘܟ-ܝܐ`** | 4.5 | `ܕܐܢܛܝܘܟ` |
 | ܛܘܪܥܒܕܝܢܝܐ | **`ܛܘܪܥܒܕܝܢ-ܝܐ`** | 4.5 | `ܛܘܪܥܒܕܝܢ` |
 | ܩܬܘܠܝܩܝ̈ܐ | **`ܩܬܘܠܝܩܝ-̈ܐ`** | 4.5 | `ܩܬܘܠܝܩܝ` |
+| ܡܬܥܡܪܢܝܬܐ | **`ܡܬܥܡܪܢ-ܝܬܐ`** | 4.5 | `ܡܬܥܡܪܢ` |
+| ܒܡܬܥܕܪܢܘܬܐ | **`ܒܡܬܥܕܪܢ-ܘܬܐ`** | 1.5 | `ܒܡܬܥܕܪܢ` |
+| ܬܫܥܝܬܢܝܬܐ | **`ܬܫܥܝܬܢ-ܝܬܐ`** | 1.5 | `ܬܫܥܝܬܢ` |
 | ܠܫܘܠܛܢܘܬܐ | **`ܠܫܘܠܛܢ-ܘܬܐ`** | 1.5 | `ܠܫܘܠܛܢ` |
+| ܡܫܥܢܕܢܝܬܐ | **`ܡܫܥܢܕܢ-ܝܬܐ`** | 1.5 | `ܡܫܥܢܕܢ` |
+| ܐܝܣܠܢܕ̈ܝܐ | **`ܐܝܣܠܢܕ̈-ܝܐ`** | 1.5 | `ܐܝܣܠܢܕ̈` |
+| ܐܪܬܘܕܟܣܝܐ | **`ܐܪܬܘܕܟܣ-ܝܐ`** | 1.5 | `ܐܪܬܘܕܟܣ` |
+| ܦܘܠܛܝܩܝܬܐ | **`ܦܘܠܛܝܩ-ܝܬܐ`** | 1.5 | `ܦܘܠܛܝܩ` |
 ### 6.6 Linguistic Interpretation
 > **Automated Insight:**
+The language Aramaic shows high morphological productivity. The subword models are significantly more efficient than word models, suggesting a rich system of affixation or compounding.
+> **Note on Idiomaticity:** The high Idiomaticity Gap suggests a large number of frequent multi-word expressions or formulaic sequences that are statistically distinct from their component parts.
 ---
 ## 7. Summary & Recommendations
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
 | Tokenizer | **32k BPE** | Best compression (4.58x) |
+| N-gram | **2-gram** | Lowest perplexity (363) |
 | Markov | **Context-4** | Highest predictability (98.9%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 16:33:24*

models/embeddings/aligned/arc_128d.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e8d628ee8950c2b7ed7c3e9de1db3782e533dcdf606048be9ab19b74823a78c1
+size 1025874179

models/embeddings/aligned/arc_128d.meta.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lang": "arc", "dim": 128, "max_seq_len": 512, "is_aligned": true}

models/embeddings/aligned/arc_128d.projection.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90a8bc7d91ab749fbde71ef4592ddb5930aad98cf89e4fe62609617116614d4c
+size 65664

models/embeddings/aligned/arc_128d_metadata.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "language": "arc",
+  "dimension": 128,
+  "version": "aligned",
+  "hub_language": "en",
+  "seed_vocab_size": 81,
+  "vocab_size": 1795
+}

models/embeddings/aligned/arc_32d.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea95831f0cddf68f8d1922adb20034599b5abee861aa999d0c55c5b434433549
+size 256495619

models/embeddings/aligned/arc_32d.meta.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lang": "arc", "dim": 32, "max_seq_len": 512, "is_aligned": true}

models/embeddings/aligned/arc_32d.projection.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:314b43a7227e4552787faf00f8ef83eb3186cbe76d9aa61a0c07d6bc9b336111
+size 4224

models/embeddings/aligned/arc_32d_metadata.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "language": "arc",
+  "dimension": 32,
+  "version": "aligned",
+  "hub_language": "en",
+  "seed_vocab_size": 81,
+  "vocab_size": 1795
+}

models/embeddings/aligned/arc_64d.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:80969530d1c44b57121288762e8b87141d96d2f813f5a2ce5b5770ef551329e6
+size 512955139

models/embeddings/aligned/arc_64d.meta.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lang": "arc", "dim": 64, "max_seq_len": 512, "is_aligned": true}

models/embeddings/aligned/arc_64d.projection.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e168e33b3341339910d3d05582308c54eda45b95a9123adf0e4b2ad948e055a
+size 16512

models/embeddings/aligned/arc_64d_metadata.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "language": "arc",
+  "dimension": 64,
+  "version": "aligned",
+  "hub_language": "en",
+  "seed_vocab_size": 81,
+  "vocab_size": 1795
+}

models/embeddings/monolingual/arc_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:af294265d0648fcd88d1a3a79c52a316b5312eb2df0e22491460d7962b9abecf
-size 1025880436

 version https://git-lfs.github.com/spec/v1
+oid sha256:e8d628ee8950c2b7ed7c3e9de1db3782e533dcdf606048be9ab19b74823a78c1
+size 1025874179

models/embeddings/monolingual/arc_128d_metadata.json CHANGED Viewed

@@ -11,5 +11,5 @@
     "encoding_method": "rope",
     "dim": 128
   },
-  "vocab_size": 1801
 }

     "encoding_method": "rope",
     "dim": 128
   },
+  "vocab_size": 1795
 }

models/embeddings/monolingual/arc_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ca828785023acb65e4a3915ca46d6488b08273c410da1d5ebe5859bffe778de4
-size 256497268

 version https://git-lfs.github.com/spec/v1
+oid sha256:ea95831f0cddf68f8d1922adb20034599b5abee861aa999d0c55c5b434433549
+size 256495619

models/embeddings/monolingual/arc_32d_metadata.json CHANGED Viewed

@@ -11,5 +11,5 @@
     "encoding_method": "rope",
     "dim": 32
   },
-  "vocab_size": 1801
 }

     "encoding_method": "rope",
     "dim": 32
   },
+  "vocab_size": 1795
 }

models/embeddings/monolingual/arc_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3ca0fd83ec54d91bfbc8d6aec7c2188702f2f3ff651521ae421ae4403a931166
-size 512958324

 version https://git-lfs.github.com/spec/v1
+oid sha256:80969530d1c44b57121288762e8b87141d96d2f813f5a2ce5b5770ef551329e6
+size 512955139

models/embeddings/monolingual/arc_64d_metadata.json CHANGED Viewed

@@ -11,5 +11,5 @@
     "encoding_method": "rope",
     "dim": 64
   },
-  "vocab_size": 1801
 }

     "encoding_method": "rope",
     "dim": 64
   },
+  "vocab_size": 1795
 }

models/subword_markov/arc_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bc87b5460d601a98fbc387cbc99f283bf6cd7d98c37e6a136392f9742a5c1769
-size 61605

 version https://git-lfs.github.com/spec/v1
+oid sha256:f2cf16541ca4543598754c02901d686e966979a3f2dbef93171079a43a695800
+size 61546

models/subword_markov/arc_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 1232,
-  "total_transitions": 368665
 }

   "context_size": 1,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 1231,
+  "total_transitions": 367430
 }

models/subword_markov/arc_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:44f7913fb0e3f66a526482ce6ecca8f16f96c74f389962eac8c930ae99d709a4
-size 236563

 version https://git-lfs.github.com/spec/v1
+oid sha256:ade6f3b86c5ac381ec52cb57cbda8cc727ed4013fd635825536cbdeb065e3851
+size 228551

models/subword_markov/arc_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 7459,
-  "total_transitions": 367072
 }

   "context_size": 2,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 7433,
+  "total_transitions": 365838
 }

models/subword_markov/arc_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6c02d515c8771584609106a8bfb0e1e6f4a01778f690d1e902e6c23730d85b1f
-size 614974

 version https://git-lfs.github.com/spec/v1
+oid sha256:a14f68a36dd11286c23370274ba7ff4661e8c2e5ad7150608f82b13f2e385917
+size 613765

models/subword_markov/arc_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 28618,
-  "total_transitions": 365479
 }

   "context_size": 3,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 28465,
+  "total_transitions": 364246
 }

models/subword_markov/arc_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ab3159c713b354d456d78d674f47cafc43e66c4c18a290510850783e3575b191
-size 1213186

 version https://git-lfs.github.com/spec/v1
+oid sha256:de9318a054b0cd1ae284c2b1e5d27ed21cc036ddfb239927d75f1c53cd6a035b
+size 1191772

models/subword_markov/arc_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "arc",
-  "unique_contexts": 69915,
-  "total_transitions": 363886
 }

   "context_size": 4,
   "variant": "subword",
   "language": "arc",
+  "unique_contexts": 69540,
+  "total_transitions": 362654
 }

models/subword_ngram/arc_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1e6f137266958bfa5bf2d5725d17daa253bc7444dd15168bf7c4aba0c82d498e
-size 30619

 version https://git-lfs.github.com/spec/v1
+oid sha256:0b0ef8b68904eb3bc261b5599c684b44c82c140473e77759ec123df2bcccf1b5
+size 30383

models/subword_ngram/arc_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "arc",
-  "unique_ngrams": 2347,
-  "total_ngrams": 368665
 }

   "n": 2,
   "variant": "subword",
   "language": "arc",
+  "unique_ngrams": 2330,
+  "total_ngrams": 367430
 }

models/subword_ngram/arc_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ce18df7b3c9474a7e74ca5ef2fd82d105d907ac676d24a02a4a055d5c06da3df
-size 135373

 version https://git-lfs.github.com/spec/v1
+oid sha256:41cf48eff4f49881b3b1ef4bee554fb58154195938b1733c73ed7e68e9e0214b
+size 134552

models/subword_ngram/arc_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "arc",
-  "unique_ngrams": 10625,
-  "total_ngrams": 367072
 }

   "n": 3,
   "variant": "subword",
   "language": "arc",
+  "unique_ngrams": 10583,
+  "total_ngrams": 365838
 }

models/subword_ngram/arc_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:36cdb55afe0905dc264f2b5348858599da9f4a343be608731c24757accd88088
-size 398536

 version https://git-lfs.github.com/spec/v1
+oid sha256:9672d51caaa89c2f58eff730631ddd563754bbaac6d78c20d8cb575cbbf65d87
+size 394701

models/subword_ngram/arc_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "arc",
-  "unique_ngrams": 28979,
-  "total_ngrams": 365479
 }

   "n": 4,
   "variant": "subword",
   "language": "arc",
+  "unique_ngrams": 28872,
+  "total_ngrams": 364246
 }

models/subword_ngram/arc_5gram_subword.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f92babc9e6855ff8c2b135ce81ee6f2033e836d8767833ec47cf2caf5c7f2dd
+size 537047

models/subword_ngram/arc_5gram_subword_metadata.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "n": 5,
+  "variant": "subword",
+  "language": "arc",
+  "unique_ngrams": 38919,
+  "total_ngrams": 362654
+}

models/tokenizer/arc_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f9d8b946b07c5ee953e8f59e73c726ea02c7e02592415beaef72b11ce24d07ed
-size 553548

 version https://git-lfs.github.com/spec/v1
+oid sha256:a79f3440c903b39d8b82904f5bdf191e3a3f69c45dc94cff5f577a68ffb722d3
+size 553519

models/tokenizer/arc_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arc_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dc878161f3ebfde37edd5364173a906b743ffa182cc4a09da6dedb402cd0d479
-size 891580

 version https://git-lfs.github.com/spec/v1
+oid sha256:f93993d85e9706c681a3124199465b9fd0ecf892ec81f742aa1dd3f2a3509aa8
+size 891227

models/tokenizer/arc_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/arc_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e552460c83ce8697c66bb59c12883cdc3ee45df742f4f4a31ed47f9e096b570f
-size 390840

 version https://git-lfs.github.com/spec/v1
+oid sha256:bf1e2a73cf8ac952ea13f9fc21d2cc6279d579673c3cc68f2fc93a338ee0a5c0
+size 390696

models/tokenizer/arc_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/arc_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a0c7a49ea4d72d65cb41e84815025c1a3ac379f03d6c2738d077d2713987df61
-size 98987

 version https://git-lfs.github.com/spec/v1
+oid sha256:6f0bb804661e7eb94ae77e3d082bdaf86902542e5dff49b802dc403e642e6d93
+size 99274

models/vocabulary/arc_vocabulary_metadata.json CHANGED Viewed

@@ -1,17 +1,17 @@
 {
   "language": "arc",
-  "vocabulary_size": 6113,
   "variant": "full",
   "statistics": {
-    "type_token_ratio": 0.28987946832669004,
     "coverage": {
-      "top_100": 0.2557526480443379,
-      "top_1000": 0.5486970192628353,
-      "top_5000": 0.7718473583077925,
-      "top_10000": 0.8689237903161773
     },
-    "hapax_count": 12141,
-    "hapax_ratio": 0.6651144954530513,
-    "total_documents": 1593
   }
 }

 {
   "language": "arc",
+  "vocabulary_size": 6099,
   "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.2900409450826071,
     "coverage": {
+      "top_100": 0.2565360778753166,
+      "top_1000": 0.5490624054041137,
+      "top_5000": 0.7721095480108975,
+      "top_10000": 0.869278442493667
     },
+    "hapax_count": 12106,
+    "hapax_ratio": 0.6649821477616039,
+    "total_documents": 1592
   }
 }

models/word_markov/arc_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:832f26dc2d4ff5a3ebd23bf31f4087617a431dd92bb988486e6d4cc21c427ab8
-size 561103

 version https://git-lfs.github.com/spec/v1
+oid sha256:5cbedb01a5447afd32845e25fc8a11d1d3d23bb575c77b402dc2373f23ba697b
+size 559495

models/word_markov/arc_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "arc",
-  "unique_contexts": 18018,
-  "total_transitions": 61378
 }

   "context_size": 1,
   "variant": "word",
   "language": "arc",
+  "unique_contexts": 17962,
+  "total_transitions": 61175
 }

models/word_markov/arc_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:de53c3acb316d8128ebd4e1ec3ca41812fa0d445a955fc1d817d7c545461f871
-size 1081973

 version https://git-lfs.github.com/spec/v1
+oid sha256:73c67b400e1d1d3b59ad64c463d7764d153a25bb16654c53e98cb0c5ff9eb85e
+size 1076458

models/word_markov/arc_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "arc",
-  "unique_contexts": 45749,
-  "total_transitions": 59785
 }

   "context_size": 2,
   "variant": "word",
   "language": "arc",
+  "unique_contexts": 45549,
+  "total_transitions": 59583
 }

models/word_markov/arc_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8d239eaf545ca57c200cb2afeabb8ab0ceb2b6cc79f4dcd2f2e5f4aa5cc4e9c1
-size 1332931

 version https://git-lfs.github.com/spec/v1
+oid sha256:5e174c8083c10c295c6f9ad804650219ba42e49b0a3d2c0566dc2c29f402d62e
+size 1328117

models/word_markov/arc_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "arc",
-  "unique_contexts": 51822,
-  "total_transitions": 58192
 }

   "context_size": 3,
   "variant": "word",
   "language": "arc",
+  "unique_contexts": 51591,
+  "total_transitions": 57991
 }