omarkamali commited on
Commit
a98ea1b
·
verified ·
1 Parent(s): ecd1032

Upload all models and assets for arc (latest)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. README.md +184 -147
  3. models/embeddings/aligned/arc_128d.bin +3 -0
  4. models/embeddings/aligned/arc_128d.meta.json +1 -0
  5. models/embeddings/aligned/arc_128d.projection.npy +3 -0
  6. models/embeddings/aligned/arc_128d_metadata.json +8 -0
  7. models/embeddings/aligned/arc_32d.bin +3 -0
  8. models/embeddings/aligned/arc_32d.meta.json +1 -0
  9. models/embeddings/aligned/arc_32d.projection.npy +3 -0
  10. models/embeddings/aligned/arc_32d_metadata.json +8 -0
  11. models/embeddings/aligned/arc_64d.bin +3 -0
  12. models/embeddings/aligned/arc_64d.meta.json +1 -0
  13. models/embeddings/aligned/arc_64d.projection.npy +3 -0
  14. models/embeddings/aligned/arc_64d_metadata.json +8 -0
  15. models/embeddings/monolingual/arc_128d.bin +2 -2
  16. models/embeddings/monolingual/arc_128d_metadata.json +1 -1
  17. models/embeddings/monolingual/arc_32d.bin +2 -2
  18. models/embeddings/monolingual/arc_32d_metadata.json +1 -1
  19. models/embeddings/monolingual/arc_64d.bin +2 -2
  20. models/embeddings/monolingual/arc_64d_metadata.json +1 -1
  21. models/subword_markov/arc_markov_ctx1_subword.parquet +2 -2
  22. models/subword_markov/arc_markov_ctx1_subword_metadata.json +2 -2
  23. models/subword_markov/arc_markov_ctx2_subword.parquet +2 -2
  24. models/subword_markov/arc_markov_ctx2_subword_metadata.json +2 -2
  25. models/subword_markov/arc_markov_ctx3_subword.parquet +2 -2
  26. models/subword_markov/arc_markov_ctx3_subword_metadata.json +2 -2
  27. models/subword_markov/arc_markov_ctx4_subword.parquet +2 -2
  28. models/subword_markov/arc_markov_ctx4_subword_metadata.json +2 -2
  29. models/subword_ngram/arc_2gram_subword.parquet +2 -2
  30. models/subword_ngram/arc_2gram_subword_metadata.json +2 -2
  31. models/subword_ngram/arc_3gram_subword.parquet +2 -2
  32. models/subword_ngram/arc_3gram_subword_metadata.json +2 -2
  33. models/subword_ngram/arc_4gram_subword.parquet +2 -2
  34. models/subword_ngram/arc_4gram_subword_metadata.json +2 -2
  35. models/subword_ngram/arc_5gram_subword.parquet +3 -0
  36. models/subword_ngram/arc_5gram_subword_metadata.json +7 -0
  37. models/tokenizer/arc_tokenizer_16k.model +2 -2
  38. models/tokenizer/arc_tokenizer_16k.vocab +0 -0
  39. models/tokenizer/arc_tokenizer_32k.model +2 -2
  40. models/tokenizer/arc_tokenizer_32k.vocab +0 -0
  41. models/tokenizer/arc_tokenizer_8k.model +2 -2
  42. models/tokenizer/arc_tokenizer_8k.vocab +0 -0
  43. models/vocabulary/arc_vocabulary.parquet +2 -2
  44. models/vocabulary/arc_vocabulary_metadata.json +9 -9
  45. models/word_markov/arc_markov_ctx1_word.parquet +2 -2
  46. models/word_markov/arc_markov_ctx1_word_metadata.json +2 -2
  47. models/word_markov/arc_markov_ctx2_word.parquet +2 -2
  48. models/word_markov/arc_markov_ctx2_word_metadata.json +2 -2
  49. models/word_markov/arc_markov_ctx3_word.parquet +2 -2
  50. models/word_markov/arc_markov_ctx3_word_metadata.json +2 -2
.gitattributes CHANGED
@@ -39,3 +39,4 @@ visualizations/position_encoding_comparison.png filter=lfs diff=lfs merge=lfs -t
39
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
41
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
 
 
39
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
41
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
42
+ visualizations/embedding_tsne_multilingual.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  language: arc
3
- language_name: ARC
4
  language_family: semitic_aramaic
5
  tags:
6
  - wikilangs
@@ -10,11 +10,21 @@ tags:
10
  - n-gram
11
  - markov
12
  - wikipedia
 
 
 
 
 
 
 
 
 
 
13
  - monolingual
14
  - family-semitic_aramaic
15
  license: mit
16
  library_name: wikilangs
17
- pipeline_tag: feature-extraction
18
  datasets:
19
  - omarkamali/wikipedia-monthly
20
  dataset_info:
@@ -26,17 +36,17 @@ metrics:
26
  value: 4.583
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.2739
30
  - name: vocabulary_size
31
  type: vocab
32
  value: 0
33
  generated: 2026-01-03
34
  ---
35
 
36
- # ARC - Wikilangs Models
37
  ## Comprehensive Research Report & Full Ablation Study
38
 
39
- This repository contains NLP models trained and evaluated by Wikilangs, specifically on **ARC** Wikipedia data.
40
  We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
41
 
42
  ## 📋 Repository Contents
@@ -60,7 +70,7 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
- - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
  - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
@@ -80,43 +90,43 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
- | **8k** | 3.552x | 3.57 | 0.1271% | 63,747 |
84
- | **16k** | 3.988x | 4.01 | 0.1427% | 56,780 |
85
- | **32k** | 4.583x 🏆 | 4.60 | 0.1640% | 49,402 |
86
 
87
  ### Tokenization Examples
88
 
89
  Below are sample sentences tokenized with each vocabulary size:
90
 
91
- **Sample 1:** `ܡܬܠܬܐ ܗܘ ܐܣܟܡܐ ܡܚܪܝܐ (ܓܐܘܡܛܪܝܐ) ܕܐܝܬ ܠܗ ܬܠܬܐ ܐܠܥ̈ܐ ܘܬܠܬܐ ܙܘܝܬ̈ܐ܀`
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
- | 8k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐ ܘܡ ܛܪܝܐ ) ▁ܕܐܝܬ ... (+8 more)` | 18 |
96
- | 16k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐܘܡܛܪܝܐ ) ▁ܕܐܝܬ ▁ܠܗ ▁ܬܠܬܐ ... (+4 more)` | 14 |
97
- | 32k | `▁ܡܬܠܬܐ ▁ܗܘ ▁ܐܣܟܡܐ ▁ܡܚܪܝܐ ▁( ܓܐܘܡܛܪܝܐ ) ▁ܕܐܝܬ ▁ܠܗ ▁ܬܠܬܐ ... (+3 more)` | 13 |
98
 
99
- **Sample 2:** `ܟܐܢܣܐܣ ܐܘ ܟܐܢܙܐܣ (Kansas) ܐܝܬܝܗ ܐܘܚܕܢܐ ܓܘ ܡܢܬܐ ܡܥܪܒܝܬܐ ܡܨܥܝܬܐ ܕܐ̈ܘܚܕܢܐ ܡ̈ܚܝܕܐ ܕܐ...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
- | 8k | `▁ܟ ܐܢ ܣܐܣ ▁ܐܘ ▁ܟ ܐܢ ܙܐ ܣ ▁( k ... (+14 more)` | 24 |
104
- | 16k | `▁ܟܐܢܣܐܣ ▁ܐܘ ▁ܟܐܢܙܐܣ ▁( kansas ) ▁ܐܝܬܝܗ ▁ܐܘܚܕܢܐ ▁ܓܘ ▁ܡܢܬܐ ... (+7 more)` | 17 |
105
- | 32k | `▁ܟܐܢܣܐܣ ▁ܐܘ ▁ܟܐܢܙܐܣ ▁( kansas ) ▁ܐܝܬܝܗ ▁ܐܘܚܕܢܐ ▁ܓܘ ▁ܡܢܬܐ ... (+7 more)` | 17 |
106
 
107
- **Sample 3:** `ܢܝܘ ܗܐܡܦܫܪ (New Hampshire) ܗܝ ܐܬܪܐ ܓܘ ܡܢܬܐ ܓܪܒܝܝܬܐ ܡܕܢܚܝܬܐ ܕܐܬ݂ܪ̈ܘܬ݂ܐ ܡ̈ܚܝܕܐ ܕܐܡ...`
108
 
109
  | Vocab | Tokens | Count |
110
  |-------|--------|-------|
111
- | 8k | `▁ܢܝܘ ▁ܗܐܡ ܦܫ ܪ ▁( newh am p shi ... (+14 more)` | 24 |
112
- | 16k | `▁ܢܝܘ ▁ܗܐܡܦܫܪ ▁( newhamp shire ) ▁ܗܝ ▁ܐܬܪܐ ▁ܓܘ ... (+9 more)` | 19 |
113
- | 32k | `▁ܢܝܘ ▁ܗܐܡܦܫܪ ▁( newhampshire ) ▁ܗܝ ▁ܐܬܪܐ ▁ܓܘ ▁ܡܢܬܐ ... (+8 more)` | 18 |
114
 
115
 
116
  ### Key Findings
117
 
118
  - **Best Compression:** 32k achieves 4.583x compression
119
- - **Lowest UNK Rate:** 8k with 0.1271% unknown tokens
120
  - **Trade-off:** Larger vocabularies improve compression but increase model size
121
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
122
 
@@ -133,12 +143,14 @@ Below are sample sentences tokenized with each vocabulary size:
133
 
134
  | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
135
  |--------|---------|------------|---------|----------------|------------------|-------------------|
136
- | **2-gram** | Word | 477 | 8.90 | 718 | 45.8% | 100.0% |
137
- | **2-gram** | Subword | 365 🏆 | 8.51 | 2,347 | 59.8% | 96.1% |
138
- | **3-gram** | Word | 437 | 8.77 | 752 | 52.0% | 100.0% |
139
- | **3-gram** | Subword | 2,390 | 11.22 | 10,625 | 28.2% | 66.9% |
140
- | **4-gram** | Word | 742 | 9.53 | 1,438 | 43.7% | 84.6% |
141
- | **4-gram** | Subword | 8,576 | 13.07 | 28,979 | 14.2% | 43.1% |
 
 
142
 
143
  ### Top 5 N-grams by Size
144
 
@@ -148,66 +160,86 @@ Below are sample sentences tokenized with each vocabulary size:
148
  |------|--------|-------|
149
  | 1 | `ܐܦ ܚܙܝ` | 193 |
150
  | 2 | `ܚܕ ܡܢ` | 141 |
151
- | 3 | `ܗܝ ܐܬܪܐ` | 123 |
152
- | 4 | `ܐܝܬ ܠܗ` | 103 |
153
- | 5 | `ܬܚܘܡܐ ܥܡ` | 88 |
154
 
155
  **3-grams (Word):**
156
 
157
  | Rank | N-gram | Count |
158
  |------|--------|-------|
159
  | 1 | `ܗܘ ܚܕ ܡܢ` | 72 |
160
- | 2 | `ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍` | 52 |
161
- | 3 | `ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ` | 52 |
162
- | 4 | `ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ` | 52 |
163
- | 5 | `ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ` | 52 |
164
 
165
  **4-grams (Word):**
166
 
167
  | Rank | N-gram | Count |
168
  |------|--------|-------|
169
- | 1 | `ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ` | 52 |
170
- | 2 | `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ` | 52 |
171
- | 3 | `ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ` | 52 |
172
- | 4 | `ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ` | 52 |
173
- | 5 | `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ` | 52 |
 
 
 
 
 
 
 
 
 
 
174
 
175
  **2-grams (Subword):**
176
 
177
  | Rank | N-gram | Count |
178
  |------|--------|-------|
179
- | 1 | `ܐ _` | 24,633 |
180
- | 2 | `_ ܕ` | 7,621 |
181
- | 3 | `ܬ ܐ` | 7,176 |
182
- | 4 | `_ ܐ` | 6,899 |
183
- | 5 | `ܝ ܐ` | 5,702 |
184
 
185
  **3-grams (Subword):**
186
 
187
  | Rank | N-gram | Count |
188
  |------|--------|-------|
189
- | 1 | `ܐ _ ܕ` | 6,138 |
190
- | 2 | `ܬ ܐ _` | 5,890 |
191
- | 3 | `ܝ ܐ _` | 4,242 |
192
  | 4 | `ܐ _ ܐ` | 2,477 |
193
- | 5 | ܐ _` | 2,397 |
194
 
195
  **4-grams (Subword):**
196
 
197
  | Rank | N-gram | Count |
198
  |------|--------|-------|
199
- | 1 | `ܬ ܐ _ ܕ` | 2,008 |
200
- | 2 | `ܝ ܬ ܐ _` | 1,523 |
201
- | 3 | `ܐ ܝ ܬ _` | 1,367 |
202
- | 4 | `ܘ ܬ ܐ _` | 1,297 |
203
- | 5 | `_ ܡ ܢ _` | 1,210 |
 
 
 
 
 
 
 
 
 
 
204
 
205
 
206
  ### Key Findings
207
 
208
- - **Best Perplexity:** 2-gram (subword) with 365
209
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
210
- - **Coverage:** Top-1000 patterns cover ~43% of corpus
211
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
212
 
213
  ---
@@ -223,14 +255,14 @@ Below are sample sentences tokenized with each vocabulary size:
223
 
224
  | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
225
  |---------|---------|-------------|------------|------------------|-----------------|----------------|
226
- | **1** | Word | 0.5449 | 1.459 | 2.60 | 18,018 | 45.5% |
227
- | **1** | Subword | 0.9655 | 1.953 | 6.06 | 1,232 | 3.5% |
228
- | **2** | Word | 0.1027 | 1.074 | 1.16 | 45,749 | 89.7% |
229
- | **2** | Subword | 0.7977 | 1.738 | 3.85 | 7,459 | 20.2% |
230
- | **3** | Word | 0.0295 | 1.021 | 1.04 | 51,822 | 97.1% |
231
- | **3** | Subword | 0.5934 | 1.509 | 2.45 | 28,618 | 40.7% |
232
- | **4** | Word | 0.0106 🏆 | 1.007 | 1.01 | 52,472 | 98.9% |
233
- | **4** | Subword | 0.3583 | 1.282 | 1.71 | 69,915 | 64.2% |
234
 
235
  ### Generated Text Samples (Word-based)
236
 
@@ -238,27 +270,27 @@ Below are text samples generated from each word-based Markov chain model:
238
 
239
  **Context Size 1:**
240
 
241
- 1. `ܡܢ ܓܪܒܝܐ ܕܒܝܬ ܗܘܠܢܕ̈ܝܐ ܗܘܠܢܕܐܝܬ hengelo ܗܝ ܐܘܚܕܢܐ ܒܓܪܒܝܐ ܕܥܝܪܐܩ ܝܘܡܢܐ ܬܡܢ ܣܝܡܠܗ̇ ܥܕܬ̈ܐ ܡܕܢܚܝܬ̈ܐ ܕܐܬ݂...`
242
- 2. `ܐܘ ܙܢܓܒܝܠܐ ܗܘ ܡܣܘܪܩܐ ܡܐܢܐ ܕܐܪܕܟܠܘܬܐ`
243
- 3. `ܗܘ ܓܘܣܐ ܒܥܠܬܐ ܐܘ ܝܣܪܝܠ ܐܘ ܬܫܪܝܢ ܒ ܩܛܠܥܡܐ ܣܘܪܝܝܐ ܒ ت ܬ ܦ 80 754`
244
 
245
  **Context Size 2:**
246
 
247
- 1. `ܐܦ ܚܙܝ ܓܒܪܐ`
248
- 2. `ܚܕ ܡܢ ܐܪܒܥܐ ܟܬܒ̈ܐ ܩܕ̈ܡܝܐ ܕܕܝܬܝܩܝ ܚܕܬܐ ܦܘܠܘܣ ܫܠܝܚܐ ܟܬܒ ܗܕܐ ܐܓܪܬܐ ܠܩܘܠܣܝ̈ܐ ܕܗܢܘܢ ܐܢܫ̈ܐ ܕܡܕܝܢܬܐ ܕܐܦܣܘܣ`
249
- 3. `ܗܝ ܐܬܪܐ ܒܐܘܪܘܦܐ ܩܘܛܢܝܘܬܐ ܕܐܝܪܠܢܕ ܗܝ ܒܓܘ ܓܙܪܬܐ ܕܐܝܪܠܢܕ ܠܐܝܪܠܢܕ ܓܪܒܝܝܬܐ ܐܝܬ ܬܚܘܡܐ ܥܡ ܪܘܡܢܝܐ ܘܥܡ ܛܘܪܩܝܐ`
250
 
251
  **Context Size 3:**
252
 
253
- 1. `ܗܘ ܚܕ ܡܢ ܓܘܢ̈ܐ ܪ̈ܝܫܝܐ ܕܗܢܘܢ ܣܘܡܩܐ ܘܫܥܘܬܐ ܘܙܪܩܐ ܢܘܗܪܐ ܣܘܡܩܐ ܐܝܬ ܠܗ ܐܘܪܟܐ ܓܠܠܝܐ ܢܐܢܘܡܝܛܪ`
254
- 2. `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ...`
255
- 3. `ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟ...`
256
 
257
  **Context Size 4:**
258
 
259
- 1. `ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚ...`
260
- 2. `ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝ��ܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ...`
261
- 3. `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝ...`
262
 
263
 
264
  ### Generated Text Samples (Subword-based)
@@ -267,34 +299,34 @@ Below are text samples generated from each subword-based Markov chain model:
267
 
268
  **Context Size 1:**
269
 
270
- 1. `_ܨܝܬܐ_أبيد)_ܙܘܢܓ`
271
- 2. `ܐ._ܥܡܫܝܗܝܢܝܬ̈ܐ_ܡܕ`
272
- 3. `ܝܟܫܬܐ_ܐ܀_ܐ_ܫܘܪܒܡ`
273
 
274
  **Context Size 2:**
275
 
276
- 1. `ܐ_ܕܗܘܡܝܠܢܕܐ_ܕܐܥܬܐ`
277
- 2. `_ܕܥܣܪܘܝܕܝܢܘܢ_ܐܘ_ܦ`
278
- 3. `ܬܐ_ܕܩܪܝܬܐ_ܥܠ_ܐܟܪܝ`
279
 
280
  **Context Size 3:**
281
 
282
- 1. `ܐ_ܕܒܓܕܐ_ܕܛܘܪ_ܗܘܘ_ܬ`
283
- 2. `ܬܐ_ܕܛܲܟ݂ܣܵܐ_ܡܠܟܐ_ܫܠܝܚ̈`
284
- 3. `ܝܐ_ܘܡܬܝ_ܡܠܝܝܐ_מלזי`
285
 
286
  **Context Size 4:**
287
 
288
- 1. `ܬܐ_ܕܗܘܦܪܟܝܐ_ܕܪܗܘܡܝܬ`
289
- 2. `ܝܬܐ_ܙܘ_ܫܘܬܐ_ܕܬܘܕܝܬܐ`
290
- 3. `ܐܝܬ_ܡܨܪ̈ܝܐ܆_ܚܣܢ_ܒܡܕܝ`
291
 
292
 
293
  ### Key Findings
294
 
295
  - **Best Predictability:** Context-4 (word) with 98.9% predictability
296
  - **Branching Factor:** Decreases with context size (more deterministic)
297
- - **Memory Trade-off:** Larger contexts require more storage (69,915 contexts)
298
  - **Recommendation:** Context-3 or Context-4 for text generation
299
 
300
  ---
@@ -310,9 +342,9 @@ Below are text samples generated from each subword-based Markov chain model:
310
 
311
  | Metric | Value |
312
  |--------|-------|
313
- | Vocabulary Size | 6,113 |
314
- | Total Tokens | 50,830 |
315
- | Mean Frequency | 8.32 |
316
  | Median Frequency | 3 |
317
  | Frequency Std Dev | 32.05 |
318
 
@@ -320,16 +352,16 @@ Below are text samples generated from each subword-based Markov chain model:
320
 
321
  | Rank | Word | Frequency |
322
  |------|------|-----------|
323
- | 1 | ܡܢ | 1,283 |
324
- | 2 | ܐܘ | 975 |
325
- | 3 | ܗܘ | 861 |
326
  | 4 | ܗܝ | 816 |
327
- | 5 | ܐܝܬ | 512 |
328
  | 6 | ܗܘܐ | 394 |
329
- | 7 | ܥܠ | 327 |
330
- | 8 | ܘܥܡ | 326 |
331
- | 9 | ܐܦ | 275 |
332
- | 10 | ܠܫܢܐ | 266 |
333
 
334
  ### Least Common Words (from vocabulary)
335
 
@@ -350,24 +382,24 @@ Below are text samples generated from each subword-based Markov chain model:
350
 
351
  | Metric | Value |
352
  |--------|-------|
353
- | Zipf Coefficient | 0.8947 |
354
- | R² (Goodness of Fit) | 0.982774 |
355
  | Adherence Quality | **excellent** |
356
 
357
  ### Coverage Analysis
358
 
359
  | Top N Words | Coverage |
360
  |-------------|----------|
361
- | Top 100 | 31.7% |
362
  | Top 1,000 | 68.0% |
363
- | Top 5,000 | 95.6% |
364
  | Top 10,000 | 0.0% |
365
 
366
  ### Key Findings
367
 
368
  - **Zipf Compliance:** R²=0.9828 indicates excellent adherence to Zipf's law
369
- - **High Frequency Dominance:** Top 100 words cover 31.7% of corpus
370
- - **Long Tail:** -3,887 words needed for remaining 100.0% coverage
371
 
372
  ---
373
  ## 5. Word Embeddings Evaluation
@@ -383,37 +415,40 @@ Below are text samples generated from each subword-based Markov chain model:
383
 
384
  ### 5.1 Cross-Lingual Alignment
385
 
386
- > *Note: Multilingual alignment visualization not available for this language.*
 
 
387
 
388
 
389
  ### 5.2 Model Comparison
390
 
391
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
392
  |-------|-----------|----------|------------------|---------------|----------------|
393
- | **mono_32d** | 32 | 0.2739 🏆 | 0.4979 | N/A | N/A |
394
- | **mono_64d** | 64 | 0.0566 | 0.5064 | N/A | N/A |
395
- | **mono_128d** | 128 | 0.0089 | 0.4882 | N/A | N/A |
 
 
 
396
 
397
  ### Key Findings
398
 
399
- - **Best Isotropy:** mono_32d with 0.2739 (more uniform distribution)
400
- - **Semantic Density:** Average pairwise similarity of 0.4975. Lower values indicate better semantic separation.
401
- - **Alignment Quality:** No aligned models evaluated in this run.
402
  - **Recommendation:** 128d aligned for best cross-lingual performance
403
 
404
  ---
405
  ## 6. Morphological Analysis (Experimental)
406
 
407
- > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
408
-
409
  This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
410
 
411
  ### 6.1 Productivity & Complexity
412
 
413
  | Metric | Value | Interpretation | Recommendation |
414
  |--------|-------|----------------|----------------|
415
- | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
416
- | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
417
 
418
  ### 6.2 Affix Inventory (Productive Units)
419
 
@@ -426,13 +461,13 @@ These are the most productive prefixes and suffixes identified by sampling the v
426
  #### Productive Suffixes
427
  | Suffix | Examples |
428
  |--------|----------|
429
- | `-ܐ` | ܘܫܛܚܐ, ܠܘܚ̈ܐ, ܐܢܫܘܬܐ |
430
- | `-ܬܐ` | ܐܢܫܘܬܐ, ܚܘܪܬܐ, ܐܘܪܬܕܘܟܣܝܬܐ |
431
- | `-ܝܐ` | ܘܥܪ̈ܒܝܐ, ܒܐܠܒܢܝܐ, ܥܒܝܐ |
432
- | `-̈ܐ` | ܠܘܚ̈ܐ, ܡܚܝܕ̈ܐ, ܥܝܢ̈ܐ |
433
- | `-ܝܬܐ` | ܐܘܪܬܕܘܟܣܝܬܐ, ܣܘܪܝܝܬܐ, ܒܩܕܡܝܬܐ |
434
- | `-ܘܬܐ` | ܐܢܫܘܬܐ, ܘܕܡܠܟܘܬܐ, ܛܝܒܘܬܐ |
435
- | `-ܢܐ` | ܕܫܝܢܐ, ܡܢܝܢܐ, ܥܝܢܐ |
436
 
437
  ### 6.3 Bound Stems (Lexical Roots)
438
 
@@ -440,18 +475,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
440
 
441
  | Stem | Cohesion | Substitutability | Examples |
442
  |------|----------|------------------|----------|
443
- | `ܢܝܬܐ` | 1.58x | 23 contexts | ܦܢܝܬܐ, ܡܢܝܬܐ, ܡܪܢܝܬܐ |
444
- | `ܪܝܬܐ` | 1.59x | 18 contexts | ܫܪܝܬܐ, ܩܪܝܬܐ, ܒܪܝܬܐ |
445
- | `ܫܝܚܝ` | 1.61x | 16 contexts | ܡܫܝܚܝܐ, ܡܫܝܚܝܬܐ, ܕܡܫܝܚܝܐ |
446
- | `ܪܒܝܐ` | 1.59x | 16 contexts | ܨܪܒܝܐ, ܥܪܒܝܐ, ܐܪܒܝܐ |
447
- | `ܘܢܝܐ` | 1.57x | 16 contexts | ܟܘܢܝܐ, ܩܘܢܝܐ, ܓܘܢܝܐ |
448
- | `ܘܪܝܐ` | 1.37x | 23 contexts | ܛܘܪܝܐ, ܣܘܪܝܐ, ܟܘܪܝܐ |
449
- | `ܡܫܝܚ` | 1.59x | 14 contexts | ܡܫܝܚܐ, ܡܫܝܚܝܐ, ܕܡܫܝܚܐ |
450
- | `ܡܕܝܢ` | 1.58x | 13 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
451
- | `ܢܐܝܬ` | 1.53x | 14 contexts | ܝܘܢܐܝܬ, ܨܝܢܐܝܬ, ܟܠܢܐܝܬ |
452
- | `ܣܘܪܝ` | 1.38x | 18 contexts | ܣܘܪܝܐ, ܣܘܪܝܬ, ܘܣܘܪܝܐ |
453
- | `ܝܢܬܐ` | 1.65x | 9 contexts | ܩܝܢܬܐ, ܡܕܝܢܬܐ, ܣܦܝܢܬܐ |
454
- | `ܕܝܢܬ` | 1.62x | 9 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
455
 
456
  ### 6.4 Affix Compatibility (Co-occurrence)
457
 
@@ -467,25 +502,27 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
467
  | Word | Suggested Split | Confidence | Stem |
468
  |------|-----------------|------------|------|
469
  | ܝܘܪܕܢܢܝܬܐ | **`ܝܘܪܕܢܢ-ܝܬܐ`** | 4.5 | `ܝܘܪܕܢܢ` |
 
470
  | ܥܘܬܡܐܢܝܬܐ | **`ܥܘܬܡܐܢ-ܝܬܐ`** | 4.5 | `ܥܘܬܡܐܢ` |
471
  | ܕܬܠܝܬܝܘܬܐ | **`ܕܬܠܝܬܝ-ܘܬܐ`** | 4.5 | `ܕܬܠܝܬܝ` |
472
  | ܕܐܢܛܝܘܟܝܐ | **`ܕܐܢܛܝܘܟ-ܝܐ`** | 4.5 | `ܕܐܢܛܝܘܟ` |
473
- | ܐܝܣܪܐܝܠܝܐ | **`ܐܝܣܪܐܝܠ-ܝܐ`** | 4.5 | `ܐܝܣܪܐܝܠ` |
474
- | ܦܘܪܛܘܓܠܝܐ | **`ܦܘܪܛܘܓܠ-ܝܐ`** | 4.5 | `ܦܘܪܛܘܓܠ` |
475
- | ܡܬܥܡܪܢܝܬܐ | **`ܡܬܥܡܪܢ-ܝܬܐ`** | 4.5 | `ܡܬܥܡܪܢ` |
476
  | ܛܘܪܥܒܕܝܢܝܐ | **`ܛܘܪܥܒܕܝܢ-ܝܐ`** | 4.5 | `ܛܘܪܥܒܕܝܢ` |
477
  | ܩܬܘܠܝܩܝ̈ܐ | **`ܩܬܘܠܝܩܝ-̈ܐ`** | 4.5 | `ܩܬܘܠܝܩܝ` |
 
 
 
478
  | ܠܫܘܠܛܢܘܬܐ | **`ܠܫܘܠܛܢ-ܘܬܐ`** | 1.5 | `ܠܫܘܠܛܢ` |
479
- | ܐܘܪܬܕܘܟܣܝܐ | **`ܐܘܪܬܕܘܟܣ-ܝܐ`** | 1.5 | `ܐܘܪܬܕܘܟܣ` |
480
- | ܐܝܓܘܦܛܝܬܐ | **`ܐܝܓܘܦܛ-ܝܬܐ`** | 1.5 | `ܐܝܓܘܦܛ` |
481
- | ܘܒܐܘܚܕ̈ܢܐ | **`ܘܒܐܘܚܕ̈-ܢܐ`** | 1.5 | `ܘܒܐܘܚܕ̈` |
482
- | ܘܒܡܫܝܚܝܘܬܐ | **`ܘܒܡܫܝܚܝ-ܘܬܐ`** | 1.5 | `ܘܒܡܫܝܚܝ` |
483
- | ܢܩܪܘܡܢܛܝܐ | **`ܢܩܪܘܡܢܛ-ܝܐ`** | 1.5 | `ܢܩܪܘܡܢܛ` |
484
 
485
  ### 6.6 Linguistic Interpretation
486
 
487
  > **Automated Insight:**
488
- The language ARC appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 
 
489
 
490
  ---
491
  ## 7. Summary & Recommendations
@@ -497,7 +534,7 @@ The language ARC appears to be more isolating or has a highly fixed vocabulary.
497
  | Component | Recommended | Rationale |
498
  |-----------|-------------|-----------|
499
  | Tokenizer | **32k BPE** | Best compression (4.58x) |
500
- | N-gram | **2-gram** | Lowest perplexity (365) |
501
  | Markov | **Context-4** | Highest predictability (98.9%) |
502
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
503
 
@@ -712,4 +749,4 @@ MIT License - Free for academic and commercial use.
712
  ---
713
  *Generated by Wikilangs Models Pipeline*
714
 
715
- *Report Date: 2026-01-03 05:14:47*
 
1
  ---
2
  language: arc
3
+ language_name: Aramaic
4
  language_family: semitic_aramaic
5
  tags:
6
  - wikilangs
 
10
  - n-gram
11
  - markov
12
  - wikipedia
13
+ - feature-extraction
14
+ - sentence-similarity
15
+ - tokenization
16
+ - n-grams
17
+ - markov-chain
18
+ - text-mining
19
+ - fasttext
20
+ - babelvec
21
+ - vocabulous
22
+ - vocabulary
23
  - monolingual
24
  - family-semitic_aramaic
25
  license: mit
26
  library_name: wikilangs
27
+ pipeline_tag: text-generation
28
  datasets:
29
  - omarkamali/wikipedia-monthly
30
  dataset_info:
 
36
  value: 4.583
37
  - name: best_isotropy
38
  type: isotropy
39
+ value: 0.3326
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
43
  generated: 2026-01-03
44
  ---
45
 
46
+ # Aramaic - Wikilangs Models
47
  ## Comprehensive Research Report & Full Ablation Study
48
 
49
+ This repository contains NLP models trained and evaluated by Wikilangs, specifically on **Aramaic** Wikipedia data.
50
  We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
51
 
52
  ## 📋 Repository Contents
 
70
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
71
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
72
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
73
+ - [6. Morphological Analysis (Experimental)](#6--morphological-analysis-experimental)
74
  - [7. Summary & Recommendations](#7-summary--recommendations)
75
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
76
  - [Visualizations Index](#visualizations-index)
 
90
 
91
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
92
  |------------|-------------|---------------|----------|--------------|
93
+ | **8k** | 3.553x | 3.57 | 0.1262% | 63,406 |
94
+ | **16k** | 3.990x | 4.01 | 0.1417% | 56,456 |
95
+ | **32k** | 4.583x 🏆 | 4.60 | 0.1628% | 49,148 |
96
 
97
  ### Tokenization Examples
98
 
99
  Below are sample sentences tokenized with each vocabulary size:
100
 
101
+ **Sample 1:** `ܐܫܬܐ (ܟܢܘܫܝܐ: ܐܫܝ̈ܬܐ) ܗܝ ܡܢܬܐ ܓܠܝܠܬܐ ܕܨܪܘܝܘܬܐ ܕܬܦܠ ܒܒܣܬܪܐ ܕܐܓܢܐ ܕܓܘܫܡ̈ܐ ܕܐܢܫ̈ܐ ܘ...`
102
 
103
  | Vocab | Tokens | Count |
104
  |-------|--------|-------|
105
+ | 8k | `▁ܐܫܬܐ ▁( ܟܢܘܫܝܐ : ▁ܐ ܫܝ ̈ܬܐ ) ▁ܗܝ ▁ܡܢܬܐ ... (+20 more)` | 30 |
106
+ | 16k | `▁ܐܫܬܐ ▁( ܟܢܘܫܝܐ : ▁ܐ ܫܝ ̈ܬܐ ) ▁ܗܝ ▁ܡܢܬܐ ... (+18 more)` | 28 |
107
+ | 32k | `▁ܐܫܬܐ ▁( ܟܢܘܫܝܐ : ▁ܐܫܝ̈ܬܐ ) ▁ܗܝ ▁ܡܢܬܐ ▁ܓܠܝܠܬܐ ▁ܕܨܪܘܝܘܬܐ ... (+12 more)` | 22 |
108
 
109
+ **Sample 2:** `ܐܫܬܐ ܗܝ ܓܕܫܐ ܐܣܝܝܐ. ܐܫܬܐ ܗܝ ܚܡܝܡܘܬ ܓܘܫܡܐ ܠܥܠ ܡܢ ܫܘܝܐ ܟܝܢܝܐ ܕܚܡܝܡܘܬܐ ܓܘܝܬܐ ܕܓܘܫܡܐ...`
110
 
111
  | Vocab | Tokens | Count |
112
  |-------|--------|-------|
113
+ | 8k | `▁ܐܫܬܐ ▁ܗܝ ▁ܓܕܫܐ ▁ܐܣ ܝܝܐ . ▁ܐܫܬܐ ▁ܗܝ ▁ܚܡ ܝܡ ... (+10 more)` | 20 |
114
+ | 16k | `▁ܐܫܬܐ ▁ܗܝ ▁ܓܕܫܐ ▁ܐܣܝܝܐ . ▁ܐܫܬܐ ▁ܗܝ ▁ܚܡܝܡܘܬ ▁ܓܘܫܡܐ ▁ܠܥܠ ... (+6 more)` | 16 |
115
+ | 32k | `▁ܐܫܬܐ ▁ܗܝ ▁ܓܕܫܐ ▁ܐܣܝܝܐ . ▁ܐܫܬܐ ▁ܗܝ ▁ܚܡܝܡܘܬ ▁ܓܘܫܡܐ ▁ܠܥܠ ... (+6 more)` | 16 |
116
 
117
+ **Sample 3:** `ܡܪܝ ܢܪܣܝ ܕܒܙ (ܐܬܝܠܕ 17 ܐܝܪ - ܡܝܬ 14 ܫܒܛ ܗܘܐ ܡܝܛܪܦܘܠܝܛܐ ܕܠܒܢܢ ܘܣܘܪܝܐ ܘܟܠܗ̇ ܐܘܪܘܦܐ...`
118
 
119
  | Vocab | Tokens | Count |
120
  |-------|--------|-------|
121
+ | 8k | `▁ܡܪܝ ▁ܢܪܣܝ ▁ܕܒܙ ▁( ܐܬܝܠܕ1 7 ▁ܐܝܪ ▁- ... (+17 more)` | 27 |
122
+ | 16k | `▁ܡܪܝ ▁ܢܪܣܝ ▁ܕܒܙ ▁( ܐܬܝܠܕ1 7 ▁ܐܝܪ ▁- ... (+17 more)` | 27 |
123
+ | 32k | `▁ܡܪܝ ▁ܢܪܣܝ ▁ܕܒܙ ▁( ܐܬܝܠܕ1 7 ▁ܐܝܪ ▁- ... (+16 more)` | 26 |
124
 
125
 
126
  ### Key Findings
127
 
128
  - **Best Compression:** 32k achieves 4.583x compression
129
+ - **Lowest UNK Rate:** 8k with 0.1262% unknown tokens
130
  - **Trade-off:** Larger vocabularies improve compression but increase model size
131
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
132
 
 
143
 
144
  | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
145
  |--------|---------|------------|---------|----------------|------------------|-------------------|
146
+ | **2-gram** | Word | 477 | 8.90 | 719 | 45.8% | 100.0% |
147
+ | **2-gram** | Subword | 363 🏆 | 8.50 | 2,330 | 59.9% | 96.1% |
148
+ | **3-gram** | Word | 438 | 8.77 | 754 | 52.0% | 100.0% |
149
+ | **3-gram** | Subword | 2,379 | 11.22 | 10,583 | 28.3% | 67.0% |
150
+ | **4-gram** | Word | 759 | 9.57 | 1,466 | 43.3% | 83.8% |
151
+ | **4-gram** | Subword | 8,542 | 13.06 | 28,872 | 14.3% | 43.2% |
152
+ | **5-gram** | Word | 508 | 8.99 | 1,055 | 50.5% | 97.5% |
153
+ | **5-gram** | Subword | 16,168 | 13.98 | 38,919 | 8.4% | 30.9% |
154
 
155
  ### Top 5 N-grams by Size
156
 
 
160
  |------|--------|-------|
161
  | 1 | `ܐܦ ܚܙܝ` | 193 |
162
  | 2 | `ܚܕ ܡܢ` | 141 |
163
+ | 3 | `ܗܝ ܐܬܪܐ` | 124 |
164
+ | 4 | `ܐܝܬ ܠܗ` | 102 |
165
+ | 5 | `ܬܚܘܡܐ ܥܡ` | 89 |
166
 
167
  **3-grams (Word):**
168
 
169
  | Rank | N-gram | Count |
170
  |------|--------|-------|
171
  | 1 | `ܗܘ ܚܕ ܡܢ` | 72 |
172
+ | 2 | `ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ` | 52 |
173
+ | 3 | `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ` | 52 |
174
+ | 4 | `ܝܢܦܠ ܡܒܤܢ ܐܤܡ` | 52 |
175
+ | 5 | `ܓܐ ܝܢܦܠ ܡܒܤܢ` | 52 |
176
 
177
  **4-grams (Word):**
178
 
179
  | Rank | N-gram | Count |
180
  |------|--------|-------|
181
+ | 1 | `ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ` | 52 |
182
+ | 2 | `ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ` | 52 |
183
+ | 3 | `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ` | 52 |
184
+ | 4 | `ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ` | 52 |
185
+ | 5 | `ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢܝܛܠܐ ܝܟܝܟܕ` | 52 |
186
+
187
+ **5-grams (Word):**
188
+
189
+ | Rank | N-gram | Count |
190
+ |------|--------|-------|
191
+ | 1 | `ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ` | 52 |
192
+ | 2 | `ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ` | 52 |
193
+ | 3 | `ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ` | 52 |
194
+ | 4 | `ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ` | 52 |
195
+ | 5 | `ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ` | 52 |
196
 
197
  **2-grams (Subword):**
198
 
199
  | Rank | N-gram | Count |
200
  |------|--------|-------|
201
+ | 1 | `ܐ _` | 24,552 |
202
+ | 2 | `_ ܕ` | 7,580 |
203
+ | 3 | `ܬ ܐ` | 7,166 |
204
+ | 4 | `_ ܐ` | 6,890 |
205
+ | 5 | `ܝ ܐ` | 5,689 |
206
 
207
  **3-grams (Subword):**
208
 
209
  | Rank | N-gram | Count |
210
  |------|--------|-------|
211
+ | 1 | `ܐ _ ܕ` | 6,099 |
212
+ | 2 | `ܬ ܐ _` | 5,875 |
213
+ | 3 | `ܝ ܐ _` | 4,233 |
214
  | 4 | `ܐ _ ܐ` | 2,477 |
215
+ | 5 | _ ܘ` | 2,392 |
216
 
217
  **4-grams (Subword):**
218
 
219
  | Rank | N-gram | Count |
220
  |------|--------|-------|
221
+ | 1 | `ܬ ܐ _ ܕ` | 1,993 |
222
+ | 2 | `ܝ ܬ ܐ _` | 1,513 |
223
+ | 3 | `ܐ ܝ ܬ _` | 1,372 |
224
+ | 4 | `ܘ ܬ ܐ _` | 1,304 |
225
+ | 5 | `_ ܡ ܢ _` | 1,203 |
226
+
227
+ **5-grams (Subword):**
228
+
229
+ | Rank | N-gram | Count |
230
+ |------|--------|-------|
231
+ | 1 | `ܐ _ ܐ ܘ _` | 603 |
232
+ | 2 | `ܘ ܬ ܐ _ ܕ` | 543 |
233
+ | 3 | `ܐ _ ܡ ܢ _` | 533 |
234
+ | 4 | `_ ܐ ܝ ܬ _` | 503 |
235
+ | 5 | `ܝ ܘ ܬ ܐ _` | 487 |
236
 
237
 
238
  ### Key Findings
239
 
240
+ - **Best Perplexity:** 2-gram (subword) with 363
241
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
242
+ - **Coverage:** Top-1000 patterns cover ~31% of corpus
243
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
244
 
245
  ---
 
255
 
256
  | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
257
  |---------|---------|-------------|------------|------------------|-----------------|----------------|
258
+ | **1** | Word | 0.5444 | 1.458 | 2.59 | 17,962 | 45.6% |
259
+ | **1** | Subword | 0.9651 | 1.952 | 6.05 | 1,231 | 3.5% |
260
+ | **2** | Word | 0.1025 | 1.074 | 1.16 | 45,549 | 89.7% |
261
+ | **2** | Subword | 0.7954 | 1.736 | 3.84 | 7,433 | 20.5% |
262
+ | **3** | Word | 0.0296 | 1.021 | 1.04 | 51,591 | 97.0% |
263
+ | **3** | Subword | 0.5935 | 1.509 | 2.45 | 28,465 | 40.7% |
264
+ | **4** | Word | 0.0106 🏆 | 1.007 | 1.01 | 52,240 | 98.9% |
265
+ | **4** | Subword | 0.3585 | 1.282 | 1.71 | 69,540 | 64.2% |
266
 
267
  ### Generated Text Samples (Word-based)
268
 
 
270
 
271
  **Context Size 1:**
272
 
273
+ 1. `ܡܢ ܕܝܪܐ ܕܡܪܝ ܓܒܪܐܝܠ ܗܘ ܫܡܐ ܕܒܗܘ ܢܬܝܕܥܘܢ ܗܘܘ ܡܠܘܢ ܕܝܢ ܪ̈ܥܘܬܐ ܕܝܠܗܘܢ ܒܪܓܘܠܐ ܕܓܗܢܐ ܗܢܐ`
274
+ 2. `ܐܘ ܡܣܐܢܐ ܐܘ ܚܡܗܐ ܗܘ ܣܦܪܐ ܕܗܘܫܥ ܘܕܐܫܥܝܐ ܘܗܘܝܘ ܦܪܝܣܐ ܩܘܢܣܛܢܛܝܢ`
275
+ 3. `ܗܘ ܢܒܝܐ ܙܥܘܪܐ ܒܠܒܘܫܐ ܕܒܗ ܫܩܠܘ ܐܢܛܝܘܟܝܐ ܘܫܚܠܦ ܗܘܐ ܦܛܪܝܪܟܐ ܩܢܘܢܝܐ ܣܕܪܐ ܘܟܠ ܕܡܘ 2 ܒܢܝܣܢ`
276
 
277
  **Context Size 2:**
278
 
279
+ 1. `ܐܦ ܚܙܝ ܥܡܐ ܒܝܬܝܘܬܐ de verwandtschaftsbeziehung onkel und tante`
280
+ 2. `ܚܕ ܡܢ ܬܪܥܣܪ ܢܒܝ̈ܐ ܙܥܘܪ̈ܐ ܕܬܢܟ ܘܕܕܝܬܝܩܝ ܥܬܝܩܬܐ ܥܬܝܩܬܐ`
281
+ 3. `ܗܝ ܐܬܪܐ ܒܐܣܝܐ ܨܝܢ ܐܝܬ ܠܗ̇ ܪܡܙܐ ܕܡܘܠܕܐ ܬܪܝܢܐ ܒܝܕ ܡܝ̈ܐ ܘܪܘܚܐ ܕܩܘܕܫܐ ܝܫܘܥ ܐܡܪ ܠܗ ܘܠܐܚܗ`
282
 
283
  **Context Size 3:**
284
 
285
+ 1. `ܗܘ ܚܕ ܡܢ ܠܫܢ̈ܐ ܨܝܢܝ̈ܐ ܕܢܬܡܠܠܘܢ ܒܬܝܡܢ ܡܕܢܚ ܕܨܝܢ ܐܝܬܘܗܝ ܠܫܢܐ ܐܡܗܝܐ ܕܝܬܝܪ ܡܢ 90 ܡܠܝܘܢܐ ܐܢܫ̈ܝܢ ܪܝܫܐܝܬ`
286
+ 2. `ܢܝܛܠܐ ܝܟܝܟܕ ܝܡܓܚܝܢܐ ܐܓܐ ܟܡܠܐ ܣܐܙܬܝܐܢ ܝܠܟܐܒ ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢ...`
287
+ 3. `ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫ...`
288
 
289
  **Context Size 4:**
290
 
291
+ 1. `ܝܓܚܝܐ ܟܠܢܚܝܓܐ ܓܐ ܝܢܦܠ ܡܒܤܢ ܐܤܡ ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓ...`
292
+ 2. `ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫܙܢ ܢ...`
293
+ 3. `ܟܛܠ ܚܢܝܬܝܐ ܡܕܛܚܝܢܐ ܡܒܕ ܫܐܢܡܝܢ ܪܡܝܚܢܐܢ ܢܕܢܐ ܡܠܝܝܐ ܢܟܓܐܝܚܢܛܟ‍ ܟ‍ܝܣܢܐ ܡܓܝܡܡ ܡܟܒܡ ܠܣܐܟ ܒܟܡ ܣܢܝܓܚܝܢܪܢ ܟܢܫ...`
294
 
295
 
296
  ### Generated Text Samples (Subword-based)
 
299
 
300
  **Context Size 1:**
301
 
302
+ 1. `_ܥܪ_ܙܥܠܐܡܘܦܐܘܗ_ܡ`
303
+ 2. `ܐܣܬܐ_ܒܕ_ܐ_ܡܐ_ܕܡܦ`
304
+ 3. `ܝܓܝܛܠܚܕܢܪܝܦܝ̈ܘܪܝܐ`
305
 
306
  **Context Size 2:**
307
 
308
+ 1. `ܐ_ܫܡܐ_ܕܢܐܡܪܝܬ_800`
309
+ 2. `_ܕܠܐܬܘܕܝܫܐ_ܘܒܝܬ_ا`
310
+ 3. `ܬܐ_ܕܛܚܝܬ_ܡܚܝܬ_ܡܥܘ`
311
 
312
  **Context Size 3:**
313
 
314
+ 1. `ܐ_ܕܐܬܥܢܘܬܐ_ܠܫܢܐ_(r`
315
+ 2. `ܬܐ_ܥܪܒܐ_ܩܘܛܢܝܘܬܐ_ܛ`
316
+ 3. `ܝܐ_ܒܪ_ܒܪ_ܐܘ_ܐܘܢܛܐܪ`
317
 
318
  **Context Size 4:**
319
 
320
+ 1. `ܬܐ_ܕܢܒܥ_ܡܫܝܚܐ._ܥܠܠܢ̈`
321
+ 2. `ܝܬܐ_ܕܪ̈ܗܘܡܝܐ܀_ܢܗܪܝܢ܁`
322
+ 3. `ܐܝܬ_ܚܡܫܐ_ܨܒܝܢܗ._ܟܬܒ`
323
 
324
 
325
  ### Key Findings
326
 
327
  - **Best Predictability:** Context-4 (word) with 98.9% predictability
328
  - **Branching Factor:** Decreases with context size (more deterministic)
329
+ - **Memory Trade-off:** Larger contexts require more storage (69,540 contexts)
330
  - **Recommendation:** Context-3 or Context-4 for text generation
331
 
332
  ---
 
342
 
343
  | Metric | Value |
344
  |--------|-------|
345
+ | Vocabulary Size | 6,099 |
346
+ | Total Tokens | 50,661 |
347
+ | Mean Frequency | 8.31 |
348
  | Median Frequency | 3 |
349
  | Frequency Std Dev | 32.05 |
350
 
 
352
 
353
  | Rank | Word | Frequency |
354
  |------|------|-----------|
355
+ | 1 | ܡܢ | 1,276 |
356
+ | 2 | ܐܘ | 979 |
357
+ | 3 | ܗܘ | 860 |
358
  | 4 | ܗܝ | 816 |
359
+ | 5 | ܐܝܬ | 513 |
360
  | 6 | ܗܘܐ | 394 |
361
+ | 7 | ܘܥܡ | 330 |
362
+ | 8 | ܥܠ | 324 |
363
+ | 9 | ܐܦ | 277 |
364
+ | 10 | ܠܫܢܐ | 263 |
365
 
366
  ### Least Common Words (from vocabulary)
367
 
 
382
 
383
  | Metric | Value |
384
  |--------|-------|
385
+ | Zipf Coefficient | 0.8942 |
386
+ | R² (Goodness of Fit) | 0.982775 |
387
  | Adherence Quality | **excellent** |
388
 
389
  ### Coverage Analysis
390
 
391
  | Top N Words | Coverage |
392
  |-------------|----------|
393
+ | Top 100 | 31.8% |
394
  | Top 1,000 | 68.0% |
395
+ | Top 5,000 | 95.7% |
396
  | Top 10,000 | 0.0% |
397
 
398
  ### Key Findings
399
 
400
  - **Zipf Compliance:** R²=0.9828 indicates excellent adherence to Zipf's law
401
+ - **High Frequency Dominance:** Top 100 words cover 31.8% of corpus
402
+ - **Long Tail:** -3,901 words needed for remaining 100.0% coverage
403
 
404
  ---
405
  ## 5. Word Embeddings Evaluation
 
415
 
416
  ### 5.1 Cross-Lingual Alignment
417
 
418
+ ![Alignment Quality](visualizations/embedding_alignment_quality.png)
419
+
420
+ ![Multilingual t-SNE](visualizations/embedding_tsne_multilingual.png)
421
 
422
 
423
  ### 5.2 Model Comparison
424
 
425
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
426
  |-------|-----------|----------|------------------|---------------|----------------|
427
+ | **mono_32d** | 32 | 0.3326 | 0.4816 | N/A | N/A |
428
+ | **mono_64d** | 64 | 0.0524 | 0.4997 | N/A | N/A |
429
+ | **mono_128d** | 128 | 0.0094 | 0.4954 | N/A | N/A |
430
+ | **aligned_32d** | 32 | 0.3326 🏆 | 0.4744 | 0.2099 | 0.5556 |
431
+ | **aligned_64d** | 64 | 0.0524 | 0.4826 | 0.1975 | 0.6543 |
432
+ | **aligned_128d** | 128 | 0.0094 | 0.5048 | 0.2099 | 0.7037 |
433
 
434
  ### Key Findings
435
 
436
+ - **Best Isotropy:** aligned_32d with 0.3326 (more uniform distribution)
437
+ - **Semantic Density:** Average pairwise similarity of 0.4897. Lower values indicate better semantic separation.
438
+ - **Alignment Quality:** Aligned models achieve up to 21.0% R@1 in cross-lingual retrieval.
439
  - **Recommendation:** 128d aligned for best cross-lingual performance
440
 
441
  ---
442
  ## 6. Morphological Analysis (Experimental)
443
 
 
 
444
  This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
445
 
446
  ### 6.1 Productivity & Complexity
447
 
448
  | Metric | Value | Interpretation | Recommendation |
449
  |--------|-------|----------------|----------------|
450
+ | Productivity Index | **5.000** | High morphological productivity | Reliable analysis |
451
+ | Idiomaticity Gap | **2.160** | High formulaic/idiomatic content | - |
452
 
453
  ### 6.2 Affix Inventory (Productive Units)
454
 
 
461
  #### Productive Suffixes
462
  | Suffix | Examples |
463
  |--------|----------|
464
+ | `-ܐ` | ܬܫܐܢܐܟܐܠܐ, ܟܘܕܢܝܐ, ܕܚܘܼܡܵܐ |
465
+ | `-ܬܐ` | ܐܒܪ̈ܗܡܝܬܐ, ܕܡܥܡܕܝܬܐ, ܫܬܐ |
466
+ | `-ܝܐ` | ܟܘܕܢܝܐ, ܡܠܝܝܐ, ܓܐܘܓܪܦܝܐ |
467
+ | `-ܝܬܐ` | ܐܒܪ̈ܗܡܝܬܐ, ܕܡܥܡܕܝܬܐ, ܟܢܘܫܝܬܐ |
468
+ | `-̈ܐ` | ܢܒܝ̈ܐ, ܕܓܘܫܡ̈ܐ, ܘܚܝ̈ܐ |
469
+ | `-ܘܬܐ` | ܘܐܬܪ̈ܘܬܐ, ܛܝܒܘܬܐ, ܦܛܪܝܪܟܘܬܐ |
470
+ | `-ܢܐ` | ܐܡܝܢܐ, ܘܡܩܝܡܢܐ, ܕܓܝܗܢܐ |
471
 
472
  ### 6.3 Bound Stems (Lexical Roots)
473
 
 
475
 
476
  | Stem | Cohesion | Substitutability | Examples |
477
  |------|----------|------------------|----------|
478
+ | `ܢܝܬܐ` | 1.64x | 23 contexts | ܦܢܝܬܐ, ܡܢܝܬܐ, ܡܕܢܝܬܐ |
479
+ | `ܪܝܬܐ` | 1.68x | 18 contexts | ܒܪܝܬܐ, ܫܪܝܬܐ, ܩܪܝܬܐ |
480
+ | `ܘܪܝܐ` | 1.51x | 23 contexts | ܣܘܪܝܐ, ܛܘܪܝܐ, ܟܘܪܝܐ |
481
+ | `ܫܝܚܝ` | 1.69x | 15 contexts | ܡܫܝܚܝܐ, ܘܡܫܝܚܝܐ, ܡܫܝܚܝ̈ܐ |
482
+ | `ܪܒܝܐ` | 1.63x | 16 contexts | ܨܪܒܝܐ, ܓܪܒܝܐ, ܐܪܒܝܐ |
483
+ | `ܘܢܝܐ` | 1.63x | 15 contexts | ܝܘܢܝܐ, ܓܘܢܝܐ, ܟܘܢܝܐ |
484
+ | `ܡܫܝܚ` | 1.70x | 13 contexts | ܡܫܝܚܐ, ܡܫܝܚܝܐ, ܕܡܫܝܚܐ |
485
+ | `ܣܘܪܝ` | 1.49x | 18 contexts | ܣܘܪܝܐ, ܣܘܪܝܬ, ܐܣܘܪܝܐ |
486
+ | `ܡܕܝܢ` | 1.63x | 13 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
487
+ | `ܢܐܝܬ` | 1.55x | 14 contexts | ܨܝܢܐܝܬ, ܝܘܢܐܝܬ, ܝܦܢܐܝܬ |
488
+ | `ܝܢܬܐ` | 1.71x | 9 contexts | ܩܝܢܬܐ, ܣܦܝܢܬܐ, ܡܕܝܢܬܐ |
489
+ | `ܕܝܢܬ` | 1.67x | 9 contexts | ܡܕܝܢܬ, ܡܕܝܢܬܐ, ܠܡܕܝܢܬ |
490
 
491
  ### 6.4 Affix Compatibility (Co-occurrence)
492
 
 
502
  | Word | Suggested Split | Confidence | Stem |
503
  |------|-----------------|------------|------|
504
  | ܝܘܪܕܢܢܝܬܐ | **`ܝܘܪܕܢܢ-ܝܬܐ`** | 4.5 | `ܝܘܪܕܢܢ` |
505
+ | ܦܘܪܛܘܓܠܝܐ | **`ܦܘܪܛܘܓܠ-ܝܐ`** | 4.5 | `ܦܘܪܛܘܓܠ` |
506
  | ܥܘܬܡܐܢܝܬܐ | **`ܥܘܬܡܐܢ-ܝܬܐ`** | 4.5 | `ܥܘܬܡܐܢ` |
507
  | ܕܬܠܝܬܝܘܬܐ | **`ܕܬܠܝܬܝ-ܘܬܐ`** | 4.5 | `ܕܬܠܝܬܝ` |
508
  | ܕܐܢܛܝܘܟܝܐ | **`ܕܐܢܛܝܘܟ-ܝܐ`** | 4.5 | `ܕܐܢܛܝܘܟ` |
 
 
 
509
  | ܛܘܪܥܒܕܝܢܝܐ | **`ܛܘܪܥܒܕܝܢ-ܝܐ`** | 4.5 | `ܛܘܪܥܒܕܝܢ` |
510
  | ܩܬܘܠܝܩܝ̈ܐ | **`ܩܬܘܠܝܩܝ-̈ܐ`** | 4.5 | `ܩܬܘܠܝܩܝ` |
511
+ | ܡܬܥܡܪܢܝܬܐ | **`ܡܬܥܡܪܢ-ܝܬܐ`** | 4.5 | `ܡܬܥܡܪܢ` |
512
+ | ܒܡܬܥܕܪܢܘܬܐ | **`ܒܡܬܥܕܪܢ-ܘܬܐ`** | 1.5 | `ܒܡܬܥܕܪܢ` |
513
+ | ܬܫܥܝܬܢܝܬܐ | **`ܬܫܥܝܬܢ-ܝܬܐ`** | 1.5 | `ܬܫܥܝܬܢ` |
514
  | ܠܫܘܠܛܢܘܬܐ | **`ܠܫܘܠܛܢ-ܘܬܐ`** | 1.5 | `ܠܫܘܠܛܢ` |
515
+ | ܡܫܥܢܕܢܝܬܐ | **`ܡܫܥܢܕܢ-ܝܬܐ`** | 1.5 | `ܡܫܥܢܕܢ` |
516
+ | ܐܝܣܠܢܕ̈ܝܐ | **`ܐܝܣܠܢܕ̈-ܝܐ`** | 1.5 | `ܐܝܣܠܢܕ̈` |
517
+ | ܐܪܬܘܕܟܣܝܐ | **`ܐܪܬܘܕܟܣ-ܝܐ`** | 1.5 | `ܐܪܬܘܕܟܣ` |
518
+ | ܦܘܠܛܝܩܝܬܐ | **`ܦܘܠܛܝܩ-ܝܬܐ`** | 1.5 | `ܦܘܠܛܝܩ` |
 
519
 
520
  ### 6.6 Linguistic Interpretation
521
 
522
  > **Automated Insight:**
523
+ The language Aramaic shows high morphological productivity. The subword models are significantly more efficient than word models, suggesting a rich system of affixation or compounding.
524
+
525
+ > **Note on Idiomaticity:** The high Idiomaticity Gap suggests a large number of frequent multi-word expressions or formulaic sequences that are statistically distinct from their component parts.
526
 
527
  ---
528
  ## 7. Summary & Recommendations
 
534
  | Component | Recommended | Rationale |
535
  |-----------|-------------|-----------|
536
  | Tokenizer | **32k BPE** | Best compression (4.58x) |
537
+ | N-gram | **2-gram** | Lowest perplexity (363) |
538
  | Markov | **Context-4** | Highest predictability (98.9%) |
539
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
540
 
 
749
  ---
750
  *Generated by Wikilangs Models Pipeline*
751
 
752
+ *Report Date: 2026-01-03 16:33:24*
models/embeddings/aligned/arc_128d.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8d628ee8950c2b7ed7c3e9de1db3782e533dcdf606048be9ab19b74823a78c1
3
+ size 1025874179
models/embeddings/aligned/arc_128d.meta.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lang": "arc", "dim": 128, "max_seq_len": 512, "is_aligned": true}
models/embeddings/aligned/arc_128d.projection.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90a8bc7d91ab749fbde71ef4592ddb5930aad98cf89e4fe62609617116614d4c
3
+ size 65664
models/embeddings/aligned/arc_128d_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "language": "arc",
3
+ "dimension": 128,
4
+ "version": "aligned",
5
+ "hub_language": "en",
6
+ "seed_vocab_size": 81,
7
+ "vocab_size": 1795
8
+ }
models/embeddings/aligned/arc_32d.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea95831f0cddf68f8d1922adb20034599b5abee861aa999d0c55c5b434433549
3
+ size 256495619
models/embeddings/aligned/arc_32d.meta.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lang": "arc", "dim": 32, "max_seq_len": 512, "is_aligned": true}
models/embeddings/aligned/arc_32d.projection.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:314b43a7227e4552787faf00f8ef83eb3186cbe76d9aa61a0c07d6bc9b336111
3
+ size 4224
models/embeddings/aligned/arc_32d_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "language": "arc",
3
+ "dimension": 32,
4
+ "version": "aligned",
5
+ "hub_language": "en",
6
+ "seed_vocab_size": 81,
7
+ "vocab_size": 1795
8
+ }
models/embeddings/aligned/arc_64d.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80969530d1c44b57121288762e8b87141d96d2f813f5a2ce5b5770ef551329e6
3
+ size 512955139
models/embeddings/aligned/arc_64d.meta.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lang": "arc", "dim": 64, "max_seq_len": 512, "is_aligned": true}
models/embeddings/aligned/arc_64d.projection.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e168e33b3341339910d3d05582308c54eda45b95a9123adf0e4b2ad948e055a
3
+ size 16512
models/embeddings/aligned/arc_64d_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "language": "arc",
3
+ "dimension": 64,
4
+ "version": "aligned",
5
+ "hub_language": "en",
6
+ "seed_vocab_size": 81,
7
+ "vocab_size": 1795
8
+ }
models/embeddings/monolingual/arc_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:af294265d0648fcd88d1a3a79c52a316b5312eb2df0e22491460d7962b9abecf
3
- size 1025880436
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8d628ee8950c2b7ed7c3e9de1db3782e533dcdf606048be9ab19b74823a78c1
3
+ size 1025874179
models/embeddings/monolingual/arc_128d_metadata.json CHANGED
@@ -11,5 +11,5 @@
11
  "encoding_method": "rope",
12
  "dim": 128
13
  },
14
- "vocab_size": 1801
15
  }
 
11
  "encoding_method": "rope",
12
  "dim": 128
13
  },
14
+ "vocab_size": 1795
15
  }
models/embeddings/monolingual/arc_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ca828785023acb65e4a3915ca46d6488b08273c410da1d5ebe5859bffe778de4
3
- size 256497268
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea95831f0cddf68f8d1922adb20034599b5abee861aa999d0c55c5b434433549
3
+ size 256495619
models/embeddings/monolingual/arc_32d_metadata.json CHANGED
@@ -11,5 +11,5 @@
11
  "encoding_method": "rope",
12
  "dim": 32
13
  },
14
- "vocab_size": 1801
15
  }
 
11
  "encoding_method": "rope",
12
  "dim": 32
13
  },
14
+ "vocab_size": 1795
15
  }
models/embeddings/monolingual/arc_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3ca0fd83ec54d91bfbc8d6aec7c2188702f2f3ff651521ae421ae4403a931166
3
- size 512958324
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80969530d1c44b57121288762e8b87141d96d2f813f5a2ce5b5770ef551329e6
3
+ size 512955139
models/embeddings/monolingual/arc_64d_metadata.json CHANGED
@@ -11,5 +11,5 @@
11
  "encoding_method": "rope",
12
  "dim": 64
13
  },
14
- "vocab_size": 1801
15
  }
 
11
  "encoding_method": "rope",
12
  "dim": 64
13
  },
14
+ "vocab_size": 1795
15
  }
models/subword_markov/arc_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bc87b5460d601a98fbc387cbc99f283bf6cd7d98c37e6a136392f9742a5c1769
3
- size 61605
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2cf16541ca4543598754c02901d686e966979a3f2dbef93171079a43a695800
3
+ size 61546
models/subword_markov/arc_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_contexts": 1232,
6
- "total_transitions": 368665
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_contexts": 1231,
6
+ "total_transitions": 367430
7
  }
models/subword_markov/arc_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:44f7913fb0e3f66a526482ce6ecca8f16f96c74f389962eac8c930ae99d709a4
3
- size 236563
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ade6f3b86c5ac381ec52cb57cbda8cc727ed4013fd635825536cbdeb065e3851
3
+ size 228551
models/subword_markov/arc_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_contexts": 7459,
6
- "total_transitions": 367072
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_contexts": 7433,
6
+ "total_transitions": 365838
7
  }
models/subword_markov/arc_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6c02d515c8771584609106a8bfb0e1e6f4a01778f690d1e902e6c23730d85b1f
3
- size 614974
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a14f68a36dd11286c23370274ba7ff4661e8c2e5ad7150608f82b13f2e385917
3
+ size 613765
models/subword_markov/arc_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_contexts": 28618,
6
- "total_transitions": 365479
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_contexts": 28465,
6
+ "total_transitions": 364246
7
  }
models/subword_markov/arc_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ab3159c713b354d456d78d674f47cafc43e66c4c18a290510850783e3575b191
3
- size 1213186
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de9318a054b0cd1ae284c2b1e5d27ed21cc036ddfb239927d75f1c53cd6a035b
3
+ size 1191772
models/subword_markov/arc_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_contexts": 69915,
6
- "total_transitions": 363886
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_contexts": 69540,
6
+ "total_transitions": 362654
7
  }
models/subword_ngram/arc_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1e6f137266958bfa5bf2d5725d17daa253bc7444dd15168bf7c4aba0c82d498e
3
- size 30619
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b0ef8b68904eb3bc261b5599c684b44c82c140473e77759ec123df2bcccf1b5
3
+ size 30383
models/subword_ngram/arc_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_ngrams": 2347,
6
- "total_ngrams": 368665
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_ngrams": 2330,
6
+ "total_ngrams": 367430
7
  }
models/subword_ngram/arc_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ce18df7b3c9474a7e74ca5ef2fd82d105d907ac676d24a02a4a055d5c06da3df
3
- size 135373
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:41cf48eff4f49881b3b1ef4bee554fb58154195938b1733c73ed7e68e9e0214b
3
+ size 134552
models/subword_ngram/arc_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_ngrams": 10625,
6
- "total_ngrams": 367072
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_ngrams": 10583,
6
+ "total_ngrams": 365838
7
  }
models/subword_ngram/arc_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:36cdb55afe0905dc264f2b5348858599da9f4a343be608731c24757accd88088
3
- size 398536
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9672d51caaa89c2f58eff730631ddd563754bbaac6d78c20d8cb575cbbf65d87
3
+ size 394701
models/subword_ngram/arc_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "arc",
5
- "unique_ngrams": 28979,
6
- "total_ngrams": 365479
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "arc",
5
+ "unique_ngrams": 28872,
6
+ "total_ngrams": 364246
7
  }
models/subword_ngram/arc_5gram_subword.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f92babc9e6855ff8c2b135ce81ee6f2033e836d8767833ec47cf2caf5c7f2dd
3
+ size 537047
models/subword_ngram/arc_5gram_subword_metadata.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "n": 5,
3
+ "variant": "subword",
4
+ "language": "arc",
5
+ "unique_ngrams": 38919,
6
+ "total_ngrams": 362654
7
+ }
models/tokenizer/arc_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f9d8b946b07c5ee953e8f59e73c726ea02c7e02592415beaef72b11ce24d07ed
3
- size 553548
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a79f3440c903b39d8b82904f5bdf191e3a3f69c45dc94cff5f577a68ffb722d3
3
+ size 553519
models/tokenizer/arc_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/arc_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dc878161f3ebfde37edd5364173a906b743ffa182cc4a09da6dedb402cd0d479
3
- size 891580
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f93993d85e9706c681a3124199465b9fd0ecf892ec81f742aa1dd3f2a3509aa8
3
+ size 891227
models/tokenizer/arc_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/arc_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e552460c83ce8697c66bb59c12883cdc3ee45df742f4f4a31ed47f9e096b570f
3
- size 390840
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf1e2a73cf8ac952ea13f9fc21d2cc6279d579673c3cc68f2fc93a338ee0a5c0
3
+ size 390696
models/tokenizer/arc_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/arc_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a0c7a49ea4d72d65cb41e84815025c1a3ac379f03d6c2738d077d2713987df61
3
- size 98987
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f0bb804661e7eb94ae77e3d082bdaf86902542e5dff49b802dc403e642e6d93
3
+ size 99274
models/vocabulary/arc_vocabulary_metadata.json CHANGED
@@ -1,17 +1,17 @@
1
  {
2
  "language": "arc",
3
- "vocabulary_size": 6113,
4
  "variant": "full",
5
  "statistics": {
6
- "type_token_ratio": 0.28987946832669004,
7
  "coverage": {
8
- "top_100": 0.2557526480443379,
9
- "top_1000": 0.5486970192628353,
10
- "top_5000": 0.7718473583077925,
11
- "top_10000": 0.8689237903161773
12
  },
13
- "hapax_count": 12141,
14
- "hapax_ratio": 0.6651144954530513,
15
- "total_documents": 1593
16
  }
17
  }
 
1
  {
2
  "language": "arc",
3
+ "vocabulary_size": 6099,
4
  "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.2900409450826071,
7
  "coverage": {
8
+ "top_100": 0.2565360778753166,
9
+ "top_1000": 0.5490624054041137,
10
+ "top_5000": 0.7721095480108975,
11
+ "top_10000": 0.869278442493667
12
  },
13
+ "hapax_count": 12106,
14
+ "hapax_ratio": 0.6649821477616039,
15
+ "total_documents": 1592
16
  }
17
  }
models/word_markov/arc_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:832f26dc2d4ff5a3ebd23bf31f4087617a431dd92bb988486e6d4cc21c427ab8
3
- size 561103
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5cbedb01a5447afd32845e25fc8a11d1d3d23bb575c77b402dc2373f23ba697b
3
+ size 559495
models/word_markov/arc_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "arc",
5
- "unique_contexts": 18018,
6
- "total_transitions": 61378
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "arc",
5
+ "unique_contexts": 17962,
6
+ "total_transitions": 61175
7
  }
models/word_markov/arc_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:de53c3acb316d8128ebd4e1ec3ca41812fa0d445a955fc1d817d7c545461f871
3
- size 1081973
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73c67b400e1d1d3b59ad64c463d7764d153a25bb16654c53e98cb0c5ff9eb85e
3
+ size 1076458
models/word_markov/arc_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "arc",
5
- "unique_contexts": 45749,
6
- "total_transitions": 59785
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "arc",
5
+ "unique_contexts": 45549,
6
+ "total_transitions": 59583
7
  }
models/word_markov/arc_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8d239eaf545ca57c200cb2afeabb8ab0ceb2b6cc79f4dcd2f2e5f4aa5cc4e9c1
3
- size 1332931
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e174c8083c10c295c6f9ad804650219ba42e49b0a3d2c0566dc2c29f402d62e
3
+ size 1328117
models/word_markov/arc_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "arc",
5
- "unique_contexts": 51822,
6
- "total_transitions": 58192
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "arc",
5
+ "unique_contexts": 51591,
6
+ "total_transitions": 57991
7
  }