omarkamali commited on
Commit
706be59
·
verified ·
1 Parent(s): 93235e3

Upload all models and assets for atj (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +305 -122
  2. models/embeddings/monolingual/atj_128d.bin +2 -2
  3. models/embeddings/monolingual/atj_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/atj_32d.bin +2 -2
  5. models/embeddings/monolingual/atj_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/atj_64d.bin +2 -2
  7. models/embeddings/monolingual/atj_64d_metadata.json +5 -3
  8. models/subword_markov/atj_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/atj_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/atj_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/atj_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/atj_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/atj_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/atj_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/atj_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/atj_2gram_subword.parquet +2 -2
  17. models/subword_ngram/atj_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/atj_3gram_subword.parquet +2 -2
  19. models/subword_ngram/atj_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/atj_4gram_subword.parquet +2 -2
  21. models/subword_ngram/atj_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/atj_tokenizer_16k.model +2 -2
  23. models/tokenizer/atj_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/atj_tokenizer_32k.model +2 -2
  25. models/tokenizer/atj_tokenizer_32k.vocab +0 -0
  26. models/tokenizer/atj_tokenizer_8k.model +2 -2
  27. models/tokenizer/atj_tokenizer_8k.vocab +0 -0
  28. models/vocabulary/atj_vocabulary.parquet +2 -2
  29. models/vocabulary/atj_vocabulary_metadata.json +10 -9
  30. models/word_markov/atj_markov_ctx1_word.parquet +2 -2
  31. models/word_markov/atj_markov_ctx1_word_metadata.json +2 -2
  32. models/word_markov/atj_markov_ctx2_word.parquet +2 -2
  33. models/word_markov/atj_markov_ctx2_word_metadata.json +2 -2
  34. models/word_markov/atj_markov_ctx3_word.parquet +2 -2
  35. models/word_markov/atj_markov_ctx3_word_metadata.json +2 -2
  36. models/word_markov/atj_markov_ctx4_word.parquet +2 -2
  37. models/word_markov/atj_markov_ctx4_word_metadata.json +2 -2
  38. models/word_ngram/atj_2gram_word.parquet +2 -2
  39. models/word_ngram/atj_2gram_word_metadata.json +2 -2
  40. models/word_ngram/atj_3gram_word.parquet +2 -2
  41. models/word_ngram/atj_3gram_word_metadata.json +2 -2
  42. models/word_ngram/atj_4gram_word.parquet +2 -2
  43. models/word_ngram/atj_4gram_word_metadata.json +2 -2
  44. visualizations/embedding_isotropy.png +0 -0
  45. visualizations/embedding_norms.png +0 -0
  46. visualizations/embedding_similarity.png +2 -2
  47. visualizations/markov_branching.png +0 -0
  48. visualizations/markov_contexts.png +0 -0
  49. visualizations/markov_entropy.png +0 -0
  50. visualizations/model_sizes.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 5.808
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.1478
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 6720
33
- generated: 2025-12-27
34
  ---
35
 
36
  # ATJ - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,47 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 4.986x | 4.94 | 0.2011% | 100,930 |
76
- | **16k** | 5.358x | 5.31 | 0.2161% | 93,917 |
77
- | **32k** | 5.808x 🏆 | 5.76 | 0.2343% | 86,639 |
78
 
79
  ### Tokenization Examples
80
 
81
  Below are sample sentences tokenized with each vocabulary size:
82
 
83
- **Sample 1:** `Nipicapo ka pictewakamik e wapakaminikaniwok. Ciwakamin kaie micta wikicin mito...`
84
 
85
  | Vocab | Tokens | Count |
86
  |-------|--------|-------|
87
- | 8k | `▁nipicapo ▁kapicte wakamikewapa kami ni kaniwok . ... (+19 more)` | 29 |
88
- | 16k | `▁nipicapokapictewakamikewapa kami nikaniwok .ciwakamin ▁kaie ... (+14 more)` | 24 |
89
- | 32k | `▁nipicapokapictewakamikewapakami nikaniwok . ciwakaminkaiemicta ... (+12 more)` | 22 |
90
 
91
- **Sample 2:** `Fossambault-sur-le-Lac oteno Kepek askik ici actew, Kanata. Irikik e tacinaniwok...`
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
- | 8k | `▁fo ssa mb ault - sur - le - lac ... (+30 more)` | 40 |
96
- | 16k | `▁fo ssa mb ault - sur - le - lac ... (+30 more)` | 40 |
97
- | 32k | `▁fossambault - sur - le - lacoteno ▁kepek ▁askik ... (+27 more)` | 37 |
98
 
99
- **Sample 3:** `Saint-Basile-le-Grand oteno Kepek askik ici actew, Kanata. Irikik e tacinaniwok ...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
- | 8k | `▁saint - ba sile - le - grandotenokepek ... (+30 more)` | 40 |
104
- | 16k | `▁saint - basile - le - grandotenokepek ▁askik ... (+29 more)` | 39 |
105
- | 32k | `▁saint - basile - le - grand otenokepekaskik ... (+29 more)` | 39 |
106
 
107
 
108
  ### Key Findings
109
 
110
- - **Best Compression:** 32k achieves 5.808x compression
111
- - **Lowest UNK Rate:** 8k with 0.2011% unknown tokens
112
  - **Trade-off:** Larger vocabularies improve compression but increase model size
113
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
114
 
@@ -117,57 +125,89 @@ Below are sample sentences tokenized with each vocabulary size:
117
 
118
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
119
 
 
 
120
  ![N-gram Coverage](visualizations/ngram_coverage.png)
121
 
122
  ### Results
123
 
124
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
125
- |--------|------------|---------|----------------|------------------|-------------------|
126
- | **2-gram** | 862 🏆 | 9.75 | 3,083 | 46.3% | 79.6% |
127
- | **2-gram** | 152 🏆 | 7.25 | 1,194 | 85.3% | 99.9% |
128
- | **3-gram** | 572 | 9.16 | 3,170 | 56.1% | 80.5% |
129
- | **3-gram** | 883 | 9.79 | 7,630 | 40.9% | 90.0% |
130
- | **4-gram** | 615 | 9.26 | 4,529 | 57.5% | 75.7% |
131
- | **4-gram** | 3,239 | 11.66 | 24,127 | 22.8% | 65.1% |
132
 
133
  ### Top 5 N-grams by Size
134
 
135
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
  | Rank | N-gram | Count |
138
  |------|--------|-------|
139
- | 1 | `tipanictawin :` | 2,442 |
140
- | 2 | `: e` | 1,606 |
141
- | 3 | `e matcectakaniwok` | 1,602 |
142
- | 4 | `ici actew` | 889 |
143
- | 5 | `matcectakaniwok tipanictawin` | 812 |
144
 
145
- **3-grams:**
146
 
147
  | Rank | N-gram | Count |
148
  |------|--------|-------|
149
- | 1 | `tipanictawin : e` | 1,604 |
150
- | 2 | `: e matcectakaniwok` | 1,602 |
151
- | 3 | `matcectakaniwok tipanictawin :` | 812 |
152
- | 4 | `e matcectakaniwok tipanictawin` | 811 |
153
- | 5 | `ici actew ,` | 774 |
154
 
155
- **4-grams:**
156
 
157
  | Rank | N-gram | Count |
158
  |------|--------|-------|
159
- | 1 | `tipanictawin : e matcectakaniwok` | 1,602 |
160
- | 2 | `: e matcectakaniwok tipanictawin` | 811 |
161
- | 3 | `e matcectakaniwok tipanictawin :` | 811 |
162
- | 4 | `ici actew , kanata` | 759 |
163
- | 5 | `actew , kanata .` | 756 |
164
 
165
 
166
  ### Key Findings
167
 
168
- - **Best Perplexity:** 2-gram with 152
169
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
170
- - **Coverage:** Top-1000 patterns cover ~65% of corpus
171
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
172
 
173
  ---
@@ -175,55 +215,86 @@ Below are sample sentences tokenized with each vocabulary size:
175
 
176
  ![Markov Entropy](visualizations/markov_entropy.png)
177
 
 
 
178
  ![Markov Branching](visualizations/markov_branching.png)
179
 
180
  ### Results
181
 
182
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
183
- |---------|-------------|------------|------------------|-----------------|----------------|
184
- | **1** | 0.5314 | 1.445 | 3.58 | 20,568 | 46.9% |
185
- | **1** | 1.7637 | 3.396 | 16.31 | 126 | 0.0% |
186
- | **2** | 0.2296 | 1.172 | 1.52 | 73,484 | 77.0% |
187
- | **2** | 1.2892 | 2.444 | 6.48 | 2,049 | 0.0% |
188
- | **3** | 0.0715 | 1.051 | 1.12 | 111,484 | 92.9% |
189
- | **3** | 0.9367 | 1.914 | 3.52 | 13,270 | 6.3% |
190
- | **4** | 0.0212 🏆 | 1.015 | 1.04 | 124,488 | 97.9% |
191
- | **4** | 0.5020 🏆 | 1.416 | 2.09 | 46,658 | 49.8% |
 
 
 
 
192
 
193
- ### Generated Text Samples
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
- Below are text samples generated from each Markov chain model:
 
 
 
 
 
 
 
196
 
197
  **Context Size 1:**
198
 
199
- 1. `. kaskina kekocic ocitakaniwon ickwatem aka kitci wectcikaniwok mitwi mitcet wapamew e matcectakaniw...`
200
- 2. `e totwakiniwitc , ekota awik e kicekawok " type " : e ici matisitcik tca mia`
201
- 3. `, e matcectakaniwok tipanictawin : sakihikana . irikik e ki masinahotiso kitci ki pekociw ni apitc`
202
 
203
  **Context Size 2:**
204
 
205
- 1. `tipanictawin : e matcectakaniwok ‎ tipanictawin : e matcectakaniwok tipanictawin : nehirowisi otenow...`
206
- 2. `: e matcectakaniwok tipanictawin : e matcectakaniwok tipanictawin : e matcectakaniwok tipanictawin :...`
207
- 3. `e matcectakaniwok tipanictawin : e matcectakaniwok ‎ taci we otciparik anihe itewin . nictam awacak ...`
208
 
209
  **Context Size 3:**
210
 
211
- 1. `tipanictawin : e matcectakaniwok tipanictawin : otenowa tipanictawin : e matcectakaniwok ‎ tipanicta...`
212
- 2. `: e matcectakaniwok ‎ tipanictawin : pirecicak grand héron`
213
- 3. `matcectakaniwok tipanictawin : otenowa tipanictawin : e matcectakaniwok tipanictawin : otenowa tipan...`
214
 
215
  **Context Size 4:**
216
 
217
- 1. `tipanictawin : e matcectakaniwok tipanictawin : otenowa tipanictawin : e matcectakaniwok ‎ tipanicta...`
218
- 2. `: e matcectakaniwok tipanictawin : otenowa tipanictawin : e matcectakaniwok tipanictawin : otenowa t...`
219
- 3. `e matcectakaniwok tipanictawin : otenowa tipanictawin : e matcectakaniwok ‎ tipanictawin : ka kisker...`
220
 
221
 
222
  ### Key Findings
223
 
224
- - **Best Predictability:** Context-4 with 97.9% predictability
225
  - **Branching Factor:** Decreases with context size (more deterministic)
226
- - **Memory Trade-off:** Larger contexts require more storage (46,658 contexts)
227
  - **Recommendation:** Context-3 or Context-4 for text generation
228
 
229
  ---
@@ -239,35 +310,35 @@ Below are text samples generated from each Markov chain model:
239
 
240
  | Metric | Value |
241
  |--------|-------|
242
- | Vocabulary Size | 6,720 |
243
- | Total Tokens | 113,194 |
244
- | Mean Frequency | 16.84 |
245
  | Median Frequency | 3 |
246
- | Frequency Std Dev | 144.86 |
247
 
248
  ### Most Common Words
249
 
250
  | Rank | Word | Frequency |
251
  |------|------|-----------|
252
- | 1 | e | 7,966 |
253
- | 2 | ka | 4,835 |
254
- | 3 | ki | 3,659 |
255
- | 4 | ici | 2,657 |
256
- | 5 | tipanictawin | 2,442 |
257
- | 6 | kitci | 1,874 |
258
- | 7 | kaie | 1,655 |
259
- | 8 | matcectakaniwok | 1,604 |
260
- | 9 | micta | 1,222 |
261
- | 10 | kirika | 1,112 |
262
 
263
  ### Least Common Words (from vocabulary)
264
 
265
  | Rank | Word | Frequency |
266
  |------|------|-----------|
267
- | 1 | nehirosi | 2 |
268
- | 2 | cikomewokw | 2 |
269
- | 3 | miitaw | 2 |
270
- | 4 | droits | 2 |
271
  | 5 | kiskinohamato | 2 |
272
  | 6 | banque | 2 |
273
  | 7 | mawotcicorianionik | 2 |
@@ -279,24 +350,24 @@ Below are text samples generated from each Markov chain model:
279
 
280
  | Metric | Value |
281
  |--------|-------|
282
- | Zipf Coefficient | 1.0564 |
283
- | R² (Goodness of Fit) | 0.988465 |
284
  | Adherence Quality | **excellent** |
285
 
286
  ### Coverage Analysis
287
 
288
  | Top N Words | Coverage |
289
  |-------------|----------|
290
- | Top 100 | 54.9% |
291
- | Top 1,000 | 81.7% |
292
- | Top 5,000 | 97.0% |
293
  | Top 10,000 | 0.0% |
294
 
295
  ### Key Findings
296
 
297
- - **Zipf Compliance:** R²=0.9885 indicates excellent adherence to Zipf's law
298
- - **High Frequency Dominance:** Top 100 words cover 54.9% of corpus
299
- - **Long Tail:** -3,280 words needed for remaining 100.0% coverage
300
 
301
  ---
302
  ## 5. Word Embeddings Evaluation
@@ -309,24 +380,133 @@ Below are text samples generated from each Markov chain model:
309
 
310
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
311
 
312
- ### Model Comparison
313
 
314
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
315
- |-------|------------|-----------|----------|----------|----------|
316
- | **mono_32d** | 2,371 | 32 | 2.682 | 0.752 | 0.1478 🏆 |
317
- | **mono_64d** | 2,371 | 64 | 2.686 | 0.746 | 0.0305 |
318
- | **mono_128d** | 2,371 | 128 | 2.694 | 0.728 | 0.0055 |
319
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
320
 
321
  ### Key Findings
322
 
323
- - **Best Isotropy:** mono_32d with 0.1478 (more uniform distribution)
324
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
325
- - **Vocabulary Coverage:** All models cover 2,371 words
326
- - **Recommendation:** 100d for balanced semantic capture and efficiency
327
 
328
  ---
329
- ## 6. Summary & Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
330
 
331
  ![Performance Dashboard](visualizations/performance_dashboard.png)
332
 
@@ -334,11 +514,12 @@ Below are text samples generated from each Markov chain model:
334
 
335
  | Component | Recommended | Rationale |
336
  |-----------|-------------|-----------|
337
- | Tokenizer | **32k BPE** | Best compression (5.81x) with low UNK rate |
338
- | N-gram | **5-gram** | Lowest perplexity (152) |
339
- | Markov | **Context-4** | Highest predictability (97.9%) |
340
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
341
 
 
342
  ---
343
  ## Appendix: Metrics Glossary & Interpretation Guide
344
 
@@ -528,7 +709,8 @@ If you use these models in your research, please cite:
528
  author = {Kamali, Omar},
529
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
530
  year = {2025},
531
- publisher = {HuggingFace},
 
532
  url = {https://huggingface.co/wikilangs}
533
  institution = {Omneity Labs}
534
  }
@@ -544,7 +726,8 @@ MIT License - Free for academic and commercial use.
544
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
545
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
546
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
547
  ---
548
  *Generated by Wikilangs Models Pipeline*
549
 
550
- *Report Date: 2025-12-27 20:37:02*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 5.949
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.1619
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # ATJ - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 5.115x | 5.13 | 0.1890% | 92,078 |
84
+ | **16k** | 5.507x | 5.52 | 0.2035% | 85,522 |
85
+ | **32k** | 5.949x 🏆 | 5.96 | 0.2198% | 79,160 |
86
 
87
  ### Tokenization Examples
88
 
89
  Below are sample sentences tokenized with each vocabulary size:
90
 
91
+ **Sample 1:** `Thetford Mines oteno Kepek askik ici actew, Kanata. Irikik e tacinaniwok 25 649 ...`
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
+ | 8k | `▁the t ford mi ne s otenokepek ▁askik ▁ici ... (+15 more)` | 25 |
96
+ | 16k | `▁thetfordmi nes otenokepekaskik ▁ici ▁actew ,kanata ... (+12 more)` | 22 |
97
+ | 32k | `▁thetfordminesotenokepekaskikiciactew , kanata . ... (+11 more)` | 21 |
98
 
99
+ **Sample 2:** `Ka Oskiskakamaksource CNA - Atikamekw Kinokewin, sakihikan Kepek askik ici actew...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
+ | 8k | `▁ka ▁oski skakamak source ▁cna ▁- ▁atikamekw ▁kino kewin , ... (+9 more)` | 19 |
104
+ | 16k | `▁ka ▁oski skakamak source ▁cna ▁- ▁atikamekw ▁kinokewin , ▁sakihikan ... (+8 more)` | 18 |
105
+ | 32k | `▁ka ▁oskiskakamak source ▁cna ▁- ▁atikamekw ▁kinokewin , sakihikan ▁kepek ... (+7 more)` | 17 |
106
 
107
+ **Sample 3:** `Stellarton oteno Nouvelle-Écosse aski ici actew, Kanata. Irikik e tacinaniwok 4 ...`
108
 
109
  | Vocab | Tokens | Count |
110
  |-------|--------|-------|
111
+ | 8k | `▁ste lla r ton ▁oteno ▁nouvelle - écosseaskiici ... (+14 more)` | 24 |
112
+ | 16k | `▁ste lla r ton ▁oteno ▁nouvelle - écosseaskiici ... (+14 more)` | 24 |
113
+ | 32k | `▁stellarton ▁oteno ▁nouvelle - écosse ▁askiiciactew , kanata ... (+11 more)` | 21 |
114
 
115
 
116
  ### Key Findings
117
 
118
+ - **Best Compression:** 32k achieves 5.949x compression
119
+ - **Lowest UNK Rate:** 8k with 0.1890% unknown tokens
120
  - **Trade-off:** Larger vocabularies improve compression but increase model size
121
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
122
 
 
125
 
126
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
127
 
128
+ ![N-gram Unique](visualizations/ngram_unique.png)
129
+
130
  ![N-gram Coverage](visualizations/ngram_coverage.png)
131
 
132
  ### Results
133
 
134
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
135
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
136
+ | **2-gram** | Word | 756 | 9.56 | 2,026 | 44.6% | 84.1% |
137
+ | **2-gram** | Subword | 129 🏆 | 7.01 | 992 | 88.9% | 100.0% |
138
+ | **3-gram** | Word | 540 | 9.08 | 1,856 | 50.0% | 84.6% |
139
+ | **3-gram** | Subword | 761 | 9.57 | 5,493 | 41.8% | 92.5% |
140
+ | **4-gram** | Word | 578 | 9.18 | 2,537 | 50.5% | 75.6% |
141
+ | **4-gram** | Subword | 3,042 | 11.57 | 19,249 | 21.7% | 65.9% |
142
 
143
  ### Top 5 N-grams by Size
144
 
145
+ **2-grams (Word):**
146
+
147
+ | Rank | N-gram | Count |
148
+ |------|--------|-------|
149
+ | 1 | `ici actew` | 889 |
150
+ | 2 | `actew kanata` | 771 |
151
+ | 3 | `manawan wemotaci` | 722 |
152
+ | 4 | `e ici` | 686 |
153
+ | 5 | `irikik e` | 672 |
154
+
155
+ **3-grams (Word):**
156
+
157
+ | Rank | N-gram | Count |
158
+ |------|--------|-------|
159
+ | 1 | `ici actew kanata` | 770 |
160
+ | 2 | `irikik e tacinaniwok` | 633 |
161
+ | 3 | `kanata irikik e` | 620 |
162
+ | 4 | `actew kanata irikik` | 620 |
163
+ | 5 | `askik ici actew` | 500 |
164
+
165
+ **4-grams (Word):**
166
+
167
+ | Rank | N-gram | Count |
168
+ |------|--------|-------|
169
+ | 1 | `kanata irikik e tacinaniwok` | 620 |
170
+ | 2 | `actew kanata irikik e` | 620 |
171
+ | 3 | `ici actew kanata irikik` | 620 |
172
+ | 4 | `askik ici actew kanata` | 490 |
173
+ | 5 | `kepek askik ici actew` | 457 |
174
+
175
+ **2-grams (Subword):**
176
 
177
  | Rank | N-gram | Count |
178
  |------|--------|-------|
179
+ | 1 | `c i` | 23,693 |
180
+ | 2 | `k a` | 23,558 |
181
+ | 3 | `_ k` | 23,282 |
182
+ | 4 | `t c` | 23,205 |
183
+ | 5 | `i k` | 21,042 |
184
 
185
+ **3-grams (Subword):**
186
 
187
  | Rank | N-gram | Count |
188
  |------|--------|-------|
189
+ | 1 | `t c i` | 11,312 |
190
+ | 2 | `_ k i` | 10,112 |
191
+ | 3 | `i t c` | 10,006 |
192
+ | 4 | `_ k a` | 9,178 |
193
+ | 5 | `c i _` | 8,654 |
194
 
195
+ **4-grams (Subword):**
196
 
197
  | Rank | N-gram | Count |
198
  |------|--------|-------|
199
+ | 1 | `i t c i` | 5,889 |
200
+ | 2 | `a n i w` | 5,154 |
201
+ | 3 | `_ k a _` | 4,777 |
202
+ | 4 | `n i w o` | 4,370 |
203
+ | 5 | `k a n i` | 4,232 |
204
 
205
 
206
  ### Key Findings
207
 
208
+ - **Best Perplexity:** 2-gram (subword) with 129
209
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
210
+ - **Coverage:** Top-1000 patterns cover ~66% of corpus
211
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
212
 
213
  ---
 
215
 
216
  ![Markov Entropy](visualizations/markov_entropy.png)
217
 
218
+ ![Markov Contexts](visualizations/markov_contexts.png)
219
+
220
  ![Markov Branching](visualizations/markov_branching.png)
221
 
222
  ### Results
223
 
224
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
225
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
226
+ | **1** | Word | 0.5831 | 1.498 | 3.56 | 19,290 | 41.7% |
227
+ | **1** | Subword | 1.5451 | 2.918 | 13.90 | 118 | 0.0% |
228
+ | **2** | Word | 0.1879 | 1.139 | 1.41 | 67,726 | 81.2% |
229
+ | **2** | Subword | 1.2667 | 2.406 | 6.32 | 1,639 | 0.0% |
230
+ | **3** | Word | 0.0530 | 1.037 | 1.09 | 93,891 | 94.7% |
231
+ | **3** | Subword | 0.7981 | 1.739 | 3.30 | 10,345 | 20.2% |
232
+ | **4** | Word | 0.0145 🏆 | 1.010 | 1.02 | 100,093 | 98.6% |
233
+ | **4** | Subword | 0.5495 | 1.464 | 2.26 | 34,073 | 45.1% |
234
+
235
+ ### Generated Text Samples (Word-based)
236
+
237
+ Below are text samples generated from each word-based Markov chain model:
238
 
239
+ **Context Size 1:**
240
+
241
+ 1. `e takonikatek anihe wikasi aka nipoane aka ewi tipapasotc tawatcikaniw apitcitaw iskwew kata nespito...`
242
+ 2. `ka icinikasotc ki nti matce kisinarik micta kackitatc e ici sikinikatek rasop e wamowsotc kaskina wi...`
243
+ 3. `ki micta kackitatc nictam ka ickwa aiketcik mitcetowaw nikamohinik tekera weckatc nehirowisikw ni ap...`
244
+
245
+ **Context Size 2:**
246
+
247
+ 1. `ici actew kanata irikik e tacinaniwok matcectakaniwok`
248
+ 2. `actew kanata matcectakaniwok opitciwan matcectakaniwok matcectakaniwok`
249
+ 3. `manawan wemotaci nabesipi sipi ekote e otci katcitcipitak niheriw kitci aititosketc mitowi ka takoki...`
250
+
251
+ **Context Size 3:**
252
+
253
+ 1. `ici actew kanata irikik e tacinaniwok 5 037 matcectakaniwok`
254
+ 2. `kanata irikik e tacinaniwok 7 200 oteno ote itekera ka icitiperitakok comté portneuf rareak micta si...`
255
+ 3. `actew kanata irikik e tacinaniwok 71 419 matcectakaniwok`
256
+
257
+ **Context Size 4:**
258
 
259
+ 1. `ici actew kanata irikik e tacinaniwok 403 390 matcectakaniwok`
260
+ 2. `actew kanata irikik e tacinaniwok 3 930 matcectakaniwok`
261
+ 3. `kanata irikik e tacinaniwok 552`
262
+
263
+
264
+ ### Generated Text Samples (Subword-based)
265
+
266
+ Below are text samples generated from each subword-based Markov chain model:
267
 
268
  **Context Size 1:**
269
 
270
+ 1. `i_w_acikaraska_p`
271
+ 2. `_thitcicisan_ta_`
272
+ 3. `ak_kitciwik_ours`
273
 
274
  **Context Size 2:**
275
 
276
+ 1. `cira._ok_takanetc`
277
+ 2. `katcik._ka_ictert`
278
+ 3. `_koki_sinikaniwan`
279
 
280
  **Context Size 3:**
281
 
282
+ 1. `tcikamooseph_du_qu`
283
+ 2. `_kirikanikaniwok._`
284
+ 3. `itc_iskakwasotc._e`
285
 
286
  **Context Size 4:**
287
 
288
+ 1. `itciwan_nehiriwa_on`
289
+ 2. `aniwiw_ka_taci_matc`
290
+ 3. `_ka_iti_ici_nictahi`
291
 
292
 
293
  ### Key Findings
294
 
295
+ - **Best Predictability:** Context-4 (word) with 98.6% predictability
296
  - **Branching Factor:** Decreases with context size (more deterministic)
297
+ - **Memory Trade-off:** Larger contexts require more storage (34,073 contexts)
298
  - **Recommendation:** Context-3 or Context-4 for text generation
299
 
300
  ---
 
310
 
311
  | Metric | Value |
312
  |--------|-------|
313
+ | Vocabulary Size | 6,479 |
314
+ | Total Tokens | 105,209 |
315
+ | Mean Frequency | 16.24 |
316
  | Median Frequency | 3 |
317
+ | Frequency Std Dev | 131.12 |
318
 
319
  ### Most Common Words
320
 
321
  | Rank | Word | Frequency |
322
  |------|------|-----------|
323
+ | 1 | e | 6,370 |
324
+ | 2 | ka | 4,817 |
325
+ | 3 | ki | 3,656 |
326
+ | 4 | ici | 2,654 |
327
+ | 5 | kitci | 1,874 |
328
+ | 6 | kaie | 1,655 |
329
+ | 7 | matcectakaniwok | 1,604 |
330
+ | 8 | micta | 1,222 |
331
+ | 9 | kirika | 1,111 |
332
+ | 10 | manawan | 973 |
333
 
334
  ### Least Common Words (from vocabulary)
335
 
336
  | Rank | Word | Frequency |
337
  |------|------|-----------|
338
+ | 1 | cikomewokw | 2 |
339
+ | 2 | miitaw | 2 |
340
+ | 3 | droits | 2 |
341
+ | 4 | ntokiw | 2 |
342
  | 5 | kiskinohamato | 2 |
343
  | 6 | banque | 2 |
344
  | 7 | mawotcicorianionik | 2 |
 
350
 
351
  | Metric | Value |
352
  |--------|-------|
353
+ | Zipf Coefficient | 1.0501 |
354
+ | R² (Goodness of Fit) | 0.987715 |
355
  | Adherence Quality | **excellent** |
356
 
357
  ### Coverage Analysis
358
 
359
  | Top N Words | Coverage |
360
  |-------------|----------|
361
+ | Top 100 | 54.5% |
362
+ | Top 1,000 | 81.8% |
363
+ | Top 5,000 | 97.2% |
364
  | Top 10,000 | 0.0% |
365
 
366
  ### Key Findings
367
 
368
+ - **Zipf Compliance:** R²=0.9877 indicates excellent adherence to Zipf's law
369
+ - **High Frequency Dominance:** Top 100 words cover 54.5% of corpus
370
+ - **Long Tail:** -3,521 words needed for remaining 100.0% coverage
371
 
372
  ---
373
  ## 5. Word Embeddings Evaluation
 
380
 
381
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
382
 
 
383
 
384
+ ### 5.1 Cross-Lingual Alignment
385
+
386
+ > *Note: Multilingual alignment visualization not available for this language.*
387
+
388
+
389
+ ### 5.2 Model Comparison
390
+
391
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
392
+ |-------|-----------|----------|------------------|---------------|----------------|
393
+ | **mono_32d** | 32 | 0.1619 🏆 | 0.4893 | N/A | N/A |
394
+ | **mono_64d** | 64 | 0.0330 | 0.5001 | N/A | N/A |
395
+ | **mono_128d** | 128 | 0.0058 | 0.5012 | N/A | N/A |
396
 
397
  ### Key Findings
398
 
399
+ - **Best Isotropy:** mono_32d with 0.1619 (more uniform distribution)
400
+ - **Semantic Density:** Average pairwise similarity of 0.4969. Lower values indicate better semantic separation.
401
+ - **Alignment Quality:** No aligned models evaluated in this run.
402
+ - **Recommendation:** 128d aligned for best cross-lingual performance
403
 
404
  ---
405
+ ## 6. Morphological Analysis (Experimental)
406
+
407
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
408
+
409
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
410
+
411
+ ### 6.1 Productivity & Complexity
412
+
413
+ | Metric | Value | Interpretation | Recommendation |
414
+ |--------|-------|----------------|----------------|
415
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
416
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
417
+
418
+ ### 6.2 Affix Inventory (Productive Units)
419
+
420
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
421
+
422
+ #### Productive Prefixes
423
+ | Prefix | Examples |
424
+ |--------|----------|
425
+ | `-ki` | kictapatisiw, kiwew, kiskinohomakewiniw |
426
+ | `-mi` | mikomin, mitataw, mirokotek |
427
+ | `-ma` | mamowinikatek, masinatew, manikaniwon |
428
+ | `-ni` | ninan, nikikw, niheritamokw |
429
+ | `-ot` | otcepikiwitc, otactimikok, otcotcoma |
430
+ | `-ta` | tanetci, tatiniw, tatow |
431
+ | `-ic` | icakopan, icinikatakiniw, icinikatakaniwitcik |
432
+
433
+ #### Productive Suffixes
434
+ | Suffix | Examples |
435
+ |--------|----------|
436
+ | `-k` | nosinetakaniwonik, rarewak, mamowinikatek |
437
+ | `-w` | kictapatisiw, masinatew, kiwew |
438
+ | `-c` | awotatokwetc, otcepikiwitc, taciketc |
439
+ | `-n` | potatcikan, mikomin, icakopan |
440
+ | `-ik` | nosinetakaniwonik, pakacik, icinikatakaniwitcik |
441
+ | `-tc` | awotatokwetc, otcepikiwitc, taciketc |
442
+ | `-iw` | kictapatisiw, kiskinohomakewiniw, icinikatakiniw |
443
+ | `-ok` | petakok, otactimikok, wapitamok |
444
+
445
+ ### 6.3 Bound Stems (Lexical Roots)
446
+
447
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
448
+
449
+ | Stem | Cohesion | Substitutability | Examples |
450
+ |------|----------|------------------|----------|
451
+ | `taka` | 1.44x | 22 contexts | otakai, pataka, matakaw |
452
+ | `tako` | 1.32x | 29 contexts | takon, takok, takoki |
453
+ | `apit` | 1.50x | 17 contexts | apitc, apita, tapit |
454
+ | `atis` | 1.49x | 17 contexts | matisiw, matisin, batiste |
455
+ | `mitc` | 1.33x | 22 contexts | mitca, mitci, mitcin |
456
+ | `aniw` | 1.38x | 19 contexts | aniwe, nikaniw, oskaniw |
457
+ | `iwok` | 1.43x | 16 contexts | apiwok, aipiwok, irniwok |
458
+ | `erit` | 1.48x | 14 contexts | iteritam, iteritak, oreritam |
459
+ | `niwo` | 1.51x | 13 contexts | irniwok, koniwok, kaniwok |
460
+ | `tcik` | 1.30x | 19 contexts | tatcik, mitcik, motcik |
461
+ | `irow` | 1.56x | 11 contexts | kirowe, wirowow, kewirow |
462
+ | `kate` | 1.33x | 16 contexts | katek, makate, kateri |
463
+
464
+ ### 6.4 Affix Compatibility (Co-occurrence)
465
+
466
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
467
+
468
+ | Prefix | Suffix | Frequency | Examples |
469
+ |--------|--------|-----------|----------|
470
+ | `-ki` | `-k` | 127 words | kiskeritakik, kiskeritakositcik |
471
+ | `-ma` | `-k` | 89 words | masinapiskikatek, mackikirinikwecik |
472
+ | `-mi` | `-k` | 88 words | mitcenik, mickaniwok |
473
+ | `-ki` | `-w` | 68 words | kiskinohamakewiniw, kictaw |
474
+ | `-mi` | `-w` | 65 words | miromakosiw, mitcetwaw |
475
+ | `-ni` | `-k` | 61 words | nitwakik, nisitowinikatek |
476
+ | `-ot` | `-k` | 57 words | otek, otcirowek |
477
+ | `-ki` | `-ik` | 56 words | kiskeritakik, kiskeritakositcik |
478
+ | `-ki` | `-c` | 51 words | kicterimitisotc, kiciwahikotc |
479
+ | `-ta` | `-k` | 49 words | tacikewok, takociparitcik |
480
+
481
+ ### 6.5 Recursive Morpheme Segmentation
482
+
483
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
484
+
485
+ | Word | Suggested Split | Confidence | Stem |
486
+ |------|-----------------|------------|------|
487
+ | kitotakaniw | **`ki-totak-an-iw`** | 7.5 | `totak` |
488
+ | pimatisinaniwok | **`pimatisin-an-iw-ok`** | 7.5 | `pimatisin` |
489
+ | icipekahikaniwok | **`ic-ipekah-ik-an-iw-ok`** | 7.5 | `ipekah` |
490
+ | masinahikaniwok | **`ma-sinah-ik-an-iw-ok`** | 7.5 | `sinah` |
491
+ | icitatcik | **`ic-itat-cik`** | 6.0 | `itat` |
492
+ | masinahikanik | **`ma-sinah-ik-an-ik`** | 6.0 | `sinah` |
493
+ | mipariwakaniwok | **`mi-pariwak-an-iw-ok`** | 6.0 | `pariwak` |
494
+ | osawapisikaniwok | **`osawapis-ik-an-iw-ok`** | 6.0 | `osawapis` |
495
+ | matcehonaniwok | **`ma-tcehon-an-iw-ok`** | 6.0 | `tcehon` |
496
+ | nipimatisiwinik | **`ni-pimatisiwin-ik`** | 6.0 | `pimatisiwin` |
497
+ | misinhikaniw | **`mi-sinh-ik-an-iw`** | 6.0 | `sinh` |
498
+ | nikamonaniwok | **`ni-kamon-an-iw-ok`** | 6.0 | `kamon` |
499
+ | metowaniwok | **`metow-an-iw-ok`** | 4.5 | `metow` |
500
+ | miremakanik | **`mi-remak-an-ik`** | 4.5 | `remak` |
501
+ | acamakaniwok | **`acamak-an-iw-ok`** | 4.5 | `acamak` |
502
+
503
+ ### 6.6 Linguistic Interpretation
504
+
505
+ > **Automated Insight:**
506
+ The language ATJ appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
507
+
508
+ ---
509
+ ## 7. Summary & Recommendations
510
 
511
  ![Performance Dashboard](visualizations/performance_dashboard.png)
512
 
 
514
 
515
  | Component | Recommended | Rationale |
516
  |-----------|-------------|-----------|
517
+ | Tokenizer | **32k BPE** | Best compression (5.95x) |
518
+ | N-gram | **2-gram** | Lowest perplexity (129) |
519
+ | Markov | **Context-4** | Highest predictability (98.6%) |
520
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
521
 
522
+
523
  ---
524
  ## Appendix: Metrics Glossary & Interpretation Guide
525
 
 
709
  author = {Kamali, Omar},
710
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
711
  year = {2025},
712
+ doi = {10.5281/zenodo.18073153},
713
+ publisher = {Zenodo},
714
  url = {https://huggingface.co/wikilangs}
715
  institution = {Omneity Labs}
716
  }
 
726
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
727
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
728
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
729
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
730
  ---
731
  *Generated by Wikilangs Models Pipeline*
732
 
733
+ *Report Date: 2026-01-03 05:18:59*
models/embeddings/monolingual/atj_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5e4f652757f81386724dd7cf5c8d22549280e2e24aabab3b92abac08935d3c4a
3
- size 1026469326
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f3d6139699a8a77dd41505f198f1088f6ea783407b1398a6369141ef13d2adc
3
+ size 1026354933
models/embeddings/monolingual/atj_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 2371
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 2261
15
  }
models/embeddings/monolingual/atj_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1aac0c6be5de24ce9a05654a95b0945cccec0cac9fee51d6b0dfbb3df8672f74
3
- size 256648398
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:733947e40bfc2a29297e78ab7590c70db193bca524959bb47af3e2ea45087684
3
+ size 256618485
models/embeddings/monolingual/atj_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 2371
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 2261
15
  }
models/embeddings/monolingual/atj_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2e026ba4da09c003162df384f6aa6973c82d536f19dc68f82368126723c690cb
3
- size 513255374
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf1bb93977201f5841f00da8ac2e6f2915a651bf8892a51f91d3c84233f3fae3
3
+ size 513197301
models/embeddings/monolingual/atj_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 2371
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 2261
15
  }
models/subword_markov/atj_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d39bea41083fabbd4be030111bb4376a8345ee2c24cec3e7b066979621d62313
3
- size 19976
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01ab1d155ad7b2173943a42ef233b27a6456b68458231c9472df823263be9ec6
3
+ size 16979
models/subword_markov/atj_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_contexts": 126,
6
- "total_transitions": 971476
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_contexts": 118,
6
+ "total_transitions": 875958
7
  }
models/subword_markov/atj_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f0fe5cfd0b592f5cd30abbcc82868c6ee2c08671541e1e9c0d7aae11b4877a0a
3
- size 101456
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01c646d4de1b50ab2af2e42b22f63c9283f46a99722629ba2187c5b17039d034
3
+ size 80338
models/subword_markov/atj_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_contexts": 2049,
6
- "total_transitions": 969388
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_contexts": 1639,
6
+ "total_transitions": 873879
7
  }
models/subword_markov/atj_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:283a408f8f1bcb986dfc6a2779a3fb12448ab6bbe5183f07e442fbfe885e28a5
3
- size 337589
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d29a79e07eb2572a881d989c62b2b464a2e86ce95f9e1ca1cfc17ebb6f7fdda4
3
+ size 260964
models/subword_markov/atj_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_contexts": 13270,
6
- "total_transitions": 967300
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_contexts": 10345,
6
+ "total_transitions": 871800
7
  }
models/subword_markov/atj_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7d18d91899c8a29a661cfc050414a089b1a63c9c87ca71a786329c19df6b48ad
3
- size 812220
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c92961d642ccb93882ab7a777fc55a418e457e8f9843b90ea4d3869eb4115b68
3
+ size 635430
models/subword_markov/atj_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_contexts": 46658,
6
- "total_transitions": 965212
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_contexts": 34073,
6
+ "total_transitions": 869721
7
  }
models/subword_ngram/atj_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:511b2678c1e7285952fdcf58c2a8d8a86667a817f0871985f9327da706d62c1f
3
- size 16760
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83d8aa6cff41c2e4e7567422c922f50fad121f6d85606b56af394b2714de92de
3
+ size 14344
models/subword_ngram/atj_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_ngrams": 1194,
6
- "total_ngrams": 971476
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_ngrams": 992,
6
+ "total_ngrams": 875958
7
  }
models/subword_ngram/atj_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b0ec7189eab62ae8af249edf91953dbc77bf9ffb24f1864289cbbd10d584e18f
3
- size 88729
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8cde20675b31ba779a55241e29b9edb6286c6f51bb3fa511c63b49f3d14f3e10
3
+ size 65963
models/subword_ngram/atj_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_ngrams": 7630,
6
- "total_ngrams": 969388
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_ngrams": 5493,
6
+ "total_ngrams": 873879
7
  }
models/subword_ngram/atj_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6e75807d9f320a6f542d1776448ca86b0607bc7ad4c3c997ae27386964cafe9f
3
- size 288361
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7765e9eede8cde44f7885c15dc01158f7b11d31566539ee029e4555794cf9cf
3
+ size 231317
models/subword_ngram/atj_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "atj",
5
- "unique_ngrams": 24127,
6
- "total_ngrams": 967300
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "atj",
5
+ "unique_ngrams": 19249,
6
+ "total_ngrams": 871800
7
  }
models/tokenizer/atj_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a063ccee59eb9db3942d7290d6ff30567b41f568b01820e485d279d8849d9849
3
- size 533850
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:087da7688c1626c425cee991b2afa0071074d6ba52750f654d69b2ab05b54160
3
+ size 536207
models/tokenizer/atj_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/atj_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e5f7bdd1eb535e2ec8d48cea6bf379c07f11dfd7923a25bdeb412bf5e8de7a49
3
- size 855496
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8fa5e1ab7cf748ada69603f315ae6c44dbb95d4e07cabda9709bfbe2f875a70
3
+ size 846451
models/tokenizer/atj_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/atj_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a9511d57125eca644a7e92c031fb618e671127112c3f8ae4fb8d86bba50f9f88
3
- size 381099
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:678b99dd207da5596a12f7a9f3f36ad762fad42d71893cebac1f4c67048caa88
3
+ size 384081
models/tokenizer/atj_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/atj_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bd2aba50829aac40db6f44873c3c728b3e7032be8f025cb9a1f670041e355de9
3
- size 110501
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00f58d6e6175ddabe7c28e84d3c39c5e5ba1832a75f401ec087f627ab36c7ca5
3
+ size 106314
models/vocabulary/atj_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "atj",
3
- "vocabulary_size": 6720,
 
4
  "statistics": {
5
- "type_token_ratio": 0.16166825450565717,
6
  "coverage": {
7
- "top_100": 0.48908327887439274,
8
- "top_1000": 0.7285031533694993,
9
- "top_5000": 0.8641570937034967,
10
- "top_10000": 0.9170675632051777
11
  },
12
- "hapax_count": 13813,
13
- "hapax_ratio": 0.6727219597720743,
14
- "total_documents": 2088
15
  }
16
  }
 
1
  {
2
  "language": "atj",
3
+ "vocabulary_size": 6479,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.16444512148678497,
7
  "coverage": {
8
+ "top_100": 0.485278560607984,
9
+ "top_1000": 0.7279643875728878,
10
+ "top_5000": 0.865353204526028,
11
+ "top_10000": 0.9201851710801364
12
  },
13
+ "hapax_count": 12952,
14
+ "hapax_ratio": 0.6665637383562348,
15
+ "total_documents": 2079
16
  }
17
  }
models/word_markov/atj_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5cc0671b8feee03a545436e64c2433b0fae5929afd839c95895a1dd7a786b31b
3
- size 647870
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bdf082d2a0a4824649c5b861ab9afabf286d0dbb86ad23a70e172a5b2eb5897a
3
+ size 602424
models/word_markov/atj_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_contexts": 20568,
6
- "total_transitions": 155273
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_contexts": 19290,
6
+ "total_transitions": 116082
7
  }
models/word_markov/atj_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1bad041177a4fab91ca325954d98e00ec826c5f9a895eabdb741027fcd2741c5
3
- size 1343290
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e76203e3dfef52c4916a31791ddc3e203330c5ef4fa528a0653be84bb8bcec4e
3
+ size 1248971
models/word_markov/atj_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_contexts": 73484,
6
- "total_transitions": 153185
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_contexts": 67726,
6
+ "total_transitions": 114003
7
  }
models/word_markov/atj_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ebc592e1cd2fb89bf6091b62444d792f176ba63a922be974dd636b06b6f98d1c
3
- size 1854587
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e80b9624c1459ecf42b9b53bfb0b754b70c4448cfbdda5f761c274441f7abec
3
+ size 1620476
models/word_markov/atj_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_contexts": 111484,
6
- "total_transitions": 151097
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_contexts": 93891,
6
+ "total_transitions": 111924
7
  }
models/word_markov/atj_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0985586ccd19bf6a0dc2637f1333d7183f5038b6938b444a5176ef7d509f514e
3
- size 2130768
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef0092fba285ff8cbcd27b44e31bb50e02ea170185e6d35262b62e980fc6bdc2
3
+ size 1821627
models/word_markov/atj_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_contexts": 124488,
6
- "total_transitions": 149010
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_contexts": 100093,
6
+ "total_transitions": 109845
7
  }
models/word_ngram/atj_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:62105b25627ffc7e1d5c69b732181827784ed359ea429a44beec674d369493e8
3
- size 41316
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea561dca24844eb9229ee5eab7bfb36cf51aa25a215c4e73586a94bb426e4237
3
+ size 28681
models/word_ngram/atj_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_ngrams": 3083,
6
- "total_ngrams": 155273
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_ngrams": 2026,
6
+ "total_ngrams": 116082
7
  }
models/word_ngram/atj_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:386ec64bf8b3c1e5e4d542440e374d9e21afc5bea7cec276940db9523d92dd10
3
- size 48660
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d2ef5961a8f1df2b0cfe0ae1086e241f6274d85e63a7d821468ecde1848f6bc
3
+ size 30298
models/word_ngram/atj_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_ngrams": 3170,
6
- "total_ngrams": 153185
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_ngrams": 1856,
6
+ "total_ngrams": 114003
7
  }
models/word_ngram/atj_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:47ed6531e2d3bc1d45b63a5cd4c8ef732bb0478a628f2ad02029971a6484ccfb
3
- size 76192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14c469ed9b37804dc6410be9e384c9f2fbfe4f07a4d0a79de09028386726d42e
3
+ size 46381
models/word_ngram/atj_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "atj",
5
- "unique_ngrams": 4529,
6
- "total_ngrams": 151097
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "atj",
5
+ "unique_ngrams": 2537,
6
+ "total_ngrams": 111924
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 0c2b30c9f7a70ef52afc2439340ef5cad98edcfc371af88cc439db78d9b970a3
  • Pointer size: 131 Bytes
  • Size of remote file: 166 kB

Git LFS Details

  • SHA256: 35eb07bac76423327674feff507177ed92e9c916baec6ff4bdff297947403395
  • Pointer size: 131 Bytes
  • Size of remote file: 158 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED