omarkamali commited on
Commit
c982f3a
·
verified ·
1 Parent(s): 2e1be1d

Upload all models and assets for bi (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. README.md +273 -126
  3. models/embeddings/monolingual/bi_128d.bin +2 -2
  4. models/embeddings/monolingual/bi_128d_metadata.json +5 -3
  5. models/embeddings/monolingual/bi_32d.bin +2 -2
  6. models/embeddings/monolingual/bi_32d_metadata.json +5 -3
  7. models/embeddings/monolingual/bi_64d.bin +2 -2
  8. models/embeddings/monolingual/bi_64d_metadata.json +5 -3
  9. models/subword_markov/bi_markov_ctx1_subword.parquet +2 -2
  10. models/subword_markov/bi_markov_ctx1_subword_metadata.json +2 -2
  11. models/subword_markov/bi_markov_ctx2_subword.parquet +2 -2
  12. models/subword_markov/bi_markov_ctx2_subword_metadata.json +2 -2
  13. models/subword_markov/bi_markov_ctx3_subword.parquet +2 -2
  14. models/subword_markov/bi_markov_ctx3_subword_metadata.json +2 -2
  15. models/subword_markov/bi_markov_ctx4_subword.parquet +2 -2
  16. models/subword_markov/bi_markov_ctx4_subword_metadata.json +2 -2
  17. models/subword_ngram/bi_2gram_subword.parquet +2 -2
  18. models/subword_ngram/bi_2gram_subword_metadata.json +2 -2
  19. models/subword_ngram/bi_3gram_subword.parquet +2 -2
  20. models/subword_ngram/bi_3gram_subword_metadata.json +2 -2
  21. models/subword_ngram/bi_4gram_subword.parquet +2 -2
  22. models/subword_ngram/bi_4gram_subword_metadata.json +2 -2
  23. models/tokenizer/bi_tokenizer_16k.model +2 -2
  24. models/tokenizer/bi_tokenizer_16k.vocab +0 -0
  25. models/tokenizer/bi_tokenizer_8k.model +2 -2
  26. models/tokenizer/bi_tokenizer_8k.vocab +0 -0
  27. models/vocabulary/bi_vocabulary.parquet +2 -2
  28. models/vocabulary/bi_vocabulary_metadata.json +9 -8
  29. models/word_markov/bi_markov_ctx1_word.parquet +2 -2
  30. models/word_markov/bi_markov_ctx1_word_metadata.json +2 -2
  31. models/word_markov/bi_markov_ctx2_word.parquet +2 -2
  32. models/word_markov/bi_markov_ctx2_word_metadata.json +2 -2
  33. models/word_markov/bi_markov_ctx3_word.parquet +2 -2
  34. models/word_markov/bi_markov_ctx3_word_metadata.json +2 -2
  35. models/word_markov/bi_markov_ctx4_word.parquet +2 -2
  36. models/word_markov/bi_markov_ctx4_word_metadata.json +2 -2
  37. models/word_ngram/bi_2gram_word.parquet +2 -2
  38. models/word_ngram/bi_2gram_word_metadata.json +2 -2
  39. models/word_ngram/bi_3gram_word.parquet +2 -2
  40. models/word_ngram/bi_3gram_word_metadata.json +2 -2
  41. models/word_ngram/bi_4gram_word.parquet +2 -2
  42. models/word_ngram/bi_4gram_word_metadata.json +2 -2
  43. visualizations/embedding_isotropy.png +0 -0
  44. visualizations/embedding_norms.png +0 -0
  45. visualizations/embedding_similarity.png +2 -2
  46. visualizations/markov_branching.png +0 -0
  47. visualizations/markov_contexts.png +0 -0
  48. visualizations/markov_entropy.png +0 -0
  49. visualizations/model_sizes.png +0 -0
  50. visualizations/ngram_coverage.png +0 -0
.gitattributes CHANGED
@@ -39,3 +39,4 @@ visualizations/position_encoding_comparison.png filter=lfs diff=lfs merge=lfs -t
39
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
41
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
 
 
39
  visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
40
  visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
41
  visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
42
+ visualizations/ngram_coverage.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 4.017
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.0541
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 3655
33
- generated: 2025-12-28
34
  ---
35
 
36
  # BI - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,46 +70,49 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.698x | 3.65 | 0.1622% | 57,343 |
76
- | **16k** | 4.017x 🏆 | 3.96 | 0.1762% | 52,790 |
77
 
78
  ### Tokenization Examples
79
 
80
  Below are sample sentences tokenized with each vocabulary size:
81
 
82
- **Sample 1:** `Minsk, hem i bigtaon long senta blong Belarus, mo hemi kapitol blong kaontri ia....`
83
 
84
  | Vocab | Tokens | Count |
85
  |-------|--------|-------|
86
- | 8k | `▁minsk , hem ▁ibigtaonlongsentablongbelarus , ... (+10 more)` | 20 |
87
- | 16k | `▁minsk , hem ▁ibigtaonlongsentablongbelarus , ... (+10 more)` | 20 |
88
 
89
- **Sample 2:** `+UetersenRosenstadt Uetersen 125px 125px 300px
90
- Uetersen i stap smol taon blong...`
91
 
92
  | Vocab | Tokens | Count |
93
  |-------|--------|-------|
94
- | 8k | `▁+ ue ter sen ros ens tad t uetersen ... (+41 more)` | 51 |
95
- | 16k | `▁+ uetersen rosenstadt uetersen ▁ 1 2 5 px ▁ ... (+36 more)` | 46 |
96
 
97
- **Sample 3:** `Prayut Chan-o-cha (boen 1954) i praem minista blong Thailand.
98
-
99
- Category:Praem mi...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
- | 8k | `▁pra y utchan - o - cha( boen ... (+18 more)` | 28 |
104
- | 16k | `▁prayutchan - o - cha( boen1 ... (+16 more)` | 26 |
105
 
106
 
107
  ### Key Findings
108
 
109
- - **Best Compression:** 16k achieves 4.017x compression
110
- - **Lowest UNK Rate:** 8k with 0.1622% unknown tokens
111
  - **Trade-off:** Larger vocabularies improve compression but increase model size
112
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
113
 
@@ -116,57 +121,89 @@ Category:Praem mi...`
116
 
117
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
118
 
 
 
119
  ![N-gram Coverage](visualizations/ngram_coverage.png)
120
 
121
  ### Results
122
 
123
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
124
- |--------|------------|---------|----------------|------------------|-------------------|
125
- | **2-gram** | 483 🏆 | 8.92 | 1,634 | 54.9% | 91.1% |
126
- | **2-gram** | 264 🏆 | 8.05 | 1,233 | 68.8% | 99.6% |
127
- | **3-gram** | 712 | 9.48 | 2,346 | 48.6% | 84.2% |
128
- | **3-gram** | 1,434 | 10.49 | 7,760 | 37.2% | 75.8% |
129
- | **4-gram** | 1,319 | 10.36 | 4,093 | 39.6% | 72.5% |
130
- | **4-gram** | 3,949 | 11.95 | 23,770 | 28.7% | 57.4% |
131
 
132
  ### Top 5 N-grams by Size
133
 
134
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
136
  | Rank | N-gram | Count |
137
  |------|--------|-------|
138
- | 1 | `category :` | 2,068 |
139
- | 2 | `. category` | 1,177 |
140
- | 3 | `hem i` | 759 |
141
- | 4 | `stet blong` | 711 |
142
- | 5 | `yunaeted stet` | 643 |
143
 
144
- **3-grams:**
145
 
146
  | Rank | N-gram | Count |
147
  |------|--------|-------|
148
- | 1 | `. category :` | 1,177 |
149
- | 2 | `yunaeted stet blong` | 587 |
150
- | 3 | `stet blong amerika` | 585 |
151
- | 4 | `blong yunaeted stet` | 481 |
152
- | 5 | `category : pipol` | 468 |
153
 
154
- **4-grams:**
155
 
156
  | Rank | N-gram | Count |
157
  |------|--------|-------|
158
- | 1 | `yunaeted stet blong amerika` | 585 |
159
- | 2 | `blong yunaeted stet blong` | 472 |
160
- | 3 | `category : pipol blong` | 413 |
161
- | 4 | `. category : ol` | 274 |
162
- | 5 | `stet blong amerika .` | 229 |
 
 
 
 
 
 
 
 
 
 
163
 
164
 
165
  ### Key Findings
166
 
167
- - **Best Perplexity:** 2-gram with 264
168
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
169
- - **Coverage:** Top-1000 patterns cover ~57% of corpus
170
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
171
 
172
  ---
@@ -174,55 +211,86 @@ Category:Praem mi...`
174
 
175
  ![Markov Entropy](visualizations/markov_entropy.png)
176
 
 
 
177
  ![Markov Branching](visualizations/markov_branching.png)
178
 
179
  ### Results
180
 
181
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
182
- |---------|-------------|------------|------------------|-----------------|----------------|
183
- | **1** | 0.5518 | 1.466 | 3.32 | 9,727 | 44.8% |
184
- | **1** | 1.0793 | 2.113 | 7.70 | 381 | 0.0% |
185
- | **2** | 0.2268 | 1.170 | 1.46 | 32,184 | 77.3% |
186
- | **2** | 1.0810 | 2.115 | 5.52 | 2,929 | 0.0% |
187
- | **3** | 0.0853 | 1.061 | 1.15 | 46,928 | 91.5% |
188
- | **3** | 0.7971 | 1.738 | 3.07 | 16,165 | 20.3% |
189
- | **4** | 0.0386 🏆 | 1.027 | 1.07 | 53,803 | 96.1% |
190
- | **4** | 0.4284 🏆 | 1.346 | 1.80 | 49,589 | 57.2% |
191
 
192
- ### Generated Text Samples
193
 
194
- Below are text samples generated from each Markov chain model:
195
 
196
  **Context Size 1:**
197
 
198
- 1. `blong itali . category : politikis blong yunaeted stet blong yunaeted stet blong afrika category :`
199
- 2. `. - mackie - 19 novemba 1962 long polan . plante samting ikam inside long not`
200
- 3. `i stap wetem pas . king blong polanbol . ph / ebchecked / cette / /`
201
 
202
  **Context Size 2:**
203
 
204
- 1. `category : yunaeted stet blong amerika . category : ol krietiv daerekta long tv show yo gabba`
205
- 2. `. category : politikis blong franis , spain ) bibliothèque nationale de france le bulletin de la`
206
- 3. `hem i wan kaontri long pasifik mo save mekem i gat seven koninens ( nasa , 2022`
207
 
208
  **Context Size 3:**
209
 
210
- 1. `. category : pipol blong jemani category : politikis blong franis category : saentis category : pipo...`
211
- 2. `yunaeted stet blong amerika . category : praem minista blong japan . category : kaontri category : s...`
212
- 3. `blong yunaeted stet blong amerika . category : spen`
213
 
214
  **Context Size 4:**
215
 
216
- 1. `blong yunaeted stet blong amerika . category : hed blong stet category : politikis blong taelan`
217
- 2. `category : pipol blong jemani category : politikis`
218
- 3. `yunaeted stet blong amerika category : ol woman blong singsing category : pipol blong yunaeted kingd...`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
 
220
 
221
  ### Key Findings
222
 
223
- - **Best Predictability:** Context-4 with 96.1% predictability
224
  - **Branching Factor:** Decreases with context size (more deterministic)
225
- - **Memory Trade-off:** Larger contexts require more storage (49,589 contexts)
226
  - **Recommendation:** Context-3 or Context-4 for text generation
227
 
228
  ---
@@ -238,64 +306,64 @@ Below are text samples generated from each Markov chain model:
238
 
239
  | Metric | Value |
240
  |--------|-------|
241
- | Vocabulary Size | 3,655 |
242
- | Total Tokens | 57,331 |
243
- | Mean Frequency | 15.69 |
244
  | Median Frequency | 3 |
245
- | Frequency Std Dev | 124.23 |
246
 
247
  ### Most Common Words
248
 
249
  | Rank | Word | Frequency |
250
  |------|------|-----------|
251
- | 1 | blong | 5,072 |
252
- | 2 | i | 3,328 |
253
- | 3 | long | 2,237 |
254
- | 4 | category | 2,068 |
255
- | 5 | ol | 1,320 |
256
- | 6 | mo | 1,059 |
257
- | 7 | hem | 1,034 |
258
- | 8 | wan | 894 |
259
- | 9 | stet | 842 |
260
- | 10 | yunaeted | 714 |
261
 
262
  ### Least Common Words (from vocabulary)
263
 
264
  | Rank | Word | Frequency |
265
  |------|------|-----------|
266
- | 1 | sftp | 2 |
267
- | 2 | operating | 2 |
268
- | 3 | guide | 2 |
269
- | 4 | spesifikesen | 2 |
270
- | 5 | firewall | 2 |
271
- | 6 | 2428 | 2 |
272
- | 7 | sapot | 2 |
273
- | 8 | lesin | 2 |
274
- | 9 | sanem | 2 |
275
- | 10 | extended | 2 |
276
 
277
  ### Zipf's Law Analysis
278
 
279
  | Metric | Value |
280
  |--------|-------|
281
- | Zipf Coefficient | 1.0336 |
282
- | R² (Goodness of Fit) | 0.989882 |
283
  | Adherence Quality | **excellent** |
284
 
285
  ### Coverage Analysis
286
 
287
  | Top N Words | Coverage |
288
  |-------------|----------|
289
- | Top 100 | 60.4% |
290
- | Top 1,000 | 86.7% |
291
  | Top 5,000 | 0.0% |
292
  | Top 10,000 | 0.0% |
293
 
294
  ### Key Findings
295
 
296
- - **Zipf Compliance:** R²=0.9899 indicates excellent adherence to Zipf's law
297
- - **High Frequency Dominance:** Top 100 words cover 60.4% of corpus
298
- - **Long Tail:** -6,345 words needed for remaining 100.0% coverage
299
 
300
  ---
301
  ## 5. Word Embeddings Evaluation
@@ -308,24 +376,100 @@ Below are text samples generated from each Markov chain model:
308
 
309
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
310
 
311
- ### Model Comparison
312
 
313
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
314
- |-------|------------|-----------|----------|----------|----------|
315
- | **mono_32d** | 1,195 | 32 | 2.350 | 0.505 | 0.0541 🏆 |
316
- | **mono_64d** | 1,195 | 64 | 2.278 | 0.491 | 0.0110 |
317
- | **mono_128d** | 1,195 | 128 | 2.279 | 0.484 | 0.0021 |
318
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
319
 
320
  ### Key Findings
321
 
322
- - **Best Isotropy:** mono_32d with 0.0541 (more uniform distribution)
323
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
324
- - **Vocabulary Coverage:** All models cover 1,195 words
325
- - **Recommendation:** 100d for balanced semantic capture and efficiency
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
326
 
327
  ---
328
- ## 6. Summary & Recommendations
329
 
330
  ![Performance Dashboard](visualizations/performance_dashboard.png)
331
 
@@ -333,11 +477,12 @@ Below are text samples generated from each Markov chain model:
333
 
334
  | Component | Recommended | Rationale |
335
  |-----------|-------------|-----------|
336
- | Tokenizer | **32k BPE** | Best compression (4.02x) with low UNK rate |
337
- | N-gram | **5-gram** | Lowest perplexity (264) |
338
- | Markov | **Context-4** | Highest predictability (96.1%) |
339
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
340
 
 
341
  ---
342
  ## Appendix: Metrics Glossary & Interpretation Guide
343
 
@@ -527,7 +672,8 @@ If you use these models in your research, please cite:
527
  author = {Kamali, Omar},
528
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
529
  year = {2025},
530
- publisher = {HuggingFace},
 
531
  url = {https://huggingface.co/wikilangs}
532
  institution = {Omneity Labs}
533
  }
@@ -543,7 +689,8 @@ MIT License - Free for academic and commercial use.
543
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
544
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
545
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
546
  ---
547
  *Generated by Wikilangs Models Pipeline*
548
 
549
- *Report Date: 2025-12-28 05:14:46*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 4.443
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.0388
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # BI - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 4.032x | 4.05 | 0.1444% | 47,092 |
84
+ | **16k** | 4.443x 🏆 | 4.47 | 0.1591% | 42,734 |
85
 
86
  ### Tokenization Examples
87
 
88
  Below are sample sentences tokenized with each vocabulary size:
89
 
90
+ **Sample 1:** `Copenhagen (toktok Denmak: København), hem i kapitol blong Denmak. Long yia popu...`
91
 
92
  | Vocab | Tokens | Count |
93
  |-------|--------|-------|
94
+ | 8k | `▁copenhagen( toktokdenmak : københavn ), hemikapitol ... (+20 more)` | 30 |
95
+ | 16k | `▁copenhagen( toktokdenmak : københavn ), hemikapitol ... (+20 more)` | 30 |
96
 
97
+ **Sample 2:** `Emily Elizabeth Dickinson (10 Desemba – 15 May em i bin wan poet blong Amerika. ...`
 
98
 
99
  | Vocab | Tokens | Count |
100
  |-------|--------|-------|
101
+ | 8k | `▁em il y ▁elizabeth ▁dick ins on( 1 0 ... (+19 more)` | 29 |
102
+ | 16k | `▁emily ▁elizabethdickinson( 1 0 ▁desemba ▁–1 ... (+15 more)` | 25 |
103
 
104
+ **Sample 3:** `Narafala kaen blong spot long Vanuatu i stap pleiplei tru long kaontri long yumi...`
 
 
105
 
106
  | Vocab | Tokens | Count |
107
  |-------|--------|-------|
108
+ | 8k | `▁narafala ▁kaen ▁blongspot ▁long ▁vanuatu ▁i ▁stappleiplei ▁tru ... (+7 more)` | 17 |
109
+ | 16k | `▁narafalakaen ▁blong ▁spot ▁long ▁vanuatui ▁stappleiplei ▁tru ... (+7 more)` | 17 |
110
 
111
 
112
  ### Key Findings
113
 
114
+ - **Best Compression:** 16k achieves 4.443x compression
115
+ - **Lowest UNK Rate:** 8k with 0.1444% unknown tokens
116
  - **Trade-off:** Larger vocabularies improve compression but increase model size
117
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
118
 
 
121
 
122
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
123
 
124
+ ![N-gram Unique](visualizations/ngram_unique.png)
125
+
126
  ![N-gram Coverage](visualizations/ngram_coverage.png)
127
 
128
  ### Results
129
 
130
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
131
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
132
+ | **2-gram** | Word | 362 | 8.50 | 1,049 | 58.8% | 98.9% |
133
+ | **2-gram** | Subword | 209 🏆 | 7.71 | 983 | 73.7% | 100.0% |
134
+ | **3-gram** | Word | 496 | 8.95 | 1,408 | 53.1% | 92.0% |
135
+ | **3-gram** | Subword | 1,182 | 10.21 | 5,848 | 38.2% | 79.4% |
136
+ | **4-gram** | Word | 887 | 9.79 | 2,457 | 43.9% | 77.4% |
137
+ | **4-gram** | Subword | 3,532 | 11.79 | 19,225 | 28.5% | 58.2% |
138
 
139
  ### Top 5 N-grams by Size
140
 
141
+ **2-grams (Word):**
142
+
143
+ | Rank | N-gram | Count |
144
+ |------|--------|-------|
145
+ | 1 | `hem i` | 738 |
146
+ | 2 | `stet blong` | 729 |
147
+ | 3 | `em i` | 617 |
148
+ | 4 | `blong amerika` | 598 |
149
+ | 5 | `blong yunaeted` | 535 |
150
+
151
+ **3-grams (Word):**
152
+
153
+ | Rank | N-gram | Count |
154
+ |------|--------|-------|
155
+ | 1 | `stet blong amerika` | 583 |
156
+ | 2 | `yunaeted stet blong` | 479 |
157
+ | 3 | `blong yunaeted stet` | 479 |
158
+ | 4 | `blong singsing blong` | 292 |
159
+ | 5 | `blong hem i` | 259 |
160
+
161
+ **4-grams (Word):**
162
 
163
  | Rank | N-gram | Count |
164
  |------|--------|-------|
165
+ | 1 | `yunaeted stet blong amerika` | 477 |
166
+ | 2 | `blong yunaeted stet blong` | 470 |
167
+ | 3 | `akta blong yunaeted stet` | 210 |
168
+ | 4 | `woman blong singsing blong` | 182 |
169
+ | 5 | `blong singsing blong japan` | 150 |
170
 
171
+ **2-grams (Subword):**
172
 
173
  | Rank | N-gram | Count |
174
  |------|--------|-------|
175
+ | 1 | `o n` | 9,093 |
176
+ | 2 | `n g` | 8,780 |
177
+ | 3 | `l o` | 8,027 |
178
+ | 4 | `g _` | 7,936 |
179
+ | 5 | `_ b` | 7,059 |
180
 
181
+ **3-grams (Subword):**
182
 
183
  | Rank | N-gram | Count |
184
  |------|--------|-------|
185
+ | 1 | `n g _` | 7,795 |
186
+ | 2 | `o n g` | 7,296 |
187
+ | 3 | `l o n` | 7,257 |
188
+ | 4 | `_ b l` | 5,277 |
189
+ | 5 | `b l o` | 5,252 |
190
+
191
+ **4-grams (Subword):**
192
+
193
+ | Rank | N-gram | Count |
194
+ |------|--------|-------|
195
+ | 1 | `o n g _` | 7,200 |
196
+ | 2 | `l o n g` | 7,191 |
197
+ | 3 | `_ b l o` | 5,238 |
198
+ | 4 | `b l o n` | 5,015 |
199
+ | 5 | `_ l o n` | 2,153 |
200
 
201
 
202
  ### Key Findings
203
 
204
+ - **Best Perplexity:** 2-gram (subword) with 209
205
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
206
+ - **Coverage:** Top-1000 patterns cover ~58% of corpus
207
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
208
 
209
  ---
 
211
 
212
  ![Markov Entropy](visualizations/markov_entropy.png)
213
 
214
+ ![Markov Contexts](visualizations/markov_contexts.png)
215
+
216
  ![Markov Branching](visualizations/markov_branching.png)
217
 
218
  ### Results
219
 
220
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
221
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
222
+ | **1** | Word | 0.5840 | 1.499 | 3.04 | 8,338 | 41.6% |
223
+ | **1** | Subword | 0.9602 | 1.946 | 6.50 | 364 | 4.0% |
224
+ | **2** | Word | 0.1997 | 1.148 | 1.41 | 24,957 | 80.0% |
225
+ | **2** | Subword | 0.9911 | 1.988 | 5.10 | 2,361 | 0.9% |
226
+ | **3** | Word | 0.0755 | 1.054 | 1.13 | 34,724 | 92.4% |
227
+ | **3** | Subword | 0.7964 | 1.737 | 3.17 | 12,016 | 20.4% |
228
+ | **4** | Word | 0.0328 🏆 | 1.023 | 1.06 | 38,736 | 96.7% |
229
+ | **4** | Subword | 0.4627 | 1.378 | 1.90 | 38,018 | 53.7% |
230
 
231
+ ### Generated Text Samples (Word-based)
232
 
233
+ Below are text samples generated from each word-based Markov chain model:
234
 
235
  **Context Size 1:**
236
 
237
+ 1. `blong olgeta mo yu ol disaepol blong dover long wol plante fasin blong court i wan`
238
+ 2. `i bin wan strongfala win if you s 84 913 km2 populaesen blong stet blong hem`
239
+ 3. `long saed blong tekem carbondioxde mo wanwan aelan gaua o aoba hem i wokem long milly`
240
 
241
  **Context Size 2:**
242
 
243
+ 1. `hem i stap insaet long solwota everi man i save sindaon o silip long hem islam relijon`
244
+ 2. `stet blong philippines blong stet blong amerika blong stet blong amerika blong stet blong amerika mo...`
245
+ 3. `em i bin ded 8 septemba em i woman blong singsing blong japan man blong singsing blong`
246
 
247
  **Context Size 3:**
248
 
249
+ 1. `blong yunaeted stet blong amerika model akta blong pornografi blong ajentina em i stap popiula from ...`
250
+ 2. `yunaeted stet blong amerika akta blong yunaeted stet blong amerika blong yunaeted stet blong amerika...`
251
+ 3. `blong singsing blong japan thumb anna iriyama man blong singsing blong kanada man blong singsing blo...`
252
 
253
  **Context Size 4:**
254
 
255
+ 1. `blong yunaeted stet blong amerika blong stet blong yunaeted stet blong amerika blong yunaeted stet b...`
256
+ 2. `yunaeted stet blong amerika blong stet blong yunaeted stet blong amerika blong yunaeted stet blong a...`
257
+ 3. `akta blong yunaeted stet blong amerika akta blong yunaeted stet blong amerika blong stet blong yunae...`
258
+
259
+
260
+ ### Generated Text Samples (Subword-based)
261
+
262
+ Below are text samples generated from each subword-based Markov chain model:
263
+
264
+ **Context Size 1:**
265
+
266
+ 1. `_dimo_ste_lon_i_`
267
+ 2. `a_blong_bl_19_s_`
268
+ 3. `ngstang_yulolem:`
269
+
270
+ **Context Size 2:**
271
+
272
+ 1. `ong_300px_12_3_44`
273
+ 2. `ng_st_boetexanblo`
274
+ 3. `long_prol_no,_рос`
275
+
276
+ **Context Size 3:**
277
+
278
+ 1. `ng_amerika._akta_b`
279
+ 2. `ong_savela_taeland`
280
+ 3. `long_amerika_maura`
281
+
282
+ **Context Size 4:**
283
+
284
+ 1. `ong_amerika._praem_`
285
+ 2. `long_not_prize_nigh`
286
+ 3. `_blong_21_man_blong`
287
 
288
 
289
  ### Key Findings
290
 
291
+ - **Best Predictability:** Context-4 (word) with 96.7% predictability
292
  - **Branching Factor:** Decreases with context size (more deterministic)
293
+ - **Memory Trade-off:** Larger contexts require more storage (38,018 contexts)
294
  - **Recommendation:** Context-3 or Context-4 for text generation
295
 
296
  ---
 
306
 
307
  | Metric | Value |
308
  |--------|-------|
309
+ | Vocabulary Size | 3,117 |
310
+ | Total Tokens | 48,872 |
311
+ | Mean Frequency | 15.68 |
312
  | Median Frequency | 3 |
313
+ | Frequency Std Dev | 124.49 |
314
 
315
  ### Most Common Words
316
 
317
  | Rank | Word | Frequency |
318
  |------|------|-----------|
319
+ | 1 | blong | 5,014 |
320
+ | 2 | i | 3,182 |
321
+ | 3 | long | 2,146 |
322
+ | 4 | mo | 1,031 |
323
+ | 5 | hem | 1,008 |
324
+ | 6 | ol | 886 |
325
+ | 7 | wan | 875 |
326
+ | 8 | stet | 840 |
327
+ | 9 | amerika | 673 |
328
+ | 10 | em | 660 |
329
 
330
  ### Least Common Words (from vocabulary)
331
 
332
  | Rank | Word | Frequency |
333
  |------|------|-----------|
334
+ | 1 | lotta | 2 |
335
+ | 2 | continua | 2 |
336
+ | 3 | ekshumesen | 2 |
337
+ | 4 | suspension | 2 |
338
+ | 5 | fulwan | 2 |
339
+ | 6 | konfirm | 2 |
340
+ | 7 | trial | 2 |
341
+ | 8 | window | 2 |
342
+ | 9 | piazza | 2 |
343
+ | 10 | fontana | 2 |
344
 
345
  ### Zipf's Law Analysis
346
 
347
  | Metric | Value |
348
  |--------|-------|
349
+ | Zipf Coefficient | 1.0400 |
350
+ | R² (Goodness of Fit) | 0.989215 |
351
  | Adherence Quality | **excellent** |
352
 
353
  ### Coverage Analysis
354
 
355
  | Top N Words | Coverage |
356
  |-------------|----------|
357
+ | Top 100 | 62.1% |
358
+ | Top 1,000 | 88.5% |
359
  | Top 5,000 | 0.0% |
360
  | Top 10,000 | 0.0% |
361
 
362
  ### Key Findings
363
 
364
+ - **Zipf Compliance:** R²=0.9892 indicates excellent adherence to Zipf's law
365
+ - **High Frequency Dominance:** Top 100 words cover 62.1% of corpus
366
+ - **Long Tail:** -6,883 words needed for remaining 100.0% coverage
367
 
368
  ---
369
  ## 5. Word Embeddings Evaluation
 
376
 
377
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
378
 
 
379
 
380
+ ### 5.1 Cross-Lingual Alignment
381
+
382
+ > *Note: Multilingual alignment visualization not available for this language.*
383
+
384
+
385
+ ### 5.2 Model Comparison
386
+
387
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
388
+ |-------|-----------|----------|------------------|---------------|----------------|
389
+ | **mono_32d** | 32 | 0.0388 🏆 | 0.6777 | N/A | N/A |
390
+ | **mono_64d** | 64 | 0.0097 | 0.6676 | N/A | N/A |
391
+ | **mono_128d** | 128 | 0.0021 | 0.6720 | N/A | N/A |
392
 
393
  ### Key Findings
394
 
395
+ - **Best Isotropy:** mono_32d with 0.0388 (more uniform distribution)
396
+ - **Semantic Density:** Average pairwise similarity of 0.6724. Lower values indicate better semantic separation.
397
+ - **Alignment Quality:** No aligned models evaluated in this run.
398
+ - **Recommendation:** 128d aligned for best cross-lingual performance
399
+
400
+ ---
401
+ ## 6. Morphological Analysis (Experimental)
402
+
403
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
404
+
405
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
406
+
407
+ ### 6.1 Productivity & Complexity
408
+
409
+ | Metric | Value | Interpretation | Recommendation |
410
+ |--------|-------|----------------|----------------|
411
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
412
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
413
+
414
+ ### 6.2 Affix Inventory (Productive Units)
415
+
416
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
417
+
418
+ #### Productive Prefixes
419
+ | Prefix | Examples |
420
+ |--------|----------|
421
+
422
+ #### Productive Suffixes
423
+ | Suffix | Examples |
424
+ |--------|----------|
425
+ | `-en` | ren, disisen, citizen |
426
+ | `-an` | givhan, kirgistan, wan |
427
+ | `-em` | shoem, wokem, blem |
428
+
429
+ ### 6.3 Bound Stems (Lexical Roots)
430
+
431
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
432
+
433
+ | Stem | Cohesion | Substitutability | Examples |
434
+ |------|----------|------------------|----------|
435
+ | `amba` | 1.38x | 8 contexts | ambae, namba, bambae |
436
+
437
+ ### 6.4 Affix Compatibility (Co-occurrence)
438
+
439
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
440
+
441
+ *No significant affix co-occurrences detected.*
442
+
443
+
444
+ ### 6.5 Recursive Morpheme Segmentation
445
+
446
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
447
+
448
+ | Word | Suggested Split | Confidence | Stem |
449
+ |------|-----------------|------------|------|
450
+ | republican | **`republic-an`** | 4.5 | `republic` |
451
+ | andastanem | **`andast-an-em`** | 3.0 | `andast` |
452
+ | niutesteman | **`niutest-em-an`** | 3.0 | `niutest` |
453
+ | kirgistan | **`kirgist-an`** | 1.5 | `kirgist` |
454
+ | valencian | **`valenci-an`** | 1.5 | `valenci` |
455
+ | singaotem | **`singaot-em`** | 1.5 | `singaot` |
456
+ | defdefren | **`defdefr-en`** | 1.5 | `defdefr` |
457
+ | melanesian | **`melanesi-an`** | 1.5 | `melanesi` |
458
+ | konstitusen | **`konstitus-en`** | 1.5 | `konstitus` |
459
+ | komposisen | **`komposis-en`** | 1.5 | `komposis` |
460
+ | smithsonian | **`smithsoni-an`** | 1.5 | `smithsoni` |
461
+ | kompitisen | **`kompitis-en`** | 1.5 | `kompitis` |
462
+ | bisnesman | **`bisnesm-an`** | 1.5 | `bisnesm` |
463
+ | protestan | **`protest-an`** | 1.5 | `protest` |
464
+ | ekshumesen | **`ekshumes-en`** | 1.5 | `ekshumes` |
465
+
466
+ ### 6.6 Linguistic Interpretation
467
+
468
+ > **Automated Insight:**
469
+ The language BI appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
470
 
471
  ---
472
+ ## 7. Summary & Recommendations
473
 
474
  ![Performance Dashboard](visualizations/performance_dashboard.png)
475
 
 
477
 
478
  | Component | Recommended | Rationale |
479
  |-----------|-------------|-----------|
480
+ | Tokenizer | **16k BPE** | Best compression (4.44x) |
481
+ | N-gram | **2-gram** | Lowest perplexity (209) |
482
+ | Markov | **Context-4** | Highest predictability (96.7%) |
483
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
484
 
485
+
486
  ---
487
  ## Appendix: Metrics Glossary & Interpretation Guide
488
 
 
672
  author = {Kamali, Omar},
673
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
674
  year = {2025},
675
+ doi = {10.5281/zenodo.18073153},
676
+ publisher = {Zenodo},
677
  url = {https://huggingface.co/wikilangs}
678
  institution = {Omneity Labs}
679
  }
 
689
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
690
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
691
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
692
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
693
  ---
694
  *Generated by Wikilangs Models Pipeline*
695
 
696
+ *Report Date: 2026-01-03 07:17:54*
models/embeddings/monolingual/bi_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:caab5935973c28d2f5da72f9cf5b5188fa2c9edce13d7ebd5d742833957f536a
3
- size 1025242321
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ba043557491d9bd866f4880ed63951e9cedd258275bb5ab99cc6a4588a48418
3
+ size 1025087100
models/embeddings/monolingual/bi_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1195
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 1046
15
  }
models/embeddings/monolingual/bi_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e416f9f9b0b23b2d2883958880cacc4980aa5e0694b6ccedc1110f52e55bce2b
3
- size 256324561
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77f00c5cf1334339a3726d39a681bacc6c8c5b7d226141cb5bd2ec7d3d93731c
3
+ size 256283772
models/embeddings/monolingual/bi_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1195
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 1046
15
  }
models/embeddings/monolingual/bi_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:90ad9fe1ab3aad85875dd7c7e530975798cadf8deef91de6e32dd2d5a0f21735
3
- size 512630481
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64be2d287f8ece0f9696fe2372c048f2ea1c8dc331d051d642a50efe1563bc8b
3
+ size 512551548
models/embeddings/monolingual/bi_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 1195
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 1046
15
  }
models/subword_markov/bi_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:939056bb57c6d57cef88397915123fe30ff6db2b6bd41444716de44e04e02910
3
- size 25094
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e9372210a853979f0e4b07d4830a256cffa139a1fece177ff9c6038eed35df5
3
+ size 22700
models/subword_markov/bi_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_contexts": 381,
6
- "total_transitions": 375436
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_contexts": 364,
6
+ "total_transitions": 308852
7
  }
models/subword_markov/bi_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6322d05a7e7c19ac152a2f6ef1c2a6c75e3259cb3e86f72c1b9d1766212caf00
3
- size 115200
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ca6b0d701fee7db455da7e30950385f2806f14cec4f336ce82c4779e09621b4
3
+ size 91544
models/subword_markov/bi_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_contexts": 2929,
6
- "total_transitions": 373835
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_contexts": 2361,
6
+ "total_transitions": 307409
7
  }
models/subword_markov/bi_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4ea1c08f391c5c276bbab53af1851c581076fa215e1ba890bc059391e01ef855
3
- size 346792
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68a8c133f845c55867a0fe53a9d2c945b085e56ce9e1e895558840e0a31ab900
3
+ size 279694
models/subword_markov/bi_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_contexts": 16165,
6
- "total_transitions": 372234
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_contexts": 12016,
6
+ "total_transitions": 305966
7
  }
models/subword_markov/bi_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d4e91d074a40f87ad5464f595f8b4a666e20ee3e50b49909f00856a81624693d
3
- size 770026
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e769ec2bcec378d6b983577c7020c20977d3c068c15135f0bc9e18b6c7e264f
3
+ size 620331
models/subword_markov/bi_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_contexts": 49589,
6
- "total_transitions": 370633
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_contexts": 38018,
6
+ "total_transitions": 304523
7
  }
models/subword_ngram/bi_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fe618ebafef2b68121e2f9be785bcaad54d3ef638542480c5567752c647086fc
3
- size 16793
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4527bcf666ffd2dd1c578cfc5840d02a71199e369b0c8a627659726d39e176f6
3
+ size 14130
models/subword_ngram/bi_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_ngrams": 1233,
6
- "total_ngrams": 375436
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_ngrams": 983,
6
+ "total_ngrams": 308852
7
  }
models/subword_ngram/bi_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e39f7c0cae8ddad16ffa9c4e7a5b2d47bd3d3d8c1fe65906a276c905a9e65f52
3
- size 83670
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b80fdffb1d982a5003ad057c3864a9f483b951ec85c6662f9c20d596201a26ff
3
+ size 63627
models/subword_ngram/bi_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_ngrams": 7760,
6
- "total_ngrams": 373835
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_ngrams": 5848,
6
+ "total_ngrams": 307409
7
  }
models/subword_ngram/bi_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:17352c622203d2149e4ce41acb530a75712c65ae857e09d52dc983df854b7d50
3
- size 276251
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4ebfe1944f1ab6833eb266e7debe2291ba5a0381a209e1ff6ac58781421a2c2
3
+ size 221014
models/subword_ngram/bi_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "bi",
5
- "unique_ngrams": 23770,
6
- "total_ngrams": 372234
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "bi",
5
+ "unique_ngrams": 19225,
6
+ "total_ngrams": 305966
7
  }
models/tokenizer/bi_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2acc0817a4fe6cae5cc586d4dd39f982bca4830e346325d6cb762281d84b105b
3
- size 505184
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb93c36553e5bad9dc7098178cff3fc9a58ec9892899f622abc48bd993de075f
3
+ size 506618
models/tokenizer/bi_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/bi_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5e03b02df521285dacf413c63825ccbb0b39cfd1e07e4eef79fb3f0a904fbbb3
3
- size 369743
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a4efeb4ad5e3a8b678a5c79bd0331f01122c9bb189bbfd294c1172390d5a05c
3
+ size 370520
models/tokenizer/bi_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/bi_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ea974cb6d740ce7a68a58cd7c62ff6fb4a801623a68be726034ab99cd8cec698
3
- size 58570
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:87f70e9b39cbf89b9a192034747f166c231ea9bf458bb2112e79e97ae15e247a
3
+ size 50916
models/vocabulary/bi_vocabulary_metadata.json CHANGED
@@ -1,15 +1,16 @@
1
  {
2
  "language": "bi",
3
- "vocabulary_size": 3655,
 
4
  "statistics": {
5
- "type_token_ratio": 0.1527339310519005,
6
  "coverage": {
7
- "top_100": 0.5463442353832555,
8
- "top_1000": 0.7844109104684935,
9
- "top_5000": 0.926190175527213
10
  },
11
- "hapax_count": 6021,
12
- "hapax_ratio": 0.6222612649855312,
13
- "total_documents": 1601
14
  }
15
  }
 
1
  {
2
  "language": "bi",
3
+ "vocabulary_size": 3117,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.15550018456995202,
7
  "coverage": {
8
+ "top_100": 0.5605020302694721,
9
+ "top_1000": 0.7978589885566629,
10
+ "top_5000": 0.9367847914359543
11
  },
12
+ "hapax_count": 5308,
13
+ "hapax_ratio": 0.6300296735905044,
14
+ "total_documents": 1443
15
  }
16
  }
models/word_markov/bi_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9a608e9becc7f5d7c66e918204585bc6f42d121e71342fc07761cbdd3d6e91c8
3
- size 280800
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bcf4783672108d39279e5b6f88455605e79e8765b7ffe3f48acaac9ccda1c56
3
+ size 230383
models/word_markov/bi_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_contexts": 9727,
6
- "total_transitions": 78262
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_contexts": 8338,
6
+ "total_transitions": 52737
7
  }
models/word_markov/bi_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c7bde5b341c93d0a47f391cffe0dd3ea52489ebc29b6ce3d7e8f4b769e4ca41f
3
- size 554076
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c04695f69e6a2692e750e9bff816affd14058dc6384bd028ad4dea65463972ae
3
+ size 440900
models/word_markov/bi_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_contexts": 32184,
6
- "total_transitions": 76661
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_contexts": 24957,
6
+ "total_transitions": 51294
7
  }
models/word_markov/bi_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8ef088c98e551ec44ba505c42d24cea3dbd22e67a6c29605173bf026569dae20
3
- size 760704
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1acfd63f57a3dbfe43f8279cf4b22712ba4b1311bd026f0bab886fd886ead857
3
+ size 595208
models/word_markov/bi_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_contexts": 46928,
6
- "total_transitions": 75061
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_contexts": 34724,
6
+ "total_transitions": 49851
7
  }
models/word_markov/bi_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cb81655fd8ff6cd36bee3c1dad6d43a0abbc7b68e8224d93e29b6357fc239525
3
- size 872672
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19edc02c0845063d5a953f889ccfbfd26a2b1ec828f00bb70ee81b9c16697c22
3
+ size 659846
models/word_markov/bi_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_contexts": 53803,
6
- "total_transitions": 73462
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_contexts": 38736,
6
+ "total_transitions": 48408
7
  }
models/word_ngram/bi_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8bfe8c99bffba619da7ea499b717c3bf2cdb7a25364a768303e4d121191d6311
3
- size 22529
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3910c59ef8e48f0aeff2cd337c0f4e24ef16de9c532784d7421359489476c53
3
+ size 15823
models/word_ngram/bi_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_ngrams": 1634,
6
- "total_ngrams": 78262
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_ngrams": 1049,
6
+ "total_ngrams": 52737
7
  }
models/word_ngram/bi_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8b213bdfbdd16ae2d5503bf1464193c03508683c778da7372faaa9faaf04dea2
3
- size 34234
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d152b84b7da74969f8c046b4f1bba6ef10e369d30da93155c74ac1d1709d09c8
3
+ size 22298
models/word_ngram/bi_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_ngrams": 2346,
6
- "total_ngrams": 76661
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_ngrams": 1408,
6
+ "total_ngrams": 51294
7
  }
models/word_ngram/bi_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:536b37aa666b8ae8ed8dbf9e56e71d2a9ce630cafc6f1d687c6f1e72355b0ac7
3
- size 61965
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f230fb5fe00313288af8a9d573f021c638e7aeeae11122fd8d9d1bc87cd5351
3
+ size 39653
models/word_ngram/bi_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "bi",
5
- "unique_ngrams": 4093,
6
- "total_ngrams": 75061
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "bi",
5
+ "unique_ngrams": 2457,
6
+ "total_ngrams": 49851
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 1abb3d3e56f8605218af31f2a2f649bebc6193436a569b3b3c9633233a0b47a1
  • Pointer size: 131 Bytes
  • Size of remote file: 160 kB

Git LFS Details

  • SHA256: 8cafc862ea4e3fc3fc997eb871b4bac8d7a9be24f83c03a2ba8444f611629ef6
  • Pointer size: 131 Bytes
  • Size of remote file: 160 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED
visualizations/ngram_coverage.png CHANGED

Git LFS Details

  • SHA256: f4c353ffb68ec064ef4bf9839d7167db969c6eb29e3c2abb21d978202196188c
  • Pointer size: 131 Bytes
  • Size of remote file: 100 kB