omarkamali commited on
Commit
1815cdf
·
verified ·
1 Parent(s): 1e6f4ab

Upload all models and assets for bbc (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +313 -135
  2. models/embeddings/monolingual/bbc_128d.bin +2 -2
  3. models/embeddings/monolingual/bbc_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/bbc_32d.bin +2 -2
  5. models/embeddings/monolingual/bbc_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/bbc_64d.bin +2 -2
  7. models/embeddings/monolingual/bbc_64d_metadata.json +5 -3
  8. models/subword_markov/bbc_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/bbc_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/bbc_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/bbc_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/bbc_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/bbc_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/bbc_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/bbc_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/bbc_2gram_subword.parquet +2 -2
  17. models/subword_ngram/bbc_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/bbc_3gram_subword.parquet +2 -2
  19. models/subword_ngram/bbc_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/bbc_4gram_subword.parquet +2 -2
  21. models/subword_ngram/bbc_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/bbc_tokenizer_16k.model +2 -2
  23. models/tokenizer/bbc_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/bbc_tokenizer_32k.model +2 -2
  25. models/tokenizer/bbc_tokenizer_32k.vocab +0 -0
  26. models/tokenizer/bbc_tokenizer_8k.model +2 -2
  27. models/tokenizer/bbc_tokenizer_8k.vocab +0 -0
  28. models/vocabulary/bbc_vocabulary.parquet +2 -2
  29. models/vocabulary/bbc_vocabulary_metadata.json +10 -9
  30. models/word_markov/bbc_markov_ctx1_word.parquet +2 -2
  31. models/word_markov/bbc_markov_ctx1_word_metadata.json +2 -2
  32. models/word_markov/bbc_markov_ctx2_word.parquet +2 -2
  33. models/word_markov/bbc_markov_ctx2_word_metadata.json +2 -2
  34. models/word_markov/bbc_markov_ctx3_word.parquet +2 -2
  35. models/word_markov/bbc_markov_ctx3_word_metadata.json +2 -2
  36. models/word_markov/bbc_markov_ctx4_word.parquet +2 -2
  37. models/word_markov/bbc_markov_ctx4_word_metadata.json +2 -2
  38. models/word_ngram/bbc_2gram_word.parquet +2 -2
  39. models/word_ngram/bbc_2gram_word_metadata.json +2 -2
  40. models/word_ngram/bbc_3gram_word.parquet +2 -2
  41. models/word_ngram/bbc_3gram_word_metadata.json +2 -2
  42. models/word_ngram/bbc_4gram_word.parquet +2 -2
  43. models/word_ngram/bbc_4gram_word_metadata.json +2 -2
  44. visualizations/embedding_isotropy.png +0 -0
  45. visualizations/embedding_norms.png +0 -0
  46. visualizations/embedding_similarity.png +2 -2
  47. visualizations/markov_branching.png +0 -0
  48. visualizations/markov_contexts.png +0 -0
  49. visualizations/markov_entropy.png +0 -0
  50. visualizations/model_sizes.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 4.433
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.8253
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 24711
33
- generated: 2025-12-28
34
  ---
35
 
36
  # BBC - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,53 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.867x | 3.83 | 0.1235% | 1,466,873 |
76
- | **16k** | 4.118x | 4.08 | 0.1315% | 1,377,727 |
77
- | **32k** | 4.304x | 4.27 | 0.1375% | 1,318,061 |
78
- | **64k** | 4.433x 🏆 | 4.39 | 0.1416% | 1,279,829 |
79
 
80
  ### Tokenization Examples
81
 
82
  Below are sample sentences tokenized with each vocabulary size:
83
 
84
- **Sample 1:** `Panjunan i ma sada huta na adong di Kecamatan Petarukan, Kabupaten Pemalang, Pr...`
85
 
86
  | Vocab | Tokens | Count |
87
  |-------|--------|-------|
88
- | 8k | `▁panj unan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
89
- | 16k | `▁panj unan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
90
- | 32k | `▁panj unan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
91
- | 64k | `▁panjunan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁petarukan ... (+10 more)` | 20 |
92
-
93
- **Sample 2:** `Ampapaga (Surat Batak:ᯀᯔ᯲ᯇᯇᯎ) i ma sada suansuanan na tubu di gadu ni hauma.
94
 
95
- P...`
96
 
97
  | Vocab | Tokens | Count |
98
  |-------|--------|-------|
99
- | 8k | `▁amp ap aga( suratbatak : ᯔ᯲ ... (+20 more)` | 30 |
100
- | 16k | `▁amp ap aga( suratbatak : ᯔ᯲ᯇ ... (+18 more)` | 28 |
101
- | 32k | `▁amp apaga( suratbatak : ᯔ᯲ᯇᯇᯎ )i ... (+15 more)` | 25 |
102
- | 64k | `▁ampapaga ▁( surat ▁batak : ᯀᯔ᯲ᯇᯇᯎ ) ▁i ▁ma ▁sada ... (+13 more)` | 23 |
103
 
104
- **Sample 3:** `Sungapan i ma sada huta na adong di Kecamatan Pemalang, Kabupaten Pemalang, Pro...`
105
 
106
  | Vocab | Tokens | Count |
107
  |-------|--------|-------|
108
- | 8k | `▁sung apan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
109
- | 16k | `▁sung apan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
110
- | 32k | `▁sung apan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
111
- | 64k | `▁sungapan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁pemalang ... (+10 more)` | 20 |
112
 
113
 
114
  ### Key Findings
115
 
116
- - **Best Compression:** 64k achieves 4.433x compression
117
- - **Lowest UNK Rate:** 8k with 0.1235% unknown tokens
118
  - **Trade-off:** Larger vocabularies improve compression but increase model size
119
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
120
 
@@ -123,57 +125,89 @@ Below are sample sentences tokenized with each vocabulary size:
123
 
124
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
125
 
 
 
126
  ![N-gram Coverage](visualizations/ngram_coverage.png)
127
 
128
  ### Results
129
 
130
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
131
- |--------|------------|---------|----------------|------------------|-------------------|
132
- | **2-gram** | 7,191 🏆 | 12.81 | 30,496 | 17.8% | 49.6% |
133
- | **2-gram** | 209 🏆 | 7.70 | 2,868 | 75.2% | 99.1% |
134
- | **3-gram** | 25,989 | 14.67 | 62,993 | 8.7% | 26.3% |
135
- | **3-gram** | 1,399 | 10.45 | 19,851 | 36.8% | 80.6% |
136
- | **4-gram** | 63,619 | 15.96 | 114,847 | 4.7% | 15.6% |
137
- | **4-gram** | 6,375 | 12.64 | 81,359 | 18.9% | 52.8% |
138
 
139
  ### Top 5 N-grams by Size
140
 
141
- **2-grams:**
142
 
143
  | Rank | N-gram | Count |
144
  |------|--------|-------|
145
- | 1 | `, jala` | 8,780 |
146
- | 2 | `i ,` | 6,997 |
147
- | 3 | `ᯬ ᯲` | 4,573 |
148
- | 4 | `angka na` | 4,409 |
149
- | 5 | `dung i` | 4,328 |
150
 
151
- **3-grams:**
152
 
153
  | Rank | N-gram | Count |
154
  |------|--------|-------|
155
- | 1 | `anak ni si` | 1,611 |
156
- | 2 | `, angka na` | 1,528 |
157
- | 3 | `. 2 :` | 1,079 |
158
- | 4 | `, anak ni` | 1,069 |
159
- | 5 | `ᯰ ᯦` | 1,063 |
160
 
161
- **4-grams:**
162
 
163
  | Rank | N-gram | Count |
164
  |------|--------|-------|
165
- | 1 | `, anak ni si` | 906 |
166
- | 2 | `ᯀ ᯦` | 686 |
167
- | 3 | `ᯘ ᯀᯉ ᯲` | 457 |
168
- | 4 | `ᯬ ᯂᯖ ᯲` | 432 |
169
- | 5 | `on do hata ni` | 421 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
 
172
  ### Key Findings
173
 
174
- - **Best Perplexity:** 2-gram with 209
175
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
176
- - **Coverage:** Top-1000 patterns cover ~53% of corpus
177
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
178
 
179
  ---
@@ -181,55 +215,86 @@ Below are sample sentences tokenized with each vocabulary size:
181
 
182
  ![Markov Entropy](visualizations/markov_entropy.png)
183
 
 
 
184
  ![Markov Branching](visualizations/markov_branching.png)
185
 
186
  ### Results
187
 
188
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
189
- |---------|-------------|------------|------------------|-----------------|----------------|
190
- | **1** | 0.8281 | 1.775 | 6.09 | 50,531 | 17.2% |
191
- | **1** | 1.0960 | 2.138 | 7.35 | 1,187 | 0.0% |
192
- | **2** | 0.3983 | 1.318 | 2.17 | 307,642 | 60.2% |
193
- | **2** | 0.8091 | 1.752 | 4.70 | 8,725 | 19.1% |
194
- | **3** | 0.1982 | 1.147 | 1.41 | 667,901 | 80.2% |
195
- | **3** | 0.7878 | 1.726 | 3.57 | 41,004 | 21.2% |
196
- | **4** | 0.0988 🏆 | 1.071 | 1.16 | 943,940 | 90.1% |
197
- | **4** | 0.5526 🏆 | 1.467 | 2.40 | 146,535 | 44.7% |
198
 
199
- ### Generated Text Samples
200
 
201
- Below are text samples generated from each Markov chain model:
202
 
203
  **Context Size 1:**
204
 
205
- 1. `, jala tu si wasti , jala dipadeakdeak hamu sian i jala peakkononna tu palaspalas pamurunan`
206
- 2. `. 12 : songon i hatangku tu bagasan saluhut angka ari puasa bintang di jolo i`
207
- 3. `: ida ma angka na margoar milo dohot paredangedangan , 2 : 35 : 25 pelean`
208
 
209
  **Context Size 2:**
210
 
211
- 1. `, jala tudoshon gumora angka anak ni si joab soara ni sarune i laho ma ho ,`
212
- 2. `i , naung pinauli ni tangan ni halak pangarupa umpogo upa ni pambahenannasida . 99 : 7`
213
- 3. `ᯬ ᯘᯞ ᯂᯖ ᯇᯔ ᯅᯂ ᯉᯉ ᯉ`
214
 
215
  **Context Size 3:**
216
 
217
- 1. `anak ni si ahilud do panuturi . 18 : 13 tung sura ahu parohon begu masa tu luat`
218
- 2. `, angka na so bangso hian ; marhite sian bangso na asing , jala marbarita goarmu di betlehem`
219
- 3. `. 2 : 15 alai tarrimas situtu ma si abner dohot di halak na marroha pangansion , ndang`
220
 
221
  **Context Size 4:**
222
 
223
- 1. `, anak ni si ammiel . 3 : 6 ndada tu torop bangso , angka parhata bobang , manang`
224
- 2. `ᯀ ᯂᯖ ᯉᯘ ᯪ`
225
- 3. `ᯘ ᯀᯉ ᯂᯔ ᯩ ᯅᯖ 2 :`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
 
227
 
228
  ### Key Findings
229
 
230
- - **Best Predictability:** Context-4 with 90.1% predictability
231
  - **Branching Factor:** Decreases with context size (more deterministic)
232
- - **Memory Trade-off:** Larger contexts require more storage (146,535 contexts)
233
  - **Recommendation:** Context-3 or Context-4 for text generation
234
 
235
  ---
@@ -245,64 +310,64 @@ Below are text samples generated from each Markov chain model:
245
 
246
  | Metric | Value |
247
  |--------|-------|
248
- | Vocabulary Size | 24,711 |
249
- | Total Tokens | 1,019,541 |
250
- | Mean Frequency | 41.26 |
251
  | Median Frequency | 4 |
252
- | Frequency Std Dev | 565.94 |
253
 
254
  ### Most Common Words
255
 
256
  | Rank | Word | Frequency |
257
  |------|------|-----------|
258
- | 1 | ni | 35,101 |
259
- | 2 | na | 34,088 |
260
- | 3 | i | 33,056 |
261
- | 4 | ma | 26,769 |
262
- | 5 | di | 26,053 |
263
- | 6 | tu | 20,450 |
264
- | 7 | do | 19,163 |
265
- | 8 | angka | 17,428 |
266
- | 9 | jala | 14,598 |
267
- | 10 | dohot | 13,609 |
268
 
269
  ### Least Common Words (from vocabulary)
270
 
271
  | Rank | Word | Frequency |
272
  |------|------|-----------|
273
- | 1 | ᯝᯇᯞ | 2 |
274
- | 2 | kayo | 2 |
275
- | 3 | uttar | 2 |
276
- | 4 | ltr | 2 |
277
- | 5 | ebrima | 2 |
278
- | 6 | 290px | 2 |
279
- | 7 | td | 2 |
280
- | 8 | height | 2 |
281
- | 9 | 260px | 2 |
282
- | 10 | 22251 | 2 |
283
 
284
  ### Zipf's Law Analysis
285
 
286
  | Metric | Value |
287
  |--------|-------|
288
- | Zipf Coefficient | 1.1956 |
289
- | R² (Goodness of Fit) | 0.996705 |
290
  | Adherence Quality | **excellent** |
291
 
292
  ### Coverage Analysis
293
 
294
  | Top N Words | Coverage |
295
  |-------------|----------|
296
- | Top 100 | 52.8% |
297
- | Top 1,000 | 78.6% |
298
- | Top 5,000 | 91.7% |
299
- | Top 10,000 | 95.9% |
300
 
301
  ### Key Findings
302
 
303
- - **Zipf Compliance:** R²=0.9967 indicates excellent adherence to Zipf's law
304
- - **High Frequency Dominance:** Top 100 words cover 52.8% of corpus
305
- - **Long Tail:** 14,711 words needed for remaining 4.1% coverage
306
 
307
  ---
308
  ## 5. Word Embeddings Evaluation
@@ -315,24 +380,134 @@ Below are text samples generated from each Markov chain model:
315
 
316
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
317
 
318
- ### Model Comparison
319
 
320
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
321
- |-------|------------|-----------|----------|----------|----------|
322
- | **mono_32d** | 15,079 | 32 | 3.458 | 0.818 | 0.8253 🏆 |
323
- | **mono_64d** | 15,079 | 64 | 3.886 | 0.738 | 0.7641 |
324
- | **mono_128d** | 15,079 | 128 | 4.143 | 0.691 | 0.4668 |
325
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
326
 
327
  ### Key Findings
328
 
329
- - **Best Isotropy:** mono_32d with 0.8253 (more uniform distribution)
330
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
331
- - **Vocabulary Coverage:** All models cover 15,079 words
332
- - **Recommendation:** 100d for balanced semantic capture and efficiency
333
 
334
  ---
335
- ## 6. Summary & Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
336
 
337
  ![Performance Dashboard](visualizations/performance_dashboard.png)
338
 
@@ -340,11 +515,12 @@ Below are text samples generated from each Markov chain model:
340
 
341
  | Component | Recommended | Rationale |
342
  |-----------|-------------|-----------|
343
- | Tokenizer | **32k BPE** | Best compression (4.43x) with low UNK rate |
344
- | N-gram | **5-gram** | Lowest perplexity (209) |
345
- | Markov | **Context-4** | Highest predictability (90.1%) |
346
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
347
 
 
348
  ---
349
  ## Appendix: Metrics Glossary & Interpretation Guide
350
 
@@ -534,7 +710,8 @@ If you use these models in your research, please cite:
534
  author = {Kamali, Omar},
535
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
536
  year = {2025},
537
- publisher = {HuggingFace},
 
538
  url = {https://huggingface.co/wikilangs}
539
  institution = {Omneity Labs}
540
  }
@@ -550,7 +727,8 @@ MIT License - Free for academic and commercial use.
550
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
551
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
552
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
553
  ---
554
  *Generated by Wikilangs Models Pipeline*
555
 
556
- *Report Date: 2025-12-28 00:12:38*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 3.663
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.8223
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # BBC - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 3.308x | 3.31 | 0.2131% | 1,665,136 |
84
+ | **16k** | 3.527x | 3.53 | 0.2273% | 1,561,615 |
85
+ | **32k** | 3.663x 🏆 | 3.66 | 0.2360% | 1,503,727 |
 
86
 
87
  ### Tokenization Examples
88
 
89
  Below are sample sentences tokenized with each vocabulary size:
90
 
91
+ **Sample 1:** `Pedurungan i ma sada huta na adong di Kecamatan Taman, Kabupaten Pemalang, Propi...`
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
+ | 8k | `▁ped ur ungan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ... (+12 more)` | 22 |
96
+ | 16k | `▁pedurungan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁taman ... (+10 more)` | 20 |
97
+ | 32k | `▁pedurungan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁taman ... (+10 more)` | 20 |
 
 
 
98
 
99
+ **Sample 2:** `Mulyoharjo i ma sada Kelurahan na adong di Kecamatan Pemalang, Kabupaten Pemalan...`
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
+ | 8k | `▁mul y oharjoi ▁masada ▁kelurahan ▁na ▁adong ▁di ... (+12 more)` | 22 |
104
+ | 16k | `▁mulyoharjo ▁i ▁masada ▁kelurahanna ▁adong ▁di ▁kecamatan ▁pemalang ... (+10 more)` | 20 |
105
+ | 32k | `▁mulyoharjo ▁ima ▁sadakelurahan ▁na ▁adong ▁di ▁kecamatanpemalang ... (+10 more)` | 20 |
 
106
 
107
+ **Sample 3:** `Klegen i ma sada huta na adong di Kecamatan Comal, Kabupaten Pemalang, Propinsi ...`
108
 
109
  | Vocab | Tokens | Count |
110
  |-------|--------|-------|
111
+ | 8k | `▁kl egen ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
112
+ | 16k | `▁klegen ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁comal ... (+10 more)` | 20 |
113
+ | 32k | `▁klegen ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁comal ... (+10 more)` | 20 |
 
114
 
115
 
116
  ### Key Findings
117
 
118
+ - **Best Compression:** 32k achieves 3.663x compression
119
+ - **Lowest UNK Rate:** 8k with 0.2131% unknown tokens
120
  - **Trade-off:** Larger vocabularies improve compression but increase model size
121
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
122
 
 
125
 
126
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
127
 
128
+ ![N-gram Unique](visualizations/ngram_unique.png)
129
+
130
  ![N-gram Coverage](visualizations/ngram_coverage.png)
131
 
132
  ### Results
133
 
134
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
135
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
136
+ | **2-gram** | Word | 8,528 | 13.06 | 26,426 | 17.5% | 42.8% |
137
+ | **2-gram** | Subword | 185 🏆 | 7.53 | 3,491 | 77.6% | 99.2% |
138
+ | **3-gram** | Word | 22,579 | 14.46 | 43,165 | 8.3% | 25.2% |
139
+ | **3-gram** | Subword | 1,219 | 10.25 | 18,183 | 38.1% | 83.2% |
140
+ | **4-gram** | Word | 44,749 | 15.45 | 67,595 | 5.7% | 16.0% |
141
+ | **4-gram** | Subword | 5,604 | 12.45 | 70,417 | 19.7% | 54.7% |
142
 
143
  ### Top 5 N-grams by Size
144
 
145
+ **2-grams (Word):**
146
 
147
  | Rank | N-gram | Count |
148
  |------|--------|-------|
149
+ | 1 | `angka na` | 4,424 |
150
+ | 2 | `dung i` | 4,327 |
151
+ | 3 | `ni si` | 4,061 |
152
+ | 4 | `i ma` | 3,622 |
153
+ | 5 | `ni jahowa` | 2,892 |
154
 
155
+ **3-grams (Word):**
156
 
157
  | Rank | N-gram | Count |
158
  |------|--------|-------|
159
+ | 1 | `anak ni si` | 1,613 |
160
+ | 2 | `dung i ninna` | 735 |
161
+ | 3 | `i ma sada` | 728 |
162
+ | 4 | `hata ni jahowa` | 703 |
163
+ | 5 | `na adong di` | 690 |
164
 
165
+ **4-grams (Word):**
166
 
167
  | Rank | N-gram | Count |
168
  |------|--------|-------|
169
+ | 1 | `on do hata ni` | 423 |
170
+ | 2 | `songon on do hata` | 408 |
171
+ | 3 | `i ma sada huta` | 363 |
172
+ | 4 | `angka anak ni si` | 336 |
173
+ | 5 | `na adong di kecamatan` | 297 |
174
+
175
+ **2-grams (Subword):**
176
+
177
+ | Rank | N-gram | Count |
178
+ |------|--------|-------|
179
+ | 1 | `a _` | 206,904 |
180
+ | 2 | `a n` | 205,541 |
181
+ | 3 | `n g` | 154,122 |
182
+ | 4 | `i _` | 143,001 |
183
+ | 5 | `n a` | 122,611 |
184
+
185
+ **3-grams (Subword):**
186
+
187
+ | Rank | N-gram | Count |
188
+ |------|--------|-------|
189
+ | 1 | `a n g` | 81,985 |
190
+ | 2 | `_ m a` | 76,327 |
191
+ | 3 | `n a _` | 58,974 |
192
+ | 4 | `_ n a` | 53,547 |
193
+ | 5 | `a n _` | 51,343 |
194
+
195
+ **4-grams (Subword):**
196
+
197
+ | Rank | N-gram | Count |
198
+ |------|--------|-------|
199
+ | 1 | `_ n i _` | 34,971 |
200
+ | 2 | `_ n a _` | 33,600 |
201
+ | 3 | `_ d i _` | 25,982 |
202
+ | 4 | `a n g k` | 24,957 |
203
+ | 5 | `_ m a _` | 23,771 |
204
 
205
 
206
  ### Key Findings
207
 
208
+ - **Best Perplexity:** 2-gram (subword) with 185
209
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
210
+ - **Coverage:** Top-1000 patterns cover ~55% of corpus
211
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
212
 
213
  ---
 
215
 
216
  ![Markov Entropy](visualizations/markov_entropy.png)
217
 
218
+ ![Markov Contexts](visualizations/markov_contexts.png)
219
+
220
  ![Markov Branching](visualizations/markov_branching.png)
221
 
222
  ### Results
223
 
224
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
225
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
226
+ | **1** | Word | 0.9188 | 1.890 | 6.44 | 50,697 | 8.1% |
227
+ | **1** | Subword | 0.9378 | 1.916 | 7.17 | 1,435 | 6.2% |
228
+ | **2** | Word | 0.3742 | 1.296 | 2.02 | 325,909 | 62.6% |
229
+ | **2** | Subword | 0.7095 | 1.635 | 4.04 | 10,290 | 29.0% |
230
+ | **3** | Word | 0.1535 | 1.112 | 1.28 | 658,447 | 84.6% |
231
+ | **3** | Subword | 0.6445 | 1.563 | 3.15 | 41,529 | 35.5% |
232
+ | **4** | Word | 0.0590 🏆 | 1.042 | 1.09 | 840,046 | 94.1% |
233
+ | **4** | Subword | 0.5191 | 1.433 | 2.39 | 130,734 | 48.1% |
234
 
235
+ ### Generated Text Samples (Word-based)
236
 
237
+ Below are text samples generated from each word-based Markov chain model:
238
 
239
  **Context Size 1:**
240
 
241
+ 1. `ni harbangan asa dioloi si ferdinand lumban gaol boi manampung 1 11 3 naung disunat 2`
242
+ 2. `na marumur sataon na humaliang 5 000 m2 hira hira songon sondang ni raja iii diida`
243
+ 3. `i gok daupa sian si daud sian pangasammu do sarita pardapot ni na sampuludua i 22`
244
 
245
  **Context Size 2:**
246
 
247
+ 1. `angka na niuhir dohot na tarulang angka bagasnasida jala ndang marnalemba 10 38 ingkon mago roham ba...`
248
+ 2. `dung i ninna jesus ma siseanna i ninna parompuan na mabalu disi marmudu ho 17 2 dung`
249
+ 3. `ni si beor pangarunding i dibunu halak daniel 6 6 1 hamu pe ditanda sada pangituai parheheon`
250
 
251
  **Context Size 3:**
252
 
253
+ 1. `anak ni si jaasia si beno 24 27 ia angka anak ni si aron ma mudar i jala`
254
+ 2. `dung i ninna ibana tu ahu hombar tu bagabagam 119 171 sai marbulakkon pujipujian ma angka bibirhu al...`
255
+ 3. `i ma sada kecamatan na adong di kecamatan petarukan kabupaten pemalang propinsi jawa tonga indonesia...`
256
 
257
  **Context Size 4:**
258
 
259
+ 1. `on do hata ni tuhan jahowa ida ma ahu sandiri pahehehon tu nasida hahisaron dohot hamalumon jala pam...`
260
+ 2. `songon on do hata ni jahowa zebaot tu hamu malim angka na palea goarhu hape lam didok hamu do`
261
+ 3. `i ma sada huta na maringanan di kecamatan tarutung kabupaten tapanuli utara propinsi sumatera utara ...`
262
+
263
+
264
+ ### Generated Text Samples (Subword-based)
265
+
266
+ Below are text samples generated from each subword-based Markov chain model:
267
+
268
+ **Context Size 1:**
269
+
270
+ 1. `_nobedi_anoa?_hi`
271
+ 2. `akoni_jarin_a_ng`
272
+ 3. `nabomaholalowap_`
273
+
274
+ **Context Size 2:**
275
+
276
+ 1. `a_sia_ahaan_rohot`
277
+ 2. `anahu:_tana_raela`
278
+ 3. `ngon_nak_i._10:2_`
279
+
280
+ **Context Size 3:**
281
+
282
+ 1. `anggo_ia_ingka_jor`
283
+ 2. `_marik_marhabus;_a`
284
+ 3. `na_5_menjangkup_he`
285
+
286
+ **Context Size 4:**
287
+
288
+ 1. `_ni_angka_halak_juj`
289
+ 2. `_na_hian_gabe_manan`
290
+ 3. `_di_bagaska_indones`
291
 
292
 
293
  ### Key Findings
294
 
295
+ - **Best Predictability:** Context-4 (word) with 94.1% predictability
296
  - **Branching Factor:** Decreases with context size (more deterministic)
297
+ - **Memory Trade-off:** Larger contexts require more storage (130,734 contexts)
298
  - **Recommendation:** Context-3 or Context-4 for text generation
299
 
300
  ---
 
310
 
311
  | Metric | Value |
312
  |--------|-------|
313
+ | Vocabulary Size | 24,970 |
314
+ | Total Tokens | 972,166 |
315
+ | Mean Frequency | 38.93 |
316
  | Median Frequency | 4 |
317
+ | Frequency Std Dev | 557.36 |
318
 
319
  ### Most Common Words
320
 
321
  | Rank | Word | Frequency |
322
  |------|------|-----------|
323
+ | 1 | ni | 35,042 |
324
+ | 2 | na | 33,939 |
325
+ | 3 | i | 32,856 |
326
+ | 4 | ma | 26,602 |
327
+ | 5 | di | 26,003 |
328
+ | 6 | tu | 20,420 |
329
+ | 7 | do | 19,118 |
330
+ | 8 | angka | 17,417 |
331
+ | 9 | jala | 14,585 |
332
+ | 10 | dohot | 13,546 |
333
 
334
  ### Least Common Words (from vocabulary)
335
 
336
  | Rank | Word | Frequency |
337
  |------|------|-----------|
338
+ | 1 | continua | 2 |
339
+ | 2 | giuseppe | 2 |
340
+ | 3 | mamutuskan | 2 |
341
+ | 4 | disidang | 2 |
342
+ | 5 | disuspensi | 2 |
343
+ | 6 | formula | 2 |
344
+ | 7 | dibenarkan | 2 |
345
+ | 8 | pidana | 2 |
346
+ | 9 | piazza | 2 |
347
+ | 10 | fontana | 2 |
348
 
349
  ### Zipf's Law Analysis
350
 
351
  | Metric | Value |
352
  |--------|-------|
353
+ | Zipf Coefficient | 1.1798 |
354
+ | R² (Goodness of Fit) | 0.997075 |
355
  | Adherence Quality | **excellent** |
356
 
357
  ### Coverage Analysis
358
 
359
  | Top N Words | Coverage |
360
  |-------------|----------|
361
+ | Top 100 | 53.7% |
362
+ | Top 1,000 | 78.4% |
363
+ | Top 5,000 | 91.4% |
364
+ | Top 10,000 | 95.7% |
365
 
366
  ### Key Findings
367
 
368
+ - **Zipf Compliance:** R²=0.9971 indicates excellent adherence to Zipf's law
369
+ - **High Frequency Dominance:** Top 100 words cover 53.7% of corpus
370
+ - **Long Tail:** 14,970 words needed for remaining 4.3% coverage
371
 
372
  ---
373
  ## 5. Word Embeddings Evaluation
 
380
 
381
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
382
 
 
383
 
384
+ ### 5.1 Cross-Lingual Alignment
385
+
386
+ > *Note: Multilingual alignment visualization not available for this language.*
387
+
388
+
389
+ ### 5.2 Model Comparison
390
+
391
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
392
+ |-------|-----------|----------|------------------|---------------|----------------|
393
+ | **mono_32d** | 32 | 0.8223 🏆 | 0.3239 | N/A | N/A |
394
+ | **mono_64d** | 64 | 0.7605 | 0.2710 | N/A | N/A |
395
+ | **mono_128d** | 128 | 0.4652 | 0.2352 | N/A | N/A |
396
 
397
  ### Key Findings
398
 
399
+ - **Best Isotropy:** mono_32d with 0.8223 (more uniform distribution)
400
+ - **Semantic Density:** Average pairwise similarity of 0.2767. Lower values indicate better semantic separation.
401
+ - **Alignment Quality:** No aligned models evaluated in this run.
402
+ - **Recommendation:** 128d aligned for best cross-lingual performance
403
 
404
  ---
405
+ ## 6. Morphological Analysis (Experimental)
406
+
407
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
408
+
409
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
410
+
411
+ ### 6.1 Productivity & Complexity
412
+
413
+ | Metric | Value | Interpretation | Recommendation |
414
+ |--------|-------|----------------|----------------|
415
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
416
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
417
+
418
+ ### 6.2 Affix Inventory (Productive Units)
419
+
420
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
421
+
422
+ #### Productive Prefixes
423
+ | Prefix | Examples |
424
+ |--------|----------|
425
+ | `-pa` | paunsathon, paraguay, panimbukbuk |
426
+ | `-ma` | matan, marsogotna, marimbang |
427
+ | `-di` | dihalungunhon, dipajonok, dikunjungi |
428
+ | `-mar` | marsogotna, marimbang, marsuhat |
429
+ | `-si` | sintuana, simalolongna, siotihotik |
430
+ | `-man` | manongoshon, maneat, manongos |
431
+ | `-par` | paraguay, paransis, parmaraan |
432
+ | `-ha` | haro, hadoboon, hakristenon |
433
+
434
+ #### Productive Suffixes
435
+ | Suffix | Examples |
436
+ |--------|----------|
437
+ | `-n` | matan, kanan, paunsathon |
438
+ | `-a` | bulanda, tanomanmuna, musikna |
439
+ | `-on` | paunsathon, manongoshon, dihalungunhon |
440
+ | `-na` | tanomanmuna, musikna, marsogotna |
441
+ | `-an` | matan, kanan, tibetan |
442
+ | `-ng` | marimbang, palding, lumeleng |
443
+ | `-hon` | paunsathon, manongoshon, dihalungunhon |
444
+ | `-nna` | anginna, binoanna, parumaenna |
445
+
446
+ ### 6.3 Bound Stems (Lexical Roots)
447
+
448
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
449
+
450
+ | Stem | Cohesion | Substitutability | Examples |
451
+ |------|----------|------------------|----------|
452
+ | `anga` | 1.65x | 126 contexts | angan, sanga, langa |
453
+ | `angk` | 1.46x | 153 contexts | angka, rangka, dangka |
454
+ | `mang` | 1.72x | 61 contexts | amang, damang, mangae |
455
+ | `ngka` | 1.53x | 87 contexts | angka, rangka, dangka |
456
+ | `ngko` | 1.76x | 41 contexts | ingkon, angkot, tingko |
457
+ | `onga` | 1.75x | 36 contexts | longa, tonga, dongan |
458
+ | `angg` | 1.39x | 75 contexts | anggo, anggi, anggia |
459
+ | `anna` | 1.73x | 31 contexts | hanna, manna, annai |
460
+ | `bahe` | 1.78x | 26 contexts | bahen, dibahe, ibahen |
461
+ | `ingk` | 1.43x | 59 contexts | ingkon, tingki, lingka |
462
+ | `ngan` | 1.37x | 65 contexts | ingan, angan, dongan |
463
+ | `ndan` | 1.68x | 25 contexts | ndang, pandan, undang |
464
+
465
+ ### 6.4 Affix Compatibility (Co-occurrence)
466
+
467
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
468
+
469
+ | Prefix | Suffix | Frequency | Examples |
470
+ |--------|--------|-----------|----------|
471
+ | `-pa` | `-n` | 372 words | paboaon, pangalelaon |
472
+ | `-ma` | `-n` | 227 words | malungun, mambahen |
473
+ | `-pa` | `-on` | 208 words | paboaon, pangalelaon |
474
+ | `-pa` | `-a` | 196 words | pangkeannasida, padanna |
475
+ | `-pa` | `-an` | 162 words | pamalian, parmiahan |
476
+ | `-di` | `-n` | 135 words | diparsahitan, dison |
477
+ | `-ma` | `-on` | 127 words | mangkalungunhon, mangkasogohon |
478
+ | `-pa` | `-na` | 124 words | padanna, pandokna |
479
+ | `-ha` | `-n` | 124 words | harun, hamulian |
480
+ | `-di` | `-on` | 113 words | dison, ditahbishon |
481
+
482
+ ### 6.5 Recursive Morpheme Segmentation
483
+
484
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
485
+
486
+ | Word | Suggested Split | Confidence | Stem |
487
+ |------|-----------------|------------|------|
488
+ | dipahatahata | **`di-pa-ha-ta-hata`** | 9.0 | `hata` |
489
+ | marhatomanon | **`mar-ha-toman-on`** | 7.5 | `toman` |
490
+ | panimbangan | **`pan-imba-ng-an`** | 7.5 | `imba` |
491
+ | patongonhon | **`pa-tong-on-hon`** | 7.5 | `tong` |
492
+ | hatigoranku | **`ha-tigor-an-ku`** | 7.5 | `tigor` |
493
+ | pargogoanku | **`par-gogo-an-ku`** | 7.5 | `gogo` |
494
+ | hagaleonku | **`ha-gale-on-ku`** | 7.5 | `gale` |
495
+ | dipatongon | **`di-pa-tong-on`** | 7.5 | `tong` |
496
+ | taparrohahon | **`ta-par-roha-hon`** | 7.5 | `roha` |
497
+ | parhaporseaon | **`par-ha-porsea-on`** | 7.5 | `porsea` |
498
+ | hamuliaonku | **`ha-mulia-on-ku`** | 7.5 | `mulia` |
499
+ | paluhutonku | **`pa-luhut-on-ku`** | 7.5 | `luhut` |
500
+ | hamateanna | **`ha-ma-tean-na`** | 7.5 | `tean` |
501
+ | silehononku | **`si-lehon-on-ku`** | 7.5 | `lehon` |
502
+ | patoltolonku | **`pa-toltol-on-ku`** | 7.5 | `toltol` |
503
+
504
+ ### 6.6 Linguistic Interpretation
505
+
506
+ > **Automated Insight:**
507
+ The language BBC appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
508
+
509
+ ---
510
+ ## 7. Summary & Recommendations
511
 
512
  ![Performance Dashboard](visualizations/performance_dashboard.png)
513
 
 
515
 
516
  | Component | Recommended | Rationale |
517
  |-----------|-------------|-----------|
518
+ | Tokenizer | **32k BPE** | Best compression (3.66x) |
519
+ | N-gram | **2-gram** | Lowest perplexity (185) |
520
+ | Markov | **Context-4** | Highest predictability (94.1%) |
521
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
522
 
523
+
524
  ---
525
  ## Appendix: Metrics Glossary & Interpretation Guide
526
 
 
710
  author = {Kamali, Omar},
711
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
712
  year = {2025},
713
+ doi = {10.5281/zenodo.18073153},
714
+ publisher = {Zenodo},
715
  url = {https://huggingface.co/wikilangs}
716
  institution = {Omneity Labs}
717
  }
 
727
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
728
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
729
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
730
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
731
  ---
732
  *Generated by Wikilangs Models Pipeline*
733
 
734
+ *Report Date: 2026-01-03 06:19:26*
models/embeddings/monolingual/bbc_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0b0ca75d42a296503b6510e66321ba63fde51730d05e8ae06bf9ee333ebf3764
3
- size 1039704692
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:299d624499ec82c937aa1769fa092d3bd9d976c6d08c2af9b22f767740074a59
3
+ size 1039420296
models/embeddings/monolingual/bbc_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 15079
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 14806
15
  }
models/embeddings/monolingual/bbc_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1e806c00c5217f8c0bc9c0c7c7f674b5091cf2b888f0f09d5617c5947fdf7b85
3
- size 260124020
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:981df92afd62ee12261b51c6b8c02f4b1e76c4f2d6c299c0dd0e78da12e9ac2b
3
+ size 260049288
models/embeddings/monolingual/bbc_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 15079
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 14806
15
  }
models/embeddings/monolingual/bbc_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b313b52ddbda268ad3cbd1f0e06c19c9d337a7809b8be819492b088a18d3ea52
3
- size 519984244
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7fddeff473650e7228b6c4a71a4299b4d9ee94908cca09fc7492ca327d34b12
3
+ size 519839624
models/embeddings/monolingual/bbc_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 15079
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 14806
15
  }
models/subword_markov/bbc_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:31f366e845fdd50b876fa026287e3adc40435cfcd969f38d0a43a5e37677bf6f
3
- size 67861
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5aefbfb1171b4b00a844e78d07315efbcf62f55a2b30f5bd6d940b5ba83804d7
3
+ size 81484
models/subword_markov/bbc_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_contexts": 1187,
6
- "total_transitions": 5965783
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_contexts": 1435,
6
+ "total_transitions": 5702376
7
  }
models/subword_markov/bbc_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b7007b2743717f219549dc35da4e22ef904f954017d220c8e61b57e8f201260f
3
- size 331323
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f38ba522fe1bac3abb75948ca7e8ea124307971bfee224a267114a3c9466873
3
+ size 358445
models/subword_markov/bbc_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_contexts": 8725,
6
- "total_transitions": 5964472
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_contexts": 10290,
6
+ "total_transitions": 5701163
7
  }
models/subword_markov/bbc_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c82b65a7ab369bc14ca7c650a4752310ee46abd1500b7705572d13c633829e72
3
- size 1156768
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a3da300cf244b2ab02f8cc4190a829feb85819df2984a6b1a78f37ab7daec14
3
+ size 1190319
models/subword_markov/bbc_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_contexts": 41004,
6
- "total_transitions": 5963161
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_contexts": 41529,
6
+ "total_transitions": 5699950
7
  }
models/subword_markov/bbc_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7f8adaacc604505a5ecc9a9df0ee1360e764fe6a44233577ea34de523534262e
3
- size 2961725
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32b0388a33d7717a0deb5945264197ca362aded134c2c728e20da7a5ddee82f4
3
+ size 2842522
models/subword_markov/bbc_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_contexts": 146535,
6
- "total_transitions": 5961850
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_contexts": 130734,
6
+ "total_transitions": 5698737
7
  }
models/subword_ngram/bbc_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e60fc5bcc5851afa5f2e2cc336875cafe4bf25d188866759e33cdfd81d92110
3
- size 36155
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d706d757ef98f7e2eaf523f46f0a602b8f06b25cb1d00579847f22b28eece7f1
3
+ size 46126
models/subword_ngram/bbc_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_ngrams": 2868,
6
- "total_ngrams": 5965783
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_ngrams": 3491,
6
+ "total_ngrams": 5702376
7
  }
models/subword_ngram/bbc_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1a241f736f6533859dd3ceeb25dc4c326e50cb9bdae44188ae81918ec9b1e09d
3
- size 236441
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:578bacf93c61827c89b4f8427603b775cd6b7891f9ceecc9c90545263f0f5768
3
+ size 237045
models/subword_ngram/bbc_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_ngrams": 19851,
6
- "total_ngrams": 5964472
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_ngrams": 18183,
6
+ "total_ngrams": 5701163
7
  }
models/subword_ngram/bbc_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:de307e6c44b633e8d607ebdbd329107fa1f51230fa199d5272b932992601e309
3
- size 929802
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66675bb6688469ac3db798effd708b7b3e5f75e0f119c579a8f76b3e2b98fa07
3
+ size 867430
models/subword_ngram/bbc_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "bbc",
5
- "unique_ngrams": 81359,
6
- "total_ngrams": 5963161
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "bbc",
5
+ "unique_ngrams": 70417,
6
+ "total_ngrams": 5699950
7
  }
models/tokenizer/bbc_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:136cf9b48df47bd0325da4d7c88b57560631e30b6199a6208f4260be8e9552ee
3
- size 526133
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a176fe5ca0bdc5334c45eb11b39e17d285bd6a1469ac9c9df2ca41a79f9a60c
3
+ size 510189
models/tokenizer/bbc_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/bbc_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2529fe3a9c19885eb39e8ccf47c668a7b0d05f23fec6e510f14e88006b6b7bfd
3
- size 829199
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92d486b2cdc4197f788488f7923c1a959c468622b7cc6f4cc652213c29688a42
3
+ size 804933
models/tokenizer/bbc_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/bbc_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:013a29e4795c8a56295410d77b42d5f8e382048830cd0cb9713d5e5e7d4843e0
3
- size 380151
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd7423a518aa21b134b1db88b66fe95833b16223e79c31fab654c72db01f19bd
3
+ size 370418
models/tokenizer/bbc_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/bbc_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:98151fd7e6ed05510325054834af69754a9818fbdb10484a7067639a5bdf9974
3
- size 407231
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:063a599013021fa7eb8f83c98669f631f73439cf9f45048819ae9c1bfe1070e2
3
+ size 419766
models/vocabulary/bbc_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "bbc",
3
- "vocabulary_size": 24711,
 
4
  "statistics": {
5
- "type_token_ratio": 0.0481935549300518,
6
  "coverage": {
7
- "top_100": 0.5151721868117359,
8
- "top_1000": 0.7666824211970509,
9
- "top_5000": 0.894505559690854,
10
- "top_10000": 0.9355091168979777
11
  },
12
- "hapax_count": 25661,
13
- "hapax_ratio": 0.5094298419757007,
14
- "total_documents": 1311
15
  }
16
  }
 
1
  {
2
  "language": "bbc",
3
+ "vocabulary_size": 24970,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.05084399284522539,
7
  "coverage": {
8
+ "top_100": 0.5228987859930757,
9
+ "top_1000": 0.7641910545275995,
10
+ "top_5000": 0.8904758325943073,
11
+ "top_10000": 0.9321639184916853
12
  },
13
+ "hapax_count": 25769,
14
+ "hapax_ratio": 0.507873627781391,
15
+ "total_documents": 1213
16
  }
17
  }
models/word_markov/bbc_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7e98885de06212fe4d458334ea6fca807e1bc10f3eeb95db70f713244d6ae9a7
3
- size 2077733
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71dc86808478f99a7145bcc0a4ff6b725b6972f69dc381f3ae944224fbe4c16e
3
+ size 2205032
models/word_markov/bbc_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_contexts": 50531,
6
- "total_transitions": 1302268
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_contexts": 50697,
6
+ "total_transitions": 996722
7
  }
models/word_markov/bbc_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c2b1dee42bcb68e6a9fef3aa20e018a082f3dc777f1972d72933e2d6eea72d27
3
- size 6027713
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:728d851895842f58fe81fb7ab5ed1ae78d0307c5849828fe282e48bfdcd66b91
3
+ size 6594255
models/word_markov/bbc_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_contexts": 307642,
6
- "total_transitions": 1300957
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_contexts": 325909,
6
+ "total_transitions": 995509
7
  }
models/word_markov/bbc_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8acf1995ad2ef8973fc20a189c2be31adc36869291155c07279e2d2a5a7df71a
3
- size 10931319
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f6ad77c32e92404bdcd431db829da9cd2b7fba3d776d4add45047f520d777bd
3
+ size 10832852
models/word_markov/bbc_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_contexts": 667901,
6
- "total_transitions": 1299646
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_contexts": 658447,
6
+ "total_transitions": 994305
7
  }
models/word_markov/bbc_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c2cca265740b80274df07e8a920e3c873c44dd9f14f647cf7c26d83cbc3ac68f
3
- size 14909288
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3f9fed8b38c168b6cb6bd140019ba39475f2256af74958a5be3b86590c35465
3
+ size 13542247
models/word_markov/bbc_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_contexts": 943940,
6
- "total_transitions": 1298335
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_contexts": 840046,
6
+ "total_transitions": 993102
7
  }
models/word_ngram/bbc_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d8d6402f7a1a57dd4dd1c3fd635fd7023724f3f67f2a4c61d61a5f224906cd25
3
- size 404690
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35e7179149d81c363c63014a32d763e82ed4c68c09f8bf11666d2a411687d048
3
+ size 356865
models/word_ngram/bbc_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_ngrams": 30496,
6
- "total_ngrams": 1302268
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_ngrams": 26426,
6
+ "total_ngrams": 996722
7
  }
models/word_ngram/bbc_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f0e5d6a6cac168c51ea635254aca5bfd8c4702475845dac965c1b0991023487f
3
- size 849819
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82fe52ca4876909aaae283e65cee6b4f353c5ee4954be84a71334c4d597b03dd
3
+ size 632278
models/word_ngram/bbc_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_ngrams": 62993,
6
- "total_ngrams": 1300957
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_ngrams": 43165,
6
+ "total_ngrams": 995509
7
  }
models/word_ngram/bbc_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ab915e5a6d57480fa0c0c0261b9d209b8f5d2872e033c606183a407608cfeff9
3
- size 1582171
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f1f7eaa38f291087c7b01870d85c8eafea562d6c79b4b0dc82e882fa6bf5219
3
+ size 1017276
models/word_ngram/bbc_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "bbc",
5
- "unique_ngrams": 114847,
6
- "total_ngrams": 1299646
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "bbc",
5
+ "unique_ngrams": 67595,
6
+ "total_ngrams": 994305
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: e4252613ca102d71cfbbaf34b94e6af5d9651cbd57d0c5081ede58946e00c6f5
  • Pointer size: 131 Bytes
  • Size of remote file: 143 kB

Git LFS Details

  • SHA256: 3fca8cacd33d00e4f6242b983e7210268def536fdfac878cbb2940260a259593
  • Pointer size: 131 Bytes
  • Size of remote file: 142 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED