omarkamali commited on
Commit
77a2691
·
verified ·
1 Parent(s): bae74c6

Upload all models and assets for bm (20251001)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +290 -132
  2. models/embeddings/monolingual/bm_128d.bin +2 -2
  3. models/embeddings/monolingual/bm_128d_metadata.json +5 -3
  4. models/embeddings/monolingual/bm_32d.bin +2 -2
  5. models/embeddings/monolingual/bm_32d_metadata.json +5 -3
  6. models/embeddings/monolingual/bm_64d.bin +2 -2
  7. models/embeddings/monolingual/bm_64d_metadata.json +5 -3
  8. models/subword_markov/bm_markov_ctx1_subword.parquet +2 -2
  9. models/subword_markov/bm_markov_ctx1_subword_metadata.json +2 -2
  10. models/subword_markov/bm_markov_ctx2_subword.parquet +2 -2
  11. models/subword_markov/bm_markov_ctx2_subword_metadata.json +2 -2
  12. models/subword_markov/bm_markov_ctx3_subword.parquet +2 -2
  13. models/subword_markov/bm_markov_ctx3_subword_metadata.json +2 -2
  14. models/subword_markov/bm_markov_ctx4_subword.parquet +2 -2
  15. models/subword_markov/bm_markov_ctx4_subword_metadata.json +2 -2
  16. models/subword_ngram/bm_2gram_subword.parquet +2 -2
  17. models/subword_ngram/bm_2gram_subword_metadata.json +2 -2
  18. models/subword_ngram/bm_3gram_subword.parquet +2 -2
  19. models/subword_ngram/bm_3gram_subword_metadata.json +2 -2
  20. models/subword_ngram/bm_4gram_subword.parquet +2 -2
  21. models/subword_ngram/bm_4gram_subword_metadata.json +2 -2
  22. models/tokenizer/bm_tokenizer_16k.model +2 -2
  23. models/tokenizer/bm_tokenizer_16k.vocab +0 -0
  24. models/tokenizer/bm_tokenizer_32k.model +2 -2
  25. models/tokenizer/bm_tokenizer_32k.vocab +0 -0
  26. models/tokenizer/bm_tokenizer_8k.model +2 -2
  27. models/tokenizer/bm_tokenizer_8k.vocab +0 -0
  28. models/vocabulary/bm_vocabulary.parquet +2 -2
  29. models/vocabulary/bm_vocabulary_metadata.json +10 -9
  30. models/word_markov/bm_markov_ctx1_word.parquet +2 -2
  31. models/word_markov/bm_markov_ctx1_word_metadata.json +2 -2
  32. models/word_markov/bm_markov_ctx2_word.parquet +2 -2
  33. models/word_markov/bm_markov_ctx2_word_metadata.json +2 -2
  34. models/word_markov/bm_markov_ctx3_word.parquet +2 -2
  35. models/word_markov/bm_markov_ctx3_word_metadata.json +2 -2
  36. models/word_markov/bm_markov_ctx4_word.parquet +2 -2
  37. models/word_markov/bm_markov_ctx4_word_metadata.json +2 -2
  38. models/word_ngram/bm_2gram_word.parquet +2 -2
  39. models/word_ngram/bm_2gram_word_metadata.json +2 -2
  40. models/word_ngram/bm_3gram_word.parquet +2 -2
  41. models/word_ngram/bm_3gram_word_metadata.json +2 -2
  42. models/word_ngram/bm_4gram_word.parquet +2 -2
  43. models/word_ngram/bm_4gram_word_metadata.json +2 -2
  44. visualizations/embedding_isotropy.png +0 -0
  45. visualizations/embedding_norms.png +0 -0
  46. visualizations/embedding_similarity.png +2 -2
  47. visualizations/markov_branching.png +0 -0
  48. visualizations/markov_contexts.png +0 -0
  49. visualizations/markov_entropy.png +0 -0
  50. visualizations/model_sizes.png +0 -0
README.md CHANGED
@@ -23,14 +23,14 @@ dataset_info:
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
- value: 4.070
27
  - name: best_isotropy
28
  type: isotropy
29
- value: 0.2244
30
  - name: vocabulary_size
31
  type: vocab
32
- value: 7195
33
- generated: 2025-12-28
34
  ---
35
 
36
  # BM - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
- - N-gram models (2, 3, 4-gram)
48
- - Markov chains (context of 1, 2, 3 and 4)
49
  - Subword N-gram and Markov chains
50
- - Embeddings in various sizes and dimensions
51
  - Language Vocabulary
52
  - Language Statistics
 
53
  ![Performance Dashboard](visualizations/performance_dashboard.png)
54
 
55
  ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
59
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
60
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
61
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
62
- - [6. Summary & Recommendations](#6-summary--recommendations)
 
63
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
64
  - [Visualizations Index](#visualizations-index)
65
 
@@ -68,52 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
68
 
69
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
70
 
 
 
 
 
 
 
71
  ### Results
72
 
73
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
74
  |------------|-------------|---------------|----------|--------------|
75
- | **8k** | 3.433x | 3.36 | 0.0887% | 117,303 |
76
- | **16k** | 3.752x | 3.68 | 0.0969% | 107,301 |
77
- | **32k** | 4.070x 🏆 | 3.99 | 0.1051% | 98,934 |
78
 
79
  ### Tokenization Examples
80
 
81
  Below are sample sentences tokenized with each vocabulary size:
82
 
83
- **Sample 1:** `Los Angeles ye Amerika ka Kelenyalen Jamanaw ka dugu ye.
84
-
85
-
86
-
87
- Catégorie:Amerika ka...`
88
 
89
  | Vocab | Tokens | Count |
90
  |-------|--------|-------|
91
- | 8k | `▁los ▁angeles ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+9 more)` | 19 |
92
- | 16k | `▁los ▁angeles ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+9 more)` | 19 |
93
- | 32k | `▁losangeles ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+9 more)` | 19 |
94
 
95
- **Sample 2:** `Kunkolosɛmɛ baganw ani mogow ka kunkolo kɔnɔ yɛ. Mogow miri kunkolo kɔnɔ.
96
- ...`
97
 
98
  | Vocab | Tokens | Count |
99
  |-------|--------|-------|
100
- | 8k | `▁kunkolo sɛ mɛ baganwanimogowkakunkolokɔnɔ ... (+14 more)` | 24 |
101
- | 16k | `▁kunkolo sɛmɛ baganwanimogowkakunkolokɔnɔ ... (+11 more)` | 21 |
102
- | 32k | `▁kunkolosɛmɛbaganwanimogowkakunkolokɔnɔ . ... (+9 more)` | 19 |
103
 
104
- **Sample 3:** `TaliBailleul, Charles. 2008. Dictionnaire français-bambara. Bamako: Éditions Do...`
105
 
106
  | Vocab | Tokens | Count |
107
  |-------|--------|-------|
108
- | 8k | `▁tali bailleul , ▁charles . ▁ 2 0 0 8 ... (+31 more)` | 41 |
109
- | 16k | `▁tali bailleul , ▁charles . ▁ 2 0 0 8 ... (+31 more)` | 41 |
110
- | 32k | `▁talibailleul , ▁charles . ▁ 2 0 0 8 . ... (+30 more)` | 40 |
111
 
112
 
113
  ### Key Findings
114
 
115
- - **Best Compression:** 32k achieves 4.070x compression
116
- - **Lowest UNK Rate:** 8k with 0.0887% unknown tokens
117
  - **Trade-off:** Larger vocabularies improve compression but increase model size
118
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
119
 
@@ -122,55 +125,87 @@ Catégorie:Amerika ka...`
122
 
123
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
124
 
 
 
125
  ![N-gram Coverage](visualizations/ngram_coverage.png)
126
 
127
  ### Results
128
 
129
- | N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
130
- |--------|------------|---------|----------------|------------------|-------------------|
131
- | **2-gram** | 950 🏆 | 9.89 | 2,854 | 45.2% | 80.3% |
132
- | **2-gram** | 322 🏆 | 8.33 | 2,127 | 63.8% | 97.9% |
133
- | **3-gram** | 976 | 9.93 | 3,554 | 47.1% | 75.8% |
134
- | **3-gram** | 2,159 | 11.08 | 11,428 | 27.6% | 72.5% |
135
- | **4-gram** | 1,930 | 10.91 | 8,337 | 41.6% | 58.5% |
136
- | **4-gram** | 8,659 | 13.08 | 39,905 | 14.4% | 46.6% |
137
 
138
  ### Top 5 N-grams by Size
139
 
140
- **2-grams:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  | Rank | N-gram | Count |
143
  |------|--------|-------|
144
- | 1 | `catégorie :` | 1,068 |
145
- | 2 | `ye .` | 822 |
146
- | 3 | `’ a` | 627 |
147
- | 4 | `ka dugu` | 577 |
148
- | 5 | `. sababou` | 571 |
149
 
150
- **3-grams:**
151
 
152
  | Rank | N-gram | Count |
153
  |------|--------|-------|
154
- | 1 | `français - bambara` | 419 |
155
- | 2 | `dictionnaire français -` | 419 |
156
- | 3 | `. dictionnaire français` | 419 |
157
- | 4 | `2008 . dictionnaire` | 419 |
158
- | 5 | `. 2008 .` | 419 |
159
 
160
- **4-grams:**
161
 
162
  | Rank | N-gram | Count |
163
  |------|--------|-------|
164
- | 1 | `. dictionnaire français -` | 419 |
165
- | 2 | `- 04 - 8` | 419 |
166
- | 3 | `français - bambara .` | 419 |
167
- | 4 | `- bambara . bamako` | 419 |
168
- | 5 | `bambara . bamako :` | 419 |
 
 
 
 
 
 
 
 
 
 
169
 
170
 
171
  ### Key Findings
172
 
173
- - **Best Perplexity:** 2-gram with 322
174
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
175
  - **Coverage:** Top-1000 patterns cover ~47% of corpus
176
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
@@ -180,55 +215,86 @@ Catégorie:Amerika ka...`
180
 
181
  ![Markov Entropy](visualizations/markov_entropy.png)
182
 
 
 
183
  ![Markov Branching](visualizations/markov_branching.png)
184
 
185
  ### Results
186
 
187
- | Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
188
- |---------|-------------|------------|------------------|-----------------|----------------|
189
- | **1** | 0.5830 | 1.498 | 3.54 | 18,421 | 41.7% |
190
- | **1** | 1.2181 | 2.326 | 9.19 | 508 | 0.0% |
191
- | **2** | 0.2211 | 1.166 | 1.48 | 65,151 | 77.9% |
192
- | **2** | 1.0235 | 2.033 | 5.17 | 4,659 | 0.0% |
193
- | **3** | 0.0800 | 1.057 | 1.13 | 96,171 | 92.0% |
194
- | **3** | 0.7286 | 1.657 | 3.07 | 24,092 | 27.1% |
195
- | **4** | 0.0309 🏆 | 1.022 | 1.04 | 108,554 | 96.9% |
196
- | **4** | 0.4796 🏆 | 1.394 | 2.01 | 73,789 | 52.0% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
- ### Generated Text Samples
199
 
200
- Below are text samples generated from each Markov chain model:
 
 
201
 
202
  **Context Size 1:**
203
 
204
- 1. `. farrell , kɔɔnɔ du riz africain , o wati san 1648 - wong vong ye`
205
- 2. `, i ka tɔgɔ tun te deli ka dugu tɔw kan , ni swedi , na`
206
- 3. `ye ko an bolo suguya kɛmɛ tan kɔnɔ bamakɔ dugu ye ukrainekaw ka kɛnɛya ye jamana`
207
 
208
  **Context Size 2:**
209
 
210
- 1. `catégorie : afrika catégorie : cema amerika . onu kuntilenna fɔlɔ ye mi ma yiriwa kosɛbɛ .`
211
- 2. `ye . catégorie : faransi ka dugu ye . dugumogo be taa jon yooro . gallery sababou`
212
- 3. `’ a kulɛriw kan . hadamadenw ka josariyaw ni politikitɔnw ka bɛnkansɛbɛn ye min sigira senkan c`
213
 
214
  **Context Size 3:**
215
 
216
- 1. `éditions donniya . isbn 2 - 911741 - 04 - 8 . sababou catégorie : jɛgɛ`
217
- 2. `bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8 . sababou catégorie`
218
- 3. `français - bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8 .`
219
 
220
  **Context Size 4:**
221
 
222
- 1. `, charles . 2008 . dictionnaire français - bambara . bamako : éditions donniya . isbn 2 - 911741`
223
- 2. `. dictionnaire français - bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8`
224
- 3. `français - bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8 . sababou`
225
 
226
 
227
  ### Key Findings
228
 
229
- - **Best Predictability:** Context-4 with 96.9% predictability
230
  - **Branching Factor:** Decreases with context size (more deterministic)
231
- - **Memory Trade-off:** Larger contexts require more storage (73,789 contexts)
232
  - **Recommendation:** Context-3 or Context-4 for text generation
233
 
234
  ---
@@ -244,64 +310,64 @@ Below are text samples generated from each Markov chain model:
244
 
245
  | Metric | Value |
246
  |--------|-------|
247
- | Vocabulary Size | 7,195 |
248
- | Total Tokens | 102,263 |
249
- | Mean Frequency | 14.21 |
250
  | Median Frequency | 3 |
251
- | Frequency Std Dev | 106.29 |
252
 
253
  ### Most Common Words
254
 
255
  | Rank | Word | Frequency |
256
  |------|------|-----------|
257
- | 1 | ye | 4,480 |
258
- | 2 | ka | 4,400 |
259
- | 3 | a | 3,281 |
260
- | 4 | la | 1,931 |
261
- | 5 | ni | 1,900 |
262
- | 6 | bɛ | 1,844 |
263
- | 7 | na | 1,626 |
264
- | 8 | min | 1,192 |
265
- | 9 | o | 1,154 |
266
- | 10 | ani | 1,079 |
267
 
268
  ### Least Common Words (from vocabulary)
269
 
270
  | Rank | Word | Frequency |
271
  |------|------|-----------|
272
- | 1 | dakon | 2 |
273
- | 2 | taamaɲogonw | 2 |
274
- | 3 | abubakari | 2 |
275
- | 4 | ameniras | 2 |
276
- | 5 | kandasi | 2 |
277
- | 6 | qore | 2 |
278
- | 7 | amɔn | 2 |
279
- | 8 | bajiw | 2 |
280
- | 9 | dunbagaw | 2 |
281
- | 10 | mouvement | 2 |
282
 
283
  ### Zipf's Law Analysis
284
 
285
  | Metric | Value |
286
  |--------|-------|
287
- | Zipf Coefficient | 1.0134 |
288
- | R² (Goodness of Fit) | 0.984519 |
289
  | Adherence Quality | **excellent** |
290
 
291
  ### Coverage Analysis
292
 
293
  | Top N Words | Coverage |
294
  |-------------|----------|
295
- | Top 100 | 52.0% |
296
- | Top 1,000 | 79.0% |
297
- | Top 5,000 | 95.7% |
298
  | Top 10,000 | 0.0% |
299
 
300
  ### Key Findings
301
 
302
- - **Zipf Compliance:** R²=0.9845 indicates excellent adherence to Zipf's law
303
- - **High Frequency Dominance:** Top 100 words cover 52.0% of corpus
304
- - **Long Tail:** -2,805 words needed for remaining 100.0% coverage
305
 
306
  ---
307
  ## 5. Word Embeddings Evaluation
@@ -314,24 +380,113 @@ Below are text samples generated from each Markov chain model:
314
 
315
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
316
 
317
- ### Model Comparison
318
 
319
- | Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
320
- |-------|------------|-----------|----------|----------|----------|
321
- | **mono_32d** | 2,309 | 32 | 2.998 | 0.730 | 0.2244 🏆 |
322
- | **mono_64d** | 2,309 | 64 | 2.963 | 0.687 | 0.0668 |
323
- | **mono_128d** | 2,309 | 128 | 2.969 | 0.695 | 0.0115 |
324
- | **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 
 
 
 
 
 
325
 
326
  ### Key Findings
327
 
328
- - **Best Isotropy:** mono_32d with 0.2244 (more uniform distribution)
329
- - **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
330
- - **Vocabulary Coverage:** All models cover 2,309 words
331
- - **Recommendation:** 100d for balanced semantic capture and efficiency
332
 
333
  ---
334
- ## 6. Summary & Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
335
 
336
  ![Performance Dashboard](visualizations/performance_dashboard.png)
337
 
@@ -339,11 +494,12 @@ Below are text samples generated from each Markov chain model:
339
 
340
  | Component | Recommended | Rationale |
341
  |-----------|-------------|-----------|
342
- | Tokenizer | **32k BPE** | Best compression (4.07x) with low UNK rate |
343
- | N-gram | **5-gram** | Lowest perplexity (322) |
344
- | Markov | **Context-4** | Highest predictability (96.9%) |
345
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
346
 
 
347
  ---
348
  ## Appendix: Metrics Glossary & Interpretation Guide
349
 
@@ -533,7 +689,8 @@ If you use these models in your research, please cite:
533
  author = {Kamali, Omar},
534
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
535
  year = {2025},
536
- publisher = {HuggingFace},
 
537
  url = {https://huggingface.co/wikilangs}
538
  institution = {Omneity Labs}
539
  }
@@ -549,7 +706,8 @@ MIT License - Free for academic and commercial use.
549
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
550
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
551
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 
552
  ---
553
  *Generated by Wikilangs Models Pipeline*
554
 
555
- *Report Date: 2025-12-28 05:28:29*
 
23
  metrics:
24
  - name: best_compression_ratio
25
  type: compression
26
+ value: 4.016
27
  - name: best_isotropy
28
  type: isotropy
29
+ value: 0.2668
30
  - name: vocabulary_size
31
  type: vocab
32
+ value: 0
33
+ generated: 2026-01-03
34
  ---
35
 
36
  # BM - Wikilangs Models
 
44
  ### Models & Assets
45
 
46
  - Tokenizers (8k, 16k, 32k, 64k)
47
+ - N-gram models (2, 3, 4, 5-gram)
48
+ - Markov chains (context of 1, 2, 3, 4 and 5)
49
  - Subword N-gram and Markov chains
50
+ - Embeddings in various sizes and dimensions (aligned and unaligned)
51
  - Language Vocabulary
52
  - Language Statistics
53
+
54
  ![Performance Dashboard](visualizations/performance_dashboard.png)
55
 
56
  ### Analysis and Evaluation
 
60
  - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
61
  - [4. Vocabulary Analysis](#4-vocabulary-analysis)
62
  - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
63
+ - [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
64
+ - [7. Summary & Recommendations](#7-summary--recommendations)
65
  - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
66
  - [Visualizations Index](#visualizations-index)
67
 
 
70
 
71
  ![Tokenizer Compression](visualizations/tokenizer_compression.png)
72
 
73
+ ![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
74
+
75
+ ![Tokenizer OOV](visualizations/tokenizer_oov.png)
76
+
77
+ ![Total Tokens](visualizations/tokenizer_total_tokens.png)
78
+
79
  ### Results
80
 
81
  | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
82
  |------------|-------------|---------------|----------|--------------|
83
+ | **8k** | 3.547x | 3.56 | 1.4000% | 104,860 |
84
+ | **16k** | 3.831x | 3.84 | 1.5119% | 97,096 |
85
+ | **32k** | 4.016x 🏆 | 4.03 | 1.5853% | 92,603 |
86
 
87
  ### Tokenization Examples
88
 
89
  Below are sample sentences tokenized with each vocabulary size:
90
 
91
+ **Sample 1:** `Denver ye Amerika ka Kelenyalen Jamanaw ka dugu ye. ka Kelenyalen Jamanaw ka dug...`
 
 
 
 
92
 
93
  | Vocab | Tokens | Count |
94
  |-------|--------|-------|
95
+ | 8k | `▁den ver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+6 more)` | 16 |
96
+ | 16k | `▁den ver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+6 more)` | 16 |
97
+ | 32k | `▁denver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye . ... (+5 more)` | 15 |
98
 
99
+ **Sample 2:** `Dakar ye Senegali faamadugu ye. A be Atlantiki kɔgɔji da la. thumb|Dakar-Indépen...`
 
100
 
101
  | Vocab | Tokens | Count |
102
  |-------|--------|-------|
103
+ | 8k | `▁dakaryesenegalifaama dugu ye . abeatlantiki ... (+19 more)` | 29 |
104
+ | 16k | `▁dakaryesenegalifaamaduguye . abeatlantikikɔgɔji ... (+12 more)` | 22 |
105
+ | 32k | `▁dakaryesenegalifaamaduguye . abeatlantikikɔgɔji ... (+11 more)` | 21 |
106
 
107
+ **Sample 3:** `MugukɔnkɔnBailleul, Charles. Dictionnaire français-bambara. Bamako: Éditions Don...`
108
 
109
  | Vocab | Tokens | Count |
110
  |-------|--------|-------|
111
+ | 8k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
112
+ | 16k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
113
+ | 32k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
114
 
115
 
116
  ### Key Findings
117
 
118
+ - **Best Compression:** 32k achieves 4.016x compression
119
+ - **Lowest UNK Rate:** 8k with 1.4000% unknown tokens
120
  - **Trade-off:** Larger vocabularies improve compression but increase model size
121
  - **Recommendation:** 32k vocabulary provides optimal balance for production use
122
 
 
125
 
126
  ![N-gram Perplexity](visualizations/ngram_perplexity.png)
127
 
128
+ ![N-gram Unique](visualizations/ngram_unique.png)
129
+
130
  ![N-gram Coverage](visualizations/ngram_coverage.png)
131
 
132
  ### Results
133
 
134
+ | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
135
+ |--------|---------|------------|---------|----------------|------------------|-------------------|
136
+ | **2-gram** | Word | 923 | 9.85 | 2,067 | 40.5% | 82.4% |
137
+ | **2-gram** | Subword | 272 🏆 | 8.09 | 1,826 | 67.7% | 98.7% |
138
+ | **3-gram** | Word | 775 | 9.60 | 2,207 | 44.0% | 78.6% |
139
+ | **3-gram** | Subword | 1,884 | 10.88 | 9,873 | 29.9% | 74.7% |
140
+ | **4-gram** | Word | 2,048 | 11.00 | 5,635 | 33.1% | 51.1% |
141
+ | **4-gram** | Subword | 8,105 | 12.98 | 35,658 | 14.6% | 47.0% |
142
 
143
  ### Top 5 N-grams by Size
144
 
145
+ **2-grams (Word):**
146
+
147
+ | Rank | N-gram | Count |
148
+ |------|--------|-------|
149
+ | 1 | `ka dugu` | 526 |
150
+ | 2 | `charles dictionnaire` | 419 |
151
+ | 3 | `dictionnaire français` | 419 |
152
+ | 4 | `français bambara` | 419 |
153
+ | 5 | `bamako éditions` | 419 |
154
+
155
+ **3-grams (Word):**
156
+
157
+ | Rank | N-gram | Count |
158
+ |------|--------|-------|
159
+ | 1 | `bamako éditions donniya` | 419 |
160
+ | 2 | `éditions donniya isbn` | 419 |
161
+ | 3 | `dictionnaire français bambara` | 419 |
162
+ | 4 | `bambara bamako éditions` | 419 |
163
+ | 5 | `charles dictionnaire français` | 419 |
164
+
165
+ **4-grams (Word):**
166
 
167
  | Rank | N-gram | Count |
168
  |------|--------|-------|
169
+ | 1 | `dictionnaire français bambara bamako` | 419 |
170
+ | 2 | `bamako éditions donniya isbn` | 419 |
171
+ | 3 | `bambara bamako éditions donniya` | 419 |
172
+ | 4 | `français bambara bamako éditions` | 419 |
173
+ | 5 | `charles dictionnaire français bambara` | 419 |
174
 
175
+ **2-grams (Subword):**
176
 
177
  | Rank | N-gram | Count |
178
  |------|--------|-------|
179
+ | 1 | `a _` | 23,595 |
180
+ | 2 | `_ k` | 13,733 |
181
+ | 3 | `a n` | 13,570 |
182
+ | 4 | `n _` | 12,447 |
183
+ | 5 | `i _` | 9,856 |
184
 
185
+ **3-grams (Subword):**
186
 
187
  | Rank | N-gram | Count |
188
  |------|--------|-------|
189
+ | 1 | `_ k a` | 6,371 |
190
+ | 2 | `k a _` | 4,967 |
191
+ | 3 | `_ y e` | 4,581 |
192
+ | 4 | `a n _` | 4,011 |
193
+ | 5 | `n i _` | 3,940 |
194
+
195
+ **4-grams (Subword):**
196
+
197
+ | Rank | N-gram | Count |
198
+ |------|--------|-------|
199
+ | 1 | `_ k a _` | 4,307 |
200
+ | 2 | `_ y e _` | 3,197 |
201
+ | 3 | `_ b ɛ _` | 1,818 |
202
+ | 4 | `_ n i _` | 1,809 |
203
+ | 5 | `_ m i n` | 1,781 |
204
 
205
 
206
  ### Key Findings
207
 
208
+ - **Best Perplexity:** 2-gram (subword) with 272
209
  - **Entropy Trend:** Decreases with larger n-grams (more predictable)
210
  - **Coverage:** Top-1000 patterns cover ~47% of corpus
211
  - **Recommendation:** 4-gram or 5-gram for best predictive performance
 
215
 
216
  ![Markov Entropy](visualizations/markov_entropy.png)
217
 
218
+ ![Markov Contexts](visualizations/markov_contexts.png)
219
+
220
  ![Markov Branching](visualizations/markov_branching.png)
221
 
222
  ### Results
223
 
224
+ | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
225
+ |---------|---------|-------------|------------|------------------|-----------------|----------------|
226
+ | **1** | Word | 0.5956 | 1.511 | 3.32 | 17,657 | 40.4% |
227
+ | **1** | Subword | 1.1635 | 2.240 | 8.41 | 480 | 0.0% |
228
+ | **2** | Word | 0.2005 | 1.149 | 1.41 | 58,338 | 80.0% |
229
+ | **2** | Subword | 0.9884 | 1.984 | 5.02 | 4,032 | 1.2% |
230
+ | **3** | Word | 0.0636 | 1.045 | 1.10 | 81,761 | 93.6% |
231
+ | **3** | Subword | 0.7361 | 1.666 | 3.15 | 20,227 | 26.4% |
232
+ | **4** | Word | 0.0198 🏆 | 1.014 | 1.03 | 89,097 | 98.0% |
233
+ | **4** | Subword | 0.5022 | 1.416 | 2.09 | 63,561 | 49.8% |
234
+
235
+ ### Generated Text Samples (Word-based)
236
+
237
+ Below are text samples generated from each word-based Markov chain model:
238
+
239
+ **Context Size 1:**
240
+
241
+ 1. `ka fasojamana ye yɛrɛmahɔrɔnya jamanaw ka bi cɛ bawo wariko gɛlɛya wɛrɛ fɛ iko ala dɔnbali`
242
+ 2. `ye kumajago senw tigɛli ninakili dɔnni don kɔsa in municipality of south africa art solo exhibition`
243
+ 3. `a bonya ye masala jagodon kalanbolo kɔnɔ k ɲ 26 ma peninsula mara la amadu ni`
244
+
245
+ **Context Size 2:**
246
+
247
+ 1. `éditions donniya isbn sababou kɔkan sirilanw lutrinae`
248
+ 2. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw donkey`
249
+ 3. `bamako éditions donniya isbn sababou kɔkan sirilanw lepus`
250
+
251
+ **Context Size 3:**
252
+
253
+ 1. `dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
254
+ 2. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw hyaenidae link wikiquote en hye...`
255
+ 3. `éditions donniya isbn sababou kɔkan sirilanw hippotragus equinus`
256
+
257
+ **Context Size 4:**
258
+
259
+ 1. `bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
260
+ 2. `bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
261
+ 3. `charles dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw herpestes ...`
262
 
 
263
 
264
+ ### Generated Text Samples (Subword-based)
265
+
266
+ Below are text samples generated from each subword-based Markov chain model:
267
 
268
  **Context Size 1:**
269
 
270
+ 1. `_beu_yin_samesi_`
271
+ 2. `akoni_swan'bɛn'u`
272
+ 3. `n_kɔn_a-s_koon,_`
273
 
274
  **Context Size 2:**
275
 
276
+ 1. `a_ba_ya._bɛ,_marr`
277
+ 2. `_k'a_ye_ka_sɔra_y`
278
+ 3. `ana-as_duguru,_mi`
279
 
280
  **Context Size 3:**
281
 
282
+ 1. `_ka_min_sababou_kɛ`
283
+ 2. `ka_dumuniorussin_t`
284
+ 3. `_ye_jamand_reviese`
285
 
286
  **Context Size 4:**
287
 
288
+ 1. `_ka_so_kɔnɔ_milleul`
289
+ 2. `_ye_siby_sidenw_ka_`
290
+ 3. `_bɛ_lajɛ_kilɛ_mali_`
291
 
292
 
293
  ### Key Findings
294
 
295
+ - **Best Predictability:** Context-4 (word) with 98.0% predictability
296
  - **Branching Factor:** Decreases with context size (more deterministic)
297
+ - **Memory Trade-off:** Larger contexts require more storage (63,561 contexts)
298
  - **Recommendation:** Context-3 or Context-4 for text generation
299
 
300
  ---
 
310
 
311
  | Metric | Value |
312
  |--------|-------|
313
+ | Vocabulary Size | 6,895 |
314
+ | Total Tokens | 95,713 |
315
+ | Mean Frequency | 13.88 |
316
  | Median Frequency | 3 |
317
+ | Frequency Std Dev | 106.18 |
318
 
319
  ### Most Common Words
320
 
321
  | Rank | Word | Frequency |
322
  |------|------|-----------|
323
+ | 1 | ye | 4,391 |
324
+ | 2 | ka | 4,364 |
325
+ | 3 | a | 3,308 |
326
+ | 4 | la | 1,918 |
327
+ | 5 | ni | 1,903 |
328
+ | 6 | bɛ | 1,828 |
329
+ | 7 | na | 1,625 |
330
+ | 8 | min | 1,195 |
331
+ | 9 | o | 1,160 |
332
+ | 10 | ani | 1,074 |
333
 
334
  ### Least Common Words (from vocabulary)
335
 
336
  | Rank | Word | Frequency |
337
  |------|------|-----------|
338
+ | 1 | diverse | 2 |
339
+ | 2 | cryptography | 2 |
340
+ | 3 | career | 2 |
341
+ | 4 | this | 2 |
342
+ | 5 | corp | 2 |
343
+ | 6 | strathspey | 2 |
344
+ | 7 | holdings | 2 |
345
+ | 8 | firm | 2 |
346
+ | 9 | allergan | 2 |
347
+ | 10 | hybe | 2 |
348
 
349
  ### Zipf's Law Analysis
350
 
351
  | Metric | Value |
352
  |--------|-------|
353
+ | Zipf Coefficient | 1.0043 |
354
+ | R² (Goodness of Fit) | 0.984602 |
355
  | Adherence Quality | **excellent** |
356
 
357
  ### Coverage Analysis
358
 
359
  | Top N Words | Coverage |
360
  |-------------|----------|
361
+ | Top 100 | 52.1% |
362
+ | Top 1,000 | 79.1% |
363
+ | Top 5,000 | 96.0% |
364
  | Top 10,000 | 0.0% |
365
 
366
  ### Key Findings
367
 
368
+ - **Zipf Compliance:** R²=0.9846 indicates excellent adherence to Zipf's law
369
+ - **High Frequency Dominance:** Top 100 words cover 52.1% of corpus
370
+ - **Long Tail:** -3,105 words needed for remaining 100.0% coverage
371
 
372
  ---
373
  ## 5. Word Embeddings Evaluation
 
380
 
381
  ![t-SNE Sentences](visualizations/tsne_sentences.png)
382
 
 
383
 
384
+ ### 5.1 Cross-Lingual Alignment
385
+
386
+ > *Note: Multilingual alignment visualization not available for this language.*
387
+
388
+
389
+ ### 5.2 Model Comparison
390
+
391
+ | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
392
+ |-------|-----------|----------|------------------|---------------|----------------|
393
+ | **mono_32d** | 32 | 0.2668 🏆 | 0.5000 | N/A | N/A |
394
+ | **mono_64d** | 64 | 0.0657 | 0.5219 | N/A | N/A |
395
+ | **mono_128d** | 128 | 0.0127 | 0.4839 | N/A | N/A |
396
 
397
  ### Key Findings
398
 
399
+ - **Best Isotropy:** mono_32d with 0.2668 (more uniform distribution)
400
+ - **Semantic Density:** Average pairwise similarity of 0.5020. Lower values indicate better semantic separation.
401
+ - **Alignment Quality:** No aligned models evaluated in this run.
402
+ - **Recommendation:** 128d aligned for best cross-lingual performance
403
 
404
  ---
405
+ ## 6. Morphological Analysis (Experimental)
406
+
407
+ > ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
408
+
409
+ This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
410
+
411
+ ### 6.1 Productivity & Complexity
412
+
413
+ | Metric | Value | Interpretation | Recommendation |
414
+ |--------|-------|----------------|----------------|
415
+ | Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
416
+ | Idiomaticity Gap | **-1.000** | Low formulaic content | - |
417
+
418
+ ### 6.2 Affix Inventory (Productive Units)
419
+
420
+ These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
421
+
422
+ #### Productive Prefixes
423
+ | Prefix | Examples |
424
+ |--------|----------|
425
+ | `-ma` | maracogo, maraka, macron |
426
+
427
+ #### Productive Suffixes
428
+ | Suffix | Examples |
429
+ |--------|----------|
430
+ | `-a` | dɔnnikɛla, zanbia, miriya |
431
+ | `-an` | balansan, abubuwan, 15nan |
432
+
433
+ ### 6.3 Bound Stems (Lexical Roots)
434
+
435
+ Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
436
+
437
+ | Stem | Cohesion | Substitutability | Examples |
438
+ |------|----------|------------------|----------|
439
+ | `alan` | 1.65x | 24 contexts | kalan, palan, balan |
440
+ | `riya` | 1.81x | 11 contexts | suriya, miriya, sariya |
441
+ | `alen` | 1.45x | 20 contexts | falen, dalen, jalen |
442
+ | `aara` | 1.72x | 12 contexts | maara, taara, yaara |
443
+ | `aman` | 1.31x | 25 contexts | daman, saman, faman |
444
+ | `elen` | 1.65x | 12 contexts | yelen, kelen, selen |
445
+ | `ɔgɔn` | 1.75x | 9 contexts | nɔgɔn, ɲɔgɔn, nyɔgɔn |
446
+ | `ɛbɛn` | 1.82x | 8 contexts | sɛbɛn, sɛbɛnw, sɛbɛnni |
447
+ | `anka` | 1.51x | 13 contexts | mankan, yankan, dankan |
448
+ | `amin` | 1.51x | 13 contexts | damina, daminè, damine |
449
+ | `nkan` | 1.35x | 14 contexts | bɛnkan, benkan, mankan |
450
+ | `kili` | 1.49x | 10 contexts | hakili, kilisi, nkiliki |
451
+
452
+ ### 6.4 Affix Compatibility (Co-occurrence)
453
+
454
+ This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
455
+
456
+ | Prefix | Suffix | Frequency | Examples |
457
+ |--------|--------|-----------|----------|
458
+ | `-ma` | `-a` | 22 words | maa, maara |
459
+ | `-ma` | `-an` | 8 words | masasigilan, mankaan |
460
+
461
+ ### 6.5 Recursive Morpheme Segmentation
462
+
463
+ Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
464
+
465
+ | Word | Suggested Split | Confidence | Stem |
466
+ |------|-----------------|------------|------|
467
+ | maninkakan | **`ma-ninkak-an`** | 3.0 | `ninkak` |
468
+ | tlasabanan | **`tlasab-an-an`** | 3.0 | `tlasab` |
469
+ | machecoul | **`ma-checoul`** | 1.5 | `checoul` |
470
+ | marabolow | **`ma-rabolow`** | 1.5 | `rabolow` |
471
+ | woroduguyanfan | **`woroduguyanf-an`** | 1.5 | `woroduguyanf` |
472
+ | binkannikɛlan | **`binkannikɛl-an`** | 1.5 | `binkannikɛl` |
473
+ | masakɛmuso | **`ma-sakɛmuso`** | 1.5 | `sakɛmuso` |
474
+ | sɛnɛfɔkan | **`sɛnɛfɔk-an`** | 1.5 | `sɛnɛfɔk` |
475
+ | ispanyikan | **`ispanyik-an`** | 1.5 | `ispanyik` |
476
+ | tubabukan | **`tubabuk-an`** | 1.5 | `tubabuk` |
477
+ | maramafen | **`ma-ramafen`** | 1.5 | `ramafen` |
478
+ | balikukalan | **`balikukal-an`** | 1.5 | `balikukal` |
479
+ | ukrayinakan | **`ukrayinak-an`** | 1.5 | `ukrayinak` |
480
+ | marisikalo | **`ma-risikalo`** | 1.5 | `risikalo` |
481
+ | matarafali | **`ma-tarafali`** | 1.5 | `tarafali` |
482
+
483
+ ### 6.6 Linguistic Interpretation
484
+
485
+ > **Automated Insight:**
486
+ The language BM appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
487
+
488
+ ---
489
+ ## 7. Summary & Recommendations
490
 
491
  ![Performance Dashboard](visualizations/performance_dashboard.png)
492
 
 
494
 
495
  | Component | Recommended | Rationale |
496
  |-----------|-------------|-----------|
497
+ | Tokenizer | **32k BPE** | Best compression (4.02x) |
498
+ | N-gram | **2-gram** | Lowest perplexity (272) |
499
+ | Markov | **Context-4** | Highest predictability (98.0%) |
500
  | Embeddings | **100d** | Balanced semantic capture and isotropy |
501
 
502
+
503
  ---
504
  ## Appendix: Metrics Glossary & Interpretation Guide
505
 
 
689
  author = {Kamali, Omar},
690
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
691
  year = {2025},
692
+ doi = {10.5281/zenodo.18073153},
693
+ publisher = {Zenodo},
694
  url = {https://huggingface.co/wikilangs}
695
  institution = {Omneity Labs}
696
  }
 
706
  - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
707
  - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
708
  - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
709
+ - 🤝 Sponsor: [Featherless AI](https://featherless.ai)
710
  ---
711
  *Generated by Wikilangs Models Pipeline*
712
 
713
+ *Report Date: 2026-01-03 07:27:32*
models/embeddings/monolingual/bm_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7f0d7c8930252db4f75ccff26903504087a5598f8e89d5877efdc3838fa71e4e
3
- size 1026401129
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60554973b7e758378671bf2709b7707495d4d05f0f6d5aa74b560379597f5337
3
+ size 1026320797
models/embeddings/monolingual/bm_128d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 128,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 2309
13
  }
 
3
  "dimension": 128,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 128
13
  },
14
+ "vocab_size": 2232
15
  }
models/embeddings/monolingual/bm_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ecbf3cf75d36d5786a480590916fad5d6bc30b12637de0b8e671f92799273b4e
3
- size 256627817
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d397f946ef2ae3aef026fefc210ae2f287c5d21b70e2355dd7bf1e9ad36afa1c
3
+ size 256606621
models/embeddings/monolingual/bm_32d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 32,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 2309
13
  }
 
3
  "dimension": 32,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 32
13
  },
14
+ "vocab_size": 2232
15
  }
models/embeddings/monolingual/bm_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:03ddc93065212141e650c80ef4273e04a2ec47432a07bd4dc33ebf690234049c
3
- size 513218921
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59a556264dfd6bffcee58e216ac1d731f3135622dfe1812992c391a2191779d2
3
+ size 513178013
models/embeddings/monolingual/bm_64d_metadata.json CHANGED
@@ -3,11 +3,13 @@
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
- "dim": 64,
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
- "epochs": 5
 
 
11
  },
12
- "vocab_size": 2309
13
  }
 
3
  "dimension": 64,
4
  "version": "monolingual",
5
  "training_params": {
6
+ "algorithm": "skipgram",
7
  "min_count": 5,
8
  "window": 5,
9
  "negative": 5,
10
+ "epochs": 5,
11
+ "encoding_method": "rope",
12
+ "dim": 64
13
  },
14
+ "vocab_size": 2232
15
  }
models/subword_markov/bm_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0819a99ba610dc7d0095d059a8cd0fbf0e93d96340fab94bfe408f6e78c431e4
3
- size 40112
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75665f8001ac1c98685e9eac5bd7af0ed39fc626141b1fd4cef1b034c92d438d
3
+ size 35393
models/subword_markov/bm_markov_ctx1_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_contexts": 508,
6
- "total_transitions": 649338
7
  }
 
2
  "context_size": 1,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_contexts": 480,
6
+ "total_transitions": 596087
7
  }
models/subword_markov/bm_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6bb80b6a46cd1c2d4387947a64a191c8c3a0e17cb30ee46dd9853eb3f4e28c6a
3
- size 184560
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89493aa503c873924d3cc26fe8b12683b9ea51449a0ff9dd9a1825aa759bcfbf
3
+ size 160110
models/subword_markov/bm_markov_ctx2_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_contexts": 4659,
6
- "total_transitions": 648039
7
  }
 
2
  "context_size": 2,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_contexts": 4032,
6
+ "total_transitions": 594884
7
  }
models/subword_markov/bm_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2346a42c76aff72fd2b3fe02f559333cd9dda578b8ced3ba260d8cf6e2171a4e
3
- size 546787
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c678a21c70cd7e4a8724cc5c8781d9a005a7b1153d5cfab9672b1e53bfdb3c44
3
+ size 480692
models/subword_markov/bm_markov_ctx3_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_contexts": 24092,
6
- "total_transitions": 646740
7
  }
 
2
  "context_size": 3,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_contexts": 20227,
6
+ "total_transitions": 593681
7
  }
models/subword_markov/bm_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a1d7257d4abefb8c32f0420d2d9fc0d913bcdb5e4772236891d7490a5264704a
3
- size 1226126
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d36df5ddcd8f68677e8aeeb781950fca4c726d9b697ac0720d4fb6f5eb94a5a2
3
+ size 1088116
models/subword_markov/bm_markov_ctx4_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_contexts": 73789,
6
- "total_transitions": 645441
7
  }
 
2
  "context_size": 4,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_contexts": 63561,
6
+ "total_transitions": 592478
7
  }
models/subword_ngram/bm_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:06ce57f95742b61708e17e229d33f174f2dabe7ae11cfc1872a7a6182573cfcc
3
- size 26169
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe3125074954dbce7e5c347fb5b53e5f30cec7775dfd663157c67fbdc9e255bc
3
+ size 23022
models/subword_ngram/bm_2gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_ngrams": 2127,
6
- "total_ngrams": 649338
7
  }
 
2
  "n": 2,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_ngrams": 1826,
6
+ "total_ngrams": 596087
7
  }
models/subword_ngram/bm_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2f6c3f360cbad86d31de6197b81525e30f7a976dc638b3ca0ae073b2a98553a6
3
- size 131173
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c97c794d987cb73cd6c3301a1516b8e4889869836d3d9c5369016ba8fe9c0fb7
3
+ size 113452
models/subword_ngram/bm_3gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_ngrams": 11428,
6
- "total_ngrams": 648039
7
  }
 
2
  "n": 3,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_ngrams": 9873,
6
+ "total_ngrams": 594884
7
  }
models/subword_ngram/bm_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:66a769c12a2d8a97eff8c9d92dea9868be0c798775b864d08da7c4cc28cc988f
3
- size 486277
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dbf91bd19db052de26308667adfa16bf3463044ec8463971cda61775082b87c4
3
+ size 431891
models/subword_ngram/bm_4gram_subword_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "bm",
5
- "unique_ngrams": 39905,
6
- "total_ngrams": 646740
7
  }
 
2
  "n": 4,
3
  "variant": "subword",
4
  "language": "bm",
5
+ "unique_ngrams": 35658,
6
+ "total_ngrams": 593681
7
  }
models/tokenizer/bm_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9296d04641b8a3567ec23dd58f2570e88576b52be897baa8365b0b75d62817db
3
- size 512912
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:55b76e9844ea83477a53ca0eacddd2c822efad2b886d04d9e9ba8ce2715dfacf
3
+ size 514654
models/tokenizer/bm_tokenizer_16k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/bm_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5c3728738bd953c0b0980dea17440614b9fc5c1944fd6a606c0819bd83da8f70
3
- size 801851
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a3490f95bc0ae3183c66cee9b8b34a693f115b059216203609f99d8409129f6
3
+ size 764320
models/tokenizer/bm_tokenizer_32k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/tokenizer/bm_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b7ebf2d24f769b1286b48c8e35b7389a0e0e1303d97a28ec9d43ae1701d59f89
3
- size 371453
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9361cd8391a5d749408299b263a632521ad4a28af1108df637b4399ef06a96bd
3
+ size 372567
models/tokenizer/bm_tokenizer_8k.vocab CHANGED
The diff for this file is too large to render. See raw diff
 
models/vocabulary/bm_vocabulary.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3d54a1977879161f36d59485af9d41f9f5140c9cf6590d94c12addac73e06114
3
- size 116321
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95cadef0ad07f1fcd065af0e35aa2cf7c22454a9e8a9d71f454288dae5ddf109
3
+ size 110665
models/vocabulary/bm_vocabulary_metadata.json CHANGED
@@ -1,16 +1,17 @@
1
  {
2
  "language": "bm",
3
- "vocabulary_size": 7195,
 
4
  "statistics": {
5
- "type_token_ratio": 0.16176133457950517,
6
  "coverage": {
7
- "top_100": 0.46858412541661526,
8
- "top_1000": 0.7123194667325021,
9
- "top_5000": 0.8629710617736788,
10
- "top_10000": 0.9264112014389758
11
  },
12
- "hapax_count": 11151,
13
- "hapax_ratio": 0.607816417747738,
14
- "total_documents": 1299
15
  }
16
  }
 
1
  {
2
  "language": "bm",
3
+ "vocabulary_size": 6895,
4
+ "variant": "full",
5
  "statistics": {
6
+ "type_token_ratio": 0.16638822668143338,
7
  "coverage": {
8
+ "top_100": 0.4682859985358437,
9
+ "top_1000": 0.7107728117432845,
10
+ "top_5000": 0.8627541155932649,
11
+ "top_10000": 0.9274679481163066
12
  },
13
+ "hapax_count": 10833,
14
+ "hapax_ratio": 0.611067238267148,
15
+ "total_documents": 1203
16
  }
17
  }
models/word_markov/bm_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c5771841e705a9dd2e47d794be576e0f304212cdfd165b9c86e4a3bbf40bc589
3
- size 575996
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:789e3b5dbec8a4180554670cf6c8140274009e31544768738d599464c4f97eaf
3
+ size 539344
models/word_markov/bm_markov_ctx1_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_contexts": 18421,
6
- "total_transitions": 140738
7
  }
 
2
  "context_size": 1,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_contexts": 17657,
6
+ "total_transitions": 105343
7
  }
models/word_markov/bm_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:75d31f3bd92ad9f4e4d1c4e921aeabc69c0202e90bc20dd9e3b9f2822ae86f7f
3
- size 1171294
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:887a80e826ca473a6ab2e880642feffde94dc8b35a379b2e1de27ffb35ee82e6
3
+ size 1068878
models/word_markov/bm_markov_ctx2_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_contexts": 65151,
6
- "total_transitions": 139439
7
  }
 
2
  "context_size": 2,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_contexts": 58338,
6
+ "total_transitions": 104140
7
  }
models/word_markov/bm_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3fa392bd3a195161417c5b8dfa8792082e634a6a83b57348f2a245ae4a04efba
3
- size 1584409
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ac6a4a0d3b0dfb35d2dc573351f32014393c9e3b47f7a32c3ea6caa6ee45484
3
+ size 1381560
models/word_markov/bm_markov_ctx3_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_contexts": 96171,
6
- "total_transitions": 138140
7
  }
 
2
  "context_size": 3,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_contexts": 81761,
6
+ "total_transitions": 102937
7
  }
models/word_markov/bm_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fa556a96029c6bafa0afbbae6795406371cd660317dcbb46cb396c604196f52b
3
- size 1827580
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e2957d52fc463087fedd8a81ef2b6b897a11b8fddb0b3792bd16af7db14a60c
3
+ size 1567499
models/word_markov/bm_markov_ctx4_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_contexts": 108554,
6
- "total_transitions": 136845
7
  }
 
2
  "context_size": 4,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_contexts": 89097,
6
+ "total_transitions": 101734
7
  }
models/word_ngram/bm_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7b5007949f6bbf42d027ff51fc5b8f9e4e55c91f7a05c383cb4a9d9747400c3c
3
- size 37545
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:899da362fc834616e68542ec185ca203675f5d217635520a21cbc90524ff1dae
3
+ size 28426
models/word_ngram/bm_2gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 2,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_ngrams": 2854,
6
- "total_ngrams": 140738
7
  }
 
2
  "n": 2,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_ngrams": 2067,
6
+ "total_ngrams": 105343
7
  }
models/word_ngram/bm_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f60674a1667559b06bf2d6d1c02809b85928a179d949721898f660c388b27feb
3
- size 52891
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01e363967c6c38f500df23393a86d4d9f2542632f1ab459b68620240a7675f48
3
+ size 34597
models/word_ngram/bm_3gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 3,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_ngrams": 3554,
6
- "total_ngrams": 139439
7
  }
 
2
  "n": 3,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_ngrams": 2207,
6
+ "total_ngrams": 104140
7
  }
models/word_ngram/bm_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1e75ace9be8417983bb1c726ad403d1c0882df8c7f950705aedd826360a33ad7
3
- size 136064
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02bbc45d87f3c4629eeb12cf6577258882d57257531b040d948037fc59c08083
3
+ size 97246
models/word_ngram/bm_4gram_word_metadata.json CHANGED
@@ -2,6 +2,6 @@
2
  "n": 4,
3
  "variant": "word",
4
  "language": "bm",
5
- "unique_ngrams": 8337,
6
- "total_ngrams": 138140
7
  }
 
2
  "n": 4,
3
  "variant": "word",
4
  "language": "bm",
5
+ "unique_ngrams": 5635,
6
+ "total_ngrams": 102937
7
  }
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 865d3f432225d6e9fa27c3b399fbbf0f61c7c1ea14a4f190ea27238bc566bf7b
  • Pointer size: 131 Bytes
  • Size of remote file: 152 kB

Git LFS Details

  • SHA256: b8f4299d1b66858edaf5e2d0b404260fffd816d51764301a8fe36d23161bfdf2
  • Pointer size: 131 Bytes
  • Size of remote file: 152 kB
visualizations/markov_branching.png CHANGED
visualizations/markov_contexts.png CHANGED
visualizations/markov_entropy.png CHANGED
visualizations/model_sizes.png CHANGED