Upload all models and assets for bug (20251001)
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- README.md +301 -136
- models/embeddings/monolingual/bug_128d.bin +2 -2
- models/embeddings/monolingual/bug_128d_metadata.json +5 -3
- models/embeddings/monolingual/bug_32d.bin +2 -2
- models/embeddings/monolingual/bug_32d_metadata.json +5 -3
- models/embeddings/monolingual/bug_64d.bin +2 -2
- models/embeddings/monolingual/bug_64d_metadata.json +5 -3
- models/subword_markov/bug_markov_ctx1_subword.parquet +2 -2
- models/subword_markov/bug_markov_ctx1_subword_metadata.json +2 -2
- models/subword_markov/bug_markov_ctx2_subword.parquet +2 -2
- models/subword_markov/bug_markov_ctx2_subword_metadata.json +2 -2
- models/subword_markov/bug_markov_ctx3_subword.parquet +2 -2
- models/subword_markov/bug_markov_ctx3_subword_metadata.json +2 -2
- models/subword_markov/bug_markov_ctx4_subword.parquet +2 -2
- models/subword_markov/bug_markov_ctx4_subword_metadata.json +2 -2
- models/subword_ngram/bug_2gram_subword.parquet +2 -2
- models/subword_ngram/bug_2gram_subword_metadata.json +2 -2
- models/subword_ngram/bug_3gram_subword.parquet +2 -2
- models/subword_ngram/bug_3gram_subword_metadata.json +2 -2
- models/subword_ngram/bug_4gram_subword.parquet +2 -2
- models/subword_ngram/bug_4gram_subword_metadata.json +2 -2
- models/tokenizer/bug_tokenizer_16k.model +2 -2
- models/tokenizer/bug_tokenizer_16k.vocab +0 -0
- models/tokenizer/bug_tokenizer_32k.model +2 -2
- models/tokenizer/bug_tokenizer_32k.vocab +0 -0
- models/tokenizer/bug_tokenizer_8k.model +2 -2
- models/tokenizer/bug_tokenizer_8k.vocab +0 -0
- models/vocabulary/bug_vocabulary.parquet +2 -2
- models/vocabulary/bug_vocabulary_metadata.json +10 -9
- models/word_markov/bug_markov_ctx1_word.parquet +2 -2
- models/word_markov/bug_markov_ctx1_word_metadata.json +2 -2
- models/word_markov/bug_markov_ctx2_word.parquet +2 -2
- models/word_markov/bug_markov_ctx2_word_metadata.json +2 -2
- models/word_markov/bug_markov_ctx3_word.parquet +2 -2
- models/word_markov/bug_markov_ctx3_word_metadata.json +2 -2
- models/word_markov/bug_markov_ctx4_word.parquet +2 -2
- models/word_markov/bug_markov_ctx4_word_metadata.json +2 -2
- models/word_ngram/bug_2gram_word.parquet +2 -2
- models/word_ngram/bug_2gram_word_metadata.json +2 -2
- models/word_ngram/bug_3gram_word.parquet +2 -2
- models/word_ngram/bug_3gram_word_metadata.json +2 -2
- models/word_ngram/bug_4gram_word.parquet +2 -2
- models/word_ngram/bug_4gram_word_metadata.json +2 -2
- visualizations/embedding_isotropy.png +0 -0
- visualizations/embedding_norms.png +0 -0
- visualizations/embedding_similarity.png +2 -2
- visualizations/markov_branching.png +0 -0
- visualizations/markov_contexts.png +0 -0
- visualizations/markov_entropy.png +0 -0
- visualizations/model_sizes.png +0 -0
README.md
CHANGED
|
@@ -23,14 +23,14 @@ dataset_info:
|
|
| 23 |
metrics:
|
| 24 |
- name: best_compression_ratio
|
| 25 |
type: compression
|
| 26 |
-
value:
|
| 27 |
- name: best_isotropy
|
| 28 |
type: isotropy
|
| 29 |
-
value: 0.
|
| 30 |
- name: vocabulary_size
|
| 31 |
type: vocab
|
| 32 |
-
value:
|
| 33 |
-
generated:
|
| 34 |
---
|
| 35 |
|
| 36 |
# BUG - Wikilangs Models
|
|
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 44 |
### Models & Assets
|
| 45 |
|
| 46 |
- Tokenizers (8k, 16k, 32k, 64k)
|
| 47 |
-
- N-gram models (2, 3, 4-gram)
|
| 48 |
-
- Markov chains (context of 1, 2, 3 and
|
| 49 |
- Subword N-gram and Markov chains
|
| 50 |
-
- Embeddings in various sizes and dimensions
|
| 51 |
- Language Vocabulary
|
| 52 |
- Language Statistics
|
|
|
|
| 53 |

|
| 54 |
|
| 55 |
### Analysis and Evaluation
|
|
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 59 |
- [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
|
| 60 |
- [4. Vocabulary Analysis](#4-vocabulary-analysis)
|
| 61 |
- [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
|
| 62 |
-
- [6.
|
|
|
|
| 63 |
- [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
|
| 64 |
- [Visualizations Index](#visualizations-index)
|
| 65 |
|
|
@@ -68,59 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 68 |
|
| 69 |

|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
### Results
|
| 72 |
|
| 73 |
| Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
|
| 74 |
|------------|-------------|---------------|----------|--------------|
|
| 75 |
-
| **8k** |
|
| 76 |
-
| **16k** |
|
| 77 |
-
| **32k** |
|
| 78 |
-
| **64k** | 3.778x 🏆 | 3.69 | 0.3758% | 47,094 |
|
| 79 |
|
| 80 |
### Tokenization Examples
|
| 81 |
|
| 82 |
Below are sample sentences tokenized with each vocabulary size:
|
| 83 |
|
| 84 |
-
**Sample 1:** `
|
| 85 |
-
|
| 86 |
-
Ita to
|
| 87 |
-
Komun r...`
|
| 88 |
|
| 89 |
| Vocab | Tokens | Count |
|
| 90 |
|-------|--------|-------|
|
| 91 |
-
| 8k | `▁
|
| 92 |
-
| 16k | `▁
|
| 93 |
-
| 32k | `▁
|
| 94 |
-
| 64k | `▁daours ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁somme ▁ri ▁perancis . ... (+11 more)` | 21 |
|
| 95 |
-
|
| 96 |
-
**Sample 2:** `Damas-et-Bettegney iyanaritu séuwa komun ri déparetema Vosges ri Perancis.
|
| 97 |
|
| 98 |
-
Ita...`
|
| 99 |
|
| 100 |
| Vocab | Tokens | Count |
|
| 101 |
|-------|--------|-------|
|
| 102 |
-
| 8k | `▁
|
| 103 |
-
| 16k | `▁
|
| 104 |
-
| 32k | `▁
|
| 105 |
-
| 64k | `▁damas - et - bettegney ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ... (+15 more)` | 25 |
|
| 106 |
|
| 107 |
-
**Sample 3:** `Ita to
|
| 108 |
-
Komun ri déparetema Aisne
|
| 109 |
-
|
| 110 |
-
Kategori:Komun ri Aisne`
|
| 111 |
|
| 112 |
| Vocab | Tokens | Count |
|
| 113 |
|-------|--------|-------|
|
| 114 |
-
| 8k | `▁
|
| 115 |
-
| 16k | `▁
|
| 116 |
-
| 32k | `▁
|
| 117 |
-
| 64k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁aisne ▁kategori : komun ▁ri ... (+1 more)` | 11 |
|
| 118 |
|
| 119 |
|
| 120 |
### Key Findings
|
| 121 |
|
| 122 |
-
- **Best Compression:**
|
| 123 |
-
- **Lowest UNK Rate:** 8k with 0.
|
| 124 |
- **Trade-off:** Larger vocabularies improve compression but increase model size
|
| 125 |
- **Recommendation:** 32k vocabulary provides optimal balance for production use
|
| 126 |
|
|
@@ -129,57 +125,89 @@ Kategori:Komun ri Aisne`
|
|
| 129 |
|
| 130 |

|
| 131 |
|
|
|
|
|
|
|
| 132 |

|
| 133 |
|
| 134 |
### Results
|
| 135 |
|
| 136 |
-
| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
|
| 137 |
-
|
| 138 |
-
| **2-gram** |
|
| 139 |
-
| **2-gram** |
|
| 140 |
-
| **3-gram** |
|
| 141 |
-
| **3-gram** |
|
| 142 |
-
| **4-gram** |
|
| 143 |
-
| **4-gram** |
|
| 144 |
|
| 145 |
### Top 5 N-grams by Size
|
| 146 |
|
| 147 |
-
**2-grams:**
|
| 148 |
|
| 149 |
| Rank | N-gram | Count |
|
| 150 |
|------|--------|-------|
|
| 151 |
-
| 1 | `komun ri` |
|
| 152 |
| 2 | `ri déparetema` | 25,713 |
|
| 153 |
-
| 3 | `kategori
|
| 154 |
-
| 4 |
|
| 155 |
-
| 5 | `
|
| 156 |
|
| 157 |
-
**3-grams:**
|
| 158 |
|
| 159 |
| Rank | N-gram | Count |
|
| 160 |
|------|--------|-------|
|
| 161 |
| 1 | `komun ri déparetema` | 25,709 |
|
| 162 |
-
| 2 | `kategori
|
| 163 |
-
| 3 |
|
| 164 |
| 4 | `ita to komun` | 13,889 |
|
| 165 |
-
| 5 | `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
**
|
| 168 |
|
| 169 |
| Rank | N-gram | Count |
|
| 170 |
|------|--------|-------|
|
| 171 |
-
| 1 | `
|
| 172 |
-
| 2 | `
|
| 173 |
-
| 3 | `
|
| 174 |
-
| 4 |
|
| 175 |
-
| 5 | `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
|
| 178 |
### Key Findings
|
| 179 |
|
| 180 |
-
- **Best Perplexity:** 2-gram with
|
| 181 |
- **Entropy Trend:** Decreases with larger n-grams (more predictable)
|
| 182 |
-
- **Coverage:** Top-1000 patterns cover ~
|
| 183 |
- **Recommendation:** 4-gram or 5-gram for best predictive performance
|
| 184 |
|
| 185 |
---
|
|
@@ -187,55 +215,86 @@ Kategori:Komun ri Aisne`
|
|
| 187 |
|
| 188 |

|
| 189 |
|
|
|
|
|
|
|
| 190 |

|
| 191 |
|
| 192 |
### Results
|
| 193 |
|
| 194 |
-
| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
|
| 195 |
-
|
| 196 |
-
| **1** | 0.
|
| 197 |
-
| **1** | 0.
|
| 198 |
-
| **2** | 0.
|
| 199 |
-
| **2** | 0.
|
| 200 |
-
| **3** | 0.
|
| 201 |
-
| **3** | 0.
|
| 202 |
-
| **4** | 0.
|
| 203 |
-
| **4** | 0.
|
| 204 |
|
| 205 |
-
### Generated Text Samples
|
| 206 |
|
| 207 |
-
Below are text samples generated from each Markov chain model:
|
| 208 |
|
| 209 |
**Context Size 1:**
|
| 210 |
|
| 211 |
-
1. `ri déparetema
|
| 212 |
-
2.
|
| 213 |
-
3. `
|
| 214 |
|
| 215 |
**Context Size 2:**
|
| 216 |
|
| 217 |
-
1. `komun ri
|
| 218 |
-
2. `ri déparetema
|
| 219 |
-
3. `kategori
|
| 220 |
|
| 221 |
**Context Size 3:**
|
| 222 |
|
| 223 |
-
1. `komun ri déparetema
|
| 224 |
-
2. `kategori
|
| 225 |
-
3.
|
| 226 |
|
| 227 |
**Context Size 4:**
|
| 228 |
|
| 229 |
-
1. `
|
| 230 |
-
2. `to komun ri déparetema vosges kategori
|
| 231 |
-
3. `ita to komun ri déparetema
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
|
| 233 |
|
| 234 |
### Key Findings
|
| 235 |
|
| 236 |
-
- **Best Predictability:** Context-4 with
|
| 237 |
- **Branching Factor:** Decreases with context size (more deterministic)
|
| 238 |
-
- **Memory Trade-off:** Larger contexts require more storage (
|
| 239 |
- **Recommendation:** Context-3 or Context-4 for text generation
|
| 240 |
|
| 241 |
---
|
|
@@ -251,64 +310,64 @@ Below are text samples generated from each Markov chain model:
|
|
| 251 |
|
| 252 |
| Metric | Value |
|
| 253 |
|--------|-------|
|
| 254 |
-
| Vocabulary Size |
|
| 255 |
-
| Total Tokens |
|
| 256 |
-
| Mean Frequency |
|
| 257 |
| Median Frequency | 2 |
|
| 258 |
-
| Frequency Std Dev |
|
| 259 |
|
| 260 |
### Most Common Words
|
| 261 |
|
| 262 |
| Rank | Word | Frequency |
|
| 263 |
|------|------|-----------|
|
| 264 |
-
| 1 | ri | 55,
|
| 265 |
-
| 2 | komun |
|
| 266 |
| 3 | déparetema | 27,244 |
|
| 267 |
-
| 4 | kategori | 15,
|
| 268 |
-
| 5 | to | 14,
|
| 269 |
-
| 6 | ita | 13,
|
| 270 |
-
| 7 | iyanaritu | 13,
|
| 271 |
-
| 8 | séuwa | 13,
|
| 272 |
| 9 | perancis | 12,636 |
|
| 273 |
-
| 10 | haute | 6,
|
| 274 |
|
| 275 |
### Least Common Words (from vocabulary)
|
| 276 |
|
| 277 |
| Rank | Word | Frequency |
|
| 278 |
|------|------|-----------|
|
| 279 |
-
| 1 |
|
| 280 |
-
| 2 |
|
| 281 |
-
| 3 |
|
| 282 |
-
| 4 |
|
| 283 |
-
| 5 |
|
| 284 |
-
| 6 |
|
| 285 |
-
| 7 |
|
| 286 |
-
| 8 |
|
| 287 |
-
| 9 |
|
| 288 |
-
| 10 |
|
| 289 |
|
| 290 |
### Zipf's Law Analysis
|
| 291 |
|
| 292 |
| Metric | Value |
|
| 293 |
|--------|-------|
|
| 294 |
-
| Zipf Coefficient | 0.
|
| 295 |
-
| R² (Goodness of Fit) | 0.
|
| 296 |
| Adherence Quality | **excellent** |
|
| 297 |
|
| 298 |
### Coverage Analysis
|
| 299 |
|
| 300 |
| Top N Words | Coverage |
|
| 301 |
|-------------|----------|
|
| 302 |
-
| Top 100 |
|
| 303 |
-
| Top 1,000 |
|
| 304 |
-
| Top 5,000 |
|
| 305 |
-
| Top 10,000 |
|
| 306 |
|
| 307 |
### Key Findings
|
| 308 |
|
| 309 |
-
- **Zipf Compliance:** R²=0.
|
| 310 |
-
- **High Frequency Dominance:** Top 100 words cover
|
| 311 |
-
- **Long Tail:**
|
| 312 |
|
| 313 |
---
|
| 314 |
## 5. Word Embeddings Evaluation
|
|
@@ -321,24 +380,127 @@ Below are text samples generated from each Markov chain model:
|
|
| 321 |
|
| 322 |

|
| 323 |
|
| 324 |
-
### Model Comparison
|
| 325 |
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 332 |
|
| 333 |
### Key Findings
|
| 334 |
|
| 335 |
-
- **Best Isotropy:** mono_32d with 0.
|
| 336 |
-
- **
|
| 337 |
-
- **
|
| 338 |
-
- **Recommendation:**
|
| 339 |
|
| 340 |
---
|
| 341 |
-
## 6.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 342 |
|
| 343 |

|
| 344 |
|
|
@@ -346,11 +508,12 @@ Below are text samples generated from each Markov chain model:
|
|
| 346 |
|
| 347 |
| Component | Recommended | Rationale |
|
| 348 |
|-----------|-------------|-----------|
|
| 349 |
-
| Tokenizer | **32k BPE** | Best compression (
|
| 350 |
-
| N-gram | **
|
| 351 |
-
| Markov | **Context-4** | Highest predictability (
|
| 352 |
| Embeddings | **100d** | Balanced semantic capture and isotropy |
|
| 353 |
|
|
|
|
| 354 |
---
|
| 355 |
## Appendix: Metrics Glossary & Interpretation Guide
|
| 356 |
|
|
@@ -540,7 +703,8 @@ If you use these models in your research, please cite:
|
|
| 540 |
author = {Kamali, Omar},
|
| 541 |
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
|
| 542 |
year = {2025},
|
| 543 |
-
|
|
|
|
| 544 |
url = {https://huggingface.co/wikilangs}
|
| 545 |
institution = {Omneity Labs}
|
| 546 |
}
|
|
@@ -556,7 +720,8 @@ MIT License - Free for academic and commercial use.
|
|
| 556 |
- 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
|
| 557 |
- 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
|
| 558 |
- 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
|
|
|
|
| 559 |
---
|
| 560 |
*Generated by Wikilangs Models Pipeline*
|
| 561 |
|
| 562 |
-
*Report Date:
|
|
|
|
| 23 |
metrics:
|
| 24 |
- name: best_compression_ratio
|
| 25 |
type: compression
|
| 26 |
+
value: 4.924
|
| 27 |
- name: best_isotropy
|
| 28 |
type: isotropy
|
| 29 |
+
value: 0.0631
|
| 30 |
- name: vocabulary_size
|
| 31 |
type: vocab
|
| 32 |
+
value: 0
|
| 33 |
+
generated: 2026-01-03
|
| 34 |
---
|
| 35 |
|
| 36 |
# BUG - Wikilangs Models
|
|
|
|
| 44 |
### Models & Assets
|
| 45 |
|
| 46 |
- Tokenizers (8k, 16k, 32k, 64k)
|
| 47 |
+
- N-gram models (2, 3, 4, 5-gram)
|
| 48 |
+
- Markov chains (context of 1, 2, 3, 4 and 5)
|
| 49 |
- Subword N-gram and Markov chains
|
| 50 |
+
- Embeddings in various sizes and dimensions (aligned and unaligned)
|
| 51 |
- Language Vocabulary
|
| 52 |
- Language Statistics
|
| 53 |
+
|
| 54 |

|
| 55 |
|
| 56 |
### Analysis and Evaluation
|
|
|
|
| 60 |
- [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
|
| 61 |
- [4. Vocabulary Analysis](#4-vocabulary-analysis)
|
| 62 |
- [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
|
| 63 |
+
- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
|
| 64 |
+
- [7. Summary & Recommendations](#7-summary--recommendations)
|
| 65 |
- [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
|
| 66 |
- [Visualizations Index](#visualizations-index)
|
| 67 |
|
|
|
|
| 70 |
|
| 71 |

|
| 72 |
|
| 73 |
+

|
| 74 |
+
|
| 75 |
+

|
| 76 |
+
|
| 77 |
+

|
| 78 |
+
|
| 79 |
### Results
|
| 80 |
|
| 81 |
| Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
|
| 82 |
|------------|-------------|---------------|----------|--------------|
|
| 83 |
+
| **8k** | 4.284x | 4.31 | 0.4916% | 36,818 |
|
| 84 |
+
| **16k** | 4.514x | 4.54 | 0.5180% | 34,940 |
|
| 85 |
+
| **32k** | 4.924x 🏆 | 4.96 | 0.5650% | 32,035 |
|
|
|
|
| 86 |
|
| 87 |
### Tokenization Examples
|
| 88 |
|
| 89 |
Below are sample sentences tokenized with each vocabulary size:
|
| 90 |
|
| 91 |
+
**Sample 1:** `Ita to Komun ri déparetema Allier Kategori:Komun ri Allier`
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
| Vocab | Tokens | Count |
|
| 94 |
|-------|--------|-------|
|
| 95 |
+
| 8k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁allier ▁kategori : komun ▁ri ... (+1 more)` | 11 |
|
| 96 |
+
| 16k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁allier ▁kategori : komun ▁ri ... (+1 more)` | 11 |
|
| 97 |
+
| 32k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁allier ▁kategori : komun ▁ri ... (+1 more)` | 11 |
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
**Sample 2:** `iyanaritu séuwa komun ri déparetema Manche ri Perancis. Ita to Komun ri déparete...`
|
| 100 |
|
| 101 |
| Vocab | Tokens | Count |
|
| 102 |
|-------|--------|-------|
|
| 103 |
+
| 8k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁manche ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
|
| 104 |
+
| 16k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁manche ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
|
| 105 |
+
| 32k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁manche ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
|
|
|
|
| 106 |
|
| 107 |
+
**Sample 3:** `iyanaritu séuwa komun ri déparetema Gard ri Perancis. Ita to Komun ri déparetema...`
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
| Vocab | Tokens | Count |
|
| 110 |
|-------|--------|-------|
|
| 111 |
+
| 8k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁gard ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
|
| 112 |
+
| 16k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁gard ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
|
| 113 |
+
| 32k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁gard ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
|
|
|
|
| 114 |
|
| 115 |
|
| 116 |
### Key Findings
|
| 117 |
|
| 118 |
+
- **Best Compression:** 32k achieves 4.924x compression
|
| 119 |
+
- **Lowest UNK Rate:** 8k with 0.4916% unknown tokens
|
| 120 |
- **Trade-off:** Larger vocabularies improve compression but increase model size
|
| 121 |
- **Recommendation:** 32k vocabulary provides optimal balance for production use
|
| 122 |
|
|
|
|
| 125 |
|
| 126 |

|
| 127 |
|
| 128 |
+

|
| 129 |
+
|
| 130 |

|
| 131 |
|
| 132 |
### Results
|
| 133 |
|
| 134 |
+
| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
|
| 135 |
+
|--------|---------|------------|---------|----------------|------------------|-------------------|
|
| 136 |
+
| **2-gram** | Word | 75 🏆 | 6.23 | 1,721 | 84.8% | 98.5% |
|
| 137 |
+
| **2-gram** | Subword | 168 | 7.39 | 2,166 | 81.3% | 99.5% |
|
| 138 |
+
| **3-gram** | Word | 118 | 6.89 | 2,060 | 74.9% | 98.6% |
|
| 139 |
+
| **3-gram** | Subword | 512 | 9.00 | 10,883 | 62.7% | 89.5% |
|
| 140 |
+
| **4-gram** | Word | 228 | 7.84 | 4,992 | 61.5% | 96.5% |
|
| 141 |
+
| **4-gram** | Subword | 938 | 9.87 | 41,978 | 58.6% | 80.3% |
|
| 142 |
|
| 143 |
### Top 5 N-grams by Size
|
| 144 |
|
| 145 |
+
**2-grams (Word):**
|
| 146 |
|
| 147 |
| Rank | N-gram | Count |
|
| 148 |
|------|--------|-------|
|
| 149 |
+
| 1 | `komun ri` | 40,954 |
|
| 150 |
| 2 | `ri déparetema` | 25,713 |
|
| 151 |
+
| 3 | `kategori komun` | 15,119 |
|
| 152 |
+
| 4 | `ita to` | 13,903 |
|
| 153 |
+
| 5 | `to komun` | 13,889 |
|
| 154 |
|
| 155 |
+
**3-grams (Word):**
|
| 156 |
|
| 157 |
| Rank | N-gram | Count |
|
| 158 |
|------|--------|-------|
|
| 159 |
| 1 | `komun ri déparetema` | 25,709 |
|
| 160 |
+
| 2 | `kategori komun ri` | 15,118 |
|
| 161 |
+
| 3 | `to komun ri` | 13,889 |
|
| 162 |
| 4 | `ita to komun` | 13,889 |
|
| 163 |
+
| 5 | `iyanaritu séuwa komun` | 13,324 |
|
| 164 |
+
|
| 165 |
+
**4-grams (Word):**
|
| 166 |
+
|
| 167 |
+
| Rank | N-gram | Count |
|
| 168 |
+
|------|--------|-------|
|
| 169 |
+
| 1 | `ita to komun ri` | 13,889 |
|
| 170 |
+
| 2 | `to komun ri déparetema` | 13,889 |
|
| 171 |
+
| 3 | `perancis ita to komun` | 12,095 |
|
| 172 |
+
| 4 | `iyanaritu séuwa komun ri` | 11,780 |
|
| 173 |
+
| 5 | `séuwa komun ri déparetema` | 11,779 |
|
| 174 |
+
|
| 175 |
+
**2-grams (Subword):**
|
| 176 |
+
|
| 177 |
+
| Rank | N-gram | Count |
|
| 178 |
+
|------|--------|-------|
|
| 179 |
+
| 1 | `r i` | 90,073 |
|
| 180 |
+
| 2 | `a _` | 63,521 |
|
| 181 |
+
| 3 | `i _` | 58,103 |
|
| 182 |
+
| 4 | `_ r` | 57,562 |
|
| 183 |
+
| 5 | `t e` | 57,384 |
|
| 184 |
|
| 185 |
+
**3-grams (Subword):**
|
| 186 |
|
| 187 |
| Rank | N-gram | Count |
|
| 188 |
|------|--------|-------|
|
| 189 |
+
| 1 | `_ r i` | 56,241 |
|
| 190 |
+
| 2 | `r i _` | 55,682 |
|
| 191 |
+
| 3 | `m u n` | 43,032 |
|
| 192 |
+
| 4 | `u n _` | 42,982 |
|
| 193 |
+
| 5 | `k o m` | 42,818 |
|
| 194 |
+
|
| 195 |
+
**4-grams (Subword):**
|
| 196 |
+
|
| 197 |
+
| Rank | N-gram | Count |
|
| 198 |
+
|------|--------|-------|
|
| 199 |
+
| 1 | `_ r i _` | 55,380 |
|
| 200 |
+
| 2 | `o m u n` | 42,739 |
|
| 201 |
+
| 3 | `k o m u` | 42,738 |
|
| 202 |
+
| 4 | `m u n _` | 42,683 |
|
| 203 |
+
| 5 | `n _ r i` | 41,407 |
|
| 204 |
|
| 205 |
|
| 206 |
### Key Findings
|
| 207 |
|
| 208 |
+
- **Best Perplexity:** 2-gram (word) with 75
|
| 209 |
- **Entropy Trend:** Decreases with larger n-grams (more predictable)
|
| 210 |
+
- **Coverage:** Top-1000 patterns cover ~80% of corpus
|
| 211 |
- **Recommendation:** 4-gram or 5-gram for best predictive performance
|
| 212 |
|
| 213 |
---
|
|
|
|
| 215 |
|
| 216 |

|
| 217 |
|
| 218 |
+

|
| 219 |
+
|
| 220 |

|
| 221 |
|
| 222 |
### Results
|
| 223 |
|
| 224 |
+
| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
|
| 225 |
+
|---------|---------|-------------|------------|------------------|-----------------|----------------|
|
| 226 |
+
| **1** | Word | 0.5094 | 1.423 | 2.20 | 33,148 | 49.1% |
|
| 227 |
+
| **1** | Subword | 0.6420 | 1.560 | 6.04 | 1,115 | 35.8% |
|
| 228 |
+
| **2** | Word | 0.1229 | 1.089 | 1.21 | 72,816 | 87.7% |
|
| 229 |
+
| **2** | Subword | 0.6776 | 1.600 | 3.79 | 6,727 | 32.2% |
|
| 230 |
+
| **3** | Word | 0.0488 | 1.034 | 1.07 | 87,927 | 95.1% |
|
| 231 |
+
| **3** | Subword | 0.6911 | 1.614 | 3.05 | 25,458 | 30.9% |
|
| 232 |
+
| **4** | Word | 0.0143 🏆 | 1.010 | 1.02 | 93,631 | 98.6% |
|
| 233 |
+
| **4** | Subword | 0.5492 | 1.463 | 2.16 | 77,506 | 45.1% |
|
| 234 |
|
| 235 |
+
### Generated Text Samples (Word-based)
|
| 236 |
|
| 237 |
+
Below are text samples generated from each word-based Markov chain model:
|
| 238 |
|
| 239 |
**Context Size 1:**
|
| 240 |
|
| 241 |
+
1. `ri déparetema gironde ri déparetema eure et loir kategori komun ri déparetema manche ri aisne katego...`
|
| 242 |
+
2. `komun ri perancis ita to komun ri dordogne ri déparetema somme kategori kota ri déparetema haute`
|
| 243 |
+
3. `déparetema gironde ri haute saône ri déparetema haute provence ri perancis ita to komun ri déparetem...`
|
| 244 |
|
| 245 |
**Context Size 2:**
|
| 246 |
|
| 247 |
+
1. `komun ri provinsi messina komun ri déparetema haute loire ri perancis ita to komun ri déparetema dor...`
|
| 248 |
+
2. `ri déparetema manche ri perancis ita to komun ri déparetema dordogne ri perancis ita to komun ri`
|
| 249 |
+
3. `kategori komun ri déparetema manche kategori komun ri déparetema ain kategori komun ri alpes de haut...`
|
| 250 |
|
| 251 |
**Context Size 3:**
|
| 252 |
|
| 253 |
+
1. `komun ri déparetema somme kategori komun ri eure et loir ri perancis ita to komun ri déparetema haut...`
|
| 254 |
+
2. `kategori komun ri aisne`
|
| 255 |
+
3. `to komun ri déparetema dordogne ri perancis ita to komun ri déparetema vosges ri perancis ita to kom...`
|
| 256 |
|
| 257 |
**Context Size 4:**
|
| 258 |
|
| 259 |
+
1. `ita to komun ri déparetema gironde ri perancis ita to komun ri déparetema haute saône ri perancis it...`
|
| 260 |
+
2. `to komun ri déparetema vosges kategori komun ri vosges`
|
| 261 |
+
3. `perancis ita to komun ri déparetema côtes d armor kategori komun ri côtes d armor`
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
### Generated Text Samples (Subword-based)
|
| 265 |
+
|
| 266 |
+
Below are text samples generated from each subword-based Markov chain model:
|
| 267 |
+
|
| 268 |
+
**Context Size 1:**
|
| 269 |
+
|
| 270 |
+
1. `_séunas._koiya_r`
|
| 271 |
+
2. `ares._retoépesat`
|
| 272 |
+
3. `riséunanetaŋn_pa`
|
| 273 |
+
|
| 274 |
+
**Context Size 2:**
|
| 275 |
+
|
| 276 |
+
1. `ri_ube_katespia_f`
|
| 277 |
+
2. `a_to_komun_ri:kom`
|
| 278 |
+
3. `i_aretemay_(caven`
|
| 279 |
+
|
| 280 |
+
**Context Size 3:**
|
| 281 |
+
|
| 282 |
+
1. `_ri_déparetema_vos`
|
| 283 |
+
2. `ri_déparetema_eure`
|
| 284 |
+
3. `mun_ri_perancis._i`
|
| 285 |
+
|
| 286 |
+
**Context Size 4:**
|
| 287 |
+
|
| 288 |
+
1. `_ri_perancis._ita_t`
|
| 289 |
+
2. `omun_ri_aisne_ri_pe`
|
| 290 |
+
3. `komun_ri_déparetema`
|
| 291 |
|
| 292 |
|
| 293 |
### Key Findings
|
| 294 |
|
| 295 |
+
- **Best Predictability:** Context-4 (word) with 98.6% predictability
|
| 296 |
- **Branching Factor:** Decreases with context size (more deterministic)
|
| 297 |
+
- **Memory Trade-off:** Larger contexts require more storage (77,506 contexts)
|
| 298 |
- **Recommendation:** Context-3 or Context-4 for text generation
|
| 299 |
|
| 300 |
---
|
|
|
|
| 310 |
|
| 311 |
| Metric | Value |
|
| 312 |
|--------|-------|
|
| 313 |
+
| Vocabulary Size | 13,441 |
|
| 314 |
+
| Total Tokens | 358,235 |
|
| 315 |
+
| Mean Frequency | 26.65 |
|
| 316 |
| Median Frequency | 2 |
|
| 317 |
+
| Frequency Std Dev | 719.11 |
|
| 318 |
|
| 319 |
### Most Common Words
|
| 320 |
|
| 321 |
| Rank | Word | Frequency |
|
| 322 |
|------|------|-----------|
|
| 323 |
+
| 1 | ri | 55,390 |
|
| 324 |
+
| 2 | komun | 42,680 |
|
| 325 |
| 3 | déparetema | 27,244 |
|
| 326 |
+
| 4 | kategori | 15,401 |
|
| 327 |
+
| 5 | to | 14,028 |
|
| 328 |
+
| 6 | ita | 13,904 |
|
| 329 |
+
| 7 | iyanaritu | 13,503 |
|
| 330 |
+
| 8 | séuwa | 13,393 |
|
| 331 |
| 9 | perancis | 12,636 |
|
| 332 |
+
| 10 | haute | 6,207 |
|
| 333 |
|
| 334 |
### Least Common Words (from vocabulary)
|
| 335 |
|
| 336 |
| Rank | Word | Frequency |
|
| 337 |
|------|------|-----------|
|
| 338 |
+
| 1 | ᨆᨘᨄᨗ | 2 |
|
| 339 |
+
| 2 | ᨕᨗᨊᨘᨂᨛᨊ | 2 |
|
| 340 |
+
| 3 | ᨒᨘ | 2 |
|
| 341 |
+
| 4 | ᨅᨀᨗ | 2 |
|
| 342 |
+
| 5 | ᨀᨀᨀᨀ | 2 |
|
| 343 |
+
| 6 | ᨉᨗᨛᨄᨗᨛ | 2 |
|
| 344 |
+
| 7 | days | 2 |
|
| 345 |
+
| 8 | after | 2 |
|
| 346 |
+
| 9 | federal | 2 |
|
| 347 |
+
| 10 | ᨔᨛᨀᨗᨈ | 2 |
|
| 348 |
|
| 349 |
### Zipf's Law Analysis
|
| 350 |
|
| 351 |
| Metric | Value |
|
| 352 |
|--------|-------|
|
| 353 |
+
| Zipf Coefficient | 0.9107 |
|
| 354 |
+
| R² (Goodness of Fit) | 0.956604 |
|
| 355 |
| Adherence Quality | **excellent** |
|
| 356 |
|
| 357 |
### Coverage Analysis
|
| 358 |
|
| 359 |
| Top N Words | Coverage |
|
| 360 |
|-------------|----------|
|
| 361 |
+
| Top 100 | 83.0% |
|
| 362 |
+
| Top 1,000 | 89.7% |
|
| 363 |
+
| Top 5,000 | 95.1% |
|
| 364 |
+
| Top 10,000 | 98.1% |
|
| 365 |
|
| 366 |
### Key Findings
|
| 367 |
|
| 368 |
+
- **Zipf Compliance:** R²=0.9566 indicates excellent adherence to Zipf's law
|
| 369 |
+
- **High Frequency Dominance:** Top 100 words cover 83.0% of corpus
|
| 370 |
+
- **Long Tail:** 3,441 words needed for remaining 1.9% coverage
|
| 371 |
|
| 372 |
---
|
| 373 |
## 5. Word Embeddings Evaluation
|
|
|
|
| 380 |
|
| 381 |

|
| 382 |
|
|
|
|
| 383 |
|
| 384 |
+
### 5.1 Cross-Lingual Alignment
|
| 385 |
+
|
| 386 |
+
> *Note: Multilingual alignment visualization not available for this language.*
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
### 5.2 Model Comparison
|
| 390 |
+
|
| 391 |
+
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 392 |
+
|-------|-----------|----------|------------------|---------------|----------------|
|
| 393 |
+
| **mono_32d** | 32 | 0.0631 🏆 | 0.6817 | N/A | N/A |
|
| 394 |
+
| **mono_64d** | 64 | 0.0264 | 0.6525 | N/A | N/A |
|
| 395 |
+
| **mono_128d** | 128 | 0.0035 | 0.7173 | N/A | N/A |
|
| 396 |
|
| 397 |
### Key Findings
|
| 398 |
|
| 399 |
+
- **Best Isotropy:** mono_32d with 0.0631 (more uniform distribution)
|
| 400 |
+
- **Semantic Density:** Average pairwise similarity of 0.6838. Lower values indicate better semantic separation.
|
| 401 |
+
- **Alignment Quality:** No aligned models evaluated in this run.
|
| 402 |
+
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 403 |
|
| 404 |
---
|
| 405 |
+
## 6. Morphological Analysis (Experimental)
|
| 406 |
+
|
| 407 |
+
> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
|
| 408 |
+
|
| 409 |
+
This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
|
| 410 |
+
|
| 411 |
+
### 6.1 Productivity & Complexity
|
| 412 |
+
|
| 413 |
+
| Metric | Value | Interpretation | Recommendation |
|
| 414 |
+
|--------|-------|----------------|----------------|
|
| 415 |
+
| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
|
| 416 |
+
| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
|
| 417 |
+
|
| 418 |
+
### 6.2 Affix Inventory (Productive Units)
|
| 419 |
+
|
| 420 |
+
These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
|
| 421 |
+
|
| 422 |
+
#### Productive Prefixes
|
| 423 |
+
| Prefix | Examples |
|
| 424 |
+
|--------|----------|
|
| 425 |
+
| `-ma` | marainville, mailhac, mauges |
|
| 426 |
+
| `-mo` | molliens, montmotier, morin |
|
| 427 |
+
| `-ch` | châtel, chèze, chauffourt |
|
| 428 |
+
| `-co` | confort, coulombiers, coux |
|
| 429 |
+
| `-la` | lacalm, lasse, lacour |
|
| 430 |
+
|
| 431 |
+
#### Productive Suffixes
|
| 432 |
+
| Suffix | Examples |
|
| 433 |
+
|--------|----------|
|
| 434 |
+
| `-s` | pozières, serbonnes, molliens |
|
| 435 |
+
| `-e` | givonne, marainville, roville |
|
| 436 |
+
| `-es` | pozières, serbonnes, mauges |
|
| 437 |
+
| `-rt` | confort, chauffourt, saucourt |
|
| 438 |
+
| `-le` | marainville, roville, touille |
|
| 439 |
+
| `-urt` | chauffourt, saucourt, pignicourt |
|
| 440 |
+
| `-ourt` | chauffourt, saucourt, pignicourt |
|
| 441 |
+
| `-lle` | marainville, roville, touille |
|
| 442 |
+
|
| 443 |
+
### 6.3 Bound Stems (Lexical Roots)
|
| 444 |
+
|
| 445 |
+
Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
|
| 446 |
+
|
| 447 |
+
| Stem | Cohesion | Substitutability | Examples |
|
| 448 |
+
|------|----------|------------------|----------|
|
| 449 |
+
| `ngka` | 1.37x | 20 contexts | éngka, angka, engka |
|
| 450 |
+
| `appa` | 1.39x | 15 contexts | lappa, nappa, tappa |
|
| 451 |
+
| `engk` | 1.41x | 9 contexts | engka, engkai, engkaé |
|
| 452 |
+
| `seng` | 1.34x | 10 contexts | aseng, naseng, siseng |
|
| 453 |
+
| `asen` | 1.35x | 8 contexts | aseng, naseng, asenna |
|
| 454 |
+
| `unna` | 1.32x | 6 contexts | punna, umunna, punnai |
|
| 455 |
+
| `enna` | 1.37x | 5 contexts | asenna, lalenna, sisenna |
|
| 456 |
+
| `yana` | 1.30x | 5 contexts | iyana, iyanae, iyanaé |
|
| 457 |
+
|
| 458 |
+
### 6.4 Affix Compatibility (Co-occurrence)
|
| 459 |
+
|
| 460 |
+
This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
|
| 461 |
+
|
| 462 |
+
| Prefix | Suffix | Frequency | Examples |
|
| 463 |
+
|--------|--------|-----------|----------|
|
| 464 |
+
| `-la` | `-e` | 76 words | lagardelle, lange |
|
| 465 |
+
| `-ch` | `-s` | 63 words | charnois, chassagnes |
|
| 466 |
+
| `-ma` | `-e` | 54 words | marville, maddare |
|
| 467 |
+
| `-co` | `-s` | 53 words | collonges, corps |
|
| 468 |
+
| `-mo` | `-s` | 46 words | moffans, moulédous |
|
| 469 |
+
| `-ch` | `-es` | 44 words | chassagnes, chaumes |
|
| 470 |
+
| `-ma` | `-s` | 43 words | mazères, martigues |
|
| 471 |
+
| `-ch` | `-e` | 41 words | challerange, champclause |
|
| 472 |
+
| `-co` | `-e` | 35 words | corbière, conie |
|
| 473 |
+
| `-co` | `-es` | 28 words | collonges, coyolles |
|
| 474 |
+
|
| 475 |
+
### 6.5 Recursive Morpheme Segmentation
|
| 476 |
+
|
| 477 |
+
Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
|
| 478 |
+
|
| 479 |
+
| Word | Suggested Split | Confidence | Stem |
|
| 480 |
+
|------|-----------------|------------|------|
|
| 481 |
+
| lagardelle | **`la-garde-lle`** | 6.0 | `garde` |
|
| 482 |
+
| lavilledieu | **`la-villedieu`** | 4.5 | `villedieu` |
|
| 483 |
+
| laboissière | **`la-boissière`** | 4.5 | `boissière` |
|
| 484 |
+
| malaincourt | **`ma-la-inco-urt`** | 4.5 | `inco` |
|
| 485 |
+
| colleville | **`co-llev-ille`** | 3.0 | `llev` |
|
| 486 |
+
| maizières | **`ma-izièr-es`** | 3.0 | `izièr` |
|
| 487 |
+
| champenoises | **`ch-ampenois-es`** | 3.0 | `ampenois` |
|
| 488 |
+
| chavignon | **`ch-avign-on`** | 3.0 | `avign` |
|
| 489 |
+
| montescourt | **`mo-ntesc-ourt`** | 3.0 | `ntesc` |
|
| 490 |
+
| châteauredon | **`ch-âteaured-on`** | 3.0 | `âteaured` |
|
| 491 |
+
| chevannes | **`ch-evann-es`** | 3.0 | `evann` |
|
| 492 |
+
| lamasquère | **`la-ma-squère`** | 3.0 | `squère` |
|
| 493 |
+
| landaville | **`la-ndav-ille`** | 3.0 | `ndav` |
|
| 494 |
+
| mourvilles | **`mo-urvill-es`** | 3.0 | `urvill` |
|
| 495 |
+
| malvières | **`ma-lvièr-es`** | 3.0 | `lvièr` |
|
| 496 |
+
|
| 497 |
+
### 6.6 Linguistic Interpretation
|
| 498 |
+
|
| 499 |
+
> **Automated Insight:**
|
| 500 |
+
The language BUG appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
|
| 501 |
+
|
| 502 |
+
---
|
| 503 |
+
## 7. Summary & Recommendations
|
| 504 |
|
| 505 |

|
| 506 |
|
|
|
|
| 508 |
|
| 509 |
| Component | Recommended | Rationale |
|
| 510 |
|-----------|-------------|-----------|
|
| 511 |
+
| Tokenizer | **32k BPE** | Best compression (4.92x) |
|
| 512 |
+
| N-gram | **2-gram** | Lowest perplexity (75) |
|
| 513 |
+
| Markov | **Context-4** | Highest predictability (98.6%) |
|
| 514 |
| Embeddings | **100d** | Balanced semantic capture and isotropy |
|
| 515 |
|
| 516 |
+
|
| 517 |
---
|
| 518 |
## Appendix: Metrics Glossary & Interpretation Guide
|
| 519 |
|
|
|
|
| 703 |
author = {Kamali, Omar},
|
| 704 |
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
|
| 705 |
year = {2025},
|
| 706 |
+
doi = {10.5281/zenodo.18073153},
|
| 707 |
+
publisher = {Zenodo},
|
| 708 |
url = {https://huggingface.co/wikilangs}
|
| 709 |
institution = {Omneity Labs}
|
| 710 |
}
|
|
|
|
| 720 |
- 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
|
| 721 |
- 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
|
| 722 |
- 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
|
| 723 |
+
- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
|
| 724 |
---
|
| 725 |
*Generated by Wikilangs Models Pipeline*
|
| 726 |
|
| 727 |
+
*Report Date: 2026-01-03 08:55:12*
|
models/embeddings/monolingual/bug_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:60e643ec1ece75f273fcafba6e20d15dd273d9a060acc636c62b079cc3827632
|
| 3 |
+
size 1025503150
|
models/embeddings/monolingual/bug_128d_metadata.json
CHANGED
|
@@ -3,11 +3,13 @@
|
|
| 3 |
"dimension": 128,
|
| 4 |
"version": "monolingual",
|
| 5 |
"training_params": {
|
| 6 |
-
"
|
| 7 |
"min_count": 5,
|
| 8 |
"window": 5,
|
| 9 |
"negative": 5,
|
| 10 |
-
"epochs": 5
|
|
|
|
|
|
|
| 11 |
},
|
| 12 |
-
"vocab_size":
|
| 13 |
}
|
|
|
|
| 3 |
"dimension": 128,
|
| 4 |
"version": "monolingual",
|
| 5 |
"training_params": {
|
| 6 |
+
"algorithm": "skipgram",
|
| 7 |
"min_count": 5,
|
| 8 |
"window": 5,
|
| 9 |
"negative": 5,
|
| 10 |
+
"epochs": 5,
|
| 11 |
+
"encoding_method": "rope",
|
| 12 |
+
"dim": 128
|
| 13 |
},
|
| 14 |
+
"vocab_size": 1443
|
| 15 |
}
|
models/embeddings/monolingual/bug_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:becf192777316400ac67a33873637aadaeabcd3ab768294b2064c3c0615f144b
|
| 3 |
+
size 256394926
|
models/embeddings/monolingual/bug_32d_metadata.json
CHANGED
|
@@ -3,11 +3,13 @@
|
|
| 3 |
"dimension": 32,
|
| 4 |
"version": "monolingual",
|
| 5 |
"training_params": {
|
| 6 |
-
"
|
| 7 |
"min_count": 5,
|
| 8 |
"window": 5,
|
| 9 |
"negative": 5,
|
| 10 |
-
"epochs": 5
|
|
|
|
|
|
|
| 11 |
},
|
| 12 |
-
"vocab_size":
|
| 13 |
}
|
|
|
|
| 3 |
"dimension": 32,
|
| 4 |
"version": "monolingual",
|
| 5 |
"training_params": {
|
| 6 |
+
"algorithm": "skipgram",
|
| 7 |
"min_count": 5,
|
| 8 |
"window": 5,
|
| 9 |
"negative": 5,
|
| 10 |
+
"epochs": 5,
|
| 11 |
+
"encoding_method": "rope",
|
| 12 |
+
"dim": 32
|
| 13 |
},
|
| 14 |
+
"vocab_size": 1443
|
| 15 |
}
|
models/embeddings/monolingual/bug_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d9823040ce7cd355f045c8caad10719684ec4e18bf1bef4ea891663c355cb360
|
| 3 |
+
size 512764334
|
models/embeddings/monolingual/bug_64d_metadata.json
CHANGED
|
@@ -3,11 +3,13 @@
|
|
| 3 |
"dimension": 64,
|
| 4 |
"version": "monolingual",
|
| 5 |
"training_params": {
|
| 6 |
-
"
|
| 7 |
"min_count": 5,
|
| 8 |
"window": 5,
|
| 9 |
"negative": 5,
|
| 10 |
-
"epochs": 5
|
|
|
|
|
|
|
| 11 |
},
|
| 12 |
-
"vocab_size":
|
| 13 |
}
|
|
|
|
| 3 |
"dimension": 64,
|
| 4 |
"version": "monolingual",
|
| 5 |
"training_params": {
|
| 6 |
+
"algorithm": "skipgram",
|
| 7 |
"min_count": 5,
|
| 8 |
"window": 5,
|
| 9 |
"negative": 5,
|
| 10 |
+
"epochs": 5,
|
| 11 |
+
"encoding_method": "rope",
|
| 12 |
+
"dim": 64
|
| 13 |
},
|
| 14 |
+
"vocab_size": 1443
|
| 15 |
}
|
models/subword_markov/bug_markov_ctx1_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ea6b54847a784f51e2f16c837e7951d96c474085d40f16a957be9eecbdceb2d7
|
| 3 |
+
size 56451
|
models/subword_markov/bug_markov_ctx1_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 1,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 1,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 1115,
|
| 6 |
+
"total_transitions": 2377233
|
| 7 |
}
|
models/subword_markov/bug_markov_ctx2_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fc9770904232ce07f4aa4fa00d4f31741086ce42dde9337c557a94fcd91601d3
|
| 3 |
+
size 229577
|
models/subword_markov/bug_markov_ctx2_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 2,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 2,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 6727,
|
| 6 |
+
"total_transitions": 2361714
|
| 7 |
}
|
models/subword_markov/bug_markov_ctx3_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1851961810e4cd0121317f3bf6e74a5b2ad2d767fbf3ea615e494b1b33a0d9ff
|
| 3 |
+
size 651956
|
models/subword_markov/bug_markov_ctx3_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 3,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 3,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 25458,
|
| 6 |
+
"total_transitions": 2346195
|
| 7 |
}
|
models/subword_markov/bug_markov_ctx4_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1991e713220325ecadbfc81c2871898ab38be9aff994ef7d45517d9081e2d843
|
| 3 |
+
size 1459206
|
models/subword_markov/bug_markov_ctx4_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 4,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 4,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 77506,
|
| 6 |
+
"total_transitions": 2330676
|
| 7 |
}
|
models/subword_ngram/bug_2gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bc0ffd67460f75034bcef5b208e6dcebf5026c731829639ee1c9b37cd94f0255
|
| 3 |
+
size 29845
|
models/subword_ngram/bug_2gram_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"n": 2,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_ngrams":
|
| 6 |
-
"total_ngrams":
|
| 7 |
}
|
|
|
|
| 2 |
"n": 2,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_ngrams": 2166,
|
| 6 |
+
"total_ngrams": 2377233
|
| 7 |
}
|
models/subword_ngram/bug_3gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:27dae62a90b04631ca45a23226c4fb6e3e5676dd3b56fda2064206ccd05cc271
|
| 3 |
+
size 133308
|
models/subword_ngram/bug_3gram_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"n": 3,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_ngrams":
|
| 6 |
-
"total_ngrams":
|
| 7 |
}
|
|
|
|
| 2 |
"n": 3,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_ngrams": 10883,
|
| 6 |
+
"total_ngrams": 2361714
|
| 7 |
}
|
models/subword_ngram/bug_4gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4d7a75327568e461ca60da5b19c31846f906735013d0944cb9687eb7e175681a
|
| 3 |
+
size 522479
|
models/subword_ngram/bug_4gram_subword_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"n": 4,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_ngrams":
|
| 6 |
-
"total_ngrams":
|
| 7 |
}
|
|
|
|
| 2 |
"n": 4,
|
| 3 |
"variant": "subword",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_ngrams": 41978,
|
| 6 |
+
"total_ngrams": 2346195
|
| 7 |
}
|
models/tokenizer/bug_tokenizer_16k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f695d915517db069983efcf78f1827a707c9d7e5df1fe96fe7b390346018196e
|
| 3 |
+
size 518421
|
models/tokenizer/bug_tokenizer_16k.vocab
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
models/tokenizer/bug_tokenizer_32k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:41e741a9c77783f82b6eef2bd92933e422989f91b37dafb131fcbf96a6c44a90
|
| 3 |
+
size 821175
|
models/tokenizer/bug_tokenizer_32k.vocab
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
models/tokenizer/bug_tokenizer_8k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ae5e6410beed7f97e2ba0a2f82bd06f81b54b111021264a322f628b6ebd7930f
|
| 3 |
+
size 372086
|
models/tokenizer/bug_tokenizer_8k.vocab
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
models/vocabulary/bug_vocabulary.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eaf37b2a8977cb6828c5e00937c105d6dd1b5753eb506d9a9c3d3715c43116ba
|
| 3 |
+
size 212963
|
models/vocabulary/bug_vocabulary_metadata.json
CHANGED
|
@@ -1,16 +1,17 @@
|
|
| 1 |
{
|
| 2 |
"language": "bug",
|
| 3 |
-
"vocabulary_size":
|
|
|
|
| 4 |
"statistics": {
|
| 5 |
-
"type_token_ratio": 0.
|
| 6 |
"coverage": {
|
| 7 |
-
"top_100": 0.
|
| 8 |
-
"top_1000": 0.
|
| 9 |
-
"top_5000": 0.
|
| 10 |
-
"top_10000": 0.
|
| 11 |
},
|
| 12 |
-
"hapax_count":
|
| 13 |
-
"hapax_ratio": 0.
|
| 14 |
-
"total_documents":
|
| 15 |
}
|
| 16 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"language": "bug",
|
| 3 |
+
"vocabulary_size": 13441,
|
| 4 |
+
"variant": "full",
|
| 5 |
"statistics": {
|
| 6 |
+
"type_token_ratio": 0.08785621316176548,
|
| 7 |
"coverage": {
|
| 8 |
+
"top_100": 0.7870075448937048,
|
| 9 |
+
"top_1000": 0.8499830689622332,
|
| 10 |
+
"top_5000": 0.9016359615242167,
|
| 11 |
+
"top_10000": 0.9294954550745494
|
| 12 |
},
|
| 13 |
+
"hapax_count": 19769,
|
| 14 |
+
"hapax_ratio": 0.5952725082806384,
|
| 15 |
+
"total_documents": 15519
|
| 16 |
}
|
| 17 |
}
|
models/word_markov/bug_markov_ctx1_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:11fe5e192c8147c9659e92fea9d4a1deda14c8f7dd8f48d353ccca9695ddc8e5
|
| 3 |
+
size 879712
|
models/word_markov/bug_markov_ctx1_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 1,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 1,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 33148,
|
| 6 |
+
"total_transitions": 362485
|
| 7 |
}
|
models/word_markov/bug_markov_ctx2_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6e286442d9daf9a0a40fbc601c40874bbd94485367efec9e0aa5fa8ada829a45
|
| 3 |
+
size 1390138
|
models/word_markov/bug_markov_ctx2_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 2,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 2,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 72816,
|
| 6 |
+
"total_transitions": 346966
|
| 7 |
}
|
models/word_markov/bug_markov_ctx3_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c638220df23f4ff0bdbe9ae2d3083f2ce355e1c83d1a468b2bea91d89552f40e
|
| 3 |
+
size 1674825
|
models/word_markov/bug_markov_ctx3_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 3,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 3,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 87927,
|
| 6 |
+
"total_transitions": 331447
|
| 7 |
}
|
models/word_markov/bug_markov_ctx4_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0d9d69bcf81c502d74df001df57fd6eab15283698366763ee71dcd5e51ce1e9
|
| 3 |
+
size 1867230
|
models/word_markov/bug_markov_ctx4_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"context_size": 4,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_contexts":
|
| 6 |
-
"total_transitions":
|
| 7 |
}
|
|
|
|
| 2 |
"context_size": 4,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_contexts": 93631,
|
| 6 |
+
"total_transitions": 315928
|
| 7 |
}
|
models/word_ngram/bug_2gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4806b76280f953c69f3b47fae8e19150bc4d06b3e180825767704420821b1734
|
| 3 |
+
size 27390
|
models/word_ngram/bug_2gram_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"n": 2,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_ngrams":
|
| 6 |
-
"total_ngrams":
|
| 7 |
}
|
|
|
|
| 2 |
"n": 2,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_ngrams": 1721,
|
| 6 |
+
"total_ngrams": 362485
|
| 7 |
}
|
models/word_ngram/bug_3gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:12c41464b2fe3b48442ab57eb81c6e3a89b357a0127c417369670f4fdb38b49d
|
| 3 |
+
size 35727
|
models/word_ngram/bug_3gram_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"n": 3,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_ngrams":
|
| 6 |
-
"total_ngrams":
|
| 7 |
}
|
|
|
|
| 2 |
"n": 3,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_ngrams": 2060,
|
| 6 |
+
"total_ngrams": 346966
|
| 7 |
}
|
models/word_ngram/bug_4gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:895c9e5bd7da4197c713d3ef826d4aed1e41933d7f3600e864f72cc89e2457ec
|
| 3 |
+
size 87627
|
models/word_ngram/bug_4gram_word_metadata.json
CHANGED
|
@@ -2,6 +2,6 @@
|
|
| 2 |
"n": 4,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
-
"unique_ngrams":
|
| 6 |
-
"total_ngrams":
|
| 7 |
}
|
|
|
|
| 2 |
"n": 4,
|
| 3 |
"variant": "word",
|
| 4 |
"language": "bug",
|
| 5 |
+
"unique_ngrams": 4992,
|
| 6 |
+
"total_ngrams": 331447
|
| 7 |
}
|
visualizations/embedding_isotropy.png
CHANGED
|
|
visualizations/embedding_norms.png
CHANGED
|
|
visualizations/embedding_similarity.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/markov_branching.png
CHANGED
|
|
visualizations/markov_contexts.png
CHANGED
|
|
visualizations/markov_entropy.png
CHANGED
|
|
visualizations/model_sizes.png
CHANGED
|
|