Irish - Wikilangs Models
Comprehensive Research Report & Full Ablation Study
This repository contains NLP models trained and evaluated by Wikilangs, specifically on Irish Wikipedia data. We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
π Repository Contents
Models & Assets
- Tokenizers (8k, 16k, 32k, 64k)
- N-gram models (2, 3, 4, 5-gram)
- Markov chains (context of 1, 2, 3, 4 and 5)
- Subword N-gram and Markov chains
- Embeddings in various sizes and dimensions (aligned and unaligned)
- Language Vocabulary
- Language Statistics
Analysis and Evaluation
- 1. Tokenizer Evaluation
- 2. N-gram Model Evaluation
- 3. Markov Chain Evaluation
- 4. Vocabulary Analysis
- 5. Word Embeddings Evaluation
- 6. Morphological Analysis (Experimental)
- 7. Summary & Recommendations
- Metrics Glossary
- Visualizations Index
1. Tokenizer Evaluation
Results
| Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
|---|---|---|---|---|
| 8k | 3.807x | 3.81 | 0.1479% | 836,105 |
| 16k | 4.135x | 4.14 | 0.1607% | 769,705 |
| 32k | 4.402x | 4.40 | 0.1711% | 723,137 |
| 64k | 4.595x π | 4.60 | 0.1786% | 692,774 |
Tokenization Examples
Below are sample sentences tokenized with each vocabulary size:
Sample 1: Is baile suite i gContae an Longfoirt Γ© Caonach. TagairtΓ i gContae an Longfoirt
| Vocab | Tokens | Count |
|---|---|---|
| 8k | βis βbaile βsuite βi βgcontae βan βlongfoirt βΓ© βcao nach ... (+6 more) |
16 |
| 16k | βis βbaile βsuite βi βgcontae βan βlongfoirt βΓ© βcao nach ... (+6 more) |
16 |
| 32k | βis βbaile βsuite βi βgcontae βan βlongfoirt βΓ© βcaonach . ... (+5 more) |
15 |
| 64k | βis βbaile βsuite βi βgcontae βan βlongfoirt βΓ© βcaonach . ... (+5 more) |
15 |
Sample 2: SrΓ‘idbhaile beag i gContae Ros ComΓ‘in is ea An Seanbhaile (Old Town as BΓ©arla). ...
| Vocab | Tokens | Count |
|---|---|---|
| 8k | βsrΓ‘idbhaile βbeag βi βgcontae βros βcomΓ‘in βis βea βan βsean ... (+10 more) |
20 |
| 16k | βsrΓ‘idbhaile βbeag βi βgcontae βros βcomΓ‘in βis βea βan βsean ... (+10 more) |
20 |
| 32k | βsrΓ‘idbhaile βbeag βi βgcontae βros βcomΓ‘in βis βea βan βseanbhaile ... (+8 more) |
18 |
| 64k | βsrΓ‘idbhaile βbeag βi βgcontae βros βcomΓ‘in βis βea βan βseanbhaile ... (+8 more) |
18 |
Sample 3: Is imreoir leadΓ³ige as An tSeapΓ‘in Γ Misaki Doi. Rugadh Γ ar an 29 AibreΓ‘n leadΓ³...
| Vocab | Tokens | Count |
|---|---|---|
| 8k | βis βimreoir βleadΓ³ige βas βan βtseapΓ‘in βΓ βm isa ki ... (+17 more) |
27 |
| 16k | βis βimreoir βleadΓ³ige βas βan βtseapΓ‘in βΓ βm isa ki ... (+17 more) |
27 |
| 32k | βis βimreoir βleadΓ³ige βas βan βtseapΓ‘in βΓ βm isa ki ... (+16 more) |
26 |
| 64k | βis βimreoir βleadΓ³ige βas βan βtseapΓ‘in βΓ βm isa ki ... (+16 more) |
26 |
Key Findings
- Best Compression: 64k achieves 4.595x compression
- Lowest UNK Rate: 8k with 0.1479% unknown tokens
- Trade-off: Larger vocabularies improve compression but increase model size
- Recommendation: 32k vocabulary provides optimal balance for production use
2. N-gram Model Evaluation
Results
| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
|---|---|---|---|---|---|---|
| 2-gram | Word | 41,051 | 15.33 | 224,402 | 11.6% | 28.8% |
| 2-gram | Subword | 260 π | 8.02 | 7,311 | 69.6% | 99.2% |
| 3-gram | Word | 129,955 | 16.99 | 394,113 | 5.2% | 15.9% |
| 3-gram | Subword | 2,220 | 11.12 | 56,094 | 27.5% | 72.9% |
| 4-gram | Word | 328,612 | 18.33 | 698,569 | 3.1% | 9.7% |
| 4-gram | Subword | 13,083 | 13.68 | 311,374 | 13.3% | 40.0% |
| 5-gram | Word | 276,286 | 18.08 | 496,389 | 2.8% | 9.5% |
| 5-gram | Subword | 52,276 | 15.67 | 940,984 | 7.3% | 24.4% |
Top 5 N-grams by Size
2-grams (Word):
| Rank | N-gram | Count |
|---|---|---|
| 1 | ar an |
55,595 |
| 2 | sa bhliain |
34,147 |
| 3 | a bhΓ |
24,293 |
| 4 | leis an |
21,408 |
| 5 | a rugadh |
15,751 |
3-grams (Word):
| Rank | N-gram | Count |
|---|---|---|
| 1 | a rugadh i |
11,250 |
| 2 | baile Γ‘tha cliath |
4,993 |
| 3 | ina dhiaidh sin |
4,414 |
| 4 | is Γ© an |
4,339 |
| 5 | go dtΓ an |
3,964 |
4-grams (Word):
| Rank | N-gram | Count |
|---|---|---|
| 1 | a rugadh i i |
3,258 |
| 2 | a rugadh i beo |
3,011 |
| 3 | tagairtΓ a rugadh i |
2,902 |
| 4 | i mbaile Γ‘tha cliath |
2,279 |
| 5 | baile fearainn i gcontae |
2,227 |
5-grams (Word):
| Rank | N-gram | Count |
|---|---|---|
| 1 | tagairtΓ a rugadh i i |
1,261 |
| 2 | milliΓΊn duine ar an eipeasΓ³id |
1,003 |
| 3 | an eipeasΓ³id seo d fhΓ©ach |
997 |
| 4 | breitheanna bΓ‘sanna ceannairΓ domhanda tagairtΓ |
817 |
| 5 | eachtraΓ breitheanna bΓ‘sanna ceannairΓ domhanda |
812 |
2-grams (Subword):
| Rank | N-gram | Count |
|---|---|---|
| 1 | _ a |
1,728,019 |
| 2 | a _ |
1,304,438 |
| 3 | n _ |
1,293,557 |
| 4 | c h |
1,096,662 |
| 5 | a n |
1,083,880 |
3-grams (Subword):
| Rank | N-gram | Count |
|---|---|---|
| 1 | a c h |
528,297 |
| 2 | a n _ |
512,170 |
| 3 | _ a n |
478,331 |
| 4 | a r _ |
407,037 |
| 5 | n a _ |
405,810 |
4-grams (Subword):
| Rank | N-gram | Count |
|---|---|---|
| 1 | _ a n _ |
398,568 |
| 2 | _ n a _ |
252,847 |
| 3 | a c h _ |
239,043 |
| 4 | a g u s |
237,745 |
| 5 | g u s _ |
237,257 |
5-grams (Subword):
| Rank | N-gram | Count |
|---|---|---|
| 1 | _ a g u s |
236,766 |
| 2 | a g u s _ |
236,632 |
| 3 | r _ a n _ |
82,111 |
| 4 | _ a r _ a |
75,982 |
| 5 | _ b h Γ _ |
72,519 |
Key Findings
- Best Perplexity: 2-gram (subword) with 260
- Entropy Trend: Decreases with larger n-grams (more predictable)
- Coverage: Top-1000 patterns cover ~24% of corpus
- Recommendation: 4-gram or 5-gram for best predictive performance
3. Markov Chain Evaluation
Results
| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
|---|---|---|---|---|---|---|
| 1 | Word | 0.9964 | 1.995 | 8.63 | 343,683 | 0.4% |
| 1 | Subword | 0.9582 | 1.943 | 6.76 | 3,318 | 4.2% |
| 2 | Word | 0.3580 | 1.282 | 2.08 | 2,957,508 | 64.2% |
| 2 | Subword | 0.8596 | 1.815 | 5.42 | 22,430 | 14.0% |
| 3 | Word | 0.1474 | 1.108 | 1.31 | 6,125,828 | 85.3% |
| 3 | Subword | 0.7941 | 1.734 | 4.34 | 121,437 | 20.6% |
| 4 | Word | 0.0625 π | 1.044 | 1.11 | 7,985,903 | 93.8% |
| 4 | Subword | 0.7210 | 1.648 | 3.32 | 527,224 | 27.9% |
Generated Text Samples (Word-based)
Below are text samples generated from each word-based Markov chain model:
Context Size 1:
an tarbh cogaidh ar eolas digiteal i siam agus don 19ΓΊ haois bhunaigh sΓ© go raibha bhΓonn faoi dhΓ³ Γ© turas eitilt seo tagairtΓ nuachta le dobharmharc e restrictor chun cinnna mbrΓ‘thar bΓ‘n is mΓ³ nΓ‘ neart ceoil de chuid iarnrΓ³d Γ©ireann athrΓΊ mΓ³r an tslΓ³vaicis
Context Size 2:
ar an toirt agus mΓ©id ceimeachΓ‘in atΓ‘ i gceist a Γ©ilΓonn is a thiocfaidh an galar seosa bhliain chuir eorpaigh fΓΊthu san india Γ³ bombay thaistil siad ar an gcuid is mΓ³ snaa bhΓ dΓlis d ΓΊdarΓ‘s na gaeltachta taibhdhearc na gaillimhe an ros contae na gaillimhe naisc sheacht...
Context Size 3:
a rugadh i as londain sasanacha sasanacha sasanacha a rugadh i meiriceΓ‘nacha meiriceΓ‘nacha meiriceΓ‘n...baile Γ‘tha cliath tomΓ‘s Γ³ laidhin cΓ©imΓ de chuid ollscoil missouri kansas city agus scoil dlΓ na nig...ina dhiaidh sin agus dΓΊirt sΓ© go raibh galar intinne uirthi agus go leor ΓΊsΓ‘idΓ ann mar dhΓolachΓ‘in
Context Size 4:
tagairtΓ a rugadh i i moslamacha otamΓ‘nacha ioslamachbaile fearainn i gcontae an chabhΓ‘in tuaim contae an chlΓ‘ir baile fearainn i gcontae chiarraΓ an cil...is baile suite i gcontae aontroma Γ© tagairtΓ in albain dhΓΉn phris is ghall ghΓ idhealaibh in iardheis...
Generated Text Samples (Subword-based)
Below are text samples generated from each subword-based Markov chain model:
Context Size 1:
_tΓoleΓ‘naiteaiobach,_achagh_liruiachtaspΓ‘inns_gu
Context Size 2:
_ad_lon_Γ‘faon_agua_gintaeipearna_rn_thaobedate_clek
Context Size 3:
ach,_geolas_sΓ©_phyan_tar_come)"._ar__an_ar_fΓ©ach_ar_ΓΊs
Context Size 4:
_an_tΓ©adach_stuaist_na_hΓ©ireann_5_de_tach_na_thΓ‘bhΓ‘il._ma
Key Findings
- Best Predictability: Context-4 (word) with 93.8% predictability
- Branching Factor: Decreases with context size (more deterministic)
- Memory Trade-off: Larger contexts require more storage (527,224 contexts)
- Recommendation: Context-3 or Context-4 for text generation
4. Vocabulary Analysis
Statistics
| Metric | Value |
|---|---|
| Vocabulary Size | 161,708 |
| Total Tokens | 10,057,096 |
| Mean Frequency | 62.19 |
| Median Frequency | 4 |
| Frequency Std Dev | 1917.88 |
Most Common Words
| Rank | Word | Frequency |
|---|---|---|
| 1 | an | 411,365 |
| 2 | a | 299,072 |
| 3 | na | 254,180 |
| 4 | agus | 237,584 |
| 5 | ar | 204,783 |
| 6 | i | 198,698 |
| 7 | is | 131,814 |
| 8 | le | 97,770 |
| 9 | sa | 94,976 |
| 10 | go | 90,513 |
Least Common Words (from vocabulary)
| Rank | Word | Frequency |
|---|---|---|
| 1 | mΓΊcapholaisiΓΊicrΓdΓ | 2 |
| 2 | slock | 2 |
| 3 | oinonen | 2 |
| 4 | frithsciΓΊradh | 2 |
| 5 | varoufakis | 2 |
| 6 | wordnet | 2 |
| 7 | babelnet | 2 |
| 8 | cdle | 2 |
| 9 | malavoglia | 2 |
| 10 | btv | 2 |
Zipf's Law Analysis
| Metric | Value |
|---|---|
| Zipf Coefficient | 1.0687 |
| RΒ² (Goodness of Fit) | 0.997049 |
| Adherence Quality | excellent |
Coverage Analysis
| Top N Words | Coverage |
|---|---|
| Top 100 | 41.5% |
| Top 1,000 | 64.7% |
| Top 5,000 | 80.4% |
| Top 10,000 | 86.2% |
Key Findings
- Zipf Compliance: RΒ²=0.9970 indicates excellent adherence to Zipf's law
- High Frequency Dominance: Top 100 words cover 41.5% of corpus
- Long Tail: 151,708 words needed for remaining 13.8% coverage
5. Word Embeddings Evaluation
5.1 Cross-Lingual Alignment
5.2 Model Comparison
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|---|---|---|---|---|---|
| mono_32d | 32 | 0.8458 | 0.3686 | N/A | N/A |
| mono_64d | 64 | 0.8459 π | 0.2792 | N/A | N/A |
| mono_128d | 128 | 0.8282 | 0.2131 | N/A | N/A |
| aligned_32d | 32 | 0.8458 | 0.3623 | 0.1860 | 0.5460 |
| aligned_64d | 64 | 0.8459 | 0.2830 | 0.2320 | 0.6040 |
| aligned_128d | 128 | 0.8282 | 0.2127 | 0.3460 | 0.6980 |
Key Findings
- Best Isotropy: mono_64d with 0.8459 (more uniform distribution)
- Semantic Density: Average pairwise similarity of 0.2865. Lower values indicate better semantic separation.
- Alignment Quality: Aligned models achieve up to 34.6% R@1 in cross-lingual retrieval.
- Recommendation: 128d aligned for best cross-lingual performance
6. Morphological Analysis (Experimental)
This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
6.1 Productivity & Complexity
| Metric | Value | Interpretation | Recommendation |
|---|---|---|---|
| Productivity Index | 5.000 | High morphological productivity | Reliable analysis |
| Idiomaticity Gap | -0.611 | Low formulaic content | - |
6.2 Affix Inventory (Productive Units)
These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
Productive Prefixes
| Prefix | Examples |
|---|---|
-ch |
chillΓ‘n, chomhlachtaΓ, choimeΓ‘dacha |
Productive Suffixes
| Suffix | Examples |
|---|---|
-a |
vieja, jedna, zha |
-ch |
achtanΓ³ideach, mhuraenach, chlochach |
-ach |
achtanΓ³ideach, mhuraenach, chlochach |
-in |
rodin, coimisiΓΊin, arcΓ‘in |
-ha |
zha, choimeΓ‘dacha, sheandΓ‘laΓocha |
-ir |
reachtair, dΓ³ttir, stΓ³ir |
6.3 Bound Stems (Lexical Roots)
Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
| Stem | Cohesion | Substitutability | Examples |
|---|---|---|---|
rach |
1.68x | 258 contexts | brach, trach, Γ rach |
agai |
1.83x | 98 contexts | nagai, agaid, agair |
mhai |
1.47x | 225 contexts | mhair, mhail, mhais |
chta |
1.45x | 238 contexts | achta, Γ©chta, uchta |
aΓoc |
1.72x | 89 contexts | aΓoch, aΓocht, aΓochta |
reac |
1.59x | 128 contexts | reach, preac, breac |
aith |
1.40x | 224 contexts | maith, raith, daith |
eith |
1.59x | 116 contexts | beith, reith, feith |
irea |
1.40x | 194 contexts | pirea, Γ©irean, oirear |
bhai |
1.39x | 175 contexts | bhais, bhain, bhaic |
omha |
1.43x | 140 contexts | domha, Γomha, comha |
onta |
1.39x | 151 contexts | ponta, gonta, konta |
6.4 Affix Compatibility (Co-occurrence)
This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
| Prefix | Suffix | Frequency | Examples |
|---|---|---|---|
-ch |
-a |
31 words | chreata, chongΓ³cha |
-ch |
-ch |
25 words | chΓch, charbocsaileach |
-ch |
-ach |
22 words | charbocsaileach, chumasach |
-ch |
-in |
16 words | choimeΓ‘dΓ‘in, chΓomhΓ‘in |
-ch |
-ir |
16 words | choisir, chreachadΓ³ir |
-ch |
-ha |
10 words | chongΓ³cha, chrimΓ©acha |
6.5 Recursive Morpheme Segmentation
Using Recursive Hierarchical Substitutability, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., prefix-prefix-root-suffix).
| Word | Suggested Split | Confidence | Stem |
|---|---|---|---|
| chruthach | ch-ruth-ach |
6.0 | ruth |
| cheannach | ch-eann-ach |
6.0 | eann |
| gcruithneach | gcruithne-ach |
4.5 | gcruithne |
| Γ©ireanach | Γ©irean-ach |
4.5 | Γ©irean |
| uathbhΓ‘sach | uathbhΓ‘s-ach |
4.5 | uathbhΓ‘s |
| reitineach | reitine-ach |
4.5 | reitine |
| chaithreachas | ch-aithreachas |
4.5 | aithreachas |
| chomhfhachtΓ³ir | ch-omhfhachtΓ³-ir |
3.0 | omhfhachtΓ³ |
| cellachain | cellac-ha-in |
3.0 | cellac |
| mhΓ³rchathair | mhΓ³rchat-ha-ir |
3.0 | mhΓ³rchat |
| phartalΓ‘in | phartalΓ‘-in |
1.5 | phartalΓ‘ |
| motherfoclΓ³ir | motherfoclΓ³-ir |
1.5 | motherfoclΓ³ |
| bhaictΓ©aracha | bhaictΓ©arac-ha |
1.5 | bhaictΓ©arac |
| mheasartha | mheasart-ha |
1.5 | mheasart |
| annalacha | annalac-ha |
1.5 | annalac |
6.6 Linguistic Interpretation
Automated Insight: The language Irish shows high morphological productivity. The subword models are significantly more efficient than word models, suggesting a rich system of affixation or compounding.
7. Summary & Recommendations
Production Recommendations
| Component | Recommended | Rationale |
|---|---|---|
| Tokenizer | 64k BPE | Best compression (4.59x) |
| N-gram | 2-gram | Lowest perplexity (260) |
| Markov | Context-4 | Highest predictability (93.8%) |
| Embeddings | 100d | Balanced semantic capture and isotropy |
Appendix: Metrics Glossary & Interpretation Guide
This section provides definitions, intuitions, and guidance for interpreting the metrics used throughout this report.
Tokenizer Metrics
Compression Ratio
Definition: The ratio of characters to tokens (chars/token). Measures how efficiently the tokenizer represents text.
Intuition: Higher compression means fewer tokens needed to represent the same text, reducing sequence lengths for downstream models. A 3x compression means ~3 characters per token on average.
What to seek: Higher is generally better for efficiency, but extremely high compression may indicate overly aggressive merging that loses morphological information.
Average Token Length (Fertility)
Definition: Mean number of characters per token produced by the tokenizer.
Intuition: Reflects the granularity of tokenization. Longer tokens capture more context but may struggle with rare words; shorter tokens are more flexible but increase sequence length.
What to seek: Balance between 2-5 characters for most languages. Arabic/morphologically-rich languages may benefit from slightly longer tokens.
Unknown Token Rate (OOV Rate)
Definition: Percentage of tokens that map to the unknown/UNK token, indicating words the tokenizer cannot represent.
Intuition: Lower OOV means better vocabulary coverage. High OOV indicates the tokenizer encounters many unseen character sequences.
What to seek: Below 1% is excellent; below 5% is acceptable. BPE tokenizers typically achieve very low OOV due to subword fallback.
N-gram Model Metrics
Perplexity
Definition: Measures how "surprised" the model is by test data. Mathematically: 2^(cross-entropy). Lower values indicate better prediction.
Intuition: If perplexity is 100, the model is as uncertain as if choosing uniformly among 100 options at each step. A perplexity of 10 means effectively choosing among 10 equally likely options.
What to seek: Lower is better. Perplexity decreases with larger n-grams (more context). Values vary widely by language and corpus size.
Entropy
Definition: Average information content (in bits) needed to encode the next token given the context. Related to perplexity: perplexity = 2^entropy.
Intuition: High entropy means high uncertainty/randomness; low entropy means predictable patterns. Natural language typically has entropy between 1-4 bits per character.
What to seek: Lower entropy indicates more predictable text patterns. Entropy should decrease as n-gram size increases.
Coverage (Top-K)
Definition: Percentage of corpus occurrences explained by the top K most frequent n-grams.
Intuition: High coverage with few patterns indicates repetitive/formulaic text; low coverage suggests diverse vocabulary usage.
What to seek: Depends on use case. For language modeling, moderate coverage (40-60% with top-1000) is typical for natural text.
Markov Chain Metrics
Average Entropy
Definition: Mean entropy across all contexts, measuring average uncertainty in next-word prediction.
Intuition: Lower entropy means the model is more confident about what comes next. Context-1 has high entropy (many possible next words); Context-4 has low entropy (few likely continuations).
What to seek: Decreasing entropy with larger context sizes. Very low entropy (<0.1) indicates highly deterministic transitions.
Branching Factor
Definition: Average number of unique next tokens observed for each context.
Intuition: High branching = many possible continuations (flexible but uncertain); low branching = few options (predictable but potentially repetitive).
What to seek: Branching factor should decrease with context size. Values near 1.0 indicate nearly deterministic chains.
Predictability
Definition: Derived metric: (1 - normalized_entropy) Γ 100%. Indicates how deterministic the model's predictions are.
Intuition: 100% predictability means the next word is always certain; 0% means completely random. Real text falls between these extremes.
What to seek: Higher predictability for text generation quality, but too high (>98%) may produce repetitive output.
Vocabulary & Zipf's Law Metrics
Zipf's Coefficient
Definition: The slope of the log-log plot of word frequency vs. rank. Zipf's law predicts this should be approximately -1.
Intuition: A coefficient near -1 indicates the corpus follows natural language patterns where a few words are very common and most words are rare.
What to seek: Values between -0.8 and -1.2 indicate healthy natural language distribution. Deviations may suggest domain-specific or artificial text.
RΒ² (Coefficient of Determination)
Definition: Measures how well the linear fit explains the frequency-rank relationship. Ranges from 0 to 1.
Intuition: RΒ² near 1.0 means the data closely follows Zipf's law; lower values indicate deviation from expected word frequency patterns.
What to seek: RΒ² > 0.95 is excellent; > 0.99 indicates near-perfect Zipf adherence typical of large natural corpora.
Vocabulary Coverage
Definition: Cumulative percentage of corpus tokens accounted for by the top N words.
Intuition: Shows how concentrated word usage is. If top-100 words cover 50% of text, the corpus relies heavily on common words.
What to seek: Top-100 covering 30-50% is typical. Higher coverage indicates more repetitive text; lower suggests richer vocabulary.
Word Embedding Metrics
Isotropy
Definition: Measures how uniformly distributed vectors are in the embedding space. Computed as the ratio of minimum to maximum singular values.
Intuition: High isotropy (near 1.0) means vectors spread evenly in all directions; low isotropy means vectors cluster in certain directions, reducing expressiveness.
What to seek: Higher isotropy generally indicates better-quality embeddings. Values > 0.1 are reasonable; > 0.3 is good. Lower-dimensional embeddings tend to have higher isotropy.
Average Norm
Definition: Mean magnitude (L2 norm) of word vectors in the embedding space.
Intuition: Indicates the typical "length" of vectors. Consistent norms suggest stable training; high variance may indicate some words are undertrained.
What to seek: Relatively consistent norms across models. The absolute value matters less than consistency (low std deviation).
Cosine Similarity
Definition: Measures angular similarity between vectors, ranging from -1 (opposite) to 1 (identical direction).
Intuition: Words with similar meanings should have high cosine similarity. This is the standard metric for semantic relatedness in embeddings.
What to seek: Semantically related words should score > 0.5; unrelated words should be near 0. Synonyms often score > 0.7.
t-SNE Visualization
Definition: t-Distributed Stochastic Neighbor Embedding - a dimensionality reduction technique that preserves local structure for visualization.
Intuition: Clusters in t-SNE plots indicate groups of semantically related words. Spread indicates vocabulary diversity; tight clusters suggest semantic coherence.
What to seek: Meaningful clusters (e.g., numbers together, verbs together). Avoid over-interpreting distances - t-SNE preserves local, not global, structure.
General Interpretation Guidelines
- Compare within model families: Metrics are most meaningful when comparing models of the same type (e.g., 8k vs 64k tokenizer).
- Consider trade-offs: Better performance on one metric often comes at the cost of another (e.g., compression vs. OOV rate).
- Context matters: Optimal values depend on downstream tasks. Text generation may prioritize different metrics than classification.
- Corpus influence: All metrics are influenced by corpus characteristics. Wikipedia text differs from social media or literature.
- Language-specific patterns: Morphologically rich languages (like Arabic) may show different optimal ranges than analytic languages.
Visualizations Index
| Visualization | Description |
|---|---|
| Tokenizer Compression | Compression ratios by vocabulary size |
| Tokenizer Fertility | Average token length by vocabulary |
| Tokenizer OOV | Unknown token rates |
| Tokenizer Total Tokens | Total tokens by vocabulary |
| N-gram Perplexity | Perplexity by n-gram size |
| N-gram Entropy | Entropy by n-gram size |
| N-gram Coverage | Top pattern coverage |
| N-gram Unique | Unique n-gram counts |
| Markov Entropy | Entropy by context size |
| Markov Branching | Branching factor by context |
| Markov Contexts | Unique context counts |
| Zipf's Law | Frequency-rank distribution with fit |
| Vocab Frequency | Word frequency distribution |
| Top 20 Words | Most frequent words |
| Vocab Coverage | Cumulative coverage curve |
| Embedding Isotropy | Vector space uniformity |
| Embedding Norms | Vector magnitude distribution |
| Embedding Similarity | Word similarity heatmap |
| Nearest Neighbors | Similar words for key terms |
| t-SNE Words | 2D word embedding visualization |
| t-SNE Sentences | 2D sentence embedding visualization |
| Position Encoding | Encoding method comparison |
| Model Sizes | Storage requirements |
| Performance Dashboard | Comprehensive performance overview |
About This Project
Data Source
Models trained on wikipedia-monthly - a monthly snapshot of Wikipedia articles across 300+ languages.
Project
A project by Wikilangs - Open-source NLP models for every Wikipedia language.
Maintainer
Citation
If you use these models in your research, please cite:
@misc{wikilangs2025,
author = {Kamali, Omar},
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
year = {2025},
doi = {10.5281/zenodo.18073153},
publisher = {Zenodo},
url = {https://huggingface.co/wikilangs}
institution = {Omneity Labs}
}
License
MIT License - Free for academic and commercial use.
Links
- π Website: wikilangs.org
- π€ Models: huggingface.co/wikilangs
- π Data: wikipedia-monthly
- π€ Author: Omar Kamali
- π€ Sponsor: Featherless AI
Generated by Wikilangs Models Pipeline
Report Date: 2026-01-09 22:37:05



















