es / README.md
omarkamali's picture
Upload all models and assets for es (latest)
4ce217a verified
---
language: es
language_name: Spanish
language_family: romance_iberian
tags:
- wikilangs
- nlp
- tokenizer
- embeddings
- n-gram
- markov
- wikipedia
- feature-extraction
- sentence-similarity
- tokenization
- n-grams
- markov-chain
- text-mining
- fasttext
- babelvec
- vocabulous
- vocabulary
- monolingual
- family-romance_iberian
license: mit
library_name: wikilangs
pipeline_tag: text-generation
datasets:
- omarkamali/wikipedia-monthly
dataset_info:
name: wikipedia-monthly
description: Monthly snapshots of Wikipedia articles across 300+ languages
metrics:
- name: best_compression_ratio
type: compression
value: 4.831
- name: best_isotropy
type: isotropy
value: 0.7898
- name: best_alignment_r10
type: alignment
value: 0.9680
- name: vocabulary_size
type: vocab
value: 1128398
generated: 2026-03-04
---
# Spanish — Wikilangs Models
Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **Spanish** Wikipedia by [Wikilangs](https://wikilangs.org).
🌐 [Language Page](https://wikilangs.org/languages/es/) · 🎮 [Playground](https://wikilangs.org/playground/?lang=es) · 📊 [Full Research Report](RESEARCH_REPORT.md)
## Language Samples
Example sentences drawn from the Spanish Wikipedia corpus:
> Apogonia es un género de escarabajos. Algunos son plagas de los árboles de durio. Referencias
> Elymordeum es un género monotípico de plantas herbáceas perteneciente a la familia de las poáceas. Su única especie es Elymordeum montanense (Scribn.) Bowden. Referencias
> Graphis es un género de hongos liquenizados de la familia Graphidaceae. Fue descrito por el naturalista francés Michel Adanson en Referencias de Graphidales
> Modem puede hacer referencia: el módem, dispositivo electrónico de comunicación; o el partido político francés MoDem.
> Opegrapha es un género de hongos liquenizados de la familia Opegraphaceae. Especies Referencias de Arthoniales
## Quick Start
### Load the Tokenizer
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("es_tokenizer_32k.model")
text = "Opegrapha es un género de hongos liquenizados de la familia Opegraphaceae. Espec"
tokens = sp.EncodeAsPieces(text)
ids = sp.EncodeAsIds(text)
print(tokens) # subword pieces
print(ids) # integer ids
# Decode back
print(sp.DecodeIds(ids))
```
<details>
<summary><b>Tokenization examples (click to expand)</b></summary>
**Sample 1:** `Opegrapha es un género de hongos liquenizados de la familia Opegraphaceae. Espec…`
| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `▁o pe gra p ha ▁es ▁un ▁género ▁de ▁hon … (+22 more)` | 32 |
| 16k | `▁o pe gra pha ▁es ▁un ▁género ▁de ▁hongos ▁li … (+18 more)` | 28 |
| 32k | `▁o pe gra pha ▁es ▁un ▁género ▁de ▁hongos ▁li … (+17 more)` | 27 |
| 64k | `▁o pe gra pha ▁es ▁un ▁género ▁de ▁hongos ▁li … (+17 more)` | 27 |
**Sample 2:** `Una única familia: Salicaceae. Árboles, arbustos y matas. Numerosos óvulos; 2 ca…`
| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `▁una ▁única ▁familia : ▁sal ica ceae . ▁árboles , … (+29 more)` | 39 |
| 16k | `▁una ▁única ▁familia : ▁sal ica ceae . ▁árboles , … (+24 more)` | 34 |
| 32k | `▁una ▁única ▁familia : ▁sal icaceae . ▁árboles , ▁arbustos … (+17 more)` | 27 |
| 64k | `▁una ▁única ▁familia : ▁sal icaceae . ▁árboles , ▁arbustos … (+17 more)` | 27 |
**Sample 3:** `Apogonia es un género de escarabajos. Algunos son plagas de los árboles de durio…`
| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `▁apo gon ia ▁es ▁un ▁género ▁de ▁esca ra ba … (+14 more)` | 24 |
| 16k | `▁apo gon ia ▁es ▁un ▁género ▁de ▁esca raba jos … (+13 more)` | 23 |
| 32k | `▁apo gonia ▁es ▁un ▁género ▁de ▁esca raba jos . … (+12 more)` | 22 |
| 64k | `▁apo gonia ▁es ▁un ▁género ▁de ▁escarabajos . ▁algunos ▁son … (+9 more)` | 19 |
</details>
### Load Word Embeddings
```python
from gensim.models import KeyedVectors
# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("es_embeddings_128d_aligned.kv")
similar = wv.most_similar("word", topn=5)
for word, score in similar:
print(f" {word}: {score:.3f}")
```
### Load N-gram Model
```python
import pyarrow.parquet as pq
df = pq.read_table("es_3gram_word.parquet").to_pandas()
print(df.head())
```
## Models Overview
![Performance Dashboard](visualizations/performance_dashboard.png)
| Category | Assets |
|----------|--------|
| Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes |
| N-gram models | 2 / 3 / 4 / 5-gram (word & subword) |
| Markov chains | Context 1–5 (word & subword) |
| Embeddings | 32d, 64d, 128d — mono & aligned |
| Vocabulary | Full frequency list + Zipf analysis |
| Statistics | Corpus & model statistics JSON |
## Metrics Summary
| Component | Model | Key Metric | Value |
|-----------|-------|------------|-------|
| Tokenizer | 8k BPE | Compression | 3.89x |
| Tokenizer | 16k BPE | Compression | 4.28x |
| Tokenizer | 32k BPE | Compression | 4.60x |
| Tokenizer | 64k BPE | Compression | 4.83x 🏆 |
| N-gram | 2-gram (subword) | Perplexity | 225 🏆 |
| N-gram | 2-gram (word) | Perplexity | 183,447 |
| N-gram | 3-gram (subword) | Perplexity | 1,802 |
| N-gram | 3-gram (word) | Perplexity | 1,817,727 |
| N-gram | 4-gram (subword) | Perplexity | 10,272 |
| N-gram | 4-gram (word) | Perplexity | 7,309,961 |
| N-gram | 5-gram (subword) | Perplexity | 43,696 |
| N-gram | 5-gram (word) | Perplexity | 8,151,138 |
| Markov | ctx-1 (subword) | Predictability | 0.0% |
| Markov | ctx-1 (word) | Predictability | 0.0% |
| Markov | ctx-2 (subword) | Predictability | 37.1% |
| Markov | ctx-2 (word) | Predictability | 53.8% |
| Markov | ctx-3 (subword) | Predictability | 32.1% |
| Markov | ctx-3 (word) | Predictability | 76.0% |
| Markov | ctx-4 (subword) | Predictability | 32.2% |
| Markov | ctx-4 (word) | Predictability | 88.3% 🏆 |
| Vocabulary | full | Size | 1,128,398 |
| Vocabulary | full | Zipf R² | 0.9938 |
| Embeddings | mono_32d | Isotropy | 0.7898 |
| Embeddings | mono_64d | Isotropy | 0.7625 |
| Embeddings | mono_128d | Isotropy | 0.6860 |
| Embeddings | aligned_32d | Isotropy | 0.7898 🏆 |
| Embeddings | aligned_64d | Isotropy | 0.7625 |
| Embeddings | aligned_128d | Isotropy | 0.6860 |
| Alignment | aligned_32d | R@1 / R@5 / R@10 | 56.6% / 81.2% / 86.8% |
| Alignment | aligned_64d | R@1 / R@5 / R@10 | 75.2% / 88.6% / 92.6% |
| Alignment | aligned_128d | R@1 / R@5 / R@10 | 79.6% / 94.4% / 96.8% 🏆 |
📊 **[Full ablation study, per-model breakdowns, and interpretation guide →](RESEARCH_REPORT.md)**
---
## About
Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) — monthly snapshots of 300+ Wikipedia languages.
A project by **[Wikilangs](https://wikilangs.org)** · Maintainer: [Omar Kamali](https://omarkamali.com) · [Omneity Labs](https://omneitylabs.com)
### Citation
```bibtex
@misc{wikilangs2025,
author = {Kamali, Omar},
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
year = {2025},
doi = {10.5281/zenodo.18073153},
publisher = {Zenodo},
url = {https://huggingface.co/wikilangs},
institution = {Omneity Labs}
}
```
### Links
- 🌐 [wikilangs.org](https://wikilangs.org)
- 🌍 [Language page](https://wikilangs.org/languages/es/)
- 🎮 [Playground](https://wikilangs.org/playground/?lang=es)
- 🤗 [HuggingFace models](https://huggingface.co/wikilangs)
- 📊 [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
- 👤 [Omar Kamali](https://huggingface.co/omarkamali)
- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
**License:** MIT — free for academic and commercial use.
---
*Generated by Wikilangs Pipeline · 2026-03-04 04:26:07*