Latin BPE Tokenizer (32K)
A SentencePiece BPE tokenizer trained on 738 million words of curated Latin text — the largest clean Latin corpus assembled to date.
Why a Latin-specific tokenizer?
Multilingual tokenizers (Llama, Qwen, GPT) fragment Latin words into 3-5 subwords because Latin represents a tiny fraction of their training data. This tokenizer treats Latin as the whole point:
Multilingual: "principio" → ["pr", "inc", "ip", "io"] (4 tokens)
This: "principio" → ["▁principio"] (1 token)
Multilingual: "Gallia est omnis divisa in partes tres." → 12-15 tokens
This: "Gallia est omnis divisa in partes tres." → 8 tokens
Fewer tokens per word means longer effective context windows and better modeling of Latin morphology.
Training data
738M words from 14 curated sources, deduplicated and quality-filtered:
| Source | Words | Description |
|---|---|---|
| hf-latin-170m | 429M | Mixed-period Latin texts |
| cc100-latin | 178M | Web-crawled Latin (lorem ipsum and fragments removed) |
| camena | 46M | Corpus Automatum Multiplex Electorum Neolatinitatis Auctorum |
| lascivaroma | 18M | Classical Latin literature |
| ogl-csel | 15M | Corpus Scriptorum Ecclesiasticorum Latinorum |
| cltk-latin_library | 13M | Classical Language Toolkit — Latin Library |
| latinise | 12M | Latin texts collection |
| openmgh | 8M | Monumenta Germaniae Historica |
| cltk-tesserae | 7M | Classical Language Toolkit — Tesserae |
| digiliblt | 6M | Digital Library of Late-Antique Latin Texts |
| perseus | 7M | Perseus Digital Library |
| vulgate | 0.6M | Latin Vulgate Bible |
| proiel | 0.4M | PROIEL treebank |
| cltk-perseus | 0.1M | Classical Language Toolkit — Perseus |
Sources with OCR quality issues (latin-pd, patrologia-latina) were excluded entirely.
Tokenizer details
- Type: BPE (SentencePiece)
- Vocab size: 32,000
- Special tokens:
<unk>(0),<s>(1),</s>(2),<pad>(3) - Byte fallback: enabled (handles any UTF-8 input)
- Digit splitting: enabled
- Character coverage: 99.99%
- Normalization: identity (no NFKC or case folding)
Usage
With HuggingFace Transformers
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("v37/latin-tokenizer-32k", trust_remote_code=True)
text = "Gallia est omnis divisa in partes tres."
tokens = tok.tokenize(text)
# ['▁Gallia', '▁est', '▁omnis', '▁divisa', '▁in', '▁partes', '▁tres', '.']
ids = tok.encode(text)
# [7294, 322, 1251, 12800, 281, 2080, 2209, 31719]
decoded = tok.decode(ids)
# "Gallia est omnis divisa in partes tres."
With SentencePiece directly
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("latin_bpe_32000.model")
text = "Gallia est omnis divisa in partes tres."
tokens = sp.encode(text, out_type=str)
# ['▁Gallia', '▁est', '▁omnis', '▁divisa', '▁in', '▁partes', '▁tres', '.']
Sample tokenizations
| Text | Tokens | Count |
|---|---|---|
| Gallia est omnis divisa in partes tres. | ▁Gallia ▁est ▁omnis ▁divisa ▁in ▁partes ▁tres . | 8 |
| In principio creavit Deus caelum et terram. | ▁In ▁principio ▁creavit ▁Deus ▁caelum ▁et ▁terram . | 8 |
| Cogito ergo sum. | ▁Cog ito ▁ergo ▁sum . | 5 |
| Ave Maria, gratia plena, Dominus tecum. | ▁Ave ▁Maria , ▁gratia ▁plena , ▁Dominus ▁tecum . | 9 |
Files
latin_bpe_32000.model— SentencePiece model file (565 KB)latin_bpe_32000.vocab— Human-readable vocabularylatin_tokenizer.py— Custom tokenizer class for HuggingFace integrationtokenizer_config.json— Configuration metadata
License
Apache-2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support