Latin BPE Tokenizer (32K)

A SentencePiece BPE tokenizer trained on 738 million words of curated Latin text — the largest clean Latin corpus assembled to date.

Why a Latin-specific tokenizer?

Multilingual tokenizers (Llama, Qwen, GPT) fragment Latin words into 3-5 subwords because Latin represents a tiny fraction of their training data. This tokenizer treats Latin as the whole point:

Multilingual:  "principio" → ["pr", "inc", "ip", "io"]     (4 tokens)
This:          "principio" → ["▁principio"]                 (1 token)

Multilingual:  "Gallia est omnis divisa in partes tres." → 12-15 tokens
This:          "Gallia est omnis divisa in partes tres." → 8 tokens

Fewer tokens per word means longer effective context windows and better modeling of Latin morphology.

Training data

738M words from 14 curated sources, deduplicated and quality-filtered:

Source Words Description
hf-latin-170m 429M Mixed-period Latin texts
cc100-latin 178M Web-crawled Latin (lorem ipsum and fragments removed)
camena 46M Corpus Automatum Multiplex Electorum Neolatinitatis Auctorum
lascivaroma 18M Classical Latin literature
ogl-csel 15M Corpus Scriptorum Ecclesiasticorum Latinorum
cltk-latin_library 13M Classical Language Toolkit — Latin Library
latinise 12M Latin texts collection
openmgh 8M Monumenta Germaniae Historica
cltk-tesserae 7M Classical Language Toolkit — Tesserae
digiliblt 6M Digital Library of Late-Antique Latin Texts
perseus 7M Perseus Digital Library
vulgate 0.6M Latin Vulgate Bible
proiel 0.4M PROIEL treebank
cltk-perseus 0.1M Classical Language Toolkit — Perseus

Sources with OCR quality issues (latin-pd, patrologia-latina) were excluded entirely.

Tokenizer details

  • Type: BPE (SentencePiece)
  • Vocab size: 32,000
  • Special tokens: <unk> (0), <s> (1), </s> (2), <pad> (3)
  • Byte fallback: enabled (handles any UTF-8 input)
  • Digit splitting: enabled
  • Character coverage: 99.99%
  • Normalization: identity (no NFKC or case folding)

Usage

With HuggingFace Transformers

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("v37/latin-tokenizer-32k", trust_remote_code=True)

text = "Gallia est omnis divisa in partes tres."
tokens = tok.tokenize(text)
# ['▁Gallia', '▁est', '▁omnis', '▁divisa', '▁in', '▁partes', '▁tres', '.']

ids = tok.encode(text)
# [7294, 322, 1251, 12800, 281, 2080, 2209, 31719]

decoded = tok.decode(ids)
# "Gallia est omnis divisa in partes tres."

With SentencePiece directly

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("latin_bpe_32000.model")

text = "Gallia est omnis divisa in partes tres."
tokens = sp.encode(text, out_type=str)
# ['▁Gallia', '▁est', '▁omnis', '▁divisa', '▁in', '▁partes', '▁tres', '.']

Sample tokenizations

Text Tokens Count
Gallia est omnis divisa in partes tres. ▁Gallia ▁est ▁omnis ▁divisa ▁in ▁partes ▁tres . 8
In principio creavit Deus caelum et terram. ▁In ▁principio ▁creavit ▁Deus ▁caelum ▁et ▁terram . 8
Cogito ergo sum. ▁Cog ito ▁ergo ▁sum . 5
Ave Maria, gratia plena, Dominus tecum. ▁Ave ▁Maria , ▁gratia ▁plena , ▁Dominus ▁tecum . 9

Files

  • latin_bpe_32000.model — SentencePiece model file (565 KB)
  • latin_bpe_32000.vocab — Human-readable vocabulary
  • latin_tokenizer.py — Custom tokenizer class for HuggingFace integration
  • tokenizer_config.json — Configuration metadata

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support