Emese Tokenizer

32,000-vocab SentencePiece Unigram tokenizer optimized for Hungarian language.

Technical Specs

Property Value
Vocab size 32,000
Training corpus Hungarian Wikipedia (1.16M paragraphs, ~252M chars)
Algorithm SentencePiece Unigram (pure EM)
Avg. chars/token 4.90
Avg. tokens/word 1.54
Max piece length 24 chars
Training sample 500,000 paragraphs
Coverage 0.9999
Normalization NFKC
Shrinking factor 0.85
EM sub-iterations 4
Seed sentencepieces 2,000,000

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="emese-tokenizer/emese-tokenizer.model")
tokens = sp.encode("Budapest Magyarország fővárosa.", out_type=int)
text = sp.decode(tokens)

Special Tokens

ID Token Purpose
0 <unk> Unknown token
1 <s> Beginning of sequence
2 </s> End of sequence
3 <eos> End of document (used as separator in training data)

Comparison with Other Hungarian Tokenizers

Model Vocab c/tok tok/word Algorithm
Emese 32K 4.90 1.54 Unigram (Pure EM)
huBERT 32K ~4.60 ~1.65 WordPiece
PULI-GPT-3SX 50K ~4.45 ~1.71 BPE
PULI-GPTrio 150K ~4.65 ~1.62 BPE (Trilingual)
GPT-4o 200K 4.2–4.4 1.6–1.8 BPE
Llama 3.1 128K ~4.10 ~1.85 BPE
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support