Emese Tokenizer
32,000-vocab SentencePiece Unigram tokenizer optimized for Hungarian language.
Technical Specs
| Property | Value |
|---|---|
| Vocab size | 32,000 |
| Training corpus | Hungarian Wikipedia (1.16M paragraphs, ~252M chars) |
| Algorithm | SentencePiece Unigram (pure EM) |
| Avg. chars/token | 4.90 |
| Avg. tokens/word | 1.54 |
| Max piece length | 24 chars |
| Training sample | 500,000 paragraphs |
| Coverage | 0.9999 |
| Normalization | NFKC |
| Shrinking factor | 0.85 |
| EM sub-iterations | 4 |
| Seed sentencepieces | 2,000,000 |
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="emese-tokenizer/emese-tokenizer.model")
tokens = sp.encode("Budapest Magyarország fővárosa.", out_type=int)
text = sp.decode(tokens)
Special Tokens
| ID | Token | Purpose |
|---|---|---|
| 0 | <unk> |
Unknown token |
| 1 | <s> |
Beginning of sequence |
| 2 | </s> |
End of sequence |
| 3 | <eos> |
End of document (used as separator in training data) |
Comparison with Other Hungarian Tokenizers
| Model | Vocab | c/tok | tok/word | Algorithm |
|---|---|---|---|---|
| Emese | 32K | 4.90 | 1.54 | Unigram (Pure EM) |
| huBERT | 32K | ~4.60 | ~1.65 | WordPiece |
| PULI-GPT-3SX | 50K | ~4.45 | ~1.71 | BPE |
| PULI-GPTrio | 150K | ~4.65 | ~1.62 | BPE (Trilingual) |
| GPT-4o | 200K | 4.2–4.4 | 1.6–1.8 | BPE |
| Llama 3.1 | 128K | ~4.10 | ~1.85 | BPE |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support