Ukrainian SentencePiece Tokenizer (64K Vocabulary)
This is a SentencePiece unigram tokenizer trained on Ukrainian text corpus with a vocabulary size of 64,000 tokens.
Model Details
- Model Type: SentencePiece Unigram
- Vocabulary Size: 64,000
- Language: Ukrainian (uk)
- Training Corpus:
- Kobza dataset (largest percentage)
- Diia dataset
- Academic corpus
- Verkhovna Rada (Parliament) documents
Special Tokens
The tokenizer includes the following special tokens:
<s>- Beginning of sequence</s>- End of sequence<unk>- Unknown token<pad>- Padding token<system>,</system>- System message tags<user>,</user>- User message tags<assistant>,</assistant>- Assistant message tags<reasoning>,</reasoning>- Reasoning tags<tool_call>,</tool_result>- Tool interaction tags<final>,</final>- Final response tags<json>,</json>- JSON formatting tags<tool>,</tool>- Tool tags
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"dovcharenko/spm_uk_64k",
trust_remote_code=True
)
# Encode text
text = "Україна — це країна в Східній Європі"
tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"Tokens: {len(tokens)}")
# Decode tokens
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
Performance Metrics
Based on testing on Ukrainian corpus:
- Tokens per 1K chars: Mean ~375.64, Median ~233.49
- Tokens per word: Mean ~16.13, Median ~1.63
Citation
If you use this tokenizer, please cite:
@misc{spm_uk_64k_tokenizer,
title={Ukrainian SentencePiece Tokenizer (64K Vocabulary)},
author={Dovcharenko},
year={2024},
publisher={HuggingFace}
}
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support