Ukrainian SentencePiece Tokenizer (64K Vocabulary)

This is a SentencePiece unigram tokenizer trained on Ukrainian text corpus with a vocabulary size of 64,000 tokens.

Model Details

Model Type: SentencePiece Unigram
Vocabulary Size: 64,000
Language: Ukrainian (uk)
Training Corpus:
- Kobza dataset (largest percentage)
- Diia dataset
- Academic corpus
- Verkhovna Rada (Parliament) documents

Special Tokens

The tokenizer includes the following special tokens:

<s> - Beginning of sequence
</s> - End of sequence
<unk> - Unknown token
<pad> - Padding token
<system>, </system> - System message tags
<user>, </user> - User message tags
<assistant>, </assistant> - Assistant message tags
<reasoning>, </reasoning> - Reasoning tags
<tool_call>, </tool_result> - Tool interaction tags
<final>, </final> - Final response tags
<json>, </json> - JSON formatting tags
<tool>, </tool> - Tool tags

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "dovcharenko/spm_uk_64k",
    trust_remote_code=True
)

# Encode text
text = "Україна — це країна в Східній Європі"
tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"Tokens: {len(tokens)}")

# Decode tokens
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

Performance Metrics

Based on testing on Ukrainian corpus:

Tokens per 1K chars: Mean ~375.64, Median ~233.49
Tokens per word: Mean ~16.13, Median ~1.63

Citation

If you use this tokenizer, please cite:

@misc{spm_uk_64k_tokenizer,
  title={Ukrainian SentencePiece Tokenizer (64K Vocabulary)},
  author={Dovcharenko},
  year={2024},
  publisher={HuggingFace}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support