Ukrainian SentencePiece Tokenizer (64K Vocabulary)

This is a SentencePiece unigram tokenizer trained on Ukrainian text corpus with a vocabulary size of 64,000 tokens.

Model Details

  • Model Type: SentencePiece Unigram
  • Vocabulary Size: 64,000
  • Language: Ukrainian (uk)
  • Training Corpus:
    • Kobza dataset (largest percentage)
    • Diia dataset
    • Academic corpus
    • Verkhovna Rada (Parliament) documents

Special Tokens

The tokenizer includes the following special tokens:

  • <s> - Beginning of sequence
  • </s> - End of sequence
  • <unk> - Unknown token
  • <pad> - Padding token
  • <system>, </system> - System message tags
  • <user>, </user> - User message tags
  • <assistant>, </assistant> - Assistant message tags
  • <reasoning>, </reasoning> - Reasoning tags
  • <tool_call>, </tool_result> - Tool interaction tags
  • <final>, </final> - Final response tags
  • <json>, </json> - JSON formatting tags
  • <tool>, </tool> - Tool tags

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "dovcharenko/spm_uk_64k",
    trust_remote_code=True
)

# Encode text
text = "Україна — це країна в Східній Європі"
tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"Tokens: {len(tokens)}")

# Decode tokens
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

Performance Metrics

Based on testing on Ukrainian corpus:

  • Tokens per 1K chars: Mean ~375.64, Median ~233.49
  • Tokens per word: Mean ~16.13, Median ~1.63

Citation

If you use this tokenizer, please cite:

@misc{spm_uk_64k_tokenizer,
  title={Ukrainian SentencePiece Tokenizer (64K Vocabulary)},
  author={Dovcharenko},
  year={2024},
  publisher={HuggingFace}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support