--- language: - bo library_name: transformers tags: - tokenizer - sentencepiece - tibetan - unigram license: apache-2.0 --- # BoSentencePiece - Tibetan SentencePiece Tokenizer A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm. ## Model Details | Parameter | Value | |-----------|-------| | **Model Type** | Unigram | | **Vocabulary Size** | 20,000 | | **Character Coverage** | 100% | | **Max Token Length** | 16 | ## Special Tokens | Token | ID | Description | |-------|-----|-------------| | `` | 0 | Unknown token | | `` | 1 | Beginning of sequence | | `` | 2 | End of sequence | | `` | 3 | Padding token | ## Usage ### With Transformers ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("openpecha/BoSentencePiece") text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།" tokens = tokenizer.tokenize(text) print(tokens) # Encode encoded = tokenizer.encode(text) print(encoded) # Decode decoded = tokenizer.decode(encoded) print(decoded) ``` ### With SentencePiece Directly ```python from huggingface_hub import hf_hub_download import sentencepiece as spm # Download the model file model_path = hf_hub_download("openpecha/BoSentencePiece", "spiece.model") sp = spm.SentencePieceProcessor() sp.load(model_path) text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།" tokens = sp.encode_as_pieces(text) print(tokens) ``` ## License Apache 2.0