| --- |
| language: |
| - bo |
| library_name: transformers |
| tags: |
| - tokenizer |
| - sentencepiece |
| - tibetan |
| - unigram |
| license: apache-2.0 |
| --- |
| |
| # BoSentencePiece - Tibetan SentencePiece Tokenizer |
|
|
| A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm. |
|
|
| ## Model Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | **Model Type** | Unigram | |
| | **Vocabulary Size** | 20,000 | |
| | **Character Coverage** | 100% | |
| | **Max Token Length** | 16 | |
|
|
| ## Special Tokens |
|
|
| | Token | ID | Description | |
| |-------|-----|-------------| |
| | `<unk>` | 0 | Unknown token | |
| | `<s>` | 1 | Beginning of sequence | |
| | `</s>` | 2 | End of sequence | |
| | `<pad>` | 3 | Padding token | |
|
|
| ## Usage |
|
|
| ### With Transformers |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("openpecha/BoSentencePiece") |
| |
| text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།" |
| tokens = tokenizer.tokenize(text) |
| print(tokens) |
| |
| # Encode |
| encoded = tokenizer.encode(text) |
| print(encoded) |
| |
| # Decode |
| decoded = tokenizer.decode(encoded) |
| print(decoded) |
| ``` |
|
|
| ### With SentencePiece Directly |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import sentencepiece as spm |
| |
| # Download the model file |
| model_path = hf_hub_download("openpecha/BoSentencePiece", "spiece.model") |
| |
| sp = spm.SentencePieceProcessor() |
| sp.load(model_path) |
| |
| text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།" |
| tokens = sp.encode_as_pieces(text) |
| print(tokens) |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|