---
language:
- bo
library_name: transformers
tags:
- tokenizer
- sentencepiece
- tibetan
- unigram
license: apache-2.0
---

# BoSentencePiece - Tibetan SentencePiece Tokenizer

A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm.

## Model Details

| Parameter | Value |
|-----------|-------|
| **Model Type** | Unigram |
| **Vocabulary Size** | 20,000 |
| **Character Coverage** | 100% |
| **Max Token Length** | 16 |

## Special Tokens

| Token | ID | Description |
|-------|-----|-------------|
| `<unk>` | 0 | Unknown token |
| `<s>` | 1 | Beginning of sequence |
| `</s>` | 2 | End of sequence |
| `<pad>` | 3 | Padding token |

## Usage

### With Transformers

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openpecha/BoSentencePiece")

text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode
encoded = tokenizer.encode(text)
print(encoded)

# Decode
decoded = tokenizer.decode(encoded)
print(decoded)
```

### With SentencePiece Directly

```python
from huggingface_hub import hf_hub_download
import sentencepiece as spm

# Download the model file
model_path = hf_hub_download("openpecha/BoSentencePiece", "spiece.model")

sp = spm.SentencePieceProcessor()
sp.load(model_path)

text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
tokens = sp.encode_as_pieces(text)
print(tokens)
```

## License

Apache 2.0