Instructions to use OpenFormosa/PangolinTokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenFormosa/PangolinTokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenFormosa/PangolinTokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 1,171 Bytes
82e506e 989d8ec 82e506e 989d8ec 82e506e 989d8ec 82e506e 989d8ec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | ---
license: other
library_name: transformers
tags:
- tokenizer
- byte-level-bpe
- traditional-chinese
- taiwan
- multilingual
---
# PangolinTokenizer
Byte-level BPE tokenizer for Traditional Chinese, Taiwan text, multilingual text,
rich transcription, OCR-style text, and generic control formats.
This revision adds the Open Formosa required control tokens as special tokens.
The base BPE vocabulary size remains 114,688. The effective tokenizer length,
including added special tokens, is 114,822.
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"voidful/PangolinTokenizer",
trust_remote_code=False,
)
text = "<|system|>台灣健保與注音ㄅㄆㄇ,Tailo: Tâi-uân"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
```
## Open Formosa Compatibility
- Required special tokens present: 157
- Required special tokens encode as single IDs: yes
- Standard special tokens: `<unk>`, `<s>`, `</s>`, `<pad>`
- Model max length metadata: 131,072
- `trust_remote_code`: not required
- No discrete audio codec token ranges are included.
- No dense timestamp token ranges are included.
|