Instructions to use OpenFormosa/PangolinTokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenFormosa/PangolinTokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenFormosa/PangolinTokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: other | |
| library_name: transformers | |
| tags: | |
| - tokenizer | |
| - byte-level-bpe | |
| - traditional-chinese | |
| - taiwan | |
| - multilingual | |
| # PangolinTokenizer | |
| Byte-level BPE tokenizer for Traditional Chinese, Taiwan text, multilingual text, | |
| rich transcription, OCR-style text, and generic control formats. | |
| This revision adds the Open Formosa required control tokens as special tokens. | |
| The base BPE vocabulary size remains 114,688. The effective tokenizer length, | |
| including added special tokens, is 114,822. | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "voidful/PangolinTokenizer", | |
| trust_remote_code=False, | |
| ) | |
| text = "<|system|>台灣健保與注音ㄅㄆㄇ,Tailo: Tâi-uân" | |
| ids = tokenizer.encode(text) | |
| decoded = tokenizer.decode(ids) | |
| ``` | |
| ## Open Formosa Compatibility | |
| - Required special tokens present: 157 | |
| - Required special tokens encode as single IDs: yes | |
| - Standard special tokens: `<unk>`, `<s>`, `</s>`, `<pad>` | |
| - Model max length metadata: 131,072 | |
| - `trust_remote_code`: not required | |
| - No discrete audio codec token ranges are included. | |
| - No dense timestamp token ranges are included. | |