OpenFormosa
/

PangolinTokenizer

traditional-chinese

Model card Files Files and versions

PangolinTokenizer / README.md

voidful's picture

Add Open Formosa special tokens

989d8ec verified 17 days ago

|

History Blame Contribute Delete

1.17 kB

	---
	license: other
	library_name: transformers
	tags:
	- tokenizer
	- byte-level-bpe
	- traditional-chinese
	- taiwan
	- multilingual
	---

	# PangolinTokenizer

	Byte-level BPE tokenizer for Traditional Chinese, Taiwan text, multilingual text,
	rich transcription, OCR-style text, and generic control formats.

	This revision adds the Open Formosa required control tokens as special tokens.
	The base BPE vocabulary size remains 114,688. The effective tokenizer length,
	including added special tokens, is 114,822.

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"voidful/PangolinTokenizer",
	trust_remote_code=False,
	)

	text = "<\|system\|>台灣健保與注音ㄅㄆㄇ，Tailo: Tâi-uân"
	ids = tokenizer.encode(text)
	decoded = tokenizer.decode(ids)
	```

	## Open Formosa Compatibility

	- Required special tokens present: 157
	- Required special tokens encode as single IDs: yes
	- Standard special tokens: `<unk>`, `<s>`, `</s>`, `<pad>`
	- Model max length metadata: 131,072
	- `trust_remote_code`: not required
	- No discrete audio codec token ranges are included.
	- No dense timestamp token ranges are included.