ocisd4
/

openllama_tokenizer_ext_zh

Model card Files Files and versions

samleeasus commited on Jun 2, 2023

Commit

bccccd8

·

1 Parent(s): 1853fcd

Create README.md

Files changed (1) hide show

README.md +37 -0

README.md ADDED Viewed

	@@ -0,0 +1,37 @@

+```python
+from transformers import LlamaTokenizer
+tokenizer = LlamaTokenizer.from_pretrained(
+        'ocisd4/openllama_tokenizer_ext_zh',
+        pad_token="<pad>",
+        add_bos_token=False,
+        add_eos_token=True,
+        use_auth_token='True',
+)
+print('vocab size:',tokenizer.vocab_size)
+#vocab size: 52992
+text = '今天天氣真好！'
+print([k for k, v in tokenizer.get_vocab().items() if v  > tokenizer.vocab_size -7])
+print(tokenizer.tokenize(text))
+#['▁', '今天', '天氣', '真', '好', '<0xEF>', '<0xBC>', '<0x81>']
+print(tokenizer.encode(text))
+#[1, 31822, 32101, 32927, 45489, 45301, 242, 191, 132]
+print(tokenizer.decode(tokenizer.encode(text)))
+# 今天天氣真好！</s>
+```
+** note: **
+ - The first token might be a whitespace in LLamaTokenizer.
+ - Open LlaMa的tokenizer is incompatible with original LlaMa
+ - This tokenizer will encode  continuous spaces to ONE space
+### updated
+#### 2023-06-02
+  - add special tokens: <|output|>, <|input|>, <|sep|>, <|emb|>, <|rwd|>, <|ctx|>