DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified
# Tiktoken ูˆุงู„ุชูุงุนู„ ู…ุน Transformers
ูŠุชู… ุฏู…ุฌ ุฏุนู… ู…ู„ูุงุช ู†ู…ูˆุฐุฌ tiktoken ุจุณู„ุงุณุฉ ููŠ ๐Ÿค— transformers ุนู†ุฏ ุชุญู…ูŠู„ ุงู„ู†ู…ุงุฐุฌ
`from_pretrained` ู…ุน ู…ู„ู `tokenizer.model` tiktoken ุนู„ู‰ HubุŒ ูˆุงู„ุฐูŠ ูŠุชู… ุชุญูˆูŠู„ู‡ ุชู„ู‚ุงุฆูŠู‹ุง ุฅู„ู‰ [ุงู„ู…ุญู„ู„ ุงู„ู„ุบูˆูŠ ุงู„ุณุฑูŠุน](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast).
### ุงู„ู†ู…ุงุฐุฌ ุงู„ู…ุนุฑูˆูุฉ ุงู„ุชูŠ ุชู… ุฅุตุฏุงุฑู‡ุง ู…ุน `tiktoken.model`:
- gpt2
- llama3
## ู…ุซุงู„ ุนู„ู‰ ุงู„ุงุณุชุฎุฏุงู…
ู…ู† ุฃุฌู„ ุชุญู…ูŠู„ ู…ู„ูุงุช `tiktoken` ููŠ `transformers`ุŒ ุชุฃูƒุฏ ู…ู† ุฃู† ู…ู„ู `tokenizer.model` ู‡ูˆ ู…ู„ู tiktoken ูˆุณูŠุชู… ุชุญู…ูŠู„ู‡ ุชู„ู‚ุงุฆูŠู‹ุง ุนู†ุฏ ุงู„ุชุญู…ูŠู„ `from_pretrained`. ุฅู„ูŠูƒ ูƒูŠููŠุฉ ุชุญู…ูŠู„ ู…ุฌุฒู‰ุก ู„ุบูˆูŠ ูˆู†ู…ูˆุฐุฌุŒ ูˆุงู„ุฐูŠ
ูŠู…ูƒู† ุชุญู…ูŠู„ู‡ ู…ู† ู†ูุณ ุงู„ู…ู„ู ุจุงู„ุถุจุท:
```py
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")
```
## ุฅู†ุดุงุก ู…ุฌุฒู‰ุก ู„ุบูˆูŠ tiktoken
ู„ุง ูŠุญุชูˆูŠ ู…ู„ู `tokenizer.model` ุนู„ู‰ ุฃูŠ ู…ุนู„ูˆู…ุงุช ุญูˆู„ ุงู„ุฑู…ูˆุฒ ุฃูˆ ุงู„ุฃู†ู…ุงุท ุงู„ุฅุถุงููŠุฉ. ุฅุฐุง ูƒุงู†ุช ู‡ุฐู‡ ุงู„ุฃู…ูˆุฑ ู…ู‡ู…ุฉุŒ ู‚ู… ุจุชุญูˆูŠู„ ุงู„ู…ุญู„ู„ ุงู„ู„ุบูˆูŠ ุฅู„ู‰ `tokenizer.json`ุŒ ูˆู‡ูˆ ุงู„ุชู†ุณูŠู‚ ุงู„ู…ู†ุงุณุจ ู„ู€ [`PreTrainedTokenizerFast`].
ู‚ู… ุจุชูˆู„ูŠุฏ ู…ู„ู `tokenizer.model` ุจุงุณุชุฎุฏุงู… [tiktoken.get_encoding](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/tiktoken/registry.py#L63) ุซู… ู‚ู… ุจุชุญูˆูŠู„ู‡ ุฅู„ู‰ `tokenizer.json` ุจุงุณุชุฎุฏุงู… [`convert_tiktoken_to_fast`].
```py
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding
# ูŠู…ูƒู†ูƒ ุชุญู…ูŠู„ ุชุฑู…ูŠุฒูƒ ุงู„ู…ุฎุตุต ุฃูˆ ุงู„ุชุฑู…ูŠุฒ ุงู„ุฐูŠ ุชูˆูุฑู‡ OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")
```
ูŠุชู… ุญูุธ ู…ู„ู `tokenizer.json` ุงู„ู†ุงุชุฌ ููŠ ุงู„ุฏู„ูŠู„ ุงู„ู…ุญุฏุฏ ูˆูŠู…ูƒู† ุชุญู…ูŠู„ู‡ ุจุงุณุชุฎุฏุงู… [`PreTrainedTokenizerFast`].
```py
tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")
```