File size: 2,299 Bytes
17c6d62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Tiktoken ูˆุงู„ุชูุงุนู„ ู…ุน Transformers

ูŠุชู… ุฏู…ุฌ ุฏุนู… ู…ู„ูุงุช ู†ู…ูˆุฐุฌ tiktoken ุจุณู„ุงุณุฉ ููŠ ๐Ÿค— transformers ุนู†ุฏ ุชุญู…ูŠู„ ุงู„ู†ู…ุงุฐุฌ
`from_pretrained` ู…ุน ู…ู„ู `tokenizer.model` tiktoken ุนู„ู‰ HubุŒ ูˆุงู„ุฐูŠ ูŠุชู… ุชุญูˆูŠู„ู‡ ุชู„ู‚ุงุฆูŠู‹ุง ุฅู„ู‰ [ุงู„ู…ุญู„ู„ ุงู„ู„ุบูˆูŠ ุงู„ุณุฑูŠุน](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast).

### ุงู„ู†ู…ุงุฐุฌ ุงู„ู…ุนุฑูˆูุฉ ุงู„ุชูŠ ุชู… ุฅุตุฏุงุฑู‡ุง ู…ุน `tiktoken.model`:
	- gpt2
	- llama3

## ู…ุซุงู„ ุนู„ู‰ ุงู„ุงุณุชุฎุฏุงู…

ู…ู† ุฃุฌู„ ุชุญู…ูŠู„ ู…ู„ูุงุช `tiktoken` ููŠ `transformers`ุŒ ุชุฃูƒุฏ ู…ู† ุฃู† ู…ู„ู `tokenizer.model` ู‡ูˆ ู…ู„ู tiktoken ูˆุณูŠุชู… ุชุญู…ูŠู„ู‡ ุชู„ู‚ุงุฆูŠู‹ุง ุนู†ุฏ ุงู„ุชุญู…ูŠู„ `from_pretrained`. ุฅู„ูŠูƒ ูƒูŠููŠุฉ ุชุญู…ูŠู„ ู…ุฌุฒู‰ุก ู„ุบูˆูŠ ูˆู†ู…ูˆุฐุฌุŒ ูˆุงู„ุฐูŠ
ูŠู…ูƒู† ุชุญู…ูŠู„ู‡ ู…ู† ู†ูุณ ุงู„ู…ู„ู ุจุงู„ุถุจุท:

```py
from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")
```
## ุฅู†ุดุงุก ู…ุฌุฒู‰ุก ู„ุบูˆูŠ tiktoken

ู„ุง ูŠุญุชูˆูŠ ู…ู„ู `tokenizer.model` ุนู„ู‰ ุฃูŠ ู…ุนู„ูˆู…ุงุช ุญูˆู„ ุงู„ุฑู…ูˆุฒ ุฃูˆ ุงู„ุฃู†ู…ุงุท ุงู„ุฅุถุงููŠุฉ. ุฅุฐุง ูƒุงู†ุช ู‡ุฐู‡ ุงู„ุฃู…ูˆุฑ ู…ู‡ู…ุฉุŒ ู‚ู… ุจุชุญูˆูŠู„ ุงู„ู…ุญู„ู„ ุงู„ู„ุบูˆูŠ ุฅู„ู‰ `tokenizer.json`ุŒ ูˆู‡ูˆ ุงู„ุชู†ุณูŠู‚ ุงู„ู…ู†ุงุณุจ ู„ู€ [`PreTrainedTokenizerFast`].

ู‚ู… ุจุชูˆู„ูŠุฏ ู…ู„ู `tokenizer.model` ุจุงุณุชุฎุฏุงู… [tiktoken.get_encoding](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/tiktoken/registry.py#L63) ุซู… ู‚ู… ุจุชุญูˆูŠู„ู‡ ุฅู„ู‰ `tokenizer.json` ุจุงุณุชุฎุฏุงู… [`convert_tiktoken_to_fast`].

```py

from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding

# ูŠู…ูƒู†ูƒ ุชุญู…ูŠู„ ุชุฑู…ูŠุฒูƒ ุงู„ู…ุฎุตุต ุฃูˆ ุงู„ุชุฑู…ูŠุฒ ุงู„ุฐูŠ ุชูˆูุฑู‡ OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")
```

ูŠุชู… ุญูุธ ู…ู„ู `tokenizer.json` ุงู„ู†ุงุชุฌ ููŠ ุงู„ุฏู„ูŠู„ ุงู„ู…ุญุฏุฏ ูˆูŠู…ูƒู† ุชุญู…ูŠู„ู‡ ุจุงุณุชุฎุฏุงู… [`PreTrainedTokenizerFast`].

```py
tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")
```