DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Tiktoken ูˆุงู„ุชูุงุนู„ ู…ุน Transformers

ูŠุชู… ุฏู…ุฌ ุฏุนู… ู…ู„ูุงุช ู†ู…ูˆุฐุฌ tiktoken ุจุณู„ุงุณุฉ ููŠ ๐Ÿค— transformers ุนู†ุฏ ุชุญู…ูŠู„ ุงู„ู†ู…ุงุฐุฌ from_pretrained ู…ุน ู…ู„ู tokenizer.model tiktoken ุนู„ู‰ HubุŒ ูˆุงู„ุฐูŠ ูŠุชู… ุชุญูˆูŠู„ู‡ ุชู„ู‚ุงุฆูŠู‹ุง ุฅู„ู‰ ุงู„ู…ุญู„ู„ ุงู„ู„ุบูˆูŠ ุงู„ุณุฑูŠุน.

ุงู„ู†ู…ุงุฐุฌ ุงู„ู…ุนุฑูˆูุฉ ุงู„ุชูŠ ุชู… ุฅุตุฏุงุฑู‡ุง ู…ุน tiktoken.model:

- gpt2
- llama3

ู…ุซุงู„ ุนู„ู‰ ุงู„ุงุณุชุฎุฏุงู…

ู…ู† ุฃุฌู„ ุชุญู…ูŠู„ ู…ู„ูุงุช tiktoken ููŠ transformersุŒ ุชุฃูƒุฏ ู…ู† ุฃู† ู…ู„ู tokenizer.model ู‡ูˆ ู…ู„ู tiktoken ูˆุณูŠุชู… ุชุญู…ูŠู„ู‡ ุชู„ู‚ุงุฆูŠู‹ุง ุนู†ุฏ ุงู„ุชุญู…ูŠู„ from_pretrained. ุฅู„ูŠูƒ ูƒูŠููŠุฉ ุชุญู…ูŠู„ ู…ุฌุฒู‰ุก ู„ุบูˆูŠ ูˆู†ู…ูˆุฐุฌุŒ ูˆุงู„ุฐูŠ ูŠู…ูƒู† ุชุญู…ูŠู„ู‡ ู…ู† ู†ูุณ ุงู„ู…ู„ู ุจุงู„ุถุจุท:

from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")

ุฅู†ุดุงุก ู…ุฌุฒู‰ุก ู„ุบูˆูŠ tiktoken

ู„ุง ูŠุญุชูˆูŠ ู…ู„ู tokenizer.model ุนู„ู‰ ุฃูŠ ู…ุนู„ูˆู…ุงุช ุญูˆู„ ุงู„ุฑู…ูˆุฒ ุฃูˆ ุงู„ุฃู†ู…ุงุท ุงู„ุฅุถุงููŠุฉ. ุฅุฐุง ูƒุงู†ุช ู‡ุฐู‡ ุงู„ุฃู…ูˆุฑ ู…ู‡ู…ุฉุŒ ู‚ู… ุจุชุญูˆูŠู„ ุงู„ู…ุญู„ู„ ุงู„ู„ุบูˆูŠ ุฅู„ู‰ tokenizer.jsonุŒ ูˆู‡ูˆ ุงู„ุชู†ุณูŠู‚ ุงู„ู…ู†ุงุณุจ ู„ู€ [PreTrainedTokenizerFast].

ู‚ู… ุจุชูˆู„ูŠุฏ ู…ู„ู tokenizer.model ุจุงุณุชุฎุฏุงู… tiktoken.get_encoding ุซู… ู‚ู… ุจุชุญูˆูŠู„ู‡ ุฅู„ู‰ tokenizer.json ุจุงุณุชุฎุฏุงู… [convert_tiktoken_to_fast].


from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding

# ูŠู…ูƒู†ูƒ ุชุญู…ูŠู„ ุชุฑู…ูŠุฒูƒ ุงู„ู…ุฎุตุต ุฃูˆ ุงู„ุชุฑู…ูŠุฒ ุงู„ุฐูŠ ุชูˆูุฑู‡ OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")

ูŠุชู… ุญูุธ ู…ู„ู tokenizer.json ุงู„ู†ุงุชุฌ ููŠ ุงู„ุฏู„ูŠู„ ุงู„ู…ุญุฏุฏ ูˆูŠู…ูƒู† ุชุญู…ูŠู„ู‡ ุจุงุณุชุฎุฏุงู… [PreTrainedTokenizerFast].

tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")