| <!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # ν ν¬λμ΄μ [[tokenizer]] | |
| ν ν¬λμ΄μ λ λͺ¨λΈμ μ λ ₯μ μ€λΉνλ μν μ λ΄λΉν©λλ€. μ΄ λΌμ΄λΈλ¬λ¦¬μλ λͺ¨λ λͺ¨λΈμ μν ν ν¬λμ΄μ κ° ν¬ν¨λμ΄ μμ΅λλ€. λλΆλΆμ ν ν¬λμ΄μ λ λ κ°μ§ λ²μ μΌλ‘ μ 곡λ©λλ€. μμ ν νμ΄μ¬ ꡬνκ³Ό Rust λΌμ΄λΈλ¬λ¦¬ [π€ Tokenizers](https://github.com/huggingface/tokenizers)μ κΈ°λ°ν "Fast" ꡬνμ λλ€. "Fast" ꡬνμ λ€μμ κ°λ₯νκ² ν©λλ€: | |
| 1. νΉν λ°°μΉ ν ν°νλ₯Ό μνν λ μλκ° ν¬κ² ν₯μλ©λλ€. | |
| 2. μλ³Έ λ¬Έμμ΄(λ¬Έμ λ° λ¨μ΄)κ³Ό ν ν° κ³΅κ° μ¬μ΄λ₯Ό λ§€ννλ μΆκ°μ μΈ λ©μλλ₯Ό μ 곡ν©λλ€. (μ: νΉμ λ¬Έμλ₯Ό ν¬ν¨νλ ν ν°μ μΈλ±μ€λ₯Ό μ»κ±°λ, νΉμ ν ν°μ ν΄λΉνλ λ¬Έμ λ²μλ₯Ό κ°μ Έμ€λ λ±). | |
| κΈ°λ³Έ ν΄λμ€μΈ [`PreTrainedTokenizer`]μ [`PreTrainedTokenizerFast`]λ λ¬Έμμ΄ μ λ ₯μ μΈμ½λ©νλ λ©μλλ₯Ό ꡬννλ©°(μλ μ°Έμ‘°), λ‘컬 νμΌμ΄λ λλ ν 리, λλ λΌμ΄λΈλ¬λ¦¬μμ μ 곡νλ μ¬μ νλ ¨λ ν ν¬λμ΄μ (HuggingFaceμ AWS S3 μ μ₯μμμ λ€μ΄λ‘λλ)λ‘λΆν° νμ΄μ¬ λ° "Fast" ν ν¬λμ΄μ λ₯Ό μΈμ€ν΄μ€ννκ±°λ μ μ₯νλ κΈ°λ₯μ μ 곡ν©λλ€. μ΄ λ ν΄λμ€λ κ³΅ν΅ λ©μλλ₯Ό ν¬ν¨νλ [`~tokenization_utils_base.PreTrainedTokenizerBase`]μ μμ‘΄ν©λλ€. | |
| [`PreTrainedTokenizer`]μ [`PreTrainedTokenizerFast`]λ λͺ¨λ ν ν¬λμ΄μ μμ μ¬μ©λλ μ£Όμ λ©μλλ€μ ꡬνν©λλ€: | |
| - ν ν°ν(λ¬Έμμ΄μ νμ λ¨μ΄ ν ν° λ¬Έμμ΄λ‘ λΆν ), ν ν° λ¬Έμμ΄μ IDλ‘ λ³ν λ° κ·Έ λ°λ κ³Όμ , κ·Έλ¦¬κ³ μΈμ½λ©/λμ½λ©(μ¦, ν ν°ν λ° μ μλ‘ λ³ν)μ μνν©λλ€. | |
| - ꡬ쑰(BPE, SentencePiece λ±)μ ꡬμ λ°μ§ μκ³ μ΄νμ μλ‘μ΄ ν ν°μ μΆκ°ν©λλ€. | |
| - νΉμ ν ν°(λ§μ€ν¬, λ¬Έμ₯ μμ λ±) κ΄λ¦¬: ν ν°μ μΆκ°νκ³ , μ½κ² μ κ·Όν μ μλλ‘ ν ν¬λμ΄μ μ μμ±μ ν λΉνλ©°, ν ν°ν κ³Όμ μμ λΆλ¦¬λμ§ μλλ‘ λ³΄μ₯ν©λλ€. | |
| [`BatchEncoding`]μ [`~tokenization_utils_base.PreTrainedTokenizerBase`]μ μΈμ½λ© λ©μλ(`__call__`, `encode_plus`, `batch_encode_plus`)μ μΆλ ₯μ λ΄κ³ μμΌλ©°, νμ΄μ¬ λμ λ리λ₯Ό μμλ°μ΅λλ€. ν ν¬λμ΄μ κ° μμ νμ΄μ¬ ν ν¬λμ΄μ μΈ κ²½μ° μ΄ ν΄λμ€λ νμ€ νμ΄μ¬ λμ λ리μ²λΌ λμνλ©°, μ΄λ¬ν λ©μλλ€λ‘ κ³μ°λ λ€μν λͺ¨λΈ μ λ ₯(`input_ids`, `attention_mask` λ±)μ κ°μ΅λλ€. ν ν¬λμ΄μ κ° "Fast" ν ν¬λμ΄μ μΌ κ²½μ°(μ¦, HuggingFace [tokenizers λΌμ΄λΈλ¬λ¦¬](https://github.com/huggingface/tokenizers) κΈ°λ°μΌ κ²½μ°), μ΄ ν΄λμ€λ μΆκ°μ μΌλ‘ μλ³Έ λ¬Έμμ΄(λ¬Έμ λ° λ¨μ΄)κ³Ό ν ν° κ³΅κ° μ¬μ΄λ₯Ό λ§€ννλ λ° μ¬μ©ν μ μλ μ¬λ¬ κ³ κΈ μ λ ¬ λ©μλλ₯Ό μ 곡ν©λλ€ (μ: νΉμ λ¬Έμλ₯Ό ν¬ν¨νλ ν ν°μ μΈλ±μ€λ₯Ό μ»κ±°λ, νΉμ ν ν°μ ν΄λΉνλ λ¬Έμ λ²μλ₯Ό μ»λ λ±). | |
| # λ©ν°λͺ¨λ¬ ν ν¬λμ΄μ [[multimodal-tokenizer]] | |
| κ·Έ μΈμλ κ° ν ν¬λμ΄μ λ "λ©ν°λͺ¨λ¬" ν ν¬λμ΄μ κ° λ μ μμΌλ©°, μ΄λ ν ν¬λμ΄μ κ° λͺ¨λ κ΄λ ¨ νΉμ ν ν°μ ν ν¬λμ΄μ μμ±μ μΌλΆλ‘ μ μ₯νμ¬ λ μ½κ² μ κ·Όν μ μλλ‘ νλ€λ κ²μ μλ―Έν©λλ€. μλ₯Ό λ€μ΄, LLaVAμ κ°μ λΉμ -μΈμ΄ λͺ¨λΈμμ ν ν¬λμ΄μ λ₯Ό κ°μ Έμ€λ©΄, `tokenizer.image_token_id`μ μ κ·Όνμ¬ νλ μ΄μ€νλλ‘ μ¬μ©λλ νΉμ μ΄λ―Έμ§ ν ν°μ μ»μ μ μμ΅λλ€. | |
| λͺ¨λ μ νμ ν ν¬λμ΄μ μ μΆκ° νΉμ ν ν°μ νμ±ννλ €λ©΄, λ€μ μ½λλ₯Ό μΆκ°νκ³ ν ν¬λμ΄μ λ₯Ό μ μ₯ν΄μΌ ν©λλ€. μΆκ° νΉμ ν ν°μ λ°λμ νΉμ λͺ¨λ¬λ¦¬ν°μ κ΄λ ¨λ νμλ μμΌλ©°, λͺ¨λΈμ΄ μμ£Ό μ κ·Όν΄μΌ νλ μ΄λ€ κ²μ΄λ λ μ μμ΅λλ€. μλ μ½λμμ `output_dir`μ μ μ₯λ ν ν¬λμ΄μ λ μΈ κ°μ μΆκ° νΉμ ν ν°μ μ§μ μ κ·Όν μ μκ² λ©λλ€. | |
| ```python | |
| vision_tokenizer = AutoTokenizer.from_pretrained( | |
| "llava-hf/llava-1.5-7b-hf", | |
| extra_special_tokens={"image_token": "<image>", "boi_token": "<image_start>", "eoi_token": "<image_end>"} | |
| ) | |
| print(vision_tokenizer.image_token, vision_tokenizer.image_token_id) | |
| ("<image>", 32000) | |
| ``` | |
| ## PreTrainedTokenizer[[transformers.PreTrainedTokenizer]] | |
| [[autodoc]] PreTrainedTokenizer | |
| - __call__ | |
| - add_tokens | |
| - add_special_tokens | |
| - apply_chat_template | |
| - batch_decode | |
| - decode | |
| - encode | |
| - push_to_hub | |
| - all | |
| ## PreTrainedTokenizerFast[[transformers.PreTrainedTokenizerFast]] | |
| [`PreTrainedTokenizerFast`]λ [tokenizers](https://huggingface.co/docs/tokenizers) λΌμ΄λΈλ¬λ¦¬μ μμ‘΄ν©λλ€. π€ tokenizers λΌμ΄λΈλ¬λ¦¬μμ μ»μ ν ν¬λμ΄μ λ | |
| π€ transformersλ‘ λ§€μ° κ°λ¨νκ² κ°μ Έμ¬ μ μμ΅λλ€. μ΄λ»κ² νλμ§ μμλ³΄λ €λ©΄ [Using tokenizers from π€ tokenizers](../fast_tokenizers) νμ΄μ§λ₯Ό μ°Έκ³ νμΈμ. | |
| [[autodoc]] PreTrainedTokenizerFast | |
| - __call__ | |
| - add_tokens | |
| - add_special_tokens | |
| - apply_chat_template | |
| - batch_decode | |
| - decode | |
| - encode | |
| - push_to_hub | |
| - all | |
| ## BatchEncoding[[transformers.BatchEncoding]] | |
| [[autodoc]] BatchEncoding |