| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| *Meta๋ ์ด ๋ชจ๋ธ์ 2025-04-05์ ์ถ์ํ๊ณ ๊ฐ์ ๋ Hugging Face Transformers์ ์ถ๊ฐํ์ต๋๋ค.* | |
| # Llama4[[llama4]] | |
| <div style="float: right;"> | |
| <div class="flex flex-wrap space-x-1"> | |
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | |
| <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | |
| <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white"> | |
| </div> | |
| </div> | |
| Meta์์ ๊ฐ๋ฐํ [Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/)๋ ์๋ก์ด ์๊ธฐํ๊ท Mixture-of-Experts (MoE) ์ํคํ ์ฒ๋ฅผ ๋์ ํฉ๋๋ค. | |
| ์ด ์ธ๋๋ ๋ ๊ฐ์ง ๋ชจ๋ธ๋ก ๋๋ฉ๋๋ค: | |
| - 128๊ฐ์ ์ ๋ฌธ๊ฐ(expert)๋ฅผ ์ฌ์ฉํ์ฌ ์ด ์ฝ 400B ๋งค๊ฐ๋ณ์ ์ค 17B ํ์ฑ ๋งค๊ฐ๋ณ์๋ฅผ ๊ฐ๋ ๊ณ ์ฑ๋ฅ Llama 4 Maverick | |
| - 16๊ฐ์ ์ ๋ฌธ๊ฐ๋ง ์ฌ์ฉํ์ฌ ์ด ์ฝ 109B ๋งค๊ฐ๋ณ์ ์ค 17B ํ์ฑ ๋งค๊ฐ๋ณ์๋ฅผ ๊ฐ๋ ๊ฒฝ๋ํ๋ Llama 4 Scout | |
| - | |
| ๋ ๋ชจ๋ธ ๋ชจ๋ ๋ค์ดํฐ๋ธ ๋ฉํฐ๋ชจ๋ฌ์ ์ํ ์ด๊ธฐ ์ตํฉ(early fusion)์ ํ์ฉํ์ฌ ํ ์คํธ์ ์ด๋ฏธ์ง ์ ๋ ฅ์ ์ฒ๋ฆฌํ ์ ์์ต๋๋ค. | |
| Maverick๊ณผ Scout ๋ชจ๋ 200๊ฐ ์ธ์ด๋ฅผ ํฌํจํ๋ ๋ฐ์ดํฐ์์ ์ต๋ 40์กฐ๊ฐ์ ํ ํฐ์ผ๋ก ํ๋ จ๋์์ต๋๋ค. | |
| (์๋์ด, ์คํ์ธ์ด, ๋ ์ผ์ด, ํ๋์ด๋ฅผ ํฌํจํ 12๊ฐ ์ธ์ด์ ๋ํ ํน์ ๋ฏธ์ธ ์กฐ์ ์ง์ ํฌํจ) | |
| Meta๋ Llama 4 Scout์ ๋๊ตฌ๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋๋ก ์ค๊ณํ์ต๋๋ค. Scout์ 4๋นํธ ๋๋ 8๋นํธ ์์ํ๋ฅผ ์ ์ฉํ๋ฉด ๋จ์ผ ์๋ฒ๊ธ GPU์์๋ ์ค์๊ฐ์ผ๋ก ์คํํ ์ ์์ต๋๋ค. ๋ฐ๋ฉด, ๋ ๋๊ท๋ชจ์ธ Llama 4 Maverick์ ๊ณ ์ฑ๋ฅ ์ฐ์ฐ์ ์ํด BF16๊ณผ FP8 ํ์์ผ๋ก ์ ๊ณตํฉ๋๋ค. | |
| ์ด ๋ชจ๋ธ๋ค์ ๋ชจ๋ธ ์ ์ฅ์์์ ์ ๊ณต๋๋ ์ฌ์ฉ์ ์ง์ Llama 4 ์ปค๋ฎค๋ํฐ ๋ผ์ด์ ์ค ๊ณ์ฝ์ ๋ฐ๋ผ ์ถ์๋ฉ๋๋ค. | |
| ๋ชจ๋ ์๋ณธ Llama ์ฒดํฌํฌ์ธํธ๋ hugging face [meta-llama](https://huggingface.co/meta-llama) ํ์ด์ง์์ ํ์ธํ์ค ์ ์์ต๋๋ค. | |
| > [!TIP] | |
| > Llama 4 ๋ชจ๋ธ ํจ๋ฐ๋ฆฌ๋ ๋ ๊ฐ์ง ํํ๋ก ์ ๊ณต๋ฉ๋๋ค: 109B์ 402B ๋งค๊ฐ๋ณ์์ ๋๋ค. ์ด ๋ ํํ ๋ชจ๋ ๋งค์ฐ ํฐ ๋ชจ๋ธ์ด๋ฉฐ | |
| > ์ผ๋ฐ์ ์ธ ๊ธฐ๊ธฐ์์๋ ์คํํ ์ ์์ต๋๋ค. ์๋์ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ค์ด๋ ๋ฐฉ๋ฒ ๋ช ๊ฐ์ง๋ฅผ ์ ๋ฆฌํ์ต๋๋ค. | |
| > | |
| > ๋์ฑ ๋น ๋ฅด๊ณ ์์ ์ ์ธ ๋ค์ด๋ก๋๋ฅผ ์ํด `hf_xet` ์ข ์์ฑ ์ค์น๋ฅผ ๊ถ์ฅํฉ๋๋ค: | |
| > `pip install transformers[hf_xet]` | |
| ์๋ ์์๋ค์ [`Pipeline`] ๋๋ [`AutoModel`]๋ก ์์ฑํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋๋ค. ๋ํ ์ผ๋ถ Llama 4 ๋ณํ์ด | |
| ์ต๋ 1์ฒ๋ง ํ ํฐ์ ์ปจํ ์คํธ ๊ธธ์ด๋ฅผ ๊ฐ๊ธฐ ๋๋ฌธ์, ๋งค์ฐ ๊ธด ์ปจํ ์คํธ ์์ฑ์ ํ์ฑํํ๊ธฐ ์ํด ์ฌ๋ฐ๋ฅธ ์์ฑ์ ํ ๊ธํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ฃผ๋ ์์๋ ์ถ๊ฐํ์ต๋๋ค. | |
| <hfoptions id="usage"> | |
| <hfoption id="Pipeline"> | |
| ```py | |
| from transformers import pipeline | |
| import torch | |
| model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" | |
| messages = [ | |
| {"role": "user", "content": "๋ง์๋ค์ฆ ๋ ์ํผ๊ฐ ๋ฌด์์ธ๊ฐ์?"}, | |
| ] | |
| pipe = pipeline( | |
| "text-generation", | |
| model=model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16 | |
| ) | |
| output = pipe(messages, do_sample=False, max_new_tokens=200) | |
| print(output[0]["generated_text"][-1]["content"]) | |
| ``` | |
| </hfoption> | |
| <hfoption id="AutoModel - Text only"> | |
| ```py | |
| from transformers import AutoTokenizer, Llama4ForConditionalGeneration | |
| import torch | |
| model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| messages = [ | |
| {"role": "user", "content": "๋น์ ์ ๋๊ตฌ์ ๊ฐ์?"}, | |
| ] | |
| inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True) | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16 | |
| ) | |
| outputs = model.generate(**inputs.to(model.device), max_new_tokens=100) | |
| outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:]) | |
| print(outputs[0]) | |
| ``` | |
| </hfoption> | |
| <hfoption id="AutoModel - Multimodal"> | |
| ```py | |
| from transformers import AutoProcessor, Llama4ForConditionalGeneration | |
| import torch | |
| model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "url": img_url}, | |
| {"type": "text", "text": "์ด ์ด๋ฏธ์ง๋ฅผ ๋ ๋ฌธ์ฅ์ผ๋ก ์ค๋ช ํด์ฃผ์ธ์."}, | |
| ] | |
| }, | |
| ] | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=256, | |
| ) | |
| response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0] | |
| print(response) | |
| ``` | |
| </hfoption> | |
| <hfoption id="AutoModel - Multimodal with multiple images"> | |
| ```py | |
| from transformers import AutoProcessor, Llama4ForConditionalGeneration | |
| import torch | |
| model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" | |
| url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png" | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "url": url1}, | |
| {"type": "image", "url": url2}, | |
| {"type": "text", "text": "์ด ๋ ์ด๋ฏธ์ง๊ฐ ์ด๋ป๊ฒ ๋น์ทํ๊ณ , ์ด๋ป๊ฒ ๋ค๋ฅธ์ง ์ค๋ช ํด์ฃผ์ค ์ ์๋์?"}, | |
| ] | |
| }, | |
| ] | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=256, | |
| ) | |
| response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0] | |
| print(response) | |
| ``` | |
| </hfoption> | |
| <hfoption id="AutoModel - Long context"> | |
| ์ฃผ์: ์๋ ์์๋ `device_map="auto"`์ flex-attention์ ๋ชจ๋ ์ฌ์ฉํฉ๋๋ค. | |
| ์ด ์์๋ฅผ ํ ์ ๋ณ๋ ฌ ๋ชจ๋๋ก ์คํํ๋ ค๋ฉด `torchrun`์ ์ฌ์ฉํ์ธ์. | |
| ํฅํ ํ ์ ๋ณ๋ ฌ ์์ด `device_map="auto"`์ flex-attention์ ํจ๊ป ์คํํ ์ ์๋๋ก | |
| ์์ ํ ์์ ์ ๋๋ค. | |
| ```py | |
| from transformers import Llama4ForConditionalGeneration, AutoTokenizer, infer_device | |
| import torch | |
| import time | |
| file = "very_long_context_prompt.txt" | |
| model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" | |
| with open(file, "r") as f: | |
| very_long_text = "\n".join(f.readlines()) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| attn_implementation="flex_attention", | |
| dtype=torch.bfloat16 | |
| ) | |
| messages = [ | |
| {"role": "user", "content": f"๋ค์ ํ ์คํธ๋ค์ ๋ณด์ธ์: [{very_long_text}]\n\n\n\n์ฑ ๋ค์ ๋ฌด์์ด๋ฉฐ, ๋๊ฐ ์ผ๋์? ์ข์ ๋ชฉ๋ก์ ๋ง๋ค์ด์ฃผ์ธ์."}, | |
| ] | |
| input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") | |
| device = infer_device() | |
| torch_device_module = getattr(torch, device, torch.cuda) | |
| torch_device_module.synchronize() | |
| start = time.time() | |
| out = model.generate( | |
| input_ids.to(model.device), | |
| prefill_chunk_size=2048*8, | |
| max_new_tokens=300, | |
| cache_implementation="hybrid", | |
| ) | |
| print(time.time()-start) | |
| print(tokenizer.batch_decode(out[:, input_ids.shape[-1]:])) | |
| print(f"{torch_device_module.max_memory_allocated(model.device) / 1024**3:.2f} GiB") | |
| ``` | |
| </hfoption> | |
| </hfoptions> | |
| ## ํจ์จ์ฑ; Llama 4์ ์ต๋ ์ฑ๋ฅ ํ์ฉํ๊ธฐ[[efficiency-how-to-get-the-best-out-of-llama-4]] | |
| ### ์ดํ ์ ๋ฐฉ๋ฒ[[the-attention-methods]] | |
| ๊ธฐ๋ณธ ์ค์ ์ผ๋ก ์ฃผ์ด์ง๋ ์ดํ ์ ํจ์๋ฅผ ๋ณ๊ฒฝํ๋ฉด ๊ณ์ฐ ์ฑ๋ฅ๊ณผ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ํฌ๊ฒ ๊ฐ์ ํ ์ ์์ต๋๋ค. ์ธํฐํ์ด์ค์ ๋ํ ์์ธํ ์ค๋ช ์ [์ดํ ์ ์ธํฐํ์ด์ค](../attention_interface) ๊ฐ์๋ฅผ ์ฐธ์กฐํ์ธ์. | |
| Llama 4 ๋ชจ๋ธ์ ์ฒ์ ๊ณต๊ฐ๋ ๋๋ถํฐ ๋ค์ ์ดํ ์ ๋ฐฉ์์ ์ง์ํฉ๋๋ค: `eager`, `flex_attention`, `sdpa`. ์ต์์ ๊ฒฐ๊ณผ๋ฅผ ์ํด `flex_attention` ์ฌ์ฉ์ ๊ถ์ฅํฉ๋๋ค. | |
| ์ดํ ์ ๋ฉ์ปค๋์ฆ ์ ํ์ ๋ชจ๋ธ์ ์ด๊ธฐํํ ๋ ์ด๋ฃจ์ด์ง๋๋ค: | |
| <hfoptions id="Attention"> | |
| <hfoption id="Flex Attention"> | |
| Flex Attention์ ๋ชจ๋ธ์ด ๊ธด ์ปจํ ์คํธ๋ฅผ ์ฒ๋ฆฌํ ๋ ์ต์ ์ ์ฑ๋ฅ์ ๋ฐํํฉ๋๋ค. | |
| > [!TIP] ์ฃผ์: ์๋ ์์๋ `device_map="auto"`์ flex-attention์ ๋ชจ๋ ์ฌ์ฉํฉ๋๋ค. | |
| > ์ด ์์๋ฅผ ํ ์ ๋ณ๋ ฌ ๋ชจ๋๋ก ์คํํ๋ ค๋ฉด `torchrun`์ ์ฌ์ฉํ์ธ์. | |
| > | |
| > ํฅํ ํ ์ ๋ณ๋ ฌ ์์ด `device_map="auto"`์ flex-attention์ ํจ๊ป ์คํํ ์ ์๋๋ก | |
| > ์์ ํ ์์ ์ ๋๋ค. | |
| ```py | |
| from transformers import Llama4ForConditionalGeneration | |
| import torch | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| attn_implementation="flex_attention", | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| ``` | |
| </hfoption> | |
| <hfoption id="SDPA"> | |
| `sdpa` ์ดํ ์ ๋ฐฉ๋ฒ์ ์ผ๋ฐ์ ์ผ๋ก `eager` ๋ฐฉ๋ฒ๋ณด๋ค ๊ณ์ฐ ํจ์จ์ ์ ๋๋ค. | |
| ```py | |
| from transformers import Llama4ForConditionalGeneration | |
| import torch | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| attn_implementation="sdpa", | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| ``` | |
| </hfoption> | |
| <hfoption id="Eager"> | |
| `eager` ์ดํ ์ ๋ฐฉ๋ฒ์ด ๊ธฐ๋ณธ์ผ๋ก ์ค์ ๋์ด ์์ผ๋ฏ๋ก ๋ชจ๋ธ ๋ก๋ ์ ๋ค๋ฅธ ์ค์ ์ด ํ์ํ์ง ์์ต๋๋ค: | |
| ```py | |
| from transformers import Llama4ForConditionalGeneration | |
| import torch | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| ``` | |
| </hfoption> | |
| </hfoptions> | |
| ### ์์ํ[[quantization]] | |
| ์์ํ๋ ๊ฐ์ค์น๋ฅผ ๋ ๋ฎ์ ์ ๋ฐ๋๋ก ๋ฐ๊ฟ ๋ํ ๋ชจ๋ธ์ ๋ฉ๋ชจ๋ฆฌ ๋ถ๋ด์ ์ค์ ๋๋ค. ์ฌ์ฉ ๊ฐ๋ฅํ ์์ํ ๋ฐฑ์๋์ ๋ํด์๋ [์์ํ](../quantization/overview) ๊ฐ์๋ฅผ ์ฐธ์กฐํ์ธ์. | |
| ํ์ฌ๋ FBGEMM๊ณผ LLM-Compressor๋ฅผ ์ง์ํ๋ฉฐ, ๊ณง ๋ ๋ง์ ๋ฐฉ์์ด ์ถ๊ฐ๋ ์์ ์ ๋๋ค. | |
| ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ ์์๋ฅผ ์๋์์ ํ์ธํ์ธ์: | |
| ๋ค์์ FBGEMM ์ ๊ทผ๋ฒ์ ์ฌ์ฉํ์ฌ BF16 ๋ชจ๋ธ์ FP8๋ก ๋ก๋ํ๋ ์์์ ๋๋ค: | |
| <hfoptions id="Quantization"> | |
| <hfoption id="FBGEMM"> | |
| ```python | |
| from transformers import AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Config | |
| import torch | |
| model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| messages = [ | |
| {"role": "user", "content": "๋น์ ์ ๋๊ตฌ์ ๊ฐ์?"}, | |
| ] | |
| inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True) | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| quantization_config=FbgemmFp8Config() | |
| ) | |
| outputs = model.generate(**inputs.to(model.device), max_new_tokens=100) | |
| outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:]) | |
| print(outputs[0]) | |
| ``` | |
| </hfoption> | |
| <hfoption id="LLM-Compressor"> | |
| LLLM-Compressor๋ฅผ ์ฌ์ฉํ ๋๋ ํจ๊ป ์ ๊ณต๋๋ ์ฌ์ ์์ํ๋ FP8 ์ฒดํฌํฌ์ธํธ๋ฅผ ์ฐ๋ ๊ฒ์ด ์ข์ต๋๋ค: | |
| ```python | |
| from transformers import AutoTokenizer, Llama4ForConditionalGeneration | |
| import torch | |
| model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| messages = [ | |
| {"role": "user", "content": "๋น์ ์ ๋๊ตฌ์ ๊ฐ์?"}, | |
| ] | |
| inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True) | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| tp_plan="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| outputs = model.generate(**inputs.to(model.device), max_new_tokens=100) | |
| outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:]) | |
| print(outputs[0]) | |
| ``` | |
| </hfoption> | |
| </hfoptions> | |
| ### ์คํ๋ก๋ฉ[[offloading]] | |
| CPU ์คํ๋ก๋ฉ์ ํ์ฑํํ๋ฉด, GPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋ถ์กฑํ ๋ ๋ชจ๋ธ์ด ๊ตฌ์ฑ ์์๋ฅผ CPU๋ก ์ด๋์ํต๋๋ค. | |
| ์ถ๋ก ์ ๋ค์ํ ๊ตฌ์ฑ ์์๋ค์ด GPU์ CPU ๊ฐ์ ๋์ ์ผ๋ก ๋ก๋๋๊ณ ์ธ๋ก๋๋ฉ๋๋ค. ์ด๋ฅผ ํตํด CPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ถฉ๋ถํ ํ ๋ ์์ ๋จธ์ ์์๋ ๋ชจ๋ธ์ ๋ก๋ํ ์ ์์ต๋๋ค. | |
| ๋ค๋ง ํต์ ์ค๋ฒํค๋๋ก ์ธํด ์ถ๋ก ์๋๊ฐ ๋๋ ค์ง ์ ์์ต๋๋ค. | |
| CPU ์คํ๋ก๋ฉ์ ํ์ฑํํ๋ ค๋ฉด ๋ชจ๋ธ ๋ก๋ ์ `device_map`์ `auto`๋ก ์ง์ ํ๋ฉด ๋ฉ๋๋ค | |
| ```py | |
| from transformers import Llama4ForConditionalGeneration | |
| import torch | |
| model = Llama4ForConditionalGeneration.from_pretrained( | |
| model_id, | |
| device_map="auto", | |
| dtype=torch.bfloat16, | |
| ) | |
| ``` | |
| ## Llama4Config | |
| [[autodoc]] Llama4Config | |
| ## Llama4TextConfig | |
| [[autodoc]] Llama4TextConfig | |
| ## Llama4VisionConfig | |
| [[autodoc]] Llama4VisionConfig | |
| ## Llama4Processor | |
| [[autodoc]] Llama4Processor | |
| ## Llama4ImageProcessorFast | |
| [[autodoc]] Llama4ImageProcessorFast | |
| ## Llama4ForConditionalGeneration | |
| [[autodoc]] Llama4ForConditionalGeneration | |
| - forward | |
| ## Llama4ForCausalLM | |
| [[autodoc]] Llama4ForCausalLM | |
| - forward | |
| ## Llama4TextModel | |
| [[autodoc]] Llama4TextModel | |
| - forward | |
| ## Llama4ForCausalLM | |
| [[autodoc]] Llama4ForCausalLM | |
| - forward | |
| ## Llama4VisionModel | |
| [[autodoc]] Llama4VisionModel | |
| - forward |