AbdulElahGwaith's picture
Upload folder using huggingface_hub
a9bd396 verified

Meta๋Š” ์ด ๋ชจ๋ธ์„ 2025-04-05์— ์ถœ์‹œํ•˜๊ณ  ๊ฐ™์€ ๋‚  Hugging Face Transformers์— ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

Llama4[[llama4]]

PyTorch FlashAttention Tensor parallelism

Meta์—์„œ ๊ฐœ๋ฐœํ•œ Llama 4๋Š” ์ƒˆ๋กœ์šด ์ž๊ธฐํšŒ๊ท€ Mixture-of-Experts (MoE) ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด ์„ธ๋Œ€๋Š” ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค: - 128๊ฐœ์˜ ์ „๋ฌธ๊ฐ€(expert)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ์•ฝ 400B ๋งค๊ฐœ๋ณ€์ˆ˜ ์ค‘ 17B ํ™œ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐ–๋Š” ๊ณ ์„ฑ๋Šฅ Llama 4 Maverick - 16๊ฐœ์˜ ์ „๋ฌธ๊ฐ€๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ์ด ์•ฝ 109B ๋งค๊ฐœ๋ณ€์ˆ˜ ์ค‘ 17B ํ™œ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐ–๋Š” ๊ฒฝ๋Ÿ‰ํ™”๋œ Llama 4 Scout

๋‘ ๋ชจ๋ธ ๋ชจ๋‘ ๋„ค์ดํ‹ฐ๋ธŒ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์„ ์œ„ํ•œ ์ดˆ๊ธฐ ์œตํ•ฉ(early fusion)์„ ํ™œ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Maverick๊ณผ Scout ๋ชจ๋‘ 200๊ฐœ ์–ธ์–ด๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ์—์„œ ์ตœ๋Œ€ 40์กฐ๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (์•„๋ž์–ด, ์ŠคํŽ˜์ธ์–ด, ๋…์ผ์–ด, ํžŒ๋””์–ด๋ฅผ ํฌํ•จํ•œ 12๊ฐœ ์–ธ์–ด์— ๋Œ€ํ•œ ํŠน์ • ๋ฏธ์„ธ ์กฐ์ • ์ง€์› ํฌํ•จ)

Meta๋Š” Llama 4 Scout์„ ๋ˆ„๊ตฌ๋‚˜ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. Scout์€ 4๋น„ํŠธ ๋˜๋Š” 8๋น„ํŠธ ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜๋ฉด ๋‹จ์ผ ์„œ๋ฒ„๊ธ‰ GPU์—์„œ๋„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ๋” ๋Œ€๊ทœ๋ชจ์ธ Llama 4 Maverick์€ ๊ณ ์„ฑ๋Šฅ ์—ฐ์‚ฐ์„ ์œ„ํ•ด BF16๊ณผ FP8 ํ˜•์‹์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์€ ๋ชจ๋ธ ์ €์žฅ์†Œ์—์„œ ์ œ๊ณต๋˜๋Š” ์‚ฌ์šฉ์ž ์ง€์ • Llama 4 ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ผ์ด์„ ์Šค ๊ณ„์•ฝ์— ๋”ฐ๋ผ ์ถœ์‹œ๋ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ์›๋ณธ Llama ์ฒดํฌํฌ์ธํŠธ๋Š” hugging face meta-llama ํŽ˜์ด์ง€์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Llama 4 ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋Š” ๋‘ ๊ฐ€์ง€ ํ˜•ํƒœ๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค: 109B์™€ 402B ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๋‘ ํ˜•ํƒœ ๋ชจ๋‘ ๋งค์šฐ ํฐ ๋ชจ๋ธ์ด๋ฉฐ ์ผ๋ฐ˜์ ์ธ ๊ธฐ๊ธฐ์—์„œ๋Š” ์‹คํ–‰ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์— ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ• ๋ช‡ ๊ฐ€์ง€๋ฅผ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

๋”์šฑ ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ธ ๋‹ค์šด๋กœ๋“œ๋ฅผ ์œ„ํ•ด hf_xet ์ข…์†์„ฑ ์„ค์น˜๋ฅผ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค: pip install transformers[hf_xet]

์•„๋ž˜ ์˜ˆ์‹œ๋“ค์€ [Pipeline] ๋˜๋Š” [AutoModel]๋กœ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ์ผ๋ถ€ Llama 4 ๋ณ€ํ˜•์ด ์ตœ๋Œ€ 1์ฒœ๋งŒ ํ† ํฐ์˜ ์ปจํ…์ŠคํŠธ ๊ธธ์ด๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์—, ๋งค์šฐ ๊ธด ์ปจํ…์ŠคํŠธ ์ƒ์„ฑ์„ ํ™œ์„ฑํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์˜ฌ๋ฐ”๋ฅธ ์†์„ฑ์„ ํ† ๊ธ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๋Š” ์˜ˆ์‹œ๋„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

from transformers import pipeline
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

messages = [
    {"role": "user", "content": "๋งˆ์š”๋„ค์ฆˆ ๋ ˆ์‹œํ”ผ๊ฐ€ ๋ฌด์—‡์ธ๊ฐ€์š”?"},
]

pipe = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto",
    dtype=torch.bfloat16
)

output = pipe(messages, do_sample=False, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
from transformers import AutoTokenizer, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "๋‹น์‹ ์€ ๋ˆ„๊ตฌ์‹ ๊ฐ€์š”?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16
)

outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
)

img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": img_url},
            {"type": "text", "text": "์ด ์ด๋ฏธ์ง€๋ฅผ ๋‘ ๋ฌธ์žฅ์œผ๋กœ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”."},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "์ด ๋‘ ์ด๋ฏธ์ง€๊ฐ€ ์–ด๋–ป๊ฒŒ ๋น„์Šทํ•˜๊ณ , ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ์ง€ ์„ค๋ช…ํ•ด์ฃผ์‹ค ์ˆ˜ ์žˆ๋‚˜์š”?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)

์ฃผ์˜: ์•„๋ž˜ ์˜ˆ์‹œ๋Š” device_map="auto"์™€ flex-attention์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์˜ˆ์‹œ๋ฅผ ํ…์„œ ๋ณ‘๋ ฌ ๋ชจ๋“œ๋กœ ์‹คํ–‰ํ•˜๋ ค๋ฉด torchrun์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

ํ–ฅํ›„ ํ…์„œ ๋ณ‘๋ ฌ ์—†์ด device_map="auto"์™€ flex-attention์„ ํ•จ๊ป˜ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ž‘์—…ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

from transformers import Llama4ForConditionalGeneration, AutoTokenizer, infer_device
import torch
import time

file = "very_long_context_prompt.txt"
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

with open(file, "r") as f:
    very_long_text = "\n".join(f.readlines())

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flex_attention",
    dtype=torch.bfloat16
)

messages = [
    {"role": "user", "content": f"๋‹ค์Œ ํ…์ŠคํŠธ๋“ค์„ ๋ณด์„ธ์š”: [{very_long_text}]\n\n\n\n์ฑ…๋“ค์€ ๋ฌด์—‡์ด๋ฉฐ, ๋ˆ„๊ฐ€ ์ผ๋‚˜์š”? ์ข‹์€ ๋ชฉ๋ก์„ ๋งŒ๋“ค์–ด์ฃผ์„ธ์š”."},
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

device = infer_device()
torch_device_module = getattr(torch, device, torch.cuda)
torch_device_module.synchronize()
start = time.time()
out = model.generate(
    input_ids.to(model.device),
    prefill_chunk_size=2048*8,
    max_new_tokens=300,
    cache_implementation="hybrid",
)
print(time.time()-start)
print(tokenizer.batch_decode(out[:, input_ids.shape[-1]:]))
print(f"{torch_device_module.max_memory_allocated(model.device) / 1024**3:.2f} GiB")

ํšจ์œจ์„ฑ; Llama 4์˜ ์ตœ๋Œ€ ์„ฑ๋Šฅ ํ™œ์šฉํ•˜๊ธฐ[[efficiency-how-to-get-the-best-out-of-llama-4]]

์–ดํ…์…˜ ๋ฐฉ๋ฒ•[[the-attention-methods]]

๊ธฐ๋ณธ ์„ค์ •์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ์–ดํ…์…˜ ํ•จ์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•˜๋ฉด ๊ณ„์‚ฐ ์„ฑ๋Šฅ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์€ ์–ดํ…์…˜ ์ธํ„ฐํŽ˜์ด์Šค ๊ฐœ์š”๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Llama 4 ๋ชจ๋ธ์€ ์ฒ˜์Œ ๊ณต๊ฐœ๋  ๋•Œ๋ถ€ํ„ฐ ๋‹ค์Œ ์–ดํ…์…˜ ๋ฐฉ์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค: eager, flex_attention, sdpa. ์ตœ์ƒ์˜ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด flex_attention ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ „ํ™˜์€ ๋ชจ๋ธ์„ ์ดˆ๊ธฐํ™”ํ•  ๋•Œ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค:

Flex Attention์€ ๋ชจ๋ธ์ด ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์˜: ์•„๋ž˜ ์˜ˆ์‹œ๋Š” device_map="auto"์™€ flex-attention์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์˜ˆ์‹œ๋ฅผ ํ…์„œ ๋ณ‘๋ ฌ ๋ชจ๋“œ๋กœ ์‹คํ–‰ํ•˜๋ ค๋ฉด torchrun์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

ํ–ฅํ›„ ํ…์„œ ๋ณ‘๋ ฌ ์—†์ด device_map="auto"์™€ flex-attention์„ ํ•จ๊ป˜ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ž‘์—…ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    dtype=torch.bfloat16,
)
`sdpa` ์–ดํ…์…˜ ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์ ์œผ๋กœ `eager` ๋ฐฉ๋ฒ•๋ณด๋‹ค ๊ณ„์‚ฐ ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="sdpa",
    device_map="auto",
    dtype=torch.bfloat16,
)
`eager` ์–ดํ…์…˜ ๋ฐฉ๋ฒ•์ด ๊ธฐ๋ณธ์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ๋ชจ๋ธ ๋กœ๋“œ ์‹œ ๋‹ค๋ฅธ ์„ค์ •์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค:
from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
)

์–‘์žํ™”[[quantization]]

์–‘์žํ™”๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋” ๋‚ฎ์€ ์ •๋ฐ€๋„๋กœ ๋ฐ”๊ฟ” ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์„ ์ค„์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์–‘์žํ™” ๋ฐฑ์—”๋“œ์— ๋Œ€ํ•ด์„œ๋Š” ์–‘์žํ™” ๊ฐœ์š”๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. ํ˜„์žฌ๋Š” FBGEMM๊ณผ LLM-Compressor๋ฅผ ์ง€์›ํ•˜๋ฉฐ, ๊ณง ๋” ๋งŽ์€ ๋ฐฉ์‹์ด ์ถ”๊ฐ€๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ๋ฅผ ์•„๋ž˜์—์„œ ํ™•์ธํ•˜์„ธ์š”:

๋‹ค์Œ์€ FBGEMM ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ BF16 ๋ชจ๋ธ์„ FP8๋กœ ๋กœ๋“œํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค:

from transformers import AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Config
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "๋‹น์‹ ์€ ๋ˆ„๊ตฌ์‹ ๊ฐ€์š”?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    quantization_config=FbgemmFp8Config()
)

outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])

LLLM-Compressor๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋Š” ํ•จ๊ป˜ ์ œ๊ณต๋˜๋Š” ์‚ฌ์ „ ์–‘์žํ™”๋œ FP8 ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:

from transformers import AutoTokenizer, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "๋‹น์‹ ์€ ๋ˆ„๊ตฌ์‹ ๊ฐ€์š”?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    tp_plan="auto",
    dtype=torch.bfloat16,
)

outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])

์˜คํ”„๋กœ๋”ฉ[[offloading]]

CPU ์˜คํ”„๋กœ๋”ฉ์„ ํ™œ์„ฑํ™”ํ•˜๋ฉด, GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•  ๋•Œ ๋ชจ๋ธ์ด ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ CPU๋กœ ์ด๋™์‹œํ‚ต๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ ๋‹ค์–‘ํ•œ ๊ตฌ์„ฑ ์š”์†Œ๋“ค์ด GPU์™€ CPU ๊ฐ„์— ๋™์ ์œผ๋กœ ๋กœ๋“œ๋˜๊ณ  ์–ธ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด CPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ถฉ๋ถ„ํ•œ ํ•œ ๋” ์ž‘์€ ๋จธ์‹ ์—์„œ๋„ ๋ชจ๋ธ์„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ์ธํ•ด ์ถ”๋ก  ์†๋„๊ฐ€ ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

CPU ์˜คํ”„๋กœ๋”ฉ์„ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด ๋ชจ๋ธ ๋กœ๋“œ ์‹œ device_map์„ auto๋กœ ์ง€์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค

from transformers import Llama4ForConditionalGeneration
import torch

model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
)

Llama4Config

[[autodoc]] Llama4Config

Llama4TextConfig

[[autodoc]] Llama4TextConfig

Llama4VisionConfig

[[autodoc]] Llama4VisionConfig

Llama4Processor

[[autodoc]] Llama4Processor

Llama4ImageProcessorFast

[[autodoc]] Llama4ImageProcessorFast

Llama4ForConditionalGeneration

[[autodoc]] Llama4ForConditionalGeneration

  • forward

Llama4ForCausalLM

[[autodoc]] Llama4ForCausalLM

  • forward

Llama4TextModel

[[autodoc]] Llama4TextModel

  • forward

Llama4ForCausalLM

[[autodoc]] Llama4ForCausalLM

  • forward

Llama4VisionModel

[[autodoc]] Llama4VisionModel

  • forward