Meta๋ ์ด ๋ชจ๋ธ์ 2025-04-05์ ์ถ์ํ๊ณ ๊ฐ์ ๋ Hugging Face Transformers์ ์ถ๊ฐํ์ต๋๋ค.
Llama4[[llama4]]
Meta์์ ๊ฐ๋ฐํ Llama 4๋ ์๋ก์ด ์๊ธฐํ๊ท Mixture-of-Experts (MoE) ์ํคํ ์ฒ๋ฅผ ๋์ ํฉ๋๋ค. ์ด ์ธ๋๋ ๋ ๊ฐ์ง ๋ชจ๋ธ๋ก ๋๋ฉ๋๋ค: - 128๊ฐ์ ์ ๋ฌธ๊ฐ(expert)๋ฅผ ์ฌ์ฉํ์ฌ ์ด ์ฝ 400B ๋งค๊ฐ๋ณ์ ์ค 17B ํ์ฑ ๋งค๊ฐ๋ณ์๋ฅผ ๊ฐ๋ ๊ณ ์ฑ๋ฅ Llama 4 Maverick - 16๊ฐ์ ์ ๋ฌธ๊ฐ๋ง ์ฌ์ฉํ์ฌ ์ด ์ฝ 109B ๋งค๊ฐ๋ณ์ ์ค 17B ํ์ฑ ๋งค๊ฐ๋ณ์๋ฅผ ๊ฐ๋ ๊ฒฝ๋ํ๋ Llama 4 Scout
๋ ๋ชจ๋ธ ๋ชจ๋ ๋ค์ดํฐ๋ธ ๋ฉํฐ๋ชจ๋ฌ์ ์ํ ์ด๊ธฐ ์ตํฉ(early fusion)์ ํ์ฉํ์ฌ ํ ์คํธ์ ์ด๋ฏธ์ง ์ ๋ ฅ์ ์ฒ๋ฆฌํ ์ ์์ต๋๋ค. Maverick๊ณผ Scout ๋ชจ๋ 200๊ฐ ์ธ์ด๋ฅผ ํฌํจํ๋ ๋ฐ์ดํฐ์์ ์ต๋ 40์กฐ๊ฐ์ ํ ํฐ์ผ๋ก ํ๋ จ๋์์ต๋๋ค. (์๋์ด, ์คํ์ธ์ด, ๋ ์ผ์ด, ํ๋์ด๋ฅผ ํฌํจํ 12๊ฐ ์ธ์ด์ ๋ํ ํน์ ๋ฏธ์ธ ์กฐ์ ์ง์ ํฌํจ)
Meta๋ Llama 4 Scout์ ๋๊ตฌ๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋๋ก ์ค๊ณํ์ต๋๋ค. Scout์ 4๋นํธ ๋๋ 8๋นํธ ์์ํ๋ฅผ ์ ์ฉํ๋ฉด ๋จ์ผ ์๋ฒ๊ธ GPU์์๋ ์ค์๊ฐ์ผ๋ก ์คํํ ์ ์์ต๋๋ค. ๋ฐ๋ฉด, ๋ ๋๊ท๋ชจ์ธ Llama 4 Maverick์ ๊ณ ์ฑ๋ฅ ์ฐ์ฐ์ ์ํด BF16๊ณผ FP8 ํ์์ผ๋ก ์ ๊ณตํฉ๋๋ค. ์ด ๋ชจ๋ธ๋ค์ ๋ชจ๋ธ ์ ์ฅ์์์ ์ ๊ณต๋๋ ์ฌ์ฉ์ ์ง์ Llama 4 ์ปค๋ฎค๋ํฐ ๋ผ์ด์ ์ค ๊ณ์ฝ์ ๋ฐ๋ผ ์ถ์๋ฉ๋๋ค.
๋ชจ๋ ์๋ณธ Llama ์ฒดํฌํฌ์ธํธ๋ hugging face meta-llama ํ์ด์ง์์ ํ์ธํ์ค ์ ์์ต๋๋ค.
Llama 4 ๋ชจ๋ธ ํจ๋ฐ๋ฆฌ๋ ๋ ๊ฐ์ง ํํ๋ก ์ ๊ณต๋ฉ๋๋ค: 109B์ 402B ๋งค๊ฐ๋ณ์์ ๋๋ค. ์ด ๋ ํํ ๋ชจ๋ ๋งค์ฐ ํฐ ๋ชจ๋ธ์ด๋ฉฐ ์ผ๋ฐ์ ์ธ ๊ธฐ๊ธฐ์์๋ ์คํํ ์ ์์ต๋๋ค. ์๋์ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ค์ด๋ ๋ฐฉ๋ฒ ๋ช ๊ฐ์ง๋ฅผ ์ ๋ฆฌํ์ต๋๋ค.
๋์ฑ ๋น ๋ฅด๊ณ ์์ ์ ์ธ ๋ค์ด๋ก๋๋ฅผ ์ํด
hf_xet์ข ์์ฑ ์ค์น๋ฅผ ๊ถ์ฅํฉ๋๋ค:pip install transformers[hf_xet]
์๋ ์์๋ค์ [Pipeline] ๋๋ [AutoModel]๋ก ์์ฑํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋๋ค. ๋ํ ์ผ๋ถ Llama 4 ๋ณํ์ด
์ต๋ 1์ฒ๋ง ํ ํฐ์ ์ปจํ
์คํธ ๊ธธ์ด๋ฅผ ๊ฐ๊ธฐ ๋๋ฌธ์, ๋งค์ฐ ๊ธด ์ปจํ
์คํธ ์์ฑ์ ํ์ฑํํ๊ธฐ ์ํด ์ฌ๋ฐ๋ฅธ ์์ฑ์ ํ ๊ธํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ฃผ๋ ์์๋ ์ถ๊ฐํ์ต๋๋ค.
from transformers import pipeline
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
messages = [
{"role": "user", "content": "๋ง์๋ค์ฆ ๋ ์ํผ๊ฐ ๋ฌด์์ธ๊ฐ์?"},
]
pipe = pipeline(
"text-generation",
model=model_id,
device_map="auto",
dtype=torch.bfloat16
)
output = pipe(messages, do_sample=False, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
from transformers import AutoTokenizer, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "๋น์ ์ ๋๊ตฌ์ ๊ฐ์?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16
)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
)
img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": img_url},
{"type": "text", "text": "์ด ์ด๋ฏธ์ง๋ฅผ ๋ ๋ฌธ์ฅ์ผ๋ก ์ค๋ช
ํด์ฃผ์ธ์."},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
)
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "์ด ๋ ์ด๋ฏธ์ง๊ฐ ์ด๋ป๊ฒ ๋น์ทํ๊ณ , ์ด๋ป๊ฒ ๋ค๋ฅธ์ง ์ค๋ช
ํด์ฃผ์ค ์ ์๋์?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
์ฃผ์: ์๋ ์์๋ device_map="auto"์ flex-attention์ ๋ชจ๋ ์ฌ์ฉํฉ๋๋ค.
์ด ์์๋ฅผ ํ
์ ๋ณ๋ ฌ ๋ชจ๋๋ก ์คํํ๋ ค๋ฉด torchrun์ ์ฌ์ฉํ์ธ์.
ํฅํ ํ
์ ๋ณ๋ ฌ ์์ด device_map="auto"์ flex-attention์ ํจ๊ป ์คํํ ์ ์๋๋ก
์์
ํ ์์ ์
๋๋ค.
from transformers import Llama4ForConditionalGeneration, AutoTokenizer, infer_device
import torch
import time
file = "very_long_context_prompt.txt"
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
with open(file, "r") as f:
very_long_text = "\n".join(f.readlines())
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
attn_implementation="flex_attention",
dtype=torch.bfloat16
)
messages = [
{"role": "user", "content": f"๋ค์ ํ
์คํธ๋ค์ ๋ณด์ธ์: [{very_long_text}]\n\n\n\n์ฑ
๋ค์ ๋ฌด์์ด๋ฉฐ, ๋๊ฐ ์ผ๋์? ์ข์ ๋ชฉ๋ก์ ๋ง๋ค์ด์ฃผ์ธ์."},
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
device = infer_device()
torch_device_module = getattr(torch, device, torch.cuda)
torch_device_module.synchronize()
start = time.time()
out = model.generate(
input_ids.to(model.device),
prefill_chunk_size=2048*8,
max_new_tokens=300,
cache_implementation="hybrid",
)
print(time.time()-start)
print(tokenizer.batch_decode(out[:, input_ids.shape[-1]:]))
print(f"{torch_device_module.max_memory_allocated(model.device) / 1024**3:.2f} GiB")
ํจ์จ์ฑ; Llama 4์ ์ต๋ ์ฑ๋ฅ ํ์ฉํ๊ธฐ[[efficiency-how-to-get-the-best-out-of-llama-4]]
์ดํ ์ ๋ฐฉ๋ฒ[[the-attention-methods]]
๊ธฐ๋ณธ ์ค์ ์ผ๋ก ์ฃผ์ด์ง๋ ์ดํ ์ ํจ์๋ฅผ ๋ณ๊ฒฝํ๋ฉด ๊ณ์ฐ ์ฑ๋ฅ๊ณผ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ํฌ๊ฒ ๊ฐ์ ํ ์ ์์ต๋๋ค. ์ธํฐํ์ด์ค์ ๋ํ ์์ธํ ์ค๋ช ์ ์ดํ ์ ์ธํฐํ์ด์ค ๊ฐ์๋ฅผ ์ฐธ์กฐํ์ธ์.
Llama 4 ๋ชจ๋ธ์ ์ฒ์ ๊ณต๊ฐ๋ ๋๋ถํฐ ๋ค์ ์ดํ
์
๋ฐฉ์์ ์ง์ํฉ๋๋ค: eager, flex_attention, sdpa. ์ต์์ ๊ฒฐ๊ณผ๋ฅผ ์ํด flex_attention ์ฌ์ฉ์ ๊ถ์ฅํฉ๋๋ค.
์ดํ
์
๋ฉ์ปค๋์ฆ ์ ํ์ ๋ชจ๋ธ์ ์ด๊ธฐํํ ๋ ์ด๋ฃจ์ด์ง๋๋ค:
Flex Attention์ ๋ชจ๋ธ์ด ๊ธด ์ปจํ ์คํธ๋ฅผ ์ฒ๋ฆฌํ ๋ ์ต์ ์ ์ฑ๋ฅ์ ๋ฐํํฉ๋๋ค.
์ฃผ์: ์๋ ์์๋
device_map="auto"์ flex-attention์ ๋ชจ๋ ์ฌ์ฉํฉ๋๋ค. ์ด ์์๋ฅผ ํ ์ ๋ณ๋ ฌ ๋ชจ๋๋ก ์คํํ๋ ค๋ฉดtorchrun์ ์ฌ์ฉํ์ธ์.ํฅํ ํ ์ ๋ณ๋ ฌ ์์ด
device_map="auto"์ flex-attention์ ํจ๊ป ์คํํ ์ ์๋๋ก ์์ ํ ์์ ์ ๋๋ค.
from transformers import Llama4ForConditionalGeneration
import torch
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
dtype=torch.bfloat16,
)
`sdpa` ์ดํ
์
๋ฐฉ๋ฒ์ ์ผ๋ฐ์ ์ผ๋ก `eager` ๋ฐฉ๋ฒ๋ณด๋ค ๊ณ์ฐ ํจ์จ์ ์
๋๋ค.
from transformers import Llama4ForConditionalGeneration
import torch
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="sdpa",
device_map="auto",
dtype=torch.bfloat16,
)
`eager` ์ดํ
์
๋ฐฉ๋ฒ์ด ๊ธฐ๋ณธ์ผ๋ก ์ค์ ๋์ด ์์ผ๋ฏ๋ก ๋ชจ๋ธ ๋ก๋ ์ ๋ค๋ฅธ ์ค์ ์ด ํ์ํ์ง ์์ต๋๋ค:
from transformers import Llama4ForConditionalGeneration
import torch
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
)
์์ํ[[quantization]]
์์ํ๋ ๊ฐ์ค์น๋ฅผ ๋ ๋ฎ์ ์ ๋ฐ๋๋ก ๋ฐ๊ฟ ๋ํ ๋ชจ๋ธ์ ๋ฉ๋ชจ๋ฆฌ ๋ถ๋ด์ ์ค์ ๋๋ค. ์ฌ์ฉ ๊ฐ๋ฅํ ์์ํ ๋ฐฑ์๋์ ๋ํด์๋ ์์ํ ๊ฐ์๋ฅผ ์ฐธ์กฐํ์ธ์. ํ์ฌ๋ FBGEMM๊ณผ LLM-Compressor๋ฅผ ์ง์ํ๋ฉฐ, ๊ณง ๋ ๋ง์ ๋ฐฉ์์ด ์ถ๊ฐ๋ ์์ ์ ๋๋ค.
๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ ์์๋ฅผ ์๋์์ ํ์ธํ์ธ์:
๋ค์์ FBGEMM ์ ๊ทผ๋ฒ์ ์ฌ์ฉํ์ฌ BF16 ๋ชจ๋ธ์ FP8๋ก ๋ก๋ํ๋ ์์์ ๋๋ค:
from transformers import AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Config
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "๋น์ ์ ๋๊ตฌ์ ๊ฐ์?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
quantization_config=FbgemmFp8Config()
)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
LLLM-Compressor๋ฅผ ์ฌ์ฉํ ๋๋ ํจ๊ป ์ ๊ณต๋๋ ์ฌ์ ์์ํ๋ FP8 ์ฒดํฌํฌ์ธํธ๋ฅผ ์ฐ๋ ๊ฒ์ด ์ข์ต๋๋ค:
from transformers import AutoTokenizer, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "๋น์ ์ ๋๊ตฌ์ ๊ฐ์?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
tp_plan="auto",
dtype=torch.bfloat16,
)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
์คํ๋ก๋ฉ[[offloading]]
CPU ์คํ๋ก๋ฉ์ ํ์ฑํํ๋ฉด, GPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋ถ์กฑํ ๋ ๋ชจ๋ธ์ด ๊ตฌ์ฑ ์์๋ฅผ CPU๋ก ์ด๋์ํต๋๋ค. ์ถ๋ก ์ ๋ค์ํ ๊ตฌ์ฑ ์์๋ค์ด GPU์ CPU ๊ฐ์ ๋์ ์ผ๋ก ๋ก๋๋๊ณ ์ธ๋ก๋๋ฉ๋๋ค. ์ด๋ฅผ ํตํด CPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ถฉ๋ถํ ํ ๋ ์์ ๋จธ์ ์์๋ ๋ชจ๋ธ์ ๋ก๋ํ ์ ์์ต๋๋ค. ๋ค๋ง ํต์ ์ค๋ฒํค๋๋ก ์ธํด ์ถ๋ก ์๋๊ฐ ๋๋ ค์ง ์ ์์ต๋๋ค.
CPU ์คํ๋ก๋ฉ์ ํ์ฑํํ๋ ค๋ฉด ๋ชจ๋ธ ๋ก๋ ์ device_map์ auto๋ก ์ง์ ํ๋ฉด ๋ฉ๋๋ค
from transformers import Llama4ForConditionalGeneration
import torch
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
)
Llama4Config
[[autodoc]] Llama4Config
Llama4TextConfig
[[autodoc]] Llama4TextConfig
Llama4VisionConfig
[[autodoc]] Llama4VisionConfig
Llama4Processor
[[autodoc]] Llama4Processor
Llama4ImageProcessorFast
[[autodoc]] Llama4ImageProcessorFast
Llama4ForConditionalGeneration
[[autodoc]] Llama4ForConditionalGeneration
- forward
Llama4ForCausalLM
[[autodoc]] Llama4ForCausalLM
- forward
Llama4TextModel
[[autodoc]] Llama4TextModel
- forward
Llama4ForCausalLM
[[autodoc]] Llama4ForCausalLM
- forward
Llama4VisionModel
[[autodoc]] Llama4VisionModel
- forward