DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

AWQ [[awq]]

์ด ๋…ธํŠธ๋ถ ์œผ๋กœ AWQ ์–‘์žํ™”๋ฅผ ์‹ค์Šตํ•ด๋ณด์„ธ์š” !

Activation-aware Weight Quantization (AWQ)์€ ๋ชจ๋ธ์˜ ๋ชจ๋“  ๊ฐ€์ค‘์น˜๋ฅผ ์–‘์žํ™”ํ•˜์ง€ ์•Š๊ณ , LLM ์„ฑ๋Šฅ์— ์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ์จ 4๋น„ํŠธ ์ •๋ฐ€๋„๋กœ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•ด๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์–‘์žํ™” ์†์‹ค์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

AWQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์–‘์žํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด llm-awq, autoawq , optimum-intel ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Transformers๋Š” llm-awq, autoawq ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” autoawq๋กœ ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฌ๋‚˜, llm-awq๋กœ ์–‘์žํ™”๋œ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ๋„ ์œ ์‚ฌํ•œ ์ ˆ์ฐจ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

autoawq๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install autoawq

AWQ ์–‘์žํ™”๋œ ๋ชจ๋ธ์€ ํ•ด๋‹น ๋ชจ๋ธ์˜ config.json ํŒŒ์ผ์˜ quantization_config ์†์„ฑ์„ ํ†ตํ•ด ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.:

{
  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
  "architectures": [
    "MistralForCausalLM"
  ],
  ...
  ...
  ...
  "quantization_config": {
    "quant_method": "awq",
    "zero_point": true,
    "group_size": 128,
    "bits": 4,
    "version": "gemm"
  }
}

์–‘์žํ™”๋œ ๋ชจ๋ธ์€ [~PreTrainedModel.from_pretrained] ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ CPU์— ๊ฐ€์ ธ์™”๋‹ค๋ฉด, ๋จผ์ € ๋ชจ๋ธ์„ GPU ์žฅ์น˜๋กœ ์˜ฎ๊ฒจ์•ผ ํ•ฉ๋‹ˆ๋‹ค. device_map ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋ฐฐ์น˜ํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")

AWQ ์–‘์žํ™” ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋ฉด ์ž๋™์œผ๋กœ ์„ฑ๋Šฅ์ƒ์˜ ์ด์œ ๋กœ ์ธํ•ด ๊ฐ€์ค‘์น˜๋“ค์˜ ๊ธฐ๋ณธ๊ฐ’์ด fp16์œผ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค๋ฅธ ํ˜•์‹์œผ๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด, torch_dtype ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

์ถ”๋ก ์„ ๋”์šฑ ๊ฐ€์†ํ™”ํ•˜๊ธฐ ์œ„ํ•ด AWQ ์–‘์žํ™”์™€ FlashAttention-2 ๋ฅผ ๊ฒฐํ•ฉ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ [[fused-modules]]

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ ์ •ํ™•๋„์™€ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ Llama ์•„ํ‚คํ…์ฒ˜์™€ Mistral ์•„ํ‚คํ…์ฒ˜์˜ AWQ๋ชจ๋“ˆ์— ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์›๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ง€์›๋˜์ง€ ์•Š๋Š” ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•ด์„œ๋„ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์€ FlashAttention-2์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๊ธฐ์ˆ ๊ณผ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์ง€์›๋˜๋Š” ์•„ํ‚คํ…์ฒ˜์—์„œ ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์„ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด, [AwqConfig] ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋งค๊ฐœ๋ณ€์ˆ˜ fuse_max_seq_len ๊ณผ do_fuse=True๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. fuse_max_seq_len ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์ „์ฒด ์‹œํ€€์Šค ๊ธธ์ด๋กœ, ์ปจํ…์ŠคํŠธ ๊ธธ์ด์™€ ์˜ˆ์ƒ ์ƒ์„ฑ ๊ธธ์ด๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•ˆ์ „ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋” ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, TheBloke/Mistral-7B-OpenOrca-AWQ ๋ชจ๋ธ์˜ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import torch
from transformers import AwqConfig, AutoModelForCausalLM

model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    do_fuse=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)

TheBloke/Mistral-7B-OpenOrca-AWQ ๋ชจ๋ธ์€ ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์ด ์žˆ๋Š” ๊ฒฝ์šฐ์™€ ์—†๋Š” ๊ฒฝ์šฐ ๋ชจ๋‘ batch_size=1 ๋กœ ์„ฑ๋Šฅ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ“จ์ฆˆ๋˜์ง€ ์•Š์€ ๋ชจ๋“ˆ
๋ฐฐ์น˜ ํฌ๊ธฐ ํ”„๋ฆฌํ•„ ๊ธธ์ด ๋””์ฝ”๋“œ ๊ธธ์ด ํ”„๋ฆฌํ•„ ํ† ํฐ/์ดˆ ๋””์ฝ”๋“œ ํ† ํฐ/์ดˆ ๋ฉ”๋ชจ๋ฆฌ (VRAM)
1 32 32 60.0984 38.4537 4.50 GB (5.68%)
1 64 64 1333.67 31.6604 4.50 GB (5.68%)
1 128 128 2434.06 31.6272 4.50 GB (5.68%)
1 256 256 3072.26 38.1731 4.50 GB (5.68%)
1 512 512 3184.74 31.6819 4.59 GB (5.80%)
1 1024 1024 3148.18 36.8031 4.81 GB (6.07%)
1 2048 2048 2927.33 35.2676 5.73 GB (7.23%)
ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ
๋ฐฐ์น˜ ํฌ๊ธฐ ํ”„๋ฆฌํ•„ ๊ธธ์ด ๋””์ฝ”๋“œ ๊ธธ์ด ํ”„๋ฆฌํ•„ ํ† ํฐ/์ดˆ ๋””์ฝ”๋“œ ํ† ํฐ/์ดˆ ๋ฉ”๋ชจ๋ฆฌ (VRAM)
1 32 32 81.4899 80.2569 4.00 GB (5.05%)
1 64 64 1756.1 106.26 4.00 GB (5.05%)
1 128 128 2479.32 105.631 4.00 GB (5.06%)
1 256 256 1813.6 85.7485 4.01 GB (5.06%)
1 512 512 2848.9 97.701 4.11 GB (5.19%)
1 1024 1024 3044.35 87.7323 4.41 GB (5.57%)
1 2048 2048 2715.11 89.4709 5.57 GB (7.04%)

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ ๋ฐ ํ“จ์ฆˆ๋˜์ง€ ์•Š์€ ๋ชจ๋“ˆ์˜ ์†๋„์™€ ์ฒ˜๋ฆฌ๋Ÿ‰์€ optimum-benchmark๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

generate throughput per batch size
ํฌ์›Œ๋“œ ํ”ผํฌ ๋ฉ”๋ชจ๋ฆฌ (forward peak memory)/๋ฐฐ์น˜ ํฌ๊ธฐ
forward latency per batch size
์ƒ์„ฑ ์ฒ˜๋ฆฌ๋Ÿ‰/๋ฐฐ์น˜ํฌ๊ธฐ

ํ“จ์ฆˆ๋œ ๋ชจ๋“ˆ์„ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ์•„ํ‚คํ…์ฒ˜์˜ ๊ฒฝ์šฐ, modules_to_fuse ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์ง์ ‘ ํ“จ์ฆˆ ๋งคํ•‘์„ ๋งŒ๋“ค์–ด ์–ด๋–ค ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ• ์ง€ ์ •์˜ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋กœ, TheBloke/Yi-34B-AWQ ๋ชจ๋ธ์˜ AWQ ๋ชจ๋“ˆ์„ ํ“จ์ฆˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

import torch
from transformers import AwqConfig, AutoModelForCausalLM

model_id = "TheBloke/Yi-34B-AWQ"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    modules_to_fuse={
        "attention": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "layernorm": ["ln1", "ln2", "norm"],
        "mlp": ["gate_proj", "up_proj", "down_proj"],
        "use_alibi": False,
        "num_attention_heads": 56,
        "num_key_value_heads": 8,
        "hidden_size": 7168
    }
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)

modules_to_fuse ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ์„ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • "attention": ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋Š” ๋‹ค์Œ ์ˆœ์„œ๋กœ ํ“จ์ฆˆํ•˜์„ธ์š” : ์ฟผ๋ฆฌ (query), ํ‚ค (key), ๊ฐ’ (value) , ์ถœ๋ ฅ ํ”„๋กœ์ ์…˜ ๊ณ„์ธต (output projection layer). ํ•ด๋‹น ๋ ˆ์ด์–ด๋ฅผ ํ“จ์ฆˆํ•˜์ง€ ์•Š์œผ๋ ค๋ฉด ๋นˆ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ „๋‹ฌํ•˜์„ธ์š”.
  • "layernorm": ์‚ฌ์šฉ์ž ์ •์˜ ํ“จ์ฆˆ ๋ ˆ์ด์–ด ์ •๊ทœํ™”๋กœ ๊ตํ•  ๋ ˆ์ด์–ด ์ •๊ทœํ™” ๋ ˆ์ด์–ด๋ช…. ํ•ด๋‹น ๋ ˆ์ด์–ด๋ฅผ ํ“จ์ฆˆํ•˜์ง€ ์•Š์œผ๋ ค๋ฉด ๋นˆ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ „๋‹ฌํ•˜์„ธ์š”.
  • "mlp": ๋‹จ์ผ MLP ๋ ˆ์ด์–ด๋กœ ํ“จ์ฆˆํ•  MLP ๋ ˆ์ด์–ด ์ˆœ์„œ : (๊ฒŒ์ดํŠธ (gate) (๋ด์Šค(dense), ๋ ˆ์ด์–ด(layer), ํฌ์ŠคํŠธ ์–ดํ…์…˜(post-attention)) / ์œ„ / ์•„๋ž˜ ๋ ˆ์ด์–ด).
  • "use_alibi": ๋ชจ๋ธ์ด ALiBi positional embedding์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • "num_attention_heads": ์–ดํ…์…˜ ํ—ค๋“œ (attention heads)์˜ ์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • "num_key_value_heads": ๊ทธ๋ฃนํ™” ์ฟผ๋ฆฌ ์–ดํ…์…˜ (GQA)์„ ๊ตฌํ˜„ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ํ‚ค ๊ฐ’ ํ—ค๋“œ์˜ ์ˆ˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. num_key_value_heads=num_attention_heads๋กœ ์„ค์ •ํ•  ๊ฒฝ์šฐ, ๋ชจ๋ธ์€ ๋‹ค์ค‘ ํ—ค๋“œ ์–ดํ…์…˜ (MHA)๊ฐ€ ์‚ฌ์šฉ๋˜๋ฉฐ, num_key_value_heads=1 ๋Š” ๋‹ค์ค‘ ์ฟผ๋ฆฌ ์–ดํ…์…˜ (MQA)๊ฐ€, ๋‚˜๋จธ์ง€๋Š” GQA๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • "hidden_size": ์ˆจ๊ฒจ์ง„ ํ‘œํ˜„(hidden representations)์˜ ์ฐจ์›์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

ExLlama-v2 ์„œํฌํŠธ [[exllama-v2-support]]

์ตœ์‹  ๋ฒ„์ „ autoawq๋Š” ๋น ๋ฅธ ํ”„๋ฆฌํ•„๊ณผ ๋””์ฝ”๋”ฉ์„ ์œ„ํ•ด ExLlama-v2 ์ปค๋„์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์ž‘ํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € ์ตœ์‹  ๋ฒ„์ „ autoawq ๋ฅผ ์„ค์น˜ํ•˜์„ธ์š” :

pip install git+https://github.com/casper-hansen/AutoAWQ.git

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ version="exllama"๋กœ ์„ค์ •ํ•ด AwqConfig()๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ชจ๋ธ์— ๋„˜๊ฒจ์ฃผ์„ธ์š”.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

quantization_config = AwqConfig(version="exllama")

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    quantization_config=quantization_config,
    device_map="auto",
)

input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
output = model(input_ids)
print(output.logits)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

์ด ๊ธฐ๋Šฅ์€ AMD GPUs์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.