| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # AWQ [[awq]] | |
| <Tip> | |
| ์ด [๋ ธํธ๋ถ](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY) ์ผ๋ก AWQ ์์ํ๋ฅผ ์ค์ตํด๋ณด์ธ์ ! | |
| </Tip> | |
| [Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978)์ ๋ชจ๋ธ์ ๋ชจ๋ ๊ฐ์ค์น๋ฅผ ์์ํํ์ง ์๊ณ , LLM ์ฑ๋ฅ์ ์ค์ํ ๊ฐ์ค์น๋ฅผ ์ ์งํฉ๋๋ค. ์ด๋ก์จ 4๋นํธ ์ ๋ฐ๋๋ก ๋ชจ๋ธ์ ์คํํด๋ ์ฑ๋ฅ ์ ํ ์์ด ์์ํ ์์ค์ ํฌ๊ฒ ์ค์ผ ์ ์์ต๋๋ค. | |
| AWQ ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ์์ํํ ์ ์๋ ์ฌ๋ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ์์ต๋๋ค. ์๋ฅผ ๋ค์ด [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) , [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc) ๋ฑ์ด ์์ต๋๋ค. Transformers๋ llm-awq, autoawq ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ด์ฉํด ์์ํ๋ ๋ชจ๋ธ์ ๊ฐ์ ธ์ฌ ์ ์๋๋ก ์ง์ํฉ๋๋ค. ์ด ๊ฐ์ด๋์์๋ autoawq๋ก ์์ํ๋ ๋ชจ๋ธ์ ๊ฐ์ ธ์ค๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ๋๋ฆฌ๋, llm-awq๋ก ์์ํ๋ ๋ชจ๋ธ์ ๊ฒฝ์ฐ๋ ์ ์ฌํ ์ ์ฐจ๋ฅผ ๋ฐ๋ฆ ๋๋ค. | |
| autoawq๊ฐ ์ค์น๋์ด ์๋์ง ํ์ธํ์ธ์: | |
| ```bash | |
| pip install autoawq | |
| ``` | |
| AWQ ์์ํ๋ ๋ชจ๋ธ์ ํด๋น ๋ชจ๋ธ์ [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) ํ์ผ์ `quantization_config` ์์ฑ์ ํตํด ์๋ณํ ์ ์์ต๋๋ค.: | |
| ```json | |
| { | |
| "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source", | |
| "architectures": [ | |
| "MistralForCausalLM" | |
| ], | |
| ... | |
| ... | |
| ... | |
| "quantization_config": { | |
| "quant_method": "awq", | |
| "zero_point": true, | |
| "group_size": 128, | |
| "bits": 4, | |
| "version": "gemm" | |
| } | |
| } | |
| ``` | |
| ์์ํ๋ ๋ชจ๋ธ์ [`~PreTrainedModel.from_pretrained`] ๋ฉ์๋๋ฅผ ์ฌ์ฉํ์ฌ ๊ฐ์ ธ์ต๋๋ค. ๋ชจ๋ธ์ CPU์ ๊ฐ์ ธ์๋ค๋ฉด, ๋จผ์ ๋ชจ๋ธ์ GPU ์ฅ์น๋ก ์ฎ๊ฒจ์ผ ํฉ๋๋ค. `device_map` ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ๋ฐฐ์นํ ์์น๋ฅผ ์ง์ ํ์ธ์: | |
| ```py | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_id = "TheBloke/zephyr-7B-alpha-AWQ" | |
| model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") | |
| ``` | |
| AWQ ์์ํ ๋ชจ๋ธ์ ๊ฐ์ ธ์ค๋ฉด ์๋์ผ๋ก ์ฑ๋ฅ์์ ์ด์ ๋ก ์ธํด ๊ฐ์ค์น๋ค์ ๊ธฐ๋ณธ๊ฐ์ด fp16์ผ๋ก ์ค์ ๋ฉ๋๋ค. ๊ฐ์ค์น๋ฅผ ๋ค๋ฅธ ํ์์ผ๋ก ๊ฐ์ ธ์ค๋ ค๋ฉด, `torch_dtype` ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์ธ์: | |
| ```py | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_id = "TheBloke/zephyr-7B-alpha-AWQ" | |
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32) | |
| ``` | |
| ์ถ๋ก ์ ๋์ฑ ๊ฐ์ํํ๊ธฐ ์ํด AWQ ์์ํ์ [FlashAttention-2](../perf_infer_gpu_one#flashattention-2) ๋ฅผ ๊ฒฐํฉ ํ ์ ์์ต๋๋ค: | |
| ```py | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0") | |
| ``` | |
| ## ํจ์ฆ๋ ๋ชจ๋ [[fused-modules]] | |
| ํจ์ฆ๋ ๋ชจ๋์ ์ ํ๋์ ์ฑ๋ฅ์ ๊ฐ์ ํฉ๋๋ค. ํจ์ฆ๋ ๋ชจ๋์ [Llama](https://huggingface.co/meta-llama) ์ํคํ ์ฒ์ [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) ์ํคํ ์ฒ์ AWQ๋ชจ๋์ ๊ธฐ๋ณธ์ ์ผ๋ก ์ง์๋ฉ๋๋ค. ๊ทธ๋ฌ๋ ์ง์๋์ง ์๋ ์ํคํ ์ฒ์ ๋ํด์๋ AWQ ๋ชจ๋์ ํจ์ฆํ ์ ์์ต๋๋ค. | |
| <Tip warning={true}> | |
| ํจ์ฆ๋ ๋ชจ๋์ FlashAttention-2์ ๊ฐ์ ๋ค๋ฅธ ์ต์ ํ ๊ธฐ์ ๊ณผ ๊ฒฐํฉํ ์ ์์ต๋๋ค. | |
| </Tip> | |
| <hfoptions id="fuse"> | |
| <hfoption id="supported architectures"> | |
| ์ง์๋๋ ์ํคํ ์ฒ์์ ํจ์ฆ๋ ๋ชจ๋์ ํ์ฑํํ๋ ค๋ฉด, [`AwqConfig`] ๋ฅผ ์์ฑํ๊ณ ๋งค๊ฐ๋ณ์ `fuse_max_seq_len` ๊ณผ `do_fuse=True`๋ฅผ ์ค์ ํด์ผ ํฉ๋๋ค. `fuse_max_seq_len` ๋งค๊ฐ๋ณ์๋ ์ ์ฒด ์ํ์ค ๊ธธ์ด๋ก, ์ปจํ ์คํธ ๊ธธ์ด์ ์์ ์์ฑ ๊ธธ์ด๋ฅผ ํฌํจํด์ผ ํฉ๋๋ค. ์์ ํ๊ฒ ์ฌ์ฉํ๊ธฐ ์ํด ๋ ํฐ ๊ฐ์ผ๋ก ์ค์ ํ ์ ์์ต๋๋ค. | |
| ์๋ฅผ ๋ค์ด, [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) ๋ชจ๋ธ์ AWQ ๋ชจ๋์ ํจ์ฆํด๋ณด๊ฒ ์ต๋๋ค. | |
| ```python | |
| import torch | |
| from transformers import AwqConfig, AutoModelForCausalLM | |
| model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ" | |
| quantization_config = AwqConfig( | |
| bits=4, | |
| fuse_max_seq_len=512, | |
| do_fuse=True, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0) | |
| ``` | |
| [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) ๋ชจ๋ธ์ ํจ์ฆ๋ ๋ชจ๋์ด ์๋ ๊ฒฝ์ฐ์ ์๋ ๊ฒฝ์ฐ ๋ชจ๋ `batch_size=1` ๋ก ์ฑ๋ฅ ํ๊ฐ๋์์ต๋๋ค. | |
| <figcaption class="text-center text-gray-500 text-lg">ํจ์ฆ๋์ง ์์ ๋ชจ๋</figcaption> | |
| | ๋ฐฐ์น ํฌ๊ธฐ | ํ๋ฆฌํ ๊ธธ์ด | ๋์ฝ๋ ๊ธธ์ด | ํ๋ฆฌํ ํ ํฐ/์ด | ๋์ฝ๋ ํ ํฐ/์ด | ๋ฉ๋ชจ๋ฆฌ (VRAM) | | |
| |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| | |
| | 1 | 32 | 32 | 60.0984 | 38.4537 | 4.50 GB (5.68%) | | |
| | 1 | 64 | 64 | 1333.67 | 31.6604 | 4.50 GB (5.68%) | | |
| | 1 | 128 | 128 | 2434.06 | 31.6272 | 4.50 GB (5.68%) | | |
| | 1 | 256 | 256 | 3072.26 | 38.1731 | 4.50 GB (5.68%) | | |
| | 1 | 512 | 512 | 3184.74 | 31.6819 | 4.59 GB (5.80%) | | |
| | 1 | 1024 | 1024 | 3148.18 | 36.8031 | 4.81 GB (6.07%) | | |
| | 1 | 2048 | 2048 | 2927.33 | 35.2676 | 5.73 GB (7.23%) | | |
| <figcaption class="text-center text-gray-500 text-lg">ํจ์ฆ๋ ๋ชจ๋</figcaption> | |
| | ๋ฐฐ์น ํฌ๊ธฐ | ํ๋ฆฌํ ๊ธธ์ด | ๋์ฝ๋ ๊ธธ์ด | ํ๋ฆฌํ ํ ํฐ/์ด | ๋์ฝ๋ ํ ํฐ/์ด | ๋ฉ๋ชจ๋ฆฌ (VRAM) | | |
| |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| | |
| | 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) | | |
| | 1 | 64 | 64 | 1756.1 | 106.26 | 4.00 GB (5.05%) | | |
| | 1 | 128 | 128 | 2479.32 | 105.631 | 4.00 GB (5.06%) | | |
| | 1 | 256 | 256 | 1813.6 | 85.7485 | 4.01 GB (5.06%) | | |
| | 1 | 512 | 512 | 2848.9 | 97.701 | 4.11 GB (5.19%) | | |
| | 1 | 1024 | 1024 | 3044.35 | 87.7323 | 4.41 GB (5.57%) | | |
| | 1 | 2048 | 2048 | 2715.11 | 89.4709 | 5.57 GB (7.04%) | | |
| ํจ์ฆ๋ ๋ชจ๋ ๋ฐ ํจ์ฆ๋์ง ์์ ๋ชจ๋์ ์๋์ ์ฒ๋ฆฌ๋์ [optimum-benchmark](https://github.com/huggingface/optimum-benchmark)๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ํ ์คํธ ๋์์ต๋๋ค. | |
| <div class="flex gap-4"> | |
| <div> | |
| <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size" /> | |
| <figcaption class="mt-2 text-center text-sm text-gray-500">ํฌ์๋ ํผํฌ ๋ฉ๋ชจ๋ฆฌ (forward peak memory)/๋ฐฐ์น ํฌ๊ธฐ</figcaption> | |
| </div> | |
| <div> | |
| <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size" /> | |
| <figcaption class="mt-2 text-center text-sm text-gray-500"> ์์ฑ ์ฒ๋ฆฌ๋/๋ฐฐ์นํฌ๊ธฐ</figcaption> | |
| </div> | |
| </div> | |
| </hfoption> | |
| <hfoption id="unsupported architectures"> | |
| ํจ์ฆ๋ ๋ชจ๋์ ์ง์ํ์ง ์๋ ์ํคํ ์ฒ์ ๊ฒฝ์ฐ, `modules_to_fuse` ๋งค๊ฐ๋ณ์๋ฅผ ์ฌ์ฉํด ์ง์ ํจ์ฆ ๋งคํ์ ๋ง๋ค์ด ์ด๋ค ๋ชจ๋์ ํจ์ฆํ ์ง ์ ์ํด์ผํฉ๋๋ค. ์๋ก, [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) ๋ชจ๋ธ์ AWQ ๋ชจ๋์ ํจ์ฆํ๋ ๋ฐฉ๋ฒ์ ๋๋ค. | |
| ```python | |
| import torch | |
| from transformers import AwqConfig, AutoModelForCausalLM | |
| model_id = "TheBloke/Yi-34B-AWQ" | |
| quantization_config = AwqConfig( | |
| bits=4, | |
| fuse_max_seq_len=512, | |
| modules_to_fuse={ | |
| "attention": ["q_proj", "k_proj", "v_proj", "o_proj"], | |
| "layernorm": ["ln1", "ln2", "norm"], | |
| "mlp": ["gate_proj", "up_proj", "down_proj"], | |
| "use_alibi": False, | |
| "num_attention_heads": 56, | |
| "num_key_value_heads": 8, | |
| "hidden_size": 7168 | |
| } | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0) | |
| ``` | |
| `modules_to_fuse` ๋งค๊ฐ๋ณ์๋ ๋ค์์ ํฌํจํด์ผ ํฉ๋๋ค: | |
| - `"attention"`: ์ดํ ์ ๋ ์ด์ด๋ ๋ค์ ์์๋ก ํจ์ฆํ์ธ์ : ์ฟผ๋ฆฌ (query), ํค (key), ๊ฐ (value) , ์ถ๋ ฅ ํ๋ก์ ์ ๊ณ์ธต (output projection layer). ํด๋น ๋ ์ด์ด๋ฅผ ํจ์ฆํ์ง ์์ผ๋ ค๋ฉด ๋น ๋ฆฌ์คํธ๋ฅผ ์ ๋ฌํ์ธ์. | |
| - `"layernorm"`: ์ฌ์ฉ์ ์ ์ ํจ์ฆ ๋ ์ด์ด ์ ๊ทํ๋ก ๊ตํ ๋ ์ด์ด ์ ๊ทํ ๋ ์ด์ด๋ช . ํด๋น ๋ ์ด์ด๋ฅผ ํจ์ฆํ์ง ์์ผ๋ ค๋ฉด ๋น ๋ฆฌ์คํธ๋ฅผ ์ ๋ฌํ์ธ์. | |
| - `"mlp"`: ๋จ์ผ MLP ๋ ์ด์ด๋ก ํจ์ฆํ MLP ๋ ์ด์ด ์์ : (๊ฒ์ดํธ (gate) (๋ด์ค(dense), ๋ ์ด์ด(layer), ํฌ์คํธ ์ดํ ์ (post-attention)) / ์ / ์๋ ๋ ์ด์ด). | |
| - `"use_alibi"`: ๋ชจ๋ธ์ด ALiBi positional embedding์ ์ฌ์ฉํ ๊ฒฝ์ฐ ์ค์ ํฉ๋๋ค. | |
| - `"num_attention_heads"`: ์ดํ ์ ํค๋ (attention heads)์ ์๋ฅผ ์ค์ ํฉ๋๋ค. | |
| - `"num_key_value_heads"`: ๊ทธ๋ฃนํ ์ฟผ๋ฆฌ ์ดํ ์ (GQA)์ ๊ตฌํํ๋๋ฐ ์ฌ์ฉ๋๋ ํค ๊ฐ ํค๋์ ์๋ฅผ ์ค์ ํฉ๋๋ค. `num_key_value_heads=num_attention_heads`๋ก ์ค์ ํ ๊ฒฝ์ฐ, ๋ชจ๋ธ์ ๋ค์ค ํค๋ ์ดํ ์ (MHA)๊ฐ ์ฌ์ฉ๋๋ฉฐ, `num_key_value_heads=1` ๋ ๋ค์ค ์ฟผ๋ฆฌ ์ดํ ์ (MQA)๊ฐ, ๋๋จธ์ง๋ GQA๊ฐ ์ฌ์ฉ๋ฉ๋๋ค. | |
| - `"hidden_size"`: ์จ๊ฒจ์ง ํํ(hidden representations)์ ์ฐจ์์ ์ค์ ํฉ๋๋ค. | |
| </hfoption> | |
| </hfoptions> | |
| ## ExLlama-v2 ์ํฌํธ [[exllama-v2-support]] | |
| ์ต์ ๋ฒ์ `autoawq`๋ ๋น ๋ฅธ ํ๋ฆฌํ๊ณผ ๋์ฝ๋ฉ์ ์ํด ExLlama-v2 ์ปค๋์ ์ง์ํฉ๋๋ค. ์์ํ๊ธฐ ์ํด ๋จผ์ ์ต์ ๋ฒ์ `autoawq` ๋ฅผ ์ค์นํ์ธ์ : | |
| ```bash | |
| pip install git+https://github.com/casper-hansen/AutoAWQ.git | |
| ``` | |
| ๋งค๊ฐ๋ณ์๋ฅผ `version="exllama"`๋ก ์ค์ ํด `AwqConfig()`๋ฅผ ์์ฑํ๊ณ ๋ชจ๋ธ์ ๋๊ฒจ์ฃผ์ธ์. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig | |
| quantization_config = AwqConfig(version="exllama") | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "TheBloke/Mistral-7B-Instruct-v0.1-AWQ", | |
| quantization_config=quantization_config, | |
| device_map="auto", | |
| ) | |
| input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda") | |
| output = model(input_ids) | |
| print(output.logits) | |
| tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ") | |
| input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device) | |
| output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256) | |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) | |
| ``` | |
| <Tip warning={true}> | |
| ์ด ๊ธฐ๋ฅ์ AMD GPUs์์ ์ง์๋ฉ๋๋ค. | |
| </Tip> | |