Delete GPTQ-for-Qwen_hf/README.md
Browse files- GPTQ-for-Qwen_hf/README.md +0 -188
GPTQ-for-Qwen_hf/README.md
DELETED
|
@@ -1,188 +0,0 @@
|
|
| 1 |
-
# GPTQ-for-LLaMA
|
| 2 |
-
|
| 3 |
-
**I am currently focusing on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) and recommend using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) instead of GPTQ for Llama.**
|
| 4 |
-
|
| 5 |
-
<img src = https://user-images.githubusercontent.com/64115820/235287009-2d07bba8-9b85-4973-9e06-2a3c28777f06.png width="50%" height="50%">
|
| 6 |
-
|
| 7 |
-
4 bits quantization of [LLaMA](https://arxiv.org/abs/2302.13971) using [GPTQ](https://arxiv.org/abs/2210.17323)
|
| 8 |
-
|
| 9 |
-
GPTQ is SOTA one-shot weight quantization method
|
| 10 |
-
|
| 11 |
-
**It can be used universally, but it is not the [fastest](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/old-cuda) and only supports linux.**
|
| 12 |
-
|
| 13 |
-
**Triton only supports Linux, so if you are a Windows user, please use [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install).**
|
| 14 |
-
|
| 15 |
-
## News or Update
|
| 16 |
-
**AutoGPTQ-triton, a packaged version of GPTQ with triton, has been integrated into [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).**
|
| 17 |
-
## Result
|
| 18 |
-
<details>
|
| 19 |
-
<summary>LLaMA-7B(click me)</summary>
|
| 20 |
-
|
| 21 |
-
| [LLaMA-7B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
|
| 22 |
-
| -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
|
| 23 |
-
| FP16 | 16 | - | 13940 | 5.68 | 12.5 |
|
| 24 |
-
| RTN | 4 | - | - | 6.29 | - |
|
| 25 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | 4740 | 6.09 | 3.5 |
|
| 26 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | 4891 | 5.85 | 3.6 |
|
| 27 |
-
| RTN | 3 | - | - | 25.54 | - |
|
| 28 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | 3852 | 8.07 | 2.7 |
|
| 29 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | 4116 | 6.61 | 3.0 |
|
| 30 |
-
|
| 31 |
-
</details>
|
| 32 |
-
|
| 33 |
-
<details>
|
| 34 |
-
<summary>LLaMA-13B</summary>
|
| 35 |
-
|
| 36 |
-
| [LLaMA-13B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
|
| 37 |
-
| -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
|
| 38 |
-
| FP16 | 16 | - | OOM | 5.09 | 24.2 |
|
| 39 |
-
| RTN | 4 | - | - | 5.53 | - |
|
| 40 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | 8410 | 5.36 | 6.5 |
|
| 41 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | 8747 | 5.20 | 6.7 |
|
| 42 |
-
| RTN | 3 | - | - | 11.40 | - |
|
| 43 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | 6870 | 6.63 | 5.1 |
|
| 44 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | 7277 | 5.62 | 5.4 |
|
| 45 |
-
|
| 46 |
-
</details>
|
| 47 |
-
|
| 48 |
-
<details>
|
| 49 |
-
<summary>LLaMA-33B</summary>
|
| 50 |
-
|
| 51 |
-
| [LLaMA-33B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
|
| 52 |
-
| -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
|
| 53 |
-
| FP16 | 16 | - | OOM | 4.10 | 60.5 |
|
| 54 |
-
| RTN | 4 | - | - | 4.54 | - |
|
| 55 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | 19493 | 4.45 | 15.7 |
|
| 56 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | 20570 | 4.23 | 16.3 |
|
| 57 |
-
| RTN | 3 | - | - | 14.89 | - |
|
| 58 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | 15493 | 5.69 | 12.0 |
|
| 59 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | 16566 | 4.80 | 13.0 |
|
| 60 |
-
|
| 61 |
-
</details>
|
| 62 |
-
|
| 63 |
-
<details>
|
| 64 |
-
<summary>LLaMA-65B</summary>
|
| 65 |
-
|
| 66 |
-
| [LLaMA-65B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
|
| 67 |
-
| -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
|
| 68 |
-
| FP16 | 16 | - | OOM | 3.53 | 121.0 |
|
| 69 |
-
| RTN | 4 | - | - | 3.92 | - |
|
| 70 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | OOM | 3.84 | 31.1 |
|
| 71 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | OOM | 3.65 | 32.3 |
|
| 72 |
-
| RTN | 3 | - | - | 10.59 | - |
|
| 73 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | OOM | 5.04 | 23.6 |
|
| 74 |
-
| [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | OOM | 4.17 | 25.6 |
|
| 75 |
-
</details>
|
| 76 |
-
|
| 77 |
-
Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.
|
| 78 |
-
|
| 79 |
-
Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(https://github.com/IST-DASLab/gptq/issues/1)
|
| 80 |
-
|
| 81 |
-
According to [GPTQ paper](https://arxiv.org/abs/2210.17323), As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.
|
| 82 |
-
|
| 83 |
-
## GPTQ vs [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
|
| 84 |
-
|
| 85 |
-
<details>
|
| 86 |
-
<summary>LLaMA-7B(click me)</summary>
|
| 87 |
-
|
| 88 |
-
| [LLaMA-7B(seqlen=2048)](https://arxiv.org/abs/2302.13971) | Bits Per Weight(BPW)| memory(MiB) | c4(ppl) |
|
| 89 |
-
| --------------------------------------------------------------- | ------------------- | ----------- | --------- |
|
| 90 |
-
| FP16 | 16 | 13948 | 5.22 |
|
| 91 |
-
| [GPTQ-128g](https://arxiv.org/abs/2210.17323) | 4.15 | 4781 | 5.30 |
|
| 92 |
-
| [nf4-double_quant](https://arxiv.org/abs/2305.14314) | 4.127 | 4804 | 5.30 |
|
| 93 |
-
| [nf4](https://arxiv.org/abs/2305.14314) | 4.5 | 5102 | 5.30 |
|
| 94 |
-
| [fp4](https://arxiv.org/abs/2212.09720) | 4.5 | 5102 | 5.33 |
|
| 95 |
-
|
| 96 |
-
</details>
|
| 97 |
-
|
| 98 |
-
<details>
|
| 99 |
-
<summary>LLaMA-13B</summary>
|
| 100 |
-
|
| 101 |
-
| [LLaMA-13B(seqlen=2048)](https://arxiv.org/abs/2302.13971) | Bits Per Weight(BPW)| memory(MiB) | c4(ppl) |
|
| 102 |
-
| ---------------------------------------------------------------- | ------------------- | ----------- | --------- |
|
| 103 |
-
| FP16 | 16 | OOM | - |
|
| 104 |
-
| [GPTQ-128g](https://arxiv.org/abs/2210.17323) | 4.15 | 8589 | 5.02 |
|
| 105 |
-
| [nf4-double_quant](https://arxiv.org/abs/2305.14314) | 4.127 | 8581 | 5.04 |
|
| 106 |
-
| [nf4](https://arxiv.org/abs/2305.14314) | 4.5 | 9170 | 5.04 |
|
| 107 |
-
| [fp4](https://arxiv.org/abs/2212.09720) | 4.5 | 9170 | 5.11 |
|
| 108 |
-
</details>
|
| 109 |
-
|
| 110 |
-
<details>
|
| 111 |
-
<summary>LLaMA-33B</summary>
|
| 112 |
-
|
| 113 |
-
| [LLaMA-33B(seqlen=1024)](https://arxiv.org/abs/2302.13971) | Bits Per Weight(BPW)| memory(MiB) | c4(ppl) |
|
| 114 |
-
| ---------------------------------------------------------------- | ------------------- | ----------- | --------- |
|
| 115 |
-
| FP16 | 16 | OOM | - |
|
| 116 |
-
| [GPTQ-128g](https://arxiv.org/abs/2210.17323) | 4.15 | 18441 | 3.71 |
|
| 117 |
-
| [nf4-double_quant](https://arxiv.org/abs/2305.14314) | 4.127 | 18313 | 3.76 |
|
| 118 |
-
| [nf4](https://arxiv.org/abs/2305.14314) | 4.5 | 19729 | 3.75 |
|
| 119 |
-
| [fp4](https://arxiv.org/abs/2212.09720) | 4.5 | 19729 | 3.75 |
|
| 120 |
-
|
| 121 |
-
</details>
|
| 122 |
-
|
| 123 |
-
## Installation
|
| 124 |
-
If you don't have [conda](https://docs.conda.io/en/latest/miniconda.html), install it first.
|
| 125 |
-
```
|
| 126 |
-
conda create --name gptq python=3.9 -y
|
| 127 |
-
conda activate gptq
|
| 128 |
-
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
|
| 129 |
-
# Or, if you're having trouble with conda, use pip with python3.9:
|
| 130 |
-
# pip3 install torch torchvision torchaudio
|
| 131 |
-
|
| 132 |
-
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
|
| 133 |
-
cd GPTQ-for-LLaMa
|
| 134 |
-
pip install -r requirements.txt
|
| 135 |
-
```
|
| 136 |
-
## Dependencies
|
| 137 |
-
|
| 138 |
-
* `torch`: tested on v2.0.0+cu117
|
| 139 |
-
* `transformers`: tested on v4.28.0.dev0
|
| 140 |
-
* `datasets`: tested on v2.10.1
|
| 141 |
-
* `safetensors`: tested on v0.3.0
|
| 142 |
-
|
| 143 |
-
All experiments were run on a single NVIDIA RTX3090.
|
| 144 |
-
|
| 145 |
-
# Language Generation
|
| 146 |
-
## LLaMA
|
| 147 |
-
|
| 148 |
-
```
|
| 149 |
-
#convert LLaMA to hf
|
| 150 |
-
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf
|
| 151 |
-
|
| 152 |
-
# Benchmark language generation with 4-bit LLaMA-7B:
|
| 153 |
-
|
| 154 |
-
# Save compressed model
|
| 155 |
-
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt
|
| 156 |
-
|
| 157 |
-
# Or save compressed `.safetensors` model
|
| 158 |
-
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors
|
| 159 |
-
|
| 160 |
-
# Benchmark generating a 2048 token sequence with the saved model
|
| 161 |
-
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check
|
| 162 |
-
|
| 163 |
-
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
|
| 164 |
-
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ${MODEL_DIR} c4 --benchmark 2048 --check
|
| 165 |
-
|
| 166 |
-
# model inference with the saved model
|
| 167 |
-
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
|
| 168 |
-
|
| 169 |
-
# model inference with the saved model using safetensors loaded direct to gpu
|
| 170 |
-
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama" --device=0
|
| 171 |
-
|
| 172 |
-
# model inference with the saved model with offload(This is very slow).
|
| 173 |
-
CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
|
| 174 |
-
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.
|
| 175 |
-
```
|
| 176 |
-
Basically, 4-bit quantization and 128 groupsize are recommended.
|
| 177 |
-
|
| 178 |
-
You can also export quantization parameters with toml+numpy format.
|
| 179 |
-
```
|
| 180 |
-
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --quant-directory ${TOML_DIR}
|
| 181 |
-
```
|
| 182 |
-
|
| 183 |
-
# Acknowledgements
|
| 184 |
-
This code is based on [GPTQ](https://github.com/IST-DASLab/gptq)
|
| 185 |
-
|
| 186 |
-
Thanks to Meta AI for releasing [LLaMA](https://arxiv.org/abs/2302.13971), a powerful LLM.
|
| 187 |
-
|
| 188 |
-
Triton GPTQ kernel code is based on [GPTQ-triton](https://github.com/fpgaminer/GPTQ-triton)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|