HaoranChu commited on
Commit
ddf2ebf
·
verified ·
1 Parent(s): e9ad0c9

Delete GPTQ-for-Qwen_hf/README.md

Browse files
Files changed (1) hide show
  1. GPTQ-for-Qwen_hf/README.md +0 -188
GPTQ-for-Qwen_hf/README.md DELETED
@@ -1,188 +0,0 @@
1
- # GPTQ-for-LLaMA
2
-
3
- **I am currently focusing on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) and recommend using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) instead of GPTQ for Llama.**
4
-
5
- <img src = https://user-images.githubusercontent.com/64115820/235287009-2d07bba8-9b85-4973-9e06-2a3c28777f06.png width="50%" height="50%">
6
-
7
- 4 bits quantization of [LLaMA](https://arxiv.org/abs/2302.13971) using [GPTQ](https://arxiv.org/abs/2210.17323)
8
-
9
- GPTQ is SOTA one-shot weight quantization method
10
-
11
- **It can be used universally, but it is not the [fastest](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/old-cuda) and only supports linux.**
12
-
13
- **Triton only supports Linux, so if you are a Windows user, please use [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install).**
14
-
15
- ## News or Update
16
- **AutoGPTQ-triton, a packaged version of GPTQ with triton, has been integrated into [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).**
17
- ## Result
18
- <details>
19
- <summary>LLaMA-7B(click me)</summary>
20
-
21
- | [LLaMA-7B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
22
- | -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
23
- | FP16 | 16 | - | 13940 | 5.68 | 12.5 |
24
- | RTN | 4 | - | - | 6.29 | - |
25
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | 4740 | 6.09 | 3.5 |
26
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | 4891 | 5.85 | 3.6 |
27
- | RTN | 3 | - | - | 25.54 | - |
28
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | 3852 | 8.07 | 2.7 |
29
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | 4116 | 6.61 | 3.0 |
30
-
31
- </details>
32
-
33
- <details>
34
- <summary>LLaMA-13B</summary>
35
-
36
- | [LLaMA-13B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
37
- | -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
38
- | FP16 | 16 | - | OOM | 5.09 | 24.2 |
39
- | RTN | 4 | - | - | 5.53 | - |
40
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | 8410 | 5.36 | 6.5 |
41
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | 8747 | 5.20 | 6.7 |
42
- | RTN | 3 | - | - | 11.40 | - |
43
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | 6870 | 6.63 | 5.1 |
44
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | 7277 | 5.62 | 5.4 |
45
-
46
- </details>
47
-
48
- <details>
49
- <summary>LLaMA-33B</summary>
50
-
51
- | [LLaMA-33B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
52
- | -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
53
- | FP16 | 16 | - | OOM | 4.10 | 60.5 |
54
- | RTN | 4 | - | - | 4.54 | - |
55
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | 19493 | 4.45 | 15.7 |
56
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | 20570 | 4.23 | 16.3 |
57
- | RTN | 3 | - | - | 14.89 | - |
58
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | 15493 | 5.69 | 12.0 |
59
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | 16566 | 4.80 | 13.0 |
60
-
61
- </details>
62
-
63
- <details>
64
- <summary>LLaMA-65B</summary>
65
-
66
- | [LLaMA-65B](https://arxiv.org/abs/2302.13971) | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
67
- | -------------------------------------------------- | ---- | ---------- | ----------- | --------- | ------------------- |
68
- | FP16 | 16 | - | OOM | 3.53 | 121.0 |
69
- | RTN | 4 | - | - | 3.92 | - |
70
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | - | OOM | 3.84 | 31.1 |
71
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 4 | 128 | OOM | 3.65 | 32.3 |
72
- | RTN | 3 | - | - | 10.59 | - |
73
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | - | OOM | 5.04 | 23.6 |
74
- | [GPTQ](https://arxiv.org/abs/2210.17323) | 3 | 128 | OOM | 4.17 | 25.6 |
75
- </details>
76
-
77
- Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.
78
-
79
- Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(https://github.com/IST-DASLab/gptq/issues/1)
80
-
81
- According to [GPTQ paper](https://arxiv.org/abs/2210.17323), As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.
82
-
83
- ## GPTQ vs [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
84
-
85
- <details>
86
- <summary>LLaMA-7B(click me)</summary>
87
-
88
- | [LLaMA-7B(seqlen=2048)](https://arxiv.org/abs/2302.13971) | Bits Per Weight(BPW)| memory(MiB) | c4(ppl) |
89
- | --------------------------------------------------------------- | ------------------- | ----------- | --------- |
90
- | FP16 | 16 | 13948 | 5.22 |
91
- | [GPTQ-128g](https://arxiv.org/abs/2210.17323) | 4.15 | 4781 | 5.30 |
92
- | [nf4-double_quant](https://arxiv.org/abs/2305.14314) | 4.127 | 4804 | 5.30 |
93
- | [nf4](https://arxiv.org/abs/2305.14314) | 4.5 | 5102 | 5.30 |
94
- | [fp4](https://arxiv.org/abs/2212.09720) | 4.5 | 5102 | 5.33 |
95
-
96
- </details>
97
-
98
- <details>
99
- <summary>LLaMA-13B</summary>
100
-
101
- | [LLaMA-13B(seqlen=2048)](https://arxiv.org/abs/2302.13971) | Bits Per Weight(BPW)| memory(MiB) | c4(ppl) |
102
- | ---------------------------------------------------------------- | ------------------- | ----------- | --------- |
103
- | FP16 | 16 | OOM | - |
104
- | [GPTQ-128g](https://arxiv.org/abs/2210.17323) | 4.15 | 8589 | 5.02 |
105
- | [nf4-double_quant](https://arxiv.org/abs/2305.14314) | 4.127 | 8581 | 5.04 |
106
- | [nf4](https://arxiv.org/abs/2305.14314) | 4.5 | 9170 | 5.04 |
107
- | [fp4](https://arxiv.org/abs/2212.09720) | 4.5 | 9170 | 5.11 |
108
- </details>
109
-
110
- <details>
111
- <summary>LLaMA-33B</summary>
112
-
113
- | [LLaMA-33B(seqlen=1024)](https://arxiv.org/abs/2302.13971) | Bits Per Weight(BPW)| memory(MiB) | c4(ppl) |
114
- | ---------------------------------------------------------------- | ------------------- | ----------- | --------- |
115
- | FP16 | 16 | OOM | - |
116
- | [GPTQ-128g](https://arxiv.org/abs/2210.17323) | 4.15 | 18441 | 3.71 |
117
- | [nf4-double_quant](https://arxiv.org/abs/2305.14314) | 4.127 | 18313 | 3.76 |
118
- | [nf4](https://arxiv.org/abs/2305.14314) | 4.5 | 19729 | 3.75 |
119
- | [fp4](https://arxiv.org/abs/2212.09720) | 4.5 | 19729 | 3.75 |
120
-
121
- </details>
122
-
123
- ## Installation
124
- If you don't have [conda](https://docs.conda.io/en/latest/miniconda.html), install it first.
125
- ```
126
- conda create --name gptq python=3.9 -y
127
- conda activate gptq
128
- conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
129
- # Or, if you're having trouble with conda, use pip with python3.9:
130
- # pip3 install torch torchvision torchaudio
131
-
132
- git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
133
- cd GPTQ-for-LLaMa
134
- pip install -r requirements.txt
135
- ```
136
- ## Dependencies
137
-
138
- * `torch`: tested on v2.0.0+cu117
139
- * `transformers`: tested on v4.28.0.dev0
140
- * `datasets`: tested on v2.10.1
141
- * `safetensors`: tested on v0.3.0
142
-
143
- All experiments were run on a single NVIDIA RTX3090.
144
-
145
- # Language Generation
146
- ## LLaMA
147
-
148
- ```
149
- #convert LLaMA to hf
150
- python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf
151
-
152
- # Benchmark language generation with 4-bit LLaMA-7B:
153
-
154
- # Save compressed model
155
- CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt
156
-
157
- # Or save compressed `.safetensors` model
158
- CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors
159
-
160
- # Benchmark generating a 2048 token sequence with the saved model
161
- CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check
162
-
163
- # Benchmark FP16 baseline, note that the model will be split across all listed GPUs
164
- CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ${MODEL_DIR} c4 --benchmark 2048 --check
165
-
166
- # model inference with the saved model
167
- CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
168
-
169
- # model inference with the saved model using safetensors loaded direct to gpu
170
- CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama" --device=0
171
-
172
- # model inference with the saved model with offload(This is very slow).
173
- CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
174
- It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.
175
- ```
176
- Basically, 4-bit quantization and 128 groupsize are recommended.
177
-
178
- You can also export quantization parameters with toml+numpy format.
179
- ```
180
- CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --quant-directory ${TOML_DIR}
181
- ```
182
-
183
- # Acknowledgements
184
- This code is based on [GPTQ](https://github.com/IST-DASLab/gptq)
185
-
186
- Thanks to Meta AI for releasing [LLaMA](https://arxiv.org/abs/2302.13971), a powerful LLM.
187
-
188
- Triton GPTQ kernel code is based on [GPTQ-triton](https://github.com/fpgaminer/GPTQ-triton)