Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,148 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
|
| 11 |
+
|
| 12 |
+
## TL;DR
|
| 13 |
+
|
| 14 |
+
**Vector Post-Training Quantization (VPTQ)** is a novel Post-Training Quantization method that leverages **Vector Quantization** to high accuracy on LLMs at an extremely low bit-width (<2-bit).
|
| 15 |
+
VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
|
| 16 |
+
|
| 17 |
+
* Better Accuracy on 1-2 bits
|
| 18 |
+
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
| 19 |
+
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
| 20 |
+
|
| 21 |
+
## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
| 22 |
+
|
| 23 |
+
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
## Early Results from Tech Report
|
| 27 |
+
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
|
| 28 |
+
|
| 29 |
+
<img src="assets/vptq.png" width="500">
|
| 30 |
+
|
| 31 |
+
| Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
|
| 32 |
+
| ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
|
| 33 |
+
| LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
|
| 34 |
+
| | 2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 |
|
| 35 |
+
| LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
|
| 36 |
+
| | 2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 |
|
| 37 |
+
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
|
| 38 |
+
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## Install and Evaluation
|
| 42 |
+
|
| 43 |
+
### Dependencies
|
| 44 |
+
- python 3.10+
|
| 45 |
+
- torch >= 2.2.0
|
| 46 |
+
- transformers >= 4.44.0
|
| 47 |
+
- Accelerate >= 0.33.0
|
| 48 |
+
- latest datasets
|
| 49 |
+
|
| 50 |
+
### Installation
|
| 51 |
+
|
| 52 |
+
> Preparation steps that might be needed: Set up CUDA PATH.
|
| 53 |
+
```bash
|
| 54 |
+
export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Language Generation
|
| 62 |
+
To generate text using the pre-trained model, you can use the following code snippet:
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
python -m vptq --model=LLaMa-2-7b-1.5bi-vptq --prompt="Do Not Go Gentle into That Good Night"
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
Launching a chatbot:
|
| 69 |
+
Note that you must use a chat model for this to work
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
python -m vptq --model=LLaMa-2-7b-chat-1.5b-vptq --chat
|
| 73 |
+
```
|
| 74 |
+
Using the Python API:
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
import vptq
|
| 78 |
+
import transformers
|
| 79 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("LLaMa-2-7b-1.5bi-vptq")
|
| 80 |
+
m = vptq.AutoModelForCausalLM.from_pretrained("LLaMa-2-7b-1.5bi-vptq", device_map='auto')
|
| 81 |
+
|
| 82 |
+
inputs = tokenizer("Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
|
| 83 |
+
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
|
| 84 |
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Gradio app example
|
| 88 |
+
A environment variable is available to control share link or not.
|
| 89 |
+
`export SHARE_LINK=1`
|
| 90 |
+
```
|
| 91 |
+
python -m vptq.app
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## Road Map
|
| 95 |
+
- [ ] Merge the quantization algorithm into the public repository.
|
| 96 |
+
- [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
|
| 97 |
+
- [ ] Improve the implementation of the inference kernel.
|
| 98 |
+
- [ ] **TBC**
|
| 99 |
+
|
| 100 |
+
## Project main members:
|
| 101 |
+
* Yifei Liu (@lyf-00)
|
| 102 |
+
* Jicheng Wen (@wejoncy)
|
| 103 |
+
* Yang Wang (@YangWang92)
|
| 104 |
+
|
| 105 |
+
## Acknowledgement
|
| 106 |
+
|
| 107 |
+
* We thank for **James Hensman** for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
|
| 108 |
+
* We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
|
| 109 |
+
|
| 110 |
+
## Publication
|
| 111 |
+
EMNLP 2024 Main
|
| 112 |
+
```bibtex
|
| 113 |
+
@inproceedings{
|
| 114 |
+
vptq,
|
| 115 |
+
title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
|
| 116 |
+
author={Yifei Liu and
|
| 117 |
+
Jicheng Wen and
|
| 118 |
+
Yang Wang and
|
| 119 |
+
Shengyu Ye and
|
| 120 |
+
Li Lyna Zhang and
|
| 121 |
+
Ting Cao, Cheng Li and
|
| 122 |
+
Mao Yang},
|
| 123 |
+
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
|
| 124 |
+
year={2024},
|
| 125 |
+
}
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
## Limitation of VPTQ
|
| 129 |
+
* ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
|
| 130 |
+
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository project cannot guarantee the performance of those models.
|
| 131 |
+
* ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
|
| 132 |
+
* ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
|
| 133 |
+
|
| 134 |
+
## Contributing
|
| 135 |
+
|
| 136 |
+
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
| 137 |
+
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
| 138 |
+
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
|
| 139 |
+
|
| 140 |
+
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
|
| 141 |
+
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
|
| 142 |
+
provided by the bot. You will only need to do this once across all repos using our CLA.
|
| 143 |
+
|
| 144 |
+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
| 145 |
+
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
| 146 |
+
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
| 147 |
+
|
| 148 |
+
## Trademarks
|
| 149 |
+
|
| 150 |
+
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
| 151 |
+
trademarks or logos is subject to and must follow
|
| 152 |
+
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
| 153 |
+
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
| 154 |
+
Any use of third-party trademarks or logos are subject to those third-party's policies.
|