Update README.md
Browse files
README.md
CHANGED
|
@@ -18,6 +18,18 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
|
|
| 18 |
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
| 19 |
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
| 22 |
|
| 23 |
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|
|
|
|
| 18 |
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
| 19 |
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
| 20 |
|
| 21 |
+
[**arXiv**](https://arxiv.org/abs/2409.17066): https://arxiv.org/abs/2409.17066
|
| 22 |
+
|
| 23 |
+
[**Models from Community**](https://huggingface.co/VPTQ-community): https://huggingface.co/VPTQ-community
|
| 24 |
+
|
| 25 |
+
[**Github**](https://github.com/microsoft/vptq) https://github.com/microsoft/vptq
|
| 26 |
+
|
| 27 |
+
Prompt example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
Chat example: Llama 3.1 70B on RTX4090 (24 GB@2bit)
|
| 31 |
+

|
| 32 |
+
|
| 33 |
## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
|
| 34 |
|
| 35 |
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
|