| --- |
| library_name: gguf |
| tags: |
| - llama |
| - quantized |
| - gptq |
| - evopress |
| model_type: llama |
| base_model: meta-llama/Llama-3.2-1B-Instruct |
| --- |
| |
| # Llama-3.2-1B-Instruct GGUF DASLab Quantization |
|
|
| This repository contains advanced quantized versions of Llama 3.2 1B Instruct using **GPTQ quantization** and **GPTQ+EvoPress optimization** from the [DASLab GGUF Toolkit](https://github.com/IST-DASLab/gguf-toolkit). |
|
|
| ## Models |
|
|
| - **GPTQ Uniform**: High-quality GPTQ quantization at 2-6 bit precision |
| - **GPTQ+EvoPress**: Non-uniform per-layer quantization discovered via evolutionary search |
|
|
| ## Performance |
|
|
| Our GPTQ-based quantization methods achieve **superior quality-compression tradeoffs** compared to standard quantization: |
|
|
| - **Better perplexity** at equivalent bitwidths vs. naive quantization approaches |
| - **Error-correcting updates** during calibration for improved accuracy |
| - **Optimized configurations** that allocate bits based on layer sensitivity (EvoPress) |
|
|
| ## Usage |
|
|
| Compatible with llama.cpp and all GGUF-supporting inference engines. No special setup required. |
|
|
| **Full documentation, evaluation results, and toolkit source**: https://github.com/IST-DASLab/gguf-toolkit |
|
|
| --- |