Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Llama 3.2 Reasoning-Focused Quantization
|
| 2 |
+
|
| 3 |
+
A complete toolkit for quantizing Llama 3.2 models with optimization for reasoning tasks. This repository provides pre-quantized GGUF models, reasoning-focused calibration data, and automation scripts for the full quantization pipeline.
|
| 4 |
+
|
| 5 |
+
## Repository Contents
|
| 6 |
+
|
| 7 |
+
- **llama-3.2-f16.gguf** - Base F16 GGUF model (6.0GB) converted from safetensors
|
| 8 |
+
- **quantized/** - Pre-quantized models optimized for reasoning tasks:
|
| 9 |
+
- Q8_0 (3.2GB) - Highest quality quantization
|
| 10 |
+
- Q6_K (2.5GB) - Excellent quality, near F16 performance
|
| 11 |
+
- Q5_K_M (2.2GB) - Better quality, good balance
|
| 12 |
+
- Q4_K_M (1.9GB) - Recommended balance of size/quality
|
| 13 |
+
- Q3_K_M (1.6GB) - Smaller footprint
|
| 14 |
+
- IQ3_S-code (1.5GB) - Specialized for code tasks
|
| 15 |
+
- **scripts/quantizer.sh** - Automated quantization script
|
| 16 |
+
- **calibration_reasoning.txt** - Curated reasoning-focused calibration dataset
|
| 17 |
+
|
| 18 |
+
## Purpose
|
| 19 |
+
|
| 20 |
+
This repository streamlines the process of creating reasoning-optimized quantized models using importance matrix calibration. The included calibration data focuses on mathematical reasoning, logical inference, and chain-of-thought problem solving to preserve model quality in reasoning-heavy tasks.
|
| 21 |
+
|
| 22 |
+
## Quick Start
|
| 23 |
+
|
| 24 |
+
### Option 1: Use Pre-Quantized Models (Recommended)
|
| 25 |
+
|
| 26 |
+
The `quantized/` directory contains ready-to-use models. Simply load any .gguf file with your preferred inference engine:
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
# Using llama.cpp
|
| 30 |
+
llama-cli -m quantized/llama-3.2-Q4_K_M.gguf -p "Solve this problem step by step..."
|
| 31 |
+
|
| 32 |
+
# Using ollama (copy to ollama models directory first)
|
| 33 |
+
ollama run llama-3.2-Q4_K_M
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
### Option 2: Quantize From Scratch
|
| 37 |
+
|
| 38 |
+
If you want to create custom quantizations or use different calibration data:
|
| 39 |
+
|
| 40 |
+
## Prerequisites
|
| 41 |
+
|
| 42 |
+
Install llama.cpp if not already available:
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
git clone https://github.com/ggml-org/llama.cpp
|
| 46 |
+
cd llama.cpp && cmake -B build && cmake --build build --config Release -j 8
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
Add to PATH (optional):
|
| 50 |
+
```bash
|
| 51 |
+
export PATH="$HOME/llama.cpp/build/bin:$PATH"
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Step 1: Generate Importance Matrix
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
llama-imatrix \
|
| 58 |
+
-m llama-3.2-f16.gguf \
|
| 59 |
+
-f calibration_reasoning.txt \
|
| 60 |
+
-o reasoning.imatrix
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Step 2: Quantize Model
|
| 64 |
+
|
| 65 |
+
Use the provided script:
|
| 66 |
+
|
| 67 |
+
```bash
|
| 68 |
+
./scripts/quantizer.sh reasoning.imatrix llama-3.2-f16.gguf
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Or manually:
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
llama-quantize \
|
| 75 |
+
--imatrix reasoning.imatrix \
|
| 76 |
+
llama-3.2-f16.gguf \
|
| 77 |
+
llama-3.2-Q4_K_M.gguf \
|
| 78 |
+
Q4_K_M
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Calibration Data
|
| 82 |
+
|
| 83 |
+
The `calibration_reasoning.txt` file contains curated examples for:
|
| 84 |
+
- Mathematical word problems (GSM8k-style)
|
| 85 |
+
- Logical reasoning
|
| 86 |
+
- Multi-step problem solving
|
| 87 |
+
- Chain-of-thought reasoning
|
| 88 |
+
|
| 89 |
+
## Quantization Types
|
| 90 |
+
|
| 91 |
+
| Type | Size | Quality | Use Case |
|
| 92 |
+
|------|------|---------|----------|
|
| 93 |
+
| Q4_K_M | ~4.5GB | Good | Best balance |
|
| 94 |
+
| Q5_K_M | ~5.5GB | Better | Higher quality |
|
| 95 |
+
| Q6_K | ~6.5GB | Excellent | Near F16 quality |
|
| 96 |
+
| Q8_0 | ~8.5GB | Highest | Maximum quality |
|
| 97 |
+
|
| 98 |
+
## Common Issues
|
| 99 |
+
|
| 100 |
+
**Binary not found**: Use full path to llama.cpp binaries:
|
| 101 |
+
```bash
|
| 102 |
+
~/llama.cpp/build/bin/llama-imatrix ...
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
**Out of memory**: Reduce chunks with `--chunks 50`
|
| 106 |
+
|
| 107 |
+
## Setup Instructions
|
| 108 |
+
|
| 109 |
+
### 1. Clone and Setup
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
git clone <your-repo-url>
|
| 113 |
+
cd llama-3.2
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
### 2. Choose Your Workflow
|
| 117 |
+
|
| 118 |
+
**Using Pre-Quantized Models:**
|
| 119 |
+
- Models are in `quantized/` directory
|
| 120 |
+
- Select based on your hardware and quality needs (see table above)
|
| 121 |
+
- Use with any GGUF-compatible inference engine
|
| 122 |
+
|
| 123 |
+
**Creating Custom Quantizations:**
|
| 124 |
+
1. Install llama.cpp (see Prerequisites)
|
| 125 |
+
2. Use the base F16 model: `llama-3.2-f16.gguf`
|
| 126 |
+
3. Generate importance matrix (optional - improves quality):
|
| 127 |
+
```bash
|
| 128 |
+
llama-imatrix -m llama-3.2-f16.gguf -f calibration_reasoning.txt -o custom.imatrix
|
| 129 |
+
```
|
| 130 |
+
4. Run quantization:
|
| 131 |
+
```bash
|
| 132 |
+
./scripts/quantizer.sh custom.imatrix llama-3.2-f16.gguf Q4_K_M
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
### 3. Inference
|
| 136 |
+
|
| 137 |
+
Use the quantized model with your preferred tool:
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
# llama.cpp
|
| 141 |
+
llama-cli -m quantized/llama-3.2-Q4_K_M.gguf -p "Your prompt here" -n 512
|
| 142 |
+
|
| 143 |
+
# With conversation mode
|
| 144 |
+
llama-cli -m quantized/llama-3.2-Q4_K_M.gguf --interactive
|
| 145 |
+
|
| 146 |
+
# Python (using llama-cpp-python)
|
| 147 |
+
python3 -c "from llama_cpp import Llama; llm = Llama(model_path='quantized/llama-3.2-Q4_K_M.gguf'); print(llm('Solve: 2+2=?'))"
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
## Model Selection Guide
|
| 151 |
+
|
| 152 |
+
- **Limited VRAM (<6GB)**: Use Q3_K_M or IQ3_S-code
|
| 153 |
+
- **Balanced (6-8GB)**: Use Q4_K_M (recommended)
|
| 154 |
+
- **High Quality (8-12GB)**: Use Q5_K_M or Q6_K
|
| 155 |
+
- **Maximum Quality (>12GB)**: Use Q8_0 or F16
|
| 156 |
+
|
| 157 |
+
## License
|
| 158 |
+
|
| 159 |
+
See LICENSE.txt for model license information.
|