davezaxh commited on
Commit
d1cc352
·
verified ·
1 Parent(s): 2bc5e64

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +159 -0
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Llama 3.2 Reasoning-Focused Quantization
2
+
3
+ A complete toolkit for quantizing Llama 3.2 models with optimization for reasoning tasks. This repository provides pre-quantized GGUF models, reasoning-focused calibration data, and automation scripts for the full quantization pipeline.
4
+
5
+ ## Repository Contents
6
+
7
+ - **llama-3.2-f16.gguf** - Base F16 GGUF model (6.0GB) converted from safetensors
8
+ - **quantized/** - Pre-quantized models optimized for reasoning tasks:
9
+ - Q8_0 (3.2GB) - Highest quality quantization
10
+ - Q6_K (2.5GB) - Excellent quality, near F16 performance
11
+ - Q5_K_M (2.2GB) - Better quality, good balance
12
+ - Q4_K_M (1.9GB) - Recommended balance of size/quality
13
+ - Q3_K_M (1.6GB) - Smaller footprint
14
+ - IQ3_S-code (1.5GB) - Specialized for code tasks
15
+ - **scripts/quantizer.sh** - Automated quantization script
16
+ - **calibration_reasoning.txt** - Curated reasoning-focused calibration dataset
17
+
18
+ ## Purpose
19
+
20
+ This repository streamlines the process of creating reasoning-optimized quantized models using importance matrix calibration. The included calibration data focuses on mathematical reasoning, logical inference, and chain-of-thought problem solving to preserve model quality in reasoning-heavy tasks.
21
+
22
+ ## Quick Start
23
+
24
+ ### Option 1: Use Pre-Quantized Models (Recommended)
25
+
26
+ The `quantized/` directory contains ready-to-use models. Simply load any .gguf file with your preferred inference engine:
27
+
28
+ ```bash
29
+ # Using llama.cpp
30
+ llama-cli -m quantized/llama-3.2-Q4_K_M.gguf -p "Solve this problem step by step..."
31
+
32
+ # Using ollama (copy to ollama models directory first)
33
+ ollama run llama-3.2-Q4_K_M
34
+ ```
35
+
36
+ ### Option 2: Quantize From Scratch
37
+
38
+ If you want to create custom quantizations or use different calibration data:
39
+
40
+ ## Prerequisites
41
+
42
+ Install llama.cpp if not already available:
43
+
44
+ ```bash
45
+ git clone https://github.com/ggml-org/llama.cpp
46
+ cd llama.cpp && cmake -B build && cmake --build build --config Release -j 8
47
+ ```
48
+
49
+ Add to PATH (optional):
50
+ ```bash
51
+ export PATH="$HOME/llama.cpp/build/bin:$PATH"
52
+ ```
53
+
54
+ ## Step 1: Generate Importance Matrix
55
+
56
+ ```bash
57
+ llama-imatrix \
58
+ -m llama-3.2-f16.gguf \
59
+ -f calibration_reasoning.txt \
60
+ -o reasoning.imatrix
61
+ ```
62
+
63
+ ## Step 2: Quantize Model
64
+
65
+ Use the provided script:
66
+
67
+ ```bash
68
+ ./scripts/quantizer.sh reasoning.imatrix llama-3.2-f16.gguf
69
+ ```
70
+
71
+ Or manually:
72
+
73
+ ```bash
74
+ llama-quantize \
75
+ --imatrix reasoning.imatrix \
76
+ llama-3.2-f16.gguf \
77
+ llama-3.2-Q4_K_M.gguf \
78
+ Q4_K_M
79
+ ```
80
+
81
+ ## Calibration Data
82
+
83
+ The `calibration_reasoning.txt` file contains curated examples for:
84
+ - Mathematical word problems (GSM8k-style)
85
+ - Logical reasoning
86
+ - Multi-step problem solving
87
+ - Chain-of-thought reasoning
88
+
89
+ ## Quantization Types
90
+
91
+ | Type | Size | Quality | Use Case |
92
+ |------|------|---------|----------|
93
+ | Q4_K_M | ~4.5GB | Good | Best balance |
94
+ | Q5_K_M | ~5.5GB | Better | Higher quality |
95
+ | Q6_K | ~6.5GB | Excellent | Near F16 quality |
96
+ | Q8_0 | ~8.5GB | Highest | Maximum quality |
97
+
98
+ ## Common Issues
99
+
100
+ **Binary not found**: Use full path to llama.cpp binaries:
101
+ ```bash
102
+ ~/llama.cpp/build/bin/llama-imatrix ...
103
+ ```
104
+
105
+ **Out of memory**: Reduce chunks with `--chunks 50`
106
+
107
+ ## Setup Instructions
108
+
109
+ ### 1. Clone and Setup
110
+
111
+ ```bash
112
+ git clone <your-repo-url>
113
+ cd llama-3.2
114
+ ```
115
+
116
+ ### 2. Choose Your Workflow
117
+
118
+ **Using Pre-Quantized Models:**
119
+ - Models are in `quantized/` directory
120
+ - Select based on your hardware and quality needs (see table above)
121
+ - Use with any GGUF-compatible inference engine
122
+
123
+ **Creating Custom Quantizations:**
124
+ 1. Install llama.cpp (see Prerequisites)
125
+ 2. Use the base F16 model: `llama-3.2-f16.gguf`
126
+ 3. Generate importance matrix (optional - improves quality):
127
+ ```bash
128
+ llama-imatrix -m llama-3.2-f16.gguf -f calibration_reasoning.txt -o custom.imatrix
129
+ ```
130
+ 4. Run quantization:
131
+ ```bash
132
+ ./scripts/quantizer.sh custom.imatrix llama-3.2-f16.gguf Q4_K_M
133
+ ```
134
+
135
+ ### 3. Inference
136
+
137
+ Use the quantized model with your preferred tool:
138
+
139
+ ```bash
140
+ # llama.cpp
141
+ llama-cli -m quantized/llama-3.2-Q4_K_M.gguf -p "Your prompt here" -n 512
142
+
143
+ # With conversation mode
144
+ llama-cli -m quantized/llama-3.2-Q4_K_M.gguf --interactive
145
+
146
+ # Python (using llama-cpp-python)
147
+ python3 -c "from llama_cpp import Llama; llm = Llama(model_path='quantized/llama-3.2-Q4_K_M.gguf'); print(llm('Solve: 2+2=?'))"
148
+ ```
149
+
150
+ ## Model Selection Guide
151
+
152
+ - **Limited VRAM (<6GB)**: Use Q3_K_M or IQ3_S-code
153
+ - **Balanced (6-8GB)**: Use Q4_K_M (recommended)
154
+ - **High Quality (8-12GB)**: Use Q5_K_M or Q6_K
155
+ - **Maximum Quality (>12GB)**: Use Q8_0 or F16
156
+
157
+ ## License
158
+
159
+ See LICENSE.txt for model license information.