davezaxh
/

llama-3.2-exp

+# Llama 3.2 Reasoning-Focused Quantization
+A complete toolkit for quantizing Llama 3.2 models with optimization for reasoning tasks. This repository provides pre-quantized GGUF models, reasoning-focused calibration data, and automation scripts for the full quantization pipeline.
+## Repository Contents
+- **llama-3.2-f16.gguf** - Base F16 GGUF model (6.0GB) converted from safetensors
+- **quantized/** - Pre-quantized models optimized for reasoning tasks:
+  - Q8_0 (3.2GB) - Highest quality quantization
+  - Q6_K (2.5GB) - Excellent quality, near F16 performance
+  - Q5_K_M (2.2GB) - Better quality, good balance
+  - Q4_K_M (1.9GB) - Recommended balance of size/quality
+  - Q3_K_M (1.6GB) - Smaller footprint
+  - IQ3_S-code (1.5GB) - Specialized for code tasks
+- **scripts/quantizer.sh** - Automated quantization script
+- **calibration_reasoning.txt** - Curated reasoning-focused calibration dataset
+## Purpose
+This repository streamlines the process of creating reasoning-optimized quantized models using importance matrix calibration. The included calibration data focuses on mathematical reasoning, logical inference, and chain-of-thought problem solving to preserve model quality in reasoning-heavy tasks.
+## Quick Start
+### Option 1: Use Pre-Quantized Models (Recommended)
+The `quantized/` directory contains ready-to-use models. Simply load any .gguf file with your preferred inference engine:
+```bash
+# Using llama.cpp
+llama-cli -m quantized/llama-3.2-Q4_K_M.gguf -p "Solve this problem step by step..."
+# Using ollama (copy to ollama models directory first)
+ollama run llama-3.2-Q4_K_M
+```
+### Option 2: Quantize From Scratch
+If you want to create custom quantizations or use different calibration data:
+## Prerequisites
+Install llama.cpp if not already available:
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp && cmake -B build && cmake --build build --config Release -j 8
+```
+Add to PATH (optional):
+```bash
+export PATH="$HOME/llama.cpp/build/bin:$PATH"
+```
+## Step 1: Generate Importance Matrix
+```bash
+llama-imatrix \
+  -m llama-3.2-f16.gguf \
+  -f calibration_reasoning.txt \
+  -o reasoning.imatrix
+```
+## Step 2: Quantize Model
+Use the provided script:
+```bash
+./scripts/quantizer.sh reasoning.imatrix llama-3.2-f16.gguf
+```
+Or manually:
+```bash
+llama-quantize \
+  --imatrix reasoning.imatrix \
+  llama-3.2-f16.gguf \
+  llama-3.2-Q4_K_M.gguf \
+  Q4_K_M
+```
+## Calibration Data
+The `calibration_reasoning.txt` file contains curated examples for:
+- Mathematical word problems (GSM8k-style)
+- Logical reasoning
+- Multi-step problem solving
+- Chain-of-thought reasoning
+## Quantization Types
+| Type | Size | Quality | Use Case |
+|------|------|---------|----------|
+| Q4_K_M | ~4.5GB | Good | Best balance |
+| Q5_K_M | ~5.5GB | Better | Higher quality |
+| Q6_K | ~6.5GB | Excellent | Near F16 quality |
+| Q8_0 | ~8.5GB | Highest | Maximum quality |
+## Common Issues
+**Binary not found**: Use full path to llama.cpp binaries:
+```bash
+~/llama.cpp/build/bin/llama-imatrix ...
+```
+**Out of memory**: Reduce chunks with `--chunks 50`
+## Setup Instructions
+### 1. Clone and Setup
+```bash
+git clone <your-repo-url>
+cd llama-3.2
+```
+### 2. Choose Your Workflow
+**Using Pre-Quantized Models:**
+- Models are in `quantized/` directory
+- Select based on your hardware and quality needs (see table above)
+- Use with any GGUF-compatible inference engine
+**Creating Custom Quantizations:**
+1. Install llama.cpp (see Prerequisites)
+2. Use the base F16 model: `llama-3.2-f16.gguf`
+3. Generate importance matrix (optional - improves quality):
+   ```bash
+   llama-imatrix -m llama-3.2-f16.gguf -f calibration_reasoning.txt -o custom.imatrix
+   ```
+4. Run quantization:
+   ```bash
+   ./scripts/quantizer.sh custom.imatrix llama-3.2-f16.gguf Q4_K_M
+   ```
+### 3. Inference
+Use the quantized model with your preferred tool:
+```bash
+# llama.cpp
+llama-cli -m quantized/llama-3.2-Q4_K_M.gguf -p "Your prompt here" -n 512
+# With conversation mode
+llama-cli -m quantized/llama-3.2-Q4_K_M.gguf --interactive
+# Python (using llama-cpp-python)
+python3 -c "from llama_cpp import Llama; llm = Llama(model_path='quantized/llama-3.2-Q4_K_M.gguf'); print(llm('Solve: 2+2=?'))"
+```
+## Model Selection Guide
+- **Limited VRAM (<6GB)**: Use Q3_K_M or IQ3_S-code
+- **Balanced (6-8GB)**: Use Q4_K_M (recommended)
+- **High Quality (8-12GB)**: Use Q5_K_M or Q6_K
+- **Maximum Quality (>12GB)**: Use Q8_0 or F16
+## License
+See LICENSE.txt for model license information.