Importance Matrix Generation for Llama 3.2
Purpose
This repository gives you the necessary files to run importance matrix (imatrix) operation on your workstation to generate optimized quantized models.
This includes the calibration data focuses which focuses on reasoning based questions (mathematical problems, logical inference) to preserve model quality in reasoning tasks after quantization.
- Base F16 GGUF model ready for processing
- The calibration dataset
Setup
Install llama.cpp:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j $(nproc)
Add to PATH:
export PATH="$HOME/llama.cpp/build/bin:$PATH"
Workflow
Step 1: Generate Importance Matrix
llama-imatrix \
-m llama-3.2-f16.gguf \
-f calibration.txt \
-o llama-3.2-reasoning.imatrix \
-t $(nproc)
Step 2: Quantize Model with Generated Imatrix
./scripts/quantizer
It essentially runs this command the following command but with these types of quantizations: Q8_0, Q4_K_M,IQ4_XS, IQ3_S
llama-quantize \
--imatrix llama-3.2.imatrix \
llama-3.2-f16.gguf \
llama-3.2-Q4_K_M.gguf \
Q4_K_M
Repository Contents
Models
- llama-3.2-f16.gguf (6.0GB) - Base F16 model ready for imatrix generation
- quantized/ - Reference quantized models generated with imatrix optimization
Calibration Datasets
- calibration.txt (138MB) - dataset focused based on python and js
- calibration_reasoning.txt (9.3KB) - dataset focused on reasoning and logic
Automation
- scripts/quantizer.sh - Automated script for running multiple types of quantizations
Common Issues
Binary not found: Use full path to llama.cpp binaries:
~/llama.cpp/build/bin/llama-imatrix ...
Out of memory: Reduce chunks or GPU layers:
llama-imatrix -m llama-3.2-f16.gguf -f calibration.txt --chunks 50 --gpu-layers 30
Expected Outputs
After running the imatrix generation and quantization workflow, you should have:
llama-3.2-exp/
โโโ llama-3.2-f16.gguf # Original F16 model (6.0GB)
โโโ llama-3.2.imatrix # Generated importance matrix file
โโโ quantized/ # Quantized models folder
โ โโโ llama-3.2-Q4_K_M.gguf # Quantized with imatrix (~1.9GB)
โ โโโ llama-3.2-Q5_K_M.gguf # Quantized with imatrix (~2.2GB)
โ โโโ llama-3.2-Q6_K.gguf # Quantized with imatrix (~2.5GB)
โโโ calibration.txt # Programming focued (Py/JS) dataset
โโโ calibration_reasoning.txt # Reasoning focused dataset
โโโ scripts/ # Automation scripts
โโโ quantizer.sh # Quantization automation script
The .imatrix file contains importance weights that guide the quantization process to preserve model quality in reasoning tasks.
Additional Resources
- llama.cpp Documentation: https://github.com/ggml-org/llama.cpp
- GGUF Format Specification: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Quantization Guide: https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/README.md
- Downloads last month
- 17
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for davezaxh/llama-3.2-exp
Base model
meta-llama/Llama-3.2-3B