Importance Matrix Generation for Llama 3.2

Purpose

This repository gives you the necessary files to run importance matrix (imatrix) operation on your workstation to generate optimized quantized models.

This includes the calibration data focuses which focuses on reasoning based questions (mathematical problems, logical inference) to preserve model quality in reasoning tasks after quantization.

Base F16 GGUF model ready for processing
The calibration dataset

Setup

Install llama.cpp:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j $(nproc)

Add to PATH:

export PATH="$HOME/llama.cpp/build/bin:$PATH"

Workflow

Step 1: Generate Importance Matrix

llama-imatrix \
  -m llama-3.2-f16.gguf \
  -f calibration.txt \
  -o llama-3.2-reasoning.imatrix \
  -t $(nproc)

Step 2: Quantize Model with Generated Imatrix

./scripts/quantizer

It essentially runs this command the following command but with these types of quantizations: Q8_0, Q4_K_M,IQ4_XS, IQ3_S

llama-quantize \
  --imatrix llama-3.2.imatrix \
  llama-3.2-f16.gguf \
  llama-3.2-Q4_K_M.gguf \
  Q4_K_M

Repository Contents

Models

llama-3.2-f16.gguf (6.0GB) - Base F16 model ready for imatrix generation
quantized/ - Reference quantized models generated with imatrix optimization

Calibration Datasets

calibration.txt (138MB) - dataset focused based on python and js
calibration_reasoning.txt (9.3KB) - dataset focused on reasoning and logic

Automation

scripts/quantizer.sh - Automated script for running multiple types of quantizations

Common Issues

Binary not found: Use full path to llama.cpp binaries:

~/llama.cpp/build/bin/llama-imatrix ...

Out of memory: Reduce chunks or GPU layers:

llama-imatrix -m llama-3.2-f16.gguf -f calibration.txt --chunks 50 --gpu-layers 30

Expected Outputs

After running the imatrix generation and quantization workflow, you should have:

llama-3.2-exp/
├── llama-3.2-f16.gguf              # Original F16 model (6.0GB)
├── llama-3.2.imatrix                # Generated importance matrix file
├── quantized/                       # Quantized models folder
│   ├── llama-3.2-Q4_K_M.gguf       # Quantized with imatrix (~1.9GB)
│   ├── llama-3.2-Q5_K_M.gguf       # Quantized with imatrix (~2.2GB)
│   └── llama-3.2-Q6_K.gguf         # Quantized with imatrix (~2.5GB)
├── calibration.txt                  # Programming focued (Py/JS) dataset
├── calibration_reasoning.txt        # Reasoning focused dataset
└── scripts/                         # Automation scripts
    └── quantizer.sh                 # Quantization automation script

The .imatrix file contains importance weights that guide the quantization process to preserve model quality in reasoning tasks.

Additional Resources

llama.cpp Documentation: https://github.com/ggml-org/llama.cpp
GGUF Format Specification: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
Quantization Guide: https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/README.md

Downloads last month: 17

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for davezaxh/llama-3.2-exp

Base model

meta-llama/Llama-3.2-3B

Quantized

(126)

this model