Importance Matrix Generation for Llama 3.2

Purpose

This repository gives you the necessary files to run importance matrix (imatrix) operation on your workstation to generate optimized quantized models.

This includes the calibration data focuses which focuses on reasoning based questions (mathematical problems, logical inference) to preserve model quality in reasoning tasks after quantization.

  • Base F16 GGUF model ready for processing
  • The calibration dataset

Setup

Install llama.cpp:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j $(nproc)

Add to PATH:

export PATH="$HOME/llama.cpp/build/bin:$PATH"

Workflow

Step 1: Generate Importance Matrix

llama-imatrix \
  -m llama-3.2-f16.gguf \
  -f calibration.txt \
  -o llama-3.2-reasoning.imatrix \
  -t $(nproc)

Step 2: Quantize Model with Generated Imatrix

./scripts/quantizer

It essentially runs this command the following command but with these types of quantizations: Q8_0, Q4_K_M,IQ4_XS, IQ3_S

llama-quantize \
  --imatrix llama-3.2.imatrix \
  llama-3.2-f16.gguf \
  llama-3.2-Q4_K_M.gguf \
  Q4_K_M

Repository Contents

Models

  • llama-3.2-f16.gguf (6.0GB) - Base F16 model ready for imatrix generation
  • quantized/ - Reference quantized models generated with imatrix optimization

Calibration Datasets

  • calibration.txt (138MB) - dataset focused based on python and js
  • calibration_reasoning.txt (9.3KB) - dataset focused on reasoning and logic

Automation

  • scripts/quantizer.sh - Automated script for running multiple types of quantizations

Common Issues

Binary not found: Use full path to llama.cpp binaries:

~/llama.cpp/build/bin/llama-imatrix ...

Out of memory: Reduce chunks or GPU layers:

llama-imatrix -m llama-3.2-f16.gguf -f calibration.txt --chunks 50 --gpu-layers 30

Expected Outputs

After running the imatrix generation and quantization workflow, you should have:

llama-3.2-exp/
โ”œโ”€โ”€ llama-3.2-f16.gguf              # Original F16 model (6.0GB)
โ”œโ”€โ”€ llama-3.2.imatrix                # Generated importance matrix file
โ”œโ”€โ”€ quantized/                       # Quantized models folder
โ”‚   โ”œโ”€โ”€ llama-3.2-Q4_K_M.gguf       # Quantized with imatrix (~1.9GB)
โ”‚   โ”œโ”€โ”€ llama-3.2-Q5_K_M.gguf       # Quantized with imatrix (~2.2GB)
โ”‚   โ””โ”€โ”€ llama-3.2-Q6_K.gguf         # Quantized with imatrix (~2.5GB)
โ”œโ”€โ”€ calibration.txt                  # Programming focued (Py/JS) dataset
โ”œโ”€โ”€ calibration_reasoning.txt        # Reasoning focused dataset
โ””โ”€โ”€ scripts/                         # Automation scripts
    โ””โ”€โ”€ quantizer.sh                 # Quantization automation script

The .imatrix file contains importance weights that guide the quantization process to preserve model quality in reasoning tasks.

Additional Resources

Downloads last month
17
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for davezaxh/llama-3.2-exp

Quantized
(126)
this model