Geeked-Out-Quantization-Software / QUANTIZATION_NOTES.md
LGxNDs's picture
Upload 3 files
4e3e307 verified

Quantization Notes

Overview

This model was quantized using The Geeked Out Quantizer, a specialized Windows-native quantization environment designed for extreme compression with quality preservation.

Quantization Specifications

Parameter Details
Source Format BF16 (bfloat16) or F16 (float16)
Target Format IQ2_M (2.0 bits per weight)
Compression Ratio 16x smaller than FP32 baseline
Quantization Method Importance-aware quantization with IMatrix
Quality Metric ~3-8% perplexity increase vs. baseline

The Importance Matrix (IMatrix) Method

What is an Importance Matrix?

An importance matrix is a statistical analysis of a neural network that identifies which weights contribute most significantly to model output quality. Rather than applying uniform quantization across all tensors, this method:

  • Preserves precision on high-impact weights
  • Aggressively compresses low-impact weights
  • Maintains information flow through the network architecture

Why It Matters

Traditional uniform quantization to 2-bit precision typically causes 10-20% quality degradation. The importance matrix approach reduces this to 3-8%, making 2-bit models viable for production use.

Calibration Process

Data Selection

The importance matrix is generated using carefully selected calibration data that:

  • Represents the model's intended use domain
  • Contains diverse vocabulary and sentence structures
  • Includes 100-500 text chunks of typical prompt length
  • Matches the distribution of expected inference inputs

Generation Parameters

Setting Typical Value Purpose
Chunks 200-500 Balance quality vs. generation time
GPU Layers 99 (max) Accelerate processing via CUDA
Thread Count Auto-detected Optimize for hardware configuration

Memory & Hardware Optimization

The quantization process includes:

  • Dynamic memory management β€” Reserves system RAM to maintain Windows responsiveness
  • Hardware detection β€” Automatically detects CPU cores, memory type (DDR4/DDR5), and GPU capabilities
  • Thread optimization β€” Adjusts parallelism based on available resources
  • Retry logic β€” Handles transient memory pressure gracefully

Model Selection Criteria

Source models are selected based on quality hierarchy:

  1. BF16 (preferred) β€” Best precision for quantization
  2. F16 β€” Good precision, widely available
  3. F32 β€” Acceptable but creates larger intermediate files

Models already in quantized formats are skipped unless explicitly re-quantizing.

Output Format Details

IQ2_M Characteristics

  • Bit depth: 2.0 bits per weight
  • Speed: 2-3x faster inference than F32
  • VRAM usage: ~1/16th of FP32
  • Imatrix required: Yes
  • Quality tier: Best-in-class for 2-bit quantization

Naming Convention

Quantized models follow this pattern:

OriginalModel-BF16.gguf β†’ OriginalModel-IQ2_M.gguf

Sharded models preserve shard numbering:

Model-00001-of-00004.gguf β†’ Model-IQ2_M-00001-of-00004.gguf

Quality Verification

Models are validated through:

  • Perplexity measurement against baseline
  • Sample inference testing
  • File integrity verification

Use Cases

IQ2_M quantized models are ideal for:

  • Edge deployment β€” Minimal storage footprint
  • Consumer hardware β€” Reduced VRAM requirements
  • High-throughput inference β€” Faster token generation
  • Bandwidth-constrained environments β€” Efficient distribution

Technical Notes

  • Quantization performed on Windows with CUDA 12.4+ support
  • GPU acceleration utilized for imatrix generation
  • Multi-threaded processing with memory safety guards
  • Compatible with llama.cpp inference engines

Citation

If you use this quantized model in research or applications, please acknowledge:

Quantized using The Geeked Out Quantizer with importance-aware IQ2_M optimization.


For questions about the quantization method or collaboration inquiries, please open a discussion on this model's page.