LGxNDs commited on
Commit
4e3e307
Β·
verified Β·
1 Parent(s): d88d189

Upload 3 files

Browse files

Geeked Out Quantizer methodology:
- QUANTIZATION_NOTES.md: Technical specs and method details
- GEEKED_OUT_INFO.md: Overview of the quantization environment
- CALIBRATION_INFO.txt: Calibration data explanation

Files changed (3) hide show
  1. CALIBRATION_INFO.txt +82 -0
  2. GEEKED_OUT_INFO.md +151 -0
  3. QUANTIZATION_NOTES.md +118 -0
CALIBRATION_INFO.txt ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CALIBRATION DATA INFORMATION
2
+ =============================
3
+
4
+ This model was quantized using importance matrix (imatrix) generation.
5
+ The imatrix captures which weights in the model are most important for
6
+ maintaining output quality during extreme compression (2-bit quantization).
7
+
8
+ WHAT IS CALIBRATION?
9
+ --------------------
10
+ Calibration is the process of running sample inputs through the model to
11
+ measure which tensors (weight matrices) contribute most to the output.
12
+ These measurements create an "importance matrix" that guides the quantizer
13
+ to preserve precision where it matters most.
14
+
15
+ CALIBRATION DATA CHARACTERISTICS
16
+ --------------------------------
17
+ Good calibration data should be:
18
+
19
+ 1. REPRESENTATIVE
20
+ - Matches the domain the model will operate in
21
+ - Similar vocabulary and complexity to expected inputs
22
+ - Reflects actual use case scenarios
23
+
24
+ 2. DIVERSE
25
+ - Multiple topics, subjects, and writing styles
26
+ - Mix of common and rare tokens
27
+ - Varied sentence structures and lengths
28
+
29
+ 3. SUFFICIENT
30
+ - 100-500 text chunks of typical document length
31
+ - More chunks = better quality (diminishing returns beyond ~500)
32
+ - Each chunk processed independently
33
+
34
+ 4. NATURAL
35
+ - Real-world text (not synthetic or random)
36
+ - Domain-appropriate (code for code models, medical for medical models)
37
+ - Representative token distribution
38
+
39
+ CALIBRATION PROCESS PARAMETERS
40
+ ------------------------------
41
+ Typical settings for this quantization:
42
+
43
+ Chunks Processed: 200-500 (production quality)
44
+ Chunk Size: Typical document/paragraph length
45
+ GPU Acceleration: Enabled (99 layers offloaded)
46
+ Thread Count: Auto-detected based on CPU
47
+
48
+ QUALITY IMPACT
49
+ --------------
50
+ The importance matrix generated from quality calibration data enables:
51
+
52
+ - 3-8% perplexity increase (vs 10-20% without imatrix)
53
+ - Preservation of critical weights
54
+ - Intelligent bit allocation per tensor
55
+ - 16x compression with minimal quality loss
56
+
57
+ CALIBRATION DATA SOURCES
58
+ ------------------------
59
+ Common sources for high-quality calibration data:
60
+
61
+ - Wikitext-2-raw (general language models)
62
+ - Domain-specific corpora (medical, legal, code)
63
+ - The Pile subset (diverse web text)
64
+ - Custom curated datasets matching expected use
65
+
66
+ VERIFICATION
67
+ ------------
68
+ Quantized models are tested for:
69
+ βœ“ Perplexity measurement vs baseline
70
+ βœ“ Sample inference quality
71
+ βœ“ Token prediction accuracy
72
+ βœ“ Model file integrity
73
+
74
+ NOTES
75
+ -----
76
+ - Calibration is performed once per source model
77
+ - Same imatrix can be reused for different target formats
78
+ - Domain-specific calibration yields better results
79
+ - GPU acceleration significantly speeds up generation
80
+
81
+ For questions about the calibration methodology used for this model,
82
+ please open a discussion on the model's Hugging Face page.
GEEKED_OUT_INFO.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The Geeked Out Quantizer
2
+
3
+ ## What Is It?
4
+
5
+ **The Geeked Out Quantizer** is a production-ready quantization environment built for Windows systems. It specializes in extreme model compression using importance-aware quantization techniques, particularly the IQ2_M format which achieves 16x compression with minimal quality loss.
6
+
7
+ ## The Mission
8
+
9
+ Traditional model quantization forces a choice: small file size or good quality. The Geeked Out Quantizer breaks this trade-off by using **importance matrices** β€” statistical analysis that identifies which weights matter most, allowing intelligent bit allocation.
10
+
11
+ ## Core Capabilities
12
+
13
+ ### 🎯 Importance-Aware Quantization
14
+ - Generates importance matrices automatically using calibration data
15
+ - Allocates precision where it matters most
16
+ - Achieves 2-bit quantization with only 3-8% quality loss
17
+
18
+ ### ⚑ Hardware Optimization
19
+ - Auto-detects CPU, memory type (DDR4/DDR5), and GPU capabilities
20
+ - Optimizes thread counts and processing parameters
21
+ - GPU acceleration for 5-10x speedup on imatrix generation
22
+ - CUDA 12.4+ support with dynamic GPU layer offloading
23
+
24
+ ### 🧠 Intelligent Memory Management
25
+ - Reserves system RAM to keep Windows responsive during conversion
26
+ - Monitors memory pressure and auto-pauses when needed
27
+ - Configurable retry logic for transient resource constraints
28
+
29
+ ### πŸ“¦ Complete Workflow Support
30
+ - Scans directories for valid source models
31
+ - Selects optimal source format (BF16 > F16 > F32)
32
+ - Handles sharded models while preserving structure
33
+ - Batch processing for multiple models
34
+ - Desktop GUI for interactive use
35
+
36
+ ## Quantization Pipeline
37
+
38
+ ```
39
+ Source Model (BF16/F16)
40
+ ↓
41
+ Calibration Data Analysis
42
+ ↓
43
+ Importance Matrix Generation
44
+ ↓
45
+ Smart Bit Allocation
46
+ ↓
47
+ IQ2_M Quantization
48
+ ↓
49
+ Quality Verification
50
+ ↓
51
+ Production-Ready Model (16x smaller)
52
+ ```
53
+
54
+ ## Supported Formats
55
+
56
+ ### Importance-Aware (IMatrix Required)
57
+ | Format | Bits/Weight | Best For |
58
+ |--------|-------------|----------|
59
+ | IQ1_M | 1.0 | Ultra-compact mobile/edge |
60
+ | IQ2_XXS | 2.0 | Maximum compression |
61
+ | IQ2_XS | 2.0 | Balanced compression |
62
+ | **IQ2_M** | **2.0** | **Best quality 2-bit** ⭐ |
63
+ | IQ2_S | 2.0 | Higher quality, slower |
64
+ | IQ3_M | 3.0 | Near-Q4 quality |
65
+ | IQ4_XS | 4.0 | Importance-aware 4-bit |
66
+
67
+ ### Standard K-Quant Formats
68
+ Q2_K, Q3_K variants, Q4 variants, Q5 variants, Q6_K, Q8_0
69
+
70
+ ### Ternary Formats
71
+ TQ2_0, TQ1_0 β€” experimental 3-value quantization
72
+
73
+ ## Why IQ2_M?
74
+
75
+ IQ2_M represents the sweet spot for extreme quantization:
76
+
77
+ - **16x smaller** than FP32 models
78
+ - **2-3x faster** inference
79
+ - **VRAM usage** reduced to ~1/16th
80
+ - **Quality** approaches Q4_K with proper imatrix
81
+ - **Compatible** with llama.cpp inference stack
82
+
83
+ ## Use Cases
84
+
85
+ - πŸ€– **Edge AI** β€” Run large models on limited hardware
86
+ - 🌐 **Browser-Based Inference** β€” Smaller models for WebGPU/WebGL
87
+ - πŸ“± **Mobile Deployment** β€” Fit large models on phones/tablets
88
+ - πŸš€ **High-Throughput APIs** β€” Serve more requests with less VRAM
89
+ - πŸ’Ύ **Archive Storage** β€” Preserve models at minimal storage cost
90
+
91
+ ## Technical Philosophy
92
+
93
+ The Geeked Out Quantizer focuses on:
94
+
95
+ 1. **Quality Preservation** β€” Never sacrifice more quality than necessary
96
+ 2. **Automation** β€” Minimize manual tuning through intelligent defaults
97
+ 3. **Hardware Awareness** β€” Adapt to the system's capabilities
98
+ 4. **Production Ready** β€” Robust error handling and retry logic
99
+ 5. **Calibration Quality** β€” Emphasize representative data selection
100
+
101
+ ## Model Curation
102
+
103
+ Not all models are equal candidates. The quantizer evaluates:
104
+ - Source format quality (BF16 preferred)
105
+ - Model architecture compatibility
106
+ - Existing quantization state
107
+ - Expected use case alignment
108
+
109
+ ## Calibration Best Practices
110
+
111
+ The quality of your quantized model depends heavily on calibration data:
112
+
113
+ βœ… **DO:**
114
+ - Use domain-relevant text (code for code models, medical for medical models)
115
+ - Include diverse topics and writing styles
116
+ - Provide 100-500 chunks of typical document length
117
+ - Ensure natural token distribution
118
+
119
+ ❌ **DON'T:**
120
+ - Use repetitive or overly simple text
121
+ - Include corrupted or random data
122
+ - Rely on single-domain text for general-purpose models
123
+
124
+ ## Collaboration & Research
125
+
126
+ The Geeked Out Quantizer methodology is available for:
127
+ - Research collaborations on quantization techniques
128
+ - Edge deployment optimization projects
129
+ - Custom calibration strategies for specialized domains
130
+ - Hardware-specific optimization studies
131
+
132
+ ## Community
133
+
134
+ All models in this Hugging Face profile are quantized using this toolchain. Each model card includes:
135
+ - Quantization specifications
136
+ - Calibration methodology
137
+ - Quality metrics
138
+ - Use case recommendations
139
+
140
+ ## Future Directions
141
+
142
+ - Expanded format support (new GGML quantization types)
143
+ - Domain-specific calibration datasets
144
+ - Hardware-specific optimization profiles
145
+ - Batch processing automation
146
+
147
+ ---
148
+
149
+ *The Geeked Out Quantizer: Making extreme compression intelligent.*
150
+
151
+ For questions about quantization methodology, collaboration opportunities, or technical discussions, please open an issue or discussion on any model in this profile.
QUANTIZATION_NOTES.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quantization Notes
2
+
3
+ ## Overview
4
+
5
+ This model was quantized using **The Geeked Out Quantizer**, a specialized Windows-native quantization environment designed for extreme compression with quality preservation.
6
+
7
+ ## Quantization Specifications
8
+
9
+ | Parameter | Details |
10
+ |-----------|---------|
11
+ | **Source Format** | BF16 (bfloat16) or F16 (float16) |
12
+ | **Target Format** | IQ2_M (2.0 bits per weight) |
13
+ | **Compression Ratio** | 16x smaller than FP32 baseline |
14
+ | **Quantization Method** | Importance-aware quantization with IMatrix |
15
+ | **Quality Metric** | ~3-8% perplexity increase vs. baseline |
16
+
17
+ ## The Importance Matrix (IMatrix) Method
18
+
19
+ ### What is an Importance Matrix?
20
+
21
+ An importance matrix is a statistical analysis of a neural network that identifies which weights contribute most significantly to model output quality. Rather than applying uniform quantization across all tensors, this method:
22
+
23
+ - **Preserves precision** on high-impact weights
24
+ - **Aggressively compresses** low-impact weights
25
+ - **Maintains information flow** through the network architecture
26
+
27
+ ### Why It Matters
28
+
29
+ Traditional uniform quantization to 2-bit precision typically causes 10-20% quality degradation. The importance matrix approach reduces this to 3-8%, making 2-bit models viable for production use.
30
+
31
+ ## Calibration Process
32
+
33
+ ### Data Selection
34
+
35
+ The importance matrix is generated using carefully selected calibration data that:
36
+ - Represents the model's intended use domain
37
+ - Contains diverse vocabulary and sentence structures
38
+ - Includes 100-500 text chunks of typical prompt length
39
+ - Matches the distribution of expected inference inputs
40
+
41
+ ### Generation Parameters
42
+
43
+ | Setting | Typical Value | Purpose |
44
+ |---------|---------------|---------|
45
+ | Chunks | 200-500 | Balance quality vs. generation time |
46
+ | GPU Layers | 99 (max) | Accelerate processing via CUDA |
47
+ | Thread Count | Auto-detected | Optimize for hardware configuration |
48
+
49
+ ## Memory & Hardware Optimization
50
+
51
+ The quantization process includes:
52
+ - **Dynamic memory management** β€” Reserves system RAM to maintain Windows responsiveness
53
+ - **Hardware detection** β€” Automatically detects CPU cores, memory type (DDR4/DDR5), and GPU capabilities
54
+ - **Thread optimization** β€” Adjusts parallelism based on available resources
55
+ - **Retry logic** β€” Handles transient memory pressure gracefully
56
+
57
+ ## Model Selection Criteria
58
+
59
+ Source models are selected based on quality hierarchy:
60
+ 1. **BF16** (preferred) β€” Best precision for quantization
61
+ 2. **F16** β€” Good precision, widely available
62
+ 3. **F32** β€” Acceptable but creates larger intermediate files
63
+
64
+ Models already in quantized formats are skipped unless explicitly re-quantizing.
65
+
66
+ ## Output Format Details
67
+
68
+ ### IQ2_M Characteristics
69
+
70
+ - **Bit depth:** 2.0 bits per weight
71
+ - **Speed:** 2-3x faster inference than F32
72
+ - **VRAM usage:** ~1/16th of FP32
73
+ - **Imatrix required:** Yes
74
+ - **Quality tier:** Best-in-class for 2-bit quantization
75
+
76
+ ### Naming Convention
77
+
78
+ Quantized models follow this pattern:
79
+ ```
80
+ OriginalModel-BF16.gguf β†’ OriginalModel-IQ2_M.gguf
81
+ ```
82
+
83
+ Sharded models preserve shard numbering:
84
+ ```
85
+ Model-00001-of-00004.gguf β†’ Model-IQ2_M-00001-of-00004.gguf
86
+ ```
87
+
88
+ ## Quality Verification
89
+
90
+ Models are validated through:
91
+ - Perplexity measurement against baseline
92
+ - Sample inference testing
93
+ - File integrity verification
94
+
95
+ ## Use Cases
96
+
97
+ IQ2_M quantized models are ideal for:
98
+ - **Edge deployment** β€” Minimal storage footprint
99
+ - **Consumer hardware** β€” Reduced VRAM requirements
100
+ - **High-throughput inference** β€” Faster token generation
101
+ - **Bandwidth-constrained environments** β€” Efficient distribution
102
+
103
+ ## Technical Notes
104
+
105
+ - Quantization performed on Windows with CUDA 12.4+ support
106
+ - GPU acceleration utilized for imatrix generation
107
+ - Multi-threaded processing with memory safety guards
108
+ - Compatible with llama.cpp inference engines
109
+
110
+ ## Citation
111
+
112
+ If you use this quantized model in research or applications, please acknowledge:
113
+
114
+ > Quantized using The Geeked Out Quantizer with importance-aware IQ2_M optimization.
115
+
116
+ ---
117
+
118
+ *For questions about the quantization method or collaboration inquiries, please open a discussion on this model's page.*