Bopalv commited on
Commit
ffe4b6a
Β·
verified Β·
1 Parent(s): 0627dd9

Upload Qwen3-0.6B-Comparison.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. Qwen3-0.6B-Comparison.md +168 -0
Qwen3-0.6B-Comparison.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-0.6B Quantized Models Comparison
2
+
3
+ ## Summary
4
+
5
+ Three quantized versions of Qwen3-0.6B were created on March 21, 2026 and uploaded to Hugging Face at:
6
+ **https://huggingface.co/Bopalv/Qwen3-0.6B-quantized**
7
+
8
+ ## Models Overview
9
+
10
+ | Model | Format | Quantization | File Size | Status |
11
+ |-------|--------|--------------|-----------|--------|
12
+ | GGUF Q4_K_M | GGUF | 4-bit K-quant | 462 MB | βœ… Complete |
13
+ | GPTQ-Int4 | Safetensors | 4-bit GPTQ | 517 MB | βœ… Complete |
14
+ | GPTQ-Int8 | Safetensors | 8-bit GPTQ | 727 MB | βœ… Complete |
15
+
16
+ ## Technical Specifications
17
+
18
+ ### Common Properties
19
+ - **Base Model**: Qwen3-0.6B
20
+ - **Parameters**: 0.6B (490M)
21
+ - **Architecture**: Qwen3ForCausalLM
22
+ - **Hidden Size**: 1024
23
+ - **Layers**: 28
24
+ - **Attention Heads**: 16
25
+ - **KV Heads**: 8
26
+ - **Max Context**: 40,960 tokens
27
+ - **Vocab Size**: 151,936
28
+
29
+ ### Quantization Details
30
+
31
+ | Model | Bits | Group Size | Symmetric | Quantizer | Pack Dtype |
32
+ |-------|------|------------|-----------|-----------|------------|
33
+ | GGUF Q4_K_M | 4 | N/A | Yes | llama.cpp | N/A |
34
+ | GPTQ-Int4 | 4 | 128 | Yes | gptqmodel 4.0.0 | int32 |
35
+ | GPTQ-Int8 | 8 | 128 | Yes | gptqmodel 2.2.0 | int32 |
36
+
37
+ ## File Size Comparison
38
+
39
+ ```
40
+ GGUF Q4_K_M β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 462 MB (Smallest)
41
+ GPTQ-Int4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 517 MB (+12%)
42
+ GPTQ-Int8 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 727 MB (+57%)
43
+ ```
44
+
45
+ ## Theoretical Performance Analysis
46
+
47
+ ### Memory Usage
48
+ - **GGUF Q4_K_M**: ~462 MB loaded
49
+ - **GPTQ-Int4**: ~517 MB loaded + overhead
50
+ - **GPTQ-Int8**: ~727 MB loaded + overhead
51
+
52
+ ### Expected Quality (Lower bits = More compression, Potentially lower quality)
53
+ 1. **GPTQ-Int8**: Best quality (8-bit precision)
54
+ 2. **GPTQ-Int4**: Good quality (4-bit with group quantization)
55
+ 3. **GGUF Q4_K_M**: Good quality (4-bit K-quant, optimized for llama.cpp)
56
+
57
+ ### Expected Speed (CPU-based)
58
+ 1. **GGUF Q4_K_M**: Fastest (optimized for llama.cpp, smallest size)
59
+ 2. **GPTQ-Int4**: Medium (requires dequantization overhead)
60
+ 3. **GPTQ-Int8**: Slowest (largest size, more computation)
61
+
62
+ ## Compatibility
63
+
64
+ ### GGUF Q4_K_M
65
+ - βœ… llama.cpp
66
+ - βœ… prima.cpp (if Qwen3 architecture is supported)
67
+ - βœ… Ollama
68
+ - βœ… LM Studio
69
+ - βœ… Text Generation WebUI
70
+
71
+ ### GPTQ-Int4 & GPTQ-Int8
72
+ - βœ… HuggingFace Transformers
73
+ - βœ… AutoGPTQ
74
+ - βœ… vLLM
75
+ - βœ… Text Generation WebUI
76
+ - ⚠️ llama.cpp (requires conversion)
77
+
78
+ ## Usage Recommendations
79
+
80
+ ### For CPU-only systems
81
+ **Recommended: GGUF Q4_K_M**
82
+ - Smallest file size
83
+ - Optimized for CPU inference
84
+ - Fastest loading time
85
+ - Compatible with llama.cpp ecosystem
86
+
87
+ ### For GPU systems
88
+ **Recommended: GPTQ-Int4**
89
+ - Good balance of quality and size
90
+ - Works with AutoGPTQ and Transformers
91
+ - Faster than GPTQ-Int8
92
+ - Better quality than GGUF on GPU
93
+
94
+ ### For Maximum Quality
95
+ **Recommended: GPTQ-Int8**
96
+ - Highest precision (8-bit)
97
+ - Best output quality
98
+ - Requires more memory
99
+ - Slower inference
100
+
101
+ ## Benchmarking Notes
102
+
103
+ The model-efficiency tool from `bopalvelut-prog/model-efficiency` requires:
104
+ 1. Ollama running on port 11434, OR
105
+ 2. An OpenAI-compatible API server
106
+
107
+ To benchmark these models:
108
+
109
+ ### Option 1: Using Ollama
110
+ ```bash
111
+ # Install Ollama
112
+ curl -fsSL https://ollama.com/install.sh | sh
113
+
114
+ # Import GGUF model
115
+ ollama create qwen3-0.6b-gguf -f Modelfile
116
+
117
+ # Run benchmark
118
+ cd model-efficiency
119
+ python model_efficiency_comparator.py -p "Your test prompt"
120
+ ```
121
+
122
+ ### Option 2: Using prima.cpp
123
+ ```bash
124
+ # Start server with GGUF model
125
+ /home/ma/prima.cpp/llama-server \
126
+ -m /home/ma/models/Qwen3-0.6B-GGUF/Qwen3-0.6B.Q4_K_M.gguf \
127
+ --port 8080
128
+
129
+ # Test with curl
130
+ curl http://localhost:8080/v1/chat/completions \
131
+ -H "Content-Type: application/json" \
132
+ -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'
133
+ ```
134
+
135
+ ### Option 3: Using Transformers (for GPTQ models)
136
+ ```python
137
+ from transformers import AutoModelForCausalLM, AutoTokenizer
138
+ import time
139
+
140
+ model_path = "/home/ma/models/Qwen3-0.6B-GPTQ-Int4"
141
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
142
+ model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
143
+
144
+ inputs = tokenizer("Hello", return_tensors="pt")
145
+ start = time.time()
146
+ outputs = model.generate(**inputs, max_new_tokens=100)
147
+ end = time.time()
148
+
149
+ print(f"Time: {end-start:.2f}s")
150
+ ```
151
+
152
+ ## Storage Requirements
153
+
154
+ | Model | File Size | Disk Space Needed | RAM Needed (Est.) |
155
+ |-------|-----------|-------------------|-------------------|
156
+ | GGUF Q4_K_M | 462 MB | 462 MB | ~600 MB |
157
+ | GPTQ-Int4 | 517 MB | 517 MB | ~700 MB |
158
+ | GPTQ-Int8 | 727 MB | 727 MB | ~900 MB |
159
+ | **All Models** | **1.7 GB** | **1.7 GB** | **~2.2 GB** |
160
+
161
+ ## Conclusion
162
+
163
+ - **Best for CPU/Embedded**: GGUF Q4_K_M (smallest, fastest)
164
+ - **Best for GPU**: GPTQ-Int4 (balanced)
165
+ - **Best Quality**: GPTQ-Int8 (highest precision)
166
+
167
+ All models are available at:
168
+ **https://huggingface.co/Bopalv/Qwen3-0.6B-quantized**