g023 commited on
Commit
c43acdc
Β·
verified Β·
1 Parent(s): e53f249

Upload 6 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Qwen3-g023-tiny-v1-Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
37
+ Qwen3-g023-tiny-v1-Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
38
+ Qwen3-g023-tiny-v1-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
39
+ Qwen3-g023-tiny-v1-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
40
+ Qwen3-g023-tiny-v1-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
Qwen3-g023-tiny-v1-Q2_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:203d50e94354d786d169ff8cf05cd18d8e8f3d2f335278f782eca51b47035d4f
3
+ size 759345248
Qwen3-g023-tiny-v1-Q3_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a98e1bf52b7ed8a5fee139586c91789d78456bb3baf0ce882a9eec51eb180063
3
+ size 915386464
Qwen3-g023-tiny-v1-Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b43839b47fb45ca0ee455009bded9d4574deeb64abd733df315f1e797da8005
3
+ size 1075294304
Qwen3-g023-tiny-v1-Q6_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2ab3f70eae940ae96a9f72e71876ff59a7a5f95f733e0263c5499b235514c75
3
+ size 1376448608
Qwen3-g023-tiny-v1-Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3aa85e70ecb8986b126239bc9c056054e6e612b2bf4fb60df8ef3744351db5ff
3
+ size 1780930656
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model: Qwen/Qwen3-1.7B
6
+ tags:
7
+ - qwen3
8
+ - gguf
9
+ - layer-surgery
10
+ - small-language-model
11
+ - pruned
12
+ - optimized
13
+ - thinking
14
+ - text-generation
15
+ model_name: Qwen3-g023-tiny-v1
16
+ pipeline_tag: text-generation
17
+ library_name: llama.cpp
18
+ quantized_by: g023
19
+ ---
20
+
21
+ # Qwen3-g023-tiny-v1 β€” GGUF
22
+
23
+ **A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.**
24
+
25
+ Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores **92.9/100** with **100% factual accuracy** β€” a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100).
26
+
27
+ ## Available Quantizations
28
+
29
+ | Quantization | Bits/Weight | Description | Download |
30
+ |:---:|:---:|:---|:---:|
31
+ | **Q8_0** | 8.00 | Highest quality, virtually lossless | [Qwen3-g023-tiny-v1-Q8_0.gguf](./Qwen3-g023-tiny-v1-Q8_0.gguf) |
32
+ | **Q6_K** | 6.57 | Excellent quality, good compression | [Qwen3-g023-tiny-v1-Q6_K.gguf](./Qwen3-g023-tiny-v1-Q6_K.gguf) |
33
+ | **Q4_K_M** | 4.85 | Good balance of quality and size | [Qwen3-g023-tiny-v1-Q4_K_M.gguf](./Qwen3-g023-tiny-v1-Q4_K_M.gguf) |
34
+ | **Q3_K_M** | 3.91 | High compression, moderate quality loss | [Qwen3-g023-tiny-v1-Q3_K_M.gguf](./Qwen3-g023-tiny-v1-Q3_K_M.gguf) |
35
+ | **Q2_K** | 3.35 | Maximum compression, significant quality loss | [Qwen3-g023-tiny-v1-Q2_K.gguf](./Qwen3-g023-tiny-v1-Q2_K.gguf) |
36
+
37
+ ## Model Details
38
+
39
+ | Parameter | Value |
40
+ |:---|:---|
41
+ | Architecture | Qwen3ForCausalLM |
42
+ | Layers | **27** (28 original βˆ’ 1 deleted) |
43
+ | Hidden Size | 2,048 |
44
+ | Intermediate Size | 6,144 |
45
+ | Attention Heads | 16 query / 8 key-value (GQA) |
46
+ | Head Dimension | 128 |
47
+ | Vocabulary | 151,936 tokens |
48
+ | Max Context | 40,960 tokens |
49
+ | RoPE ΞΈ | 1,000,000 |
50
+ | Tied Embeddings | Yes |
51
+ | Total Parameters | **~1.67B** |
52
+ | Precision (source) | bfloat16 |
53
+
54
+ ## Surgery Operations
55
+
56
+ This model was created by applying two surgical operations to [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B):
57
+
58
+ 1. **Delete layer 10** β€” Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4.
59
+ 2. **Swap layers 11 ↔ 12** (post-deletion indices) β€” Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9.
60
+
61
+ ### Key Findings
62
+
63
+ - **Smaller is better**: The 27-layer model outperforms both the 28-layer original and various 29–30 layer expanded models.
64
+ - **Layer 10 is actively harmful**: Removing it alone yields a +3.6 point improvement.
65
+ - **Operations compound selectively**: Deletion + swap works, but deletion + duplication degrades quality.
66
+
67
+ ## Benchmark Results
68
+
69
+ | Metric | Original (28L) | **v1 (27L)** | Ξ” |
70
+ |:---|:---:|:---:|:---:|
71
+ | **Overall Score** | 87.8 / 100 | **92.9 / 100** | **+5.1** |
72
+ | **Factual Accuracy** | 15 / 17 (88%) | **17 / 17 (100%)** | **+12%** |
73
+ | Avg Perplexity | β€” | 15.70 | β€” |
74
+ | Thinking Mode | βœ… | βœ… | β€” |
75
+ | Non-Thinking Mode | βœ… | βœ… | β€” |
76
+
77
+ Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.
78
+
79
+ ## Features
80
+
81
+ - **Thinking mode**: Full `<think>` / `</think>` reasoning support β€” toggle via `enable_thinking` parameter
82
+ - **Non-thinking mode**: Direct responses without chain-of-thought overhead
83
+ - **Tool calling**: Full function/tool calling support
84
+ - **System prompts**: Standard system message support
85
+ - **Chat template**: Qwen3 ChatML template embedded in the GGUF
86
+
87
+ ## Usage
88
+
89
+ ### With Ollama
90
+
91
+ ```bash
92
+ # Download the GGUF and create from Modelfile
93
+ cat > Modelfile << 'EOF'
94
+ FROM ./Qwen3-g023-tiny-v1-Q4_K_M.gguf
95
+
96
+ PARAMETER temperature 0.6
97
+ PARAMETER top_p 0.95
98
+ PARAMETER top_k 20
99
+ PARAMETER min_p 0.0
100
+
101
+ TEMPLATE """{{- if .System }}
102
+ <|im_start|>system
103
+ {{ .System }}<|im_end|>
104
+ {{ end }}
105
+ {{- range .Messages }}
106
+ {{- if eq .Role "user" }}
107
+ <|im_start|>user
108
+ {{ .Content }}<|im_end|>
109
+ {{- else if eq .Role "assistant" }}
110
+ <|im_start|>assistant
111
+ {{ .Content }}<|im_end|>
112
+ {{- end }}
113
+ {{- end }}
114
+ <|im_start|>assistant
115
+ """
116
+ SYSTEM "You are a helpful assistant."
117
+ EOF
118
+
119
+ ollama create qwen3-tiny-v1 -f Modelfile
120
+ ollama run qwen3-tiny-v1
121
+ ```
122
+
123
+ ### With llama.cpp
124
+
125
+ ```bash
126
+ # Interactive chat
127
+ llama-cli -m Qwen3-g023-tiny-v1-Q4_K_M.gguf \
128
+ --chat-template chatml -cnv
129
+
130
+ # Thinking mode
131
+ llama-cli -m Qwen3-g023-tiny-v1-Q4_K_M.gguf \
132
+ -p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
133
+ -n 512
134
+
135
+ # Non-thinking mode
136
+ llama-cli -m Qwen3-g023-tiny-v1-Q4_K_M.gguf \
137
+ -p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
138
+ -n 128
139
+ ```
140
+
141
+ ### With Python (llama-cpp-python)
142
+
143
+ ```python
144
+ from llama_cpp import Llama
145
+
146
+ model = Llama("Qwen3-g023-tiny-v1-Q4_K_M.gguf", n_ctx=4096)
147
+ response = model.create_chat_completion(
148
+ messages=[
149
+ {"role": "system", "content": "You are a helpful assistant."},
150
+ {"role": "user", "content": "What is the capital of France?"},
151
+ ],
152
+ temperature=0.6,
153
+ )
154
+ print(response["choices"][0]["message"]["content"])
155
+ ```
156
+
157
+ ## System Requirements
158
+
159
+ | Quantization | RAM (CPU) | VRAM (GPU) |
160
+ |:---:|:---:|:---:|
161
+ | Q8_0 | ~2.0 GB | ~2.0 GB |
162
+ | Q6_K | ~1.7 GB | ~1.7 GB |
163
+ | Q4_K_M | ~1.3 GB | ~1.3 GB |
164
+ | Q3_K_M | ~1.1 GB | ~1.1 GB |
165
+ | Q2_K | ~0.9 GB | ~0.9 GB |
166
+
167
+ ## v1 vs v2
168
+
169
+ This model (v1) is the **Phase 1 champion**, focused on surgical precision with minimal operations.
170
+
171
+ | | v1 (this model) | [v2](https://huggingface.co/g023/Qwen3-g023-tiny-v2-GGUF) |
172
+ |:---|:---:|:---:|
173
+ | Layers | 27 | 30 |
174
+ | Parameters | ~1.67B | ~1.82B |
175
+ | Operations | del + swap | swap + interpolate + bridge |
176
+ | Score | 92.9 / 100 | 94.3 / 100 |
177
+ | Factual | 100% (17/17) | 94% (16/17) |
178
+ | Perplexity | 15.70 | 15.17 |
179
+ | Use Case | Max factual accuracy | Max overall score |
180
+
181
+ **v1** is recommended when factual accuracy is paramount (100% vs 94%).
182
+ **v2** is recommended when overall quality matters more (94.3 vs 92.9).
183
+
184
+ ## Methodology
185
+
186
+ Layer surgery was performed through a systematic, test-driven development process:
187
+
188
+ 1. **Phase 1**: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations
189
+ 2. **Evaluation**: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
190
+ 3. **Selection**: The champion was selected based on overall score, with factual accuracy as a tiebreaker
191
+
192
+ The surgery framework is available in the [source repository](https://huggingface.co/g023/Qwen3-g023-tiny-v1-GGUF).
193
+
194
+ ## Credits
195
+
196
+ - **Base model**: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by the Qwen team at Alibaba
197
+ - **Quantization**: llama.cpp
198
+ - **Surgery**: g023
199
+
200
+ ## License
201
+
202
+ Apache 2.0 β€” same as the original Qwen3-1.7B model.