File size: 9,400 Bytes
98ebeb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0abc8fe
98ebeb2
 
 
 
 
0abc8fe
98ebeb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0abc8fe
98ebeb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0abc8fe
98ebeb2
 
 
 
0abc8fe
98ebeb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
tags:
- 7b
- Chinese
- English
- android
- apple-silicon
- code
- compensation-lora
- continuum
- distillation
- edge-inference
- efficient
- embedded
- experiential-plasticity
- forge-alloy
- forged
- general
- general-purpose
- head-pruning
- iphone
- llama-cpp
- lm-studio
- local-inference
- lora
- macbook
- mobile
- neural-plasticity
- ollama
- on-device
- optimized
- pruned
- qwen
- qwen2.5
- raspberry-pi
- sentinel-ai
- text-generation
- validation-artifact
- versatile
base_model: Qwen/Qwen2.5-Coder-7B
pipeline_tag: text-generation
license: apache-2.0
---

# 12% Pruned, 61.0 HUMANEVAL (base 62.2)

**Qwen2.5-Coder-7B** forged through Experiential Plasticity and recovered to within calibration tolerance of the unmodified base via KL-distillation compensation LoRA.

- **HUMANEVAL**: 61.0 (base 62.2, Δ -1.2)
- **HUMANEVAL+PLUS**: 53.0 (base 53.7, Δ -0.7)


<p align="center">
<a href="https://cambriantech.github.io/forge-alloy/verify/#c92083286a04544b">
<img src="alloy-qr.png" alt="Verify Chain of Custody" width="160"/>
</a>
</p>

<p align="center">
<a href="https://cambriantech.github.io/forge-alloy/verify/#c92083286a04544b"><b>Every claim on this card is verified</b></a><br>
<b>Trust: self-attested</b> · 2 benchmarks · 1 device tested<br>
<a href="https://github.com/CambrianTech/forge-alloy">ForgeAlloy</a> chain of custody · <a href="v2-7b-coder-compensated.alloy.json">Download alloy</a> · Merkle-chained
</p>

---

## About this model

Methodology validation artifact for the v2 forge pipeline + KL-distillation compensation LoRA. Demonstrates that aggressive head pruning + activation-metric importance + pad-mode defrag, when paired with output-distribution distillation against the unmodified teacher, recovers near-base HumanEval capability (61.0 vs 62.2 base, within calibration tolerance). This is the empirical anchor for PLASTICITY-COMPACTION §4.1.3.3 and the loss-function ablation that closes the §4.1.3.2 PPL/HumanEval disconnect. NOT a Pareto improvement over the unmodified base 7B at any single VRAM tier — published as proof that the methodology stack works end-to-end, in preparation for the Qwen3.5-35B-A3B and 397B-A17B forges where the pruning dimension actually wins.

## The Journey

This artifact is the punchline of a four-run experimental sequence on the same base model. The first run scored **50.0**; the final run scored **61.0**. Each run between them isolated a single variable, and each result narrowed the design space to the structural fix that recovered near-base capability.

| Run | Configuration | HumanEval pass@1 |
|---|---|---|
| 1 | broken global-flat L2-weight | **50.0** |
| 2 | layer-normalized activation, 1-cycle 500-step | **54.9** |
| 3 | layer-normalized activation, 3-cycle (ablation) | **46.3** |
| 4 | 1-cycle + KL compensation LoRA | **61.0** |

## Loss Function Ablation

The compensation LoRA was run twice with identical configuration, varying only the distillation loss. The result is a substantive methodology finding in its own right:

| Distillation loss | HumanEval | HumanEval+ | Outcome |
|---|---|---|---|
| `mse_hidden` | **0.0** | **0.0** | degenerate fixed point — model collapsed to outputting '0' |
| `kl_logits` | **61.0** | **53.0** | near-base recovery within calibration tolerance |

MSE-on-hidden-states has a degenerate fixed point: the student can satisfy the loss by collapsing some downstream computation, regardless of whether the hidden states encode useful information. KL-on-output-logits has none, because matching the teacher's output distribution directly constrains task-level behavior. **For autoregressive language models, distillation must operate at the output layer, not at intermediate residual streams.**


## Benchmarks

| Benchmark | Score | Base | Δ | Verified |
|---|---|---|---|---|
| **humaneval** | **61.0** | 62.2 | -1.2 | ✅ Result hash |
| **humaneval_plus** | **53.0** | 53.7 | -0.7 | ✅ Result hash |


## What Changed (Base → Forged)

| | Base | Forged | Delta |
|---|---|---|---|
| **Pruning** | None | 12% heads (activation-magnitude) | **-12%** params ✅ |
| **compensation-lora** | None | rank=16 | q_proj, k_proj, v_proj, o_proj... |
| **Pipeline** | | prune → lora → lora → eval | 1 cycles |

## Runs On

| Device | Format | Size | Speed |
|--------|--------|------|-------|
| **NVIDIA GeForce RTX 5090** | fp16 | — | Verified |
| MacBook Pro 32GB | fp16 | 8.0GB | Expected |
| MacBook Air 16GB | Q8_0 | ~4.0GB | Expected |
| MacBook Air 8GB | Q4_K_M | ~2.5GB | Expected |
| iPhone / Android | Q4_K_M | ~2.5GB | Expected |

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("continuum-ai/v2-7b-coder-compensated",
    torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("continuum-ai/v2-7b-coder-compensated")

inputs = tokenizer("def merge_sort(arr):", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```


## How It Was Made

```
prune → lora → lora → eval (1 cycles)
```

- **Pruning**: 12% heads via `activation-magnitude`, layer-normalized, pad-mode defrag
  > Layer-normalized activation-magnitude head importance (PLASTICITY-COMPACTION §4.1.3.1 fix). Pad-mode defrag preserves the q_proj invariant num_q_heads*head_dim==hidden_size so the artifact loads in llama.cpp (Finding 6 fix from VALIDATED-TENSOR-SURGERY).
- **lora**: rank ?, 500 steps
  > Single-cycle code-domain LoRA fine-tuning on the pruned student. 1-cycle ablation chosen because the 3-cycle multi-cycle test surfaced the §4.1.3.2 PPL/HumanEval disconnect (54.9 → 46.3 across cycles).
- **compensation-lora**: rank 16, 500 steps, `kl_logits` distillation against `Qwen/Qwen2.5-Coder-7B`
  > PLASTICITY-COMPACTION §4.1.3.3. KL divergence on output logits is the structural fix for the §4.1.3.2 disconnect. Loss-function ablation: MSE-on-hidden-states collapsed the model to 0.0 (degenerate fixed point); KL-on-logits recovered to 61.0. LoRA adapter merged into student weights at save time so inference-time VRAM and tokens/sec are unchanged from the un-compensated student.
- **Calibrated evaluation**: anchored against `Qwen2.5-Coder-7B` (published 61.6, measured 62.2, ±3.0pt tolerance)
  > All HumanEval numbers are anchor-calibrated against the unmodified Qwen2.5-Coder-7B base measured on the same hardware/pipeline in the same run. Hard-fail tolerance: ±3.0 points. Anchor delta: +0.6/+0.7 vs Qwen-published 61.6/53.0, deterministic across 6+ independent runs.
- **Hardware**: NVIDIA GeForce RTX 5090
- **Forge tool**: [Continuum](https://github.com/CambrianTech/continuum) Factory + [sentinel-ai](https://github.com/CambrianTech/sentinel-ai)
## Limitations

- This model is currently a methodology demonstration rather than a Pareto-optimal artifact at any specific hardware tier. For production code workloads on smaller hardware, the unmodified Qwen2.5-Coder-7B at standard quantization (Q4_K_M / Q5_K_M / Q8_0) may be a better fit pending the larger Qwen3.5+ forges that exercise the pruning dimension where this methodology actually wins.
- Validated on HumanEval / HumanEval+ for English-language Python code completion. Performance on other programming languages, code paradigms (functional, embedded, kernel), or code-adjacent domains (SQL, regex, shell) has not been measured.
- Ships as fp16 only. GGUF quantization tiers (Q5_K_S / Q3_K_M / Q2_K) are not yet published for this artifact; the per-tier comparison from the development log showed base+quant dominates v2+quant at every VRAM tier on the same 7B base, which is why the methodology validation here uses fp16 and the production GGUF publishes are reserved for the Qwen3.5+ forges where the dimension flips.
- Vision modality not yet wired in. The Continuum sensory architecture treats vision as first-class for personas, but this 7B coder artifact is text-only.


## Chain of Custody

Scan the QR or [verify online](https://cambriantech.github.io/forge-alloy/verify/#c92083286a04544b). Download the [alloy file](v2-7b-coder-compensated.alloy.json) to verify independently.

| What | Proof |
|------|-------|
| Forged on | NVIDIA GeForce RTX 5090, ? |
| Published | [huggingface](https://huggingface.co/continuum-ai/v2-7b-coder-compensated) — 2026-04-08T05:02:57.072577+00:00 |
| Trust level | [`self-attested`](https://github.com/CambrianTech/forge-alloy/blob/main/docs/ATTESTATION.md) |
| Spec | [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) — Rust/Python/TypeScript |

## Make Your Own

Forged with [Continuum](https://github.com/CambrianTech/continuum) — a distributed AI world that runs on your hardware.

<p align="center">
<a href="https://github.com/CambrianTech/continuum"><img src="https://raw.githubusercontent.com/CambrianTech/continuum/main/docs/images/factory.png" alt="Continuum Model Factory" width="400"/></a>
</p>

The Factory configurator lets you design and forge custom models visually — context extension, pruning, LoRA, quantization, vision/audio modalities. Pick your target devices, the system figures out what fits.

[GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)

## License

apache-2.0