Text Generation
MLX
Safetensors
Rust
qwen2
7b
agentic-coding
android
apple-silicon
attested
bash
c
chain-of-custody
chinese
code
code-completion
code-generation
code-infill
compacted
compensation-lora
consumer-gpu
cpp
cryptographically-verified
css
distillation
edge-inference
efficient
embedded
english
forge-alloy
function-calling
general
general-purpose
go
head-pruning
html
iphone
java
javascript
knowledge-distillation
kotlin
llama-cpp
lm-studio
local-inference
lora
macbook
mobile
multilingual
ollama
on-device
optimized
php
pruned
python
qwen
qwen-coder
qwen2.5
qwen2.5-coder
raspberry-pi
reproducible
ruby
sql
swift
teacher-student
typescript
validation-artifact
versatile
conversational
File size: 9,400 Bytes
98ebeb2 0abc8fe 98ebeb2 0abc8fe 98ebeb2 0abc8fe 98ebeb2 0abc8fe 98ebeb2 0abc8fe 98ebeb2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | ---
tags:
- 7b
- Chinese
- English
- android
- apple-silicon
- code
- compensation-lora
- continuum
- distillation
- edge-inference
- efficient
- embedded
- experiential-plasticity
- forge-alloy
- forged
- general
- general-purpose
- head-pruning
- iphone
- llama-cpp
- lm-studio
- local-inference
- lora
- macbook
- mobile
- neural-plasticity
- ollama
- on-device
- optimized
- pruned
- qwen
- qwen2.5
- raspberry-pi
- sentinel-ai
- text-generation
- validation-artifact
- versatile
base_model: Qwen/Qwen2.5-Coder-7B
pipeline_tag: text-generation
license: apache-2.0
---
# 12% Pruned, 61.0 HUMANEVAL (base 62.2)
**Qwen2.5-Coder-7B** forged through Experiential Plasticity and recovered to within calibration tolerance of the unmodified base via KL-distillation compensation LoRA.
- **HUMANEVAL**: 61.0 (base 62.2, Δ -1.2)
- **HUMANEVAL+PLUS**: 53.0 (base 53.7, Δ -0.7)
<p align="center">
<a href="https://cambriantech.github.io/forge-alloy/verify/#c92083286a04544b">
<img src="alloy-qr.png" alt="Verify Chain of Custody" width="160"/>
</a>
</p>
<p align="center">
<a href="https://cambriantech.github.io/forge-alloy/verify/#c92083286a04544b"><b>Every claim on this card is verified</b></a><br>
<b>Trust: self-attested</b> · 2 benchmarks · 1 device tested<br>
<a href="https://github.com/CambrianTech/forge-alloy">ForgeAlloy</a> chain of custody · <a href="v2-7b-coder-compensated.alloy.json">Download alloy</a> · Merkle-chained
</p>
---
## About this model
Methodology validation artifact for the v2 forge pipeline + KL-distillation compensation LoRA. Demonstrates that aggressive head pruning + activation-metric importance + pad-mode defrag, when paired with output-distribution distillation against the unmodified teacher, recovers near-base HumanEval capability (61.0 vs 62.2 base, within calibration tolerance). This is the empirical anchor for PLASTICITY-COMPACTION §4.1.3.3 and the loss-function ablation that closes the §4.1.3.2 PPL/HumanEval disconnect. NOT a Pareto improvement over the unmodified base 7B at any single VRAM tier — published as proof that the methodology stack works end-to-end, in preparation for the Qwen3.5-35B-A3B and 397B-A17B forges where the pruning dimension actually wins.
## The Journey
This artifact is the punchline of a four-run experimental sequence on the same base model. The first run scored **50.0**; the final run scored **61.0**. Each run between them isolated a single variable, and each result narrowed the design space to the structural fix that recovered near-base capability.
| Run | Configuration | HumanEval pass@1 |
|---|---|---|
| 1 | broken global-flat L2-weight | **50.0** |
| 2 | layer-normalized activation, 1-cycle 500-step | **54.9** |
| 3 | layer-normalized activation, 3-cycle (ablation) | **46.3** |
| 4 | 1-cycle + KL compensation LoRA | **61.0** |
## Loss Function Ablation
The compensation LoRA was run twice with identical configuration, varying only the distillation loss. The result is a substantive methodology finding in its own right:
| Distillation loss | HumanEval | HumanEval+ | Outcome |
|---|---|---|---|
| `mse_hidden` | **0.0** | **0.0** | degenerate fixed point — model collapsed to outputting '0' |
| `kl_logits` | **61.0** | **53.0** | near-base recovery within calibration tolerance |
MSE-on-hidden-states has a degenerate fixed point: the student can satisfy the loss by collapsing some downstream computation, regardless of whether the hidden states encode useful information. KL-on-output-logits has none, because matching the teacher's output distribution directly constrains task-level behavior. **For autoregressive language models, distillation must operate at the output layer, not at intermediate residual streams.**
## Benchmarks
| Benchmark | Score | Base | Δ | Verified |
|---|---|---|---|---|
| **humaneval** | **61.0** | 62.2 | -1.2 | ✅ Result hash |
| **humaneval_plus** | **53.0** | 53.7 | -0.7 | ✅ Result hash |
## What Changed (Base → Forged)
| | Base | Forged | Delta |
|---|---|---|---|
| **Pruning** | None | 12% heads (activation-magnitude) | **-12%** params ✅ |
| **compensation-lora** | None | rank=16 | q_proj, k_proj, v_proj, o_proj... |
| **Pipeline** | | prune → lora → lora → eval | 1 cycles |
## Runs On
| Device | Format | Size | Speed |
|--------|--------|------|-------|
| **NVIDIA GeForce RTX 5090** | fp16 | — | Verified |
| MacBook Pro 32GB | fp16 | 8.0GB | Expected |
| MacBook Air 16GB | Q8_0 | ~4.0GB | Expected |
| MacBook Air 8GB | Q4_K_M | ~2.5GB | Expected |
| iPhone / Android | Q4_K_M | ~2.5GB | Expected |
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("continuum-ai/v2-7b-coder-compensated",
torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("continuum-ai/v2-7b-coder-compensated")
inputs = tokenizer("def merge_sort(arr):", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## How It Was Made
```
prune → lora → lora → eval (1 cycles)
```
- **Pruning**: 12% heads via `activation-magnitude`, layer-normalized, pad-mode defrag
> Layer-normalized activation-magnitude head importance (PLASTICITY-COMPACTION §4.1.3.1 fix). Pad-mode defrag preserves the q_proj invariant num_q_heads*head_dim==hidden_size so the artifact loads in llama.cpp (Finding 6 fix from VALIDATED-TENSOR-SURGERY).
- **lora**: rank ?, 500 steps
> Single-cycle code-domain LoRA fine-tuning on the pruned student. 1-cycle ablation chosen because the 3-cycle multi-cycle test surfaced the §4.1.3.2 PPL/HumanEval disconnect (54.9 → 46.3 across cycles).
- **compensation-lora**: rank 16, 500 steps, `kl_logits` distillation against `Qwen/Qwen2.5-Coder-7B`
> PLASTICITY-COMPACTION §4.1.3.3. KL divergence on output logits is the structural fix for the §4.1.3.2 disconnect. Loss-function ablation: MSE-on-hidden-states collapsed the model to 0.0 (degenerate fixed point); KL-on-logits recovered to 61.0. LoRA adapter merged into student weights at save time so inference-time VRAM and tokens/sec are unchanged from the un-compensated student.
- **Calibrated evaluation**: anchored against `Qwen2.5-Coder-7B` (published 61.6, measured 62.2, ±3.0pt tolerance)
> All HumanEval numbers are anchor-calibrated against the unmodified Qwen2.5-Coder-7B base measured on the same hardware/pipeline in the same run. Hard-fail tolerance: ±3.0 points. Anchor delta: +0.6/+0.7 vs Qwen-published 61.6/53.0, deterministic across 6+ independent runs.
- **Hardware**: NVIDIA GeForce RTX 5090
- **Forge tool**: [Continuum](https://github.com/CambrianTech/continuum) Factory + [sentinel-ai](https://github.com/CambrianTech/sentinel-ai)
## Limitations
- This model is currently a methodology demonstration rather than a Pareto-optimal artifact at any specific hardware tier. For production code workloads on smaller hardware, the unmodified Qwen2.5-Coder-7B at standard quantization (Q4_K_M / Q5_K_M / Q8_0) may be a better fit pending the larger Qwen3.5+ forges that exercise the pruning dimension where this methodology actually wins.
- Validated on HumanEval / HumanEval+ for English-language Python code completion. Performance on other programming languages, code paradigms (functional, embedded, kernel), or code-adjacent domains (SQL, regex, shell) has not been measured.
- Ships as fp16 only. GGUF quantization tiers (Q5_K_S / Q3_K_M / Q2_K) are not yet published for this artifact; the per-tier comparison from the development log showed base+quant dominates v2+quant at every VRAM tier on the same 7B base, which is why the methodology validation here uses fp16 and the production GGUF publishes are reserved for the Qwen3.5+ forges where the dimension flips.
- Vision modality not yet wired in. The Continuum sensory architecture treats vision as first-class for personas, but this 7B coder artifact is text-only.
## Chain of Custody
Scan the QR or [verify online](https://cambriantech.github.io/forge-alloy/verify/#c92083286a04544b). Download the [alloy file](v2-7b-coder-compensated.alloy.json) to verify independently.
| What | Proof |
|------|-------|
| Forged on | NVIDIA GeForce RTX 5090, ? |
| Published | [huggingface](https://huggingface.co/continuum-ai/v2-7b-coder-compensated) — 2026-04-08T05:02:57.072577+00:00 |
| Trust level | [`self-attested`](https://github.com/CambrianTech/forge-alloy/blob/main/docs/ATTESTATION.md) |
| Spec | [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) — Rust/Python/TypeScript |
## Make Your Own
Forged with [Continuum](https://github.com/CambrianTech/continuum) — a distributed AI world that runs on your hardware.
<p align="center">
<a href="https://github.com/CambrianTech/continuum"><img src="https://raw.githubusercontent.com/CambrianTech/continuum/main/docs/images/factory.png" alt="Continuum Model Factory" width="400"/></a>
</p>
The Factory configurator lets you design and forge custom models visually — context extension, pruning, LoRA, quantization, vision/audio modalities. Pick your target devices, the system figures out what fits.
[GitHub](https://github.com/CambrianTech/continuum) · [All Models](https://huggingface.co/continuum-ai) · [Forge-Alloy](https://github.com/CambrianTech/forge-alloy)
## License
apache-2.0
|