Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,182 +1,62 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
-
- zh
|
| 5 |
license: apache-2.0
|
| 6 |
-
library_name: transformers
|
| 7 |
-
pipeline_tag: text-generation
|
| 8 |
tags:
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
- head-pruning
|
| 13 |
-
-
|
| 14 |
-
-
|
| 15 |
-
- continuum
|
| 16 |
-
|
| 17 |
-
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
- code
|
| 21 |
-
- code-generation
|
| 22 |
-
- coding
|
| 23 |
-
- coder
|
| 24 |
-
- programming
|
| 25 |
-
- software-engineering
|
| 26 |
-
- local-inference
|
| 27 |
-
- efficient
|
| 28 |
-
- optimized
|
| 29 |
-
- pruned
|
| 30 |
-
- 14b
|
| 31 |
-
base_model:
|
| 32 |
-
- Qwen/Qwen2.5-Coder-14B
|
| 33 |
-
datasets:
|
| 34 |
-
- m-a-p/CodeFeedback-Filtered-Instruction
|
| 35 |
---
|
| 36 |
|
| 37 |
-
#
|
| 38 |
-
|
| 39 |
-
**Optimized through Experiential Plasticity.** Forged from [Qwen/Qwen2.5-Coder-14B](https://huggingface.co/Qwen/Qwen2.5-Coder-14B) for **code** tasks.
|
| 40 |
-
|
| 41 |
-
**Not quantized. Not distilled. Structurally reshaped.**
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
##
|
| 46 |
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
The result: a model that fits on consumer GPUs (RTX 3090/4090/5090) while retaining the specialized knowledge of a much larger model.
|
| 52 |
-
|
| 53 |
-
## Results
|
| 54 |
|
| 55 |
| Metric | Value |
|
| 56 |
|--------|-------|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
| Pruning Level | 30% |
|
| 62 |
-
| Cycles | 3 |
|
| 63 |
-
| Steps/Cycle | 1000 |
|
| 64 |
-
|
| 65 |
-
## Runs On
|
| 66 |
-
|
| 67 |
-
| Device | Format | Verified |
|
| 68 |
-
|--------|--------|----------|
|
| 69 |
-
| MacBook Pro 16GB | fp16 | Yes |
|
| 70 |
-
| MacBook Pro 32GB | fp16 | Yes |
|
| 71 |
-
|
| 72 |
-
These models are designed for **consumer hardware**. No A100s required. Your MacBook, your gaming PC, your home server.
|
| 73 |
|
| 74 |
-
##
|
| 75 |
-
|
| 76 |
-
```python
|
| 77 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 78 |
-
|
| 79 |
-
model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen2.5-coder-14b-code-forged",
|
| 80 |
-
torch_dtype="auto", device_map="auto")
|
| 81 |
-
tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen2.5-coder-14b-code-forged")
|
| 82 |
-
|
| 83 |
-
inputs = tokenizer("Write a Python decorator that caches results:", return_tensors="pt").to(model.device)
|
| 84 |
-
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
|
| 85 |
-
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
## Forge Your Own
|
| 89 |
-
|
| 90 |
-
Three commands. Any NVIDIA GPU with 8GB+ VRAM.
|
| 91 |
|
|
|
|
| 92 |
```bash
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
python scripts/forge_model.py Qwen/Qwen2.5-Coder-14B --domain code
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via `status.json`.
|
| 99 |
-
|
| 100 |
-
## The Science: Experiential Plasticity
|
| 101 |
-
|
| 102 |
-
Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**.
|
| 103 |
-
|
| 104 |
-
### How It Works
|
| 105 |
-
|
| 106 |
-
1. **Train** on domain-specific data (LoRA + AMP mixed precision)
|
| 107 |
-
2. **Measure** each attention head's information contribution (entropy-based importance)
|
| 108 |
-
3. **Prune** the lowest-contributing heads
|
| 109 |
-
4. **Retrain** on the same domain data — surviving heads specialize and compensate
|
| 110 |
-
5. **Defrag** — structurally remove dead heads, free VRAM
|
| 111 |
-
6. **Repeat** — each cycle the model improves on its domain
|
| 112 |
-
|
| 113 |
-
### Scaling Law
|
| 114 |
-
|
| 115 |
-
Larger models harbor more architectural redundancy. Plasticity exploits this — bigger models benefit more:
|
| 116 |
-
|
| 117 |
-
| Model | Params | Domain | Improvement |
|
| 118 |
-
|-------|--------|--------|------------|
|
| 119 |
-
| Qwen2.5-0.5B | 0.5B | General | -3.2% (too small to prune) |
|
| 120 |
-
| Qwen2.5-1.5B | 1.5B | General | +3.0% |
|
| 121 |
-
| Qwen2.5-7B | 7.6B | General | +11.8% |
|
| 122 |
-
| **Qwen3.5-4B** | **3.4B** | **Code** | **+24.0%** |
|
| 123 |
-
| **Qwen3.5-27B** | **23.6B** | **Code** | **+3.5%** (4-bit, runs in 17GB) |
|
| 124 |
-
|
| 125 |
-
Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.
|
| 126 |
-
|
| 127 |
-
### Transfer Function
|
| 128 |
-
|
| 129 |
-
Recovery from iterative pruning follows a measurable exponential decay:
|
| 130 |
-
|
| 131 |
-
```
|
| 132 |
-
recovery = 1.45 * exp(-0.18 * cycle) - 0.03
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
This connects transformer optimization to classical control theory — the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.
|
| 136 |
-
|
| 137 |
-
### Continuous Defrag
|
| 138 |
-
|
| 139 |
-
Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:
|
| 140 |
-
|
| 141 |
```
|
| 142 |
-
Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
|
| 143 |
-
Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB (2x faster)
|
| 144 |
-
Cycle 3: train (batch=3, 22B, 14.5GB) -> prune -> defrag (2.8x faster)
|
| 145 |
-
```
|
| 146 |
-
|
| 147 |
-
40% faster total training and a 33% smaller final model.
|
| 148 |
-
|
| 149 |
-
### Head Mitosis
|
| 150 |
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
**Read the full paper**: [Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)
|
| 154 |
|
| 155 |
-
|
|
|
|
|
|
|
| 156 |
|
| 157 |
-
|
|
|
|
| 158 |
|
| 159 |
-
*No generation samples available for this model.*
|
| 160 |
|
| 161 |
-
##
|
| 162 |
|
| 163 |
-
|
| 164 |
-
{
|
| 165 |
-
"model": "Qwen/Qwen2.5-Coder-14B",
|
| 166 |
-
"improvement_pct": 0,
|
| 167 |
-
"baseline_ppl": 0,
|
| 168 |
-
"final_ppl": 0
|
| 169 |
-
}
|
| 170 |
-
```
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
|
| 175 |
-
- **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** — Foundation paper with cross-architecture results
|
| 176 |
-
- **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** — MoE expert pruning (67GB to 14GB)
|
| 177 |
-
|
| 178 |
-
## Links
|
| 179 |
|
| 180 |
-
- [All published models](https://huggingface.co/continuum-ai)
|
| 181 |
-
- [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) — Open source forge framework
|
| 182 |
-
- [continuum](https://github.com/CambrianTech/continuum) — Distributed AI on consumer hardware
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
+
- code
|
| 5 |
+
- qwen2
|
| 6 |
+
- compacted
|
| 7 |
+
- head-pruning
|
| 8 |
+
- continuum
|
| 9 |
+
- continuum:compacted
|
| 10 |
+
- continuum:head-pruning
|
| 11 |
+
language:
|
| 12 |
+
- en
|
| 13 |
+
base_model: Qwen/Qwen2.5-Coder-14B-Instruct
|
| 14 |
+
pipeline_tag: text-generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# Qwen2.5-Coder-14B-Instruct — Compacted (25Q/5KV, Q5_K_S)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
A **14-billion parameter** coding model compressed to run on a **16GB MacBook Air**.
|
| 20 |
|
| 21 |
+
## How It Was Built
|
| 22 |
|
| 23 |
+
Continuum's adaptive compression pipeline:
|
| 24 |
+
1. **Head Pruning**: 40 Q-heads / 8 KV-heads → 25 Q-heads / 5 KV-heads (37.5% KV cache reduction)
|
| 25 |
+
2. **Quantization**: Q5_K_S (5.1 bits per weight)
|
| 26 |
+
3. **Result**: 27GB BF16 → 8.9GB GGUF
|
| 27 |
|
| 28 |
+
## Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
| Metric | Value |
|
| 31 |
|--------|-------|
|
| 32 |
+
| Speed | 9.2 tok/s (M1 Pro 32GB, Metal) |
|
| 33 |
+
| Memory | ~9 GB |
|
| 34 |
+
| Architecture | 48 layers, 25 Q-heads, 5 KV-heads, head_dim=128 |
|
| 35 |
+
| Quantization | Q5_K_S (5.1 BPW) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
## How to Run
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
With [Continuum](https://github.com/cambrian-tech/continuum) — downloads automatically:
|
| 40 |
```bash
|
| 41 |
+
# Model alias "coder" resolves to this model
|
| 42 |
+
./jtag inference/generate --model=coder --prompt="def fibonacci(n):"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
## Links
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
- **[Continuum](https://github.com/cambrian-tech/continuum)** — Local AI runtime
|
| 48 |
+
- **[sentinel-ai](https://github.com/cambrian-tech/sentinel-ai)** — Research project
|
| 49 |
+
- **[continuum-ai](https://huggingface.co/continuum-ai)** — More models
|
| 50 |
|
| 51 |
+
## License
|
| 52 |
+
Apache 2.0
|
| 53 |
|
|
|
|
| 54 |
|
| 55 |
+
## Part of continuum
|
| 56 |
|
| 57 |
+
[continuum](https://github.com/CambrianTech/continuum) is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
Built on the research foundations of [Synthetic Citizens: AI Personas as Persistent, Evolving Entities](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SYNTHETIC-CITIZENS.md) and [Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md). Our core contribution is **utilization-aware model surgery** — runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.
|
| 60 |
|
| 61 |
+
[Plasticity Compaction Paper](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md) | [Get started](https://github.com/CambrianTech/continuum)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
|
|
|
|
|
|
|
|