EnricoFermi commited on
Commit
ce8463b
·
verified ·
1 Parent(s): 7afe2c9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +37 -157
README.md CHANGED
@@ -1,182 +1,62 @@
1
  ---
2
- language:
3
- - en
4
- - zh
5
  license: apache-2.0
6
- library_name: transformers
7
- pipeline_tag: text-generation
8
  tags:
9
- - qwen2.5
10
- - experiential-plasticity
11
- - forged
12
- - head-pruning
13
- - neural-plasticity
14
- - sentinel-ai
15
- - continuum
16
- - safetensors
17
- - compacted
18
- - expert-pruning
19
- - moe-pruning
20
- - code
21
- - code-generation
22
- - coding
23
- - coder
24
- - programming
25
- - software-engineering
26
- - local-inference
27
- - efficient
28
- - optimized
29
- - pruned
30
- - 14b
31
- base_model:
32
- - Qwen/Qwen2.5-Coder-14B
33
- datasets:
34
- - m-a-p/CodeFeedback-Filtered-Instruction
35
  ---
36
 
37
- # qwen2.5-coder-14b-code-forged
38
-
39
- **Optimized through Experiential Plasticity.** Forged from [Qwen/Qwen2.5-Coder-14B](https://huggingface.co/Qwen/Qwen2.5-Coder-14B) for **code** tasks.
40
-
41
- **Not quantized. Not distilled. Structurally reshaped.**
42
 
43
- The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its task — like biological synaptic pruning during brain development.
44
 
45
- ## Expert Compaction (MoE Pruning)
46
 
47
- Mixture-of-Experts models route tokens to specialized sub-networks. Most of these experts are redundant for any given domain. Expert compaction identifies and removes the lowest-importance experts entirely.
 
 
 
48
 
49
- **The gate weights are sliced to match** — no special loading code needed. Load with standard HuggingFace transformers.
50
-
51
- The result: a model that fits on consumer GPUs (RTX 3090/4090/5090) while retaining the specialized knowledge of a much larger model.
52
-
53
- ## Results
54
 
55
  | Metric | Value |
56
  |--------|-------|
57
- | Base Model | [Qwen/Qwen2.5-Coder-14B](https://huggingface.co/Qwen/Qwen2.5-Coder-14B) |
58
- | Domain | code |
59
- | Training Data | wikitext-2 |
60
- | Strategy | combined |
61
- | Pruning Level | 30% |
62
- | Cycles | 3 |
63
- | Steps/Cycle | 1000 |
64
-
65
- ## Runs On
66
-
67
- | Device | Format | Verified |
68
- |--------|--------|----------|
69
- | MacBook Pro 16GB | fp16 | Yes |
70
- | MacBook Pro 32GB | fp16 | Yes |
71
-
72
- These models are designed for **consumer hardware**. No A100s required. Your MacBook, your gaming PC, your home server.
73
 
74
- ## Quick Start
75
-
76
- ```python
77
- from transformers import AutoModelForCausalLM, AutoTokenizer
78
-
79
- model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen2.5-coder-14b-code-forged",
80
- torch_dtype="auto", device_map="auto")
81
- tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen2.5-coder-14b-code-forged")
82
-
83
- inputs = tokenizer("Write a Python decorator that caches results:", return_tensors="pt").to(model.device)
84
- output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
85
- print(tokenizer.decode(output[0], skip_special_tokens=True))
86
- ```
87
-
88
- ## Forge Your Own
89
-
90
- Three commands. Any NVIDIA GPU with 8GB+ VRAM.
91
 
 
92
  ```bash
93
- git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
94
- source .venv/bin/activate
95
- python scripts/forge_model.py Qwen/Qwen2.5-Coder-14B --domain code
96
- ```
97
-
98
- The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via `status.json`.
99
-
100
- ## The Science: Experiential Plasticity
101
-
102
- Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**.
103
-
104
- ### How It Works
105
-
106
- 1. **Train** on domain-specific data (LoRA + AMP mixed precision)
107
- 2. **Measure** each attention head's information contribution (entropy-based importance)
108
- 3. **Prune** the lowest-contributing heads
109
- 4. **Retrain** on the same domain data — surviving heads specialize and compensate
110
- 5. **Defrag** — structurally remove dead heads, free VRAM
111
- 6. **Repeat** — each cycle the model improves on its domain
112
-
113
- ### Scaling Law
114
-
115
- Larger models harbor more architectural redundancy. Plasticity exploits this — bigger models benefit more:
116
-
117
- | Model | Params | Domain | Improvement |
118
- |-------|--------|--------|------------|
119
- | Qwen2.5-0.5B | 0.5B | General | -3.2% (too small to prune) |
120
- | Qwen2.5-1.5B | 1.5B | General | +3.0% |
121
- | Qwen2.5-7B | 7.6B | General | +11.8% |
122
- | **Qwen3.5-4B** | **3.4B** | **Code** | **+24.0%** |
123
- | **Qwen3.5-27B** | **23.6B** | **Code** | **+3.5%** (4-bit, runs in 17GB) |
124
-
125
- Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.
126
-
127
- ### Transfer Function
128
-
129
- Recovery from iterative pruning follows a measurable exponential decay:
130
-
131
- ```
132
- recovery = 1.45 * exp(-0.18 * cycle) - 0.03
133
- ```
134
-
135
- This connects transformer optimization to classical control theory — the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.
136
-
137
- ### Continuous Defrag
138
-
139
- Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:
140
-
141
  ```
142
- Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
143
- Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB (2x faster)
144
- Cycle 3: train (batch=3, 22B, 14.5GB) -> prune -> defrag (2.8x faster)
145
- ```
146
-
147
- 40% faster total training and a 33% smaller final model.
148
-
149
- ### Head Mitosis
150
 
151
- Pruning frees slots. Mitosis fills them. When a head is overutilized, it gets cloned into a pruned slot — each copy at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, like cell differentiation after biological mitosis. The model grows new specialized capacity exactly where it's needed.
152
-
153
- **Read the full paper**: [Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)
154
 
155
- ## Output Samples
 
 
156
 
157
- Generated by the forged model immediately after forging — **no cherry-picking, no post-processing**.
 
158
 
159
- *No generation samples available for this model.*
160
 
161
- ## Forging Metadata
162
 
163
- ```json
164
- {
165
- "model": "Qwen/Qwen2.5-Coder-14B",
166
- "improvement_pct": 0,
167
- "baseline_ppl": 0,
168
- "final_ppl": 0
169
- }
170
- ```
171
 
172
- ## Research
173
 
174
- - **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)** Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
175
- - **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** — Foundation paper with cross-architecture results
176
- - **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** — MoE expert pruning (67GB to 14GB)
177
-
178
- ## Links
179
 
180
- - [All published models](https://huggingface.co/continuum-ai)
181
- - [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) — Open source forge framework
182
- - [continuum](https://github.com/CambrianTech/continuum) — Distributed AI on consumer hardware
 
1
  ---
 
 
 
2
  license: apache-2.0
 
 
3
  tags:
4
+ - code
5
+ - qwen2
6
+ - compacted
7
+ - head-pruning
8
+ - continuum
9
+ - continuum:compacted
10
+ - continuum:head-pruning
11
+ language:
12
+ - en
13
+ base_model: Qwen/Qwen2.5-Coder-14B-Instruct
14
+ pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
+ # Qwen2.5-Coder-14B-Instruct — Compacted (25Q/5KV, Q5_K_S)
 
 
 
 
18
 
19
+ A **14-billion parameter** coding model compressed to run on a **16GB MacBook Air**.
20
 
21
+ ## How It Was Built
22
 
23
+ Continuum's adaptive compression pipeline:
24
+ 1. **Head Pruning**: 40 Q-heads / 8 KV-heads → 25 Q-heads / 5 KV-heads (37.5% KV cache reduction)
25
+ 2. **Quantization**: Q5_K_S (5.1 bits per weight)
26
+ 3. **Result**: 27GB BF16 → 8.9GB GGUF
27
 
28
+ ## Performance
 
 
 
 
29
 
30
  | Metric | Value |
31
  |--------|-------|
32
+ | Speed | 9.2 tok/s (M1 Pro 32GB, Metal) |
33
+ | Memory | ~9 GB |
34
+ | Architecture | 48 layers, 25 Q-heads, 5 KV-heads, head_dim=128 |
35
+ | Quantization | Q5_K_S (5.1 BPW) |
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ ## How to Run
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ With [Continuum](https://github.com/cambrian-tech/continuum) — downloads automatically:
40
  ```bash
41
+ # Model alias "coder" resolves to this model
42
+ ./jtag inference/generate --model=coder --prompt="def fibonacci(n):"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ```
 
 
 
 
 
 
 
 
44
 
45
+ ## Links
 
 
46
 
47
+ - **[Continuum](https://github.com/cambrian-tech/continuum)** — Local AI runtime
48
+ - **[sentinel-ai](https://github.com/cambrian-tech/sentinel-ai)** — Research project
49
+ - **[continuum-ai](https://huggingface.co/continuum-ai)** — More models
50
 
51
+ ## License
52
+ Apache 2.0
53
 
 
54
 
55
+ ## Part of continuum
56
 
57
+ [continuum](https://github.com/CambrianTech/continuum) is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.
 
 
 
 
 
 
 
58
 
59
+ Built on the research foundations of [Synthetic Citizens: AI Personas as Persistent, Evolving Entities](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SYNTHETIC-CITIZENS.md) and [Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md). Our core contribution is **utilization-aware model surgery** — runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.
60
 
61
+ [Plasticity Compaction Paper](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md) | [Get started](https://github.com/CambrianTech/continuum)
 
 
 
 
62