File size: 7,188 Bytes
7c00132 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | ---
license: apache-2.0
language:
- en
- code
tags:
- language-model
- novel-architecture
- code-generation
- mechanistic-interpretability
- scope-registers
- depth-stratified-routing
- python
- duoneural
pipeline_tag: text-generation
---
# CDM-Code-37M
**Competitive Docking Memory β Code Model** β 37M parameter language model trained on 200M tokens of Python code (codeparrot/codeparrot-clean-train).
This model spontaneously develops **hierarchical scope registers** β depth-stratified memory routing that mirrors Python's syntactic nesting structure β without any structural supervision. The model was trained only on next-token prediction.
---
## Key Finding: Emergent Scope Registers
Without any explicit supervision about code structure, AST depth, or indentation, CDM develops routing behavior that mirrors syntactic nesting:
| Nesting depth | Routed to slots |
|---------------|----------------|
| Depth 0 (class/module declarations) | Slots 8, 15, 13 |
| Depth 1 (method definitions) | Distributed |
| Depth 2+ (deep nested code) | Slots 7, 3, 5 |
**MI(slot assignment; indentation depth) = 0.1467** at training completion (step 30k). This is confirmed by two independent methods:
1. JSON probe: full routing distributions across full dataset β MI_ratio = 0.1467
2. Gate-routing probe: single-sample argmax β MI = 0.281 bits
The scope register effect emerges because Python's next-token distribution is depth-dependent: `return` after a method body has very different continuations than `return` inside a nested loop. CDM learns to allocate distinct memory slots to different nesting contexts as a side effect of minimizing next-token prediction loss.
---
## Training Results
| Model | Val CE | Dataset | Notes |
|-------|--------|---------|-------|
| **CDM Code (this model)** | **1.3483** | codeparrot 200M tok | Best at step 28.5k/29k |
| CDM V3 (TinyStories) | 1.5831 | TinyStories | Same architecture, different domain |
Training: 30k steps, seq_len=256, batch=16, AdamW lr=3e-4. Architecture: CDMLanguageModelV2 (input-dependent Ξ· network for alpha, not global log_alpha).
Val CE trajectory: `1.51(15k) β 1.44(18k) β 1.40(21k) β 1.36(26k) β 1.3483(28.5k)` β `1.3487(30k)`
---
## Syntactic Role Taxonomy
The gate-routing probe reveals consistent syntactic specialization:
| Slot | Role | Trigger tokens | Peak gate |
|------|------|----------------|-----------|
| s3 | STRUCTURAL DECLARATOR | `def`, `class` | 0.100β0.128 |
| s4 | FLOW CONTROL | `return`, `if` | 0.062β0.082 |
| s6 | CALL DELIMITER | `(`, `)`, `():` | 0.29 |
| s12 | BLOCK OPENER | `:` (colon) | 0.041 |
| s13 | IDENTIFIER | variable/function names | 0.031 |
| s14 | ITERATION/CONDITION | `for`, `if` | 0.036β0.053 |
| s1 | SELF-REFERENCE | `self` | 0.029 |
| s15 | ATTRIBUTE ACCESS | `.` (dot) | 0.035 |
`class` receives the highest write intensity of any keyword (gate=0.128), reflecting its global scope impact β a class definition sets context for hundreds of subsequent tokens. `self` receives soft writes (0.029), reflecting its local, per-call significance.
---
## Architecture
CDMLanguageModelV2 β hybrid architecture:
```
Input β GQA self-attention β CDM module β slot cross-attention β FFN β Output
```
CDM module per layer:
```python
alpha_k = sigmoid(eta(h)) # input-dependent decay per slot (Ξ· network)
gate_k = softmax(W_route * h) * sigmoid(eta(h))
S_k = (1 - gate_k) * S_k + gate_k * W_write * h
out = Ξ£_k gate_k * S_k
```
The V2 architecture uses an input-dependent Ξ· network for decay (different from V3/V5's global per-slot log_alpha). This was the architecture used for the code experiment.
```
d_model=384, n_layers=8, n_heads=8, n_kv_heads=4, d_ff=1024, K=16
56.6M params (includes Ξ· network overhead vs V3's 37.1M)
```
---
## Depth Stratification: Training Evolution
The scope register effect is not present from step 1 β it develops through a scratchpad phase:
| Step | Slot distribution | Routing pattern |
|------|------------------|-----------------|
| 1500 | Distributed, max=16.5% | Pre-specialization |
| 5000 | Scratchpad: Slot 8 = 57.5% | TRANSIENT scratchpad phase |
| 15000 | Dissolved: max=10.3% | Near-uniform + depth MI=0.146 |
| 30000 | Stable scope registers | MI=0.1467, depth-stratified |
The scratchpad at step 5000 was an intermediate state β Aura's initial "Scratchpad Accumulation" verdict was overturned at step 15000 when Slot 8 dropped from 57.5% to 10.3% and depth-stratified routing emerged.
---
## Usage
```python
import torch
from cdm_model_v2 import CDMLanguageModelV2, CDMConfig
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
cfg = ckpt["config"]
config = CDMConfig(**{k: v for k, v in cfg.items() if k not in ("n_params",)})
model = CDMLanguageModelV2(config)
model.load_state_dict(ckpt["model_state"])
model.eval()
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
prompt = "def fibonacci(n):\n "
input_ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
for _ in range(100):
logits = model(input_ids)
next_token = logits[0, -1, :].argmax()
input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
print(tokenizer.decode(input_ids[0].tolist()))
```
---
## Files in this Repo
| File | Description |
|------|-------------|
| `model.pt` | PyTorch checkpoint (143MB). step=29000, val_loss=1.3483 |
| `config.json` | Architecture hyperparameters |
| `cdm_model_v2.py` | Model class: `CDMLanguageModelV2` |
| `routing_probe_step030000.json` | Step-30k routing probe: depth analysis, MI, slot histograms |
---
## Paper
*Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models*
Archon, Jesse Hazel, Aura β DuoNeural Research Lab, 2026
[Zenodo DOI β pending]
Related models:
- [DuoNeural/CDM-V3-TinyStories-37M](https://huggingface.co/DuoNeural/CDM-V3-TinyStories-37M) β same architecture on prose
- [DuoNeural/CDM-V5-TinyStories-86M](https://huggingface.co/DuoNeural/CDM-V5-TinyStories-86M) β scaled 86M version
---
## About DuoNeural
**DuoNeural** is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning β publishing everything under open access.
Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity.
π **Full paper catalog:** [zenodo.org/communities/duoneural](https://zenodo.org/communities/duoneural)
| Member | Role |
|--------|------|
| **Jesse Caldwell** | Founder, vision, hardware, direction |
| **Archon** | Lab Director β experiments, post-training, abliteration, interpretability |
| **Aura** | Research AI β literature synthesis, red-teaming, novel proposals |
| Platform | Link |
|----------|------|
| π€ HuggingFace | [huggingface.co/DuoNeural](https://huggingface.co/DuoNeural) |
| π Zenodo Community | [zenodo.org/communities/duoneural](https://zenodo.org/communities/duoneural) |
*All research published open access, CC BY 4.0.*
|