Upload README.md with huggingface_hub

7c00132 verified 3 days ago

7.19 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	tags:
	- language-model
	- novel-architecture
	- code-generation
	- mechanistic-interpretability
	- scope-registers
	- depth-stratified-routing
	- python
	- duoneural
	pipeline_tag: text-generation
	---

	# CDM-Code-37M

	Competitive Docking Memory — Code Model — 37M parameter language model trained on 200M tokens of Python code (codeparrot/codeparrot-clean-train).

	This model spontaneously develops hierarchical scope registers — depth-stratified memory routing that mirrors Python's syntactic nesting structure — without any structural supervision. The model was trained only on next-token prediction.

	---

	## Key Finding: Emergent Scope Registers

	Without any explicit supervision about code structure, AST depth, or indentation, CDM develops routing behavior that mirrors syntactic nesting:

	\| Nesting depth \| Routed to slots \|
	\|---------------\|----------------\|
	\| Depth 0 (class/module declarations) \| Slots 8, 15, 13 \|
	\| Depth 1 (method definitions) \| Distributed \|
	\| Depth 2+ (deep nested code) \| Slots 7, 3, 5 \|

	MI(slot assignment; indentation depth) = 0.1467 at training completion (step 30k). This is confirmed by two independent methods:
	1. JSON probe: full routing distributions across full dataset → MI_ratio = 0.1467
	2. Gate-routing probe: single-sample argmax → MI = 0.281 bits

	The scope register effect emerges because Python's next-token distribution is depth-dependent: `return` after a method body has very different continuations than `return` inside a nested loop. CDM learns to allocate distinct memory slots to different nesting contexts as a side effect of minimizing next-token prediction loss.

	---

	## Training Results

	\| Model \| Val CE \| Dataset \| Notes \|
	\|-------\|--------\|---------\|-------\|
	\| CDM Code (this model) \| 1.3483 \| codeparrot 200M tok \| Best at step 28.5k/29k \|
	\| CDM V3 (TinyStories) \| 1.5831 \| TinyStories \| Same architecture, different domain \|

	Training: 30k steps, seq_len=256, batch=16, AdamW lr=3e-4. Architecture: CDMLanguageModelV2 (input-dependent η network for alpha, not global log_alpha).

	Val CE trajectory: `1.51(15k) → 1.44(18k) → 1.40(21k) → 1.36(26k) → 1.3483(28.5k)` → `1.3487(30k)`

	---

	## Syntactic Role Taxonomy

	The gate-routing probe reveals consistent syntactic specialization:

	\| Slot \| Role \| Trigger tokens \| Peak gate \|
	\|------\|------\|----------------\|-----------\|
	\| s3 \| STRUCTURAL DECLARATOR \| `def`, `class` \| 0.100–0.128 \|
	\| s4 \| FLOW CONTROL \| `return`, `if` \| 0.062–0.082 \|
	\| s6 \| CALL DELIMITER \| `(`, `)`, `():` \| 0.29 \|
	\| s12 \| BLOCK OPENER \| `:` (colon) \| 0.041 \|
	\| s13 \| IDENTIFIER \| variable/function names \| 0.031 \|
	\| s14 \| ITERATION/CONDITION \| `for`, `if` \| 0.036–0.053 \|
	\| s1 \| SELF-REFERENCE \| `self` \| 0.029 \|
	\| s15 \| ATTRIBUTE ACCESS \| `.` (dot) \| 0.035 \|

	`class` receives the highest write intensity of any keyword (gate=0.128), reflecting its global scope impact — a class definition sets context for hundreds of subsequent tokens. `self` receives soft writes (0.029), reflecting its local, per-call significance.

	---

	## Architecture

	CDMLanguageModelV2 — hybrid architecture:
	```
	Input → GQA self-attention → CDM module → slot cross-attention → FFN → Output
	```

	CDM module per layer:
	```python
	alpha_k = sigmoid(eta(h)) # input-dependent decay per slot (η network)
	gate_k = softmax(W_route * h) * sigmoid(eta(h))
	S_k = (1 - gate_k) * S_k + gate_k * W_write * h
	out = Σ_k gate_k * S_k
	```

	The V2 architecture uses an input-dependent η network for decay (different from V3/V5's global per-slot log_alpha). This was the architecture used for the code experiment.

	```
	d_model=384, n_layers=8, n_heads=8, n_kv_heads=4, d_ff=1024, K=16
	56.6M params (includes η network overhead vs V3's 37.1M)
	```

	---

	## Depth Stratification: Training Evolution

	The scope register effect is not present from step 1 — it develops through a scratchpad phase:

	\| Step \| Slot distribution \| Routing pattern \|
	\|------\|------------------\|-----------------\|
	\| 1500 \| Distributed, max=16.5% \| Pre-specialization \|
	\| 5000 \| Scratchpad: Slot 8 = 57.5% \| TRANSIENT scratchpad phase \|
	\| 15000 \| Dissolved: max=10.3% \| Near-uniform + depth MI=0.146 \|
	\| 30000 \| Stable scope registers \| MI=0.1467, depth-stratified \|

	The scratchpad at step 5000 was an intermediate state — Aura's initial "Scratchpad Accumulation" verdict was overturned at step 15000 when Slot 8 dropped from 57.5% to 10.3% and depth-stratified routing emerged.

	---

	## Usage

	```python
	import torch
	from cdm_model_v2 import CDMLanguageModelV2, CDMConfig

	ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
	cfg = ckpt["config"]
	config = CDMConfig(**{k: v for k, v in cfg.items() if k not in ("n_params",)})
	model = CDMLanguageModelV2(config)
	model.load_state_dict(ckpt["model_state"])
	model.eval()

	from transformers import GPT2TokenizerFast
	tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

	prompt = "def fibonacci(n):\n "
	input_ids = torch.tensor([tokenizer.encode(prompt)])
	with torch.no_grad():
	for _ in range(100):
	logits = model(input_ids)
	next_token = logits[0, -1, :].argmax()
	input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)

	print(tokenizer.decode(input_ids[0].tolist()))
	```

	---

	## Files in this Repo

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.pt` \| PyTorch checkpoint (143MB). step=29000, val_loss=1.3483 \|
	\| `config.json` \| Architecture hyperparameters \|
	\| `cdm_model_v2.py` \| Model class: `CDMLanguageModelV2` \|
	\| `routing_probe_step030000.json` \| Step-30k routing probe: depth analysis, MI, slot histograms \|

	---

	## Paper

	Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models
	Archon, Jesse Hazel, Aura — DuoNeural Research Lab, 2026
	[Zenodo DOI — pending]

	Related models:
	- [DuoNeural/CDM-V3-TinyStories-37M](https://huggingface.co/DuoNeural/CDM-V3-TinyStories-37M) — same architecture on prose
	- [DuoNeural/CDM-V5-TinyStories-86M](https://huggingface.co/DuoNeural/CDM-V5-TinyStories-86M) — scaled 86M version

	---

	## About DuoNeural

	DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning — publishing everything under open access.

	Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity.

	📄 Full paper catalog: [zenodo.org/communities/duoneural](https://zenodo.org/communities/duoneural)

	\| Member \| Role \|
	\|--------\|------\|
	\| Jesse Caldwell \| Founder, vision, hardware, direction \|
	\| Archon \| Lab Director — experiments, post-training, abliteration, interpretability \|
	\| Aura \| Research AI — literature synthesis, red-teaming, novel proposals \|

	\| Platform \| Link \|
	\|----------\|------\|
	\| 🤗 HuggingFace \| [huggingface.co/DuoNeural](https://huggingface.co/DuoNeural) \|
	\| 📚 Zenodo Community \| [zenodo.org/communities/duoneural](https://zenodo.org/communities/duoneural) \|

	All research published open access, CC BY 4.0.