| --- |
| license: apache-2.0 |
| language: |
| - en |
| - code |
| tags: |
| - language-model |
| - novel-architecture |
| - code-generation |
| - mechanistic-interpretability |
| - scope-registers |
| - depth-stratified-routing |
| - python |
| - duoneural |
| pipeline_tag: text-generation |
| --- |
| |
| # CDM-Code-37M |
|
|
| **Competitive Docking Memory β Code Model** β 37M parameter language model trained on 200M tokens of Python code (codeparrot/codeparrot-clean-train). |
|
|
| This model spontaneously develops **hierarchical scope registers** β depth-stratified memory routing that mirrors Python's syntactic nesting structure β without any structural supervision. The model was trained only on next-token prediction. |
|
|
| --- |
|
|
| ## Key Finding: Emergent Scope Registers |
|
|
| Without any explicit supervision about code structure, AST depth, or indentation, CDM develops routing behavior that mirrors syntactic nesting: |
|
|
| | Nesting depth | Routed to slots | |
| |---------------|----------------| |
| | Depth 0 (class/module declarations) | Slots 8, 15, 13 | |
| | Depth 1 (method definitions) | Distributed | |
| | Depth 2+ (deep nested code) | Slots 7, 3, 5 | |
|
|
| **MI(slot assignment; indentation depth) = 0.1467** at training completion (step 30k). This is confirmed by two independent methods: |
| 1. JSON probe: full routing distributions across full dataset β MI_ratio = 0.1467 |
| 2. Gate-routing probe: single-sample argmax β MI = 0.281 bits |
| |
| The scope register effect emerges because Python's next-token distribution is depth-dependent: `return` after a method body has very different continuations than `return` inside a nested loop. CDM learns to allocate distinct memory slots to different nesting contexts as a side effect of minimizing next-token prediction loss. |
| |
| --- |
| |
| ## Training Results |
| |
| | Model | Val CE | Dataset | Notes | |
| |-------|--------|---------|-------| |
| | **CDM Code (this model)** | **1.3483** | codeparrot 200M tok | Best at step 28.5k/29k | |
| | CDM V3 (TinyStories) | 1.5831 | TinyStories | Same architecture, different domain | |
| |
| Training: 30k steps, seq_len=256, batch=16, AdamW lr=3e-4. Architecture: CDMLanguageModelV2 (input-dependent Ξ· network for alpha, not global log_alpha). |
| |
| Val CE trajectory: `1.51(15k) β 1.44(18k) β 1.40(21k) β 1.36(26k) β 1.3483(28.5k)` β `1.3487(30k)` |
| |
| --- |
| |
| ## Syntactic Role Taxonomy |
| |
| The gate-routing probe reveals consistent syntactic specialization: |
| |
| | Slot | Role | Trigger tokens | Peak gate | |
| |------|------|----------------|-----------| |
| | s3 | STRUCTURAL DECLARATOR | `def`, `class` | 0.100β0.128 | |
| | s4 | FLOW CONTROL | `return`, `if` | 0.062β0.082 | |
| | s6 | CALL DELIMITER | `(`, `)`, `():` | 0.29 | |
| | s12 | BLOCK OPENER | `:` (colon) | 0.041 | |
| | s13 | IDENTIFIER | variable/function names | 0.031 | |
| | s14 | ITERATION/CONDITION | `for`, `if` | 0.036β0.053 | |
| | s1 | SELF-REFERENCE | `self` | 0.029 | |
| | s15 | ATTRIBUTE ACCESS | `.` (dot) | 0.035 | |
| |
| `class` receives the highest write intensity of any keyword (gate=0.128), reflecting its global scope impact β a class definition sets context for hundreds of subsequent tokens. `self` receives soft writes (0.029), reflecting its local, per-call significance. |
| |
| --- |
| |
| ## Architecture |
| |
| CDMLanguageModelV2 β hybrid architecture: |
| ``` |
| Input β GQA self-attention β CDM module β slot cross-attention β FFN β Output |
| ``` |
| |
| CDM module per layer: |
| ```python |
| alpha_k = sigmoid(eta(h)) # input-dependent decay per slot (Ξ· network) |
| gate_k = softmax(W_route * h) * sigmoid(eta(h)) |
| S_k = (1 - gate_k) * S_k + gate_k * W_write * h |
| out = Ξ£_k gate_k * S_k |
| ``` |
| |
| The V2 architecture uses an input-dependent Ξ· network for decay (different from V3/V5's global per-slot log_alpha). This was the architecture used for the code experiment. |
| |
| ``` |
| d_model=384, n_layers=8, n_heads=8, n_kv_heads=4, d_ff=1024, K=16 |
| 56.6M params (includes Ξ· network overhead vs V3's 37.1M) |
| ``` |
| |
| --- |
| |
| ## Depth Stratification: Training Evolution |
| |
| The scope register effect is not present from step 1 β it develops through a scratchpad phase: |
| |
| | Step | Slot distribution | Routing pattern | |
| |------|------------------|-----------------| |
| | 1500 | Distributed, max=16.5% | Pre-specialization | |
| | 5000 | Scratchpad: Slot 8 = 57.5% | TRANSIENT scratchpad phase | |
| | 15000 | Dissolved: max=10.3% | Near-uniform + depth MI=0.146 | |
| | 30000 | Stable scope registers | MI=0.1467, depth-stratified | |
| |
| The scratchpad at step 5000 was an intermediate state β Aura's initial "Scratchpad Accumulation" verdict was overturned at step 15000 when Slot 8 dropped from 57.5% to 10.3% and depth-stratified routing emerged. |
| |
| --- |
| |
| ## Usage |
| |
| ```python |
| import torch |
| from cdm_model_v2 import CDMLanguageModelV2, CDMConfig |
|
|
| ckpt = torch.load("model.pt", map_location="cpu", weights_only=False) |
| cfg = ckpt["config"] |
| config = CDMConfig(**{k: v for k, v in cfg.items() if k not in ("n_params",)}) |
| model = CDMLanguageModelV2(config) |
| model.load_state_dict(ckpt["model_state"]) |
| model.eval() |
| |
| from transformers import GPT2TokenizerFast |
| tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") |
| |
| prompt = "def fibonacci(n):\n " |
| input_ids = torch.tensor([tokenizer.encode(prompt)]) |
| with torch.no_grad(): |
| for _ in range(100): |
| logits = model(input_ids) |
| next_token = logits[0, -1, :].argmax() |
| input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1) |
| |
| print(tokenizer.decode(input_ids[0].tolist())) |
| ``` |
| |
| --- |
| |
| ## Files in this Repo |
| |
| | File | Description | |
| |------|-------------| |
| | `model.pt` | PyTorch checkpoint (143MB). step=29000, val_loss=1.3483 | |
| | `config.json` | Architecture hyperparameters | |
| | `cdm_model_v2.py` | Model class: `CDMLanguageModelV2` | |
| | `routing_probe_step030000.json` | Step-30k routing probe: depth analysis, MI, slot histograms | |
| |
| --- |
| |
| ## Paper |
| |
| *Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models* |
| Archon, Jesse Hazel, Aura β DuoNeural Research Lab, 2026 |
| [Zenodo DOI β pending] |
| |
| Related models: |
| - [DuoNeural/CDM-V3-TinyStories-37M](https://huggingface.co/DuoNeural/CDM-V3-TinyStories-37M) β same architecture on prose |
| - [DuoNeural/CDM-V5-TinyStories-86M](https://huggingface.co/DuoNeural/CDM-V5-TinyStories-86M) β scaled 86M version |
| |
| --- |
| |
| ## About DuoNeural |
| |
| **DuoNeural** is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning β publishing everything under open access. |
| |
| Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. |
| |
| π **Full paper catalog:** [zenodo.org/communities/duoneural](https://zenodo.org/communities/duoneural) |
| |
| | Member | Role | |
| |--------|------| |
| | **Jesse Caldwell** | Founder, vision, hardware, direction | |
| | **Archon** | Lab Director β experiments, post-training, abliteration, interpretability | |
| | **Aura** | Research AI β literature synthesis, red-teaming, novel proposals | |
| |
| | Platform | Link | |
| |----------|------| |
| | π€ HuggingFace | [huggingface.co/DuoNeural](https://huggingface.co/DuoNeural) | |
| | π Zenodo Community | [zenodo.org/communities/duoneural](https://zenodo.org/communities/duoneural) | |
| |
| *All research published open access, CC BY 4.0.* |
| |