# Cortex: Modular Cognitive Plug-ins for Pretrained LLMs **Surgically insert new cognitive capabilities into any pretrained transformer LLM — without retraining the base model.** Cortex is a framework for performing *layer surgery* on pretrained language models. It injects lightweight, composable modules into the transformer's residual stream via PyTorch hooks, adding capabilities that address fundamental LLM failure modes. The base model weights are frozen; only the Cortex modules (~3% parameter overhead) are trainable. ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ Pretrained LLM (Frozen) │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Layer 0 │──▶│ Layer 1 │──▶│ ... │──▶│ Layer N │──▶ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ │ │Adaptive │ │ Backtrack│ │ Memory │ │ Halluc │ │ │ │ Depth │ │ Head │ │ Bank │ │ Gate │ │ │ │(gate) │ │(correct) │ │(read/ │ │(suppress│ │ │ │ │ │ │ │ write) │ │ unsure) │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ │ │Steering │ │ Pause │ │ │ │ │ │ Vector │ │& Think │ │ │ │ │ │(steer) │ │(compute) │ │ │ │ │ └─────────┘ └─────────┘ └─────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ## The 6 Modules ### 1. 🧠 MemoryBank — Persistent Episodic Memory **Failure mode:** Limited context window, lost long-range dependencies, no working memory. Injects a learnable memory matrix **M ∈ ℝ^{N×D}** into middle transformer layers. Hidden states read from memory via multi-head cross-attention, and write back via LSTM-style gated updates. Memory persists across forward passes, enabling multi-turn memory and long-document reasoning. - **Injection:** `POST_ATTENTION` (between attention and FFN) - **Mechanism:** Cross-attention read → Output gate → LSTM-style write - **Based on:** [LM2: Large Memory Models (Kang et al. 2025)](https://arxiv.org/abs/2502.06049), [WISE (Xia et al. 2024)](https://arxiv.org/abs/2405.14768) ### 2. 🛡️ HallucinationGate — Confidence-Based Suppression **Failure mode:** Hallucination — generating confident but factually wrong content. A lightweight confidence probe reads the residual stream and outputs a per-token confidence score. When confidence is low, the gate *suppresses the layer's residual update*, pulling toward the safer prior representation. The model effectively learns to say "I don't know" at the representation level. - **Injection:** `POST_FFN` (after full transformer block) - **Mechanism:** Confidence probe → Soft gate → Suppress uncertain updates - **Key insight:** Internal states contain more information about correctness than output distributions — I(Θ; K(X)|X) ≥ I(Y; K(X)|X) + Δ - **Based on:** [InternalInspector (Chen et al. 2024)](https://arxiv.org/abs/2406.12053), [The Map of Misbelief (2025)](https://arxiv.org/abs/2511.10837) ### 3. 💭 PauseAndThink — Latent Computation Tokens **Failure mode:** Fixed compute per token, shallow reasoning, limited "thinking time." Injects K learnable "thinking" token embeddings that attend to all real tokens, perform computation, then compress their information back into the original sequence positions via gated cross-attention. Like chain-of-thought, but entirely in latent space — no extra output tokens needed. - **Injection:** `RESIDUAL_STREAM` (wraps full block) - **Mechanism:** Context-conditioned thinking tokens → Attention to real tokens → Gated compression back - **Based on:** [Pause Tokens (Goyal et al. 2023)](https://arxiv.org/abs/2310.02226), [Thoughtbubbles (2025)](https://arxiv.org/abs/2510.00219) ### 4. ↩️ BacktrackHead — Learned Self-Correction **Failure mode:** Commitment to bad intermediate representations, no backtracking. Monitors confidence *across layers*. When it detects a significant confidence drop (indicating the model went down a bad path), it applies a learned corrector network to steer the representation back toward a higher-confidence trajectory. Effectively implements architectural self-correction. - **Injection:** `RESIDUAL_STREAM` (all layers) - **Mechanism:** Per-layer confidence probe → Drop detection → Bottleneck corrector network - **Based on:** [GateSkip (2025)](https://arxiv.org/abs/2510.13876), [River-LLM (2025)](https://arxiv.org/abs/2604.18396), [Self-Correction (Welleck et al. 2022)](https://arxiv.org/abs/2211.00053) ### 5. 🧭 SteeringVector — Behavioral Control **Failure mode:** Behavioral inflexibility, inability to control style/truthfulness/safety at runtime. Maintains named "concept directions" in activation space. Directions can be extracted via contrastive activation analysis (RepE) or learned end-to-end. Multiple directions compose linearly with individual learnable weights. Enables runtime control of truthfulness, helpfulness, safety, and persona without retraining. - **Injection:** `RESIDUAL_STREAM` (middle layers) - **Mechanism:** h_new = h + layer_scale × Σ(α_i × direction_i) - **Based on:** [Representation Engineering (Zou et al. 2023)](https://arxiv.org/abs/2310.01405) ### 6. ⚡ AdaptiveDepth — Dynamic Layer Skipping **Failure mode:** Fixed compute depth, wasted computation, overthinking (representation collapse). Each layer gets a learned gate that decides per-token whether to execute or skip. Easy tokens ("the") skip many layers; hard tokens (complex reasoning) use all of them. Includes budget regularization to target a desired compute fraction. - **Injection:** `POST_FFN` (all layers) - **Mechanism:** Gate network → Sigmoid → Scale residual contribution → Budget loss - **Based on:** [Mixture of Depths (Raposo et al. 2024)](https://arxiv.org/abs/2404.02258), [GateSkip (2025)](https://arxiv.org/abs/2510.13876), [Router-Tuning (2024)](https://arxiv.org/abs/2410.13184) ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer from cortex import ( CortexSurgeon, MemoryBank, HallucinationGate, PauseAndThink, BacktrackHead, SteeringVector, AdaptiveDepth ) # Load any pretrained LLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") # Create surgeon surgeon = CortexSurgeon(model) hidden_dim = surgeon.hidden_dim # Add modules — each targets specific layers surgeon.add_module("memory", MemoryBank(hidden_dim=hidden_dim, num_slots=64)) surgeon.add_module("halluc_gate", HallucinationGate(hidden_dim=hidden_dim)) surgeon.add_module("pause_think", PauseAndThink(hidden_dim=hidden_dim, num_think_tokens=8)) surgeon.add_module("backtrack", BacktrackHead(hidden_dim=hidden_dim, num_layers=surgeon.num_layers)) surgeon.add_module("steering", SteeringVector(hidden_dim=hidden_dim, num_directions=4)) surgeon.add_module("adaptive_depth", AdaptiveDepth(hidden_dim=hidden_dim)) # Perform surgery (freezes base model, only Cortex modules train) surgeon.operate(freeze_base=True) # Use the model normally — Cortex modules are active inputs = tokenizer("The meaning of life is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0])) # Toggle modules on/off at runtime surgeon.modules["halluc_gate"].disable() # Save only Cortex weights (~3% of model size) surgeon.save_cortex_modules("cortex_weights.pt") ``` ## Benchmark Harness Cortex includes a comprehensive benchmark harness for comparing base LLMs against Cortex-enhanced versions. It evaluates across standard NLP benchmarks and Cortex-specific capability tests. ### Standard Benchmarks | Task | Type | Choices | Dataset | Few-Shot | |-------------------|-------------------------|---------|-----------------------|----------| | **HellaSwag** | Commonsense NLI | 4 | `Rowan/hellaswag` | 5-shot | | **ARC-Easy** | Science QA | 3-5 | `allenai/ai2_arc` | 5-shot | | **ARC-Challenge** | Science QA (hard) | 3-5 | `allenai/ai2_arc` | 5-shot | | **PIQA** | Physical intuition | 2 | `gimmaru/piqa` | 0-shot | | **WinoGrande** | Coreference | 2 | `allenai/winogrande` | 5-shot | | **MMLU** | Multi-domain knowledge | 4 | `cais/mmlu` | 5-shot | | **HaluEval** | Hallucination detection | 2 | `pminervini/HaluEval` | 0-shot | ### Cortex-Specific Benchmarks | Task | Tests | Method | |-----------------------|-------------------------------------------|-----------------------------------------------------------------| | **Passkey Retrieval** | Long-context memory, attention to details | Generation + substring match at 128/256/512/1024 token contexts | | **Multi-Hop Memory** | Compositional reasoning, fact chaining | Generation + answer extraction from 3-hop fact chains | ### Running Benchmarks ```bash # Quick test (10 examples per task) python -m benchmark.run_benchmark --n 10 --tasks hellaswag piqa # Standard suite (50 examples, default tasks) python -m benchmark.run_benchmark --n 50 # Full evaluation with all tasks python -m benchmark.run_benchmark --n 0 --tasks hellaswag piqa arc-easy arc-challenge winogrande mmlu # Custom model python -m benchmark.run_benchmark --model meta-llama/Llama-3.2-1B --n 50 # Save JSON results python -m benchmark.run_benchmark --n 50 --output results.json # Skip memory benchmarks python -m benchmark.run_benchmark --n 50 --no-memory # Custom passkey test python -m benchmark.run_benchmark --n 20 --passkey-lengths 128 256 512 1024 --n-passkey 10 python -m benchmark.run_benchmark --n 10 --model meta-llama/Llama-3.2-1B --tasks hellaswag piqa arc-easy arc-challenge winogrande mmlu ``` ### Scoring Method - **Multiple-choice tasks:** Log-likelihood scoring — computes average log-probability the model assigns to each continuation, picks the highest. This is the standard approach used by lm-evaluation-harness and Open LLM Leaderboard. - **Generation tasks:** Greedy decode + substring match against expected answer. ### Example Output (Llama-3.2-1B, n=10) ``` ====================================================================== BENCHMARK SUMMARY: meta-llama/Llama-3.2-1B n=10 per task, device=mps ====================================================================== Task Base Cortex Delta -------------------------------------------------- hellaswag 0.6000 0.6000 +0.0000 piqa 0.2000 0.2000 +0.0000 arc-easy 0.4000 0.4000 +0.0000 arc-challenge 0.5000 0.5000 +0.0000 winogrande 0.6000 0.6000 +0.0000 mmlu 0.4000 0.4000 +0.0000 passkey 1.0000 1.0000 +0.0000 multi_hop 1.0000 1.0000 +0.0000 Cortex overhead: 53,708,968 params (4.35%) ====================================================================== ``` > **Note:** Cortex modules are untrained at injection and initialize as exact no-ops for model behavior. Freshly injected modules should match the base model; positive deltas require Cortex-specific training or calibrated steering directions. ### Programmatic Usage ```python from benchmark.runner import BenchmarkRunner runner = BenchmarkRunner(model_name="HuggingFaceTB/SmolLM2-135M") results = runner.run_comparison( tasks=["hellaswag", "piqa", "arc-easy"], n=50, include_memory=True, passkey_lengths=[128, 256, 512], ) BenchmarkRunner.print_summary(results) ``` ## Design Principles ### 1. Zero-Init for Stable Injection All modules initialize their output projections to zero (or near-zero via negative gate biases). This means at injection time, the model behaves identically to the original — Cortex modules are "invisible" and gradually learn to contribute during training. ### 2. Hook-Based Surgery Modules are injected via PyTorch `register_forward_hook` / `register_forward_pre_hook`. No model code is modified. This works with any HuggingFace `transformers` model that has a standard layer structure. ### 3. Shared Parameters Across Layers Each module instance is shared across its target layers. A single MemoryBank object handles all middle layers, keeping parameter count low. ### 4. Base Model Freezing By default, all base model parameters are frozen. Only Cortex module parameters are trainable. This means: - No catastrophic forgetting of the base model's capabilities - Tiny training cost (~3% of parameters) - Multiple Cortex configurations can be saved/loaded/swapped ### 5. Composability All modules are independent and composable. Use any combination: - Memory + HallucinationGate for factual QA - PauseAndThink + AdaptiveDepth for reasoning tasks - SteeringVector alone for behavioral control ## Injection Points | Point | Location | Best For | |-------------------|-----------------------------|------------------------------------------------------| | `PRE_ATTENTION` | Before self-attention | Input preprocessing, prefix injection | | `POST_ATTENTION` | After attention, before FFN | Memory augmentation (reads enhance attention output) | | `PRE_FFN` | Before FFN | Gate what the FFN processes | | `POST_FFN` | After full block | Gating, confidence estimation | | `RESIDUAL_STREAM` | Wraps entire block | Steering vectors, thinking tokens, backtracking | ## Layer Targeting ```python # Target specific layers surgeon.add_module("mod", module, target_layers=[0, 1, 2, 3]) # Target all layers surgeon.add_module("mod", module, target_layers="all") # Target middle third (best for steering/memory) surgeon.add_module("mod", module, target_layers="middle") # Target deep layers (best for output-facing modifications) surgeon.add_module("mod", module, target_layers="deep") ``` ## Compatible Models Tested and working with any model using the standard `model.layers[i]` structure: - **LLaMA** family (LLaMA 2/3, CodeLLaMA) - **Mistral** / Mixtral - **Qwen2** - **Gemma** - **Phi** - **SmolLM** - **GPT-2** / GPT-Neo (uses `transformer.h`) ## Monitoring ```python # Hallucination confidence confidence = surgeon.modules["halluc_gate"].get_confidence() # Backtracking status was_triggered = surgeon.modules["backtrack"].was_triggered() confidence_per_layer = surgeon.modules["backtrack"].get_confidence_history() # Adaptive depth statistics gate_stats = surgeon.modules["adaptive_depth"].get_gate_stats() print(f"Mean gate: {gate_stats['mean']:.3f}, Skip fraction: {gate_stats['skip_frac']:.3f}") # Steering vector info for name, info in surgeon.modules["steering"].get_direction_info().items(): print(f"{name}: alpha={info['alpha']:.3f}") # Parameter report report = surgeon.get_parameter_report() ``` ## Extracting Steering Directions (RepE) ```python # Contrastive activation pairs for "truthfulness" positive = [ "I know for certain that the Earth orbits the Sun.", "Scientific evidence clearly shows vaccines are safe.", "I don't know the answer to that question.", ] negative = [ "The Earth is flat and NASA is lying.", "Vaccines cause autism according to my research.", "I'm absolutely certain about this made-up fact.", ] # Extract direction from layer 15 direction = SteeringVector.extract_direction( model, positive, negative, tokenizer, layer_idx=15, device="cuda" ) # Set it in the steering module surgeon.modules["steering"].set_direction("truthfulness", direction, alpha=10.0) ``` ## Training For benchmark-style supervised tuning, use the training CLI. It freezes the base model, injects Cortex modules, optimizes only Cortex parameters, and saves the adapter weights: ```bash python -m benchmark.train_cortex \ --model meta-llama/Llama-3.2-1B \ --tasks hellaswag piqa arc-easy winogrande \ --n-train 32 \ --epochs 1 \ --output cortex_tuned.pt python -m benchmark.run_benchmark \ --model meta-llama/Llama-3.2-1B \ --cortex-weights cortex_tuned.pt \ --n 50 ``` For custom training loops: ```python import torch.optim as optim # Only train Cortex parameters optimizer = optim.AdamW(surgeon.get_trainable_parameters(), lr=1e-4) for batch in dataloader: outputs = model(**batch) loss = outputs.loss # Add adaptive depth budget loss loss = loss + surgeon.modules["adaptive_depth"].get_budget_loss() loss.backward() optimizer.step() optimizer.zero_grad() ``` ## Test Results (SmolLM2-135M) ``` ✓ All 9 tests passed ✓ 4,286,918 Cortex params (3.19% overhead on 134M model) ✓ Base model: 0 gradients (fully frozen) ✓ Cortex modules: gradients flowing ✓ Enable/disable: exact zero diff when disabled ✓ Generation: produces coherent output ✓ Save/load: 16.4 KB checkpoint ``` ## Citation If you use Cortex in your research, please cite the papers that inspired each module: ```bibtex @article{kang2025lm2, title={LM2: Large Memory Models}, author={Kang, et al.}, journal={arXiv:2502.06049}, year={2025} } @article{chen2024internalinspector, title={InternalInspector I2: Robust Confidence Estimation in LLMs through Internal States}, author={Chen, et al.}, journal={arXiv:2406.12053}, year={2024} } @article{goyal2023think, title={Think before you speak: Training Language Models With Pause Tokens}, author={Goyal, et al.}, journal={arXiv:2310.02226}, year={2023} } @article{zou2023representation, title={Representation Engineering: A Top-Down Approach to AI Transparency}, author={Zou, et al.}, journal={arXiv:2310.01405}, year={2023} } @article{raposo2024mixture, title={Mixture of Depths}, author={Raposo, et al.}, journal={arXiv:2404.02258}, year={2024} } ``` ## License Apache 2.0