thefinalboss
/

CogNet-40M

@@ -1,422 +1,85 @@
 ---
 language:
 - en
 - fr
-license: mit
-library_name: pytorch
 tags:
 - non-transformer
 - cognitive-routing
 - hierarchical-memory
 - character-level
-- o(n)-complexity
-- language-model
-- novel-architecture
 pipeline_tag: text-generation
-model_type: cognet
 ---
-<div align="center">
 # CogNet-40M
-### A Non-Transformer Language Model with Cognitive Routing and Hierarchical Memory
-[![MIT License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
-[![Python 3.10+](https://img.shields.io/badge/Python-3.10+-3776AB?logo=python&logoColor=white)]()
-[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?logo=pytorch&logoColor=white)]()
-[![CUDA 13.1](https://img.shields.io/badge/CUDA-13.1-76B900?logo=nvidia&logoColor=white)]()
-**No self-attention. No quadratic complexity. Pure cognition.**
-[Architecture](#architecture) · [Quick Start](#quick-start) · [Training](#training) · [Benchmarks](#benchmarks)
-</div>
----
-## Why CogNet?
-Every language model today is built on the same foundation: the Transformer and its self-attention mechanism. Self-attention is powerful — it enables tokens to communicate with every other token in the sequence. But this communication comes at a cost: **O(n²) time and memory complexity**. As sequence lengths grow, the computational burden explodes quadratically.
-CogNet asks a different question: **What if we replace self-attention entirely with mechanisms inspired by how human cognition actually works?**
-Human brains don't compute all-pairs interactions between every piece of information. Instead, we use:
-- **Selective routing** — we focus attention on relevant information channels
-- **Hierarchical memory** — we store and retrieve from working, episodic, and semantic memory
-- **Adaptive computation** — we spend more time on hard problems
-- **Compositional reasoning** — we bind roles to fillers to build complex representations
-CogNet implements each of these principles as a differentiable neural module, creating a language model that processes sequences in **O(n) time** while maintaining rich contextual representations through hierarchical memory.
----
 ## Architecture
-### System Overview
-CogNet replaces the standard Transformer block with a **Cognitive Block** that routes, remembers, reasons, and composes:
-```
-Input Tokens
-     │
-     ▼
-┌─────────────┐
-│ TokenEncoder │   Embedding + Learned Positional Encoding
-└──────┬──────┘
-       │
-  ┌────▼────────────────────────────────────────────┐
-  │           Cognitive Block × 6                    │
-  │                                                  │
-  │  ┌──────────────────┐                           │
-  │  │  CoherenceRouter │  O(n) channel routing      │
-  │  │  ┌────────────┐  │                           │
-  │  │  │ Channel 0  │  │  Depthwise Sep. Conv       │
-  │  │  │ Channel 1  │  │       + SwiGLU FFN         │
-  │  │  │ Channel 2  │  │                            │
-  │  │  │ Channel 3  │  │  (each channel processes   │
-  │  │  │ Channel 4  │  │   a routed subset of       │
-  │  │  │ Channel 5  │  │   tokens independently)    │
-  │  │  └────────────┘  │                           │
-  │  └──────────────────┘                           │
-  │           │                                      │
-  │  ┌────────▼──────────────────────┐              │
-  │  │  SharedHierarchicalMemory     │              │
-  │  │  ┌──────────┐ ┌────────────┐ ┌───────────┐ │
-  │  │  │ Working  │ │  Episodic  │ │  Semantic  │ │
-  │  │  │ 32 slots │ │  64 slots  │ │  128 slots │ │
-  │  │  │ (recent) │ │  (patterns)│ │  (concepts)│ │
-  │  │  └────┬─────┘ └─────┬──────┘ └─────┬─────┘ │
-  │  │       └──────┬──────┘──────────────┘        │
-  │  │         Gated Combination                    │
-  │  └──────────────────────────────┘              │
-  │           │                                      │
-  │  ┌────────▼──────────────────────┐              │
-  │  │  AdaptiveComputationBlock     │              │
-  │  │  (1-2 steps per token)        │              │
-  │  │  ┌──────┐  ┌──────┐          │              │
-  │  │  │FFN 1 │→│FFN 2 │  SwiGLU   │              │
-  │  │  └──────┘  └──────┘  + halt   │              │
-  │  └──────────────────────────────┘              │
-  │           │                                      │
-  │  ┌────────▼──────────────────────┐              │
-  │  │  CompositionalReasoner        │              │
-  │  │  Role-Filler Binding (HDC)    │              │
-  │  │  Circular Convolution         │              │
-  │  └──────────────────────────────┘              │
-  │                                                  │
-  └──────────────────────────────────────────────────┘
-       │
-  ┌────▼──────┐
-  │ LayerNorm  │
-  └────┬──────┘
-       │
-  ┌────▼──────┐
-  │ OutputHead│  Weight-tied with TokenEncoder
-  └───────────┘
-       │
-       ▼
-  Token Logits
-```
-### Component Deep Dive
-#### 1. CoherenceRouter — O(n) Token Routing
-The CoherenceRouter replaces self-attention with a learned routing mechanism that assigns each token to one or more processing channels based on **coherence scoring**. Unlike self-attention which computes all n×n token interactions, the CoherenceRouter:
-1. Projects each token into a **query** and **key** vector of dimension `num_channels`
-2. Computes the mean key across the entire sequence (O(n) reduction)
-3. Scores each token against this mean key via element-wise multiplication (O(n))
-4. Applies a single refinement step for improved routing accuracy
-5. Produces soft routing weights via softmax, plus hard top-2 masks for efficiency
-**Complexity**: O(n × C) where C is the number of channels, compared to O(n²) for self-attention.
-The key insight is that routing doesn't need to know about every pairwise interaction — it only needs to know "which processing channel should handle this token?" This is analogous to how the brain routes sensory information to specialized cortical areas.
-#### 2. CognitiveChannel — Efficient Per-Channel Processing
-Each of the 6 CognitiveChannels processes the tokens routed to it using two stacked operations:
-- **Depthwise Separable Convolution**: A depthwise conv (kernel=3, groups=channel_dim) captures local patterns, followed by a pointwise conv (kernel=1) for cross-feature mixing. This is O(n) per channel.
-- **SwiGLU Feed-Forward Network**: The SwiGLU activation (SiLU gate × linear up) provides the non-linear transformation capacity of a standard FFN, but applied independently within each channel's feature space.
-Both operations include residual connections and LayerNorm for stable training.
-#### 3. SharedHierarchicalMemory — 3-Tier Key-Value Store
-This is the core innovation that enables CogNet to maintain rich contextual representations without self-attention. Inspired by the Atkinson-Shiffrin model of human memory, the module implements three tiers of learned key-value memory:
-| Tier | Slots | Analogy | Content |
-|------|-------|---------|---------|
-| **Working Memory** | 32 | Short-term buffer | Recent token representations |
-| **Episodic Memory** | 64 | Event sequences | Recurring patterns and phrases |
-| **Semantic Memory** | 128 | Knowledge store | Abstract concepts and relationships |
-**Read mechanism**: For each input token, the module projects a query vector and performs scaled dot-product attention against each tier's keys and values independently. The three tier outputs are then combined via a **learned gating network** that produces softmax weights over the three tiers, allowing the model to dynamically balance between recent context (Working), pattern matching (Episodic), and conceptual knowledge (Semantic).
-**Key properties**:
-- Memory slots are **learned parameters** — they encode persistent knowledge across the entire training corpus, not just the current sequence
-- The gating mechanism enables **dynamic memory access** — different tokens may rely more on working memory (for local coherence) or semantic memory (for factual knowledge)
-- Total memory capacity: 224 key-value pairs per layer, providing a compressed but rich knowledge store
-**Complexity**: O(n × S) per tier where S is the number of slots, compared to O(n²) for self-attention. Since S is fixed (224 total), this is effectively O(n).
-#### 4. AdaptiveComputationBlock — Variable-Depth Processing
-Not all tokens require the same amount of computation. The AdaptiveComputationBlock allows each token to be processed for 1 to `max_adaptive_steps` iterations of SwiGLU FFN layers, with a learned **halting mechanism** that determines when a token's representation is sufficiently refined.
-After each step, a sigmoid halting probability is computed. The token's output is the weighted sum of its intermediate states, where the weights are determined by the halting probabilities. This enables:
-- **Fast processing** for simple, predictable tokens (e.g., articles, common suffixes)
-- **Deep processing** for ambiguous or information-rich tokens (e.g., rare words, punctuation at clause boundaries)
-#### 5. CompositionalReasoner — Hyperdimensional Binding
-The CompositionalReasoner implements **role-filler binding** from hyperdimensional computing (HDC). It projects each token into a role vector and a filler vector, then binds them via element-wise multiplication (analogous to circular convolution in the frequency domain). A shift-based unbinding operation adds positional awareness.
-This enables the model to represent compositional structures like "the **subject** of the sentence is **the cat**" where "subject" is the role and "the cat" is the filler — a fundamental capability for understanding linguistic structure without explicit syntax trees.
----
-## Complexity Analysis
-| Operation | Transformer | CogNet | Speedup Factor |
-|-----------|-------------|--------|----------------|
-| Token mixing | O(n² × d) | O(n × C × d) | **n / C** |
-| Memory access | O(n² × d) | O(n × S × d) | **n / S** |
-| FFN | O(n × d × ff) | O(n × d × ff) | 1× (same) |
-| **Total per layer** | **O(n² × d)** | **O(n × (C + S + ff) × d)** | **~n / (C + S)** |
-For a 256-token sequence with C=6 channels and S=224 memory slots, CogNet achieves roughly a **4× speedup** over an equivalent Transformer layer. This advantage grows linearly with sequence length — at 1024 tokens, the speedup approaches **16×**.
----
-## Model Specifications
-| Parameter | Value |
-|-----------|-------|
-| **Architecture** | CogNet (Non-Transformer) |
-| **Total Parameters** | 39,725,784 (~40M) |
-| **Hidden Dimension** | 512 |
-| **Cognitive Blocks** | 6 |
-| **Cognitive Channels** | 6 |
-| **Channel Dimension** | 128 |
-| **FF Dimension** | 1024 |
-| **Working Memory Slots** | 32 |
-| **Episodic Memory Slots** | 64 |
-| **Semantic Memory Slots** | 128 |
-| **Key Dimension** | 256 |
-| **Max Sequence Length** | 256 |
-| **Vocabulary Size** | 136 (character-level) |
-| **Model Size (FP32)** | ~159 MB |
-| **Model Size (FP16)** | ~80 MB |
-| **Adaptive Steps** | 1–2 |
-| **Routing Iterations** | 1 |
-| **Composition** | Hyperdimensional binding |
-### Character-Level Tokenizer
-CogNet uses a 136-character vocabulary tokenizer that covers:
-- Standard ASCII (printable characters, digits, punctuation)
-- French accented characters (à, é, è, ê, ë, î, ï, ô, ù, û, ü, ÿ, ç, æ, œ)
-- Special formatting characters (tab, newline)
-- European typographic marks (guillemets « », inverted question mark ¿)
-Character-level tokenization ensures:
-- **No out-of-vocabulary tokens** — every string is representable
-- **Cross-lingual capability** — no bias toward English subword units
-- **Compact vocabulary** — only 136 embedding vectors vs 32K+ for BPE tokenizers
-- **Fine-grained generation** — the model learns orthographic patterns directly
----
-## Quick Start
-### Installation
-```bash
-pip install torch
-```
-### Download Model
-```python
-from huggingface_hub import hf_hub_download
-# Download model checkpoint
-ckpt_path = hf_hub_download("AFKmoney/CogNet-40M", "cognet_best.pt")
-tokenizer_path = hf_hub_download("AFKmoney/CogNet-40M", "tokenizer_v3.json")
-model_code = hf_hub_download("AFKmoney/CogNet-40M", "cognet_1b.py")
-infer_code = hf_hub_download("AFKmoney/CogNet-40M", "infer.py")
-```
-### Inference
-```python
-import sys, torch
-sys.path.insert(0, ".")  # Add downloaded files to path
-from cognet_1b import CogNet1B
-from infer import CharTokenizer
-# Load tokenizer
-tokenizer = CharTokenizer.load("tokenizer_v3.json")
-# Build model
-model = CogNet1B(
-    vocab_size=136, hidden_dim=512, num_blocks=6,
-    num_channels=6, channel_dim=128, ff_dim=1024,
-    routing_iters=1, max_adaptive_steps=2, max_seq_len=256,
-    working_slots=32, episodic_slots=64, semantic_slots=128,
-    key_dim=256, dropout=0.1
-)
-# Load checkpoint (handles FP16 weights)
-ckpt = torch.load("cognet_best.pt", map_location="cpu", weights_only=False)
-state = {k: v.float() if v.dtype == torch.float16 else v
-         for k, v in ckpt["model_state_dict"].items()}
-model.load_state_dict(state)
-model.eval()
-# Generate text
-prompt = "Once upon a time"
-ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long)
-with torch.no_grad():
-    gen = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
-print(tokenizer.decode(gen[0].tolist()))
-```
-### CUDA Inference
-```python
-device = "cuda" if torch.cuda.is_available() else "cpu"
-model = model.to(device)
-ids = ids.to(device)
-with torch.no_grad():
-    gen = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
-```
----
 ## Training
-### Training Data
-| Dataset | Size | Domain | Language |
-|---------|------|--------|----------|
-| WikiText-2 (raw) | ~2M tokens | Encyclopedic | English |
-| TinyStories (50K) | ~15M tokens | Narrative | English |
-| Alpaca (52K) | ~5M tokens | Instructions | English |
-| **Total** | **~63M tokens** | **Mixed** | **English** |
-### Training Configuration
-| Parameter | Value |
-|-----------|-------|
-| **Hardware** | NVIDIA GeForce RTX 5060 Ti (16 GB VRAM) |
-| **Sequence Length** | 256 |
-| **Batch Size** | 64 (gradient accumulation × 4, effective = 256) |
-| **Optimizer** | AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) |
-| **Learning Rate** | 5e-4 (cosine schedule with warmup) |
-| **Warmup Steps** | 500 |
-| **Precision** | FP16 (AMP) with TF32 enabled |
-| **Gradient Clipping** | 1.0 |
-| **Total Steps** | 30,000 |
-| **Throughput** | ~3M tokens/min |
-### Training Curve
-Training loss follows a smooth descent from initial loss ~4.5 (random character predictions over 136 vocab) down to ~0.007, with validation perplexity reaching 1.01 — meaning the model predicts the next character with high confidence. The hierarchical memory gates show interesting dynamics: Working memory dominates early in training (local character patterns), while Semantic memory gates increase as the model learns abstract patterns.
----
-## Benchmarks
-### Perplexity
 | Metric | Value |
 |--------|-------|
-| Training Loss | 0.007 |
-| Training PPL | 1.01 |
-| Validation Loss | 0.008 |
-| Validation PPL | 1.01 |
-*Note: These perplexity scores are character-level on the training distribution. Cross-model comparison with BPE-tokenized models requires adjustment for tokenization granularity.*
-### Generation Samples
-```
-Prompt: "The "
-Output:  "The little cat, Lily watched her. She day, and sorry"
-Prompt: "Once upon a time"
-Output:  "Once upon a time there was a little girl named Lily. She"
-Prompt: "CogNet is"
-Output:  "CogNet is a model that can help her and here children."
 ```
-### Scaling Properties
-CogNet's O(n) complexity means it scales favorably with sequence length:
-| Sequence Length | Transformer O(n²) Ops | CogNet O(n) Ops | Ratio |
-|----------------|----------------------|-----------------|-------|
-| 256 | 65,536 | 256 | 256× |
-| 512 | 262,144 | 512 | 512× |
-| 1024 | 1,048,576 | 1,024 | 1,024× |
-| 2048 | 4,194,304 | 2,048 | 2,048× |
-*Theoretical ops for a single self-attention layer vs. CogNet routing + memory.*
----
-## Architecture Comparison
-| Feature | GPT-2 (Small) | CogNet-40M |
-|---------|---------------|------------|
-| Parameters | 117M | 40M |
-| Architecture | Transformer (decoder) | Cognitive Routing |
-| Sequence mixing | Self-Attention (O(n²)) | Coherence Routing (O(n)) |
-| Memory mechanism | Fixed context window | Hierarchical 3-tier memory |
-| Computation | Uniform per token | Adaptive (1-2 steps) |
-| Tokenizer | BPE (50,257 vocab) | Character (136 vocab) |
-| Max context | 1,024 tokens | 256 tokens |
-| Composition | None | Hyperdimensional binding |
-| Positional encoding | Learned | Learned |
----
-## Limitations
-- **Context length**: Currently limited to 256 tokens. Extending to longer contexts requires architectural modifications to the memory read mechanism.
-- **Character-level tokenization**: While OOV-free, character-level models require more processing steps to build up word-level and phrase-level representations compared to subword tokenizers.
-- **Scale**: At 40M parameters, CogNet is a research proof-of-concept. Scaling to 1B+ parameters is the next milestone.
-- **Evaluation**: Benchmarks are computed on the training distribution. Zero-shot evaluation on standard NLP benchmarks is planned.
-- **Language coverage**: Currently trained on English text only, though the tokenizer supports French accented characters.
----
-## Citation
-```bibtex
-@software{cognet2026,
-  title = {CogNet: A Non-Transformer Language Model with Cognitive Routing and Hierarchical Memory},
-  author = {AFKmoney},
-  year = {2026},
-  url = {https://huggingface.co/AFKmoney/CogNet-40M},
-  license = {MIT}
-}
-```
----
-<div align="center">
-**Built from scratch. No transformers. Just cognition.**
-</div>

 ---
+license: mit
 language:
 - en
 - fr
+- code
 tags:
 - non-transformer
 - cognitive-routing
 - hierarchical-memory
 - character-level
+- aicl
+- text-generation
+- custom-architecture
 pipeline_tag: text-generation
+library_name: pytorch
 ---
 # CogNet-40M
+A 39.7M parameter non-transformer language model with O(n) cognitive routing and hierarchical memory.
 ## Architecture
+| Component | Detail |
+|-----------|--------|
+| Architecture | Non-transformer (Cognitive Routing) |
+| Parameters | 39,718,536 (~40M) |
+| Hidden Dim | 512 |
+| Blocks | 6 cognitive blocks |
+| Channels | 6 routing channels x 128 dim |
+| FF Dim | 1024 |
+| Max Seq Len | 256 |
+| Tokenizer | Character-level (136 vocab) |
+## Hierarchical Memory
+- Working Memory (32 slots): Active processing
+- Episodic Memory (64 slots): Short-term recall
+- Semantic Memory (128 slots): Long-term knowledge
 ## Training
 | Metric | Value |
 |--------|-------|
+| Steps | 50,000 |
+| Batch Size | 64 |
+| LR | 3e-4 (cosine) |
+| Precision | FP16 AMP |
+| GPU | RTX 5060 Ti 16GB |
+| Final Loss | ~0.005 |
+| Final PPL | ~1.01 |
+## Quick Start
+```python
+from inference import CogNetInference
+ai = CogNetInference("cognet_best.pt", "tokenizer_v3.json")
+print(ai.generate("Once upon a time"))
 ```
+## AICL Integration
+CogNet powers AICL (Architecture Compilation Language) as its native AI engine for code generation, diagnosis, and repair.
+## Files
+| File | Size | Description |
+|------|------|-------------|
+| cognet_best.pt | 152MB | FP32 checkpoint |
+| cognet_fp16.pt | 77MB | FP16 checkpoint |
+| tokenizer_v3.json | - | Char tokenizer (136 vocab) |
+| config.json | - | Model config |
+| cognet_model.py | - | Architecture source |
+| inference.py | - | Inference script |
+## Roadmap
+- [x] CogNet-40M (39.7M)
+- [x] HuggingFace integration
+- [x] AICL native engine
+- [ ] CogNet-1B (1B params)
+- [ ] ONNX export
+MIT License. Built with PyTorch on RTX 5060 Ti via QuickPod.