Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,422 +1,85 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- fr
|
| 5 |
-
|
| 6 |
-
library_name: pytorch
|
| 7 |
tags:
|
| 8 |
- non-transformer
|
| 9 |
- cognitive-routing
|
| 10 |
- hierarchical-memory
|
| 11 |
- character-level
|
| 12 |
-
-
|
| 13 |
-
-
|
| 14 |
-
-
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
-
|
| 17 |
---
|
| 18 |
|
| 19 |
-
<div align="center">
|
| 20 |
-
|
| 21 |
# CogNet-40M
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
[](LICENSE)
|
| 26 |
-
[]()
|
| 27 |
-
[]()
|
| 28 |
-
[]()
|
| 29 |
-
|
| 30 |
-
**No self-attention. No quadratic complexity. Pure cognition.**
|
| 31 |
-
|
| 32 |
-
[Architecture](#architecture) · [Quick Start](#quick-start) · [Training](#training) · [Benchmarks](#benchmarks)
|
| 33 |
-
|
| 34 |
-
</div>
|
| 35 |
-
|
| 36 |
-
---
|
| 37 |
-
|
| 38 |
-
## Why CogNet?
|
| 39 |
-
|
| 40 |
-
Every language model today is built on the same foundation: the Transformer and its self-attention mechanism. Self-attention is powerful — it enables tokens to communicate with every other token in the sequence. But this communication comes at a cost: **O(n²) time and memory complexity**. As sequence lengths grow, the computational burden explodes quadratically.
|
| 41 |
-
|
| 42 |
-
CogNet asks a different question: **What if we replace self-attention entirely with mechanisms inspired by how human cognition actually works?**
|
| 43 |
-
|
| 44 |
-
Human brains don't compute all-pairs interactions between every piece of information. Instead, we use:
|
| 45 |
-
- **Selective routing** — we focus attention on relevant information channels
|
| 46 |
-
- **Hierarchical memory** — we store and retrieve from working, episodic, and semantic memory
|
| 47 |
-
- **Adaptive computation** — we spend more time on hard problems
|
| 48 |
-
- **Compositional reasoning** — we bind roles to fillers to build complex representations
|
| 49 |
-
|
| 50 |
-
CogNet implements each of these principles as a differentiable neural module, creating a language model that processes sequences in **O(n) time** while maintaining rich contextual representations through hierarchical memory.
|
| 51 |
-
|
| 52 |
-
---
|
| 53 |
|
| 54 |
## Architecture
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
└──────┬──────┘
|
| 67 |
-
│
|
| 68 |
-
┌────▼────────────────────────────────────────────┐
|
| 69 |
-
│ Cognitive Block × 6 │
|
| 70 |
-
│ │
|
| 71 |
-
│ ┌──────────────────┐ │
|
| 72 |
-
│ │ CoherenceRouter │ O(n) channel routing │
|
| 73 |
-
│ │ ┌────────────┐ │ │
|
| 74 |
-
│ │ │ Channel 0 │ │ Depthwise Sep. Conv │
|
| 75 |
-
│ │ │ Channel 1 │ │ + SwiGLU FFN │
|
| 76 |
-
│ │ │ Channel 2 │ │ │
|
| 77 |
-
│ │ │ Channel 3 │ │ (each channel processes │
|
| 78 |
-
│ │ │ Channel 4 │ │ a routed subset of │
|
| 79 |
-
│ │ │ Channel 5 │ │ tokens independently) │
|
| 80 |
-
│ │ └────────────┘ │ │
|
| 81 |
-
│ └──────────────────┘ │
|
| 82 |
-
│ │ │
|
| 83 |
-
│ ┌────────▼──────────────────────┐ │
|
| 84 |
-
│ │ SharedHierarchicalMemory │ │
|
| 85 |
-
│ │ ┌──────────┐ ┌────────────┐ ┌───────────┐ │
|
| 86 |
-
│ │ │ Working │ │ Episodic │ │ Semantic │ │
|
| 87 |
-
│ │ │ 32 slots │ │ 64 slots │ │ 128 slots │ │
|
| 88 |
-
│ │ │ (recent) │ │ (patterns)│ │ (concepts)│ │
|
| 89 |
-
│ │ └────┬─────┘ └─────┬──────┘ └─────┬─────┘ │
|
| 90 |
-
│ │ └──────┬──────┘──────────────┘ │
|
| 91 |
-
│ │ Gated Combination │
|
| 92 |
-
│ └──────────────────────────────┘ │
|
| 93 |
-
│ │ │
|
| 94 |
-
│ ┌────────▼──────────────────────┐ │
|
| 95 |
-
│ │ AdaptiveComputationBlock │ │
|
| 96 |
-
│ │ (1-2 steps per token) │ │
|
| 97 |
-
│ │ ┌──────┐ ┌──────┐ │ │
|
| 98 |
-
│ │ │FFN 1 │→│FFN 2 │ SwiGLU │ │
|
| 99 |
-
│ │ └──────┘ └──────┘ + halt │ │
|
| 100 |
-
│ └──────────────────────────────┘ │
|
| 101 |
-
│ │ │
|
| 102 |
-
│ ┌────────▼──────────────────────┐ │
|
| 103 |
-
│ │ CompositionalReasoner │ │
|
| 104 |
-
│ │ Role-Filler Binding (HDC) │ │
|
| 105 |
-
│ │ Circular Convolution │ │
|
| 106 |
-
│ └──────────────────────────────┘ │
|
| 107 |
-
│ │
|
| 108 |
-
└──────────────────────────────────────────────────┘
|
| 109 |
-
│
|
| 110 |
-
┌────▼──────┐
|
| 111 |
-
│ LayerNorm │
|
| 112 |
-
└────┬──────┘
|
| 113 |
-
│
|
| 114 |
-
┌────▼──────┐
|
| 115 |
-
│ OutputHead│ Weight-tied with TokenEncoder
|
| 116 |
-
└───────────┘
|
| 117 |
-
│
|
| 118 |
-
▼
|
| 119 |
-
Token Logits
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
### Component Deep Dive
|
| 123 |
-
|
| 124 |
-
#### 1. CoherenceRouter — O(n) Token Routing
|
| 125 |
-
|
| 126 |
-
The CoherenceRouter replaces self-attention with a learned routing mechanism that assigns each token to one or more processing channels based on **coherence scoring**. Unlike self-attention which computes all n×n token interactions, the CoherenceRouter:
|
| 127 |
-
|
| 128 |
-
1. Projects each token into a **query** and **key** vector of dimension `num_channels`
|
| 129 |
-
2. Computes the mean key across the entire sequence (O(n) reduction)
|
| 130 |
-
3. Scores each token against this mean key via element-wise multiplication (O(n))
|
| 131 |
-
4. Applies a single refinement step for improved routing accuracy
|
| 132 |
-
5. Produces soft routing weights via softmax, plus hard top-2 masks for efficiency
|
| 133 |
-
|
| 134 |
-
**Complexity**: O(n × C) where C is the number of channels, compared to O(n²) for self-attention.
|
| 135 |
-
|
| 136 |
-
The key insight is that routing doesn't need to know about every pairwise interaction — it only needs to know "which processing channel should handle this token?" This is analogous to how the brain routes sensory information to specialized cortical areas.
|
| 137 |
-
|
| 138 |
-
#### 2. CognitiveChannel — Efficient Per-Channel Processing
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
-
-
|
| 143 |
-
-
|
| 144 |
-
|
| 145 |
-
Both operations include residual connections and LayerNorm for stable training.
|
| 146 |
-
|
| 147 |
-
#### 3. SharedHierarchicalMemory — 3-Tier Key-Value Store
|
| 148 |
-
|
| 149 |
-
This is the core innovation that enables CogNet to maintain rich contextual representations without self-attention. Inspired by the Atkinson-Shiffrin model of human memory, the module implements three tiers of learned key-value memory:
|
| 150 |
-
|
| 151 |
-
| Tier | Slots | Analogy | Content |
|
| 152 |
-
|------|-------|---------|---------|
|
| 153 |
-
| **Working Memory** | 32 | Short-term buffer | Recent token representations |
|
| 154 |
-
| **Episodic Memory** | 64 | Event sequences | Recurring patterns and phrases |
|
| 155 |
-
| **Semantic Memory** | 128 | Knowledge store | Abstract concepts and relationships |
|
| 156 |
-
|
| 157 |
-
**Read mechanism**: For each input token, the module projects a query vector and performs scaled dot-product attention against each tier's keys and values independently. The three tier outputs are then combined via a **learned gating network** that produces softmax weights over the three tiers, allowing the model to dynamically balance between recent context (Working), pattern matching (Episodic), and conceptual knowledge (Semantic).
|
| 158 |
-
|
| 159 |
-
**Key properties**:
|
| 160 |
-
- Memory slots are **learned parameters** — they encode persistent knowledge across the entire training corpus, not just the current sequence
|
| 161 |
-
- The gating mechanism enables **dynamic memory access** — different tokens may rely more on working memory (for local coherence) or semantic memory (for factual knowledge)
|
| 162 |
-
- Total memory capacity: 224 key-value pairs per layer, providing a compressed but rich knowledge store
|
| 163 |
-
|
| 164 |
-
**Complexity**: O(n × S) per tier where S is the number of slots, compared to O(n²) for self-attention. Since S is fixed (224 total), this is effectively O(n).
|
| 165 |
-
|
| 166 |
-
#### 4. AdaptiveComputationBlock — Variable-Depth Processing
|
| 167 |
-
|
| 168 |
-
Not all tokens require the same amount of computation. The AdaptiveComputationBlock allows each token to be processed for 1 to `max_adaptive_steps` iterations of SwiGLU FFN layers, with a learned **halting mechanism** that determines when a token's representation is sufficiently refined.
|
| 169 |
-
|
| 170 |
-
After each step, a sigmoid halting probability is computed. The token's output is the weighted sum of its intermediate states, where the weights are determined by the halting probabilities. This enables:
|
| 171 |
-
- **Fast processing** for simple, predictable tokens (e.g., articles, common suffixes)
|
| 172 |
-
- **Deep processing** for ambiguous or information-rich tokens (e.g., rare words, punctuation at clause boundaries)
|
| 173 |
-
|
| 174 |
-
#### 5. CompositionalReasoner — Hyperdimensional Binding
|
| 175 |
-
|
| 176 |
-
The CompositionalReasoner implements **role-filler binding** from hyperdimensional computing (HDC). It projects each token into a role vector and a filler vector, then binds them via element-wise multiplication (analogous to circular convolution in the frequency domain). A shift-based unbinding operation adds positional awareness.
|
| 177 |
-
|
| 178 |
-
This enables the model to represent compositional structures like "the **subject** of the sentence is **the cat**" where "subject" is the role and "the cat" is the filler — a fundamental capability for understanding linguistic structure without explicit syntax trees.
|
| 179 |
-
|
| 180 |
-
---
|
| 181 |
-
|
| 182 |
-
## Complexity Analysis
|
| 183 |
-
|
| 184 |
-
| Operation | Transformer | CogNet | Speedup Factor |
|
| 185 |
-
|-----------|-------------|--------|----------------|
|
| 186 |
-
| Token mixing | O(n² × d) | O(n × C × d) | **n / C** |
|
| 187 |
-
| Memory access | O(n² × d) | O(n × S × d) | **n / S** |
|
| 188 |
-
| FFN | O(n × d × ff) | O(n × d × ff) | 1× (same) |
|
| 189 |
-
| **Total per layer** | **O(n² × d)** | **O(n × (C + S + ff) × d)** | **~n / (C + S)** |
|
| 190 |
-
|
| 191 |
-
For a 256-token sequence with C=6 channels and S=224 memory slots, CogNet achieves roughly a **4× speedup** over an equivalent Transformer layer. This advantage grows linearly with sequence length — at 1024 tokens, the speedup approaches **16×**.
|
| 192 |
-
|
| 193 |
-
---
|
| 194 |
-
|
| 195 |
-
## Model Specifications
|
| 196 |
-
|
| 197 |
-
| Parameter | Value |
|
| 198 |
-
|-----------|-------|
|
| 199 |
-
| **Architecture** | CogNet (Non-Transformer) |
|
| 200 |
-
| **Total Parameters** | 39,725,784 (~40M) |
|
| 201 |
-
| **Hidden Dimension** | 512 |
|
| 202 |
-
| **Cognitive Blocks** | 6 |
|
| 203 |
-
| **Cognitive Channels** | 6 |
|
| 204 |
-
| **Channel Dimension** | 128 |
|
| 205 |
-
| **FF Dimension** | 1024 |
|
| 206 |
-
| **Working Memory Slots** | 32 |
|
| 207 |
-
| **Episodic Memory Slots** | 64 |
|
| 208 |
-
| **Semantic Memory Slots** | 128 |
|
| 209 |
-
| **Key Dimension** | 256 |
|
| 210 |
-
| **Max Sequence Length** | 256 |
|
| 211 |
-
| **Vocabulary Size** | 136 (character-level) |
|
| 212 |
-
| **Model Size (FP32)** | ~159 MB |
|
| 213 |
-
| **Model Size (FP16)** | ~80 MB |
|
| 214 |
-
| **Adaptive Steps** | 1–2 |
|
| 215 |
-
| **Routing Iterations** | 1 |
|
| 216 |
-
| **Composition** | Hyperdimensional binding |
|
| 217 |
-
|
| 218 |
-
### Character-Level Tokenizer
|
| 219 |
-
|
| 220 |
-
CogNet uses a 136-character vocabulary tokenizer that covers:
|
| 221 |
-
- Standard ASCII (printable characters, digits, punctuation)
|
| 222 |
-
- French accented characters (à, é, è, ê, ë, î, ï, ô, ù, û, ü, ÿ, ç, æ, œ)
|
| 223 |
-
- Special formatting characters (tab, newline)
|
| 224 |
-
- European typographic marks (guillemets « », inverted question mark ¿)
|
| 225 |
-
|
| 226 |
-
Character-level tokenization ensures:
|
| 227 |
-
- **No out-of-vocabulary tokens** — every string is representable
|
| 228 |
-
- **Cross-lingual capability** — no bias toward English subword units
|
| 229 |
-
- **Compact vocabulary** — only 136 embedding vectors vs 32K+ for BPE tokenizers
|
| 230 |
-
- **Fine-grained generation** — the model learns orthographic patterns directly
|
| 231 |
-
|
| 232 |
-
---
|
| 233 |
-
|
| 234 |
-
## Quick Start
|
| 235 |
-
|
| 236 |
-
### Installation
|
| 237 |
-
|
| 238 |
-
```bash
|
| 239 |
-
pip install torch
|
| 240 |
-
```
|
| 241 |
-
|
| 242 |
-
### Download Model
|
| 243 |
-
|
| 244 |
-
```python
|
| 245 |
-
from huggingface_hub import hf_hub_download
|
| 246 |
-
|
| 247 |
-
# Download model checkpoint
|
| 248 |
-
ckpt_path = hf_hub_download("AFKmoney/CogNet-40M", "cognet_best.pt")
|
| 249 |
-
tokenizer_path = hf_hub_download("AFKmoney/CogNet-40M", "tokenizer_v3.json")
|
| 250 |
-
model_code = hf_hub_download("AFKmoney/CogNet-40M", "cognet_1b.py")
|
| 251 |
-
infer_code = hf_hub_download("AFKmoney/CogNet-40M", "infer.py")
|
| 252 |
-
```
|
| 253 |
-
|
| 254 |
-
### Inference
|
| 255 |
-
|
| 256 |
-
```python
|
| 257 |
-
import sys, torch
|
| 258 |
-
sys.path.insert(0, ".") # Add downloaded files to path
|
| 259 |
-
|
| 260 |
-
from cognet_1b import CogNet1B
|
| 261 |
-
from infer import CharTokenizer
|
| 262 |
-
|
| 263 |
-
# Load tokenizer
|
| 264 |
-
tokenizer = CharTokenizer.load("tokenizer_v3.json")
|
| 265 |
-
|
| 266 |
-
# Build model
|
| 267 |
-
model = CogNet1B(
|
| 268 |
-
vocab_size=136, hidden_dim=512, num_blocks=6,
|
| 269 |
-
num_channels=6, channel_dim=128, ff_dim=1024,
|
| 270 |
-
routing_iters=1, max_adaptive_steps=2, max_seq_len=256,
|
| 271 |
-
working_slots=32, episodic_slots=64, semantic_slots=128,
|
| 272 |
-
key_dim=256, dropout=0.1
|
| 273 |
-
)
|
| 274 |
-
|
| 275 |
-
# Load checkpoint (handles FP16 weights)
|
| 276 |
-
ckpt = torch.load("cognet_best.pt", map_location="cpu", weights_only=False)
|
| 277 |
-
state = {k: v.float() if v.dtype == torch.float16 else v
|
| 278 |
-
for k, v in ckpt["model_state_dict"].items()}
|
| 279 |
-
model.load_state_dict(state)
|
| 280 |
-
model.eval()
|
| 281 |
-
|
| 282 |
-
# Generate text
|
| 283 |
-
prompt = "Once upon a time"
|
| 284 |
-
ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long)
|
| 285 |
-
|
| 286 |
-
with torch.no_grad():
|
| 287 |
-
gen = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
|
| 288 |
-
|
| 289 |
-
print(tokenizer.decode(gen[0].tolist()))
|
| 290 |
-
```
|
| 291 |
-
|
| 292 |
-
### CUDA Inference
|
| 293 |
-
|
| 294 |
-
```python
|
| 295 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 296 |
-
model = model.to(device)
|
| 297 |
-
ids = ids.to(device)
|
| 298 |
-
|
| 299 |
-
with torch.no_grad():
|
| 300 |
-
gen = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
|
| 301 |
-
```
|
| 302 |
-
|
| 303 |
-
---
|
| 304 |
|
| 305 |
## Training
|
| 306 |
|
| 307 |
-
### Training Data
|
| 308 |
-
|
| 309 |
-
| Dataset | Size | Domain | Language |
|
| 310 |
-
|---------|------|--------|----------|
|
| 311 |
-
| WikiText-2 (raw) | ~2M tokens | Encyclopedic | English |
|
| 312 |
-
| TinyStories (50K) | ~15M tokens | Narrative | English |
|
| 313 |
-
| Alpaca (52K) | ~5M tokens | Instructions | English |
|
| 314 |
-
| **Total** | **~63M tokens** | **Mixed** | **English** |
|
| 315 |
-
|
| 316 |
-
### Training Configuration
|
| 317 |
-
|
| 318 |
-
| Parameter | Value |
|
| 319 |
-
|-----------|-------|
|
| 320 |
-
| **Hardware** | NVIDIA GeForce RTX 5060 Ti (16 GB VRAM) |
|
| 321 |
-
| **Sequence Length** | 256 |
|
| 322 |
-
| **Batch Size** | 64 (gradient accumulation × 4, effective = 256) |
|
| 323 |
-
| **Optimizer** | AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) |
|
| 324 |
-
| **Learning Rate** | 5e-4 (cosine schedule with warmup) |
|
| 325 |
-
| **Warmup Steps** | 500 |
|
| 326 |
-
| **Precision** | FP16 (AMP) with TF32 enabled |
|
| 327 |
-
| **Gradient Clipping** | 1.0 |
|
| 328 |
-
| **Total Steps** | 30,000 |
|
| 329 |
-
| **Throughput** | ~3M tokens/min |
|
| 330 |
-
|
| 331 |
-
### Training Curve
|
| 332 |
-
|
| 333 |
-
Training loss follows a smooth descent from initial loss ~4.5 (random character predictions over 136 vocab) down to ~0.007, with validation perplexity reaching 1.01 — meaning the model predicts the next character with high confidence. The hierarchical memory gates show interesting dynamics: Working memory dominates early in training (local character patterns), while Semantic memory gates increase as the model learns abstract patterns.
|
| 334 |
-
|
| 335 |
-
---
|
| 336 |
-
|
| 337 |
-
## Benchmarks
|
| 338 |
-
|
| 339 |
-
### Perplexity
|
| 340 |
-
|
| 341 |
| Metric | Value |
|
| 342 |
|--------|-------|
|
| 343 |
-
|
|
| 344 |
-
|
|
| 345 |
-
|
|
| 346 |
-
|
|
|
|
|
|
|
|
|
|
|
| 347 |
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
### Generation Samples
|
| 351 |
-
|
| 352 |
-
```
|
| 353 |
-
Prompt: "The "
|
| 354 |
-
Output: "The little cat, Lily watched her. She day, and sorry"
|
| 355 |
-
|
| 356 |
-
Prompt: "Once upon a time"
|
| 357 |
-
Output: "Once upon a time there was a little girl named Lily. She"
|
| 358 |
|
| 359 |
-
|
| 360 |
-
|
|
|
|
|
|
|
| 361 |
```
|
| 362 |
|
| 363 |
-
##
|
| 364 |
-
|
| 365 |
-
CogNet's O(n) complexity means it scales favorably with sequence length:
|
| 366 |
|
| 367 |
-
|
| 368 |
-
|----------------|----------------------|-----------------|-------|
|
| 369 |
-
| 256 | 65,536 | 256 | 256× |
|
| 370 |
-
| 512 | 262,144 | 512 | 512× |
|
| 371 |
-
| 1024 | 1,048,576 | 1,024 | 1,024× |
|
| 372 |
-
| 2048 | 4,194,304 | 2,048 | 2,048× |
|
| 373 |
|
| 374 |
-
|
| 375 |
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
|
| 381 |
-
|
|
| 382 |
-
|
|
| 383 |
-
|
|
| 384 |
-
| Sequence mixing | Self-Attention (O(n²)) | Coherence Routing (O(n)) |
|
| 385 |
-
| Memory mechanism | Fixed context window | Hierarchical 3-tier memory |
|
| 386 |
-
| Computation | Uniform per token | Adaptive (1-2 steps) |
|
| 387 |
-
| Tokenizer | BPE (50,257 vocab) | Character (136 vocab) |
|
| 388 |
-
| Max context | 1,024 tokens | 256 tokens |
|
| 389 |
-
| Composition | None | Hyperdimensional binding |
|
| 390 |
-
| Positional encoding | Learned | Learned |
|
| 391 |
-
|
| 392 |
-
---
|
| 393 |
-
|
| 394 |
-
## Limitations
|
| 395 |
-
|
| 396 |
-
- **Context length**: Currently limited to 256 tokens. Extending to longer contexts requires architectural modifications to the memory read mechanism.
|
| 397 |
-
- **Character-level tokenization**: While OOV-free, character-level models require more processing steps to build up word-level and phrase-level representations compared to subword tokenizers.
|
| 398 |
-
- **Scale**: At 40M parameters, CogNet is a research proof-of-concept. Scaling to 1B+ parameters is the next milestone.
|
| 399 |
-
- **Evaluation**: Benchmarks are computed on the training distribution. Zero-shot evaluation on standard NLP benchmarks is planned.
|
| 400 |
-
- **Language coverage**: Currently trained on English text only, though the tokenizer supports French accented characters.
|
| 401 |
-
|
| 402 |
-
---
|
| 403 |
-
|
| 404 |
-
## Citation
|
| 405 |
-
|
| 406 |
-
```bibtex
|
| 407 |
-
@software{cognet2026,
|
| 408 |
-
title = {CogNet: A Non-Transformer Language Model with Cognitive Routing and Hierarchical Memory},
|
| 409 |
-
author = {AFKmoney},
|
| 410 |
-
year = {2026},
|
| 411 |
-
url = {https://huggingface.co/AFKmoney/CogNet-40M},
|
| 412 |
-
license = {MIT}
|
| 413 |
-
}
|
| 414 |
-
```
|
| 415 |
-
|
| 416 |
-
---
|
| 417 |
|
| 418 |
-
|
| 419 |
|
| 420 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
|
| 422 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- fr
|
| 6 |
+
- code
|
|
|
|
| 7 |
tags:
|
| 8 |
- non-transformer
|
| 9 |
- cognitive-routing
|
| 10 |
- hierarchical-memory
|
| 11 |
- character-level
|
| 12 |
+
- aicl
|
| 13 |
+
- text-generation
|
| 14 |
+
- custom-architecture
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
+
library_name: pytorch
|
| 17 |
---
|
| 18 |
|
|
|
|
|
|
|
| 19 |
# CogNet-40M
|
| 20 |
|
| 21 |
+
A 39.7M parameter non-transformer language model with O(n) cognitive routing and hierarchical memory.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Architecture
|
| 24 |
|
| 25 |
+
| Component | Detail |
|
| 26 |
+
|-----------|--------|
|
| 27 |
+
| Architecture | Non-transformer (Cognitive Routing) |
|
| 28 |
+
| Parameters | 39,718,536 (~40M) |
|
| 29 |
+
| Hidden Dim | 512 |
|
| 30 |
+
| Blocks | 6 cognitive blocks |
|
| 31 |
+
| Channels | 6 routing channels x 128 dim |
|
| 32 |
+
| FF Dim | 1024 |
|
| 33 |
+
| Max Seq Len | 256 |
|
| 34 |
+
| Tokenizer | Character-level (136 vocab) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
## Hierarchical Memory
|
| 37 |
|
| 38 |
+
- Working Memory (32 slots): Active processing
|
| 39 |
+
- Episodic Memory (64 slots): Short-term recall
|
| 40 |
+
- Semantic Memory (128 slots): Long-term knowledge
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Training
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
| Metric | Value |
|
| 45 |
|--------|-------|
|
| 46 |
+
| Steps | 50,000 |
|
| 47 |
+
| Batch Size | 64 |
|
| 48 |
+
| LR | 3e-4 (cosine) |
|
| 49 |
+
| Precision | FP16 AMP |
|
| 50 |
+
| GPU | RTX 5060 Ti 16GB |
|
| 51 |
+
| Final Loss | ~0.005 |
|
| 52 |
+
| Final PPL | ~1.01 |
|
| 53 |
|
| 54 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
```python
|
| 57 |
+
from inference import CogNetInference
|
| 58 |
+
ai = CogNetInference("cognet_best.pt", "tokenizer_v3.json")
|
| 59 |
+
print(ai.generate("Once upon a time"))
|
| 60 |
```
|
| 61 |
|
| 62 |
+
## AICL Integration
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
CogNet powers AICL (Architecture Compilation Language) as its native AI engine for code generation, diagnosis, and repair.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
+
## Files
|
| 67 |
|
| 68 |
+
| File | Size | Description |
|
| 69 |
+
|------|------|-------------|
|
| 70 |
+
| cognet_best.pt | 152MB | FP32 checkpoint |
|
| 71 |
+
| cognet_fp16.pt | 77MB | FP16 checkpoint |
|
| 72 |
+
| tokenizer_v3.json | - | Char tokenizer (136 vocab) |
|
| 73 |
+
| config.json | - | Model config |
|
| 74 |
+
| cognet_model.py | - | Architecture source |
|
| 75 |
+
| inference.py | - | Inference script |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
## Roadmap
|
| 78 |
|
| 79 |
+
- [x] CogNet-40M (39.7M)
|
| 80 |
+
- [x] HuggingFace integration
|
| 81 |
+
- [x] AICL native engine
|
| 82 |
+
- [ ] CogNet-1B (1B params)
|
| 83 |
+
- [ ] ONNX export
|
| 84 |
|
| 85 |
+
MIT License. Built with PyTorch on RTX 5060 Ti via QuickPod.
|