CogNet-1B
A ~1.06B parameter non-transformer language model with a novel cognitive architecture featuring working, episodic, and semantic memory systems. CogNet uses cognitive routing with vectorized channel processing and hierarchical memory tiers, achieving O(n) per-layer complexity instead of O(n^2) for transformers.
Architecture
| Parameter | Value |
|---|---|
| Hidden dim | 2048 |
| Blocks | 16 (8 channels each) |
| Channel dim | 384 |
| FF dim | 8192 (Fused SwiGLU) |
| Working memory slots | 128 |
| Episodic memory slots | 256 |
| Semantic memory slots | 512 |
| Tokenizer | CharTokenizer (136 vocab) |
| Normalization | RMSNorm |
| Positional encoding | RoPE |
Key Differences from Transformers
- Cognitive routing: Input is routed through parallel channels instead of attention heads
- Hierarchical memory: 3-tier memory system (working/episodic/semantic) with SDPA reads
- O(n) per-layer complexity: Channel processing is linear in sequence length (vs O(n^2) attention)
- Vectorized channels: All 8 channels processed in a single batched operation (no for-loops)
- Fused SwiGLU: Gate and up projections combined into a single matmul
Optimized Training Pipeline
The train_ultra.py script includes the complete training pipeline with all optimizations:
Data Pipeline (A-B-C-D-E)
| Part | Source | Description |
|---|---|---|
| A | HuggingFace datasets | wikitext-103, codeparrot-clean, fineweb, oscar-fr, the-stack-smol, alpaca-cleaned, c4-en |
| B | CogNet HF repo data | Pre-tokenized .pt files from this repository |
| C | AICL repo | JSONL datasets, .aicl examples, source code, spec, tests (10x repeated) |
| D | HF scripts | Python/JSON/MD scripts from this repo (3x weight) |
| E | Synthetic data | Code templates + English + French sentences (~50M chars) |
All parts are merged, shuffled, and saved as a single train_merged.pt file.
Optimizations
| # | Optimization | Benefit |
|---|---|---|
| 1 | BF16 mixed precision | 2x throughput vs FP32 |
| 2 | RMSNorm + RoPE | No learned positional table |
| 3 | Vectorized channel processing | No Python for-loops over channels |
| 4 | SDPA/Flash Attention for memory tiers | Fused attention for memory reads |
| 5 | Fused SwiGLU | Single matmul for gate+up |
| 6 | Gradient checkpointing | ~3x memory savings |
| 7 | torch.compile() | Kernel fusion, reduced overhead |
| 8 | FSDP multi-GPU | Near-linear multi-GPU scaling |
| 9 | Fused AdamW | Faster optimizer step |
| 10 | CUDA prefetch pipeline | Overlaps data transfer with compute |
| 11 | Async checkpointing | Saves in background, no training pause |
| 12 | Sequence length warmup | 128 -> target over warmup period |
| 13 | 8-bit optimizer (optional) | 50% less VRAM for optimizer states |
Real Benchmark
No fabricated performance claims. The training script runs a real benchmark at startup:
- 3 warmup steps to heat up compile caches and CUDA allocations
- 10 measured steps (forward + backward + optimizer) with
cuda.synchronize() - Reports real steps/sec and tokens/sec on your hardware
- Calculates ETA based on measured speed
- Saves results to
benchmark_results.json
Every log line shows ETA: Xh calculated from the measured speed.
Files
Optimized (V2) β Recommended
| File | Description |
|---|---|
cognet_1b_optimized.py |
Optimized model architecture (RMSNorm, RoPE, vectorized, SDPA, FusedSwiGLU) |
train_ultra.py |
Main training script (complete A-B-C-D-E pipeline + benchmark + all optimizations) |
run.py |
Python launcher (auto-detects GPUs, installs deps, launches torchrun) |
infer_optimized.py |
Inference with optimized model (generate, analyze, benchmark) |
benchmark.py |
Standalone benchmark (original vs optimized, scalability test) |
convert_checkpoint.py |
Convert original checkpoint to optimized format |
requirements.txt |
Python dependencies |
setup.sh |
Quick start setup script |
Original β Legacy
| File | Description |
|---|---|
cognet_1b.py |
Original model architecture |
runpod_train_1b.py |
Original RunPod training script |
train_1b_final.py |
Previous training script |
train_1b_v2.py |
Previous training script v2 |
train_1b_v3.py |
Previous training script v3 |
train_bg.py |
Background training script |
train_pipeline.py |
Pipeline training script |
infer.py |
Original inference script |
chat_infer.py |
Chat-style inference |
gen_data_1b.py |
Synthetic data generation |
cognet_data_prep.py |
Standalone data prep |
config.json |
Model config |
tokenizer_v3.json |
CharTokenizer vocabulary |
data/ |
AICL datasets and examples |
Quick Start
# 1. Clone
git clone https://huggingface.co/thefinalboss/CogNet-1B
cd CogNet-1B
# 2. Install deps
pip install torch datasets huggingface_hub tokenizers
# 3. Set HF token (for data download)
export HF_TOKEN=your_token_here
# 4. Train β everything is automatic
python run.py
Training Options
# Single GPU with all optimizations
python train_ultra.py --max-steps 100000 --compile --cuda-prefetch --seq-warmup --async-ckpt
# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train_ultra.py --use-fsdp --max-steps 100000
# Use the Python launcher (auto-detects GPUs, installs deps)
python run.py --max-steps 100000 --hf-token hf_xxx
# Just prepare data (no training)
python run.py --prep-only
# Resume from checkpoint
python run.py --resume ./checkpoints_1b/cognet_1b_latest.pt
# 350M model (faster for testing)
python run.py --model-size 350m
# 8-bit optimizer (less VRAM)
python run.py --8bit
Inference
from cognet_1b_optimized import create_cognet_1b_optimized
import torch
# Create model
model = create_cognet_1b_optimized(vocab_size=136, max_seq_len=512)
# Load checkpoint
ckpt = torch.load('checkpoints/cognet_best.pt', map_location='cpu', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()
# Generate
prompt = torch.tensor([[2]]) # BOS token
output = model.generate(prompt, max_new_tokens=200, temperature=0.8, top_k=50)
# Decode (CharTokenizer)
vocab = {0: '', 1: '', 2: '', 3: ''}
for i in range(4, 136):
vocab[i] = chr([*range(32,127), *[
192,193,194,195,196,197,199,200,201,202,203,204,205,206,207,
210,211,212,213,214,217,218,219,220,224,225,226,227,228,229,
231,232,233,234,235,236,237,238,239,242,243,244,245,246,249,
250,251,252,253,255
]][i-4])
text = ''.join(vocab.get(t, '') for t in output[0].tolist() if t not in (0,1,2,3))
print(text)
Or use the inference script:
python infer_optimized.py generate --prompt "The future of AI is" --max-tokens 100
python infer_optimized.py benchmark
Benchmark Your Hardware
# Full benchmark: original vs optimized + scalability test
python benchmark.py
# Quick benchmark during training (automatic)
python train_ultra.py --max-steps 20
# The first 13 steps are: 3 warmup + 10 benchmark = real speed measurement
Config Files
YAML configs are available in configs/:
| Config | Description |
|---|---|
1b_single_gpu.yaml |
1B model, single GPU |
1b_fsdp.yaml |
1B model, multi-GPU FSDP |
350m_fast.yaml |
350M model, fast iteration |
- Downloads last month
- 780
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support