CogNet-1B

A ~1.06B parameter non-transformer language model with a novel cognitive architecture featuring working, episodic, and semantic memory systems. CogNet uses cognitive routing with vectorized channel processing and hierarchical memory tiers, achieving O(n) per-layer complexity instead of O(n^2) for transformers.

Architecture

Parameter	Value
Hidden dim	2048
Blocks	16 (8 channels each)
Channel dim	384
FF dim	8192 (Fused SwiGLU)
Working memory slots	128
Episodic memory slots	256
Semantic memory slots	512
Tokenizer	CharTokenizer (136 vocab)
Normalization	RMSNorm
Positional encoding	RoPE

Key Differences from Transformers

Cognitive routing: Input is routed through parallel channels instead of attention heads
Hierarchical memory: 3-tier memory system (working/episodic/semantic) with SDPA reads
O(n) per-layer complexity: Channel processing is linear in sequence length (vs O(n^2) attention)
Vectorized channels: All 8 channels processed in a single batched operation (no for-loops)
Fused SwiGLU: Gate and up projections combined into a single matmul

Optimized Training Pipeline

The train_ultra.py script includes the complete training pipeline with all optimizations:

Data Pipeline (A-B-C-D-E)

Part	Source	Description
A	HuggingFace datasets	wikitext-103, codeparrot-clean, fineweb, oscar-fr, the-stack-smol, alpaca-cleaned, c4-en
B	CogNet HF repo data	Pre-tokenized .pt files from this repository
C	AICL repo	JSONL datasets, .aicl examples, source code, spec, tests (10x repeated)
D	HF scripts	Python/JSON/MD scripts from this repo (3x weight)
E	Synthetic data	Code templates + English + French sentences (~50M chars)

All parts are merged, shuffled, and saved as a single train_merged.pt file.

Optimizations

#	Optimization	Benefit
1	BF16 mixed precision	2x throughput vs FP32
2	RMSNorm + RoPE	No learned positional table
3	Vectorized channel processing	No Python for-loops over channels
4	SDPA/Flash Attention for memory tiers	Fused attention for memory reads
5	Fused SwiGLU	Single matmul for gate+up
6	Gradient checkpointing	~3x memory savings
7	torch.compile()	Kernel fusion, reduced overhead
8	FSDP multi-GPU	Near-linear multi-GPU scaling
9	Fused AdamW	Faster optimizer step
10	CUDA prefetch pipeline	Overlaps data transfer with compute
11	Async checkpointing	Saves in background, no training pause
12	Sequence length warmup	128 -> target over warmup period
13	8-bit optimizer (optional)	50% less VRAM for optimizer states

Real Benchmark

No fabricated performance claims. The training script runs a real benchmark at startup:

3 warmup steps to heat up compile caches and CUDA allocations
10 measured steps (forward + backward + optimizer) with cuda.synchronize()
Reports real steps/sec and tokens/sec on your hardware
Calculates ETA based on measured speed
Saves results to benchmark_results.json

Every log line shows ETA: Xh calculated from the measured speed.

Files

Optimized (V2) — Recommended

File	Description
`cognet_1b_optimized.py`	Optimized model architecture (RMSNorm, RoPE, vectorized, SDPA, FusedSwiGLU)
`train_ultra.py`	Main training script (complete A-B-C-D-E pipeline + benchmark + all optimizations)
`run.py`	Python launcher (auto-detects GPUs, installs deps, launches torchrun)
`infer_optimized.py`	Inference with optimized model (generate, analyze, benchmark)
`benchmark.py`	Standalone benchmark (original vs optimized, scalability test)
`convert_checkpoint.py`	Convert original checkpoint to optimized format
`requirements.txt`	Python dependencies
`setup.sh`	Quick start setup script

Original — Legacy

File	Description
`cognet_1b.py`	Original model architecture
`runpod_train_1b.py`	Original RunPod training script
`train_1b_final.py`	Previous training script
`train_1b_v2.py`	Previous training script v2
`train_1b_v3.py`	Previous training script v3
`train_bg.py`	Background training script
`train_pipeline.py`	Pipeline training script
`infer.py`	Original inference script
`chat_infer.py`	Chat-style inference
`gen_data_1b.py`	Synthetic data generation
`cognet_data_prep.py`	Standalone data prep
`config.json`	Model config
`tokenizer_v3.json`	CharTokenizer vocabulary
`data/`	AICL datasets and examples

Quick Start

# 1. Clone
git clone https://huggingface.co/thefinalboss/CogNet-1B
cd CogNet-1B

# 2. Install deps
pip install torch datasets huggingface_hub tokenizers

# 3. Set HF token (for data download)
export HF_TOKEN=your_token_here

# 4. Train — everything is automatic
python run.py

Training Options

# Single GPU with all optimizations
python train_ultra.py --max-steps 100000 --compile --cuda-prefetch --seq-warmup --async-ckpt

# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train_ultra.py --use-fsdp --max-steps 100000

# Use the Python launcher (auto-detects GPUs, installs deps)
python run.py --max-steps 100000 --hf-token hf_xxx

# Just prepare data (no training)
python run.py --prep-only

# Resume from checkpoint
python run.py --resume ./checkpoints_1b/cognet_1b_latest.pt

# 350M model (faster for testing)
python run.py --model-size 350m

# 8-bit optimizer (less VRAM)
python run.py --8bit

Inference

from cognet_1b_optimized import create_cognet_1b_optimized
import torch

# Create model
model = create_cognet_1b_optimized(vocab_size=136, max_seq_len=512)

# Load checkpoint
ckpt = torch.load('checkpoints/cognet_best.pt', map_location='cpu', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate
prompt = torch.tensor([[2]])  # BOS token
output = model.generate(prompt, max_new_tokens=200, temperature=0.8, top_k=50)

# Decode (CharTokenizer)
vocab = {0: '', 1: '', 2: '', 3: ''}
for i in range(4, 136):
    vocab[i] = chr([*range(32,127), *[
        192,193,194,195,196,197,199,200,201,202,203,204,205,206,207,
        210,211,212,213,214,217,218,219,220,224,225,226,227,228,229,
        231,232,233,234,235,236,237,238,239,242,243,244,245,246,249,
        250,251,252,253,255
    ]][i-4])

text = ''.join(vocab.get(t, '') for t in output[0].tolist() if t not in (0,1,2,3))
print(text)

Or use the inference script:

python infer_optimized.py generate --prompt "The future of AI is" --max-tokens 100
python infer_optimized.py benchmark

Benchmark Your Hardware

# Full benchmark: original vs optimized + scalability test
python benchmark.py

# Quick benchmark during training (automatic)
python train_ultra.py --max-steps 20
# The first 13 steps are: 3 warmup + 10 benchmark = real speed measurement

Config Files

YAML configs are available in configs/:

Config	Description
`1b_single_gpu.yaml`	1B model, single GPU
`1b_fsdp.yaml`	1B model, multi-GPU FSDP
`350m_fast.yaml`	350M model, fast iteration

Downloads last month: 780

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support