CogNet-1B

A ~1.06B parameter non-transformer language model with a novel cognitive architecture featuring working, episodic, and semantic memory systems. CogNet uses cognitive routing with vectorized channel processing and hierarchical memory tiers, achieving O(n) per-layer complexity instead of O(n^2) for transformers.

Architecture

Parameter Value
Hidden dim 2048
Blocks 16 (8 channels each)
Channel dim 384
FF dim 8192 (Fused SwiGLU)
Working memory slots 128
Episodic memory slots 256
Semantic memory slots 512
Tokenizer CharTokenizer (136 vocab)
Normalization RMSNorm
Positional encoding RoPE

Key Differences from Transformers

  • Cognitive routing: Input is routed through parallel channels instead of attention heads
  • Hierarchical memory: 3-tier memory system (working/episodic/semantic) with SDPA reads
  • O(n) per-layer complexity: Channel processing is linear in sequence length (vs O(n^2) attention)
  • Vectorized channels: All 8 channels processed in a single batched operation (no for-loops)
  • Fused SwiGLU: Gate and up projections combined into a single matmul

Optimized Training Pipeline

The train_ultra.py script includes the complete training pipeline with all optimizations:

Data Pipeline (A-B-C-D-E)

Part Source Description
A HuggingFace datasets wikitext-103, codeparrot-clean, fineweb, oscar-fr, the-stack-smol, alpaca-cleaned, c4-en
B CogNet HF repo data Pre-tokenized .pt files from this repository
C AICL repo JSONL datasets, .aicl examples, source code, spec, tests (10x repeated)
D HF scripts Python/JSON/MD scripts from this repo (3x weight)
E Synthetic data Code templates + English + French sentences (~50M chars)

All parts are merged, shuffled, and saved as a single train_merged.pt file.

Optimizations

# Optimization Benefit
1 BF16 mixed precision 2x throughput vs FP32
2 RMSNorm + RoPE No learned positional table
3 Vectorized channel processing No Python for-loops over channels
4 SDPA/Flash Attention for memory tiers Fused attention for memory reads
5 Fused SwiGLU Single matmul for gate+up
6 Gradient checkpointing ~3x memory savings
7 torch.compile() Kernel fusion, reduced overhead
8 FSDP multi-GPU Near-linear multi-GPU scaling
9 Fused AdamW Faster optimizer step
10 CUDA prefetch pipeline Overlaps data transfer with compute
11 Async checkpointing Saves in background, no training pause
12 Sequence length warmup 128 -> target over warmup period
13 8-bit optimizer (optional) 50% less VRAM for optimizer states

Real Benchmark

No fabricated performance claims. The training script runs a real benchmark at startup:

  1. 3 warmup steps to heat up compile caches and CUDA allocations
  2. 10 measured steps (forward + backward + optimizer) with cuda.synchronize()
  3. Reports real steps/sec and tokens/sec on your hardware
  4. Calculates ETA based on measured speed
  5. Saves results to benchmark_results.json

Every log line shows ETA: Xh calculated from the measured speed.

Files

Optimized (V2) β€” Recommended

File Description
cognet_1b_optimized.py Optimized model architecture (RMSNorm, RoPE, vectorized, SDPA, FusedSwiGLU)
train_ultra.py Main training script (complete A-B-C-D-E pipeline + benchmark + all optimizations)
run.py Python launcher (auto-detects GPUs, installs deps, launches torchrun)
infer_optimized.py Inference with optimized model (generate, analyze, benchmark)
benchmark.py Standalone benchmark (original vs optimized, scalability test)
convert_checkpoint.py Convert original checkpoint to optimized format
requirements.txt Python dependencies
setup.sh Quick start setup script

Original β€” Legacy

File Description
cognet_1b.py Original model architecture
runpod_train_1b.py Original RunPod training script
train_1b_final.py Previous training script
train_1b_v2.py Previous training script v2
train_1b_v3.py Previous training script v3
train_bg.py Background training script
train_pipeline.py Pipeline training script
infer.py Original inference script
chat_infer.py Chat-style inference
gen_data_1b.py Synthetic data generation
cognet_data_prep.py Standalone data prep
config.json Model config
tokenizer_v3.json CharTokenizer vocabulary
data/ AICL datasets and examples

Quick Start

# 1. Clone
git clone https://huggingface.co/thefinalboss/CogNet-1B
cd CogNet-1B

# 2. Install deps
pip install torch datasets huggingface_hub tokenizers

# 3. Set HF token (for data download)
export HF_TOKEN=your_token_here

# 4. Train β€” everything is automatic
python run.py

Training Options

# Single GPU with all optimizations
python train_ultra.py --max-steps 100000 --compile --cuda-prefetch --seq-warmup --async-ckpt

# Multi-GPU with FSDP
torchrun --nproc_per_node=4 train_ultra.py --use-fsdp --max-steps 100000

# Use the Python launcher (auto-detects GPUs, installs deps)
python run.py --max-steps 100000 --hf-token hf_xxx

# Just prepare data (no training)
python run.py --prep-only

# Resume from checkpoint
python run.py --resume ./checkpoints_1b/cognet_1b_latest.pt

# 350M model (faster for testing)
python run.py --model-size 350m

# 8-bit optimizer (less VRAM)
python run.py --8bit

Inference

from cognet_1b_optimized import create_cognet_1b_optimized
import torch

# Create model
model = create_cognet_1b_optimized(vocab_size=136, max_seq_len=512)

# Load checkpoint
ckpt = torch.load('checkpoints/cognet_best.pt', map_location='cpu', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate
prompt = torch.tensor([[2]])  # BOS token
output = model.generate(prompt, max_new_tokens=200, temperature=0.8, top_k=50)

# Decode (CharTokenizer)
vocab = {0: '', 1: '', 2: '', 3: ''}
for i in range(4, 136):
    vocab[i] = chr([*range(32,127), *[
        192,193,194,195,196,197,199,200,201,202,203,204,205,206,207,
        210,211,212,213,214,217,218,219,220,224,225,226,227,228,229,
        231,232,233,234,235,236,237,238,239,242,243,244,245,246,249,
        250,251,252,253,255
    ]][i-4])

text = ''.join(vocab.get(t, '') for t in output[0].tolist() if t not in (0,1,2,3))
print(text)

Or use the inference script:

python infer_optimized.py generate --prompt "The future of AI is" --max-tokens 100
python infer_optimized.py benchmark

Benchmark Your Hardware

# Full benchmark: original vs optimized + scalability test
python benchmark.py

# Quick benchmark during training (automatic)
python train_ultra.py --max-steps 20
# The first 13 steps are: 3 warmup + 10 benchmark = real speed measurement

Config Files

YAML configs are available in configs/:

Config Description
1b_single_gpu.yaml 1B model, single GPU
1b_fsdp.yaml 1B model, multi-GPU FSDP
350m_fast.yaml 350M model, fast iteration
Downloads last month
780
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support