--- language: - en - fr license: mit library_name: pytorch tags: - cognet - language-model - aicl - custom-architecture - non-transformer --- # CogNet-1B A ~1.06B parameter **non-transformer** language model with a novel cognitive architecture featuring working, episodic, and semantic memory systems. CogNet uses cognitive routing with vectorized channel processing and hierarchical memory tiers, achieving O(n) per-layer complexity instead of O(n^2) for transformers. ## Architecture | Parameter | Value | |-----------|-------| | **Hidden dim** | 2048 | | **Blocks** | 16 (8 channels each) | | **Channel dim** | 384 | | **FF dim** | 8192 (Fused SwiGLU) | | **Working memory slots** | 128 | | **Episodic memory slots** | 256 | | **Semantic memory slots** | 512 | | **Tokenizer** | CharTokenizer (136 vocab) | | **Normalization** | RMSNorm | | **Positional encoding** | RoPE | ### Key Differences from Transformers - **Cognitive routing**: Input is routed through parallel channels instead of attention heads - **Hierarchical memory**: 3-tier memory system (working/episodic/semantic) with SDPA reads - **O(n) per-layer complexity**: Channel processing is linear in sequence length (vs O(n^2) attention) - **Vectorized channels**: All 8 channels processed in a single batched operation (no for-loops) - **Fused SwiGLU**: Gate and up projections combined into a single matmul ## Optimized Training Pipeline The `train_ultra.py` script includes the complete training pipeline with all optimizations: ### Data Pipeline (A-B-C-D-E) | Part | Source | Description | |------|--------|-------------| | **A** | HuggingFace datasets | wikitext-103, codeparrot-clean, fineweb, oscar-fr, the-stack-smol, alpaca-cleaned, c4-en | | **B** | CogNet HF repo data | Pre-tokenized .pt files from this repository | | **C** | AICL repo | JSONL datasets, .aicl examples, source code, spec, tests (10x repeated) | | **D** | HF scripts | Python/JSON/MD scripts from this repo (3x weight) | | **E** | Synthetic data | Code templates + English + French sentences (~50M chars) | All parts are merged, shuffled, and saved as a single `train_merged.pt` file. ### Optimizations | # | Optimization | Benefit | |---|-------------|---------| | 1 | BF16 mixed precision | 2x throughput vs FP32 | | 2 | RMSNorm + RoPE | No learned positional table | | 3 | Vectorized channel processing | No Python for-loops over channels | | 4 | SDPA/Flash Attention for memory tiers | Fused attention for memory reads | | 5 | Fused SwiGLU | Single matmul for gate+up | | 6 | Gradient checkpointing | ~3x memory savings | | 7 | torch.compile() | Kernel fusion, reduced overhead | | 8 | FSDP multi-GPU | Near-linear multi-GPU scaling | | 9 | Fused AdamW | Faster optimizer step | | 10 | CUDA prefetch pipeline | Overlaps data transfer with compute | | 11 | Async checkpointing | Saves in background, no training pause | | 12 | Sequence length warmup | 128 -> target over warmup period | | 13 | 8-bit optimizer (optional) | 50% less VRAM for optimizer states | ### Real Benchmark **No fabricated performance claims.** The training script runs a real benchmark at startup: 1. **3 warmup steps** to heat up compile caches and CUDA allocations 2. **10 measured steps** (forward + backward + optimizer) with `cuda.synchronize()` 3. Reports real **steps/sec** and **tokens/sec** on your hardware 4. Calculates **ETA** based on measured speed 5. Saves results to `benchmark_results.json` Every log line shows `ETA: Xh` calculated from the measured speed. ## Files ### Optimized (V2) — Recommended | File | Description | |------|-------------| | `cognet_1b_optimized.py` | **Optimized model architecture** (RMSNorm, RoPE, vectorized, SDPA, FusedSwiGLU) | | `train_ultra.py` | **Main training script** (complete A-B-C-D-E pipeline + benchmark + all optimizations) | | `run.py` | **Python launcher** (auto-detects GPUs, installs deps, launches torchrun) | | `infer_optimized.py` | Inference with optimized model (generate, analyze, benchmark) | | `benchmark.py` | Standalone benchmark (original vs optimized, scalability test) | | `convert_checkpoint.py` | Convert original checkpoint to optimized format | | `requirements.txt` | Python dependencies | | `setup.sh` | Quick start setup script | ### Original — Legacy | File | Description | |------|-------------| | `cognet_1b.py` | Original model architecture | | `runpod_train_1b.py` | Original RunPod training script | | `train_1b_final.py` | Previous training script | | `train_1b_v2.py` | Previous training script v2 | | `train_1b_v3.py` | Previous training script v3 | | `train_bg.py` | Background training script | | `train_pipeline.py` | Pipeline training script | | `infer.py` | Original inference script | | `chat_infer.py` | Chat-style inference | | `gen_data_1b.py` | Synthetic data generation | | `cognet_data_prep.py` | Standalone data prep | | `config.json` | Model config | | `tokenizer_v3.json` | CharTokenizer vocabulary | | `data/` | AICL datasets and examples | ## Quick Start ```bash # 1. Clone git clone https://huggingface.co/thefinalboss/CogNet-1B cd CogNet-1B # 2. Install deps pip install torch datasets huggingface_hub tokenizers # 3. Set HF token (for data download) export HF_TOKEN=your_token_here # 4. Train — everything is automatic python run.py ``` ### Training Options ```bash # Single GPU with all optimizations python train_ultra.py --max-steps 100000 --compile --cuda-prefetch --seq-warmup --async-ckpt # Multi-GPU with FSDP torchrun --nproc_per_node=4 train_ultra.py --use-fsdp --max-steps 100000 # Use the Python launcher (auto-detects GPUs, installs deps) python run.py --max-steps 100000 --hf-token hf_xxx # Just prepare data (no training) python run.py --prep-only # Resume from checkpoint python run.py --resume ./checkpoints_1b/cognet_1b_latest.pt # 350M model (faster for testing) python run.py --model-size 350m # 8-bit optimizer (less VRAM) python run.py --8bit ``` ## Inference ```python from cognet_1b_optimized import create_cognet_1b_optimized import torch # Create model model = create_cognet_1b_optimized(vocab_size=136, max_seq_len=512) # Load checkpoint ckpt = torch.load('checkpoints/cognet_best.pt', map_location='cpu', weights_only=False) model.load_state_dict(ckpt['model_state_dict']) model.eval() # Generate prompt = torch.tensor([[2]]) # BOS token output = model.generate(prompt, max_new_tokens=200, temperature=0.8, top_k=50) # Decode (CharTokenizer) vocab = {0: '', 1: '', 2: '', 3: ''} for i in range(4, 136): vocab[i] = chr([*range(32,127), *[ 192,193,194,195,196,197,199,200,201,202,203,204,205,206,207, 210,211,212,213,214,217,218,219,220,224,225,226,227,228,229, 231,232,233,234,235,236,237,238,239,242,243,244,245,246,249, 250,251,252,253,255 ]][i-4]) text = ''.join(vocab.get(t, '') for t in output[0].tolist() if t not in (0,1,2,3)) print(text) ``` Or use the inference script: ```bash python infer_optimized.py generate --prompt "The future of AI is" --max-tokens 100 python infer_optimized.py benchmark ``` ## Benchmark Your Hardware ```bash # Full benchmark: original vs optimized + scalability test python benchmark.py # Quick benchmark during training (automatic) python train_ultra.py --max-steps 20 # The first 13 steps are: 3 warmup + 10 benchmark = real speed measurement ``` ## Config Files YAML configs are available in `configs/`: | Config | Description | |--------|-------------| | `1b_single_gpu.yaml` | 1B model, single GPU | | `1b_fsdp.yaml` | 1B model, multi-GPU FSDP | | `350m_fast.yaml` | 350M model, fast iteration |