--- language: en tags: - gpt - language-model - gpu - cuda - ai-systems - pytorch license: mit --- # KernelGPT — GPU/AI Systems Performance A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text. ## Model Specs | Property | Value | |----------|-------| | Parameters | ~125M | | Architecture | Decoder-only Transformer | | Embedding dim | 768 | | Attention heads | 12 | | Layers | 8 | | Context length | 512 tokens | | Vocab size | 32,000 (SentencePiece BPE) | ## Training | Setting | Value | |---------|-------| | Training steps | 162,000 | | Val loss | 4.3889 | | Optimizer | AdamW | | Learning rate | 3e-4 (cosine decay) | | Batch size | 1 (effective 4 with grad accum) | ## Training Data - **FineWeb** (general web text) - **arXiv papers** (cs.DC, cs.AR, cs.LG, cs.PF categories — GPU/AI/systems) - **Wikipedia** (ML/systems filtered articles) - **GPU-specific crawl** (NVIDIA docs, GitHub READMEs, arXiv abstracts) Topics cover all 20 chapters of *AI Performance Engineering* including CUDA internals, KV cache tuning, LLM inference, distributed training, and GPU cluster scaling. ## Usage ```python import torch import sentencepiece as spm from huggingface_hub import hf_hub_download # Download files ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt") tok_path = hf_hub_download("saiakula/KernelGPT", "tokenizer.model") # Load tokenizer sp = spm.SentencePieceProcessor(model_file=tok_path) # Load model # (requires TinyGPT src — clone https://github.com/your-username/TinyGPT) checkpoint = torch.load(ckpt_path, map_location="cpu") ``` ## Acknowledgments - Inspired by Andrej Karpathy's nanoGPT - Training topics based on *AI Performance Engineering* by Chris Fregly