| ---
|
| language: en
|
| tags:
|
| - gpt
|
| - language-model
|
| - gpu
|
| - cuda
|
| - ai-systems
|
| - pytorch
|
| license: mit
|
| ---
|
|
|
| # KernelGPT — GPU/AI Systems Performance
|
|
|
| A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text.
|
|
|
| ## Model Specs
|
|
|
| | Property | Value |
|
| |----------|-------|
|
| | Parameters | ~125M |
|
| | Architecture | Decoder-only Transformer |
|
| | Embedding dim | 768 |
|
| | Attention heads | 12 |
|
| | Layers | 8 |
|
| | Context length | 512 tokens |
|
| | Vocab size | 32,000 (SentencePiece BPE) |
|
|
|
| ## Training
|
|
|
| | Setting | Value |
|
| |---------|-------|
|
| | Training steps | 162,000 |
|
| | Val loss | 4.3889 |
|
| | Optimizer | AdamW |
|
| | Learning rate | 3e-4 (cosine decay) |
|
| | Batch size | 1 (effective 4 with grad accum) |
|
|
|
| ## Training Data
|
|
|
| - **FineWeb** (general web text)
|
| - **arXiv papers** (cs.DC, cs.AR, cs.LG, cs.PF categories — GPU/AI/systems)
|
| - **Wikipedia** (ML/systems filtered articles)
|
| - **GPU-specific crawl** (NVIDIA docs, GitHub READMEs, arXiv abstracts)
|
|
|
| Topics cover all 20 chapters of *AI Performance Engineering* including CUDA internals,
|
| KV cache tuning, LLM inference, distributed training, and GPU cluster scaling.
|
|
|
| ## Usage
|
|
|
| ```python
|
| import torch
|
| import sentencepiece as spm
|
| from huggingface_hub import hf_hub_download
|
|
|
| # Download files
|
| ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt")
|
| tok_path = hf_hub_download("saiakula/KernelGPT", "tokenizer.model")
|
|
|
| # Load tokenizer
|
| sp = spm.SentencePieceProcessor(model_file=tok_path)
|
|
|
| # Load model
|
| # (requires TinyGPT src — clone https://github.com/your-username/TinyGPT)
|
| checkpoint = torch.load(ckpt_path, map_location="cpu")
|
| ```
|
|
|
| ## Acknowledgments
|
|
|
| - Inspired by Andrej Karpathy's nanoGPT
|
| - Training topics based on *AI Performance Engineering* by Chris Fregly
|
|
|