File size: 1,847 Bytes
81515d1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | ---
language: en
tags:
- gpt
- language-model
- gpu
- cuda
- ai-systems
- pytorch
license: mit
---
# KernelGPT — GPU/AI Systems Performance
A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text.
## Model Specs
| Property | Value |
|----------|-------|
| Parameters | ~125M |
| Architecture | Decoder-only Transformer |
| Embedding dim | 768 |
| Attention heads | 12 |
| Layers | 8 |
| Context length | 512 tokens |
| Vocab size | 32,000 (SentencePiece BPE) |
## Training
| Setting | Value |
|---------|-------|
| Training steps | 162,000 |
| Val loss | 4.3889 |
| Optimizer | AdamW |
| Learning rate | 3e-4 (cosine decay) |
| Batch size | 1 (effective 4 with grad accum) |
## Training Data
- **FineWeb** (general web text)
- **arXiv papers** (cs.DC, cs.AR, cs.LG, cs.PF categories — GPU/AI/systems)
- **Wikipedia** (ML/systems filtered articles)
- **GPU-specific crawl** (NVIDIA docs, GitHub READMEs, arXiv abstracts)
Topics cover all 20 chapters of *AI Performance Engineering* including CUDA internals,
KV cache tuning, LLM inference, distributed training, and GPU cluster scaling.
## Usage
```python
import torch
import sentencepiece as spm
from huggingface_hub import hf_hub_download
# Download files
ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt")
tok_path = hf_hub_download("saiakula/KernelGPT", "tokenizer.model")
# Load tokenizer
sp = spm.SentencePieceProcessor(model_file=tok_path)
# Load model
# (requires TinyGPT src — clone https://github.com/your-username/TinyGPT)
checkpoint = torch.load(ckpt_path, map_location="cpu")
```
## Acknowledgments
- Inspired by Andrej Karpathy's nanoGPT
- Training topics based on *AI Performance Engineering* by Chris Fregly
|