File size: 1,847 Bytes

81515d1

---

language: en
tags:
  - gpt
  - language-model
  - gpu
  - cuda
  - ai-systems
  - pytorch
license: mit
---


# KernelGPT — GPU/AI Systems Performance

A GPT-style decoder-only transformer trained from scratch on GPU/AI systems performance engineering text.

## Model Specs

| Property | Value |
|----------|-------|
| Parameters | ~125M |
| Architecture | Decoder-only Transformer |
| Embedding dim | 768 |
| Attention heads | 12 |
| Layers | 8 |
| Context length | 512 tokens |
| Vocab size | 32,000 (SentencePiece BPE) |

## Training

| Setting | Value |
|---------|-------|
| Training steps | 162,000 |
| Val loss | 4.3889 |
| Optimizer | AdamW |
| Learning rate | 3e-4 (cosine decay) |
| Batch size | 1 (effective 4 with grad accum) |

## Training Data

- **FineWeb** (general web text)
- **arXiv papers** (cs.DC, cs.AR, cs.LG, cs.PF categories — GPU/AI/systems)
- **Wikipedia** (ML/systems filtered articles)
- **GPU-specific crawl** (NVIDIA docs, GitHub READMEs, arXiv abstracts)

Topics cover all 20 chapters of *AI Performance Engineering* including CUDA internals,
KV cache tuning, LLM inference, distributed training, and GPU cluster scaling.

## Usage

```python

import torch

import sentencepiece as spm

from huggingface_hub import hf_hub_download



# Download files

ckpt_path = hf_hub_download("saiakula/KernelGPT", "pytorch_model.pt")

tok_path  = hf_hub_download("saiakula/KernelGPT", "tokenizer.model")



# Load tokenizer

sp = spm.SentencePieceProcessor(model_file=tok_path)



# Load model

# (requires TinyGPT src — clone https://github.com/your-username/TinyGPT)

checkpoint = torch.load(ckpt_path, map_location="cpu")

```

## Acknowledgments

- Inspired by Andrej Karpathy's nanoGPT
- Training topics based on *AI Performance Engineering* by Chris Fregly