| ---
|
| language:
|
| - en
|
| license: mit
|
| tags:
|
| - code
|
| - python
|
| - causal-lm
|
| - from-scratch
|
| - gqa
|
| - rope
|
| - swiglu
|
| - fim
|
| - qk-norm
|
| datasets:
|
| - nampdn-ai/tiny-codes
|
| - ise-uiuc/Magicoder-OSS-Instruct-75K
|
| - iamtarun/python_code_instructions_18k_alpaca
|
| - bigcode/the-stack-smol
|
| - flytech/python-codes-25k
|
| - iamtarun/code_instructions_120k_alpaca
|
| ---
|
|
|
| # PyCraft-1: A 55M Python Code LLM Trained From Scratch on Consumer Hardware
|
|
|
| ## Model Description
|
|
|
| PyCraft-1 is a 55.3M parameter decoder-only transformer trained
|
| **entirely from scratch** on Python code using a single NVIDIA RTX 3050
|
| laptop GPU (4GB VRAM). It demonstrates that a domain-specific code LLM
|
| can be trained without cloud compute, following a quality-first data
|
| curriculum inspired by the Phi-1 "Textbooks Are All You Need" approach.
|
|
|
| ## Architecture
|
|
|
| | Component | Choice |
|
| |---|---|
|
| | Architecture | Decoder-only transformer |
|
| | Parameters | 55.3M |
|
| | Attention | Grouped Query Attention (8Q / 2KV heads) |
|
| | Positional encoding | RoPE |
|
| | QK-Norm | RMSNorm on Q and K (OLMo 2 / Qwen 3, 2025) |
|
| | FFN | SwiGLU |
|
| | Normalisation | RMSNorm pre-norm |
|
| | Training objective | Causal LM + Fill-in-the-Middle (FIM, 50%) |
|
| | Context window | 1024 tokens |
|
| | Vocabulary | 32,000 (custom BPE, Python-tuned) |
|
|
|
| ## Training
|
|
|
| **Pretraining:**
|
| - 309,221 curated Python examples from 6 open sources
|
| - Quality curriculum: scored on 5 heuristics, ordered best-first
|
| - 4,000 steps | 1.05B tokens | Loss 1.16 | PPL 3.2
|
|
|
| **Supervised Fine-Tuning:**
|
| - Magicoder-OSS-Instruct-75K (Python subset, 40k examples)
|
| - 400 steps | Loss 1.15 | PPL 3.15
|
|
|
| **Hardware:** NVIDIA RTX 3050 Laptop GPU 4GB, Ryzen 7 6000, 16GB RAM
|
| **Total training time:** ~22 hours
|
|
|
| ## Novel Contributions
|
|
|
| 1. **Quality-first data curriculum** — 309k Python examples scored and
|
| ordered by educational value (docstrings, type hints, comments,
|
| naming conventions, length). Validates Phi-1 hypothesis at the
|
| resource-constrained regime.
|
|
|
| 2. **QK-Norm** — RMSNorm applied to Q and K before RoPE, adopted from
|
| OLMo 2 and Qwen 3 (2025), improving training stability in small models.
|
|
|
| 3. **FIM pretraining on 4GB VRAM** — Fill-in-the-Middle objective using
|
| PSM format trained on a consumer GPU via gradient checkpointing and
|
| BF16 mixed precision.
|
|
|
| 4. **Full reproducibility** — complete open-source pipeline runnable on
|
| consumer hardware in under one week.
|
|
|
| ## Evaluation
|
|
|
| | Metric | Value |
|
| |---|---|
|
| | Pretraining loss | 1.16 |
|
| | Pretraining PPL | 3.20 |
|
| | SFT loss | 1.15 |
|
| | SFT PPL | 3.15 |
|
| | Held-out PPL (binary search) | 1.4 |
|
| | Held-out PPL (Stack class) | 1.7 |
|
| | Held-out PPL (average, 5 samples) | 2.16 |
|
|
|
| ## Example Usage
|
|
|
| ```python
|
| # Clone the repo and install dependencies first
|
| # pip install torch safetensors tokenizers datasets
|
|
|
| import torch
|
| from safetensors.torch import load_file
|
| from model.config import get_config_120m
|
| from model.pycraft_model import PyCraftModel
|
| from tokenizer.tokenizer_utils import PyCraftTokenizer
|
|
|
| device = "cuda" if torch.cuda.is_available() else "cpu"
|
| tokenizer = PyCraftTokenizer("tokenizer/vocab/tokenizer.json")
|
|
|
| cfg = get_config_120m()
|
| cfg.vocab_size = 32000
|
| cfg.dropout = 0.0
|
|
|
| model = PyCraftModel(cfg).to(device)
|
| model.load_state_dict(load_file("model.safetensors", device=device))
|
| model.eval()
|
|
|
| prompt = "def is_palindrome(s: str) -> bool:\n "
|
| ids = tokenizer.encode(prompt)
|
| inp = torch.tensor(ids, dtype=torch.long).unsqueeze(0).to(device)
|
|
|
| with torch.no_grad():
|
| out = model.generate(inp, max_new_tokens=80, temperature=0.7, top_k=40)
|
|
|
| print(tokenizer.decode(out[0].tolist()))
|
| ```
|
|
|
| ## Limitations
|
|
|
| - 55M parameters: generates plausible Python but may produce logical errors
|
| - Context window: 1024 tokens
|
| - Best on standard algorithms, data structures, and common library patterns
|
| - Not a conversational assistant
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @misc{inamdar2026pycraft,
|
| title={PyCraft-1: Training a Python Code LLM From Scratch on Consumer Hardware},
|
| author={Inamdar, Rohan},
|
| year={2026},
|
| institution={University of Manchester, MSc Artificial Intelligence},
|
| }
|
| ```
|
|
|
| ## Links
|
|
|
| - GitHub: https://github.com/irohan0/pycraft-llm
|
| - Author: https://www.linkedin.com/in/rohan-inamdar-47aa4b251
|
|
|