--- language: - en license: mit tags: - code - python - causal-lm - from-scratch - gqa - rope - swiglu - fim - qk-norm datasets: - nampdn-ai/tiny-codes - ise-uiuc/Magicoder-OSS-Instruct-75K - iamtarun/python_code_instructions_18k_alpaca - bigcode/the-stack-smol - flytech/python-codes-25k - iamtarun/code_instructions_120k_alpaca --- # PyCraft-1: A 55M Python Code LLM Trained From Scratch on Consumer Hardware ## Model Description PyCraft-1 is a 55.3M parameter decoder-only transformer trained **entirely from scratch** on Python code using a single NVIDIA RTX 3050 laptop GPU (4GB VRAM). It demonstrates that a domain-specific code LLM can be trained without cloud compute, following a quality-first data curriculum inspired by the Phi-1 "Textbooks Are All You Need" approach. ## Architecture | Component | Choice | |---|---| | Architecture | Decoder-only transformer | | Parameters | 55.3M | | Attention | Grouped Query Attention (8Q / 2KV heads) | | Positional encoding | RoPE | | QK-Norm | RMSNorm on Q and K (OLMo 2 / Qwen 3, 2025) | | FFN | SwiGLU | | Normalisation | RMSNorm pre-norm | | Training objective | Causal LM + Fill-in-the-Middle (FIM, 50%) | | Context window | 1024 tokens | | Vocabulary | 32,000 (custom BPE, Python-tuned) | ## Training **Pretraining:** - 309,221 curated Python examples from 6 open sources - Quality curriculum: scored on 5 heuristics, ordered best-first - 4,000 steps | 1.05B tokens | Loss 1.16 | PPL 3.2 **Supervised Fine-Tuning:** - Magicoder-OSS-Instruct-75K (Python subset, 40k examples) - 400 steps | Loss 1.15 | PPL 3.15 **Hardware:** NVIDIA RTX 3050 Laptop GPU 4GB, Ryzen 7 6000, 16GB RAM **Total training time:** ~22 hours ## Novel Contributions 1. **Quality-first data curriculum** — 309k Python examples scored and ordered by educational value (docstrings, type hints, comments, naming conventions, length). Validates Phi-1 hypothesis at the resource-constrained regime. 2. **QK-Norm** — RMSNorm applied to Q and K before RoPE, adopted from OLMo 2 and Qwen 3 (2025), improving training stability in small models. 3. **FIM pretraining on 4GB VRAM** — Fill-in-the-Middle objective using PSM format trained on a consumer GPU via gradient checkpointing and BF16 mixed precision. 4. **Full reproducibility** — complete open-source pipeline runnable on consumer hardware in under one week. ## Evaluation | Metric | Value | |---|---| | Pretraining loss | 1.16 | | Pretraining PPL | 3.20 | | SFT loss | 1.15 | | SFT PPL | 3.15 | | Held-out PPL (binary search) | 1.4 | | Held-out PPL (Stack class) | 1.7 | | Held-out PPL (average, 5 samples) | 2.16 | ## Example Usage ```python # Clone the repo and install dependencies first # pip install torch safetensors tokenizers datasets import torch from safetensors.torch import load_file from model.config import get_config_120m from model.pycraft_model import PyCraftModel from tokenizer.tokenizer_utils import PyCraftTokenizer device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = PyCraftTokenizer("tokenizer/vocab/tokenizer.json") cfg = get_config_120m() cfg.vocab_size = 32000 cfg.dropout = 0.0 model = PyCraftModel(cfg).to(device) model.load_state_dict(load_file("model.safetensors", device=device)) model.eval() prompt = "def is_palindrome(s: str) -> bool:\n " ids = tokenizer.encode(prompt) inp = torch.tensor(ids, dtype=torch.long).unsqueeze(0).to(device) with torch.no_grad(): out = model.generate(inp, max_new_tokens=80, temperature=0.7, top_k=40) print(tokenizer.decode(out[0].tolist())) ``` ## Limitations - 55M parameters: generates plausible Python but may produce logical errors - Context window: 1024 tokens - Best on standard algorithms, data structures, and common library patterns - Not a conversational assistant ## Citation ```bibtex @misc{inamdar2026pycraft, title={PyCraft-1: Training a Python Code LLM From Scratch on Consumer Hardware}, author={Inamdar, Rohan}, year={2026}, institution={University of Manchester, MSc Artificial Intelligence}, } ``` ## Links - GitHub: https://github.com/irohan0/pycraft-llm - Author: https://www.linkedin.com/in/rohan-inamdar-47aa4b251