Upload README.md with huggingface_hub

988d9ab verified 17 days ago

4.34 kB

	---
	language:
	- en
	license: mit
	tags:
	- code
	- python
	- causal-lm
	- from-scratch
	- gqa
	- rope
	- swiglu
	- fim
	- qk-norm
	datasets:
	- nampdn-ai/tiny-codes
	- ise-uiuc/Magicoder-OSS-Instruct-75K
	- iamtarun/python_code_instructions_18k_alpaca
	- bigcode/the-stack-smol
	- flytech/python-codes-25k
	- iamtarun/code_instructions_120k_alpaca
	---

	# PyCraft-1: A 55M Python Code LLM Trained From Scratch on Consumer Hardware

	## Model Description

	PyCraft-1 is a 55.3M parameter decoder-only transformer trained
	entirely from scratch on Python code using a single NVIDIA RTX 3050
	laptop GPU (4GB VRAM). It demonstrates that a domain-specific code LLM
	can be trained without cloud compute, following a quality-first data
	curriculum inspired by the Phi-1 "Textbooks Are All You Need" approach.

	## Architecture

	\| Component \| Choice \|
	\|---\|---\|
	\| Architecture \| Decoder-only transformer \|
	\| Parameters \| 55.3M \|
	\| Attention \| Grouped Query Attention (8Q / 2KV heads) \|
	\| Positional encoding \| RoPE \|
	\| QK-Norm \| RMSNorm on Q and K (OLMo 2 / Qwen 3, 2025) \|
	\| FFN \| SwiGLU \|
	\| Normalisation \| RMSNorm pre-norm \|
	\| Training objective \| Causal LM + Fill-in-the-Middle (FIM, 50%) \|
	\| Context window \| 1024 tokens \|
	\| Vocabulary \| 32,000 (custom BPE, Python-tuned) \|

	## Training

	Pretraining:
	- 309,221 curated Python examples from 6 open sources
	- Quality curriculum: scored on 5 heuristics, ordered best-first
	- 4,000 steps \| 1.05B tokens \| Loss 1.16 \| PPL 3.2

	Supervised Fine-Tuning:
	- Magicoder-OSS-Instruct-75K (Python subset, 40k examples)
	- 400 steps \| Loss 1.15 \| PPL 3.15

	Hardware: NVIDIA RTX 3050 Laptop GPU 4GB, Ryzen 7 6000, 16GB RAM
	Total training time: ~22 hours

	## Novel Contributions

	1. Quality-first data curriculum — 309k Python examples scored and
	ordered by educational value (docstrings, type hints, comments,
	naming conventions, length). Validates Phi-1 hypothesis at the
	resource-constrained regime.

	2. QK-Norm — RMSNorm applied to Q and K before RoPE, adopted from
	OLMo 2 and Qwen 3 (2025), improving training stability in small models.

	3. FIM pretraining on 4GB VRAM — Fill-in-the-Middle objective using
	PSM format trained on a consumer GPU via gradient checkpointing and
	BF16 mixed precision.

	4. Full reproducibility — complete open-source pipeline runnable on
	consumer hardware in under one week.

	## Evaluation

	\| Metric \| Value \|
	\|---\|---\|
	\| Pretraining loss \| 1.16 \|
	\| Pretraining PPL \| 3.20 \|
	\| SFT loss \| 1.15 \|
	\| SFT PPL \| 3.15 \|
	\| Held-out PPL (binary search) \| 1.4 \|
	\| Held-out PPL (Stack class) \| 1.7 \|
	\| Held-out PPL (average, 5 samples) \| 2.16 \|

	## Example Usage

	```python
	# Clone the repo and install dependencies first
	# pip install torch safetensors tokenizers datasets

	import torch
	from safetensors.torch import load_file
	from model.config import get_config_120m
	from model.pycraft_model import PyCraftModel
	from tokenizer.tokenizer_utils import PyCraftTokenizer

	device = "cuda" if torch.cuda.is_available() else "cpu"
	tokenizer = PyCraftTokenizer("tokenizer/vocab/tokenizer.json")

	cfg = get_config_120m()
	cfg.vocab_size = 32000
	cfg.dropout = 0.0

	model = PyCraftModel(cfg).to(device)
	model.load_state_dict(load_file("model.safetensors", device=device))
	model.eval()

	prompt = "def is_palindrome(s: str) -> bool:\n "
	ids = tokenizer.encode(prompt)
	inp = torch.tensor(ids, dtype=torch.long).unsqueeze(0).to(device)

	with torch.no_grad():
	out = model.generate(inp, max_new_tokens=80, temperature=0.7, top_k=40)

	print(tokenizer.decode(out[0].tolist()))
	```

	## Limitations

	- 55M parameters: generates plausible Python but may produce logical errors
	- Context window: 1024 tokens
	- Best on standard algorithms, data structures, and common library patterns
	- Not a conversational assistant

	## Citation

	```bibtex
	@misc{inamdar2026pycraft,
	title={PyCraft-1: Training a Python Code LLM From Scratch on Consumer Hardware},
	author={Inamdar, Rohan},
	year={2026},
	institution={University of Manchester, MSc Artificial Intelligence},
	}
	```

	## Links

	- GitHub: https://github.com/irohan0/pycraft-llm
	- Author: https://www.linkedin.com/in/rohan-inamdar-47aa4b251