pycraft-1 / README.md
imshadow0's picture
Upload README.md with huggingface_hub
988d9ab verified
|
Raw
History Blame Contribute Delete
4.34 kB
---
language:
- en
license: mit
tags:
- code
- python
- causal-lm
- from-scratch
- gqa
- rope
- swiglu
- fim
- qk-norm
datasets:
- nampdn-ai/tiny-codes
- ise-uiuc/Magicoder-OSS-Instruct-75K
- iamtarun/python_code_instructions_18k_alpaca
- bigcode/the-stack-smol
- flytech/python-codes-25k
- iamtarun/code_instructions_120k_alpaca
---
# PyCraft-1: A 55M Python Code LLM Trained From Scratch on Consumer Hardware
## Model Description
PyCraft-1 is a 55.3M parameter decoder-only transformer trained
**entirely from scratch** on Python code using a single NVIDIA RTX 3050
laptop GPU (4GB VRAM). It demonstrates that a domain-specific code LLM
can be trained without cloud compute, following a quality-first data
curriculum inspired by the Phi-1 "Textbooks Are All You Need" approach.
## Architecture
| Component | Choice |
|---|---|
| Architecture | Decoder-only transformer |
| Parameters | 55.3M |
| Attention | Grouped Query Attention (8Q / 2KV heads) |
| Positional encoding | RoPE |
| QK-Norm | RMSNorm on Q and K (OLMo 2 / Qwen 3, 2025) |
| FFN | SwiGLU |
| Normalisation | RMSNorm pre-norm |
| Training objective | Causal LM + Fill-in-the-Middle (FIM, 50%) |
| Context window | 1024 tokens |
| Vocabulary | 32,000 (custom BPE, Python-tuned) |
## Training
**Pretraining:**
- 309,221 curated Python examples from 6 open sources
- Quality curriculum: scored on 5 heuristics, ordered best-first
- 4,000 steps | 1.05B tokens | Loss 1.16 | PPL 3.2
**Supervised Fine-Tuning:**
- Magicoder-OSS-Instruct-75K (Python subset, 40k examples)
- 400 steps | Loss 1.15 | PPL 3.15
**Hardware:** NVIDIA RTX 3050 Laptop GPU 4GB, Ryzen 7 6000, 16GB RAM
**Total training time:** ~22 hours
## Novel Contributions
1. **Quality-first data curriculum** — 309k Python examples scored and
ordered by educational value (docstrings, type hints, comments,
naming conventions, length). Validates Phi-1 hypothesis at the
resource-constrained regime.
2. **QK-Norm** — RMSNorm applied to Q and K before RoPE, adopted from
OLMo 2 and Qwen 3 (2025), improving training stability in small models.
3. **FIM pretraining on 4GB VRAM** — Fill-in-the-Middle objective using
PSM format trained on a consumer GPU via gradient checkpointing and
BF16 mixed precision.
4. **Full reproducibility** — complete open-source pipeline runnable on
consumer hardware in under one week.
## Evaluation
| Metric | Value |
|---|---|
| Pretraining loss | 1.16 |
| Pretraining PPL | 3.20 |
| SFT loss | 1.15 |
| SFT PPL | 3.15 |
| Held-out PPL (binary search) | 1.4 |
| Held-out PPL (Stack class) | 1.7 |
| Held-out PPL (average, 5 samples) | 2.16 |
## Example Usage
```python
# Clone the repo and install dependencies first
# pip install torch safetensors tokenizers datasets
import torch
from safetensors.torch import load_file
from model.config import get_config_120m
from model.pycraft_model import PyCraftModel
from tokenizer.tokenizer_utils import PyCraftTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PyCraftTokenizer("tokenizer/vocab/tokenizer.json")
cfg = get_config_120m()
cfg.vocab_size = 32000
cfg.dropout = 0.0
model = PyCraftModel(cfg).to(device)
model.load_state_dict(load_file("model.safetensors", device=device))
model.eval()
prompt = "def is_palindrome(s: str) -> bool:\n "
ids = tokenizer.encode(prompt)
inp = torch.tensor(ids, dtype=torch.long).unsqueeze(0).to(device)
with torch.no_grad():
out = model.generate(inp, max_new_tokens=80, temperature=0.7, top_k=40)
print(tokenizer.decode(out[0].tolist()))
```
## Limitations
- 55M parameters: generates plausible Python but may produce logical errors
- Context window: 1024 tokens
- Best on standard algorithms, data structures, and common library patterns
- Not a conversational assistant
## Citation
```bibtex
@misc{inamdar2026pycraft,
title={PyCraft-1: Training a Python Code LLM From Scratch on Consumer Hardware},
author={Inamdar, Rohan},
year={2026},
institution={University of Manchester, MSc Artificial Intelligence},
}
```
## Links
- GitHub: https://github.com/irohan0/pycraft-llm
- Author: https://www.linkedin.com/in/rohan-inamdar-47aa4b251