--- language: en license: apache-2.0 tags: - causal-lm - code - python - pretrain base_model: gpt2 --- # code-1b-pretrain-v3 A 1.13B parameter GPT-2 architecture causal language model pretrained from scratch on a curated mix of Python code and programming literature. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("rovdetection/code-1b-pretrain-v3") model = AutoModelForCausalLM.from_pretrained("rovdetection/code-1b-pretrain-v3") inputs = tokenizer("def fibonacci(n):", return_tensors="pt") out = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ## Training summary | | | |---|---| | Architecture | GPT-2 (1.13B params) | | Total steps | 30,000 | | Peak LR | 3e-5 (cosine with warmup) | | Effective batch size | 32 (gradient accumulation) | | Precision | fp16 + 8-bit Adam (bitsandbytes) | | Eval perplexity (held-out Python) | **3.65** | ## Dataset mix (Phase 4 — final 7k steps) | Dataset | Weight | |---|---| | bigcode/starcoderdata (Python) | 35% | | codeparrot/codeparrot-clean | 25% | | open-phi/programming_books_llama | 25% | | greengerong/leetcode | 15% | ## Training phases | Phase | Steps | LR range | Notes | |---|---|---|---| | 1 | 0 – 10,000 | 0 → 3e-5 | Warmup + early descent | | 2–3 | 10,000 – 23,000 | 3e-5 → 4e-6 | Cosine decay, baseline mix | | 4 | 23,000 – 30,000 | 4e-6 → ~0 | Quality shift: StarCoder ↑35% | ## Repo structure The repo root contains inference weights only (`model.safetensors`, tokenizer, `config.json`). The `last-checkpoint/` subfolder contains the full training state (optimizer, scheduler, scaler, RNG) for resuming training.