| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - causal-lm |
| - code |
| - python |
| - pretrain |
| base_model: gpt2 |
| --- |
| |
| # code-1b-pretrain-v3 |
|
|
| A 1.13B parameter GPT-2 architecture causal language model pretrained from |
| scratch on a curated mix of Python code and programming literature. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("rovdetection/code-1b-pretrain-v3") |
| model = AutoModelForCausalLM.from_pretrained("rovdetection/code-1b-pretrain-v3") |
| |
| inputs = tokenizer("def fibonacci(n):", return_tensors="pt") |
| out = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True) |
| print(tokenizer.decode(out[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Training summary |
|
|
| | | | |
| |---|---| |
| | Architecture | GPT-2 (1.13B params) | |
| | Total steps | 30,000 | |
| | Peak LR | 3e-5 (cosine with warmup) | |
| | Effective batch size | 32 (gradient accumulation) | |
| | Precision | fp16 + 8-bit Adam (bitsandbytes) | |
| | Eval perplexity (held-out Python) | **3.65** | |
|
|
| ## Dataset mix (Phase 4 β final 7k steps) |
|
|
| | Dataset | Weight | |
| |---|---| |
| | bigcode/starcoderdata (Python) | 35% | |
| | codeparrot/codeparrot-clean | 25% | |
| | open-phi/programming_books_llama | 25% | |
| | greengerong/leetcode | 15% | |
|
|
| ## Training phases |
|
|
| | Phase | Steps | LR range | Notes | |
| |---|---|---|---| |
| | 1 | 0 β 10,000 | 0 β 3e-5 | Warmup + early descent | |
| | 2β3 | 10,000 β 23,000 | 3e-5 β 4e-6 | Cosine decay, baseline mix | |
| | 4 | 23,000 β 30,000 | 4e-6 β ~0 | Quality shift: StarCoder β35% | |
|
|
| ## Repo structure |
|
|
| The repo root contains inference weights only (`model.safetensors`, tokenizer, |
| `config.json`). The `last-checkpoint/` subfolder contains the full training |
| state (optimizer, scheduler, scaler, RNG) for resuming training. |
|
|