code-1b-pretrain-v3 / README.md
rovdetection's picture
Model card β€” step 30000, PPL 3.65
513c3e0 verified
|
Raw
History Blame Contribute Delete
1.78 kB
---
language: en
license: apache-2.0
tags:
- causal-lm
- code
- python
- pretrain
base_model: gpt2
---
# code-1b-pretrain-v3
A 1.13B parameter GPT-2 architecture causal language model pretrained from
scratch on a curated mix of Python code and programming literature.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rovdetection/code-1b-pretrain-v3")
model = AutoModelForCausalLM.from_pretrained("rovdetection/code-1b-pretrain-v3")
inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
## Training summary
| | |
|---|---|
| Architecture | GPT-2 (1.13B params) |
| Total steps | 30,000 |
| Peak LR | 3e-5 (cosine with warmup) |
| Effective batch size | 32 (gradient accumulation) |
| Precision | fp16 + 8-bit Adam (bitsandbytes) |
| Eval perplexity (held-out Python) | **3.65** |
## Dataset mix (Phase 4 β€” final 7k steps)
| Dataset | Weight |
|---|---|
| bigcode/starcoderdata (Python) | 35% |
| codeparrot/codeparrot-clean | 25% |
| open-phi/programming_books_llama | 25% |
| greengerong/leetcode | 15% |
## Training phases
| Phase | Steps | LR range | Notes |
|---|---|---|---|
| 1 | 0 – 10,000 | 0 β†’ 3e-5 | Warmup + early descent |
| 2–3 | 10,000 – 23,000 | 3e-5 β†’ 4e-6 | Cosine decay, baseline mix |
| 4 | 23,000 – 30,000 | 4e-6 β†’ ~0 | Quality shift: StarCoder ↑35% |
## Repo structure
The repo root contains inference weights only (`model.safetensors`, tokenizer,
`config.json`). The `last-checkpoint/` subfolder contains the full training
state (optimizer, scheduler, scaler, RNG) for resuming training.