code-1b-pretrain-v3 / README.md
rovdetection's picture
Model card β€” step 30000, PPL 3.65
513c3e0 verified
|
Raw
History Blame Contribute Delete
1.78 kB
metadata
language: en
license: apache-2.0
tags:
  - causal-lm
  - code
  - python
  - pretrain
base_model: gpt2

code-1b-pretrain-v3

A 1.13B parameter GPT-2 architecture causal language model pretrained from scratch on a curated mix of Python code and programming literature.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rovdetection/code-1b-pretrain-v3")
model     = AutoModelForCausalLM.from_pretrained("rovdetection/code-1b-pretrain-v3")

inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
out    = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Training summary

Architecture GPT-2 (1.13B params)
Total steps 30,000
Peak LR 3e-5 (cosine with warmup)
Effective batch size 32 (gradient accumulation)
Precision fp16 + 8-bit Adam (bitsandbytes)
Eval perplexity (held-out Python) 3.65

Dataset mix (Phase 4 β€” final 7k steps)

Dataset Weight
bigcode/starcoderdata (Python) 35%
codeparrot/codeparrot-clean 25%
open-phi/programming_books_llama 25%
greengerong/leetcode 15%

Training phases

Phase Steps LR range Notes
1 0 – 10,000 0 β†’ 3e-5 Warmup + early descent
2–3 10,000 – 23,000 3e-5 β†’ 4e-6 Cosine decay, baseline mix
4 23,000 – 30,000 4e-6 β†’ ~0 Quality shift: StarCoder ↑35%

Repo structure

The repo root contains inference weights only (model.safetensors, tokenizer, config.json). The last-checkpoint/ subfolder contains the full training state (optimizer, scheduler, scaler, RNG) for resuming training.