rovdetection
/

code-1b-pretrain-v3

Model card Files Files and versions

code-1b-pretrain-v3 / README.md

rovdetection's picture

Model card — step 30000, PPL 3.65

513c3e0 verified about 1 month ago

|

History Blame Contribute Delete

1.78 kB

	---
	language: en
	license: apache-2.0
	tags:
	- causal-lm
	- code
	- python
	- pretrain
	base_model: gpt2
	---

	# code-1b-pretrain-v3

	A 1.13B parameter GPT-2 architecture causal language model pretrained from
	scratch on a curated mix of Python code and programming literature.

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("rovdetection/code-1b-pretrain-v3")
	model = AutoModelForCausalLM.from_pretrained("rovdetection/code-1b-pretrain-v3")

	inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
	out = model.generate(**inputs, max_new_tokens=100, temperature=0.8, do_sample=True)
	print(tokenizer.decode(out[0], skip_special_tokens=True))
	```

	## Training summary

	\| \| \|
	\|---\|---\|
	\| Architecture \| GPT-2 (1.13B params) \|
	\| Total steps \| 30,000 \|
	\| Peak LR \| 3e-5 (cosine with warmup) \|
	\| Effective batch size \| 32 (gradient accumulation) \|
	\| Precision \| fp16 + 8-bit Adam (bitsandbytes) \|
	\| Eval perplexity (held-out Python) \| 3.65 \|

	## Dataset mix (Phase 4 — final 7k steps)

	\| Dataset \| Weight \|
	\|---\|---\|
	\| bigcode/starcoderdata (Python) \| 35% \|
	\| codeparrot/codeparrot-clean \| 25% \|
	\| open-phi/programming_books_llama \| 25% \|
	\| greengerong/leetcode \| 15% \|

	## Training phases

	\| Phase \| Steps \| LR range \| Notes \|
	\|---\|---\|---\|---\|
	\| 1 \| 0 – 10,000 \| 0 → 3e-5 \| Warmup + early descent \|
	\| 2–3 \| 10,000 – 23,000 \| 3e-5 → 4e-6 \| Cosine decay, baseline mix \|
	\| 4 \| 23,000 – 30,000 \| 4e-6 → ~0 \| Quality shift: StarCoder ↑35% \|

	## Repo structure

	The repo root contains inference weights only (`model.safetensors`, tokenizer,
	`config.json`). The `last-checkpoint/` subfolder contains the full training
	state (optimizer, scheduler, scaler, RNG) for resuming training.