RON-110M / README.md

Upload Ron-110M: pretrain + summarizer + tokenizer + code

3b97420 verified 15 days ago

3.79 kB

	---
	language:
	- en
	license: mit
	tags:
	- gpt
	- text-generation
	- summarization
	- from-scratch
	- pytorch
	library_name: pytorch
	---

	# Ron-110M

	A 110M-parameter GPT-style language model trained from scratch on a single
	RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for
	extractive news summarization.

	This is a learning / research model. It is small, the tokenizer is a custom
	byte-level BPE, and it does not use the Hugging Face `transformers` model
	classes. The repo includes the original PyTorch code so you can run, fine-tune,
	or continue pretraining from these weights.

	## Files

	- `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining)
	- `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference)
	- `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: `<pad> <bos> <eos> <unk>`)
	- `meta.json` - dataset metadata (vocab size, dtype, token counts)
	- `code/model.py` - GPT model definition
	- `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix
	- `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram
	- `code/train.py` - pretraining script
	- `code/finetune_sft.py` - supervised fine-tuning script
	- `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder
	- `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training

	## Architecture

	```
	n_layer = 12
	n_head = 12
	n_embd = 768
	block_size = 512
	vocab_size = 32000
	parameters = 109.92M
	```

	## Training results

	\| Stage \| Dataset \| Steps \| Final val loss \|
	\|--------------------\|---------------\|--------\|----------------\|
	\| Pretrain \| WikiText-103 \| 12,000 \| 3.15 \|
	\| SFT (summarizer) \| CNN/DailyMail \| 6,000 \| 2.97 \|

	## Quick start

	```bash
	# Clone this repo
	git lfs install
	git clone https://huggingface.co/endurasolution/RON-110M
	cd RON-110M

	# Install minimal deps
	pip install torch numpy tokenizers rich

	# Run inference
	python code/ask.py \
	--checkpoint summarizer.pt \
	--tokenizer tokenizer.json \
	--text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
	--max_new_tokens 80 \
	--temperature 0.4 \
	--top_p 0.9 \
	--repetition_penalty 1.1 \
	--no_repeat_ngram_size 3
	```

	Expected output (paraphrased): a short news-style summary that preserves the key
	facts from the input.

	## Continue training

	To resume pretraining from `pretrain.pt`:

	```bash
	python code/train.py \
	--resume pretrain.pt \
	--reset_step --reset_optimizer \
	--data_dir data/wikitext103 \
	--out_dir runs/wikitext-gpt \
	--preset rtx3090_8h \
	--batch_size 16 --grad_accum 8 \
	--max_steps 12000 \
	--learning_rate 2e-4 --min_lr 2e-5 \
	--warmup_steps 200 \
	--no_gradient_checkpointing \
	--save_optimizer
	```

	To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer`
	keys, then:

	```bash
	python code/finetune_sft.py \
	--base_checkpoint pretrain.pt \
	--tokenizer tokenizer.json \
	--sft_file your_data.jsonl \
	--out_dir runs/my-finetune \
	--max_steps 6000 \
	--batch_size 8 --grad_accum 8 \
	--learning_rate 5e-5 --min_lr 5e-6 \
	--warmup_steps 200
	```

	## Limitations

	- Small (110M parameters) - knowledge is limited, hallucinations possible on
	out-of-domain inputs.
	- Tokenizer is custom byte-level BPE - must be loaded with the included
	`tokenizer.json`. Do not substitute a GPT-2 tokenizer.
	- Not compatible with `transformers.AutoModel`. Use the included `code/`.
	- SFT data was CNN/DailyMail news. The model is most reliable on news-style
	English; expect weaker output on code, math, or conversational input.

	## License

	MIT.