RON-110M / README.md
endurasolution's picture
Upload Ron-110M: pretrain + summarizer + tokenizer + code
3b97420 verified
---
language:
- en
license: mit
tags:
- gpt
- text-generation
- summarization
- from-scratch
- pytorch
library_name: pytorch
---
# Ron-110M
A 110M-parameter GPT-style language model trained from scratch on a single
RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for
extractive news summarization.
This is a learning / research model. It is small, the tokenizer is a custom
byte-level BPE, and it does not use the Hugging Face `transformers` model
classes. The repo includes the original PyTorch code so you can run, fine-tune,
or continue pretraining from these weights.
## Files
- `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining)
- `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference)
- `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: `<pad> <bos> <eos> <unk>`)
- `meta.json` - dataset metadata (vocab size, dtype, token counts)
- `code/model.py` - GPT model definition
- `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix
- `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram
- `code/train.py` - pretraining script
- `code/finetune_sft.py` - supervised fine-tuning script
- `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder
- `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training
## Architecture
```
n_layer = 12
n_head = 12
n_embd = 768
block_size = 512
vocab_size = 32000
parameters = 109.92M
```
## Training results
| Stage | Dataset | Steps | Final val loss |
|--------------------|---------------|--------|----------------|
| Pretrain | WikiText-103 | 12,000 | 3.15 |
| SFT (summarizer) | CNN/DailyMail | 6,000 | 2.97 |
## Quick start
```bash
# Clone this repo
git lfs install
git clone https://huggingface.co/endurasolution/RON-110M
cd RON-110M
# Install minimal deps
pip install torch numpy tokenizers rich
# Run inference
python code/ask.py \
--checkpoint summarizer.pt \
--tokenizer tokenizer.json \
--text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
--max_new_tokens 80 \
--temperature 0.4 \
--top_p 0.9 \
--repetition_penalty 1.1 \
--no_repeat_ngram_size 3
```
Expected output (paraphrased): a short news-style summary that preserves the key
facts from the input.
## Continue training
To resume pretraining from `pretrain.pt`:
```bash
python code/train.py \
--resume pretrain.pt \
--reset_step --reset_optimizer \
--data_dir data/wikitext103 \
--out_dir runs/wikitext-gpt \
--preset rtx3090_8h \
--batch_size 16 --grad_accum 8 \
--max_steps 12000 \
--learning_rate 2e-4 --min_lr 2e-5 \
--warmup_steps 200 \
--no_gradient_checkpointing \
--save_optimizer
```
To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer`
keys, then:
```bash
python code/finetune_sft.py \
--base_checkpoint pretrain.pt \
--tokenizer tokenizer.json \
--sft_file your_data.jsonl \
--out_dir runs/my-finetune \
--max_steps 6000 \
--batch_size 8 --grad_accum 8 \
--learning_rate 5e-5 --min_lr 5e-6 \
--warmup_steps 200
```
## Limitations
- Small (110M parameters) - knowledge is limited, hallucinations possible on
out-of-domain inputs.
- Tokenizer is custom byte-level BPE - **must** be loaded with the included
`tokenizer.json`. Do not substitute a GPT-2 tokenizer.
- Not compatible with `transformers.AutoModel`. Use the included `code/`.
- SFT data was CNN/DailyMail news. The model is most reliable on news-style
English; expect weaker output on code, math, or conversational input.
## License
MIT.