File size: 3,794 Bytes
3b97420 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
language:
- en
license: mit
tags:
- gpt
- text-generation
- summarization
- from-scratch
- pytorch
library_name: pytorch
---
# Ron-110M
A 110M-parameter GPT-style language model trained from scratch on a single
RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for
extractive news summarization.
This is a learning / research model. It is small, the tokenizer is a custom
byte-level BPE, and it does not use the Hugging Face `transformers` model
classes. The repo includes the original PyTorch code so you can run, fine-tune,
or continue pretraining from these weights.
## Files
- `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining)
- `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference)
- `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: `<pad> <bos> <eos> <unk>`)
- `meta.json` - dataset metadata (vocab size, dtype, token counts)
- `code/model.py` - GPT model definition
- `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix
- `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram
- `code/train.py` - pretraining script
- `code/finetune_sft.py` - supervised fine-tuning script
- `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder
- `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training
## Architecture
```
n_layer = 12
n_head = 12
n_embd = 768
block_size = 512
vocab_size = 32000
parameters = 109.92M
```
## Training results
| Stage | Dataset | Steps | Final val loss |
|--------------------|---------------|--------|----------------|
| Pretrain | WikiText-103 | 12,000 | 3.15 |
| SFT (summarizer) | CNN/DailyMail | 6,000 | 2.97 |
## Quick start
```bash
# Clone this repo
git lfs install
git clone https://huggingface.co/endurasolution/RON-110M
cd RON-110M
# Install minimal deps
pip install torch numpy tokenizers rich
# Run inference
python code/ask.py \
--checkpoint summarizer.pt \
--tokenizer tokenizer.json \
--text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
--max_new_tokens 80 \
--temperature 0.4 \
--top_p 0.9 \
--repetition_penalty 1.1 \
--no_repeat_ngram_size 3
```
Expected output (paraphrased): a short news-style summary that preserves the key
facts from the input.
## Continue training
To resume pretraining from `pretrain.pt`:
```bash
python code/train.py \
--resume pretrain.pt \
--reset_step --reset_optimizer \
--data_dir data/wikitext103 \
--out_dir runs/wikitext-gpt \
--preset rtx3090_8h \
--batch_size 16 --grad_accum 8 \
--max_steps 12000 \
--learning_rate 2e-4 --min_lr 2e-5 \
--warmup_steps 200 \
--no_gradient_checkpointing \
--save_optimizer
```
To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer`
keys, then:
```bash
python code/finetune_sft.py \
--base_checkpoint pretrain.pt \
--tokenizer tokenizer.json \
--sft_file your_data.jsonl \
--out_dir runs/my-finetune \
--max_steps 6000 \
--batch_size 8 --grad_accum 8 \
--learning_rate 5e-5 --min_lr 5e-6 \
--warmup_steps 200
```
## Limitations
- Small (110M parameters) - knowledge is limited, hallucinations possible on
out-of-domain inputs.
- Tokenizer is custom byte-level BPE - **must** be loaded with the included
`tokenizer.json`. Do not substitute a GPT-2 tokenizer.
- Not compatible with `transformers.AutoModel`. Use the included `code/`.
- SFT data was CNN/DailyMail news. The model is most reliable on news-style
English; expect weaker output on code, math, or conversational input.
## License
MIT.
|