File size: 3,794 Bytes
3b97420
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
language:
- en
license: mit
tags:
- gpt
- text-generation
- summarization
- from-scratch
- pytorch
library_name: pytorch
---

# Ron-110M

A 110M-parameter GPT-style language model trained from scratch on a single
RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for
extractive news summarization.

This is a learning / research model. It is small, the tokenizer is a custom
byte-level BPE, and it does not use the Hugging Face `transformers` model
classes. The repo includes the original PyTorch code so you can run, fine-tune,
or continue pretraining from these weights.

## Files

- `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining)
- `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference)
- `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: `<pad> <bos> <eos> <unk>`)
- `meta.json` - dataset metadata (vocab size, dtype, token counts)
- `code/model.py` - GPT model definition
- `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix
- `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram
- `code/train.py` - pretraining script
- `code/finetune_sft.py` - supervised fine-tuning script
- `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder
- `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training

## Architecture

```
n_layer       = 12
n_head        = 12
n_embd        = 768
block_size    = 512
vocab_size    = 32000
parameters    = 109.92M
```

## Training results

| Stage              | Dataset        | Steps  | Final val loss |
|--------------------|---------------|--------|----------------|
| Pretrain           | WikiText-103  | 12,000 | 3.15           |
| SFT (summarizer)   | CNN/DailyMail | 6,000  | 2.97           |

## Quick start

```bash
# Clone this repo
git lfs install
git clone https://huggingface.co/endurasolution/RON-110M
cd RON-110M

# Install minimal deps
pip install torch numpy tokenizers rich

# Run inference
python code/ask.py \
  --checkpoint summarizer.pt \
  --tokenizer tokenizer.json \
  --text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
  --max_new_tokens 80 \
  --temperature 0.4 \
  --top_p 0.9 \
  --repetition_penalty 1.1 \
  --no_repeat_ngram_size 3
```

Expected output (paraphrased): a short news-style summary that preserves the key
facts from the input.

## Continue training

To resume pretraining from `pretrain.pt`:

```bash
python code/train.py \
  --resume pretrain.pt \
  --reset_step --reset_optimizer \
  --data_dir data/wikitext103 \
  --out_dir runs/wikitext-gpt \
  --preset rtx3090_8h \
  --batch_size 16 --grad_accum 8 \
  --max_steps 12000 \
  --learning_rate 2e-4 --min_lr 2e-5 \
  --warmup_steps 200 \
  --no_gradient_checkpointing \
  --save_optimizer
```

To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer`
keys, then:

```bash
python code/finetune_sft.py \
  --base_checkpoint pretrain.pt \
  --tokenizer tokenizer.json \
  --sft_file your_data.jsonl \
  --out_dir runs/my-finetune \
  --max_steps 6000 \
  --batch_size 8 --grad_accum 8 \
  --learning_rate 5e-5 --min_lr 5e-6 \
  --warmup_steps 200
```

## Limitations

- Small (110M parameters) - knowledge is limited, hallucinations possible on
  out-of-domain inputs.
- Tokenizer is custom byte-level BPE - **must** be loaded with the included
  `tokenizer.json`. Do not substitute a GPT-2 tokenizer.
- Not compatible with `transformers.AutoModel`. Use the included `code/`.
- SFT data was CNN/DailyMail news. The model is most reliable on news-style
  English; expect weaker output on code, math, or conversational input.

## License

MIT.