---
language:
- en
license: mit
tags:
- gpt
- text-generation
- summarization
- from-scratch
- pytorch
library_name: pytorch
---

# Ron-110M

A 110M-parameter GPT-style language model trained from scratch on a single
RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for
extractive news summarization.

This is a learning / research model. It is small, the tokenizer is a custom
byte-level BPE, and it does not use the Hugging Face `transformers` model
classes. The repo includes the original PyTorch code so you can run, fine-tune,
or continue pretraining from these weights.

## Files

- `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining)
- `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference)
- `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: `<pad> <bos> <eos> <unk>`)
- `meta.json` - dataset metadata (vocab size, dtype, token counts)
- `code/model.py` - GPT model definition
- `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix
- `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram
- `code/train.py` - pretraining script
- `code/finetune_sft.py` - supervised fine-tuning script
- `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder
- `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training

## Architecture

```
n_layer       = 12
n_head        = 12
n_embd        = 768
block_size    = 512
vocab_size    = 32000
parameters    = 109.92M
```

## Training results

| Stage              | Dataset        | Steps  | Final val loss |
|--------------------|---------------|--------|----------------|
| Pretrain           | WikiText-103  | 12,000 | 3.15           |
| SFT (summarizer)   | CNN/DailyMail | 6,000  | 2.97           |

## Quick start

```bash
# Clone this repo
git lfs install
git clone https://huggingface.co/endurasolution/RON-110M
cd RON-110M

# Install minimal deps
pip install torch numpy tokenizers rich

# Run inference
python code/ask.py \
  --checkpoint summarizer.pt \
  --tokenizer tokenizer.json \
  --text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
  --max_new_tokens 80 \
  --temperature 0.4 \
  --top_p 0.9 \
  --repetition_penalty 1.1 \
  --no_repeat_ngram_size 3
```

Expected output (paraphrased): a short news-style summary that preserves the key
facts from the input.

## Continue training

To resume pretraining from `pretrain.pt`:

```bash
python code/train.py \
  --resume pretrain.pt \
  --reset_step --reset_optimizer \
  --data_dir data/wikitext103 \
  --out_dir runs/wikitext-gpt \
  --preset rtx3090_8h \
  --batch_size 16 --grad_accum 8 \
  --max_steps 12000 \
  --learning_rate 2e-4 --min_lr 2e-5 \
  --warmup_steps 200 \
  --no_gradient_checkpointing \
  --save_optimizer
```

To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer`
keys, then:

```bash
python code/finetune_sft.py \
  --base_checkpoint pretrain.pt \
  --tokenizer tokenizer.json \
  --sft_file your_data.jsonl \
  --out_dir runs/my-finetune \
  --max_steps 6000 \
  --batch_size 8 --grad_accum 8 \
  --learning_rate 5e-5 --min_lr 5e-6 \
  --warmup_steps 200
```

## Limitations

- Small (110M parameters) - knowledge is limited, hallucinations possible on
  out-of-domain inputs.
- Tokenizer is custom byte-level BPE - **must** be loaded with the included
  `tokenizer.json`. Do not substitute a GPT-2 tokenizer.
- Not compatible with `transformers.AutoModel`. Use the included `code/`.
- SFT data was CNN/DailyMail news. The model is most reliable on news-style
  English; expect weaker output on code, math, or conversational input.

## License

MIT.