Ron-110M

A 110M-parameter GPT-style language model trained from scratch on a single RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for extractive news summarization.

This is a learning / research model. It is small, the tokenizer is a custom byte-level BPE, and it does not use the Hugging Face transformers model classes. The repo includes the original PyTorch code so you can run, fine-tune, or continue pretraining from these weights.

Files

pretrain.pt - base language model checkpoint (after WikiText-103 pretraining)
summarizer.pt - SFT checkpoint for news summarization (start from this for inference)
tokenizer.json - byte-level BPE tokenizer (32k vocab, specials: <pad> <bos> <eos> <unk>)
meta.json - dataset metadata (vocab size, dtype, token counts)
code/model.py - GPT model definition
code/tokenizer.py - tokenizer wrapper with ByteLevel decoder fix
code/ask.py - inference script with repetition penalty, top-p, no-repeat-ngram
code/train.py - pretraining script
code/finetune_sft.py - supervised fine-tuning script
code/make_cnndm_sft.py - CNN/DailyMail SFT data builder
code/prepare_wikitext.py - WikiText-103 tokenization + tokenizer training

Architecture

n_layer       = 12
n_head        = 12
n_embd        = 768
block_size    = 512
vocab_size    = 32000
parameters    = 109.92M

Training results

Stage	Dataset	Steps	Final val loss
Pretrain	WikiText-103	12,000	3.15
SFT (summarizer)	CNN/DailyMail	6,000	2.97

Quick start

# Clone this repo
git lfs install
git clone https://huggingface.co/endurasolution/RON-110M
cd RON-110M

# Install minimal deps
pip install torch numpy tokenizers rich

# Run inference
python code/ask.py \
  --checkpoint summarizer.pt \
  --tokenizer tokenizer.json \
  --text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
  --max_new_tokens 80 \
  --temperature 0.4 \
  --top_p 0.9 \
  --repetition_penalty 1.1 \
  --no_repeat_ngram_size 3

Expected output (paraphrased): a short news-style summary that preserves the key facts from the input.

Continue training

To resume pretraining from pretrain.pt:

python code/train.py \
  --resume pretrain.pt \
  --reset_step --reset_optimizer \
  --data_dir data/wikitext103 \
  --out_dir runs/wikitext-gpt \
  --preset rtx3090_8h \
  --batch_size 16 --grad_accum 8 \
  --max_steps 12000 \
  --learning_rate 2e-4 --min_lr 2e-5 \
  --warmup_steps 200 \
  --no_gradient_checkpointing \
  --save_optimizer

To fine-tune for a new task, prepare a JSONL file with prompt and answer keys, then:

python code/finetune_sft.py \
  --base_checkpoint pretrain.pt \
  --tokenizer tokenizer.json \
  --sft_file your_data.jsonl \
  --out_dir runs/my-finetune \
  --max_steps 6000 \
  --batch_size 8 --grad_accum 8 \
  --learning_rate 5e-5 --min_lr 5e-6 \
  --warmup_steps 200

Limitations

Small (110M parameters) - knowledge is limited, hallucinations possible on out-of-domain inputs.
Tokenizer is custom byte-level BPE - must be loaded with the included tokenizer.json. Do not substitute a GPT-2 tokenizer.
Not compatible with transformers.AutoModel. Use the included code/.
SFT data was CNN/DailyMail news. The model is most reliable on news-style English; expect weaker output on code, math, or conversational input.

License

MIT.

Downloads last month: 22