Ron-110M
A 110M-parameter GPT-style language model trained from scratch on a single RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for extractive news summarization.
This is a learning / research model. It is small, the tokenizer is a custom
byte-level BPE, and it does not use the Hugging Face transformers model
classes. The repo includes the original PyTorch code so you can run, fine-tune,
or continue pretraining from these weights.
Files
pretrain.pt- base language model checkpoint (after WikiText-103 pretraining)summarizer.pt- SFT checkpoint for news summarization (start from this for inference)tokenizer.json- byte-level BPE tokenizer (32k vocab, specials:<pad> <bos> <eos> <unk>)meta.json- dataset metadata (vocab size, dtype, token counts)code/model.py- GPT model definitioncode/tokenizer.py- tokenizer wrapper with ByteLevel decoder fixcode/ask.py- inference script with repetition penalty, top-p, no-repeat-ngramcode/train.py- pretraining scriptcode/finetune_sft.py- supervised fine-tuning scriptcode/make_cnndm_sft.py- CNN/DailyMail SFT data buildercode/prepare_wikitext.py- WikiText-103 tokenization + tokenizer training
Architecture
n_layer = 12
n_head = 12
n_embd = 768
block_size = 512
vocab_size = 32000
parameters = 109.92M
Training results
| Stage | Dataset | Steps | Final val loss |
|---|---|---|---|
| Pretrain | WikiText-103 | 12,000 | 3.15 |
| SFT (summarizer) | CNN/DailyMail | 6,000 | 2.97 |
Quick start
# Clone this repo
git lfs install
git clone https://huggingface.co/endurasolution/RON-110M
cd RON-110M
# Install minimal deps
pip install torch numpy tokenizers rich
# Run inference
python code/ask.py \
--checkpoint summarizer.pt \
--tokenizer tokenizer.json \
--text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \
--max_new_tokens 80 \
--temperature 0.4 \
--top_p 0.9 \
--repetition_penalty 1.1 \
--no_repeat_ngram_size 3
Expected output (paraphrased): a short news-style summary that preserves the key facts from the input.
Continue training
To resume pretraining from pretrain.pt:
python code/train.py \
--resume pretrain.pt \
--reset_step --reset_optimizer \
--data_dir data/wikitext103 \
--out_dir runs/wikitext-gpt \
--preset rtx3090_8h \
--batch_size 16 --grad_accum 8 \
--max_steps 12000 \
--learning_rate 2e-4 --min_lr 2e-5 \
--warmup_steps 200 \
--no_gradient_checkpointing \
--save_optimizer
To fine-tune for a new task, prepare a JSONL file with prompt and answer
keys, then:
python code/finetune_sft.py \
--base_checkpoint pretrain.pt \
--tokenizer tokenizer.json \
--sft_file your_data.jsonl \
--out_dir runs/my-finetune \
--max_steps 6000 \
--batch_size 8 --grad_accum 8 \
--learning_rate 5e-5 --min_lr 5e-6 \
--warmup_steps 200
Limitations
- Small (110M parameters) - knowledge is limited, hallucinations possible on out-of-domain inputs.
- Tokenizer is custom byte-level BPE - must be loaded with the included
tokenizer.json. Do not substitute a GPT-2 tokenizer. - Not compatible with
transformers.AutoModel. Use the includedcode/. - SFT data was CNN/DailyMail news. The model is most reliable on news-style English; expect weaker output on code, math, or conversational input.
License
MIT.
- Downloads last month
- 22