--- language: - en license: mit tags: - gpt - text-generation - summarization - from-scratch - pytorch library_name: pytorch --- # Ron-110M A 110M-parameter GPT-style language model trained from scratch on a single RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for extractive news summarization. This is a learning / research model. It is small, the tokenizer is a custom byte-level BPE, and it does not use the Hugging Face `transformers` model classes. The repo includes the original PyTorch code so you can run, fine-tune, or continue pretraining from these weights. ## Files - `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining) - `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference) - `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: ` `) - `meta.json` - dataset metadata (vocab size, dtype, token counts) - `code/model.py` - GPT model definition - `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix - `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram - `code/train.py` - pretraining script - `code/finetune_sft.py` - supervised fine-tuning script - `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder - `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training ## Architecture ``` n_layer = 12 n_head = 12 n_embd = 768 block_size = 512 vocab_size = 32000 parameters = 109.92M ``` ## Training results | Stage | Dataset | Steps | Final val loss | |--------------------|---------------|--------|----------------| | Pretrain | WikiText-103 | 12,000 | 3.15 | | SFT (summarizer) | CNN/DailyMail | 6,000 | 2.97 | ## Quick start ```bash # Clone this repo git lfs install git clone https://huggingface.co/endurasolution/RON-110M cd RON-110M # Install minimal deps pip install torch numpy tokenizers rich # Run inference python code/ask.py \ --checkpoint summarizer.pt \ --tokenizer tokenizer.json \ --text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \ --max_new_tokens 80 \ --temperature 0.4 \ --top_p 0.9 \ --repetition_penalty 1.1 \ --no_repeat_ngram_size 3 ``` Expected output (paraphrased): a short news-style summary that preserves the key facts from the input. ## Continue training To resume pretraining from `pretrain.pt`: ```bash python code/train.py \ --resume pretrain.pt \ --reset_step --reset_optimizer \ --data_dir data/wikitext103 \ --out_dir runs/wikitext-gpt \ --preset rtx3090_8h \ --batch_size 16 --grad_accum 8 \ --max_steps 12000 \ --learning_rate 2e-4 --min_lr 2e-5 \ --warmup_steps 200 \ --no_gradient_checkpointing \ --save_optimizer ``` To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer` keys, then: ```bash python code/finetune_sft.py \ --base_checkpoint pretrain.pt \ --tokenizer tokenizer.json \ --sft_file your_data.jsonl \ --out_dir runs/my-finetune \ --max_steps 6000 \ --batch_size 8 --grad_accum 8 \ --learning_rate 5e-5 --min_lr 5e-6 \ --warmup_steps 200 ``` ## Limitations - Small (110M parameters) - knowledge is limited, hallucinations possible on out-of-domain inputs. - Tokenizer is custom byte-level BPE - **must** be loaded with the included `tokenizer.json`. Do not substitute a GPT-2 tokenizer. - Not compatible with `transformers.AutoModel`. Use the included `code/`. - SFT data was CNN/DailyMail news. The model is most reliable on news-style English; expect weaker output on code, math, or conversational input. ## License MIT.