| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - gpt |
| - text-generation |
| - summarization |
| - from-scratch |
| - pytorch |
| library_name: pytorch |
| --- |
| |
| # Ron-110M |
|
|
| A 110M-parameter GPT-style language model trained from scratch on a single |
| RTX 3090. Pretrained on WikiText-103, then fine-tuned on CNN/DailyMail for |
| extractive news summarization. |
|
|
| This is a learning / research model. It is small, the tokenizer is a custom |
| byte-level BPE, and it does not use the Hugging Face `transformers` model |
| classes. The repo includes the original PyTorch code so you can run, fine-tune, |
| or continue pretraining from these weights. |
|
|
| ## Files |
|
|
| - `pretrain.pt` - base language model checkpoint (after WikiText-103 pretraining) |
| - `summarizer.pt` - SFT checkpoint for news summarization (start from this for inference) |
| - `tokenizer.json` - byte-level BPE tokenizer (32k vocab, specials: `<pad> <bos> <eos> <unk>`) |
| - `meta.json` - dataset metadata (vocab size, dtype, token counts) |
| - `code/model.py` - GPT model definition |
| - `code/tokenizer.py` - tokenizer wrapper with ByteLevel decoder fix |
| - `code/ask.py` - inference script with repetition penalty, top-p, no-repeat-ngram |
| - `code/train.py` - pretraining script |
| - `code/finetune_sft.py` - supervised fine-tuning script |
| - `code/make_cnndm_sft.py` - CNN/DailyMail SFT data builder |
| - `code/prepare_wikitext.py` - WikiText-103 tokenization + tokenizer training |
|
|
| ## Architecture |
|
|
| ``` |
| n_layer = 12 |
| n_head = 12 |
| n_embd = 768 |
| block_size = 512 |
| vocab_size = 32000 |
| parameters = 109.92M |
| ``` |
|
|
| ## Training results |
|
|
| | Stage | Dataset | Steps | Final val loss | |
| |--------------------|---------------|--------|----------------| |
| | Pretrain | WikiText-103 | 12,000 | 3.15 | |
| | SFT (summarizer) | CNN/DailyMail | 6,000 | 2.97 | |
|
|
| ## Quick start |
|
|
| ```bash |
| # Clone this repo |
| git lfs install |
| git clone https://huggingface.co/endurasolution/RON-110M |
| cd RON-110M |
| |
| # Install minimal deps |
| pip install torch numpy tokenizers rich |
| |
| # Run inference |
| python code/ask.py \ |
| --checkpoint summarizer.pt \ |
| --tokenizer tokenizer.json \ |
| --text "A man has been arrested in Manchester after a series of break-ins at local shops. Police said the suspect was found with stolen goods. He is due to appear in court on Monday." \ |
| --max_new_tokens 80 \ |
| --temperature 0.4 \ |
| --top_p 0.9 \ |
| --repetition_penalty 1.1 \ |
| --no_repeat_ngram_size 3 |
| ``` |
|
|
| Expected output (paraphrased): a short news-style summary that preserves the key |
| facts from the input. |
|
|
| ## Continue training |
|
|
| To resume pretraining from `pretrain.pt`: |
|
|
| ```bash |
| python code/train.py \ |
| --resume pretrain.pt \ |
| --reset_step --reset_optimizer \ |
| --data_dir data/wikitext103 \ |
| --out_dir runs/wikitext-gpt \ |
| --preset rtx3090_8h \ |
| --batch_size 16 --grad_accum 8 \ |
| --max_steps 12000 \ |
| --learning_rate 2e-4 --min_lr 2e-5 \ |
| --warmup_steps 200 \ |
| --no_gradient_checkpointing \ |
| --save_optimizer |
| ``` |
|
|
| To fine-tune for a new task, prepare a JSONL file with `prompt` and `answer` |
| keys, then: |
|
|
| ```bash |
| python code/finetune_sft.py \ |
| --base_checkpoint pretrain.pt \ |
| --tokenizer tokenizer.json \ |
| --sft_file your_data.jsonl \ |
| --out_dir runs/my-finetune \ |
| --max_steps 6000 \ |
| --batch_size 8 --grad_accum 8 \ |
| --learning_rate 5e-5 --min_lr 5e-6 \ |
| --warmup_steps 200 |
| ``` |
|
|
| ## Limitations |
|
|
| - Small (110M parameters) - knowledge is limited, hallucinations possible on |
| out-of-domain inputs. |
| - Tokenizer is custom byte-level BPE - **must** be loaded with the included |
| `tokenizer.json`. Do not substitute a GPT-2 tokenizer. |
| - Not compatible with `transformers.AutoModel`. Use the included `code/`. |
| - SFT data was CNN/DailyMail news. The model is most reliable on news-style |
| English; expect weaker output on code, math, or conversational input. |
|
|
| ## License |
|
|
| MIT. |
|
|