Update README.md

00fece7 verified about 2 months ago

5.72 kB

	---
	language:
	- en
	license: gpl-3.0
	library_name: transformers
	tags:
	- text-generation
	- tinygpt2
	- causal-lm
	- instruction-tuned
	- sft
	- rope
	- grouped-query-attention
	- rms-norm
	datasets:
	- tatsu-lab/alpaca
	- Skylion007/openwebtext
	pipeline_tag: text-generation
	model-index:
	- name: TinyGPT2-IT
	results: []
	---

	<div align="center">

	# TinyGPT2-IT

	### A 95M parameter instruction-tuned language model trained from scratch on a single consumer GPU

	[![GitHub](https://img.shields.io/badge/GitHub-NotShrirang%2Ftinygpt-blue?logo=github)](https://github.com/NotShrirang/tinygpt)
	[![Demo](https://img.shields.io/badge/Demo-Streamlit-FF4B4B?logo=streamlit)](https://tinygpt.streamlit.app/)
	[![License](https://img.shields.io/badge/License-GPL--3.0-green)](https://www.gnu.org/licenses/gpl-3.0.en.html)

	</div>

	---

	## Overview

	TinyGPT2-IT is an instruction-tuned variant of [TinyGPT2](https://github.com/NotShrirang/tinygpt) — a modern GPT architecture built from scratch using PyTorch. The base model was pretrained on ~6.7B tokens from OpenWebText, then supervised fine-tuned (SFT) on Stanford Alpaca's 52K instruction-response pairs.

	The entire pipeline — pretraining, fine-tuning, and inference — runs on a single NVIDIA RTX 3070 Ti (8 GB VRAM).

	> This model uses a custom architecture and requires `trust_remote_code=True`.

	---

	## Architecture

	\| Component \| Detail \|
	\|---\|---\|
	\| Parameters \| ~95M \|
	\| Layers \| 12 transformer blocks \|
	\| Attention \| Grouped Query Attention (12 query heads, 4 KV groups) \|
	\| Embedding dim \| 768 \|
	\| FFN hidden dim \| 2048 \|
	\| Position encoding \| Rotary Position Embeddings (RoPE) \|
	\| Normalization \| RMSNorm \|
	\| Context window \| 512 tokens \|
	\| Vocabulary \| 50,304 (GPT-2 tiktoken + PAD token) \|
	\| Weight tying \| Token embedding ↔ LM head \|
	\| KV Cache \| Supported for efficient generation \|

	---

	## Training

	### Stage 1 — Pretraining

	\| \| \|
	\|---\|---\|
	\| Dataset \| OpenWebText (~6.7B tokens) \|
	\| Optimizer \| AdamW (fused) \|
	\| Effective batch \| 262K tokens/step \|
	\| Precision \| bfloat16 + `torch.compile` \|
	\| Hardware \| NVIDIA RTX 3070 Ti (8 GB) \|

	### Stage 2 — Supervised Fine-Tuning (SFT)

	\| \| \|
	\|---\|---\|
	\| Dataset \| Stanford Alpaca (52K instructions) \|
	\| Epochs \| 3 \|
	\| Loss masking \| Response-only (instruction tokens are masked) \|
	\| Final train loss \| 1.91 \|
	\| Final val loss \| 1.98 \|
	\| Final val perplexity \| 7.26 \|
	\| Tokens processed \| ~72M \|
	\| Prompt format \| `### Instruction: ... ### Response: ...` \|

	---

	## Usage

	### Quick Start

	```python
	from transformers import AutoModelForCausalLM
	import tiktoken
	import torch

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"NotShrirang/tinygpt2-it",
	trust_remote_code=True,
	)
	model.eval()

	# Tokenize
	enc = tiktoken.get_encoding("gpt2")
	prompt = "### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
	input_ids = torch.tensor([enc.encode(prompt)])

	# Generate
	with torch.no_grad():
	output = model.generate(input_ids, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=40)

	print(enc.decode(output[0].tolist()))
	```

	### Prompt Format

	This model expects instructions in the following template:

	```
	### Instruction:
	{your instruction here}

	### Response:
	```

	For instructions with additional context:

	```
	### Instruction:
	{your instruction here}

	### Input:
	{additional context}

	### Response:
	```

	---

	## Example Outputs

	Factual Q&A
	```
	>>> What is the capital of France?
	The capital of France is Paris.
	```

	Explanation
	```
	>>> Explain what machine learning is in simple terms.
	Machine learning is a branch of computer science that focuses on using algorithms to
	identify patterns in data. These algorithms are used to analyze large amounts of data
	and make predictions about future trends.
	```

	Creative
	```
	>>> Write a motivational quote.
	"The only way to make a difference is to be bold and courageous."
	```

	---

	## Limitations

	- Small model — 95M parameters is far below production LLMs; expect factual errors, repetition, and limited reasoning.
	- Short context — 512 token window limits the length of conversations and documents.
	- Training data — pretrained on web text and fine-tuned on synthetic Alpaca data, which may contain biases or inaccuracies.
	- Not safety-aligned — no RLHF/DPO applied to this checkpoint; the model may produce harmful or inappropriate content.

	---

	## Model Family

	\| Model \| Params \| Description \| Link \|
	\|---\|---\|---\|---\|
	\| TinyGPT \| 51M \| Standard GPT, TinyStories \| [GitHub](https://github.com/NotShrirang/tinygpt) \|
	\| TinyGPT-MoE \| 85M \| Mixture of Experts, TinyStories \| [GitHub](https://github.com/NotShrirang/tinygpt) \|
	\| Wikipedia-MoE \| 135M \| 8-expert MoE, Wikipedia/C4 \| [GitHub](https://github.com/NotShrirang/tinygpt) \|
	\| TinyGPT2 \| 95M \| RoPE + GQA + RMSNorm, OpenWebText \| [GitHub](https://github.com/NotShrirang/tinygpt) \|
	\| TinyGPT2.1 \| 183M \| Scaled TinyGPT2, FineWeb-Edu \| [GitHub](https://github.com/NotShrirang/tinygpt) \|
	\| TinyGPT2-IT \| 95M \| Instruction-tuned (this model) \| You are here \|
	\| TinyGPT2-DPO \| 95M \| DPO-aligned with Anthropic HH-RLHF \| [GitHub](https://github.com/NotShrirang/tinygpt) \|

	---

	## Citation

	```bibtex
	@misc{tinygpt2-it,
	author = {Shrirang Mahajan},
	title = {TinyGPT2-IT: Instruction-Tuned 95M Parameter Language Model},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/NotShrirang/tinygpt2-it}
	}
	```

	---

	## License

	This model is released under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.en.html).