--- language: - en license: gpl-3.0 library_name: transformers tags: - text-generation - tinygpt2 - causal-lm - instruction-tuned - sft - rope - grouped-query-attention - rms-norm datasets: - tatsu-lab/alpaca - Skylion007/openwebtext pipeline_tag: text-generation model-index: - name: TinyGPT2-IT results: [] ---
# TinyGPT2-IT ### A 95M parameter instruction-tuned language model trained from scratch on a single consumer GPU [![GitHub](https://img.shields.io/badge/GitHub-NotShrirang%2Ftinygpt-blue?logo=github)](https://github.com/NotShrirang/tinygpt) [![Demo](https://img.shields.io/badge/Demo-Streamlit-FF4B4B?logo=streamlit)](https://tinygpt.streamlit.app/) [![License](https://img.shields.io/badge/License-GPL--3.0-green)](https://www.gnu.org/licenses/gpl-3.0.en.html)
--- ## Overview **TinyGPT2-IT** is an instruction-tuned variant of [TinyGPT2](https://github.com/NotShrirang/tinygpt) — a modern GPT architecture built from scratch using PyTorch. The base model was pretrained on ~6.7B tokens from OpenWebText, then supervised fine-tuned (SFT) on Stanford Alpaca's 52K instruction-response pairs. The entire pipeline — pretraining, fine-tuning, and inference — runs on a **single NVIDIA RTX 3070 Ti (8 GB VRAM)**. > This model uses a custom architecture and requires `trust_remote_code=True`. --- ## Architecture | Component | Detail | |---|---| | **Parameters** | ~95M | | **Layers** | 12 transformer blocks | | **Attention** | Grouped Query Attention (12 query heads, 4 KV groups) | | **Embedding dim** | 768 | | **FFN hidden dim** | 2048 | | **Position encoding** | Rotary Position Embeddings (RoPE) | | **Normalization** | RMSNorm | | **Context window** | 512 tokens | | **Vocabulary** | 50,304 (GPT-2 tiktoken + PAD token) | | **Weight tying** | Token embedding ↔ LM head | | **KV Cache** | Supported for efficient generation | --- ## Training ### Stage 1 — Pretraining | | | |---|---| | **Dataset** | OpenWebText (~6.7B tokens) | | **Optimizer** | AdamW (fused) | | **Effective batch** | 262K tokens/step | | **Precision** | bfloat16 + `torch.compile` | | **Hardware** | NVIDIA RTX 3070 Ti (8 GB) | ### Stage 2 — Supervised Fine-Tuning (SFT) | | | |---|---| | **Dataset** | Stanford Alpaca (52K instructions) | | **Epochs** | 3 | | **Loss masking** | Response-only (instruction tokens are masked) | | **Final train loss** | 1.91 | | **Final val loss** | 1.98 | | **Final val perplexity** | 7.26 | | **Tokens processed** | ~72M | | **Prompt format** | `### Instruction: ... ### Response: ...` | --- ## Usage ### Quick Start ```python from transformers import AutoModelForCausalLM import tiktoken import torch # Load model model = AutoModelForCausalLM.from_pretrained( "NotShrirang/tinygpt2-it", trust_remote_code=True, ) model.eval() # Tokenize enc = tiktoken.get_encoding("gpt2") prompt = "### Instruction:\nWhat is the capital of France?\n\n### Response:\n" input_ids = torch.tensor([enc.encode(prompt)]) # Generate with torch.no_grad(): output = model.generate(input_ids, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=40) print(enc.decode(output[0].tolist())) ``` ### Prompt Format This model expects instructions in the following template: ``` ### Instruction: {your instruction here} ### Response: ``` For instructions with additional context: ``` ### Instruction: {your instruction here} ### Input: {additional context} ### Response: ``` --- ## Example Outputs **Factual Q&A** ``` >>> What is the capital of France? The capital of France is Paris. ``` **Explanation** ``` >>> Explain what machine learning is in simple terms. Machine learning is a branch of computer science that focuses on using algorithms to identify patterns in data. These algorithms are used to analyze large amounts of data and make predictions about future trends. ``` **Creative** ``` >>> Write a motivational quote. "The only way to make a difference is to be bold and courageous." ``` --- ## Limitations - **Small model** — 95M parameters is far below production LLMs; expect factual errors, repetition, and limited reasoning. - **Short context** — 512 token window limits the length of conversations and documents. - **Training data** — pretrained on web text and fine-tuned on synthetic Alpaca data, which may contain biases or inaccuracies. - **Not safety-aligned** — no RLHF/DPO applied to this checkpoint; the model may produce harmful or inappropriate content. --- ## Model Family | Model | Params | Description | Link | |---|---|---|---| | TinyGPT | 51M | Standard GPT, TinyStories | [GitHub](https://github.com/NotShrirang/tinygpt) | | TinyGPT-MoE | 85M | Mixture of Experts, TinyStories | [GitHub](https://github.com/NotShrirang/tinygpt) | | Wikipedia-MoE | 135M | 8-expert MoE, Wikipedia/C4 | [GitHub](https://github.com/NotShrirang/tinygpt) | | TinyGPT2 | 95M | RoPE + GQA + RMSNorm, OpenWebText | [GitHub](https://github.com/NotShrirang/tinygpt) | | TinyGPT2.1 | 183M | Scaled TinyGPT2, FineWeb-Edu | [GitHub](https://github.com/NotShrirang/tinygpt) | | **TinyGPT2-IT** | **95M** | **Instruction-tuned (this model)** | **You are here** | | TinyGPT2-DPO | 95M | DPO-aligned with Anthropic HH-RLHF | [GitHub](https://github.com/NotShrirang/tinygpt) | --- ## Citation ```bibtex @misc{tinygpt2-it, author = {Shrirang Mahajan}, title = {TinyGPT2-IT: Instruction-Tuned 95M Parameter Language Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/NotShrirang/tinygpt2-it} } ``` --- ## License This model is released under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.en.html).