Text Generation
Transformers
Safetensors
English
tinygpt2
causal-lm
instruction-tuned
sft
rope
grouped-query-attention
rms-norm
custom_code
Instructions to use NotShrirang/tinygpt2-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NotShrirang/tinygpt2-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NotShrirang/tinygpt2-it", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("NotShrirang/tinygpt2-it", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use NotShrirang/tinygpt2-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NotShrirang/tinygpt2-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotShrirang/tinygpt2-it", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/NotShrirang/tinygpt2-it
- SGLang
How to use NotShrirang/tinygpt2-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NotShrirang/tinygpt2-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotShrirang/tinygpt2-it", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NotShrirang/tinygpt2-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotShrirang/tinygpt2-it", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use NotShrirang/tinygpt2-it with Docker Model Runner:
docker model run hf.co/NotShrirang/tinygpt2-it
| language: | |
| - en | |
| license: gpl-3.0 | |
| library_name: transformers | |
| tags: | |
| - text-generation | |
| - tinygpt2 | |
| - causal-lm | |
| - instruction-tuned | |
| - sft | |
| - rope | |
| - grouped-query-attention | |
| - rms-norm | |
| datasets: | |
| - tatsu-lab/alpaca | |
| - Skylion007/openwebtext | |
| pipeline_tag: text-generation | |
| model-index: | |
| - name: TinyGPT2-IT | |
| results: [] | |
| <div align="center"> | |
| # TinyGPT2-IT | |
| ### A 95M parameter instruction-tuned language model trained from scratch on a single consumer GPU | |
| [](https://github.com/NotShrirang/tinygpt) | |
| [](https://tinygpt.streamlit.app/) | |
| [](https://www.gnu.org/licenses/gpl-3.0.en.html) | |
| </div> | |
| --- | |
| ## Overview | |
| **TinyGPT2-IT** is an instruction-tuned variant of [TinyGPT2](https://github.com/NotShrirang/tinygpt) — a modern GPT architecture built from scratch using PyTorch. The base model was pretrained on ~6.7B tokens from OpenWebText, then supervised fine-tuned (SFT) on Stanford Alpaca's 52K instruction-response pairs. | |
| The entire pipeline — pretraining, fine-tuning, and inference — runs on a **single NVIDIA RTX 3070 Ti (8 GB VRAM)**. | |
| > This model uses a custom architecture and requires `trust_remote_code=True`. | |
| --- | |
| ## Architecture | |
| | Component | Detail | | |
| |---|---| | |
| | **Parameters** | ~95M | | |
| | **Layers** | 12 transformer blocks | | |
| | **Attention** | Grouped Query Attention (12 query heads, 4 KV groups) | | |
| | **Embedding dim** | 768 | | |
| | **FFN hidden dim** | 2048 | | |
| | **Position encoding** | Rotary Position Embeddings (RoPE) | | |
| | **Normalization** | RMSNorm | | |
| | **Context window** | 512 tokens | | |
| | **Vocabulary** | 50,304 (GPT-2 tiktoken + PAD token) | | |
| | **Weight tying** | Token embedding ↔ LM head | | |
| | **KV Cache** | Supported for efficient generation | | |
| --- | |
| ## Training | |
| ### Stage 1 — Pretraining | |
| | | | | |
| |---|---| | |
| | **Dataset** | OpenWebText (~6.7B tokens) | | |
| | **Optimizer** | AdamW (fused) | | |
| | **Effective batch** | 262K tokens/step | | |
| | **Precision** | bfloat16 + `torch.compile` | | |
| | **Hardware** | NVIDIA RTX 3070 Ti (8 GB) | | |
| ### Stage 2 — Supervised Fine-Tuning (SFT) | |
| | | | | |
| |---|---| | |
| | **Dataset** | Stanford Alpaca (52K instructions) | | |
| | **Epochs** | 3 | | |
| | **Loss masking** | Response-only (instruction tokens are masked) | | |
| | **Final train loss** | 1.91 | | |
| | **Final val loss** | 1.98 | | |
| | **Final val perplexity** | 7.26 | | |
| | **Tokens processed** | ~72M | | |
| | **Prompt format** | `### Instruction: ... ### Response: ...` | | |
| --- | |
| ## Usage | |
| ### Quick Start | |
| ```python | |
| from transformers import AutoModelForCausalLM | |
| import tiktoken | |
| import torch | |
| # Load model | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "NotShrirang/tinygpt2-it", | |
| trust_remote_code=True, | |
| ) | |
| model.eval() | |
| # Tokenize | |
| enc = tiktoken.get_encoding("gpt2") | |
| prompt = "### Instruction:\nWhat is the capital of France?\n\n### Response:\n" | |
| input_ids = torch.tensor([enc.encode(prompt)]) | |
| # Generate | |
| with torch.no_grad(): | |
| output = model.generate(input_ids, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=40) | |
| print(enc.decode(output[0].tolist())) | |
| ``` | |
| ### Prompt Format | |
| This model expects instructions in the following template: | |
| ``` | |
| ### Instruction: | |
| {your instruction here} | |
| ### Response: | |
| ``` | |
| For instructions with additional context: | |
| ``` | |
| ### Instruction: | |
| {your instruction here} | |
| ### Input: | |
| {additional context} | |
| ### Response: | |
| ``` | |
| --- | |
| ## Example Outputs | |
| **Factual Q&A** | |
| ``` | |
| >>> What is the capital of France? | |
| The capital of France is Paris. | |
| ``` | |
| **Explanation** | |
| ``` | |
| >>> Explain what machine learning is in simple terms. | |
| Machine learning is a branch of computer science that focuses on using algorithms to | |
| identify patterns in data. These algorithms are used to analyze large amounts of data | |
| and make predictions about future trends. | |
| ``` | |
| **Creative** | |
| ``` | |
| >>> Write a motivational quote. | |
| "The only way to make a difference is to be bold and courageous." | |
| ``` | |
| --- | |
| ## Limitations | |
| - **Small model** — 95M parameters is far below production LLMs; expect factual errors, repetition, and limited reasoning. | |
| - **Short context** — 512 token window limits the length of conversations and documents. | |
| - **Training data** — pretrained on web text and fine-tuned on synthetic Alpaca data, which may contain biases or inaccuracies. | |
| - **Not safety-aligned** — no RLHF/DPO applied to this checkpoint; the model may produce harmful or inappropriate content. | |
| --- | |
| ## Model Family | |
| | Model | Params | Description | Link | | |
| |---|---|---|---| | |
| | TinyGPT | 51M | Standard GPT, TinyStories | [GitHub](https://github.com/NotShrirang/tinygpt) | | |
| | TinyGPT-MoE | 85M | Mixture of Experts, TinyStories | [GitHub](https://github.com/NotShrirang/tinygpt) | | |
| | Wikipedia-MoE | 135M | 8-expert MoE, Wikipedia/C4 | [GitHub](https://github.com/NotShrirang/tinygpt) | | |
| | TinyGPT2 | 95M | RoPE + GQA + RMSNorm, OpenWebText | [GitHub](https://github.com/NotShrirang/tinygpt) | | |
| | TinyGPT2.1 | 183M | Scaled TinyGPT2, FineWeb-Edu | [GitHub](https://github.com/NotShrirang/tinygpt) | | |
| | **TinyGPT2-IT** | **95M** | **Instruction-tuned (this model)** | **You are here** | | |
| | TinyGPT2-DPO | 95M | DPO-aligned with Anthropic HH-RLHF | [GitHub](https://github.com/NotShrirang/tinygpt) | | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{tinygpt2-it, | |
| author = {Shrirang Mahajan}, | |
| title = {TinyGPT2-IT: Instruction-Tuned 95M Parameter Language Model}, | |
| year = {2025}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/NotShrirang/tinygpt2-it} | |
| } | |
| ``` | |
| --- | |
| ## License | |
| This model is released under the [GPL-3.0 License](https://www.gnu.org/licenses/gpl-3.0.en.html). | |