README.md · kunjcr2/GatorGPT2 at main

GatorGPT2 / README.md

kunjcr2

Update README.md

4e17640 verified 6 months ago

preview code

raw

history blame contribute delete

7.04 kB

	---
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- decoder-only
	- nlp
	- autoregressive
	- rope
	- gqa
	- rmsnorm
	- swiglu
	- from-scratch
	datasets:
	- roneneldan/TinyStories
	license: apache-2.0
	model-index:
	- name: GatorGPT2
	results: []
	---

	# 🐊 GatorGPT2

	GatorGPT2 is a small, decoder-only Transformer trained from scratch on a subset of TinyStories for next-token prediction.
	It uses RoPE (rotary positional embeddings), GQA (grouped-query attention), RMSNorm, and a SwiGLU MLP.
	Tokenizer is tiktoken with p50k_base vocabulary.

	> Repo: `kunjcr2/GatorGPT2`
	> Intended use: research, experimentation, educational demos for training/serving custom LMs

	---

	## 🔧 Architecture

	- Type: Decoder-only, causal LM
	- Layers: `num_hidden_layers = 10`
	- Hidden size: `hidden_size = 448`
	- Heads: `num_attention_heads = 8` (GQA with 2 KV heads per query group)
	- FFN: SwiGLU, `d_ff ≈ 2× hidden_size`
	- Norm: RMSNorm (pre-norm blocks)
	- Positional: RoPE
	- Vocab: `vocab_size = 50,257` (tiktoken p50k_base)
	- Context length: `max_position_embeddings = 1024`
	- Weight tying: output head tied with token embeddings
	- Files:
	- `pytorch_model.bin` (or `model.safetensors`)
	- `config.json` (`model_type: "gator-transformer"`, `auto_map` provided)
	- `modeling_gator.py`, `configuration_gator.py`, `__init__.py`
	- `tokenizer_manifest.json` → `{ "library": "tiktoken", "encoding": "p50k_base" }`

	> Custom code is loaded via `trust_remote_code=True`.

	---

	## 📦 Install

	```bash
	pip install torch transformers tiktoken
	````

	---

	## 🚀 Quickstart (Transformers + tiktoken)

	```python
	import torch
	from transformers import AutoModelForCausalLM
	import tiktoken

	MODEL_ID = "kunjcr2/GatorGPT2"
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	# Load model (uses custom modeling code)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	torch_dtype=torch.float32,
	).to(DEVICE).eval()

	# Tokenizer (p50k_base via tiktoken)
	tok = tiktoken.get_encoding("p50k_base")

	def generate_greedy(prompt: str, max_new_tokens: int = 64) -> str:
	ids = tok.encode(prompt)
	x = torch.tensor([ids], device=DEVICE)
	for _ in range(max_new_tokens):
	with torch.no_grad():
	out = model(x)
	logits = out["logits"] if isinstance(out, dict) else out.logits
	next_id = int(torch.argmax(logits[0, -1]))
	x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
	return tok.decode(x[0].tolist()).replace("<\|endoftext\|>", "").strip()

	print(generate_greedy("Little girl was"))
	```

	### Temperature-only sampling (no top-k/p)

	```python
	def generate_temp(prompt, max_new_tokens=64, temperature=0.9):
	ids = tok.encode(prompt)
	x = torch.tensor([ids], device=DEVICE)
	for _ in range(max_new_tokens):
	with torch.no_grad():
	logits = model(x).logits[0, -1] / max(temperature, 1e-6)
	probs = torch.softmax(logits, dim=-1)
	next_id = torch.multinomial(probs, 1).item()
	x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
	return tok.decode(x[0].tolist()).replace("<\|endoftext\|>", "").strip()
	```

	---

	## 🌐 Serving with vLLM (Optional)

	```bash
	python -m vllm.entrypoints.openai.api_server \
	--model kunjcr2/GatorGPT2 \
	--tokenizer kunjcr2/GatorGPT2 \
	--trust-remote-code \
	--dtype float32 \
	--max-model-len 1024 \
	--host 0.0.0.0 --port 8000
	```

	Call it:

	```bash
	curl http://localhost:8000/v1/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"kunjcr2/GatorGPT2","prompt":"Little girl was","max_tokens":64,"temperature":0.9}'
	```

	---

	## 🧪 Training Summary

	* Data: `roneneldan/TinyStories` (train split; subset of \~1.5M stories)
	* Objective: causal LM (next-token prediction), cross-entropy
	* Optimizer: AdamW (`lr=3e-4`, `weight_decay=0.01`, `eps=1e-8`)
	* Precision: bf16 autocast on CUDA during forward for speed
	* Batching: sliding windows via a `FastDataset` (window size e.g. 512, stride 256)
	* Eval: periodic validation over fixed batches; train loss downsampled to eval steps for plotting
	* Hardware: intended for A100-class GPUs; also runs on CPU for debug (slow)

	> This is a from-scratch toy/educational model; quality depends heavily on steps, data cleaned, and schedule. Expect simple, short English generations.

	---

	## ✅ Intended Use

	* Research on small decoder-only Transformers
	* Educational demos (training, saving, model hub, vLLM serving)
	* Baseline for experimenting with:

	* LoRA/QLoRA, quantization, distillation
	* Attention variants (Flash-Attention, GQA configs)
	* Data curation and scaling laws

	Not intended for production or safety-critical use.

	---

	## ⚠️ Limitations & Risks

	* Trained on children’s story data ⇒ limited world knowledge & reasoning
	* May output incoherent, repetitive, or undesirable text
	* No instruction-tuning or RLHF
	* Tokenizer is `tiktoken p50k_base` (not a standard HF tokenizer), so examples use `tiktoken` directly

	---

	## 📁 Repo Structure

	```
	.
	├── config.json
	├── pytorch_model.bin # or model.safetensors
	├── modeling_gator.py # custom architecture (RoPE, GQA, RMSNorm, SwiGLU)
	├── configuration_gator.py
	├── __init__.py
	└── tokenizer_manifest.json # { "library": "tiktoken", "encoding": "p50k_base" }
	```

	`config.json` includes:

	```json
	{
	"model_type": "gator-transformer",
	"architectures": ["GatorModel"],
	"auto_map": {
	"AutoConfig": "configuration_gator.GatorConfig",
	"AutoModelForCausalLM": "modeling_gator.GatorModel"
	}
	}
	```

	---

	## 📊 Evaluation

	No formal benchmarks reported. You can compute loss/perplexity on your own validation subset:

	```python
	import math, torch
	from torch.utils.data import DataLoader, TensorDataset

	# ...build a DataLoader of (input_ids, target_ids) pairs...
	def eval_loss(model, loader, device="cuda"):
	model.eval(); total, n = 0.0, 0
	with torch.no_grad():
	for x, y in loader:
	x, y = x.to(device), y.to(device)
	logits = model(x).logits
	loss = torch.nn.functional.cross_entropy(
	logits.view(-1, logits.size(-1)), y.view(-1)
	)
	total += loss.item(); n += 1
	return total / max(n,1)

	val_loss = eval_loss(model, your_val_loader)
	print("val loss:", val_loss, " ppl:", math.exp(val_loss))
	```

	---

	## 📜 License

	apache-2.0

	---

	## 🙌 Acknowledgements

	* TinyStories dataset by Ronen Eldan et al. (`roneneldan/TinyStories`)
	* Community tooling: PyTorch, 🤗 Transformers, tiktoken, vLLM

	---

	## ✉️ Citation

	If you use this model, please cite this repository:

	```bibtex
	@software{GatorGPT2_2025,
	author = {Kunj},
	title = {GatorGPT2: a small decoder-only Transformer with RoPE+GQA},
	year = {2025},
	url = {https://huggingface.co/kunjcr2/GatorGPT2}
	}
	```