kunjcr2
/

GatorGPT2

 ---
 language:
 - en
+library_name: transformers
 pipeline_tag: text-generation
 tags:
+- decoder-only
+- nlp
+- autoregressive
+- rope
+- gqa
+- rmsnorm
+- swiglu
+- from-scratch
+datasets:
+- roneneldan/TinyStories
+license: apache-2.0
+model-index:
+- name: GatorGPT2
+  results: []
+---
+# 🐊 GatorGPT2
+**GatorGPT2** is a small, decoder-only Transformer trained from scratch on a subset of **TinyStories** for next-token prediction.
+It uses **RoPE** (rotary positional embeddings), **GQA** (grouped-query attention), **RMSNorm**, and a **SwiGLU MLP**.
+Tokenizer is **tiktoken** with **p50k_base** vocabulary.
+> **Repo**: `kunjcr2/GatorGPT2`
+> **Intended use**: research, experimentation, educational demos for training/serving custom LMs
+---
+## 🔧 Architecture
+- **Type**: Decoder-only, causal LM
+- **Layers**: `num_hidden_layers = 10`
+- **Hidden size**: `hidden_size = 448`
+- **Heads**: `num_attention_heads = 8` (GQA with 2 KV heads per query group)
+- **FFN**: SwiGLU, `d_ff ≈ 2× hidden_size`
+- **Norm**: RMSNorm (pre-norm blocks)
+- **Positional**: RoPE
+- **Vocab**: `vocab_size = 50,257` (tiktoken p50k_base)
+- **Context length**: `max_position_embeddings = 1024`
+- **Weight tying**: output head tied with token embeddings
+- **Files**:
+  - `pytorch_model.bin` (or `model.safetensors`)
+  - `config.json` (`model_type: "gator-transformer"`, `auto_map` provided)
+  - `modeling_gator.py`, `configuration_gator.py`, `__init__.py`
+  - `tokenizer_manifest.json` → `{ "library": "tiktoken", "encoding": "p50k_base" }`
+> Custom code is loaded via `trust_remote_code=True`.
+---
+## 📦 Install
+```bash
+pip install torch transformers tiktoken
+````
+---
+## 🚀 Quickstart (Transformers + tiktoken)
+```python
+import torch
+from transformers import AutoModelForCausalLM
+import tiktoken
+MODEL_ID = "kunjcr2/GatorGPT2"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+# Load model (uses custom modeling code)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    trust_remote_code=True,
+    torch_dtype=torch.float32,
+).to(DEVICE).eval()
+# Tokenizer (p50k_base via tiktoken)
+tok = tiktoken.get_encoding("p50k_base")
+def generate_greedy(prompt: str, max_new_tokens: int = 64) -> str:
+    ids = tok.encode(prompt)
+    x = torch.tensor([ids], device=DEVICE)
+    for _ in range(max_new_tokens):
+        with torch.no_grad():
+            out = model(x)
+        logits = out["logits"] if isinstance(out, dict) else out.logits
+        next_id = int(torch.argmax(logits[0, -1]))
+        x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
+    return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()
+print(generate_greedy("Little girl was"))
+```
+### Temperature-only sampling (no top-k/p)
+```python
+def generate_temp(prompt, max_new_tokens=64, temperature=0.9):
+    ids = tok.encode(prompt)
+    x = torch.tensor([ids], device=DEVICE)
+    for _ in range(max_new_tokens):
+        with torch.no_grad():
+            logits = model(x).logits[0, -1] / max(temperature, 1e-6)
+        probs = torch.softmax(logits, dim=-1)
+        next_id = torch.multinomial(probs, 1).item()
+        x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
+    return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()
+```
+---
+## 🌐 Serving with vLLM (Optional)
+```bash
+python -m vllm.entrypoints.openai.api_server \
+  --model kunjcr2/GatorGPT2 \
+  --tokenizer kunjcr2/GatorGPT2 \
+  --trust-remote-code \
+  --dtype float32 \
+  --max-model-len 1024 \
+  --host 0.0.0.0 --port 8000
+```
+Call it:
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"kunjcr2/GatorGPT2","prompt":"Little girl was","max_tokens":64,"temperature":0.9}'
+```
+---
+## 🧪 Training Summary
+* **Data**: `roneneldan/TinyStories` (train split; subset of \~1.5M stories)
+* **Objective**: causal LM (next-token prediction), cross-entropy
+* **Optimizer**: AdamW (`lr=3e-4`, `weight_decay=0.01`, `eps=1e-8`)
+* **Precision**: bf16 autocast on CUDA during forward for speed
+* **Batching**: sliding windows via a `FastDataset` (window size e.g. 512, stride 256)
+* **Eval**: periodic validation over fixed batches; train loss downsampled to eval steps for plotting
+* **Hardware**: intended for A100-class GPUs; also runs on CPU for debug (slow)
+> This is a *from-scratch* toy/educational model; quality depends heavily on steps, data cleaned, and schedule. Expect simple, short English generations.
+---
+## ✅ Intended Use
+* Research on small decoder-only Transformers
+* Educational demos (training, saving, model hub, vLLM serving)
+* Baseline for experimenting with:
+  * LoRA/QLoRA, quantization, distillation
+  * Attention variants (Flash-Attention, GQA configs)
+  * Data curation and scaling laws
+**Not** intended for production or safety-critical use.
+---
+## ⚠️ Limitations & Risks
+* Trained on children’s story data ⇒ limited world knowledge & reasoning
+* May output incoherent, repetitive, or undesirable text
+* No instruction-tuning or RLHF
+* Tokenizer is `tiktoken p50k_base` (not a standard HF tokenizer), so examples use `tiktoken` directly
+---
+## 📁 Repo Structure
+```
+.
+├── config.json
+├── pytorch_model.bin        # or model.safetensors
+├── modeling_gator.py        # custom architecture (RoPE, GQA, RMSNorm, SwiGLU)
+├── configuration_gator.py
+├── __init__.py
+└── tokenizer_manifest.json  # { "library": "tiktoken", "encoding": "p50k_base" }
+```
+`config.json` includes:
+```json
+{
+  "model_type": "gator-transformer",
+  "architectures": ["GatorModel"],
+  "auto_map": {
+    "AutoConfig": "configuration_gator.GatorConfig",
+    "AutoModelForCausalLM": "modeling_gator.GatorModel"
+  }
+}
+```
+---
+## 📊 Evaluation
+No formal benchmarks reported. You can compute loss/perplexity on your own validation subset:
+```python
+import math, torch
+from torch.utils.data import DataLoader, TensorDataset
+# ...build a DataLoader of (input_ids, target_ids) pairs...
+def eval_loss(model, loader, device="cuda"):
+    model.eval(); total, n = 0.0, 0
+    with torch.no_grad():
+        for x, y in loader:
+            x, y = x.to(device), y.to(device)
+            logits = model(x).logits
+            loss = torch.nn.functional.cross_entropy(
+                logits.view(-1, logits.size(-1)), y.view(-1)
+            )
+            total += loss.item(); n += 1
+    return total / max(n,1)
+val_loss = eval_loss(model, your_val_loader)
+print("val loss:", val_loss, "  ppl:", math.exp(val_loss))
+```
+---
+## 📜 License
+**apache-2.0**
+---
+## 🙌 Acknowledgements
+* **TinyStories** dataset by Ronen Eldan et al. (`roneneldan/TinyStories`)
+* Community tooling: **PyTorch**, **🤗 Transformers**, **tiktoken**, **vLLM**
+---
+## ✉️ Citation
+If you use this model, please cite this repository:
+```bibtex
+@software{GatorGPT2_2025,
+  author = {Kunj},
+  title = {GatorGPT2: a small decoder-only Transformer with RoPE+GQA},
+  year = {2025},
+  url = {https://huggingface.co/kunjcr2/GatorGPT2}
+}
+```