Ashx098
/

Mini-LLM

+# 🧠 Mini-LLM — 80M Parameter Transformer (Pretrained From Scratch)
+<p align="center">
+  <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/openllm.svg" width="120"/>
+</p>
+**Mini-LLM** is an 80M parameter decoder-only transformer trained **fully from scratch** using a custom tokenizer, custom architecture, and custom training loop.
+It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end.
+---
+## ✨ Key Features
+- **80M parameters** — compact but fully functional LLM
+- **Trained from scratch** (no borrowed checkpoints)
+- Custom **Byte-Level BPE tokenizer (32k vocab)**
+- Modern architecture components:
+  - RoPE (Rotary Position Embeddings)
+  - RMSNorm
+  - SwiGLU FeedForward layer
+  - FlashAttention (via PyTorch SDPA)
+  - GQA-ready Attention implementation
+- **2B tokens** mixed corpus (FineWeb + WikiText + Wikipedia)
+- Training logs, checkpoints, plots all included for transparency
+- Released under a permissive license for research & learning
+---
+## 📐 Model Architecture
+| Component | Value |
+|----------|-------|
+| Type | Decoder-only transformer |
+| Parameters | ~80M |
+| Layers | 16 |
+| Embedding dim | 384 |
+| Attention heads | 6 |
+| KV Heads | 6 |
+| MLP Hidden Dim | 1536 (SwiGLU) |
+| Max sequence length | 2048 |
+| Norm | RMSNorm |
+| Positional Encoding | RoPE |
+| Tokenizer | SentencePiece BPE (32k vocab, byte fallback) |
+---
+## 📦 Files in This Repo
+- `checkpoints/` → Pretrained model state_dict + optimizer
+- `safetensors/` → Final consolidated .safetensors file
+- `logs/` → Training logs in JSONL
+- `plots/` → Train/val loss curves
+- `tokenizer.json` → HF-compatible tokenizer
+- `spm.model` → SentencePiece model
+---
+## 🧪 Quick Usage (HF Transformers)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True)
+tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM")
+prompt = "Hello, how are you?"
+inputs = tok(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=50)
+print(tok.decode(outputs[0], skip_special_tokens=True))
+```
+## 🚀 Training Details
+### Optimizer
+- **AdamW** (β1=0.9, β2=0.95, weight decay=0.1)
+- **Learning rate**: 6e-4 (cosine annealing + warmup)
+### Batch ⨉ Sequence
+- **Global batch size** = 32
+- **Sequence length** = 2048
+- **Gradient accumulation** = 8
+### Hardware
+- Trained on 1× NVIDIA A100 80GB
+## 📊 Training Curve
+<p align="center"> <img src="phase-1-pretraining/plots/loss_curve.png" width="500"> </p>
+Final loss reached: ~3.25
+## 💬 Example Outputs
+**Prompt**: "Hello, how are you"
+**Output**: "Hello, how are you?"
+**Prompt**: "Python is a programming language that"
+**Output**: "Python is a programming language that allows the history..."
+## ⚠️ Limitations
+- Small model → limited reasoning, hallucination likely
+- Not instruction-tuned
+- Not suitable for production usage
+- Best viewed as a learning + research artifact
+## 📜 License
+MIT License — free for research, modification, and further training.
+## 🙌 Credits
+Developed by **Avinash Mynampati**
+Built from scratch using PyTorch + custom training pipeline.
+### Want to fine-tune or extend it?
+You can:
+- Train further with your own dataset
+- Add LoRA adapters
+- Use it to learn attention, RoPE, SwiGLU, etc.
+- Build a tiny instruction-tuned version (coming soon!)
+## 📬 Contact
+For questions or collaborations:
+- **GitHub**: [Ashx098](https://github.com/Ashx098)
+- **LinkedIn**: [Avinash Mynampati](https://linkedin.com/in/avinash-mynampati)