Mini-LLM / README.md

Upload README.md with huggingface_hub

a6534e0 verified about 2 months ago

4 kB

	---
	language:
	- en
	license: mit
	tags:
	- llm
	- decoder-only
	- transformer
	- from-scratch
	- research
	- educational
	- 80m
	- pytorch
	- pretraining
	- custom-architecture
	pipeline_tag: text-generation
	inference:
	parameters:
	temperature: 0.7
	top_p: 0.95
	---

	# 🧠 Mini-LLM — 80M Parameter Transformer (Pretrained From Scratch)

	[![MIT License](https://img.shields.io/badge/license-MIT-green.svg)]()
	[![Model Size](https://img.shields.io/badge/params-80M-blue.svg)]()

	Mini-LLM is an 80M parameter decoder-only transformer trained fully from scratch using a custom tokenizer, custom architecture, and custom training loop.
	It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end.

	---

	## ✨ Key Features

	- 80M parameters — compact but fully functional LLM
	- Trained from scratch (no borrowed checkpoints)
	- Custom Byte-Level BPE tokenizer (32k vocab)
	- Modern architecture components:
	- RoPE (Rotary Position Embeddings)
	- RMSNorm
	- SwiGLU FeedForward layer
	- FlashAttention (via PyTorch SDPA)
	- GQA-ready Attention implementation
	- 2B tokens mixed corpus (FineWeb + WikiText + Wikipedia)
	- Training logs, checkpoints, plots all included for transparency
	- Released under a permissive license for research & learning

	---

	## 📐 Model Architecture

	\| Component \| Value \|
	\|----------\|-------\|
	\| Type \| Decoder-only transformer \|
	\| Parameters \| ~80M \|
	\| Layers \| 16 \|
	\| Embedding dim \| 384 \|
	\| Attention heads \| 6 \|
	\| KV Heads \| 6 \|
	\| MLP Hidden Dim \| 1536 (SwiGLU) \|
	\| Max sequence length \| 2048 \|
	\| Norm \| RMSNorm \|
	\| Positional Encoding \| RoPE \|
	\| Tokenizer \| SentencePiece BPE (32k vocab, byte fallback) \|

	---

	## 📦 Files in This Repo

	- `checkpoints/` → Pretrained model state_dict + optimizer
	- `safetensors/` → Final consolidated .safetensors file
	- `logs/` → Training logs in JSONL
	- `plots/` → Train/val loss curves
	- `tokenizer.json` → HF-compatible tokenizer
	- `spm.model` → SentencePiece model

	---

	## 🧪 Quick Usage (HF Transformers)

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True)
	tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM")

	prompt = "Hello, how are you?"
	inputs = tok(prompt, return_tensors="pt")

	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tok.decode(outputs[0], skip_special_tokens=True))
	```

	## 🚀 Training Details

	### Optimizer
	- AdamW (β1=0.9, β2=0.95, weight decay=0.1)
	- Learning rate: 6e-4 (cosine annealing + warmup)

	### Batch ⨉ Sequence
	- Global batch size = 32
	- Sequence length = 2048
	- Gradient accumulation = 8

	### Hardware
	- Trained on 1× NVIDIA A100 80GB

	## 📊 Training Curve
	<p align="center"> <img src="https://huggingface.co/Ashx098/Mini-LLM/resolve/main/phase-1-pretraining/plots/loss_curve.png" width="500"> </p>

	Final loss reached: ~3.25

	## 💬 Example Outputs

	Prompt: "Hello, how are you"
	Output: "Hello, how are you?"

	Prompt: "Python is a programming language that"
	Output: "Python is a programming language that allows the history..."

	## ⚠️ Limitations
	- Small model → limited reasoning, hallucination likely
	- Not instruction-tuned
	- Not suitable for production usage
	- Best viewed as a learning + research artifact

	## 📜 License
	MIT License — free for research, modification, and further training.

	## 🙌 Credits
	Developed by Avinash Mynampati
	Built from scratch using PyTorch + custom training pipeline.

	### Want to fine-tune or extend it?
	You can:
	- Train further with your own dataset
	- Add LoRA adapters
	- Use it to learn attention, RoPE, SwiGLU, etc.
	- Build a tiny instruction-tuned version (coming soon!)

	## 📬 Contact
	For questions or collaborations:
	- GitHub: [Ashx098](https://github.com/Ashx098)
	- LinkedIn: [Avinash Mynampati](https://linkedin.com/in/avinash-mynampati)