| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - llm |
| - decoder-only |
| - transformer |
| - from-scratch |
| - research |
| - educational |
| - 80m |
| - pytorch |
| - pretraining |
| - custom-architecture |
| pipeline_tag: text-generation |
| inference: |
| parameters: |
| temperature: 0.7 |
| top_p: 0.95 |
| --- |
| |
| # π§ Mini-LLM β 80M Parameter Transformer (Pretrained From Scratch) |
|
|
| []() |
| []() |
|
|
| **Mini-LLM** is an 80M parameter decoder-only transformer trained **fully from scratch** using a custom tokenizer, custom architecture, and custom training loop. |
| It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end. |
|
|
| --- |
|
|
| ## β¨ Key Features |
|
|
| - **80M parameters** β compact but fully functional LLM |
| - **Trained from scratch** (no borrowed checkpoints) |
| - Custom **Byte-Level BPE tokenizer (32k vocab)** |
| - Modern architecture components: |
| - RoPE (Rotary Position Embeddings) |
| - RMSNorm |
| - SwiGLU FeedForward layer |
| - FlashAttention (via PyTorch SDPA) |
| - GQA-ready Attention implementation |
| - **2B tokens** mixed corpus (FineWeb + WikiText + Wikipedia) |
| - Training logs, checkpoints, plots all included for transparency |
| - Released under a permissive license for research & learning |
|
|
| --- |
|
|
| ## π Model Architecture |
|
|
| | Component | Value | |
| |----------|-------| |
| | Type | Decoder-only transformer | |
| | Parameters | ~80M | |
| | Layers | 16 | |
| | Embedding dim | 384 | |
| | Attention heads | 6 | |
| | KV Heads | 6 | |
| | MLP Hidden Dim | 1536 (SwiGLU) | |
| | Max sequence length | 2048 | |
| | Norm | RMSNorm | |
| | Positional Encoding | RoPE | |
| | Tokenizer | SentencePiece BPE (32k vocab, byte fallback) | |
|
|
| --- |
|
|
| ## π¦ Files in This Repo |
|
|
| - `checkpoints/` β Pretrained model state_dict + optimizer |
| - `safetensors/` β Final consolidated .safetensors file |
| - `logs/` β Training logs in JSONL |
| - `plots/` β Train/val loss curves |
| - `tokenizer.json` β HF-compatible tokenizer |
| - `spm.model` β SentencePiece model |
| |
| --- |
| |
| ## π§ͺ Quick Usage (HF Transformers) |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True) |
| tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM") |
| |
| prompt = "Hello, how are you?" |
| inputs = tok(prompt, return_tensors="pt") |
|
|
| outputs = model.generate(**inputs, max_new_tokens=50) |
| print(tok.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
| |
| ## π Training Details |
| |
| ### Optimizer |
| - **AdamW** (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1) |
| - **Learning rate**: 6e-4 (cosine annealing + warmup) |
|
|
| ### Batch β¨ Sequence |
| - **Global batch size** = 32 |
| - **Sequence length** = 2048 |
| - **Gradient accumulation** = 8 |
|
|
| ### Hardware |
| - Trained on 1Γ NVIDIA A100 80GB |
|
|
| ## π Training Curve |
| <p align="center"> <img src="https://huggingface.co/Ashx098/Mini-LLM/resolve/main/phase-1-pretraining/plots/loss_curve.png" width="500"> </p> |
|
|
| Final loss reached: ~3.25 |
|
|
| ## π¬ Example Outputs |
|
|
| **Prompt**: "Hello, how are you" |
| **Output**: "Hello, how are you?" |
|
|
| **Prompt**: "Python is a programming language that" |
| **Output**: "Python is a programming language that allows the history..." |
|
|
| ## β οΈ Limitations |
| - Small model β limited reasoning, hallucination likely |
| - Not instruction-tuned |
| - Not suitable for production usage |
| - Best viewed as a learning + research artifact |
|
|
| ## π License |
| MIT License β free for research, modification, and further training. |
|
|
| ## π Credits |
| Developed by **Avinash Mynampati** |
| Built from scratch using PyTorch + custom training pipeline. |
|
|
| ### Want to fine-tune or extend it? |
| You can: |
| - Train further with your own dataset |
| - Add LoRA adapters |
| - Use it to learn attention, RoPE, SwiGLU, etc. |
| - Build a tiny instruction-tuned version (coming soon!) |
|
|
| ## π¬ Contact |
| For questions or collaborations: |
| - **GitHub**: [Ashx098](https://github.com/Ashx098) |
| - **LinkedIn**: [Avinash Mynampati](https://linkedin.com/in/avinash-mynampati) |
|
|