File size: 4,001 Bytes
87879d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
825fe90
 
a6534e0
 
825fe90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87879d6
825fe90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
- en
license: mit
tags:
- llm
- decoder-only
- transformer
- from-scratch
- research
- educational
- 80m
- pytorch
- pretraining
- custom-architecture
pipeline_tag: text-generation
inference:
  parameters:
    temperature: 0.7
    top_p: 0.95
---

# 🧠 Mini-LLM β€” 80M Parameter Transformer (Pretrained From Scratch)

[![MIT License](https://img.shields.io/badge/license-MIT-green.svg)]()
[![Model Size](https://img.shields.io/badge/params-80M-blue.svg)]()

**Mini-LLM** is an 80M parameter decoder-only transformer trained **fully from scratch** using a custom tokenizer, custom architecture, and custom training loop.  
It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end.

---

## ✨ Key Features

- **80M parameters** β€” compact but fully functional LLM  
- **Trained from scratch** (no borrowed checkpoints)  
- Custom **Byte-Level BPE tokenizer (32k vocab)**  
- Modern architecture components:
  - RoPE (Rotary Position Embeddings)
  - RMSNorm
  - SwiGLU FeedForward layer
  - FlashAttention (via PyTorch SDPA)
  - GQA-ready Attention implementation
- **2B tokens** mixed corpus (FineWeb + WikiText + Wikipedia)
- Training logs, checkpoints, plots all included for transparency
- Released under a permissive license for research & learning

---

## πŸ“ Model Architecture

| Component | Value |
|----------|-------|
| Type | Decoder-only transformer |
| Parameters | ~80M |
| Layers | 16 |
| Embedding dim | 384 |
| Attention heads | 6 |
| KV Heads | 6 |
| MLP Hidden Dim | 1536 (SwiGLU) |
| Max sequence length | 2048 |
| Norm | RMSNorm |
| Positional Encoding | RoPE |
| Tokenizer | SentencePiece BPE (32k vocab, byte fallback) |

---

## πŸ“¦ Files in This Repo

- `checkpoints/` β†’ Pretrained model state_dict + optimizer
- `safetensors/` β†’ Final consolidated .safetensors file
- `logs/` β†’ Training logs in JSONL
- `plots/` β†’ Train/val loss curves
- `tokenizer.json` β†’ HF-compatible tokenizer
- `spm.model` β†’ SentencePiece model

---

## πŸ§ͺ Quick Usage (HF Transformers)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True)
tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM")

prompt = "Hello, how are you?"
inputs = tok(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tok.decode(outputs[0], skip_special_tokens=True))
```

## πŸš€ Training Details

### Optimizer
- **AdamW** (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
- **Learning rate**: 6e-4 (cosine annealing + warmup)

### Batch ⨉ Sequence
- **Global batch size** = 32
- **Sequence length** = 2048
- **Gradient accumulation** = 8

### Hardware
- Trained on 1Γ— NVIDIA A100 80GB

## πŸ“Š Training Curve
<p align="center"> <img src="https://huggingface.co/Ashx098/Mini-LLM/resolve/main/phase-1-pretraining/plots/loss_curve.png" width="500"> </p>

Final loss reached: ~3.25

## πŸ’¬ Example Outputs

**Prompt**: "Hello, how are you"
**Output**: "Hello, how are you?"

**Prompt**: "Python is a programming language that"
**Output**: "Python is a programming language that allows the history..."

## ⚠️ Limitations
- Small model β†’ limited reasoning, hallucination likely
- Not instruction-tuned
- Not suitable for production usage
- Best viewed as a learning + research artifact

## πŸ“œ License
MIT License β€” free for research, modification, and further training.

## πŸ™Œ Credits
Developed by **Avinash Mynampati**  
Built from scratch using PyTorch + custom training pipeline.

### Want to fine-tune or extend it?
You can:
- Train further with your own dataset
- Add LoRA adapters
- Use it to learn attention, RoPE, SwiGLU, etc.
- Build a tiny instruction-tuned version (coming soon!)

## πŸ“¬ Contact
For questions or collaborations:
- **GitHub**: [Ashx098](https://github.com/Ashx098)
- **LinkedIn**: [Avinash Mynampati](https://linkedin.com/in/avinash-mynampati)