Ashx098 commited on
Commit
825fe90
Β·
verified Β·
1 Parent(s): 32703a2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Mini-LLM β€” 80M Parameter Transformer (Pretrained From Scratch)
2
+
3
+ <p align="center">
4
+ <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/openllm.svg" width="120"/>
5
+ </p>
6
+
7
+ **Mini-LLM** is an 80M parameter decoder-only transformer trained **fully from scratch** using a custom tokenizer, custom architecture, and custom training loop.
8
+ It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end.
9
+
10
+ ---
11
+
12
+ ## ✨ Key Features
13
+
14
+ - **80M parameters** β€” compact but fully functional LLM
15
+ - **Trained from scratch** (no borrowed checkpoints)
16
+ - Custom **Byte-Level BPE tokenizer (32k vocab)**
17
+ - Modern architecture components:
18
+ - RoPE (Rotary Position Embeddings)
19
+ - RMSNorm
20
+ - SwiGLU FeedForward layer
21
+ - FlashAttention (via PyTorch SDPA)
22
+ - GQA-ready Attention implementation
23
+ - **2B tokens** mixed corpus (FineWeb + WikiText + Wikipedia)
24
+ - Training logs, checkpoints, plots all included for transparency
25
+ - Released under a permissive license for research & learning
26
+
27
+ ---
28
+
29
+ ## πŸ“ Model Architecture
30
+
31
+ | Component | Value |
32
+ |----------|-------|
33
+ | Type | Decoder-only transformer |
34
+ | Parameters | ~80M |
35
+ | Layers | 16 |
36
+ | Embedding dim | 384 |
37
+ | Attention heads | 6 |
38
+ | KV Heads | 6 |
39
+ | MLP Hidden Dim | 1536 (SwiGLU) |
40
+ | Max sequence length | 2048 |
41
+ | Norm | RMSNorm |
42
+ | Positional Encoding | RoPE |
43
+ | Tokenizer | SentencePiece BPE (32k vocab, byte fallback) |
44
+
45
+ ---
46
+
47
+ ## πŸ“¦ Files in This Repo
48
+
49
+ - `checkpoints/` β†’ Pretrained model state_dict + optimizer
50
+ - `safetensors/` β†’ Final consolidated .safetensors file
51
+ - `logs/` β†’ Training logs in JSONL
52
+ - `plots/` β†’ Train/val loss curves
53
+ - `tokenizer.json` β†’ HF-compatible tokenizer
54
+ - `spm.model` β†’ SentencePiece model
55
+
56
+ ---
57
+
58
+ ## πŸ§ͺ Quick Usage (HF Transformers)
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+
63
+ model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True)
64
+ tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM")
65
+
66
+ prompt = "Hello, how are you?"
67
+ inputs = tok(prompt, return_tensors="pt")
68
+
69
+ outputs = model.generate(**inputs, max_new_tokens=50)
70
+ print(tok.decode(outputs[0], skip_special_tokens=True))
71
+ ```
72
+
73
+ ## πŸš€ Training Details
74
+
75
+ ### Optimizer
76
+ - **AdamW** (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
77
+ - **Learning rate**: 6e-4 (cosine annealing + warmup)
78
+
79
+ ### Batch ⨉ Sequence
80
+ - **Global batch size** = 32
81
+ - **Sequence length** = 2048
82
+ - **Gradient accumulation** = 8
83
+
84
+ ### Hardware
85
+ - Trained on 1Γ— NVIDIA A100 80GB
86
+
87
+ ## πŸ“Š Training Curve
88
+ <p align="center"> <img src="phase-1-pretraining/plots/loss_curve.png" width="500"> </p>
89
+
90
+ Final loss reached: ~3.25
91
+
92
+ ## πŸ’¬ Example Outputs
93
+
94
+ **Prompt**: "Hello, how are you"
95
+ **Output**: "Hello, how are you?"
96
+
97
+ **Prompt**: "Python is a programming language that"
98
+ **Output**: "Python is a programming language that allows the history..."
99
+
100
+ ## ⚠️ Limitations
101
+ - Small model β†’ limited reasoning, hallucination likely
102
+ - Not instruction-tuned
103
+ - Not suitable for production usage
104
+ - Best viewed as a learning + research artifact
105
+
106
+ ## πŸ“œ License
107
+ MIT License β€” free for research, modification, and further training.
108
+
109
+ ## πŸ™Œ Credits
110
+ Developed by **Avinash Mynampati**
111
+ Built from scratch using PyTorch + custom training pipeline.
112
+
113
+ ### Want to fine-tune or extend it?
114
+ You can:
115
+ - Train further with your own dataset
116
+ - Add LoRA adapters
117
+ - Use it to learn attention, RoPE, SwiGLU, etc.
118
+ - Build a tiny instruction-tuned version (coming soon!)
119
+
120
+ ## πŸ“¬ Contact
121
+ For questions or collaborations:
122
+ - **GitHub**: [Ashx098](https://github.com/Ashx098)
123
+ - **LinkedIn**: [Avinash Mynampati](https://linkedin.com/in/avinash-mynampati)