abhinavv3 commited on
Commit
cfca1f4
ยท
verified ยท
1 Parent(s): 5006238

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md CHANGED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MEMGPT
2
+
3
+ A GPT-2-style large language model (LLM) repository.This implementation includes full support for distributed training, sharded datasets, benchmark evaluation, and efficient text generation.
4
+
5
+ ---
6
+
7
+ ## ๐Ÿ”ง Features
8
+
9
+ - Transformer architecture based on GPT-2.
10
+ - Configurable training and model hyperparameters via JSON.
11
+ - Sharded dataset loading from `.npy` files.
12
+ - Mixed-precision training with `torch.autocast`.
13
+ - DDP (DistributedDataParallel) support.
14
+ - Evaluation support with HellaSwag.
15
+ - Modular codebase for easy extensibility.
16
+
17
+ ---
18
+
19
+ ## ๐Ÿ“ Project Structure
20
+
21
+ ```bash
22
+ MEMGPT/
23
+ โ”œโ”€โ”€ configs/
24
+ โ”‚ โ””โ”€โ”€ config.json # Model and training configuration
25
+ โ”‚
26
+ โ”œโ”€โ”€ data/
27
+ โ”‚ โ”œโ”€โ”€ edu_fineweb/ # Sharded training data
28
+ โ”‚ โ”‚ โ”œโ”€โ”€ train_000001.npy
29
+ โ”‚ โ”‚ โ”œโ”€โ”€ train_000002.npy
30
+ โ”‚ โ”‚ โ””โ”€โ”€ test_000001.npy
31
+ โ”‚ โ”œโ”€โ”€ hellaswag/
32
+ โ”‚ โ”‚ โ””โ”€โ”€ hellaswag_val.jsonl
33
+ โ”‚ โ””โ”€โ”€ fineweb.py # Dataset sharding/processing logic
34
+ โ”‚
35
+ โ”œโ”€โ”€ model_core/
36
+ โ”‚ โ”œโ”€โ”€ __init__.py
37
+ โ”‚ โ”œโ”€โ”€ attention.py # Self-attention module
38
+ โ”‚ โ”œโ”€โ”€ model.py # GPT2 model architecture
39
+ โ”‚ โ”œโ”€โ”€ dataloader.py # DataLoader_1 class
40
+ โ”‚ โ””โ”€โ”€ training.py # train_nanogpt function
41
+ โ”‚
42
+ โ”œโ”€โ”€ scripts/
43
+ โ”‚ โ”œโ”€โ”€ train.py # Entry point to start training
44
+ โ”‚ โ”œโ”€โ”€ evaluate.py # Run evaluation
45
+ โ”‚ โ””โ”€โ”€ generate.py # Generate text from trained model
46
+ โ”‚
47
+ โ”œโ”€โ”€ evaluation/
48
+ โ”‚ โ”œโ”€โ”€ __init__.py
49
+ โ”‚ โ”œโ”€โ”€ hellaswag.py # HellaSwag dataset preparation
50
+ โ”‚ โ””โ”€โ”€ val_hellaswag.py # HellaSwag scoring function
51
+ โ”‚
52
+ โ”œโ”€โ”€ logs/
53
+ โ”‚ โ”œโ”€โ”€ log.txt # Training log file
54
+ โ”‚ โ””โ”€โ”€ model_xxxxx.pt # Checkpoint files
55
+ โ”‚
56
+ โ”œโ”€โ”€ .gitignore
57
+ โ”œโ”€โ”€ README.md
58
+ โ”œโ”€โ”€ requirements.txt
59
+ ```
60
+
61
+ ---
62
+
63
+ ## โš™๏ธ Configuration
64
+
65
+ Edit `configs/config.json` to configure your model and training setup.
66
+
67
+ Example:
68
+ ```json
69
+ {
70
+ "model": {
71
+ "block_size": 1024,
72
+ "vocab_size": 50304,
73
+ "n_layer": 12,
74
+ "n_head": 12,
75
+ "n_embd": 768
76
+ },
77
+ "training": {
78
+ "max_steps": 19073,
79
+ "log_dir": "log",
80
+ "total_batch_size": 524288,
81
+ "B": 64,
82
+ "T": 1024,
83
+ "max_lr": 0.0006,
84
+ "min_lr": 0.00006,
85
+ "warmup_steps": 715,
86
+ "weight_decay": 0.1,
87
+ "learning_rate": 0.0006
88
+ }
89
+ }
90
+ ```
91
+
92
+ ---
93
+
94
+ ## ๐Ÿš€ Training
95
+
96
+ To start training the model:
97
+
98
+ ```bash
99
+ python scripts/train.py
100
+ ```
101
+
102
+ This script internally loads `train_nanogpt()` from `model_core/training.py` using the config in `configs/config.json`.
103
+
104
+ ### Optional: Distributed Training
105
+
106
+ To run training across multiple GPUs using PyTorch DDP:
107
+
108
+ ```bash
109
+ torchrun --nproc_per_node=NUM_GPUS scripts/train.py
110
+ ```
111
+
112
+ Replace `NUM_GPUS` with the number of GPUs you want to use.
113
+
114
+ ---
115
+
116
+ ## ๐Ÿ“Š Evaluation
117
+
118
+ To evaluate on HellaSwag:
119
+
120
+ ```bash
121
+ python scripts/evaluate.py
122
+ ```
123
+
124
+ Make sure the `hellaswag_val.jsonl` file is available under `data/hellaswag/`.
125
+
126
+ ---
127
+
128
+ ## โœ๏ธ Text Generation
129
+
130
+ To generate text from a trained model:
131
+
132
+ ```bash
133
+ python scripts/generate.py
134
+ ```
135
+
136
+ Make sure to adjust the generation script to point to the correct checkpoint under the `logs/` directory.
137
+
138
+ ---
139
+
140
+ ## ๐Ÿงฉ Requirements
141
+
142
+ Install required packages:
143
+
144
+ ```bash
145
+ pip install -r requirements.txt
146
+ ```
147
+
148
+ ---
149
+
150
+ ## ๐Ÿ“Œ Notes
151
+
152
+ - Ensure your `.npy` sharded data is placed under `data/edu_fineweb/`.
153
+ - The log directory and checkpoints will be saved in `logs/`.
154
+ - The `DataLoader_1` handles distributed data loading.
155
+ - Supports `bfloat16` autocasting for better training efficiency.
156
+
157
+ ---
158
+
159
+ ## ๐Ÿ“ฎ License
160
+
161
+ MIT License. Feel free to modify and build upon this for research or commercial use.
162
+
163
+ ---
164
+
165
+ ## ๐Ÿ™Œ Acknowledgements
166
+
167
+ Inspired by Andrej Karpathy's nanoGPT. Special thanks to the Andrej Karpathy Youtube tutorials and open-source AI community.
168
+