File size: 9,895 Bytes
4b4cd1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# LUNA - 100M Parameter LLM from Scratch

Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.

## Quick Start (RunPod / Cloud GPU)

### 1. Clone & Install (one command)

```bash
git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
cd /workspace/LUNA && \
pip install -q -r requirements.txt
```

### 2. Get Dataset + Train (one command)

The dataset (~4.5B tokens) is hosted as a zip at [ASTERIZER/Luna_Dataset](https://huggingface.co/datasets/ASTERIZER/Luna_Dataset). The script downloads, extracts, and starts training automatically.

**From HuggingFace (recommended):**
```bash
bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset
```

**From Google Drive:**
```bash
bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID
```

**Smoke test (10M tokens only):**
```bash
bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000
```

That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.

---

## How It Works

### Auto vs Manual Config

All hyperparameters live in `train_config.yaml`:

```yaml
auto_config: true   # auto-detect everything from hardware
auto_config: false  # use exact values below, no overrides
```

When `auto_config: true` (default), the trainer:
- **Probes VRAM** via binary search to find max micro_batch_size (82% safety)
- **Sets grad_accum** to hit the target global_batch_size
- **Picks precision** (bf16 on Ampere+, fp16 otherwise)
- **Scales workers** to half your CPU cores, capped by RAM
- **Enables torch.compile** if Triton is available (Linux)

When `auto_config: false`, every value in the YAML is used exactly as-is.

### CLI Overrides

Any config value can be overridden from the command line:

```bash
python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000
```

Priority: CLI args > train_config.yaml > auto-detection

---

## Dataset

- **4,515,286,950 tokens** (4.5B) in 270 binary chunks
- Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
- Format: LitData binary (int32, block_size=1025, TokensLoader)
- Tokenizer: EleutherAI/pythia-160m (50,254 vocab)

## Model Architecture

| Parameter | Value |
|-----------|-------|
| Layers | 10 |
| Hidden dim | 768 |
| Attention heads | 12 |
| Vocab size | 50,304 (padded) |
| Context length | 1,024 |
| Total params | ~109M (70M unique, tied embeddings) |
| Rotary % | 25% |

## File Structure

```
LUNA/
  train.py              # Main training script (config-driven, auto-detects hardware)
  train_config.yaml     # All hyperparameters (auto_config: true/false)
  fetch_data.py         # Downloads dataset from HuggingFace / GDrive
  setup_and_train.sh    # One-command cloud entrypoint
  benchmark_runpod.py   # Local benchmark + RunPod cost calculator
  requirements.txt      # Python dependencies
  Base/
    checkpoints/EleutherAI/pythia-160m/   # Tokenizer files
    configs/             # Legacy litgpt YAML configs (reference only)
    scripts/             # Data preprocessing scripts
```

## Estimated Training Times (RunPod)

| GPU | $/hr | tok/s | Hours | Cost USD | Cost INR |
|-----|------|-------|-------|----------|----------|
| RTX A5000 | $0.16 | ~6,400 | ~196h | ~$31 | ~2,700 |
| RTX 3090 | $0.22 | ~7,600 | ~165h | ~$36 | ~3,100 |
| RTX 4090 | $0.34 | ~10,000 | ~125h | ~$42 | ~3,600 |
| RTX 5090 | $0.69 | ~16,000 | ~78h | ~$54 | ~4,600 |
| H100 NVL | $2.59 | ~43,000 | ~29h | ~$75 | ~6,400 |

## Resume Training

Training auto-saves `latest.pt` every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.

---

## Verified Configs (What Worked)

These are the exact configurations that produced the current LUNA 100M model.
Do NOT change them unless you know what you're doing β€” they are proven and validated.

---

### 1. Pretraining β€” 4.5 Billion Tokens

The pretraining ran in two phases on an RTX 4060 Ti 16GB.

**Phase 1: Bulk pretraining on 3B general web tokens**

| Parameter | Value |
|-----------|-------|
| Dataset | `litdata_3b` β€” deduplicated, quality-filtered (score β‰₯ 0.96) general web |
| Total tokens | 3,000,000,000 (3B) |
| Precision | bf16-mixed |
| Global batch size | 120 (micro_batch=12 Γ— grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine decay with 500-step warmup |
| Gradient clip | max_norm=1.0 |
| Checkpoints | Every 1000 steps |
| Seed | 1337 |
| Tokenizer | EleutherAI/pythia-160m (vocab 50,254) |

**Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)**

| Parameter | Value |
|-----------|-------|
| Dataset | `litdata_english` β€” ultra-clean Wikipedia + FineWeb-Edu |
| Total tokens | 150,000,000 (150M) β€” ~3 epochs over ~50M unique tokens |
| Init weights | Phase 1 checkpoint (`custom-100m-3b-full/final_raw`) |
| Precision | bf16-mixed |
| Global batch size | 120 (micro_batch=12 Γ— grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine decay with 200-step warmup |
| Gradient clip | max_norm=1.0 |
| Checkpoints | Every 500 steps |

**Final combined dataset used for the production run:**

| Parameter | Value |
|-----------|-------|
| Dataset | `litdata_pretrain_final` β€” all sources merged |
| Total tokens | 4,515,286,950 (~4.5B) in 270 chunks |
| Sources | Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English) |
| Format | LitData binary (int32, block_size=1025, EOS=0) |
| Config file | `train_config.yaml` |
| Precision | bf16 |
| Global batch size | 120 (micro_batch=12 Γ— grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine with 500-step warmup (5% of total steps when auto) |
| Gradient clip | max_norm=1.0 |
| torch.compile | true (Linux/cloud with Triton) |
| auto_config | true (probes VRAM, CPU, RAM at runtime) |

---

### 2. SFT Fine-Tuning β€” ~145 Million Tokens

Supervised fine-tuning on the pretrained LUNA 100M checkpoint.

| Parameter | Value |
|-----------|-------|
| Dataset | `Base/Datasets/sft_clean/` β€” 574,996 train + 5,808 val samples |
| Format | Alpaca JSON (instruction / input / output) |
| Estimated tokens | ~145M total (574,996 samples Γ— ~250 tokens avg Γ— 2 epochs) |
| Epochs | 2 |
| Config file | `sft_config.yaml` |

**Model (frozen architecture β€” matches pretrain exactly):**

| Parameter | Value |
|-----------|-------|
| vocab_size | 50,304 (padded to 128 multiple) |
| seq_len | 1024 |
| n_layer | 10 |
| n_embd | 768 |
| n_head | 12 |
| Rotary % | 25% |
| Total params | 109,513,728 |

**Training hyperparameters:**

| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95]) |
| Precision | bf16 |
| Global batch size | 64 (micro_batch=8 Γ— grad_accum=8) |
| LR warmup | 200 steps |
| Gradient clip | max_norm=1.0 |
| Save interval | Every 500 steps |
| Eval interval | Every 500 steps (runs val loss + eval prompts) |
| DataLoader | 4 workers, pin_memory=true |
| torch.compile | false |

**Prompt format (used during training β€” must be matched at inference):**

```
### Instruction:
{instruction}

### Response:
```

With optional input field:

```
### Instruction:
{instruction}

### Input:
{input}

### Response:
```

**Loss masking:** Only the response tokens (after `### Response:\n`) contribute to the loss.
The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.

---

### 3. SFT Inference / Chat β€” Loaded Configs

These are the exact generation parameters loaded when running `chat.py` or `validate_sft.py`.
They match the training eval config from `sft_train.py`.

```bash
python chat.py --ckpt "Base\out\sft\model.pth"
```

**Model loading:**

| Parameter | Value |
|-----------|-------|
| Checkpoint | `Base/out/sft/model.pth` (419 MB, raw state_dict, 154 keys) |
| Checkpoint format | Raw `state_dict` β€” NOT wrapped in `{"model": ...}` dict |
| Tokenizer | `Base/checkpoints/EleutherAI/pythia-160m` (vocab 50,254) |
| EOS token ID | 0 (pythia tokenizer β€” NOT 50276) |
| Device | auto (CUDA if available, else CPU) |
| Precision | float32 at inference (weights loaded as-is from bf16-trained ckpt) |

**Generation parameters:**

| Parameter | Value | Why |
|-----------|-------|-----|
| temperature | 0.7 | Balanced creativity vs coherence |
| top_k | 40 | Matches training eval (NOT 50) |
| top_p | 0.9 | Nucleus sampling cutoff |
| repetition_penalty | 1.0 | No penalty β€” matches training (NOT 1.1) |
| max_new_tokens | 150 | Matches training eval (NOT 256) |

**Prompt template (must match training exactly):**

```python
def format_prompt(instruction, context=""):
    if instruction and context:
        return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
    else:
        return f"### Instruction:\n{instruction}\n\n### Response:\n"
```

**Critical notes:**
- There is NO Alpaca preamble text (e.g., "Below is an instruction...") β€” the model was never trained with one
- EOS token is id=0 (pythia), not 50276 (GPT-NeoX) β€” using the wrong EOS causes the model to never stop
- Generation stops when EOS is produced OR max_new_tokens is reached
- For longer responses in chat, you can override: `--max_new 512`
- For less repetition in production, add: `--rep_pen 1.05`

**Validation results with these configs (100 complex examples):**

| Metric | Value |
|--------|-------|
| Overall Grade | A |
| Avg Loss (CE) | 1.9167 |
| Avg Perplexity | 7.45 |
| Token Accuracy | 58.6% |
| BLEU-1 | 0.589 |
| BLEU-2 | 0.219 |
| Empty responses | 0/100 |
| Repetitive responses | 5/100 |

---

## License

Private / ASTERIZER 2026