File size: 9,895 Bytes
4b4cd1e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | # LUNA - 100M Parameter LLM from Scratch
Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.
## Quick Start (RunPod / Cloud GPU)
### 1. Clone & Install (one command)
```bash
git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
cd /workspace/LUNA && \
pip install -q -r requirements.txt
```
### 2. Get Dataset + Train (one command)
The dataset (~4.5B tokens) is hosted as a zip at [ASTERIZER/Luna_Dataset](https://huggingface.co/datasets/ASTERIZER/Luna_Dataset). The script downloads, extracts, and starts training automatically.
**From HuggingFace (recommended):**
```bash
bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset
```
**From Google Drive:**
```bash
bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID
```
**Smoke test (10M tokens only):**
```bash
bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000
```
That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.
---
## How It Works
### Auto vs Manual Config
All hyperparameters live in `train_config.yaml`:
```yaml
auto_config: true # auto-detect everything from hardware
auto_config: false # use exact values below, no overrides
```
When `auto_config: true` (default), the trainer:
- **Probes VRAM** via binary search to find max micro_batch_size (82% safety)
- **Sets grad_accum** to hit the target global_batch_size
- **Picks precision** (bf16 on Ampere+, fp16 otherwise)
- **Scales workers** to half your CPU cores, capped by RAM
- **Enables torch.compile** if Triton is available (Linux)
When `auto_config: false`, every value in the YAML is used exactly as-is.
### CLI Overrides
Any config value can be overridden from the command line:
```bash
python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000
```
Priority: CLI args > train_config.yaml > auto-detection
---
## Dataset
- **4,515,286,950 tokens** (4.5B) in 270 binary chunks
- Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
- Format: LitData binary (int32, block_size=1025, TokensLoader)
- Tokenizer: EleutherAI/pythia-160m (50,254 vocab)
## Model Architecture
| Parameter | Value |
|-----------|-------|
| Layers | 10 |
| Hidden dim | 768 |
| Attention heads | 12 |
| Vocab size | 50,304 (padded) |
| Context length | 1,024 |
| Total params | ~109M (70M unique, tied embeddings) |
| Rotary % | 25% |
## File Structure
```
LUNA/
train.py # Main training script (config-driven, auto-detects hardware)
train_config.yaml # All hyperparameters (auto_config: true/false)
fetch_data.py # Downloads dataset from HuggingFace / GDrive
setup_and_train.sh # One-command cloud entrypoint
benchmark_runpod.py # Local benchmark + RunPod cost calculator
requirements.txt # Python dependencies
Base/
checkpoints/EleutherAI/pythia-160m/ # Tokenizer files
configs/ # Legacy litgpt YAML configs (reference only)
scripts/ # Data preprocessing scripts
```
## Estimated Training Times (RunPod)
| GPU | $/hr | tok/s | Hours | Cost USD | Cost INR |
|-----|------|-------|-------|----------|----------|
| RTX A5000 | $0.16 | ~6,400 | ~196h | ~$31 | ~2,700 |
| RTX 3090 | $0.22 | ~7,600 | ~165h | ~$36 | ~3,100 |
| RTX 4090 | $0.34 | ~10,000 | ~125h | ~$42 | ~3,600 |
| RTX 5090 | $0.69 | ~16,000 | ~78h | ~$54 | ~4,600 |
| H100 NVL | $2.59 | ~43,000 | ~29h | ~$75 | ~6,400 |
## Resume Training
Training auto-saves `latest.pt` every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.
---
## Verified Configs (What Worked)
These are the exact configurations that produced the current LUNA 100M model.
Do NOT change them unless you know what you're doing β they are proven and validated.
---
### 1. Pretraining β 4.5 Billion Tokens
The pretraining ran in two phases on an RTX 4060 Ti 16GB.
**Phase 1: Bulk pretraining on 3B general web tokens**
| Parameter | Value |
|-----------|-------|
| Dataset | `litdata_3b` β deduplicated, quality-filtered (score β₯ 0.96) general web |
| Total tokens | 3,000,000,000 (3B) |
| Precision | bf16-mixed |
| Global batch size | 120 (micro_batch=12 Γ grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine decay with 500-step warmup |
| Gradient clip | max_norm=1.0 |
| Checkpoints | Every 1000 steps |
| Seed | 1337 |
| Tokenizer | EleutherAI/pythia-160m (vocab 50,254) |
**Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)**
| Parameter | Value |
|-----------|-------|
| Dataset | `litdata_english` β ultra-clean Wikipedia + FineWeb-Edu |
| Total tokens | 150,000,000 (150M) β ~3 epochs over ~50M unique tokens |
| Init weights | Phase 1 checkpoint (`custom-100m-3b-full/final_raw`) |
| Precision | bf16-mixed |
| Global batch size | 120 (micro_batch=12 Γ grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine decay with 200-step warmup |
| Gradient clip | max_norm=1.0 |
| Checkpoints | Every 500 steps |
**Final combined dataset used for the production run:**
| Parameter | Value |
|-----------|-------|
| Dataset | `litdata_pretrain_final` β all sources merged |
| Total tokens | 4,515,286,950 (~4.5B) in 270 chunks |
| Sources | Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English) |
| Format | LitData binary (int32, block_size=1025, EOS=0) |
| Config file | `train_config.yaml` |
| Precision | bf16 |
| Global batch size | 120 (micro_batch=12 Γ grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine with 500-step warmup (5% of total steps when auto) |
| Gradient clip | max_norm=1.0 |
| torch.compile | true (Linux/cloud with Triton) |
| auto_config | true (probes VRAM, CPU, RAM at runtime) |
---
### 2. SFT Fine-Tuning β ~145 Million Tokens
Supervised fine-tuning on the pretrained LUNA 100M checkpoint.
| Parameter | Value |
|-----------|-------|
| Dataset | `Base/Datasets/sft_clean/` β 574,996 train + 5,808 val samples |
| Format | Alpaca JSON (instruction / input / output) |
| Estimated tokens | ~145M total (574,996 samples Γ ~250 tokens avg Γ 2 epochs) |
| Epochs | 2 |
| Config file | `sft_config.yaml` |
**Model (frozen architecture β matches pretrain exactly):**
| Parameter | Value |
|-----------|-------|
| vocab_size | 50,304 (padded to 128 multiple) |
| seq_len | 1024 |
| n_layer | 10 |
| n_embd | 768 |
| n_head | 12 |
| Rotary % | 25% |
| Total params | 109,513,728 |
**Training hyperparameters:**
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95]) |
| Precision | bf16 |
| Global batch size | 64 (micro_batch=8 Γ grad_accum=8) |
| LR warmup | 200 steps |
| Gradient clip | max_norm=1.0 |
| Save interval | Every 500 steps |
| Eval interval | Every 500 steps (runs val loss + eval prompts) |
| DataLoader | 4 workers, pin_memory=true |
| torch.compile | false |
**Prompt format (used during training β must be matched at inference):**
```
### Instruction:
{instruction}
### Response:
```
With optional input field:
```
### Instruction:
{instruction}
### Input:
{input}
### Response:
```
**Loss masking:** Only the response tokens (after `### Response:\n`) contribute to the loss.
The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.
---
### 3. SFT Inference / Chat β Loaded Configs
These are the exact generation parameters loaded when running `chat.py` or `validate_sft.py`.
They match the training eval config from `sft_train.py`.
```bash
python chat.py --ckpt "Base\out\sft\model.pth"
```
**Model loading:**
| Parameter | Value |
|-----------|-------|
| Checkpoint | `Base/out/sft/model.pth` (419 MB, raw state_dict, 154 keys) |
| Checkpoint format | Raw `state_dict` β NOT wrapped in `{"model": ...}` dict |
| Tokenizer | `Base/checkpoints/EleutherAI/pythia-160m` (vocab 50,254) |
| EOS token ID | 0 (pythia tokenizer β NOT 50276) |
| Device | auto (CUDA if available, else CPU) |
| Precision | float32 at inference (weights loaded as-is from bf16-trained ckpt) |
**Generation parameters:**
| Parameter | Value | Why |
|-----------|-------|-----|
| temperature | 0.7 | Balanced creativity vs coherence |
| top_k | 40 | Matches training eval (NOT 50) |
| top_p | 0.9 | Nucleus sampling cutoff |
| repetition_penalty | 1.0 | No penalty β matches training (NOT 1.1) |
| max_new_tokens | 150 | Matches training eval (NOT 256) |
**Prompt template (must match training exactly):**
```python
def format_prompt(instruction, context=""):
if instruction and context:
return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
else:
return f"### Instruction:\n{instruction}\n\n### Response:\n"
```
**Critical notes:**
- There is NO Alpaca preamble text (e.g., "Below is an instruction...") β the model was never trained with one
- EOS token is id=0 (pythia), not 50276 (GPT-NeoX) β using the wrong EOS causes the model to never stop
- Generation stops when EOS is produced OR max_new_tokens is reached
- For longer responses in chat, you can override: `--max_new 512`
- For less repetition in production, add: `--rep_pen 1.05`
**Validation results with these configs (100 complex examples):**
| Metric | Value |
|--------|-------|
| Overall Grade | A |
| Avg Loss (CE) | 1.9167 |
| Avg Perplexity | 7.45 |
| Token Accuracy | 58.6% |
| BLEU-1 | 0.589 |
| BLEU-2 | 0.219 |
| Empty responses | 0/100 |
| Repetitive responses | 5/100 |
---
## License
Private / ASTERIZER 2026
|