Upload README.md with huggingface_hub

4b4cd1e verified about 1 month ago

9.9 kB

	# LUNA - 100M Parameter LLM from Scratch

	Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.

	## Quick Start (RunPod / Cloud GPU)

	### 1. Clone & Install (one command)

	```bash
	git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
	cd /workspace/LUNA && \
	pip install -q -r requirements.txt
	```

	### 2. Get Dataset + Train (one command)

	The dataset (~4.5B tokens) is hosted as a zip at [ASTERIZER/Luna_Dataset](https://huggingface.co/datasets/ASTERIZER/Luna_Dataset). The script downloads, extracts, and starts training automatically.

	From HuggingFace (recommended):
	```bash
	bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset
	```

	From Google Drive:
	```bash
	bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID
	```

	Smoke test (10M tokens only):
	```bash
	bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000
	```

	That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.

	---

	## How It Works

	### Auto vs Manual Config

	All hyperparameters live in `train_config.yaml`:

	```yaml
	auto_config: true # auto-detect everything from hardware
	auto_config: false # use exact values below, no overrides
	```

	When `auto_config: true` (default), the trainer:
	- Probes VRAM via binary search to find max micro_batch_size (82% safety)
	- Sets grad_accum to hit the target global_batch_size
	- Picks precision (bf16 on Ampere+, fp16 otherwise)
	- Scales workers to half your CPU cores, capped by RAM
	- Enables torch.compile if Triton is available (Linux)

	When `auto_config: false`, every value in the YAML is used exactly as-is.

	### CLI Overrides

	Any config value can be overridden from the command line:

	```bash
	python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000
	```

	Priority: CLI args > train_config.yaml > auto-detection

	---

	## Dataset

	- 4,515,286,950 tokens (4.5B) in 270 binary chunks
	- Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
	- Format: LitData binary (int32, block_size=1025, TokensLoader)
	- Tokenizer: EleutherAI/pythia-160m (50,254 vocab)

	## Model Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Layers \| 10 \|
	\| Hidden dim \| 768 \|
	\| Attention heads \| 12 \|
	\| Vocab size \| 50,304 (padded) \|
	\| Context length \| 1,024 \|
	\| Total params \| ~109M (70M unique, tied embeddings) \|
	\| Rotary % \| 25% \|

	## File Structure

	```
	LUNA/
	train.py # Main training script (config-driven, auto-detects hardware)
	train_config.yaml # All hyperparameters (auto_config: true/false)
	fetch_data.py # Downloads dataset from HuggingFace / GDrive
	setup_and_train.sh # One-command cloud entrypoint
	benchmark_runpod.py # Local benchmark + RunPod cost calculator
	requirements.txt # Python dependencies
	Base/
	checkpoints/EleutherAI/pythia-160m/ # Tokenizer files
	configs/ # Legacy litgpt YAML configs (reference only)
	scripts/ # Data preprocessing scripts
	```

	## Estimated Training Times (RunPod)

	\| GPU \| $/hr \| tok/s \| Hours \| Cost USD \| Cost INR \|
	\|-----\|------\|-------\|-------\|----------\|----------\|
	\| RTX A5000 \| $0.16 \| ~6,400 \| ~196h \| ~$31 \| ~2,700 \|
	\| RTX 3090 \| $0.22 \| ~7,600 \| ~165h \| ~$36 \| ~3,100 \|
	\| RTX 4090 \| $0.34 \| ~10,000 \| ~125h \| ~$42 \| ~3,600 \|
	\| RTX 5090 \| $0.69 \| ~16,000 \| ~78h \| ~$54 \| ~4,600 \|
	\| H100 NVL \| $2.59 \| ~43,000 \| ~29h \| ~$75 \| ~6,400 \|

	## Resume Training

	Training auto-saves `latest.pt` every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.

	---

	## Verified Configs (What Worked)

	These are the exact configurations that produced the current LUNA 100M model.
	Do NOT change them unless you know what you're doing — they are proven and validated.

	---

	### 1. Pretraining — 4.5 Billion Tokens

	The pretraining ran in two phases on an RTX 4060 Ti 16GB.

	Phase 1: Bulk pretraining on 3B general web tokens

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| `litdata_3b` — deduplicated, quality-filtered (score ≥ 0.96) general web \|
	\| Total tokens \| 3,000,000,000 (3B) \|
	\| Precision \| bf16-mixed \|
	\| Global batch size \| 120 (micro_batch=12 × grad_accum=10) \|
	\| Sequence length \| 1024 \|
	\| Optimizer \| AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) \|
	\| LR schedule \| Cosine decay with 500-step warmup \|
	\| Gradient clip \| max_norm=1.0 \|
	\| Checkpoints \| Every 1000 steps \|
	\| Seed \| 1337 \|
	\| Tokenizer \| EleutherAI/pythia-160m (vocab 50,254) \|

	Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| `litdata_english` — ultra-clean Wikipedia + FineWeb-Edu \|
	\| Total tokens \| 150,000,000 (150M) — ~3 epochs over ~50M unique tokens \|
	\| Init weights \| Phase 1 checkpoint (`custom-100m-3b-full/final_raw`) \|
	\| Precision \| bf16-mixed \|
	\| Global batch size \| 120 (micro_batch=12 × grad_accum=10) \|
	\| Sequence length \| 1024 \|
	\| Optimizer \| AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95]) \|
	\| LR schedule \| Cosine decay with 200-step warmup \|
	\| Gradient clip \| max_norm=1.0 \|
	\| Checkpoints \| Every 500 steps \|

	Final combined dataset used for the production run:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| `litdata_pretrain_final` — all sources merged \|
	\| Total tokens \| 4,515,286,950 (~4.5B) in 270 chunks \|
	\| Sources \| Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English) \|
	\| Format \| LitData binary (int32, block_size=1025, EOS=0) \|
	\| Config file \| `train_config.yaml` \|
	\| Precision \| bf16 \|
	\| Global batch size \| 120 (micro_batch=12 × grad_accum=10) \|
	\| Sequence length \| 1024 \|
	\| Optimizer \| AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) \|
	\| LR schedule \| Cosine with 500-step warmup (5% of total steps when auto) \|
	\| Gradient clip \| max_norm=1.0 \|
	\| torch.compile \| true (Linux/cloud with Triton) \|
	\| auto_config \| true (probes VRAM, CPU, RAM at runtime) \|

	---

	### 2. SFT Fine-Tuning — ~145 Million Tokens

	Supervised fine-tuning on the pretrained LUNA 100M checkpoint.

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| `Base/Datasets/sft_clean/` — 574,996 train + 5,808 val samples \|
	\| Format \| Alpaca JSON (instruction / input / output) \|
	\| Estimated tokens \| ~145M total (574,996 samples × ~250 tokens avg × 2 epochs) \|
	\| Epochs \| 2 \|
	\| Config file \| `sft_config.yaml` \|

	Model (frozen architecture — matches pretrain exactly):

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| vocab_size \| 50,304 (padded to 128 multiple) \|
	\| seq_len \| 1024 \|
	\| n_layer \| 10 \|
	\| n_embd \| 768 \|
	\| n_head \| 12 \|
	\| Rotary % \| 25% \|
	\| Total params \| 109,513,728 \|

	Training hyperparameters:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Optimizer \| AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95]) \|
	\| Precision \| bf16 \|
	\| Global batch size \| 64 (micro_batch=8 × grad_accum=8) \|
	\| LR warmup \| 200 steps \|
	\| Gradient clip \| max_norm=1.0 \|
	\| Save interval \| Every 500 steps \|
	\| Eval interval \| Every 500 steps (runs val loss + eval prompts) \|
	\| DataLoader \| 4 workers, pin_memory=true \|
	\| torch.compile \| false \|

	Prompt format (used during training — must be matched at inference):

	```
	### Instruction:
	{instruction}

	### Response:
	```

	With optional input field:

	```
	### Instruction:
	{instruction}

	### Input:
	{input}

	### Response:
	```

	Loss masking: Only the response tokens (after `### Response:\n`) contribute to the loss.
	The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.

	---

	### 3. SFT Inference / Chat — Loaded Configs

	These are the exact generation parameters loaded when running `chat.py` or `validate_sft.py`.
	They match the training eval config from `sft_train.py`.

	```bash
	python chat.py --ckpt "Base\out\sft\model.pth"
	```

	Model loading:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Checkpoint \| `Base/out/sft/model.pth` (419 MB, raw state_dict, 154 keys) \|
	\| Checkpoint format \| Raw `state_dict` — NOT wrapped in `{"model": ...}` dict \|
	\| Tokenizer \| `Base/checkpoints/EleutherAI/pythia-160m` (vocab 50,254) \|
	\| EOS token ID \| 0 (pythia tokenizer — NOT 50276) \|
	\| Device \| auto (CUDA if available, else CPU) \|
	\| Precision \| float32 at inference (weights loaded as-is from bf16-trained ckpt) \|

	Generation parameters:

	\| Parameter \| Value \| Why \|
	\|-----------\|-------\|-----\|
	\| temperature \| 0.7 \| Balanced creativity vs coherence \|
	\| top_k \| 40 \| Matches training eval (NOT 50) \|
	\| top_p \| 0.9 \| Nucleus sampling cutoff \|
	\| repetition_penalty \| 1.0 \| No penalty — matches training (NOT 1.1) \|
	\| max_new_tokens \| 150 \| Matches training eval (NOT 256) \|

	Prompt template (must match training exactly):

	```python
	def format_prompt(instruction, context=""):
	if instruction and context:
	return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
	else:
	return f"### Instruction:\n{instruction}\n\n### Response:\n"
	```

	Critical notes:
	- There is NO Alpaca preamble text (e.g., "Below is an instruction...") — the model was never trained with one
	- EOS token is id=0 (pythia), not 50276 (GPT-NeoX) — using the wrong EOS causes the model to never stop
	- Generation stops when EOS is produced OR max_new_tokens is reached
	- For longer responses in chat, you can override: `--max_new 512`
	- For less repetition in production, add: `--rep_pen 1.05`

	Validation results with these configs (100 complex examples):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Overall Grade \| A \|
	\| Avg Loss (CE) \| 1.9167 \|
	\| Avg Perplexity \| 7.45 \|
	\| Token Accuracy \| 58.6% \|
	\| BLEU-1 \| 0.589 \|
	\| BLEU-2 \| 0.219 \|
	\| Empty responses \| 0/100 \|
	\| Repetitive responses \| 5/100 \|

	---

	## License

	Private / ASTERIZER 2026