Map-NEO / Readme.md

Upload folder using huggingface_hub

a683148 verified 4 months ago

6.7 kB

	# MAP-NEO Mini: A DIY LLM from Scratch

	This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.

	***

	## 🚀 Project Overview

	- Model: MAP-NEO Mini (253 M parameters)
	- Architecture: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention
	- Hardware: Intel i5 CPU, 16 GB RAM → NVIDIA RTX 4000 (20 GB) → RTX 5070 (8 GB)
	- Data: RefinedWeb (100 K high-quality web docs, 41 M tokens)
	- Context Window: Extended from 1,024 → 16,384 tokens
	- Training: Mixed precision (bf16), gradient checkpointing, gradient accumulation
	- Fine-Tuning: Planned conversational instruction tuning with UltraChat

	***

	## 📂 Repository Structure

	```
	AI/
	├─ checkpoints/ # Model checkpoints & configs
	│ ├─ checkpoint_step_149999.pt # Last pre-training checkpoint
	│ ├─ extended_context_model.pt # 8K context model
	│ └─ model_config.json # Config for extended model
	├─ data/ # Raw and processed data
	│ ├─ shards/ # Raw JSONL shards
	│ ├─ processed/ # Filtered JSONL
	│ └─ tokens/ # Packed token sequences
	├─ clean_conversational_neo/ # Conversational training scripts
	├─ configs/ # training_config.json, data_config.json
	├─ logs/ # TensorBoard logs
	├─ notebooks/ # Exploratory Jupyter notebooks
	├─ advanced_generate.py # Advanced inference & context tests
	├─ conversation_data_prep.py # Prepares chat data for fine-tuning
	├─ data_prep.py # RefinedWeb download & preprocessing
	├─ debug_downloaded_data.py # Inspect raw data quality
	├─ extend_context.py # Script to extend model context window
	├─ finetune_neo.py # Base fine-tuning script
	├─ generate_text.py # Simple generation utility
	├─ interactive_chat.py # Interactive chat interface
	├─ model_neo.py # Model & config definitions
	├─ requirements.txt # Python dependencies
	├─ run_training.py # Orchestrates data prep → training
	├─ scale_data.py # Utilities for sampling & scaling datasets
	├─ setup_project.py # Initial setup (venv, downloads)
	├─ test_conversational_neo.py # Tests on small conversational model
	└─ train_neo.py # Main pre-training script
	```

	***

	## 🛠️ Setup & Installation

	1. Clone this repo.
	2. Create virtual environment (Python 3.10+):
	```bash
	python -m venv .venv
	source .venv/bin/activate # or .venv\Scripts\activate
	pip install --upgrade pip
	pip install -r requirements.txt
	```
	3. Install GPU drivers and CUDA (if using RTX).
	4. Optional: `pip install tensorboard pynvml` for logging and GPU monitoring.

	***

	## 📊 Data Preparation

	- Dataset: `tiiuae/falcon-refinedweb`
	- Script: `data_prep.py`
	- Downloads 100 K docs, filters for quality (200–10,000 chars, English only)
	- Tokenizes with GPT-2 BPE, packs into sequences of length 1,024
	- Output:
	- Raw shards: `data/shards/refinedweb_sample_raw.jsonl`
	- Filtered: `data/processed/refinedweb_filtered.jsonl`
	- Packed tokens: `data/tokens/packed_1024.txt`

	```bash
	python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data
	```

	***

	## 🏋️ Pre-Training

	- Script: `train_neo.py`
	- Config:
	```python
	batch_size = 1
	gradient_accumulation_steps = 32
	max_steps = 150000
	warmup_steps = 3750
	mixed_precision = "bf16"
	gradient_checkpointing = True
	```
	- Accelerator handles mixed-precision, gradient accumulation, checkpointing.
	- Resume from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`.

	```bash
	python train_neo.py # fresh
	python train_neo.py --resume checkpoints/checkpoint_step_7500.pt # resume
	```

	- Speed: ~10 it/s → 4 hours for 150K steps on RTX 5070.

	***

	## 🔧 Context Extension

	- Script: `extend_context.py`
	- Extends `config.max_seq_len` → 16,384 and interpolates position embeddings.
	- Output: `checkpoints/extended_context_model_16k.pt`

	```bash
	python extend_context.py --new_max_len 16384
	```

	***

	## 🤖 Inference & Testing

	### Simple Generation
	`advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring.

	### Interactive Chat
	`interactive_chat.py` provides a full chat interface:
	- `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit`
	- Real-time GPU usage and context tracking
	- Customizable sampling parameters

	```bash
	python interactive_chat.py
	```

	***

	## 📈 Fine-Tuning Plan

	- Dataset Recommendation: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4`
	- Script: `finetune_neo.py` (will be extended for conversational data)
	- Goal: Transform base model → instruction-following chat assistant

	```bash
	python finetune_neo.py \
	--base_model checkpoints/extended_context_model_16k.pt \
	--dataset /path/to/UltraChat \
	--epochs 3 --lr 5e-6 --batch_size 1
	```

	***

	## 🔑 Key Lessons & Tips

	- Quality > Quantity: RefinedWeb quality cut training steps by 25%
	- Memory Efficiency: Achieved 3.6 K tokens at ~1.3 GB VRAM
	- Batch Size Tradeoff: 1 vs 2 batch size critical for VRAM overflow
	- Cache Clearing: `torch.cuda.empty_cache()` essential for long context tests
	- Resume Training: Checkpointing during pre-training saved 10+ hours
	- Conversational Fine-Tuning: Final step to transform base model into chat assistant

	***

	## 📂 Next Steps

	1. Review and run conversational fine-tuning on UltraChat.
	2. Evaluate on standardized benchmarks (perplexity, MMLU, HellaSwag).
	3. Quantize or prune for faster inference on edge devices.
	4. Deploy with FastAPI + SSE for streaming responses.
	5. Document model card and share results.

	***

	Thank you for following this detailed project! Your model is now a powerful, efficient LLM ready for conversational fine-tuning and deployment. Good luck!