Map-NEO / Readme.md
Austin207's picture
Upload folder using huggingface_hub
a683148 verified
# MAP-NEO Mini: A DIY LLM from Scratch
This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.
***
## πŸš€ Project Overview
- **Model**: MAP-NEO Mini (253 M parameters)
- **Architecture**: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention
- **Hardware**: Intel i5 CPU, 16 GB RAM β†’ NVIDIA RTX 4000 (20 GB) β†’ RTX 5070 (8 GB)
- **Data**: RefinedWeb (100 K high-quality web docs, 41 M tokens)
- **Context Window**: Extended from 1,024 β†’ 16,384 tokens
- **Training**: Mixed precision (bf16), gradient checkpointing, gradient accumulation
- **Fine-Tuning**: Planned conversational instruction tuning with UltraChat
***
## πŸ“‚ Repository Structure
```
AI/
β”œβ”€ checkpoints/ # Model checkpoints & configs
β”‚ β”œβ”€ checkpoint_step_149999.pt # Last pre-training checkpoint
β”‚ β”œβ”€ extended_context_model.pt # 8K context model
β”‚ └─ model_config.json # Config for extended model
β”œβ”€ data/ # Raw and processed data
β”‚ β”œβ”€ shards/ # Raw JSONL shards
β”‚ β”œβ”€ processed/ # Filtered JSONL
β”‚ └─ tokens/ # Packed token sequences
β”œβ”€ clean_conversational_neo/ # Conversational training scripts
β”œβ”€ configs/ # training_config.json, data_config.json
β”œβ”€ logs/ # TensorBoard logs
β”œβ”€ notebooks/ # Exploratory Jupyter notebooks
β”œβ”€ advanced_generate.py # Advanced inference & context tests
β”œβ”€ conversation_data_prep.py # Prepares chat data for fine-tuning
β”œβ”€ data_prep.py # RefinedWeb download & preprocessing
β”œβ”€ debug_downloaded_data.py # Inspect raw data quality
β”œβ”€ extend_context.py # Script to extend model context window
β”œβ”€ finetune_neo.py # Base fine-tuning script
β”œβ”€ generate_text.py # Simple generation utility
β”œβ”€ interactive_chat.py # Interactive chat interface
β”œβ”€ model_neo.py # Model & config definitions
β”œβ”€ requirements.txt # Python dependencies
β”œβ”€ run_training.py # Orchestrates data prep β†’ training
β”œβ”€ scale_data.py # Utilities for sampling & scaling datasets
β”œβ”€ setup_project.py # Initial setup (venv, downloads)
β”œβ”€ test_conversational_neo.py # Tests on small conversational model
└─ train_neo.py # Main pre-training script
```
***
## πŸ› οΈ Setup & Installation
1. **Clone** this repo.
2. **Create virtual environment** (Python 3.10+):
```bash
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
```
3. **Install GPU drivers** and **CUDA** (if using RTX).
4. **Optional**: `pip install tensorboard pynvml` for logging and GPU monitoring.
***
## πŸ“Š Data Preparation
- **Dataset**: `tiiuae/falcon-refinedweb`
- **Script**: `data_prep.py`
- Downloads 100 K docs, filters for quality (200–10,000 chars, English only)
- Tokenizes with GPT-2 BPE, packs into sequences of length 1,024
- **Output**:
- Raw shards: `data/shards/refinedweb_sample_raw.jsonl`
- Filtered: `data/processed/refinedweb_filtered.jsonl`
- Packed tokens: `data/tokens/packed_1024.txt`
```bash
python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data
```
***
## πŸ‹οΈ Pre-Training
- **Script**: `train_neo.py`
- **Config**:
```python
batch_size = 1
gradient_accumulation_steps = 32
max_steps = 150000
warmup_steps = 3750
mixed_precision = "bf16"
gradient_checkpointing = True
```
- **Accelerator** handles mixed-precision, gradient accumulation, checkpointing.
- **Resume** from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`.
```bash
python train_neo.py # fresh
python train_neo.py --resume checkpoints/checkpoint_step_7500.pt # resume
```
- **Speed**: ~10 it/s β†’ 4 hours for 150K steps on RTX 5070.
***
## πŸ”§ Context Extension
- **Script**: `extend_context.py`
- Extends `config.max_seq_len` β†’ 16,384 and interpolates position embeddings.
- **Output**: `checkpoints/extended_context_model_16k.pt`
```bash
python extend_context.py --new_max_len 16384
```
***
## πŸ€– Inference & Testing
### **Simple Generation**
`advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring.
### **Interactive Chat**
`interactive_chat.py` provides a full chat interface:
- `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit`
- Real-time GPU usage and context tracking
- Customizable sampling parameters
```bash
python interactive_chat.py
```
***
## πŸ“ˆ Fine-Tuning Plan
- **Dataset Recommendation**: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4`
- **Script**: `finetune_neo.py` (will be extended for conversational data)
- **Goal**: Transform base model β†’ instruction-following chat assistant
```bash
python finetune_neo.py \
--base_model checkpoints/extended_context_model_16k.pt \
--dataset /path/to/UltraChat \
--epochs 3 --lr 5e-6 --batch_size 1
```
***
## πŸ”‘ Key Lessons & Tips
- **Quality > Quantity**: RefinedWeb quality cut training steps by 25%
- **Memory Efficiency**: Achieved 3.6 K tokens at ~1.3 GB VRAM
- **Batch Size Tradeoff**: 1 vs 2 batch size critical for VRAM overflow
- **Cache Clearing**: `torch.cuda.empty_cache()` essential for long context tests
- **Resume Training**: Checkpointing during pre-training saved 10+ hours
- **Conversational Fine-Tuning**: Final step to transform base model into chat assistant
***
## πŸ“‚ Next Steps
1. **Review and run conversational fine-tuning** on UltraChat.
2. **Evaluate** on standardized benchmarks (perplexity, MMLU, HellaSwag).
3. **Quantize** or **prune** for faster inference on edge devices.
4. **Deploy** with FastAPI + SSE for streaming responses.
5. **Document** model card and share results.
***
Thank you for following this detailed project! Your model is now a **powerful, efficient LLM** ready for conversational fine-tuning and deployment. Good luck!