File size: 6,697 Bytes
a683148 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
# MAP-NEO Mini: A DIY LLM from Scratch
This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.
***
## π Project Overview
- **Model**: MAP-NEO Mini (253 M parameters)
- **Architecture**: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention
- **Hardware**: Intel i5 CPU, 16 GB RAM β NVIDIA RTX 4000 (20 GB) β RTX 5070 (8 GB)
- **Data**: RefinedWeb (100 K high-quality web docs, 41 M tokens)
- **Context Window**: Extended from 1,024 β 16,384 tokens
- **Training**: Mixed precision (bf16), gradient checkpointing, gradient accumulation
- **Fine-Tuning**: Planned conversational instruction tuning with UltraChat
***
## π Repository Structure
```
AI/
ββ checkpoints/ # Model checkpoints & configs
β ββ checkpoint_step_149999.pt # Last pre-training checkpoint
β ββ extended_context_model.pt # 8K context model
β ββ model_config.json # Config for extended model
ββ data/ # Raw and processed data
β ββ shards/ # Raw JSONL shards
β ββ processed/ # Filtered JSONL
β ββ tokens/ # Packed token sequences
ββ clean_conversational_neo/ # Conversational training scripts
ββ configs/ # training_config.json, data_config.json
ββ logs/ # TensorBoard logs
ββ notebooks/ # Exploratory Jupyter notebooks
ββ advanced_generate.py # Advanced inference & context tests
ββ conversation_data_prep.py # Prepares chat data for fine-tuning
ββ data_prep.py # RefinedWeb download & preprocessing
ββ debug_downloaded_data.py # Inspect raw data quality
ββ extend_context.py # Script to extend model context window
ββ finetune_neo.py # Base fine-tuning script
ββ generate_text.py # Simple generation utility
ββ interactive_chat.py # Interactive chat interface
ββ model_neo.py # Model & config definitions
ββ requirements.txt # Python dependencies
ββ run_training.py # Orchestrates data prep β training
ββ scale_data.py # Utilities for sampling & scaling datasets
ββ setup_project.py # Initial setup (venv, downloads)
ββ test_conversational_neo.py # Tests on small conversational model
ββ train_neo.py # Main pre-training script
```
***
## π οΈ Setup & Installation
1. **Clone** this repo.
2. **Create virtual environment** (Python 3.10+):
```bash
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
```
3. **Install GPU drivers** and **CUDA** (if using RTX).
4. **Optional**: `pip install tensorboard pynvml` for logging and GPU monitoring.
***
## π Data Preparation
- **Dataset**: `tiiuae/falcon-refinedweb`
- **Script**: `data_prep.py`
- Downloads 100 K docs, filters for quality (200β10,000 chars, English only)
- Tokenizes with GPT-2 BPE, packs into sequences of length 1,024
- **Output**:
- Raw shards: `data/shards/refinedweb_sample_raw.jsonl`
- Filtered: `data/processed/refinedweb_filtered.jsonl`
- Packed tokens: `data/tokens/packed_1024.txt`
```bash
python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data
```
***
## ποΈ Pre-Training
- **Script**: `train_neo.py`
- **Config**:
```python
batch_size = 1
gradient_accumulation_steps = 32
max_steps = 150000
warmup_steps = 3750
mixed_precision = "bf16"
gradient_checkpointing = True
```
- **Accelerator** handles mixed-precision, gradient accumulation, checkpointing.
- **Resume** from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`.
```bash
python train_neo.py # fresh
python train_neo.py --resume checkpoints/checkpoint_step_7500.pt # resume
```
- **Speed**: ~10 it/s β 4 hours for 150K steps on RTX 5070.
***
## π§ Context Extension
- **Script**: `extend_context.py`
- Extends `config.max_seq_len` β 16,384 and interpolates position embeddings.
- **Output**: `checkpoints/extended_context_model_16k.pt`
```bash
python extend_context.py --new_max_len 16384
```
***
## π€ Inference & Testing
### **Simple Generation**
`advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring.
### **Interactive Chat**
`interactive_chat.py` provides a full chat interface:
- `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit`
- Real-time GPU usage and context tracking
- Customizable sampling parameters
```bash
python interactive_chat.py
```
***
## π Fine-Tuning Plan
- **Dataset Recommendation**: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4`
- **Script**: `finetune_neo.py` (will be extended for conversational data)
- **Goal**: Transform base model β instruction-following chat assistant
```bash
python finetune_neo.py \
--base_model checkpoints/extended_context_model_16k.pt \
--dataset /path/to/UltraChat \
--epochs 3 --lr 5e-6 --batch_size 1
```
***
## π Key Lessons & Tips
- **Quality > Quantity**: RefinedWeb quality cut training steps by 25%
- **Memory Efficiency**: Achieved 3.6 K tokens at ~1.3 GB VRAM
- **Batch Size Tradeoff**: 1 vs 2 batch size critical for VRAM overflow
- **Cache Clearing**: `torch.cuda.empty_cache()` essential for long context tests
- **Resume Training**: Checkpointing during pre-training saved 10+ hours
- **Conversational Fine-Tuning**: Final step to transform base model into chat assistant
***
## π Next Steps
1. **Review and run conversational fine-tuning** on UltraChat.
2. **Evaluate** on standardized benchmarks (perplexity, MMLU, HellaSwag).
3. **Quantize** or **prune** for faster inference on edge devices.
4. **Deploy** with FastAPI + SSE for streaming responses.
5. **Document** model card and share results.
***
Thank you for following this detailed project! Your model is now a **powerful, efficient LLM** ready for conversational fine-tuning and deployment. Good luck! |