File size: 6,697 Bytes

a683148

# MAP-NEO Mini: A DIY LLM from Scratch

This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.

***



## 🚀 Project Overview



- **Model**: MAP-NEO Mini (253 M parameters)  

- **Architecture**: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention  

- **Hardware**: Intel i5 CPU, 16 GB RAM → NVIDIA RTX 4000 (20 GB) → RTX 5070 (8 GB)  

- **Data**: RefinedWeb (100 K high-quality web docs, 41 M tokens)  

- **Context Window**: Extended from 1,024 → 16,384 tokens  

- **Training**: Mixed precision (bf16), gradient checkpointing, gradient accumulation  

- **Fine-Tuning**: Planned conversational instruction tuning with UltraChat  



***

## 📂 Repository Structure

```

AI/

├─ checkpoints/                  # Model checkpoints & configs

│  ├─ checkpoint_step_149999.pt  # Last pre-training checkpoint

│  ├─ extended_context_model.pt  # 8K context model

│  └─ model_config.json          # Config for extended model

├─ data/                         # Raw and processed data

│  ├─ shards/                    # Raw JSONL shards

│  ├─ processed/                 # Filtered JSONL

│  └─ tokens/                    # Packed token sequences

├─ clean_conversational_neo/     # Conversational training scripts

├─ configs/                      # training_config.json, data_config.json

├─ logs/                         # TensorBoard logs

├─ notebooks/                    # Exploratory Jupyter notebooks

├─ advanced_generate.py          # Advanced inference & context tests

├─ conversation_data_prep.py     # Prepares chat data for fine-tuning

├─ data_prep.py                  # RefinedWeb download & preprocessing

├─ debug_downloaded_data.py      # Inspect raw data quality

├─ extend_context.py             # Script to extend model context window

├─ finetune_neo.py               # Base fine-tuning script

├─ generate_text.py              # Simple generation utility

├─ interactive_chat.py           # Interactive chat interface

├─ model_neo.py                  # Model & config definitions

├─ requirements.txt              # Python dependencies

├─ run_training.py               # Orchestrates data prep → training

├─ scale_data.py                 # Utilities for sampling & scaling datasets

├─ setup_project.py              # Initial setup (venv, downloads)

├─ test_conversational_neo.py    # Tests on small conversational model

└─ train_neo.py                  # Main pre-training script

```

***



## 🛠️ Setup & Installation



1. **Clone** this repo.  
2. **Create virtual environment** (Python 3.10+):  
   ```bash

   python -m venv .venv

   source .venv/bin/activate  # or .venv\Scripts\activate

   pip install --upgrade pip

   pip install -r requirements.txt

   ```
3. **Install GPU drivers** and **CUDA** (if using RTX).  
4. **Optional**: `pip install tensorboard pynvml` for logging and GPU monitoring.

***



## 📊 Data Preparation



- **Dataset**: `tiiuae/falcon-refinedweb`  

- **Script**: `data_prep.py`  

  - Downloads 100 K docs, filters for quality (200–10,000 chars, English only)  

  - Tokenizes with GPT-2 BPE, packs into sequences of length 1,024  

- **Output**:  

  - Raw shards: `data/shards/refinedweb_sample_raw.jsonl`  

  - Filtered: `data/processed/refinedweb_filtered.jsonl`  

  - Packed tokens: `data/tokens/packed_1024.txt`  



```bash

python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data

```



***



## 🏋️ Pre-Training



- **Script**: `train_neo.py`  

- **Config**:  

  ```python

  batch_size = 1

  gradient_accumulation_steps = 32

  max_steps = 150000

  warmup_steps = 3750

  mixed_precision = "bf16"

  gradient_checkpointing = True

  ```

- **Accelerator** handles mixed-precision, gradient accumulation, checkpointing.  

- **Resume** from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`.



```bash

python train_neo.py        # fresh

python train_neo.py --resume checkpoints/checkpoint_step_7500.pt  # resume

```



- **Speed**: ~10 it/s → 4 hours for 150K steps on RTX 5070.



***



## 🔧 Context Extension



- **Script**: `extend_context.py`  

- Extends `config.max_seq_len` → 16,384 and interpolates position embeddings.  

- **Output**: `checkpoints/extended_context_model_16k.pt`



```bash

python extend_context.py --new_max_len 16384

```



***

## 🤖 Inference & Testing

### **Simple Generation**  
`advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring.  

### **Interactive Chat**  
`interactive_chat.py` provides a full chat interface:  
- `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit`  
- Real-time GPU usage and context tracking  
- Customizable sampling parameters  

```bash

python interactive_chat.py

```

***



## 📈 Fine-Tuning Plan



- **Dataset Recommendation**: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4`  

- **Script**: `finetune_neo.py` (will be extended for conversational data)  

- **Goal**: Transform base model → instruction-following chat assistant  



```bash

python finetune_neo.py \

  --base_model checkpoints/extended_context_model_16k.pt \

  --dataset /path/to/UltraChat \

  --epochs 3 --lr 5e-6 --batch_size 1

```



***



## 🔑 Key Lessons & Tips



- **Quality > Quantity**: RefinedWeb quality cut training steps by 25%  

- **Memory Efficiency**: Achieved 3.6 K tokens at ~1.3 GB VRAM  

- **Batch Size Tradeoff**: 1 vs 2 batch size critical for VRAM overflow  

- **Cache Clearing**: `torch.cuda.empty_cache()` essential for long context tests  

- **Resume Training**: Checkpointing during pre-training saved 10+ hours  

- **Conversational Fine-Tuning**: Final step to transform base model into chat assistant



***

## 📂 Next Steps

1. **Review and run conversational fine-tuning** on UltraChat.  
2. **Evaluate** on standardized benchmarks (perplexity, MMLU, HellaSwag).  
3. **Quantize** or **prune** for faster inference on edge devices.  
4. **Deploy** with FastAPI + SSE for streaming responses.  
5. **Document** model card and share results.

***



Thank you for following this detailed project! Your model is now a **powerful, efficient LLM** ready for conversational fine-tuning and deployment. Good luck!