| # MAP-NEO Mini: A DIY LLM from Scratch | |
| This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization. | |
| *** | |
| ## π Project Overview | |
| - **Model**: MAP-NEO Mini (253 M parameters) | |
| - **Architecture**: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention | |
| - **Hardware**: Intel i5 CPU, 16 GB RAM β NVIDIA RTX 4000 (20 GB) β RTX 5070 (8 GB) | |
| - **Data**: RefinedWeb (100 K high-quality web docs, 41 M tokens) | |
| - **Context Window**: Extended from 1,024 β 16,384 tokens | |
| - **Training**: Mixed precision (bf16), gradient checkpointing, gradient accumulation | |
| - **Fine-Tuning**: Planned conversational instruction tuning with UltraChat | |
| *** | |
| ## π Repository Structure | |
| ``` | |
| AI/ | |
| ββ checkpoints/ # Model checkpoints & configs | |
| β ββ checkpoint_step_149999.pt # Last pre-training checkpoint | |
| β ββ extended_context_model.pt # 8K context model | |
| β ββ model_config.json # Config for extended model | |
| ββ data/ # Raw and processed data | |
| β ββ shards/ # Raw JSONL shards | |
| β ββ processed/ # Filtered JSONL | |
| β ββ tokens/ # Packed token sequences | |
| ββ clean_conversational_neo/ # Conversational training scripts | |
| ββ configs/ # training_config.json, data_config.json | |
| ββ logs/ # TensorBoard logs | |
| ββ notebooks/ # Exploratory Jupyter notebooks | |
| ββ advanced_generate.py # Advanced inference & context tests | |
| ββ conversation_data_prep.py # Prepares chat data for fine-tuning | |
| ββ data_prep.py # RefinedWeb download & preprocessing | |
| ββ debug_downloaded_data.py # Inspect raw data quality | |
| ββ extend_context.py # Script to extend model context window | |
| ββ finetune_neo.py # Base fine-tuning script | |
| ββ generate_text.py # Simple generation utility | |
| ββ interactive_chat.py # Interactive chat interface | |
| ββ model_neo.py # Model & config definitions | |
| ββ requirements.txt # Python dependencies | |
| ββ run_training.py # Orchestrates data prep β training | |
| ββ scale_data.py # Utilities for sampling & scaling datasets | |
| ββ setup_project.py # Initial setup (venv, downloads) | |
| ββ test_conversational_neo.py # Tests on small conversational model | |
| ββ train_neo.py # Main pre-training script | |
| ``` | |
| *** | |
| ## π οΈ Setup & Installation | |
| 1. **Clone** this repo. | |
| 2. **Create virtual environment** (Python 3.10+): | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate # or .venv\Scripts\activate | |
| pip install --upgrade pip | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Install GPU drivers** and **CUDA** (if using RTX). | |
| 4. **Optional**: `pip install tensorboard pynvml` for logging and GPU monitoring. | |
| *** | |
| ## π Data Preparation | |
| - **Dataset**: `tiiuae/falcon-refinedweb` | |
| - **Script**: `data_prep.py` | |
| - Downloads 100 K docs, filters for quality (200β10,000 chars, English only) | |
| - Tokenizes with GPT-2 BPE, packs into sequences of length 1,024 | |
| - **Output**: | |
| - Raw shards: `data/shards/refinedweb_sample_raw.jsonl` | |
| - Filtered: `data/processed/refinedweb_filtered.jsonl` | |
| - Packed tokens: `data/tokens/packed_1024.txt` | |
| ```bash | |
| python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data | |
| ``` | |
| *** | |
| ## ποΈ Pre-Training | |
| - **Script**: `train_neo.py` | |
| - **Config**: | |
| ```python | |
| batch_size = 1 | |
| gradient_accumulation_steps = 32 | |
| max_steps = 150000 | |
| warmup_steps = 3750 | |
| mixed_precision = "bf16" | |
| gradient_checkpointing = True | |
| ``` | |
| - **Accelerator** handles mixed-precision, gradient accumulation, checkpointing. | |
| - **Resume** from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`. | |
| ```bash | |
| python train_neo.py # fresh | |
| python train_neo.py --resume checkpoints/checkpoint_step_7500.pt # resume | |
| ``` | |
| - **Speed**: ~10 it/s β 4 hours for 150K steps on RTX 5070. | |
| *** | |
| ## π§ Context Extension | |
| - **Script**: `extend_context.py` | |
| - Extends `config.max_seq_len` β 16,384 and interpolates position embeddings. | |
| - **Output**: `checkpoints/extended_context_model_16k.pt` | |
| ```bash | |
| python extend_context.py --new_max_len 16384 | |
| ``` | |
| *** | |
| ## π€ Inference & Testing | |
| ### **Simple Generation** | |
| `advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring. | |
| ### **Interactive Chat** | |
| `interactive_chat.py` provides a full chat interface: | |
| - `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit` | |
| - Real-time GPU usage and context tracking | |
| - Customizable sampling parameters | |
| ```bash | |
| python interactive_chat.py | |
| ``` | |
| *** | |
| ## π Fine-Tuning Plan | |
| - **Dataset Recommendation**: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4` | |
| - **Script**: `finetune_neo.py` (will be extended for conversational data) | |
| - **Goal**: Transform base model β instruction-following chat assistant | |
| ```bash | |
| python finetune_neo.py \ | |
| --base_model checkpoints/extended_context_model_16k.pt \ | |
| --dataset /path/to/UltraChat \ | |
| --epochs 3 --lr 5e-6 --batch_size 1 | |
| ``` | |
| *** | |
| ## π Key Lessons & Tips | |
| - **Quality > Quantity**: RefinedWeb quality cut training steps by 25% | |
| - **Memory Efficiency**: Achieved 3.6 K tokens at ~1.3 GB VRAM | |
| - **Batch Size Tradeoff**: 1 vs 2 batch size critical for VRAM overflow | |
| - **Cache Clearing**: `torch.cuda.empty_cache()` essential for long context tests | |
| - **Resume Training**: Checkpointing during pre-training saved 10+ hours | |
| - **Conversational Fine-Tuning**: Final step to transform base model into chat assistant | |
| *** | |
| ## π Next Steps | |
| 1. **Review and run conversational fine-tuning** on UltraChat. | |
| 2. **Evaluate** on standardized benchmarks (perplexity, MMLU, HellaSwag). | |
| 3. **Quantize** or **prune** for faster inference on edge devices. | |
| 4. **Deploy** with FastAPI + SSE for streaming responses. | |
| 5. **Document** model card and share results. | |
| *** | |
| Thank you for following this detailed project! Your model is now a **powerful, efficient LLM** ready for conversational fine-tuning and deployment. Good luck! |