MAP-NEO Mini: A DIY LLM from Scratch
This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.
π Project Overview
- Model: MAP-NEO Mini (253 M parameters)
- Architecture: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention
- Hardware: Intel i5 CPU, 16 GB RAM β NVIDIA RTX 4000 (20 GB) β RTX 5070 (8 GB)
- Data: RefinedWeb (100 K high-quality web docs, 41 M tokens)
- Context Window: Extended from 1,024 β 16,384 tokens
- Training: Mixed precision (bf16), gradient checkpointing, gradient accumulation
- Fine-Tuning: Planned conversational instruction tuning with UltraChat
π Repository Structure
AI/
ββ checkpoints/ # Model checkpoints & configs
β ββ checkpoint_step_149999.pt # Last pre-training checkpoint
β ββ extended_context_model.pt # 8K context model
β ββ model_config.json # Config for extended model
ββ data/ # Raw and processed data
β ββ shards/ # Raw JSONL shards
β ββ processed/ # Filtered JSONL
β ββ tokens/ # Packed token sequences
ββ clean_conversational_neo/ # Conversational training scripts
ββ configs/ # training_config.json, data_config.json
ββ logs/ # TensorBoard logs
ββ notebooks/ # Exploratory Jupyter notebooks
ββ advanced_generate.py # Advanced inference & context tests
ββ conversation_data_prep.py # Prepares chat data for fine-tuning
ββ data_prep.py # RefinedWeb download & preprocessing
ββ debug_downloaded_data.py # Inspect raw data quality
ββ extend_context.py # Script to extend model context window
ββ finetune_neo.py # Base fine-tuning script
ββ generate_text.py # Simple generation utility
ββ interactive_chat.py # Interactive chat interface
ββ model_neo.py # Model & config definitions
ββ requirements.txt # Python dependencies
ββ run_training.py # Orchestrates data prep β training
ββ scale_data.py # Utilities for sampling & scaling datasets
ββ setup_project.py # Initial setup (venv, downloads)
ββ test_conversational_neo.py # Tests on small conversational model
ββ train_neo.py # Main pre-training script
π οΈ Setup & Installation
- Clone this repo.
- Create virtual environment (Python 3.10+):
python -m venv .venv source .venv/bin/activate # or .venv\Scripts\activate pip install --upgrade pip pip install -r requirements.txt - Install GPU drivers and CUDA (if using RTX).
- Optional:
pip install tensorboard pynvmlfor logging and GPU monitoring.
π Data Preparation
- Dataset:
tiiuae/falcon-refinedweb - Script:
data_prep.py- Downloads 100 K docs, filters for quality (200β10,000 chars, English only)
- Tokenizes with GPT-2 BPE, packs into sequences of length 1,024
- Output:
- Raw shards:
data/shards/refinedweb_sample_raw.jsonl - Filtered:
data/processed/refinedweb_filtered.jsonl - Packed tokens:
data/tokens/packed_1024.txt
- Raw shards:
python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data
ποΈ Pre-Training
- Script:
train_neo.py - Config:
batch_size = 1 gradient_accumulation_steps = 32 max_steps = 150000 warmup_steps = 3750 mixed_precision = "bf16" gradient_checkpointing = True - Accelerator handles mixed-precision, gradient accumulation, checkpointing.
- Resume from any checkpoint: set
resume_from_checkpointinTrainingConfig.
python train_neo.py # fresh
python train_neo.py --resume checkpoints/checkpoint_step_7500.pt # resume
- Speed: ~10 it/s β 4 hours for 150K steps on RTX 5070.
π§ Context Extension
- Script:
extend_context.py - Extends
config.max_seq_lenβ 16,384 and interpolates position embeddings. - Output:
checkpoints/extended_context_model_16k.pt
python extend_context.py --new_max_len 16384
π€ Inference & Testing
Simple Generation
advanced_generate.py tests fixed prompts and long context usage with VRAM monitoring.
Interactive Chat
interactive_chat.py provides a full chat interface:
/help,/params,/memory,/context,/clear,/save,/load,/multi,/system,/exit- Real-time GPU usage and context tracking
- Customizable sampling parameters
python interactive_chat.py
π Fine-Tuning Plan
- Dataset Recommendation:
openbmb/UltraChat(1.5 M dialogs) +BAAI/Infinity-Instruct+vicgalle/alpaca-gpt4 - Script:
finetune_neo.py(will be extended for conversational data) - Goal: Transform base model β instruction-following chat assistant
python finetune_neo.py \
--base_model checkpoints/extended_context_model_16k.pt \
--dataset /path/to/UltraChat \
--epochs 3 --lr 5e-6 --batch_size 1
π Key Lessons & Tips
- Quality > Quantity: RefinedWeb quality cut training steps by 25%
- Memory Efficiency: Achieved 3.6 K tokens at ~1.3 GB VRAM
- Batch Size Tradeoff: 1 vs 2 batch size critical for VRAM overflow
- Cache Clearing:
torch.cuda.empty_cache()essential for long context tests - Resume Training: Checkpointing during pre-training saved 10+ hours
- Conversational Fine-Tuning: Final step to transform base model into chat assistant
π Next Steps
- Review and run conversational fine-tuning on UltraChat.
- Evaluate on standardized benchmarks (perplexity, MMLU, HellaSwag).
- Quantize or prune for faster inference on edge devices.
- Deploy with FastAPI + SSE for streaming responses.
- Document model card and share results.
Thank you for following this detailed project! Your model is now a powerful, efficient LLM ready for conversational fine-tuning and deployment. Good luck!