MAP-NEO Mini: A DIY LLM from Scratch

This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.

🚀 Project Overview

Model: MAP-NEO Mini (253 M parameters)
Architecture: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention
Hardware: Intel i5 CPU, 16 GB RAM → NVIDIA RTX 4000 (20 GB) → RTX 5070 (8 GB)
Data: RefinedWeb (100 K high-quality web docs, 41 M tokens)
Context Window: Extended from 1,024 → 16,384 tokens
Training: Mixed precision (bf16), gradient checkpointing, gradient accumulation
Fine-Tuning: Planned conversational instruction tuning with UltraChat

📂 Repository Structure

AI/
├─ checkpoints/                  # Model checkpoints & configs
│  ├─ checkpoint_step_149999.pt  # Last pre-training checkpoint
│  ├─ extended_context_model.pt  # 8K context model
│  └─ model_config.json          # Config for extended model
├─ data/                         # Raw and processed data
│  ├─ shards/                    # Raw JSONL shards
│  ├─ processed/                 # Filtered JSONL
│  └─ tokens/                    # Packed token sequences
├─ clean_conversational_neo/     # Conversational training scripts
├─ configs/                      # training_config.json, data_config.json
├─ logs/                         # TensorBoard logs
├─ notebooks/                    # Exploratory Jupyter notebooks
├─ advanced_generate.py          # Advanced inference & context tests
├─ conversation_data_prep.py     # Prepares chat data for fine-tuning
├─ data_prep.py                  # RefinedWeb download & preprocessing
├─ debug_downloaded_data.py      # Inspect raw data quality
├─ extend_context.py             # Script to extend model context window
├─ finetune_neo.py               # Base fine-tuning script
├─ generate_text.py              # Simple generation utility
├─ interactive_chat.py           # Interactive chat interface
├─ model_neo.py                  # Model & config definitions
├─ requirements.txt              # Python dependencies
├─ run_training.py               # Orchestrates data prep → training
├─ scale_data.py                 # Utilities for sampling & scaling datasets
├─ setup_project.py              # Initial setup (venv, downloads)
├─ test_conversational_neo.py    # Tests on small conversational model
└─ train_neo.py                  # Main pre-training script

🛠️ Setup & Installation

Clone this repo.

Create virtual environment (Python 3.10+):

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt

Install GPU drivers and CUDA (if using RTX).
Optional: pip install tensorboard pynvml for logging and GPU monitoring.

📊 Data Preparation

Dataset: tiiuae/falcon-refinedweb
Script: data_prep.py
- Downloads 100 K docs, filters for quality (200–10,000 chars, English only)
- Tokenizes with GPT-2 BPE, packs into sequences of length 1,024
Output:
- Raw shards: data/shards/refinedweb_sample_raw.jsonl
- Filtered: data/processed/refinedweb_filtered.jsonl
- Packed tokens: data/tokens/packed_1024.txt

python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data

🏋️ Pre-Training

Script: train_neo.py

Config:

batch_size = 1
gradient_accumulation_steps = 32
max_steps = 150000
warmup_steps = 3750
mixed_precision = "bf16"
gradient_checkpointing = True

Accelerator handles mixed-precision, gradient accumulation, checkpointing.
Resume from any checkpoint: set resume_from_checkpoint in TrainingConfig.

python train_neo.py        # fresh
python train_neo.py --resume checkpoints/checkpoint_step_7500.pt  # resume

Speed: ~10 it/s → 4 hours for 150K steps on RTX 5070.

🔧 Context Extension

Script: extend_context.py
Extends config.max_seq_len → 16,384 and interpolates position embeddings.
Output: checkpoints/extended_context_model_16k.pt

python extend_context.py --new_max_len 16384

🤖 Inference & Testing

Simple Generation

advanced_generate.py tests fixed prompts and long context usage with VRAM monitoring.

Interactive Chat

interactive_chat.py provides a full chat interface:

/help, /params, /memory, /context, /clear, /save, /load, /multi, /system, /exit
Real-time GPU usage and context tracking
Customizable sampling parameters

python interactive_chat.py

📈 Fine-Tuning Plan

Dataset Recommendation: openbmb/UltraChat (1.5 M dialogs) + BAAI/Infinity-Instruct + vicgalle/alpaca-gpt4
Script: finetune_neo.py (will be extended for conversational data)
Goal: Transform base model → instruction-following chat assistant

python finetune_neo.py \
  --base_model checkpoints/extended_context_model_16k.pt \
  --dataset /path/to/UltraChat \
  --epochs 3 --lr 5e-6 --batch_size 1

🔑 Key Lessons & Tips

Quality > Quantity: RefinedWeb quality cut training steps by 25%
Memory Efficiency: Achieved 3.6 K tokens at ~1.3 GB VRAM
Batch Size Tradeoff: 1 vs 2 batch size critical for VRAM overflow
Cache Clearing: torch.cuda.empty_cache() essential for long context tests
Resume Training: Checkpointing during pre-training saved 10+ hours
Conversational Fine-Tuning: Final step to transform base model into chat assistant

📂 Next Steps

Review and run conversational fine-tuning on UltraChat.
Evaluate on standardized benchmarks (perplexity, MMLU, HellaSwag).
Quantize or prune for faster inference on edge devices.
Deploy with FastAPI + SSE for streaming responses.
Document model card and share results.

Thank you for following this detailed project! Your model is now a powerful, efficient LLM ready for conversational fine-tuning and deployment. Good luck!