Map-NEO / Readme.md
Austin207's picture
Upload folder using huggingface_hub
a683148 verified

MAP-NEO Mini: A DIY LLM from Scratch

This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.


πŸš€ Project Overview

  • Model: MAP-NEO Mini (253 M parameters)
  • Architecture: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention
  • Hardware: Intel i5 CPU, 16 GB RAM β†’ NVIDIA RTX 4000 (20 GB) β†’ RTX 5070 (8 GB)
  • Data: RefinedWeb (100 K high-quality web docs, 41 M tokens)
  • Context Window: Extended from 1,024 β†’ 16,384 tokens
  • Training: Mixed precision (bf16), gradient checkpointing, gradient accumulation
  • Fine-Tuning: Planned conversational instruction tuning with UltraChat

πŸ“‚ Repository Structure

AI/
β”œβ”€ checkpoints/                  # Model checkpoints & configs
β”‚  β”œβ”€ checkpoint_step_149999.pt  # Last pre-training checkpoint
β”‚  β”œβ”€ extended_context_model.pt  # 8K context model
β”‚  └─ model_config.json          # Config for extended model
β”œβ”€ data/                         # Raw and processed data
β”‚  β”œβ”€ shards/                    # Raw JSONL shards
β”‚  β”œβ”€ processed/                 # Filtered JSONL
β”‚  └─ tokens/                    # Packed token sequences
β”œβ”€ clean_conversational_neo/     # Conversational training scripts
β”œβ”€ configs/                      # training_config.json, data_config.json
β”œβ”€ logs/                         # TensorBoard logs
β”œβ”€ notebooks/                    # Exploratory Jupyter notebooks
β”œβ”€ advanced_generate.py          # Advanced inference & context tests
β”œβ”€ conversation_data_prep.py     # Prepares chat data for fine-tuning
β”œβ”€ data_prep.py                  # RefinedWeb download & preprocessing
β”œβ”€ debug_downloaded_data.py      # Inspect raw data quality
β”œβ”€ extend_context.py             # Script to extend model context window
β”œβ”€ finetune_neo.py               # Base fine-tuning script
β”œβ”€ generate_text.py              # Simple generation utility
β”œβ”€ interactive_chat.py           # Interactive chat interface
β”œβ”€ model_neo.py                  # Model & config definitions
β”œβ”€ requirements.txt              # Python dependencies
β”œβ”€ run_training.py               # Orchestrates data prep β†’ training
β”œβ”€ scale_data.py                 # Utilities for sampling & scaling datasets
β”œβ”€ setup_project.py              # Initial setup (venv, downloads)
β”œβ”€ test_conversational_neo.py    # Tests on small conversational model
└─ train_neo.py                  # Main pre-training script

πŸ› οΈ Setup & Installation

  1. Clone this repo.
  2. Create virtual environment (Python 3.10+):
    python -m venv .venv
    source .venv/bin/activate  # or .venv\Scripts\activate
    pip install --upgrade pip
    pip install -r requirements.txt
    
  3. Install GPU drivers and CUDA (if using RTX).
  4. Optional: pip install tensorboard pynvml for logging and GPU monitoring.

πŸ“Š Data Preparation

  • Dataset: tiiuae/falcon-refinedweb
  • Script: data_prep.py
    • Downloads 100 K docs, filters for quality (200–10,000 chars, English only)
    • Tokenizes with GPT-2 BPE, packs into sequences of length 1,024
  • Output:
    • Raw shards: data/shards/refinedweb_sample_raw.jsonl
    • Filtered: data/processed/refinedweb_filtered.jsonl
    • Packed tokens: data/tokens/packed_1024.txt
python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data

πŸ‹οΈ Pre-Training

  • Script: train_neo.py
  • Config:
    batch_size = 1
    gradient_accumulation_steps = 32
    max_steps = 150000
    warmup_steps = 3750
    mixed_precision = "bf16"
    gradient_checkpointing = True
    
  • Accelerator handles mixed-precision, gradient accumulation, checkpointing.
  • Resume from any checkpoint: set resume_from_checkpoint in TrainingConfig.
python train_neo.py        # fresh
python train_neo.py --resume checkpoints/checkpoint_step_7500.pt  # resume
  • Speed: ~10 it/s β†’ 4 hours for 150K steps on RTX 5070.

πŸ”§ Context Extension

  • Script: extend_context.py
  • Extends config.max_seq_len β†’ 16,384 and interpolates position embeddings.
  • Output: checkpoints/extended_context_model_16k.pt
python extend_context.py --new_max_len 16384

πŸ€– Inference & Testing

Simple Generation

advanced_generate.py tests fixed prompts and long context usage with VRAM monitoring.

Interactive Chat

interactive_chat.py provides a full chat interface:

  • /help, /params, /memory, /context, /clear, /save, /load, /multi, /system, /exit
  • Real-time GPU usage and context tracking
  • Customizable sampling parameters
python interactive_chat.py

πŸ“ˆ Fine-Tuning Plan

  • Dataset Recommendation: openbmb/UltraChat (1.5 M dialogs) + BAAI/Infinity-Instruct + vicgalle/alpaca-gpt4
  • Script: finetune_neo.py (will be extended for conversational data)
  • Goal: Transform base model β†’ instruction-following chat assistant
python finetune_neo.py \
  --base_model checkpoints/extended_context_model_16k.pt \
  --dataset /path/to/UltraChat \
  --epochs 3 --lr 5e-6 --batch_size 1

πŸ”‘ Key Lessons & Tips

  • Quality > Quantity: RefinedWeb quality cut training steps by 25%
  • Memory Efficiency: Achieved 3.6 K tokens at ~1.3 GB VRAM
  • Batch Size Tradeoff: 1 vs 2 batch size critical for VRAM overflow
  • Cache Clearing: torch.cuda.empty_cache() essential for long context tests
  • Resume Training: Checkpointing during pre-training saved 10+ hours
  • Conversational Fine-Tuning: Final step to transform base model into chat assistant

πŸ“‚ Next Steps

  1. Review and run conversational fine-tuning on UltraChat.
  2. Evaluate on standardized benchmarks (perplexity, MMLU, HellaSwag).
  3. Quantize or prune for faster inference on edge devices.
  4. Deploy with FastAPI + SSE for streaming responses.
  5. Document model card and share results.

Thank you for following this detailed project! Your model is now a powerful, efficient LLM ready for conversational fine-tuning and deployment. Good luck!