mamba-encoder-swarm-training / training /HF_INTEGRATION_GUIDE.md
Debito's picture
Upload 7 files
e295ac5 verified

πŸš€ Using Your Existing Mamba Trainer with HuggingFace Datasets

Your existing trainer.py and data_loader.py are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.

βœ… What You Already Have (Perfect!)

Your Existing Training System:

  • training/trainer.py - Sophisticated 4-phase training pipeline
  • training/data_loader.py - Complete data loading infrastructure
  • training/optimizer.py - Advanced Mamba-specific optimization
  • training/loss.py - Comprehensive loss functions
  • core/config.py - Complete configuration system

Your Training Pipeline:

  1. Phase 1: Foundation training (shared weights)
  2. Phase 2: Specialist training (domain experts)
  3. Phase 3: Aggregator training (combining specialists)
  4. Phase 4: End-to-end fine-tuning

This is production-ready and more advanced than most training systems!

πŸ”— HuggingFace Integration (Simple Addition)

Step 1: Install HF Requirements

pip install -r hf_requirements.txt

Step 2: Quick Training with HF Data

# Uses your existing trainer with WikiText-103 dataset
python enhanced_training.py

# Quick test with tiny dataset
python enhanced_training.py --quick-test

Step 3: Custom HF Dataset Training

# Download specific datasets
python train_with_hf_datasets.py --download-only

# Train with specific dataset
python enhanced_training.py --dataset "openwebtext"

πŸ“Š Popular HuggingFace Datasets You Can Use

Language Modeling Datasets:

  • wikitext-103-v1 - Wikipedia articles (recommended for testing)
  • openwebtext - Web text corpus (large, good for training)
  • c4 - Colossal Clean Crawled Corpus (very large)
  • pile - EleutherAI's diverse text dataset
  • tiny_shakespeare - Small dataset for quick testing

Domain-Specific Datasets:

  • Medical: pubmed_qa, bioasq
  • Legal: lex_glue
  • Code: codeparrot/github-code, bigcode/the-stack
  • Science: scientific_papers

🎯 How It Integrates With Your System

Your Existing Data Loader Enhancement:

The HF integration simply:

  1. Downloads datasets from HuggingFace
  2. Converts them to your expected text format
  3. Saves as train_data.txt
  4. Your existing MambaDataset loads it normally

Your Existing Config Usage:

# Your existing config works perfectly
config = MambaConfig(
    vocab_size=50257,
    d_model=1024,
    n_layers=12,
    batch_size=4,
    learning_rate=1e-4,
    num_specialists=50,
    train_data_path="train_data.txt"  # HF dataset converted to this
)

# Your existing trainer
trainer = MambaSwarmTrainer(config)
trainer.full_training_pipeline()  # Uses your 4-phase system

πŸƒ Quick Start Commands

1. Test Your Existing System:

# Use your existing trainer as-is
python -c "
from core.config import MambaConfig
from training.trainer import MambaSwarmTrainer

config = MambaConfig()
trainer = MambaSwarmTrainer(config)
trainer.train_foundation_phase(num_steps=100)  # Quick test
"

2. Add HuggingFace Data:

# Download WikiText and train with your system
python enhanced_training.py

3. Train with Different HF Datasets:

# Shakespeare (tiny, for testing)
python enhanced_training.py --dataset tiny_shakespeare

# OpenWebText (larger, for real training)  
python enhanced_training.py --dataset openwebtext

πŸ“ˆ Your Enhanced Training Flow

πŸ“₯ HuggingFace Dataset
    ↓ (convert to text format)
πŸ“„ train_data.txt
    ↓ (your existing data_loader.py)
🧠 MambaDataset
    ↓ (your existing trainer.py)
πŸ—οΈ  4-Phase Training Pipeline:
    πŸ“š Phase 1: Foundation
    🎯 Phase 2: Specialists  
    πŸ”— Phase 3: Aggregator
    🎨 Phase 4: End-to-end
    ↓
πŸ’Ύ Trained Mamba Swarm
    ↓ (your enhanced app.py)
πŸš€ Production Ready Model

πŸŽ›οΈ Configuration Examples

Small Model (Quick Testing):

config = MambaConfig(
    d_model=512,
    n_layers=6,
    batch_size=2,
    num_specialists=10,
    max_steps=1000
)

Production Model:

config = MambaConfig(
    d_model=1024, 
    n_layers=12,
    batch_size=8,
    num_specialists=50,
    max_steps=50000
)

Large Model (If you have GPU):

config = MambaConfig(
    d_model=2048,
    n_layers=24, 
    batch_size=4,
    num_specialists=100,
    max_steps=100000
)

πŸ” What Gets Enhanced

Your app.py Now Detects:

  1. Custom Trained Models (Priority 1-9)
  2. Standard Mamba Models (Priority 10-19)
  3. GPT Fallbacks (Priority 20+)

When you train a model, it gets highest priority automatically!

Example Status Display:

🎯 CUSTOM TRAINED MAMBA ENCODER
Status: 🟒 Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D)

πŸ“ Training Log Example

πŸ“₯ Loading wikitext-103-v1 from Hugging Face...
πŸ“„ Converting to text format...
βœ… Dataset saved to train_data.txt
🐍 Starting Mamba Swarm Training with HF Data
βœ… Config created:
  - Model: 768D, 8 layers
  - Specialists: 20
  - Batch size: 2
  - Training data: train_data.txt
βœ… Trainer initialized successfully
Step 4: Starting training pipeline...
Phase 1: Foundation training
Phase 2: Specialist training
Phase 3: Aggregator training  
Phase 4: End-to-end fine-tuning
πŸŽ‰ Training completed successfully!
πŸ’Ύ Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt

πŸ’‘ Key Benefits

  1. Your System is Already Advanced - No need to replace anything
  2. HF Integration is Simple - Just adds data sources
  3. Automatic Model Detection - Trained models get priority
  4. Production Ready - Your 4-phase training is sophisticated
  5. Open Source Data - Access to massive datasets

πŸš€ Next Steps

  1. Test your existing system: python enhanced_training.py --quick-test
  2. Try with HF data: python enhanced_training.py
  3. Experiment with datasets: Try different HF datasets
  4. Scale up: Increase model size and training steps
  5. Deploy: Your trained model automatically works in app.py

Your existing training system is excellent - the HF integration just gives you access to world-class datasets!