mamba-encoder-swarm-training / training /HF_INTEGRATION_GUIDE.md

Debito

Upload 7 files

e295ac5 verified 6 months ago

preview code

raw

history blame contribute delete

6.47 kB

🚀 Using Your Existing Mamba Trainer with HuggingFace Datasets

Your existing trainer.py and data_loader.py are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.

✅ What You Already Have (Perfect!)

Your Existing Training System:

training/trainer.py - Sophisticated 4-phase training pipeline
training/data_loader.py - Complete data loading infrastructure
training/optimizer.py - Advanced Mamba-specific optimization
training/loss.py - Comprehensive loss functions
core/config.py - Complete configuration system

Your Training Pipeline:

Phase 1: Foundation training (shared weights)
Phase 2: Specialist training (domain experts)
Phase 3: Aggregator training (combining specialists)
Phase 4: End-to-end fine-tuning

This is production-ready and more advanced than most training systems!

🔗 HuggingFace Integration (Simple Addition)

Step 1: Install HF Requirements

pip install -r hf_requirements.txt

Step 2: Quick Training with HF Data

# Uses your existing trainer with WikiText-103 dataset
python enhanced_training.py

# Quick test with tiny dataset
python enhanced_training.py --quick-test

Step 3: Custom HF Dataset Training

# Download specific datasets
python train_with_hf_datasets.py --download-only

# Train with specific dataset
python enhanced_training.py --dataset "openwebtext"

📊 Popular HuggingFace Datasets You Can Use

Language Modeling Datasets:

wikitext-103-v1 - Wikipedia articles (recommended for testing)
openwebtext - Web text corpus (large, good for training)
c4 - Colossal Clean Crawled Corpus (very large)
pile - EleutherAI's diverse text dataset
tiny_shakespeare - Small dataset for quick testing

Domain-Specific Datasets:

Medical: pubmed_qa, bioasq
Legal: lex_glue
Code: codeparrot/github-code, bigcode/the-stack
Science: scientific_papers

🎯 How It Integrates With Your System

Your Existing Data Loader Enhancement:

The HF integration simply:

Downloads datasets from HuggingFace
Converts them to your expected text format
Saves as train_data.txt
Your existing MambaDataset loads it normally

Your Existing Config Usage:

# Your existing config works perfectly
config = MambaConfig(
    vocab_size=50257,
    d_model=1024,
    n_layers=12,
    batch_size=4,
    learning_rate=1e-4,
    num_specialists=50,
    train_data_path="train_data.txt"  # HF dataset converted to this
)

# Your existing trainer
trainer = MambaSwarmTrainer(config)
trainer.full_training_pipeline()  # Uses your 4-phase system

🏃 Quick Start Commands

1. Test Your Existing System:

# Use your existing trainer as-is
python -c "
from core.config import MambaConfig
from training.trainer import MambaSwarmTrainer

config = MambaConfig()
trainer = MambaSwarmTrainer(config)
trainer.train_foundation_phase(num_steps=100)  # Quick test
"

2. Add HuggingFace Data:

# Download WikiText and train with your system
python enhanced_training.py

3. Train with Different HF Datasets:

# Shakespeare (tiny, for testing)
python enhanced_training.py --dataset tiny_shakespeare

# OpenWebText (larger, for real training)  
python enhanced_training.py --dataset openwebtext

📈 Your Enhanced Training Flow

📥 HuggingFace Dataset
    ↓ (convert to text format)
📄 train_data.txt
    ↓ (your existing data_loader.py)
🧠 MambaDataset
    ↓ (your existing trainer.py)
🏗️  4-Phase Training Pipeline:
    📚 Phase 1: Foundation
    🎯 Phase 2: Specialists  
    🔗 Phase 3: Aggregator
    🎨 Phase 4: End-to-end
    ↓
💾 Trained Mamba Swarm
    ↓ (your enhanced app.py)
🚀 Production Ready Model

🎛️ Configuration Examples

Small Model (Quick Testing):

config = MambaConfig(
    d_model=512,
    n_layers=6,
    batch_size=2,
    num_specialists=10,
    max_steps=1000
)

Production Model:

config = MambaConfig(
    d_model=1024, 
    n_layers=12,
    batch_size=8,
    num_specialists=50,
    max_steps=50000
)

Large Model (If you have GPU):

config = MambaConfig(
    d_model=2048,
    n_layers=24, 
    batch_size=4,
    num_specialists=100,
    max_steps=100000
)

🔍 What Gets Enhanced

Your `app.py` Now Detects:

Custom Trained Models (Priority 1-9)
Standard Mamba Models (Priority 10-19)
GPT Fallbacks (Priority 20+)

When you train a model, it gets highest priority automatically!

Example Status Display:

🎯 CUSTOM TRAINED MAMBA ENCODER
Status: 🟢 Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D)

📝 Training Log Example

📥 Loading wikitext-103-v1 from Hugging Face...
📄 Converting to text format...
✅ Dataset saved to train_data.txt
🐍 Starting Mamba Swarm Training with HF Data
✅ Config created:
  - Model: 768D, 8 layers
  - Specialists: 20
  - Batch size: 2
  - Training data: train_data.txt
✅ Trainer initialized successfully
Step 4: Starting training pipeline...
Phase 1: Foundation training
Phase 2: Specialist training
Phase 3: Aggregator training  
Phase 4: End-to-end fine-tuning
🎉 Training completed successfully!
💾 Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt

💡 Key Benefits

Your System is Already Advanced - No need to replace anything
HF Integration is Simple - Just adds data sources
Automatic Model Detection - Trained models get priority
Production Ready - Your 4-phase training is sophisticated
Open Source Data - Access to massive datasets

🚀 Next Steps

Test your existing system: python enhanced_training.py --quick-test
Try with HF data: python enhanced_training.py
Experiment with datasets: Try different HF datasets
Scale up: Increase model size and training steps
Deploy: Your trained model automatically works in app.py

Your existing training system is excellent - the HF integration just gives you access to world-class datasets!