# 🚀 Using Your Existing Mamba Trainer with HuggingFace Datasets Your existing `trainer.py` and `data_loader.py` are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets. ## ✅ What You Already Have (Perfect!) ### Your Existing Training System: - **`training/trainer.py`** - Sophisticated 4-phase training pipeline - **`training/data_loader.py`** - Complete data loading infrastructure - **`training/optimizer.py`** - Advanced Mamba-specific optimization - **`training/loss.py`** - Comprehensive loss functions - **`core/config.py`** - Complete configuration system ### Your Training Pipeline: 1. **Phase 1**: Foundation training (shared weights) 2. **Phase 2**: Specialist training (domain experts) 3. **Phase 3**: Aggregator training (combining specialists) 4. **Phase 4**: End-to-end fine-tuning This is **production-ready** and more advanced than most training systems! ## 🔗 HuggingFace Integration (Simple Addition) ### Step 1: Install HF Requirements ```bash pip install -r hf_requirements.txt ``` ### Step 2: Quick Training with HF Data ```bash # Uses your existing trainer with WikiText-103 dataset python enhanced_training.py # Quick test with tiny dataset python enhanced_training.py --quick-test ``` ### Step 3: Custom HF Dataset Training ```bash # Download specific datasets python train_with_hf_datasets.py --download-only # Train with specific dataset python enhanced_training.py --dataset "openwebtext" ``` ## 📊 Popular HuggingFace Datasets You Can Use ### Language Modeling Datasets: - **`wikitext-103-v1`** - Wikipedia articles (recommended for testing) - **`openwebtext`** - Web text corpus (large, good for training) - **`c4`** - Colossal Clean Crawled Corpus (very large) - **`pile`** - EleutherAI's diverse text dataset - **`tiny_shakespeare`** - Small dataset for quick testing ### Domain-Specific Datasets: - **Medical**: `pubmed_qa`, `bioasq` - **Legal**: `lex_glue` - **Code**: `codeparrot/github-code`, `bigcode/the-stack` - **Science**: `scientific_papers` ## 🎯 How It Integrates With Your System ### Your Existing Data Loader Enhancement: The HF integration simply: 1. Downloads datasets from HuggingFace 2. Converts them to your expected text format 3. Saves as `train_data.txt` 4. Your existing `MambaDataset` loads it normally ### Your Existing Config Usage: ```python # Your existing config works perfectly config = MambaConfig( vocab_size=50257, d_model=1024, n_layers=12, batch_size=4, learning_rate=1e-4, num_specialists=50, train_data_path="train_data.txt" # HF dataset converted to this ) # Your existing trainer trainer = MambaSwarmTrainer(config) trainer.full_training_pipeline() # Uses your 4-phase system ``` ## 🏃 Quick Start Commands ### 1. Test Your Existing System: ```bash # Use your existing trainer as-is python -c " from core.config import MambaConfig from training.trainer import MambaSwarmTrainer config = MambaConfig() trainer = MambaSwarmTrainer(config) trainer.train_foundation_phase(num_steps=100) # Quick test " ``` ### 2. Add HuggingFace Data: ```bash # Download WikiText and train with your system python enhanced_training.py ``` ### 3. Train with Different HF Datasets: ```bash # Shakespeare (tiny, for testing) python enhanced_training.py --dataset tiny_shakespeare # OpenWebText (larger, for real training) python enhanced_training.py --dataset openwebtext ``` ## 📈 Your Enhanced Training Flow ``` 📥 HuggingFace Dataset ↓ (convert to text format) 📄 train_data.txt ↓ (your existing data_loader.py) 🧠 MambaDataset ↓ (your existing trainer.py) 🏗️ 4-Phase Training Pipeline: 📚 Phase 1: Foundation 🎯 Phase 2: Specialists 🔗 Phase 3: Aggregator 🎨 Phase 4: End-to-end ↓ 💾 Trained Mamba Swarm ↓ (your enhanced app.py) 🚀 Production Ready Model ``` ## 🎛️ Configuration Examples ### Small Model (Quick Testing): ```python config = MambaConfig( d_model=512, n_layers=6, batch_size=2, num_specialists=10, max_steps=1000 ) ``` ### Production Model: ```python config = MambaConfig( d_model=1024, n_layers=12, batch_size=8, num_specialists=50, max_steps=50000 ) ``` ### Large Model (If you have GPU): ```python config = MambaConfig( d_model=2048, n_layers=24, batch_size=4, num_specialists=100, max_steps=100000 ) ``` ## 🔍 What Gets Enhanced ### Your `app.py` Now Detects: 1. **Custom Trained Models** (Priority 1-9) 2. **Standard Mamba Models** (Priority 10-19) 3. **GPT Fallbacks** (Priority 20+) When you train a model, it gets **highest priority** automatically! ### Example Status Display: ``` 🎯 CUSTOM TRAINED MAMBA ENCODER Status: 🟢 Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D) ``` ## 📝 Training Log Example ``` 📥 Loading wikitext-103-v1 from Hugging Face... 📄 Converting to text format... ✅ Dataset saved to train_data.txt 🐍 Starting Mamba Swarm Training with HF Data ✅ Config created: - Model: 768D, 8 layers - Specialists: 20 - Batch size: 2 - Training data: train_data.txt ✅ Trainer initialized successfully Step 4: Starting training pipeline... Phase 1: Foundation training Phase 2: Specialist training Phase 3: Aggregator training Phase 4: End-to-end fine-tuning 🎉 Training completed successfully! 💾 Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt ``` ## 💡 Key Benefits 1. **Your System is Already Advanced** - No need to replace anything 2. **HF Integration is Simple** - Just adds data sources 3. **Automatic Model Detection** - Trained models get priority 4. **Production Ready** - Your 4-phase training is sophisticated 5. **Open Source Data** - Access to massive datasets ## 🚀 Next Steps 1. **Test your existing system**: `python enhanced_training.py --quick-test` 2. **Try with HF data**: `python enhanced_training.py` 3. **Experiment with datasets**: Try different HF datasets 4. **Scale up**: Increase model size and training steps 5. **Deploy**: Your trained model automatically works in `app.py` Your existing training system is excellent - the HF integration just gives you access to world-class datasets!