| # π Using Your Existing Mamba Trainer with HuggingFace Datasets | |
| Your existing `trainer.py` and `data_loader.py` are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets. | |
| ## β What You Already Have (Perfect!) | |
| ### Your Existing Training System: | |
| - **`training/trainer.py`** - Sophisticated 4-phase training pipeline | |
| - **`training/data_loader.py`** - Complete data loading infrastructure | |
| - **`training/optimizer.py`** - Advanced Mamba-specific optimization | |
| - **`training/loss.py`** - Comprehensive loss functions | |
| - **`core/config.py`** - Complete configuration system | |
| ### Your Training Pipeline: | |
| 1. **Phase 1**: Foundation training (shared weights) | |
| 2. **Phase 2**: Specialist training (domain experts) | |
| 3. **Phase 3**: Aggregator training (combining specialists) | |
| 4. **Phase 4**: End-to-end fine-tuning | |
| This is **production-ready** and more advanced than most training systems! | |
| ## π HuggingFace Integration (Simple Addition) | |
| ### Step 1: Install HF Requirements | |
| ```bash | |
| pip install -r hf_requirements.txt | |
| ``` | |
| ### Step 2: Quick Training with HF Data | |
| ```bash | |
| # Uses your existing trainer with WikiText-103 dataset | |
| python enhanced_training.py | |
| # Quick test with tiny dataset | |
| python enhanced_training.py --quick-test | |
| ``` | |
| ### Step 3: Custom HF Dataset Training | |
| ```bash | |
| # Download specific datasets | |
| python train_with_hf_datasets.py --download-only | |
| # Train with specific dataset | |
| python enhanced_training.py --dataset "openwebtext" | |
| ``` | |
| ## π Popular HuggingFace Datasets You Can Use | |
| ### Language Modeling Datasets: | |
| - **`wikitext-103-v1`** - Wikipedia articles (recommended for testing) | |
| - **`openwebtext`** - Web text corpus (large, good for training) | |
| - **`c4`** - Colossal Clean Crawled Corpus (very large) | |
| - **`pile`** - EleutherAI's diverse text dataset | |
| - **`tiny_shakespeare`** - Small dataset for quick testing | |
| ### Domain-Specific Datasets: | |
| - **Medical**: `pubmed_qa`, `bioasq` | |
| - **Legal**: `lex_glue` | |
| - **Code**: `codeparrot/github-code`, `bigcode/the-stack` | |
| - **Science**: `scientific_papers` | |
| ## π― How It Integrates With Your System | |
| ### Your Existing Data Loader Enhancement: | |
| The HF integration simply: | |
| 1. Downloads datasets from HuggingFace | |
| 2. Converts them to your expected text format | |
| 3. Saves as `train_data.txt` | |
| 4. Your existing `MambaDataset` loads it normally | |
| ### Your Existing Config Usage: | |
| ```python | |
| # Your existing config works perfectly | |
| config = MambaConfig( | |
| vocab_size=50257, | |
| d_model=1024, | |
| n_layers=12, | |
| batch_size=4, | |
| learning_rate=1e-4, | |
| num_specialists=50, | |
| train_data_path="train_data.txt" # HF dataset converted to this | |
| ) | |
| # Your existing trainer | |
| trainer = MambaSwarmTrainer(config) | |
| trainer.full_training_pipeline() # Uses your 4-phase system | |
| ``` | |
| ## π Quick Start Commands | |
| ### 1. Test Your Existing System: | |
| ```bash | |
| # Use your existing trainer as-is | |
| python -c " | |
| from core.config import MambaConfig | |
| from training.trainer import MambaSwarmTrainer | |
| config = MambaConfig() | |
| trainer = MambaSwarmTrainer(config) | |
| trainer.train_foundation_phase(num_steps=100) # Quick test | |
| " | |
| ``` | |
| ### 2. Add HuggingFace Data: | |
| ```bash | |
| # Download WikiText and train with your system | |
| python enhanced_training.py | |
| ``` | |
| ### 3. Train with Different HF Datasets: | |
| ```bash | |
| # Shakespeare (tiny, for testing) | |
| python enhanced_training.py --dataset tiny_shakespeare | |
| # OpenWebText (larger, for real training) | |
| python enhanced_training.py --dataset openwebtext | |
| ``` | |
| ## π Your Enhanced Training Flow | |
| ``` | |
| π₯ HuggingFace Dataset | |
| β (convert to text format) | |
| π train_data.txt | |
| β (your existing data_loader.py) | |
| π§ MambaDataset | |
| β (your existing trainer.py) | |
| ποΈ 4-Phase Training Pipeline: | |
| π Phase 1: Foundation | |
| π― Phase 2: Specialists | |
| π Phase 3: Aggregator | |
| π¨ Phase 4: End-to-end | |
| β | |
| πΎ Trained Mamba Swarm | |
| β (your enhanced app.py) | |
| π Production Ready Model | |
| ``` | |
| ## ποΈ Configuration Examples | |
| ### Small Model (Quick Testing): | |
| ```python | |
| config = MambaConfig( | |
| d_model=512, | |
| n_layers=6, | |
| batch_size=2, | |
| num_specialists=10, | |
| max_steps=1000 | |
| ) | |
| ``` | |
| ### Production Model: | |
| ```python | |
| config = MambaConfig( | |
| d_model=1024, | |
| n_layers=12, | |
| batch_size=8, | |
| num_specialists=50, | |
| max_steps=50000 | |
| ) | |
| ``` | |
| ### Large Model (If you have GPU): | |
| ```python | |
| config = MambaConfig( | |
| d_model=2048, | |
| n_layers=24, | |
| batch_size=4, | |
| num_specialists=100, | |
| max_steps=100000 | |
| ) | |
| ``` | |
| ## π What Gets Enhanced | |
| ### Your `app.py` Now Detects: | |
| 1. **Custom Trained Models** (Priority 1-9) | |
| 2. **Standard Mamba Models** (Priority 10-19) | |
| 3. **GPT Fallbacks** (Priority 20+) | |
| When you train a model, it gets **highest priority** automatically! | |
| ### Example Status Display: | |
| ``` | |
| π― CUSTOM TRAINED MAMBA ENCODER | |
| Status: π’ Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D) | |
| ``` | |
| ## π Training Log Example | |
| ``` | |
| π₯ Loading wikitext-103-v1 from Hugging Face... | |
| π Converting to text format... | |
| β Dataset saved to train_data.txt | |
| π Starting Mamba Swarm Training with HF Data | |
| β Config created: | |
| - Model: 768D, 8 layers | |
| - Specialists: 20 | |
| - Batch size: 2 | |
| - Training data: train_data.txt | |
| β Trainer initialized successfully | |
| Step 4: Starting training pipeline... | |
| Phase 1: Foundation training | |
| Phase 2: Specialist training | |
| Phase 3: Aggregator training | |
| Phase 4: End-to-end fine-tuning | |
| π Training completed successfully! | |
| πΎ Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt | |
| ``` | |
| ## π‘ Key Benefits | |
| 1. **Your System is Already Advanced** - No need to replace anything | |
| 2. **HF Integration is Simple** - Just adds data sources | |
| 3. **Automatic Model Detection** - Trained models get priority | |
| 4. **Production Ready** - Your 4-phase training is sophisticated | |
| 5. **Open Source Data** - Access to massive datasets | |
| ## π Next Steps | |
| 1. **Test your existing system**: `python enhanced_training.py --quick-test` | |
| 2. **Try with HF data**: `python enhanced_training.py` | |
| 3. **Experiment with datasets**: Try different HF datasets | |
| 4. **Scale up**: Increase model size and training steps | |
| 5. **Deploy**: Your trained model automatically works in `app.py` | |
| Your existing training system is excellent - the HF integration just gives you access to world-class datasets! | |