mamba-encoder-swarm-training / training /HF_INTEGRATION_GUIDE.md
Debito's picture
Upload 7 files
e295ac5 verified
# πŸš€ Using Your Existing Mamba Trainer with HuggingFace Datasets
Your existing `trainer.py` and `data_loader.py` are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.
## βœ… What You Already Have (Perfect!)
### Your Existing Training System:
- **`training/trainer.py`** - Sophisticated 4-phase training pipeline
- **`training/data_loader.py`** - Complete data loading infrastructure
- **`training/optimizer.py`** - Advanced Mamba-specific optimization
- **`training/loss.py`** - Comprehensive loss functions
- **`core/config.py`** - Complete configuration system
### Your Training Pipeline:
1. **Phase 1**: Foundation training (shared weights)
2. **Phase 2**: Specialist training (domain experts)
3. **Phase 3**: Aggregator training (combining specialists)
4. **Phase 4**: End-to-end fine-tuning
This is **production-ready** and more advanced than most training systems!
## πŸ”— HuggingFace Integration (Simple Addition)
### Step 1: Install HF Requirements
```bash
pip install -r hf_requirements.txt
```
### Step 2: Quick Training with HF Data
```bash
# Uses your existing trainer with WikiText-103 dataset
python enhanced_training.py
# Quick test with tiny dataset
python enhanced_training.py --quick-test
```
### Step 3: Custom HF Dataset Training
```bash
# Download specific datasets
python train_with_hf_datasets.py --download-only
# Train with specific dataset
python enhanced_training.py --dataset "openwebtext"
```
## πŸ“Š Popular HuggingFace Datasets You Can Use
### Language Modeling Datasets:
- **`wikitext-103-v1`** - Wikipedia articles (recommended for testing)
- **`openwebtext`** - Web text corpus (large, good for training)
- **`c4`** - Colossal Clean Crawled Corpus (very large)
- **`pile`** - EleutherAI's diverse text dataset
- **`tiny_shakespeare`** - Small dataset for quick testing
### Domain-Specific Datasets:
- **Medical**: `pubmed_qa`, `bioasq`
- **Legal**: `lex_glue`
- **Code**: `codeparrot/github-code`, `bigcode/the-stack`
- **Science**: `scientific_papers`
## 🎯 How It Integrates With Your System
### Your Existing Data Loader Enhancement:
The HF integration simply:
1. Downloads datasets from HuggingFace
2. Converts them to your expected text format
3. Saves as `train_data.txt`
4. Your existing `MambaDataset` loads it normally
### Your Existing Config Usage:
```python
# Your existing config works perfectly
config = MambaConfig(
vocab_size=50257,
d_model=1024,
n_layers=12,
batch_size=4,
learning_rate=1e-4,
num_specialists=50,
train_data_path="train_data.txt" # HF dataset converted to this
)
# Your existing trainer
trainer = MambaSwarmTrainer(config)
trainer.full_training_pipeline() # Uses your 4-phase system
```
## πŸƒ Quick Start Commands
### 1. Test Your Existing System:
```bash
# Use your existing trainer as-is
python -c "
from core.config import MambaConfig
from training.trainer import MambaSwarmTrainer
config = MambaConfig()
trainer = MambaSwarmTrainer(config)
trainer.train_foundation_phase(num_steps=100) # Quick test
"
```
### 2. Add HuggingFace Data:
```bash
# Download WikiText and train with your system
python enhanced_training.py
```
### 3. Train with Different HF Datasets:
```bash
# Shakespeare (tiny, for testing)
python enhanced_training.py --dataset tiny_shakespeare
# OpenWebText (larger, for real training)
python enhanced_training.py --dataset openwebtext
```
## πŸ“ˆ Your Enhanced Training Flow
```
πŸ“₯ HuggingFace Dataset
↓ (convert to text format)
πŸ“„ train_data.txt
↓ (your existing data_loader.py)
🧠 MambaDataset
↓ (your existing trainer.py)
πŸ—οΈ 4-Phase Training Pipeline:
πŸ“š Phase 1: Foundation
🎯 Phase 2: Specialists
πŸ”— Phase 3: Aggregator
🎨 Phase 4: End-to-end
↓
πŸ’Ύ Trained Mamba Swarm
↓ (your enhanced app.py)
πŸš€ Production Ready Model
```
## πŸŽ›οΈ Configuration Examples
### Small Model (Quick Testing):
```python
config = MambaConfig(
d_model=512,
n_layers=6,
batch_size=2,
num_specialists=10,
max_steps=1000
)
```
### Production Model:
```python
config = MambaConfig(
d_model=1024,
n_layers=12,
batch_size=8,
num_specialists=50,
max_steps=50000
)
```
### Large Model (If you have GPU):
```python
config = MambaConfig(
d_model=2048,
n_layers=24,
batch_size=4,
num_specialists=100,
max_steps=100000
)
```
## πŸ” What Gets Enhanced
### Your `app.py` Now Detects:
1. **Custom Trained Models** (Priority 1-9)
2. **Standard Mamba Models** (Priority 10-19)
3. **GPT Fallbacks** (Priority 20+)
When you train a model, it gets **highest priority** automatically!
### Example Status Display:
```
🎯 CUSTOM TRAINED MAMBA ENCODER
Status: 🟒 Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D)
```
## πŸ“ Training Log Example
```
πŸ“₯ Loading wikitext-103-v1 from Hugging Face...
πŸ“„ Converting to text format...
βœ… Dataset saved to train_data.txt
🐍 Starting Mamba Swarm Training with HF Data
βœ… Config created:
- Model: 768D, 8 layers
- Specialists: 20
- Batch size: 2
- Training data: train_data.txt
βœ… Trainer initialized successfully
Step 4: Starting training pipeline...
Phase 1: Foundation training
Phase 2: Specialist training
Phase 3: Aggregator training
Phase 4: End-to-end fine-tuning
πŸŽ‰ Training completed successfully!
πŸ’Ύ Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt
```
## πŸ’‘ Key Benefits
1. **Your System is Already Advanced** - No need to replace anything
2. **HF Integration is Simple** - Just adds data sources
3. **Automatic Model Detection** - Trained models get priority
4. **Production Ready** - Your 4-phase training is sophisticated
5. **Open Source Data** - Access to massive datasets
## πŸš€ Next Steps
1. **Test your existing system**: `python enhanced_training.py --quick-test`
2. **Try with HF data**: `python enhanced_training.py`
3. **Experiment with datasets**: Try different HF datasets
4. **Scale up**: Increase model size and training steps
5. **Deploy**: Your trained model automatically works in `app.py`
Your existing training system is excellent - the HF integration just gives you access to world-class datasets!