mamba-encoder-swarm-training / training /HF_INTEGRATION_GUIDE.md

Upload 7 files

e295ac5 verified 7 months ago

6.47 kB

	# 🚀 Using Your Existing Mamba Trainer with HuggingFace Datasets

	Your existing `trainer.py` and `data_loader.py` are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.

	## ✅ What You Already Have (Perfect!)

	### Your Existing Training System:
	- `training/trainer.py` - Sophisticated 4-phase training pipeline
	- `training/data_loader.py` - Complete data loading infrastructure
	- `training/optimizer.py` - Advanced Mamba-specific optimization
	- `training/loss.py` - Comprehensive loss functions
	- `core/config.py` - Complete configuration system

	### Your Training Pipeline:
	1. Phase 1: Foundation training (shared weights)
	2. Phase 2: Specialist training (domain experts)
	3. Phase 3: Aggregator training (combining specialists)
	4. Phase 4: End-to-end fine-tuning

	This is production-ready and more advanced than most training systems!

	## 🔗 HuggingFace Integration (Simple Addition)

	### Step 1: Install HF Requirements
	```bash
	pip install -r hf_requirements.txt
	```

	### Step 2: Quick Training with HF Data
	```bash
	# Uses your existing trainer with WikiText-103 dataset
	python enhanced_training.py

	# Quick test with tiny dataset
	python enhanced_training.py --quick-test
	```

	### Step 3: Custom HF Dataset Training
	```bash
	# Download specific datasets
	python train_with_hf_datasets.py --download-only

	# Train with specific dataset
	python enhanced_training.py --dataset "openwebtext"
	```

	## 📊 Popular HuggingFace Datasets You Can Use

	### Language Modeling Datasets:
	- `wikitext-103-v1` - Wikipedia articles (recommended for testing)
	- `openwebtext` - Web text corpus (large, good for training)
	- `c4` - Colossal Clean Crawled Corpus (very large)
	- `pile` - EleutherAI's diverse text dataset
	- `tiny_shakespeare` - Small dataset for quick testing

	### Domain-Specific Datasets:
	- Medical: `pubmed_qa`, `bioasq`
	- Legal: `lex_glue`
	- Code: `codeparrot/github-code`, `bigcode/the-stack`
	- Science: `scientific_papers`

	## 🎯 How It Integrates With Your System

	### Your Existing Data Loader Enhancement:
	The HF integration simply:
	1. Downloads datasets from HuggingFace
	2. Converts them to your expected text format
	3. Saves as `train_data.txt`
	4. Your existing `MambaDataset` loads it normally

	### Your Existing Config Usage:
	```python
	# Your existing config works perfectly
	config = MambaConfig(
	vocab_size=50257,
	d_model=1024,
	n_layers=12,
	batch_size=4,
	learning_rate=1e-4,
	num_specialists=50,
	train_data_path="train_data.txt" # HF dataset converted to this
	)

	# Your existing trainer
	trainer = MambaSwarmTrainer(config)
	trainer.full_training_pipeline() # Uses your 4-phase system
	```

	## 🏃 Quick Start Commands

	### 1. Test Your Existing System:
	```bash
	# Use your existing trainer as-is
	python -c "
	from core.config import MambaConfig
	from training.trainer import MambaSwarmTrainer

	config = MambaConfig()
	trainer = MambaSwarmTrainer(config)
	trainer.train_foundation_phase(num_steps=100) # Quick test
	"
	```

	### 2. Add HuggingFace Data:
	```bash
	# Download WikiText and train with your system
	python enhanced_training.py
	```

	### 3. Train with Different HF Datasets:
	```bash
	# Shakespeare (tiny, for testing)
	python enhanced_training.py --dataset tiny_shakespeare

	# OpenWebText (larger, for real training)
	python enhanced_training.py --dataset openwebtext
	```

	## 📈 Your Enhanced Training Flow

	```
	📥 HuggingFace Dataset
	↓ (convert to text format)
	📄 train_data.txt
	↓ (your existing data_loader.py)
	🧠 MambaDataset
	↓ (your existing trainer.py)
	🏗️ 4-Phase Training Pipeline:
	📚 Phase 1: Foundation
	🎯 Phase 2: Specialists
	🔗 Phase 3: Aggregator
	🎨 Phase 4: End-to-end
	↓
	💾 Trained Mamba Swarm
	↓ (your enhanced app.py)
	🚀 Production Ready Model
	```

	## 🎛️ Configuration Examples

	### Small Model (Quick Testing):
	```python
	config = MambaConfig(
	d_model=512,
	n_layers=6,
	batch_size=2,
	num_specialists=10,
	max_steps=1000
	)
	```

	### Production Model:
	```python
	config = MambaConfig(
	d_model=1024,
	n_layers=12,
	batch_size=8,
	num_specialists=50,
	max_steps=50000
	)
	```

	### Large Model (If you have GPU):
	```python
	config = MambaConfig(
	d_model=2048,
	n_layers=24,
	batch_size=4,
	num_specialists=100,
	max_steps=100000
	)
	```

	## 🔍 What Gets Enhanced

	### Your `app.py` Now Detects:
	1. Custom Trained Models (Priority 1-9)
	2. Standard Mamba Models (Priority 10-19)
	3. GPT Fallbacks (Priority 20+)

	When you train a model, it gets highest priority automatically!

	### Example Status Display:
	```
	🎯 CUSTOM TRAINED MAMBA ENCODER
	Status: 🟢 Custom Model Online \| Model: Custom Trained: mamba_swarm_hf_trained (1024D)
	```

	## 📝 Training Log Example

	```
	📥 Loading wikitext-103-v1 from Hugging Face...
	📄 Converting to text format...
	✅ Dataset saved to train_data.txt
	🐍 Starting Mamba Swarm Training with HF Data
	✅ Config created:
	- Model: 768D, 8 layers
	- Specialists: 20
	- Batch size: 2
	- Training data: train_data.txt
	✅ Trainer initialized successfully
	Step 4: Starting training pipeline...
	Phase 1: Foundation training
	Phase 2: Specialist training
	Phase 3: Aggregator training
	Phase 4: End-to-end fine-tuning
	🎉 Training completed successfully!
	💾 Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt
	```

	## 💡 Key Benefits

	1. Your System is Already Advanced - No need to replace anything
	2. HF Integration is Simple - Just adds data sources
	3. Automatic Model Detection - Trained models get priority
	4. Production Ready - Your 4-phase training is sophisticated
	5. Open Source Data - Access to massive datasets

	## 🚀 Next Steps

	1. Test your existing system: `python enhanced_training.py --quick-test`
	2. Try with HF data: `python enhanced_training.py`
	3. Experiment with datasets: Try different HF datasets
	4. Scale up: Increase model size and training steps
	5. Deploy: Your trained model automatically works in `app.py`

	Your existing training system is excellent - the HF integration just gives you access to world-class datasets!