π Using Your Existing Mamba Trainer with HuggingFace Datasets
Your existing trainer.py and data_loader.py are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.
β What You Already Have (Perfect!)
Your Existing Training System:
training/trainer.py- Sophisticated 4-phase training pipelinetraining/data_loader.py- Complete data loading infrastructuretraining/optimizer.py- Advanced Mamba-specific optimizationtraining/loss.py- Comprehensive loss functionscore/config.py- Complete configuration system
Your Training Pipeline:
- Phase 1: Foundation training (shared weights)
- Phase 2: Specialist training (domain experts)
- Phase 3: Aggregator training (combining specialists)
- Phase 4: End-to-end fine-tuning
This is production-ready and more advanced than most training systems!
π HuggingFace Integration (Simple Addition)
Step 1: Install HF Requirements
pip install -r hf_requirements.txt
Step 2: Quick Training with HF Data
# Uses your existing trainer with WikiText-103 dataset
python enhanced_training.py
# Quick test with tiny dataset
python enhanced_training.py --quick-test
Step 3: Custom HF Dataset Training
# Download specific datasets
python train_with_hf_datasets.py --download-only
# Train with specific dataset
python enhanced_training.py --dataset "openwebtext"
π Popular HuggingFace Datasets You Can Use
Language Modeling Datasets:
wikitext-103-v1- Wikipedia articles (recommended for testing)openwebtext- Web text corpus (large, good for training)c4- Colossal Clean Crawled Corpus (very large)pile- EleutherAI's diverse text datasettiny_shakespeare- Small dataset for quick testing
Domain-Specific Datasets:
- Medical:
pubmed_qa,bioasq - Legal:
lex_glue - Code:
codeparrot/github-code,bigcode/the-stack - Science:
scientific_papers
π― How It Integrates With Your System
Your Existing Data Loader Enhancement:
The HF integration simply:
- Downloads datasets from HuggingFace
- Converts them to your expected text format
- Saves as
train_data.txt - Your existing
MambaDatasetloads it normally
Your Existing Config Usage:
# Your existing config works perfectly
config = MambaConfig(
vocab_size=50257,
d_model=1024,
n_layers=12,
batch_size=4,
learning_rate=1e-4,
num_specialists=50,
train_data_path="train_data.txt" # HF dataset converted to this
)
# Your existing trainer
trainer = MambaSwarmTrainer(config)
trainer.full_training_pipeline() # Uses your 4-phase system
π Quick Start Commands
1. Test Your Existing System:
# Use your existing trainer as-is
python -c "
from core.config import MambaConfig
from training.trainer import MambaSwarmTrainer
config = MambaConfig()
trainer = MambaSwarmTrainer(config)
trainer.train_foundation_phase(num_steps=100) # Quick test
"
2. Add HuggingFace Data:
# Download WikiText and train with your system
python enhanced_training.py
3. Train with Different HF Datasets:
# Shakespeare (tiny, for testing)
python enhanced_training.py --dataset tiny_shakespeare
# OpenWebText (larger, for real training)
python enhanced_training.py --dataset openwebtext
π Your Enhanced Training Flow
π₯ HuggingFace Dataset
β (convert to text format)
π train_data.txt
β (your existing data_loader.py)
π§ MambaDataset
β (your existing trainer.py)
ποΈ 4-Phase Training Pipeline:
π Phase 1: Foundation
π― Phase 2: Specialists
π Phase 3: Aggregator
π¨ Phase 4: End-to-end
β
πΎ Trained Mamba Swarm
β (your enhanced app.py)
π Production Ready Model
ποΈ Configuration Examples
Small Model (Quick Testing):
config = MambaConfig(
d_model=512,
n_layers=6,
batch_size=2,
num_specialists=10,
max_steps=1000
)
Production Model:
config = MambaConfig(
d_model=1024,
n_layers=12,
batch_size=8,
num_specialists=50,
max_steps=50000
)
Large Model (If you have GPU):
config = MambaConfig(
d_model=2048,
n_layers=24,
batch_size=4,
num_specialists=100,
max_steps=100000
)
π What Gets Enhanced
Your app.py Now Detects:
- Custom Trained Models (Priority 1-9)
- Standard Mamba Models (Priority 10-19)
- GPT Fallbacks (Priority 20+)
When you train a model, it gets highest priority automatically!
Example Status Display:
π― CUSTOM TRAINED MAMBA ENCODER
Status: π’ Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D)
π Training Log Example
π₯ Loading wikitext-103-v1 from Hugging Face...
π Converting to text format...
β
Dataset saved to train_data.txt
π Starting Mamba Swarm Training with HF Data
β
Config created:
- Model: 768D, 8 layers
- Specialists: 20
- Batch size: 2
- Training data: train_data.txt
β
Trainer initialized successfully
Step 4: Starting training pipeline...
Phase 1: Foundation training
Phase 2: Specialist training
Phase 3: Aggregator training
Phase 4: End-to-end fine-tuning
π Training completed successfully!
πΎ Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt
π‘ Key Benefits
- Your System is Already Advanced - No need to replace anything
- HF Integration is Simple - Just adds data sources
- Automatic Model Detection - Trained models get priority
- Production Ready - Your 4-phase training is sophisticated
- Open Source Data - Access to massive datasets
π Next Steps
- Test your existing system:
python enhanced_training.py --quick-test - Try with HF data:
python enhanced_training.py - Experiment with datasets: Try different HF datasets
- Scale up: Increase model size and training steps
- Deploy: Your trained model automatically works in
app.py
Your existing training system is excellent - the HF integration just gives you access to world-class datasets!