File size: 6,467 Bytes
e295ac5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | # π Using Your Existing Mamba Trainer with HuggingFace Datasets
Your existing `trainer.py` and `data_loader.py` are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.
## β
What You Already Have (Perfect!)
### Your Existing Training System:
- **`training/trainer.py`** - Sophisticated 4-phase training pipeline
- **`training/data_loader.py`** - Complete data loading infrastructure
- **`training/optimizer.py`** - Advanced Mamba-specific optimization
- **`training/loss.py`** - Comprehensive loss functions
- **`core/config.py`** - Complete configuration system
### Your Training Pipeline:
1. **Phase 1**: Foundation training (shared weights)
2. **Phase 2**: Specialist training (domain experts)
3. **Phase 3**: Aggregator training (combining specialists)
4. **Phase 4**: End-to-end fine-tuning
This is **production-ready** and more advanced than most training systems!
## π HuggingFace Integration (Simple Addition)
### Step 1: Install HF Requirements
```bash
pip install -r hf_requirements.txt
```
### Step 2: Quick Training with HF Data
```bash
# Uses your existing trainer with WikiText-103 dataset
python enhanced_training.py
# Quick test with tiny dataset
python enhanced_training.py --quick-test
```
### Step 3: Custom HF Dataset Training
```bash
# Download specific datasets
python train_with_hf_datasets.py --download-only
# Train with specific dataset
python enhanced_training.py --dataset "openwebtext"
```
## π Popular HuggingFace Datasets You Can Use
### Language Modeling Datasets:
- **`wikitext-103-v1`** - Wikipedia articles (recommended for testing)
- **`openwebtext`** - Web text corpus (large, good for training)
- **`c4`** - Colossal Clean Crawled Corpus (very large)
- **`pile`** - EleutherAI's diverse text dataset
- **`tiny_shakespeare`** - Small dataset for quick testing
### Domain-Specific Datasets:
- **Medical**: `pubmed_qa`, `bioasq`
- **Legal**: `lex_glue`
- **Code**: `codeparrot/github-code`, `bigcode/the-stack`
- **Science**: `scientific_papers`
## π― How It Integrates With Your System
### Your Existing Data Loader Enhancement:
The HF integration simply:
1. Downloads datasets from HuggingFace
2. Converts them to your expected text format
3. Saves as `train_data.txt`
4. Your existing `MambaDataset` loads it normally
### Your Existing Config Usage:
```python
# Your existing config works perfectly
config = MambaConfig(
vocab_size=50257,
d_model=1024,
n_layers=12,
batch_size=4,
learning_rate=1e-4,
num_specialists=50,
train_data_path="train_data.txt" # HF dataset converted to this
)
# Your existing trainer
trainer = MambaSwarmTrainer(config)
trainer.full_training_pipeline() # Uses your 4-phase system
```
## π Quick Start Commands
### 1. Test Your Existing System:
```bash
# Use your existing trainer as-is
python -c "
from core.config import MambaConfig
from training.trainer import MambaSwarmTrainer
config = MambaConfig()
trainer = MambaSwarmTrainer(config)
trainer.train_foundation_phase(num_steps=100) # Quick test
"
```
### 2. Add HuggingFace Data:
```bash
# Download WikiText and train with your system
python enhanced_training.py
```
### 3. Train with Different HF Datasets:
```bash
# Shakespeare (tiny, for testing)
python enhanced_training.py --dataset tiny_shakespeare
# OpenWebText (larger, for real training)
python enhanced_training.py --dataset openwebtext
```
## π Your Enhanced Training Flow
```
π₯ HuggingFace Dataset
β (convert to text format)
π train_data.txt
β (your existing data_loader.py)
π§ MambaDataset
β (your existing trainer.py)
ποΈ 4-Phase Training Pipeline:
π Phase 1: Foundation
π― Phase 2: Specialists
π Phase 3: Aggregator
π¨ Phase 4: End-to-end
β
πΎ Trained Mamba Swarm
β (your enhanced app.py)
π Production Ready Model
```
## ποΈ Configuration Examples
### Small Model (Quick Testing):
```python
config = MambaConfig(
d_model=512,
n_layers=6,
batch_size=2,
num_specialists=10,
max_steps=1000
)
```
### Production Model:
```python
config = MambaConfig(
d_model=1024,
n_layers=12,
batch_size=8,
num_specialists=50,
max_steps=50000
)
```
### Large Model (If you have GPU):
```python
config = MambaConfig(
d_model=2048,
n_layers=24,
batch_size=4,
num_specialists=100,
max_steps=100000
)
```
## π What Gets Enhanced
### Your `app.py` Now Detects:
1. **Custom Trained Models** (Priority 1-9)
2. **Standard Mamba Models** (Priority 10-19)
3. **GPT Fallbacks** (Priority 20+)
When you train a model, it gets **highest priority** automatically!
### Example Status Display:
```
π― CUSTOM TRAINED MAMBA ENCODER
Status: π’ Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D)
```
## π Training Log Example
```
π₯ Loading wikitext-103-v1 from Hugging Face...
π Converting to text format...
β
Dataset saved to train_data.txt
π Starting Mamba Swarm Training with HF Data
β
Config created:
- Model: 768D, 8 layers
- Specialists: 20
- Batch size: 2
- Training data: train_data.txt
β
Trainer initialized successfully
Step 4: Starting training pipeline...
Phase 1: Foundation training
Phase 2: Specialist training
Phase 3: Aggregator training
Phase 4: End-to-end fine-tuning
π Training completed successfully!
πΎ Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt
```
## π‘ Key Benefits
1. **Your System is Already Advanced** - No need to replace anything
2. **HF Integration is Simple** - Just adds data sources
3. **Automatic Model Detection** - Trained models get priority
4. **Production Ready** - Your 4-phase training is sophisticated
5. **Open Source Data** - Access to massive datasets
## π Next Steps
1. **Test your existing system**: `python enhanced_training.py --quick-test`
2. **Try with HF data**: `python enhanced_training.py`
3. **Experiment with datasets**: Try different HF datasets
4. **Scale up**: Increase model size and training steps
5. **Deploy**: Your trained model automatically works in `app.py`
Your existing training system is excellent - the HF integration just gives you access to world-class datasets!
|