File size: 6,467 Bytes
e295ac5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# πŸš€ Using Your Existing Mamba Trainer with HuggingFace Datasets

Your existing `trainer.py` and `data_loader.py` are excellent! This guide shows how to enhance them with HuggingFace's open-source datasets.

## βœ… What You Already Have (Perfect!)

### Your Existing Training System:
- **`training/trainer.py`** - Sophisticated 4-phase training pipeline
- **`training/data_loader.py`** - Complete data loading infrastructure  

- **`training/optimizer.py`** - Advanced Mamba-specific optimization

- **`training/loss.py`** - Comprehensive loss functions

- **`core/config.py`** - Complete configuration system



### Your Training Pipeline:

1. **Phase 1**: Foundation training (shared weights)

2. **Phase 2**: Specialist training (domain experts)

3. **Phase 3**: Aggregator training (combining specialists)

4. **Phase 4**: End-to-end fine-tuning



This is **production-ready** and more advanced than most training systems!



## πŸ”— HuggingFace Integration (Simple Addition)



### Step 1: Install HF Requirements

```bash

pip install -r hf_requirements.txt

```



### Step 2: Quick Training with HF Data

```bash

# Uses your existing trainer with WikiText-103 dataset

python enhanced_training.py



# Quick test with tiny dataset

python enhanced_training.py --quick-test

```



### Step 3: Custom HF Dataset Training

```bash

# Download specific datasets

python train_with_hf_datasets.py --download-only



# Train with specific dataset

python enhanced_training.py --dataset "openwebtext"

```



## πŸ“Š Popular HuggingFace Datasets You Can Use



### Language Modeling Datasets:

- **`wikitext-103-v1`** - Wikipedia articles (recommended for testing)
- **`openwebtext`** - Web text corpus (large, good for training)
- **`c4`** - Colossal Clean Crawled Corpus (very large)
- **`pile`** - EleutherAI's diverse text dataset
- **`tiny_shakespeare`** - Small dataset for quick testing



### Domain-Specific Datasets:

- **Medical**: `pubmed_qa`, `bioasq`

- **Legal**: `lex_glue`

- **Code**: `codeparrot/github-code`, `bigcode/the-stack`

- **Science**: `scientific_papers`



## 🎯 How It Integrates With Your System



### Your Existing Data Loader Enhancement:

The HF integration simply:

1. Downloads datasets from HuggingFace

2. Converts them to your expected text format

3. Saves as `train_data.txt` 

4. Your existing `MambaDataset` loads it normally



### Your Existing Config Usage:

```python

# Your existing config works perfectly

config = MambaConfig(

    vocab_size=50257,

    d_model=1024,

    n_layers=12,

    batch_size=4,

    learning_rate=1e-4,

    num_specialists=50,

    train_data_path="train_data.txt"  # HF dataset converted to this

)



# Your existing trainer

trainer = MambaSwarmTrainer(config)

trainer.full_training_pipeline()  # Uses your 4-phase system

```



## πŸƒ Quick Start Commands



### 1. Test Your Existing System:

```bash

# Use your existing trainer as-is

python -c "

from core.config import MambaConfig

from training.trainer import MambaSwarmTrainer



config = MambaConfig()

trainer = MambaSwarmTrainer(config)

trainer.train_foundation_phase(num_steps=100)  # Quick test

"

```



### 2. Add HuggingFace Data:

```bash

# Download WikiText and train with your system

python enhanced_training.py

```



### 3. Train with Different HF Datasets:

```bash

# Shakespeare (tiny, for testing)

python enhanced_training.py --dataset tiny_shakespeare



# OpenWebText (larger, for real training)  

python enhanced_training.py --dataset openwebtext

```



## πŸ“ˆ Your Enhanced Training Flow



```

πŸ“₯ HuggingFace Dataset

    ↓ (convert to text format)

πŸ“„ train_data.txt

    ↓ (your existing data_loader.py)

🧠 MambaDataset

    ↓ (your existing trainer.py)

πŸ—οΈ  4-Phase Training Pipeline:

    πŸ“š Phase 1: Foundation

    🎯 Phase 2: Specialists  

    πŸ”— Phase 3: Aggregator

    🎨 Phase 4: End-to-end

    ↓

πŸ’Ύ Trained Mamba Swarm

    ↓ (your enhanced app.py)

πŸš€ Production Ready Model

```



## πŸŽ›οΈ Configuration Examples



### Small Model (Quick Testing):

```python

config = MambaConfig(

    d_model=512,

    n_layers=6,

    batch_size=2,

    num_specialists=10,

    max_steps=1000

)

```



### Production Model:

```python

config = MambaConfig(

    d_model=1024, 

    n_layers=12,

    batch_size=8,

    num_specialists=50,

    max_steps=50000

)

```



### Large Model (If you have GPU):

```python

config = MambaConfig(

    d_model=2048,

    n_layers=24, 

    batch_size=4,

    num_specialists=100,

    max_steps=100000

)

```



## πŸ” What Gets Enhanced



### Your `app.py` Now Detects:

1. **Custom Trained Models** (Priority 1-9) 
2. **Standard Mamba Models** (Priority 10-19)
3. **GPT Fallbacks** (Priority 20+)

When you train a model, it gets **highest priority** automatically!

### Example Status Display:
```

🎯 CUSTOM TRAINED MAMBA ENCODER

Status: 🟒 Custom Model Online | Model: Custom Trained: mamba_swarm_hf_trained (1024D)

```

## πŸ“ Training Log Example

```

πŸ“₯ Loading wikitext-103-v1 from Hugging Face...

πŸ“„ Converting to text format...

βœ… Dataset saved to train_data.txt

🐍 Starting Mamba Swarm Training with HF Data

βœ… Config created:

  - Model: 768D, 8 layers

  - Specialists: 20

  - Batch size: 2

  - Training data: train_data.txt

βœ… Trainer initialized successfully

Step 4: Starting training pipeline...

Phase 1: Foundation training

Phase 2: Specialist training

Phase 3: Aggregator training  

Phase 4: End-to-end fine-tuning

πŸŽ‰ Training completed successfully!

πŸ’Ύ Checkpoint saved: checkpoints/mamba_swarm_hf_trained.pt

```

## πŸ’‘ Key Benefits

1. **Your System is Already Advanced** - No need to replace anything
2. **HF Integration is Simple** - Just adds data sources
3. **Automatic Model Detection** - Trained models get priority
4. **Production Ready** - Your 4-phase training is sophisticated
5. **Open Source Data** - Access to massive datasets

## πŸš€ Next Steps

1. **Test your existing system**: `python enhanced_training.py --quick-test`
2. **Try with HF data**: `python enhanced_training.py`
3. **Experiment with datasets**: Try different HF datasets
4. **Scale up**: Increase model size and training steps
5. **Deploy**: Your trained model automatically works in `app.py`

Your existing training system is excellent - the HF integration just gives you access to world-class datasets!