lilbablo commited on
Commit
36ac84e
·
1 Parent(s): 7275aef

chore: initial public release of Humigence with dual-GPU & CLI wizard

Browse files
HUMIGENCE_COMMAND_READY.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Humigence Command - Ready to Use! 🚀
2
+
3
+ ## ✅ **Yes, you can launch this with "humigence"!**
4
+
5
+ The Humigence training pipeline has been successfully refactored and is now ready to use with the `humigence` command.
6
+
7
+ ## 🎯 **How to Use**
8
+
9
+ ### **Launch Humigence CLI**
10
+ ```bash
11
+ humigence
12
+ ```
13
+
14
+ ### **What You'll See**
15
+ ```
16
+ ──────────────── Humigence — Your AI. Your pipeline. Zero code. ────────────────
17
+ A complete MLOps suite built for makers, teams, and enterprises.
18
+
19
+ Options:
20
+ 1. Supervised Fine-Tuning ✅
21
+ 2. RAG Implementation (coming soon)
22
+ 3. EnterpriseGPT (coming soon)
23
+ 4. Batch Inference (coming soon)
24
+ 5. Context Length (coming soon)
25
+ 6. Exit
26
+
27
+ Select an option:
28
+ ```
29
+
30
+ ### **Training Options**
31
+ 1. **Select "1. Supervised Fine-Tuning"**
32
+ 2. **Choose Setup Mode**: Basic or Advanced
33
+ 3. **Select Model**: TinyLlama, Qwen, Phi-2, etc.
34
+ 4. **Choose Training Recipe**: LoRA, QLoRA, etc.
35
+ 5. **Select Dataset**: Your available datasets
36
+ 6. **Choose Training Mode**: Multi-GPU or Single-GPU
37
+ 7. **Confirm Configuration**: Review and start training
38
+
39
+ ## 🚀 **What's New (Accelerate Refactor)**
40
+
41
+ ### **Clean Architecture**
42
+ - **Hugging Face Accelerate**: Stable DDP training
43
+ - **Single-GPU Evaluation**: Always on cuda:0
44
+ - **No More NCCL Errors**: Robust distributed training
45
+ - **Clean Code**: Removed over-engineering
46
+
47
+ ### **Key Features**
48
+ - ✅ **Multi-GPU Training**: 2× RTX 5090s support
49
+ - ✅ **Single-GPU Fallback**: Automatic fallback if needed
50
+ - ✅ **LoRA/QLoRA Support**: Parameter-efficient fine-tuning
51
+ - ✅ **Structured Logging**: Clean, readable output
52
+ - ✅ **Error Handling**: Robust error management
53
+
54
+ ## 📋 **Training Modes**
55
+
56
+ ### **Multi-GPU Training (Recommended)**
57
+ - Uses `accelerate launch` with 2× RTX 5090s
58
+ - Stable DDP training with NCCL backend
59
+ - Automatic device management
60
+ - Mixed precision (bf16/fp16)
61
+
62
+ ### **Single-GPU Training**
63
+ - Uses `python train.py` for single GPU
64
+ - Fallback option if multi-GPU fails
65
+ - Same functionality, single device
66
+
67
+ ## 🎯 **Usage Examples**
68
+
69
+ ### **Interactive CLI**
70
+ ```bash
71
+ humigence
72
+ # Select option 1
73
+ # Choose Multi-GPU Training
74
+ # Follow the configuration wizard
75
+ ```
76
+
77
+ ### **Direct Training (Advanced)**
78
+ ```bash
79
+ # Multi-GPU
80
+ accelerate launch --config_file accelerate_config.yaml train.py --config_file config.json
81
+
82
+ # Single-GPU
83
+ python train.py --config_file config.json
84
+ ```
85
+
86
+ ## 🔧 **Technical Details**
87
+
88
+ ### **Files Created/Updated**
89
+ - **`train.py`** - Clean Accelerate-based training script
90
+ - **`accelerate_config.yaml`** - Multi-GPU configuration
91
+ - **`cli/main.py`** - Updated CLI integration
92
+ - **`humigence`** - Command-line entry point
93
+
94
+ ### **Dependencies**
95
+ - **Hugging Face Accelerate** - Distributed training
96
+ - **Transformers** - Model loading and training
97
+ - **PEFT** - LoRA/QLoRA support
98
+ - **Rich** - Beautiful CLI interface
99
+
100
+ ## 🎉 **Ready to Use!**
101
+
102
+ The Humigence training pipeline is now:
103
+ - ✅ **Refactored** with Hugging Face Accelerate
104
+ - ✅ **Tested** and working correctly
105
+ - ✅ **Installed** as `humigence` command
106
+ - ✅ **Ready** for production use
107
+
108
+ **Just run `humigence` and start training!** 🚀
109
+
110
+ ## 📊 **What You Get**
111
+
112
+ 1. **Clean CLI Interface** - Easy to use
113
+ 2. **Stable Multi-GPU Training** - No more NCCL errors
114
+ 3. **Single-GPU Evaluation** - No device mismatches
115
+ 4. **Structured Reporting** - Clear training summaries
116
+ 5. **Error Handling** - Robust error management
117
+ 6. **Production Ready** - Works with your 2× RTX 5090s
118
+
119
+ **The refactored Humigence pipeline is ready for your AI training needs!** 🎯
MULTI_GPU_TRAINING_README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multi-GPU Training with 2× RTX 5090s
2
+
3
+ ## 🚀 **Quick Start**
4
+
5
+ ### **Multi-GPU Training (Recommended)**
6
+ ```bash
7
+ torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json
8
+ ```
9
+
10
+ ### **Single GPU Training (Fallback)**
11
+ ```bash
12
+ python train.py --config runs/humigence/config.snapshot.json --fallback_single_gpu
13
+ ```
14
+
15
+ ## 🔧 **Features**
16
+
17
+ ### **Multi-GPU Support**
18
+ - ✅ **NCCL Backend**: Stable distributed training
19
+ - ✅ **2× RTX 5090s**: Full utilization of both GPUs
20
+ - ✅ **Automatic Detection**: Detects available GPUs
21
+ - ✅ **Process Synchronization**: Proper rank management
22
+
23
+ ### **Environment Hardening**
24
+ - ✅ **NCCL Debug**: `NCCL_DEBUG=INFO` for troubleshooting
25
+ - ✅ **IB Disabled**: `NCCL_IB_DISABLE=1` prevents InfiniBand issues
26
+ - ✅ **P2P Disabled**: `NCCL_P2P_DISABLE=1` prevents peer-to-peer issues
27
+ - ✅ **Async Error Handling**: `NCCL_ASYNC_ERROR_HANDLING=1` for better error handling
28
+ - ✅ **Tokenizer Safety**: `TOKENIZERS_PARALLELISM=false` prevents fork warnings
29
+
30
+ ### **Graceful Fallback**
31
+ - ✅ **Automatic Fallback**: Falls back to single GPU if multi-GPU fails
32
+ - ✅ **Clear Warnings**: Shows when fallback is triggered
33
+ - ✅ **No Data Loss**: Training continues seamlessly
34
+ - ✅ **Error Recovery**: Handles NCCL errors gracefully
35
+
36
+ ### **Device Consistency**
37
+ - ✅ **Training**: Each process uses its local rank device (`cuda:local_rank`)
38
+ - ✅ **Evaluation**: Always uses `cuda:0` for fresh model reload
39
+ - ✅ **No Mixing**: No tensors mixed between `cuda:0` and `cuda:1`
40
+ - ✅ **Synchronization**: Proper process synchronization
41
+
42
+ ## 📊 **Training Modes**
43
+
44
+ ### **Multi-GPU Mode (Default)**
45
+ ```
46
+ 🚀 Starting multi-GPU training on 2 GPUs
47
+ ✅ Distributed training initialized
48
+ Rank: 0/1
49
+ Local Rank: 0
50
+ World Size: 2
51
+ Device: cuda:0
52
+
53
+ 🚀 Starting multi-GPU training on 2 GPUs
54
+ ✅ Distributed training initialized
55
+ Rank: 1/1
56
+ Local Rank: 1
57
+ World Size: 2
58
+ Device: cuda:1
59
+ ```
60
+
61
+ ### **Single GPU Fallback**
62
+ ```
63
+ ⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
64
+ 🚀 Starting single GPU training on cuda:0
65
+ ✅ Single GPU Training: cuda:0
66
+ Device: cuda:0
67
+ ```
68
+
69
+ ## 🛠️ **Configuration**
70
+
71
+ ### **Multi-GPU Configuration**
72
+ The launcher automatically sets:
73
+ ```python
74
+ config = {
75
+ "distributed": True,
76
+ "rank": 0, # or 1
77
+ "world_size": 2,
78
+ "local_rank": 0, # or 1
79
+ "device": "cuda:0", # or "cuda:1"
80
+ "per_device_train_batch_size": 4, # Per GPU
81
+ "per_device_eval_batch_size": 8, # Per GPU
82
+ }
83
+ ```
84
+
85
+ ### **Single GPU Configuration**
86
+ ```python
87
+ config = {
88
+ "distributed": False,
89
+ "rank": 0,
90
+ "world_size": 1,
91
+ "local_rank": 0,
92
+ "device": "cuda:0",
93
+ "per_device_train_batch_size": 8, # Doubled for single GPU
94
+ "per_device_eval_batch_size": 16, # Doubled for single GPU
95
+ }
96
+ ```
97
+
98
+ ## 🔍 **Troubleshooting**
99
+
100
+ ### **Common Issues**
101
+
102
+ 1. **NCCL Initialization Failed**
103
+ ```
104
+ ❌ Distributed training initialization failed: NCCL Error
105
+ ⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
106
+ ```
107
+ **Solution**: This is expected behavior. The launcher will automatically fall back to single GPU.
108
+
109
+ 2. **CUDA Out of Memory**
110
+ ```
111
+ ❌ CUDA out of memory
112
+ ```
113
+ **Solution**: Reduce `per_device_train_batch_size` in your config.
114
+
115
+ 3. **Device Mismatch**
116
+ ```
117
+ ❌ Expected all tensors to be on the same device
118
+ ```
119
+ **Solution**: This should not happen with the new launcher. If it does, check that evaluation is using fresh model reload.
120
+
121
+ ### **Debug Mode**
122
+ Set environment variables for debugging:
123
+ ```bash
124
+ export NCCL_DEBUG=INFO
125
+ export CUDA_LAUNCH_BLOCKING=1
126
+ torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json
127
+ ```
128
+
129
+ ## 📈 **Performance**
130
+
131
+ ### **Multi-GPU Benefits**
132
+ - **2× Training Speed**: Approximately 2x faster training
133
+ - **Larger Batch Sizes**: Can use larger effective batch sizes
134
+ - **Better Convergence**: Often better model performance
135
+ - **Memory Efficiency**: Distributes memory across GPUs
136
+
137
+ ### **Single GPU Fallback**
138
+ - **Reliable**: Always works if multi-GPU fails
139
+ - **Simpler**: Easier to debug issues
140
+ - **Compatible**: Works with any setup
141
+
142
+ ## 🎯 **Best Practices**
143
+
144
+ 1. **Always use the launcher**: Don't run training directly
145
+ 2. **Check GPU availability**: Ensure both GPUs are visible
146
+ 3. **Monitor memory usage**: Watch for OOM errors
147
+ 4. **Use appropriate batch sizes**: Start small and increase
148
+ 5. **Check logs**: Look for NCCL warnings or errors
149
+
150
+ ## 🚨 **Important Notes**
151
+
152
+ - **Evaluation always uses cuda:0**: Fresh model reload ensures device consistency
153
+ - **Training uses local rank devices**: Each process uses its assigned GPU
154
+ - **No tensor mixing**: Tensors never cross between cuda:0 and cuda:1
155
+ - **Automatic fallback**: If multi-GPU fails, single GPU training continues
156
+ - **Process synchronization**: All processes are properly synchronized
157
+
158
+ ## 🎉 **Summary**
159
+
160
+ The new training launcher provides:
161
+ - **Robust multi-GPU training** with NCCL
162
+ - **Graceful fallback** to single GPU
163
+ - **Device consistency** throughout training and evaluation
164
+ - **Professional logging** and error handling
165
+ - **Fool-proof operation** with automatic error recovery
166
+
167
+ No more `cuda:0 vs cuda:1` mismatches, no deadlocks, no NCCL crashes without fallback! 🚀
README_LORA_TRAINING.md ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Humigence LoRA Training System
2
+
3
+ A robust, single-GPU LoRA fine-tuning solution that works exactly like the fixed script, but generalized for all models supported by Humigence.
4
+
5
+ ## 🚀 Quick Start
6
+
7
+ ### Via Humigence CLI (Recommended)
8
+ ```bash
9
+ # Interactive wizard with auto-detection
10
+ humigence
11
+ # Select option 2: Single-GPU LoRA Training
12
+ # The wizard will auto-detect models, datasets, and create output directories
13
+
14
+ # Direct command (for advanced users)
15
+ python3 cli/train_lora_cli.py --model meta-llama/Meta-Llama-3-8B-Instruct --output-dir ./out_lora
16
+ ```
17
+
18
+ ### Via Accelerate (Recommended)
19
+ ```bash
20
+ accelerate launch --num_processes=1 cli/train_lora_single.py --model meta-llama/Meta-Llama-3-8B-Instruct --output-dir ./out_lora
21
+ ```
22
+
23
+ ## ✨ Key Features
24
+
25
+ - ✅ **Interactive Wizard** with auto-detection of models and datasets
26
+ - ✅ **Single GPU training** (safe default)
27
+ - ✅ **bf16 precision** where supported
28
+ - ✅ **Proper gradient flow** (no loss=None errors)
29
+ - ✅ **PEFT/LoRA integration** with correct target modules
30
+ - ✅ **Gradient checkpointing** enabled
31
+ - ✅ **Support for multiple models** (LLaMA, Mistral, Phi-2, etc.)
32
+ - ✅ **Comprehensive error handling** and validation
33
+ - ✅ **Auto-generated output directories** with meaningful names
34
+ - ✅ **LoRA configuration presets** for different use cases
35
+ - ✅ **Rich progress tracking** and logging
36
+
37
+ ## 🧠 Supported Models
38
+
39
+ | Model Family | Example | Target Modules |
40
+ |--------------|---------|----------------|
41
+ | **LLaMA** | `meta-llama/Meta-Llama-3-8B-Instruct` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
42
+ | **Mistral** | `mistralai/Mistral-7B-Instruct-v0.1` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
43
+ | **Phi** | `microsoft/Phi-2` | `q_proj`, `k_proj`, `v_proj`, `dense` |
44
+ | **TinyLlama** | `TinyLlama/TinyLlama-1.1B-Chat-v1.0` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
45
+ | **Qwen** | `Qwen/Qwen1.5-0.5B` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
46
+
47
+ ## 🧙‍♂️ Interactive Wizard Features
48
+
49
+ The LoRA training wizard provides:
50
+
51
+ ### 🔍 Auto-Detection
52
+ - **Models**: Scans Hugging Face cache and provides popular model options
53
+ - **Datasets**: Detects local datasets and offers popular Hugging Face datasets
54
+ - **System Info**: Shows GPU memory, CUDA availability, and system specs
55
+
56
+ ### ⚙️ Configuration Presets
57
+ - **Efficient (r=8, α=16)**: Fast training, lower memory usage
58
+ - **Balanced (r=16, α=32)**: Good balance of performance and speed
59
+ - **High Quality (r=32, α=64)**: Better performance, more parameters
60
+ - **Custom**: Set your own LoRA parameters
61
+
62
+ ### 📁 Smart Output Management
63
+ - **Auto-generated directories**: `out_lora_{model}_{dataset}_{timestamp}`
64
+ - **Configuration saving**: Saves all settings to `lora_config.json`
65
+ - **Reproduction scripts**: Generates `reproduce.sh` for easy re-runs
66
+
67
+ ## 📋 Usage Examples
68
+
69
+ ### Interactive Wizard (Recommended)
70
+ ```bash
71
+ humigence
72
+ # Select option 2: Single-GPU LoRA Training
73
+ # Follow the interactive prompts
74
+ ```
75
+
76
+ ### Direct Command Line
77
+ ```bash
78
+ python3 cli/train_lora_single.py \
79
+ --model meta-llama/Meta-Llama-3-8B-Instruct \
80
+ --output-dir ./out_lora \
81
+ --max-steps 1000 \
82
+ --batch-size 4
83
+ ```
84
+
85
+ ### Custom LoRA Settings
86
+ ```bash
87
+ python3 cli/train_lora_single.py \
88
+ --model mistralai/Mistral-7B-Instruct-v0.1 \
89
+ --output-dir ./out_mistral \
90
+ --max-steps 2000 \
91
+ --batch-size 2 \
92
+ --lora-r 32 \
93
+ --lora-alpha 64 \
94
+ --lora-dropout 0.1
95
+ ```
96
+
97
+ ### Small Model Testing
98
+ ```bash
99
+ python3 cli/train_lora_single.py \
100
+ --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
101
+ --output-dir ./out_tinyllama \
102
+ --max-steps 100 \
103
+ --batch-size 8 \
104
+ --block-size 256
105
+ ```
106
+
107
+ ## 🔧 Configuration Options
108
+
109
+ ### Required Arguments
110
+ - `--model`: Model name or path (e.g., `meta-llama/Meta-Llama-3-8B-Instruct`)
111
+ - `--output-dir`: Output directory for trained model
112
+
113
+ ### Dataset Options
114
+ - `--dataset`: Dataset name (default: `wikitext`)
115
+ - `--dataset-config`: Dataset configuration (default: `wikitext-2-raw-v1`)
116
+ - `--block-size`: Block size for text grouping (default: `512`)
117
+
118
+ ### Training Options
119
+ - `--max-steps`: Maximum training steps (default: `1000`)
120
+ - `--batch-size`: Per-device batch size (default: `4`)
121
+ - `--grad-accum`: Gradient accumulation steps (default: `4`)
122
+ - `--learning-rate`: Learning rate (default: `2e-4`)
123
+
124
+ ### LoRA Options
125
+ - `--lora-r`: LoRA rank (default: `16`)
126
+ - `--lora-alpha`: LoRA alpha (default: `32`)
127
+ - `--lora-dropout`: LoRA dropout (default: `0.05`)
128
+
129
+ ### Other Options
130
+ - `--warmup-steps`: Number of warmup steps (default: `100`)
131
+ - `--logging-steps`: Logging frequency (default: `10`)
132
+ - `--save-steps`: Save frequency (default: `200`)
133
+ - `--eval-steps`: Evaluation frequency (default: `200`)
134
+ - `--save-total-limit`: Maximum checkpoints to keep (default: `2`)
135
+
136
+ ## 🧪 Testing
137
+
138
+ Run the test suite to validate the implementation:
139
+
140
+ ```bash
141
+ python3 test_lora_single.py
142
+ ```
143
+
144
+ This will test:
145
+ - Model architecture support
146
+ - CLI interface
147
+ - Model and dataset validation
148
+ - Short training run
149
+
150
+ ## 🔍 Validation
151
+
152
+ After training, validate your adapters:
153
+
154
+ ```python
155
+ from transformers import AutoTokenizer, AutoModelForCausalLM
156
+ from peft import PeftModel
157
+
158
+ # Load the base model
159
+ tokenizer = AutoTokenizer.from_pretrained('./out_lora')
160
+ model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
161
+
162
+ # Load the LoRA adapters
163
+ model = PeftModel.from_pretrained(model, './out_lora')
164
+
165
+ print('✅ Adapters loaded successfully!')
166
+ ```
167
+
168
+ ## 🐛 Troubleshooting
169
+
170
+ ### Common Issues
171
+
172
+ 1. **"Loss does not require gradients" warning**
173
+ - This is handled automatically by the custom `LoRATrainer` class
174
+ - The script will force gradient computation if needed
175
+
176
+ 2. **CUDA out of memory**
177
+ - Reduce `--batch-size` (try 1 or 2)
178
+ - Reduce `--block-size` (try 256 or 128)
179
+ - Use gradient accumulation: increase `--grad-accum`
180
+
181
+ 3. **Model not found**
182
+ - Ensure the model name is correct
183
+ - Check if you have internet access for downloading
184
+ - Verify the model exists on Hugging Face Hub
185
+
186
+ 4. **Dataset loading issues**
187
+ - The script uses `wikitext-2-raw-v1` by default
188
+ - Ensure you have the `datasets` library installed
189
+ - Check internet connection for dataset download
190
+
191
+ ### Memory Optimization
192
+
193
+ For large models, use these settings:
194
+ ```bash
195
+ python3 cli/train_lora_single.py \
196
+ --model meta-llama/Meta-Llama-3-8B-Instruct \
197
+ --output-dir ./out_lora \
198
+ --batch-size 1 \
199
+ --grad-accum 8 \
200
+ --block-size 256
201
+ ```
202
+
203
+ ## 📊 Output Structure
204
+
205
+ After training, you'll find:
206
+
207
+ ```
208
+ out_lora/
209
+ ├── adapter_config.json # LoRA configuration
210
+ ├── adapter_model.safetensors # LoRA weights
211
+ ├── tokenizer.json # Tokenizer
212
+ ├── tokenizer_config.json # Tokenizer config
213
+ ├── special_tokens_map.json # Special tokens
214
+ ├── training_summary.json # Training metrics
215
+ └── checkpoint-*/ # Training checkpoints
216
+ ├── adapter_config.json
217
+ ├── adapter_model.safetensors
218
+ ├── optimizer.pt
219
+ ├── scheduler.pt
220
+ └── trainer_state.json
221
+ ```
222
+
223
+ ## 🔬 Technical Details
224
+
225
+ ### Key Fixes Applied
226
+
227
+ 1. **Custom LoRATrainer**: Ensures proper gradient flow
228
+ 2. **enable_input_require_grads()**: Critical for PEFT + gradient checkpointing
229
+ 3. **Proper data collation**: Uses `DataCollatorForLanguageModeling`
230
+ 4. **Model-specific target modules**: Automatically detects correct LoRA targets
231
+ 5. **Non-reentrant checkpointing**: Avoids gradient issues
232
+
233
+ ### Architecture Support
234
+
235
+ The system automatically detects model architectures and applies the correct LoRA target modules:
236
+
237
+ - **LLaMA/Mistral**: All attention and MLP layers
238
+ - **Phi**: Attention layers + dense layer
239
+ - **GPT**: c_attn and c_proj layers
240
+ - **Default**: Common transformer modules
241
+
242
+ ## 🤝 Contributing
243
+
244
+ To add support for new model architectures:
245
+
246
+ 1. Add the model name pattern to `get_model_target_modules()`
247
+ 2. Specify the correct target modules
248
+ 3. Test with a short training run
249
+ 4. Update this documentation
250
+
251
+ ## 📝 License
252
+
253
+ This code is part of the Humigence project and follows the same license terms.
cli.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Humigence CLI - Main entry point for all Humigence commands
4
+ """
5
+
6
+ import typer
7
+ from typing import Optional
8
+ from rich.console import Console
9
+ from rich.panel import Panel
10
+ from pathlib import Path
11
+ import sys
12
+
13
+ # Add the current directory to the path for imports
14
+ sys.path.insert(0, str(Path(__file__).parent))
15
+
16
+ from training.train_wikitext import run_training
17
+
18
+ app = typer.Typer(
19
+ name="humigence",
20
+ help="Your AI. Your pipeline. Zero code.",
21
+ add_completion=False,
22
+ rich_markup_mode="rich"
23
+ )
24
+
25
+ console = Console()
26
+
27
+
28
+ @app.command()
29
+ def train_wikitext(
30
+ model: str = typer.Option(
31
+ ...,
32
+ "--model",
33
+ "-m",
34
+ help="Path or Hugging Face model name (e.g., 'gpt2' or 'microsoft/DialoGPT-small')"
35
+ ),
36
+ output_dir: str = typer.Option(
37
+ ...,
38
+ "--output-dir",
39
+ "-o",
40
+ help="Directory where checkpoints will be saved"
41
+ ),
42
+ epochs: int = typer.Option(
43
+ 1,
44
+ "--epochs",
45
+ "-e",
46
+ help="Number of training epochs"
47
+ ),
48
+ batch_size: int = typer.Option(
49
+ 2,
50
+ "--batch-size",
51
+ "-b",
52
+ help="Per-device batch size"
53
+ ),
54
+ learning_rate: float = typer.Option(
55
+ 5e-5,
56
+ "--learning-rate",
57
+ "-lr",
58
+ help="Learning rate for training"
59
+ ),
60
+ dataset: str = typer.Option(
61
+ "wikitext",
62
+ "--dataset",
63
+ help="Dataset name (default: wikitext)"
64
+ ),
65
+ dataset_config: str = typer.Option(
66
+ "wikitext-2-raw-v1",
67
+ "--dataset-config",
68
+ help="Dataset configuration (default: wikitext-2-raw-v1)"
69
+ ),
70
+ max_steps: Optional[int] = typer.Option(
71
+ None,
72
+ "--max-steps",
73
+ help="Maximum training steps (overrides epochs if set)"
74
+ ),
75
+ block_size: int = typer.Option(
76
+ 1024,
77
+ "--block-size",
78
+ help="Maximum sequence length"
79
+ ),
80
+ grad_accum: int = typer.Option(
81
+ 4,
82
+ "--grad-accum",
83
+ help="Gradient accumulation steps"
84
+ ),
85
+ warmup_steps: int = typer.Option(
86
+ 100,
87
+ "--warmup-steps",
88
+ help="Number of warmup steps"
89
+ ),
90
+ logging_steps: int = typer.Option(
91
+ 10,
92
+ "--logging-steps",
93
+ help="Logging frequency in steps"
94
+ ),
95
+ save_steps: int = typer.Option(
96
+ 200,
97
+ "--save-steps",
98
+ help="Model saving frequency in steps"
99
+ ),
100
+ eval_steps: int = typer.Option(
101
+ 200,
102
+ "--eval-steps",
103
+ help="Evaluation frequency in steps"
104
+ ),
105
+ lora_r: int = typer.Option(
106
+ 8,
107
+ "--lora-r",
108
+ help="LoRA rank"
109
+ ),
110
+ lora_alpha: int = typer.Option(
111
+ 32,
112
+ "--lora-alpha",
113
+ help="LoRA alpha parameter"
114
+ ),
115
+ lora_dropout: float = typer.Option(
116
+ 0.05,
117
+ "--lora-dropout",
118
+ help="LoRA dropout rate"
119
+ ),
120
+ ):
121
+ """
122
+ Train a model on Wikitext dataset using LoRA fine-tuning.
123
+
124
+ This command fine-tunes a language model on the Wikitext dataset using LoRA (Low-Rank Adaptation)
125
+ for efficient parameter updates. The training runs on a single GPU by default.
126
+
127
+ Examples:
128
+ # Basic training with GPT-2
129
+ humigence train-wikitext --model gpt2 --output-dir ./out
130
+
131
+ # Training with custom parameters
132
+ humigence train-wikitext --model microsoft/DialoGPT-small --output-dir ./out --epochs 2 --batch-size 4 --learning-rate 1e-4
133
+
134
+ # Training with specific steps instead of epochs
135
+ humigence train-wikitext --model gpt2 --output-dir ./out --max-steps 1000 --batch-size 2
136
+ """
137
+
138
+ # Display training configuration
139
+ config_panel = Panel(
140
+ f"""[bold blue]Training Configuration[/bold blue]
141
+
142
+ [cyan]Model:[/cyan] {model}
143
+ [cyan]Output Directory:[/cyan] {output_dir}
144
+ [cyan]Epochs:[/cyan] {epochs}
145
+ [cyan]Batch Size:[/cyan] {batch_size}
146
+ [cyan]Learning Rate:[/cyan] {learning_rate}
147
+ [cyan]Dataset:[/cyan] {dataset}/{dataset_config}
148
+ [cyan]Max Steps:[/cyan] {max_steps if max_steps else 'Auto-calculated'}
149
+ [cyan]Block Size:[/cyan] {block_size}
150
+ [cyan]Gradient Accumulation:[/cyan] {grad_accum}
151
+ [cyan]LoRA Rank:[/cyan] {lora_r}
152
+ [cyan]LoRA Alpha:[/cyan] {lora_alpha}
153
+ [cyan]LoRA Dropout:[/cyan] {lora_dropout}""",
154
+ title="🚀 Starting Wikitext Training",
155
+ border_style="green"
156
+ )
157
+
158
+ console.print(config_panel)
159
+
160
+ # Create output directory if it doesn't exist
161
+ Path(output_dir).mkdir(parents=True, exist_ok=True)
162
+
163
+ # Run training
164
+ try:
165
+ result = run_training(
166
+ model=model,
167
+ output_dir=output_dir,
168
+ epochs=epochs,
169
+ batch_size=batch_size,
170
+ learning_rate=learning_rate,
171
+ dataset=dataset,
172
+ dataset_config=dataset_config,
173
+ max_steps=max_steps,
174
+ block_size=block_size,
175
+ grad_accum=grad_accum,
176
+ warmup_steps=warmup_steps,
177
+ logging_steps=logging_steps,
178
+ save_steps=save_steps,
179
+ eval_steps=eval_steps,
180
+ lora_r=lora_r,
181
+ lora_alpha=lora_alpha,
182
+ lora_dropout=lora_dropout,
183
+ )
184
+
185
+ if result["status"] == "success":
186
+ console.print(Panel(
187
+ f"""[bold green]✅ Training Completed Successfully![/bold green]
188
+
189
+ [cyan]Output Directory:[/cyan] {result['output_dir']}
190
+ [cyan]Model Path:[/cyan] {result['model_path']}
191
+
192
+ [bold blue]Final Metrics:[/bold blue]
193
+ [cyan]Train Loss:[/cyan] {result['metrics'].get('train_loss', 'N/A')}
194
+ [cyan]Eval Loss:[/cyan] {result['metrics'].get('eval_loss', 'N/A')}
195
+ [cyan]Total Steps:[/cyan] {result['metrics'].get('total_steps', 'N/A')}
196
+ [cyan]Epochs:[/cyan] {result['metrics'].get('epochs', 'N/A')}
197
+ [cyan]Train Runtime:[/cyan] {result['metrics'].get('train_runtime', 'N/A')}s
198
+ [cyan]Samples/Second:[/cyan] {result['metrics'].get('train_samples_per_second', 'N/A')}""",
199
+ title="🎉 Training Results",
200
+ border_style="green"
201
+ ))
202
+ raise typer.Exit(0)
203
+ else:
204
+ console.print(Panel(
205
+ f"""[bold red]❌ Training Failed[/bold red]
206
+
207
+ [red]Error:[/red] {result.get('error', 'Unknown error')}
208
+ [cyan]Output Directory:[/cyan] {result.get('output_dir', 'N/A')}""",
209
+ title="💥 Training Error",
210
+ border_style="red"
211
+ ))
212
+ raise typer.Exit(1)
213
+
214
+ except Exception as e:
215
+ console.print(Panel(
216
+ f"""[bold red]❌ Unexpected Error[/bold red]
217
+
218
+ [red]Error:[/red] {str(e)}""",
219
+ title="💥 Unexpected Error",
220
+ border_style="red"
221
+ ))
222
+ raise typer.Exit(1)
223
+
224
+
225
+ @app.command()
226
+ def version():
227
+ """Show version information."""
228
+ console.print("[bold blue]Humigence v1.0.0[/bold blue]")
229
+ console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
230
+
231
+
232
+ @app.callback()
233
+ def main(
234
+ version: bool = typer.Option(
235
+ False,
236
+ "--version",
237
+ "-v",
238
+ help="Show version and exit"
239
+ )
240
+ ):
241
+ """
242
+ Humigence - Your AI. Your pipeline. Zero code.
243
+
244
+ A complete MLOps suite built for makers, teams, and enterprises.
245
+ """
246
+ if version:
247
+ console.print("[bold blue]Humigence v1.0.0[/bold blue]")
248
+ console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
249
+ raise typer.Exit(0)
250
+
251
+
252
+ if __name__ == "__main__":
253
+ app()
254
+
255
+
256
+
257
+
compatibility_utils.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # compatibility_utils.py
2
+ import torch
3
+ import datetime
4
+
5
+ def get_pytorch_version():
6
+ """Detect PyTorch version for compatibility"""
7
+ version = torch.__version__
8
+ major, minor = map(int, version.split('.')[:2])
9
+ return major, minor
10
+
11
+ def setup_timeout():
12
+ """Create timeout compatible with PyTorch version"""
13
+ major, minor = get_pytorch_version()
14
+
15
+ if major >= 1 and minor >= 10:
16
+ # Use modern timedelta for newer PyTorch
17
+ if hasattr(torch.distributed, 'timedelta'):
18
+ return torch.distributed.timedelta(seconds=1800)
19
+ else:
20
+ # Fallback to datetime
21
+ return datetime.timedelta(seconds=1800)
22
+ else:
23
+ # Use datetime for older versions
24
+ return datetime.timedelta(seconds=1800)
25
+
26
+ def check_environment():
27
+ """Check PyTorch environment and compatibility"""
28
+ print("=== Environment Check ===")
29
+ print(f"PyTorch version: {torch.__version__}")
30
+ print(f"CUDA available: {torch.cuda.is_available()}")
31
+ if torch.cuda.is_available():
32
+ print(f"CUDA version: {torch.version.cuda}")
33
+ print(f"Number of GPUs: {torch.cuda.device_count()}")
34
+
35
+ if torch.cuda.device_count() > 0:
36
+ for i in range(torch.cuda.device_count()):
37
+ print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
38
+
39
+ # Check distributed module
40
+ if hasattr(torch.distributed, 'timedelta'):
41
+ print("✓ torch.distributed.timedelta available")
42
+ else:
43
+ print("✗ torch.distributed.timedelta not available - using datetime")
44
+
45
+ # Check critical attributes
46
+ critical_attrs = ['init_process_group', 'is_initialized', 'destroy_process_group']
47
+ for attr in critical_attrs:
48
+ if hasattr(torch.distributed, attr):
49
+ print(f"✓ torch.distributed.{attr} available")
50
+ else:
51
+ print(f"✗ torch.distributed.{attr} missing!")
52
+
53
+ # Check timeout compatibility
54
+ timeout = setup_timeout()
55
+ print(f"✓ Timeout setup: {type(timeout).__name__}")
56
+
57
+ if __name__ == "__main__":
58
+ check_environment()
config_migration.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Config Migration and Validation Utilities
3
+
4
+ This module provides robust config saving that ensures compatibility with the live TrainConfig schema.
5
+ It handles migration from old config formats, validates against the current schema, and provides
6
+ clear feedback about what changes were made.
7
+ """
8
+
9
+ import json
10
+ import logging
11
+ from pathlib import Path
12
+ from typing import Dict, Any, List, Tuple, Optional
13
+ from config_schema import TrainConfig
14
+
15
+ # Set up logging
16
+ logger = logging.getLogger(__name__)
17
+
18
+ # Legacy key mappings (old_key -> new_key)
19
+ LEGACY_KEY_MAPPINGS = {
20
+ "base_model": "model_name",
21
+ "model": "model_name",
22
+ "model_id": "model_name",
23
+ "model_path": "model_name",
24
+ "split_ratios": "train_val_test_split",
25
+ "random_seed": "split_seed",
26
+ "max_seq_len": "max_seq_length",
27
+ "torch_dtype": "dtype", # Handle deprecated torch_dtype
28
+ }
29
+
30
+ # Safe defaults for required fields that might be missing
31
+ SAFE_DEFAULTS = {
32
+ "per_device_train_batch_size": 2,
33
+ "gradient_accumulation_steps": 4,
34
+ "learning_rate": 0.0002,
35
+ "num_train_epochs": 1,
36
+ "eval_batch_size": 8,
37
+ "logging_steps": 10,
38
+ "save_steps": 500,
39
+ "eval_steps": 100,
40
+ "max_seq_length": 1024,
41
+ "fp16": True,
42
+ "bf16": False,
43
+ "multi_gpu": False,
44
+ "eval_single_gpu": True,
45
+ "eval_gpu_index": 0,
46
+ "num_workers": 4,
47
+ "pin_memory": True,
48
+ "split_seed": 42,
49
+ "train_val_test_split": [0.8, 0.1, 0.1],
50
+ "data_schema": "instruction_output",
51
+ "training_recipe": "LoRA (FP16)",
52
+ "lora_r": 16,
53
+ "lora_alpha": 32,
54
+ "lora_dropout": 0.05,
55
+ "output_dir": "runs/humigence",
56
+ "selected_gpus": [0], # Default to single GPU
57
+ }
58
+
59
+ def migrate_config_dict(config_dict: Dict[str, Any]) -> Tuple[Dict[str, Any], List[str], List[str]]:
60
+ """
61
+ Migrate a config dictionary to match the current TrainConfig schema.
62
+
63
+ Args:
64
+ config_dict: Raw config dictionary (potentially with legacy keys)
65
+
66
+ Returns:
67
+ Tuple of (migrated_config, dropped_keys, applied_defaults)
68
+ """
69
+ # Get the current schema fields (Pydantic v1/v2 compatibility)
70
+ if hasattr(TrainConfig, 'model_fields'):
71
+ # Pydantic v2
72
+ schema_fields = set(TrainConfig.model_fields.keys())
73
+ else:
74
+ # Pydantic v1
75
+ schema_fields = set(TrainConfig.__fields__.keys())
76
+
77
+ migrated = {}
78
+ dropped_keys = []
79
+ applied_defaults = []
80
+
81
+ # Step 1: Apply legacy key mappings
82
+ for old_key, new_key in LEGACY_KEY_MAPPINGS.items():
83
+ if old_key in config_dict and new_key not in config_dict:
84
+ migrated[new_key] = config_dict[old_key]
85
+ logger.info(f"Renamed '{old_key}' -> '{new_key}'")
86
+ elif old_key in config_dict and new_key in config_dict:
87
+ logger.warning(f"Both '{old_key}' and '{new_key}' present, using '{new_key}'")
88
+
89
+ # Step 2: Copy valid keys from original config
90
+ for key, value in config_dict.items():
91
+ if key in schema_fields:
92
+ migrated[key] = value
93
+ elif key not in LEGACY_KEY_MAPPINGS:
94
+ dropped_keys.append(key)
95
+ logger.info(f"Dropped unsupported key: '{key}'")
96
+
97
+ # Step 3: Apply safe defaults for missing required fields
98
+ if hasattr(TrainConfig, 'model_fields'):
99
+ # Pydantic v2
100
+ fields = TrainConfig.model_fields
101
+ else:
102
+ # Pydantic v1
103
+ fields = TrainConfig.__fields__
104
+
105
+ for field_name, field_info in fields.items():
106
+ if field_name not in migrated:
107
+ if field_name in SAFE_DEFAULTS:
108
+ migrated[field_name] = SAFE_DEFAULTS[field_name]
109
+ applied_defaults.append(field_name)
110
+ logger.info(f"Applied default for '{field_name}': {SAFE_DEFAULTS[field_name]}")
111
+ else:
112
+ # This is a required field with no safe default - validation will catch this
113
+ logger.warning(f"Missing required field '{field_name}' with no safe default")
114
+
115
+ return migrated, dropped_keys, applied_defaults
116
+
117
+ def validate_and_save_config(
118
+ config_dict: Dict[str, Any],
119
+ output_path: str,
120
+ context_info: Optional[Dict[str, Any]] = None
121
+ ) -> TrainConfig:
122
+ """
123
+ Migrate, validate, and save a config dictionary to ensure it matches the current schema.
124
+
125
+ Args:
126
+ config_dict: Raw config dictionary to migrate and save
127
+ output_path: Path where to save the validated config
128
+ context_info: Optional runtime context to help fill missing values
129
+
130
+ Returns:
131
+ Validated TrainConfig instance
132
+
133
+ Raises:
134
+ ValueError: If config cannot be migrated to valid schema
135
+ """
136
+ logger.info("Starting config migration and validation...")
137
+
138
+ # Step 1: Migrate the config
139
+ migrated_config, dropped_keys, applied_defaults = migrate_config_dict(config_dict)
140
+
141
+ # Step 2: Apply context info if provided
142
+ if context_info:
143
+ # Get schema fields for validation
144
+ if hasattr(TrainConfig, 'model_fields'):
145
+ schema_fields = set(TrainConfig.model_fields.keys())
146
+ else:
147
+ schema_fields = set(TrainConfig.__fields__.keys())
148
+
149
+ for key, value in context_info.items():
150
+ if key in schema_fields and key not in migrated_config:
151
+ migrated_config[key] = value
152
+ logger.info(f"Applied context value for '{key}': {value}")
153
+
154
+ # Step 3: Validate against schema
155
+ try:
156
+ validated_config = TrainConfig(**migrated_config)
157
+ logger.info("✅ Config validation successful")
158
+ except Exception as e:
159
+ logger.error(f"❌ Config validation failed: {e}")
160
+ raise ValueError(f"Configuration validation failed after migration: {e}")
161
+
162
+ # Step 4: Save to file
163
+ output_path = Path(output_path)
164
+ output_path.parent.mkdir(parents=True, exist_ok=True)
165
+
166
+ with open(output_path, 'w') as f:
167
+ json.dump(validated_config.dict(), f, indent=2)
168
+
169
+ logger.info(f"✅ Config saved to {output_path}")
170
+
171
+ # Step 5: Print summary
172
+ print_config_migration_summary(dropped_keys, applied_defaults, output_path)
173
+
174
+ return validated_config
175
+
176
+ def print_config_migration_summary(
177
+ dropped_keys: List[str],
178
+ applied_defaults: List[str],
179
+ output_path: str
180
+ ) -> None:
181
+ """Print a summary of config migration changes."""
182
+ print("\n" + "="*60)
183
+ print("CONFIG MIGRATION SUMMARY")
184
+ print("="*60)
185
+ print(f"📁 Saved to: {output_path}")
186
+
187
+ if dropped_keys:
188
+ print(f"🗑️ Dropped keys ({len(dropped_keys)}): {', '.join(dropped_keys)}")
189
+ else:
190
+ print("✅ No keys dropped")
191
+
192
+ if applied_defaults:
193
+ print(f"⚙️ Applied defaults ({len(applied_defaults)}): {', '.join(applied_defaults)}")
194
+ else:
195
+ print("✅ No defaults applied")
196
+
197
+ print("✅ Config is now compatible with current TrainConfig schema")
198
+ print("="*60)
199
+
200
+ def save_config_snapshot(
201
+ config_dict: Dict[str, Any],
202
+ output_path: str = "runs/humigence/config.snapshot.json",
203
+ context_info: Optional[Dict[str, Any]] = None
204
+ ) -> TrainConfig:
205
+ """
206
+ Save a config snapshot with automatic migration and validation.
207
+
208
+ This is the main function that should be used throughout the codebase
209
+ to ensure all saved configs are compatible with the current schema.
210
+
211
+ Args:
212
+ config_dict: Raw config dictionary to save
213
+ output_path: Path where to save the config (default: runs/humigence/config.snapshot.json)
214
+ context_info: Optional runtime context to help fill missing values
215
+
216
+ Returns:
217
+ Validated TrainConfig instance
218
+ """
219
+ return validate_and_save_config(config_dict, output_path, context_info)
220
+
221
+ def load_and_validate_config(config_path: str) -> TrainConfig:
222
+ """
223
+ Load and validate a config file against the current schema.
224
+
225
+ Args:
226
+ config_path: Path to the config file
227
+
228
+ Returns:
229
+ Validated TrainConfig instance
230
+
231
+ Raises:
232
+ FileNotFoundError: If config file doesn't exist
233
+ ValueError: If config cannot be validated
234
+ """
235
+ config_path = Path(config_path)
236
+ if not config_path.exists():
237
+ raise FileNotFoundError(f"Config file not found: {config_path}")
238
+
239
+ with open(config_path, 'r') as f:
240
+ config_dict = json.load(f)
241
+
242
+ # Migrate and validate
243
+ migrated_config, dropped_keys, applied_defaults = migrate_config_dict(config_dict)
244
+
245
+ try:
246
+ validated_config = TrainConfig(**migrated_config)
247
+ return validated_config
248
+ except Exception as e:
249
+ raise ValueError(f"Config validation failed: {e}")
250
+
251
+ # Backward compatibility function
252
+ def save_config(config: TrainConfig, output_path: str) -> None:
253
+ """Legacy save_config function for backward compatibility."""
254
+ output_path = Path(output_path)
255
+ output_path.parent.mkdir(parents=True, exist_ok=True)
256
+
257
+ with open(output_path, 'w') as f:
258
+ json.dump(config.dict(), f, indent=2)
259
+
260
+ logger.info(f"Config saved to {output_path}")
config_schema.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pydantic configuration schema for Humigence training pipeline
3
+ """
4
+ from pydantic import BaseModel, Field, validator
5
+ from typing import List, Optional, Union
6
+ from pathlib import Path
7
+
8
+ class TrainConfig(BaseModel):
9
+ """Strict configuration schema for Humigence training"""
10
+
11
+ # Model configuration
12
+ model_name: str = Field(..., description="Hugging Face model name")
13
+ training_recipe: str = Field(default="LoRA (FP16)", description="Training recipe")
14
+
15
+ # Training hyperparameters
16
+ learning_rate: float = Field(..., ge=1e-6, le=1.0, description="Learning rate")
17
+ num_train_epochs: int = Field(..., ge=1, le=100, description="Number of training epochs")
18
+ per_device_train_batch_size: int = Field(..., ge=1, le=32, description="Batch size per device")
19
+ gradient_accumulation_steps: int = Field(..., ge=1, le=32, description="Gradient accumulation steps")
20
+ eval_batch_size: int = Field(..., ge=1, le=32, description="Evaluation batch size")
21
+
22
+ # Precision settings
23
+ fp16: bool = Field(default=True, description="Use FP16 precision")
24
+ bf16: bool = Field(default=False, description="Use BF16 precision")
25
+
26
+ # Multi-GPU settings
27
+ multi_gpu: bool = Field(default=False, description="Enable multi-GPU training")
28
+ selected_gpus: List[int] = Field(default=[0], description="Selected GPU indices")
29
+
30
+ # Dataset configuration
31
+ dataset_path: str = Field(..., description="Path to dataset file")
32
+ data_schema: str = Field(default="instruction_output", description="Dataset schema")
33
+ train_val_test_split: List[float] = Field(default=[0.8, 0.1, 0.1], description="Dataset split ratios")
34
+ split_seed: int = Field(default=42, description="Random seed for dataset split")
35
+ max_seq_length: int = Field(default=1024, ge=64, le=4096, description="Maximum sequence length")
36
+
37
+ # LoRA configuration
38
+ lora_r: int = Field(default=16, ge=1, le=256, description="LoRA rank")
39
+ lora_alpha: int = Field(default=32, ge=1, le=512, description="LoRA alpha")
40
+ lora_dropout: float = Field(default=0.05, ge=0.0, le=0.5, description="LoRA dropout")
41
+
42
+ # Logging and evaluation
43
+ logging_steps: int = Field(default=10, ge=1, le=1000, description="Logging frequency")
44
+ eval_steps: int = Field(default=100, ge=1, le=10000, description="Evaluation frequency")
45
+ save_steps: int = Field(default=500, ge=1, le=10000, description="Save frequency")
46
+
47
+ # Output configuration
48
+ output_dir: str = Field(default="runs/humigence", description="Output directory")
49
+ eval_single_gpu: bool = Field(default=True, description="Evaluate on single GPU")
50
+ eval_gpu_index: int = Field(default=0, description="GPU index for evaluation")
51
+
52
+ # System configuration
53
+ num_workers: int = Field(default=4, ge=0, le=16, description="Number of data loader workers")
54
+ pin_memory: bool = Field(default=True, description="Pin memory for data loading")
55
+
56
+ @validator('train_val_test_split')
57
+ def validate_split(cls, v):
58
+ if len(v) != 3:
59
+ raise ValueError("train_val_test_split must have exactly 3 values")
60
+ if abs(sum(v) - 1.0) > 1e-6:
61
+ raise ValueError("train_val_test_split values must sum to 1.0")
62
+ return v
63
+
64
+ @validator('fp16', 'bf16')
65
+ def validate_precision(cls, v, values):
66
+ if values.get('fp16') and values.get('bf16'):
67
+ raise ValueError("Cannot use both fp16 and bf16 simultaneously")
68
+ return v
69
+
70
+ @validator('dataset_path')
71
+ def validate_dataset_path(cls, v):
72
+ path = Path(v)
73
+ if not path.exists():
74
+ raise ValueError(f"Dataset file not found: {v}")
75
+ if not path.suffix == '.jsonl':
76
+ raise ValueError(f"Dataset must be a .jsonl file: {v}")
77
+ return str(path)
78
+
79
+ @validator('model_name')
80
+ def validate_model_name(cls, v):
81
+ # Basic validation for Hugging Face model names
82
+ if not v or len(v.strip()) == 0:
83
+ raise ValueError("Model name cannot be empty")
84
+ return v.strip()
85
+
86
+ class Config:
87
+ """Pydantic configuration"""
88
+ validate_assignment = True
89
+ extra = "forbid" # Reject extra fields
90
+ use_enum_values = True
91
+
92
+ def load_config(config_path: str) -> TrainConfig:
93
+ """Load and validate configuration from JSON file"""
94
+ import json
95
+
96
+ with open(config_path, 'r') as f:
97
+ config_dict = json.load(f)
98
+
99
+ try:
100
+ return TrainConfig(**config_dict)
101
+ except Exception as e:
102
+ raise ValueError(f"Configuration validation failed: {e}")
103
+
104
+ def save_config(config: TrainConfig, output_path: str) -> None:
105
+ """Save configuration to JSON file (legacy function)"""
106
+ import json
107
+ from pathlib import Path
108
+
109
+ output_path = Path(output_path)
110
+ output_path.parent.mkdir(parents=True, exist_ok=True)
111
+
112
+ with open(output_path, 'w') as f:
113
+ json.dump(config.dict(), f, indent=2)
114
+
115
+ def save_config_snapshot(config_dict: dict, output_path: str = "runs/humigence/config.snapshot.json") -> TrainConfig:
116
+ """Save config with automatic migration and validation"""
117
+ from config_migration import save_config_snapshot as _save_config_snapshot
118
+ return _save_config_snapshot(config_dict, output_path)
distributed_utils.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # distributed_utils.py
2
+ import os
3
+ import torch
4
+ import torch.distributed as dist
5
+ import logging
6
+ from typing import Tuple, Optional
7
+ from compatibility_utils import setup_timeout
8
+
9
+ def setup_distributed() -> Tuple[bool, int, int, int, torch.device]:
10
+ """
11
+ First-principles DDP setup with single source of truth for device mapping.
12
+ Returns: (is_ddp, rank, local_rank, world_size, device)
13
+ """
14
+ ddp = "RANK" in os.environ and "WORLD_SIZE" in os.environ
15
+
16
+ if ddp:
17
+ # Initialize process group with robust timeout
18
+ if not dist.is_initialized():
19
+ # Use compatibility-aware timeout
20
+ timeout = setup_timeout()
21
+ dist.init_process_group(
22
+ backend="nccl",
23
+ timeout=timeout
24
+ )
25
+
26
+ local_rank = int(os.environ["LOCAL_RANK"])
27
+ rank = int(os.environ["RANK"])
28
+ world_size = int(os.environ["WORLD_SIZE"])
29
+
30
+ # Critical: Set device BEFORE any CUDA operations
31
+ torch.cuda.set_device(local_rank)
32
+ device = torch.device(f"cuda:{local_rank}")
33
+
34
+ # Verify device mapping
35
+ assert torch.cuda.current_device() == local_rank, \
36
+ f"Device mapping error: current={torch.cuda.current_device()}, local_rank={local_rank}"
37
+
38
+ else:
39
+ local_rank, rank, world_size = 0, 0, 1
40
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
41
+
42
+ return ddp, rank, local_rank, world_size, device
43
+
44
+ def setup_environment():
45
+ """Set environment variables once at process start"""
46
+ os.environ.setdefault("TORCH_NCCL_ASYNC_ERROR_HANDLING", "1") # Modern replacement
47
+ os.environ.setdefault("NCCL_IB_DISABLE", "1") # Disable InfiniBand on single node
48
+ os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") # Prevent tokenizer conflicts
49
+
50
+ # Remove deprecated variables
51
+ if "NCCL_ASYNC_ERROR_HANDLING" in os.environ:
52
+ del os.environ["NCCL_ASYNC_ERROR_HANDLING"]
53
+
54
+ # Do NOT set NCCL_P2P_DISABLE - allow peer-to-peer on single node
55
+
56
+ def cleanup_distributed():
57
+ """Clean shutdown of process group"""
58
+ if dist.is_available() and dist.is_initialized():
59
+ try:
60
+ dist.barrier()
61
+ dist.destroy_process_group()
62
+ except Exception as e:
63
+ logging.warning(f"Cleanup warning: {e}")
64
+
65
+ class RankZeroOnly:
66
+ """Context manager for rank-0 only execution"""
67
+ def __init__(self, is_main: bool):
68
+ self.is_main = is_main
69
+ self.original_level = None
70
+
71
+ def __enter__(self):
72
+ if not self.is_main:
73
+ # Suppress logging for non-main ranks
74
+ self.original_level = logging.getLogger().getEffectiveLevel()
75
+ logging.getLogger().setLevel(logging.WARNING)
76
+ return self
77
+
78
+ def __exit__(self, *args):
79
+ if not self.is_main and self.original_level is not None:
80
+ logging.getLogger().setLevel(self.original_level)
81
+
82
+ def print(self, *args, **kwargs):
83
+ if self.is_main:
84
+ print(*args, **kwargs)
errors.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Custom error handling for Humigence training pipeline
3
+ """
4
+ import torch
5
+ import torch.distributed as dist
6
+ from typing import Optional
7
+
8
+ class HumigenceError(Exception):
9
+ """Base exception for Humigence training errors"""
10
+ def __init__(self, message: str, suggested_fix: Optional[str] = None):
11
+ super().__init__(message)
12
+ self.suggested_fix = suggested_fix
13
+
14
+ class ConfigurationError(HumigenceError):
15
+ """Configuration validation errors"""
16
+ pass
17
+
18
+ class DatasetError(HumigenceError):
19
+ """Dataset loading and processing errors"""
20
+ pass
21
+
22
+ class ModelError(HumigenceError):
23
+ """Model loading and setup errors"""
24
+ pass
25
+
26
+ class TrainingError(HumigenceError):
27
+ """Training process errors"""
28
+ pass
29
+
30
+ class EvaluationError(HumigenceError):
31
+ """Evaluation process errors"""
32
+ pass
33
+
34
+ class DistributedError(HumigenceError):
35
+ """Distributed training errors"""
36
+ pass
37
+
38
+ def handle_cuda_error(error: Exception) -> HumigenceError:
39
+ """Convert CUDA errors to HumigenceError with suggested fixes"""
40
+ error_msg = str(error)
41
+
42
+ if "out of memory" in error_msg.lower():
43
+ return TrainingError(
44
+ "CUDA out of memory",
45
+ "Reduce batch size or use gradient checkpointing"
46
+ )
47
+ elif "illegal memory access" in error_msg.lower():
48
+ return DistributedError(
49
+ "NCCL illegal memory access",
50
+ "Reduce batch size or retry single-GPU mode"
51
+ )
52
+ elif "device" in error_msg.lower() and "mismatch" in error_msg.lower():
53
+ return TrainingError(
54
+ "Device mismatch detected",
55
+ "Ensure all tensors are on the same device"
56
+ )
57
+ else:
58
+ return TrainingError(f"CUDA error: {error_msg}")
59
+
60
+ def handle_distributed_error(error: Exception) -> HumigenceError:
61
+ """Convert distributed training errors to HumigenceError"""
62
+ error_msg = str(error)
63
+
64
+ if "nccl" in error_msg.lower():
65
+ return DistributedError(
66
+ "NCCL communication error",
67
+ "Check network configuration or retry single-GPU mode"
68
+ )
69
+ elif "process group" in error_msg.lower():
70
+ return DistributedError(
71
+ "Process group initialization failed",
72
+ "Check distributed setup or retry single-GPU mode"
73
+ )
74
+ else:
75
+ return DistributedError(f"Distributed training error: {error_msg}")
76
+
77
+ def handle_model_error(error: Exception) -> HumigenceError:
78
+ """Convert model-related errors to HumigenceError"""
79
+ error_msg = str(error)
80
+
81
+ if "out of memory" in error_msg.lower():
82
+ return ModelError(
83
+ "Model loading out of memory",
84
+ "Use smaller model or enable model sharding"
85
+ )
86
+ elif "not found" in error_msg.lower():
87
+ return ModelError(
88
+ "Model not found",
89
+ "Check model name or download the model first"
90
+ )
91
+ else:
92
+ return ModelError(f"Model error: {error_msg}")
93
+
94
+ def handle_dataset_error(error: Exception) -> HumigenceError:
95
+ """Convert dataset-related errors to HumigenceError"""
96
+ error_msg = str(error)
97
+
98
+ if "not found" in error_msg.lower():
99
+ return DatasetError(
100
+ "Dataset file not found",
101
+ "Check dataset path and ensure file exists"
102
+ )
103
+ elif "column" in error_msg.lower() and "not in" in error_msg.lower():
104
+ return DatasetError(
105
+ "Dataset column mismatch",
106
+ "Check dataset schema and column names"
107
+ )
108
+ else:
109
+ return DatasetError(f"Dataset error: {error_msg}")
110
+
111
+ def clean_error_message(error: HumigenceError) -> str:
112
+ """Create a clean error message with suggested fix"""
113
+ message = f"❌ {error.__class__.__name__}: {error}"
114
+
115
+ if error.suggested_fix:
116
+ message += f"\n Suggested fix: {error.suggested_fix}"
117
+
118
+ return message
humigence ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Humigence CLI Entry Point
4
+ """
5
+ import sys
6
+ from pathlib import Path
7
+
8
+ # Add the humigence directory to Python path
9
+ humigence_dir = Path(__file__).parent
10
+ sys.path.insert(0, str(humigence_dir))
11
+
12
+ # Import and run the main CLI
13
+ from cli.main import main
14
+
15
+ if __name__ == "__main__":
16
+ main()
17
+
18
+
main_cli.py ADDED
@@ -0,0 +1,1175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Humigence CLI - Main entry point for all Humigence commands
4
+ """
5
+
6
+ import typer
7
+ from typing import Optional, Dict, Any
8
+ from rich.console import Console
9
+ from rich.panel import Panel
10
+ from rich.table import Table
11
+ from pathlib import Path
12
+ import sys
13
+ import os
14
+ from datetime import datetime
15
+
16
+ # Add the current directory to the path for imports
17
+ sys.path.insert(0, str(Path(__file__).parent))
18
+
19
+ from training.train_wikitext import run_training, run_training_from_config
20
+ from training.autodetect import detect_family, suggested_lora_targets
21
+ from validation.matrix import (
22
+ get_gpu_info, precision_supported, estimate_model_params,
23
+ estimate_memory_bytes, tokenizer_ok, PRECISIONS,
24
+ )
25
+ from validation.dryrun import dry_run
26
+ from validation.fallback import FallbackSimulator, ConfigCandidate
27
+ from config.schema import ValidationConfig, TrainingConfig, ConfigMetadata, save_config, validation_to_training_config
28
+
29
+ app = typer.Typer(
30
+ name="humigence",
31
+ help="Your AI. Your pipeline. Zero code.",
32
+ add_completion=False,
33
+ rich_markup_mode="rich"
34
+ )
35
+
36
+ console = Console()
37
+
38
+
39
+ @app.command()
40
+ def train_wikitext(
41
+ model: str = typer.Option(
42
+ "",
43
+ "--model",
44
+ "-m",
45
+ help="Path or Hugging Face model name (e.g., 'gpt2' or 'microsoft/DialoGPT-small')"
46
+ ),
47
+ output_dir: str = typer.Option(
48
+ ...,
49
+ "--output-dir",
50
+ "-o",
51
+ help="Directory where checkpoints will be saved"
52
+ ),
53
+ epochs: int = typer.Option(
54
+ 1,
55
+ "--epochs",
56
+ "-e",
57
+ help="Number of training epochs"
58
+ ),
59
+ batch_size: int = typer.Option(
60
+ 2,
61
+ "--batch-size",
62
+ "-b",
63
+ help="Per-device batch size"
64
+ ),
65
+ learning_rate: float = typer.Option(
66
+ 5e-5,
67
+ "--learning-rate",
68
+ "-lr",
69
+ help="Learning rate for training"
70
+ ),
71
+ dataset: str = typer.Option(
72
+ "wikitext",
73
+ "--dataset",
74
+ help="Dataset name (default: wikitext)"
75
+ ),
76
+ dataset_config: str = typer.Option(
77
+ "wikitext-2-raw-v1",
78
+ "--dataset-config",
79
+ help="Dataset configuration (default: wikitext-2-raw-v1)"
80
+ ),
81
+ max_steps: Optional[int] = typer.Option(
82
+ None,
83
+ "--max-steps",
84
+ help="Maximum training steps (overrides epochs if set)"
85
+ ),
86
+ block_size: int = typer.Option(
87
+ 1024,
88
+ "--block-size",
89
+ help="Maximum sequence length"
90
+ ),
91
+ grad_accum: int = typer.Option(
92
+ 4,
93
+ "--grad-accum",
94
+ help="Gradient accumulation steps"
95
+ ),
96
+ warmup_steps: int = typer.Option(
97
+ 100,
98
+ "--warmup-steps",
99
+ help="Number of warmup steps"
100
+ ),
101
+ logging_steps: int = typer.Option(
102
+ 10,
103
+ "--logging-steps",
104
+ help="Logging frequency in steps"
105
+ ),
106
+ save_steps: int = typer.Option(
107
+ 200,
108
+ "--save-steps",
109
+ help="Model saving frequency in steps"
110
+ ),
111
+ eval_steps: int = typer.Option(
112
+ 200,
113
+ "--eval-steps",
114
+ help="Evaluation frequency in steps"
115
+ ),
116
+ lora_r: int = typer.Option(
117
+ 8,
118
+ "--lora-r",
119
+ help="LoRA rank"
120
+ ),
121
+ lora_alpha: int = typer.Option(
122
+ 32,
123
+ "--lora-alpha",
124
+ help="LoRA alpha parameter"
125
+ ),
126
+ lora_dropout: float = typer.Option(
127
+ 0.05,
128
+ "--lora-dropout",
129
+ help="LoRA dropout rate"
130
+ ),
131
+ config: Optional[str] = typer.Option(
132
+ None,
133
+ "--config",
134
+ help="Load configuration from YAML file"
135
+ ),
136
+ ):
137
+ """
138
+ Train a model on Wikitext dataset using LoRA fine-tuning.
139
+
140
+ This command fine-tunes a language model on the Wikitext dataset using LoRA (Low-Rank Adaptation)
141
+ for efficient parameter updates. The training runs on a single GPU by default.
142
+
143
+ Examples:
144
+ # Basic training with GPT-2
145
+ humigence train-wikitext --model gpt2 --output-dir ./out
146
+
147
+ # Training with custom parameters
148
+ humigence train-wikitext --model microsoft/DialoGPT-small --output-dir ./out --epochs 2 --batch-size 4 --learning-rate 1e-4
149
+
150
+ # Training with specific steps instead of epochs
151
+ humigence train-wikitext --model gpt2 --output-dir ./out --max-steps 1000 --batch-size 2
152
+
153
+ # Training with config file
154
+ humigence train-wikitext --config ./myconfig.yaml --output-dir ./out
155
+ """
156
+
157
+ # Validate that either model or config is provided
158
+ if not config and not model:
159
+ console.print("[bold red]❌ Error: Either --model or --config must be provided[/bold red]")
160
+ raise typer.Exit(1)
161
+
162
+ # Load config from file if provided
163
+ if config:
164
+ try:
165
+ from config.schema import load_config, validation_to_training_config
166
+ # Try to load as TrainingConfig first, then ValidationConfig
167
+ try:
168
+ loaded_config, metadata = load_config(config, TrainingConfig)
169
+ except Exception:
170
+ # If it fails, try loading as ValidationConfig and convert
171
+ validation_config, metadata = load_config(config, ValidationConfig)
172
+ loaded_config = validation_to_training_config(validation_config, output_dir)
173
+
174
+ # Override with CLI arguments (CLI takes precedence)
175
+ config_dict = loaded_config.dict()
176
+
177
+ # Update with CLI values (only if they're not default values)
178
+ if model != "": # If model was provided via CLI
179
+ config_dict["model"] = model
180
+ if output_dir != "": # If output_dir was provided via CLI
181
+ config_dict["output_dir"] = output_dir
182
+ if epochs != 1:
183
+ config_dict["epochs"] = epochs
184
+ if batch_size != 2:
185
+ config_dict["batch_size"] = batch_size
186
+ if learning_rate != 5e-5:
187
+ config_dict["learning_rate"] = learning_rate
188
+ if dataset != "wikitext":
189
+ config_dict["dataset"] = dataset
190
+ if dataset_config != "wikitext-2-raw-v1":
191
+ config_dict["dataset_config"] = dataset_config
192
+ if max_steps is not None:
193
+ config_dict["max_steps"] = max_steps
194
+ if block_size != 1024:
195
+ config_dict["block_size"] = block_size
196
+ if grad_accum != 4:
197
+ config_dict["grad_accum"] = grad_accum
198
+ if warmup_steps != 100:
199
+ config_dict["warmup_steps"] = warmup_steps
200
+ if logging_steps != 10:
201
+ config_dict["logging_steps"] = logging_steps
202
+ if save_steps != 200:
203
+ config_dict["save_steps"] = save_steps
204
+ if eval_steps != 200:
205
+ config_dict["eval_steps"] = eval_steps
206
+ if lora_r != 8:
207
+ config_dict["lora_r"] = lora_r
208
+ if lora_alpha != 32:
209
+ config_dict["lora_alpha"] = lora_alpha
210
+ if lora_dropout != 0.05:
211
+ config_dict["lora_dropout"] = lora_dropout
212
+
213
+ # Create new config with merged values
214
+ final_config = TrainingConfig(**config_dict)
215
+
216
+ # Extract values for display and function call
217
+ model = final_config.model
218
+ output_dir = final_config.output_dir
219
+ dataset = final_config.dataset
220
+ dataset_config = final_config.dataset_config
221
+ epochs = final_config.epochs
222
+ batch_size = final_config.batch_size
223
+ learning_rate = final_config.learning_rate
224
+ max_steps = final_config.max_steps
225
+ block_size = final_config.block_size
226
+ grad_accum = final_config.grad_accum
227
+ warmup_steps = final_config.warmup_steps
228
+ logging_steps = final_config.logging_steps
229
+ save_steps = final_config.save_steps
230
+ eval_steps = final_config.eval_steps
231
+ lora_r = final_config.lora_r
232
+ lora_alpha = final_config.lora_alpha
233
+ lora_dropout = final_config.lora_dropout
234
+
235
+ console.print(f"[bold blue]📁 Loaded configuration from {config}[/bold blue]")
236
+
237
+ # Display provenance information if metadata is available
238
+ if metadata:
239
+ provenance_info = f"Created: {metadata.created}"
240
+ if metadata.gpu:
241
+ provenance_info += f" | GPU: {metadata.gpu}"
242
+ if metadata.auto_heal and metadata.fallback_chain:
243
+ provenance_info += f" | Auto-healed: {' → '.join(metadata.fallback_chain)}"
244
+ elif metadata.auto_heal:
245
+ provenance_info += " | Auto-healed: (no fallbacks needed)"
246
+ else:
247
+ provenance_info += " | Direct validation (no auto-healing)"
248
+
249
+ console.print(f"[dim]📋 {provenance_info}[/dim]")
250
+
251
+ except Exception as e:
252
+ console.print(f"[bold red]❌ Failed to load config from {config}: {e}[/bold red]")
253
+ raise typer.Exit(1)
254
+
255
+ # Display training configuration
256
+ config_panel = Panel(
257
+ f"""[bold blue]Training Configuration[/bold blue]
258
+
259
+ [cyan]Model:[/cyan] {model}
260
+ [cyan]Output Directory:[/cyan] {output_dir}
261
+ [cyan]Epochs:[/cyan] {epochs}
262
+ [cyan]Batch Size:[/cyan] {batch_size}
263
+ [cyan]Learning Rate:[/cyan] {learning_rate}
264
+ [cyan]Dataset:[/cyan] {dataset}/{dataset_config}
265
+ [cyan]Max Steps:[/cyan] {max_steps if max_steps else 'Auto-calculated'}
266
+ [cyan]Block Size:[/cyan] {block_size}
267
+ [cyan]Gradient Accumulation:[/cyan] {grad_accum}
268
+ [cyan]LoRA Rank:[/cyan] {lora_r}
269
+ [cyan]LoRA Alpha:[/cyan] {lora_alpha}
270
+ [cyan]LoRA Dropout:[/cyan] {lora_dropout}""",
271
+ title="🚀 Starting Wikitext Training",
272
+ border_style="green"
273
+ )
274
+
275
+ console.print(config_panel)
276
+
277
+ # Create output directory if it doesn't exist
278
+ Path(output_dir).mkdir(parents=True, exist_ok=True)
279
+
280
+ # Run training
281
+ try:
282
+ if config:
283
+ # Use config-based training with launcher
284
+ from training.launcher import launch_training
285
+ result = launch_training(final_config)
286
+ else:
287
+ # Use individual parameters - convert to TrainingConfig and use launcher
288
+ from config.schema import TrainingConfig
289
+ from training.launcher import launch_training
290
+
291
+ training_config = TrainingConfig(
292
+ model=model,
293
+ output_dir=output_dir,
294
+ dataset=dataset,
295
+ dataset_config=dataset_config,
296
+ precision="fp16",
297
+ seq_len=block_size,
298
+ batch_size=batch_size,
299
+ epochs=epochs,
300
+ learning_rate=learning_rate,
301
+ max_steps=max_steps,
302
+ block_size=block_size,
303
+ grad_accum=grad_accum,
304
+ warmup_steps=warmup_steps,
305
+ logging_steps=logging_steps,
306
+ save_steps=save_steps,
307
+ eval_steps=eval_steps,
308
+ lora=True,
309
+ lora_r=lora_r,
310
+ lora_alpha=lora_alpha,
311
+ lora_dropout=lora_dropout,
312
+ gradient_checkpointing=True,
313
+ text_field="text",
314
+ schema="plain",
315
+ gpu_mode="single",
316
+ gpu_ids=[0]
317
+ )
318
+
319
+ result = launch_training(training_config)
320
+
321
+ if result["status"] == "success":
322
+ console.print(Panel(
323
+ f"""[bold green]✅ Training Completed Successfully![/bold green]
324
+
325
+ [cyan]Output Directory:[/cyan] {result['output_dir']}
326
+ [cyan]Model Path:[/cyan] {result['model_path']}
327
+
328
+ [bold blue]Final Metrics:[/bold blue]
329
+ [cyan]Train Loss:[/cyan] {result['metrics'].get('train_loss', 'N/A')}
330
+ [cyan]Eval Loss:[/cyan] {result['metrics'].get('eval_loss', 'N/A')}
331
+ [cyan]Total Steps:[/cyan] {result['metrics'].get('total_steps', 'N/A')}
332
+ [cyan]Epochs:[/cyan] {result['metrics'].get('epochs', 'N/A')}
333
+ [cyan]Train Runtime:[/cyan] {result['metrics'].get('train_runtime', 'N/A')}s
334
+ [cyan]Samples/Second:[/cyan] {result['metrics'].get('train_samples_per_second', 'N/A')}""",
335
+ title="🎉 Training Results",
336
+ border_style="green"
337
+ ))
338
+ return
339
+ else:
340
+ console.print(Panel(
341
+ f"""[bold red]❌ Training Failed[/bold red]
342
+
343
+ [red]Error:[/red] {result.get('error', 'Unknown error')}
344
+ [cyan]Output Directory:[/cyan] {result.get('output_dir', 'N/A')}""",
345
+ title="💥 Training Error",
346
+ border_style="red"
347
+ ))
348
+ raise typer.Exit(1)
349
+
350
+ except Exception as e:
351
+ console.print(Panel(
352
+ f"""[bold red]❌ Unexpected Error[/bold red]
353
+
354
+ [red]Error:[/red] {str(e)}""",
355
+ title="💥 Unexpected Error",
356
+ border_style="red"
357
+ ))
358
+ raise typer.Exit(1)
359
+
360
+
361
+ @app.command()
362
+ def train(
363
+ config: str = typer.Option(..., "--config", "-c", help="Path to YAML configuration file"),
364
+ output_dir: Optional[str] = typer.Option(None, "--output-dir", "-o", help="Override output directory"),
365
+ epochs: Optional[int] = typer.Option(None, "--epochs", "-e", help="Override number of epochs"),
366
+ batch_size: Optional[int] = typer.Option(None, "--batch-size", "-b", help="Override batch size"),
367
+ learning_rate: Optional[float] = typer.Option(None, "--learning-rate", "-lr", help="Override learning rate"),
368
+ max_steps: Optional[int] = typer.Option(None, "--max-steps", help="Override maximum training steps"),
369
+ dataset: Optional[str] = typer.Option(None, "--dataset", help="Override dataset specification"),
370
+ text_field: Optional[str] = typer.Option(None, "--text-field", help="Override text field for HF datasets"),
371
+ schema: Optional[str] = typer.Option(None, "--schema", help="Override schema for JSONL datasets"),
372
+ gradient_checkpointing: Optional[bool] = typer.Option(None, "--gradient-checkpointing/--no-gradient-checkpointing", help="Override gradient checkpointing"),
373
+ flash_attn: Optional[bool] = typer.Option(None, "--flash-attn/--no-flash-attn", help="Override flash attention"),
374
+ dtype: Optional[str] = typer.Option(None, "--dtype", help="Override data type: fp32|fp16|bf16"),
375
+ gpu_mode: Optional[str] = typer.Option(None, "--gpu-mode", help="Override GPU mode: single|multi"),
376
+ gpu_ids: Optional[str] = typer.Option(None, "--gpu-ids", help="Override GPU IDs (comma-separated, e.g., '0,1,2')"),
377
+ ):
378
+ """
379
+ Train a model using a configuration file with dataset-agnostic support.
380
+
381
+ This command supports training on:
382
+ - Wikitext datasets (wikitext)
383
+ - JSONL SFT datasets (jsonl:path/to/file.jsonl)
384
+ - Hugging Face datasets (hf:dataset_name or dataset_name)
385
+
386
+ Examples:
387
+ # Train with Wikitext
388
+ humigence train --config gpt2_wikitext.yaml
389
+
390
+ # Train with JSONL SFT dataset
391
+ humigence train --config my_sft_config.yaml
392
+
393
+ # Train with Hugging Face dataset
394
+ humigence train --config imdb_config.yaml
395
+
396
+ # Override specific parameters
397
+ humigence train --config my_config.yaml --epochs 3 --batch-size 4
398
+ """
399
+
400
+ # Load configuration
401
+ try:
402
+ from config.schema import load_config, validation_to_training_config
403
+ # Try to load as TrainingConfig first, then ValidationConfig
404
+ try:
405
+ loaded_config, metadata = load_config(config, TrainingConfig)
406
+ except Exception:
407
+ # If it fails, try loading as ValidationConfig and convert
408
+ validation_config, metadata = load_config(config, ValidationConfig)
409
+ if not output_dir:
410
+ console.print("[bold red]❌ Error: --output-dir is required when using ValidationConfig[/bold red]")
411
+ raise typer.Exit(1)
412
+ loaded_config = validation_to_training_config(validation_config, output_dir)
413
+
414
+ # Override with CLI arguments (CLI takes precedence)
415
+ config_dict = loaded_config.dict()
416
+
417
+ if output_dir:
418
+ config_dict["output_dir"] = output_dir
419
+ if epochs is not None:
420
+ config_dict["epochs"] = epochs
421
+ if batch_size is not None:
422
+ config_dict["batch_size"] = batch_size
423
+ if learning_rate is not None:
424
+ config_dict["learning_rate"] = learning_rate
425
+ if max_steps is not None:
426
+ config_dict["max_steps"] = max_steps
427
+ if dataset:
428
+ config_dict["dataset"] = dataset
429
+ if text_field:
430
+ config_dict["text_field"] = text_field
431
+ if schema:
432
+ config_dict["schema"] = schema
433
+ if gradient_checkpointing is not None:
434
+ config_dict["gradient_checkpointing"] = gradient_checkpointing
435
+ if flash_attn is not None:
436
+ config_dict["flash_attn"] = flash_attn
437
+ if dtype:
438
+ config_dict["dtype"] = dtype
439
+ if gpu_mode:
440
+ config_dict["gpu_mode"] = gpu_mode
441
+ if gpu_ids:
442
+ # Parse comma-separated GPU IDs
443
+ try:
444
+ gpu_ids_list = [int(x.strip()) for x in gpu_ids.split(",")]
445
+ config_dict["gpu_ids"] = gpu_ids_list
446
+ except ValueError:
447
+ console.print(f"[red]❌ Invalid GPU IDs format: {gpu_ids}. Use comma-separated integers (e.g., '0,1,2')[/red]")
448
+ raise typer.Exit(1)
449
+
450
+ # Create final config
451
+ final_config = TrainingConfig(**config_dict)
452
+
453
+ console.print(f"[bold blue]📁 Loaded configuration from {config}[/bold blue]")
454
+
455
+ # Display provenance information if metadata is available
456
+ if metadata:
457
+ provenance_info = f"Created: {metadata.created}"
458
+ if metadata.gpu:
459
+ provenance_info += f" | GPU: {metadata.gpu}"
460
+ if metadata.auto_heal and metadata.fallback_chain:
461
+ provenance_info += f" | Auto-healed: {' → '.join(metadata.fallback_chain)}"
462
+ elif metadata.auto_heal:
463
+ provenance_info += " | Auto-healed: (no fallbacks needed)"
464
+ else:
465
+ provenance_info += " | Direct validation (no auto-healing)"
466
+
467
+ console.print(f"[dim]📋 {provenance_info}[/dim]")
468
+
469
+ # Display dataset provenance if available
470
+ if metadata.dataset:
471
+ dataset_info = f"📁 Dataset: {metadata.dataset.get('file_path', metadata.dataset.get('dataset_name', 'N/A'))}"
472
+ if metadata.dataset.get('schema'):
473
+ dataset_info += f" ({metadata.dataset['schema']})"
474
+ console.print(f"[dim]{dataset_info}[/dim]")
475
+
476
+ if 'train_size' in metadata.dataset and 'eval_size' in metadata.dataset:
477
+ size_info = f"🔢 Train size: {metadata.dataset['train_size']} | Eval size: {metadata.dataset['eval_size']}"
478
+ console.print(f"[dim]{size_info}[/dim]")
479
+
480
+ if 'sha256' in metadata.dataset:
481
+ sha256 = metadata.dataset['sha256']
482
+ if len(sha256) > 12:
483
+ sha256 = sha256[:12] + "..."
484
+ console.print(f"[dim]🔑 SHA256: {sha256}[/dim]")
485
+ else:
486
+ console.print("[yellow]⚠️ Config missing dataset metadata. Consider re-running validate to persist provenance.[/yellow]")
487
+
488
+ except Exception as e:
489
+ console.print(f"[bold red]❌ Failed to load config from {config}: {e}[/bold red]")
490
+ raise typer.Exit(1)
491
+
492
+ # Display training configuration
493
+ dataset_info = f"{final_config.dataset.type}"
494
+ if final_config.dataset.path:
495
+ dataset_info += f" ({final_config.dataset.path})"
496
+ elif final_config.dataset.name:
497
+ dataset_info += f" ({final_config.dataset.name})"
498
+
499
+ config_panel = Panel(
500
+ f"""[bold blue]Training Configuration[/bold blue]
501
+
502
+ [cyan]Model:[/cyan] {final_config.model}
503
+ [cyan]Output Directory:[/cyan] {final_config.output_dir}
504
+ [cyan]Dataset:[/cyan] {dataset_info}
505
+ [cyan]Schema:[/cyan] {final_config.dataset.schema_type or 'auto'}
506
+ [cyan]Text Field:[/cyan] {final_config.dataset.text_field or 'auto'}
507
+ [cyan]Epochs:[/cyan] {final_config.epochs}
508
+ [cyan]Batch Size:[/cyan] {final_config.batch_size}
509
+ [cyan]Learning Rate:[/cyan] {final_config.learning_rate}
510
+ [cyan]Max Steps:[/cyan] {final_config.max_steps if final_config.max_steps else 'Auto-calculated'}
511
+ [cyan]Block Size:[/cyan] {final_config.block_size}
512
+ [cyan]Gradient Accumulation:[/cyan] {final_config.grad_accum}
513
+ [cyan]LoRA Rank:[/cyan] {final_config.lora_r}
514
+ [cyan]LoRA Alpha:[/cyan] {final_config.lora_alpha}
515
+ [cyan]LoRA Dropout:[/cyan] {final_config.lora_dropout}
516
+ [cyan]Gradient Checkpointing:[/cyan] {final_config.gradient_checkpointing}
517
+ [cyan]Flash Attention:[/cyan] {final_config.flash_attn}
518
+ [cyan]Data Type:[/cyan] {final_config.dtype}""",
519
+ title="🚀 Starting Dataset-Agnostic Training",
520
+ border_style="green"
521
+ )
522
+
523
+ console.print(config_panel)
524
+
525
+ # Create output directory if it doesn't exist
526
+ Path(final_config.output_dir).mkdir(parents=True, exist_ok=True)
527
+
528
+ # Run training
529
+ try:
530
+ from training.launcher import launch_training
531
+ result = launch_training(final_config)
532
+
533
+ if result["status"] == "success":
534
+ console.print(Panel(
535
+ f"""[bold green]✅ Training Completed Successfully![/bold green]
536
+
537
+ [cyan]Output Directory:[/cyan] {result['output_dir']}
538
+ [cyan]Model Path:[/cyan] {result['model_path']}
539
+
540
+ [bold blue]Final Metrics:[/bold blue]
541
+ [cyan]Train Loss:[/cyan] {result['metrics'].get('train_loss', 'N/A')}
542
+ [cyan]Eval Loss:[/cyan] {result['metrics'].get('eval_loss', 'N/A')}
543
+ [cyan]Total Steps:[/cyan] {result['metrics'].get('total_steps', 'N/A')}
544
+ [cyan]Epochs:[/cyan] {result['metrics'].get('epochs', 'N/A')}
545
+ [cyan]Train Runtime:[/cyan] {result['metrics'].get('train_runtime', 'N/A')}s
546
+ [cyan]Samples/Second:[/cyan] {result['metrics'].get('train_samples_per_second', 'N/A')}""",
547
+ title="🎉 Training Results",
548
+ border_style="green"
549
+ ))
550
+ return
551
+ else:
552
+ console.print(Panel(
553
+ f"""[bold red]❌ Training Failed[/bold red]
554
+
555
+ [red]Error:[/red] {result.get('error', 'Unknown error')}
556
+ [cyan]Output Directory:[/cyan] {result.get('output_dir', 'N/A')}""",
557
+ title="💥 Training Error",
558
+ border_style="red"
559
+ ))
560
+ raise typer.Exit(1)
561
+
562
+ except Exception as e:
563
+ console.print(Panel(
564
+ f"""[bold red]❌ Unexpected Error[/bold red]
565
+
566
+ [red]Error:[/red] {str(e)}""",
567
+ title="💥 Unexpected Error",
568
+ border_style="red"
569
+ ))
570
+ raise typer.Exit(1)
571
+
572
+
573
+ @app.command()
574
+ def validate(
575
+ model: str = typer.Option(..., help="HF model id or local path"),
576
+ dataset: str = typer.Option("wikitext", help="Dataset specification: wikitext | jsonl:<path> | hf:<name>"),
577
+ precision: str = typer.Option("fp16", help="fp32|fp16|bf16|qlora4bit"),
578
+ seq_len: int = typer.Option(1024, help="Sequence length"),
579
+ batch_size: int = typer.Option(2, help="Batch size"),
580
+ lora: bool = typer.Option(True, help="Enable LoRA"),
581
+ max_samples: int = typer.Option(128, help="Max samples for schema sniff"),
582
+ text_field: Optional[str] = typer.Option(None, help="Text field for generic HF datasets"),
583
+ schema: Optional[str] = typer.Option(None, help="Schema for JSONL datasets: sft | dialogue | plain | auto"),
584
+ role_markers: bool = typer.Option(True, "--role-markers/--no-role-markers", help="Use role markers for dialogue datasets"),
585
+ user_marker: str = typer.Option("<user>", help="User role marker"),
586
+ assistant_marker: str = typer.Option("<assistant>", help="Assistant role marker"),
587
+ eval_split: Optional[float] = typer.Option(None, help="Fraction of data to use for evaluation (0.0-1.0)"),
588
+ eval_file: Optional[str] = typer.Option(None, help="Path to separate evaluation file (for JSONL)"),
589
+ gradient_checkpointing: bool = typer.Option(False, "--gradient-checkpointing/--no-gradient-checkpointing", help="Enable gradient checkpointing"),
590
+ flash_attn: bool = typer.Option(False, "--flash-attn/--no-flash-attn", help="Enable flash attention"),
591
+ dtype: str = typer.Option("fp16", help="Data type: fp32|fp16|bf16"),
592
+ dry_run_flag: bool = typer.Option(True, "--dry-run/--no-dry-run", help="Do the 1-batch fwd+bwd"),
593
+ auto_heal: bool = typer.Option(True, "--auto-heal/--no-auto-heal", help="Enable auto-healing fallback simulation"),
594
+ max_attempts: int = typer.Option(10, help="Maximum fallback attempts for auto-healing"),
595
+ save_config_path: Optional[str] = typer.Option(None, "--save-config", help="Save auto-healed config to YAML file"),
596
+ overwrite: bool = typer.Option(False, "--overwrite", help="Overwrite existing config file instead of versioning"),
597
+ ):
598
+ """
599
+ Validate model, dataset, and training configuration before training.
600
+
601
+ This command performs comprehensive validation including:
602
+ - Model family detection and LoRA target module validation
603
+ - GPU capability and precision support checks
604
+ - Memory estimation and OOM prevention
605
+ - Tokenizer validation
606
+ - Optional 1-batch dry-run to test actual training setup
607
+
608
+ Examples:
609
+ # Basic validation with GPT-2
610
+ humigence validate --model gpt2 --dataset wikitext --precision fp16
611
+
612
+ # Validate with BF16 (will fail on non-BF16 GPUs)
613
+ humigence validate --model gpt2 --precision bf16
614
+
615
+ # Validate with 4-bit quantization
616
+ humigence validate --model gpt2 --precision qlora4bit
617
+
618
+ # Validate without dry-run
619
+ humigence validate --model gpt2 --no-dry-run
620
+ """
621
+ if precision not in PRECISIONS:
622
+ typer.secho(f"Unsupported precision: {precision}", fg=typer.colors.RED, err=True)
623
+ raise typer.Exit(1)
624
+
625
+ # Detect model family and get config
626
+ family, cfg = detect_family(model)
627
+ gpu = get_gpu_info()
628
+ tok_ok, tok_msg = tokenizer_ok(model)
629
+ prec_ok, prec_msg = precision_supported(precision, gpu)
630
+
631
+ # Detect dataset type and validate
632
+ dataset_type = _detect_dataset_type(dataset)
633
+ dataset_ok, dataset_msg = _validate_dataset(dataset, dataset_type, text_field, schema)
634
+
635
+ # Create dataset configuration with eval split support
636
+ dataset_config = _create_dataset_config(dataset, text_field, schema, role_markers, user_marker, assistant_marker, eval_split, eval_file)
637
+
638
+ # GPU-aware defaults and warnings
639
+ _apply_gpu_aware_defaults(gpu, precision, batch_size, seq_len, gradient_checkpointing, flash_attn, dtype)
640
+
641
+ # Load dataset to capture metadata
642
+ dataset_metadata = None
643
+ if dataset_ok:
644
+ try:
645
+ from training.data_loader import create_dataset_loader
646
+ loader = create_dataset_loader(
647
+ dataset,
648
+ text_field=text_field,
649
+ schema=schema or "auto",
650
+ role_markers=role_markers,
651
+ user_marker=user_marker,
652
+ assistant_marker=assistant_marker,
653
+ eval_split=eval_split,
654
+ eval_file=eval_file
655
+ )
656
+ # Load dataset to get metadata
657
+ train_dataset, eval_dataset = loader.load()
658
+ dataset_metadata = loader.get_metadata()
659
+ except Exception as e:
660
+ console.print(f"[yellow]⚠️ Could not load dataset metadata: {e}[/yellow]")
661
+ dataset_metadata = None
662
+
663
+ # Estimate parameters and memory
664
+ params = estimate_model_params(cfg)
665
+ mem_est = estimate_memory_bytes(params, precision, adam=True, lora=lora)
666
+ mem_info = f"est ~{mem_est/1e9:.2f} GB" if mem_est else "n/a"
667
+
668
+ # Collect warnings
669
+ warns = []
670
+ if not tok_ok:
671
+ warns.append(f"Tokenizer: {tok_msg}")
672
+ if not prec_ok:
673
+ warns.append(f"Precision: {prec_msg}")
674
+ if not dataset_ok:
675
+ warns.append(f"Dataset: {dataset_msg}")
676
+
677
+ # Check sequence length against model limits
678
+ max_pos = getattr(cfg, "max_position_embeddings", None)
679
+ if max_pos and seq_len > max_pos:
680
+ warns.append(f"seq_len {seq_len} > model limit {max_pos}. Suggest <= {max_pos}.")
681
+
682
+ # Create summary table
683
+ tbl = Table(title="Humigence Validation Summary")
684
+ tbl.add_column("Item", style="cyan")
685
+ tbl.add_column("Value", style="white")
686
+ tbl.add_row("Model", model)
687
+ tbl.add_row("Family", family)
688
+ tbl.add_row("Dataset Type", dataset_config.type)
689
+ tbl.add_row("Dataset Path/Name", dataset_config.path or dataset_config.name or "N/A")
690
+ tbl.add_row("Schema", dataset_config.schema_type or "auto")
691
+ tbl.add_row("Text Field", dataset_config.text_field or "auto")
692
+ if dataset_config.type == "jsonl" and dataset_config.schema_type == "dialogue":
693
+ tbl.add_row("Role Markers", f"{dataset_config.user_marker} / {dataset_config.assistant_marker}")
694
+
695
+ # Add dataset metadata if available
696
+ if dataset_metadata:
697
+ tbl.add_row("Train Size", str(dataset_metadata.get("train_size", "N/A")))
698
+ tbl.add_row("Eval Size", str(dataset_metadata.get("eval_size", "N/A")))
699
+ if "sha256" in dataset_metadata:
700
+ sha256 = dataset_metadata["sha256"]
701
+ if len(sha256) > 12:
702
+ sha256 = sha256[:12] + "..."
703
+ tbl.add_row("SHA256", sha256)
704
+
705
+ tbl.add_row("Precision", precision)
706
+ tbl.add_row("GPU", f"{gpu.name} (bf16={gpu.bf16_supported}, cc={gpu.cc_major}.{gpu.cc_minor})" if gpu.available else "CPU")
707
+ tbl.add_row("Params (est.)", f"{params:,}" if params else "unknown")
708
+ tbl.add_row("Memory (est.)", mem_info)
709
+ tbl.add_row("Seq Len", str(seq_len))
710
+ tbl.add_row("Batch Size", str(batch_size))
711
+ tbl.add_row("LoRA", str(lora))
712
+ tbl.add_row("Tokenizer", "OK" if tok_ok else f"ISSUE: {tok_msg}")
713
+ tbl.add_row("Precision Support", "OK" if prec_ok else f"ISSUE: {prec_msg}")
714
+ tbl.add_row("Dataset", "OK" if dataset_ok else f"ISSUE: {dataset_msg}")
715
+ console.print(tbl)
716
+
717
+ # Display warnings
718
+ if warns:
719
+ console.print("\n[yellow]Warnings:[/yellow]")
720
+ for w in warns:
721
+ console.print(f" - {w}")
722
+
723
+ # Check precision support
724
+ if not prec_ok:
725
+ console.print("\n[bold red]FAIL[/bold red]: Precision not supported.")
726
+ _print_fallback(precision, gpu, lora, seq_len, batch_size)
727
+ raise typer.Exit(2)
728
+
729
+ # Perform dry run if requested
730
+ if dry_run_flag:
731
+ console.print("\n[bold]Running 1-batch dry-run...[/bold]")
732
+ lora_targets = suggested_lora_targets(family) if lora else None
733
+ res = dry_run(
734
+ model_id_or_path=model,
735
+ precision=precision,
736
+ seq_len=seq_len,
737
+ batch_size=batch_size,
738
+ lora=lora,
739
+ lora_targets=lora_targets,
740
+ )
741
+ if res.ok:
742
+ console.print(f"[green]PASS[/green]: dry-run completed. loss={res.details.get('loss'):.4f}")
743
+
744
+ # Save config if requested (even without auto-healing)
745
+ if save_config_path:
746
+ validation_config = ValidationConfig(
747
+ model=model,
748
+ dataset=dataset_config,
749
+ precision=precision,
750
+ seq_len=seq_len,
751
+ batch_size=batch_size,
752
+ lora=lora,
753
+ lora_targets=lora_targets,
754
+ gradient_checkpointing=gradient_checkpointing,
755
+ flash_attn=flash_attn,
756
+ dtype=dtype,
757
+ max_samples=max_samples
758
+ )
759
+
760
+ # Create runtime metadata
761
+ runtime_metadata = _create_runtime_metadata(gpu)
762
+
763
+ # Create metadata
764
+ metadata = ConfigMetadata(
765
+ created=datetime.now().isoformat(),
766
+ gpu=f"{gpu.name} (bf16={gpu.bf16_supported}, cc={gpu.cc_major}.{gpu.cc_minor})" if gpu.available else "CPU",
767
+ precision_supported=[p for p in ["fp32", "fp16", "bf16", "qlora4bit"] if precision_supported(p, gpu)[0]],
768
+ validator_version="0.3",
769
+ auto_heal=False,
770
+ fallback_chain=[],
771
+ original_config={
772
+ "model": model,
773
+ "precision": precision,
774
+ "seq_len": seq_len,
775
+ "batch_size": batch_size,
776
+ "lora": lora,
777
+ "gradient_checkpointing": gradient_checkpointing,
778
+ "flash_attn": flash_attn,
779
+ "dtype": dtype
780
+ },
781
+ dataset=dataset_metadata,
782
+ runtime=runtime_metadata
783
+ )
784
+
785
+ saved_path = save_config(validation_config, save_config_path, metadata, overwrite)
786
+ console.print(f"\n[bold green]✅ Config saved to {saved_path}[/bold green]")
787
+
788
+ raise typer.Exit(0)
789
+ else:
790
+ console.print(f"[red]FAIL[/red]: dry-run error: {res.error}")
791
+
792
+ # Auto-healing fallback simulation
793
+ if auto_heal:
794
+ console.print(f"[yellow]Auto-healing enabled. Attempting fallback simulation...[/yellow]")
795
+
796
+ # Create initial config candidate
797
+ initial_config = ConfigCandidate(
798
+ model=model,
799
+ precision=precision,
800
+ seq_len=seq_len,
801
+ batch_size=batch_size,
802
+ lora=lora,
803
+ lora_targets=lora_targets,
804
+ gradient_checkpointing=False,
805
+ dataset=dataset,
806
+ text_field=text_field
807
+ )
808
+
809
+ # Run fallback simulation
810
+ simulator = FallbackSimulator()
811
+ success, final_config = simulator.simulate_fallbacks(initial_config, max_attempts)
812
+
813
+ if success:
814
+ console.print(f"\n[bold green]🎉 AUTO-HEALING SUCCESSFUL![/bold green]")
815
+ console.print(f"[dim]Found working configuration after {len(simulator.attempts)} attempts[/dim]")
816
+
817
+ # Generate and display YAML config
818
+ yaml_config = simulator.generate_yaml_config(final_config)
819
+ console.print(f"\n[bold blue]AUTO-HEALED CONFIG PATCH[/bold blue]")
820
+ console.print(f"[dim]```yaml[/dim]")
821
+ console.print(yaml_config)
822
+ console.print(f"[dim]```[/dim]")
823
+
824
+ # Save config if requested
825
+ if save_config_path:
826
+ # Create ValidationConfig from final_config
827
+ validation_config = ValidationConfig(
828
+ model=final_config.model,
829
+ dataset=final_config.dataset,
830
+ precision=final_config.precision,
831
+ seq_len=final_config.seq_len,
832
+ batch_size=final_config.batch_size,
833
+ lora=final_config.lora,
834
+ lora_targets=final_config.lora_targets,
835
+ gradient_checkpointing=final_config.gradient_checkpointing,
836
+ text_field=final_config.text_field,
837
+ schema=getattr(final_config, 'schema', schema),
838
+ max_samples=max_samples
839
+ )
840
+
841
+ # Create fallback chain from simulator attempts
842
+ fallback_chain = []
843
+ for attempt in simulator.attempts[1:]: # Skip initial attempt
844
+ if attempt.notes:
845
+ fallback_chain.append(attempt.notes)
846
+ else:
847
+ # Generate fallback description from config changes
848
+ prev_config = simulator.attempts[attempt.attempt_num - 2].config
849
+ curr_config = attempt.config
850
+
851
+ changes = []
852
+ if prev_config.precision != curr_config.precision:
853
+ changes.append(f"precision {prev_config.precision} → {curr_config.precision}")
854
+ if prev_config.seq_len != curr_config.seq_len:
855
+ changes.append(f"seq_len {prev_config.seq_len} → {curr_config.seq_len}")
856
+ if prev_config.batch_size != curr_config.batch_size:
857
+ changes.append(f"batch_size {prev_config.batch_size} → {curr_config.batch_size}")
858
+ if prev_config.gradient_checkpointing != curr_config.gradient_checkpointing:
859
+ changes.append(f"gradient_checkpointing {prev_config.gradient_checkpointing} → {curr_config.gradient_checkpointing}")
860
+
861
+ if changes:
862
+ fallback_chain.append(", ".join(changes))
863
+
864
+ # Create metadata with fallback chain
865
+ metadata = ConfigMetadata(
866
+ created=datetime.now().isoformat(),
867
+ gpu=f"{gpu.name} (bf16={gpu.bf16_supported}, cc={gpu.cc_major}.{gpu.cc_minor})" if gpu.available else "CPU",
868
+ precision_supported=[p for p in ["fp32", "fp16", "bf16", "qlora4bit"] if precision_supported(p, gpu)[0]],
869
+ validator_version="0.3",
870
+ auto_heal=True,
871
+ fallback_chain=fallback_chain,
872
+ original_config={
873
+ "model": model,
874
+ "precision": precision,
875
+ "seq_len": seq_len,
876
+ "batch_size": batch_size,
877
+ "lora": lora
878
+ },
879
+ dataset=dataset_metadata
880
+ )
881
+
882
+ saved_path = save_config(validation_config, save_config_path, metadata, overwrite)
883
+ console.print(f"\n[bold green]✅ Auto-healed config saved to {saved_path}[/bold green]")
884
+
885
+ raise typer.Exit(0)
886
+ else:
887
+ console.print(f"\n[bold red]❌ AUTO-HEALING FAILED[/bold red]")
888
+ console.print(f"[dim]Could not find working configuration after {max_attempts} attempts[/dim]")
889
+ _print_fallback(precision, gpu, lora, seq_len, batch_size, res.oom)
890
+ raise typer.Exit(3)
891
+ else:
892
+ # No auto-healing, just show fallback suggestions
893
+ if res.oom:
894
+ console.print("[yellow]Detected OOM. Proposing fallback...[/yellow]")
895
+ _print_fallback(precision, gpu, lora, seq_len, batch_size, res.oom)
896
+ raise typer.Exit(3)
897
+ else:
898
+ # No dry-run; rely on static checks
899
+ if warns:
900
+ console.print("[yellow]COMPLETE WITH WARNINGS[/yellow]")
901
+ raise typer.Exit(0)
902
+ console.print("[green]PASS[/green]")
903
+ raise typer.Exit(0)
904
+
905
+
906
+ def _detect_dataset_type(dataset_spec: str) -> str:
907
+ """Detect dataset type from specification"""
908
+ if dataset_spec == "wikitext":
909
+ return "wikitext"
910
+ elif dataset_spec.startswith("jsonl:"):
911
+ return "jsonl"
912
+ elif dataset_spec.startswith("hf:"):
913
+ return "hf"
914
+ else:
915
+ # Assume it's a direct HF dataset name
916
+ return "hf"
917
+
918
+
919
+ def _create_dataset_config(dataset_spec: str, text_field: Optional[str], schema: Optional[str],
920
+ role_markers: bool, user_marker: str, assistant_marker: str,
921
+ eval_split: Optional[float] = None, eval_file: Optional[str] = None):
922
+ """Create DatasetConfig from CLI parameters"""
923
+ from config.schema import DatasetConfig
924
+
925
+ dataset_type = _detect_dataset_type(dataset_spec)
926
+
927
+ if dataset_type == "wikitext":
928
+ return DatasetConfig(type="wikitext", name="wikitext")
929
+
930
+ elif dataset_type == "jsonl":
931
+ file_path = dataset_spec[6:] # Remove "jsonl:" prefix
932
+ return DatasetConfig(
933
+ type="jsonl",
934
+ path=file_path,
935
+ schema_type=schema or "auto",
936
+ role_markers=role_markers,
937
+ user_marker=user_marker,
938
+ assistant_marker=assistant_marker,
939
+ eval_split=eval_split,
940
+ eval_file=eval_file
941
+ )
942
+
943
+ elif dataset_type == "hf":
944
+ dataset_name = dataset_spec[3:] if dataset_spec.startswith("hf:") else dataset_spec
945
+ return DatasetConfig(
946
+ type="hf",
947
+ name=dataset_name,
948
+ text_field=text_field or "text",
949
+ eval_split=eval_split
950
+ )
951
+
952
+ else:
953
+ raise ValueError(f"Unknown dataset type: {dataset_type}")
954
+
955
+
956
+ def _apply_gpu_aware_defaults(gpu, precision: str, batch_size: int, seq_len: int,
957
+ gradient_checkpointing: bool, flash_attn: bool, dtype: str):
958
+ """Apply GPU-aware defaults and warnings"""
959
+ if not gpu.available:
960
+ console.print("[yellow]⚠️ No GPU detected - using CPU mode[/yellow]")
961
+ return
962
+
963
+ # Get GPU memory info
964
+ try:
965
+ import torch
966
+ if torch.cuda.is_available():
967
+ gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
968
+ console.print(f"[blue]🔧 GPU Memory: {gpu_memory_gb:.1f}GB[/blue]")
969
+
970
+ # Warn about potential OOM issues
971
+ if precision == "fp32" and gpu_memory_gb < 24:
972
+ console.print(f"[yellow]⚠️ Detected {gpu_memory_gb:.1f}GB GPU — fp32 may OOM, recommend fp16 with batch_size<=4[/yellow]")
973
+ elif precision == "bf16" and not gpu.bf16_supported:
974
+ console.print(f"[yellow]⚠️ GPU doesn't support BF16, recommend fp16[/yellow]")
975
+ elif batch_size > 4 and gpu_memory_gb < 16:
976
+ console.print(f"[yellow]⚠️ Large batch size ({batch_size}) on {gpu_memory_gb:.1f}GB GPU may cause OOM[/yellow]")
977
+ except Exception as e:
978
+ console.print(f"[yellow]⚠️ Could not get GPU memory info: {e}[/yellow]")
979
+
980
+
981
+ def _create_runtime_metadata(gpu) -> Dict[str, Any]:
982
+ """Create runtime environment metadata"""
983
+ runtime_metadata = {}
984
+
985
+ try:
986
+ import torch
987
+ import platform
988
+
989
+ # GPU info
990
+ if gpu.available:
991
+ runtime_metadata["gpu"] = gpu.name
992
+ runtime_metadata["vram_gb"] = torch.cuda.get_device_properties(0).total_memory / (1024**3)
993
+ runtime_metadata["cuda"] = torch.version.cuda
994
+ else:
995
+ runtime_metadata["gpu"] = "CPU"
996
+ runtime_metadata["vram_gb"] = 0
997
+ runtime_metadata["cuda"] = None
998
+
999
+ # PyTorch version
1000
+ runtime_metadata["torch"] = torch.__version__
1001
+
1002
+ # System info
1003
+ runtime_metadata["platform"] = platform.platform()
1004
+ runtime_metadata["python"] = platform.python_version()
1005
+
1006
+ except Exception as e:
1007
+ console.print(f"[yellow]⚠️ Could not collect runtime metadata: {e}[/yellow]")
1008
+ runtime_metadata["error"] = str(e)
1009
+
1010
+ return runtime_metadata
1011
+
1012
+
1013
+ def _validate_dataset(dataset_spec: str, dataset_type: str, text_field: Optional[str], schema: Optional[str]) -> tuple[bool, str]:
1014
+ """Validate dataset specification and accessibility"""
1015
+ try:
1016
+ if dataset_type == "wikitext":
1017
+ # Wikitext is always valid
1018
+ return True, "OK"
1019
+
1020
+ elif dataset_type == "jsonl":
1021
+ file_path = dataset_spec[6:] # Remove "jsonl:" prefix
1022
+ if not os.path.exists(file_path):
1023
+ return False, f"File not found: {file_path}"
1024
+
1025
+ # Try to read first line to validate JSON format
1026
+ try:
1027
+ with open(file_path, 'r', encoding='utf-8') as f:
1028
+ first_line = f.readline().strip()
1029
+ if first_line:
1030
+ import json
1031
+ json.loads(first_line)
1032
+ return True, "OK"
1033
+ except json.JSONDecodeError:
1034
+ return False, f"Invalid JSON format in {file_path}"
1035
+ except Exception as e:
1036
+ return False, f"Error reading {file_path}: {e}"
1037
+
1038
+ elif dataset_type == "hf":
1039
+ dataset_name = dataset_spec[3:] if dataset_spec.startswith("hf:") else dataset_spec
1040
+ # Try to load dataset info (without actually downloading)
1041
+ try:
1042
+ from datasets import get_dataset_infos
1043
+ infos = get_dataset_infos(dataset_name)
1044
+ if not infos:
1045
+ return False, f"Dataset {dataset_name} not found"
1046
+ return True, "OK"
1047
+ except Exception as e:
1048
+ return False, f"Error accessing dataset {dataset_name}: {e}"
1049
+
1050
+ else:
1051
+ return False, f"Unknown dataset type: {dataset_type}"
1052
+
1053
+ except Exception as e:
1054
+ return False, f"Dataset validation error: {e}"
1055
+
1056
+
1057
+ def _print_fallback(precision: str, gpu, lora: bool, seq_len: int, batch_size: int, oom: bool = False):
1058
+ """Print fallback configuration recommendations"""
1059
+ console.print("\n[bold]RECOMMENDED CONFIG PATCH[/bold]")
1060
+ suggest = {
1061
+ "precision": precision,
1062
+ "seq_len": seq_len,
1063
+ "batch_size": batch_size,
1064
+ "lora": lora,
1065
+ "gradient_checkpointing": False,
1066
+ }
1067
+
1068
+ # Precision fallback
1069
+ if precision == "bf16" and not gpu.bf16_supported:
1070
+ suggest["precision"] = "fp16"
1071
+ if precision == "qlora4bit" and not gpu.available:
1072
+ suggest["precision"] = "fp16"
1073
+
1074
+ # OOM mitigations
1075
+ if oom:
1076
+ if batch_size > 1:
1077
+ suggest["batch_size"] = max(1, batch_size // 2)
1078
+ else:
1079
+ suggest["gradient_checkpointing"] = True
1080
+ if seq_len > 1024:
1081
+ suggest["seq_len"] = min(1024, seq_len // 2)
1082
+ if precision in ("bf16", "fp32"):
1083
+ suggest["precision"] = "fp16"
1084
+
1085
+ for k, v in suggest.items():
1086
+ console.print(f" - {k}: {v}")
1087
+
1088
+
1089
+ @app.command()
1090
+ def gpu_info():
1091
+ """Show detailed GPU information and selection options."""
1092
+ from validation.matrix import get_all_gpu_info
1093
+
1094
+ multi_gpu_info = get_all_gpu_info()
1095
+
1096
+ if not multi_gpu_info.gpus:
1097
+ console.print(Panel(
1098
+ "[bold red]❌ No GPUs detected[/bold red]\n"
1099
+ "[dim]Training will run on CPU[/dim]",
1100
+ title="GPU Information",
1101
+ border_style="red"
1102
+ ))
1103
+ return
1104
+
1105
+ # Create GPU information table
1106
+ table = Table(title="Available GPUs")
1107
+ table.add_column("Index", style="cyan", width=6)
1108
+ table.add_column("Name", style="white", width=40)
1109
+ table.add_column("VRAM", style="green", width=12)
1110
+ table.add_column("Compute Capability", style="blue", width=15)
1111
+ table.add_column("BF16 Support", style="yellow", width=12)
1112
+
1113
+ for gpu in multi_gpu_info.gpus:
1114
+ vram_gb = gpu.total_bytes / (1024**3)
1115
+ cc = f"{gpu.cc_major}.{gpu.cc_minor}"
1116
+ bf16_support = "✅ Yes" if gpu.bf16_supported else "❌ No"
1117
+
1118
+ table.add_row(
1119
+ str(gpu.device_index),
1120
+ gpu.name,
1121
+ f"{vram_gb:.1f} GB",
1122
+ cc,
1123
+ bf16_support
1124
+ )
1125
+
1126
+ console.print(table)
1127
+
1128
+ # Show selection examples
1129
+ console.print(Panel(
1130
+ f"""[bold blue]GPU Selection Examples[/bold blue]
1131
+
1132
+ [cyan]Single GPU Training:[/cyan]
1133
+ humigence train --config my_config.yaml --gpu-mode single --gpu-ids 0
1134
+
1135
+ [cyan]Multi-GPU Training (all GPUs):[/cyan]
1136
+ humigence train --config my_config.yaml --gpu-mode multi --gpu-ids 0,1
1137
+
1138
+ [cyan]Multi-GPU Training (specific GPUs):[/cyan]
1139
+ humigence train --config my_config.yaml --gpu-mode multi --gpu-ids 1,2
1140
+
1141
+ [dim]Total VRAM: {multi_gpu_info.total_vram_gb:.1f} GB across {multi_gpu_info.count} GPUs[/dim]""",
1142
+ title="Usage Examples",
1143
+ border_style="green"
1144
+ ))
1145
+
1146
+
1147
+ @app.command()
1148
+ def version():
1149
+ """Show version information."""
1150
+ console.print("[bold blue]Humigence v1.0.0[/bold blue]")
1151
+ console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
1152
+
1153
+
1154
+ @app.callback()
1155
+ def main(
1156
+ version: bool = typer.Option(
1157
+ False,
1158
+ "--version",
1159
+ "-v",
1160
+ help="Show version and exit"
1161
+ )
1162
+ ):
1163
+ """
1164
+ Humigence - Your AI. Your pipeline. Zero code.
1165
+
1166
+ A complete MLOps suite built for makers, teams, and enterprises.
1167
+ """
1168
+ if version:
1169
+ console.print("[bold blue]Humigence v1.0.0[/bold blue]")
1170
+ console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
1171
+ raise typer.Exit(0)
1172
+
1173
+
1174
+ if __name__ == "__main__":
1175
+ app()
nccl_memory_fix.py ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ NCCL Memory Conflict Resolution
4
+
5
+ This script addresses the "illegal memory access" error in multi-GPU training
6
+ by implementing memory management strategies and fallback mechanisms.
7
+ """
8
+
9
+ import os
10
+ import subprocess
11
+ import torch
12
+ import torch.distributed as dist
13
+ from typing import Optional, Dict, Any
14
+ import logging
15
+
16
+ # Set up logging
17
+ logging.basicConfig(level=logging.INFO)
18
+ logger = logging.getLogger(__name__)
19
+
20
+ def check_gpu_memory_usage() -> Dict[int, Dict[str, float]]:
21
+ """Check current GPU memory usage"""
22
+ memory_info = {}
23
+ for i in range(torch.cuda.device_count()):
24
+ allocated = torch.cuda.memory_allocated(i) / 1024**3 # GB
25
+ reserved = torch.cuda.memory_reserved(i) / 1024**3 # GB
26
+ total = torch.cuda.get_device_properties(i).total_memory / 1024**3 # GB
27
+ free = total - reserved
28
+
29
+ memory_info[i] = {
30
+ 'allocated': allocated,
31
+ 'reserved': reserved,
32
+ 'total': total,
33
+ 'free': free,
34
+ 'usage_percent': (reserved / total) * 100
35
+ }
36
+
37
+ logger.info(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, "
38
+ f"{free:.1f}GB free ({memory_info[i]['usage_percent']:.1f}% used)")
39
+
40
+ return memory_info
41
+
42
+ def clear_gpu_memory():
43
+ """Clear GPU memory and cache"""
44
+ logger.info("Clearing GPU memory...")
45
+ for i in range(torch.cuda.device_count()):
46
+ torch.cuda.set_device(i)
47
+ torch.cuda.empty_cache()
48
+ torch.cuda.synchronize()
49
+
50
+ # Force garbage collection
51
+ import gc
52
+ gc.collect()
53
+
54
+ logger.info("GPU memory cleared")
55
+
56
+ def kill_competing_processes():
57
+ """Kill processes that might be using GPU memory"""
58
+ try:
59
+ # Find processes using GPU memory
60
+ result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,process_name,used_memory',
61
+ '--format=csv,noheader,nounits'],
62
+ capture_output=True, text=True)
63
+
64
+ if result.returncode == 0:
65
+ lines = result.stdout.strip().split('\n')
66
+ for line in lines:
67
+ if line.strip():
68
+ parts = line.split(', ')
69
+ if len(parts) >= 3:
70
+ pid, name, memory = parts[0], parts[1], parts[2]
71
+ if 'llama' in name.lower() or int(memory) > 1000: # > 1GB
72
+ logger.info(f"Found competing process: {name} (PID: {pid}, Memory: {memory}MB)")
73
+ try:
74
+ subprocess.run(['kill', '-9', pid], check=True)
75
+ logger.info(f"Killed process {pid}")
76
+ except subprocess.CalledProcessError:
77
+ logger.warning(f"Could not kill process {pid}")
78
+
79
+ except Exception as e:
80
+ logger.warning(f"Could not check/kill competing processes: {e}")
81
+
82
+ def setup_nccl_environment():
83
+ """Set up optimal NCCL environment variables"""
84
+ nccl_env = {
85
+ 'NCCL_DEBUG': 'INFO',
86
+ 'NCCL_IB_DISABLE': '1', # Disable InfiniBand
87
+ 'NCCL_P2P_DISABLE': '1', # Disable peer-to-peer
88
+ 'NCCL_SHM_DISABLE': '0', # Enable shared memory
89
+ 'NCCL_SOCKET_IFNAME': 'enp6s18', # Use specific network interface
90
+ 'NCCL_NET_GDR_LEVEL': '0', # Disable GPU Direct RDMA
91
+ 'NCCL_CROSS_NIC': '0', # Disable cross-NIC communication
92
+ 'NCCL_ALGO': 'Ring', # Use Ring algorithm
93
+ 'CUDA_LAUNCH_BLOCKING': '1', # Enable CUDA error checking
94
+ 'TORCH_NCCL_ASYNC_ERROR_HANDLING': '1', # Enable async error handling
95
+ 'TOKENIZERS_PARALLELISM': 'false', # Disable tokenizer parallelism
96
+ }
97
+
98
+ for key, value in nccl_env.items():
99
+ os.environ[key] = value
100
+ logger.info(f"Set {key}={value}")
101
+
102
+ def create_memory_efficient_config(base_config: Dict[str, Any]) -> Dict[str, Any]:
103
+ """Create memory-efficient training configuration"""
104
+ memory_config = base_config.copy()
105
+
106
+ # Reduce memory usage
107
+ memory_config.update({
108
+ 'per_device_train_batch_size': 1, # Minimal batch size
109
+ 'gradient_accumulation_steps': 8, # Compensate with accumulation
110
+ 'eval_batch_size': 1, # Minimal eval batch size
111
+ 'max_seq_length': 512, # Reduce sequence length
112
+ 'fp16': True, # Use half precision
113
+ 'bf16': False, # Disable bf16 to save memory
114
+ 'pin_memory': False, # Disable pin memory
115
+ 'num_workers': 0, # Disable multiprocessing
116
+ })
117
+
118
+ logger.info("Created memory-efficient configuration")
119
+ return memory_config
120
+
121
+ def test_nccl_communication():
122
+ """Test NCCL communication without training"""
123
+ logger.info("Testing NCCL communication...")
124
+
125
+ try:
126
+ # Initialize process group
127
+ if not dist.is_initialized():
128
+ dist.init_process_group(backend='nccl')
129
+
130
+ rank = dist.get_rank()
131
+ world_size = dist.get_world_size()
132
+
133
+ logger.info(f"Rank {rank}/{world_size} initialized")
134
+
135
+ # Test simple communication
136
+ if rank == 0:
137
+ tensor = torch.ones(10, device='cuda')
138
+ logger.info(f"Rank 0 sending tensor: {tensor}")
139
+ else:
140
+ tensor = torch.zeros(10, device='cuda')
141
+ logger.info(f"Rank 1 receiving tensor: {tensor}")
142
+
143
+ # All-reduce test
144
+ dist.all_reduce(tensor)
145
+ logger.info(f"Rank {rank} after all_reduce: {tensor}")
146
+
147
+ # Barrier test
148
+ dist.barrier()
149
+ logger.info(f"Rank {rank} passed barrier")
150
+
151
+ logger.info("✅ NCCL communication test PASSED")
152
+ return True
153
+
154
+ except Exception as e:
155
+ logger.error(f"❌ NCCL communication test FAILED: {e}")
156
+ return False
157
+ finally:
158
+ if dist.is_initialized():
159
+ dist.destroy_process_group()
160
+
161
+ def run_memory_safe_training(config_path: str):
162
+ """Run training with memory safety measures"""
163
+ logger.info("Starting memory-safe training...")
164
+
165
+ # Step 1: Clear memory
166
+ clear_gpu_memory()
167
+
168
+ # Step 2: Kill competing processes
169
+ kill_competing_processes()
170
+
171
+ # Step 3: Set up NCCL environment
172
+ setup_nccl_environment()
173
+
174
+ # Step 4: Check memory after cleanup
175
+ memory_info = check_gpu_memory_usage()
176
+
177
+ # Step 5: Test NCCL communication
178
+ if not test_nccl_communication():
179
+ logger.error("NCCL communication test failed, falling back to single GPU")
180
+ return False
181
+
182
+ # Step 6: Run training with memory-efficient config
183
+ logger.info("Running memory-safe multi-GPU training...")
184
+
185
+ # This would be called by the actual training script
186
+ return True
187
+
188
+ def main():
189
+ """Main function for testing memory fixes"""
190
+ print("🚀 NCCL Memory Conflict Resolution")
191
+ print("=" * 50)
192
+
193
+ # Check initial memory state
194
+ print("\n📊 Initial GPU Memory State:")
195
+ memory_info = check_gpu_memory_usage()
196
+
197
+ # Clear memory
198
+ print("\n🧹 Clearing GPU Memory:")
199
+ clear_gpu_memory()
200
+
201
+ # Check memory after cleanup
202
+ print("\n📊 GPU Memory After Cleanup:")
203
+ memory_info = check_gpu_memory_usage()
204
+
205
+ # Set up environment
206
+ print("\n⚙️ Setting up NCCL Environment:")
207
+ setup_nccl_environment()
208
+
209
+ print("\n✅ Memory management setup complete!")
210
+ print(" Ready for memory-safe multi-GPU training")
211
+
212
+ if __name__ == "__main__":
213
+ main()
runs/humigence/config.snapshot.json CHANGED
@@ -1,13 +1,35 @@
1
  {
2
- "setup_mode": "basic",
3
- "gpu_config": "Single GPU \u2013 GPU 0: NVIDIA GeForce RTX 5090",
4
- "base_model": "Qwen/Qwen1.5-0.5B",
5
- "dataset_path": "/home/joshua/humigence_data/openassistant_full/oasst1.jsonl",
6
  "training_recipe": "QLoRA (4-bit NF4)",
7
- "learning_rate": "2e-5",
8
- "num_train_epochs": "3",
9
- "gradient_accumulation_steps": "4",
10
- "logging_steps": "10",
11
- "save_steps": "100",
12
- "timestamp": "2025-09-17T22:50:18.668019"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  }
 
1
  {
2
+ "model_name": "Qwen/Qwen2.5-0.5B",
 
 
 
3
  "training_recipe": "QLoRA (4-bit NF4)",
4
+ "learning_rate": 0.0002,
5
+ "num_train_epochs": 1,
6
+ "per_device_train_batch_size": 2,
7
+ "gradient_accumulation_steps": 4,
8
+ "eval_batch_size": 8,
9
+ "fp16": true,
10
+ "bf16": false,
11
+ "multi_gpu": false,
12
+ "selected_gpus": [
13
+ 0
14
+ ],
15
+ "dataset_path": "/home/joshua/humigence_data/wikitext2.jsonl",
16
+ "data_schema": "instruction_output",
17
+ "train_val_test_split": [
18
+ 0.8,
19
+ 0.1,
20
+ 0.1
21
+ ],
22
+ "split_seed": 42,
23
+ "max_seq_length": 1024,
24
+ "lora_r": 16,
25
+ "lora_alpha": 32,
26
+ "lora_dropout": 0.05,
27
+ "logging_steps": 10,
28
+ "eval_steps": 100,
29
+ "save_steps": 100,
30
+ "output_dir": "runs/humigence",
31
+ "eval_single_gpu": true,
32
+ "eval_gpu_index": 0,
33
+ "num_workers": 4,
34
+ "pin_memory": true
35
  }
runs/humigence/eval_results.jsonl CHANGED
@@ -1,5 +1,5 @@
1
- {"prompt": "What is the capital of France?", "output": "The capital of France is Paris."}
2
- {"prompt": "Explain quantum computing in simple terms.", "output": "Quantum computing uses quantum mechanics principles..."}
3
- {"prompt": "Write a short poem about artificial intelligence.", "output": "In circuits deep and silicon bright..."}
4
- {"prompt": "How do you make a good cup of coffee?", "output": "Start with fresh, high-quality beans..."}
5
- {"prompt": "What are the benefits of renewable energy?", "output": "Renewable energy offers numerous benefits..."}
 
1
+ {"prompt": "What is the capital of France?", "output": "What is the capital of France? ____\nA. Paris\nB. Brussels\nC. London\nD. Berlin\nAnswer:\nA\n\nThe most significant difference between a company's core competencies and its core competitive advantages lies in the fact that the former is not only the result of the latter but also an important factor in its creation. A. Correct B. Incorrect\nAnswer:\nA\n\nThe characteristic that reflects the company's long-term development direction and future prospects is ____.\nA. Product characteristics\nB. Quality characteristics\nC. Service characteristics\nD. Social characteristics\nAnswer:\nA\n\nWhich of the following statements about the impact of the new curriculum reform on teaching is incorrect?\nA. Teachers are no longer bound by fixed textbooks and fixed teaching methods.\nB. Teachers have become more active.\nC. Educational activities have become more diversified.\nD. Teachers' teaching roles have changed.\nAnswer:\nD\n\nA company has recently hired a new assistant. Through in-depth interviews with the new employee and the company's top management"}
2
+ {"prompt": "Explain quantum computing in simple terms.", "output": "Explain quantum computing in simple terms. Quantum computing is a type of computing that uses quantum particles, such as photons or electrons, to perform calculations. Instead of using bits that are either 0 or 1, quantum computers use quantum bits, or qubits, which can be in multiple states at the same time. This allows quantum computers to perform a wide range of calculations exponentially faster than classical computers, making them a promising candidate for solving complex problems in areas such as cryptography and drug discovery."}
3
+ {"prompt": "Write a short poem about artificial intelligence.", "output": "Write a short poem about artificial intelligence. Artificial intelligence creates new art"}
4
+ {"prompt": "How do you make a good cup of coffee?", "output": "How do you make a good cup of coffee? Provide a step by step guide that incorporates techniques such as brewing methods, brewing equipment, and equipment for brewing coffee. Include tips on how to vary the type of coffee you use and how to blend different types of coffee. Incorporate information on how to store and maintain the brewing equipment and coffee. Provide a list of recommended brands of coffee and brewing equipment to help you get started.\nA good cup of coffee is made through a combination of brewing methods and equipment. The best way to make a good cup of coffee is to use a high-quality brewing method and equipment that can handle the brewing process efficiently. Here is a step-by-step guide on how to make a good cup of coffee:\n1. Choose the right brewing method:\nThere are many different brewing methods available, but the most common ones are the French press, pour-over, and espresso machines. Each type of brewing method has its own unique pros and cons. For instance, the French press is an excellent method for making a strong, flavorful"}
5
+ {"prompt": "What are the benefits of renewable energy?", "output": "What are the benefits of renewable energy? The benefits of renewable energy are numerous and include the following:\nThe environmental benefits of renewable energy include reduced greenhouse gas emissions and clean air. Renewable energy sources like solar and wind power produce little to no emissions, while coal and natural gas are known to emit harmful air pollutants such as sulfur dioxide and nitrogen oxides. Renewable energy sources also help to reduce the carbon footprint of the energy sector and preserve forests and other natural habitats. The use of renewable energy also helps to improve air quality in urban areas, which benefits public health and well-being.\nThe economic benefits of renewable energy include cost savings and reduced dependence on fossil fuels. The cost of producing electricity from renewable sources is significantly lower than the cost of producing electricity from fossil fuels. This means that renewable energy can be cheaper to produce and more affordable for consumers, leading to increased consumer adoption and growth in the sector.\nThe energy security benefits of renewable energy include the ability to control energy supply and reduce dependence on foreign oil. Renewable energy resources like wind and solar power"}
runs/humigence/run_summary.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
- "run_id": "2025-09-17T22:50:18.668019",
3
  "status": "accepted",
4
- "model": "Qwen/Qwen1.5-0.5B",
5
- "dataset": "/home/joshua/humigence_data/openassistant_full/oasst1.jsonl",
6
  "recipe": "QLoRA (4-bit NF4)",
7
- "epochs": "3",
8
- "learning_rate": "2e-5",
9
- "final_loss": 0.65,
10
  "eval_prompt_count": 5,
11
- "timestamp": "2025-09-17 23:31:01"
12
  }
 
1
  {
2
+ "run_id": "2025-09-21T22:47:33",
3
  "status": "accepted",
4
+ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
5
+ "dataset": "/home/joshua/humigence_data/imdb.jsonl",
6
  "recipe": "QLoRA (4-bit NF4)",
7
+ "epochs": "1",
8
+ "learning_rate": "2e-4",
9
+ "final_loss": null,
10
  "eval_prompt_count": 5,
11
+ "timestamp": "2025-09-21 22:47:52"
12
  }
setup_unsloth.sh ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Setup script for Unsloth dual-GPU LoRA training
3
+ # Optimized for RTX 5090 (Blackwell architecture)
4
+
5
+ set -e
6
+
7
+ echo "🚀 Setting up Unsloth dual-GPU LoRA training environment..."
8
+
9
+ # Check if we're in the right directory
10
+ if [ ! -f "cli/main.py" ]; then
11
+ echo "❌ Error: Please run this script from the humigence directory"
12
+ exit 1
13
+ fi
14
+
15
+ # Check Python version
16
+ python_version=$(python3 --version 2>&1 | cut -d' ' -f2 | cut -d'.' -f1,2)
17
+ required_version="3.8"
18
+ if [ "$(printf '%s\n' "$required_version" "$python_version" | sort -V | head -n1)" != "$required_version" ]; then
19
+ echo "❌ Error: Python 3.8+ required, found $python_version"
20
+ exit 1
21
+ fi
22
+
23
+ echo "✅ Python version: $python_version"
24
+
25
+ # Check CUDA availability
26
+ if ! command -v nvidia-smi &> /dev/null; then
27
+ echo "⚠️ Warning: nvidia-smi not found. CUDA may not be available."
28
+ else
29
+ echo "✅ CUDA detected:"
30
+ nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits
31
+ fi
32
+
33
+ # Install PyTorch with CUDA support
34
+ echo "📦 Installing PyTorch with CUDA support..."
35
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
36
+
37
+ # Install other dependencies
38
+ echo "📦 Installing other dependencies..."
39
+ pip install transformers>=4.36.0 datasets>=2.14.0 accelerate>=0.24.0 peft>=0.7.0 bitsandbytes>=0.41.0
40
+
41
+ # Install Unsloth from source
42
+ echo "📦 Installing Unsloth from source..."
43
+ pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
44
+
45
+ # Install CLI dependencies
46
+ echo "📦 Installing CLI dependencies..."
47
+ pip install rich>=13.0.0 inquirer>=3.1.0 typer>=0.9.0 numpy>=1.24.0 pandas>=2.0.0 tqdm>=4.65.0
48
+
49
+ # Create output directories
50
+ echo "📁 Creating output directories..."
51
+ mkdir -p runs/humigence
52
+ mkdir -p humigence_data
53
+
54
+ # Test installation
55
+ echo "🧪 Testing installation..."
56
+ python3 -c "
57
+ import torch
58
+ import transformers
59
+ import datasets
60
+ import accelerate
61
+ import peft
62
+ import bitsandbytes
63
+ print('✅ All core dependencies imported successfully')
64
+
65
+ # Test CUDA
66
+ if torch.cuda.is_available():
67
+ print(f'✅ CUDA available: {torch.cuda.device_count()} GPU(s)')
68
+ for i in range(torch.cuda.device_count()):
69
+ print(f' GPU {i}: {torch.cuda.get_device_name(i)}')
70
+ else:
71
+ print('⚠️ CUDA not available - training will be slower')
72
+
73
+ # Test Unsloth
74
+ try:
75
+ import unsloth
76
+ print('✅ Unsloth imported successfully')
77
+ except ImportError as e:
78
+ print(f'❌ Unsloth import failed: {e}')
79
+ exit(1)
80
+ "
81
+
82
+ echo ""
83
+ echo "🎉 Setup completed successfully!"
84
+ echo ""
85
+ echo "To start training:"
86
+ echo " python3 cli/main.py"
87
+ echo ""
88
+ echo "Available options:"
89
+ echo " 1. Supervised Fine-Tuning (Unsloth + Dual-GPU) 🚀"
90
+ echo " 2. Single-GPU LoRA Training ✅"
91
+ echo ""
92
+ echo "For dual-GPU training, ensure you have 2+ GPUs available."
93
+ echo "The system will automatically detect and use available GPUs."
templates/accelerate_config.yaml CHANGED
@@ -0,0 +1,3 @@
 
 
 
 
1
+
2
+
3
+
train.py ADDED
@@ -0,0 +1,456 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Humigence Training Script with Hugging Face Accelerate
4
+ Clean DDP training with single-GPU evaluation
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import torch
10
+ import torch.nn.functional as F
11
+ from pathlib import Path
12
+ from typing import Dict, List, Optional
13
+ from dataclasses import dataclass, field
14
+ from accelerate import Accelerator
15
+ from accelerate.utils import set_seed
16
+ from transformers import (
17
+ AutoTokenizer, AutoModelForCausalLM,
18
+ TrainingArguments, Trainer, DataCollatorForLanguageModeling,
19
+ BitsAndBytesConfig, get_linear_schedule_with_warmup
20
+ )
21
+ from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, TaskType
22
+ from datasets import Dataset
23
+ import numpy as np
24
+ from rich.console import Console
25
+ from rich.table import Table
26
+ from rich.panel import Panel
27
+
28
+ # Set environment variables for stability
29
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
30
+ os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
31
+
32
+ console = Console()
33
+
34
+ @dataclass
35
+ class TrainingConfig:
36
+ """Training configuration dataclass"""
37
+ # Model config
38
+ base_model: str = "microsoft/DialoGPT-small"
39
+ training_recipe: str = "LoRA (FP16)"
40
+
41
+ # Training config
42
+ learning_rate: float = 2e-4
43
+ num_train_epochs: int = 1
44
+ per_device_train_batch_size: int = 2
45
+ per_device_eval_batch_size: int = 4
46
+ gradient_accumulation_steps: int = 4
47
+ max_seq_length: int = 1024
48
+
49
+ # LoRA config
50
+ lora_r: int = 16
51
+ lora_alpha: int = 32
52
+ lora_dropout: float = 0.05
53
+
54
+ # Data config
55
+ dataset_path: str = ""
56
+ train_val_test_split: List[float] = field(default_factory=lambda: [0.8, 0.1, 0.1])
57
+ split_seed: int = 42
58
+
59
+ # Output config
60
+ output_dir: str = "runs/humigence"
61
+ logging_steps: int = 10
62
+ save_steps: int = 100
63
+ eval_steps: int = 100
64
+
65
+ # Evaluation config
66
+ eval_gpu_index: int = 0 # Always use cuda:0 for evaluation
67
+
68
+ def load_config(config_path: str) -> TrainingConfig:
69
+ """Load configuration from JSON file"""
70
+ with open(config_path, 'r') as f:
71
+ config_dict = json.load(f)
72
+
73
+ # Map config keys to dataclass fields
74
+ config = TrainingConfig()
75
+ for key, value in config_dict.items():
76
+ if hasattr(config, key):
77
+ setattr(config, key, value)
78
+
79
+ return config
80
+
81
+ def prepare_dataset(config: TrainingConfig, tokenizer) -> tuple[Dataset, Dataset, Dataset]:
82
+ """Prepare dataset splits with tokenization"""
83
+ console.print("[blue]📊 Preparing dataset...[/blue]")
84
+
85
+ # Load dataset
86
+ with open(config.dataset_path, 'r') as f:
87
+ data = [json.loads(line) for line in f]
88
+
89
+ console.print(f"[blue] Loaded {len(data)} samples[/blue]")
90
+
91
+ # Split dataset
92
+ np.random.seed(config.split_seed)
93
+ indices = np.random.permutation(len(data))
94
+
95
+ train_size = int(len(data) * config.train_val_test_split[0])
96
+ val_size = int(len(data) * config.train_val_test_split[1])
97
+
98
+ train_indices = indices[:train_size]
99
+ val_indices = indices[train_size:train_size + val_size]
100
+ test_indices = indices[train_size + val_size:]
101
+
102
+ train_data = [data[i] for i in train_indices]
103
+ val_data = [data[i] for i in val_indices]
104
+ test_data = [data[i] for i in test_indices]
105
+
106
+ console.print(f"[blue] Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}[/blue]")
107
+
108
+ # Simple tokenization function
109
+ def tokenize_function(examples):
110
+ # Handle different data schemas
111
+ if "text" in examples:
112
+ # Simple text schema
113
+ texts = examples["text"]
114
+ elif "instruction" in examples and "output" in examples:
115
+ # Instruction-output schema
116
+ texts = []
117
+ for i in range(len(examples["instruction"])):
118
+ instruction = examples["instruction"][i]
119
+ input_text = examples.get("input", [""])[i] if examples.get("input") else ""
120
+ output = examples["output"][i]
121
+
122
+ # Format as conversation
123
+ if input_text:
124
+ text = f"Instruction: {instruction}\nInput: {input_text}\nOutput: {output}"
125
+ else:
126
+ text = f"Instruction: {instruction}\nOutput: {output}"
127
+ texts.append(text)
128
+ else:
129
+ # Fallback - use first available text column
130
+ text_col = None
131
+ for col in ["text", "instruction", "input", "output"]:
132
+ if col in examples:
133
+ text_col = col
134
+ break
135
+
136
+ if text_col:
137
+ texts = examples[text_col]
138
+ else:
139
+ # Last resort - convert to string
140
+ texts = [str(ex) for ex in examples[list(examples.keys())[0]]]
141
+
142
+ tokenized = tokenizer(
143
+ texts,
144
+ truncation=True,
145
+ padding=True,
146
+ max_length=config.max_seq_length,
147
+ return_tensors=None
148
+ )
149
+
150
+ # Create labels for causal language modeling
151
+ tokenized["labels"] = tokenized["input_ids"].copy()
152
+
153
+ return tokenized
154
+
155
+ # Create datasets and tokenize
156
+ train_dataset = Dataset.from_list(train_data)
157
+ val_dataset = Dataset.from_list(val_data)
158
+ test_dataset = Dataset.from_list(test_data)
159
+
160
+ # Tokenize datasets - remove original columns after tokenization
161
+ # First, get the original columns to remove
162
+ original_columns = list(train_dataset.column_names)
163
+
164
+ train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=original_columns)
165
+ val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=original_columns)
166
+ test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=original_columns)
167
+
168
+ # Set format for PyTorch
169
+ train_dataset.set_format("torch")
170
+ val_dataset.set_format("torch")
171
+ test_dataset.set_format("torch")
172
+
173
+ return train_dataset, val_dataset, test_dataset
174
+
175
+ def setup_model_and_tokenizer(config: TrainingConfig, accelerator: Accelerator):
176
+ """Setup model and tokenizer with LoRA/QLoRA"""
177
+ console.print(f"[blue]🤖 Loading model: {config.base_model}[/blue]")
178
+
179
+ # Load tokenizer
180
+ tokenizer = AutoTokenizer.from_pretrained(config.base_model, trust_remote_code=True)
181
+ tokenizer.pad_token = tokenizer.eos_token
182
+
183
+ # Load model
184
+ if "QLoRA" in config.training_recipe:
185
+ # QLoRA with quantization
186
+ bnb_config = BitsAndBytesConfig(
187
+ load_in_4bit=True,
188
+ bnb_4bit_use_double_quant=True,
189
+ bnb_4bit_quant_type="nf4",
190
+ bnb_4bit_compute_dtype=torch.bfloat16
191
+ )
192
+
193
+ model = AutoModelForCausalLM.from_pretrained(
194
+ config.base_model,
195
+ quantization_config=bnb_config,
196
+ device_map=None, # Let accelerate handle device placement
197
+ trust_remote_code=True
198
+ )
199
+
200
+ # Prepare for k-bit training
201
+ model = prepare_model_for_kbit_training(model)
202
+ else:
203
+ # Regular LoRA
204
+ model = AutoModelForCausalLM.from_pretrained(
205
+ config.base_model,
206
+ device_map=None, # Let accelerate handle device placement
207
+ trust_remote_code=True,
208
+ dtype=torch.bfloat16 if "BF16" in config.training_recipe else torch.float16
209
+ )
210
+
211
+ # Apply LoRA - use appropriate target modules for the model
212
+ if "gpt" in config.base_model.lower() or "dialo" in config.base_model.lower():
213
+ # For GPT-style models
214
+ target_modules = ["c_attn", "c_proj"]
215
+ elif "llama" in config.base_model.lower() or "mistral" in config.base_model.lower():
216
+ # For LLaMA/Mistral models
217
+ target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
218
+ else:
219
+ # Default fallback
220
+ target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
221
+
222
+ lora_config = LoraConfig(
223
+ r=config.lora_r,
224
+ lora_alpha=config.lora_alpha,
225
+ target_modules=target_modules,
226
+ lora_dropout=config.lora_dropout,
227
+ bias="none",
228
+ task_type=TaskType.CAUSAL_LM
229
+ )
230
+
231
+ model = get_peft_model(model, lora_config)
232
+
233
+ # Print model info
234
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
235
+ total_params = sum(p.numel() for p in model.parameters())
236
+ console.print(f"[blue] Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)[/blue]")
237
+
238
+ return model, tokenizer
239
+
240
+ def train_model(model, tokenizer, train_dataset, val_dataset, config: TrainingConfig, accelerator: Accelerator):
241
+ """Train the model using Accelerate"""
242
+ console.print("[blue]🚀 Starting training...[/blue]")
243
+
244
+ # Data collator
245
+ data_collator = DataCollatorForLanguageModeling(
246
+ tokenizer=tokenizer,
247
+ mlm=False
248
+ )
249
+
250
+ # Training arguments
251
+ training_args = TrainingArguments(
252
+ output_dir=config.output_dir,
253
+ per_device_train_batch_size=config.per_device_train_batch_size,
254
+ per_device_eval_batch_size=config.per_device_eval_batch_size,
255
+ gradient_accumulation_steps=config.gradient_accumulation_steps,
256
+ num_train_epochs=config.num_train_epochs,
257
+ learning_rate=config.learning_rate,
258
+ logging_steps=config.logging_steps,
259
+ save_steps=config.save_steps,
260
+ eval_steps=config.eval_steps,
261
+ eval_strategy="steps", # Updated parameter name
262
+ save_strategy="steps",
263
+ load_best_model_at_end=True,
264
+ metric_for_best_model="eval_loss",
265
+ greater_is_better=False,
266
+ remove_unused_columns=False,
267
+ dataloader_pin_memory=True,
268
+ dataloader_num_workers=4,
269
+ report_to=None, # Disable wandb/tensorboard
270
+ )
271
+
272
+ # Create trainer
273
+ trainer = Trainer(
274
+ model=model,
275
+ args=training_args,
276
+ train_dataset=train_dataset,
277
+ eval_dataset=val_dataset,
278
+ data_collator=data_collator,
279
+ tokenizer=tokenizer,
280
+ )
281
+
282
+ # Train the model
283
+ trainer.train()
284
+
285
+ # Save model
286
+ if accelerator.is_main_process:
287
+ trainer.save_model()
288
+ console.print("[blue]💾 Model saved[/blue]")
289
+
290
+ return trainer
291
+
292
+ def evaluate_model_on_single_gpu(model, tokenizer, test_dataset, config: TrainingConfig):
293
+ """Evaluate model on single GPU (cuda:0) to avoid device mismatches"""
294
+ console.print("[blue]🧪 Running evaluation on cuda:0...[/blue]")
295
+
296
+ # Move model to cuda:0 for evaluation
297
+ eval_device = torch.device("cuda:0")
298
+ model = model.to(eval_device)
299
+ model.eval()
300
+
301
+ # Data collator
302
+ data_collator = DataCollatorForLanguageModeling(
303
+ tokenizer=tokenizer,
304
+ mlm=False
305
+ )
306
+
307
+ # Create evaluation dataloader
308
+ from torch.utils.data import DataLoader
309
+ eval_dataloader = DataLoader(
310
+ test_dataset,
311
+ batch_size=config.per_device_eval_batch_size,
312
+ collate_fn=data_collator,
313
+ pin_memory=True
314
+ )
315
+
316
+ # Evaluation metrics
317
+ total_loss = 0.0
318
+ total_tokens = 0
319
+ correct_tokens = 0
320
+ num_samples = 0
321
+
322
+ with torch.no_grad():
323
+ for batch in eval_dataloader:
324
+ # Move batch to cuda:0
325
+ batch = {k: v.to(eval_device) for k, v in batch.items()}
326
+
327
+ # Forward pass
328
+ outputs = model(**batch)
329
+ loss = outputs.loss
330
+ logits = outputs.logits
331
+
332
+ # Calculate metrics
333
+ total_loss += loss.item()
334
+ num_samples += batch["input_ids"].size(0)
335
+
336
+ # Token-level accuracy
337
+ predictions = torch.argmax(logits, dim=-1)
338
+ labels = batch["labels"]
339
+
340
+ # Mask out ignored positions
341
+ mask = labels != -100
342
+ correct_tokens += (predictions[mask] == labels[mask]).sum().item()
343
+ total_tokens += mask.sum().item()
344
+
345
+ # Calculate final metrics
346
+ avg_loss = total_loss / len(eval_dataloader)
347
+ accuracy = correct_tokens / max(total_tokens, 1)
348
+ perplexity = np.exp(avg_loss)
349
+
350
+ return {
351
+ "loss": avg_loss,
352
+ "accuracy": accuracy,
353
+ "perplexity": perplexity,
354
+ "correct_tokens": correct_tokens,
355
+ "total_tokens": total_tokens,
356
+ "num_samples": num_samples
357
+ }
358
+
359
+ def print_training_summary(config: TrainingConfig, train_dataset, val_dataset, test_dataset, eval_results):
360
+ """Print structured training summary"""
361
+ console.print("\n[bold cyan]=" * 80)
362
+ console.print("[bold cyan]🎯 TRAINING SUMMARY[/bold cyan]")
363
+ console.print("[bold cyan]=" * 80)
364
+
365
+ # Dataset summary
366
+ console.print(f"\n[bold green]📊 Dataset Summary[/bold green]")
367
+ console.print(f" Train: {len(train_dataset):,} samples")
368
+ console.print(f" Validation: {len(val_dataset):,} samples")
369
+ console.print(f" Test: {len(test_dataset):,} samples")
370
+
371
+ # Model summary
372
+ console.print(f"\n[bold blue]🤖 Model Summary[/bold blue]")
373
+ console.print(f" Base Model: {config.base_model}")
374
+ console.print(f" Training Recipe: {config.training_recipe}")
375
+ console.print(f" LoRA r: {config.lora_r}")
376
+ console.print(f" LoRA alpha: {config.lora_alpha}")
377
+
378
+ # Training summary
379
+ console.print(f"\n[bold yellow]🚀 Training Summary[/bold yellow]")
380
+ console.print(f" Epochs: {config.num_train_epochs}")
381
+ console.print(f" Learning Rate: {config.learning_rate}")
382
+ console.print(f" Batch Size: {config.per_device_train_batch_size}")
383
+ console.print(f" Gradient Accumulation: {config.gradient_accumulation_steps}")
384
+
385
+ # Evaluation results
386
+ console.print(f"\n[bold magenta]🧪 Evaluation Results (cuda:0)[/bold magenta]")
387
+ console.print(f" Loss: {eval_results['loss']:.4f}")
388
+ console.print(f" Accuracy: {eval_results['accuracy']:.4f}")
389
+ console.print(f" Perplexity: {eval_results['perplexity']:.2f}")
390
+ console.print(f" Correct Tokens: {eval_results['correct_tokens']:,}")
391
+ console.print(f" Total Tokens: {eval_results['total_tokens']:,}")
392
+ console.print(f" Samples: {eval_results['num_samples']:,}")
393
+
394
+ console.print("\n[bold cyan]=" * 80)
395
+
396
+ def main():
397
+ """Main training function"""
398
+ # Parse arguments
399
+ import argparse
400
+ parser = argparse.ArgumentParser(description="Humigence Training with Accelerate")
401
+ parser.add_argument("--config_file", type=str, required=True, help="Path to config file")
402
+ args = parser.parse_args()
403
+
404
+ # Initialize accelerator
405
+ accelerator = Accelerator()
406
+ set_seed(42)
407
+
408
+ # Load configuration
409
+ config = load_config(args.config_file)
410
+
411
+ # Print accelerator info
412
+ console.print(f"[blue]🚀 Accelerate Info:[/blue]")
413
+ console.print(f" Process index: {accelerator.process_index}")
414
+ console.print(f" Local process index: {accelerator.local_process_index}")
415
+ console.print(f" Device: {accelerator.device}")
416
+ console.print(f" Distributed: {accelerator.distributed_type}")
417
+ console.print(f" Mixed precision: {accelerator.mixed_precision}")
418
+
419
+ try:
420
+ # Setup model and tokenizer
421
+ model, tokenizer = setup_model_and_tokenizer(config, accelerator)
422
+
423
+ # Prepare datasets
424
+ train_dataset, val_dataset, test_dataset = prepare_dataset(config, tokenizer)
425
+
426
+ # Train model
427
+ trainer = train_model(model, tokenizer, train_dataset, val_dataset, config, accelerator)
428
+
429
+ # Wait for all processes to finish training
430
+ accelerator.wait_for_everyone()
431
+
432
+ # Evaluate on single GPU (main process only)
433
+ if accelerator.is_main_process:
434
+ eval_results = evaluate_model_on_single_gpu(model, tokenizer, test_dataset, config)
435
+ print_training_summary(config, train_dataset, val_dataset, test_dataset, eval_results)
436
+ else:
437
+ eval_results = None
438
+
439
+ # Wait for evaluation to complete
440
+ accelerator.wait_for_everyone()
441
+
442
+ return {"status": "success", "eval_results": eval_results}
443
+
444
+ except Exception as e:
445
+ console.print(f"[red]❌ Training failed: {e}[/red]")
446
+ import traceback
447
+ traceback.print_exc()
448
+ return {"status": "error", "message": str(e)}
449
+
450
+ if __name__ == "__main__":
451
+ results = main()
452
+ if results["status"] == "success":
453
+ console.print("[green]✅ Training completed successfully![/green]")
454
+ else:
455
+ console.print(f"[red]❌ Training failed: {results['message']}[/red]")
456
+ exit(1)
training_launcher.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # training_launcher.py
2
+ import os
3
+ import sys
4
+ import traceback
5
+ import json
6
+ import argparse
7
+ import torch
8
+ from pathlib import Path
9
+ from distributed_utils import setup_distributed, setup_environment, cleanup_distributed, RankZeroOnly
10
+
11
+ def main():
12
+ # Parse command line arguments
13
+ parser = argparse.ArgumentParser(description="Humigence Training Launcher")
14
+ parser.add_argument("--config", type=str, required=True, help="Path to configuration file")
15
+ parser.add_argument("--fallback_single_gpu", action="store_true", help="Force single GPU training")
16
+ args = parser.parse_args()
17
+
18
+ # Set default values for error handling
19
+ ddp = False
20
+ is_main = True
21
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
22
+
23
+ # Set environment before ANY other imports
24
+ setup_environment()
25
+
26
+ try:
27
+ # Initialize distributed training
28
+ ddp, rank, local_rank, world_size, device = setup_distributed()
29
+ is_main = (rank == 0)
30
+
31
+ with RankZeroOnly(is_main) as rank_zero:
32
+ rank_zero.print(f"Training Mode: {'DDP' if ddp else 'Single-GPU'} "
33
+ f"(world_size={world_size}, rank={rank}, local_rank={local_rank}, device={device})")
34
+
35
+ # Load configuration
36
+ with open(args.config, 'r') as f:
37
+ config = json.load(f)
38
+
39
+ # Update config with distributed training info
40
+ config.update({
41
+ "device": str(device),
42
+ "ddp": ddp,
43
+ "rank": rank,
44
+ "world_size": world_size,
45
+ "is_main": is_main,
46
+ "local_rank": local_rank,
47
+ })
48
+
49
+ # Import trainer after device setup to ensure proper CUDA initialization
50
+ from pipelines.production_pipeline import ProductionPipeline
51
+
52
+ # Create pipeline with distributed config
53
+ pipeline = ProductionPipeline(config)
54
+
55
+ # Run training
56
+ results = pipeline.run()
57
+
58
+ # Clean shutdown
59
+ cleanup_distributed()
60
+
61
+ return results
62
+
63
+ except Exception as e:
64
+ # Ensure cleanup even on error
65
+ cleanup_distributed()
66
+
67
+ # Enhanced error logging
68
+ error_msg = f"Training error: {type(e).__name__}: {e}"
69
+ print(error_msg, file=sys.stderr)
70
+
71
+ # Check if this is a DDP-related error that should trigger fallback
72
+ if _should_fallback_to_single_gpu(e):
73
+ if is_main: # Now is_main is always defined
74
+ print("DDP failed, falling back to single-GPU...")
75
+ return _run_single_gpu_fallback(args.config)
76
+ else:
77
+ # Re-raise for actual errors
78
+ raise
79
+
80
+ def _should_fallback_to_single_gpu(error: Exception) -> bool:
81
+ """Determine if error warrants single-GPU fallback"""
82
+ fallback_errors = (
83
+ AttributeError, # Missing methods like set_memory_monitor
84
+ RuntimeError, # NCCL errors, device mismatches
85
+ ConnectionError, # Process group initialization failures
86
+ )
87
+ return isinstance(error, fallback_errors)
88
+
89
+ def _run_single_gpu_fallback(config_path: str):
90
+ """Clean single-GPU fallback implementation"""
91
+ # Force single GPU
92
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
93
+
94
+ # Clear any existing process group
95
+ if torch.distributed.is_initialized():
96
+ torch.distributed.destroy_process_group()
97
+
98
+ # Load original config
99
+ with open(config_path, 'r') as f:
100
+ config = json.load(f)
101
+
102
+ # Update config for single GPU
103
+ config.update({
104
+ "device": "cuda:0",
105
+ "ddp": False,
106
+ "rank": 0,
107
+ "world_size": 1,
108
+ "is_main": True,
109
+ "local_rank": 0,
110
+ "multi_gpu": False,
111
+ "use_distributed": False,
112
+ })
113
+
114
+ print("Running single-GPU fallback training...")
115
+
116
+ try:
117
+ from pipelines.production_pipeline import ProductionPipeline
118
+ pipeline = ProductionPipeline(config)
119
+ return pipeline.run()
120
+ except Exception as e:
121
+ print(f"Single-GPU fallback also failed: {e}")
122
+ return {"status": "error", "message": str(e)}
123
+
124
+ if __name__ == "__main__":
125
+ try:
126
+ results = main()
127
+ if results and results.get("status") == "success":
128
+ sys.exit(0)
129
+ else:
130
+ sys.exit(1)
131
+ except KeyboardInterrupt:
132
+ print("\nTraining interrupted by user")
133
+ sys.exit(1)
134
+ except Exception as e:
135
+ print(f"Training failed: {e}")
136
+ traceback.print_exc()
137
+ sys.exit(1)