lilbablo commited on Oct 1, 2025

Commit

36ac84e

1 Parent(s): 7275aef

chore: initial public release of Humigence with dual-GPU & CLI wizard

Browse files

Files changed (19) hide show

HUMIGENCE_COMMAND_READY.md +119 -0
MULTI_GPU_TRAINING_README.md +167 -0
README_LORA_TRAINING.md +253 -0
cli.py +257 -0
compatibility_utils.py +58 -0
config_migration.py +260 -0
config_schema.py +118 -0
distributed_utils.py +84 -0
errors.py +118 -0
humigence +18 -0
main_cli.py +1175 -0
nccl_memory_fix.py +213 -0
runs/humigence/config.snapshot.json +32 -10
runs/humigence/eval_results.jsonl +5 -5
runs/humigence/run_summary.json +7 -7
setup_unsloth.sh +93 -0
templates/accelerate_config.yaml +3 -0
train.py +456 -0
training_launcher.py +137 -0

HUMIGENCE_COMMAND_READY.md ADDED Viewed

	@@ -0,0 +1,119 @@

+# Humigence Command - Ready to Use! 🚀
+## ✅ **Yes, you can launch this with "humigence"!**
+The Humigence training pipeline has been successfully refactored and is now ready to use with the `humigence` command.
+## 🎯 **How to Use**
+### **Launch Humigence CLI**
+```bash
+humigence
+```
+### **What You'll See**
+```
+──────────────── Humigence — Your AI. Your pipeline. Zero code. ────────────────
+A complete MLOps suite built for makers, teams, and enterprises.
+Options:
+1. Supervised Fine-Tuning ✅
+2. RAG Implementation (coming soon)
+3. EnterpriseGPT (coming soon)
+4. Batch Inference (coming soon)
+5. Context Length (coming soon)
+6. Exit
+Select an option:
+```
+### **Training Options**
+1. **Select "1. Supervised Fine-Tuning"**
+2. **Choose Setup Mode**: Basic or Advanced
+3. **Select Model**: TinyLlama, Qwen, Phi-2, etc.
+4. **Choose Training Recipe**: LoRA, QLoRA, etc.
+5. **Select Dataset**: Your available datasets
+6. **Choose Training Mode**: Multi-GPU or Single-GPU
+7. **Confirm Configuration**: Review and start training
+## 🚀 **What's New (Accelerate Refactor)**
+### **Clean Architecture**
+- **Hugging Face Accelerate**: Stable DDP training
+- **Single-GPU Evaluation**: Always on cuda:0
+- **No More NCCL Errors**: Robust distributed training
+- **Clean Code**: Removed over-engineering
+### **Key Features**
+- ✅ **Multi-GPU Training**: 2× RTX 5090s support
+- ✅ **Single-GPU Fallback**: Automatic fallback if needed
+- ✅ **LoRA/QLoRA Support**: Parameter-efficient fine-tuning
+- ✅ **Structured Logging**: Clean, readable output
+- ✅ **Error Handling**: Robust error management
+## 📋 **Training Modes**
+### **Multi-GPU Training (Recommended)**
+- Uses `accelerate launch` with 2× RTX 5090s
+- Stable DDP training with NCCL backend
+- Automatic device management
+- Mixed precision (bf16/fp16)
+### **Single-GPU Training**
+- Uses `python train.py` for single GPU
+- Fallback option if multi-GPU fails
+- Same functionality, single device
+## 🎯 **Usage Examples**
+### **Interactive CLI**
+```bash
+humigence
+# Select option 1
+# Choose Multi-GPU Training
+# Follow the configuration wizard
+```
+### **Direct Training (Advanced)**
+```bash
+# Multi-GPU
+accelerate launch --config_file accelerate_config.yaml train.py --config_file config.json
+# Single-GPU
+python train.py --config_file config.json
+```
+## 🔧 **Technical Details**
+### **Files Created/Updated**
+- **`train.py`** - Clean Accelerate-based training script
+- **`accelerate_config.yaml`** - Multi-GPU configuration
+- **`cli/main.py`** - Updated CLI integration
+- **`humigence`** - Command-line entry point
+### **Dependencies**
+- **Hugging Face Accelerate** - Distributed training
+- **Transformers** - Model loading and training
+- **PEFT** - LoRA/QLoRA support
+- **Rich** - Beautiful CLI interface
+## 🎉 **Ready to Use!**
+The Humigence training pipeline is now:
+- ✅ **Refactored** with Hugging Face Accelerate
+- ✅ **Tested** and working correctly
+- ✅ **Installed** as `humigence` command
+- ✅ **Ready** for production use
+**Just run `humigence` and start training!** 🚀
+## 📊 **What You Get**
+1. **Clean CLI Interface** - Easy to use
+2. **Stable Multi-GPU Training** - No more NCCL errors
+3. **Single-GPU Evaluation** - No device mismatches
+4. **Structured Reporting** - Clear training summaries
+5. **Error Handling** - Robust error management
+6. **Production Ready** - Works with your 2× RTX 5090s
+**The refactored Humigence pipeline is ready for your AI training needs!** 🎯

MULTI_GPU_TRAINING_README.md ADDED Viewed

	@@ -0,0 +1,167 @@

+# Multi-GPU Training with 2× RTX 5090s
+## 🚀 **Quick Start**
+### **Multi-GPU Training (Recommended)**
+```bash
+torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json
+```
+### **Single GPU Training (Fallback)**
+```bash
+python train.py --config runs/humigence/config.snapshot.json --fallback_single_gpu
+```
+## 🔧 **Features**
+### **Multi-GPU Support**
+- ✅ **NCCL Backend**: Stable distributed training
+- ✅ **2× RTX 5090s**: Full utilization of both GPUs
+- ✅ **Automatic Detection**: Detects available GPUs
+- ✅ **Process Synchronization**: Proper rank management
+### **Environment Hardening**
+- ✅ **NCCL Debug**: `NCCL_DEBUG=INFO` for troubleshooting
+- ✅ **IB Disabled**: `NCCL_IB_DISABLE=1` prevents InfiniBand issues
+- ✅ **P2P Disabled**: `NCCL_P2P_DISABLE=1` prevents peer-to-peer issues
+- ✅ **Async Error Handling**: `NCCL_ASYNC_ERROR_HANDLING=1` for better error handling
+- ✅ **Tokenizer Safety**: `TOKENIZERS_PARALLELISM=false` prevents fork warnings
+### **Graceful Fallback**
+- ✅ **Automatic Fallback**: Falls back to single GPU if multi-GPU fails
+- ✅ **Clear Warnings**: Shows when fallback is triggered
+- ✅ **No Data Loss**: Training continues seamlessly
+- ✅ **Error Recovery**: Handles NCCL errors gracefully
+### **Device Consistency**
+- ✅ **Training**: Each process uses its local rank device (`cuda:local_rank`)
+- ✅ **Evaluation**: Always uses `cuda:0` for fresh model reload
+- ✅ **No Mixing**: No tensors mixed between `cuda:0` and `cuda:1`
+- ✅ **Synchronization**: Proper process synchronization
+## 📊 **Training Modes**
+### **Multi-GPU Mode (Default)**
+```
+🚀 Starting multi-GPU training on 2 GPUs
+✅ Distributed training initialized
+   Rank: 0/1
+   Local Rank: 0
+   World Size: 2
+   Device: cuda:0
+🚀 Starting multi-GPU training on 2 GPUs
+✅ Distributed training initialized
+   Rank: 1/1
+   Local Rank: 1
+   World Size: 2
+   Device: cuda:1
+```
+### **Single GPU Fallback**
+```
+⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
+🚀 Starting single GPU training on cuda:0
+✅ Single GPU Training: cuda:0
+   Device: cuda:0
+```
+## 🛠️ **Configuration**
+### **Multi-GPU Configuration**
+The launcher automatically sets:
+```python
+config = {
+    "distributed": True,
+    "rank": 0,  # or 1
+    "world_size": 2,
+    "local_rank": 0,  # or 1
+    "device": "cuda:0",  # or "cuda:1"
+    "per_device_train_batch_size": 4,  # Per GPU
+    "per_device_eval_batch_size": 8,   # Per GPU
+}
+```
+### **Single GPU Configuration**
+```python
+config = {
+    "distributed": False,
+    "rank": 0,
+    "world_size": 1,
+    "local_rank": 0,
+    "device": "cuda:0",
+    "per_device_train_batch_size": 8,  # Doubled for single GPU
+    "per_device_eval_batch_size": 16,  # Doubled for single GPU
+}
+```
+## 🔍 **Troubleshooting**
+### **Common Issues**
+1. **NCCL Initialization Failed**
+   ```
+   ❌ Distributed training initialization failed: NCCL Error
+   ⚠️ NCCL multi-GPU failed, falling back to single GPU training on cuda:0.
+   ```
+   **Solution**: This is expected behavior. The launcher will automatically fall back to single GPU.
+2. **CUDA Out of Memory**
+   ```
+   ❌ CUDA out of memory
+   ```
+   **Solution**: Reduce `per_device_train_batch_size` in your config.
+3. **Device Mismatch**
+   ```
+   ❌ Expected all tensors to be on the same device
+   ```
+   **Solution**: This should not happen with the new launcher. If it does, check that evaluation is using fresh model reload.
+### **Debug Mode**
+Set environment variables for debugging:
+```bash
+export NCCL_DEBUG=INFO
+export CUDA_LAUNCH_BLOCKING=1
+torchrun --nproc_per_node=2 train.py --config runs/humigence/config.snapshot.json
+```
+## 📈 **Performance**
+### **Multi-GPU Benefits**
+- **2× Training Speed**: Approximately 2x faster training
+- **Larger Batch Sizes**: Can use larger effective batch sizes
+- **Better Convergence**: Often better model performance
+- **Memory Efficiency**: Distributes memory across GPUs
+### **Single GPU Fallback**
+- **Reliable**: Always works if multi-GPU fails
+- **Simpler**: Easier to debug issues
+- **Compatible**: Works with any setup
+## 🎯 **Best Practices**
+1. **Always use the launcher**: Don't run training directly
+2. **Check GPU availability**: Ensure both GPUs are visible
+3. **Monitor memory usage**: Watch for OOM errors
+4. **Use appropriate batch sizes**: Start small and increase
+5. **Check logs**: Look for NCCL warnings or errors
+## 🚨 **Important Notes**
+- **Evaluation always uses cuda:0**: Fresh model reload ensures device consistency
+- **Training uses local rank devices**: Each process uses its assigned GPU
+- **No tensor mixing**: Tensors never cross between cuda:0 and cuda:1
+- **Automatic fallback**: If multi-GPU fails, single GPU training continues
+- **Process synchronization**: All processes are properly synchronized
+## 🎉 **Summary**
+The new training launcher provides:
+- **Robust multi-GPU training** with NCCL
+- **Graceful fallback** to single GPU
+- **Device consistency** throughout training and evaluation
+- **Professional logging** and error handling
+- **Fool-proof operation** with automatic error recovery
+No more `cuda:0 vs cuda:1` mismatches, no deadlocks, no NCCL crashes without fallback! 🚀

README_LORA_TRAINING.md ADDED Viewed

	@@ -0,0 +1,253 @@

+# Humigence LoRA Training System
+A robust, single-GPU LoRA fine-tuning solution that works exactly like the fixed script, but generalized for all models supported by Humigence.
+## 🚀 Quick Start
+### Via Humigence CLI (Recommended)
+```bash
+# Interactive wizard with auto-detection
+humigence
+# Select option 2: Single-GPU LoRA Training
+# The wizard will auto-detect models, datasets, and create output directories
+# Direct command (for advanced users)
+python3 cli/train_lora_cli.py --model meta-llama/Meta-Llama-3-8B-Instruct --output-dir ./out_lora
+```
+### Via Accelerate (Recommended)
+```bash
+accelerate launch --num_processes=1 cli/train_lora_single.py --model meta-llama/Meta-Llama-3-8B-Instruct --output-dir ./out_lora
+```
+## ✨ Key Features
+- ✅ **Interactive Wizard** with auto-detection of models and datasets
+- ✅ **Single GPU training** (safe default)
+- ✅ **bf16 precision** where supported
+- ✅ **Proper gradient flow** (no loss=None errors)
+- ✅ **PEFT/LoRA integration** with correct target modules
+- ✅ **Gradient checkpointing** enabled
+- ✅ **Support for multiple models** (LLaMA, Mistral, Phi-2, etc.)
+- ✅ **Comprehensive error handling** and validation
+- ✅ **Auto-generated output directories** with meaningful names
+- ✅ **LoRA configuration presets** for different use cases
+- ✅ **Rich progress tracking** and logging
+## 🧠 Supported Models
+| Model Family | Example | Target Modules |
+|--------------|---------|----------------|
+| **LLaMA** | `meta-llama/Meta-Llama-3-8B-Instruct` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
+| **Mistral** | `mistralai/Mistral-7B-Instruct-v0.1` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
+| **Phi** | `microsoft/Phi-2` | `q_proj`, `k_proj`, `v_proj`, `dense` |
+| **TinyLlama** | `TinyLlama/TinyLlama-1.1B-Chat-v1.0` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
+| **Qwen** | `Qwen/Qwen1.5-0.5B` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
+## 🧙‍♂️ Interactive Wizard Features
+The LoRA training wizard provides:
+### 🔍 Auto-Detection
+- **Models**: Scans Hugging Face cache and provides popular model options
+- **Datasets**: Detects local datasets and offers popular Hugging Face datasets
+- **System Info**: Shows GPU memory, CUDA availability, and system specs
+### ⚙️ Configuration Presets
+- **Efficient (r=8, α=16)**: Fast training, lower memory usage
+- **Balanced (r=16, α=32)**: Good balance of performance and speed
+- **High Quality (r=32, α=64)**: Better performance, more parameters
+- **Custom**: Set your own LoRA parameters
+### 📁 Smart Output Management
+- **Auto-generated directories**: `out_lora_{model}_{dataset}_{timestamp}`
+- **Configuration saving**: Saves all settings to `lora_config.json`
+- **Reproduction scripts**: Generates `reproduce.sh` for easy re-runs
+## 📋 Usage Examples
+### Interactive Wizard (Recommended)
+```bash
+humigence
+# Select option 2: Single-GPU LoRA Training
+# Follow the interactive prompts
+```
+### Direct Command Line
+```bash
+python3 cli/train_lora_single.py \
+    --model meta-llama/Meta-Llama-3-8B-Instruct \
+    --output-dir ./out_lora \
+    --max-steps 1000 \
+    --batch-size 4
+```
+### Custom LoRA Settings
+```bash
+python3 cli/train_lora_single.py \
+    --model mistralai/Mistral-7B-Instruct-v0.1 \
+    --output-dir ./out_mistral \
+    --max-steps 2000 \
+    --batch-size 2 \
+    --lora-r 32 \
+    --lora-alpha 64 \
+    --lora-dropout 0.1
+```
+### Small Model Testing
+```bash
+python3 cli/train_lora_single.py \
+    --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+    --output-dir ./out_tinyllama \
+    --max-steps 100 \
+    --batch-size 8 \
+    --block-size 256
+```
+## 🔧 Configuration Options
+### Required Arguments
+- `--model`: Model name or path (e.g., `meta-llama/Meta-Llama-3-8B-Instruct`)
+- `--output-dir`: Output directory for trained model
+### Dataset Options
+- `--dataset`: Dataset name (default: `wikitext`)
+- `--dataset-config`: Dataset configuration (default: `wikitext-2-raw-v1`)
+- `--block-size`: Block size for text grouping (default: `512`)
+### Training Options
+- `--max-steps`: Maximum training steps (default: `1000`)
+- `--batch-size`: Per-device batch size (default: `4`)
+- `--grad-accum`: Gradient accumulation steps (default: `4`)
+- `--learning-rate`: Learning rate (default: `2e-4`)
+### LoRA Options
+- `--lora-r`: LoRA rank (default: `16`)
+- `--lora-alpha`: LoRA alpha (default: `32`)
+- `--lora-dropout`: LoRA dropout (default: `0.05`)
+### Other Options
+- `--warmup-steps`: Number of warmup steps (default: `100`)
+- `--logging-steps`: Logging frequency (default: `10`)
+- `--save-steps`: Save frequency (default: `200`)
+- `--eval-steps`: Evaluation frequency (default: `200`)
+- `--save-total-limit`: Maximum checkpoints to keep (default: `2`)
+## 🧪 Testing
+Run the test suite to validate the implementation:
+```bash
+python3 test_lora_single.py
+```
+This will test:
+- Model architecture support
+- CLI interface
+- Model and dataset validation
+- Short training run
+## 🔍 Validation
+After training, validate your adapters:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+# Load the base model
+tokenizer = AutoTokenizer.from_pretrained('./out_lora')
+model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
+# Load the LoRA adapters
+model = PeftModel.from_pretrained(model, './out_lora')
+print('✅ Adapters loaded successfully!')
+```
+## 🐛 Troubleshooting
+### Common Issues
+1. **"Loss does not require gradients" warning**
+   - This is handled automatically by the custom `LoRATrainer` class
+   - The script will force gradient computation if needed
+2. **CUDA out of memory**
+   - Reduce `--batch-size` (try 1 or 2)
+   - Reduce `--block-size` (try 256 or 128)
+   - Use gradient accumulation: increase `--grad-accum`
+3. **Model not found**
+   - Ensure the model name is correct
+   - Check if you have internet access for downloading
+   - Verify the model exists on Hugging Face Hub
+4. **Dataset loading issues**
+   - The script uses `wikitext-2-raw-v1` by default
+   - Ensure you have the `datasets` library installed
+   - Check internet connection for dataset download
+### Memory Optimization
+For large models, use these settings:
+```bash
+python3 cli/train_lora_single.py \
+    --model meta-llama/Meta-Llama-3-8B-Instruct \
+    --output-dir ./out_lora \
+    --batch-size 1 \
+    --grad-accum 8 \
+    --block-size 256
+```
+## 📊 Output Structure
+After training, you'll find:
+```
+out_lora/
+├── adapter_config.json          # LoRA configuration
+├── adapter_model.safetensors    # LoRA weights
+├── tokenizer.json              # Tokenizer
+├── tokenizer_config.json       # Tokenizer config
+├── special_tokens_map.json     # Special tokens
+├── training_summary.json       # Training metrics
+└── checkpoint-*/               # Training checkpoints
+    ├── adapter_config.json
+    ├── adapter_model.safetensors
+    ├── optimizer.pt
+    ├── scheduler.pt
+    └── trainer_state.json
+```
+## 🔬 Technical Details
+### Key Fixes Applied
+1. **Custom LoRATrainer**: Ensures proper gradient flow
+2. **enable_input_require_grads()**: Critical for PEFT + gradient checkpointing
+3. **Proper data collation**: Uses `DataCollatorForLanguageModeling`
+4. **Model-specific target modules**: Automatically detects correct LoRA targets
+5. **Non-reentrant checkpointing**: Avoids gradient issues
+### Architecture Support
+The system automatically detects model architectures and applies the correct LoRA target modules:
+- **LLaMA/Mistral**: All attention and MLP layers
+- **Phi**: Attention layers + dense layer
+- **GPT**: c_attn and c_proj layers
+- **Default**: Common transformer modules
+## 🤝 Contributing
+To add support for new model architectures:
+1. Add the model name pattern to `get_model_target_modules()`
+2. Specify the correct target modules
+3. Test with a short training run
+4. Update this documentation
+## 📝 License
+This code is part of the Humigence project and follows the same license terms.

cli.py ADDED Viewed

	@@ -0,0 +1,257 @@

+#!/usr/bin/env python3
+"""
+Humigence CLI - Main entry point for all Humigence commands
+"""
+import typer
+from typing import Optional
+from rich.console import Console
+from rich.panel import Panel
+from pathlib import Path
+import sys
+# Add the current directory to the path for imports
+sys.path.insert(0, str(Path(__file__).parent))
+from training.train_wikitext import run_training
+app = typer.Typer(
+    name="humigence",
+    help="Your AI. Your pipeline. Zero code.",
+    add_completion=False,
+    rich_markup_mode="rich"
+)
+console = Console()
+@app.command()
+def train_wikitext(
+    model: str = typer.Option(
+        ...,
+        "--model",
+        "-m",
+        help="Path or Hugging Face model name (e.g., 'gpt2' or 'microsoft/DialoGPT-small')"
+    ),
+    output_dir: str = typer.Option(
+        ...,
+        "--output-dir",
+        "-o",
+        help="Directory where checkpoints will be saved"
+    ),
+    epochs: int = typer.Option(
+        1,
+        "--epochs",
+        "-e",
+        help="Number of training epochs"
+    ),
+    batch_size: int = typer.Option(
+        2,
+        "--batch-size",
+        "-b",
+        help="Per-device batch size"
+    ),
+    learning_rate: float = typer.Option(
+        5e-5,
+        "--learning-rate",
+        "-lr",
+        help="Learning rate for training"
+    ),
+    dataset: str = typer.Option(
+        "wikitext",
+        "--dataset",
+        help="Dataset name (default: wikitext)"
+    ),
+    dataset_config: str = typer.Option(
+        "wikitext-2-raw-v1",
+        "--dataset-config",
+        help="Dataset configuration (default: wikitext-2-raw-v1)"
+    ),
+    max_steps: Optional[int] = typer.Option(
+        None,
+        "--max-steps",
+        help="Maximum training steps (overrides epochs if set)"
+    ),
+    block_size: int = typer.Option(
+        1024,
+        "--block-size",
+        help="Maximum sequence length"
+    ),
+    grad_accum: int = typer.Option(
+        4,
+        "--grad-accum",
+        help="Gradient accumulation steps"
+    ),
+    warmup_steps: int = typer.Option(
+        100,
+        "--warmup-steps",
+        help="Number of warmup steps"
+    ),
+    logging_steps: int = typer.Option(
+        10,
+        "--logging-steps",
+        help="Logging frequency in steps"
+    ),
+    save_steps: int = typer.Option(
+        200,
+        "--save-steps",
+        help="Model saving frequency in steps"
+    ),
+    eval_steps: int = typer.Option(
+        200,
+        "--eval-steps",
+        help="Evaluation frequency in steps"
+    ),
+    lora_r: int = typer.Option(
+        8,
+        "--lora-r",
+        help="LoRA rank"
+    ),
+    lora_alpha: int = typer.Option(
+        32,
+        "--lora-alpha",
+        help="LoRA alpha parameter"
+    ),
+    lora_dropout: float = typer.Option(
+        0.05,
+        "--lora-dropout",
+        help="LoRA dropout rate"
+    ),
+):
+    """
+    Train a model on Wikitext dataset using LoRA fine-tuning.
+    This command fine-tunes a language model on the Wikitext dataset using LoRA (Low-Rank Adaptation)
+    for efficient parameter updates. The training runs on a single GPU by default.
+    Examples:
+        # Basic training with GPT-2
+        humigence train-wikitext --model gpt2 --output-dir ./out
+        # Training with custom parameters
+        humigence train-wikitext --model microsoft/DialoGPT-small --output-dir ./out --epochs 2 --batch-size 4 --learning-rate 1e-4
+        # Training with specific steps instead of epochs
+        humigence train-wikitext --model gpt2 --output-dir ./out --max-steps 1000 --batch-size 2
+    """
+    # Display training configuration
+    config_panel = Panel(
+        f"""[bold blue]Training Configuration[/bold blue]
+[cyan]Model:[/cyan] {model}
+[cyan]Output Directory:[/cyan] {output_dir}
+[cyan]Epochs:[/cyan] {epochs}
+[cyan]Batch Size:[/cyan] {batch_size}
+[cyan]Learning Rate:[/cyan] {learning_rate}
+[cyan]Dataset:[/cyan] {dataset}/{dataset_config}
+[cyan]Max Steps:[/cyan] {max_steps if max_steps else 'Auto-calculated'}
+[cyan]Block Size:[/cyan] {block_size}
+[cyan]Gradient Accumulation:[/cyan] {grad_accum}
+[cyan]LoRA Rank:[/cyan] {lora_r}
+[cyan]LoRA Alpha:[/cyan] {lora_alpha}
+[cyan]LoRA Dropout:[/cyan] {lora_dropout}""",
+        title="🚀 Starting Wikitext Training",
+        border_style="green"
+    )
+    console.print(config_panel)
+    # Create output directory if it doesn't exist
+    Path(output_dir).mkdir(parents=True, exist_ok=True)
+    # Run training
+    try:
+        result = run_training(
+            model=model,
+            output_dir=output_dir,
+            epochs=epochs,
+            batch_size=batch_size,
+            learning_rate=learning_rate,
+            dataset=dataset,
+            dataset_config=dataset_config,
+            max_steps=max_steps,
+            block_size=block_size,
+            grad_accum=grad_accum,
+            warmup_steps=warmup_steps,
+            logging_steps=logging_steps,
+            save_steps=save_steps,
+            eval_steps=eval_steps,
+            lora_r=lora_r,
+            lora_alpha=lora_alpha,
+            lora_dropout=lora_dropout,
+        )
+        if result["status"] == "success":
+            console.print(Panel(
+                f"""[bold green]✅ Training Completed Successfully![/bold green]
+[cyan]Output Directory:[/cyan] {result['output_dir']}
+[cyan]Model Path:[/cyan] {result['model_path']}
+[bold blue]Final Metrics:[/bold blue]
+[cyan]Train Loss:[/cyan] {result['metrics'].get('train_loss', 'N/A')}
+[cyan]Eval Loss:[/cyan] {result['metrics'].get('eval_loss', 'N/A')}
+[cyan]Total Steps:[/cyan] {result['metrics'].get('total_steps', 'N/A')}
+[cyan]Epochs:[/cyan] {result['metrics'].get('epochs', 'N/A')}
+[cyan]Train Runtime:[/cyan] {result['metrics'].get('train_runtime', 'N/A')}s
+[cyan]Samples/Second:[/cyan] {result['metrics'].get('train_samples_per_second', 'N/A')}""",
+                title="🎉 Training Results",
+                border_style="green"
+            ))
+            raise typer.Exit(0)
+        else:
+            console.print(Panel(
+                f"""[bold red]❌ Training Failed[/bold red]
+[red]Error:[/red] {result.get('error', 'Unknown error')}
+[cyan]Output Directory:[/cyan] {result.get('output_dir', 'N/A')}""",
+                title="💥 Training Error",
+                border_style="red"
+            ))
+            raise typer.Exit(1)
+    except Exception as e:
+        console.print(Panel(
+            f"""[bold red]❌ Unexpected Error[/bold red]
+[red]Error:[/red] {str(e)}""",
+            title="💥 Unexpected Error",
+            border_style="red"
+        ))
+        raise typer.Exit(1)
+@app.command()
+def version():
+    """Show version information."""
+    console.print("[bold blue]Humigence v1.0.0[/bold blue]")
+    console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
+@app.callback()
+def main(
+    version: bool = typer.Option(
+        False,
+        "--version",
+        "-v",
+        help="Show version and exit"
+    )
+):
+    """
+    Humigence - Your AI. Your pipeline. Zero code.
+    A complete MLOps suite built for makers, teams, and enterprises.
+    """
+    if version:
+        console.print("[bold blue]Humigence v1.0.0[/bold blue]")
+        console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
+        raise typer.Exit(0)
+if __name__ == "__main__":
+    app()

compatibility_utils.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# compatibility_utils.py
+import torch
+import datetime
+def get_pytorch_version():
+    """Detect PyTorch version for compatibility"""
+    version = torch.__version__
+    major, minor = map(int, version.split('.')[:2])
+    return major, minor
+def setup_timeout():
+    """Create timeout compatible with PyTorch version"""
+    major, minor = get_pytorch_version()
+    if major >= 1 and minor >= 10:
+        # Use modern timedelta for newer PyTorch
+        if hasattr(torch.distributed, 'timedelta'):
+            return torch.distributed.timedelta(seconds=1800)
+        else:
+            # Fallback to datetime
+            return datetime.timedelta(seconds=1800)
+    else:
+        # Use datetime for older versions
+        return datetime.timedelta(seconds=1800)
+def check_environment():
+    """Check PyTorch environment and compatibility"""
+    print("=== Environment Check ===")
+    print(f"PyTorch version: {torch.__version__}")
+    print(f"CUDA available: {torch.cuda.is_available()}")
+    if torch.cuda.is_available():
+        print(f"CUDA version: {torch.version.cuda}")
+        print(f"Number of GPUs: {torch.cuda.device_count()}")
+        if torch.cuda.device_count() > 0:
+            for i in range(torch.cuda.device_count()):
+                print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
+    # Check distributed module
+    if hasattr(torch.distributed, 'timedelta'):
+        print("✓ torch.distributed.timedelta available")
+    else:
+        print("✗ torch.distributed.timedelta not available - using datetime")
+    # Check critical attributes
+    critical_attrs = ['init_process_group', 'is_initialized', 'destroy_process_group']
+    for attr in critical_attrs:
+        if hasattr(torch.distributed, attr):
+            print(f"✓ torch.distributed.{attr} available")
+        else:
+            print(f"✗ torch.distributed.{attr} missing!")
+    # Check timeout compatibility
+    timeout = setup_timeout()
+    print(f"✓ Timeout setup: {type(timeout).__name__}")
+if __name__ == "__main__":
+    check_environment()

config_migration.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+Config Migration and Validation Utilities
+This module provides robust config saving that ensures compatibility with the live TrainConfig schema.
+It handles migration from old config formats, validates against the current schema, and provides
+clear feedback about what changes were made.
+"""
+import json
+import logging
+from pathlib import Path
+from typing import Dict, Any, List, Tuple, Optional
+from config_schema import TrainConfig
+# Set up logging
+logger = logging.getLogger(__name__)
+# Legacy key mappings (old_key -> new_key)
+LEGACY_KEY_MAPPINGS = {
+    "base_model": "model_name",
+    "model": "model_name",
+    "model_id": "model_name",
+    "model_path": "model_name",
+    "split_ratios": "train_val_test_split",
+    "random_seed": "split_seed",
+    "max_seq_len": "max_seq_length",
+    "torch_dtype": "dtype",  # Handle deprecated torch_dtype
+}
+# Safe defaults for required fields that might be missing
+SAFE_DEFAULTS = {
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 4,
+    "learning_rate": 0.0002,
+    "num_train_epochs": 1,
+    "eval_batch_size": 8,
+    "logging_steps": 10,
+    "save_steps": 500,
+    "eval_steps": 100,
+    "max_seq_length": 1024,
+    "fp16": True,
+    "bf16": False,
+    "multi_gpu": False,
+    "eval_single_gpu": True,
+    "eval_gpu_index": 0,
+    "num_workers": 4,
+    "pin_memory": True,
+    "split_seed": 42,
+    "train_val_test_split": [0.8, 0.1, 0.1],
+    "data_schema": "instruction_output",
+    "training_recipe": "LoRA (FP16)",
+    "lora_r": 16,
+    "lora_alpha": 32,
+    "lora_dropout": 0.05,
+    "output_dir": "runs/humigence",
+    "selected_gpus": [0],  # Default to single GPU
+}
+def migrate_config_dict(config_dict: Dict[str, Any]) -> Tuple[Dict[str, Any], List[str], List[str]]:
+    """
+    Migrate a config dictionary to match the current TrainConfig schema.
+    Args:
+        config_dict: Raw config dictionary (potentially with legacy keys)
+    Returns:
+        Tuple of (migrated_config, dropped_keys, applied_defaults)
+    """
+    # Get the current schema fields (Pydantic v1/v2 compatibility)
+    if hasattr(TrainConfig, 'model_fields'):
+        # Pydantic v2
+        schema_fields = set(TrainConfig.model_fields.keys())
+    else:
+        # Pydantic v1
+        schema_fields = set(TrainConfig.__fields__.keys())
+    migrated = {}
+    dropped_keys = []
+    applied_defaults = []
+    # Step 1: Apply legacy key mappings
+    for old_key, new_key in LEGACY_KEY_MAPPINGS.items():
+        if old_key in config_dict and new_key not in config_dict:
+            migrated[new_key] = config_dict[old_key]
+            logger.info(f"Renamed '{old_key}' -> '{new_key}'")
+        elif old_key in config_dict and new_key in config_dict:
+            logger.warning(f"Both '{old_key}' and '{new_key}' present, using '{new_key}'")
+    # Step 2: Copy valid keys from original config
+    for key, value in config_dict.items():
+        if key in schema_fields:
+            migrated[key] = value
+        elif key not in LEGACY_KEY_MAPPINGS:
+            dropped_keys.append(key)
+            logger.info(f"Dropped unsupported key: '{key}'")
+    # Step 3: Apply safe defaults for missing required fields
+    if hasattr(TrainConfig, 'model_fields'):
+        # Pydantic v2
+        fields = TrainConfig.model_fields
+    else:
+        # Pydantic v1
+        fields = TrainConfig.__fields__
+    for field_name, field_info in fields.items():
+        if field_name not in migrated:
+            if field_name in SAFE_DEFAULTS:
+                migrated[field_name] = SAFE_DEFAULTS[field_name]
+                applied_defaults.append(field_name)
+                logger.info(f"Applied default for '{field_name}': {SAFE_DEFAULTS[field_name]}")
+            else:
+                # This is a required field with no safe default - validation will catch this
+                logger.warning(f"Missing required field '{field_name}' with no safe default")
+    return migrated, dropped_keys, applied_defaults
+def validate_and_save_config(
+    config_dict: Dict[str, Any],
+    output_path: str,
+    context_info: Optional[Dict[str, Any]] = None
+) -> TrainConfig:
+    """
+    Migrate, validate, and save a config dictionary to ensure it matches the current schema.
+    Args:
+        config_dict: Raw config dictionary to migrate and save
+        output_path: Path where to save the validated config
+        context_info: Optional runtime context to help fill missing values
+    Returns:
+        Validated TrainConfig instance
+    Raises:
+        ValueError: If config cannot be migrated to valid schema
+    """
+    logger.info("Starting config migration and validation...")
+    # Step 1: Migrate the config
+    migrated_config, dropped_keys, applied_defaults = migrate_config_dict(config_dict)
+    # Step 2: Apply context info if provided
+    if context_info:
+        # Get schema fields for validation
+        if hasattr(TrainConfig, 'model_fields'):
+            schema_fields = set(TrainConfig.model_fields.keys())
+        else:
+            schema_fields = set(TrainConfig.__fields__.keys())
+        for key, value in context_info.items():
+            if key in schema_fields and key not in migrated_config:
+                migrated_config[key] = value
+                logger.info(f"Applied context value for '{key}': {value}")
+    # Step 3: Validate against schema
+    try:
+        validated_config = TrainConfig(**migrated_config)
+        logger.info("✅ Config validation successful")
+    except Exception as e:
+        logger.error(f"❌ Config validation failed: {e}")
+        raise ValueError(f"Configuration validation failed after migration: {e}")
+    # Step 4: Save to file
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, 'w') as f:
+        json.dump(validated_config.dict(), f, indent=2)
+    logger.info(f"✅ Config saved to {output_path}")
+    # Step 5: Print summary
+    print_config_migration_summary(dropped_keys, applied_defaults, output_path)
+    return validated_config
+def print_config_migration_summary(
+    dropped_keys: List[str],
+    applied_defaults: List[str],
+    output_path: str
+) -> None:
+    """Print a summary of config migration changes."""
+    print("\n" + "="*60)
+    print("CONFIG MIGRATION SUMMARY")
+    print("="*60)
+    print(f"📁 Saved to: {output_path}")
+    if dropped_keys:
+        print(f"🗑️  Dropped keys ({len(dropped_keys)}): {', '.join(dropped_keys)}")
+    else:
+        print("✅ No keys dropped")
+    if applied_defaults:
+        print(f"⚙️  Applied defaults ({len(applied_defaults)}): {', '.join(applied_defaults)}")
+    else:
+        print("✅ No defaults applied")
+    print("✅ Config is now compatible with current TrainConfig schema")
+    print("="*60)
+def save_config_snapshot(
+    config_dict: Dict[str, Any],
+    output_path: str = "runs/humigence/config.snapshot.json",
+    context_info: Optional[Dict[str, Any]] = None
+) -> TrainConfig:
+    """
+    Save a config snapshot with automatic migration and validation.
+    This is the main function that should be used throughout the codebase
+    to ensure all saved configs are compatible with the current schema.
+    Args:
+        config_dict: Raw config dictionary to save
+        output_path: Path where to save the config (default: runs/humigence/config.snapshot.json)
+        context_info: Optional runtime context to help fill missing values
+    Returns:
+        Validated TrainConfig instance
+    """
+    return validate_and_save_config(config_dict, output_path, context_info)
+def load_and_validate_config(config_path: str) -> TrainConfig:
+    """
+    Load and validate a config file against the current schema.
+    Args:
+        config_path: Path to the config file
+    Returns:
+        Validated TrainConfig instance
+    Raises:
+        FileNotFoundError: If config file doesn't exist
+        ValueError: If config cannot be validated
+    """
+    config_path = Path(config_path)
+    if not config_path.exists():
+        raise FileNotFoundError(f"Config file not found: {config_path}")
+    with open(config_path, 'r') as f:
+        config_dict = json.load(f)
+    # Migrate and validate
+    migrated_config, dropped_keys, applied_defaults = migrate_config_dict(config_dict)
+    try:
+        validated_config = TrainConfig(**migrated_config)
+        return validated_config
+    except Exception as e:
+        raise ValueError(f"Config validation failed: {e}")
+# Backward compatibility function
+def save_config(config: TrainConfig, output_path: str) -> None:
+    """Legacy save_config function for backward compatibility."""
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, 'w') as f:
+        json.dump(config.dict(), f, indent=2)
+    logger.info(f"Config saved to {output_path}")

config_schema.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Pydantic configuration schema for Humigence training pipeline
+"""
+from pydantic import BaseModel, Field, validator
+from typing import List, Optional, Union
+from pathlib import Path
+class TrainConfig(BaseModel):
+    """Strict configuration schema for Humigence training"""
+    # Model configuration
+    model_name: str = Field(..., description="Hugging Face model name")
+    training_recipe: str = Field(default="LoRA (FP16)", description="Training recipe")
+    # Training hyperparameters
+    learning_rate: float = Field(..., ge=1e-6, le=1.0, description="Learning rate")
+    num_train_epochs: int = Field(..., ge=1, le=100, description="Number of training epochs")
+    per_device_train_batch_size: int = Field(..., ge=1, le=32, description="Batch size per device")
+    gradient_accumulation_steps: int = Field(..., ge=1, le=32, description="Gradient accumulation steps")
+    eval_batch_size: int = Field(..., ge=1, le=32, description="Evaluation batch size")
+    # Precision settings
+    fp16: bool = Field(default=True, description="Use FP16 precision")
+    bf16: bool = Field(default=False, description="Use BF16 precision")
+    # Multi-GPU settings
+    multi_gpu: bool = Field(default=False, description="Enable multi-GPU training")
+    selected_gpus: List[int] = Field(default=[0], description="Selected GPU indices")
+    # Dataset configuration
+    dataset_path: str = Field(..., description="Path to dataset file")
+    data_schema: str = Field(default="instruction_output", description="Dataset schema")
+    train_val_test_split: List[float] = Field(default=[0.8, 0.1, 0.1], description="Dataset split ratios")
+    split_seed: int = Field(default=42, description="Random seed for dataset split")
+    max_seq_length: int = Field(default=1024, ge=64, le=4096, description="Maximum sequence length")
+    # LoRA configuration
+    lora_r: int = Field(default=16, ge=1, le=256, description="LoRA rank")
+    lora_alpha: int = Field(default=32, ge=1, le=512, description="LoRA alpha")
+    lora_dropout: float = Field(default=0.05, ge=0.0, le=0.5, description="LoRA dropout")
+    # Logging and evaluation
+    logging_steps: int = Field(default=10, ge=1, le=1000, description="Logging frequency")
+    eval_steps: int = Field(default=100, ge=1, le=10000, description="Evaluation frequency")
+    save_steps: int = Field(default=500, ge=1, le=10000, description="Save frequency")
+    # Output configuration
+    output_dir: str = Field(default="runs/humigence", description="Output directory")
+    eval_single_gpu: bool = Field(default=True, description="Evaluate on single GPU")
+    eval_gpu_index: int = Field(default=0, description="GPU index for evaluation")
+    # System configuration
+    num_workers: int = Field(default=4, ge=0, le=16, description="Number of data loader workers")
+    pin_memory: bool = Field(default=True, description="Pin memory for data loading")
+    @validator('train_val_test_split')
+    def validate_split(cls, v):
+        if len(v) != 3:
+            raise ValueError("train_val_test_split must have exactly 3 values")
+        if abs(sum(v) - 1.0) > 1e-6:
+            raise ValueError("train_val_test_split values must sum to 1.0")
+        return v
+    @validator('fp16', 'bf16')
+    def validate_precision(cls, v, values):
+        if values.get('fp16') and values.get('bf16'):
+            raise ValueError("Cannot use both fp16 and bf16 simultaneously")
+        return v
+    @validator('dataset_path')
+    def validate_dataset_path(cls, v):
+        path = Path(v)
+        if not path.exists():
+            raise ValueError(f"Dataset file not found: {v}")
+        if not path.suffix == '.jsonl':
+            raise ValueError(f"Dataset must be a .jsonl file: {v}")
+        return str(path)
+    @validator('model_name')
+    def validate_model_name(cls, v):
+        # Basic validation for Hugging Face model names
+        if not v or len(v.strip()) == 0:
+            raise ValueError("Model name cannot be empty")
+        return v.strip()
+    class Config:
+        """Pydantic configuration"""
+        validate_assignment = True
+        extra = "forbid"  # Reject extra fields
+        use_enum_values = True
+def load_config(config_path: str) -> TrainConfig:
+    """Load and validate configuration from JSON file"""
+    import json
+    with open(config_path, 'r') as f:
+        config_dict = json.load(f)
+    try:
+        return TrainConfig(**config_dict)
+    except Exception as e:
+        raise ValueError(f"Configuration validation failed: {e}")
+def save_config(config: TrainConfig, output_path: str) -> None:
+    """Save configuration to JSON file (legacy function)"""
+    import json
+    from pathlib import Path
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, 'w') as f:
+        json.dump(config.dict(), f, indent=2)
+def save_config_snapshot(config_dict: dict, output_path: str = "runs/humigence/config.snapshot.json") -> TrainConfig:
+    """Save config with automatic migration and validation"""
+    from config_migration import save_config_snapshot as _save_config_snapshot
+    return _save_config_snapshot(config_dict, output_path)

distributed_utils.py ADDED Viewed

	@@ -0,0 +1,84 @@

+# distributed_utils.py
+import os
+import torch
+import torch.distributed as dist
+import logging
+from typing import Tuple, Optional
+from compatibility_utils import setup_timeout
+def setup_distributed() -> Tuple[bool, int, int, int, torch.device]:
+    """
+    First-principles DDP setup with single source of truth for device mapping.
+    Returns: (is_ddp, rank, local_rank, world_size, device)
+    """
+    ddp = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    if ddp:
+        # Initialize process group with robust timeout
+        if not dist.is_initialized():
+            # Use compatibility-aware timeout
+            timeout = setup_timeout()
+            dist.init_process_group(
+                backend="nccl",
+                timeout=timeout
+            )
+        local_rank = int(os.environ["LOCAL_RANK"])
+        rank = int(os.environ["RANK"])
+        world_size = int(os.environ["WORLD_SIZE"])
+        # Critical: Set device BEFORE any CUDA operations
+        torch.cuda.set_device(local_rank)
+        device = torch.device(f"cuda:{local_rank}")
+        # Verify device mapping
+        assert torch.cuda.current_device() == local_rank, \
+            f"Device mapping error: current={torch.cuda.current_device()}, local_rank={local_rank}"
+    else:
+        local_rank, rank, world_size = 0, 0, 1
+        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+    return ddp, rank, local_rank, world_size, device
+def setup_environment():
+    """Set environment variables once at process start"""
+    os.environ.setdefault("TORCH_NCCL_ASYNC_ERROR_HANDLING", "1")  # Modern replacement
+    os.environ.setdefault("NCCL_IB_DISABLE", "1")  # Disable InfiniBand on single node
+    os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")  # Prevent tokenizer conflicts
+    # Remove deprecated variables
+    if "NCCL_ASYNC_ERROR_HANDLING" in os.environ:
+        del os.environ["NCCL_ASYNC_ERROR_HANDLING"]
+    # Do NOT set NCCL_P2P_DISABLE - allow peer-to-peer on single node
+def cleanup_distributed():
+    """Clean shutdown of process group"""
+    if dist.is_available() and dist.is_initialized():
+        try:
+            dist.barrier()
+            dist.destroy_process_group()
+        except Exception as e:
+            logging.warning(f"Cleanup warning: {e}")
+class RankZeroOnly:
+    """Context manager for rank-0 only execution"""
+    def __init__(self, is_main: bool):
+        self.is_main = is_main
+        self.original_level = None
+    def __enter__(self):
+        if not self.is_main:
+            # Suppress logging for non-main ranks
+            self.original_level = logging.getLogger().getEffectiveLevel()
+            logging.getLogger().setLevel(logging.WARNING)
+        return self
+    def __exit__(self, *args):
+        if not self.is_main and self.original_level is not None:
+            logging.getLogger().setLevel(self.original_level)
+    def print(self, *args, **kwargs):
+        if self.is_main:
+            print(*args, **kwargs)

errors.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Custom error handling for Humigence training pipeline
+"""
+import torch
+import torch.distributed as dist
+from typing import Optional
+class HumigenceError(Exception):
+    """Base exception for Humigence training errors"""
+    def __init__(self, message: str, suggested_fix: Optional[str] = None):
+        super().__init__(message)
+        self.suggested_fix = suggested_fix
+class ConfigurationError(HumigenceError):
+    """Configuration validation errors"""
+    pass
+class DatasetError(HumigenceError):
+    """Dataset loading and processing errors"""
+    pass
+class ModelError(HumigenceError):
+    """Model loading and setup errors"""
+    pass
+class TrainingError(HumigenceError):
+    """Training process errors"""
+    pass
+class EvaluationError(HumigenceError):
+    """Evaluation process errors"""
+    pass
+class DistributedError(HumigenceError):
+    """Distributed training errors"""
+    pass
+def handle_cuda_error(error: Exception) -> HumigenceError:
+    """Convert CUDA errors to HumigenceError with suggested fixes"""
+    error_msg = str(error)
+    if "out of memory" in error_msg.lower():
+        return TrainingError(
+            "CUDA out of memory",
+            "Reduce batch size or use gradient checkpointing"
+        )
+    elif "illegal memory access" in error_msg.lower():
+        return DistributedError(
+            "NCCL illegal memory access",
+            "Reduce batch size or retry single-GPU mode"
+        )
+    elif "device" in error_msg.lower() and "mismatch" in error_msg.lower():
+        return TrainingError(
+            "Device mismatch detected",
+            "Ensure all tensors are on the same device"
+        )
+    else:
+        return TrainingError(f"CUDA error: {error_msg}")
+def handle_distributed_error(error: Exception) -> HumigenceError:
+    """Convert distributed training errors to HumigenceError"""
+    error_msg = str(error)
+    if "nccl" in error_msg.lower():
+        return DistributedError(
+            "NCCL communication error",
+            "Check network configuration or retry single-GPU mode"
+        )
+    elif "process group" in error_msg.lower():
+        return DistributedError(
+            "Process group initialization failed",
+            "Check distributed setup or retry single-GPU mode"
+        )
+    else:
+        return DistributedError(f"Distributed training error: {error_msg}")
+def handle_model_error(error: Exception) -> HumigenceError:
+    """Convert model-related errors to HumigenceError"""
+    error_msg = str(error)
+    if "out of memory" in error_msg.lower():
+        return ModelError(
+            "Model loading out of memory",
+            "Use smaller model or enable model sharding"
+        )
+    elif "not found" in error_msg.lower():
+        return ModelError(
+            "Model not found",
+            "Check model name or download the model first"
+        )
+    else:
+        return ModelError(f"Model error: {error_msg}")
+def handle_dataset_error(error: Exception) -> HumigenceError:
+    """Convert dataset-related errors to HumigenceError"""
+    error_msg = str(error)
+    if "not found" in error_msg.lower():
+        return DatasetError(
+            "Dataset file not found",
+            "Check dataset path and ensure file exists"
+        )
+    elif "column" in error_msg.lower() and "not in" in error_msg.lower():
+        return DatasetError(
+            "Dataset column mismatch",
+            "Check dataset schema and column names"
+        )
+    else:
+        return DatasetError(f"Dataset error: {error_msg}")
+def clean_error_message(error: HumigenceError) -> str:
+    """Create a clean error message with suggested fix"""
+    message = f"❌ {error.__class__.__name__}: {error}"
+    if error.suggested_fix:
+        message += f"\n   Suggested fix: {error.suggested_fix}"
+    return message

humigence ADDED Viewed

	@@ -0,0 +1,18 @@

+#!/usr/bin/env python3
+"""
+Humigence CLI Entry Point
+"""
+import sys
+from pathlib import Path
+# Add the humigence directory to Python path
+humigence_dir = Path(__file__).parent
+sys.path.insert(0, str(humigence_dir))
+# Import and run the main CLI
+from cli.main import main
+if __name__ == "__main__":
+    main()

main_cli.py ADDED Viewed

	@@ -0,0 +1,1175 @@

+#!/usr/bin/env python3
+"""
+Humigence CLI - Main entry point for all Humigence commands
+"""
+import typer
+from typing import Optional, Dict, Any
+from rich.console import Console
+from rich.panel import Panel
+from rich.table import Table
+from pathlib import Path
+import sys
+import os
+from datetime import datetime
+# Add the current directory to the path for imports
+sys.path.insert(0, str(Path(__file__).parent))
+from training.train_wikitext import run_training, run_training_from_config
+from training.autodetect import detect_family, suggested_lora_targets
+from validation.matrix import (
+    get_gpu_info, precision_supported, estimate_model_params,
+    estimate_memory_bytes, tokenizer_ok, PRECISIONS,
+)
+from validation.dryrun import dry_run
+from validation.fallback import FallbackSimulator, ConfigCandidate
+from config.schema import ValidationConfig, TrainingConfig, ConfigMetadata, save_config, validation_to_training_config
+app = typer.Typer(
+    name="humigence",
+    help="Your AI. Your pipeline. Zero code.",
+    add_completion=False,
+    rich_markup_mode="rich"
+)
+console = Console()
+@app.command()
+def train_wikitext(
+    model: str = typer.Option(
+        "",
+        "--model",
+        "-m",
+        help="Path or Hugging Face model name (e.g., 'gpt2' or 'microsoft/DialoGPT-small')"
+    ),
+    output_dir: str = typer.Option(
+        ...,
+        "--output-dir",
+        "-o",
+        help="Directory where checkpoints will be saved"
+    ),
+    epochs: int = typer.Option(
+        1,
+        "--epochs",
+        "-e",
+        help="Number of training epochs"
+    ),
+    batch_size: int = typer.Option(
+        2,
+        "--batch-size",
+        "-b",
+        help="Per-device batch size"
+    ),
+    learning_rate: float = typer.Option(
+        5e-5,
+        "--learning-rate",
+        "-lr",
+        help="Learning rate for training"
+    ),
+    dataset: str = typer.Option(
+        "wikitext",
+        "--dataset",
+        help="Dataset name (default: wikitext)"
+    ),
+    dataset_config: str = typer.Option(
+        "wikitext-2-raw-v1",
+        "--dataset-config",
+        help="Dataset configuration (default: wikitext-2-raw-v1)"
+    ),
+    max_steps: Optional[int] = typer.Option(
+        None,
+        "--max-steps",
+        help="Maximum training steps (overrides epochs if set)"
+    ),
+    block_size: int = typer.Option(
+        1024,
+        "--block-size",
+        help="Maximum sequence length"
+    ),
+    grad_accum: int = typer.Option(
+        4,
+        "--grad-accum",
+        help="Gradient accumulation steps"
+    ),
+    warmup_steps: int = typer.Option(
+        100,
+        "--warmup-steps",
+        help="Number of warmup steps"
+    ),
+    logging_steps: int = typer.Option(
+        10,
+        "--logging-steps",
+        help="Logging frequency in steps"
+    ),
+    save_steps: int = typer.Option(
+        200,
+        "--save-steps",
+        help="Model saving frequency in steps"
+    ),
+    eval_steps: int = typer.Option(
+        200,
+        "--eval-steps",
+        help="Evaluation frequency in steps"
+    ),
+    lora_r: int = typer.Option(
+        8,
+        "--lora-r",
+        help="LoRA rank"
+    ),
+    lora_alpha: int = typer.Option(
+        32,
+        "--lora-alpha",
+        help="LoRA alpha parameter"
+    ),
+    lora_dropout: float = typer.Option(
+        0.05,
+        "--lora-dropout",
+        help="LoRA dropout rate"
+    ),
+    config: Optional[str] = typer.Option(
+        None,
+        "--config",
+        help="Load configuration from YAML file"
+    ),
+):
+    """
+    Train a model on Wikitext dataset using LoRA fine-tuning.
+    This command fine-tunes a language model on the Wikitext dataset using LoRA (Low-Rank Adaptation)
+    for efficient parameter updates. The training runs on a single GPU by default.
+    Examples:
+        # Basic training with GPT-2
+        humigence train-wikitext --model gpt2 --output-dir ./out
+        # Training with custom parameters
+        humigence train-wikitext --model microsoft/DialoGPT-small --output-dir ./out --epochs 2 --batch-size 4 --learning-rate 1e-4
+        # Training with specific steps instead of epochs
+        humigence train-wikitext --model gpt2 --output-dir ./out --max-steps 1000 --batch-size 2
+        # Training with config file
+        humigence train-wikitext --config ./myconfig.yaml --output-dir ./out
+    """
+    # Validate that either model or config is provided
+    if not config and not model:
+        console.print("[bold red]❌ Error: Either --model or --config must be provided[/bold red]")
+        raise typer.Exit(1)
+    # Load config from file if provided
+    if config:
+        try:
+            from config.schema import load_config, validation_to_training_config
+            # Try to load as TrainingConfig first, then ValidationConfig
+            try:
+                loaded_config, metadata = load_config(config, TrainingConfig)
+            except Exception:
+                # If it fails, try loading as ValidationConfig and convert
+                validation_config, metadata = load_config(config, ValidationConfig)
+                loaded_config = validation_to_training_config(validation_config, output_dir)
+            # Override with CLI arguments (CLI takes precedence)
+            config_dict = loaded_config.dict()
+            # Update with CLI values (only if they're not default values)
+            if model != "":  # If model was provided via CLI
+                config_dict["model"] = model
+            if output_dir != "":  # If output_dir was provided via CLI
+                config_dict["output_dir"] = output_dir
+            if epochs != 1:
+                config_dict["epochs"] = epochs
+            if batch_size != 2:
+                config_dict["batch_size"] = batch_size
+            if learning_rate != 5e-5:
+                config_dict["learning_rate"] = learning_rate
+            if dataset != "wikitext":
+                config_dict["dataset"] = dataset
+            if dataset_config != "wikitext-2-raw-v1":
+                config_dict["dataset_config"] = dataset_config
+            if max_steps is not None:
+                config_dict["max_steps"] = max_steps
+            if block_size != 1024:
+                config_dict["block_size"] = block_size
+            if grad_accum != 4:
+                config_dict["grad_accum"] = grad_accum
+            if warmup_steps != 100:
+                config_dict["warmup_steps"] = warmup_steps
+            if logging_steps != 10:
+                config_dict["logging_steps"] = logging_steps
+            if save_steps != 200:
+                config_dict["save_steps"] = save_steps
+            if eval_steps != 200:
+                config_dict["eval_steps"] = eval_steps
+            if lora_r != 8:
+                config_dict["lora_r"] = lora_r
+            if lora_alpha != 32:
+                config_dict["lora_alpha"] = lora_alpha
+            if lora_dropout != 0.05:
+                config_dict["lora_dropout"] = lora_dropout
+            # Create new config with merged values
+            final_config = TrainingConfig(**config_dict)
+            # Extract values for display and function call
+            model = final_config.model
+            output_dir = final_config.output_dir
+            dataset = final_config.dataset
+            dataset_config = final_config.dataset_config
+            epochs = final_config.epochs
+            batch_size = final_config.batch_size
+            learning_rate = final_config.learning_rate
+            max_steps = final_config.max_steps
+            block_size = final_config.block_size
+            grad_accum = final_config.grad_accum
+            warmup_steps = final_config.warmup_steps
+            logging_steps = final_config.logging_steps
+            save_steps = final_config.save_steps
+            eval_steps = final_config.eval_steps
+            lora_r = final_config.lora_r
+            lora_alpha = final_config.lora_alpha
+            lora_dropout = final_config.lora_dropout
+            console.print(f"[bold blue]📁 Loaded configuration from {config}[/bold blue]")
+            # Display provenance information if metadata is available
+            if metadata:
+                provenance_info = f"Created: {metadata.created}"
+                if metadata.gpu:
+                    provenance_info += f" | GPU: {metadata.gpu}"
+                if metadata.auto_heal and metadata.fallback_chain:
+                    provenance_info += f" | Auto-healed: {' → '.join(metadata.fallback_chain)}"
+                elif metadata.auto_heal:
+                    provenance_info += " | Auto-healed: (no fallbacks needed)"
+                else:
+                    provenance_info += " | Direct validation (no auto-healing)"
+                console.print(f"[dim]📋 {provenance_info}[/dim]")
+        except Exception as e:
+            console.print(f"[bold red]❌ Failed to load config from {config}: {e}[/bold red]")
+            raise typer.Exit(1)
+    # Display training configuration
+    config_panel = Panel(
+        f"""[bold blue]Training Configuration[/bold blue]
+[cyan]Model:[/cyan] {model}
+[cyan]Output Directory:[/cyan] {output_dir}
+[cyan]Epochs:[/cyan] {epochs}
+[cyan]Batch Size:[/cyan] {batch_size}
+[cyan]Learning Rate:[/cyan] {learning_rate}
+[cyan]Dataset:[/cyan] {dataset}/{dataset_config}
+[cyan]Max Steps:[/cyan] {max_steps if max_steps else 'Auto-calculated'}
+[cyan]Block Size:[/cyan] {block_size}
+[cyan]Gradient Accumulation:[/cyan] {grad_accum}
+[cyan]LoRA Rank:[/cyan] {lora_r}
+[cyan]LoRA Alpha:[/cyan] {lora_alpha}
+[cyan]LoRA Dropout:[/cyan] {lora_dropout}""",
+        title="🚀 Starting Wikitext Training",
+        border_style="green"
+    )
+    console.print(config_panel)
+    # Create output directory if it doesn't exist
+    Path(output_dir).mkdir(parents=True, exist_ok=True)
+    # Run training
+    try:
+        if config:
+            # Use config-based training with launcher
+            from training.launcher import launch_training
+            result = launch_training(final_config)
+        else:
+            # Use individual parameters - convert to TrainingConfig and use launcher
+            from config.schema import TrainingConfig
+            from training.launcher import launch_training
+            training_config = TrainingConfig(
+                model=model,
+                output_dir=output_dir,
+                dataset=dataset,
+                dataset_config=dataset_config,
+                precision="fp16",
+                seq_len=block_size,
+                batch_size=batch_size,
+                epochs=epochs,
+                learning_rate=learning_rate,
+                max_steps=max_steps,
+                block_size=block_size,
+                grad_accum=grad_accum,
+                warmup_steps=warmup_steps,
+                logging_steps=logging_steps,
+                save_steps=save_steps,
+                eval_steps=eval_steps,
+                lora=True,
+                lora_r=lora_r,
+                lora_alpha=lora_alpha,
+                lora_dropout=lora_dropout,
+                gradient_checkpointing=True,
+                text_field="text",
+                schema="plain",
+                gpu_mode="single",
+                gpu_ids=[0]
+            )
+            result = launch_training(training_config)
+        if result["status"] == "success":
+            console.print(Panel(
+                f"""[bold green]✅ Training Completed Successfully![/bold green]
+[cyan]Output Directory:[/cyan] {result['output_dir']}
+[cyan]Model Path:[/cyan] {result['model_path']}
+[bold blue]Final Metrics:[/bold blue]
+[cyan]Train Loss:[/cyan] {result['metrics'].get('train_loss', 'N/A')}
+[cyan]Eval Loss:[/cyan] {result['metrics'].get('eval_loss', 'N/A')}
+[cyan]Total Steps:[/cyan] {result['metrics'].get('total_steps', 'N/A')}
+[cyan]Epochs:[/cyan] {result['metrics'].get('epochs', 'N/A')}
+[cyan]Train Runtime:[/cyan] {result['metrics'].get('train_runtime', 'N/A')}s
+[cyan]Samples/Second:[/cyan] {result['metrics'].get('train_samples_per_second', 'N/A')}""",
+                title="🎉 Training Results",
+                border_style="green"
+            ))
+            return
+        else:
+            console.print(Panel(
+                f"""[bold red]❌ Training Failed[/bold red]
+[red]Error:[/red] {result.get('error', 'Unknown error')}
+[cyan]Output Directory:[/cyan] {result.get('output_dir', 'N/A')}""",
+                title="💥 Training Error",
+                border_style="red"
+            ))
+            raise typer.Exit(1)
+    except Exception as e:
+        console.print(Panel(
+            f"""[bold red]❌ Unexpected Error[/bold red]
+[red]Error:[/red] {str(e)}""",
+            title="💥 Unexpected Error",
+            border_style="red"
+        ))
+        raise typer.Exit(1)
+@app.command()
+def train(
+    config: str = typer.Option(..., "--config", "-c", help="Path to YAML configuration file"),
+    output_dir: Optional[str] = typer.Option(None, "--output-dir", "-o", help="Override output directory"),
+    epochs: Optional[int] = typer.Option(None, "--epochs", "-e", help="Override number of epochs"),
+    batch_size: Optional[int] = typer.Option(None, "--batch-size", "-b", help="Override batch size"),
+    learning_rate: Optional[float] = typer.Option(None, "--learning-rate", "-lr", help="Override learning rate"),
+    max_steps: Optional[int] = typer.Option(None, "--max-steps", help="Override maximum training steps"),
+    dataset: Optional[str] = typer.Option(None, "--dataset", help="Override dataset specification"),
+    text_field: Optional[str] = typer.Option(None, "--text-field", help="Override text field for HF datasets"),
+    schema: Optional[str] = typer.Option(None, "--schema", help="Override schema for JSONL datasets"),
+    gradient_checkpointing: Optional[bool] = typer.Option(None, "--gradient-checkpointing/--no-gradient-checkpointing", help="Override gradient checkpointing"),
+    flash_attn: Optional[bool] = typer.Option(None, "--flash-attn/--no-flash-attn", help="Override flash attention"),
+    dtype: Optional[str] = typer.Option(None, "--dtype", help="Override data type: fp32|fp16|bf16"),
+    gpu_mode: Optional[str] = typer.Option(None, "--gpu-mode", help="Override GPU mode: single|multi"),
+    gpu_ids: Optional[str] = typer.Option(None, "--gpu-ids", help="Override GPU IDs (comma-separated, e.g., '0,1,2')"),
+):
+    """
+    Train a model using a configuration file with dataset-agnostic support.
+    This command supports training on:
+    - Wikitext datasets (wikitext)
+    - JSONL SFT datasets (jsonl:path/to/file.jsonl)
+    - Hugging Face datasets (hf:dataset_name or dataset_name)
+    Examples:
+        # Train with Wikitext
+        humigence train --config gpt2_wikitext.yaml
+        # Train with JSONL SFT dataset
+        humigence train --config my_sft_config.yaml
+        # Train with Hugging Face dataset
+        humigence train --config imdb_config.yaml
+        # Override specific parameters
+        humigence train --config my_config.yaml --epochs 3 --batch-size 4
+    """
+    # Load configuration
+    try:
+        from config.schema import load_config, validation_to_training_config
+        # Try to load as TrainingConfig first, then ValidationConfig
+        try:
+            loaded_config, metadata = load_config(config, TrainingConfig)
+        except Exception:
+            # If it fails, try loading as ValidationConfig and convert
+            validation_config, metadata = load_config(config, ValidationConfig)
+            if not output_dir:
+                console.print("[bold red]❌ Error: --output-dir is required when using ValidationConfig[/bold red]")
+                raise typer.Exit(1)
+            loaded_config = validation_to_training_config(validation_config, output_dir)
+        # Override with CLI arguments (CLI takes precedence)
+        config_dict = loaded_config.dict()
+        if output_dir:
+            config_dict["output_dir"] = output_dir
+        if epochs is not None:
+            config_dict["epochs"] = epochs
+        if batch_size is not None:
+            config_dict["batch_size"] = batch_size
+        if learning_rate is not None:
+            config_dict["learning_rate"] = learning_rate
+        if max_steps is not None:
+            config_dict["max_steps"] = max_steps
+        if dataset:
+            config_dict["dataset"] = dataset
+        if text_field:
+            config_dict["text_field"] = text_field
+        if schema:
+            config_dict["schema"] = schema
+        if gradient_checkpointing is not None:
+            config_dict["gradient_checkpointing"] = gradient_checkpointing
+        if flash_attn is not None:
+            config_dict["flash_attn"] = flash_attn
+        if dtype:
+            config_dict["dtype"] = dtype
+        if gpu_mode:
+            config_dict["gpu_mode"] = gpu_mode
+        if gpu_ids:
+            # Parse comma-separated GPU IDs
+            try:
+                gpu_ids_list = [int(x.strip()) for x in gpu_ids.split(",")]
+                config_dict["gpu_ids"] = gpu_ids_list
+            except ValueError:
+                console.print(f"[red]❌ Invalid GPU IDs format: {gpu_ids}. Use comma-separated integers (e.g., '0,1,2')[/red]")
+                raise typer.Exit(1)
+        # Create final config
+        final_config = TrainingConfig(**config_dict)
+        console.print(f"[bold blue]📁 Loaded configuration from {config}[/bold blue]")
+        # Display provenance information if metadata is available
+        if metadata:
+            provenance_info = f"Created: {metadata.created}"
+            if metadata.gpu:
+                provenance_info += f" | GPU: {metadata.gpu}"
+            if metadata.auto_heal and metadata.fallback_chain:
+                provenance_info += f" | Auto-healed: {' → '.join(metadata.fallback_chain)}"
+            elif metadata.auto_heal:
+                provenance_info += " | Auto-healed: (no fallbacks needed)"
+            else:
+                provenance_info += " | Direct validation (no auto-healing)"
+            console.print(f"[dim]📋 {provenance_info}[/dim]")
+            # Display dataset provenance if available
+            if metadata.dataset:
+                dataset_info = f"📁 Dataset: {metadata.dataset.get('file_path', metadata.dataset.get('dataset_name', 'N/A'))}"
+                if metadata.dataset.get('schema'):
+                    dataset_info += f" ({metadata.dataset['schema']})"
+                console.print(f"[dim]{dataset_info}[/dim]")
+                if 'train_size' in metadata.dataset and 'eval_size' in metadata.dataset:
+                    size_info = f"🔢 Train size: {metadata.dataset['train_size']} | Eval size: {metadata.dataset['eval_size']}"
+                    console.print(f"[dim]{size_info}[/dim]")
+                if 'sha256' in metadata.dataset:
+                    sha256 = metadata.dataset['sha256']
+                    if len(sha256) > 12:
+                        sha256 = sha256[:12] + "..."
+                    console.print(f"[dim]🔑 SHA256: {sha256}[/dim]")
+            else:
+                console.print("[yellow]⚠️  Config missing dataset metadata. Consider re-running validate to persist provenance.[/yellow]")
+    except Exception as e:
+        console.print(f"[bold red]❌ Failed to load config from {config}: {e}[/bold red]")
+        raise typer.Exit(1)
+    # Display training configuration
+    dataset_info = f"{final_config.dataset.type}"
+    if final_config.dataset.path:
+        dataset_info += f" ({final_config.dataset.path})"
+    elif final_config.dataset.name:
+        dataset_info += f" ({final_config.dataset.name})"
+    config_panel = Panel(
+        f"""[bold blue]Training Configuration[/bold blue]
+[cyan]Model:[/cyan] {final_config.model}
+[cyan]Output Directory:[/cyan] {final_config.output_dir}
+[cyan]Dataset:[/cyan] {dataset_info}
+[cyan]Schema:[/cyan] {final_config.dataset.schema_type or 'auto'}
+[cyan]Text Field:[/cyan] {final_config.dataset.text_field or 'auto'}
+[cyan]Epochs:[/cyan] {final_config.epochs}
+[cyan]Batch Size:[/cyan] {final_config.batch_size}
+[cyan]Learning Rate:[/cyan] {final_config.learning_rate}
+[cyan]Max Steps:[/cyan] {final_config.max_steps if final_config.max_steps else 'Auto-calculated'}
+[cyan]Block Size:[/cyan] {final_config.block_size}
+[cyan]Gradient Accumulation:[/cyan] {final_config.grad_accum}
+[cyan]LoRA Rank:[/cyan] {final_config.lora_r}
+[cyan]LoRA Alpha:[/cyan] {final_config.lora_alpha}
+[cyan]LoRA Dropout:[/cyan] {final_config.lora_dropout}
+[cyan]Gradient Checkpointing:[/cyan] {final_config.gradient_checkpointing}
+[cyan]Flash Attention:[/cyan] {final_config.flash_attn}
+[cyan]Data Type:[/cyan] {final_config.dtype}""",
+        title="🚀 Starting Dataset-Agnostic Training",
+        border_style="green"
+    )
+    console.print(config_panel)
+    # Create output directory if it doesn't exist
+    Path(final_config.output_dir).mkdir(parents=True, exist_ok=True)
+    # Run training
+    try:
+        from training.launcher import launch_training
+        result = launch_training(final_config)
+        if result["status"] == "success":
+            console.print(Panel(
+                f"""[bold green]✅ Training Completed Successfully![/bold green]
+[cyan]Output Directory:[/cyan] {result['output_dir']}
+[cyan]Model Path:[/cyan] {result['model_path']}
+[bold blue]Final Metrics:[/bold blue]
+[cyan]Train Loss:[/cyan] {result['metrics'].get('train_loss', 'N/A')}
+[cyan]Eval Loss:[/cyan] {result['metrics'].get('eval_loss', 'N/A')}
+[cyan]Total Steps:[/cyan] {result['metrics'].get('total_steps', 'N/A')}
+[cyan]Epochs:[/cyan] {result['metrics'].get('epochs', 'N/A')}
+[cyan]Train Runtime:[/cyan] {result['metrics'].get('train_runtime', 'N/A')}s
+[cyan]Samples/Second:[/cyan] {result['metrics'].get('train_samples_per_second', 'N/A')}""",
+                title="🎉 Training Results",
+                border_style="green"
+            ))
+            return
+        else:
+            console.print(Panel(
+                f"""[bold red]❌ Training Failed[/bold red]
+[red]Error:[/red] {result.get('error', 'Unknown error')}
+[cyan]Output Directory:[/cyan] {result.get('output_dir', 'N/A')}""",
+                title="💥 Training Error",
+                border_style="red"
+            ))
+            raise typer.Exit(1)
+    except Exception as e:
+        console.print(Panel(
+            f"""[bold red]❌ Unexpected Error[/bold red]
+[red]Error:[/red] {str(e)}""",
+            title="💥 Unexpected Error",
+            border_style="red"
+        ))
+        raise typer.Exit(1)
+@app.command()
+def validate(
+    model: str = typer.Option(..., help="HF model id or local path"),
+    dataset: str = typer.Option("wikitext", help="Dataset specification: wikitext | jsonl:<path> | hf:<name>"),
+    precision: str = typer.Option("fp16", help="fp32|fp16|bf16|qlora4bit"),
+    seq_len: int = typer.Option(1024, help="Sequence length"),
+    batch_size: int = typer.Option(2, help="Batch size"),
+    lora: bool = typer.Option(True, help="Enable LoRA"),
+    max_samples: int = typer.Option(128, help="Max samples for schema sniff"),
+    text_field: Optional[str] = typer.Option(None, help="Text field for generic HF datasets"),
+    schema: Optional[str] = typer.Option(None, help="Schema for JSONL datasets: sft | dialogue | plain | auto"),
+    role_markers: bool = typer.Option(True, "--role-markers/--no-role-markers", help="Use role markers for dialogue datasets"),
+    user_marker: str = typer.Option("<user>", help="User role marker"),
+    assistant_marker: str = typer.Option("<assistant>", help="Assistant role marker"),
+    eval_split: Optional[float] = typer.Option(None, help="Fraction of data to use for evaluation (0.0-1.0)"),
+    eval_file: Optional[str] = typer.Option(None, help="Path to separate evaluation file (for JSONL)"),
+    gradient_checkpointing: bool = typer.Option(False, "--gradient-checkpointing/--no-gradient-checkpointing", help="Enable gradient checkpointing"),
+    flash_attn: bool = typer.Option(False, "--flash-attn/--no-flash-attn", help="Enable flash attention"),
+    dtype: str = typer.Option("fp16", help="Data type: fp32|fp16|bf16"),
+    dry_run_flag: bool = typer.Option(True, "--dry-run/--no-dry-run", help="Do the 1-batch fwd+bwd"),
+    auto_heal: bool = typer.Option(True, "--auto-heal/--no-auto-heal", help="Enable auto-healing fallback simulation"),
+    max_attempts: int = typer.Option(10, help="Maximum fallback attempts for auto-healing"),
+    save_config_path: Optional[str] = typer.Option(None, "--save-config", help="Save auto-healed config to YAML file"),
+    overwrite: bool = typer.Option(False, "--overwrite", help="Overwrite existing config file instead of versioning"),
+):
+    """
+    Validate model, dataset, and training configuration before training.
+    This command performs comprehensive validation including:
+    - Model family detection and LoRA target module validation
+    - GPU capability and precision support checks
+    - Memory estimation and OOM prevention
+    - Tokenizer validation
+    - Optional 1-batch dry-run to test actual training setup
+    Examples:
+        # Basic validation with GPT-2
+        humigence validate --model gpt2 --dataset wikitext --precision fp16
+        # Validate with BF16 (will fail on non-BF16 GPUs)
+        humigence validate --model gpt2 --precision bf16
+        # Validate with 4-bit quantization
+        humigence validate --model gpt2 --precision qlora4bit
+        # Validate without dry-run
+        humigence validate --model gpt2 --no-dry-run
+    """
+    if precision not in PRECISIONS:
+        typer.secho(f"Unsupported precision: {precision}", fg=typer.colors.RED, err=True)
+        raise typer.Exit(1)
+    # Detect model family and get config
+    family, cfg = detect_family(model)
+    gpu = get_gpu_info()
+    tok_ok, tok_msg = tokenizer_ok(model)
+    prec_ok, prec_msg = precision_supported(precision, gpu)
+    # Detect dataset type and validate
+    dataset_type = _detect_dataset_type(dataset)
+    dataset_ok, dataset_msg = _validate_dataset(dataset, dataset_type, text_field, schema)
+    # Create dataset configuration with eval split support
+    dataset_config = _create_dataset_config(dataset, text_field, schema, role_markers, user_marker, assistant_marker, eval_split, eval_file)
+    # GPU-aware defaults and warnings
+    _apply_gpu_aware_defaults(gpu, precision, batch_size, seq_len, gradient_checkpointing, flash_attn, dtype)
+    # Load dataset to capture metadata
+    dataset_metadata = None
+    if dataset_ok:
+        try:
+            from training.data_loader import create_dataset_loader
+            loader = create_dataset_loader(
+                dataset,
+                text_field=text_field,
+                schema=schema or "auto",
+                role_markers=role_markers,
+                user_marker=user_marker,
+                assistant_marker=assistant_marker,
+                eval_split=eval_split,
+                eval_file=eval_file
+            )
+            # Load dataset to get metadata
+            train_dataset, eval_dataset = loader.load()
+            dataset_metadata = loader.get_metadata()
+        except Exception as e:
+            console.print(f"[yellow]⚠️  Could not load dataset metadata: {e}[/yellow]")
+            dataset_metadata = None
+    # Estimate parameters and memory
+    params = estimate_model_params(cfg)
+    mem_est = estimate_memory_bytes(params, precision, adam=True, lora=lora)
+    mem_info = f"est ~{mem_est/1e9:.2f} GB" if mem_est else "n/a"
+    # Collect warnings
+    warns = []
+    if not tok_ok:
+        warns.append(f"Tokenizer: {tok_msg}")
+    if not prec_ok:
+        warns.append(f"Precision: {prec_msg}")
+    if not dataset_ok:
+        warns.append(f"Dataset: {dataset_msg}")
+    # Check sequence length against model limits
+    max_pos = getattr(cfg, "max_position_embeddings", None)
+    if max_pos and seq_len > max_pos:
+        warns.append(f"seq_len {seq_len} > model limit {max_pos}. Suggest <= {max_pos}.")
+    # Create summary table
+    tbl = Table(title="Humigence Validation Summary")
+    tbl.add_column("Item", style="cyan")
+    tbl.add_column("Value", style="white")
+    tbl.add_row("Model", model)
+    tbl.add_row("Family", family)
+    tbl.add_row("Dataset Type", dataset_config.type)
+    tbl.add_row("Dataset Path/Name", dataset_config.path or dataset_config.name or "N/A")
+    tbl.add_row("Schema", dataset_config.schema_type or "auto")
+    tbl.add_row("Text Field", dataset_config.text_field or "auto")
+    if dataset_config.type == "jsonl" and dataset_config.schema_type == "dialogue":
+        tbl.add_row("Role Markers", f"{dataset_config.user_marker} / {dataset_config.assistant_marker}")
+    # Add dataset metadata if available
+    if dataset_metadata:
+        tbl.add_row("Train Size", str(dataset_metadata.get("train_size", "N/A")))
+        tbl.add_row("Eval Size", str(dataset_metadata.get("eval_size", "N/A")))
+        if "sha256" in dataset_metadata:
+            sha256 = dataset_metadata["sha256"]
+            if len(sha256) > 12:
+                sha256 = sha256[:12] + "..."
+            tbl.add_row("SHA256", sha256)
+    tbl.add_row("Precision", precision)
+    tbl.add_row("GPU", f"{gpu.name} (bf16={gpu.bf16_supported}, cc={gpu.cc_major}.{gpu.cc_minor})" if gpu.available else "CPU")
+    tbl.add_row("Params (est.)", f"{params:,}" if params else "unknown")
+    tbl.add_row("Memory (est.)", mem_info)
+    tbl.add_row("Seq Len", str(seq_len))
+    tbl.add_row("Batch Size", str(batch_size))
+    tbl.add_row("LoRA", str(lora))
+    tbl.add_row("Tokenizer", "OK" if tok_ok else f"ISSUE: {tok_msg}")
+    tbl.add_row("Precision Support", "OK" if prec_ok else f"ISSUE: {prec_msg}")
+    tbl.add_row("Dataset", "OK" if dataset_ok else f"ISSUE: {dataset_msg}")
+    console.print(tbl)
+    # Display warnings
+    if warns:
+        console.print("\n[yellow]Warnings:[/yellow]")
+        for w in warns:
+            console.print(f" - {w}")
+    # Check precision support
+    if not prec_ok:
+        console.print("\n[bold red]FAIL[/bold red]: Precision not supported.")
+        _print_fallback(precision, gpu, lora, seq_len, batch_size)
+        raise typer.Exit(2)
+    # Perform dry run if requested
+    if dry_run_flag:
+        console.print("\n[bold]Running 1-batch dry-run...[/bold]")
+        lora_targets = suggested_lora_targets(family) if lora else None
+        res = dry_run(
+            model_id_or_path=model,
+            precision=precision,
+            seq_len=seq_len,
+            batch_size=batch_size,
+            lora=lora,
+            lora_targets=lora_targets,
+        )
+        if res.ok:
+            console.print(f"[green]PASS[/green]: dry-run completed. loss={res.details.get('loss'):.4f}")
+            # Save config if requested (even without auto-healing)
+            if save_config_path:
+                validation_config = ValidationConfig(
+                    model=model,
+                    dataset=dataset_config,
+                    precision=precision,
+                    seq_len=seq_len,
+                    batch_size=batch_size,
+                    lora=lora,
+                    lora_targets=lora_targets,
+                    gradient_checkpointing=gradient_checkpointing,
+                    flash_attn=flash_attn,
+                    dtype=dtype,
+                    max_samples=max_samples
+                )
+                # Create runtime metadata
+                runtime_metadata = _create_runtime_metadata(gpu)
+                # Create metadata
+                metadata = ConfigMetadata(
+                    created=datetime.now().isoformat(),
+                    gpu=f"{gpu.name} (bf16={gpu.bf16_supported}, cc={gpu.cc_major}.{gpu.cc_minor})" if gpu.available else "CPU",
+                    precision_supported=[p for p in ["fp32", "fp16", "bf16", "qlora4bit"] if precision_supported(p, gpu)[0]],
+                    validator_version="0.3",
+                    auto_heal=False,
+                    fallback_chain=[],
+                    original_config={
+                        "model": model,
+                        "precision": precision,
+                        "seq_len": seq_len,
+                        "batch_size": batch_size,
+                        "lora": lora,
+                        "gradient_checkpointing": gradient_checkpointing,
+                        "flash_attn": flash_attn,
+                        "dtype": dtype
+                    },
+                    dataset=dataset_metadata,
+                    runtime=runtime_metadata
+                )
+                saved_path = save_config(validation_config, save_config_path, metadata, overwrite)
+                console.print(f"\n[bold green]✅ Config saved to {saved_path}[/bold green]")
+            raise typer.Exit(0)
+        else:
+            console.print(f"[red]FAIL[/red]: dry-run error: {res.error}")
+            # Auto-healing fallback simulation
+            if auto_heal:
+                console.print(f"[yellow]Auto-healing enabled. Attempting fallback simulation...[/yellow]")
+                # Create initial config candidate
+                initial_config = ConfigCandidate(
+                    model=model,
+                    precision=precision,
+                    seq_len=seq_len,
+                    batch_size=batch_size,
+                    lora=lora,
+                    lora_targets=lora_targets,
+                    gradient_checkpointing=False,
+                    dataset=dataset,
+                    text_field=text_field
+                )
+                # Run fallback simulation
+                simulator = FallbackSimulator()
+                success, final_config = simulator.simulate_fallbacks(initial_config, max_attempts)
+                if success:
+                    console.print(f"\n[bold green]🎉 AUTO-HEALING SUCCESSFUL![/bold green]")
+                    console.print(f"[dim]Found working configuration after {len(simulator.attempts)} attempts[/dim]")
+                    # Generate and display YAML config
+                    yaml_config = simulator.generate_yaml_config(final_config)
+                    console.print(f"\n[bold blue]AUTO-HEALED CONFIG PATCH[/bold blue]")
+                    console.print(f"[dim]```yaml[/dim]")
+                    console.print(yaml_config)
+                    console.print(f"[dim]```[/dim]")
+                    # Save config if requested
+                    if save_config_path:
+                        # Create ValidationConfig from final_config
+                        validation_config = ValidationConfig(
+                            model=final_config.model,
+                            dataset=final_config.dataset,
+                            precision=final_config.precision,
+                            seq_len=final_config.seq_len,
+                            batch_size=final_config.batch_size,
+                            lora=final_config.lora,
+                            lora_targets=final_config.lora_targets,
+                            gradient_checkpointing=final_config.gradient_checkpointing,
+                            text_field=final_config.text_field,
+                            schema=getattr(final_config, 'schema', schema),
+                            max_samples=max_samples
+                        )
+                        # Create fallback chain from simulator attempts
+                        fallback_chain = []
+                        for attempt in simulator.attempts[1:]:  # Skip initial attempt
+                            if attempt.notes:
+                                fallback_chain.append(attempt.notes)
+                            else:
+                                # Generate fallback description from config changes
+                                prev_config = simulator.attempts[attempt.attempt_num - 2].config
+                                curr_config = attempt.config
+                                changes = []
+                                if prev_config.precision != curr_config.precision:
+                                    changes.append(f"precision {prev_config.precision} → {curr_config.precision}")
+                                if prev_config.seq_len != curr_config.seq_len:
+                                    changes.append(f"seq_len {prev_config.seq_len} → {curr_config.seq_len}")
+                                if prev_config.batch_size != curr_config.batch_size:
+                                    changes.append(f"batch_size {prev_config.batch_size} → {curr_config.batch_size}")
+                                if prev_config.gradient_checkpointing != curr_config.gradient_checkpointing:
+                                    changes.append(f"gradient_checkpointing {prev_config.gradient_checkpointing} → {curr_config.gradient_checkpointing}")
+                                if changes:
+                                    fallback_chain.append(", ".join(changes))
+                        # Create metadata with fallback chain
+                        metadata = ConfigMetadata(
+                            created=datetime.now().isoformat(),
+                            gpu=f"{gpu.name} (bf16={gpu.bf16_supported}, cc={gpu.cc_major}.{gpu.cc_minor})" if gpu.available else "CPU",
+                            precision_supported=[p for p in ["fp32", "fp16", "bf16", "qlora4bit"] if precision_supported(p, gpu)[0]],
+                            validator_version="0.3",
+                            auto_heal=True,
+                            fallback_chain=fallback_chain,
+                            original_config={
+                                "model": model,
+                                "precision": precision,
+                                "seq_len": seq_len,
+                                "batch_size": batch_size,
+                                "lora": lora
+                            },
+                            dataset=dataset_metadata
+                        )
+                        saved_path = save_config(validation_config, save_config_path, metadata, overwrite)
+                        console.print(f"\n[bold green]✅ Auto-healed config saved to {saved_path}[/bold green]")
+                    raise typer.Exit(0)
+                else:
+                    console.print(f"\n[bold red]❌ AUTO-HEALING FAILED[/bold red]")
+                    console.print(f"[dim]Could not find working configuration after {max_attempts} attempts[/dim]")
+                    _print_fallback(precision, gpu, lora, seq_len, batch_size, res.oom)
+                    raise typer.Exit(3)
+            else:
+                # No auto-healing, just show fallback suggestions
+                if res.oom:
+                    console.print("[yellow]Detected OOM. Proposing fallback...[/yellow]")
+                _print_fallback(precision, gpu, lora, seq_len, batch_size, res.oom)
+                raise typer.Exit(3)
+    else:
+        # No dry-run; rely on static checks
+        if warns:
+            console.print("[yellow]COMPLETE WITH WARNINGS[/yellow]")
+            raise typer.Exit(0)
+        console.print("[green]PASS[/green]")
+        raise typer.Exit(0)
+def _detect_dataset_type(dataset_spec: str) -> str:
+    """Detect dataset type from specification"""
+    if dataset_spec == "wikitext":
+        return "wikitext"
+    elif dataset_spec.startswith("jsonl:"):
+        return "jsonl"
+    elif dataset_spec.startswith("hf:"):
+        return "hf"
+    else:
+        # Assume it's a direct HF dataset name
+        return "hf"
+def _create_dataset_config(dataset_spec: str, text_field: Optional[str], schema: Optional[str],
+                          role_markers: bool, user_marker: str, assistant_marker: str,
+                          eval_split: Optional[float] = None, eval_file: Optional[str] = None):
+    """Create DatasetConfig from CLI parameters"""
+    from config.schema import DatasetConfig
+    dataset_type = _detect_dataset_type(dataset_spec)
+    if dataset_type == "wikitext":
+        return DatasetConfig(type="wikitext", name="wikitext")
+    elif dataset_type == "jsonl":
+        file_path = dataset_spec[6:]  # Remove "jsonl:" prefix
+        return DatasetConfig(
+            type="jsonl",
+            path=file_path,
+            schema_type=schema or "auto",
+            role_markers=role_markers,
+            user_marker=user_marker,
+            assistant_marker=assistant_marker,
+            eval_split=eval_split,
+            eval_file=eval_file
+        )
+    elif dataset_type == "hf":
+        dataset_name = dataset_spec[3:] if dataset_spec.startswith("hf:") else dataset_spec
+        return DatasetConfig(
+            type="hf",
+            name=dataset_name,
+            text_field=text_field or "text",
+            eval_split=eval_split
+        )
+    else:
+        raise ValueError(f"Unknown dataset type: {dataset_type}")
+def _apply_gpu_aware_defaults(gpu, precision: str, batch_size: int, seq_len: int,
+                             gradient_checkpointing: bool, flash_attn: bool, dtype: str):
+    """Apply GPU-aware defaults and warnings"""
+    if not gpu.available:
+        console.print("[yellow]⚠️  No GPU detected - using CPU mode[/yellow]")
+        return
+    # Get GPU memory info
+    try:
+        import torch
+        if torch.cuda.is_available():
+            gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+            console.print(f"[blue]🔧 GPU Memory: {gpu_memory_gb:.1f}GB[/blue]")
+            # Warn about potential OOM issues
+            if precision == "fp32" and gpu_memory_gb < 24:
+                console.print(f"[yellow]⚠️  Detected {gpu_memory_gb:.1f}GB GPU — fp32 may OOM, recommend fp16 with batch_size<=4[/yellow]")
+            elif precision == "bf16" and not gpu.bf16_supported:
+                console.print(f"[yellow]⚠️  GPU doesn't support BF16, recommend fp16[/yellow]")
+            elif batch_size > 4 and gpu_memory_gb < 16:
+                console.print(f"[yellow]⚠️  Large batch size ({batch_size}) on {gpu_memory_gb:.1f}GB GPU may cause OOM[/yellow]")
+    except Exception as e:
+        console.print(f"[yellow]⚠️  Could not get GPU memory info: {e}[/yellow]")
+def _create_runtime_metadata(gpu) -> Dict[str, Any]:
+    """Create runtime environment metadata"""
+    runtime_metadata = {}
+    try:
+        import torch
+        import platform
+        # GPU info
+        if gpu.available:
+            runtime_metadata["gpu"] = gpu.name
+            runtime_metadata["vram_gb"] = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+            runtime_metadata["cuda"] = torch.version.cuda
+        else:
+            runtime_metadata["gpu"] = "CPU"
+            runtime_metadata["vram_gb"] = 0
+            runtime_metadata["cuda"] = None
+        # PyTorch version
+        runtime_metadata["torch"] = torch.__version__
+        # System info
+        runtime_metadata["platform"] = platform.platform()
+        runtime_metadata["python"] = platform.python_version()
+    except Exception as e:
+        console.print(f"[yellow]⚠️  Could not collect runtime metadata: {e}[/yellow]")
+        runtime_metadata["error"] = str(e)
+    return runtime_metadata
+def _validate_dataset(dataset_spec: str, dataset_type: str, text_field: Optional[str], schema: Optional[str]) -> tuple[bool, str]:
+    """Validate dataset specification and accessibility"""
+    try:
+        if dataset_type == "wikitext":
+            # Wikitext is always valid
+            return True, "OK"
+        elif dataset_type == "jsonl":
+            file_path = dataset_spec[6:]  # Remove "jsonl:" prefix
+            if not os.path.exists(file_path):
+                return False, f"File not found: {file_path}"
+            # Try to read first line to validate JSON format
+            try:
+                with open(file_path, 'r', encoding='utf-8') as f:
+                    first_line = f.readline().strip()
+                    if first_line:
+                        import json
+                        json.loads(first_line)
+                return True, "OK"
+            except json.JSONDecodeError:
+                return False, f"Invalid JSON format in {file_path}"
+            except Exception as e:
+                return False, f"Error reading {file_path}: {e}"
+        elif dataset_type == "hf":
+            dataset_name = dataset_spec[3:] if dataset_spec.startswith("hf:") else dataset_spec
+            # Try to load dataset info (without actually downloading)
+            try:
+                from datasets import get_dataset_infos
+                infos = get_dataset_infos(dataset_name)
+                if not infos:
+                    return False, f"Dataset {dataset_name} not found"
+                return True, "OK"
+            except Exception as e:
+                return False, f"Error accessing dataset {dataset_name}: {e}"
+        else:
+            return False, f"Unknown dataset type: {dataset_type}"
+    except Exception as e:
+        return False, f"Dataset validation error: {e}"
+def _print_fallback(precision: str, gpu, lora: bool, seq_len: int, batch_size: int, oom: bool = False):
+    """Print fallback configuration recommendations"""
+    console.print("\n[bold]RECOMMENDED CONFIG PATCH[/bold]")
+    suggest = {
+        "precision": precision,
+        "seq_len": seq_len,
+        "batch_size": batch_size,
+        "lora": lora,
+        "gradient_checkpointing": False,
+    }
+    # Precision fallback
+    if precision == "bf16" and not gpu.bf16_supported:
+        suggest["precision"] = "fp16"
+    if precision == "qlora4bit" and not gpu.available:
+        suggest["precision"] = "fp16"
+    # OOM mitigations
+    if oom:
+        if batch_size > 1:
+            suggest["batch_size"] = max(1, batch_size // 2)
+        else:
+            suggest["gradient_checkpointing"] = True
+            if seq_len > 1024:
+                suggest["seq_len"] = min(1024, seq_len // 2)
+            if precision in ("bf16", "fp32"):
+                suggest["precision"] = "fp16"
+    for k, v in suggest.items():
+        console.print(f" - {k}: {v}")
+@app.command()
+def gpu_info():
+    """Show detailed GPU information and selection options."""
+    from validation.matrix import get_all_gpu_info
+    multi_gpu_info = get_all_gpu_info()
+    if not multi_gpu_info.gpus:
+        console.print(Panel(
+            "[bold red]❌ No GPUs detected[/bold red]\n"
+            "[dim]Training will run on CPU[/dim]",
+            title="GPU Information",
+            border_style="red"
+        ))
+        return
+    # Create GPU information table
+    table = Table(title="Available GPUs")
+    table.add_column("Index", style="cyan", width=6)
+    table.add_column("Name", style="white", width=40)
+    table.add_column("VRAM", style="green", width=12)
+    table.add_column("Compute Capability", style="blue", width=15)
+    table.add_column("BF16 Support", style="yellow", width=12)
+    for gpu in multi_gpu_info.gpus:
+        vram_gb = gpu.total_bytes / (1024**3)
+        cc = f"{gpu.cc_major}.{gpu.cc_minor}"
+        bf16_support = "✅ Yes" if gpu.bf16_supported else "❌ No"
+        table.add_row(
+            str(gpu.device_index),
+            gpu.name,
+            f"{vram_gb:.1f} GB",
+            cc,
+            bf16_support
+        )
+    console.print(table)
+    # Show selection examples
+    console.print(Panel(
+        f"""[bold blue]GPU Selection Examples[/bold blue]
+[cyan]Single GPU Training:[/cyan]
+  humigence train --config my_config.yaml --gpu-mode single --gpu-ids 0
+[cyan]Multi-GPU Training (all GPUs):[/cyan]
+  humigence train --config my_config.yaml --gpu-mode multi --gpu-ids 0,1
+[cyan]Multi-GPU Training (specific GPUs):[/cyan]
+  humigence train --config my_config.yaml --gpu-mode multi --gpu-ids 1,2
+[dim]Total VRAM: {multi_gpu_info.total_vram_gb:.1f} GB across {multi_gpu_info.count} GPUs[/dim]""",
+        title="Usage Examples",
+        border_style="green"
+    ))
+@app.command()
+def version():
+    """Show version information."""
+    console.print("[bold blue]Humigence v1.0.0[/bold blue]")
+    console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
+@app.callback()
+def main(
+    version: bool = typer.Option(
+        False,
+        "--version",
+        "-v",
+        help="Show version and exit"
+    )
+):
+    """
+    Humigence - Your AI. Your pipeline. Zero code.
+    A complete MLOps suite built for makers, teams, and enterprises.
+    """
+    if version:
+        console.print("[bold blue]Humigence v1.0.0[/bold blue]")
+        console.print("[dim]Your AI. Your pipeline. Zero code.[/dim]")
+        raise typer.Exit(0)
+if __name__ == "__main__":
+    app()

nccl_memory_fix.py ADDED Viewed

	@@ -0,0 +1,213 @@

+#!/usr/bin/env python3
+"""
+NCCL Memory Conflict Resolution
+This script addresses the "illegal memory access" error in multi-GPU training
+by implementing memory management strategies and fallback mechanisms.
+"""
+import os
+import subprocess
+import torch
+import torch.distributed as dist
+from typing import Optional, Dict, Any
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def check_gpu_memory_usage() -> Dict[int, Dict[str, float]]:
+    """Check current GPU memory usage"""
+    memory_info = {}
+    for i in range(torch.cuda.device_count()):
+        allocated = torch.cuda.memory_allocated(i) / 1024**3  # GB
+        reserved = torch.cuda.memory_reserved(i) / 1024**3    # GB
+        total = torch.cuda.get_device_properties(i).total_memory / 1024**3  # GB
+        free = total - reserved
+        memory_info[i] = {
+            'allocated': allocated,
+            'reserved': reserved,
+            'total': total,
+            'free': free,
+            'usage_percent': (reserved / total) * 100
+        }
+        logger.info(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, "
+                   f"{free:.1f}GB free ({memory_info[i]['usage_percent']:.1f}% used)")
+    return memory_info
+def clear_gpu_memory():
+    """Clear GPU memory and cache"""
+    logger.info("Clearing GPU memory...")
+    for i in range(torch.cuda.device_count()):
+        torch.cuda.set_device(i)
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+    # Force garbage collection
+    import gc
+    gc.collect()
+    logger.info("GPU memory cleared")
+def kill_competing_processes():
+    """Kill processes that might be using GPU memory"""
+    try:
+        # Find processes using GPU memory
+        result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,process_name,used_memory',
+                               '--format=csv,noheader,nounits'],
+                              capture_output=True, text=True)
+        if result.returncode == 0:
+            lines = result.stdout.strip().split('\n')
+            for line in lines:
+                if line.strip():
+                    parts = line.split(', ')
+                    if len(parts) >= 3:
+                        pid, name, memory = parts[0], parts[1], parts[2]
+                        if 'llama' in name.lower() or int(memory) > 1000:  # > 1GB
+                            logger.info(f"Found competing process: {name} (PID: {pid}, Memory: {memory}MB)")
+                            try:
+                                subprocess.run(['kill', '-9', pid], check=True)
+                                logger.info(f"Killed process {pid}")
+                            except subprocess.CalledProcessError:
+                                logger.warning(f"Could not kill process {pid}")
+    except Exception as e:
+        logger.warning(f"Could not check/kill competing processes: {e}")
+def setup_nccl_environment():
+    """Set up optimal NCCL environment variables"""
+    nccl_env = {
+        'NCCL_DEBUG': 'INFO',
+        'NCCL_IB_DISABLE': '1',  # Disable InfiniBand
+        'NCCL_P2P_DISABLE': '1',  # Disable peer-to-peer
+        'NCCL_SHM_DISABLE': '0',  # Enable shared memory
+        'NCCL_SOCKET_IFNAME': 'enp6s18',  # Use specific network interface
+        'NCCL_NET_GDR_LEVEL': '0',  # Disable GPU Direct RDMA
+        'NCCL_CROSS_NIC': '0',  # Disable cross-NIC communication
+        'NCCL_ALGO': 'Ring',  # Use Ring algorithm
+        'CUDA_LAUNCH_BLOCKING': '1',  # Enable CUDA error checking
+        'TORCH_NCCL_ASYNC_ERROR_HANDLING': '1',  # Enable async error handling
+        'TOKENIZERS_PARALLELISM': 'false',  # Disable tokenizer parallelism
+    }
+    for key, value in nccl_env.items():
+        os.environ[key] = value
+        logger.info(f"Set {key}={value}")
+def create_memory_efficient_config(base_config: Dict[str, Any]) -> Dict[str, Any]:
+    """Create memory-efficient training configuration"""
+    memory_config = base_config.copy()
+    # Reduce memory usage
+    memory_config.update({
+        'per_device_train_batch_size': 1,  # Minimal batch size
+        'gradient_accumulation_steps': 8,  # Compensate with accumulation
+        'eval_batch_size': 1,  # Minimal eval batch size
+        'max_seq_length': 512,  # Reduce sequence length
+        'fp16': True,  # Use half precision
+        'bf16': False,  # Disable bf16 to save memory
+        'pin_memory': False,  # Disable pin memory
+        'num_workers': 0,  # Disable multiprocessing
+    })
+    logger.info("Created memory-efficient configuration")
+    return memory_config
+def test_nccl_communication():
+    """Test NCCL communication without training"""
+    logger.info("Testing NCCL communication...")
+    try:
+        # Initialize process group
+        if not dist.is_initialized():
+            dist.init_process_group(backend='nccl')
+        rank = dist.get_rank()
+        world_size = dist.get_world_size()
+        logger.info(f"Rank {rank}/{world_size} initialized")
+        # Test simple communication
+        if rank == 0:
+            tensor = torch.ones(10, device='cuda')
+            logger.info(f"Rank 0 sending tensor: {tensor}")
+        else:
+            tensor = torch.zeros(10, device='cuda')
+            logger.info(f"Rank 1 receiving tensor: {tensor}")
+        # All-reduce test
+        dist.all_reduce(tensor)
+        logger.info(f"Rank {rank} after all_reduce: {tensor}")
+        # Barrier test
+        dist.barrier()
+        logger.info(f"Rank {rank} passed barrier")
+        logger.info("✅ NCCL communication test PASSED")
+        return True
+    except Exception as e:
+        logger.error(f"❌ NCCL communication test FAILED: {e}")
+        return False
+    finally:
+        if dist.is_initialized():
+            dist.destroy_process_group()
+def run_memory_safe_training(config_path: str):
+    """Run training with memory safety measures"""
+    logger.info("Starting memory-safe training...")
+    # Step 1: Clear memory
+    clear_gpu_memory()
+    # Step 2: Kill competing processes
+    kill_competing_processes()
+    # Step 3: Set up NCCL environment
+    setup_nccl_environment()
+    # Step 4: Check memory after cleanup
+    memory_info = check_gpu_memory_usage()
+    # Step 5: Test NCCL communication
+    if not test_nccl_communication():
+        logger.error("NCCL communication test failed, falling back to single GPU")
+        return False
+    # Step 6: Run training with memory-efficient config
+    logger.info("Running memory-safe multi-GPU training...")
+    # This would be called by the actual training script
+    return True
+def main():
+    """Main function for testing memory fixes"""
+    print("🚀 NCCL Memory Conflict Resolution")
+    print("=" * 50)
+    # Check initial memory state
+    print("\n📊 Initial GPU Memory State:")
+    memory_info = check_gpu_memory_usage()
+    # Clear memory
+    print("\n🧹 Clearing GPU Memory:")
+    clear_gpu_memory()
+    # Check memory after cleanup
+    print("\n📊 GPU Memory After Cleanup:")
+    memory_info = check_gpu_memory_usage()
+    # Set up environment
+    print("\n⚙️ Setting up NCCL Environment:")
+    setup_nccl_environment()
+    print("\n✅ Memory management setup complete!")
+    print("   Ready for memory-safe multi-GPU training")
+if __name__ == "__main__":
+    main()

runs/humigence/config.snapshot.json CHANGED Viewed

@@ -1,13 +1,35 @@
 {
-  "setup_mode": "basic",
-  "gpu_config": "Single GPU \u2013 GPU 0: NVIDIA GeForce RTX 5090",
-  "base_model": "Qwen/Qwen1.5-0.5B",
-  "dataset_path": "/home/joshua/humigence_data/openassistant_full/oasst1.jsonl",
   "training_recipe": "QLoRA (4-bit NF4)",
-  "learning_rate": "2e-5",
-  "num_train_epochs": "3",
-  "gradient_accumulation_steps": "4",
-  "logging_steps": "10",
-  "save_steps": "100",
-  "timestamp": "2025-09-17T22:50:18.668019"
 }

 {
+  "model_name": "Qwen/Qwen2.5-0.5B",
   "training_recipe": "QLoRA (4-bit NF4)",
+  "learning_rate": 0.0002,
+  "num_train_epochs": 1,
+  "per_device_train_batch_size": 2,
+  "gradient_accumulation_steps": 4,
+  "eval_batch_size": 8,
+  "fp16": true,
+  "bf16": false,
+  "multi_gpu": false,
+  "selected_gpus": [
+    0
+  ],
+  "dataset_path": "/home/joshua/humigence_data/wikitext2.jsonl",
+  "data_schema": "instruction_output",
+  "train_val_test_split": [
+    0.8,
+    0.1,
+    0.1
+  ],
+  "split_seed": 42,
+  "max_seq_length": 1024,
+  "lora_r": 16,
+  "lora_alpha": 32,
+  "lora_dropout": 0.05,
+  "logging_steps": 10,
+  "eval_steps": 100,
+  "save_steps": 100,
+  "output_dir": "runs/humigence",
+  "eval_single_gpu": true,
+  "eval_gpu_index": 0,
+  "num_workers": 4,
+  "pin_memory": true
 }

runs/humigence/eval_results.jsonl CHANGED Viewed

@@ -1,5 +1,5 @@
-{"prompt": "What is the capital of France?", "output": "The capital of France is Paris."}
-{"prompt": "Explain quantum computing in simple terms.", "output": "Quantum computing uses quantum mechanics principles..."}
-{"prompt": "Write a short poem about artificial intelligence.", "output": "In circuits deep and silicon bright..."}
-{"prompt": "How do you make a good cup of coffee?", "output": "Start with fresh, high-quality beans..."}
-{"prompt": "What are the benefits of renewable energy?", "output": "Renewable energy offers numerous benefits..."}

+{"prompt": "What is the capital of France?", "output": "What is the capital of France? ____\nA. Paris\nB. Brussels\nC. London\nD. Berlin\nAnswer:\nA\n\nThe most significant difference between a company's core competencies and its core competitive advantages lies in the fact that the former is not only the result of the latter but also an important factor in its creation. A. Correct B. Incorrect\nAnswer:\nA\n\nThe characteristic that reflects the company's long-term development direction and future prospects is ____.\nA. Product characteristics\nB. Quality characteristics\nC. Service characteristics\nD. Social characteristics\nAnswer:\nA\n\nWhich of the following statements about the impact of the new curriculum reform on teaching is incorrect?\nA. Teachers are no longer bound by fixed textbooks and fixed teaching methods.\nB. Teachers have become more active.\nC. Educational activities have become more diversified.\nD. Teachers' teaching roles have changed.\nAnswer:\nD\n\nA company has recently hired a new assistant. Through in-depth interviews with the new employee and the company's top management"}
+{"prompt": "Explain quantum computing in simple terms.", "output": "Explain quantum computing in simple terms. Quantum computing is a type of computing that uses quantum particles, such as photons or electrons, to perform calculations. Instead of using bits that are either 0 or 1, quantum computers use quantum bits, or qubits, which can be in multiple states at the same time. This allows quantum computers to perform a wide range of calculations exponentially faster than classical computers, making them a promising candidate for solving complex problems in areas such as cryptography and drug discovery."}
+{"prompt": "Write a short poem about artificial intelligence.", "output": "Write a short poem about artificial intelligence. Artificial intelligence creates new art"}
+{"prompt": "How do you make a good cup of coffee?", "output": "How do you make a good cup of coffee?  Provide a step by step guide that incorporates techniques such as brewing methods, brewing equipment, and equipment for brewing coffee. Include tips on how to vary the type of coffee you use and how to blend different types of coffee. Incorporate information on how to store and maintain the brewing equipment and coffee. Provide a list of recommended brands of coffee and brewing equipment to help you get started.\nA good cup of coffee is made through a combination of brewing methods and equipment. The best way to make a good cup of coffee is to use a high-quality brewing method and equipment that can handle the brewing process efficiently. Here is a step-by-step guide on how to make a good cup of coffee:\n1. Choose the right brewing method:\nThere are many different brewing methods available, but the most common ones are the French press, pour-over, and espresso machines. Each type of brewing method has its own unique pros and cons. For instance, the French press is an excellent method for making a strong, flavorful"}
+{"prompt": "What are the benefits of renewable energy?", "output": "What are the benefits of renewable energy? The benefits of renewable energy are numerous and include the following:\nThe environmental benefits of renewable energy include reduced greenhouse gas emissions and clean air. Renewable energy sources like solar and wind power produce little to no emissions, while coal and natural gas are known to emit harmful air pollutants such as sulfur dioxide and nitrogen oxides. Renewable energy sources also help to reduce the carbon footprint of the energy sector and preserve forests and other natural habitats. The use of renewable energy also helps to improve air quality in urban areas, which benefits public health and well-being.\nThe economic benefits of renewable energy include cost savings and reduced dependence on fossil fuels. The cost of producing electricity from renewable sources is significantly lower than the cost of producing electricity from fossil fuels. This means that renewable energy can be cheaper to produce and more affordable for consumers, leading to increased consumer adoption and growth in the sector.\nThe energy security benefits of renewable energy include the ability to control energy supply and reduce dependence on foreign oil. Renewable energy resources like wind and solar power"}

runs/humigence/run_summary.json CHANGED Viewed

@@ -1,12 +1,12 @@
 {
-  "run_id": "2025-09-17T22:50:18.668019",
   "status": "accepted",
-  "model": "Qwen/Qwen1.5-0.5B",
-  "dataset": "/home/joshua/humigence_data/openassistant_full/oasst1.jsonl",
   "recipe": "QLoRA (4-bit NF4)",
-  "epochs": "3",
-  "learning_rate": "2e-5",
-  "final_loss": 0.65,
   "eval_prompt_count": 5,
-  "timestamp": "2025-09-17 23:31:01"
 }

 {
+  "run_id": "2025-09-21T22:47:33",
   "status": "accepted",
+  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+  "dataset": "/home/joshua/humigence_data/imdb.jsonl",
   "recipe": "QLoRA (4-bit NF4)",
+  "epochs": "1",
+  "learning_rate": "2e-4",
+  "final_loss": null,
   "eval_prompt_count": 5,
+  "timestamp": "2025-09-21 22:47:52"
 }

setup_unsloth.sh ADDED Viewed

	@@ -0,0 +1,93 @@

+#!/bin/bash
+# Setup script for Unsloth dual-GPU LoRA training
+# Optimized for RTX 5090 (Blackwell architecture)
+set -e
+echo "🚀 Setting up Unsloth dual-GPU LoRA training environment..."
+# Check if we're in the right directory
+if [ ! -f "cli/main.py" ]; then
+    echo "❌ Error: Please run this script from the humigence directory"
+    exit 1
+fi
+# Check Python version
+python_version=$(python3 --version 2>&1 | cut -d' ' -f2 | cut -d'.' -f1,2)
+required_version="3.8"
+if [ "$(printf '%s\n' "$required_version" "$python_version" | sort -V | head -n1)" != "$required_version" ]; then
+    echo "❌ Error: Python 3.8+ required, found $python_version"
+    exit 1
+fi
+echo "✅ Python version: $python_version"
+# Check CUDA availability
+if ! command -v nvidia-smi &> /dev/null; then
+    echo "⚠️ Warning: nvidia-smi not found. CUDA may not be available."
+else
+    echo "✅ CUDA detected:"
+    nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits
+fi
+# Install PyTorch with CUDA support
+echo "📦 Installing PyTorch with CUDA support..."
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+# Install other dependencies
+echo "📦 Installing other dependencies..."
+pip install transformers>=4.36.0 datasets>=2.14.0 accelerate>=0.24.0 peft>=0.7.0 bitsandbytes>=0.41.0
+# Install Unsloth from source
+echo "📦 Installing Unsloth from source..."
+pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+# Install CLI dependencies
+echo "📦 Installing CLI dependencies..."
+pip install rich>=13.0.0 inquirer>=3.1.0 typer>=0.9.0 numpy>=1.24.0 pandas>=2.0.0 tqdm>=4.65.0
+# Create output directories
+echo "📁 Creating output directories..."
+mkdir -p runs/humigence
+mkdir -p humigence_data
+# Test installation
+echo "🧪 Testing installation..."
+python3 -c "
+import torch
+import transformers
+import datasets
+import accelerate
+import peft
+import bitsandbytes
+print('✅ All core dependencies imported successfully')
+# Test CUDA
+if torch.cuda.is_available():
+    print(f'✅ CUDA available: {torch.cuda.device_count()} GPU(s)')
+    for i in range(torch.cuda.device_count()):
+        print(f'  GPU {i}: {torch.cuda.get_device_name(i)}')
+else:
+    print('⚠️ CUDA not available - training will be slower')
+# Test Unsloth
+try:
+    import unsloth
+    print('✅ Unsloth imported successfully')
+except ImportError as e:
+    print(f'❌ Unsloth import failed: {e}')
+    exit(1)
+"
+echo ""
+echo "🎉 Setup completed successfully!"
+echo ""
+echo "To start training:"
+echo "  python3 cli/main.py"
+echo ""
+echo "Available options:"
+echo "  1. Supervised Fine-Tuning (Unsloth + Dual-GPU) 🚀"
+echo "  2. Single-GPU LoRA Training ✅"
+echo ""
+echo "For dual-GPU training, ensure you have 2+ GPUs available."
+echo "The system will automatically detect and use available GPUs."

templates/accelerate_config.yaml CHANGED Viewed

	@@ -0,0 +1,3 @@


1	+
2	+
3	+

train.py ADDED Viewed

	@@ -0,0 +1,456 @@

+#!/usr/bin/env python3
+"""
+Humigence Training Script with Hugging Face Accelerate
+Clean DDP training with single-GPU evaluation
+"""
+import os
+import json
+import torch
+import torch.nn.functional as F
+from pathlib import Path
+from typing import Dict, List, Optional
+from dataclasses import dataclass, field
+from accelerate import Accelerator
+from accelerate.utils import set_seed
+from transformers import (
+    AutoTokenizer, AutoModelForCausalLM,
+    TrainingArguments, Trainer, DataCollatorForLanguageModeling,
+    BitsAndBytesConfig, get_linear_schedule_with_warmup
+)
+from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, TaskType
+from datasets import Dataset
+import numpy as np
+from rich.console import Console
+from rich.table import Table
+from rich.panel import Panel
+# Set environment variables for stability
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
+console = Console()
+@dataclass
+class TrainingConfig:
+    """Training configuration dataclass"""
+    # Model config
+    base_model: str = "microsoft/DialoGPT-small"
+    training_recipe: str = "LoRA (FP16)"
+    # Training config
+    learning_rate: float = 2e-4
+    num_train_epochs: int = 1
+    per_device_train_batch_size: int = 2
+    per_device_eval_batch_size: int = 4
+    gradient_accumulation_steps: int = 4
+    max_seq_length: int = 1024
+    # LoRA config
+    lora_r: int = 16
+    lora_alpha: int = 32
+    lora_dropout: float = 0.05
+    # Data config
+    dataset_path: str = ""
+    train_val_test_split: List[float] = field(default_factory=lambda: [0.8, 0.1, 0.1])
+    split_seed: int = 42
+    # Output config
+    output_dir: str = "runs/humigence"
+    logging_steps: int = 10
+    save_steps: int = 100
+    eval_steps: int = 100
+    # Evaluation config
+    eval_gpu_index: int = 0  # Always use cuda:0 for evaluation
+def load_config(config_path: str) -> TrainingConfig:
+    """Load configuration from JSON file"""
+    with open(config_path, 'r') as f:
+        config_dict = json.load(f)
+    # Map config keys to dataclass fields
+    config = TrainingConfig()
+    for key, value in config_dict.items():
+        if hasattr(config, key):
+            setattr(config, key, value)
+    return config
+def prepare_dataset(config: TrainingConfig, tokenizer) -> tuple[Dataset, Dataset, Dataset]:
+    """Prepare dataset splits with tokenization"""
+    console.print("[blue]📊 Preparing dataset...[/blue]")
+    # Load dataset
+    with open(config.dataset_path, 'r') as f:
+        data = [json.loads(line) for line in f]
+    console.print(f"[blue]   Loaded {len(data)} samples[/blue]")
+    # Split dataset
+    np.random.seed(config.split_seed)
+    indices = np.random.permutation(len(data))
+    train_size = int(len(data) * config.train_val_test_split[0])
+    val_size = int(len(data) * config.train_val_test_split[1])
+    train_indices = indices[:train_size]
+    val_indices = indices[train_size:train_size + val_size]
+    test_indices = indices[train_size + val_size:]
+    train_data = [data[i] for i in train_indices]
+    val_data = [data[i] for i in val_indices]
+    test_data = [data[i] for i in test_indices]
+    console.print(f"[blue]   Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}[/blue]")
+    # Simple tokenization function
+    def tokenize_function(examples):
+        # Handle different data schemas
+        if "text" in examples:
+            # Simple text schema
+            texts = examples["text"]
+        elif "instruction" in examples and "output" in examples:
+            # Instruction-output schema
+            texts = []
+            for i in range(len(examples["instruction"])):
+                instruction = examples["instruction"][i]
+                input_text = examples.get("input", [""])[i] if examples.get("input") else ""
+                output = examples["output"][i]
+                # Format as conversation
+                if input_text:
+                    text = f"Instruction: {instruction}\nInput: {input_text}\nOutput: {output}"
+                else:
+                    text = f"Instruction: {instruction}\nOutput: {output}"
+                texts.append(text)
+        else:
+            # Fallback - use first available text column
+            text_col = None
+            for col in ["text", "instruction", "input", "output"]:
+                if col in examples:
+                    text_col = col
+                    break
+            if text_col:
+                texts = examples[text_col]
+            else:
+                # Last resort - convert to string
+                texts = [str(ex) for ex in examples[list(examples.keys())[0]]]
+        tokenized = tokenizer(
+            texts,
+            truncation=True,
+            padding=True,
+            max_length=config.max_seq_length,
+            return_tensors=None
+        )
+        # Create labels for causal language modeling
+        tokenized["labels"] = tokenized["input_ids"].copy()
+        return tokenized
+    # Create datasets and tokenize
+    train_dataset = Dataset.from_list(train_data)
+    val_dataset = Dataset.from_list(val_data)
+    test_dataset = Dataset.from_list(test_data)
+    # Tokenize datasets - remove original columns after tokenization
+    # First, get the original columns to remove
+    original_columns = list(train_dataset.column_names)
+    train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=original_columns)
+    val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=original_columns)
+    test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=original_columns)
+    # Set format for PyTorch
+    train_dataset.set_format("torch")
+    val_dataset.set_format("torch")
+    test_dataset.set_format("torch")
+    return train_dataset, val_dataset, test_dataset
+def setup_model_and_tokenizer(config: TrainingConfig, accelerator: Accelerator):
+    """Setup model and tokenizer with LoRA/QLoRA"""
+    console.print(f"[blue]🤖 Loading model: {config.base_model}[/blue]")
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(config.base_model, trust_remote_code=True)
+    tokenizer.pad_token = tokenizer.eos_token
+    # Load model
+    if "QLoRA" in config.training_recipe:
+        # QLoRA with quantization
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16
+        )
+        model = AutoModelForCausalLM.from_pretrained(
+            config.base_model,
+            quantization_config=bnb_config,
+            device_map=None,  # Let accelerate handle device placement
+            trust_remote_code=True
+        )
+        # Prepare for k-bit training
+        model = prepare_model_for_kbit_training(model)
+    else:
+        # Regular LoRA
+        model = AutoModelForCausalLM.from_pretrained(
+            config.base_model,
+            device_map=None,  # Let accelerate handle device placement
+            trust_remote_code=True,
+            dtype=torch.bfloat16 if "BF16" in config.training_recipe else torch.float16
+        )
+    # Apply LoRA - use appropriate target modules for the model
+    if "gpt" in config.base_model.lower() or "dialo" in config.base_model.lower():
+        # For GPT-style models
+        target_modules = ["c_attn", "c_proj"]
+    elif "llama" in config.base_model.lower() or "mistral" in config.base_model.lower():
+        # For LLaMA/Mistral models
+        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
+    else:
+        # Default fallback
+        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
+    lora_config = LoraConfig(
+        r=config.lora_r,
+        lora_alpha=config.lora_alpha,
+        target_modules=target_modules,
+        lora_dropout=config.lora_dropout,
+        bias="none",
+        task_type=TaskType.CAUSAL_LM
+    )
+    model = get_peft_model(model, lora_config)
+    # Print model info
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    total_params = sum(p.numel() for p in model.parameters())
+    console.print(f"[blue]   Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)[/blue]")
+    return model, tokenizer
+def train_model(model, tokenizer, train_dataset, val_dataset, config: TrainingConfig, accelerator: Accelerator):
+    """Train the model using Accelerate"""
+    console.print("[blue]🚀 Starting training...[/blue]")
+    # Data collator
+    data_collator = DataCollatorForLanguageModeling(
+        tokenizer=tokenizer,
+        mlm=False
+    )
+    # Training arguments
+    training_args = TrainingArguments(
+        output_dir=config.output_dir,
+        per_device_train_batch_size=config.per_device_train_batch_size,
+        per_device_eval_batch_size=config.per_device_eval_batch_size,
+        gradient_accumulation_steps=config.gradient_accumulation_steps,
+        num_train_epochs=config.num_train_epochs,
+        learning_rate=config.learning_rate,
+        logging_steps=config.logging_steps,
+        save_steps=config.save_steps,
+        eval_steps=config.eval_steps,
+        eval_strategy="steps",  # Updated parameter name
+        save_strategy="steps",
+        load_best_model_at_end=True,
+        metric_for_best_model="eval_loss",
+        greater_is_better=False,
+        remove_unused_columns=False,
+        dataloader_pin_memory=True,
+        dataloader_num_workers=4,
+        report_to=None,  # Disable wandb/tensorboard
+    )
+    # Create trainer
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=val_dataset,
+        data_collator=data_collator,
+        tokenizer=tokenizer,
+    )
+    # Train the model
+    trainer.train()
+    # Save model
+    if accelerator.is_main_process:
+        trainer.save_model()
+        console.print("[blue]💾 Model saved[/blue]")
+    return trainer
+def evaluate_model_on_single_gpu(model, tokenizer, test_dataset, config: TrainingConfig):
+    """Evaluate model on single GPU (cuda:0) to avoid device mismatches"""
+    console.print("[blue]🧪 Running evaluation on cuda:0...[/blue]")
+    # Move model to cuda:0 for evaluation
+    eval_device = torch.device("cuda:0")
+    model = model.to(eval_device)
+    model.eval()
+    # Data collator
+    data_collator = DataCollatorForLanguageModeling(
+        tokenizer=tokenizer,
+        mlm=False
+    )
+    # Create evaluation dataloader
+    from torch.utils.data import DataLoader
+    eval_dataloader = DataLoader(
+        test_dataset,
+        batch_size=config.per_device_eval_batch_size,
+        collate_fn=data_collator,
+        pin_memory=True
+    )
+    # Evaluation metrics
+    total_loss = 0.0
+    total_tokens = 0
+    correct_tokens = 0
+    num_samples = 0
+    with torch.no_grad():
+        for batch in eval_dataloader:
+            # Move batch to cuda:0
+            batch = {k: v.to(eval_device) for k, v in batch.items()}
+            # Forward pass
+            outputs = model(**batch)
+            loss = outputs.loss
+            logits = outputs.logits
+            # Calculate metrics
+            total_loss += loss.item()
+            num_samples += batch["input_ids"].size(0)
+            # Token-level accuracy
+            predictions = torch.argmax(logits, dim=-1)
+            labels = batch["labels"]
+            # Mask out ignored positions
+            mask = labels != -100
+            correct_tokens += (predictions[mask] == labels[mask]).sum().item()
+            total_tokens += mask.sum().item()
+    # Calculate final metrics
+    avg_loss = total_loss / len(eval_dataloader)
+    accuracy = correct_tokens / max(total_tokens, 1)
+    perplexity = np.exp(avg_loss)
+    return {
+        "loss": avg_loss,
+        "accuracy": accuracy,
+        "perplexity": perplexity,
+        "correct_tokens": correct_tokens,
+        "total_tokens": total_tokens,
+        "num_samples": num_samples
+    }
+def print_training_summary(config: TrainingConfig, train_dataset, val_dataset, test_dataset, eval_results):
+    """Print structured training summary"""
+    console.print("\n[bold cyan]=" * 80)
+    console.print("[bold cyan]🎯 TRAINING SUMMARY[/bold cyan]")
+    console.print("[bold cyan]=" * 80)
+    # Dataset summary
+    console.print(f"\n[bold green]📊 Dataset Summary[/bold green]")
+    console.print(f"   Train: {len(train_dataset):,} samples")
+    console.print(f"   Validation: {len(val_dataset):,} samples")
+    console.print(f"   Test: {len(test_dataset):,} samples")
+    # Model summary
+    console.print(f"\n[bold blue]🤖 Model Summary[/bold blue]")
+    console.print(f"   Base Model: {config.base_model}")
+    console.print(f"   Training Recipe: {config.training_recipe}")
+    console.print(f"   LoRA r: {config.lora_r}")
+    console.print(f"   LoRA alpha: {config.lora_alpha}")
+    # Training summary
+    console.print(f"\n[bold yellow]🚀 Training Summary[/bold yellow]")
+    console.print(f"   Epochs: {config.num_train_epochs}")
+    console.print(f"   Learning Rate: {config.learning_rate}")
+    console.print(f"   Batch Size: {config.per_device_train_batch_size}")
+    console.print(f"   Gradient Accumulation: {config.gradient_accumulation_steps}")
+    # Evaluation results
+    console.print(f"\n[bold magenta]🧪 Evaluation Results (cuda:0)[/bold magenta]")
+    console.print(f"   Loss: {eval_results['loss']:.4f}")
+    console.print(f"   Accuracy: {eval_results['accuracy']:.4f}")
+    console.print(f"   Perplexity: {eval_results['perplexity']:.2f}")
+    console.print(f"   Correct Tokens: {eval_results['correct_tokens']:,}")
+    console.print(f"   Total Tokens: {eval_results['total_tokens']:,}")
+    console.print(f"   Samples: {eval_results['num_samples']:,}")
+    console.print("\n[bold cyan]=" * 80)
+def main():
+    """Main training function"""
+    # Parse arguments
+    import argparse
+    parser = argparse.ArgumentParser(description="Humigence Training with Accelerate")
+    parser.add_argument("--config_file", type=str, required=True, help="Path to config file")
+    args = parser.parse_args()
+    # Initialize accelerator
+    accelerator = Accelerator()
+    set_seed(42)
+    # Load configuration
+    config = load_config(args.config_file)
+    # Print accelerator info
+    console.print(f"[blue]🚀 Accelerate Info:[/blue]")
+    console.print(f"   Process index: {accelerator.process_index}")
+    console.print(f"   Local process index: {accelerator.local_process_index}")
+    console.print(f"   Device: {accelerator.device}")
+    console.print(f"   Distributed: {accelerator.distributed_type}")
+    console.print(f"   Mixed precision: {accelerator.mixed_precision}")
+    try:
+        # Setup model and tokenizer
+        model, tokenizer = setup_model_and_tokenizer(config, accelerator)
+        # Prepare datasets
+        train_dataset, val_dataset, test_dataset = prepare_dataset(config, tokenizer)
+        # Train model
+        trainer = train_model(model, tokenizer, train_dataset, val_dataset, config, accelerator)
+        # Wait for all processes to finish training
+        accelerator.wait_for_everyone()
+        # Evaluate on single GPU (main process only)
+        if accelerator.is_main_process:
+            eval_results = evaluate_model_on_single_gpu(model, tokenizer, test_dataset, config)
+            print_training_summary(config, train_dataset, val_dataset, test_dataset, eval_results)
+        else:
+            eval_results = None
+        # Wait for evaluation to complete
+        accelerator.wait_for_everyone()
+        return {"status": "success", "eval_results": eval_results}
+    except Exception as e:
+        console.print(f"[red]❌ Training failed: {e}[/red]")
+        import traceback
+        traceback.print_exc()
+        return {"status": "error", "message": str(e)}
+if __name__ == "__main__":
+    results = main()
+    if results["status"] == "success":
+        console.print("[green]✅ Training completed successfully![/green]")
+    else:
+        console.print(f"[red]❌ Training failed: {results['message']}[/red]")
+        exit(1)

training_launcher.py ADDED Viewed

	@@ -0,0 +1,137 @@

+# training_launcher.py
+import os
+import sys
+import traceback
+import json
+import argparse
+import torch
+from pathlib import Path
+from distributed_utils import setup_distributed, setup_environment, cleanup_distributed, RankZeroOnly
+def main():
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description="Humigence Training Launcher")
+    parser.add_argument("--config", type=str, required=True, help="Path to configuration file")
+    parser.add_argument("--fallback_single_gpu", action="store_true", help="Force single GPU training")
+    args = parser.parse_args()
+    # Set default values for error handling
+    ddp = False
+    is_main = True
+    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+    # Set environment before ANY other imports
+    setup_environment()
+    try:
+        # Initialize distributed training
+        ddp, rank, local_rank, world_size, device = setup_distributed()
+        is_main = (rank == 0)
+        with RankZeroOnly(is_main) as rank_zero:
+            rank_zero.print(f"Training Mode: {'DDP' if ddp else 'Single-GPU'} "
+                          f"(world_size={world_size}, rank={rank}, local_rank={local_rank}, device={device})")
+        # Load configuration
+        with open(args.config, 'r') as f:
+            config = json.load(f)
+        # Update config with distributed training info
+        config.update({
+            "device": str(device),
+            "ddp": ddp,
+            "rank": rank,
+            "world_size": world_size,
+            "is_main": is_main,
+            "local_rank": local_rank,
+        })
+        # Import trainer after device setup to ensure proper CUDA initialization
+        from pipelines.production_pipeline import ProductionPipeline
+        # Create pipeline with distributed config
+        pipeline = ProductionPipeline(config)
+        # Run training
+        results = pipeline.run()
+        # Clean shutdown
+        cleanup_distributed()
+        return results
+    except Exception as e:
+        # Ensure cleanup even on error
+        cleanup_distributed()
+        # Enhanced error logging
+        error_msg = f"Training error: {type(e).__name__}: {e}"
+        print(error_msg, file=sys.stderr)
+        # Check if this is a DDP-related error that should trigger fallback
+        if _should_fallback_to_single_gpu(e):
+            if is_main:  # Now is_main is always defined
+                print("DDP failed, falling back to single-GPU...")
+            return _run_single_gpu_fallback(args.config)
+        else:
+            # Re-raise for actual errors
+            raise
+def _should_fallback_to_single_gpu(error: Exception) -> bool:
+    """Determine if error warrants single-GPU fallback"""
+    fallback_errors = (
+        AttributeError,  # Missing methods like set_memory_monitor
+        RuntimeError,    # NCCL errors, device mismatches
+        ConnectionError, # Process group initialization failures
+    )
+    return isinstance(error, fallback_errors)
+def _run_single_gpu_fallback(config_path: str):
+    """Clean single-GPU fallback implementation"""
+    # Force single GPU
+    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+    # Clear any existing process group
+    if torch.distributed.is_initialized():
+        torch.distributed.destroy_process_group()
+    # Load original config
+    with open(config_path, 'r') as f:
+        config = json.load(f)
+    # Update config for single GPU
+    config.update({
+        "device": "cuda:0",
+        "ddp": False,
+        "rank": 0,
+        "world_size": 1,
+        "is_main": True,
+        "local_rank": 0,
+        "multi_gpu": False,
+        "use_distributed": False,
+    })
+    print("Running single-GPU fallback training...")
+    try:
+        from pipelines.production_pipeline import ProductionPipeline
+        pipeline = ProductionPipeline(config)
+        return pipeline.run()
+    except Exception as e:
+        print(f"Single-GPU fallback also failed: {e}")
+        return {"status": "error", "message": str(e)}
+if __name__ == "__main__":
+    try:
+        results = main()
+        if results and results.get("status") == "success":
+            sys.exit(0)
+        else:
+            sys.exit(1)
+    except KeyboardInterrupt:
+        print("\nTraining interrupted by user")
+        sys.exit(1)
+    except Exception as e:
+        print(f"Training failed: {e}")
+        traceback.print_exc()
+        sys.exit(1)