{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "title" }, "source": [ "# ๐Ÿš€ ULTRATHINK Perfect Training - Google Colab\n", "\n", "## โœจ What's New in This Configuration?\n", "\n", "This notebook uses the **PERFECT** training configuration that fixes:\n", "- โœ… **Routing Collapse** (Entropy 0.52 โ†’ 0.8-1.2)\n", "- โœ… **Expert Imbalance** (Max Expert 100% โ†’ 50-70%)\n", "- โœ… **High Auxiliary Loss** (8.0 โ†’ 2.0-4.0)\n", "- โœ… **Slow Convergence** (Better perplexity by step 200)\n", "\n", "### ๐ŸŽฏ Key Improvements:\n", "- **MoE Top-K**: 1 โ†’ **2** (prevents single expert dominance)\n", "- **Load Balance Weight**: 0.01 โ†’ **0.1** (10x stronger)\n", "- **Z-Loss Weight**: 0.001 โ†’ **0.0001** (10x weaker)\n", "- **Expert Capacity**: 1.0 โ†’ **1.5** (50% overflow)\n", "- **Effective Batch Size**: 16 โ†’ **64** (4x larger)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "id": "setup" }, "source": [ "## ๐Ÿ“‹ Setup Instructions\n", "\n", "1. **Runtime**: Go to `Runtime` โ†’ `Change runtime type` โ†’ Select `GPU` (T4 recommended)\n", "2. **Upload Project**: Upload the ULTRATHINK project folder or clone from GitHub\n", "3. **Run Cells**: Execute cells in order\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gpu_check" }, "outputs": [], "source": [ "# Check GPU availability\n", "!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader\n", "\n", "import torch\n", "print(f\"\\nPyTorch version: {torch.__version__}\")\n", "print(f\"CUDA available: {torch.cuda.is_available()}\")\n", "if torch.cuda.is_available():\n", " print(f\"CUDA version: {torch.version.cuda}\")\n", " print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n", " print(f\"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB\")" ] }, { "cell_type": "markdown", "metadata": { "id": "project_setup" }, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mount_drive" }, "outputs": [], "source": [ "# Option: Mount Google Drive (uncomment if needed)\n", "from google.colab import drive\n", "drive.mount('/content/drive')\n", "# %cd /content/drive/MyDrive/path/to/UltraThinking-LLM-Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "clone_repo" }, "outputs": [], "source": [ "# Clone repository (update with your repo URL)\n", "!git clone https://github.com/vediyappanm/UltraThinking-LLM-Training.git\n", " %cd UltraThinking-LLM-Training\n", "\n", "# Or if already uploaded:\n", "%cd /content/UltraThinking-LLM-Training\n", "\n", "# Verify we're in the right directory\n", "!ls -la train_ultrathink.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "install_deps" }, "outputs": [], "source": [ "# Install dependencies\n", "!pip install -q -r requirements.txt\n", "\n", "# Upgrade core packages\n", "!pip install -q --upgrade pip setuptools wheel\n", "\n", "# Install additional packages for Colab\n", "!pip install -q transformers datasets accelerate\n", "\n", "print(\"โœ“ Dependencies installed successfully!\")" ] }, { "cell_type": "markdown", "metadata": { "id": "training_section" }, "source": [ "## ๐ŸŽฏ Perfect Training Configuration\n", "\n", "### Expected Results:\n", "\n", "| Metric | Before | After (Step 50-100) | Meaning |\n", "|--------|--------|---------------------|----------|\n", "| **Entropy** | 0.52 | 0.8-1.2 | More uniform expert selection |\n", "| **Max Expert %** | 100% | 50-65% | No single expert dominates |\n", "| **Aux Loss** | 8.0-8.5 | 2.0-4.0 | Routing regularization working |\n", "| **Perplexity** | 85k โ†’ 30k | <5k by step 200 | Faster learning |\n", "| **Loss** | 11.3 โ†’ 10.3 | <8.0 by step 200 | Better optimization |\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "train_perfect" }, "outputs": [], "source": [ "# ============================================================================\n", "# PERFECT TRAINING CONFIGURATION\n", "# ============================================================================\n", "# This configuration fixes routing collapse and achieves optimal performance\n", "# ============================================================================\n", "\n", "!python train_ultrathink.py \\\n", " --vocab_size 50257 \\\n", " --hidden_size 512 \\\n", " --num_layers 6 \\\n", " --num_heads 8 \\\n", " --num_kv_heads 4 \\\n", " --intermediate_size 2048 \\\n", " --max_seq_length 256 \\\n", " --activation swiglu \\\n", " --enable_moe \\\n", " --num_knowledge_experts 4 \\\n", " --num_skill_experts 2 \\\n", " --num_meta_experts 1 \\\n", " --num_safety_experts 1 \\\n", " --moe_top_k 2 \\\n", " --expert_capacity 1.5 \\\n", " --load_balance_weight 0.1 \\\n", " --z_loss_weight 0.0001 \\\n", " --importance_weight 0.05 \\\n", " --batch_size 2 \\\n", " --gradient_accumulation_steps 32 \\\n", " --learning_rate 0.0001 \\\n", " --weight_decay 0.1 \\\n", " --adam_beta1 0.9 \\\n", " --adam_beta2 0.999 \\\n", " --warmup_steps 1000 \\\n", " --max_steps 100000 \\\n", " --num_epochs 1 \\\n", " --gradient_clipping 0.5 \\\n", " --dropout 0.15 \\\n", " --attention_dropout 0.15 \\\n", " --gradient_checkpointing \\\n", " --use_amp \\\n", " --amp_warmup_steps 500 \\\n", " --enable_dre \\\n", " --dre_warmup_steps 1000 \\\n", " --dataset c4 \\\n", " --dataset_subset en \\\n", " --tokenizer_name gpt2 \\\n", " --streaming \\\n", " --train_samples 10000 \\\n", " --val_samples 1000 \\\n", " --num_workers 2 \\\n", " --use_mlflow \\\n", " --mlflow_tracking_uri \"file:./mlruns\" \\\n", " --mlflow_experiment \"UltraThinking-LLM-Training\" \\\n", " --run_name \"ultrathink_colab_perfect_v2\" \\\n", " --perf_log_interval 5 \\\n", " --eval_frequency 50 \\\n", " --output_dir \"./outputs/ultrathink_colab_perfect\"" ] }, { "cell_type": "markdown", "metadata": { "id": "quick_test" }, "source": [ "## ๐Ÿงช Quick Test Run (Optional)\n", "\n", "Run a quick 100-step test to verify everything works before full training.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "train_quick_test" }, "outputs": [], "source": [ "# Quick test run (100 steps, ~2-3 minutes)\n", "!python train_ultrathink.py \\\n", " --vocab_size 50257 \\\n", " --hidden_size 256 \\\n", " --num_layers 2 \\\n", " --num_heads 4 \\\n", " --num_kv_heads 2 \\\n", " --intermediate_size 1024 \\\n", " --max_seq_length 128 \\\n", " --enable_moe \\\n", " --num_knowledge_experts 2 \\\n", " --num_skill_experts 1 \\\n", " --num_meta_experts 1 \\\n", " --num_safety_experts 1 \\\n", " --moe_top_k 2 \\\n", " --expert_capacity 2.0 \\\n", " --load_balance_weight 0.2 \\\n", " --z_loss_weight 0.00001 \\\n", " --batch_size 1 \\\n", " --gradient_accumulation_steps 8 \\\n", " --learning_rate 0.0001 \\\n", " --warmup_steps 50 \\\n", " --max_steps 100 \\\n", " --num_epochs 1 \\\n", " --dataset dummy \\\n", " --train_samples 100 \\\n", " --val_samples 20 \\\n", " --eval_frequency 50 \\\n", " --run_name \"ultrathink_quick_test\" \\\n", " --output_dir \"./outputs/ultrathink_quick_test\"\n", "\n", "print(\"\\nโœ“ Quick test completed! Check the metrics above.\")\n", "print(\"If everything looks good, run the full training cell.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "monitoring" }, "source": [ "## ๐Ÿ“Š Monitoring & Metrics\n", "\n", "### What to Watch For:\n", "\n", "#### โœ… Good Signs (by step 50):\n", "- Entropy increases from 0.52 โ†’ 0.7+\n", "- Max expert drops from 100% โ†’ 60-70%\n", "- Auxiliary loss drops from 8.0 โ†’ 3-5\n", "- Loss decreases steadily\n", "\n", "#### โš ๏ธ Warning Signs:\n", "- Entropy stuck at 0.52 โ†’ Increase load_balance_weight\n", "- Max expert still 100% โ†’ Increase expert_capacity\n", "- Aux loss still >7.0 โ†’ Decrease z_loss_weight\n", "\n", "#### ๐Ÿ›‘ Critical Issues:\n", "- NaN/Inf losses โ†’ Disable AMP temporarily\n", "- OOM errors โ†’ Reduce batch_size or increase gradient_accumulation_steps\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "view_logs" }, "outputs": [], "source": [ "# View recent training logs\n", "!tail -n 50 outputs/ultrathink_colab_perfect/training.log 2>/dev/null || echo \"No logs yet\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mlflow_ui" }, "outputs": [], "source": [ "# Start MLflow UI (optional - runs in background)\n", "# Note: In Colab, you'll need to use ngrok or similar to expose the port\n", "\n", "# Install ngrok for port forwarding\n", "!pip install -q pyngrok\n", "\n", "from pyngrok import ngrok\n", "import threading\n", "import subprocess\n", "\n", "# Start MLflow UI in background\n", "def start_mlflow():\n", " subprocess.run([\"mlflow\", \"ui\", \"--backend-store-uri\", \"./mlruns\", \"--port\", \"5000\"])\n", "\n", "thread = threading.Thread(target=start_mlflow, daemon=True)\n", "thread.start()\n", "\n", "# Create ngrok tunnel\n", "public_url = ngrok.connect(5000)\n", "print(f\"\\nโœ“ MLflow UI available at: {public_url}\")\n", "print(\"Click the link above to view training metrics in real-time!\")" ] }, { "cell_type": "markdown", "metadata": { "id": "checkpoints" }, "source": [ "## ๐Ÿ’พ Checkpoints & Model Export\n", "\n", "Download trained models and checkpoints to your local machine or Google Drive.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "list_checkpoints" }, "outputs": [], "source": [ "# List available checkpoints\n", "!ls -lh outputs/ultrathink_colab_perfect/*.pt 2>/dev/null || echo \"No checkpoints yet\"\n", "!ls -lh outputs/ultrathink_colab_perfect/final_model/ 2>/dev/null || echo \"No final model yet\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "download_model" }, "outputs": [], "source": [ "# Download model to local machine\n", "from google.colab import files\n", "import shutil\n", "import os\n", "\n", "# Create a zip file of the final model\n", "if os.path.exists(\"outputs/ultrathink_colab_perfect/final_model\"):\n", " shutil.make_archive(\"ultrathink_final_model\", \"zip\", \"outputs/ultrathink_colab_perfect/final_model\")\n", " print(\"โœ“ Model archived as ultrathink_final_model.zip\")\n", " \n", " # Download (this may take a while for large models)\n", " # files.download(\"ultrathink_final_model.zip\")\n", " print(\"\\nTo download, uncomment the files.download() line above.\")\n", "else:\n", " print(\"No final model found yet. Training may still be in progress.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "save_to_drive" }, "outputs": [], "source": [ "# Save to Google Drive (if mounted)\n", "# Uncomment and modify path as needed\n", "\n", "# from google.colab import drive\n", "# drive.mount('/content/drive')\n", "\n", "# import shutil\n", "# shutil.copytree(\n", "# \"outputs/ultrathink_colab_perfect\",\n", "# \"/content/drive/MyDrive/ULTRATHINK_Models/ultrathink_colab_perfect\",\n", "# dirs_exist_ok=True\n", "# )\n", "# print(\"โœ“ Model saved to Google Drive!\")" ] }, { "cell_type": "markdown", "metadata": { "id": "inference" }, "source": [ "## ๐ŸŽฎ Quick Inference Test\n", "\n", "Test your trained model with sample text generation.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "test_inference" }, "outputs": [], "source": [ "# Quick inference test\n", "!python scripts/inference.py \\\n", " --model_path outputs/ultrathink_colab_perfect/final_model \\\n", " --prompt \"The future of artificial intelligence is\" \\\n", " --max_length 100 \\\n", " --temperature 0.8 \\\n", " --top_p 0.9 2>/dev/null || echo \"Inference script not available or model not ready\"" ] }, { "cell_type": "markdown", "metadata": { "id": "troubleshooting" }, "source": [ "## ๐Ÿ”ง Troubleshooting\n", "\n", "### Common Issues:\n", "\n", "| Issue | Solution |\n", "|-------|----------|\n", "| **OOM Error** | Reduce `--batch_size` to 1, increase `--gradient_accumulation_steps` |\n", "| **NaN Losses** | Remove `--use_amp` or increase `--amp_warmup_steps` |\n", "| **Slow Training** | Reduce `--num_workers` to 0 for streaming datasets |\n", "| **Routing Collapse** | Increase `--load_balance_weight` to 0.15 or 0.2 |\n", "| **High Aux Loss** | Decrease `--z_loss_weight` to 0.00005 |\n", "\n", "### Need Help?\n", "- Check the [Training Config Guide](Training%20congig.md)\n", "- Review logs in `outputs/ultrathink_colab_perfect/training.log`\n", "- Open an issue on GitHub\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "id": "footer" }, "source": [ "## ๐Ÿ“š Additional Resources\n", "\n", "- **Documentation**: See `README.md` and `ADVANCED_TRAINING_GUIDE.md`\n", "- **Architecture**: See `ARCHITECTURE_OVERVIEW.md`\n", "- **Training Config**: See `Training congig.md`\n", "\n", "---\n", "\n", "## ๐ŸŽ‰ Success Criteria\n", "\n", "Your training is successful when:\n", "\n", "**By Step 50:**\n", "- โœ“ Entropy > 0.7\n", "- โœ“ Max expert < 70%\n", "- โœ“ Aux loss < 5.0\n", "\n", "**By Step 200:**\n", "- โœ“ Loss < 8.0\n", "- โœ“ Perplexity < 5,000\n", "- โœ“ All experts showing 5-40% usage\n", "\n", "**By Step 1000:**\n", "- โœ“ Loss < 6.0\n", "- โœ“ Perplexity < 1,000\n", "- โœ“ Stable, consistent improvement\n", "\n", "---\n", "\n", "**Good luck with your training! ๐Ÿš€**" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "name": "ULTRATHINK_Perfect_Training_Colab.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }