File size: 17,386 Bytes

54c5666

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "title"
      },
      "source": [
        "# 🚀 ULTRATHINK Perfect Training - Google Colab\n",
        "\n",
        "## ✨ What's New in This Configuration?\n",
        "\n",
        "This notebook uses the **PERFECT** training configuration that fixes:\n",
        "- ✅ **Routing Collapse** (Entropy 0.52 → 0.8-1.2)\n",
        "- ✅ **Expert Imbalance** (Max Expert 100% → 50-70%)\n",
        "- ✅ **High Auxiliary Loss** (8.0 → 2.0-4.0)\n",
        "- ✅ **Slow Convergence** (Better perplexity by step 200)\n",
        "\n",
        "### 🎯 Key Improvements:\n",
        "- **MoE Top-K**: 1 → **2** (prevents single expert dominance)\n",
        "- **Load Balance Weight**: 0.01 → **0.1** (10x stronger)\n",
        "- **Z-Loss Weight**: 0.001 → **0.0001** (10x weaker)\n",
        "- **Expert Capacity**: 1.0 → **1.5** (50% overflow)\n",
        "- **Effective Batch Size**: 16 → **64** (4x larger)\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "setup"
      },
      "source": [
        "## 📋 Setup Instructions\n",
        "\n",
        "1. **Runtime**: Go to `Runtime` → `Change runtime type` → Select `GPU` (T4 recommended)\n",
        "2. **Upload Project**: Upload the ULTRATHINK project folder or clone from GitHub\n",
        "3. **Run Cells**: Execute cells in order\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gpu_check"
      },
      "outputs": [],
      "source": [
        "# Check GPU availability\n",
        "!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader\n",
        "\n",
        "import torch\n",
        "print(f\"\\nPyTorch version: {torch.__version__}\")\n",
        "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
        "if torch.cuda.is_available():\n",
        "    print(f\"CUDA version: {torch.version.cuda}\")\n",
        "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
        "    print(f\"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "project_setup"
      },
      "source": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mount_drive"
      },
      "outputs": [],
      "source": [
        "# Option: Mount Google Drive (uncomment if needed)\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')\n",
        "# %cd /content/drive/MyDrive/path/to/UltraThinking-LLM-Training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "clone_repo"
      },
      "outputs": [],
      "source": [
        "# Clone repository (update with your repo URL)\n",
        "!git clone https://github.com/vediyappanm/UltraThinking-LLM-Training.git\n",
        " %cd UltraThinking-LLM-Training\n",
        "\n",
        "# Or if already uploaded:\n",
        "%cd /content/UltraThinking-LLM-Training\n",
        "\n",
        "# Verify we're in the right directory\n",
        "!ls -la train_ultrathink.py"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "install_deps"
      },
      "outputs": [],
      "source": [
        "# Install dependencies\n",
        "!pip install -q -r requirements.txt\n",
        "\n",
        "# Upgrade core packages\n",
        "!pip install -q --upgrade pip setuptools wheel\n",
        "\n",
        "# Install additional packages for Colab\n",
        "!pip install -q transformers datasets accelerate\n",
        "\n",
        "print(\"✓ Dependencies installed successfully!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "training_section"
      },
      "source": [
        "## 🎯 Perfect Training Configuration\n",
        "\n",
        "### Expected Results:\n",
        "\n",
        "| Metric | Before | After (Step 50-100) | Meaning |\n",
        "|--------|--------|---------------------|----------|\n",
        "| **Entropy** | 0.52 | 0.8-1.2 | More uniform expert selection |\n",
        "| **Max Expert %** | 100% | 50-65% | No single expert dominates |\n",
        "| **Aux Loss** | 8.0-8.5 | 2.0-4.0 | Routing regularization working |\n",
        "| **Perplexity** | 85k → 30k | <5k by step 200 | Faster learning |\n",
        "| **Loss** | 11.3 → 10.3 | <8.0 by step 200 | Better optimization |\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "train_perfect"
      },
      "outputs": [],
      "source": [
        "# ============================================================================\n",
        "# PERFECT TRAINING CONFIGURATION\n",
        "# ============================================================================\n",
        "# This configuration fixes routing collapse and achieves optimal performance\n",
        "# ============================================================================\n",
        "\n",
        "!python train_ultrathink.py \\\n",
        "  --vocab_size 50257 \\\n",
        "  --hidden_size 512 \\\n",
        "  --num_layers 6 \\\n",
        "  --num_heads 8 \\\n",
        "  --num_kv_heads 4 \\\n",
        "  --intermediate_size 2048 \\\n",
        "  --max_seq_length 256 \\\n",
        "  --activation swiglu \\\n",
        "  --enable_moe \\\n",
        "  --num_knowledge_experts 4 \\\n",
        "  --num_skill_experts 2 \\\n",
        "  --num_meta_experts 1 \\\n",
        "  --num_safety_experts 1 \\\n",
        "  --moe_top_k 2 \\\n",
        "  --expert_capacity 1.5 \\\n",
        "  --load_balance_weight 0.1 \\\n",
        "  --z_loss_weight 0.0001 \\\n",
        "  --importance_weight 0.05 \\\n",
        "  --batch_size 2 \\\n",
        "  --gradient_accumulation_steps 32 \\\n",
        "  --learning_rate 0.0001 \\\n",
        "  --weight_decay 0.1 \\\n",
        "  --adam_beta1 0.9 \\\n",
        "  --adam_beta2 0.999 \\\n",
        "  --warmup_steps 1000 \\\n",
        "  --max_steps 100000 \\\n",
        "  --num_epochs 1 \\\n",
        "  --gradient_clipping 0.5 \\\n",
        "  --dropout 0.15 \\\n",
        "  --attention_dropout 0.15 \\\n",
        "  --gradient_checkpointing \\\n",
        "  --use_amp \\\n",
        "  --amp_warmup_steps 500 \\\n",
        "  --enable_dre \\\n",
        "  --dre_warmup_steps 1000 \\\n",
        "  --dataset c4 \\\n",
        "  --dataset_subset en \\\n",
        "  --tokenizer_name gpt2 \\\n",
        "  --streaming \\\n",
        "  --train_samples 10000 \\\n",
        "  --val_samples 1000 \\\n",
        "  --num_workers 2 \\\n",
        "  --use_mlflow \\\n",
        "  --mlflow_tracking_uri \"file:./mlruns\" \\\n",
        "  --mlflow_experiment \"UltraThinking-LLM-Training\" \\\n",
        "  --run_name \"ultrathink_colab_perfect_v2\" \\\n",
        "  --perf_log_interval 5 \\\n",
        "  --eval_frequency 50 \\\n",
        "  --output_dir \"./outputs/ultrathink_colab_perfect\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "quick_test"
      },
      "source": [
        "## 🧪 Quick Test Run (Optional)\n",
        "\n",
        "Run a quick 100-step test to verify everything works before full training.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "train_quick_test"
      },
      "outputs": [],
      "source": [
        "# Quick test run (100 steps, ~2-3 minutes)\n",
        "!python train_ultrathink.py \\\n",
        "  --vocab_size 50257 \\\n",
        "  --hidden_size 256 \\\n",
        "  --num_layers 2 \\\n",
        "  --num_heads 4 \\\n",
        "  --num_kv_heads 2 \\\n",
        "  --intermediate_size 1024 \\\n",
        "  --max_seq_length 128 \\\n",
        "  --enable_moe \\\n",
        "  --num_knowledge_experts 2 \\\n",
        "  --num_skill_experts 1 \\\n",
        "  --num_meta_experts 1 \\\n",
        "  --num_safety_experts 1 \\\n",
        "  --moe_top_k 2 \\\n",
        "  --expert_capacity 2.0 \\\n",
        "  --load_balance_weight 0.2 \\\n",
        "  --z_loss_weight 0.00001 \\\n",
        "  --batch_size 1 \\\n",
        "  --gradient_accumulation_steps 8 \\\n",
        "  --learning_rate 0.0001 \\\n",
        "  --warmup_steps 50 \\\n",
        "  --max_steps 100 \\\n",
        "  --num_epochs 1 \\\n",
        "  --dataset dummy \\\n",
        "  --train_samples 100 \\\n",
        "  --val_samples 20 \\\n",
        "  --eval_frequency 50 \\\n",
        "  --run_name \"ultrathink_quick_test\" \\\n",
        "  --output_dir \"./outputs/ultrathink_quick_test\"\n",
        "\n",
        "print(\"\\n✓ Quick test completed! Check the metrics above.\")\n",
        "print(\"If everything looks good, run the full training cell.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "monitoring"
      },
      "source": [
        "## 📊 Monitoring & Metrics\n",
        "\n",
        "### What to Watch For:\n",
        "\n",
        "#### ✅ Good Signs (by step 50):\n",
        "- Entropy increases from 0.52 → 0.7+\n",
        "- Max expert drops from 100% → 60-70%\n",
        "- Auxiliary loss drops from 8.0 → 3-5\n",
        "- Loss decreases steadily\n",
        "\n",
        "#### ⚠️ Warning Signs:\n",
        "- Entropy stuck at 0.52 → Increase load_balance_weight\n",
        "- Max expert still 100% → Increase expert_capacity\n",
        "- Aux loss still >7.0 → Decrease z_loss_weight\n",
        "\n",
        "#### 🛑 Critical Issues:\n",
        "- NaN/Inf losses → Disable AMP temporarily\n",
        "- OOM errors → Reduce batch_size or increase gradient_accumulation_steps\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "view_logs"
      },
      "outputs": [],
      "source": [
        "# View recent training logs\n",
        "!tail -n 50 outputs/ultrathink_colab_perfect/training.log 2>/dev/null || echo \"No logs yet\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mlflow_ui"
      },
      "outputs": [],
      "source": [
        "# Start MLflow UI (optional - runs in background)\n",
        "# Note: In Colab, you'll need to use ngrok or similar to expose the port\n",
        "\n",
        "# Install ngrok for port forwarding\n",
        "!pip install -q pyngrok\n",
        "\n",
        "from pyngrok import ngrok\n",
        "import threading\n",
        "import subprocess\n",
        "\n",
        "# Start MLflow UI in background\n",
        "def start_mlflow():\n",
        "    subprocess.run([\"mlflow\", \"ui\", \"--backend-store-uri\", \"./mlruns\", \"--port\", \"5000\"])\n",
        "\n",
        "thread = threading.Thread(target=start_mlflow, daemon=True)\n",
        "thread.start()\n",
        "\n",
        "# Create ngrok tunnel\n",
        "public_url = ngrok.connect(5000)\n",
        "print(f\"\\n✓ MLflow UI available at: {public_url}\")\n",
        "print(\"Click the link above to view training metrics in real-time!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "checkpoints"
      },
      "source": [
        "## 💾 Checkpoints & Model Export\n",
        "\n",
        "Download trained models and checkpoints to your local machine or Google Drive.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "list_checkpoints"
      },
      "outputs": [],
      "source": [
        "# List available checkpoints\n",
        "!ls -lh outputs/ultrathink_colab_perfect/*.pt 2>/dev/null || echo \"No checkpoints yet\"\n",
        "!ls -lh outputs/ultrathink_colab_perfect/final_model/ 2>/dev/null || echo \"No final model yet\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "download_model"
      },
      "outputs": [],
      "source": [
        "# Download model to local machine\n",
        "from google.colab import files\n",
        "import shutil\n",
        "import os\n",
        "\n",
        "# Create a zip file of the final model\n",
        "if os.path.exists(\"outputs/ultrathink_colab_perfect/final_model\"):\n",
        "    shutil.make_archive(\"ultrathink_final_model\", \"zip\", \"outputs/ultrathink_colab_perfect/final_model\")\n",
        "    print(\"✓ Model archived as ultrathink_final_model.zip\")\n",
        "    \n",
        "    # Download (this may take a while for large models)\n",
        "    # files.download(\"ultrathink_final_model.zip\")\n",
        "    print(\"\\nTo download, uncomment the files.download() line above.\")\n",
        "else:\n",
        "    print(\"No final model found yet. Training may still be in progress.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "save_to_drive"
      },
      "outputs": [],
      "source": [
        "# Save to Google Drive (if mounted)\n",
        "# Uncomment and modify path as needed\n",
        "\n",
        "# from google.colab import drive\n",
        "# drive.mount('/content/drive')\n",
        "\n",
        "# import shutil\n",
        "# shutil.copytree(\n",
        "#     \"outputs/ultrathink_colab_perfect\",\n",
        "#     \"/content/drive/MyDrive/ULTRATHINK_Models/ultrathink_colab_perfect\",\n",
        "#     dirs_exist_ok=True\n",
        "# )\n",
        "# print(\"✓ Model saved to Google Drive!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "inference"
      },
      "source": [
        "## 🎮 Quick Inference Test\n",
        "\n",
        "Test your trained model with sample text generation.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "test_inference"
      },
      "outputs": [],
      "source": [
        "# Quick inference test\n",
        "!python scripts/inference.py \\\n",
        "  --model_path outputs/ultrathink_colab_perfect/final_model \\\n",
        "  --prompt \"The future of artificial intelligence is\" \\\n",
        "  --max_length 100 \\\n",
        "  --temperature 0.8 \\\n",
        "  --top_p 0.9 2>/dev/null || echo \"Inference script not available or model not ready\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "troubleshooting"
      },
      "source": [
        "## 🔧 Troubleshooting\n",
        "\n",
        "### Common Issues:\n",
        "\n",
        "| Issue | Solution |\n",
        "|-------|----------|\n",
        "| **OOM Error** | Reduce `--batch_size` to 1, increase `--gradient_accumulation_steps` |\n",
        "| **NaN Losses** | Remove `--use_amp` or increase `--amp_warmup_steps` |\n",
        "| **Slow Training** | Reduce `--num_workers` to 0 for streaming datasets |\n",
        "| **Routing Collapse** | Increase `--load_balance_weight` to 0.15 or 0.2 |\n",
        "| **High Aux Loss** | Decrease `--z_loss_weight` to 0.00005 |\n",
        "\n",
        "### Need Help?\n",
        "- Check the [Training Config Guide](Training%20congig.md)\n",
        "- Review logs in `outputs/ultrathink_colab_perfect/training.log`\n",
        "- Open an issue on GitHub\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "footer"
      },
      "source": [
        "## 📚 Additional Resources\n",
        "\n",
        "- **Documentation**: See `README.md` and `ADVANCED_TRAINING_GUIDE.md`\n",
        "- **Architecture**: See `ARCHITECTURE_OVERVIEW.md`\n",
        "- **Training Config**: See `Training congig.md`\n",
        "\n",
        "---\n",
        "\n",
        "## 🎉 Success Criteria\n",
        "\n",
        "Your training is successful when:\n",
        "\n",
        "**By Step 50:**\n",
        "- ✓ Entropy > 0.7\n",
        "- ✓ Max expert < 70%\n",
        "- ✓ Aux loss < 5.0\n",
        "\n",
        "**By Step 200:**\n",
        "- ✓ Loss < 8.0\n",
        "- ✓ Perplexity < 5,000\n",
        "- ✓ All experts showing 5-40% usage\n",
        "\n",
        "**By Step 1000:**\n",
        "- ✓ Loss < 6.0\n",
        "- ✓ Perplexity < 1,000\n",
        "- ✓ Stable, consistent improvement\n",
        "\n",
        "---\n",
        "\n",
        "**Good luck with your training! 🚀**"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "name": "ULTRATHINK_Perfect_Training_Colab.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}