{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# CommitmentOS Checkpoint Evaluation (Colab)\n",
        "\n",
        "This notebook compares a base model against a locally saved LoRA-trained checkpoint on the CommitmentOS environment.\n",
        "\n",
        "It uses:\n",
        "- `BASELINE_MODEL_NAME` from Hugging Face\n",
        "- `TRAINED_MODEL_PATH` from disk in Colab\n",
        "- the existing `evaluation/evaluate_llm_checkpoints.py` script\n",
        "\n",
        "By default the notebook evaluates against the hosted CommitmentOS environment on Hugging Face Space. An optional local-server cell is included below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d43c692d",
      "metadata": {},
      "outputs": [],
      "source": [
        "!pip -q install --upgrade pip\n",
        "!pip -q install transformers peft accelerate torch sentencepiece fastapi uvicorn requests python-dotenv pydantic \"openenv-core>=0.2.0\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!git clone https://github.com/Jayant2304/commitment_os.git\n",
        "%cd commitment_os"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Configure Paths\n",
        "\n",
        "Set the base model ID and the local adapter/checkpoint path. Change `TRAINED_MODEL_PATH` to the folder you actually want to evaluate.\n",
        "\n",
        "If the base model is gated, set `HF_TOKEN` as well."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# Colab: load Hugging Face token from Secrets (key must be exactly HF_TOKEN)\n",
        "try:\n",
        "    from google.colab import userdata\n",
        "\n",
        "    os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n",
        "    print(\"HF_TOKEN loaded from Colab secrets\")\n",
        "except ImportError:\n",
        "    print(\"Not on Colab; set HF_TOKEN in the shell or .env if downloads fail.\")\n",
        "except Exception as exc:\n",
        "    print(\"Could not load HF_TOKEN from secrets:\", exc)\n",
        "\n",
        "os.environ[\"BASELINE_MODEL_NAME\"] = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
        "os.environ[\"TRAINED_MODEL_PATH\"] = \"/content/commitment_os/training_output\"\n",
        "os.environ[\"ENV_BASE_URL\"] = \"https://jayant2304-commitment-os.hf.space\"\n",
        "\n",
        "# Optional for gated base models:\n",
        "# os.environ[\"HF_TOKEN\"] = \"hf_xxx\"\n",
        "\n",
        "# Optional eval overrides:\n",
        "os.environ[\"EVAL_SEED\"] = \"42\"\n",
        "os.environ[\"EVAL_MAX_STEPS\"] = \"12\"\n",
        "os.environ[\"EVAL_TEMPERATURE\"] = \"0.0\"\n",
        "os.environ[\"EVAL_TOP_P\"] = \"1.0\"\n",
        "os.environ[\"EVAL_MAX_NEW_TOKENS\"] = \"256\"\n",
        "os.environ[\"EVAL_SUCCESS_THRESHOLD\"] = \"0.6\"\n",
        "\n",
        "for key in [\n",
        "    \"BASELINE_MODEL_NAME\",\n",
        "    \"TRAINED_MODEL_PATH\",\n",
        "    \"ENV_BASE_URL\",\n",
        "    \"EVAL_SEED\",\n",
        "    \"EVAL_MAX_STEPS\",\n",
        "    \"EVAL_TEMPERATURE\",\n",
        "    \"EVAL_TOP_P\",\n",
        "    \"EVAL_MAX_NEW_TOKENS\",\n",
        "    \"EVAL_SUCCESS_THRESHOLD\",\n",
        "]:\n",
        "    print(f\"{key}={os.environ[key]}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from pathlib import Path\n",
        "\n",
        "trained_path = Path(os.environ[\"TRAINED_MODEL_PATH\"])\n",
        "print(\"Checkpoint exists:\", trained_path.exists())\n",
        "if trained_path.exists():\n",
        "    print(\"Checkpoint contents:\")\n",
        "    for item in sorted(trained_path.iterdir()):\n",
        "        print(\" -\", item.name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Optional: Run CommitmentOS Locally Instead Of HF Space\n",
        "\n",
        "Only run this if you want evaluation against a local server inside Colab. Otherwise skip this section and keep `ENV_BASE_URL` pointed at the hosted Space."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Optional local server setup\n",
        "# import os\n",
        "# os.environ[\"ENV_BASE_URL\"] = \"http://127.0.0.1:7860\"\n",
        "# !nohup python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 >/tmp/commitmentos.log 2>&1 &\n",
        "# !sleep 5\n",
        "# !curl -s http://127.0.0.1:7860/health"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Run Checkpoint Comparison"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python evaluation/evaluate_llm_checkpoints.py"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python evaluation/plot_llm_checkpoints.py"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Inspect Artifacts"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import json\n",
        "from pathlib import Path\n",
        "\n",
        "artifact_dir = Path(\"artifacts/evals_llm\")\n",
        "print(sorted(p.name for p in artifact_dir.iterdir()))\n",
        "\n",
        "summary = json.loads((artifact_dir / \"llm_summary.json\").read_text())\n",
        "summary"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "pd.read_csv(\"artifacts/evals_llm/llm_comparison.csv\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from IPython.display import SVG, display\n",
        "\n",
        "display(SVG(filename=\"artifacts/evals_llm/llm_reward_by_task.svg\"))\n",
        "display(SVG(filename=\"artifacts/evals_llm/llm_violations_before_after.svg\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9e8a35c5",
      "metadata": {},
      "source": [
        "## Backup results (zip and download)\n",
        "\n",
        "Run after eval/plot finish. Large runs: copy `training_output` to Google Drive instead of browser download.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b4a5bcc7",
      "metadata": {},
      "outputs": [],
      "source": [
        "!cd /content/commitment_os && du -sh training_output artifacts/evals_llm 2>/dev/null || true\n",
        "!cd /content/commitment_os && zip -r /content/commitment_os_bundle.zip training_output artifacts/evals_llm\n",
        "from google.colab import files\n",
        "\n",
        "files.download(\"/content/commitment_os_bundle.zip\")\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.x"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}