{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CommitmentOS Checkpoint Evaluation (Colab)\n", "\n", "This notebook compares a base model against a locally saved LoRA-trained checkpoint on the CommitmentOS environment.\n", "\n", "It uses:\n", "- `BASELINE_MODEL_NAME` from Hugging Face\n", "- `TRAINED_MODEL_PATH` from disk in Colab\n", "- the existing `evaluation/evaluate_llm_checkpoints.py` script\n", "\n", "By default the notebook evaluates against the hosted CommitmentOS environment on Hugging Face Space. An optional local-server cell is included below." ] }, { "cell_type": "code", "execution_count": null, "id": "d43c692d", "metadata": {}, "outputs": [], "source": [ "!pip -q install --upgrade pip\n", "!pip -q install transformers peft accelerate torch sentencepiece fastapi uvicorn requests python-dotenv pydantic \"openenv-core>=0.2.0\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!git clone https://github.com/Jayant2304/commitment_os.git\n", "%cd commitment_os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Paths\n", "\n", "Set the base model ID and the local adapter/checkpoint path. Change `TRAINED_MODEL_PATH` to the folder you actually want to evaluate.\n", "\n", "If the base model is gated, set `HF_TOKEN` as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# Colab: load Hugging Face token from Secrets (key must be exactly HF_TOKEN)\n", "try:\n", " from google.colab import userdata\n", "\n", " os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n", " print(\"HF_TOKEN loaded from Colab secrets\")\n", "except ImportError:\n", " print(\"Not on Colab; set HF_TOKEN in the shell or .env if downloads fail.\")\n", "except Exception as exc:\n", " print(\"Could not load HF_TOKEN from secrets:\", exc)\n", "\n", "os.environ[\"BASELINE_MODEL_NAME\"] = \"Qwen/Qwen2.5-1.5B-Instruct\"\n", "os.environ[\"TRAINED_MODEL_PATH\"] = \"/content/commitment_os/training_output\"\n", "os.environ[\"ENV_BASE_URL\"] = \"https://jayant2304-commitment-os.hf.space\"\n", "\n", "# Optional for gated base models:\n", "# os.environ[\"HF_TOKEN\"] = \"hf_xxx\"\n", "\n", "# Optional eval overrides:\n", "os.environ[\"EVAL_SEED\"] = \"42\"\n", "os.environ[\"EVAL_MAX_STEPS\"] = \"12\"\n", "os.environ[\"EVAL_TEMPERATURE\"] = \"0.0\"\n", "os.environ[\"EVAL_TOP_P\"] = \"1.0\"\n", "os.environ[\"EVAL_MAX_NEW_TOKENS\"] = \"256\"\n", "os.environ[\"EVAL_SUCCESS_THRESHOLD\"] = \"0.6\"\n", "\n", "for key in [\n", " \"BASELINE_MODEL_NAME\",\n", " \"TRAINED_MODEL_PATH\",\n", " \"ENV_BASE_URL\",\n", " \"EVAL_SEED\",\n", " \"EVAL_MAX_STEPS\",\n", " \"EVAL_TEMPERATURE\",\n", " \"EVAL_TOP_P\",\n", " \"EVAL_MAX_NEW_TOKENS\",\n", " \"EVAL_SUCCESS_THRESHOLD\",\n", "]:\n", " print(f\"{key}={os.environ[key]}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "trained_path = Path(os.environ[\"TRAINED_MODEL_PATH\"])\n", "print(\"Checkpoint exists:\", trained_path.exists())\n", "if trained_path.exists():\n", " print(\"Checkpoint contents:\")\n", " for item in sorted(trained_path.iterdir()):\n", " print(\" -\", item.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional: Run CommitmentOS Locally Instead Of HF Space\n", "\n", "Only run this if you want evaluation against a local server inside Colab. Otherwise skip this section and keep `ENV_BASE_URL` pointed at the hosted Space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Optional local server setup\n", "# import os\n", "# os.environ[\"ENV_BASE_URL\"] = \"http://127.0.0.1:7860\"\n", "# !nohup python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 >/tmp/commitmentos.log 2>&1 &\n", "# !sleep 5\n", "# !curl -s http://127.0.0.1:7860/health" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Checkpoint Comparison" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python evaluation/evaluate_llm_checkpoints.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python evaluation/plot_llm_checkpoints.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect Artifacts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "\n", "artifact_dir = Path(\"artifacts/evals_llm\")\n", "print(sorted(p.name for p in artifact_dir.iterdir()))\n", "\n", "summary = json.loads((artifact_dir / \"llm_summary.json\").read_text())\n", "summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.read_csv(\"artifacts/evals_llm/llm_comparison.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import SVG, display\n", "\n", "display(SVG(filename=\"artifacts/evals_llm/llm_reward_by_task.svg\"))\n", "display(SVG(filename=\"artifacts/evals_llm/llm_violations_before_after.svg\"))" ] }, { "cell_type": "markdown", "id": "9e8a35c5", "metadata": {}, "source": [ "## Backup results (zip and download)\n", "\n", "Run after eval/plot finish. Large runs: copy `training_output` to Google Drive instead of browser download.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b4a5bcc7", "metadata": {}, "outputs": [], "source": [ "!cd /content/commitment_os && du -sh training_output artifacts/evals_llm 2>/dev/null || true\n", "!cd /content/commitment_os && zip -r /content/commitment_os_bundle.zip training_output artifacts/evals_llm\n", "from google.colab import files\n", "\n", "files.download(\"/content/commitment_os_bundle.zip\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.x" } }, "nbformat": 4, "nbformat_minor": 5 }