{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d290a613",
   "metadata": {},
   "source": [
    "# AntiAtropos — Pre-Training Model Capability Tests\n",
    "\n",
    "**Model:** Qwen2.5 4B Instruct  \n",
    "**Goal:** Verify the base model can (1) emit valid SRE-action JSON and (2) reason \n",
    "about cluster physics zero-shot, before any SFT/RL training.\n",
    "\n",
    "If these tests fail, SFT will be training on broken output format — the model\n",
    "needs format instruction before it can learn content."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e93c24a6",
   "metadata": {},
   "source": [
    "## Cell 1 — Imports & Model Load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "10bf7d65",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: transformers in /usr/local/lib/python3.12/dist-packages (5.6.2)\n",
      "Requirement already satisfied: accelerate in /usr/local/lib/python3.12/dist-packages (1.13.0)\n",
      "Collecting bitsandbytes\n",
      "  Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)\n",
      "Requirement already satisfied: huggingface-hub<2.0,>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (1.10.1)\n",
      "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers) (2.0.2)\n",
      "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (26.0)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers) (6.0.3)\n",
      "Requirement already satisfied: regex>=2025.10.22 in /usr/local/lib/python3.12/dist-packages (from transformers) (2025.11.3)\n",
      "Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.22.2)\n",
      "Requirement already satisfied: typer in /usr/local/lib/python3.12/dist-packages (from transformers) (0.24.1)\n",
      "Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.7.0)\n",
      "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers) (4.67.3)\n",
      "Requirement already satisfied: psutil in /usr/local/lib/python3.12/dist-packages (from accelerate) (5.9.5)\n",
      "Requirement already satisfied: torch>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from accelerate) (2.10.0+cu128)\n",
      "Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (3.25.2)\n",
      "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (2025.3.0)\n",
      "Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (1.4.3)\n",
      "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (0.28.1)\n",
      "Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (4.15.0)\n",
      "Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (75.2.0)\n",
      "Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.14.0)\n",
      "Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.1)\n",
      "Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.1.6)\n",
      "Requirement already satisfied: cuda-bindings==12.9.4 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.9.4)\n",
      "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n",
      "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n",
      "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n",
      "Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (9.10.2.21)\n",
      "Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.4.1)\n",
      "Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.3.3.83)\n",
      "Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (10.3.9.90)\n",
      "Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.7.3.90)\n",
      "Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.5.8.93)\n",
      "Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (0.7.1)\n",
      "Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (2.27.5)\n",
      "Requirement already satisfied: nvidia-nvshmem-cu12==3.4.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.4.5)\n",
      "Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n",
      "Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n",
      "Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.13.1.3)\n",
      "Requirement already satisfied: triton==3.6.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.0)\n",
      "Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-bindings==12.9.4->torch>=2.0.0->accelerate) (1.5.2)\n",
      "Collecting bitsandbytes\n",
      "  Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)\n",
      "Requirement already satisfied: huggingface-hub<2.0,>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (1.10.1)\n",
      "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers) (2.0.2)\n",
      "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (26.0)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers) (6.0.3)\n",
      "Requirement already satisfied: regex>=2025.10.22 in /usr/local/lib/python3.12/dist-packages (from transformers) (2025.11.3)\n",
      "Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.22.2)\n",
      "Requirement already satisfied: typer in /usr/local/lib/python3.12/dist-packages (from transformers) (0.24.1)\n",
      "Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.7.0)\n",
      "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers) (4.67.3)\n",
      "Requirement already satisfied: psutil in /usr/local/lib/python3.12/dist-packages (from accelerate) (5.9.5)\n",
      "Requirement already satisfied: torch>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from accelerate) (2.10.0+cu128)\n",
      "Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (3.25.2)\n",
      "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (2025.3.0)\n",
      "Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (1.4.3)\n",
      "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (0.28.1)\n",
      "Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (4.15.0)\n",
      "Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (75.2.0)\n",
      "Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.14.0)\n",
      "Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.1)\n",
      "Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.1.6)\n",
      "Requirement already satisfied: cuda-bindings==12.9.4 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.9.4)\n",
      "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n",
      "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n",
      "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n",
      "Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (9.10.2.21)\n",
      "Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.4.1)\n",
      "Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.3.3.83)\n",
      "Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (10.3.9.90)\n",
      "Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.7.3.90)\n",
      "Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.5.8.93)\n",
      "Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (0.7.1)\n",
      "Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (2.27.5)\n",
      "Requirement already satisfied: nvidia-nvshmem-cu12==3.4.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.4.5)\n",
      "Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n",
      "Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n",
      "Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.13.1.3)\n",
      "Requirement already satisfied: triton==3.6.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.0)\n",
      "Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-bindings==12.9.4->torch>=2.0.0->accelerate) (1.5.2)\n",
      "Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (8.3.2)\n",
      "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (1.5.4)\n",
      "Requirement already satisfied: rich>=12.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (13.9.4)\n",
      "Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (0.0.4)\n",
      "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (4.13.0)\n",
      "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (2026.2.25)\n",
      "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (1.0.9)\n",
      "Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (3.11)\n",
      "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (0.16.0)\n",
      "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (4.0.0)\n",
      "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (2.20.0)\n",
      "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch>=2.0.0->accelerate) (1.3.0)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.0.0->accelerate) (3.0.3)\n",
      "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer->transformers) (0.1.2)\n",
      "Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB)\n",
      "\u001b[2K   \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m36.3/60.7 MB\u001b[0m \u001b[31m266.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0mRequirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (8.3.2)\n",
      "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (1.5.4)\n",
      "Requirement already satisfied: rich>=12.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (13.9.4)\n",
      "Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (0.0.4)\n",
      "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (4.13.0)\n",
      "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (2026.2.25)\n",
      "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (1.0.9)\n",
      "Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (3.11)\n",
      "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (0.16.0)\n",
      "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (4.0.0)\n",
      "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (2.20.0)\n",
      "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch>=2.0.0->accelerate) (1.3.0)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.0.0->accelerate) (3.0.3)\n",
      "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer->transformers) (0.1.2)\n",
      "Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n",
      "\u001b[?25hInstalling collected packages: bitsandbytes\n",
      "Installing collected packages: bitsandbytes\n",
      "Successfully installed bitsandbytes-0.49.2\n",
      "Successfully installed bitsandbytes-0.49.2\n"
     ]
    }
   ],
   "source": [
    "pip install -U transformers accelerate bitsandbytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "82d58995",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading Qwen/Qwen3.5-4B ...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "87a57fc405ca471d83f9db3d1e7aad58",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (incomplete total...): 0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b2b3311921514bcc94367bbbdbb98097",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "47c0be3da2ba42ada6ec04d478705eed",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading weights:   0%|          | 0/426 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model loaded. Device: cpu\n",
      "Setup complete. Ready for tests.\n"
     ]
    }
   ],
   "source": [
    "import json, re, sys, os\n",
    "from typing import Optional\n",
    "\n",
    "# ── Ensure no HF token is used ──\n",
    "os.environ.pop(\"HUGGINGFACE_HUB_TOKEN\", None)\n",
    "os.environ.pop(\"HF_TOKEN\", None)\n",
    "\n",
    "# ── Model ──\n",
    "import torch\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "MODEL_ID = \"Qwen/Qwen3.5-4B\"\n",
    "\n",
    "print(f\"Loading {MODEL_ID} ...\")\n",
    "tokenizer = AutoTokenizer.from_pretrained(\n",
    "    MODEL_ID,\n",
    "    trust_remote_code=True,\n",
    "    token=None  # force unauthenticated\n",
    ")\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    MODEL_ID,\n",
    "    torch_dtype=torch.float16,\n",
    "    device_map=\"auto\",\n",
    "    trust_remote_code=True,\n",
    "    token=None  # force unauthenticated\n",
    ")\n",
    "print(f\"Model loaded. Device: {model.device}\")\n",
    "\n",
    "# ── SRE Action schema (must match models.py) ──\n",
    "VALID_ACTIONS = [\"NO_OP\", \"SCALE_UP\", \"SCALE_DOWN\", \"REROUTE_TRAFFIC\", \"SHED_LOAD\"]\n",
    "VALID_NODES   = [\"node-0\", \"node-1\", \"node-2\", \"node-3\", \"node-4\"]\n",
    "\n",
    "# ── Generation helper ──\n",
    "def generate(prompt: str, max_tokens: int = 40, temperature: float = 0.0) -> str:\n",
    "    inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
    "\n",
    "    outputs = model.generate(\n",
    "        **inputs,\n",
    "        max_new_tokens=max_tokens,\n",
    "        do_sample=False,\n",
    "        pad_token_id=tokenizer.eos_token_id,\n",
    "    )\n",
    "\n",
    "    outputs = model.generate(\n",
    "    **inputs,\n",
    "    max_new_tokens=60,\n",
    "    do_sample=False,\n",
    "    pad_token_id=tokenizer.eos_token_id,\n",
    "    eos_token_id=tokenizer.convert_tokens_to_ids(\"</think>\") \n",
    "   )\n",
    "\n",
    "    return tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
    "\n",
    "# ── JSON extraction helper ──\n",
    "def extract_json(text: str) -> Optional[dict]:\n",
    "    \"\"\"Try to pull a JSON object out of model output.\"\"\"\n",
    "    # Try ```json ... ``` block first\n",
    "    m = re.search(r\"```json\\s*(\\{.*?\\})\\s*```\", text, re.DOTALL)\n",
    "    if m:\n",
    "        return json.loads(m.group(1))\n",
    "    # Try bare {...}\n",
    "    m = re.search(r\"\\{[^{}]*\\}\", text, re.DOTALL)\n",
    "    if m:\n",
    "        # Find the outermost braces\n",
    "        start = text.index(\"{\")\n",
    "        depth = 0\n",
    "        for i, ch in enumerate(text[start:], start):\n",
    "            if ch == \"{\": depth += 1\n",
    "            elif ch == \"}\":\n",
    "                depth -= 1\n",
    "                if depth == 0:\n",
    "                    return json.loads(text[start:i+1])\n",
    "    return None\n",
    "\n",
    "# ── Validation helpers ──\n",
    "def validate_action(obj: dict) -> tuple[bool, str]:\n",
    "    \"\"\"Check a JSON object against the SREAction schema.\"\"\"\n",
    "    if not isinstance(obj, dict):\n",
    "        return False, \"not a dict\"\n",
    "    at = obj.get(\"action_type\")\n",
    "    if at not in VALID_ACTIONS:\n",
    "        return False, f\"invalid action_type '{at}'\"\n",
    "    nid = obj.get(\"target_node_id\")\n",
    "    if nid not in VALID_NODES:\n",
    "        return False, f\"invalid target_node_id '{nid}'\"\n",
    "    param = obj.get(\"parameter\")\n",
    "    if not isinstance(param, (int, float)):\n",
    "        return False, f\"parameter is not a number: {type(param).__name__}\"\n",
    "    if not (0.0 <= float(param) <= 10.0):\n",
    "        return False, f\"parameter {param} out of [0,10]\"\n",
    "    return True, \"\"\n",
    "\n",
    "print(\"Setup complete. Ready for tests.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0960e21",
   "metadata": {},
   "source": [
    "---\n",
    "## TEST 1 — Can the Model Write Valid SRE-Action JSON?\n",
    "\n",
    "We test increasingly complex schema requirements.  A model that fails any\n",
    "of these will produce broken output during SFT — fix the format instruction\n",
    "or system prompt before training."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5ac1c83",
   "metadata": {},
   "source": [
    "### Test 1a — Minimal: NO_OP with single node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e041d79",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'transformers.models.qwen3_5.modeling_qwen3_5.Qwen3_5ForCausalLM'>\n",
      "============================================================\n",
      "TEST 1a — Minimal NO_OP JSON\n",
      "============================================================\n"
     ]
    }
   ],
   "source": [
    "print(type(model))\n",
    "print(\"=\" * 60)\n",
    "print(\"TEST 1a — Minimal NO_OP JSON\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "\n",
    "prompt = \"\"\"Return ONLY a valid JSON object.\n",
    "\n",
    "Schema:\n",
    "{\"action_type\": string, \"target_node_id\": string, \"parameter\": float}\n",
    "\n",
    "Rules:\n",
    "- action_type must be one of: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n",
    "- target_node_id must be one of: node-0, node-1, node-2, node-3, node-4\n",
    "- parameter must be between 0.0 and 10.0\n",
    "\n",
    "Cluster state: healthy, no load.\n",
    "\n",
    "Correct response:\n",
    "{\"action_type\": \"NO_OP\", \"target_node_id\": \"node-0\", \"parameter\": 0.0}\n",
    "\n",
    "Output:\"\"\"\n",
    "\n",
    "for trial in range(5):\n",
    "    raw = generate(prompt, max_tokens=80, temperature=0.00)\n",
    "    print(f\"  Trial {trial+1} raw: {raw.strip()[:120]}\")\n",
    "    obj = extract_json(raw)\n",
    "    if obj is None:\n",
    "        print(f\"    ❌ Could not extract JSON\")\n",
    "        continue\n",
    "    ok, err = validate_action(obj)\n",
    "    if ok:\n",
    "        print(f\"    ✅ Valid: {obj}\")\n",
    "        if obj.get(\"action_type\") == \"NO_OP\":\n",
    "            print(f\"    ✅ Correct action (NO_OP)\")\n",
    "        else:\n",
    "            print(f\"    ⚠️  Action is {obj.get('action_type')} — should be NO_OP\")\n",
    "    else:\n",
    "        print(f\"    ❌ Invalid: {err}  |  raw keys: {list(obj.keys()) if isinstance(obj, dict) else 'N/A'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a76acaf",
   "metadata": {},
   "source": [
    "### Test 1b — SCALE_UP with specific target + parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1bab1ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 1b — SCALE_UP Specific Target\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "prompt = \"\"\"You are an SRE agent.  Respond with ONLY a JSON action.\n",
    "\n",
    "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n",
    "Nodes: node-0, node-1, node-2, node-3, node-4\n",
    "Parameter: float 0.0-10.0\n",
    "\n",
    "OBSERVATION:\n",
    "  node-0 (VIP payment): queue_depth=185, status=HEALTHY, capacity=3.0\n",
    "  node-1 (checkout):    queue_depth=12,  status=HEALTHY, capacity=3.0\n",
    "  node-2 (catalog):     queue_depth=8,   status=HEALTHY, capacity=3.0\n",
    "  node-3 (inventory):   queue_depth=2,   status=HEALTHY, capacity=3.0\n",
    "  node-4 (auth):        queue_depth=5,   status=HEALTHY, capacity=3.0\n",
    "\n",
    "node-0 is near FATAL_FAIL (threshold=200).  You MUST scale it up.\n",
    "Respond with the JSON:\"\"\"\n",
    "\n",
    "for trial in range(5):\n",
    "    raw = generate(prompt, max_tokens=80, temperature=0.0)\n",
    "    print(f\"  Trial {trial+1} raw: {raw.strip()[:120]}\")\n",
    "    obj = extract_json(raw)\n",
    "    if obj is None:\n",
    "        print(f\"    ❌ Could not extract JSON\")\n",
    "        continue\n",
    "    ok, err = validate_action(obj)\n",
    "    if ok:\n",
    "        print(f\"    ✅ Valid: {obj}\")\n",
    "        if obj.get(\"action_type\") == \"SCALE_UP\" and obj.get(\"target_node_id\") == \"node-0\":\n",
    "            print(f\"    ✅ Correct: SCALE_UP node-0\")\n",
    "        elif obj.get(\"action_type\") == \"SCALE_UP\":\n",
    "            print(f\"    ⚠️  SCALE_UP but target={obj.get('target_node_id')} (expected node-0)\")\n",
    "        else:\n",
    "            print(f\"    ⚠️  Action is {obj.get('action_type')} — should be SCALE_UP\")\n",
    "    else:\n",
    "        print(f\"    ❌ Invalid: {err}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "997fa23f",
   "metadata": {},
   "source": [
    "### Test 1c — REROUTE_TRAFFIC (different action type + 0-1 parameter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cce2df01",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 1c — REROUTE_TRAFFIC\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "prompt = \"\"\"You are an SRE agent.  Respond with ONLY a JSON action.\n",
    "\n",
    "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n",
    "Nodes: node-0, node-1, node-2, node-3, node-4\n",
    "Parameter: float 0.0-10.0\n",
    "\n",
    "OBSERVATION:\n",
    "  node-0 (VIP): queue_depth=45, status=HEALTHY, capacity=3.0\n",
    "  node-1:        queue_depth=210, status=FAILED, capacity=0.0\n",
    "  node-2:        queue_depth=8,  status=HEALTHY, capacity=3.0\n",
    "  node-3:        queue_depth=0,  status=HEALTHY, capacity=3.0\n",
    "  node-4:        queue_depth=30, status=HEALTHY, capacity=3.0\n",
    "\n",
    "node-1 has FAILED.  Traffic must be rerouted AWAY from it to healthy nodes.\n",
    "Use REROUTE_TRAFFIC with a fraction parameter (e.g. 0.8 means reroute 80%).\n",
    "Respond with the JSON:\"\"\"\n",
    "\n",
    "for trial in range(5):\n",
    "    raw = generate(prompt, max_tokens=80, temperature=0.0)\n",
    "    print(f\"  Trial {trial+1} raw: {raw.strip()[:120]}\")\n",
    "    obj = extract_json(raw)\n",
    "    if obj is None:\n",
    "        print(f\"    ❌ Could not extract JSON\")\n",
    "        continue\n",
    "    ok, err = validate_action(obj)\n",
    "    if ok:\n",
    "        print(f\"    ✅ Valid: {obj}\")\n",
    "        if obj.get(\"action_type\") == \"REROUTE_TRAFFIC\":\n",
    "            print(f\"    ✅ Correct action type\")\n",
    "        else:\n",
    "            print(f\"    ⚠️  Action is {obj.get('action_type')} — should be REROUTE_TRAFFIC\")\n",
    "    else:\n",
    "        print(f\"    ❌ Invalid: {err}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ef0ae70",
   "metadata": {},
   "source": [
    "### Test 1d — All 5 action types in separate calls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ae14f31",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 1d — All 5 Action Types\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "scenarios = [\n",
    "    (\"NO_OP\", \"All queues are below 10. Cluster is healthy. Do nothing.\"),\n",
    "    (\"SCALE_UP\", \"node-0 queue=195 (near fatal). Scale it up now.\"),\n",
    "    (\"SCALE_DOWN\", \"node-4 has capacity=5.0 and queue=2. It is over-provisioned. Scale it down.\"),\n",
    "    (\"REROUTE_TRAFFIC\", \"node-2 is FAILED. Reroute traffic away from it to healthy peers.\"),\n",
    "    (\"SHED_LOAD\", \"node-3 queue=175. It is NOT critical. Drop some of its incoming traffic.\"),\n",
    "]\n",
    "\n",
    "results_1d = {}\n",
    "for expected_action, scenario in scenarios:\n",
    "    prompt = f\"\"\"You are an SRE agent.  Respond with ONLY a JSON action.\n",
    "\n",
    "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n",
    "Nodes: node-0, node-1, node-2, node-3, node-4\n",
    "Parameter: float 0.0-10.0\n",
    "\n",
    "SCENARIO: {scenario}\n",
    "Respond with the JSON:\"\"\"\n",
    "    raw = generate(prompt, max_tokens=80, temperature=0.0)\n",
    "    obj = extract_json(raw)\n",
    "    if obj is None:\n",
    "        results_1d[expected_action] = (\"NO_JSON\", raw[:80])\n",
    "    else:\n",
    "        ok, err = validate_action(obj)\n",
    "        if ok:\n",
    "            match = \"MATCH\" if obj[\"action_type\"] == expected_action else f\"MISMATCH→{obj['action_type']}\"\n",
    "            results_1d[expected_action] = (match, obj)\n",
    "        else:\n",
    "            results_1d[expected_action] = (\"INVALID\", err)\n",
    "\n",
    "for action, (status, detail) in results_1d.items():\n",
    "    icon = \"✅\" if status == \"MATCH\" else \"❌\"\n",
    "    print(f\"  {icon} {action:20s} → {status:12s}  {detail}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60e355b5",
   "metadata": {},
   "source": [
    "### Test 1e — Adversarial: prompts that should NOT produce actions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98647b0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 1e — Adversarial Prompts (should still be valid JSON)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "adversarial_prompts = [\n",
    "    (\"fake_field\", \"Respond with JSON. Also include a 'reasoning' field (not in schema).\"),\n",
    "    (\"out_of_range\", \"node-4 queue=5000. Scale it to parameter=999.0.\"),\n",
    "    (\"unknown_node\", \"Reroute traffic from node-99.\"),\n",
    "    (\"empty_obs\", \"No data. All sensors offline. Just respond.\"),\n",
    "    (\"contradiction\", \"Queue is zero but node is FAILED. What do you do?\"),\n",
    "]\n",
    "\n",
    "for label, scenario in adversarial_prompts:\n",
    "    prompt = f\"\"\"You are an SRE agent.  Respond with ONLY a JSON action.\n",
    "\n",
    "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n",
    "Nodes: node-0, node-1, node-2, node-3, node-4\n",
    "Parameter: float 0.0-10.0\n",
    "\n",
    "SCENARIO ({label}): {scenario}\n",
    "Respond with the JSON:\"\"\"\n",
    "    raw = generate(prompt, max_tokens=120, temperature=0.0)\n",
    "    obj = extract_json(raw)\n",
    "    if obj is None:\n",
    "        print(f\"  {label:20s} → ❌ NO JSON extracted: {raw.strip()[:80]}\")\n",
    "    else:\n",
    "        ok, err = validate_action(obj)\n",
    "        if ok:\n",
    "            print(f\"  {label:20s} → ✅ Valid JSON produced: {obj}\")\n",
    "        else:\n",
    "            print(f\"  {label:20s} → ⚠️  JSON extracted but invalid: {err}  |  {obj}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de2f4fa6",
   "metadata": {},
   "source": [
    "### Test 1 — Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f75f567c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Aggregate results from all Test 1 cells above.\n",
    "# Count how many of the previous cells produced correct JSON.\n",
    "# (This is a manual summary — review the outputs above.)\n",
    "\n",
    "print(\"Test 1 — JSON Format Capability Summary\")\n",
    "print(\"=\" * 50)\n",
    "print(\"1a: NO_OP format            — should produce valid, correct JSON\")\n",
    "print(\"1b: SCALE_UP specific       — should target node-0, valid param\")\n",
    "print(\"1c: REROUTE_TRAFFIC         — should use REROUTE action type\")\n",
    "print(\"1d: All 5 action types      — should match each expected action\")\n",
    "print(\"1e: Adversarial robustness  — should recover to valid JSON\")\n",
    "print()\n",
    "print(\"If >80% of trials produce valid, schema-correct JSON,\")\n",
    "print(\"the model is JSON-capable for SFT format instruction.\")\n",
    "print(\"If <50%, add a system prompt with the exact schema + example\")\n",
    "print(\"and re-run before starting SFT.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3d08ada",
   "metadata": {},
   "source": [
    "---\n",
    "## TEST 2 — Zero-Shot SRE Judgment (No Training)\n",
    "\n",
    "Can the model REASON about SRE actions from raw cluster telemetry?\n",
    "We test across all 3 tasks + edge cases, increasing in complexity.\n",
    "The model sees a natural-language description of the cluster state\n",
    "and must pick the right action type, target, and reasoning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3572ce7f",
   "metadata": {},
   "source": [
    "### Helpers — Format cluster state as natural language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c79c47d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_state(nodes: list[dict]) -> str:\n",
    "    \"\"\"Convert node states to a readable observation block.\"\"\"\n",
    "    lines = []\n",
    "    for n in nodes:\n",
    "        vip = \" (VIP)\" if n.get(\"is_vip\") else \"\"\n",
    "        lines.append(\n",
    "            f\"  {n['node_id']}{vip}: queue={n['queue']}, status={n['status']}, \"\n",
    "            f\"capacity={n['capacity']}, incoming={n.get('incoming', 0)}\"\n",
    "        )\n",
    "    return \"\\n\".join(lines)\n",
    "\n",
    "def build_sre_prompt(scenario_desc: str, nodes: list[dict], hint: str = \"\") -> str:\n",
    "    \"\"\"Build a zero-shot SRE prompt with cluster state.\"\"\"\n",
    "    state_block = format_state(nodes)\n",
    "    prompt = f\"\"\"You are an SRE (Site Reliability Engineer) managing a 5-node microservice cluster.\n",
    "\n",
    "The cluster has a DAG topology:\n",
    "  - node-0 (VIP payment gateway) feeds traffic to node-1 and node-2\n",
    "  - node-2 (catalog) feeds traffic to node-3 (inventory)\n",
    "  - node-4 (auth service) is independent\n",
    "\n",
    "Available actions:\n",
    "  - NO_OP:            Do nothing\n",
    "  - SCALE_UP:         Increase capacity on a node (parameter = amount, 0-1)\n",
    "  - SCALE_DOWN:       Decrease capacity on a node (parameter = amount, 0-1)\n",
    "  - REROUTE_TRAFFIC:  Move traffic AWAY from a node to healthy peers (parameter = fraction to move, 0-1)\n",
    "  - SHED_LOAD:        Drop incoming traffic to a node for 1 tick (parameter = fraction to drop, 0-1)\n",
    "    CRITICAL nodes (node-0, node-1, node-2) CANNOT be shed.\n",
    "\n",
    "Queue depth > 80 = DEGRADED.  Queue depth > 200 = FATAL FAILURE.\n",
    "A FAILED node processes 0 requests — its children will starve.\n",
    "\n",
    "CLUSTER STATE:\n",
    "{state_block}\n",
    "\n",
    "SCENARIO: {scenario_desc}\n",
    "{hint}\n",
    "You MUST respond with:\n",
    "1. One sentence explaining your reasoning.\n",
    "2. A JSON action: {{\"action_type\": \"...\", \"target_node_id\": \"...\", \"parameter\": X.X}}\n",
    "\"\"\"\n",
    "    return prompt\n",
    "\n",
    "def judge_response(raw: str, expected_action: str, expected_target: Optional[str] = None) -> dict:\n",
    "    \"\"\"Score a zero-shot response.\"\"\"\n",
    "    obj = extract_json(raw)\n",
    "    result = {\"raw\": raw[:150], \"json_ok\": obj is not None}\n",
    "    if obj is None:\n",
    "        result[\"verdict\"] = \"NO_JSON\"\n",
    "        return result\n",
    "    ok, err = validate_action(obj)\n",
    "    result[\"valid\"] = ok\n",
    "    result[\"action\"] = obj\n",
    "    if not ok:\n",
    "        result[\"verdict\"] = f\"INVALID: {err}\"\n",
    "        return result\n",
    "    # Check if action matches expectation\n",
    "    match = obj[\"action_type\"] == expected_action\n",
    "    if expected_target:\n",
    "        match = match and obj[\"target_node_id\"] == expected_target\n",
    "    result[\"match\"] = match\n",
    "    result[\"verdict\"] = \"CORRECT\" if match else f\"WRONG_ACTION (got {obj['action_type']}, expected {expected_action})\"\n",
    "    return result\n",
    "\n",
    "print(\"Helpers ready.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0fb56d0",
   "metadata": {},
   "source": [
    "### Test 2a — Simple Overload (Task-1 style)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c52fd417",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2a — Simple Overload (node-0 near fatal)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 185, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 44},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 22,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 18,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 5,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 18},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 8,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 43},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"node-0 is at queue=185 (threshold=200 for FATAL).  The VIP payment gateway is the most important node.\",\n",
    "    nodes,\n",
    "    hint=\"Hint: The correct action is to SCALE_UP node-0 to increase its capacity and prevent failure.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=200, temperature=0.1)\n",
    "print(f\"RAW: {raw[:300]}\")\n",
    "result = judge_response(raw, \"SCALE_UP\", \"node-0\")\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    print(f\"  Action: {result['action']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d1190a2",
   "metadata": {},
   "source": [
    "### Test 2b — Node Failure + Starvation (Task-2 style)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "963768d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2b — Node-2 FAILED, node-3 Starved (Task-2 DAG)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 35,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 47},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 28,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 23},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 0,   \"status\": \"FAILED\",  \"capacity\": 0.0, \"incoming\": 23},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 0,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 0},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 12,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 47},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"node-2 (catalog) is FAILED and processes 0 requests.  Its child node-3 (inventory) \"\n",
    "    \"is receiving 0 traffic because the DAG cannot forward through a dead parent.  \"\n",
    "    \"node-0 is still sending 50% of its outflow to the dead node-2 — that traffic is wasted.\",\n",
    "    nodes,\n",
    "    hint=\"Hint: Reroute traffic AWAY from node-2 so node-0 sends more to node-1 instead.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=200, temperature=0.1)\n",
    "print(f\"RAW: {raw[:300]}\")\n",
    "result = judge_response(raw, \"REROUTE_TRAFFIC\")\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    print(f\"  Action: {result['action']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18794fc6",
   "metadata": {},
   "source": [
    "### Test 2c — Surge on Critical Nodes (Task-3 style)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6def3e2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2c — Surge on Critical Nodes (Task-3)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 10,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 165, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 155},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 170, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 155},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 80,  \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 45},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 5,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"A surge of +140 req/tick has hit node-1 and node-2 directly (bypassing the normal DAG ingress).  \"\n",
    "    \"node-1 and node-2 are CRITICAL — you CANNOT shed their load.  Their queues are approaching FATAL.  \"\n",
    "    \"node-2's child node-3 is also starting to back up from the overflow.  \"\n",
    "    \"The correct response is to SCALE_UP the bottleneck nodes.\",\n",
    "    nodes,\n",
    "    hint=\"Hint: SCALE_UP node-1 and/or node-2.  Also consider scaling node-3 to absorb downstream overflow.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=250, temperature=0.1)\n",
    "print(f\"RAW: {raw[:350]}\")\n",
    "result = judge_response(raw, \"SCALE_UP\")  # Accept any SCALE_UP on node-1, node-2, or node-3\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    a = result[\"action\"]\n",
    "    print(f\"  Action: {a}\")\n",
    "    # Extra check: was it a sensible target?\n",
    "    if a.get(\"target_node_id\") in [\"node-1\", \"node-2\", \"node-3\"]:\n",
    "        print(f\"  ✅ Target is sensible (node-1/2/3 are the bottleneck)\")\n",
    "    else:\n",
    "        print(f\"  ⚠️  Target is {a.get('target_node_id')} — node-0 is not the bottleneck here\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30b7b6c8",
   "metadata": {},
   "source": [
    "### Test 2d — DAG Bottleneck: Ingress Full, Downstream Idle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7aba9e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2d — DAG Bottleneck (ingress full, children idle)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 190, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 47},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 3,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 4,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 2,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 6,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 43},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"node-0 is the bottleneck — it receives 47 req/tick but can only process 45 with capacity 3.0.  \"\n",
    "    \"Its queue is at 190 (near fatal).  All downstream nodes are nearly idle because node-0 \"\n",
    "    \"cannot forward enough traffic.  The right move is to SCALE_UP the ingress bottleneck (node-0).  \"\n",
    "    \"Scaling downstream nodes would NOT help — they are already underutilized.\",\n",
    "    nodes,\n",
    "    hint=\"Hint: This is a pure ingress bottleneck.  Scale the ingress, not the downstream.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=200, temperature=0.1)\n",
    "print(f\"RAW: {raw[:300]}\")\n",
    "result = judge_response(raw, \"SCALE_UP\", \"node-0\")\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    a = result[\"action\"]\n",
    "    print(f\"  Action: {a}\")\n",
    "    if a.get(\"target_node_id\") != \"node-0\" and a.get(\"action_type\") == \"SCALE_UP\":\n",
    "        print(f\"  ⚠️  Scaling {a['target_node_id']} is wrong — downstream is idle, ingress is the bottleneck\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1038626f",
   "metadata": {},
   "source": [
    "### Test 2e — Multi-Node Crisis (competing pressures)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69d13daa",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2e — Multi-Node Crisis\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 95,  \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 48},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 210, \"status\": \"FAILED\",  \"capacity\": 0.0, \"incoming\": 24},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 60,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 24},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 5,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 12},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 130, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 47},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"MULTIPLE CRISES simultaneously:\\n\"\n",
    "    \"1. node-1 is FAILED — node-0 is still sending 50% of its traffic there (wasted).\\n\"\n",
    "    \"2. node-0 is DEGRADED at queue=95 and trending up.\\n\"\n",
    "    \"3. node-4 (independent auth) is DEGRADED at queue=130.\\n\"\n",
    "    \"You can only take ONE action this tick.  Which is most critical?\",\n",
    "    nodes,\n",
    "    hint=\"Hint: The failed node-1 is permanently dead (scripted failure).  \"\n",
    "         \"Rerouting traffic AWAY from it would immediately help both node-0 and prevent waste.  \"\n",
    "         \"Alternatively, scaling node-0 or node-4 addresses their individual overload.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=250, temperature=0.1)\n",
    "print(f\"RAW: {raw[:350]}\")\n",
    "result = judge_response(raw, \"REROUTE_TRAFFIC\")  # Best answer: reroute from node-1\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    a = result[\"action\"]\n",
    "    print(f\"  Action: {a}\")\n",
    "    # Accept REROUTE (best), SCALE_UP node-0 (good), SCALE_UP node-4 (ok).\n",
    "    # Reject: NO_OP, SCALE_DOWN, SHED_LOAD on critical.\n",
    "    if a[\"action_type\"] == \"REROUTE_TRAFFIC\":\n",
    "        print(f\"  ✅ Best answer — rerouting fixes the root cause (dead node wasting traffic)\")\n",
    "    elif a[\"action_type\"] == \"SCALE_UP\" and a[\"target_node_id\"] in [\"node-0\", \"node-4\"]:\n",
    "        print(f\"  ⚠️  Acceptable but suboptimal — scaling treats symptom, reroute treats cause\")\n",
    "    elif a[\"action_type\"] == \"NO_OP\":\n",
    "        print(f\"  ❌ Doing nothing during multi-node crisis is wrong\")\n",
    "    elif a[\"action_type\"] == \"SHED_LOAD\" and a[\"target_node_id\"] in [\"node-0\", \"node-1\", \"node-2\"]:\n",
    "        print(f\"  ❌ SHED_LOAD on critical node is BLOCKED\")\n",
    "    else:\n",
    "        print(f\"  ⚠️  Unexpected action — evaluate manually\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "019cc3d1",
   "metadata": {},
   "source": [
    "### Test 2f — Calm Cluster (should NO_OP, not act)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf825879",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2f — Calm Cluster (should NO_OP)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 5,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 3,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 15},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 4,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 15},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 2,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 15},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 6,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"The cluster is completely healthy.  All queues are under 10, all nodes are HEALTHY, \"\n",
    "    \"all capacity is at baseline (3.0).  There is no reason to intervene.\",\n",
    "    nodes,\n",
    "    hint=\"Hint: The correct action is NO_OP.  Do not scale or reroute a healthy cluster.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=200, temperature=0.1)\n",
    "print(f\"RAW: {raw[:250]}\")\n",
    "result = judge_response(raw, \"NO_OP\")\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    a = result[\"action\"]\n",
    "    print(f\"  Action: {a}\")\n",
    "    if a[\"action_type\"] != \"NO_OP\":\n",
    "        print(f\"  ⚠️  Taking action on a healthy cluster wastes resources\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "587ea8c7",
   "metadata": {},
   "source": [
    "### Test 2g — Backpressure Chain (advanced: child overload throttles parent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e47394b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 60)\n",
    "print(\"TEST 2g — Backpressure Chain (child overload → parent throttled)\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "nodes = [\n",
    "    {\"node_id\": \"node-0\", \"is_vip\": True,  \"queue\": 35,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 44},\n",
    "    {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 8,   \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 190, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 175, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 22},\n",
    "    {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 10,  \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 43},\n",
    "]\n",
    "\n",
    "prompt = build_sre_prompt(\n",
    "    \"node-2 (queue=190) and its child node-3 (queue=175) are both near FATAL.  \"\n",
    "    \"Because node-2 and node-3 are overloaded, BACKPRESSURE will throttle node-0's effective \"\n",
    "    \"service rate, causing node-0's queue to rise even though it seems healthy now.  \"\n",
    "    \"The root cause is downstream (node-2/3), not the ingress.  \"\n",
    "    \"SCALE_UP the downstream bottleneck to break the backpressure chain.\",\n",
    "    nodes,\n",
    "    hint=\"Hint: Scaling node-0 would NOT help — the bottleneck is at node-2/3.  Scale them instead.\"\n",
    ")\n",
    "\n",
    "raw = generate(prompt, max_tokens=250, temperature=0.1)\n",
    "print(f\"RAW: {raw[:300]}\")\n",
    "result = judge_response(raw, \"SCALE_UP\")  # Should scale node-2 or node-3\n",
    "print(f\"\\n  Verdict: {result['verdict']}\")\n",
    "print(f\"  JSON ok: {result['json_ok']},  Valid: {result.get('valid','N/A')}\")\n",
    "if result.get(\"action\"):\n",
    "    a = result[\"action\"]\n",
    "    print(f\"  Action: {a}\")\n",
    "    if a.get(\"action_type\") == \"SCALE_UP\" and a.get(\"target_node_id\") in [\"node-2\", \"node-3\"]:\n",
    "        print(f\"  ✅ Correct — scaling the downstream bottleneck breaks backpressure\")\n",
    "    elif a.get(\"action_type\") == \"SCALE_UP\" and a.get(\"target_node_id\") == \"node-0\":\n",
    "        print(f\"  ❌ Wrong — node-0 is not the bottleneck, backpressure is caused by node-2/3\")\n",
    "    elif a.get(\"action_type\") == \"REROUTE_TRAFFIC\":\n",
    "        print(f\"  ⚠️  Rerouting from node-2 would starve node-3.  Scale is better here.\")\n",
    "    else:\n",
    "        print(f\"  ⚠️  Evaluate manually — is this action addressing the downstream bottleneck?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85758d5e",
   "metadata": {},
   "source": [
    "### Test 2 — Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cec2e024",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Test 2 — Zero-Shot SRE Judgment Summary\")\n",
    "print(\"=\" * 55)\n",
    "print()\n",
    "print(\"2a: Simple overload          — Should SCALE_UP node-0\")\n",
    "print(\"2b: Node failure + starve    — Should REROUTE_TRAFFIC from dead node\")\n",
    "print(\"2c: Surge on critical nodes  — Should SCALE_UP node-1/2/3\")\n",
    "print(\"2d: DAG ingress bottleneck   — Should SCALE_UP node-0 (not downstream)\")\n",
    "print(\"2e: Multi-node crisis        — Should REROUTE (root cause) or SCALE_UP key node\")\n",
    "print(\"2f: Calm cluster             — Should NO_OP\")\n",
    "print(\"2g: Backpressure chain       — Should SCALE_UP downstream (not ingress)\")\n",
    "print()\n",
    "print(\"Scoring guide:\")\n",
    "print(\"  5-7 correct actions + valid JSON = Model has zero-shot SRE intuition\")\n",
    "print(\"  3-4 correct                  = Model has partial understanding, needs SFT\")\n",
    "print(\"  0-2 correct                  = Model lacks SRE concepts — SFT must teach from scratch\")\n",
    "print()\n",
    "print(\"Key insight: Even a 'wrong' action with valid JSON and sensible\")\n",
    "print(\"reasoning is better than a broken response.  SFT can fix judgment;\")\n",
    "print(\"it cannot fix format if the model can't produce valid JSON.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c81dbf14",
   "metadata": {},
   "source": [
    "---\n",
    "## Final Verdict — Is This Model Ready for SFT?\n",
    "\n",
    "After running both tests, check:\n",
    "\n",
    "| Test | Pass Condition | Ready? |\n",
    "|---|---|---|\n",
    "| Test 1 — JSON | ≥80% of trials produce valid, schema-correct JSON | ☐ |\n",
    "| Test 2a — Simple overload | SCALE_UP node-0 | ☐ |\n",
    "| Test 2b — Node failure | REROUTE_TRAFFIC (not NO_OP or SCALE_UP) | ☐ |\n",
    "| Test 2c — Surge | SCALE_UP (not SHED_LOAD on critical) | ☐ |\n",
    "| Test 2d — DAG bottleneck | SCALE_UP node-0 (not node-1/2/3) | ☐ |\n",
    "| Test 2e — Multi-crisis | Any non-NO_OP, preferably REROUTE | ☐ |\n",
    "| Test 2f — Calm | NO_OP | ☐ |\n",
    "| Test 2g — Backpressure | SCALE_UP downstream (not node-0) | ☐ |\n",
    "\n",
    "**If Test 1 fails (can't write JSON):** Add a system prompt with exact schema + few-shot examples before SFT.\n",
    "\n",
    "**If Test 2 fails (<4 correct):** The model lacks SRE physics intuition. SFT will need:\n",
    "- More diverse dataset (all action types, all tasks)\n",
    "- Explicit reasoning chains in the training examples\n",
    "- Possibly a stronger base model for zero-shot transfer\n",
    "\n",
    "**If both pass:** Proceed directly to SFT. The model has format + basic physics."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}