{ "cells": [ { "cell_type": "markdown", "id": "d290a613", "metadata": {}, "source": [ "# AntiAtropos — Pre-Training Model Capability Tests\n", "\n", "**Model:** Qwen2.5 4B Instruct \n", "**Goal:** Verify the base model can (1) emit valid SRE-action JSON and (2) reason \n", "about cluster physics zero-shot, before any SFT/RL training.\n", "\n", "If these tests fail, SFT will be training on broken output format — the model\n", "needs format instruction before it can learn content." ] }, { "cell_type": "markdown", "id": "e93c24a6", "metadata": {}, "source": [ "## Cell 1 — Imports & Model Load" ] }, { "cell_type": "code", "execution_count": 6, "id": "10bf7d65", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: transformers in /usr/local/lib/python3.12/dist-packages (5.6.2)\n", "Requirement already satisfied: accelerate in /usr/local/lib/python3.12/dist-packages (1.13.0)\n", "Collecting bitsandbytes\n", " Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)\n", "Requirement already satisfied: huggingface-hub<2.0,>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (1.10.1)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers) (2.0.2)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (26.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers) (6.0.3)\n", "Requirement already satisfied: regex>=2025.10.22 in /usr/local/lib/python3.12/dist-packages (from transformers) (2025.11.3)\n", "Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.22.2)\n", "Requirement already satisfied: typer in /usr/local/lib/python3.12/dist-packages (from transformers) (0.24.1)\n", "Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.7.0)\n", "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers) (4.67.3)\n", "Requirement already satisfied: psutil in /usr/local/lib/python3.12/dist-packages (from accelerate) (5.9.5)\n", "Requirement already satisfied: torch>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from accelerate) (2.10.0+cu128)\n", "Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (3.25.2)\n", "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (2025.3.0)\n", "Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (1.4.3)\n", "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (0.28.1)\n", "Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (4.15.0)\n", "Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (75.2.0)\n", "Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.14.0)\n", "Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.1)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.1.6)\n", "Requirement already satisfied: cuda-bindings==12.9.4 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.9.4)\n", "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n", "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n", "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n", "Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (9.10.2.21)\n", "Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.4.1)\n", "Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.3.3.83)\n", "Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (10.3.9.90)\n", "Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.7.3.90)\n", "Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.5.8.93)\n", "Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (0.7.1)\n", "Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (2.27.5)\n", "Requirement already satisfied: nvidia-nvshmem-cu12==3.4.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.4.5)\n", "Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n", "Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n", "Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.13.1.3)\n", "Requirement already satisfied: triton==3.6.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.0)\n", "Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-bindings==12.9.4->torch>=2.0.0->accelerate) (1.5.2)\n", "Collecting bitsandbytes\n", " Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)\n", "Requirement already satisfied: huggingface-hub<2.0,>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (1.10.1)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers) (2.0.2)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (26.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers) (6.0.3)\n", "Requirement already satisfied: regex>=2025.10.22 in /usr/local/lib/python3.12/dist-packages (from transformers) (2025.11.3)\n", "Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.22.2)\n", "Requirement already satisfied: typer in /usr/local/lib/python3.12/dist-packages (from transformers) (0.24.1)\n", "Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers) (0.7.0)\n", "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers) (4.67.3)\n", "Requirement already satisfied: psutil in /usr/local/lib/python3.12/dist-packages (from accelerate) (5.9.5)\n", "Requirement already satisfied: torch>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from accelerate) (2.10.0+cu128)\n", "Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (3.25.2)\n", "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (2025.3.0)\n", "Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (1.4.3)\n", "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (0.28.1)\n", "Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers) (4.15.0)\n", "Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (75.2.0)\n", "Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.14.0)\n", "Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.1)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.1.6)\n", "Requirement already satisfied: cuda-bindings==12.9.4 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.9.4)\n", "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n", "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n", "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n", "Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (9.10.2.21)\n", "Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.4.1)\n", "Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.3.3.83)\n", "Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (10.3.9.90)\n", "Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (11.7.3.90)\n", "Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.5.8.93)\n", "Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (0.7.1)\n", "Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (2.27.5)\n", "Requirement already satisfied: nvidia-nvshmem-cu12==3.4.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.4.5)\n", "Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.90)\n", "Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (12.8.93)\n", "Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (1.13.1.3)\n", "Requirement already satisfied: triton==3.6.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.0.0->accelerate) (3.6.0)\n", "Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-bindings==12.9.4->torch>=2.0.0->accelerate) (1.5.2)\n", "Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (8.3.2)\n", "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (1.5.4)\n", "Requirement already satisfied: rich>=12.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (13.9.4)\n", "Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (0.0.4)\n", "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (4.13.0)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (2026.2.25)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (1.0.9)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (3.11)\n", "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (0.16.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (4.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (2.20.0)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch>=2.0.0->accelerate) (1.3.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.0.0->accelerate) (3.0.3)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer->transformers) (0.1.2)\n", "Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB)\n", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m36.3/60.7 MB\u001b[0m \u001b[31m266.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0mRequirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (8.3.2)\n", "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (1.5.4)\n", "Requirement already satisfied: rich>=12.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (13.9.4)\n", "Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer->transformers) (0.0.4)\n", "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (4.13.0)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (2026.2.25)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (1.0.9)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (3.11)\n", "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers) (0.16.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (4.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers) (2.20.0)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch>=2.0.0->accelerate) (1.3.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.0.0->accelerate) (3.0.3)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer->transformers) (0.1.2)\n", "Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n", "\u001b[?25hInstalling collected packages: bitsandbytes\n", "Installing collected packages: bitsandbytes\n", "Successfully installed bitsandbytes-0.49.2\n", "Successfully installed bitsandbytes-0.49.2\n" ] } ], "source": [ "pip install -U transformers accelerate bitsandbytes" ] }, { "cell_type": "code", "execution_count": 10, "id": "82d58995", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading Qwen/Qwen3.5-4B ...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "87a57fc405ca471d83f9db3d1e7aad58", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (incomplete total...): 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b2b3311921514bcc94367bbbdbb98097", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Fetching 2 files: 0%| | 0/2 [00:00 str:\n", " inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n", "\n", " outputs = model.generate(\n", " **inputs,\n", " max_new_tokens=max_tokens,\n", " do_sample=False,\n", " pad_token_id=tokenizer.eos_token_id,\n", " )\n", "\n", " outputs = model.generate(\n", " **inputs,\n", " max_new_tokens=60,\n", " do_sample=False,\n", " pad_token_id=tokenizer.eos_token_id,\n", " eos_token_id=tokenizer.convert_tokens_to_ids(\"\") \n", " )\n", "\n", " return tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n", "\n", "# ── JSON extraction helper ──\n", "def extract_json(text: str) -> Optional[dict]:\n", " \"\"\"Try to pull a JSON object out of model output.\"\"\"\n", " # Try ```json ... ``` block first\n", " m = re.search(r\"```json\\s*(\\{.*?\\})\\s*```\", text, re.DOTALL)\n", " if m:\n", " return json.loads(m.group(1))\n", " # Try bare {...}\n", " m = re.search(r\"\\{[^{}]*\\}\", text, re.DOTALL)\n", " if m:\n", " # Find the outermost braces\n", " start = text.index(\"{\")\n", " depth = 0\n", " for i, ch in enumerate(text[start:], start):\n", " if ch == \"{\": depth += 1\n", " elif ch == \"}\":\n", " depth -= 1\n", " if depth == 0:\n", " return json.loads(text[start:i+1])\n", " return None\n", "\n", "# ── Validation helpers ──\n", "def validate_action(obj: dict) -> tuple[bool, str]:\n", " \"\"\"Check a JSON object against the SREAction schema.\"\"\"\n", " if not isinstance(obj, dict):\n", " return False, \"not a dict\"\n", " at = obj.get(\"action_type\")\n", " if at not in VALID_ACTIONS:\n", " return False, f\"invalid action_type '{at}'\"\n", " nid = obj.get(\"target_node_id\")\n", " if nid not in VALID_NODES:\n", " return False, f\"invalid target_node_id '{nid}'\"\n", " param = obj.get(\"parameter\")\n", " if not isinstance(param, (int, float)):\n", " return False, f\"parameter is not a number: {type(param).__name__}\"\n", " if not (0.0 <= float(param) <= 10.0):\n", " return False, f\"parameter {param} out of [0,10]\"\n", " return True, \"\"\n", "\n", "print(\"Setup complete. Ready for tests.\")" ] }, { "cell_type": "markdown", "id": "e0960e21", "metadata": {}, "source": [ "---\n", "## TEST 1 — Can the Model Write Valid SRE-Action JSON?\n", "\n", "We test increasingly complex schema requirements. A model that fails any\n", "of these will produce broken output during SFT — fix the format instruction\n", "or system prompt before training." ] }, { "cell_type": "markdown", "id": "c5ac1c83", "metadata": {}, "source": [ "### Test 1a — Minimal: NO_OP with single node" ] }, { "cell_type": "code", "execution_count": null, "id": "1e041d79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "============================================================\n", "TEST 1a — Minimal NO_OP JSON\n", "============================================================\n" ] } ], "source": [ "print(type(model))\n", "print(\"=\" * 60)\n", "print(\"TEST 1a — Minimal NO_OP JSON\")\n", "print(\"=\" * 60)\n", "\n", "\n", "prompt = \"\"\"Return ONLY a valid JSON object.\n", "\n", "Schema:\n", "{\"action_type\": string, \"target_node_id\": string, \"parameter\": float}\n", "\n", "Rules:\n", "- action_type must be one of: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n", "- target_node_id must be one of: node-0, node-1, node-2, node-3, node-4\n", "- parameter must be between 0.0 and 10.0\n", "\n", "Cluster state: healthy, no load.\n", "\n", "Correct response:\n", "{\"action_type\": \"NO_OP\", \"target_node_id\": \"node-0\", \"parameter\": 0.0}\n", "\n", "Output:\"\"\"\n", "\n", "for trial in range(5):\n", " raw = generate(prompt, max_tokens=80, temperature=0.00)\n", " print(f\" Trial {trial+1} raw: {raw.strip()[:120]}\")\n", " obj = extract_json(raw)\n", " if obj is None:\n", " print(f\" ❌ Could not extract JSON\")\n", " continue\n", " ok, err = validate_action(obj)\n", " if ok:\n", " print(f\" ✅ Valid: {obj}\")\n", " if obj.get(\"action_type\") == \"NO_OP\":\n", " print(f\" ✅ Correct action (NO_OP)\")\n", " else:\n", " print(f\" ⚠️ Action is {obj.get('action_type')} — should be NO_OP\")\n", " else:\n", " print(f\" ❌ Invalid: {err} | raw keys: {list(obj.keys()) if isinstance(obj, dict) else 'N/A'}\")" ] }, { "cell_type": "markdown", "id": "5a76acaf", "metadata": {}, "source": [ "### Test 1b — SCALE_UP with specific target + parameter" ] }, { "cell_type": "code", "execution_count": null, "id": "d1bab1ee", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 1b — SCALE_UP Specific Target\")\n", "print(\"=\" * 60)\n", "\n", "prompt = \"\"\"You are an SRE agent. Respond with ONLY a JSON action.\n", "\n", "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n", "Nodes: node-0, node-1, node-2, node-3, node-4\n", "Parameter: float 0.0-10.0\n", "\n", "OBSERVATION:\n", " node-0 (VIP payment): queue_depth=185, status=HEALTHY, capacity=3.0\n", " node-1 (checkout): queue_depth=12, status=HEALTHY, capacity=3.0\n", " node-2 (catalog): queue_depth=8, status=HEALTHY, capacity=3.0\n", " node-3 (inventory): queue_depth=2, status=HEALTHY, capacity=3.0\n", " node-4 (auth): queue_depth=5, status=HEALTHY, capacity=3.0\n", "\n", "node-0 is near FATAL_FAIL (threshold=200). You MUST scale it up.\n", "Respond with the JSON:\"\"\"\n", "\n", "for trial in range(5):\n", " raw = generate(prompt, max_tokens=80, temperature=0.0)\n", " print(f\" Trial {trial+1} raw: {raw.strip()[:120]}\")\n", " obj = extract_json(raw)\n", " if obj is None:\n", " print(f\" ❌ Could not extract JSON\")\n", " continue\n", " ok, err = validate_action(obj)\n", " if ok:\n", " print(f\" ✅ Valid: {obj}\")\n", " if obj.get(\"action_type\") == \"SCALE_UP\" and obj.get(\"target_node_id\") == \"node-0\":\n", " print(f\" ✅ Correct: SCALE_UP node-0\")\n", " elif obj.get(\"action_type\") == \"SCALE_UP\":\n", " print(f\" ⚠️ SCALE_UP but target={obj.get('target_node_id')} (expected node-0)\")\n", " else:\n", " print(f\" ⚠️ Action is {obj.get('action_type')} — should be SCALE_UP\")\n", " else:\n", " print(f\" ❌ Invalid: {err}\")" ] }, { "cell_type": "markdown", "id": "997fa23f", "metadata": {}, "source": [ "### Test 1c — REROUTE_TRAFFIC (different action type + 0-1 parameter)" ] }, { "cell_type": "code", "execution_count": null, "id": "cce2df01", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 1c — REROUTE_TRAFFIC\")\n", "print(\"=\" * 60)\n", "\n", "prompt = \"\"\"You are an SRE agent. Respond with ONLY a JSON action.\n", "\n", "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n", "Nodes: node-0, node-1, node-2, node-3, node-4\n", "Parameter: float 0.0-10.0\n", "\n", "OBSERVATION:\n", " node-0 (VIP): queue_depth=45, status=HEALTHY, capacity=3.0\n", " node-1: queue_depth=210, status=FAILED, capacity=0.0\n", " node-2: queue_depth=8, status=HEALTHY, capacity=3.0\n", " node-3: queue_depth=0, status=HEALTHY, capacity=3.0\n", " node-4: queue_depth=30, status=HEALTHY, capacity=3.0\n", "\n", "node-1 has FAILED. Traffic must be rerouted AWAY from it to healthy nodes.\n", "Use REROUTE_TRAFFIC with a fraction parameter (e.g. 0.8 means reroute 80%).\n", "Respond with the JSON:\"\"\"\n", "\n", "for trial in range(5):\n", " raw = generate(prompt, max_tokens=80, temperature=0.0)\n", " print(f\" Trial {trial+1} raw: {raw.strip()[:120]}\")\n", " obj = extract_json(raw)\n", " if obj is None:\n", " print(f\" ❌ Could not extract JSON\")\n", " continue\n", " ok, err = validate_action(obj)\n", " if ok:\n", " print(f\" ✅ Valid: {obj}\")\n", " if obj.get(\"action_type\") == \"REROUTE_TRAFFIC\":\n", " print(f\" ✅ Correct action type\")\n", " else:\n", " print(f\" ⚠️ Action is {obj.get('action_type')} — should be REROUTE_TRAFFIC\")\n", " else:\n", " print(f\" ❌ Invalid: {err}\")" ] }, { "cell_type": "markdown", "id": "3ef0ae70", "metadata": {}, "source": [ "### Test 1d — All 5 action types in separate calls" ] }, { "cell_type": "code", "execution_count": null, "id": "5ae14f31", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 1d — All 5 Action Types\")\n", "print(\"=\" * 60)\n", "\n", "scenarios = [\n", " (\"NO_OP\", \"All queues are below 10. Cluster is healthy. Do nothing.\"),\n", " (\"SCALE_UP\", \"node-0 queue=195 (near fatal). Scale it up now.\"),\n", " (\"SCALE_DOWN\", \"node-4 has capacity=5.0 and queue=2. It is over-provisioned. Scale it down.\"),\n", " (\"REROUTE_TRAFFIC\", \"node-2 is FAILED. Reroute traffic away from it to healthy peers.\"),\n", " (\"SHED_LOAD\", \"node-3 queue=175. It is NOT critical. Drop some of its incoming traffic.\"),\n", "]\n", "\n", "results_1d = {}\n", "for expected_action, scenario in scenarios:\n", " prompt = f\"\"\"You are an SRE agent. Respond with ONLY a JSON action.\n", "\n", "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n", "Nodes: node-0, node-1, node-2, node-3, node-4\n", "Parameter: float 0.0-10.0\n", "\n", "SCENARIO: {scenario}\n", "Respond with the JSON:\"\"\"\n", " raw = generate(prompt, max_tokens=80, temperature=0.0)\n", " obj = extract_json(raw)\n", " if obj is None:\n", " results_1d[expected_action] = (\"NO_JSON\", raw[:80])\n", " else:\n", " ok, err = validate_action(obj)\n", " if ok:\n", " match = \"MATCH\" if obj[\"action_type\"] == expected_action else f\"MISMATCH→{obj['action_type']}\"\n", " results_1d[expected_action] = (match, obj)\n", " else:\n", " results_1d[expected_action] = (\"INVALID\", err)\n", "\n", "for action, (status, detail) in results_1d.items():\n", " icon = \"✅\" if status == \"MATCH\" else \"❌\"\n", " print(f\" {icon} {action:20s} → {status:12s} {detail}\")" ] }, { "cell_type": "markdown", "id": "60e355b5", "metadata": {}, "source": [ "### Test 1e — Adversarial: prompts that should NOT produce actions" ] }, { "cell_type": "code", "execution_count": null, "id": "98647b0a", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 1e — Adversarial Prompts (should still be valid JSON)\")\n", "print(\"=\" * 60)\n", "\n", "adversarial_prompts = [\n", " (\"fake_field\", \"Respond with JSON. Also include a 'reasoning' field (not in schema).\"),\n", " (\"out_of_range\", \"node-4 queue=5000. Scale it to parameter=999.0.\"),\n", " (\"unknown_node\", \"Reroute traffic from node-99.\"),\n", " (\"empty_obs\", \"No data. All sensors offline. Just respond.\"),\n", " (\"contradiction\", \"Queue is zero but node is FAILED. What do you do?\"),\n", "]\n", "\n", "for label, scenario in adversarial_prompts:\n", " prompt = f\"\"\"You are an SRE agent. Respond with ONLY a JSON action.\n", "\n", "Actions: NO_OP, SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD\n", "Nodes: node-0, node-1, node-2, node-3, node-4\n", "Parameter: float 0.0-10.0\n", "\n", "SCENARIO ({label}): {scenario}\n", "Respond with the JSON:\"\"\"\n", " raw = generate(prompt, max_tokens=120, temperature=0.0)\n", " obj = extract_json(raw)\n", " if obj is None:\n", " print(f\" {label:20s} → ❌ NO JSON extracted: {raw.strip()[:80]}\")\n", " else:\n", " ok, err = validate_action(obj)\n", " if ok:\n", " print(f\" {label:20s} → ✅ Valid JSON produced: {obj}\")\n", " else:\n", " print(f\" {label:20s} → ⚠️ JSON extracted but invalid: {err} | {obj}\")" ] }, { "cell_type": "markdown", "id": "de2f4fa6", "metadata": {}, "source": [ "### Test 1 — Summary" ] }, { "cell_type": "code", "execution_count": null, "id": "f75f567c", "metadata": {}, "outputs": [], "source": [ "# Aggregate results from all Test 1 cells above.\n", "# Count how many of the previous cells produced correct JSON.\n", "# (This is a manual summary — review the outputs above.)\n", "\n", "print(\"Test 1 — JSON Format Capability Summary\")\n", "print(\"=\" * 50)\n", "print(\"1a: NO_OP format — should produce valid, correct JSON\")\n", "print(\"1b: SCALE_UP specific — should target node-0, valid param\")\n", "print(\"1c: REROUTE_TRAFFIC — should use REROUTE action type\")\n", "print(\"1d: All 5 action types — should match each expected action\")\n", "print(\"1e: Adversarial robustness — should recover to valid JSON\")\n", "print()\n", "print(\"If >80% of trials produce valid, schema-correct JSON,\")\n", "print(\"the model is JSON-capable for SFT format instruction.\")\n", "print(\"If <50%, add a system prompt with the exact schema + example\")\n", "print(\"and re-run before starting SFT.\")" ] }, { "cell_type": "markdown", "id": "b3d08ada", "metadata": {}, "source": [ "---\n", "## TEST 2 — Zero-Shot SRE Judgment (No Training)\n", "\n", "Can the model REASON about SRE actions from raw cluster telemetry?\n", "We test across all 3 tasks + edge cases, increasing in complexity.\n", "The model sees a natural-language description of the cluster state\n", "and must pick the right action type, target, and reasoning." ] }, { "cell_type": "markdown", "id": "3572ce7f", "metadata": {}, "source": [ "### Helpers — Format cluster state as natural language" ] }, { "cell_type": "code", "execution_count": null, "id": "4c79c47d", "metadata": {}, "outputs": [], "source": [ "def format_state(nodes: list[dict]) -> str:\n", " \"\"\"Convert node states to a readable observation block.\"\"\"\n", " lines = []\n", " for n in nodes:\n", " vip = \" (VIP)\" if n.get(\"is_vip\") else \"\"\n", " lines.append(\n", " f\" {n['node_id']}{vip}: queue={n['queue']}, status={n['status']}, \"\n", " f\"capacity={n['capacity']}, incoming={n.get('incoming', 0)}\"\n", " )\n", " return \"\\n\".join(lines)\n", "\n", "def build_sre_prompt(scenario_desc: str, nodes: list[dict], hint: str = \"\") -> str:\n", " \"\"\"Build a zero-shot SRE prompt with cluster state.\"\"\"\n", " state_block = format_state(nodes)\n", " prompt = f\"\"\"You are an SRE (Site Reliability Engineer) managing a 5-node microservice cluster.\n", "\n", "The cluster has a DAG topology:\n", " - node-0 (VIP payment gateway) feeds traffic to node-1 and node-2\n", " - node-2 (catalog) feeds traffic to node-3 (inventory)\n", " - node-4 (auth service) is independent\n", "\n", "Available actions:\n", " - NO_OP: Do nothing\n", " - SCALE_UP: Increase capacity on a node (parameter = amount, 0-1)\n", " - SCALE_DOWN: Decrease capacity on a node (parameter = amount, 0-1)\n", " - REROUTE_TRAFFIC: Move traffic AWAY from a node to healthy peers (parameter = fraction to move, 0-1)\n", " - SHED_LOAD: Drop incoming traffic to a node for 1 tick (parameter = fraction to drop, 0-1)\n", " CRITICAL nodes (node-0, node-1, node-2) CANNOT be shed.\n", "\n", "Queue depth > 80 = DEGRADED. Queue depth > 200 = FATAL FAILURE.\n", "A FAILED node processes 0 requests — its children will starve.\n", "\n", "CLUSTER STATE:\n", "{state_block}\n", "\n", "SCENARIO: {scenario_desc}\n", "{hint}\n", "You MUST respond with:\n", "1. One sentence explaining your reasoning.\n", "2. A JSON action: {{\"action_type\": \"...\", \"target_node_id\": \"...\", \"parameter\": X.X}}\n", "\"\"\"\n", " return prompt\n", "\n", "def judge_response(raw: str, expected_action: str, expected_target: Optional[str] = None) -> dict:\n", " \"\"\"Score a zero-shot response.\"\"\"\n", " obj = extract_json(raw)\n", " result = {\"raw\": raw[:150], \"json_ok\": obj is not None}\n", " if obj is None:\n", " result[\"verdict\"] = \"NO_JSON\"\n", " return result\n", " ok, err = validate_action(obj)\n", " result[\"valid\"] = ok\n", " result[\"action\"] = obj\n", " if not ok:\n", " result[\"verdict\"] = f\"INVALID: {err}\"\n", " return result\n", " # Check if action matches expectation\n", " match = obj[\"action_type\"] == expected_action\n", " if expected_target:\n", " match = match and obj[\"target_node_id\"] == expected_target\n", " result[\"match\"] = match\n", " result[\"verdict\"] = \"CORRECT\" if match else f\"WRONG_ACTION (got {obj['action_type']}, expected {expected_action})\"\n", " return result\n", "\n", "print(\"Helpers ready.\")" ] }, { "cell_type": "markdown", "id": "b0fb56d0", "metadata": {}, "source": [ "### Test 2a — Simple Overload (Task-1 style)" ] }, { "cell_type": "code", "execution_count": null, "id": "c52fd417", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2a — Simple Overload (node-0 near fatal)\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 185, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 44},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 22, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 18, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 5, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 18},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 8, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 43},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"node-0 is at queue=185 (threshold=200 for FATAL). The VIP payment gateway is the most important node.\",\n", " nodes,\n", " hint=\"Hint: The correct action is to SCALE_UP node-0 to increase its capacity and prevent failure.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=200, temperature=0.1)\n", "print(f\"RAW: {raw[:300]}\")\n", "result = judge_response(raw, \"SCALE_UP\", \"node-0\")\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " print(f\" Action: {result['action']}\")" ] }, { "cell_type": "markdown", "id": "7d1190a2", "metadata": {}, "source": [ "### Test 2b — Node Failure + Starvation (Task-2 style)" ] }, { "cell_type": "code", "execution_count": null, "id": "963768d0", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2b — Node-2 FAILED, node-3 Starved (Task-2 DAG)\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 35, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 47},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 28, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 23},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 0, \"status\": \"FAILED\", \"capacity\": 0.0, \"incoming\": 23},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 0, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 0},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 12, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 47},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"node-2 (catalog) is FAILED and processes 0 requests. Its child node-3 (inventory) \"\n", " \"is receiving 0 traffic because the DAG cannot forward through a dead parent. \"\n", " \"node-0 is still sending 50% of its outflow to the dead node-2 — that traffic is wasted.\",\n", " nodes,\n", " hint=\"Hint: Reroute traffic AWAY from node-2 so node-0 sends more to node-1 instead.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=200, temperature=0.1)\n", "print(f\"RAW: {raw[:300]}\")\n", "result = judge_response(raw, \"REROUTE_TRAFFIC\")\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " print(f\" Action: {result['action']}\")" ] }, { "cell_type": "markdown", "id": "18794fc6", "metadata": {}, "source": [ "### Test 2c — Surge on Critical Nodes (Task-3 style)" ] }, { "cell_type": "code", "execution_count": null, "id": "6def3e2b", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2c — Surge on Critical Nodes (Task-3)\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 10, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 165, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 155},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 170, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 155},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 80, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 45},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 5, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"A surge of +140 req/tick has hit node-1 and node-2 directly (bypassing the normal DAG ingress). \"\n", " \"node-1 and node-2 are CRITICAL — you CANNOT shed their load. Their queues are approaching FATAL. \"\n", " \"node-2's child node-3 is also starting to back up from the overflow. \"\n", " \"The correct response is to SCALE_UP the bottleneck nodes.\",\n", " nodes,\n", " hint=\"Hint: SCALE_UP node-1 and/or node-2. Also consider scaling node-3 to absorb downstream overflow.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=250, temperature=0.1)\n", "print(f\"RAW: {raw[:350]}\")\n", "result = judge_response(raw, \"SCALE_UP\") # Accept any SCALE_UP on node-1, node-2, or node-3\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " a = result[\"action\"]\n", " print(f\" Action: {a}\")\n", " # Extra check: was it a sensible target?\n", " if a.get(\"target_node_id\") in [\"node-1\", \"node-2\", \"node-3\"]:\n", " print(f\" ✅ Target is sensible (node-1/2/3 are the bottleneck)\")\n", " else:\n", " print(f\" ⚠️ Target is {a.get('target_node_id')} — node-0 is not the bottleneck here\")" ] }, { "cell_type": "markdown", "id": "30b7b6c8", "metadata": {}, "source": [ "### Test 2d — DAG Bottleneck: Ingress Full, Downstream Idle" ] }, { "cell_type": "code", "execution_count": null, "id": "f7aba9e1", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2d — DAG Bottleneck (ingress full, children idle)\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 190, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 47},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 3, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 4, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 2, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 6, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 43},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"node-0 is the bottleneck — it receives 47 req/tick but can only process 45 with capacity 3.0. \"\n", " \"Its queue is at 190 (near fatal). All downstream nodes are nearly idle because node-0 \"\n", " \"cannot forward enough traffic. The right move is to SCALE_UP the ingress bottleneck (node-0). \"\n", " \"Scaling downstream nodes would NOT help — they are already underutilized.\",\n", " nodes,\n", " hint=\"Hint: This is a pure ingress bottleneck. Scale the ingress, not the downstream.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=200, temperature=0.1)\n", "print(f\"RAW: {raw[:300]}\")\n", "result = judge_response(raw, \"SCALE_UP\", \"node-0\")\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " a = result[\"action\"]\n", " print(f\" Action: {a}\")\n", " if a.get(\"target_node_id\") != \"node-0\" and a.get(\"action_type\") == \"SCALE_UP\":\n", " print(f\" ⚠️ Scaling {a['target_node_id']} is wrong — downstream is idle, ingress is the bottleneck\")" ] }, { "cell_type": "markdown", "id": "1038626f", "metadata": {}, "source": [ "### Test 2e — Multi-Node Crisis (competing pressures)" ] }, { "cell_type": "code", "execution_count": null, "id": "69d13daa", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2e — Multi-Node Crisis\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 95, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 48},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 210, \"status\": \"FAILED\", \"capacity\": 0.0, \"incoming\": 24},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 60, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 24},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 5, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 12},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 130, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 47},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"MULTIPLE CRISES simultaneously:\\n\"\n", " \"1. node-1 is FAILED — node-0 is still sending 50% of its traffic there (wasted).\\n\"\n", " \"2. node-0 is DEGRADED at queue=95 and trending up.\\n\"\n", " \"3. node-4 (independent auth) is DEGRADED at queue=130.\\n\"\n", " \"You can only take ONE action this tick. Which is most critical?\",\n", " nodes,\n", " hint=\"Hint: The failed node-1 is permanently dead (scripted failure). \"\n", " \"Rerouting traffic AWAY from it would immediately help both node-0 and prevent waste. \"\n", " \"Alternatively, scaling node-0 or node-4 addresses their individual overload.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=250, temperature=0.1)\n", "print(f\"RAW: {raw[:350]}\")\n", "result = judge_response(raw, \"REROUTE_TRAFFIC\") # Best answer: reroute from node-1\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " a = result[\"action\"]\n", " print(f\" Action: {a}\")\n", " # Accept REROUTE (best), SCALE_UP node-0 (good), SCALE_UP node-4 (ok).\n", " # Reject: NO_OP, SCALE_DOWN, SHED_LOAD on critical.\n", " if a[\"action_type\"] == \"REROUTE_TRAFFIC\":\n", " print(f\" ✅ Best answer — rerouting fixes the root cause (dead node wasting traffic)\")\n", " elif a[\"action_type\"] == \"SCALE_UP\" and a[\"target_node_id\"] in [\"node-0\", \"node-4\"]:\n", " print(f\" ⚠️ Acceptable but suboptimal — scaling treats symptom, reroute treats cause\")\n", " elif a[\"action_type\"] == \"NO_OP\":\n", " print(f\" ❌ Doing nothing during multi-node crisis is wrong\")\n", " elif a[\"action_type\"] == \"SHED_LOAD\" and a[\"target_node_id\"] in [\"node-0\", \"node-1\", \"node-2\"]:\n", " print(f\" ❌ SHED_LOAD on critical node is BLOCKED\")\n", " else:\n", " print(f\" ⚠️ Unexpected action — evaluate manually\")" ] }, { "cell_type": "markdown", "id": "019cc3d1", "metadata": {}, "source": [ "### Test 2f — Calm Cluster (should NO_OP, not act)" ] }, { "cell_type": "code", "execution_count": null, "id": "bf825879", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2f — Calm Cluster (should NO_OP)\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 5, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 3, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 15},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 4, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 15},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 2, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 15},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 6, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 30},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"The cluster is completely healthy. All queues are under 10, all nodes are HEALTHY, \"\n", " \"all capacity is at baseline (3.0). There is no reason to intervene.\",\n", " nodes,\n", " hint=\"Hint: The correct action is NO_OP. Do not scale or reroute a healthy cluster.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=200, temperature=0.1)\n", "print(f\"RAW: {raw[:250]}\")\n", "result = judge_response(raw, \"NO_OP\")\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " a = result[\"action\"]\n", " print(f\" Action: {a}\")\n", " if a[\"action_type\"] != \"NO_OP\":\n", " print(f\" ⚠️ Taking action on a healthy cluster wastes resources\")" ] }, { "cell_type": "markdown", "id": "587ea8c7", "metadata": {}, "source": [ "### Test 2g — Backpressure Chain (advanced: child overload throttles parent)" ] }, { "cell_type": "code", "execution_count": null, "id": "e47394b1", "metadata": {}, "outputs": [], "source": [ "print(\"=\" * 60)\n", "print(\"TEST 2g — Backpressure Chain (child overload → parent throttled)\")\n", "print(\"=\" * 60)\n", "\n", "nodes = [\n", " {\"node_id\": \"node-0\", \"is_vip\": True, \"queue\": 35, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 44},\n", " {\"node_id\": \"node-1\", \"is_vip\": False, \"queue\": 8, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-2\", \"is_vip\": False, \"queue\": 190, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-3\", \"is_vip\": False, \"queue\": 175, \"status\": \"DEGRADED\",\"capacity\": 3.0, \"incoming\": 22},\n", " {\"node_id\": \"node-4\", \"is_vip\": False, \"queue\": 10, \"status\": \"HEALTHY\", \"capacity\": 3.0, \"incoming\": 43},\n", "]\n", "\n", "prompt = build_sre_prompt(\n", " \"node-2 (queue=190) and its child node-3 (queue=175) are both near FATAL. \"\n", " \"Because node-2 and node-3 are overloaded, BACKPRESSURE will throttle node-0's effective \"\n", " \"service rate, causing node-0's queue to rise even though it seems healthy now. \"\n", " \"The root cause is downstream (node-2/3), not the ingress. \"\n", " \"SCALE_UP the downstream bottleneck to break the backpressure chain.\",\n", " nodes,\n", " hint=\"Hint: Scaling node-0 would NOT help — the bottleneck is at node-2/3. Scale them instead.\"\n", ")\n", "\n", "raw = generate(prompt, max_tokens=250, temperature=0.1)\n", "print(f\"RAW: {raw[:300]}\")\n", "result = judge_response(raw, \"SCALE_UP\") # Should scale node-2 or node-3\n", "print(f\"\\n Verdict: {result['verdict']}\")\n", "print(f\" JSON ok: {result['json_ok']}, Valid: {result.get('valid','N/A')}\")\n", "if result.get(\"action\"):\n", " a = result[\"action\"]\n", " print(f\" Action: {a}\")\n", " if a.get(\"action_type\") == \"SCALE_UP\" and a.get(\"target_node_id\") in [\"node-2\", \"node-3\"]:\n", " print(f\" ✅ Correct — scaling the downstream bottleneck breaks backpressure\")\n", " elif a.get(\"action_type\") == \"SCALE_UP\" and a.get(\"target_node_id\") == \"node-0\":\n", " print(f\" ❌ Wrong — node-0 is not the bottleneck, backpressure is caused by node-2/3\")\n", " elif a.get(\"action_type\") == \"REROUTE_TRAFFIC\":\n", " print(f\" ⚠️ Rerouting from node-2 would starve node-3. Scale is better here.\")\n", " else:\n", " print(f\" ⚠️ Evaluate manually — is this action addressing the downstream bottleneck?\")" ] }, { "cell_type": "markdown", "id": "85758d5e", "metadata": {}, "source": [ "### Test 2 — Summary" ] }, { "cell_type": "code", "execution_count": null, "id": "cec2e024", "metadata": {}, "outputs": [], "source": [ "print(\"Test 2 — Zero-Shot SRE Judgment Summary\")\n", "print(\"=\" * 55)\n", "print()\n", "print(\"2a: Simple overload — Should SCALE_UP node-0\")\n", "print(\"2b: Node failure + starve — Should REROUTE_TRAFFIC from dead node\")\n", "print(\"2c: Surge on critical nodes — Should SCALE_UP node-1/2/3\")\n", "print(\"2d: DAG ingress bottleneck — Should SCALE_UP node-0 (not downstream)\")\n", "print(\"2e: Multi-node crisis — Should REROUTE (root cause) or SCALE_UP key node\")\n", "print(\"2f: Calm cluster — Should NO_OP\")\n", "print(\"2g: Backpressure chain — Should SCALE_UP downstream (not ingress)\")\n", "print()\n", "print(\"Scoring guide:\")\n", "print(\" 5-7 correct actions + valid JSON = Model has zero-shot SRE intuition\")\n", "print(\" 3-4 correct = Model has partial understanding, needs SFT\")\n", "print(\" 0-2 correct = Model lacks SRE concepts — SFT must teach from scratch\")\n", "print()\n", "print(\"Key insight: Even a 'wrong' action with valid JSON and sensible\")\n", "print(\"reasoning is better than a broken response. SFT can fix judgment;\")\n", "print(\"it cannot fix format if the model can't produce valid JSON.\")" ] }, { "cell_type": "markdown", "id": "c81dbf14", "metadata": {}, "source": [ "---\n", "## Final Verdict — Is This Model Ready for SFT?\n", "\n", "After running both tests, check:\n", "\n", "| Test | Pass Condition | Ready? |\n", "|---|---|---|\n", "| Test 1 — JSON | ≥80% of trials produce valid, schema-correct JSON | ☐ |\n", "| Test 2a — Simple overload | SCALE_UP node-0 | ☐ |\n", "| Test 2b — Node failure | REROUTE_TRAFFIC (not NO_OP or SCALE_UP) | ☐ |\n", "| Test 2c — Surge | SCALE_UP (not SHED_LOAD on critical) | ☐ |\n", "| Test 2d — DAG bottleneck | SCALE_UP node-0 (not node-1/2/3) | ☐ |\n", "| Test 2e — Multi-crisis | Any non-NO_OP, preferably REROUTE | ☐ |\n", "| Test 2f — Calm | NO_OP | ☐ |\n", "| Test 2g — Backpressure | SCALE_UP downstream (not node-0) | ☐ |\n", "\n", "**If Test 1 fails (can't write JSON):** Add a system prompt with exact schema + few-shot examples before SFT.\n", "\n", "**If Test 2 fails (<4 correct):** The model lacks SRE physics intuition. SFT will need:\n", "- More diverse dataset (all action types, all tasks)\n", "- Explicit reasoning chains in the training examples\n", "- Possibly a stronger base model for zero-shot transfer\n", "\n", "**If both pass:** Proceed directly to SFT. The model has format + basic physics." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }