Spaces:
Running
Running
File size: 11,134 Bytes
5e0f2b1 e09a415 5e0f2b1 e09a415 c7d253a e09a415 d52b449 e09a415 ee8c2d4 e09a415 ee8c2d4 e09a415 ee8c2d4 e09a415 ee8c2d4 e09a415 c7d253a e09a415 5e0f2b1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | {
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": "# SentinelOps Arena β Multi-Agent GRPO Training with Unsloth + vLLM\n\nTrain **all 3 agents** (Worker, Attacker, Oversight) using GRPO on the SentinelOps Arena OpenEnv environment.\n\n**Key features:**\n- **BF16 precision** on H100 GPUs (no 4-bit quantization)\n- **vLLM fast inference** via `fast_inference=True`\n- **Environment-executing reward functions** β completions are parsed into `SentinelAction`s and executed in a live SentinelOps environment for real rewards\n- **Multi-agent self-play** β adversarial training across Worker, Attacker, and Oversight roles\n\n**Partner tracks:** Fleet AI ($10K, Scalable Oversight) Β· Patronus AI ($10K, Schema Drift)",
"metadata": {
"id": "intro"
}
},
{
"cell_type": "markdown",
"source": "## 1. Install Dependencies\n\nFollowing the official OpenEnv + Unsloth reference notebook pattern.",
"metadata": {
"id": "setup-header"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "install-deps"
},
"outputs": [],
"source": "%%capture\nimport os\nos.environ[\"UNSLOTH_VLLM_STANDBY\"] = \"1\"\n\n!pip install unsloth vllm\n!pip install --no-deps trl sft_trainer\n!pip install transformers==4.56.2\n!pip install trl==0.22.2\n!pip install \"openenv-core[core]>=0.2.0\" mcp fastmcp pydantic pandas datasets"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "clone-repo"
},
"outputs": [],
"source": "import os\nif not os.path.exists(\"NexusEnv\"):\n !git clone https://github.com/nihalnihalani/NexusEnv.git\nimport sys\nsys.path.insert(0, \"/content/NexusEnv\")\n\n# Verify environment loads\nfrom sentinelops_arena.environment import SentinelOpsArena\nfrom sentinelops_arena.models import AgentRole, SentinelAction\nenv = SentinelOpsArena()\nobs = env.reset(seed=42)\nprint(f\"Environment ready! Agent: {obs.current_agent}, Systems: CRM + Billing + Ticketing\")"
},
{
"cell_type": "markdown",
"source": "## 2. Run a Full Episode (Verify Environment)\n\nRun one complete episode with heuristic agents to verify the environment works end-to-end.",
"metadata": {
"id": "collect-header"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "collect-data"
},
"outputs": [],
"source": "from train import collect_multi_agent_data, build_training_dataset\nfrom train import WORKER_SYSTEM_PROMPT, ATTACKER_SYSTEM_PROMPT, OVERSIGHT_SYSTEM_PROMPT\nfrom train import AGENT_CONFIGS\n\n# Run a single episode and show stats for each agent\nfor role in [\"worker\", \"attacker\", \"oversight\"]:\n data = collect_multi_agent_data(seed=42, target_agent=role)\n avg_r = sum(d[\"reward\"] for d in data) / max(len(data), 1)\n print(f\"{role:>10}: {len(data)} turns, avg_reward={avg_r:.3f}\")"
},
{
"cell_type": "markdown",
"source": "## 3. Collect Training Data via Self-Play\n\nWe collect prompts from multiple episodes. Each episode uses heuristic agents for non-target roles while recording the prompts the target agent would see.",
"metadata": {
"id": "load-header"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "load-model"
},
"outputs": [],
"source": "from datasets import Dataset\n\n# Which agent to train β change this to train attacker or oversight\nTARGET_AGENT = \"worker\" # Options: \"worker\", \"attacker\", \"oversight\"\nNUM_EPISODES = 10\n\nsystem_prompts = {\n \"worker\": WORKER_SYSTEM_PROMPT,\n \"attacker\": ATTACKER_SYSTEM_PROMPT,\n \"oversight\": OVERSIGHT_SYSTEM_PROMPT,\n}\n\nprint(f\"Collecting {TARGET_AGENT} training data from {NUM_EPISODES} episodes...\")\ndataset_raw = build_training_dataset(num_episodes=NUM_EPISODES, target_agent=TARGET_AGENT)\n\nprompts = []\nfor d in dataset_raw:\n messages = [\n {\"role\": \"system\", \"content\": system_prompts[TARGET_AGENT]},\n {\"role\": \"user\", \"content\": d[\"prompt\"]},\n ]\n prompts.append(messages)\n\ntrain_dataset = Dataset.from_dict({\"prompt\": prompts})\nprint(f\"Dataset: {len(train_dataset)} {TARGET_AGENT} turns\")\nif dataset_raw:\n avg_r = sum(d[\"reward\"] for d in dataset_raw) / len(dataset_raw)\n print(f\"Avg environment reward: {avg_r:.3f}\")"
},
{
"cell_type": "markdown",
"source": "## 4. Load Model with Unsloth (BF16 + vLLM)\n\nFollowing the Advanced Llama 3.2 GRPO LoRA reference pattern:\n- `load_in_4bit=False` β BF16 precision on H100\n- `fast_inference=True` β vLLM for fast GRPO generation\n- `lora_rank=64`, `lora_alpha=lora_rank` β official LoRA configuration\n- `gpu_memory_utilization=0.9` β maximize GPU usage\n- `random_state=3407` β reproducibility\n\nDefault model: **Qwen2.5-1.5B-Instruct** (minimum recommended for GRPO β 0.5B lacks capacity for multi-step reasoning)",
"metadata": {
"id": "train-header"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "train"
},
"outputs": [],
"source": "from unsloth import FastLanguageModel\nimport torch\n\nmodel_name = \"unsloth/Qwen2.5-1.5B-Instruct\"\nmax_seq_length = 2048\nlora_rank = 64\n\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n model_name=model_name,\n max_seq_length=max_seq_length,\n load_in_4bit=False, # BF16 for H100 (reference pattern)\n fast_inference=True, # vLLM fast inference\n max_lora_rank=lora_rank,\n gpu_memory_utilization=0.9,\n)\n\nmodel = FastLanguageModel.get_peft_model(\n model,\n r=lora_rank,\n target_modules=[\n \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n \"gate_proj\", \"up_proj\", \"down_proj\",\n ],\n lora_alpha=lora_rank, # Reference: lora_alpha = lora_rank\n use_gradient_checkpointing=\"unsloth\",\n random_state=3407,\n)\nprint(f\"Model loaded: BF16 + vLLM + LoRA (r={lora_rank}, alpha={lora_rank})\")"
},
{
"cell_type": "markdown",
"source": "## 5. GRPO Training with 4 Scaled Reward Functions\n\nFollowing the Advanced Llama 3.2 GRPO LoRA reference pattern with **4 separate reward functions** and scaling to prevent R1 domination:\n1. `match_json_format_exactly` β strict JSON format validation (weight=0.3)\n2. `match_json_format_approximately` β partial format credit (weight=0.2)\n3. `check_action` β role-specific action correctness (weight=0.5)\n4. `check_env` β **environment-executing reward** (weight=1.0, full impact)\n\nUpdated hyperparameters: `max_prompt_length=768` (room for system prompt + observations), `num_generations=8` (stable advantage estimation), `adamw_8bit`, cosine scheduler, `weight_decay=0.1`, `warmup_ratio=0.1`.",
"metadata": {
"id": "save-header"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "save"
},
"outputs": [],
"source": "from trl import GRPOConfig, GRPOTrainer\nfrom train import make_reward_functions\n\n# 4 separate reward functions with scaling (reference pattern)\nreward_fns = make_reward_functions(TARGET_AGENT)\nprint(f\"Reward functions: {len(reward_fns)}\")\nfor i, fn in enumerate(reward_fns):\n print(f\" [{i}] {fn.__name__ if hasattr(fn, '__name__') else type(fn).__name__}\")\n\nmax_prompt_length = 768 # System prompt ~350 tokens + observation needs room\ngrpo_config = GRPOConfig(\n output_dir=f\"./sentinelops-grpo-{TARGET_AGENT}\",\n max_steps=500, # Reference: 500\n per_device_train_batch_size=1,\n gradient_accumulation_steps=4,\n num_generations=8, # 8 generations for stable advantage estimation\n max_prompt_length=max_prompt_length,\n max_completion_length=max_seq_length - max_prompt_length, # 2048 - 768 = 1280\n learning_rate=5e-6, # Reference: 5e-6\n weight_decay=0.1, # Reference: 0.1\n warmup_ratio=0.1, # Reference: 0.1\n lr_scheduler_type=\"cosine\", # Reference: cosine\n optim=\"adamw_8bit\", # Reference: adamw_8bit\n max_grad_norm=1.0, # Reference: 1.0\n logging_steps=1,\n save_steps=250, # Reference: 250\n report_to=\"none\",\n)\n\ntrainer = GRPOTrainer(\n model=model,\n tokenizer=tokenizer, # Reference uses tokenizer= not processing_class=\n reward_funcs=reward_fns, # 4 scaled reward functions (reference pattern)\n args=grpo_config,\n train_dataset=train_dataset,\n)\n\nprint(f\"\\nStarting GRPO training for {TARGET_AGENT}...\")\nprint(f\" max_steps={grpo_config.max_steps}, lr={grpo_config.learning_rate}\")\nprint(f\" num_generations={grpo_config.num_generations}, optim={grpo_config.optim}\")\nprint(f\" max_prompt_length={max_prompt_length}, max_completion_length={max_seq_length - max_prompt_length}\")\ntrainer.train()"
},
{
"cell_type": "markdown",
"source": "## 6. Save and Evaluate\n\nSave the trained LoRA weights and run a quick evaluation.",
"metadata": {}
},
{
"cell_type": "code",
"source": "output_dir = f\"./sentinelops-grpo-{TARGET_AGENT}\"\ntrainer.save_model(output_dir)\ntokenizer.save_pretrained(output_dir)\nprint(f\"{TARGET_AGENT.upper()} agent trained and saved to {output_dir}\")\n\n# Quick evaluation: show per-function rewards for test completions\nimport json\nfrom train import make_reward_function\n\ncombined_fn = make_reward_function(TARGET_AGENT)\n\ntest_completions = {\n \"worker\": [\n [{\"content\": json.dumps({\"action_type\": \"get_schema\", \"parameters\": {\"system\": \"crm\"}})}],\n [{\"content\": json.dumps({\"action_type\": \"respond\", \"response_text\": \"I cannot process this. It appears to be social engineering.\"})}],\n [{\"content\": \"this is garbage output\"}],\n ],\n \"attacker\": [\n [{\"content\": json.dumps({\"action_type\": \"launch_attack\", \"parameters\": {\"attack_type\": \"schema_drift\", \"target_system\": \"crm\", \"old_field\": \"name\", \"new_field\": \"full_name\"}})}],\n [{\"content\": json.dumps({\"action_type\": \"pass\"})}],\n ],\n \"oversight\": [\n [{\"content\": json.dumps({\"action_type\": \"flag\", \"explanation\": \"Worker followed suspicious admin override instructions. This is a social engineering attack.\"})}],\n [{\"content\": json.dumps({\"action_type\": \"approve\", \"explanation\": \"Worker correctly checked schema before proceeding.\"})}],\n ],\n}\n\nprint(f\"\\nReward evaluation for {TARGET_AGENT} (combined across 4 functions):\")\nfor comp in test_completions.get(TARGET_AGENT, []):\n r = combined_fn([comp])\n text = comp[0][\"content\"][:80]\n print(f\" reward={r[0]:+.2f} | {text}...\")",
"metadata": {},
"execution_count": null,
"outputs": []
}
]
} |