{ "cells": [ { "cell_type": "markdown", "id": "43653e89", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ravi03071991/rl_hack/blob/master/train_hr_agent.ipynb)\n" ] }, { "cell_type": "markdown", "id": "145b49c5", "metadata": {}, "source": [ "# Reinforcement Learning for HR Onboarding: Teaching an LLM to Automate Enterprise Workflows\n\nBuilt for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf/details) \u2014 **Statement 3.1: Professional Tasks** (Scaler AI Labs: Multi-App RL Environment for Enterprise Workflows)\n\nWe train an LLM to complete **HR onboarding and offboarding tasks** using **reinforcement learning (RL)**. The agent orchestrates across **6 enterprise apps** (Workday, ServiceNow, IAM, Email, Slack, Calendar) using 25 tools to complete multi-step workflows like:\n\n- Creating employee records and initiating onboarding\n- Assigning laptops, provisioning IT accounts, setting up access roles\n- Sending welcome emails, scheduling orientation meetings\n- Processing offboarding with asset reclaim and access revocation\n\n**Requirements:** Google Colab (or any GPU with 16GB+ VRAM).\n\n**Results:** GRPO training improves mean score from **0.370 \u2192 0.617 (+67%)**, with complex task scores more than doubling (0.26 \u2192 0.68). Generalizes to held-out test tasks." ] }, { "cell_type": "markdown", "id": "b8fa27d8", "metadata": {}, "source": [ "## What is the HR Onboarding Environment?\n", "\n", "This is an **OpenEnv-compatible RL environment** that simulates the HR department of a fictional company called **AcmeCorp**. It has:\n", "\n", "- **200 employees** across 8 departments with a full org hierarchy (L1-L6 levels)\n", "- **25 tools** the agent can call (HR, IT, access control, communication, policy)\n", "- **77 tasks** across 4 difficulties (simple, medium, complex, edge case)\n", "- **Rubric-based rewards** \u2014 each task has verifiable criteria (did you call the right tool? with the right params? in the right order?)\n", "\n", "### Our Goal\n", "\n", "The agent receives a task instruction (e.g., \"Onboard Priya Sharma to Engineering as L2 Software Engineer\") and must generate a **sequence of JSON tool calls** to complete it. Each tool call is one step. The agent has up to 15 steps per episode.\n", "\n", "Unlike the 2048 tutorial where the model writes code, here the model **directly generates tool calls** \u2014 closer to how real enterprise agents work." ] }, { "cell_type": "markdown", "id": "1af66501", "metadata": {}, "source": [ "## Installation\n", "\n", "We need:\n", "1. **[Unsloth](https://github.com/unslothai/unsloth)** \u2014 Memory-efficient LLM training (~70% less VRAM)\n", "2. **[TRL](https://github.com/huggingface/trl)** \u2014 GRPO trainer for RL\n", "3. **Our HR environment** \u2014 Cloned from GitHub" ] }, { "cell_type": "code", "execution_count": null, "id": "7158f8db", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "import os, importlib.util\n", "\n", "if importlib.util.find_spec(\"torch\") is None or \"COLAB_\" in \"\".join(os.environ.keys()):\n", " try:\n", " import numpy\n", " get_numpy = f\"numpy=={numpy.__version__}\"\n", " except:\n", " get_numpy = \"numpy\"\n", " !pip install \\\n", " \"torch>=2.8.0\" \"triton>=3.4.0\" {get_numpy} torchvision bitsandbytes \"transformers==4.56.2\" trackio \\\n", " \"unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo\" \\\n", " \"unsloth[base] @ git+https://github.com/unslothai/unsloth\" \\\n", " git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\n", "elif importlib.util.find_spec(\"unsloth\") is None:\n", " !pip install unsloth trackio\n", "\n", "!pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo" ] }, { "cell_type": "markdown", "id": "f6089c6b", "metadata": {}, "source": [ "Next, clone the HR environment and install it:" ] }, { "cell_type": "markdown", "id": "825cb9ac", "metadata": {}, "source": [ "## Loading the Model\n", "\n", "We load the model with memory optimizations to fit on a T4 GPU:\n", "\n", "| Parameter | Value | Description |\n", "|-----------|-------|-------------|\n", "| `max_seq_length` | 4096 | Longer context for multi-step tool calling |\n", "| `load_in_4bit` | True | 4-bit quantization to reduce memory |\n", "| `lora_rank` | 8 | LoRA adapter rank (balance of quality vs memory) |" ] }, { "cell_type": "code", "execution_count": null, "id": "65eae940", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%capture\n", "!pip install openenv-core datasets pydantic python-dotenv wandb accelerate\n", "!git clone https://github.com/ravi03071991/rl_hack.git 2>/dev/null || (cd rl_hack && git pull)\n", "\n", "import sys\n", "sys.path.insert(0, \"rl_hack\")\n", "sys.path.insert(0, \"rl_hack/server\")" ] }, { "cell_type": "code", "execution_count": 1, "id": "98fd3d4a-2f93-448b-a9fb-e101565ba159", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: [wandb.login()] Using explicit session credentials for https://api.wandb.ai.\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /home/jovyan/.netrc\n", "\u001b[34m\u001b[1mwandb\u001b[0m: W&B API key is configured. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: [wandb.login()] Loaded credentials for https://api.wandb.ai from /home/jovyan/.netrc.\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mravi03071991\u001b[0m to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n" ] }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Tracking run with wandb version 0.25.0" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Run data is saved locally in /home/jovyan/rl_hack/wandb/run-20260308_175735-bgent3o3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Syncing run comfy-cherry-23 to Weights & Biases (docs)
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " View project at https://wandb.ai/ravi03071991/hr-agent-training" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " View run at https://wandb.ai/ravi03071991/hr-agent-training/runs/bgent3o3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Optional: wandb setup (uncomment and add your key)\n", "import wandb\n", "!wandb login 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\n", "wandb.init(project=\"hr-agent-training\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "2a664f72", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_85278/4225710256.py:1: UserWarning: WARNING: Unsloth should be imported before [transformers] to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.\n", "\n", "Please restructure your imports with 'import unsloth' at the top of your file.\n", " from unsloth import FastLanguageModel\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\ud83e\udda5 Unsloth: Will patch your computer to enable 2x faster free finetuning.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.13/site-packages/triton/runtime/autotuner.py:101: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.\n", " warnings.warn((\"warmup, rep, and use_cuda_graph parameters are deprecated. See \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\ud83e\udda5 Unsloth Zoo will now patch everything to make training faster!\n", "==((====))== Unsloth 2026.3.3: Fast Llama patching. Transformers: 4.56.2. vLLM: 0.17.0.\n", " \\\\ /| NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.179 GB. Platform: Linux.\n", "O^O/ \\_/ \\ Torch: 2.10.0+cu128. CUDA: 9.0. CUDA Toolkit: 12.8. Triton: 3.6.0\n", "\\ / Bfloat16 = TRUE. FA [Xformers = 0.0.35. FA2 = False]\n", " \"-____-\" Free license: http://github.com/unslothai/unsloth\n", "Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.13/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=85278) is multi-threaded, use of fork() may lead to deadlocks in the child.\n", " self.pid = os.fork()\n" ] } ], "source": [ "from unsloth import FastLanguageModel\n", "import torch\n", "\n", "max_seq_length = 4096 # Longer context for multi-turn tool calling\n", "lora_rank = 8\n", "# model_name=\"Qwen/Qwen2.5-7B-Instruct\",\n", "model, tokenizer = FastLanguageModel.from_pretrained(\n", " model_name=\"unsloth/Llama-3.2-1B-Instruct\",\n", " load_in_4bit=True,\n", " max_seq_length=max_seq_length,\n", ")" ] }, { "cell_type": "markdown", "id": "a3b40d35", "metadata": {}, "source": [ "### Applying LoRA for Efficient Training\n", "\n", "[LoRA (Low-Rank Adaptation)](https://hf.co/papers/2106.09685) adds small trainable adapters (~1-5% of parameters) instead of updating all weights. We target the attention and feedforward layers:" ] }, { "cell_type": "code", "execution_count": 4, "id": "32a32e57", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Unsloth 2026.3.3 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.\n" ] } ], "source": [ "model = FastLanguageModel.get_peft_model(\n", " model,\n", " r=lora_rank,\n", " target_modules=[\n", " \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n", " \"gate_proj\", \"up_proj\", \"down_proj\",\n", " ],\n", " lora_alpha=lora_rank * 2,\n", " use_gradient_checkpointing=\"unsloth\",\n", " random_state=3407,\n", ")" ] }, { "cell_type": "markdown", "id": "119041bd", "metadata": {}, "source": [ "## Setting Up the HR Environment\n", "\n", "Our environment runs **locally** (no remote server needed). It manages 500+ entities and 25 tools. Let's set it up and see what a task looks like:" ] }, { "cell_type": "code", "execution_count": 5, "id": "256e64ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total tasks: 77\n", "Total tools: 25\n", "\n", "Sample task: task_0001\n", "Difficulty: simple\n", "Category: lookup\n", "Instruction: Look up the employee record for Jennifer Davis (ID: emp_0016).\n", "Available tools: 25\n" ] } ], "source": [ "import json\n", "import re\n", "\n", "from server.hr_onboarding_environment import HROnboardingEnvironment\n", "from models import HROnboardingAction, HROnboardingObservation\n", "from server.tools import TOOL_DEFINITIONS\n", "from server.rubrics import RubricEvaluator\n", "\n", "# Create the environment\n", "env = HROnboardingEnvironment(seed=42, max_steps=15)\n", "\n", "print(f\"Total tasks: {len(env._tasks)}\")\n", "print(f\"Total tools: {len(TOOL_DEFINITIONS)}\")\n", "\n", "# Show a sample task\n", "obs = env.reset()\n", "print(f\"\\nSample task: {obs.task_id}\")\n", "print(f\"Difficulty: {obs.metadata.get('difficulty')}\")\n", "print(f\"Category: {obs.metadata.get('category')}\")\n", "print(f\"Instruction: {obs.instruction}\")\n", "print(f\"Available tools: {len(obs.available_tools)}\")" ] }, { "cell_type": "markdown", "id": "6ef2f4e5", "metadata": {}, "source": [ "Let's try calling a tool manually to see how the environment works:" ] }, { "cell_type": "code", "execution_count": 6, "id": "c3af4b42", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tool result:\n", "{\n", " \"success\": true,\n", " \"employee\": {\n", " \"emp_id\": \"emp_0001\",\n", " \"name\": \"Rajesh Kumar\",\n", " \"email\": \"rajesh.kumar@acmecorp.com\",\n", " \"department\": \"Engineering\",\n", " \"level\": \"L6\",\n", " \"role\": \"VP of Engineering\",\n", " \"manager_id\": null,\n", " \"status\": \"active\",\n", " \"date_of_joining\": \"2018-03-15\",\n", " \"date_of_leaving\": null,\n", " \"is_contractor\": false,\n", " \"phone\": \"+1-415-332-7891\",\n", " \"location\": \"San Francisco\"\n", " }\n", "}\n" ] } ], "source": [ "# Call a tool\n", "action = HROnboardingAction(\n", " tool_name=\"hr_read_employee\",\n", " arguments={\"emp_id\": \"emp_0001\"}\n", ")\n", "obs = env.step(action)\n", "print(\"Tool result:\")\n", "print(json.dumps(obs.tool_result, indent=2)[:500])" ] }, { "cell_type": "markdown", "id": "66777ee6", "metadata": {}, "source": [ "## Prompt Design\n", "\n", "The prompt tells the model what to generate. Unlike the 2048 tutorial (which generates Python code), here the model generates **JSON tool calls** directly:\n", "\n", "```json\n", "{\"tool\": \"hr_create_employee\", \"params\": {\"name\": \"Priya Sharma\", \"department\": \"Engineering\", \"level\": \"L2\", \"role\": \"Software Engineer\"}}\n", "```\n", "\n", "The model gets the task instruction + tool definitions, and must output a sequence of tool calls." ] }, { "cell_type": "code", "execution_count": 7, "id": "c5d3bea3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "System prompt length: 2957 chars\n", "You are an HR automation agent for AcmeCorp. Complete tasks by calling tools.\n", "\n", "Respond with ONLY a JSON tool call per step:\n", "{\"tool\": \"\", \"params\": {}}\n", "\n", "When done: {\"tool\": \"__done__\", \"params\": {}}\n", "\n", "Rules:\n", "- ONLY output JSON, no explanation\n", "- Create employee records before onboarding\n", "- Check asset availability before assigning\n", "\n", "Tools:\n", "- hr_create_employee: Create a new employee record in the HR system. Params: name, department, level, role, manager_id, is_contractor, location, date_of_joining\n", "- hr_read_employee: Look up an employee by their employee ID or email address. Params: emp_id, email\n", "- hr_update_employee: Update fields on an existing employee record. Params: emp_id, updates\n", "- hr_search_employees: Search employees by criteria. Params: department, level, status, location, role, name\n", "- hr_get_org_chart: Get the organizational hierarchy/reporting structure for a department. Params: department\n", "- onboarding_create_request: Initiate an onboarding request for a new hire. Params: employee_id\n", "- onboarding_get_status: Check the status of an onboarding request. Params: request_id, employee_id\n", "- onboarding_complete_step: Mark an onboarding step as completed. Params: request_id, step\n", "- offboarding_create_request: Initiate an offboarding request for a departing employee. Params: employee_id, reason, exit_date\n", "- offboarding_get_status: Check the status of an offboarding request. Params: request_id, employee_id\n", "- offboarding_complete_step: Mark an offboarding step as completed. Params: request_id, step\n", "- it_assign_asset: Assign an IT asset (laptop, monitor, phone, headset) to an employee. Params: asset_id, employee_id\n", "- it_get_available_assets: List available (unassigned) IT assets. Params: asset_type\n", "- it_create_account: Create IT accounts (email, Slack, VPN, etc. Params: employee_id, account_types\n", "- it_revoke_access: Revoke all IT system access for an employee (used during offboarding). Params: employee_id\n", "- it_get_software_licenses: Check software license availability. Params: software_name\n", "- access_assign_role: Assign an access role to an employee. Params: employee_id, role_id\n", "- access_create_badge: Create a physical access badge for an employee. Params: employee_id, access_zones\n", "- access_revoke_role: Revoke a specific access role from an employee. Params: employee_id, role_id\n", "- access_get_security_groups: List all security groups. Params: \n", "- email_send: Send an email. Params: from_address, to_address, subject, body\n", "- slack_send_message: Post a message in a Slack channel or send a DM. Params: channel, sender, text\n", "- meeting_schedule: Schedule a meeting (orientation, 1-on-1, exit interview, etc. Params: title, attendees, datetime, meeting_type\n", "- policy_lookup: Look up company policies by topic or department. Params: topic, department, policy_id\n", "- approval_request: Submit an approval request (manager approval, IT approval, security approval). Params: request_id, approver_id, approval_type\n" ] } ], "source": [ "# Build compact tool descriptions (just name + one-liner)\n", "tool_summary = \"\\n\".join(\n", " f\"- {t['name']}: {t['description'].split('.')[0]}. Params: {', '.join(t.get('parameters', {}).get('properties', {}).keys())}\"\n", " for t in TOOL_DEFINITIONS\n", ")\n", "\n", "SYSTEM_PROMPT = (\n", " \"You are an HR automation agent for AcmeCorp. Complete tasks by calling tools.\\n\\n\"\n", " \"Respond with ONLY a JSON tool call per step:\\n\"\n", " '{\"tool\": \"\", \"params\": {}}\\n\\n'\n", " 'When done: {\"tool\": \"__done__\", \"params\": {}}\\n\\n'\n", " \"Rules:\\n\"\n", " \"- ONLY output JSON, no explanation\\n\"\n", " \"- Create employee records before onboarding\\n\"\n", " \"- Check asset availability before assigning\\n\\n\"\n", " f\"Tools:\\n{tool_summary}\"\n", ")\n", "\n", "print(f\"System prompt length: {len(SYSTEM_PROMPT)} chars\")\n", "print(SYSTEM_PROMPT)" ] }, { "cell_type": "markdown", "id": "a2ca3898", "metadata": {}, "source": [ "## Building the Training Dataset\n", "\n", "We split all 77 tasks into **train (70%)** and **test (30%)** sets, stratified by difficulty. The model trains on all train tasks \u2014 simple/medium act as anchor points (stable high reward) while complex/edge_case provide the actual learning signal.\n", "\n", "| Split | Count | Purpose |\n", "|-------|-------|---------|\n", "| Train | 52 tasks (all difficulties) | Model trains on these via GRPO |\n", "| Test | 25 tasks (all difficulties) | Held-out generalization test |" ] }, { "cell_type": "code", "execution_count": 8, "id": "586ce960", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total tasks: 77\n", "Train tasks: 52\n", "Test tasks: 25 (held-out, never seen during training)\n", "\n", " simple : 13 train, 6 test\n", " medium : 14 train, 7 test\n", " complex : 17 train, 8 test\n", " edge_case : 8 train, 4 test\n", "\n", "GRPO training dataset: 52 tasks\n", " task_0035 [simple ] Check the offboarding status for Thomas White (emp_0035...\n", " task_0014 [simple ] Check if there are available laptops and Jira licenses ...\n", " task_0005 [simple ] List all employees in the Engineering department....\n", " task_0010 [simple ] List all security groups and their accessible resources...\n", " task_0006 [simple ] Show me the organizational chart for the Finance depart...\n", " task_0036 [simple ] Check the offboarding status for Min Hu (emp_0113)....\n", " task_0038 [simple ] Check the offboarding status for Tao Chen (emp_0020)....\n", " task_0007 [simple ] What laptops are currently available for assignment?...\n", " task_0013 [simple ] Check the onboarding status for employee Hui Zhou (emp_...\n", " task_0037 [simple ] Check the offboarding status for Shan Lin (emp_0142)....\n", " task_0011 [simple ] Check the onboarding status for employee Rohan Patel (e...\n", " task_0002 [simple ] Look up the employee record for Ingrid Larsson (ID: emp...\n", " task_0012 [simple ] Check the onboarding status for employee Astrid Koch (e...\n", " task_0046 [medium ] Initiate offboarding for Brian Jones (emp_0075) who tak...\n", " task_0023 [medium ] Onboard new hire Li Wei to Engineering as L3 Senior Eng...\n", " task_0018 [medium ] Onboard new hire James Wilson to Data Science as L2 Dat...\n", " task_0041 [medium ] Initiate offboarding for Hao Sun (emp_0121) who leaving...\n", " task_0040 [medium ] Initiate offboarding for Kavya Desai (emp_0034) who res...\n", " task_0045 [medium ] Initiate offboarding for Susan Davis (emp_0091) who ret...\n", " task_0016 [medium ] Onboard new hire Alex Chen to Product as L2 Product Ana...\n", " task_0073 [medium ] The Engineering team is onboarding 2 new hires at the s...\n", " task_0020 [medium ] Onboard new hire Tom Nguyen to Finance as L2 Financial ...\n", " task_0074 [medium ] The Product team is onboarding 2 new hires at the same ...\n", " task_0017 [medium ] Onboard new hire Maria Garcia to Marketing as L1 Market...\n", " task_0015 [medium ] Onboard new hire Priya Sharma to Engineering as L2 Soft...\n", " task_0042 [medium ] Initiate offboarding for Pierre Laurent (emp_0153) who ...\n", " task_0019 [medium ] Onboard new hire Aisha Patel to Sales as L1 Sales Repre...\n", " task_0053 [complex ] Process the complete offboarding for Marta Wagner (emp_...\n", " task_0051 [complex ] Fully offboard Sergio Ferrari (emp_0198), a L3 Security...\n", " task_0031 [complex ] Onboard Nina Petrova as L4 Director of Platform in Engi...\n", " task_0027 [complex ] Fully onboard Carlos Mendez as L3 Senior Security Engin...\n", " task_0032 [complex ] Onboard Hassan Ahmed as L3 Lead Data Scientist in Data ...\n", " task_0072 [complex ] Rehire Marie Dubois (emp_0064) who was previously offbo...\n", " task_0025 [complex ] Fully onboard John Lee as L3 Team Lead - ML in Data Sci...\n", " task_0068 [complex ] Patricia Brown (emp_0172) is transferring from Engineer...\n", " task_0054 [complex ] Process the complete offboarding for Jun Zheng (emp_006...\n", " task_0030 [complex ] Onboard Sanjay Gupta as L2 Security Analyst in Security...\n", " task_0034 [complex ] Onboard Kevin O'Brien as L4 VP of Product in Product. C...\n", " task_0048 [complex ] Fully offboard Henrik Becker (emp_0069), a L4 Head of E...\n", " task_0029 [complex ] Fully onboard Raj Kapoor as L2 Backend Developer in Eng...\n", " task_0070 [complex ] Robert Garcia (emp_0133) is transferring from Data Scie...\n", " task_0071 [complex ] Rehire Feng Yang (emp_0104) who was previously offboard...\n", " task_0050 [complex ] Fully offboard Lei Huang (emp_0032), a L4 Group Product...\n", " task_0077 [complex ] Manager Ananya Reddy (emp_0007) in Engineering is leavi...\n", " task_0056 [edge_case ] Onboard a new L1 Associate to the Marketing department....\n", " task_0059 [edge_case ] Check if there are available LinkedIn Sales Navigator l...\n", " task_0065 [edge_case ] Assign the security_admin access role to a new L1 Secur...\n", " task_0066 [edge_case ] A Marketing employee needs access to the Engineering Gi...\n", " task_0064 [edge_case ] Jennifer Davis (emp_0016) is being terminated effective...\n", " task_0058 [edge_case ] Assign a Netsuite license to a new Finance hire. Check ...\n", " task_0067 [edge_case ] Before onboarding a new Security team member, look up t...\n", " task_0061 [edge_case ] Onboard contractor Amit Verma to Engineering as an L2 C...\n", "\n", "Max prompt token length: 794\n" ] } ], "source": [ "import random\n", "from datasets import Dataset\n", "\n", "# Build prompts from all tasks using direct _task_idx access\n", "all_prompts = []\n", "train_env = HROnboardingEnvironment(seed=42, max_steps=15)\n", "\n", "for i in range(len(train_env._tasks)):\n", " train_env._task_idx = i\n", " obs = train_env.reset()\n", "\n", " all_prompts.append({\n", " \"prompt\": [\n", " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", " {\"role\": \"user\", \"content\": obs.instruction},\n", " ],\n", " \"task_idx\": i,\n", " \"task_id\": obs.task_id,\n", " \"difficulty\": obs.metadata.get(\"difficulty\", \"\"),\n", " \"category\": obs.metadata.get(\"category\", \"\"),\n", " })\n", "\n", "# Build instruction -> task_idx lookup (used by rubric_reward)\n", "instruction_to_task_idx = {\n", " p[\"prompt\"][1][\"content\"]: p[\"task_idx\"] for p in all_prompts\n", "}\n", "\n", "# ============================================================\n", "# TRAIN / TEST SPLIT (stratified by difficulty)\n", "# 70% train, 30% test \u2014 no data leakage\n", "# ============================================================\n", "random.seed(42)\n", "\n", "train_prompts = []\n", "test_prompts = []\n", "\n", "for diff in [\"simple\", \"medium\", \"complex\", \"edge_case\"]:\n", " subset = [p for p in all_prompts if p[\"difficulty\"] == diff]\n", " random.shuffle(subset)\n", " split_idx = max(1, int(len(subset) * 0.7)) # 70% train\n", " train_prompts.extend(subset[:split_idx])\n", " test_prompts.extend(subset[split_idx:])\n", "\n", "print(f\"Total tasks: {len(all_prompts)}\")\n", "print(f\"Train tasks: {len(train_prompts)}\")\n", "print(f\"Test tasks: {len(test_prompts)} (held-out, never seen during training)\")\n", "print()\n", "\n", "# Show split by difficulty\n", "for diff in [\"simple\", \"medium\", \"complex\", \"edge_case\"]:\n", " tr = [p for p in train_prompts if p[\"difficulty\"] == diff]\n", " te = [p for p in test_prompts if p[\"difficulty\"] == diff]\n", " print(f\" {diff:10s}: {len(tr)} train, {len(te)} test\")\n", "\n", "# ============================================================\n", "# TRAINING DATASET: all train tasks\n", "# Simple/medium provide stable high reward (anchor points)\n", "# Complex/edge provide learning signal (gradient)\n", "# ============================================================\n", "dataset = Dataset.from_list(train_prompts)\n", "\n", "print(f\"\\nGRPO training dataset: {len(dataset)} tasks\")\n", "for p in train_prompts:\n", " print(f\" {p['task_id']:12s} [{p['difficulty']:10s}] {p['prompt'][1]['content'][:55]}...\")\n", "\n", "maximum_length = max(\n", " len(tokenizer.apply_chat_template(p[\"prompt\"], add_generation_prompt=True))\n", " for p in all_prompts\n", ")\n", "print(f\"\\nMax prompt token length: {maximum_length}\")" ] }, { "cell_type": "markdown", "id": "777cd97e", "metadata": {}, "source": [ "Let's see what the **base model** (before RL training) generates:" ] }, { "cell_type": "code", "execution_count": 9, "id": "3f4a9c99", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Task: task_0023 \u2014 Onboard new hire Li Wei to Engineering as L3 Senior Engineer. Create their employee record and initiate the onboarding request.\n", "\n", "{\"tool\": \"hr_create_employee\", \"params\": {\"name\": \"Li Wei\", \"department\": \"Engineering\", \"level\": \"L3\", \"role\": \"Senior Engineer\", \"manager_id\": \"John Smith\", \"is_contractor\": \"false\", \"location\": \"New York\", \"date_of_joining\": \"2023-01-01\"}}, {\"tool\": \"hr_read_employee\", \"params\": {\"emp_id\": \"Li Wei\", \"email\": \"liwei@acmecorp.com\"}}, {\"tool\": \"onboarding_create_request\", \"params\": {\"request_id\": \"1\", \"employee_id\": \"Li Wei\", \"status\": \"pending\"}}<|eot_id|>\n" ] } ], "source": [ "# Test base model on a medium task\n", "test_prompt = train_prompts[14][\"prompt\"] # Medium onboarding task\n", "print(f\"Task: {train_prompts[14]['task_id']} \u2014 {test_prompt[1]['content']}\\n\")\n", "\n", "text = tokenizer.apply_chat_template(\n", " test_prompt,\n", " tokenize=False,\n", " add_generation_prompt=True,\n", ")\n", "\n", "from transformers import TextStreamer\n", "\n", "_ = model.generate(\n", " **tokenizer(text, return_tensors=\"pt\").to(\"cuda\"),\n", " temperature=0.1,\n", " max_new_tokens=1024,\n", " streamer=TextStreamer(tokenizer, skip_prompt=True),\n", ")" ] }, { "cell_type": "markdown", "id": "1c08e0ff", "metadata": {}, "source": [ "## Designing Reward Functions\n", "\n", "We need reward functions that evaluate the model's generated tool calls. Unlike the 2048 tutorial which used code sandboxing, here we:\n", "\n", "1. **Parse** the model's output into JSON tool calls\n", "2. **Replay** them against the HR environment\n", "3. **Evaluate** using the task's rubric criteria\n", "\n", "| Reward Function | Purpose | Score Range |\n", "|-----------------|---------|-------------|\n", "| `valid_json_reward` | Are the generated tool calls valid JSON? | -2.0 to +1.0 |\n", "| `rubric_reward` | Does the sequence satisfy the task's rubric criteria? | -1.0 to +5.0 |\n", "| `efficiency_reward` | Was the task completed without wasting steps? | -1.0 to +1.0 |" ] }, { "cell_type": "code", "execution_count": 10, "id": "6f8671b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test replay (task_idx=14):\n", "Score: 100% (7/7)\n", " [PASS] created_employee: Created employee record\n", " [PASS] correct_name: Used correct name\n", " [PASS] correct_dept: Assigned to correct department\n", " [PASS] correct_level: Set correct level\n", " [PASS] correct_role: Set correct role\n", " [PASS] initiated_onboarding: Created onboarding request\n", " [PASS] sequencing: Created employee before onboarding request\n" ] } ], "source": [ "def extract_tool_calls(text):\n", " \"\"\"Extract JSON tool calls from model output.\"\"\"\n", " calls = []\n", " for match in re.finditer(r'\\{(?:[^{}]|\\{[^{}]*\\})*\\}', text):\n", " try:\n", " obj = json.loads(match.group())\n", " if \"tool\" in obj:\n", " calls.append(obj)\n", " except json.JSONDecodeError:\n", " continue\n", " return calls\n", "\n", "\n", "def replay_tool_calls(task_idx, tool_calls):\n", " \"\"\"Replay tool calls against a fresh environment and return evaluation.\"\"\"\n", " replay_env = HROnboardingEnvironment(seed=42, max_steps=15)\n", " # Go directly to the task by setting _task_idx\n", " replay_env._task_idx = task_idx\n", " replay_env.reset()\n", "\n", " task = replay_env._current_task\n", "\n", " steps = 0\n", " for tc in tool_calls:\n", " tool_name = tc.get(\"tool\", \"\")\n", " params = tc.get(\"params\", {})\n", " if tool_name == \"__done__\":\n", " break\n", " if steps >= 15:\n", " break\n", " action = HROnboardingAction(tool_name=tool_name, arguments=params)\n", " replay_env.step(action)\n", " steps += 1\n", "\n", " evaluator = RubricEvaluator()\n", " eval_result = evaluator.evaluate(task, replay_env.world.action_log)\n", " return eval_result, steps\n", "\n", "\n", "# Test it\n", "test_calls = [\n", " {\"tool\": \"hr_create_employee\", \"params\": {\"name\": \"Priya Sharma\", \"department\": \"Engineering\", \"level\": \"L2\", \"role\": \"Software Engineer\"}},\n", " {\"tool\": \"onboarding_create_request\", \"params\": {\"employee_id\": \"emp_0201\"}},\n", " {\"tool\": \"__done__\", \"params\": {}},\n", "]\n", "eval_result, steps = replay_tool_calls(14, test_calls)\n", "print(f\"Test replay (task_idx=14):\")\n", "print(f\"Score: {eval_result['score']:.0%} ({eval_result['passed_count']}/{eval_result['total_criteria']})\")\n", "for c in eval_result[\"criteria_results\"]:\n", " print(f\" [{'PASS' if c['passed'] else 'FAIL'}] {c['name']}: {c['description']}\")" ] }, { "cell_type": "markdown", "id": "77d6e428", "metadata": {}, "source": [ "Now the actual reward functions that GRPO will call:" ] }, { "cell_type": "code", "execution_count": 11, "id": "f704ee2c", "metadata": {}, "outputs": [], "source": [ "global PRINTER\n", "PRINTER = 0\n", "\n", "\n", "def valid_json_reward(completions, **kwargs):\n", " \"\"\"Reward for generating valid JSON tool calls.\"\"\"\n", " scores = []\n", " for completion in completions:\n", " response = completion[0][\"content\"]\n", " calls = extract_tool_calls(response)\n", " if len(calls) == 0:\n", " scores.append(-2.0)\n", " elif any(c.get(\"tool\") in [t[\"name\"] for t in TOOL_DEFINITIONS] or c.get(\"tool\") == \"__done__\" for c in calls):\n", " scores.append(1.0)\n", " else:\n", " scores.append(-0.5)\n", " return scores\n", "\n", "\n", "def get_instruction_from_prompts(prompts, idx):\n", " \"\"\"Safely extract instruction from prompts, handling various TRL formats.\"\"\"\n", " try:\n", " if isinstance(prompts[idx], list):\n", " return prompts[idx][1][\"content\"]\n", " if isinstance(prompts[idx], dict):\n", " return prompts[idx].get(\"content\", \"\")\n", " except (IndexError, KeyError, TypeError):\n", " pass\n", " try:\n", " if isinstance(prompts[0], list):\n", " return prompts[0][1][\"content\"]\n", " if isinstance(prompts[0], dict):\n", " return prompts[0].get(\"content\", \"\")\n", " except (IndexError, KeyError, TypeError):\n", " pass\n", " try:\n", " for msg in prompts:\n", " if isinstance(msg, dict) and msg.get(\"role\") == \"user\":\n", " return msg[\"content\"]\n", " except (TypeError, KeyError):\n", " pass\n", " return \"\"\n", "\n", "\n", "def rubric_reward(completions, **kwargs):\n", " \"\"\"Main reward: replay tool calls and evaluate against rubric.\"\"\"\n", " global PRINTER\n", " prompts = kwargs.get(\"prompts\", kwargs.get(\"prompt\", []))\n", " scores = []\n", "\n", " first_instruction = get_instruction_from_prompts(prompts, 0)\n", "\n", " for i, completion in enumerate(completions):\n", " response = completion[0][\"content\"]\n", " calls = extract_tool_calls(response)\n", "\n", " if len(calls) == 0:\n", " scores.append(-1.0)\n", " continue\n", "\n", " instruction = get_instruction_from_prompts(prompts, i) or first_instruction\n", "\n", " task_idx = instruction_to_task_idx.get(instruction)\n", " if task_idx is None:\n", " if PRINTER % 20 == 0:\n", " print(f\"WARNING: No task match for: {instruction[:60]}...\")\n", " scores.append(-1.0)\n", " continue\n", "\n", " try:\n", " eval_result, steps = replay_tool_calls(task_idx, calls)\n", " score = eval_result[\"score\"]\n", " reward = score * 6.0 - 1.0\n", " if eval_result[\"passed\"]:\n", " reward += 2.0\n", "\n", " if PRINTER % 10 == 0:\n", " task_info = next((p for p in all_prompts if p[\"task_idx\"] == task_idx), None)\n", " tid = task_info[\"task_id\"] if task_info else f\"idx_{task_idx}\"\n", " diff = task_info[\"difficulty\"] if task_info else \"?\"\n", " print(f\"\\n--- [{tid}] [{diff}] ---\")\n", " print(f\"Instruction: {instruction[:80]}...\")\n", " print(f\"Tool calls: {[c['tool'] for c in calls]}\")\n", " print(f\"Rubric: {eval_result['score']:.0%} ({eval_result['passed_count']}/{eval_result['total_criteria']})\")\n", " print(f\"Reward: {reward:.2f}\")\n", " PRINTER += 1\n", " scores.append(reward)\n", " except Exception as e:\n", " print(f\"Error replaying: {e}\")\n", " scores.append(-1.0)\n", "\n", " return scores\n", "\n", "\n", "def efficiency_reward(completions, **kwargs):\n", " \"\"\"Reward for completing tasks efficiently (fewer steps = better).\"\"\"\n", " scores = []\n", " for completion in completions:\n", " response = completion[0][\"content\"]\n", " calls = extract_tool_calls(response)\n", " actual_calls = [c for c in calls if c.get(\"tool\") != \"__done__\"]\n", " n = len(actual_calls)\n", "\n", " if n == 0:\n", " scores.append(-1.0)\n", " elif n <= 3:\n", " scores.append(1.0)\n", " elif n <= 6:\n", " scores.append(0.5)\n", " elif n <= 10:\n", " scores.append(0.0)\n", " else:\n", " scores.append(-0.5)\n", " return scores" ] }, { "cell_type": "markdown", "id": "c293dfaa", "metadata": {}, "source": [ "## Baseline Evaluation\n", "\n", "Before training, we evaluate the base model on **both** the train and test (held-out) sets to establish baselines:" ] }, { "cell_type": "code", "execution_count": 12, "id": "bb5bc616", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "BASELINE \u2014 TRAIN SET\n", "==================================================\n", " [FAIL] task_0035 [simple ] score=0% tools=['hr_get_offboarding_status']\n", " [X ] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0014 [simple ] score=50% tools=['it_get_available_assets']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [FAIL] task_0005 [simple ] score=0% tools=['hr_read_employee']\n", " [X ] correct_tool: Used hr_search_employees\n", " [X ] correct_dept: Filtered by correct department\n", " [FAIL] task_0010 [simple ] score=0% tools=['it_get_software_licenses']\n", " [X ] correct_tool: Used access_get_security_groups\n", " [PASS] task_0006 [simple ] score=100% tools=['hr_get_org_chart']\n", " [OK] correct_tool: Used hr_get_org_chart\n", " [OK] correct_dept: Passed correct department\n", " [FAIL] task_0036 [simple ] score=0% tools=['hr_get_offboarding_status']\n", " [X ] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0038 [simple ] score=0% tools=['hr_get_offboarding_status']\n", " [X ] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [PASS] task_0007 [simple ] score=100% tools=['it_get_available_assets']\n", " [OK] correct_tool: Used it_get_available_assets\n", " [OK] correct_type: Filtered by laptop type\n", " [FAIL] task_0013 [simple ] score=0% tools=['hr_get_org_chart']\n", " [X ] correct_tool: Used onboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0037 [simple ] score=0% tools=['hr_get_offboarding_status']\n", " [X ] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0011 [simple ] score=0% tools=['hr_get_org_chart']\n", " [X ] correct_tool: Used onboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0002 [simple ] score=50% tools=['hr_read_employee']\n", " [OK] correct_tool: Used hr_read_employee\n", " [X ] correct_id: Passed correct emp_id\n", " [FAIL] task_0012 [simple ] score=0% tools=['hr_get_org_chart']\n", " [X ] correct_tool: Used onboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0046 [medium ] score=60% tools=['offboarding_create_request', 'it_revoke_access']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [PASS] task_0023 [medium ] score=100% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [PASS] task_0018 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0041 [medium ] score=60% tools=['offboarding_create_request', 'it_revoke_access']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [FAIL] task_0040 [medium ] score=40% tools=['offboarding_create_request']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [X ] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [FAIL] task_0045 [medium ] score=40% tools=['offboarding_create_request']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [X ] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [PASS] task_0016 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0073 [medium ] score=50% tools=['it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [PASS] task_0020 [medium ] score=100% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0074 [medium ] score=50% tools=['it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [FAIL] task_0017 [medium ] score=71% tools=['hr_create_employee']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [X ] initiated_onboarding: Created onboarding request\n", " [X ] sequencing: Created employee before onboarding request\n", " [PASS] task_0015 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0042 [medium ] score=40% tools=['offboarding_create_request']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [X ] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [PASS] task_0019 [medium ] score=100% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0053 [complex ] score=0% tools=['hr_get_offboarding_request', 'onboarding_get_status', 'onboarding_complete_step', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart', 'hr_get_org_chart']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [X ] farewell_email: Sent farewell email\n", " [X ] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0051 [complex ] score=17% tools=['hr_revoke_access', 'hr_revoke_role', 'hr_revoke_access', 'hr_reassign_asset', 'hr_get_software_licenses', 'access_revoke_role', 'access_get_security_groups']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [X ] farewell: Sent farewell communication\n", " [X ] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0031 [complex ] score=20% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'hr_search_employees', 'hr_get_org_chart', 'onboarding_create_request', 'onboarding_get_status']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] got_approval: Submitted approval request\n", " [X ] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [X ] assigned_role: Assigned access role\n", " [X ] created_badge: Created physical badge\n", " [X ] sent_communications: Sent welcome communications\n", " [X ] scheduled_meeting: Scheduled orientation\n", " [X ] security_approval: Got security approval before badge\n", " [FAIL] task_0027 [complex ] score=30% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] assigned_laptop: Assigned a laptop\n", " [X ] created_accounts: Created IT accounts\n", " [X ] assigned_access: Assigned access roles\n", " [X ] sent_welcome: Sent welcome communication\n", " [X ] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0032 [complex ] score=44% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'offboarding_create_request', 'offboarding_get_status', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'access_get_security_groups']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] got_approval: Submitted approval request\n", " [X ] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_role: Assigned access role\n", " [OK] created_badge: Created physical badge\n", " [X ] sent_communications: Sent welcome communications\n", " [X ] scheduled_meeting: Scheduled orientation\n", " [FAIL] task_0072 [complex ] score=0% tools=['hr_search_employees', 'hr_create_request', 'hr_assign_asset', 'hr_send_email', 'hr_send_message', 'hr_get_org_chart', 'hr_get_software_licenses', 'access_assign_role', 'access_revoke_role']\n", " [X ] read_employee: Read employee record first\n", " [X ] updated_status: Updated status to pending/active\n", " [X ] new_onboarding: Created new onboarding request\n", " [X ] provisioned_accounts: Created IT accounts\n", " [X ] welcome_back: Sent welcome-back communication\n", " [FAIL] task_0025 [complex ] score=50% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'hr_get_org_chart', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'it_assign_asset', 'it_get_available_assets', 'it_create_account']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] assigned_laptop: Assigned a laptop\n", " [OK] created_accounts: Created IT accounts\n", " [X ] assigned_access: Assigned access roles\n", " [X ] sent_welcome: Sent welcome communication\n", " [X ] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0068 [complex ] score=20% tools=['offboard_access', 'onboard_employee', 'access_revoke', 'access_assign_role', 'access_get_security_groups']\n", " [X ] read_employee: Read employee record\n", " [X ] revoked_old_access: Revoked old department access\n", " [X ] updated_dept: Updated department\n", " [OK] new_access: Assigned new department roles\n", " [X ] notified_team: Notified new team\n", " [FAIL] task_0054 [complex ] score=60% tools=['offboarding_create_request', 'offboarding_get_status', 'offboarding_revoke_access', 'offboarding_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'meeting_schedule']\n", " [OK] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [OK] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0030 [complex ] score=20% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] got_approval: Submitted approval request\n", " [X ] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [X ] assigned_role: Assigned access role\n", " [X ] created_badge: Created physical badge\n", " [X ] sent_communications: Sent welcome communications\n", " [X ] scheduled_meeting: Scheduled orientation\n", " [X ] security_approval: Got security approval before badge\n", " [FAIL] task_0034 [complex ] score=20% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'hr_search_employees', 'onboarding_get_status', 'onboarding_complete_step', 'offboarding_create_request', 'onboarding_get_status']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] got_approval: Submitted approval request\n", " [X ] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [X ] assigned_role: Assigned access role\n", " [X ] created_badge: Created physical badge\n", " [X ] sent_communications: Sent welcome communications\n", " [X ] scheduled_meeting: Scheduled orientation\n", " [X ] security_approval: Got security approval before badge\n", " [FAIL] task_0048 [complex ] score=17% tools=['hr_revoke_access', 'hr_revoke_access', 'hr_reassign_asset', 'hr_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'access_revoke_role']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [X ] farewell: Sent farewell communication\n", " [X ] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0029 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'hr_get_org_chart', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'it_revoke_access', 'access_assign_role', 'access_get_security_groups', 'email_send']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] assigned_laptop: Assigned a laptop\n", " [OK] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [X ] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0070 [complex ] score=20% tools=['offboard_access', 'access_revoke', 'access_assign_role', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'access_assign_role', 'access_revoke_role', 'access_get_security_groups', 'access_get_software_licenses', 'access_get_security_groups', 'access_get_security_groups', 'access_get_security_groups']\n", " [X ] read_employee: Read employee record\n", " [X ] revoked_old_access: Revoked old department access\n", " [X ] updated_dept: Updated department\n", " [OK] new_access: Assigned new department roles\n", " [X ] notified_team: Notified new team\n", " [FAIL] task_0071 [complex ] score=0% tools=['hr_search_employees', 'hr_create_request', 'hr_assign_asset', 'hr_send_email', 'hr_send_message', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_complete_step']\n", " [X ] read_employee: Read employee record first\n", " [X ] updated_status: Updated status to pending/active\n", " [X ] new_onboarding: Created new onboarding request\n", " [X ] provisioned_accounts: Created IT accounts\n", " [X ] welcome_back: Sent welcome-back communication\n", " [FAIL] task_0050 [complex ] score=50% tools=['hr_revoke_access', 'hr_revoke_role', 'hr_revoke_access', 'hr_reassign_asset', 'hr_get_software_licenses', 'access_assign_role', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'access_get_security_groups', 'onboarding_create_request', 'onboarding_get_status']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [OK] farewell: Sent farewell communication\n", " [OK] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0077 [complex ] score=0% tools=['hr_reassign_asset', 'hr_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'access_assign_role', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step']\n", " [X ] read_manager: Looked up manager info\n", " [X ] offboarding: Created offboarding request\n", " [X ] reassigned: Updated reports' manager\n", " [X ] revoked_access: Revoked manager's access\n", " [X ] notified_team: Notified team\n", " [FAIL] task_0056 [edge_case ] score=50% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [OK] attempted_create: Attempted to create employee\n", " [X ] handled_limit: Recognized or handled headcount limit error\n", " [FAIL] task_0059 [edge_case ] score=0% tools=['hr_search_employees']\n", " [X ] checked_licenses: Checked licenses\n", " [FAIL] task_0065 [edge_case ] score=50% tools=['access_assign_role']\n", " [OK] attempted_assign: Attempted to assign role\n", " [X ] handled_error: Recognized level requirement error\n", " [FAIL] task_0066 [edge_case ] score=0% tools=['hr_assign_asset']\n", " [X ] attempted_assign: Attempted to assign role\n", " [X ] handled_restriction: Recognized department restriction\n", " [FAIL] task_0064 [edge_case ] score=50% tools=['hr_revoke_access', 'hr_get_software_licenses', 'access_revoke_role', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'offboarding_create_request', 'offboarding_get_status', 'it_assign_asset', 'it_get_available_assets', 'access_revoke_role', 'onboarding_get_status', 'it_get_software_licenses', 'access_get_security_groups']\n", " [OK] created_request: Created offboarding with termination reason\n", " [X ] revoked_access: Revoked all access\n", " [OK] no_farewell: Did NOT send farewell communications\n", " [X ] completed_steps: Completed termination steps\n", " [FAIL] task_0058 [edge_case ] score=0% tools=['hr_update_employee']\n", " [X ] checked_licenses: Checked license availability\n", " [X ] identified_full: Recognized licenses are full\n", " [FAIL] task_0067 [edge_case ] score=0% tools=['hr_read_employee']\n", " [X ] looked_up_badge: Looked up badge/access policy\n", " [X ] multiple_lookups: Looked up multiple policies\n", " [FAIL] task_0061 [edge_case ] score=25% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [X ] created_contractor: Created employee with is_contractor=true\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] legal_approval: Got legal approval\n", " [X ] limited_access: Created limited accounts\n", "\n", "Results: 8/52 passed (15.4%)\n", "Mean score: 0.370\n", " simple : 2/13 pass, score=0.23\n", " medium : 6/14 pass, score=0.72\n", " complex : 0/17 pass, score=0.26\n", " edge_case : 0/8 pass, score=0.22\n", "\n", "==================================================\n", "BASELINE \u2014 TEST SET (held-out)\n", "==================================================\n", " [FAIL] task_0003 [simple ] score=50% tools=['hr_read_employee']\n", " [OK] correct_tool: Used hr_read_employee\n", " [X ] correct_id: Passed correct emp_id\n", " [FAIL] task_0039 [simple ] score=0% tools=['hr_get_offboarding_status']\n", " [X ] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0008 [simple ] score=0% tools=['hr_get_software_licenses']\n", " [X ] correct_tool: Used it_get_software_licenses\n", " [X ] correct_software: Filtered by Jira\n", " [FAIL] task_0009 [simple ] score=0% tools=['hr_policy']\n", " [X ] correct_tool: Used policy_lookup\n", " [X ] relevant_topic: Searched for onboarding topic\n", " [FAIL] task_0001 [simple ] score=50% tools=['hr_read_employee']\n", " [OK] correct_tool: Used hr_read_employee\n", " [X ] correct_id: Passed correct emp_id\n", " [FAIL] task_0004 [simple ] score=0% tools=['hr_read_employee']\n", " [X ] correct_tool: Used hr_search_employees\n", " [X ] correct_dept: Filtered by correct department\n", " [PASS] task_0024 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0044 [medium ] score=0% tools=['offboarding_revoke_access']\n", " [X ] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [X ] correct_reason: Set correct reason\n", " [X ] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [PASS] task_0022 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0043 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [FAIL] task_0075 [medium ] score=50% tools=['it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets', 'it_get_available_software_licenses', 'it_get_available_software_licenses', 'access_assign_role', 'access_assign_role', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'access_revoke_role', 'access_revoke_role']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [PASS] task_0021 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0047 [medium ] score=40% tools=['offboarding_create_request']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [X ] revoked_access: Revoked IT access\n", " [X ] notified: Sent notification\n", " [FAIL] task_0055 [complex ] score=40% tools=['onboarding_complete_step', 'offboarding_revoke_access', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'hr_get_org_chart', 'hr_search_employees', 'hr_update_employee', 'hr_read_employee', 'hr_get_software_licenses', 'hr_create_account']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [OK] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0052 [complex ] score=40% tools=['onboarding_complete_step', 'offboarding_revoke_access', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'hr_get_org_chart', 'hr_update_employee', 'hr_search_employees', 'hr_get_org_chart']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [OK] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0026 [complex ] score=30% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'hr_get_org_chart', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_create_request', 'hr_get_org_chart']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] assigned_laptop: Assigned a laptop\n", " [X ] created_accounts: Created IT accounts\n", " [X ] assigned_access: Assigned access roles\n", " [X ] sent_welcome: Sent welcome communication\n", " [X ] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0033 [complex ] score=22% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] got_approval: Submitted approval request\n", " [X ] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [X ] assigned_role: Assigned access role\n", " [X ] created_badge: Created physical badge\n", " [X ] sent_communications: Sent welcome communications\n", " [X ] scheduled_meeting: Scheduled orientation\n", " [FAIL] task_0069 [complex ] score=0% tools=['hr_revoke_access', 'hr_assign_role', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [X ] read_employee: Read employee record\n", " [X ] revoked_old_access: Revoked old department access\n", " [X ] updated_dept: Updated department\n", " [X ] new_access: Assigned new department roles\n", " [X ] notified_team: Notified new team\n", " [FAIL] task_0076 [complex ] score=20% tools=['it_revoke_access', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [X ] read_manager: Looked up manager info\n", " [X ] offboarding: Created offboarding request\n", " [X ] reassigned: Updated reports' manager\n", " [OK] revoked_access: Revoked manager's access\n", " [X ] notified_team: Notified team\n", " [FAIL] task_0049 [complex ] score=17% tools=['hr_revoke_access', 'hr_revoke_access', 'hr_reassign_asset', 'hr_get_software_licenses', 'access_revoke_role', 'access_get_security_groups']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [X ] farewell: Sent farewell communication\n", " [X ] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0028 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'access_get_security_groups', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] assigned_laptop: Assigned a laptop\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [OK] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0063 [edge_case ] score=33% tools=['hr_update_employee']\n", " [X ] checked_onboarding: Checked onboarding status\n", " [X ] revoked_access: Revoked any provisioned access\n", " [OK] updated_status: Updated employee status to offboarded\n", " [FAIL] task_0060 [edge_case ] score=0% tools=['hr_search_employees']\n", " [X ] looked_up_manager: Looked up the manager or org chart\n", " [X ] found_skip_level: Identified skip-level manager\n", " [X ] proceeded: Proceeded with onboarding\n", " [FAIL] task_0062 [edge_case ] score=33% tools=['it_revoke_access', 'it_get_software_licenses', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status']\n", " [X ] checked_employee: Looked up employee record\n", " [X ] created_request: Created offboarding request\n", " [OK] revoked_access: Revoked access\n", " [FAIL] task_0057 [edge_case ] score=50% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", " [OK] attempted_create: Attempted to create employee\n", " [X ] handled_limit: Recognized or handled headcount limit error\n", "\n", "Results: 3/25 passed (12.0%)\n", "Mean score: 0.370\n", " simple : 0/6 pass, score=0.17\n", " medium : 3/7 pass, score=0.67\n", " complex : 0/8 pass, score=0.30\n", " edge_case : 0/4 pass, score=0.29\n" ] } ], "source": [ "def evaluate_model(model, tokenizer, prompts_list=None, temperature=0.1):\n", " \"\"\"Evaluate model on a list of prompt dicts (each has 'prompt', 'task_idx', 'task_id', 'difficulty').\"\"\"\n", " if prompts_list is None:\n", " prompts_list = test_prompts\n", "\n", " results = []\n", " for p in prompts_list:\n", " prompt_msgs = p[\"prompt\"]\n", " task_idx = p[\"task_idx\"]\n", "\n", " text = tokenizer.apply_chat_template(\n", " prompt_msgs, tokenize=False, add_generation_prompt=True\n", " )\n", " inputs = tokenizer(text, return_tensors=\"pt\").to(\"cuda\")\n", "\n", " with torch.no_grad():\n", " outputs = model.generate(\n", " **inputs,\n", " max_new_tokens=512,\n", " temperature=temperature,\n", " do_sample=True,\n", " )\n", " response = tokenizer.decode(\n", " outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True\n", " )\n", "\n", " calls = extract_tool_calls(response)\n", " if calls:\n", " eval_result, steps = replay_tool_calls(task_idx, calls)\n", " results.append({\n", " \"task_id\": p[\"task_id\"],\n", " \"difficulty\": p[\"difficulty\"],\n", " \"score\": eval_result[\"score\"],\n", " \"passed\": eval_result[\"passed\"],\n", " \"steps\": steps,\n", " \"tools_called\": [c[\"tool\"] for c in calls],\n", " \"criteria\": eval_result[\"criteria_results\"],\n", " })\n", " else:\n", " results.append({\n", " \"task_id\": p[\"task_id\"],\n", " \"difficulty\": p[\"difficulty\"],\n", " \"score\": 0.0,\n", " \"passed\": False,\n", " \"steps\": 0,\n", " \"tools_called\": [],\n", " \"criteria\": [],\n", " })\n", "\n", " # Print per-task result\n", " r = results[-1]\n", " status = \"PASS\" if r[\"passed\"] else \"FAIL\"\n", " print(f\" [{status}] {r['task_id']:12s} [{r['difficulty']:10s}] \"\n", " f\"score={r['score']:.0%} tools={r['tools_called']}\")\n", " for c in r.get(\"criteria\", []):\n", " print(f\" [{'OK' if c['passed'] else 'X ':s}] {c['name']}: {c['description']}\")\n", "\n", " pass_count = sum(1 for r in results if r[\"passed\"])\n", " mean_score = sum(r[\"score\"] for r in results) / max(len(results), 1)\n", "\n", " print(f\"\\nResults: {pass_count}/{len(results)} passed ({pass_count/len(results):.1%})\")\n", " print(f\"Mean score: {mean_score:.3f}\")\n", "\n", " for diff in [\"simple\", \"medium\", \"complex\", \"edge_case\"]:\n", " subset = [r for r in results if r[\"difficulty\"] == diff]\n", " if subset:\n", " p_count = sum(1 for r in subset if r[\"passed\"])\n", " s = sum(r[\"score\"] for r in subset) / len(subset)\n", " print(f\" {diff:10s}: {p_count}/{len(subset)} pass, score={s:.2f}\")\n", "\n", " return results\n", "\n", "\n", "# ============================================================\n", "# BASELINE EVALUATION (before training)\n", "# ============================================================\n", "\n", "# Evaluate on TRAIN set\n", "print(\"=\" * 50)\n", "print(\"BASELINE \u2014 TRAIN SET\")\n", "print(\"=\" * 50)\n", "baseline_train = evaluate_model(model, tokenizer, prompts_list=train_prompts)\n", "\n", "# Evaluate on TEST set (held-out)\n", "print(\"\\n\" + \"=\" * 50)\n", "print(\"BASELINE \u2014 TEST SET (held-out)\")\n", "print(\"=\" * 50)\n", "baseline_test = evaluate_model(model, tokenizer, prompts_list=test_prompts)" ] }, { "cell_type": "markdown", "id": "31ecc4ca", "metadata": {}, "source": [ "## Training with GRPO\n", "\n", "**Group Relative Policy Optimization (GRPO)** compares multiple generations for the same prompt and updates the policy to favor higher-reward outputs.\n", "\n", "Key training parameters:\n", "- `num_generations=6`: Generate 6 candidates per prompt to compute relative rewards\n", "- `max_steps=300`: Training steps\n", "- `learning_rate=5e-5`: With cosine schedule and 10% warmup\n", "- `temperature=1.0`: Higher = more exploration during training" ] }, { "cell_type": "code", "execution_count": 13, "id": "c4b14bd9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unsloth: We now expect `per_device_train_batch_size` * `gradient_accumulation_steps` * `world_size` to be a multiple of `num_generations`.\n", "We will change the batch size of 1 to the `num_generations` of 4\n" ] } ], "source": [ "max_seq_length = 4096\n", "max_prompt_length = maximum_length + 1\n", "max_completion_length = 512\n", "\n", "from trl import GRPOConfig, GRPOTrainer\n", "\n", "training_args = GRPOConfig(\n", " temperature=1.0,\n", " learning_rate=5e-5,\n", " weight_decay=0.001,\n", " warmup_ratio=0.1,\n", " lr_scheduler_type=\"cosine\",\n", " optim=\"adamw_8bit\",\n", " logging_steps=1,\n", " per_device_train_batch_size=1,\n", " gradient_accumulation_steps=1,\n", " num_generations=4,\n", " max_prompt_length=max_prompt_length,\n", " max_completion_length=512,\n", " max_steps=300,\n", " save_steps=100,\n", " report_to=\"wandb\",\n", " output_dir=\"outputs\",\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "id": "a4260bce", "metadata": { "scrolled": true }, "outputs": [], "source": [ "trainer = GRPOTrainer(\n", " model=model,\n", " processing_class=tokenizer,\n", " reward_funcs=[\n", " valid_json_reward,\n", " rubric_reward,\n", " efficiency_reward,\n", " ],\n", " args=training_args,\n", " train_dataset=dataset,\n", ")" ] }, { "cell_type": "markdown", "id": "42965993", "metadata": {}, "source": [ "### Start Training!\n", "\n", "Training will take ~20 minutes. Watch the reward column \u2014 it should gradually increase as the model learns to:\n", "1. Generate valid JSON tool calls\n", "2. Call the right tools for each task\n", "3. Pass more rubric criteria\n", "\n", "The moving average reward trends upward from ~2-3 early on to ~4-5 by the end of training." ] }, { "cell_type": "code", "execution_count": 15, "id": "7fe1e88e", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1\n", " \\\\ /| Num examples = 52 | Num Epochs = 6 | Total steps = 300\n", "O^O/ \\_/ \\ Batch size per device = 4 | Gradient accumulation steps = 1\n", "\\ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4\n", " \"-____-\" Trainable parameters = 5,636,096 of 1,241,450,496 (0.45% trained)\n", "`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072, 'temperature': 0.6, 'top_p': 0.9}. If this is not desired, please set these values explicitly.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Unsloth: Will smartly offload gradients to save VRAM!\n", "\n", "--- [task_0077] [complex] ---\n", "Instruction: Manager Ananya Reddy (emp_0007) in Engineering is leaving. They have 2 direct re...\n", "Tool calls: ['it_revoke_access', 'onboarding_get_status', 'hr_send_message', 'onboarding_create_request']\n", "Rubric: 20% (1/5)\n", "Reward: 0.20\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " [300/300 21:10, Epoch 5/6]\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StepTraining Lossrewardreward_stdcompletions / mean_lengthcompletions / min_lengthcompletions / max_lengthcompletions / clipped_ratiocompletions / mean_terminated_lengthcompletions / min_terminated_lengthcompletions / max_terminated_lengthklrewards / valid_json_reward / meanrewards / valid_json_reward / stdrewards / rubric_reward / meanrewards / rubric_reward / stdrewards / efficiency_reward / meanrewards / efficiency_reward / std
1-0.0000002.3000001.655294279.250000178.000000512.0000000.250000201.666672178.000000228.000000-0.0000001.0000000.0000000.8000001.2000000.5000000.707107
2-0.0000009.0000000.000000111.750000109.000000118.0000000.000000111.750000109.000000118.000000-0.0000001.0000000.0000007.0000000.0000001.0000000.000000
30.0000009.0000000.000000127.750000118.000000145.0000000.000000127.750000118.000000145.0000000.0024471.0000000.0000007.0000000.0000001.0000000.000000
40.0000002.1250000.250000345.500000287.000000394.0000000.000000345.500000287.000000394.0000000.0006701.0000000.0000001.0000000.0000000.1250000.250000
50.0000003.2500001.50000035.25000034.00000036.0000000.00000035.25000034.00000036.0000000.0004211.0000000.0000001.2500001.5000001.0000000.000000
60.0000009.0000000.000000119.000000111.000000124.0000000.000000119.000000111.000000124.0000000.0020581.0000000.0000007.0000000.0000001.0000000.000000
70.0000008.2857141.428571143.750000136.000000151.0000000.000000143.750000136.000000151.0000000.0033491.0000000.0000006.2857141.4285711.0000000.000000
80.0000002.2750000.623832448.250000257.000000512.0000000.750000257.000000257.000000257.0000000.0004111.0000000.0000001.4000000.979796-0.1250000.478714
90.0000001.0000000.00000040.25000025.00000057.0000000.00000040.25000025.00000057.0000000.0041571.0000000.000000-1.0000000.0000001.0000000.000000
100.0000009.0000000.00000024.25000021.00000028.0000000.00000024.25000021.00000028.0000000.0008831.0000000.0000007.0000000.0000001.0000000.000000
110.0000003.2500001.50000034.50000033.00000035.0000000.00000034.50000033.00000035.0000000.0002011.0000000.0000001.2500001.5000001.0000000.000000
120.0000001.7500002.59807633.00000028.00000037.0000000.00000033.00000028.00000037.0000000.0039520.2500000.8660250.5000001.7320511.0000000.000000
130.0000000.6250000.478714131.50000033.000000273.0000000.000000131.50000033.000000273.0000000.0054211.0000000.000000-1.0000000.0000000.6250000.478714
140.000000-1.0000002.12132038.00000032.00000047.0000000.00000038.00000032.00000047.0000000.004453-0.5000001.224745-1.0000000.0000000.5000001.000000
150.0000001.3750000.750000172.75000029.000000512.0000000.25000059.66666829.000000118.0000000.0054081.0000000.000000-0.2500001.5000000.6250000.750000
160.0000001.0000000.00000034.75000029.00000042.0000000.00000034.75000029.00000042.0000000.0404651.0000000.000000-1.0000000.0000001.0000000.000000
170.000000-0.5000000.00000037.75000031.00000051.0000000.00000037.75000031.00000051.0000000.036488-0.5000000.000000-1.0000000.0000001.0000000.000000
180.0000002.3750001.030776453.250000277.000000512.0000000.750000277.000000277.000000277.0000000.0009231.0000000.0000001.5000000.577350-0.1250000.478714
190.0002001.7500001.50000044.00000034.00000067.0000000.00000044.00000034.00000067.0000000.1532281.0000000.000000-0.2500001.5000001.0000000.000000
200.0000001.8000000.852447387.500000224.000000505.0000000.000000387.500000224.000000505.0000000.0059271.0000000.0000000.8000000.9797960.0000000.408248
210.0000004.0000000.00000038.00000038.00000038.0000000.00000038.00000038.00000038.0000000.0000261.0000000.0000002.0000000.0000001.0000000.000000
220.0054003.0000004.00000033.75000028.00000047.0000000.00000033.75000028.00000047.0000005.4459331.0000000.0000001.0000004.0000001.0000000.000000
230.0689002.4250001.95000048.00000045.00000055.0000000.00000048.00000045.00000055.00000068.9035870.6250000.7500000.8000001.2000001.0000000.000000
240.0000001.5000000.707107461.250000359.000000512.0000000.500000410.500000359.000000462.0000000.0073001.0000000.0000000.8750000.750000-0.3750000.250000
250.0000004.0000000.00000039.00000039.00000039.0000000.00000039.00000039.00000039.0000000.0018821.0000000.0000002.0000000.0000001.0000000.000000
260.0002003.6250000.478714231.000000147.000000331.0000000.000000231.000000147.000000331.0000000.2262811.0000000.0000002.0000000.0000000.6250000.478714
270.0000002.6250001.250000512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0004361.0000000.0000002.0000001.224745-0.3750000.250000
280.0000004.0000000.00000028.50000027.00000029.0000000.00000028.50000027.00000029.0000000.0092771.0000000.0000002.0000000.0000001.0000000.000000
290.0005008.6250000.750000237.500000135.000000512.0000000.250000146.000000135.000000159.0000000.4730631.0000000.0000007.0000000.0000000.6250000.750000
300.0236004.0000000.00000052.50000036.00000088.0000000.00000052.50000036.00000088.00000023.6155091.0000000.0000002.0000000.0000001.0000000.000000
310.0000002.4750001.650000512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0002011.0000000.0000001.8500001.417745-0.3750000.250000
320.0005009.0000000.000000106.750000105.000000111.0000000.000000106.750000105.000000111.0000000.4935201.0000000.0000007.0000000.0000001.0000000.000000
330.0000002.5750000.670199445.500000298.000000512.0000000.500000379.000000298.000000460.0000000.0055101.0000000.0000001.7000001.148913-0.1250000.478714
340.0000002.4250000.450000512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0002321.0000000.0000001.5500000.300000-0.1250000.250000
350.0000002.3750001.108678446.500000358.000000512.0000000.500000381.000000358.000000404.0000000.0061261.0000000.0000001.5000001.000000-0.1250000.250000
360.0000002.0500000.404145282.750000192.000000459.0000000.000000282.750000192.000000459.0000000.0097621.0000000.0000000.8000000.6928200.2500000.288675
370.0000001.4500002.25166644.25000040.00000051.0000000.00000044.25000040.00000051.0000000.0109040.2500000.8660250.2000001.3856411.0000000.000000
380.0000002.2250000.567891340.500000205.000000408.0000000.000000340.500000205.000000408.0000000.0079911.0000000.0000001.1000000.6000000.1250000.250000
390.0000004.0000000.00000035.75000035.00000036.0000000.00000035.75000035.00000036.0000000.0098911.0000000.0000002.0000000.0000001.0000000.000000
400.0002009.0000000.00000029.50000019.00000037.0000000.00000029.50000019.00000037.0000000.2031171.0000000.0000007.0000000.0000001.0000000.000000
410.0000000.9250003.317002373.250000205.000000512.0000000.500000234.500000205.000000264.0000000.0160280.2500001.5000000.8000001.200000-0.1250000.750000
420.000000-0.5000000.00000031.50000031.00000032.0000000.00000031.50000031.00000032.0000000.012630-0.5000000.000000-1.0000000.0000001.0000000.000000
430.0001004.0000000.00000037.50000021.00000045.0000000.00000037.50000021.00000045.0000000.1234991.0000000.0000002.0000000.0000001.0000000.000000
440.0001003.7500000.288675100.50000040.000000175.0000000.000000100.50000040.000000175.0000000.1014431.0000000.0000002.0000000.0000000.7500000.288675
450.0000002.9500000.858293345.500000212.000000485.0000000.000000345.500000212.000000485.0000000.0263051.0000000.0000001.7000001.1489130.2500000.645497
460.0000001.9500000.768115454.750000283.000000512.0000000.750000283.000000283.000000283.0000000.0066521.0000000.0000000.9500000.3000000.0000000.707107
470.0001002.4250001.95000042.50000033.00000051.0000000.00000042.50000033.00000051.0000000.1159790.6250000.7500000.8000001.2000001.0000000.000000
480.0001003.4000000.00000044.75000043.00000048.0000000.00000044.75000043.00000048.0000000.0504251.0000000.0000001.4000000.0000001.0000000.000000
490.0000002.8333331.394433509.750000503.000000512.0000000.750000503.000000503.000000503.0000000.0031081.0000000.0000001.8333331.3743690.0000000.707107
500.0001009.0000000.000000122.750000110.000000145.0000000.000000122.750000110.000000145.0000000.0697971.0000000.0000007.0000000.0000001.0000000.000000
510.0000003.5250000.567891512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0016011.0000000.0000002.9000000.600000-0.3750000.250000
520.0000002.1250000.763217466.750000331.000000512.0000000.750000331.000000331.000000331.0000000.0148611.0000000.0000001.2500000.754983-0.1250000.250000
530.0000004.0000000.00000028.50000027.00000029.0000000.00000028.50000027.00000029.0000000.0332781.0000000.0000002.0000000.0000001.0000000.000000
540.0000000.8750003.330040512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0011040.2500001.5000000.5000001.2490000.1250000.853913
550.0003002.5000001.73205188.25000035.000000145.0000000.00000088.25000035.000000145.0000000.3144341.0000000.0000000.5000001.7320511.0000000.000000
560.0002003.4750000.585235411.000000305.000000497.0000000.000000411.000000305.000000497.0000000.1878191.0000000.0000002.6000000.692820-0.1250000.250000
570.0001007.9107141.374420222.750000120.000000512.0000000.250000126.333336120.000000134.0000000.0926711.0000000.0000006.2857141.4285710.6250000.750000
580.0007000.6250000.75000031.00000022.00000041.0000000.00000031.00000022.00000041.0000000.7452070.6250000.750000-1.0000000.0000001.0000000.000000
590.0001000.5000000.408248151.25000067.000000276.0000000.000000151.25000067.000000276.0000000.1497621.0000000.000000-1.0000000.0000000.5000000.408248
600.0001003.2500000.645497401.250000270.000000512.0000000.250000364.333344270.000000418.0000000.1302191.0000000.0000002.5000000.577350-0.2500000.288675
610.0000002.4250001.284199489.750000423.000000512.0000000.750000423.000000423.000000423.0000000.0476231.0000000.0000001.5500001.330413-0.1250000.478714
620.0001008.7500000.500000192.750000113.000000393.0000000.000000192.750000113.000000393.0000000.0560661.0000000.0000007.0000000.0000000.7500000.500000
630.0001009.0000000.000000124.000000108.000000149.0000000.000000124.000000108.000000149.0000000.0962681.0000000.0000007.0000000.0000001.0000000.000000
640.0001008.2857141.428571121.000000113.000000133.0000000.000000121.000000113.000000133.0000000.1052501.0000000.0000006.2857141.4285711.0000000.000000
650.0000004.0000000.00000035.50000034.00000036.0000000.00000035.50000034.00000036.0000000.0210191.0000000.0000002.0000000.0000001.0000000.000000
660.0001004.0000000.000000125.250000111.000000156.0000000.000000125.250000111.000000156.0000000.0855381.0000000.0000002.0000000.0000001.0000000.000000
670.0001001.0000000.00000029.25000028.00000031.0000000.00000029.25000028.00000031.0000000.0852041.0000000.000000-1.0000000.0000001.0000000.000000
680.0001003.8000000.600000267.750000224.000000321.0000000.000000267.750000224.000000321.0000000.0603051.0000000.0000002.3000000.6000000.5000000.000000
690.0000002.7000001.148912427.750000329.000000512.0000000.500000343.500000329.000000358.0000000.0281491.0000000.0000001.7000001.1489130.0000000.000000
700.0001004.2500005.48482822.00000019.00000031.0000000.00000022.00000019.00000031.0000000.0878760.2500000.8660253.0000004.6188021.0000000.000000
710.0001003.4000000.00000045.50000041.00000052.0000000.00000045.50000041.00000052.0000000.0550511.0000000.0000001.4000000.0000001.0000000.000000
720.0000003.4000000.00000046.25000042.00000050.0000000.00000046.25000042.00000050.0000000.0412841.0000000.0000001.4000000.0000001.0000000.000000
730.0000004.0000000.00000035.50000034.00000036.0000000.00000035.50000034.00000036.0000000.0159901.0000000.0000002.0000000.0000001.0000000.000000
740.0001004.0000000.00000052.50000021.00000084.0000000.00000052.50000021.00000084.0000000.1084341.0000000.0000002.0000000.0000001.0000000.000000
750.0001003.3750001.108678369.50000053.000000512.0000000.500000227.00000053.000000401.0000000.0727311.0000000.0000002.0000001.5491930.3750000.478714
760.0001004.0000000.00000068.50000028.00000096.0000000.00000068.50000028.00000096.0000000.1053431.0000000.0000002.0000000.0000001.0000000.000000
770.0001002.8750001.547848364.000000230.000000512.0000000.250000314.666687230.000000441.0000000.1452831.0000000.0000002.0000001.414214-0.1250000.250000
780.0001001.7750000.939415164.500000122.000000261.0000000.000000164.500000122.000000261.0000000.0621401.0000000.000000-0.1000001.1489130.8750000.250000
790.0000003.5000000.912871419.250000297.000000512.0000000.250000388.333344297.000000450.0000000.0373941.0000000.0000002.5000000.5773500.0000000.408248
800.0002004.0000000.00000048.75000029.00000071.0000000.00000048.75000029.00000071.0000000.1711941.0000000.0000002.0000000.0000001.0000000.000000
810.0001003.4000000.00000052.25000042.00000060.0000000.00000052.25000042.00000060.0000000.0794321.0000000.0000001.4000000.0000001.0000000.000000
820.0001009.0000000.000000116.250000105.000000131.0000000.000000116.250000105.000000131.0000000.1044761.0000000.0000007.0000000.0000001.0000000.000000
830.0002002.9500000.544671345.250000194.000000512.0000000.250000289.666687194.000000450.0000000.1725051.0000000.0000001.7000001.0392300.2500000.645497
840.0002003.2500001.50000069.75000036.000000117.0000000.00000069.75000036.000000117.0000000.1918781.0000000.0000001.2500001.5000001.0000000.000000
850.0000003.5833331.058476512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0035111.0000000.0000002.8333330.838871-0.2500000.288675
860.012300-0.2500002.50000028.25000015.00000042.0000000.00000028.25000015.00000042.00000012.2608920.2500001.500000-1.0000000.0000000.5000001.000000
870.0001008.8750000.250000121.500000103.000000170.0000000.000000121.500000103.000000170.0000000.0724701.0000000.0000007.0000000.0000000.8750000.250000
880.0115003.4000000.00000045.25000042.00000055.0000000.00000045.25000042.00000055.00000011.5199331.0000000.0000001.4000000.0000001.0000000.000000
890.0000003.4000000.00000042.75000042.00000043.0000000.00000042.75000042.00000043.0000000.0124061.0000000.0000001.4000000.0000001.0000000.000000
900.0000009.0000000.00000021.00000021.00000021.0000000.00000021.00000021.00000021.0000000.0002351.0000000.0000007.0000000.0000001.0000000.000000
910.0002002.4750000.567891255.000000175.000000384.0000000.000000255.000000175.000000384.0000000.1989071.0000000.0000001.1000000.6000000.3750000.250000
920.0000002.7750000.550000512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0035581.0000000.0000002.1500000.300000-0.3750000.250000
930.0002003.0000000.707107426.000000269.000000512.0000000.500000340.000000269.000000411.0000000.1801301.0000000.0000002.3750000.750000-0.3750000.250000
940.0001003.0000000.707107465.750000327.000000512.0000000.750000327.000000327.000000327.0000000.0702331.0000000.0000002.3750000.750000-0.3750000.250000
950.0001000.5000006.13731833.75000029.00000038.0000000.00000033.75000029.00000038.0000000.126376-0.5000001.7320511.0000004.0000000.0000001.154701
960.0001003.4750000.888351407.750000316.000000512.0000000.250000373.000000316.000000461.0000000.0995641.0000000.0000002.6000000.979796-0.1250000.478714
970.0002004.0000000.00000049.25000034.00000064.0000000.00000049.25000034.00000064.0000000.1804551.0000000.0000002.0000000.0000001.0000000.000000
980.0000004.0000000.00000038.00000038.00000038.0000000.00000038.00000038.00000038.0000000.0001701.0000000.0000002.0000000.0000001.0000000.000000
990.0002009.0000000.000000117.250000110.000000137.0000000.000000117.250000110.000000137.0000000.1544341.0000000.0000007.0000000.0000001.0000000.000000
1000.0001004.0000000.00000035.00000034.00000036.0000000.00000035.00000034.00000036.0000000.1025061.0000000.0000002.0000000.0000001.0000000.000000
1010.0001003.1250001.034005375.250000301.000000512.0000000.250000329.666687301.000000356.0000000.0899741.0000000.0000002.0000001.2000000.1250000.478714
1020.0001004.0000000.00000035.50000035.00000036.0000000.00000035.50000035.00000036.0000000.0778971.0000000.0000002.0000000.0000001.0000000.000000
1030.0002003.8750000.25000089.25000057.000000126.0000000.00000089.25000057.000000126.0000000.2169351.0000000.0000002.0000000.0000000.8750000.250000
1040.0000004.1750000.450000446.750000366.000000512.0000000.250000425.000000366.000000483.0000000.0353681.0000000.0000003.0500000.3000000.1250000.250000
1050.0001003.4000000.00000044.00000040.00000046.0000000.00000044.00000040.00000046.0000000.0745731.0000000.0000001.4000000.0000001.0000000.000000
1060.0003004.0000000.00000035.75000034.00000039.0000000.00000035.75000034.00000039.0000000.2885941.0000000.0000002.0000000.0000001.0000000.000000
1070.0000003.7000000.60000071.00000042.000000156.0000000.00000071.00000042.000000156.0000000.0455131.0000000.0000001.7000000.6000001.0000000.000000
1080.0001004.0000000.336650366.250000358.000000381.0000000.000000366.250000358.000000381.0000000.0623471.0000000.0000002.7500000.5744560.2500000.288675
1090.0000002.5000002.121320512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0059271.0000000.0000002.0000002.121320-0.5000000.000000
1100.0003004.0000000.00000032.50000022.00000038.0000000.00000032.50000022.00000038.0000000.2917381.0000000.0000002.0000000.0000001.0000000.000000
1110.0002001.0000000.000000106.50000097.000000124.0000000.000000106.50000097.000000124.0000000.2037551.0000000.000000-1.0000000.0000001.0000000.000000
1120.0001004.5000000.707107359.000000299.000000422.0000000.000000359.000000299.000000422.0000000.0539611.0000000.0000003.5000000.5773500.0000000.408248
1130.0001004.3750000.750000342.000000311.000000375.0000000.000000342.000000311.000000375.0000000.0645891.0000000.0000003.2500000.9574270.1250000.250000
1140.0000004.3333330.490654512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0039161.0000000.0000003.3333330.8606630.0000000.408248
1150.0000004.0000000.00000029.00000029.00000029.0000000.00000029.00000029.00000029.0000000.0119961.0000000.0000002.0000000.0000001.0000000.000000
1160.0004000.6250000.75000030.75000022.00000040.0000000.00000030.75000022.00000040.0000000.3919800.6250000.750000-1.0000000.0000001.0000000.000000
1170.0002009.0000000.000000122.500000109.000000143.0000000.000000122.500000109.000000143.0000000.2144181.0000000.0000007.0000000.0000001.0000000.000000
1180.0002004.0000001.54919391.75000041.000000154.0000000.00000091.75000041.000000154.0000000.1886151.0000000.0000002.0000001.5491931.0000000.000000
1190.0002009.0000000.000000125.500000114.000000146.0000000.000000125.500000114.000000146.0000000.1980611.0000000.0000007.0000000.0000001.0000000.000000
1200.0007004.0000000.00000082.25000029.000000141.0000000.00000082.25000029.000000141.0000000.6949711.0000000.0000002.0000000.0000001.0000000.000000
1210.0001004.2000000.000000390.000000351.000000433.0000000.000000390.000000351.000000433.0000000.0765661.0000000.0000003.2000000.0000000.0000000.000000
1220.0003009.0000000.000000123.250000110.000000151.0000000.000000123.250000110.000000151.0000000.2665761.0000000.0000007.0000000.0000001.0000000.000000
1230.0000009.0000000.00000021.00000021.00000021.0000000.00000021.00000021.00000021.0000000.0012291.0000000.0000007.0000000.0000001.0000000.000000
1240.0002008.2857141.428571106.250000104.000000112.0000000.000000106.250000104.000000112.0000000.2358211.0000000.0000006.2857141.4285711.0000000.000000
1250.0001004.0000000.00000035.50000035.00000036.0000000.00000035.50000035.00000036.0000000.0677471.0000000.0000002.0000000.0000001.0000000.000000
1260.0000004.0000001.000000443.250000274.000000512.0000000.500000374.500000274.000000475.0000000.0479661.0000000.0000003.2500000.957427-0.2500000.500000
1270.0001004.0000000.00000035.50000035.00000036.0000000.00000035.50000035.00000036.0000000.0668261.0000000.0000002.0000000.0000001.0000000.000000
1280.0002009.0000000.000000110.750000106.000000118.0000000.000000110.750000106.000000118.0000000.2120881.0000000.0000007.0000000.0000001.0000000.000000
1290.0000003.6250000.727438512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0063571.0000000.0000002.7500001.024695-0.1250000.478714
1300.0003005.3750000.567891165.000000154.000000188.0000000.000000165.000000154.000000188.0000000.2786361.0000000.0000003.5000000.6000000.8750000.250000
1310.0001003.5250001.114675271.500000168.000000412.0000000.000000271.500000168.000000412.0000000.1190031.0000000.0000002.1500001.2369320.3750000.250000
1320.0002009.0000000.000000109.500000103.000000115.0000000.000000109.500000103.000000115.0000000.1819591.0000000.0000007.0000000.0000001.0000000.000000
1330.0000002.2500001.936492462.000000368.000000512.0000000.500000412.000000368.000000456.0000000.0378501.0000000.0000001.6250001.887459-0.3750000.250000
1340.0003007.7500002.50000025.25000019.00000032.0000000.00000025.25000019.00000032.0000000.3491841.0000000.0000005.7500002.5000001.0000000.000000
1350.0001004.5000002.415230254.750000112.000000512.0000000.250000169.000000112.000000277.0000000.0969891.0000000.0000003.2500002.5000000.2500000.645497
1360.0100004.0000000.00000077.75000056.000000101.0000000.00000077.75000056.000000101.0000009.9554481.0000000.0000002.0000000.0000001.0000000.000000
1370.0020004.0000000.00000092.75000063.000000105.0000000.00000092.75000063.000000105.0000002.0260591.0000000.0000002.0000000.0000001.0000000.000000
1380.0000004.0000000.00000035.75000035.00000036.0000000.00000035.75000035.00000036.0000000.0323031.0000000.0000002.0000000.0000001.0000000.000000
1390.0002007.0000004.00000035.50000028.00000038.0000000.00000035.50000028.00000038.0000000.2304541.0000000.0000005.0000004.0000001.0000000.000000
1400.0038003.7500000.288675114.25000044.000000210.0000000.000000114.25000044.000000210.0000003.7929461.0000000.0000002.0000000.0000000.7500000.288675
1410.0001004.0000000.00000035.50000034.00000038.0000000.00000035.50000034.00000038.0000000.0882491.0000000.0000002.0000000.0000001.0000000.000000
1420.0000003.8250000.585235481.000000388.000000512.0000000.750000388.000000388.000000388.0000000.0371891.0000000.0000003.2000000.692820-0.3750000.250000
1430.0002003.8750000.250000134.250000105.000000200.0000000.000000134.250000105.000000200.0000000.2361621.0000000.0000002.0000000.0000000.8750000.250000
1440.0001004.3750000.567891438.750000403.000000477.0000000.000000438.750000403.000000477.0000000.1039811.0000000.0000003.5000000.600000-0.1250000.250000
1450.0000002.1750004.141155494.500000442.000000512.0000000.750000442.000000442.000000442.0000000.0381800.2500001.5000002.3000002.218107-0.3750000.478714
1460.0011002.8750003.75000089.00000025.000000281.0000000.00000089.00000025.000000281.0000001.0776441.0000000.0000001.0000004.0000000.8750000.250000
1470.0001002.9500000.493288296.500000239.000000395.0000000.000000296.500000239.000000395.0000000.1375461.0000000.0000001.7000000.6000000.2500000.288675
1480.0003009.0000000.000000114.750000103.000000124.0000000.000000114.750000103.000000124.0000000.2543641.0000000.0000007.0000000.0000001.0000000.000000
1490.0001001.0000000.00000029.25000029.00000030.0000000.00000029.25000029.00000030.0000000.1066321.0000000.000000-1.0000000.0000001.0000000.000000
1500.0002002.4250000.950000225.250000164.000000280.0000000.000000225.250000164.000000280.0000000.1681581.0000000.0000000.8000001.2000000.6250000.250000
1510.0003005.8000000.000000160.000000154.000000166.0000000.000000160.000000154.000000166.0000000.2982641.0000000.0000003.8000000.0000001.0000000.000000
1520.0002003.9750001.257975263.000000184.000000376.0000000.000000263.000000184.000000376.0000000.1620511.0000000.0000002.6000001.3856410.3750000.250000
1530.0000003.8500000.300000512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0140291.0000000.0000003.3500000.300000-0.5000000.000000
1540.0008004.0000000.00000078.00000035.000000135.0000000.00000078.00000035.000000135.0000000.7840001.0000000.0000002.0000000.0000001.0000000.000000
1550.0004002.7750000.250000276.750000195.000000440.0000000.000000276.750000195.000000440.0000000.4440511.0000000.0000001.4000000.0000000.3750000.250000
1560.0002003.8500000.640312343.500000245.000000512.0000000.250000287.333344245.000000316.0000000.1799471.0000000.0000002.6000000.9797960.2500000.500000
1570.0015002.8750003.75000078.25000025.000000223.0000000.00000078.25000025.000000223.0000001.5334201.0000000.0000001.0000004.0000000.8750000.250000
1580.0000003.4750000.708872512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0077441.0000000.0000002.6000000.489898-0.1250000.250000
1590.0004005.6750000.250000160.250000146.000000174.0000000.000000160.250000146.000000174.0000000.3551121.0000000.0000003.8000000.0000000.8750000.250000
1600.0004005.5000000.600000136.750000110.000000153.0000000.000000136.750000110.000000153.0000000.4112011.0000000.0000003.5000000.6000001.0000000.000000
1610.0003009.0000000.000000115.000000105.000000125.0000000.000000115.000000105.000000125.0000000.3388421.0000000.0000007.0000000.0000001.0000000.000000
1620.0000004.3500000.574456493.250000437.000000512.0000000.750000437.000000437.000000437.0000000.0481101.0000000.0000003.3500000.5744560.0000000.000000
1630.0002002.5250000.250000306.000000266.000000345.0000000.000000306.000000266.000000345.0000000.2236611.0000000.0000001.4000000.0000000.1250000.250000
1640.0032003.8750004.09013064.75000013.000000116.0000000.00000064.75000013.000000116.0000003.2299001.0000000.0000002.5000003.3166250.3750000.946485
1650.0005004.0000000.00000093.50000064.000000122.0000000.00000093.50000064.000000122.0000000.4524881.0000000.0000002.0000000.0000001.0000000.000000
1660.0000009.0000000.00000019.00000019.00000019.0000000.00000019.00000019.00000019.0000000.0126761.0000000.0000007.0000000.0000001.0000000.000000
1670.0002001.1250001.600781224.500000186.000000274.0000000.000000224.500000186.000000274.0000000.2412731.0000000.000000-0.2500001.5000000.3750000.250000
1680.0001002.2750000.478714460.500000343.000000512.0000000.500000409.000000343.000000475.0000000.1028791.0000000.0000001.4000000.000000-0.1250000.478714
1690.0003004.0000000.00000036.25000035.00000039.0000000.00000036.25000035.00000039.0000000.2566981.0000000.0000002.0000000.0000001.0000000.000000
1700.0000004.0000000.00000035.75000035.00000036.0000000.00000035.75000035.00000036.0000000.0325921.0000000.0000002.0000000.0000001.0000000.000000
1710.0001004.0000000.00000035.25000035.00000036.0000000.00000035.25000035.00000036.0000000.0849681.0000000.0000002.0000000.0000001.0000000.000000
1720.0001004.7500000.288675394.750000296.000000512.0000000.250000355.666687296.000000460.0000000.1374491.0000000.0000003.7500000.5000000.0000000.408248
1730.0006004.0000000.00000028.50000022.00000038.0000000.00000028.50000022.00000038.0000000.5503991.0000000.0000002.0000000.0000001.0000000.000000
1740.0004009.0000000.000000125.750000104.000000165.0000000.000000125.750000104.000000165.0000000.3739911.0000000.0000007.0000000.0000001.0000000.000000
1750.0002003.8000001.219290293.000000211.000000334.0000000.000000293.000000211.000000334.0000000.2210691.0000000.0000002.3000001.1489130.5000000.408248
1760.0000004.0750000.262996496.000000448.000000512.0000000.750000448.000000448.000000448.0000000.0329461.0000000.0000003.2000000.489898-0.1250000.478714
1770.0000000.6000003.814883494.250000441.000000512.0000000.750000441.000000441.000000441.0000000.0320720.2500001.5000001.1000002.473863-0.7500000.288675
1780.0001004.3750001.600781357.000000299.000000411.0000000.000000357.000000299.000000411.0000000.1486861.0000000.0000003.2500001.5000000.1250000.250000
1790.0005003.8500001.674316317.000000209.000000424.0000000.000000317.000000209.000000424.0000000.4731331.0000000.0000002.6000001.3856410.2500000.288675
1800.0003004.0000000.000000114.000000104.000000135.0000000.000000114.000000104.000000135.0000000.2763661.0000000.0000002.0000000.0000001.0000000.000000
1810.0001004.1750000.050000393.000000370.000000438.0000000.000000393.000000370.000000438.0000000.1466561.0000000.0000003.0500000.3000000.1250000.250000
1820.0028008.8750000.250000131.750000103.000000211.0000000.000000131.750000103.000000211.0000002.8202621.0000000.0000007.0000000.0000000.8750000.250000
1830.0003009.0000000.000000107.250000104.000000113.0000000.000000107.250000104.000000113.0000000.3286661.0000000.0000007.0000000.0000001.0000000.000000
1840.0001003.8750002.809952374.500000258.000000437.0000000.000000374.500000258.000000437.0000000.1474281.0000000.0000003.2500002.783882-0.3750000.250000
1850.0001002.2500000.500000484.750000403.000000512.0000000.750000403.000000403.000000403.0000000.0507331.0000000.0000001.6250000.750000-0.3750000.250000
1860.0007008.6250000.750000210.250000107.000000512.0000000.250000109.666672107.000000115.0000000.7079761.0000000.0000007.0000000.0000000.6250000.750000
1870.0004005.8000000.000000163.250000147.000000179.0000000.000000163.250000147.000000179.0000000.4110701.0000000.0000003.8000000.0000001.0000000.000000
1880.0002004.2000000.000000374.250000358.000000389.0000000.000000374.250000358.000000389.0000000.1552561.0000000.0000003.2000000.0000000.0000000.000000
1890.0003003.6250000.25000090.75000061.000000108.0000000.00000090.75000061.000000108.0000000.2810721.0000000.0000002.0000000.0000000.6250000.250000
1900.0001001.0000000.00000030.00000029.00000031.0000000.00000030.00000029.00000031.0000000.1070631.0000000.000000-1.0000000.0000001.0000000.000000
1910.0000009.0000000.00000021.00000021.00000021.0000000.00000021.00000021.00000021.0000000.0018961.0000000.0000007.0000000.0000001.0000000.000000
1920.0004005.8000000.000000150.000000135.000000160.0000000.000000150.000000135.000000160.0000000.4062991.0000000.0000003.8000000.0000001.0000000.000000
1930.0001004.1250000.478714385.000000260.000000459.0000000.000000385.000000260.000000459.0000000.1329871.0000000.0000003.2500000.500000-0.1250000.478714
1940.0000004.0000000.00000034.75000034.00000035.0000000.00000034.75000034.00000035.0000000.0327221.0000000.0000002.0000000.0000001.0000000.000000
1950.0002003.5000000.496655251.000000183.000000348.0000000.000000251.000000183.000000348.0000000.2072041.0000000.0000002.0000000.6928200.5000000.408248
1960.0001004.7500001.848423223.50000058.000000510.0000000.000000223.50000058.000000510.0000000.1207321.0000000.0000003.2500002.5000000.5000000.707107
1970.0005001.0000000.00000062.25000032.00000088.0000000.00000062.25000032.00000088.0000000.5354391.0000000.000000-1.0000000.0000001.0000000.000000
1980.0030002.7500002.50000070.00000013.000000103.0000000.00000070.00000013.000000103.0000002.9587341.0000000.0000001.2500001.5000000.5000001.000000
1990.0004003.8750000.25000075.00000056.000000119.0000000.00000075.00000056.000000119.0000000.4033741.0000000.0000002.0000000.0000000.8750000.250000
2000.0002008.6250000.750000237.500000104.000000512.0000000.250000146.000000104.000000221.0000000.1794831.0000000.0000007.0000000.0000000.6250000.750000
2010.0001002.3500004.239890483.250000397.000000512.0000000.750000397.000000397.000000397.0000000.0894230.2500001.5000002.6000002.400000-0.5000000.408248
2020.0004009.0000000.000000114.000000110.000000120.0000000.000000114.000000110.000000120.0000000.3787401.0000000.0000007.0000000.0000001.0000000.000000
2030.0004005.8000000.000000160.750000150.000000170.0000000.000000160.750000150.000000170.0000000.3868501.0000000.0000003.8000000.0000001.0000000.000000
2040.0000009.0000000.00000038.00000038.00000038.0000000.00000038.00000038.00000038.0000000.0284711.0000000.0000007.0000000.0000001.0000000.000000
2050.0001004.0000000.00000034.00000032.00000035.0000000.00000034.00000032.00000035.0000000.0715491.0000000.0000002.0000000.0000001.0000000.000000
2060.0008003.2500001.50000037.75000028.00000054.0000000.00000037.75000028.00000054.0000000.7866701.0000000.0000001.2500001.5000001.0000000.000000
2070.0000004.1666670.544331512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0144941.0000000.0000003.6666670.544331-0.5000000.000000
2080.0002004.3250000.250000355.500000328.000000389.0000000.000000355.500000328.000000389.0000000.1676841.0000000.0000003.2000000.0000000.1250000.250000
2090.0009004.0000000.00000040.50000034.00000051.0000000.00000040.50000034.00000051.0000000.8534661.0000000.0000002.0000000.0000001.0000000.000000
2100.0001004.2500000.866025368.750000300.000000512.0000000.250000321.000000300.000000345.0000000.1298391.0000000.0000003.2500000.9574270.0000000.408248
2110.0001002.6250001.250000491.500000430.000000512.0000000.750000430.000000430.000000430.0000000.0514081.0000000.0000002.0000001.224745-0.3750000.250000
2120.0002003.3750001.450000256.000000190.000000365.0000000.000000256.000000190.000000365.0000000.1831751.0000000.0000002.0000001.2000000.3750000.250000
2130.0007009.0000000.000000112.750000111.000000114.0000000.000000112.750000111.000000114.0000000.7131631.0000000.0000007.0000000.0000001.0000000.000000
2140.0004001.8750001.600781200.000000114.000000287.0000000.000000200.000000114.000000287.0000000.3972981.0000000.0000000.5000001.7320510.3750000.250000
2150.0004009.0000000.000000115.500000106.000000139.0000000.000000115.500000106.000000139.0000000.3663541.0000000.0000007.0000000.0000001.0000000.000000
2160.0002005.2500001.554563350.000000180.000000512.0000000.250000296.000000180.000000474.0000000.2057161.0000000.0000004.3750001.750000-0.1250000.478714
2170.0002002.4750000.850000259.750000189.000000386.0000000.000000259.750000189.000000386.0000000.1572381.0000000.0000001.1000000.6000000.3750000.250000
2180.0000004.0000000.00000028.75000028.00000029.0000000.00000028.75000028.00000029.0000000.0307681.0000000.0000002.0000000.0000001.0000000.000000
2190.0002003.8750000.250000109.75000042.000000207.0000000.000000109.75000042.000000207.0000000.1959171.0000000.0000002.0000000.0000000.8750000.250000
2200.0002002.9500000.100000307.000000212.000000512.0000000.250000238.666672212.000000261.0000000.1894391.0000000.0000001.7000000.6000000.2500000.500000
2210.0001004.0000000.00000035.50000035.00000036.0000000.00000035.50000035.00000036.0000000.0504481.0000000.0000002.0000000.0000001.0000000.000000
2220.0002004.7500000.500000335.750000301.000000354.0000000.000000335.750000301.000000354.0000000.1773921.0000000.0000003.7500000.5000000.0000000.000000
2230.0005005.8000000.000000142.750000135.000000155.0000000.000000142.750000135.000000155.0000000.4723281.0000000.0000003.8000000.0000001.0000000.000000
2240.0001004.2750001.372042316.750000227.000000500.0000000.000000316.750000227.000000500.0000000.1391201.0000000.0000002.9000001.1489130.3750000.250000
2250.0001004.0000000.00000035.00000034.00000036.0000000.00000035.00000034.00000036.0000000.0782141.0000000.0000002.0000000.0000001.0000000.000000
2260.0001003.8250000.708872500.500000466.000000512.0000000.750000466.000000466.000000466.0000000.0506741.0000000.0000003.2000000.489898-0.3750000.250000
2270.0001004.8750000.478714391.500000280.000000512.0000000.250000351.333344280.000000440.0000000.1256861.0000000.0000003.7500000.5000000.1250000.478714
2280.0001004.2083330.567238452.500000336.000000512.0000000.500000393.000000336.000000450.0000000.0782881.0000000.0000003.3333330.384900-0.1250000.250000
2290.0002009.0000000.000000118.000000113.000000121.0000000.000000118.000000113.000000121.0000000.1609451.0000000.0000007.0000000.0000001.0000000.000000
2300.0004005.8000000.000000159.750000151.000000173.0000000.000000159.750000151.000000173.0000000.4256801.0000000.0000003.8000000.0000001.0000000.000000
2310.0000009.0000000.00000019.00000019.00000019.0000000.00000019.00000019.00000019.0000000.0147431.0000000.0000007.0000000.0000001.0000000.000000
2320.0000003.9500000.351188512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0141141.0000000.0000003.2000000.489898-0.2500000.288675
2330.0002004.3000000.594419334.000000246.000000393.0000000.000000334.000000246.000000393.0000000.1786141.0000000.0000003.0500000.7549830.2500000.288675
2340.0004000.6250000.75000026.25000022.00000031.0000000.00000026.25000022.00000031.0000000.3620010.6250000.750000-1.0000000.0000001.0000000.000000
2350.0004008.2857141.428571109.250000103.000000119.0000000.000000109.250000103.000000119.0000000.4088371.0000000.0000006.2857141.4285711.0000000.000000
2360.0004006.6000001.600000160.500000152.000000173.0000000.000000160.500000152.000000173.0000000.4245301.0000000.0000004.6000001.6000001.0000000.000000
2370.0000004.0000000.00000035.50000035.00000037.0000000.00000035.50000035.00000037.0000000.0489221.0000000.0000002.0000000.0000001.0000000.000000
2380.0001003.8500000.300000495.250000455.000000512.0000000.500000478.500000455.000000502.0000000.0686221.0000000.0000003.3500000.300000-0.5000000.000000
2390.0004009.0000000.000000104.500000102.000000110.0000000.000000104.500000102.000000110.0000000.3915461.0000000.0000007.0000000.0000001.0000000.000000
2400.0001004.5500000.288675471.750000351.000000512.0000000.750000351.000000351.000000351.0000000.0503771.0000000.0000003.8000000.000000-0.2500000.288675
2410.0001002.8250000.298608385.000000229.000000512.0000000.250000342.666687229.000000439.0000000.1380721.0000000.0000001.7000000.6000000.1250000.478714
2420.0003005.8000000.000000147.750000133.000000155.0000000.000000147.750000133.000000155.0000000.3094501.0000000.0000003.8000000.0000001.0000000.000000
2430.0008000.8750000.25000060.25000020.000000104.0000000.00000060.25000020.000000104.0000000.7636401.0000000.000000-1.0000000.0000000.8750000.250000
2440.0002004.3250000.250000383.750000340.000000423.0000000.000000383.750000340.000000423.0000000.1591221.0000000.0000003.2000000.0000000.1250000.250000
2450.0006004.0000000.00000072.50000052.00000089.0000000.00000072.50000052.00000089.0000000.5758031.0000000.0000002.0000000.0000001.0000000.000000
2460.0002004.4750000.320156355.250000323.000000395.0000000.000000355.250000323.000000395.0000000.1717801.0000000.0000003.3500000.3000000.1250000.250000
2470.0004009.0000000.000000105.250000103.000000111.0000000.000000105.250000103.000000111.0000000.4133061.0000000.0000007.0000000.0000001.0000000.000000
2480.0000009.0000000.00000021.00000021.00000021.0000000.00000021.00000021.00000021.0000000.0014021.0000000.0000007.0000000.0000001.0000000.000000
2490.0005009.0000000.000000111.500000105.000000126.0000000.000000111.500000105.000000126.0000000.4850611.0000000.0000007.0000000.0000001.0000000.000000
2500.0004003.8750000.250000128.75000080.000000218.0000000.000000128.75000080.000000218.0000000.3913121.0000000.0000002.0000000.0000000.8750000.250000
2510.0003003.5000000.000000152.250000128.000000180.0000000.000000152.250000128.000000180.0000000.3429531.0000000.0000002.0000000.0000000.5000000.000000
2520.0004005.8000000.000000150.250000145.000000155.0000000.000000150.250000145.000000155.0000000.3536071.0000000.0000003.8000000.0000001.0000000.000000
2530.0001004.5500000.288675467.750000406.000000512.0000000.500000423.500000406.000000441.0000000.0899151.0000000.0000003.8000000.000000-0.2500000.288675
2540.0004004.0000000.000000115.250000105.000000132.0000000.000000115.250000105.000000132.0000000.4167701.0000000.0000002.0000000.0000001.0000000.000000
2550.0005003.8750000.25000098.00000065.000000142.0000000.00000098.00000065.000000142.0000000.4854241.0000000.0000002.0000000.0000000.8750000.250000
2560.0003009.0000000.00000034.75000025.00000038.0000000.00000034.75000025.00000038.0000000.2958371.0000000.0000007.0000000.0000001.0000000.000000
2570.0005004.0000000.00000084.50000058.000000115.0000000.00000084.50000058.000000115.0000000.5272581.0000000.0000002.0000000.0000001.0000000.000000
2580.0024002.8750003.75000084.25000031.000000242.0000000.00000084.25000031.000000242.0000002.3520101.0000000.0000001.0000004.0000000.8750000.250000
2590.0010004.0000000.00000026.00000022.00000038.0000000.00000026.00000022.00000038.0000000.9506421.0000000.0000002.0000000.0000001.0000000.000000
2600.0000004.0000000.00000035.75000035.00000036.0000000.00000035.75000035.00000036.0000000.0167051.0000000.0000002.0000000.0000001.0000000.000000
2610.0001005.0000000.000000317.750000254.000000342.0000000.000000317.750000254.000000342.0000000.1386821.0000000.0000004.0000000.0000000.0000000.000000
2620.0005009.0000000.000000106.500000103.000000115.0000000.000000106.500000103.000000115.0000000.4702351.0000000.0000007.0000000.0000001.0000000.000000
2630.0001004.1000000.270801475.000000378.000000512.0000000.500000438.000000378.000000498.0000000.0756811.0000000.0000003.3500000.300000-0.2500000.288675
2640.0003003.8750000.25000081.75000045.000000171.0000000.00000081.75000045.000000171.0000000.2813021.0000000.0000002.0000000.0000000.8750000.250000
2650.0006000.6250000.75000076.75000042.00000094.0000000.00000076.75000042.00000094.0000000.5798450.6250000.750000-1.0000000.0000001.0000000.000000
2660.0000004.0000000.00000035.25000035.00000036.0000000.00000035.25000035.00000036.0000000.0352281.0000000.0000002.0000000.0000001.0000000.000000
2670.0000009.0000000.00000019.00000019.00000019.0000000.00000019.00000019.00000019.0000000.0150871.0000000.0000007.0000000.0000001.0000000.000000
2680.0002004.0000000.00000087.75000040.000000130.0000000.00000087.75000040.000000130.0000000.2338231.0000000.0000002.0000000.0000001.0000000.000000
2690.0006004.0000000.00000061.25000058.00000065.0000000.00000061.25000058.00000065.0000000.5712421.0000000.0000002.0000000.0000001.0000000.000000
2700.0003009.0000000.000000109.250000105.000000114.0000000.000000109.250000105.000000114.0000000.3144841.0000000.0000007.0000000.0000001.0000000.000000
2710.0001002.7500002.254625400.000000246.000000512.0000000.500000288.000000246.000000330.0000000.0942481.0000000.0000002.0000002.121320-0.2500000.288675
2720.0000004.0000000.00000035.75000035.00000036.0000000.00000035.75000035.00000036.0000000.0169871.0000000.0000002.0000000.0000001.0000000.000000
2730.0003005.8000000.000000170.000000159.000000180.0000000.000000170.000000159.000000180.0000000.3049931.0000000.0000003.8000000.0000001.0000000.000000
2740.0002004.8750000.250000364.750000316.000000462.0000000.000000364.750000316.000000462.0000000.1565751.0000000.0000004.0000000.000000-0.1250000.250000
2750.0000004.1250000.287228512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0206171.0000000.0000003.5000000.346410-0.3750000.250000
2760.0002001.0000000.00000027.00000022.00000029.0000000.00000027.00000022.00000029.0000000.2485331.0000000.000000-1.0000000.0000001.0000000.000000
2770.0004009.0000000.000000105.750000103.000000111.0000000.000000105.750000103.000000111.0000000.4087021.0000000.0000007.0000000.0000001.0000000.000000
2780.0002009.0000000.000000112.000000108.000000115.0000000.000000112.000000108.000000115.0000000.2130451.0000000.0000007.0000000.0000001.0000000.000000
2790.0005000.7500000.288675103.00000071.000000133.0000000.000000103.00000071.000000133.0000000.4730691.0000000.000000-1.0000000.0000000.7500000.288675
2800.0000004.0000000.00000038.00000038.00000038.0000000.00000038.00000038.00000038.0000000.0055321.0000000.0000002.0000000.0000001.0000000.000000
2810.0001004.0750001.141271443.500000349.000000512.0000000.500000375.000000349.000000401.0000000.0876451.0000000.0000003.2000001.200000-0.1250000.250000
2820.0002005.1250000.250000325.000000279.000000404.0000000.000000325.000000279.000000404.0000000.1822641.0000000.0000004.0000000.0000000.1250000.250000
2830.0021002.0000002.44949055.50000013.00000083.0000000.00000055.50000013.00000083.0000002.0682591.0000000.0000000.5000001.7320510.5000001.000000
2840.0004009.0000000.000000107.500000105.000000113.0000000.000000107.500000105.000000113.0000000.3975001.0000000.0000007.0000000.0000001.0000000.000000
2850.0002004.7000001.200000249.250000217.000000325.0000000.000000249.250000217.000000325.0000000.2187071.0000000.0000003.2000001.2000000.5000000.000000
2860.0002002.9000000.979796199.500000139.000000284.0000000.000000199.500000139.000000284.0000000.2398781.0000000.0000001.4000000.9797960.5000000.000000
2870.0000009.0000000.00000021.00000021.00000021.0000000.00000021.00000021.00000021.0000000.0012511.0000000.0000007.0000000.0000001.0000000.000000
2880.0006004.0000000.00000090.75000076.000000108.0000000.00000090.75000076.000000108.0000000.6164541.0000000.0000002.0000000.0000001.0000000.000000
2890.0002004.7500000.640312275.000000225.000000412.0000000.000000275.000000225.000000412.0000000.1839471.0000000.0000003.5000000.6000000.2500000.500000
2900.0000004.4166670.319142512.000000512.000000512.0000001.0000000.0000000.0000000.0000000.0178411.0000000.0000003.6666670.544331-0.2500000.500000
2910.0003003.3250000.850000228.250000177.000000274.0000000.000000228.250000177.000000274.0000000.2553601.0000000.0000001.7000000.6000000.6250000.250000
2920.0001004.5500000.288675442.500000331.000000512.0000000.500000373.000000331.000000415.0000000.0878261.0000000.0000003.8000000.000000-0.2500000.288675
2930.0002005.8000000.000000154.500000150.000000157.0000000.000000154.500000150.000000157.0000000.1996201.0000000.0000003.8000000.0000001.0000000.000000
2940.0004008.5000000.000000183.500000163.000000210.0000000.000000183.500000163.000000210.0000000.3577331.0000000.0000007.0000000.0000000.5000000.000000
2950.0002005.8000000.000000146.750000141.000000149.0000000.000000146.750000141.000000149.0000000.1816351.0000000.0000003.8000000.0000001.0000000.000000
2960.0005009.0000000.00000052.75000038.00000097.0000000.00000052.75000038.00000097.0000000.4795421.0000000.0000007.0000000.0000001.0000000.000000
2970.0005009.0000000.000000111.750000104.000000125.0000000.000000111.750000104.000000125.0000000.5010091.0000000.0000007.0000000.0000001.0000000.000000
2980.0000004.0000000.00000036.00000036.00000036.0000000.00000036.00000036.00000036.0000000.0090921.0000000.0000002.0000000.0000001.0000000.000000
2990.0003003.8750000.250000128.500000105.000000172.0000000.000000128.500000105.000000172.0000000.3394551.0000000.0000002.0000000.0000000.8750000.250000
3000.0002004.5750000.250000374.500000346.000000420.0000000.000000374.500000346.000000420.0000000.1670721.0000000.0000003.2000000.0000000.3750000.250000

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "--- [task_0017] [medium] ---\n", "Instruction: Onboard new hire Maria Garcia to Marketing as L1 Marketing Associate. Create the...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0018] [medium] ---\n", "Instruction: Onboard new hire James Wilson to Data Science as L2 Data Analyst. Create their e...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0072] [complex] ---\n", "Instruction: Rehire Marie Dubois (emp_0064) who was previously offboarded. Update their statu...\n", "Tool calls: ['onboarding_create_request', 'hr_search_employees', 'hr_get_org_chart', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'it_revoke_access', 'it_get_software_licenses', 'access_assign_role', 'access_get_security_groups']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0036] [simple] ---\n", "Instruction: Check the offboarding status for Min Hu (emp_0113)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0067] [edge_case] ---\n", "Instruction: Before onboarding a new Security team member, look up the badge access policy an...\n", "Tool calls: ['hr_read_employee']\n", "Rubric: 0% (0/2)\n", "Reward: -1.00\n", "\n", "--- [task_0011] [simple] ---\n", "Instruction: Check the onboarding status for employee Rohan Patel (emp_0011)....\n", "Tool calls: ['hr_get_org_chart']\n", "Rubric: 0% (0/2)\n", "Reward: -1.00\n", "\n", "--- [task_0048] [complex] ---\n", "Instruction: Fully offboard Henrik Becker (emp_0069), a L4 Head of Enterprise Sales in Sales ...\n", "Tool calls: ['offboarding_create_request', 'offboarding_get_status', 'access_revoke_role', 'access_get_security_groups', 'reassign_report', 'email_send']\n", "Rubric: 50% (3/6)\n", "Reward: 2.00\n", "\n", "--- [task_0002] [simple] ---\n", "Instruction: Look up the employee record for Ingrid Larsson (ID: emp_0025)....\n", "Tool calls: ['hr_read_employee']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0046] [medium] ---\n", "Instruction: Initiate offboarding for Brian Jones (emp_0075) who taking a career break. Set t...\n", "Tool calls: ['offboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0056] [edge_case] ---\n", "Instruction: Onboard a new L1 Associate to the Marketing department. Create their employee re...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0014] [simple] ---\n", "Instruction: Check if there are available laptops and Jira licenses for a new Engineering hir...\n", "Tool calls: ['it_get_available_assets']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0030] [complex] ---\n", "Instruction: Onboard Sanjay Gupta as L2 Security Analyst in Security. Create the employee rec...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'hr_update_employee', 'hr_search_employees', 'hr_get_org_chart', 'onboarding_get_status', 'onboarding_complete_step', 'it_assign_asset']\n", "Rubric: 30% (3/10)\n", "Reward: 0.80\n", "\n", "--- [task_0053] [complex] ---\n", "Instruction: Process the complete offboarding for Marta Wagner (emp_0110) from HR. Create the...\n", "Tool calls: ['onboarding_complete_step', 'it_get_available_assets', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule']\n", "Rubric: 20% (1/5)\n", "Reward: 0.20\n", "\n", "--- [task_0070] [complex] ---\n", "Instruction: Robert Garcia (emp_0133) is transferring from Data Science to Engineering. Proce...\n", "Tool calls: ['offboard_access', 'offboard_request', 'access_assign_role', 'onboarding_create_request', 'access_create_badge', 'access_get_security_groups', 'slack_send_message']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0027] [complex] ---\n", "Instruction: Fully onboard Carlos Mendez as L3 Senior Security Engineer in Security. Their ma...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'it_assign_asset', 'it_get_available_assets', 'it_create_account']\n", "Rubric: 50% (5/10)\n", "Reward: 2.00\n", "\n", "--- [task_0068] [complex] ---\n", "Instruction: Patricia Brown (emp_0172) is transferring from Engineering to Product. Process t...\n", "Tool calls: ['offboard_access', 'onboard_request', 'access_assign_role', 'email_send', 'meeting_schedule']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0074] [medium] ---\n", "Instruction: The Product team is onboarding 2 new hires at the same time. Check available ass...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_software_licenses']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0025] [complex] ---\n", "Instruction: Fully onboard John Lee as L3 Team Lead - ML in Data Science. Their manager will ...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_complete_step', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status', 'onboarding_get_status']\n", "Rubric: 30% (3/10)\n", "Reward: 0.80\n", "\n", "--- [task_0032] [complex] ---\n", "Instruction: Onboard Hassan Ahmed as L3 Lead Data Scientist in Data Science. Create the emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'hr_get_org_chart', 'hr_assign_asset', 'it_create_account', 'access_assign_role', 'access_create_badge', 'email_send']\n", "Rubric: 67% (6/9)\n", "Reward: 3.00\n", "\n", "--- [task_0054] [complex] ---\n", "Instruction: Process the complete offboarding for Jun Zheng (emp_0068) from Sales. Create the...\n", "Tool calls: ['offboarding_create_request', 'offboarding_get_status', 'offboarding_revoke_access', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'meeting_schedule', 'hr_get_org_chart', 'hr_update_employee', 'hr_search_employees', 'hr_update_employee']\n", "Rubric: 60% (3/5)\n", "Reward: 2.60\n", "\n", "--- [task_0031] [complex] ---\n", "Instruction: Onboard Nina Petrova as L4 Director of Platform in Engineering. Create the emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_create_account', 'access_assign_role']\n", "Rubric: 50% (5/10)\n", "Reward: 2.00\n", "\n", "--- [task_0029] [complex] ---\n", "Instruction: Fully onboard Raj Kapoor as L2 Backend Developer in Engineering. Their manager w...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'onboarding_get_status', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'it_get_software_licenses']\n", "Rubric: 50% (5/10)\n", "Reward: 2.00\n", "\n", "--- [task_0067] [edge_case] ---\n", "Instruction: Before onboarding a new Security team member, look up the badge access policy an...\n", "Tool calls: ['hr_read_employee', 'hr_search_employees', 'hr_get_org_chart', 'hr_update_employee']\n", "Rubric: 0% (0/2)\n", "Reward: -1.00\n", "\n", "--- [task_0034] [complex] ---\n", "Instruction: Onboard Kevin O'Brien as L4 VP of Product in Product. Create the employee record...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'access_assign_role', 'access_create_badge', 'it_get_available_assets', 'it_create_account', 'it_get_software_licenses', 'access_revoke_role', 'access_get_security_groups']\n", "Rubric: 50% (5/10)\n", "Reward: 2.00\n", "\n", "--- [task_0018] [medium] ---\n", "Instruction: Onboard new hire James Wilson to Data Science as L2 Data Analyst. Create their e...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0056] [edge_case] ---\n", "Instruction: Onboard a new L1 Associate to the Marketing department. Create their employee re...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0054] [complex] ---\n", "Instruction: Process the complete offboarding for Jun Zheng (emp_0068) from Sales. Create the...\n", "Tool calls: ['offboarding_create_request', 'offboarding_create_request', 'access_revoke_role', 'access_revoke_role', 'it_get_available_assets', 'it_get_available_assets', 'email_send', 'slack_send_message']\n", "Rubric: 60% (3/5)\n", "Reward: 2.60\n", "\n", "--- [task_0046] [medium] ---\n", "Instruction: Initiate offboarding for Brian Jones (emp_0075) who taking a career break. Set t...\n", "Tool calls: ['offboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0073] [medium] ---\n", "Instruction: The Engineering team is onboarding 2 new hires at the same time. Check available...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_software_licenses']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0074] [medium] ---\n", "Instruction: The Product team is onboarding 2 new hires at the same time. Check available ass...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_software_licenses', 'access_assign_role']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0048] [complex] ---\n", "Instruction: Fully offboard Henrik Becker (emp_0069), a L4 Head of Enterprise Sales in Sales ...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'offboarding_create_request', 'email_send', 'it_get_available_assets', 'access_assign_role', 'onboarding_get_status', 'onboarding_complete_step', '__done__']\n", "Rubric: 67% (4/6)\n", "Reward: 3.00\n", "\n", "--- [task_0040] [medium] ---\n", "Instruction: Initiate offboarding for Kavya Desai (emp_0034) who resigning. Set the reason to...\n", "Tool calls: ['offboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0011] [simple] ---\n", "Instruction: Check the onboarding status for employee Rohan Patel (emp_0011)....\n", "Tool calls: ['hr_get_org_chart', 'onboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0023] [medium] ---\n", "Instruction: Onboard new hire Li Wei to Engineering as L3 Senior Engineer. Create their emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0045] [medium] ---\n", "Instruction: Initiate offboarding for Susan Davis (emp_0091) who retiring. Set the reason to ...\n", "Tool calls: ['offboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0030] [complex] ---\n", "Instruction: Onboard Sanjay Gupta as L2 Security Analyst in Security. Create the employee rec...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule']\n", "Rubric: 60% (6/10)\n", "Reward: 2.60\n", "\n", "--- [task_0061] [edge_case] ---\n", "Instruction: Onboard contractor Amit Verma to Engineering as an L2 Contract Developer. Contra...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'offboarding_create_request', 'offboarding_get_status', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'it_revoke_access', 'it_get_software_licenses', 'access_assign_role', 'access_get_security_groups']\n", "Rubric: 50% (2/4)\n", "Reward: 2.00\n", "\n", "--- [task_0065] [edge_case] ---\n", "Instruction: Assign the security_admin access role to a new L1 Security Associate. The securi...\n", "Tool calls: ['access_assign_role']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0036] [simple] ---\n", "Instruction: Check the offboarding status for Min Hu (emp_0113)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0038] [simple] ---\n", "Instruction: Check the offboarding status for Tao Chen (emp_0020)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0046] [medium] ---\n", "Instruction: Initiate offboarding for Brian Jones (emp_0075) who taking a career break. Set t...\n", "Tool calls: ['offboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0040] [medium] ---\n", "Instruction: Initiate offboarding for Kavya Desai (emp_0034) who resigning. Set the reason to...\n", "Tool calls: ['offboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0002] [simple] ---\n", "Instruction: Look up the employee record for Ingrid Larsson (ID: emp_0025)....\n", "Tool calls: ['hr_read_employee']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0051] [complex] ---\n", "Instruction: Fully offboard Sergio Ferrari (emp_0198), a L3 Security Architect in Security wh...\n", "Tool calls: ['offboarding_create_request', 'offboarding_get_status', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'onboarding_get_status', 'access_assign_role', 'access_get_security_groups', 'meeting_schedule', 'hr_update_employee']\n", "Rubric: 83% (5/6)\n", "Reward: 4.00\n", "\n", "--- [task_0066] [edge_case] ---\n", "Instruction: A Marketing employee needs access to the Engineering GitHub repository. Try to a...\n", "Tool calls: ['access_assign_role']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0019] [medium] ---\n", "Instruction: Onboard new hire Aisha Patel to Sales as L1 Sales Representative. Create their e...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0014] [simple] ---\n", "Instruction: Check if there are available laptops and Jira licenses for a new Engineering hir...\n", "Tool calls: ['it_get_available_assets']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0018] [medium] ---\n", "Instruction: Onboard new hire James Wilson to Data Science as L2 Data Analyst. Create their e...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0036] [simple] ---\n", "Instruction: Check the offboarding status for Min Hu (emp_0113)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0035] [simple] ---\n", "Instruction: Check the offboarding status for Thomas White (emp_0035)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0042] [medium] ---\n", "Instruction: Initiate offboarding for Pierre Laurent (emp_0153) who moving to a different cit...\n", "Tool calls: ['offboarding_create_request', 'access_revoke_role', 'email_send']\n", "Rubric: 60% (3/5)\n", "Reward: 2.60\n", "\n", "--- [task_0023] [medium] ---\n", "Instruction: Onboard new hire Li Wei to Engineering as L3 Senior Engineer. Create their emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "Error replaying: 1 validation error for HROnboardingAction\n", "arguments\n", " Input should be a valid dictionary [type=dict_type, input_value=\"['offboarding']\", input_type=str]\n", " For further information visit https://errors.pydantic.dev/2.12/v/dict_type\n", "\n", "--- [task_0074] [medium] ---\n", "Instruction: The Product team is onboarding 2 new hires at the same time. Check available ass...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_assets', 'it_create_account', 'it_create_account']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0011] [simple] ---\n", "Instruction: Check the onboarding status for employee Rohan Patel (emp_0011)....\n", "Tool calls: ['hr_get_org_chart', 'onboarding_get_status', 'onboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0073] [medium] ---\n", "Instruction: The Engineering team is onboarding 2 new hires at the same time. Check available...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_software_licenses', 'access_assign_role', 'access_create_badge', 'email_send']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0053] [complex] ---\n", "Instruction: Process the complete offboarding for Marta Wagner (emp_0110) from HR. Create the...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'meeting_schedule', 'onboarding_get_status', 'access_get_security_groups', 'it_get_available_assets', 'access_assign_role']\n", "Rubric: 80% (4/5)\n", "Reward: 3.80\n", "\n", "--- [task_0031] [complex] ---\n", "Instruction: Onboard Nina Petrova as L4 Director of Platform in Engineering. Create the emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'email_send', 'meeting_schedule']\n", "Rubric: 80% (8/10)\n", "Reward: 3.80\n", "\n", "--- [task_0016] [medium] ---\n", "Instruction: Onboard new hire Alex Chen to Product as L2 Product Analyst. Create their employ...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0070] [complex] ---\n", "Instruction: Robert Garcia (emp_0133) is transferring from Data Science to Engineering. Proce...\n", "Tool calls: ['offboarding_create_request', 'access_revoke_role', 'onboarding_get_status']\n", "Rubric: 0% (0/5)\n", "Reward: -1.00\n", "\n", "--- [task_0034] [complex] ---\n", "Instruction: Onboard Kevin O'Brien as L4 VP of Product in Product. Create the employee record...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'it_revoke_access', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'email_send']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0068] [complex] ---\n", "Instruction: Patricia Brown (emp_0172) is transferring from Engineering to Product. Process t...\n", "Tool calls: ['offboard_request', 'access_revoke', 'access_assign_role', 'email_send', 'it_get_software_licenses']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0034] [complex] ---\n", "Instruction: Onboard Kevin O'Brien as L4 VP of Product in Product. Create the employee record...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_get_software_licenses', 'it_assign_asset', 'it_create_account', 'access_assign_role', 'access_create_badge']\n", "Rubric: 60% (6/10)\n", "Reward: 2.60\n", "\n", "--- [task_0045] [medium] ---\n", "Instruction: Initiate offboarding for Susan Davis (emp_0091) who retiring. Set the reason to ...\n", "Tool calls: ['offboarding_create_request', 'offboarding_get_status', 'it_revoke_access']\n", "Rubric: 60% (3/5)\n", "Reward: 2.60\n", "\n", "--- [task_0077] [complex] ---\n", "Instruction: Manager Ananya Reddy (emp_0007) in Engineering is leaving. They have 2 direct re...\n", "Tool calls: ['it_revoke_access', 'it_get_software_licenses', 'onboarding_get_status', 'access_revoke_role', 'email_send']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0012] [simple] ---\n", "Instruction: Check the onboarding status for employee Astrid Koch (emp_0073)....\n", "Tool calls: ['hr_get_org_chart', 'onboarding_get_status', 'onboarding_complete_step']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0070] [complex] ---\n", "Instruction: Robert Garcia (emp_0133) is transferring from Data Science to Engineering. Proce...\n", "Tool calls: ['offboarding_create_request', 'access_revoke_role', 'access_assign_role', 'email_send', 'meeting_schedule']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0038] [simple] ---\n", "Instruction: Check the offboarding status for Tao Chen (emp_0020)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0002] [simple] ---\n", "Instruction: Look up the employee record for Ingrid Larsson (ID: emp_0025)....\n", "Tool calls: ['hr_read_employee']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0072] [complex] ---\n", "Instruction: Rehire Marie Dubois (emp_0064) who was previously offboarded. Update their statu...\n", "Tool calls: ['it_revoke_access', 'onboarding_create_request', 'email_send']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0051] [complex] ---\n", "Instruction: Fully offboard Sergio Ferrari (emp_0198), a L3 Security Architect in Security wh...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'email_send', 'meeting_schedule', 'hr_update_employee', 'hr_get_org_chart', 'hr_search_employees']\n", "Rubric: 83% (5/6)\n", "Reward: 4.00\n", "\n", "--- [task_0056] [edge_case] ---\n", "Instruction: Onboard a new L1 Associate to the Marketing department. Create their employee re...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'onboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0020] [medium] ---\n", "Instruction: Onboard new hire Tom Nguyen to Finance as L2 Financial Analyst. Create their emp...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0061] [edge_case] ---\n", "Instruction: Onboard contractor Amit Verma to Engineering as an L2 Contract Developer. Contra...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'it_assign_asset', 'onboarding_get_status', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'it_get_software_licenses', 'email_send', 'slack_send_message', 'meeting_schedule', 'policy_lookup']\n", "Rubric: 50% (2/4)\n", "Reward: 2.00\n", "\n", "--- [task_0029] [complex] ---\n", "Instruction: Fully onboard Raj Kapoor as L2 Backend Developer in Engineering. Their manager w...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'email_send', 'it_get_available_assets', 'meeting_schedule']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0059] [edge_case] ---\n", "Instruction: Check if there are available LinkedIn Sales Navigator licenses for a new Sales h...\n", "Tool calls: ['it_get_available_assets']\n", "Rubric: 0% (0/1)\n", "Reward: -1.00\n", "\n", "--- [task_0050] [complex] ---\n", "Instruction: Fully offboard Lei Huang (emp_0032), a L4 Group Product Manager in Product who i...\n", "Tool calls: ['it_revoke_access', 'it_create_account', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule']\n", "Rubric: 67% (4/6)\n", "Reward: 3.00\n", "\n", "--- [task_0068] [complex] ---\n", "Instruction: Patricia Brown (emp_0172) is transferring from Engineering to Product. Process t...\n", "Tool calls: ['it_revoke_access', 'it_get_software_licenses', 'it_assign_asset', 'onboarding_create_request', 'it_update_employee', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule', '__done__']\n", "Rubric: 60% (3/5)\n", "Reward: 2.60\n", "\n", "--- [task_0013] [simple] ---\n", "Instruction: Check the onboarding status for employee Hui Zhou (emp_0040)....\n", "Tool calls: ['hr_get_org_chart', 'onboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0016] [medium] ---\n", "Instruction: Onboard new hire Alex Chen to Product as L2 Product Analyst. Create their employ...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0042] [medium] ---\n", "Instruction: Initiate offboarding for Pierre Laurent (emp_0153) who moving to a different cit...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'email_send']\n", "Rubric: 80% (4/5)\n", "Reward: 3.80\n", "\n", "--- [task_0066] [edge_case] ---\n", "Instruction: A Marketing employee needs access to the Engineering GitHub repository. Try to a...\n", "Tool calls: ['access_assign_role']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0027] [complex] ---\n", "Instruction: Fully onboard Carlos Mendez as L3 Senior Security Engineer in Security. Their ma...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'access_assign_role', 'email_send', 'access_create_badge', 'meeting_schedule']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0061] [edge_case] ---\n", "Instruction: Onboard contractor Amit Verma to Engineering as an L2 Contract Developer. Contra...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'it_get_software_licenses', 'access_create_badge', 'email_send', 'slack_send_message', 'meeting_schedule', 'policy_lookup', 'approval_request']\n", "Rubric: 25% (1/4)\n", "Reward: 0.50\n", "\n", "--- [task_0018] [medium] ---\n", "Instruction: Onboard new hire James Wilson to Data Science as L2 Data Analyst. Create their e...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0064] [edge_case] ---\n", "Instruction: Jennifer Davis (emp_0016) is being terminated effective immediately. Follow the ...\n", "Tool calls: ['it_revoke_access', 'it_get_available_assets', 'onboarding_create_request', 'offboarding_create_request', 'it_get_software_licenses', 'access_revoke_role', 'email_send']\n", "Rubric: 75% (3/4)\n", "Reward: 3.50\n", "\n", "--- [task_0066] [edge_case] ---\n", "Instruction: A Marketing employee needs access to the Engineering GitHub repository. Try to a...\n", "Tool calls: ['access_assign_role']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0038] [simple] ---\n", "Instruction: Check the offboarding status for Tao Chen (emp_0020)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0045] [medium] ---\n", "Instruction: Initiate offboarding for Susan Davis (emp_0091) who retiring. Set the reason to ...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'email_send']\n", "Rubric: 80% (4/5)\n", "Reward: 3.80\n", "\n", "--- [task_0031] [complex] ---\n", "Instruction: Onboard Nina Petrova as L4 Director of Platform in Engineering. Create the emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'access_assign_role', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'it_revoke_access', 'access_assign_role', 'it_get_software_licenses', 'email_send', 'meeting_schedule', 'access_get_security_groups']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0032] [complex] ---\n", "Instruction: Onboard Hassan Ahmed as L3 Lead Data Scientist in Data Science. Create the emplo...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'email_send', 'meeting_schedule']\n", "Rubric: 67% (6/9)\n", "Reward: 3.00\n", "\n", "--- [task_0006] [simple] ---\n", "Instruction: Show me the organizational chart for the Finance department....\n", "Tool calls: ['hr_get_org_chart']\n", "Rubric: 100% (2/2)\n", "Reward: 7.00\n", "\n", "--- [task_0027] [complex] ---\n", "Instruction: Fully onboard Carlos Mendez as L3 Senior Security Engineer in Security. Their ma...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'access_assign_role', 'email_send', 'access_create_badge', 'it_get_software_licenses', 'meeting_schedule']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0042] [medium] ---\n", "Instruction: Initiate offboarding for Pierre Laurent (emp_0153) who moving to a different cit...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'email_send']\n", "Rubric: 100% (5/5)\n", "Reward: 7.00\n", "\n", "--- [task_0034] [complex] ---\n", "Instruction: Onboard Kevin O'Brien as L4 VP of Product in Product. Create the employee record...\n", "Tool calls: ['hr_create_employee', 'hr_read_employee', 'onboarding_create_request', 'it_assign_asset', 'access_assign_role', 'it_get_available_assets', 'access_create_badge', 'access_revoke_role', 'email_send', 'meeting_schedule', 'it_get_software_licenses', 'policy_lookup']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0077] [complex] ---\n", "Instruction: Manager Ananya Reddy (emp_0007) in Engineering is leaving. They have 2 direct re...\n", "Tool calls: ['it_revoke_access', 'access_create_badge', 'email_send', 'it_get_software_licenses', 'meeting_schedule', 'hr_update_employee', 'onboarding_get_status', 'hr_search_employees', 'onboarding_create_request', 'hr_update_employee', 'access_revoke_role', 'access_get_security_groups']\n", "Rubric: 60% (3/5)\n", "Reward: 2.60\n", "\n", "--- [task_0058] [edge_case] ---\n", "Instruction: Assign a Netsuite license to a new Finance hire. Check license availability firs...\n", "Tool calls: ['it_assign_asset', 'hr_search_employees', 'it_get_available_assets']\n", "Rubric: 0% (0/2)\n", "Reward: -1.00\n", "\n", "--- [task_0029] [complex] ---\n", "Instruction: Fully onboard Raj Kapoor as L2 Backend Developer in Engineering. Their manager w...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_create_account', 'access_assign_role', 'email_send', 'meeting_schedule']\n", "Rubric: 80% (8/10)\n", "Reward: 3.80\n", "\n", "--- [task_0007] [simple] ---\n", "Instruction: What laptops are currently available for assignment?...\n", "Tool calls: ['it_get_available_assets']\n", "Rubric: 100% (2/2)\n", "Reward: 7.00\n", "\n", "--- [task_0073] [medium] ---\n", "Instruction: The Engineering team is onboarding 2 new hires at the same time. Check available...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets', 'hr_search_employees']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0053] [complex] ---\n", "Instruction: Process the complete offboarding for Marta Wagner (emp_0110) from HR. Create the...\n", "Tool calls: ['offboarding_create_request', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'it_get_available_assets', 'it_create_account', 'it_revoke_access', 'access_get_security_groups', 'email_send', 'access_revoke_role', 'it_get_available_assets']\n", "Rubric: 80% (4/5)\n", "Reward: 3.80\n", "\n", "--- [task_0005] [simple] ---\n", "Instruction: List all employees in the Engineering department....\n", "Tool calls: ['hr_search_employees']\n", "Rubric: 100% (2/2)\n", "Reward: 7.00\n", "\n", "--- [task_0010] [simple] ---\n", "Instruction: List all security groups and their accessible resources....\n", "Tool calls: ['it_get_software_licenses']\n", "Rubric: 0% (0/1)\n", "Reward: -1.00\n", "\n", "--- [task_0048] [complex] ---\n", "Instruction: Fully offboard Henrik Becker (emp_0069), a L4 Head of Enterprise Sales in Sales ...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'hr_update_employee', 'hr_get_org_chart']\n", "Rubric: 83% (5/6)\n", "Reward: 4.00\n", "\n", "--- [task_0030] [complex] ---\n", "Instruction: Onboard Sanjay Gupta as L2 Security Analyst in Security. Create the employee rec...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request', 'access_assign_role', 'it_assign_asset', 'it_create_account', 'email_send', 'meeting_schedule']\n", "Rubric: 70% (7/10)\n", "Reward: 3.20\n", "\n", "--- [task_0035] [simple] ---\n", "Instruction: Check the offboarding status for Thomas White (emp_0035)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0074] [medium] ---\n", "Instruction: The Product team is onboarding 2 new hires at the same time. Check available ass...\n", "Tool calls: ['it_get_available_assets', 'it_get_available_software_licenses', 'email_send']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0064] [edge_case] ---\n", "Instruction: Jennifer Davis (emp_0016) is being terminated effective immediately. Follow the ...\n", "Tool calls: ['it_revoke_access', 'it_get_available_assets', 'it_create_account', 'onboarding_complete_step', 'access_revoke_role', 'access_get_security_groups', 'email_send', '__done__']\n", "Rubric: 50% (2/4)\n", "Reward: 2.00\n", "Error replaying: 1 validation error for HROnboardingAction\n", "arguments\n", " Input should be a valid dictionary [type=dict_type, input_value='[]', input_type=str]\n", " For further information visit https://errors.pydantic.dev/2.12/v/dict_type\n", "\n", "--- [task_0042] [medium] ---\n", "Instruction: Initiate offboarding for Pierre Laurent (emp_0153) who moving to a different cit...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'email_send']\n", "Rubric: 80% (4/5)\n", "Reward: 3.80\n", "\n", "--- [task_0059] [edge_case] ---\n", "Instruction: Check if there are available LinkedIn Sales Navigator licenses for a new Sales h...\n", "Tool calls: ['it_get_available_assets']\n", "Rubric: 0% (0/1)\n", "Reward: -1.00\n", "\n", "--- [task_0019] [medium] ---\n", "Instruction: Onboard new hire Aisha Patel to Sales as L1 Sales Representative. Create their e...\n", "Tool calls: ['hr_create_employee', 'onboarding_create_request']\n", "Rubric: 100% (7/7)\n", "Reward: 7.00\n", "\n", "--- [task_0053] [complex] ---\n", "Instruction: Process the complete offboarding for Marta Wagner (emp_0110) from HR. Create the...\n", "Tool calls: ['onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'it_revoke_access', 'access_revoke_role', 'it_get_available_assets', 'email_send']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0012] [simple] ---\n", "Instruction: Check the onboarding status for employee Astrid Koch (emp_0073)....\n", "Tool calls: ['hr_get_org_chart', 'onboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0070] [complex] ---\n", "Instruction: Robert Garcia (emp_0133) is transferring from Data Science to Engineering. Proce...\n", "Tool calls: ['offboard_request', 'onboarding_create_request', 'access_assign_role', 'email_send']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0013] [simple] ---\n", "Instruction: Check the onboarding status for employee Hui Zhou (emp_0040)....\n", "Tool calls: ['hr_get_org_chart', 'onboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n", "\n", "--- [task_0077] [complex] ---\n", "Instruction: Manager Ananya Reddy (emp_0007) in Engineering is leaving. They have 2 direct re...\n", "Tool calls: ['it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'email_send', 'meeting_schedule', 'onboarding_create_request']\n", "Rubric: 40% (2/5)\n", "Reward: 1.40\n", "\n", "--- [task_0040] [medium] ---\n", "Instruction: Initiate offboarding for Kavya Desai (emp_0034) who resigning. Set the reason to...\n", "Tool calls: ['offboarding_create_request', 'it_revoke_access', 'email_send']\n", "Rubric: 80% (4/5)\n", "Reward: 3.80\n", "\n", "--- [task_0005] [simple] ---\n", "Instruction: List all employees in the Engineering department....\n", "Tool calls: ['hr_search_employees', 'hr_search_employees']\n", "Rubric: 100% (2/2)\n", "Reward: 7.00\n", "\n", "--- [task_0037] [simple] ---\n", "Instruction: Check the offboarding status for Shan Lin (emp_0142)....\n", "Tool calls: ['offboarding_get_status']\n", "Rubric: 50% (1/2)\n", "Reward: 2.00\n" ] }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Run history:


profiling/Time taken: UnslothGRPOTrainer._calculate_rewards\u2581\u2583\u2582\u2583\u2583\u2588\u2584\u2583\u2582\u2583\u2583\u2582\u2586\u2583\u2584\u2581\u2586\u2582\u2585\u2582\u2587\u2584\u2581\u2585\u2582\u2583\u2586\u2586\u2586\u2584\u2583\u2584\u2582\u2583\u2582\u2585\u2583\u2586\u2584\u2583
profiling/Time taken: UnslothGRPOTrainer._prepare_inputs\u2581\u2588\u2588\u2588\u2587\u2588\u2581\u2583\u2587\u2586\u2583\u2581\u2581\u2582\u2588\u2581\u2588\u2587\u2583\u2582\u2587\u2585\u2583\u2587\u2583\u2586\u2581\u2587\u2582\u2582\u2582\u2581\u2587\u2583\u2581\u2582\u2582\u2581\u2588\u2582
profiling/Time taken: UnslothGRPOTrainer.efficiency_reward\u2588\u2581\u2581\u2587\u2581\u2582\u2581\u2584\u2584\u2581\u2581\u2584\u2587\u2581\u2581\u2586\u2581\u2586\u2581\u2585\u2582\u2582\u2583\u2583\u2581\u2582\u2582\u2587\u2583\u2588\u2582\u2581\u2586\u2586\u2583\u2582\u2582\u2583\u2581\u2582
profiling/Time taken: UnslothGRPOTrainer.rubric_reward\u2583\u2583\u2582\u2583\u2583\u2582\u2584\u2582\u2588\u2583\u2583\u2581\u2582\u2583\u2585\u2588\u2584\u2588\u2585\u2586\u2585\u2582\u2582\u2586\u2584\u2587\u2582\u2583\u2584\u2584\u2583\u2582\u2585\u2583\u2586\u2585\u2584\u2583\u2582\u2583
profiling/Time taken: UnslothGRPOTrainer.transformers.generate\u2588\u2584\u2581\u2588\u2586\u2581\u2588\u2583\u2581\u2588\u2588\u2582\u2582\u2588\u2588\u2587\u2581\u2581\u2583\u2588\u2582\u2581\u2584\u2588\u2583\u2588\u2586\u2582\u2582\u2581\u2588\u2588\u2582\u2581\u2582\u2581\u2581\u2588\u2582\u2581
profiling/Time taken: UnslothGRPOTrainer.valid_json_reward\u2582\u2583\u2581\u2586\u2585\u2581\u2583\u2583\u2582\u2581\u2585\u2585\u2583\u2585\u2582\u2588\u2585\u2582\u2583\u2585\u2582\u2584\u2582\u2583\u2581\u2586\u2582\u2583\u2585\u2584\u2583\u2588\u2585\u2586\u2582\u2583\u2582\u2581\u2581\u2583
train/clip_ratio/high_max\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581
train/clip_ratio/high_mean\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581
train/clip_ratio/low_mean\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581
train/clip_ratio/low_min\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581\u2581
+25...

Run summary:


profiling/Time taken: UnslothGRPOTrainer._calculate_rewards0.03229
profiling/Time taken: UnslothGRPOTrainer._prepare_inputs6.67588
profiling/Time taken: UnslothGRPOTrainer.efficiency_reward0.00021
profiling/Time taken: UnslothGRPOTrainer.rubric_reward0.03116
profiling/Time taken: UnslothGRPOTrainer.transformers.generate6.58671
profiling/Time taken: UnslothGRPOTrainer.valid_json_reward0.00024
total_flos0
train/clip_ratio/high_max0
train/clip_ratio/high_mean0
train/clip_ratio/low_mean0
+30...

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " View run comfy-cherry-23 at: https://wandb.ai/ravi03071991/hr-agent-training/runs/bgent3o3
View project at: https://wandb.ai/ravi03071991/hr-agent-training
Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Find logs at: ./wandb/run-20260308_175735-bgent3o3/logs" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "TrainOutput(global_step=300, training_loss=0.0006788890576863888, metrics={'train_runtime': 1281.0975, 'train_samples_per_second': 0.937, 'train_steps_per_second': 0.234, 'total_flos': 0.0, 'train_loss': 0.0006788890576863888})" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import wandb \n", "import trl.extras.profiling \n", "trl.extras.profiling.wandb = wandb \n", "\n", "trainer.train() \n", " " ] }, { "cell_type": "markdown", "id": "5a4aff68", "metadata": {}, "source": [ "## Testing the Trained Model\n", "\n", "Let's see what the RL-trained model generates compared to the base model:" ] }, { "cell_type": "code", "execution_count": 16, "id": "47ae8bbe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Task: task_0024 (task_idx=23)\n", "Instruction: Onboard new hire Emma Davis to Product as L3 Senior PM. Create their employee record and initiate the onboarding request.\n", "\n", "Model output:\n", "----------------------------------------\n", "{\"tool\": \"hr_create_employee\", \"params\": {\"name\": \"Emma Davis\", \"department\": \"Product\", \"level\": \"L3\", \"role\": \"Senior PM\", \"manager_id\": \"John Smith\", \"is_contractor\": \"false\", \"location\": \"New York\", \"date_of_joining\": \"2023-01-01\"}}; {\"tool\": \"onboarding_create_request\", \"params\": {\"request_id\": \"1\", \"employee_id\": \"Emma Davis\"}}<|eot_id|>\n", "\n", "\n", "Tool calls extracted: ['hr_create_employee', 'onboarding_create_request']\n", "\n", "Rubric score: 100% (7/7)\n", "Passed: True\n", " [PASS] created_employee: Created employee record\n", " [PASS] correct_name: Used correct name\n", " [PASS] correct_dept: Assigned to correct department\n", " [PASS] correct_level: Set correct level\n", " [PASS] correct_role: Set correct role\n", " [PASS] initiated_onboarding: Created onboarding request\n", " [PASS] sequencing: Created employee before onboarding request\n" ] } ], "source": [ "# Test on a medium task from our selected set\n", "test_task = [p for p in test_prompts if p[\"difficulty\"] == \"medium\"][0]\n", "print(f\"Task: {test_task['task_id']} (task_idx={test_task['task_idx']})\")\n", "print(f\"Instruction: {test_task['prompt'][1]['content']}\\n\")\n", "print(\"Model output:\")\n", "print(\"-\" * 40)\n", "\n", "text = tokenizer.apply_chat_template(\n", " test_task[\"prompt\"],\n", " tokenize=False,\n", " add_generation_prompt=True,\n", ")\n", "\n", "from transformers import TextStreamer\n", "\n", "inputs = tokenizer(text, return_tensors=\"pt\").to(\"cuda\")\n", "outputs = model.generate(\n", " **inputs,\n", " temperature=0.1,\n", " max_new_tokens=512,\n", " streamer=TextStreamer(tokenizer, skip_prompt=True),\n", ")\n", "\n", "# Evaluate the output\n", "response = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n", "calls = extract_tool_calls(response)\n", "print(f\"\\n\\nTool calls extracted: {[c['tool'] for c in calls]}\")\n", "\n", "if calls:\n", " eval_result, steps = replay_tool_calls(test_task[\"task_idx\"], calls)\n", " print(f\"\\nRubric score: {eval_result['score']:.0%} ({eval_result['passed_count']}/{eval_result['total_criteria']})\")\n", " print(f\"Passed: {eval_result['passed']}\")\n", " for c in eval_result[\"criteria_results\"]:\n", " print(f\" [{'PASS' if c['passed'] else 'FAIL'}] {c['name']}: {c['description']}\")" ] }, { "cell_type": "markdown", "id": "66533682", "metadata": {}, "source": [ "## Post-Training Evaluation\n", "\n", "Now we evaluate the trained model on both train and **held-out test** sets. Improvement on the test set proves the model learned **generalizable** HR workflow skills, not just memorization." ] }, { "cell_type": "code", "execution_count": 17, "id": "b355bda6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "POST-TRAINING \u2014 TRAIN SET\n", "==================================================\n", " [FAIL] task_0035 [simple ] score=50% tools=['offboarding_get_status']\n", " [OK] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0014 [simple ] score=50% tools=['it_get_available_assets']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [FAIL] task_0005 [simple ] score=0% tools=['hr_read_employee']\n", " [X ] correct_tool: Used hr_search_employees\n", " [X ] correct_dept: Filtered by correct department\n", " [PASS] task_0010 [simple ] score=100% tools=['it_get_software_licenses', 'access_get_security_groups', 'access_assign_role', 'access_revoke_role', 'email_send']\n", " [OK] correct_tool: Used access_get_security_groups\n", " [PASS] task_0006 [simple ] score=100% tools=['hr_get_org_chart']\n", " [OK] correct_tool: Used hr_get_org_chart\n", " [OK] correct_dept: Passed correct department\n", " [FAIL] task_0036 [simple ] score=50% tools=['offboarding_get_status']\n", " [OK] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0038 [simple ] score=50% tools=['offboarding_get_status']\n", " [OK] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [PASS] task_0007 [simple ] score=100% tools=['it_get_available_assets']\n", " [OK] correct_tool: Used it_get_available_assets\n", " [OK] correct_type: Filtered by laptop type\n", " [FAIL] task_0013 [simple ] score=0% tools=['hr_get_org_chart']\n", " [X ] correct_tool: Used onboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0037 [simple ] score=50% tools=['offboarding_get_status']\n", " [OK] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0011 [simple ] score=0% tools=['hr_get_org_chart']\n", " [X ] correct_tool: Used onboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0002 [simple ] score=50% tools=['hr_read_employee']\n", " [OK] correct_tool: Used hr_read_employee\n", " [X ] correct_id: Passed correct emp_id\n", " [FAIL] task_0012 [simple ] score=50% tools=['hr_get_org_chart', 'onboarding_get_status', 'access_get_security_groups', 'email_send']\n", " [OK] correct_tool: Used onboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0046 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [PASS] task_0023 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [PASS] task_0018 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0041 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [FAIL] task_0040 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [FAIL] task_0045 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [PASS] task_0016 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0073 [medium ] score=50% tools=['it_get_available_assets', 'it_get_available_software_licenses', 'access_assign_role', 'access_create_badge', 'email_send']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [PASS] task_0020 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0074 [medium ] score=50% tools=['it_get_available_assets', 'it_get_available_assets', 'it_get_available_assets', 'it_get_available_software_licenses', 'access_assign_role', 'access_create_badge', 'email_send']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [PASS] task_0017 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [PASS] task_0015 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0042 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [PASS] task_0019 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0053 [complex ] score=60% tools=['offboarding_create_request', 'offboarding_get_status', 'offboarding_create_role', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message', 'meeting_schedule', 'policy_lookup']\n", " [OK] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [OK] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0051 [complex ] score=67% tools=['it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'hr_update_employee', 'hr_get_org_chart']\n", " [X ] created_request: Created offboarding request\n", " [OK] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [OK] farewell: Sent farewell communication\n", " [OK] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0031 [complex ] score=80% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule', 'policy_lookup', 'approval_request', 'access_get_security_groups']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] got_approval: Submitted approval request\n", " [OK] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_role: Assigned access role\n", " [OK] created_badge: Created physical badge\n", " [OK] sent_communications: Sent welcome communications\n", " [OK] scheduled_meeting: Scheduled orientation\n", " [X ] security_approval: Got security approval before badge\n", " [FAIL] task_0027 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'access_assign_role', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] assigned_laptop: Assigned a laptop\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [OK] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0032 [complex ] score=89% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'it_create_account', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] got_approval: Submitted approval request\n", " [OK] assigned_asset: Assigned an asset\n", " [OK] created_accounts: Created IT accounts\n", " [OK] assigned_role: Assigned access role\n", " [OK] created_badge: Created physical badge\n", " [OK] sent_communications: Sent welcome communications\n", " [OK] scheduled_meeting: Scheduled orientation\n", " [FAIL] task_0072 [complex ] score=80% tools=['hr_update_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'email_send', 'it_get_software_licenses']\n", " [X ] read_employee: Read employee record first\n", " [OK] updated_status: Updated status to pending/active\n", " [OK] new_onboarding: Created new onboarding request\n", " [OK] provisioned_accounts: Created IT accounts\n", " [OK] welcome_back: Sent welcome-back communication\n", " [FAIL] task_0025 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] assigned_laptop: Assigned a laptop\n", " [OK] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [OK] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0068 [complex ] score=40% tools=['offboard', 'access_revoke', 'access_assign_role', 'email_send', 'meeting_schedule']\n", " [X ] read_employee: Read employee record\n", " [X ] revoked_old_access: Revoked old department access\n", " [X ] updated_dept: Updated department\n", " [OK] new_access: Assigned new department roles\n", " [OK] notified_team: Notified new team\n", " [FAIL] task_0054 [complex ] score=60% tools=['offboarding_create_request', 'offboarding_get_status', 'offboarding_create_account', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule']\n", " [OK] created_request: Created offboarding request\n", " [OK] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [X ] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0030 [complex ] score=80% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule', 'policy_lookup', 'approval_request']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] got_approval: Submitted approval request\n", " [OK] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_role: Assigned access role\n", " [OK] created_badge: Created physical badge\n", " [OK] sent_communications: Sent welcome communications\n", " [OK] scheduled_meeting: Scheduled orientation\n", " [X ] security_approval: Got security approval before badge\n", " [FAIL] task_0034 [complex ] score=80% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'email_send', 'meeting_schedule', 'policy_lookup', 'approval_request']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] got_approval: Submitted approval request\n", " [OK] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_role: Assigned access role\n", " [OK] created_badge: Created physical badge\n", " [OK] sent_communications: Sent welcome communications\n", " [OK] scheduled_meeting: Scheduled orientation\n", " [X ] security_approval: Got security approval before badge\n", " [FAIL] task_0048 [complex ] score=50% tools=['hr_revoke_access', 'it_revoke_asset', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'policy_lookup', 'onboarding_create_request', 'hr_update_employee']\n", " [X ] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [OK] farewell: Sent farewell communication\n", " [OK] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0029 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'access_create_badge', 'access_revoke_role', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] assigned_laptop: Assigned a laptop\n", " [OK] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [OK] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0070 [complex ] score=40% tools=['offboard', 'access_revoke', 'access_assign_role', 'email_send', 'meeting_schedule']\n", " [X ] read_employee: Read employee record\n", " [X ] revoked_old_access: Revoked old department access\n", " [X ] updated_dept: Updated department\n", " [OK] new_access: Assigned new department roles\n", " [OK] notified_team: Notified new team\n", " [FAIL] task_0071 [complex ] score=80% tools=['hr_update_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'email_send', 'meeting_schedule']\n", " [X ] read_employee: Read employee record first\n", " [OK] updated_status: Updated status to pending/active\n", " [OK] new_onboarding: Created new onboarding request\n", " [OK] provisioned_accounts: Created IT accounts\n", " [OK] welcome_back: Sent welcome-back communication\n", " [FAIL] task_0050 [complex ] score=83% tools=['it_revoke_access', 'access_revoke_role', 'reassign_asset', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'policy_lookup', 'access_get_software_licenses', 'onboarding_create_request', 'onboarding_get_status', 'onboarding_complete_step', 'offboarding_create_request', 'access_revoke_role', 'access_get_security_groups']\n", " [OK] created_request: Created offboarding request\n", " [OK] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [OK] farewell: Sent farewell communication\n", " [OK] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0077 [complex ] score=60% tools=['it_revoke_access', 'offboarding_create_request', 'email_send', 'meeting_schedule']\n", " [X ] read_manager: Looked up manager info\n", " [OK] offboarding: Created offboarding request\n", " [X ] reassigned: Updated reports' manager\n", " [OK] revoked_access: Revoked manager's access\n", " [OK] notified_team: Notified team\n", " [FAIL] task_0056 [edge_case ] score=50% tools=['hr_create_employee', 'hr_read_employee', 'onboarding_create_request']\n", " [OK] attempted_create: Attempted to create employee\n", " [X ] handled_limit: Recognized or handled headcount limit error\n", " [FAIL] task_0059 [edge_case ] score=0% tools=['hr_search_employees']\n", " [X ] checked_licenses: Checked licenses\n", " [FAIL] task_0065 [edge_case ] score=50% tools=['access_assign_role']\n", " [OK] attempted_assign: Attempted to assign role\n", " [X ] handled_error: Recognized level requirement error\n", " [FAIL] task_0066 [edge_case ] score=50% tools=['hr_assign_asset', 'access_assign_role', 'email_send']\n", " [OK] attempted_assign: Attempted to assign role\n", " [X ] handled_restriction: Recognized department restriction\n", " [FAIL] task_0064 [edge_case ] score=25% tools=['hr_revoke_access', 'it_revoke_asset', 'onboarding_create_request', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'meeting_schedule', 'policy_lookup', 'onboarding_get_status', 'offboarding_create_request', 'it_get_software_licenses']\n", " [X ] created_request: Created offboarding with termination reason\n", " [X ] revoked_access: Revoked all access\n", " [OK] no_farewell: Did NOT send farewell communications\n", " [X ] completed_steps: Completed termination steps\n", " [FAIL] task_0058 [edge_case ] score=0% tools=['it_get_available_assets']\n", " [X ] checked_licenses: Checked license availability\n", " [X ] identified_full: Recognized licenses are full\n", " [FAIL] task_0067 [edge_case ] score=0% tools=['hr_read_policy', 'hr_read_policy', 'explain_requirements', 'explain_requirements']\n", " [X ] looked_up_badge: Looked up badge/access policy\n", " [X ] multiple_lookups: Looked up multiple policies\n", " [FAIL] task_0061 [edge_case ] score=25% tools=['hr_create_employee', 'onboarding_create_request', 'access_assign_role', 'it_assign_asset', 'access_create_badge', 'access_revoke_role', 'it_get_software_licenses', 'email_send', 'meeting_schedule', 'policy_lookup', 'approval_request', 'access_get_security_groups']\n", " [X ] created_contractor: Created employee with is_contractor=true\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] legal_approval: Got legal approval\n", " [X ] limited_access: Created limited accounts\n", "\n", "Results: 10/52 passed (19.2%)\n", "Mean score: 0.617\n", " simple : 3/13 pass, score=0.50\n", " medium : 7/14 pass, score=0.86\n", " complex : 0/17 pass, score=0.68\n", " edge_case : 0/8 pass, score=0.25\n", "\n", "==================================================\n", "POST-TRAINING \u2014 TEST SET (held-out)\n", "==================================================\n", " [FAIL] task_0003 [simple ] score=50% tools=['hr_read_employee']\n", " [OK] correct_tool: Used hr_read_employee\n", " [X ] correct_id: Passed correct emp_id\n", " [FAIL] task_0039 [simple ] score=50% tools=['offboarding_get_status']\n", " [OK] correct_tool: Used offboarding_get_status\n", " [X ] correct_emp: Checked correct employee\n", " [FAIL] task_0008 [simple ] score=0% tools=['it_get_available_assets']\n", " [X ] correct_tool: Used it_get_software_licenses\n", " [X ] correct_software: Filtered by Jira\n", " [PASS] task_0009 [simple ] score=100% tools=['policy_lookup']\n", " [OK] correct_tool: Used policy_lookup\n", " [OK] relevant_topic: Searched for onboarding topic\n", " [FAIL] task_0001 [simple ] score=50% tools=['hr_read_employee']\n", " [OK] correct_tool: Used hr_read_employee\n", " [X ] correct_id: Passed correct emp_id\n", " [FAIL] task_0004 [simple ] score=0% tools=['hr_read_employee']\n", " [X ] correct_tool: Used hr_search_employees\n", " [X ] correct_dept: Filtered by correct department\n", " [PASS] task_0024 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0044 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [PASS] task_0022 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0043 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [FAIL] task_0075 [medium ] score=50% tools=['it_get_available_assets', 'it_get_available_software_licenses', 'access_assign_role', 'access_create_badge', 'email_send']\n", " [OK] checked_assets: Checked available assets\n", " [X ] checked_licenses: Checked software licenses\n", " [PASS] task_0021 [medium ] score=100% tools=['hr_create_employee', 'onboarding_create_request']\n", " [OK] created_employee: Created employee record\n", " [OK] correct_name: Used correct name\n", " [OK] correct_dept: Assigned to correct department\n", " [OK] correct_level: Set correct level\n", " [OK] correct_role: Set correct role\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] sequencing: Created employee before onboarding request\n", " [FAIL] task_0047 [medium ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'email_send']\n", " [OK] created_request: Created offboarding request\n", " [X ] correct_emp: Used correct employee ID\n", " [OK] correct_reason: Set correct reason\n", " [OK] revoked_access: Revoked IT access\n", " [OK] notified: Sent notification\n", " [FAIL] task_0055 [complex ] score=80% tools=['offboarding_create_request', 'it_revoke_access', 'it_get_software_licenses', 'access_revoke_role', 'email_send', 'meeting_schedule', 'slack_send_message', 'hr_update_employee', 'hr_get_org_chart', 'hr_search_employees', 'hr_get_org_chart', 'hr_update_employee']\n", " [OK] created_request: Created offboarding request\n", " [OK] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [OK] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0052 [complex ] score=60% tools=['offboarding_create_request', 'offboarding_get_status', 'offboarding_create_role', 'offboarding_revoke_access', 'access_assign_role', 'access_revoke_role', 'access_get_security_groups', 'email_send', 'slack_send_message']\n", " [OK] created_request: Created offboarding request\n", " [X ] revoked_it: Revoked IT access\n", " [OK] farewell_email: Sent farewell email\n", " [OK] farewell_slack: Sent farewell Slack message\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0026 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule', 'access_revoke_role', 'it_get_software_licenses', 'access_assign_role', 'access_get_security_groups']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] assigned_laptop: Assigned a laptop\n", " [OK] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [OK] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0033 [complex ] score=89% tools=['hr_create_employee', 'onboarding_create_request', 'it_assign_asset', 'it_get_available_assets', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule', 'policy_lookup', 'approval_request', 'access_revoke_role']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [OK] got_approval: Submitted approval request\n", " [OK] assigned_asset: Assigned an asset\n", " [X ] created_accounts: Created IT accounts\n", " [OK] assigned_role: Assigned access role\n", " [OK] created_badge: Created physical badge\n", " [OK] sent_communications: Sent welcome communications\n", " [OK] scheduled_meeting: Scheduled orientation\n", " [FAIL] task_0069 [complex ] score=40% tools=['offboard', 'access_revoke', 'access_assign_role', 'email_send', 'meeting_schedule', 'policy_lookup', 'access_assign_badge']\n", " [X ] read_employee: Read employee record\n", " [X ] revoked_old_access: Revoked old department access\n", " [X ] updated_dept: Updated department\n", " [OK] new_access: Assigned new department roles\n", " [OK] notified_team: Notified new team\n", " [FAIL] task_0076 [complex ] score=60% tools=['it_revoke_access', 'offboarding_create_request', 'email_send', 'meeting_schedule']\n", " [X ] read_manager: Looked up manager info\n", " [OK] offboarding: Created offboarding request\n", " [X ] reassigned: Updated reports' manager\n", " [OK] revoked_access: Revoked manager's access\n", " [OK] notified_team: Notified team\n", " [FAIL] task_0049 [complex ] score=83% tools=['it_revoke_access', 'it_get_software_licenses', 'access_assign_role', 'access_revoke_role', 'email_send', 'meeting_schedule', 'slack_send_message', 'hr_get_org_chart', 'onboarding_get_status', 'onboarding_complete_step', 'offboarding_create_request']\n", " [OK] created_request: Created offboarding request\n", " [OK] revoked_it: Revoked IT access\n", " [OK] revoked_roles: Revoked access roles\n", " [OK] farewell: Sent farewell communication\n", " [OK] exit_interview: Scheduled exit interview\n", " [X ] completed_steps: Completed offboarding steps\n", " [FAIL] task_0028 [complex ] score=70% tools=['hr_create_employee', 'onboarding_create_request', 'it_create_account', 'access_assign_role', 'access_create_badge', 'email_send', 'meeting_schedule']\n", " [OK] created_employee: Created employee record\n", " [OK] initiated_onboarding: Created onboarding request\n", " [X ] assigned_laptop: Assigned a laptop\n", " [OK] created_accounts: Created IT accounts\n", " [OK] assigned_access: Assigned access roles\n", " [OK] sent_welcome: Sent welcome communication\n", " [OK] scheduled_orientation: Scheduled orientation meeting\n", " [OK] sequencing_create_first: Created employee before other steps\n", " [X ] sequencing_asset_check: Checked available assets before assigning\n", " [X ] completeness: Completed at least 3 onboarding steps\n", " [FAIL] task_0063 [edge_case ] score=67% tools=['hr_update_employee', 'it_revoke_access', 'hr_create_request']\n", " [X ] checked_onboarding: Checked onboarding status\n", " [OK] revoked_access: Revoked any provisioned access\n", " [OK] updated_status: Updated employee status to offboarded\n", " [FAIL] task_0060 [edge_case ] score=0% tools=['hr_search_employees', 'hr_update_employee', 'onboarding_create_request']\n", " [X ] looked_up_manager: Looked up the manager or org chart\n", " [X ] found_skip_level: Identified skip-level manager\n", " [X ] proceeded: Proceeded with onboarding\n", " [FAIL] task_0062 [edge_case ] score=33% tools=['it_revoke_access', 'onboarding_complete_step', 'onboarding_get_status', 'access_revoke_role', 'email_send']\n", " [X ] checked_employee: Looked up employee record\n", " [X ] created_request: Created offboarding request\n", " [OK] revoked_access: Revoked access\n", " [FAIL] task_0057 [edge_case ] score=50% tools=['hr_create_employee', 'onboarding_create_request', 'onboarding_get_status']\n", " [OK] attempted_create: Attempted to create employee\n", " [X ] handled_limit: Recognized or handled headcount limit error\n", "\n", "Results: 4/25 passed (16.0%)\n", "Mean score: 0.617\n", " simple : 1/6 pass, score=0.42\n", " medium : 3/7 pass, score=0.84\n", " complex : 0/8 pass, score=0.69\n", " edge_case : 0/4 pass, score=0.38\n", "\n", "==================================================\n", "TRAIN SET IMPROVEMENT\n", "==================================================\n", "Pass rate: 8/52 \u2192 10/52 (15.4% \u2192 19.2%)\n", "Mean score: 0.370 \u2192 0.617 (+0.247)\n", " simple : 2/13 \u2192 3/13 pass, score 0.23 \u2192 0.50\n", " medium : 6/14 \u2192 7/14 pass, score 0.72 \u2192 0.86\n", " complex : 0/17 \u2192 0/17 pass, score 0.26 \u2192 0.68\n", " edge_case : 0/8 \u2192 0/8 pass, score 0.22 \u2192 0.25\n", "\n", "==================================================\n", "TEST SET IMPROVEMENT (GENERALIZATION)\n", "==================================================\n", "Pass rate: 3/25 \u2192 4/25 (12.0% \u2192 16.0%)\n", "Mean score: 0.370 \u2192 0.617 (+0.247)\n", " simple : 0/6 \u2192 1/6 pass, score 0.17 \u2192 0.42\n", " medium : 3/7 \u2192 3/7 pass, score 0.67 \u2192 0.84\n", " complex : 0/8 \u2192 0/8 pass, score 0.30 \u2192 0.69\n", " edge_case : 0/4 \u2192 0/4 pass, score 0.29 \u2192 0.38\n" ] } ], "source": [ "# ============================================================\n", "# POST-TRAINING EVALUATION\n", "# ============================================================\n", "\n", "# Evaluate on TRAIN set\n", "print(\"=\" * 50)\n", "print(\"POST-TRAINING \u2014 TRAIN SET\")\n", "print(\"=\" * 50)\n", "trained_train = evaluate_model(model, tokenizer, prompts_list=train_prompts)\n", "\n", "# Evaluate on TEST set (held-out)\n", "print(\"\\n\" + \"=\" * 50)\n", "print(\"POST-TRAINING \u2014 TEST SET (held-out)\")\n", "print(\"=\" * 50)\n", "trained_test = evaluate_model(model, tokenizer, prompts_list=test_prompts)\n", "\n", "# ============================================================\n", "# IMPROVEMENT SUMMARY\n", "# ============================================================\n", "def summarize(name, baseline, trained):\n", " b_pass = sum(1 for r in baseline if r[\"passed\"])\n", " t_pass = sum(1 for r in trained if r[\"passed\"])\n", " b_score = sum(r[\"score\"] for r in baseline) / max(len(baseline), 1)\n", " t_score = sum(r[\"score\"] for r in trained) / max(len(trained), 1)\n", " print(f\"\\n{'=' * 50}\")\n", " print(f\"{name}\")\n", " print(f\"{'=' * 50}\")\n", " print(f\"Pass rate: {b_pass}/{len(baseline)} \u2192 {t_pass}/{len(trained)} \"\n", " f\"({b_pass/len(baseline):.1%} \u2192 {t_pass/len(trained):.1%})\")\n", " print(f\"Mean score: {b_score:.3f} \u2192 {t_score:.3f} \"\n", " f\"({'+'if t_score >= b_score else ''}{t_score - b_score:.3f})\")\n", " for diff in [\"simple\", \"medium\", \"complex\", \"edge_case\"]:\n", " b_sub = [r for r in baseline if r[\"difficulty\"] == diff]\n", " t_sub = [r for r in trained if r[\"difficulty\"] == diff]\n", " if b_sub:\n", " bs = sum(r[\"score\"] for r in b_sub) / len(b_sub)\n", " ts = sum(r[\"score\"] for r in t_sub) / len(t_sub)\n", " bp = sum(1 for r in b_sub if r[\"passed\"])\n", " tp = sum(1 for r in t_sub if r[\"passed\"])\n", " print(f\" {diff:10s}: {bp}/{len(b_sub)} \u2192 {tp}/{len(t_sub)} pass, \"\n", " f\"score {bs:.2f} \u2192 {ts:.2f}\")\n", "\n", "summarize(\"TRAIN SET IMPROVEMENT\", baseline_train, trained_train)\n", "summarize(\"TEST SET IMPROVEMENT (GENERALIZATION)\", baseline_test, trained_test)" ] }, { "cell_type": "markdown", "id": "188a9b69", "metadata": {}, "source": [ "## Saving the Fine-tuned Model\n", "\n", "Save the trained model for later use or push to Hugging Face Hub:" ] }, { "cell_type": "code", "execution_count": 19, "id": "641436fa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found HuggingFace hub cache directory: /home/jovyan/.cache/huggingface/hub\n", "Checking cache directory for required files...\n", "Cache check failed: model.safetensors not found in local cache.\n", "Not all required files found in cache. Will proceed with downloading.\n", "Checking cache directory for required files...\n", "Cache check failed: tokenizer.model not found in local cache.\n", "Not all required files found in cache. Will proceed with downloading.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Unsloth: Preparing safetensor model files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:00<00:00, 3300.00it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Note: tokenizer.model not found (this is OK for non-SentencePiece models)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Unsloth: Merging weights into 16bit: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:03<00:00, 3.91s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Unsloth: Merge process complete. Saved to `/home/jovyan/rl_hack/outputs/hr_agent_final`\n", "Model saved to outputs/hr_agent_final\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b32d235e7ae94a8288026fcf56b31ce2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing Files (0 / 0): | | 0.00B / 0.00B " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4992990ab81a4785a78815cf86c05c1a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "New Data Upload: | | 0.00B / 0.00B " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Found HuggingFace hub cache directory: /home/jovyan/.cache/huggingface/hub\n", "Checking cache directory for required files...\n", "Cache check failed: model.safetensors not found in local cache.\n", "Not all required files found in cache. Will proceed with downloading.\n", "Checking cache directory for required files...\n", "Cache check failed: tokenizer.model not found in local cache.\n", "Not all required files found in cache. Will proceed with downloading.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Unsloth: Preparing safetensor model files: 0%| | 0/1 [00:00" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Saved to outputs/training_curves.png\n", "\n", "============================================================\n", "FINAL RESULTS SUMMARY\n", "============================================================\n", "\n", "Metric Baseline Trained Change\n", "------------------------------------------------------------------\n", "Train pass rate 15.4% 19.2% +3.8%\n", "Train mean score 0.370 0.617 +0.247\n", "Test pass rate (gen.) 12.0% 16.0% +4.0%\n", "Test mean score (gen.) 0.370 0.617 +0.247\n", "\n", "Results saved to outputs/eval_results.json\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# Extract training logs from trainer\n", "logs = trainer.state.log_history\n", "\n", "steps = [l[\"step\"] for l in logs if \"reward\" in l]\n", "rewards = [l[\"reward\"] for l in logs if \"reward\" in l]\n", "losses = [l[\"loss\"] for l in logs if \"loss\" in l]\n", "loss_steps = [l[\"step\"] for l in logs if \"loss\" in l]\n", "kl = [l[\"kl\"] for l in logs if \"kl\" in l]\n", "kl_steps = [l[\"step\"] for l in logs if \"kl\" in l]\n", "\n", "# Compute moving average for reward\n", "window = 10\n", "reward_ma = [sum(rewards[max(0,i-window):i+1]) / len(rewards[max(0,i-window):i+1]) for i in range(len(rewards))]\n", "\n", "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", "\n", "# Reward curve\n", "axes[0].plot(steps, rewards, alpha=0.3, color=\"blue\", label=\"Per-step\")\n", "axes[0].plot(steps, reward_ma, color=\"blue\", linewidth=2, label=f\"Moving avg ({window})\")\n", "axes[0].set_xlabel(\"Training Step\")\n", "axes[0].set_ylabel(\"Total Reward\")\n", "axes[0].set_title(\"Reward Over Training\")\n", "axes[0].legend()\n", "axes[0].grid(True, alpha=0.3)\n", "\n", "# Loss curve\n", "axes[1].plot(loss_steps, losses, color=\"red\", alpha=0.7)\n", "axes[1].set_xlabel(\"Training Step\")\n", "axes[1].set_ylabel(\"Training Loss\")\n", "axes[1].set_title(\"Training Loss\")\n", "axes[1].grid(True, alpha=0.3)\n", "\n", "# KL divergence\n", "axes[2].plot(kl_steps, kl, color=\"green\", alpha=0.7)\n", "axes[2].set_xlabel(\"Training Step\")\n", "axes[2].set_ylabel(\"KL Divergence\")\n", "axes[2].set_title(\"KL Divergence from Base Policy\")\n", "axes[2].grid(True, alpha=0.3)\n", "\n", "plt.tight_layout()\n", "plt.savefig(\"outputs/training_curves.png\", dpi=150, bbox_inches=\"tight\")\n", "plt.show()\n", "print(\"Saved to outputs/training_curves.png\")\n", "\n", "# ============================================================\n", "# FINAL SUMMARY TABLE\n", "# ============================================================\n", "print(\"\\n\" + \"=\" * 60)\n", "print(\"FINAL RESULTS SUMMARY\")\n", "print(\"=\" * 60)\n", "\n", "def score_of(results):\n", " return sum(r[\"score\"] for r in results) / max(len(results), 1)\n", "\n", "def pass_rate(results):\n", " return sum(1 for r in results if r[\"passed\"]) / max(len(results), 1)\n", "\n", "print(f\"\\n{'Metric':<30s} {'Baseline':>12s} {'Trained':>12s} {'Change':>12s}\")\n", "print(\"-\" * 66)\n", "print(f\"{'Train pass rate':<30s} {pass_rate(baseline_train):>11.1%} {pass_rate(trained_train):>11.1%} \"\n", " f\"{pass_rate(trained_train) - pass_rate(baseline_train):>+11.1%}\")\n", "print(f\"{'Train mean score':<30s} {score_of(baseline_train):>12.3f} {score_of(trained_train):>12.3f} \"\n", " f\"{score_of(trained_train) - score_of(baseline_train):>+12.3f}\")\n", "print(f\"{'Test pass rate (gen.)':<30s} {pass_rate(baseline_test):>11.1%} {pass_rate(trained_test):>11.1%} \"\n", " f\"{pass_rate(trained_test) - pass_rate(baseline_test):>+11.1%}\")\n", "print(f\"{'Test mean score (gen.)':<30s} {score_of(baseline_test):>12.3f} {score_of(trained_test):>12.3f} \"\n", " f\"{score_of(trained_test) - score_of(baseline_test):>+12.3f}\")\n", "\n", "# Save all results\n", "import os\n", "os.makedirs(\"outputs\", exist_ok=True)\n", "all_results = {\n", " \"baseline_train\": baseline_train,\n", " \"baseline_test\": baseline_test,\n", " \"trained_train\": trained_train,\n", " \"trained_test\": trained_test,\n", "}\n", "with open(\"outputs/eval_results.json\", \"w\") as f:\n", " json.dump(all_results, f, indent=2)\n", "print(f\"\\nResults saved to outputs/eval_results.json\")" ] }, { "cell_type": "markdown", "id": "1b987501", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this tutorial, we trained an LLM to automate HR workflows using reinforcement learning. Key concepts:\n", "\n", "1. **OpenEnv** for standardized access to enterprise RL environments\n", "2. **Rubric-based rewards** that verify tool usage, parameter correctness, and sequencing\n", "3. **Multi-objective rewards** (valid JSON + rubric score + efficiency)\n", "4. **GRPO** for policy optimization without a value network\n", "5. **LoRA** for memory-efficient fine-tuning on consumer GPUs\n", "6. **Proper train/test split** \u2014 70/30 stratified split to measure generalization\n", "\n", "### Key Results\n", "\n", "| Metric | Base Model | Trained | Change |\n", "|--------|-----------|---------|--------|\n", "| Train pass rate | 15.4% | 19.2% | +3.8% |\n", "| Train mean score | 0.370 | 0.617 | +0.247 (+67%) |\n", "| **Test pass rate** | **12.0%** | **16.0%** | **+4.0%** |\n", "| **Test mean score** | **0.370** | **0.617** | **+0.247 (+67%)** |\n", "\n", "### Improvement by Difficulty\n", "\n", "| Difficulty | Baseline Score | Trained Score | Change |\n", "|------------|---------------|---------------|--------|\n", "| Simple | 0.23 | 0.50 | +0.27 |\n", "| Medium | 0.72 | 0.86 | +0.14 |\n", "| Complex | 0.26 | 0.68 | **+0.42** |\n", "| Edge case | 0.22 | 0.25 | +0.03 |\n", "\n", "The biggest improvement is on **complex tasks** (scores more than doubled), and the improvement **generalizes to held-out test tasks** \u2014 proving the model learned transferable HR workflow skills, not just memorization.\n", "\n", "### Resources\n", "\n", "- [HR Environment on HF Spaces](https://huggingface.co/spaces/devxpy/rl_hack)\n", "- [OpenEnv Documentation](https://github.com/meta-pytorch/OpenEnv)\n", "- [TRL GRPO Trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)\n", "- [Unsloth RL Guide](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide)\n", "\n", "---\n", "\n", "*This notebook uses [Unsloth](https://github.com/unslothai/unsloth) for memory-efficient training.*\n", "\n", "**License:** Apache 2.0" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }