Spaces:

devxpy
/

rl_hack

Sleeping

App Files Files Community

devxpy commited on Mar 8

Commit

126c21b

verified ·

1 Parent(s): e181764

Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +88 -14
pyproject.toml +15 -0
reward_curve.png +3 -0
server/app.py +1 -0
server/static/index.html +69 -1
server/tasks.py +310 -12
test_all_tasks.py +186 -0
train_hr_agent.ipynb +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+reward_curve.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -13,9 +13,22 @@ tags:
 # HR Onboarding & Offboarding Environment
-An OpenEnv-compatible RL environment that simulates enterprise HR onboarding and offboarding workflows. The agent interacts with a realistic HR system (200+ employees, 8 departments, RBAC, approval chains) by calling tools to complete multi-step tasks.
-Built for the Scaler AI hackathon (Statement 3.1).
 ## Quick Start
@@ -226,8 +239,11 @@ rl_hack/
 ├── __init__.py                        # Module exports
 ├── client.py                          # HROnboardingEnv client
 ├── models.py                          # Action/Observation Pydantic models
-├── test_with_llm.py                   # Test script (GPT agent)
 ├── .env                               # API keys (gitignored)
 └── server/
     ├── __init__.py
     ├── app.py                         # FastAPI application
@@ -260,7 +276,7 @@ You can test the environment locally using GPT (or any OpenAI-compatible model)
 2. Install dependencies:
    ```bash
-   pip install openai python-dotenv openenv-core
    ```
 ### Run
@@ -269,12 +285,15 @@ You can test the environment locally using GPT (or any OpenAI-compatible model)
 cd rl_hack
 # Test on default task (simple lookup)
-uv run python -m test_with_llm.py
 # Test a specific task by index (0-76)
 uv run python -m test_with_llm 14    # medium onboarding task
 uv run python -m test_with_llm 24    # complex full onboarding
 uv run python -m test_with_llm 55    # edge case (headcount limit)
 ```
 The script will:
@@ -322,29 +341,84 @@ Passed: True
 | 55-66 | Edge case | Various | Headcount limits, license caps, RBAC |
 | 67-76 | Complex | Cross-workflow | Transfers, rehires, manager departures |
 ## Building & Running
 ```bash
 # Build Docker image
 docker build -t hr-onboarding-env:latest -f server/Dockerfile .
-# Run locally (as OpenEnv HTTP server)
-uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
 # Deploy to HF Spaces
 openenv push
 ```
-## Training
-We use Unsloth + GRPO to train an LLM agent on this environment:
-- **Model**: Qwen 2.5-7B-Instruct (4-bit quantized)
 - **Algorithm**: GRPO (Group Relative Policy Optimization)
-- **Rollouts**: 8 per prompt
-- **Train/eval split**: 80/20 (62 train, 15 eval tasks)
-See `training/` directory in the parent repo for training scripts.
 ## Live Demo

 # HR Onboarding & Offboarding Environment
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ravi03071991/rl_hack/blob/master/train_hr_agent.ipynb)
+An OpenEnv-compatible RL environment that simulates enterprise HR onboarding and offboarding workflows. The agent orchestrates across **6 enterprise apps** — Workday, ServiceNow, Okta, Email, Slack, and Calendar — using 25 tools to complete multi-step tasks in a realistic HR system (200+ employees, 8 departments, RBAC, approval chains).
+Built for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf/details) — **Statement 3.1: Professional Tasks** (Scaler AI Labs partner theme: Multi-App RL Environment for Enterprise Workflows).
+### Key Results
+> **GRPO training on Llama 3.2-1B-Instruct improves mean task score by +67% (0.37 → 0.62).**
+> Complex multi-step task scores **more than double** (0.26 → 0.68). Gains generalize to held-out test tasks.
+| | Baseline | Trained | Improvement |
+|---|---------|---------|-------------|
+| Mean Score | 0.370 | 0.617 | **+67%** |
+| Complex Tasks | 0.26 | 0.68 | **+162%** |
+| Pass Rate | 15.4% | 19.2% | +3.8pp |
 ## Quick Start
 ├── __init__.py                        # Module exports
 ├── client.py                          # HROnboardingEnv client
 ├── models.py                          # Action/Observation Pydantic models
+├── test_with_llm.py                   # Test single task with GPT agent
+├── test_all_tasks.py                  # Evaluate all 77 tasks
+├── train_hr_agent.ipynb               # GRPO training notebook (Unsloth)
 ├── .env                               # API keys (gitignored)
+├── outputs/                           # Evaluation results
 └── server/
     ├── __init__.py
     ├── app.py                         # FastAPI application
 2. Install dependencies:
    ```bash
+   uv pip install -e ".[eval]"
    ```
 ### Run
 cd rl_hack
 # Test on default task (simple lookup)
+uv run python -m test_with_llm
 # Test a specific task by index (0-76)
 uv run python -m test_with_llm 14    # medium onboarding task
 uv run python -m test_with_llm 24    # complex full onboarding
 uv run python -m test_with_llm 55    # edge case (headcount limit)
+# Run full evaluation across all 77 tasks
+uv run python test_all_tasks.py
 ```
 The script will:
 | 55-66 | Edge case | Various | Headcount limits, license caps, RBAC |
 | 67-76 | Complex | Cross-workflow | Transfers, rehires, manager departures |
+## Installation
+```bash
+# Clone the repo
+git clone https://github.com/ravi03071991/rl_hack.git
+cd rl_hack
+# Install core dependencies
+uv pip install -e .
+# Install with evaluation support (adds openai)
+uv pip install -e ".[eval]"
+# Install with training support (adds unsloth, trl, torch, etc.)
+uv pip install -e ".[train]"
+# Install everything
+uv pip install -e ".[eval,train,dev]"
+```
 ## Building & Running
 ```bash
+# Run locally (as OpenEnv HTTP server with playground UI)
+uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
 # Build Docker image
 docker build -t hr-onboarding-env:latest -f server/Dockerfile .
 # Deploy to HF Spaces
 openenv push
 ```
+## Training & Results
+We use Unsloth + GRPO to train an LLM agent on this environment. See [`train_hr_agent.ipynb`](train_hr_agent.ipynb) for the full training notebook and [W&B run](https://wandb.ai/ravi03071991/hr-agent-training/runs/bgent3o3?nw=nwuserravi03071991) for live training metrics.
+### Setup
+- **Model**: Llama 3.2-1B-Instruct (4-bit quantized, LoRA rank 8)
 - **Algorithm**: GRPO (Group Relative Policy Optimization)
+- **Reward functions**: Valid JSON + rubric score + efficiency
+- **Training**: 300 steps, 6 generations per prompt, lr=5e-5 with cosine schedule
+- **Data split**: 70/30 stratified train/test (52 train, 25 test tasks)
+### Results
+GRPO training significantly improves the model's ability to complete HR workflows:
+| Metric | Base Model | Trained | Change |
+|--------|-----------|---------|--------|
+| **Train pass rate** | 15.4% | 19.2% | +3.8% |
+| **Train mean score** | 0.370 | 0.617 | **+0.247 (+67%)** |
+| **Test pass rate** | 12.0% | 16.0% | +4.0% |
+| **Test mean score** | 0.370 | 0.617 | **+0.247 (+67%)** |
+#### Improvement by difficulty
+| Difficulty | Baseline | Trained | Change |
+|------------|----------|---------|--------|
+| Simple | 0.23 | 0.50 | +0.27 |
+| Medium | 0.72 | 0.86 | +0.14 |
+| **Complex** | **0.26** | **0.68** | **+0.42** |
+| Edge case | 0.22 | 0.25 | +0.03 |
+The biggest gains are on **complex multi-step tasks** — scores more than doubled. The improvement **generalizes to held-out test tasks**, proving the model learned transferable HR workflow skills.
+### Reward Curve
+![Reward Curve](reward_curve.png)
+The moving average reward trends upward from ~2-3 early in training to ~4-5 by the end, showing consistent learning.
+### Quick start (Colab)
+1. Click the Colab badge at the top to open `train_hr_agent.ipynb` in Google Colab
+2. Select a GPU runtime
+3. Run all cells — installs dependencies, trains, and evaluates automatically
 ## Live Demo

pyproject.toml CHANGED Viewed

@@ -9,9 +9,24 @@ description = "HR Onboarding/Offboarding environment for OpenEnv — simulates e
 requires-python = ">=3.10"
 dependencies = [
     "openenv-core[core]>=0.2.0",
 ]
 [project.optional-dependencies]
 dev = [
     "pytest>=8.0.0",
     "pytest-cov>=4.0.0",

 requires-python = ">=3.10"
 dependencies = [
     "openenv-core[core]>=0.2.0",
+    "fastapi>=0.100.0",
+    "uvicorn>=0.20.0",
+    "pydantic>=2.0.0",
+    "python-dotenv>=1.0.0",
 ]
 [project.optional-dependencies]
+eval = [
+    "openai>=1.0.0",
+]
+train = [
+    "unsloth",
+    "trl>=0.22.0",
+    "datasets>=2.0.0",
+    "torch>=2.0.0",
+    "transformers>=4.40.0",
+    "bitsandbytes>=0.43.0",
+]
 dev = [
     "pytest>=8.0.0",
     "pytest-cov>=4.0.0",

reward_curve.png ADDED Viewed

Git LFS Details

SHA256: b6e3520d1c5c69210b70a2d5d5862b34a8aa1d7f36e1a6543e7e89b48675c471
Pointer size: 131 Bytes
Size of remote file: 248 kB

server/app.py CHANGED Viewed

@@ -77,6 +77,7 @@ def get_tasks():
             "difficulty": task.difficulty,
             "category": task.category,
             "expected_tools": task.expected_tools,
             "num_criteria": len(task.rubric_criteria),
         })

             "difficulty": task.difficulty,
             "category": task.category,
             "expected_tools": task.expected_tools,
+            "rubric_criteria": task.rubric_criteria,
             "num_criteria": len(task.rubric_criteria),
         })

server/static/index.html CHANGED Viewed

@@ -189,6 +189,41 @@
       color: #d0d0d0;
     }
     .step-indicator {
       display: flex;
       align-items: center;
@@ -640,6 +675,7 @@
         body: JSON.stringify({ task_idx: idx }),
       });
       const data = await res.json();
       // Update instruction
       const instrEl = document.getElementById('taskInstruction');
@@ -650,7 +686,8 @@
           <span class="task-tag tag-${task.category}">${task.category}</span>
           ${data.task_id}
         </h3>
-        <p>${data.instruction}</p>
         <div class="step-indicator">
           <span>Step ${currentStep}/${maxSteps}</span>
           <div class="step-bar"><div class="step-bar-fill" style="width: 0%"></div></div>
@@ -760,6 +797,28 @@
       log.scrollTop = log.scrollHeight;
     }
     // --- Evaluation ---
     function showEvaluation(evalData) {
       const section = document.getElementById('evalSection');
@@ -810,6 +869,15 @@
       `).join('');
     }
     function selectTool(name) {
       document.getElementById('toolSelect').value = name;
       // Show parameter hints

       color: #d0d0d0;
     }
+    .ideal-result {
+      margin-top: 14px;
+      background: #141a26;
+      border: 1px solid #243049;
+      border-left: 3px solid #58a6ff;
+      border-radius: 8px;
+      padding: 12px;
+    }
+    .ideal-result h4 {
+      font-size: 12px;
+      color: #9cc7ff;
+      margin-bottom: 8px;
+      text-transform: uppercase;
+      letter-spacing: 0.4px;
+    }
+    .ideal-result .ideal-label {
+      font-size: 11px;
+      color: #7f8da3;
+      margin: 6px 0 4px;
+    }
+    .ideal-result ul {
+      margin: 0;
+      padding-left: 18px;
+      color: #c7d2e1;
+      font-size: 12px;
+      line-height: 1.45;
+    }
+    .ideal-result li {
+      margin-bottom: 3px;
+    }
     .step-indicator {
       display: flex;
       align-items: center;
         body: JSON.stringify({ task_idx: idx }),
       });
       const data = await res.json();
+      maxSteps = data.max_steps || maxSteps;
       // Update instruction
       const instrEl = document.getElementById('taskInstruction');
           <span class="task-tag tag-${task.category}">${task.category}</span>
           ${data.task_id}
         </h3>
+        <p>${escapeHtml(data.instruction)}</p>
+        ${renderIdealResult(task)}
         <div class="step-indicator">
           <span>Step ${currentStep}/${maxSteps}</span>
           <div class="step-bar"><div class="step-bar-fill" style="width: 0%"></div></div>
       log.scrollTop = log.scrollHeight;
     }
+    function renderIdealResult(task) {
+      if (!task) return '';
+      const expectedTools = (task.expected_tools || [])
+        .map(t => `<li><code>${escapeHtml(String(t))}</code></li>`)
+        .join('');
+      const criteria = (task.rubric_criteria || [])
+        .map(c => `<li>${escapeHtml(c.description || c.name || 'Criterion')}</li>`)
+        .join('');
+      return `
+        <div class="ideal-result">
+          <h4>Ideal Result</h4>
+          <div class="ideal-label">Expected tools:</div>
+          <ul>${expectedTools || '<li>Not specified</li>'}</ul>
+          <div class="ideal-label">Success criteria:</div>
+          <ul>${criteria || '<li>Not specified</li>'}</ul>
+        </div>
+      `;
+    }
     // --- Evaluation ---
     function showEvaluation(evalData) {
       const section = document.getElementById('evalSection');
       `).join('');
     }
+    function escapeHtml(value) {
+      return String(value)
+        .replace(/&/g, '&amp;')
+        .replace(/</g, '&lt;')
+        .replace(/>/g, '&gt;')
+        .replace(/"/g, '&quot;')
+        .replace(/'/g, '&#39;');
+    }
     function selectTool(name) {
       document.getElementById('toolSelect').value = name;
       // Show parameter hints

server/tasks.py CHANGED Viewed

@@ -85,9 +85,10 @@ class TaskGenerator:
         return f"task_{self._task_counter:04d}"
     def generate_all_tasks(self) -> list[Task]:
-        """Generate the full task set (~100 tasks)."""
         tasks = []
         tasks.extend(self._simple_lookup_tasks())
         tasks.extend(self._simple_onboarding_tasks())
         tasks.extend(self._medium_onboarding_tasks())
         tasks.extend(self._complex_onboarding_tasks())
@@ -211,6 +212,126 @@ class TaskGenerator:
         return tasks
     # ---- Simple Onboarding Tasks (5) ----
     def _simple_onboarding_tasks(self) -> list[Task]:
         tasks = []
@@ -264,6 +385,22 @@ class TaskGenerator:
             ("David Brown", "Security", "L2", "Security Analyst"),
             ("Li Wei", "Engineering", "L3", "Senior Engineer"),
             ("Emma Davis", "Product", "L3", "Senior PM"),
         ]
         for name, dept, level, role in names:
@@ -303,6 +440,17 @@ class TaskGenerator:
             ("Carlos Mendez", "Security", "L3", "Senior Security Engineer"),
             ("Rachel Green", "Product", "L2", "Product Designer"),
             ("Raj Kapoor", "Engineering", "L2", "Backend Developer"),
         ]
         for name, dept, level, role in complex_hires:
@@ -347,6 +495,12 @@ class TaskGenerator:
             ("Hassan Ahmed", "Data Science", "L3", "Lead Data Scientist"),
             ("Laura Martinez", "Finance", "L3", "Senior Financial Analyst"),
             ("Kevin O'Brien", "Product", "L4", "VP of Product"),
         ]:
             manager = _pick_manager_in_dept(self.world, dept, min_level="L4")
             needs_security = dept == "Security" or int(level[1]) >= 4
@@ -429,6 +583,17 @@ class TaskGenerator:
             ("resignation", "Daniel Park is retiring"),
             ("resignation", "Christina Muller is taking a career break"),
             ("resignation", "Yuki Tanaka is going back to school"),
         ]
         for reason, scenario in offboarding_scenarios:
@@ -439,12 +604,14 @@ class TaskGenerator:
             name = emp["name"]
             instruction = (
                 f"Initiate offboarding for {name} ({emp['emp_id']}) who {scenario.split(' is ')[1] if ' is ' in scenario else 'is leaving'}. "
                 f"Revoke their system access and notify IT."
             )
             criteria = [
                 {"name": "created_request", "description": "Created offboarding request", "check": "tool_used:offboarding_create_request"},
-                {"name": "correct_reason", "description": "Set correct reason", "check": f"param_value:offboarding_create_request.reason={reason}"},
                 {"name": "revoked_access", "description": "Revoked IT access", "check": "tool_used:it_revoke_access"},
                 {"name": "notified", "description": "Sent notification", "check": "tool_used_any:email_send,slack_send_message"},
             ]
@@ -466,7 +633,7 @@ class TaskGenerator:
         tasks = []
         # Full offboarding for managers/directors with reports
-        for _ in range(4):
             # Find an employee who has direct reports
             candidates = [e for e in self.world.state["employees"]
                          if e["status"] == "active" and int(e["level"][1]) >= 3]
@@ -506,7 +673,7 @@ class TaskGenerator:
             ))
         # Offboarding with asset reclamation
-        for _ in range(4):
             emp = _pick_employee(self.world, status="active")
             if not emp:
                 continue
@@ -542,7 +709,7 @@ class TaskGenerator:
         tasks = []
         # 1. Headcount limit exceeded
-        for dept in ["Marketing", "Finance"]:
             tasks.append(Task(
                 task_id=self._next_id(),
                 instruction=f"Onboard a new L1 Associate to the {dept} department. Create their employee record and start onboarding.",
@@ -684,7 +851,7 @@ class TaskGenerator:
                 category="offboarding",
                 expected_tools=["offboarding_create_request", "it_revoke_access", "offboarding_complete_step"],
                 rubric_criteria=[
-                    {"name": "created_request", "description": "Created offboarding with termination reason", "check": "param_value:offboarding_create_request.reason=termination"},
                     {"name": "revoked_access", "description": "Revoked all access", "check": "tool_used:it_revoke_access"},
                     {"name": "no_farewell", "description": "Did NOT send farewell communications", "check": "tool_not_used:slack_send_message"},
                     {"name": "completed_steps", "description": "Completed termination steps", "check": "tool_used:offboarding_complete_step"},
@@ -744,6 +911,134 @@ class TaskGenerator:
             context={"edge_case": "policy_check"},
         ))
         return tasks
     # ---- Cross-Workflow Tasks (10) ----
@@ -755,6 +1050,9 @@ class TaskGenerator:
             ("Engineering", "Product"),
             ("Sales", "Marketing"),
             ("Data Science", "Engineering"),
         ]
         for from_dept, to_dept in transfers:
             emp = _pick_employee(self.world, status="active", department=from_dept)
@@ -784,8 +1082,8 @@ class TaskGenerator:
                 context={"target_emp_id": emp["emp_id"], "from_dept": from_dept, "to_dept": to_dept},
             ))
-        # 4-5. Rehire previously offboarded employee
-        for _ in range(2):
             emp = _pick_employee(self.world, status="offboarded")
             if not emp:
                 continue
@@ -812,8 +1110,8 @@ class TaskGenerator:
                 context={"target_emp_id": emp["emp_id"], "rehire": True},
             ))
-        # 6-8. Bulk operations
-        for dept in self.rng.sample(["Engineering", "Product", "Data Science"], 3):
             tasks.append(Task(
                 task_id=self._next_id(),
                 instruction=(
@@ -831,8 +1129,8 @@ class TaskGenerator:
                 context={"department": dept},
             ))
-        # 9-10. Manager leaving — handle succession
-        for _ in range(2):
             candidates = [e for e in self.world.state["employees"]
                          if e["status"] == "active" and int(e["level"][1]) >= 3
                          and e.get("manager_id")]

         return f"task_{self._task_counter:04d}"
     def generate_all_tasks(self) -> list[Task]:
+        """Generate the full task set (~200 tasks)."""
         tasks = []
         tasks.extend(self._simple_lookup_tasks())
+        tasks.extend(self._additional_lookup_tasks())
         tasks.extend(self._simple_onboarding_tasks())
         tasks.extend(self._medium_onboarding_tasks())
         tasks.extend(self._complex_onboarding_tasks())
         return tasks
+    # ---- Additional Lookup Tasks ----
+    def _additional_lookup_tasks(self) -> list[Task]:
+        tasks = []
+        depts = ["Engineering", "Product", "Marketing", "Sales", "Finance", "HR", "Data Science", "Security"]
+        # More employee lookups by ID
+        for _ in range(5):
+            emp = _pick_employee(self.world, status="active")
+            if not emp:
+                continue
+            tasks.append(Task(
+                task_id=self._next_id(),
+                instruction=f"Find the employee record for {emp['name']} (employee ID: {emp['emp_id']}).",
+                difficulty="simple",
+                category="lookup",
+                expected_tools=["hr_read_employee"],
+                rubric_criteria=[
+                    {"name": "correct_tool", "description": "Used hr_read_employee", "check": "tool_used:hr_read_employee"},
+                    {"name": "correct_id", "description": "Passed correct emp_id", "check": f"param_value:hr_read_employee.emp_id={emp['emp_id']}"},
+                ],
+                context={"target_emp_id": emp["emp_id"], "target_name": emp["name"]},
+            ))
+        # More department searches
+        for dept in self.rng.sample(depts, 3):
+            tasks.append(Task(
+                task_id=self._next_id(),
+                instruction=f"Show me all team members in the {dept} department.",
+                difficulty="simple",
+                category="lookup",
+                expected_tools=["hr_search_employees"],
+                rubric_criteria=[
+                    {"name": "correct_tool", "description": "Used hr_search_employees", "check": "tool_used:hr_search_employees"},
+                    {"name": "correct_dept", "description": "Filtered by correct department", "check": f"param_value:hr_search_employees.department={dept}"},
+                ],
+                context={"department": dept},
+            ))
+        # More org chart lookups
+        for dept in self.rng.sample(depts, 2):
+            tasks.append(Task(
+                task_id=self._next_id(),
+                instruction=f"Pull up the org chart for the {dept} team.",
+                difficulty="simple",
+                category="lookup",
+                expected_tools=["hr_get_org_chart"],
+                rubric_criteria=[
+                    {"name": "correct_tool", "description": "Used hr_get_org_chart", "check": "tool_used:hr_get_org_chart"},
+                    {"name": "correct_dept", "description": "Passed correct department", "check": f"param_value:hr_get_org_chart.department={dept}"},
+                ],
+                context={"department": dept},
+            ))
+        # Search by level
+        for level in ["L3", "L4", "L5"]:
+            tasks.append(Task(
+                task_id=self._next_id(),
+                instruction=f"Find all employees at level {level} across the company.",
+                difficulty="simple",
+                category="lookup",
+                expected_tools=["hr_search_employees"],
+                rubric_criteria=[
+                    {"name": "correct_tool", "description": "Used hr_search_employees", "check": "tool_used:hr_search_employees"},
+                    {"name": "correct_level", "description": "Filtered by correct level", "check": f"param_value:hr_search_employees.level={level}"},
+                ],
+                context={"level": level},
+            ))
+        # Policy lookups
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="What is the company's termination policy? Look up the relevant HR policy.",
+            difficulty="simple",
+            category="lookup",
+            expected_tools=["policy_lookup"],
+            rubric_criteria=[
+                {"name": "correct_tool", "description": "Used policy_lookup", "check": "tool_used:policy_lookup"},
+                {"name": "relevant_topic", "description": "Searched for termination topic", "check": "param_contains:policy_lookup.topic=terminat"},
+            ],
+        ))
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="Look up the contractor onboarding policy.",
+            difficulty="simple",
+            category="lookup",
+            expected_tools=["policy_lookup"],
+            rubric_criteria=[
+                {"name": "correct_tool", "description": "Used policy_lookup", "check": "tool_used:policy_lookup"},
+                {"name": "relevant_topic", "description": "Searched for contractor topic", "check": "param_contains:policy_lookup.topic=contractor"},
+            ],
+        ))
+        # Asset checks
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="What monitors are currently available for assignment?",
+            difficulty="simple",
+            category="lookup",
+            expected_tools=["it_get_available_assets"],
+            rubric_criteria=[
+                {"name": "correct_tool", "description": "Used it_get_available_assets", "check": "tool_used:it_get_available_assets"},
+                {"name": "correct_type", "description": "Filtered by monitor type", "check": "param_value:it_get_available_assets.asset_type=monitor"},
+            ],
+        ))
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="Check how many phones are available for new hires.",
+            difficulty="simple",
+            category="lookup",
+            expected_tools=["it_get_available_assets"],
+            rubric_criteria=[
+                {"name": "correct_tool", "description": "Used it_get_available_assets", "check": "tool_used:it_get_available_assets"},
+                {"name": "correct_type", "description": "Filtered by phone type", "check": "param_value:it_get_available_assets.asset_type=phone"},
+            ],
+        ))
+        return tasks
     # ---- Simple Onboarding Tasks (5) ----
     def _simple_onboarding_tasks(self) -> list[Task]:
         tasks = []
             ("David Brown", "Security", "L2", "Security Analyst"),
             ("Li Wei", "Engineering", "L3", "Senior Engineer"),
             ("Emma Davis", "Product", "L3", "Senior PM"),
+            # --- Additional medium onboarding hires ---
+            ("Olivia Thompson", "Marketing", "L2", "Content Strategist"),
+            ("Wei Zhang", "Engineering", "L3", "Staff Engineer"),
+            ("Rosa Martinez", "Sales", "L2", "Account Executive"),
+            ("Kofi Asante", "Data Science", "L1", "Junior Data Analyst"),
+            ("Yuki Sato", "Product", "L1", "Associate PM"),
+            ("Dmitri Volkov", "Security", "L3", "Senior Security Engineer"),
+            ("Amara Okafor", "HR", "L2", "HR Business Partner"),
+            ("Liam O'Connor", "Finance", "L3", "Senior Accountant"),
+            ("Fatou Diallo", "Engineering", "L1", "Junior Developer"),
+            ("Ines Moreau", "Marketing", "L3", "Marketing Manager"),
+            ("Tariq Hassan", "Sales", "L3", "Sales Manager"),
+            ("Mei-Ling Wu", "Data Science", "L2", "ML Engineer"),
+            ("Jakob Andersen", "Product", "L2", "UX Researcher"),
+            ("Chloe Dubois", "HR", "L3", "Senior HR Specialist"),
+            ("Ravi Krishnan", "Finance", "L1", "Junior Analyst"),
         ]
         for name, dept, level, role in names:
             ("Carlos Mendez", "Security", "L3", "Senior Security Engineer"),
             ("Rachel Green", "Product", "L2", "Product Designer"),
             ("Raj Kapoor", "Engineering", "L2", "Backend Developer"),
+            # --- Additional complex hires ---
+            ("Sofia Andersson", "Marketing", "L3", "Brand Director"),
+            ("Kwame Mensah", "Sales", "L2", "Enterprise Sales Rep"),
+            ("Elena Popov", "Finance", "L3", "Senior Controller"),
+            ("Marcus Washington", "HR", "L2", "Talent Acquisition Lead"),
+            ("Yuna Park", "Data Science", "L2", "Data Engineer"),
+            ("Omar Khalil", "Engineering", "L3", "DevOps Lead"),
+            ("Isabella Romano", "Product", "L3", "Senior Product Manager"),
+            ("Thabo Ndlovu", "Security", "L2", "Security Operations Analyst"),
+            ("Annika Johansson", "Marketing", "L2", "Growth Marketing Manager"),
+            ("Chen Wei", "Finance", "L2", "Financial Systems Analyst"),
         ]
         for name, dept, level, role in complex_hires:
             ("Hassan Ahmed", "Data Science", "L3", "Lead Data Scientist"),
             ("Laura Martinez", "Finance", "L3", "Senior Financial Analyst"),
             ("Kevin O'Brien", "Product", "L4", "VP of Product"),
+            # --- Additional approval-chain hires ---
+            ("Priscilla Nakamura", "Security", "L4", "Head of Security Operations"),
+            ("Ahmed El-Sayed", "Engineering", "L3", "Principal Architect"),
+            ("Gabriela Fernandez", "Data Science", "L4", "Director of Analytics"),
+            ("Vikram Reddy", "Finance", "L4", "VP of Finance"),
+            ("Nadia Kuznetsova", "HR", "L4", "VP of People"),
         ]:
             manager = _pick_manager_in_dept(self.world, dept, min_level="L4")
             needs_security = dept == "Security" or int(level[1]) >= 4
             ("resignation", "Daniel Park is retiring"),
             ("resignation", "Christina Muller is taking a career break"),
             ("resignation", "Yuki Tanaka is going back to school"),
+            # --- Additional offboarding scenarios ---
+            ("resignation", "Ming Chen is pursuing a startup"),
+            ("resignation", "Rosa Martinez is relocating internationally"),
+            ("termination", "Brian Foster is being terminated for misconduct"),
+            ("resignation", "Anika Gupta is joining a competitor"),
+            ("resignation", "Jean-Pierre Leclerc is taking a sabbatical"),
+            ("resignation", "Naomi Osei is transitioning to freelance work"),
+            ("resignation", "Derek Olson is moving into academia"),
+            ("termination", "Suki Yamamoto is being terminated for underperformance"),
+            ("resignation", "Alejandro Ruiz is emigrating abroad"),
+            ("resignation", "Priya Venkatesh is leaving for personal reasons"),
         ]
         for reason, scenario in offboarding_scenarios:
             name = emp["name"]
             instruction = (
                 f"Initiate offboarding for {name} ({emp['emp_id']}) who {scenario.split(' is ')[1] if ' is ' in scenario else 'is leaving'}. "
+                f"Set the reason to '{reason}'. "
                 f"Revoke their system access and notify IT."
             )
             criteria = [
                 {"name": "created_request", "description": "Created offboarding request", "check": "tool_used:offboarding_create_request"},
+                {"name": "correct_emp", "description": "Used correct employee ID", "check": f"param_value:offboarding_create_request.employee_id={emp['emp_id']}"},
+                {"name": "correct_reason", "description": "Set correct reason", "check": f"param_contains:offboarding_create_request.reason={reason}"},
                 {"name": "revoked_access", "description": "Revoked IT access", "check": "tool_used:it_revoke_access"},
                 {"name": "notified", "description": "Sent notification", "check": "tool_used_any:email_send,slack_send_message"},
             ]
         tasks = []
         # Full offboarding for managers/directors with reports
+        for _ in range(8):
             # Find an employee who has direct reports
             candidates = [e for e in self.world.state["employees"]
                          if e["status"] == "active" and int(e["level"][1]) >= 3]
             ))
         # Offboarding with asset reclamation
+        for _ in range(8):
             emp = _pick_employee(self.world, status="active")
             if not emp:
                 continue
         tasks = []
         # 1. Headcount limit exceeded
+        for dept in ["Marketing", "Finance", "HR", "Sales", "Data Science"]:
             tasks.append(Task(
                 task_id=self._next_id(),
                 instruction=f"Onboard a new L1 Associate to the {dept} department. Create their employee record and start onboarding.",
                 category="offboarding",
                 expected_tools=["offboarding_create_request", "it_revoke_access", "offboarding_complete_step"],
                 rubric_criteria=[
+                    {"name": "created_request", "description": "Created offboarding with termination reason", "check": "param_contains:offboarding_create_request.reason=terminat"},
                     {"name": "revoked_access", "description": "Revoked all access", "check": "tool_used:it_revoke_access"},
                     {"name": "no_farewell", "description": "Did NOT send farewell communications", "check": "tool_not_used:slack_send_message"},
                     {"name": "completed_steps", "description": "Completed termination steps", "check": "tool_used:offboarding_complete_step"},
             context={"edge_case": "policy_check"},
         ))
+        # 12. Software license full — Salesforce
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="Check if there are available Salesforce licenses for a new Sales hire.",
+            difficulty="edge_case",
+            category="onboarding",
+            expected_tools=["it_get_software_licenses"],
+            rubric_criteria=[
+                {"name": "checked_licenses", "description": "Checked licenses", "check": "tool_used:it_get_software_licenses"},
+                {"name": "correct_software", "description": "Checked Salesforce", "check": "param_contains:it_get_software_licenses.software_name=Salesforce"},
+            ],
+            context={"edge_case": "license_check", "software": "Salesforce"},
+        ))
+        # 13. Software license full — Figma
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="A new Product designer needs Figma access. Check if there are available Figma licenses.",
+            difficulty="edge_case",
+            category="onboarding",
+            expected_tools=["it_get_software_licenses"],
+            rubric_criteria=[
+                {"name": "checked_licenses", "description": "Checked licenses", "check": "tool_used:it_get_software_licenses"},
+                {"name": "correct_software", "description": "Checked Figma", "check": "param_contains:it_get_software_licenses.software_name=Figma"},
+            ],
+            context={"edge_case": "license_check", "software": "Figma"},
+        ))
+        # 14. Contractor onboarding — Marketing
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction=(
+                "Onboard contractor Lucia Bianchi to Marketing as an L1 Contract Content Writer. "
+                "Contractors have limited access — no VPN, restricted to Slack and Google Workspace only, "
+                "and require legal approval. Create the record, initiate onboarding, "
+                "get legal approval, and provision appropriate (limited) access."
+            ),
+            difficulty="edge_case",
+            category="onboarding",
+            expected_tools=["hr_create_employee", "onboarding_create_request", "approval_request",
+                           "it_create_account"],
+            rubric_criteria=[
+                {"name": "created_contractor", "description": "Created employee with is_contractor=true", "check": "param_value:hr_create_employee.is_contractor=True"},
+                {"name": "initiated_onboarding", "description": "Created onboarding request", "check": "tool_used:onboarding_create_request"},
+                {"name": "legal_approval", "description": "Got legal approval", "check": "param_value:approval_request.approval_type=legal_approval"},
+                {"name": "limited_access", "description": "Created limited accounts", "check": "tool_used:it_create_account"},
+            ],
+            context={"edge_case": "contractor_onboarding", "name": "Lucia Bianchi"},
+        ))
+        # 15. Second termination scenario — security breach
+        emp2 = _pick_employee(self.world, status="active", has_manager=True)
+        if emp2:
+            tasks.append(Task(
+                task_id=self._next_id(),
+                instruction=(
+                    f"{emp2['name']} ({emp2['emp_id']}) is being terminated due to a security breach. "
+                    f"Immediately revoke all system access and badges, create the termination request, "
+                    f"and ensure all offboarding steps are completed. Do NOT send farewell messages."
+                ),
+                difficulty="edge_case",
+                category="offboarding",
+                expected_tools=["offboarding_create_request", "it_revoke_access", "offboarding_complete_step"],
+                rubric_criteria=[
+                    {"name": "created_request", "description": "Created offboarding with termination reason", "check": "param_contains:offboarding_create_request.reason=terminat"},
+                    {"name": "revoked_access", "description": "Revoked all access", "check": "tool_used:it_revoke_access"},
+                    {"name": "no_farewell_email", "description": "Did NOT send farewell email", "check": "tool_not_used:email_send"},
+                    {"name": "no_farewell_slack", "description": "Did NOT send farewell Slack", "check": "tool_not_used:slack_send_message"},
+                    {"name": "completed_steps", "description": "Completed termination steps", "check": "tool_used:offboarding_complete_step"},
+                ],
+                context={"target_emp_id": emp2["emp_id"], "edge_case": "termination_security_breach"},
+            ))
+        # 16. Third termination scenario — misconduct
+        emp3 = _pick_employee(self.world, status="active", has_manager=True)
+        if emp3:
+            tasks.append(Task(
+                task_id=self._next_id(),
+                instruction=(
+                    f"{emp3['name']} ({emp3['emp_id']}) is being terminated for workplace misconduct. "
+                    f"Follow the termination policy: revoke all access immediately, "
+                    f"create the termination offboarding request with reason 'termination', "
+                    f"and complete the process. No farewell communications."
+                ),
+                difficulty="edge_case",
+                category="offboarding",
+                expected_tools=["offboarding_create_request", "it_revoke_access"],
+                rubric_criteria=[
+                    {"name": "revoked_first", "description": "Revoked access", "check": "tool_used:it_revoke_access"},
+                    {"name": "created_request", "description": "Created termination request", "check": "param_contains:offboarding_create_request.reason=terminat"},
+                    {"name": "no_farewell", "description": "No farewell sent", "check": "tool_not_used:slack_send_message"},
+                ],
+                context={"target_emp_id": emp3["emp_id"], "edge_case": "termination_misconduct"},
+            ))
+        # 17. Bulk onboarding resource check
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction=(
+                "The Engineering team is hiring 5 new engineers at once. Before proceeding, "
+                "check available laptops, monitors, and software licenses (Jira, GitHub, AWS). "
+                "Report what resources are available."
+            ),
+            difficulty="edge_case",
+            category="onboarding",
+            expected_tools=["it_get_available_assets", "it_get_software_licenses"],
+            rubric_criteria=[
+                {"name": "checked_laptops", "description": "Checked laptop availability", "check": "tool_used:it_get_available_assets"},
+                {"name": "checked_licenses", "description": "Checked software licenses", "check": "tool_used:it_get_software_licenses"},
+                {"name": "multiple_checks", "description": "Made multiple resource checks", "check": "tool_count:it_get_software_licenses>=2"},
+            ],
+            context={"edge_case": "bulk_onboarding_resources"},
+        ))
+        # 18. Look up termination policy
+        tasks.append(Task(
+            task_id=self._next_id(),
+            instruction="Look up the company's termination policy and the offboarding policy to understand the required steps.",
+            difficulty="edge_case",
+            category="lookup",
+            expected_tools=["policy_lookup"],
+            rubric_criteria=[
+                {"name": "looked_up_policy", "description": "Looked up policy", "check": "tool_used:policy_lookup"},
+                {"name": "multiple_lookups", "description": "Looked up multiple policies", "check": "tool_count:policy_lookup>=2"},
+            ],
+            context={"edge_case": "policy_check_termination"},
+        ))
         return tasks
     # ---- Cross-Workflow Tasks (10) ----
             ("Engineering", "Product"),
             ("Sales", "Marketing"),
             ("Data Science", "Engineering"),
+            ("Finance", "HR"),
+            ("Marketing", "Product"),
+            ("Security", "Engineering"),
         ]
         for from_dept, to_dept in transfers:
             emp = _pick_employee(self.world, status="active", department=from_dept)
                 context={"target_emp_id": emp["emp_id"], "from_dept": from_dept, "to_dept": to_dept},
             ))
+        # 4-7. Rehire previously offboarded employee
+        for _ in range(4):
             emp = _pick_employee(self.world, status="offboarded")
             if not emp:
                 continue
                 context={"target_emp_id": emp["emp_id"], "rehire": True},
             ))
+        # Bulk operations
+        for dept in self.rng.sample(["Engineering", "Product", "Data Science", "Marketing", "Sales", "Security"], 6):
             tasks.append(Task(
                 task_id=self._next_id(),
                 instruction=(
                 context={"department": dept},
             ))
+        # Manager leaving — handle succession
+        for _ in range(4):
             candidates = [e for e in self.world.state["employees"]
                          if e["status"] == "active" and int(e["level"][1]) >= 3
                          and e.get("manager_id")]

test_all_tasks.py ADDED Viewed

	@@ -0,0 +1,186 @@

+"""Run all 77 tasks with GPT-4o-mini and compute aggregate metrics."""
+import sys
+import json
+import os
+import re
+import time
+from dotenv import load_dotenv
+load_dotenv()
+sys.path.insert(0, ".")
+sys.path.insert(0, "./server")
+from openai import OpenAI
+from server.hr_onboarding_environment import HROnboardingEnvironment
+from models import HROnboardingAction
+from server.tools import TOOL_DEFINITIONS
+from server.rubrics import RubricEvaluator
+client = OpenAI()
+tool_desc = json.dumps(TOOL_DEFINITIONS, indent=2)
+system_prompt = (
+    "You are an HR automation agent for AcmeCorp. You help with employee "
+    "onboarding and offboarding by calling the appropriate tools.\n\n"
+    "For each step, respond with ONLY a JSON tool call in this exact format:\n"
+    '{"tool": "<tool_name>", "params": {<parameters>}}\n\n'
+    'When you believe the task is complete, respond with:\n'
+    '{"tool": "__done__", "params": {}}\n\n'
+    "Important rules:\n"
+    "- Respond with ONLY the JSON object, no other text\n"
+    "- Use the exact tool names and parameter names from the tool definitions\n"
+    "- Think about what information you need and what tools to call in what order\n\n"
+    f"Available tools:\n{tool_desc}"
+)
+results = []
+evaluator = RubricEvaluator()
+num_tasks = 77
+print("=" * 70)
+print("HR ONBOARDING ENVIRONMENT — FULL EVALUATION (77 tasks)")
+print(f"Model: gpt-4o-mini")
+print("=" * 70)
+for task_idx in range(num_tasks):
+    env = HROnboardingEnvironment(seed=42, max_steps=15)
+    # Cycle to the desired task
+    for _ in range(task_idx + 1):
+        obs = env.reset()
+    task = env._current_task
+    task_id = obs.task_id
+    difficulty = obs.metadata.get("difficulty", "?")
+    category = obs.metadata.get("category", "?")
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": obs.instruction},
+    ]
+    steps_taken = 0
+    error_count = 0
+    for step in range(1, obs.max_steps + 1):
+        try:
+            response = client.chat.completions.create(
+                model="gpt-4o-mini",
+                messages=messages,
+                temperature=0.1,
+                max_tokens=512,
+            )
+            assistant_msg = response.choices[0].message.content.strip()
+        except Exception as e:
+            print(f"  API error on {task_id} step {step}: {e}")
+            time.sleep(5)
+            continue
+        # Parse tool call
+        try:
+            json_match = re.search(r'\{.*\}', assistant_msg, re.DOTALL)
+            if json_match:
+                tool_call = json.loads(json_match.group())
+            else:
+                tool_call = json.loads(assistant_msg)
+        except json.JSONDecodeError:
+            messages.append({"role": "assistant", "content": assistant_msg})
+            messages.append({"role": "user", "content": 'Respond with valid JSON: {"tool": "<name>", "params": {<args>}}'})
+            error_count += 1
+            continue
+        tool_name = tool_call.get("tool", "")
+        params = tool_call.get("params", {})
+        if tool_name == "__done__":
+            break
+        action = HROnboardingAction(tool_name=tool_name, arguments=params)
+        obs = env.step(action)
+        steps_taken += 1
+        result_str = json.dumps(obs.tool_result, indent=2)
+        messages.append({"role": "assistant", "content": assistant_msg})
+        messages.append({"role": "user", "content": f"Tool result:\n{result_str}\n\nContinue with next tool call, or {{\"tool\": \"__done__\", \"params\": {{}}}} if done."})
+        if obs.done:
+            break
+    # Evaluate
+    eval_result = evaluator.evaluate(task, env.world.action_log)
+    result = {
+        "task_id": task_id,
+        "difficulty": difficulty,
+        "category": category,
+        "score": eval_result["score"],
+        "passed": eval_result["passed"],
+        "passed_count": eval_result["passed_count"],
+        "total_criteria": eval_result["total_criteria"],
+        "steps_taken": steps_taken,
+        "parse_errors": error_count,
+    }
+    results.append(result)
+    status = "PASS" if result["passed"] else "FAIL"
+    print(f"  [{task_idx+1:2d}/77] {task_id:10s} [{difficulty:10s}] [{category:14s}] "
+          f"Score: {result['score']:.0%} ({result['passed_count']}/{result['total_criteria']}) "
+          f"Steps: {steps_taken:2d}  {status}")
+# --- Aggregate metrics ---
+print("\n" + "=" * 70)
+print("AGGREGATE RESULTS")
+print("=" * 70)
+total = len(results)
+pass_count = sum(1 for r in results if r["passed"])
+mean_score = sum(r["score"] for r in results) / total
+mean_steps = sum(r["steps_taken"] for r in results) / total
+total_criteria = sum(r["total_criteria"] for r in results)
+total_passed_criteria = sum(r["passed_count"] for r in results)
+print(f"\nOverall:")
+print(f"  Tasks:           {total}")
+print(f"  Pass rate:       {pass_count}/{total} ({pass_count/total:.1%})")
+print(f"  Mean score:      {mean_score:.3f}")
+print(f"  Mean steps:      {mean_steps:.1f}")
+print(f"  Criteria hit:    {total_passed_criteria}/{total_criteria} ({total_passed_criteria/total_criteria:.1%})")
+# By difficulty
+print(f"\nBy Difficulty:")
+for diff in ["simple", "medium", "complex", "edge_case"]:
+    subset = [r for r in results if r["difficulty"] == diff]
+    if not subset:
+        continue
+    n = len(subset)
+    p = sum(1 for r in subset if r["passed"])
+    s = sum(r["score"] for r in subset) / n
+    st = sum(r["steps_taken"] for r in subset) / n
+    print(f"  {diff:10s}: {p:2d}/{n:2d} pass ({p/n:.0%})  mean_score={s:.2f}  mean_steps={st:.1f}")
+# By category
+print(f"\nBy Category:")
+for cat in ["lookup", "onboarding", "offboarding", "cross_workflow"]:
+    subset = [r for r in results if r["category"] == cat]
+    if not subset:
+        continue
+    n = len(subset)
+    p = sum(1 for r in subset if r["passed"])
+    s = sum(r["score"] for r in subset) / n
+    print(f"  {cat:14s}: {p:2d}/{n:2d} pass ({p/n:.0%})  mean_score={s:.2f}")
+# Save results
+os.makedirs("outputs", exist_ok=True)
+with open("outputs/full_eval_results.json", "w") as f:
+    json.dump({
+        "model": "gpt-4o-mini",
+        "total_tasks": total,
+        "pass_count": pass_count,
+        "pass_rate": pass_count / total,
+        "mean_score": mean_score,
+        "mean_steps": mean_steps,
+        "criteria_hit_rate": total_passed_criteria / total_criteria,
+        "results": results,
+    }, f, indent=2)
+print(f"\nDetailed results saved to outputs/full_eval_results.json")

train_hr_agent.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff