File size: 17,939 Bytes
5ed5e7a
e181764
 
 
ec17c6d
5ed5e7a
 
6c8a204
e181764
ec17c6d
 
5ed5e7a
 
e181764
ec17c6d
126c21b
ec17c6d
126c21b
 
 
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
e181764
ec17c6d
 
e181764
ec17c6d
e181764
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
 
e181764
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
e181764
ec17c6d
e181764
 
 
 
ec17c6d
e181764
ec17c6d
e181764
ec17c6d
e181764
ec17c6d
e181764
 
 
 
 
 
ec17c6d
e181764
ec17c6d
e181764
 
 
 
 
 
ec17c6d
e181764
ec17c6d
e181764
 
 
 
 
 
 
 
ec17c6d
e181764
ec17c6d
e181764
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
e181764
ec17c6d
e181764
 
 
 
 
 
 
 
 
ec17c6d
e181764
ec17c6d
e181764
ec17c6d
e181764
ec17c6d
e181764
 
 
 
ec17c6d
 
e181764
ec17c6d
e181764
 
 
ec17c6d
e181764
 
ec17c6d
e181764
 
ec17c6d
e181764
 
ec17c6d
e181764
 
ec17c6d
e181764
ec17c6d
e181764
 
 
ec17c6d
e181764
ec17c6d
e181764
 
 
 
 
 
 
 
126c21b
 
 
e181764
126c21b
e181764
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
 
e181764
ec17c6d
e181764
ec17c6d
e181764
ec17c6d
e181764
 
 
 
ec17c6d
e181764
 
126c21b
e181764
ec17c6d
e181764
ec17c6d
e181764
 
ec17c6d
e181764
126c21b
ec17c6d
e181764
 
 
 
126c21b
 
 
ec17c6d
 
e181764
 
 
 
 
 
 
ec17c6d
e181764
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
 
e181764
ec17c6d
e181764
 
 
 
 
 
 
 
 
ec17c6d
126c21b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e181764
ec17c6d
 
126c21b
 
 
e181764
 
 
 
 
ec17c6d
 
126c21b
 
 
ec17c6d
126c21b
ec17c6d
126c21b
e181764
126c21b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec17c6d
126c21b
 
 
ec17c6d
e181764
ec17c6d
e181764
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: HR Onboarding & Offboarding Environment
emoji: 🏒
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 7860
base_path: /playground
tags:
  - openenv
---

# HR Onboarding & Offboarding Environment

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ravi03071991/rl_hack/blob/master/train_hr_agent.ipynb)

An OpenEnv-compatible RL environment that simulates enterprise HR onboarding and offboarding workflows. The agent orchestrates across **6 enterprise apps** β€” Workday, ServiceNow, Okta, Email, Slack, and Calendar β€” using 25 tools to complete multi-step tasks in a realistic HR system (200+ employees, 8 departments, RBAC, approval chains).

Built for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf/details) β€” **Statement 3.1: Professional Tasks** (Scaler AI Labs partner theme: Multi-App RL Environment for Enterprise Workflows).

### Key Results

> **GRPO training on Llama 3.2-1B-Instruct improves mean task score by +67% (0.37 β†’ 0.62).**
> Complex multi-step task scores **more than double** (0.26 β†’ 0.68). Gains generalize to held-out test tasks.

| | Baseline | Trained | Improvement |
|---|---------|---------|-------------|
| Mean Score | 0.370 | 0.617 | **+67%** |
| Complex Tasks | 0.26 | 0.68 | **+162%** |
| Pass Rate | 15.4% | 19.2% | +3.8pp |

## Quick Start

```python
from rl_hack import HROnboardingAction, HROnboardingEnv

# Connect to the environment
with HROnboardingEnv(base_url="http://localhost:7860") as env:
    result = env.reset()
    print(result.observation)  # Task instruction + available tools

    # Agent calls tools to complete the task
    result = env.step(HROnboardingAction(
        tool_name="hr_create_employee",
        arguments={"name": "Priya Sharma", "department": "Engineering", "level": "L2", "role": "Software Engineer"}
    ))
    print(result.observation)  # Tool result
    print(result.reward)       # Rubric-based reward
```

## Tools / Actions (25 MCP Tools)

The agent interacts with the environment by calling these tools. Each tool modifies the world state and returns a result.

### HR System (5 tools)

| # | Tool | Description | Key Parameters |
|---|------|-------------|----------------|
| 1 | `hr_create_employee` | Create a new employee record | `name`, `department`, `level`, `role`, `manager_id`, `is_contractor` |
| 2 | `hr_read_employee` | Look up employee by ID or email | `emp_id` or `email` |
| 3 | `hr_update_employee` | Update employee fields (status, department, etc.) | `emp_id`, `updates` (dict) |
| 4 | `hr_search_employees` | Search/filter employees by criteria | `department`, `level`, `status`, `location`, `role` |
| 5 | `hr_get_org_chart` | Get reporting hierarchy for a department | `department` |

### Onboarding / Offboarding (6 tools)

| # | Tool | Description | Key Parameters |
|---|------|-------------|----------------|
| 6 | `onboarding_create_request` | Initiate onboarding for a new hire | `employee_id` |
| 7 | `onboarding_get_status` | Check onboarding progress | `request_id` or `employee_id` |
| 8 | `onboarding_complete_step` | Mark an onboarding step as done | `request_id`, `step` |
| 9 | `offboarding_create_request` | Initiate offboarding for departing employee | `employee_id`, `reason`, `exit_date` |
| 10 | `offboarding_get_status` | Check offboarding progress | `request_id` or `employee_id` |
| 11 | `offboarding_complete_step` | Mark an offboarding step as done | `request_id`, `step` |

### IT Provisioning (5 tools)

| # | Tool | Description | Key Parameters |
|---|------|-------------|----------------|
| 12 | `it_assign_asset` | Assign laptop/monitor/phone to employee | `asset_id`, `employee_id` |
| 13 | `it_get_available_assets` | List unassigned assets by type | `asset_type` (laptop, monitor, phone, headset) |
| 14 | `it_create_account` | Create email/Slack/VPN/GitHub accounts | `employee_id`, `account_types` |
| 15 | `it_revoke_access` | Revoke all IT access (for offboarding) | `employee_id` |
| 16 | `it_get_software_licenses` | Check license seat availability | `software_name` |

### Access Control (4 tools)

| # | Tool | Description | Key Parameters |
|---|------|-------------|----------------|
| 17 | `access_assign_role` | Assign RBAC role (checks level/dept restrictions) | `employee_id`, `role_id` |
| 18 | `access_create_badge` | Create physical access badge | `employee_id`, `access_zones` |
| 19 | `access_revoke_role` | Revoke a specific access role | `employee_id`, `role_id` |
| 20 | `access_get_security_groups` | List all security groups and resources | _(none)_ |

### Communication (3 tools)

| # | Tool | Description | Key Parameters |
|---|------|-------------|----------------|
| 21 | `email_send` | Send email (welcome, farewell, notifications) | `from_address`, `to_address`, `subject`, `body` |
| 22 | `slack_send_message` | Post in Slack channel or DM | `channel`, `sender`, `text` |
| 23 | `meeting_schedule` | Schedule orientation, 1-on-1, exit interview | `title`, `attendees`, `datetime`, `meeting_type` |

### Policy & Approval (2 tools)

| # | Tool | Description | Key Parameters |
|---|------|-------------|----------------|
| 24 | `policy_lookup` | Look up company policies by topic/department | `topic`, `department`, `policy_id` |
| 25 | `approval_request` | Submit approval (manager/IT/security/legal) | `request_id`, `approver_id`, `approval_type` |

## Tasks (77 tasks across 4 categories)

Each episode presents one task. The agent must call the right tools in the right order.

### Task Categories

| Category | Count | Example |
|----------|-------|---------|
| **Lookup** (simple) | 11 | "List all employees in the Engineering department" |
| **Onboarding** | 32 | "Fully onboard John Lee as L3 Team Lead in Data Science β€” create record, assign laptop, provision accounts, set up access, send welcome email, schedule orientation" |
| **Offboarding** | 24 | "Offboard departing director β€” revoke all access, reclaim assets, reassign reports, send farewell, schedule exit interview" |
| **Cross-workflow** | 10 | "Employee transferring from Engineering to Product β€” offboard from old dept, onboard to new" |

### Difficulty Levels

| Difficulty | Count | Tools per task | Description |
|------------|-------|---------------|-------------|
| Simple | 19 | 1-2 | Single lookups or status checks |
| Medium | 21 | 2-4 | Create + initiate workflows |
| Complex | 25 | 5-10 | Full end-to-end workflows with approvals |
| Edge case | 12 | 2-5 | Business rule violations, policy constraints |

### Edge Cases (designed to test policy compliance)

- Department at **headcount limit** β€” create employee should fail
- Software license **seats full** (Netsuite, LinkedIn Sales Navigator)
- Manager **on leave** β€” must find skip-level manager for approvals
- **Contractor** onboarding β€” different rules (no VPN, limited access, legal approval required)
- **Termination** vs resignation β€” different offboarding steps, no farewell email
- **Offer rescinded** β€” offboard someone mid-onboarding
- **Level mismatch** β€” L1 employee can't get L4+ access roles
- **Department restriction** β€” Marketing employee can't get Engineering GitHub role

## World State (500+ entities)

| Entity | Count | Description |
|--------|-------|-------------|
| Employees | 200 | Full org hierarchy across 8 departments (L1-L6) |
| Departments | 8 | Engineering, Product, Marketing, Sales, Finance, HR, Data Science, Security |
| IT Assets | 100 | Laptops (50), monitors (25), phones (15), headsets (10) |
| Access Roles | 20 | RBAC roles with level/department restrictions |
| Software Licenses | 15 | Jira, GitHub, AWS, Slack, Salesforce, etc. (2 intentionally full) |
| Policies | 15 | Onboarding, offboarding, badge access, contractor, termination, etc. |
| Security Groups | 15 | engineering_team, vpn_users, server_room_access, etc. |
| Message Templates | 12 | Welcome/farewell emails, Slack messages, notifications |

### RBAC Rules

- **L1** Associate β†’ **L2** Senior β†’ **L3** Team Lead β†’ **L4** Manager β†’ **L5** Director β†’ **L6** VP
- L3+ can approve onboarding
- L4+ required for security approvals and server room badge access
- Contractors require legal approval
- Access roles have minimum level requirements and department restrictions

## Reward / Rubric

Each task has a rubric with verifiable criteria. Reward = proportion of criteria satisfied.

### Rubric Check Types

| Check | Example | What it verifies |
|-------|---------|-----------------|
| `tool_used` | `tool_used:hr_create_employee` | Tool was called at least once |
| `tool_not_used` | `tool_not_used:slack_send_message` | Tool was NOT called (e.g. no farewell for terminations) |
| `tool_used_any` | `tool_used_any:email_send,slack_send_message` | At least one of the tools was used |
| `param_value` | `param_value:hr_create_employee.name=Priya Sharma` | Tool called with specific parameter value |
| `param_contains` | `param_contains:policy_lookup.topic=onboard` | Parameter contains substring |
| `tool_order` | `tool_order:hr_create_employee<onboarding_create_request` | Tool A called before Tool B |
| `tool_count` | `tool_count:onboarding_complete_step>=3` | Tool called at least N times |
| `result_contains` | `result_contains:headcount_limit` | Any tool result contains substring |

### Example Rubric (medium task)

Task: "Onboard Priya Sharma to Engineering as L2 Software Engineer"

| Criterion | Check |
|-----------|-------|
| Created employee record | `tool_used:hr_create_employee` |
| Correct name | `param_value:hr_create_employee.name=Priya Sharma` |
| Correct department | `param_value:hr_create_employee.department=Engineering` |
| Correct level | `param_value:hr_create_employee.level=L2` |
| Correct role | `param_value:hr_create_employee.role=Software Engineer` |
| Initiated onboarding | `tool_used:onboarding_create_request` |
| Correct sequencing | `tool_order:hr_create_employee<onboarding_create_request` |

**Score**: 7/7 = 1.0 (pass) or partial (e.g. 5/7 = 0.71)

## Environment API

### OpenEnv Interface (MCPEnvironment)

```
reset()  β†’ Observation   # Pick task, reset world state, return instruction
step()   β†’ Observation   # Agent calls a tool, get result + reward
state    β†’ State         # Current step count, episode ID
```

### Episode Flow

```
1. env.reset()
   β†’ Task: "Fully onboard John Lee as L3 Team Lead..."

2. Agent calls: hr_create_employee(name="John Lee", department="Data Science", level="L3", ...)
   β†’ env.step() β†’ {"success": true, "emp_id": "emp_0201"}

3. Agent calls: onboarding_create_request(employee_id="emp_0201")
   β†’ env.step() β†’ {"success": true, "request_id": "onb_0001", "steps": {...}}

4. Agent calls: it_get_available_assets(asset_type="laptop")
   β†’ env.step() β†’ {"success": true, "assets": [...]}

5. Agent calls: it_assign_asset(asset_id="asset_003", employee_id="emp_0201")
   β†’ env.step() β†’ {"success": true}

   ... more tool calls ...

N. Episode ends (max 15 steps or agent signals done)
   β†’ Reward: 8/10 criteria satisfied = 0.8
```

## Project Structure

```
rl_hack/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ openenv.yaml                       # OpenEnv manifest
β”œβ”€β”€ pyproject.toml                     # Project metadata
β”œβ”€β”€ __init__.py                        # Module exports
β”œβ”€β”€ client.py                          # HROnboardingEnv client
β”œβ”€β”€ models.py                          # Action/Observation Pydantic models
β”œβ”€β”€ test_with_llm.py                   # Test single task with GPT agent
β”œβ”€β”€ test_all_tasks.py                  # Evaluate all 77 tasks
β”œβ”€β”€ train_hr_agent.ipynb               # GRPO training notebook (Unsloth)
β”œβ”€β”€ .env                               # API keys (gitignored)
β”œβ”€β”€ outputs/                           # Evaluation results
└── server/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ app.py                         # FastAPI application
    β”œβ”€β”€ hr_onboarding_environment.py   # Core environment (Environment subclass)
    β”œβ”€β”€ world.py                       # World state (entities, RBAC, mutations)
    β”œβ”€β”€ tools.py                       # Tool registry (25 tools)
    β”œβ”€β”€ tasks.py                       # Task definitions + generation (77 tasks)
    β”œβ”€β”€ rubrics.py                     # Rubric evaluator (reward computation)
    β”œβ”€β”€ data/
    β”‚   β”œβ”€β”€ employees.json             # 200 employee records
    β”‚   β”œβ”€β”€ departments.json           # 8 departments with policies
    β”‚   β”œβ”€β”€ policies.json              # 15 business rule documents
    β”‚   β”œβ”€β”€ it_assets.json             # 100 IT assets
    β”‚   β”œβ”€β”€ access_roles.json          # 20 RBAC roles
    β”‚   └── templates.json             # 12 message templates
    β”œβ”€β”€ Dockerfile                     # Container image
    └── requirements.txt               # Server dependencies
```

## Testing with an LLM Agent

You can test the environment locally using GPT (or any OpenAI-compatible model) as the agent.

### Setup

1. Create a `.env` file in the repo root:
   ```
   OPENAI_API_KEY="sk-proj-..."
   ```

2. Install dependencies:
   ```bash
   uv pip install -e ".[eval]"
   ```

### Run

```bash
cd rl_hack

# Test on default task (simple lookup)
uv run python -m test_with_llm

# Test a specific task by index (0-76)
uv run python -m test_with_llm 14    # medium onboarding task
uv run python -m test_with_llm 24    # complex full onboarding
uv run python -m test_with_llm 55    # edge case (headcount limit)

# Run full evaluation across all 77 tasks
uv run python test_all_tasks.py
```

The script will:
- Reset the environment and pick a task
- Use GPT-4o-mini to generate tool calls
- Execute each tool call against the environment
- Print the rubric evaluation with pass/fail per criterion

### Example Output

```
Task ID: task_0015
Difficulty: medium
Instruction: Onboard new hire Priya Sharma to Engineering as L2 Software Engineer...

--- Step 1/15 ---
LLM: {"tool": "hr_create_employee", "params": {"name": "Priya Sharma", ...}}
  Tool: hr_create_employee
  Result: {"success": true, "employee": {"emp_id": "emp_0201", ...}}

--- Step 2/15 ---
LLM: {"tool": "onboarding_create_request", "params": {"employee_id": "emp_0201"}}
  Tool: onboarding_create_request
  Result: {"success": true, ...}

FINAL EVALUATION
Score: 100% (7/7 criteria)
Passed: True
  [PASS] created_employee
  [PASS] correct_name
  [PASS] correct_dept
  [PASS] initiated_onboarding
  [PASS] sequencing
```

### Task Index Reference

| Index | Difficulty | Category | Description |
|-------|-----------|----------|-------------|
| 0-13 | Simple | Lookup/Onboarding | Single lookups, status checks |
| 14-23 | Medium | Onboarding | Create employee + initiate workflow |
| 24-34 | Complex | Onboarding | Full end-to-end with IT, access, comms |
| 35-46 | Medium | Offboarding | Initiate offboarding + revoke access |
| 47-54 | Complex | Offboarding | Full offboarding with asset reclaim |
| 55-66 | Edge case | Various | Headcount limits, license caps, RBAC |
| 67-76 | Complex | Cross-workflow | Transfers, rehires, manager departures |

## Installation

```bash
# Clone the repo
git clone https://github.com/ravi03071991/rl_hack.git
cd rl_hack

# Install core dependencies
uv pip install -e .

# Install with evaluation support (adds openai)
uv pip install -e ".[eval]"

# Install with training support (adds unsloth, trl, torch, etc.)
uv pip install -e ".[train]"

# Install everything
uv pip install -e ".[eval,train,dev]"
```

## Building & Running

```bash
# Run locally (as OpenEnv HTTP server with playground UI)
uvicorn server.app:app --reload --host 0.0.0.0 --port 7860

# Build Docker image
docker build -t hr-onboarding-env:latest -f server/Dockerfile .

# Deploy to HF Spaces
openenv push
```

## Training & Results

We use Unsloth + GRPO to train an LLM agent on this environment. See [`train_hr_agent.ipynb`](train_hr_agent.ipynb) for the full training notebook and [W&B run](https://wandb.ai/ravi03071991/hr-agent-training/runs/bgent3o3?nw=nwuserravi03071991) for live training metrics.

### Setup

- **Model**: Llama 3.2-1B-Instruct (4-bit quantized, LoRA rank 8)
- **Algorithm**: GRPO (Group Relative Policy Optimization)
- **Reward functions**: Valid JSON + rubric score + efficiency
- **Training**: 300 steps, 6 generations per prompt, lr=5e-5 with cosine schedule
- **Data split**: 70/30 stratified train/test (52 train, 25 test tasks)

### Results

GRPO training significantly improves the model's ability to complete HR workflows:

| Metric | Base Model | Trained | Change |
|--------|-----------|---------|--------|
| **Train pass rate** | 15.4% | 19.2% | +3.8% |
| **Train mean score** | 0.370 | 0.617 | **+0.247 (+67%)** |
| **Test pass rate** | 12.0% | 16.0% | +4.0% |
| **Test mean score** | 0.370 | 0.617 | **+0.247 (+67%)** |

#### Improvement by difficulty

| Difficulty | Baseline | Trained | Change |
|------------|----------|---------|--------|
| Simple | 0.23 | 0.50 | +0.27 |
| Medium | 0.72 | 0.86 | +0.14 |
| **Complex** | **0.26** | **0.68** | **+0.42** |
| Edge case | 0.22 | 0.25 | +0.03 |

The biggest gains are on **complex multi-step tasks** β€” scores more than doubled. The improvement **generalizes to held-out test tasks**, proving the model learned transferable HR workflow skills.

### Reward Curve

![Reward Curve](reward_curve.png)

The moving average reward trends upward from ~2-3 early in training to ~4-5 by the end, showing consistent learning.

### Quick start (Colab)

1. Click the Colab badge at the top to open `train_hr_agent.ipynb` in Google Colab
2. Select a GPU runtime
3. Run all cells β€” installs dependencies, trains, and evaluates automatically

## Live Demo

Try the environment on Hugging Face Spaces: https://huggingface.co/spaces/devxpy/rl_hack