File size: 7,212 Bytes
6711021
bd67f06
 
 
 
6711021
bd67f06
f03afc0
bd67f06
f03afc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3d3bc40
 
 
bd67f06
 
6711021
 
f03afc0
bd67f06
f03afc0
 
 
bd67f06
3d3bc40
bd67f06
f03afc0
bd67f06
f03afc0
bd67f06
 
 
 
 
 
f03afc0
bd67f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f03afc0
bd67f06
f03afc0
bd67f06
f03afc0
bd67f06
 
 
 
 
f03afc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd67f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f03afc0
bd67f06
 
 
 
 
 
 
 
f03afc0
bd67f06
 
f03afc0
 
3d3bc40
 
 
 
 
 
 
 
 
 
 
 
f03afc0
bd67f06
 
f03afc0
bd67f06
f03afc0
bd67f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: SimLab HR  AI Recruiting & People Management Agent Environment
emoji: 👔
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: "MCP gym for benchmarking & training AI HR agents"
tags:
 - openenv
 - hr
 - human-resources
 - recruiting
 - hrms
 - agent-evaluation
 - agent-benchmark
 - simlab
 - reinforcement-learning
 - rl-environment
 - ai-agent
 - tool-use
 - function-calling
 - enterprise
 - multi-tool
 - gymnasium
 - gym
 - benchmark
 - mcp
 - model-context-protocol
 - reward-model
 - verifier
 - collinear
 - langchain
 - openai
 - sandbox
 - docker
 - toolbench
 - swe-bench
 - bfcl
pinned: true
license: apache-2.0
---

# SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation

A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.

Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).

Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.

## 4 MCP Tool Servers, 1 Environment

| Tool Server | Port | What it does |
|---|---|---|
| **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
| **Email** (MailHog) | 8040 | Send and read emails, inbox management |
| **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
| **RocketChat** | 8060 | Team messaging, channels, direct messages |

Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.

## Quickstart

```python
from simlab_hr import HRAction
from simlab_hr.client import HREnv

client = HREnv(base_url="http://localhost:8000")

with client:
    obs = client.reset()
    print(obs.observation.task_instruction)
    print(obs.observation.tools_available)  # {'hrms': [...], 'email': [...], ...}

    # Check leave balance in HRMS
    result = client.step(HRAction(
        tool_server="hrms",
        tool_name="get_leave_balance",
        parameters={"employee_id": "EMP-0042"}
    ))

    # Send an email notification
    result = client.step(HRAction(
        tool_server="email",
        tool_name="send_email",
        parameters={"to": "manager@company.com", "subject": "Leave approved", "body": "..."}
    ))
```

## Benchmark Tasks

**8 sample tasks** covering real HR workflows across three difficulty levels:

| Difficulty | Example Tasks |
|---|---|
| Easy | Approve a leave request, update an employee's designation |
| Medium | Schedule a phone screen + send confirmation, run an attendance report |
| Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |

Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.

## Automated Evaluation

SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:

- **0.8–1.0**: All requirements fully met with clear evidence
- **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold)
- **0.4–0.6**: Partial completion, significant gaps remain
- **0.0–0.4**: Minimal or no meaningful progress

Configure the verifier model:

```bash
export VERIFIER_MODEL="gpt-4o"
export VERIFIER_API_KEY="sk-..."
```

## Run Locally

```bash
git clone https://github.com/collinear-ai/simlab.git
cd simlab/envs/simlab_hr

# Start all services (HRMS, Email, Calendar, RocketChat, OpenEnv wrapper)
docker compose up

# First run pulls ~10 images and takes a few minutes for HRMS to initialize
```

Or run from Hugging Face:

```python
from simlab_hr.client import HREnv

client = HREnv.from_hub("collinear/simlab-hr")
```

## Unlock 14+ Tasks from the API

This environment ships with 8 sample tasks. Want more?

Set your Collinear API key to unlock the full task set with real HR scenarios:

```bash
export COLLINEAR_API_KEY="your-key-here"
```

Get a free API key at **[platform.collinear.ai](https://platform.collinear.ai)** (Developer Resources → API Keys).

With the API key, every `reset()` pulls a fresh task from Collinear's Scenario Manager — recruiting workflows, people management scenarios, compliance tasks, and more.

## Use with TRL / GRPOTrainer

Compatible with Hugging Face TRL for reinforcement learning fine-tuning:

```python
from simlab_hr import HRAction
from simlab_hr.client import HREnv

env = HREnv.from_hub("collinear/simlab-hr")
with env:
    obs = env.reset()
    # ... your RL training loop
```

## How SimLab HR Compares

Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.

| | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) |
|---|---|---|---|---|---|
| What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance |
| Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | ❌ Schema validation only | ❌ API simulation | ❌ Mock APIs | ❌ Simulated |
| MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ |
| Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | ✅ Multi-hop chains | ✅ Avg 9 steps | ✅ Multi-turn |
| HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ |
| Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks |
| RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ |
| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ |

## More Environments

SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**:

| Scenario | MCP Tool Servers |
|---|---|
| **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
| **Customer Service** | Helpdesk ticketing, team chat, email |
| **Finance** | SEC filings, market data, Google Workspace |
| **Coding** | Sandboxed IDE, browser automation, team chat |
| **CRM** | Contacts, deals, pipelines, activities |

Install the full toolkit:

```bash
pip install simulationlab
simlab templates list
```

Learn more: [github.com/collinear-ai/simlab](https://github.com/collinear-ai/simlab) | [docs.collinear.ai](https://docs.collinear.ai)

## License

Apache 2.0 — [Collinear AI](https://collinear.ai)