Spaces:
Runtime error
Runtime error
File size: 7,212 Bytes
6711021 bd67f06 6711021 bd67f06 f03afc0 bd67f06 f03afc0 3d3bc40 bd67f06 6711021 f03afc0 bd67f06 f03afc0 bd67f06 3d3bc40 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 3d3bc40 f03afc0 bd67f06 f03afc0 bd67f06 f03afc0 bd67f06 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | ---
title: SimLab HR — AI Recruiting & People Management Agent Environment
emoji: 👔
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: "MCP gym for benchmarking & training AI HR agents"
tags:
- openenv
- hr
- human-resources
- recruiting
- hrms
- agent-evaluation
- agent-benchmark
- simlab
- reinforcement-learning
- rl-environment
- ai-agent
- tool-use
- function-calling
- enterprise
- multi-tool
- gymnasium
- gym
- benchmark
- mcp
- model-context-protocol
- reward-model
- verifier
- collinear
- langchain
- openai
- sandbox
- docker
- toolbench
- swe-bench
- bfcl
pinned: true
license: apache-2.0
---
# SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation
A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.
## 4 MCP Tool Servers, 1 Environment
| Tool Server | Port | What it does |
|---|---|---|
| **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
| **Email** (MailHog) | 8040 | Send and read emails, inbox management |
| **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
| **RocketChat** | 8060 | Team messaging, channels, direct messages |
Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.
## Quickstart
```python
from simlab_hr import HRAction
from simlab_hr.client import HREnv
client = HREnv(base_url="http://localhost:8000")
with client:
obs = client.reset()
print(obs.observation.task_instruction)
print(obs.observation.tools_available) # {'hrms': [...], 'email': [...], ...}
# Check leave balance in HRMS
result = client.step(HRAction(
tool_server="hrms",
tool_name="get_leave_balance",
parameters={"employee_id": "EMP-0042"}
))
# Send an email notification
result = client.step(HRAction(
tool_server="email",
tool_name="send_email",
parameters={"to": "manager@company.com", "subject": "Leave approved", "body": "..."}
))
```
## Benchmark Tasks
**8 sample tasks** covering real HR workflows across three difficulty levels:
| Difficulty | Example Tasks |
|---|---|
| Easy | Approve a leave request, update an employee's designation |
| Medium | Schedule a phone screen + send confirmation, run an attendance report |
| Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.
## Automated Evaluation
SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:
- **0.8–1.0**: All requirements fully met with clear evidence
- **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold)
- **0.4–0.6**: Partial completion, significant gaps remain
- **0.0–0.4**: Minimal or no meaningful progress
Configure the verifier model:
```bash
export VERIFIER_MODEL="gpt-4o"
export VERIFIER_API_KEY="sk-..."
```
## Run Locally
```bash
git clone https://github.com/collinear-ai/simlab.git
cd simlab/envs/simlab_hr
# Start all services (HRMS, Email, Calendar, RocketChat, OpenEnv wrapper)
docker compose up
# First run pulls ~10 images and takes a few minutes for HRMS to initialize
```
Or run from Hugging Face:
```python
from simlab_hr.client import HREnv
client = HREnv.from_hub("collinear/simlab-hr")
```
## Unlock 14+ Tasks from the API
This environment ships with 8 sample tasks. Want more?
Set your Collinear API key to unlock the full task set with real HR scenarios:
```bash
export COLLINEAR_API_KEY="your-key-here"
```
Get a free API key at **[platform.collinear.ai](https://platform.collinear.ai)** (Developer Resources → API Keys).
With the API key, every `reset()` pulls a fresh task from Collinear's Scenario Manager — recruiting workflows, people management scenarios, compliance tasks, and more.
## Use with TRL / GRPOTrainer
Compatible with Hugging Face TRL for reinforcement learning fine-tuning:
```python
from simlab_hr import HRAction
from simlab_hr.client import HREnv
env = HREnv.from_hub("collinear/simlab-hr")
with env:
obs = env.reset()
# ... your RL training loop
```
## How SimLab HR Compares
Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.
| | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) |
|---|---|---|---|---|---|
| What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance |
| Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | ❌ Schema validation only | ❌ API simulation | ❌ Mock APIs | ❌ Simulated |
| MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ |
| Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | ✅ Multi-hop chains | ✅ Avg 9 steps | ✅ Multi-turn |
| HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ |
| Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks |
| RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ |
| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ |
## More Environments
SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**:
| Scenario | MCP Tool Servers |
|---|---|
| **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
| **Customer Service** | Helpdesk ticketing, team chat, email |
| **Finance** | SEC filings, market data, Google Workspace |
| **Coding** | Sandboxed IDE, browser automation, team chat |
| **CRM** | Contacts, deals, pipelines, activities |
Install the full toolkit:
```bash
pip install simulationlab
simlab templates list
```
Learn more: [github.com/collinear-ai/simlab](https://github.com/collinear-ai/simlab) | [docs.collinear.ai](https://docs.collinear.ai)
## License
Apache 2.0 — [Collinear AI](https://collinear.ai)
|