Spaces:

renanserrano
/

simulationlab-hr

Runtime error

File size: 7,212 Bytes

6711021
bd67f06
 
 
 
6711021
bd67f06
f03afc0
bd67f06
f03afc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3d3bc40
 
 
bd67f06
 
6711021
 
f03afc0
bd67f06
f03afc0
 
 
bd67f06
3d3bc40
bd67f06
f03afc0
bd67f06
f03afc0
bd67f06
 
 
 
 
 
f03afc0
bd67f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f03afc0
bd67f06
f03afc0
bd67f06
f03afc0
bd67f06
 
 
 
 
f03afc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd67f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f03afc0
bd67f06
 
 
 
 
 
 
 
f03afc0
bd67f06
 
f03afc0
 
3d3bc40
 
 
 
 
 
 
 
 
 
 
 
f03afc0
bd67f06
 
f03afc0
bd67f06
f03afc0
bd67f06

---
title: SimLab HR — AI Recruiting & People Management Agent Environment
emoji: 👔
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: "MCP gym for benchmarking & training AI HR agents"
tags:
 - openenv
 - hr
 - human-resources
 - recruiting
 - hrms
 - agent-evaluation
 - agent-benchmark
 - simlab
 - reinforcement-learning
 - rl-environment
 - ai-agent
 - tool-use
 - function-calling
 - enterprise
 - multi-tool
 - gymnasium
 - gym
 - benchmark
 - mcp
 - model-context-protocol
 - reward-model
 - verifier
 - collinear
 - langchain
 - openai
 - sandbox
 - docker
 - toolbench
 - swe-bench
 - bfcl
pinned: true
license: apache-2.0
---

# SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation

A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.

Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).

Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.

## 4 MCP Tool Servers, 1 Environment

| Tool Server | Port | What it does |
|---|---|---|
| **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
| **Email** (MailHog) | 8040 | Send and read emails, inbox management |
| **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
| **RocketChat** | 8060 | Team messaging, channels, direct messages |

Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.

## Quickstart

```python
from simlab_hr import HRAction
from simlab_hr.client import HREnv

client = HREnv(base_url="http://localhost:8000")

with client:
    obs = client.reset()
    print(obs.observation.task_instruction)
    print(obs.observation.tools_available)  # {'hrms': [...], 'email': [...], ...}

    # Check leave balance in HRMS
    result = client.step(HRAction(
        tool_server="hrms",
        tool_name="get_leave_balance",
        parameters={"employee_id": "EMP-0042"}
    ))

    # Send an email notification
    result = client.step(HRAction(
        tool_server="email",
        tool_name="send_email",
        parameters={"to": "manager@company.com", "subject": "Leave approved", "body": "..."}
    ))
```

## Benchmark Tasks

**8 sample tasks** covering real HR workflows across three difficulty levels:

| Difficulty | Example Tasks |
|---|---|
| Easy | Approve a leave request, update an employee's designation |
| Medium | Schedule a phone screen + send confirmation, run an attendance report |
| Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |

Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.

## Automated Evaluation

SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:

- **0.8–1.0**: All requirements fully met with clear evidence
- **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold)
- **0.4–0.6**: Partial completion, significant gaps remain
- **0.0–0.4**: Minimal or no meaningful progress

Configure the verifier model:

```bash
export VERIFIER_MODEL="gpt-4o"
export VERIFIER_API_KEY="sk-..."
```

## Run Locally

```bash
git clone https://github.com/collinear-ai/simlab.git
cd simlab/envs/simlab_hr

# Start all services (HRMS, Email, Calendar, RocketChat, OpenEnv wrapper)
docker compose up

# First run pulls ~10 images and takes a few minutes for HRMS to initialize
```

Or run from Hugging Face:

```python
from simlab_hr.client import HREnv

client = HREnv.from_hub("collinear/simlab-hr")
```

## Unlock 14+ Tasks from the API

This environment ships with 8 sample tasks. Want more?

Set your Collinear API key to unlock the full task set with real HR scenarios:

```bash
export COLLINEAR_API_KEY="your-key-here"
```

Get a free API key at **[platform.collinear.ai](https://platform.collinear.ai)** (Developer Resources → API Keys).

With the API key, every `reset()` pulls a fresh task from Collinear's Scenario Manager — recruiting workflows, people management scenarios, compliance tasks, and more.

## Use with TRL / GRPOTrainer

Compatible with Hugging Face TRL for reinforcement learning fine-tuning:

```python
from simlab_hr import HRAction
from simlab_hr.client import HREnv

env = HREnv.from_hub("collinear/simlab-hr")
with env:
    obs = env.reset()
    # ... your RL training loop
```

## How SimLab HR Compares

Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.

| | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) |
|---|---|---|---|---|---|
| What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance |
| Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | ❌ Schema validation only | ❌ API simulation | ❌ Mock APIs | ❌ Simulated |
| MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ |
| Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | ✅ Multi-hop chains | ✅ Avg 9 steps | ✅ Multi-turn |
| HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ |
| Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks |
| RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ |
| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ |

## More Environments

SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**:

| Scenario | MCP Tool Servers |
|---|---|
| **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
| **Customer Service** | Helpdesk ticketing, team chat, email |
| **Finance** | SEC filings, market data, Google Workspace |
| **Coding** | Sandboxed IDE, browser automation, team chat |
| **CRM** | Contacts, deals, pipelines, activities |

Install the full toolkit:

```bash
pip install simulationlab
simlab templates list
```

Learn more: [github.com/collinear-ai/simlab](https://github.com/collinear-ai/simlab) | [docs.collinear.ai](https://docs.collinear.ai)

## License

Apache 2.0 — [Collinear AI](https://collinear.ai)