Spaces:
Runtime error
Runtime error
Update comparison table — BFCL, ToolBench, EnterpriseOps-Gym, tau-bench with links and richer descriptions
3d3bc40 verified | title: SimLab HR — AI Recruiting & People Management Agent Environment | |
| emoji: 👔 | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| short_description: "MCP gym for benchmarking & training AI HR agents" | |
| tags: | |
| - openenv | |
| - hr | |
| - human-resources | |
| - recruiting | |
| - hrms | |
| - agent-evaluation | |
| - agent-benchmark | |
| - simlab | |
| - reinforcement-learning | |
| - rl-environment | |
| - ai-agent | |
| - tool-use | |
| - function-calling | |
| - enterprise | |
| - multi-tool | |
| - gymnasium | |
| - gym | |
| - benchmark | |
| - mcp | |
| - model-context-protocol | |
| - reward-model | |
| - verifier | |
| - collinear | |
| - langchain | |
| - openai | |
| - sandbox | |
| - docker | |
| - toolbench | |
| - swe-bench | |
| - bfcl | |
| pinned: true | |
| license: apache-2.0 | |
| # SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation | |
| A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation. | |
| Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab). | |
| Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end. | |
| ## 4 MCP Tool Servers, 1 Environment | |
| | Tool Server | Port | What it does | | |
| |---|---|---| | |
| | **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll | | |
| | **Email** (MailHog) | 8040 | Send and read emails, inbox management | | |
| | **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events | | |
| | **RocketChat** | 8060 | Team messaging, channels, direct messages | | |
| Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks. | |
| ## Quickstart | |
| ```python | |
| from simlab_hr import HRAction | |
| from simlab_hr.client import HREnv | |
| client = HREnv(base_url="http://localhost:8000") | |
| with client: | |
| obs = client.reset() | |
| print(obs.observation.task_instruction) | |
| print(obs.observation.tools_available) # {'hrms': [...], 'email': [...], ...} | |
| # Check leave balance in HRMS | |
| result = client.step(HRAction( | |
| tool_server="hrms", | |
| tool_name="get_leave_balance", | |
| parameters={"employee_id": "EMP-0042"} | |
| )) | |
| # Send an email notification | |
| result = client.step(HRAction( | |
| tool_server="email", | |
| tool_name="send_email", | |
| parameters={"to": "manager@company.com", "subject": "Leave approved", "body": "..."} | |
| )) | |
| ``` | |
| ## Benchmark Tasks | |
| **8 sample tasks** covering real HR workflows across three difficulty levels: | |
| | Difficulty | Example Tasks | | |
| |---|---| | |
| | Easy | Approve a leave request, update an employee's designation | | |
| | Medium | Schedule a phone screen + send confirmation, run an attendance report | | |
| | Hard | Multi-person panel interview scheduling, full new-hire onboarding flow | | |
| Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard. | |
| ## Automated Evaluation | |
| SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode: | |
| - **0.8–1.0**: All requirements fully met with clear evidence | |
| - **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold) | |
| - **0.4–0.6**: Partial completion, significant gaps remain | |
| - **0.0–0.4**: Minimal or no meaningful progress | |
| Configure the verifier model: | |
| ```bash | |
| export VERIFIER_MODEL="gpt-4o" | |
| export VERIFIER_API_KEY="sk-..." | |
| ``` | |
| ## Run Locally | |
| ```bash | |
| git clone https://github.com/collinear-ai/simlab.git | |
| cd simlab/envs/simlab_hr | |
| # Start all services (HRMS, Email, Calendar, RocketChat, OpenEnv wrapper) | |
| docker compose up | |
| # First run pulls ~10 images and takes a few minutes for HRMS to initialize | |
| ``` | |
| Or run from Hugging Face: | |
| ```python | |
| from simlab_hr.client import HREnv | |
| client = HREnv.from_hub("collinear/simlab-hr") | |
| ``` | |
| ## Unlock 14+ Tasks from the API | |
| This environment ships with 8 sample tasks. Want more? | |
| Set your Collinear API key to unlock the full task set with real HR scenarios: | |
| ```bash | |
| export COLLINEAR_API_KEY="your-key-here" | |
| ``` | |
| Get a free API key at **[platform.collinear.ai](https://platform.collinear.ai)** (Developer Resources → API Keys). | |
| With the API key, every `reset()` pulls a fresh task from Collinear's Scenario Manager — recruiting workflows, people management scenarios, compliance tasks, and more. | |
| ## Use with TRL / GRPOTrainer | |
| Compatible with Hugging Face TRL for reinforcement learning fine-tuning: | |
| ```python | |
| from simlab_hr import HRAction | |
| from simlab_hr.client import HREnv | |
| env = HREnv.from_hub("collinear/simlab-hr") | |
| with env: | |
| obs = env.reset() | |
| # ... your RL training loop | |
| ``` | |
| ## How SimLab HR Compares | |
| Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment. | |
| | | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) | | |
| |---|---|---|---|---|---| | |
| | What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance | | |
| | Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | ❌ Schema validation only | ❌ API simulation | ❌ Mock APIs | ❌ Simulated | | |
| | MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ | | |
| | Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | ✅ Multi-hop chains | ✅ Avg 9 steps | ✅ Multi-turn | | |
| | HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ | | |
| | Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks | | |
| | RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ | | |
| | Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ | | |
| ## More Environments | |
| SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**: | |
| | Scenario | MCP Tool Servers | | |
| |---|---| | |
| | **Human Resources** | HRMS, email, calendar, team chat ← *you are here* | | |
| | **Customer Service** | Helpdesk ticketing, team chat, email | | |
| | **Finance** | SEC filings, market data, Google Workspace | | |
| | **Coding** | Sandboxed IDE, browser automation, team chat | | |
| | **CRM** | Contacts, deals, pipelines, activities | | |
| Install the full toolkit: | |
| ```bash | |
| pip install simulationlab | |
| simlab templates list | |
| ``` | |
| Learn more: [github.com/collinear-ai/simlab](https://github.com/collinear-ai/simlab) | [docs.collinear.ai](https://docs.collinear.ai) | |
| ## License | |
| Apache 2.0 — [Collinear AI](https://collinear.ai) | |