Spaces:

renanserrano
/

simulationlab-hr

Runtime error

App Files Files Community

simulationlab-hr / README.md

renanserrano

Update comparison table — BFCL, ToolBench, EnterpriseOps-Gym, tau-bench with links and richer descriptions

3d3bc40 verified 5 days ago

preview code

raw

history blame contribute delete

7.21 kB

metadata

title: SimLab HR — AI Recruiting & People Management Agent Environment
emoji: 👔
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: MCP gym for benchmarking & training AI HR agents
tags:
  - openenv
  - hr
  - human-resources
  - recruiting
  - hrms
  - agent-evaluation
  - agent-benchmark
  - simlab
  - reinforcement-learning
  - rl-environment
  - ai-agent
  - tool-use
  - function-calling
  - enterprise
  - multi-tool
  - gymnasium
  - gym
  - benchmark
  - mcp
  - model-context-protocol
  - reward-model
  - verifier
  - collinear
  - langchain
  - openai
  - sandbox
  - docker
  - toolbench
  - swe-bench
  - bfcl
pinned: true
license: apache-2.0

SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation

A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.

Built on OpenEnv and powered by SimLab.

Unlike single-API function-calling benchmarks like BFCL or ToolBench, SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.

4 MCP Tool Servers, 1 Environment

Tool Server	Port	What it does
HRMS (Frappe)	8030	Employee records, leave management, attendance, payroll
Email (MailHog)	8040	Send and read emails, inbox management
Calendar (Baikal/Chronos)	8050	Schedule meetings, check availability, manage events
RocketChat	8060	Team messaging, channels, direct messages

Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.

Quickstart

from simlab_hr import HRAction
from simlab_hr.client import HREnv

client = HREnv(base_url="http://localhost:8000")

with client:
    obs = client.reset()
    print(obs.observation.task_instruction)
    print(obs.observation.tools_available)  # {'hrms': [...], 'email': [...], ...}

    # Check leave balance in HRMS
    result = client.step(HRAction(
        tool_server="hrms",
        tool_name="get_leave_balance",
        parameters={"employee_id": "EMP-0042"}
    ))

    # Send an email notification
    result = client.step(HRAction(
        tool_server="email",
        tool_name="send_email",
        parameters={"to": "manager@company.com", "subject": "Leave approved", "body": "..."}
    ))

Benchmark Tasks

8 sample tasks covering real HR workflows across three difficulty levels:

Difficulty	Example Tasks
Easy	Approve a leave request, update an employee's designation
Medium	Schedule a phone screen + send confirmation, run an attendance report
Hard	Multi-person panel interview scheduling, full new-hire onboarding flow

Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.

Automated Evaluation

SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:

0.8–1.0: All requirements fully met with clear evidence
0.6–0.8: Core requirements met with minor gaps (0.6 = PASS threshold)
0.4–0.6: Partial completion, significant gaps remain
0.0–0.4: Minimal or no meaningful progress

Configure the verifier model:

export VERIFIER_MODEL="gpt-4o"
export VERIFIER_API_KEY="sk-..."

Run Locally

git clone https://github.com/collinear-ai/simlab.git
cd simlab/envs/simlab_hr

# Start all services (HRMS, Email, Calendar, RocketChat, OpenEnv wrapper)
docker compose up

# First run pulls ~10 images and takes a few minutes for HRMS to initialize

Or run from Hugging Face:

from simlab_hr.client import HREnv

client = HREnv.from_hub("collinear/simlab-hr")

Unlock 14+ Tasks from the API

This environment ships with 8 sample tasks. Want more?

Set your Collinear API key to unlock the full task set with real HR scenarios:

export COLLINEAR_API_KEY="your-key-here"

Get a free API key at platform.collinear.ai (Developer Resources → API Keys).

With the API key, every reset() pulls a fresh task from Collinear's Scenario Manager — recruiting workflows, people management scenarios, compliance tasks, and more.

Use with TRL / GRPOTrainer

Compatible with Hugging Face TRL for reinforcement learning fine-tuning:

from simlab_hr import HRAction
from simlab_hr.client import HREnv

env = HREnv.from_hub("collinear/simlab-hr")
with env:
    obs = env.reset()
    # ... your RL training loop

How SimLab HR Compares

Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.

	SimLab HR	BFCL	ToolBench	EnterpriseOps-Gym	tau-bench
What it tests	Multi-tool HR workflows end-to-end	Function call accuracy (single/parallel)	API discovery & chaining across 16k APIs	Enterprise planning across 8 domains	Customer service policy compliance
Real backing services	✅ Frappe HRMS, MailHog, CalDAV, RocketChat	❌ Schema validation only	❌ API simulation	❌ Mock APIs	❌ Simulated
MCP tool servers	✅ 4 servers	❌	❌	❌ REST APIs	❌
Multi-step workflows	✅ 10+ steps, cross-system	❌ Single/parallel calls	✅ Multi-hop chains	✅ Avg 9 steps	✅ Multi-turn
HR-specific	✅ Dedicated	❌	❌	✅ 1 of 8 domains	❌
Automated evaluation	✅ Rubric-based LLM judge	✅ AST matching	✅ Pass rate + win rate	✅ Expert-curated	✅ Policy checks
RL / Gymnasium support	✅ OpenEnv-compatible	❌	❌	❌	❌
Task generation	✅ API pipeline	❌	❌	❌	❌

More Environments

SimLab includes 5 enterprise simulation scenarios with 14 MCP tool servers:

Scenario	MCP Tool Servers
Human Resources	HRMS, email, calendar, team chat ← you are here
Customer Service	Helpdesk ticketing, team chat, email
Finance	SEC filings, market data, Google Workspace
Coding	Sandboxed IDE, browser automation, team chat
CRM	Contacts, deals, pipelines, activities

Install the full toolkit:

pip install simulationlab
simlab templates list

Learn more: github.com/collinear-ai/simlab | docs.collinear.ai

License

Apache 2.0 — Collinear AI