Spaces:

renanserrano
/

simulationlab-hr

Runtime error

App Files Files Community

simulationlab-hr / README.md

renanserrano

Update comparison table — BFCL, ToolBench, EnterpriseOps-Gym, tau-bench with links and richer descriptions

3d3bc40 verified 6 days ago

preview code

raw

history blame contribute delete

7.21 kB

	---
	title: SimLab HR — AI Recruiting & People Management Agent Environment
	emoji: 👔
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	short_description: "MCP gym for benchmarking & training AI HR agents"
	tags:
	- openenv
	- hr
	- human-resources
	- recruiting
	- hrms
	- agent-evaluation
	- agent-benchmark
	- simlab
	- reinforcement-learning
	- rl-environment
	- ai-agent
	- tool-use
	- function-calling
	- enterprise
	- multi-tool
	- gymnasium
	- gym
	- benchmark
	- mcp
	- model-context-protocol
	- reward-model
	- verifier
	- collinear
	- langchain
	- openai
	- sandbox
	- docker
	- toolbench
	- swe-bench
	- bfcl
	pinned: true
	license: apache-2.0
	---

	# SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation

	A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.

	Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).

	Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.

	## 4 MCP Tool Servers, 1 Environment

	\| Tool Server \| Port \| What it does \|
	\|---\|---\|---\|
	\| HRMS (Frappe) \| 8030 \| Employee records, leave management, attendance, payroll \|
	\| Email (MailHog) \| 8040 \| Send and read emails, inbox management \|
	\| Calendar (Baikal/Chronos) \| 8050 \| Schedule meetings, check availability, manage events \|
	\| RocketChat \| 8060 \| Team messaging, channels, direct messages \|

	Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.

	## Quickstart

	```python
	from simlab_hr import HRAction
	from simlab_hr.client import HREnv

	client = HREnv(base_url="http://localhost:8000")

	with client:
	obs = client.reset()
	print(obs.observation.task_instruction)
	print(obs.observation.tools_available) # {'hrms': [...], 'email': [...], ...}

	# Check leave balance in HRMS
	result = client.step(HRAction(
	tool_server="hrms",
	tool_name="get_leave_balance",
	parameters={"employee_id": "EMP-0042"}
	))

	# Send an email notification
	result = client.step(HRAction(
	tool_server="email",
	tool_name="send_email",
	parameters={"to": "manager@company.com", "subject": "Leave approved", "body": "..."}
	))
	```

	## Benchmark Tasks

	8 sample tasks covering real HR workflows across three difficulty levels:

	\| Difficulty \| Example Tasks \|
	\|---\|---\|
	\| Easy \| Approve a leave request, update an employee's designation \|
	\| Medium \| Schedule a phone screen + send confirmation, run an attendance report \|
	\| Hard \| Multi-person panel interview scheduling, full new-hire onboarding flow \|

	Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.

	## Automated Evaluation

	SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:

	- 0.8–1.0: All requirements fully met with clear evidence
	- 0.6–0.8: Core requirements met with minor gaps (0.6 = PASS threshold)
	- 0.4–0.6: Partial completion, significant gaps remain
	- 0.0–0.4: Minimal or no meaningful progress

	Configure the verifier model:

	```bash
	export VERIFIER_MODEL="gpt-4o"
	export VERIFIER_API_KEY="sk-..."
	```

	## Run Locally

	```bash
	git clone https://github.com/collinear-ai/simlab.git
	cd simlab/envs/simlab_hr

	# Start all services (HRMS, Email, Calendar, RocketChat, OpenEnv wrapper)
	docker compose up

	# First run pulls ~10 images and takes a few minutes for HRMS to initialize
	```

	Or run from Hugging Face:

	```python
	from simlab_hr.client import HREnv

	client = HREnv.from_hub("collinear/simlab-hr")
	```

	## Unlock 14+ Tasks from the API

	This environment ships with 8 sample tasks. Want more?

	Set your Collinear API key to unlock the full task set with real HR scenarios:

	```bash
	export COLLINEAR_API_KEY="your-key-here"
	```

	Get a free API key at [platform.collinear.ai](https://platform.collinear.ai) (Developer Resources → API Keys).

	With the API key, every `reset()` pulls a fresh task from Collinear's Scenario Manager — recruiting workflows, people management scenarios, compliance tasks, and more.

	## Use with TRL / GRPOTrainer

	Compatible with Hugging Face TRL for reinforcement learning fine-tuning:

	```python
	from simlab_hr import HRAction
	from simlab_hr.client import HREnv

	env = HREnv.from_hub("collinear/simlab-hr")
	with env:
	obs = env.reset()
	# ... your RL training loop
	```

	## How SimLab HR Compares

	Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.

	\| \| SimLab HR \| [BFCL](https://github.com/ShishirPatil/gorilla) \| [ToolBench](https://github.com/OpenBMB/ToolBench) \| [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) \| [tau-bench](https://github.com/sierra-research/tau2-bench) \|
	\|---\|---\|---\|---\|---\|---\|
	\| What it tests \| Multi-tool HR workflows end-to-end \| Function call accuracy (single/parallel) \| API discovery & chaining across 16k APIs \| Enterprise planning across 8 domains \| Customer service policy compliance \|
	\| Real backing services \| ✅ Frappe HRMS, MailHog, CalDAV, RocketChat \| ❌ Schema validation only \| ❌ API simulation \| ❌ Mock APIs \| ❌ Simulated \|
	\| MCP tool servers \| ✅ 4 servers \| ❌ \| ❌ \| ❌ REST APIs \| ❌ \|
	\| Multi-step workflows \| ✅ 10+ steps, cross-system \| ❌ Single/parallel calls \| ✅ Multi-hop chains \| ✅ Avg 9 steps \| ✅ Multi-turn \|
	\| HR-specific \| ✅ Dedicated \| ❌ \| ❌ \| ✅ 1 of 8 domains \| ❌ \|
	\| Automated evaluation \| ✅ Rubric-based LLM judge \| ✅ AST matching \| ✅ Pass rate + win rate \| ✅ Expert-curated \| ✅ Policy checks \|
	\| RL / Gymnasium support \| ✅ OpenEnv-compatible \| ❌ \| ❌ \| ❌ \| ❌ \|
	\| Task generation \| ✅ API pipeline \| ❌ \| ❌ \| ❌ \| ❌ \|

	## More Environments

	SimLab includes 5 enterprise simulation scenarios with 14 MCP tool servers:

	\| Scenario \| MCP Tool Servers \|
	\|---\|---\|
	\| Human Resources \| HRMS, email, calendar, team chat ← you are here \|
	\| Customer Service \| Helpdesk ticketing, team chat, email \|
	\| Finance \| SEC filings, market data, Google Workspace \|
	\| Coding \| Sandboxed IDE, browser automation, team chat \|
	\| CRM \| Contacts, deals, pipelines, activities \|

	Install the full toolkit:

	```bash
	pip install simulationlab
	simlab templates list
	```

	Learn more: [github.com/collinear-ai/simlab](https://github.com/collinear-ai/simlab) \| [docs.collinear.ai](https://docs.collinear.ai)

	## License

	Apache 2.0 — [Collinear AI](https://collinear.ai)