Spaces:

renanserrano
/

simulationlab-hr

Runtime error

App Files Files Community

renanserrano commited on Mar 31

Commit

3d3bc40

verified ·

1 Parent(s): f03afc0

Update comparison table — BFCL, ToolBench, EnterpriseOps-Gym, tau-bench with links and richer descriptions

Browse files

Files changed (1) hide show

README.md +16 -10

README.md CHANGED Viewed

@@ -34,6 +34,9 @@ tags:
  - openai
  - sandbox
  - docker
 pinned: true
 license: apache-2.0
 ---
@@ -44,7 +47,7 @@ A fully-functional HR simulation gym for training, evaluating, and benchmarking
 Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
-Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
 ## 4 MCP Tool Servers, 1 Environment
@@ -163,15 +166,18 @@ with env:
 ## How SimLab HR Compares
-| | SimLab HR | EnterpriseOps-Gym | Toolathlon-GYM | tau-bench |
-|---|---|---|---|---|
-| HR-specific tasks | ✅ Dedicated | ✅ 1 of 8 domains | Partial (analytics data) | ❌ |
-| Real backing services | ✅ Frappe, MailHog, CalDAV, RocketChat | ❌ Mock APIs | ✅ PostgreSQL | ❌ Simulated |
-| MCP tool servers | ✅ 4 servers | ❌ REST APIs | ✅ 25 servers | ❌ |
-| Automated evaluation | ✅ Rubric judge | ✅ Expert-curated | ✅ Groundtruth scripts | ✅ Policy checks |
-| RL / Gymnasium support | ✅ OpenEnv | ❌ | ❌ | ❌ |
-| Custom MCP server plugging | ✅ via SimLab CLI | ❌ | ❌ | ❌ |
-| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ |
 ## More Environments

  - openai
  - sandbox
  - docker
+ - toolbench
+ - swe-bench
+ - bfcl
 pinned: true
 license: apache-2.0
 ---
 Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
+Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.
 ## 4 MCP Tool Servers, 1 Environment
 ## How SimLab HR Compares
+Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.
+| | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) |
+|---|---|---|---|---|---|
+| What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance |
+| Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | ❌ Schema validation only | ❌ API simulation | ❌ Mock APIs | ❌ Simulated |
+| MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ |
+| Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | ✅ Multi-hop chains | ✅ Avg 9 steps | ✅ Multi-turn |
+| HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ |
+| Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks |
+| RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ |
+| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ |
 ## More Environments