Spaces:
Runtime error
Runtime error
Improve README for discoverability — add MCP/gym/benchmark keywords, comparison table, evaluation docs
Browse files
README.md
CHANGED
|
@@ -5,43 +5,57 @@ colorFrom: blue
|
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
-
short_description: "
|
| 9 |
tags:
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
pinned: true
|
| 26 |
license: apache-2.0
|
| 27 |
---
|
| 28 |
|
| 29 |
-
# SimLab HR —
|
| 30 |
|
| 31 |
-
A fully-functional HR simulation for training, evaluating, and benchmarking AI recruiting and people management agents.
|
|
|
|
|
|
|
| 32 |
|
| 33 |
Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
|
| 34 |
|
| 35 |
-
## 4 Tool Servers, 1 Environment
|
| 36 |
|
| 37 |
-
| Server | Port | What it does |
|
| 38 |
|---|---|---|
|
| 39 |
| **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
|
| 40 |
| **Email** (MailHog) | 8040 | Send and read emails, inbox management |
|
| 41 |
| **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
|
| 42 |
| **RocketChat** | 8060 | Team messaging, channels, direct messages |
|
| 43 |
|
| 44 |
-
Agents must reason across all four
|
| 45 |
|
| 46 |
## Quickstart
|
| 47 |
|
|
@@ -71,17 +85,33 @@ with client:
|
|
| 71 |
))
|
| 72 |
```
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
-
**8 sample tasks** covering real HR workflows:
|
| 77 |
|
| 78 |
-
| Difficulty | Example |
|
| 79 |
|---|---|
|
| 80 |
| Easy | Approve a leave request, update an employee's designation |
|
| 81 |
| Medium | Schedule a phone screen + send confirmation, run an attendance report |
|
| 82 |
| Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
|
| 83 |
|
| 84 |
-
Every task requires the agent to coordinate across multiple tool servers — this is what makes it hard.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
## Run Locally
|
| 87 |
|
|
@@ -119,7 +149,7 @@ With the API key, every `reset()` pulls a fresh task from Collinear's Scenario M
|
|
| 119 |
|
| 120 |
## Use with TRL / GRPOTrainer
|
| 121 |
|
| 122 |
-
Compatible with Hugging Face TRL for
|
| 123 |
|
| 124 |
```python
|
| 125 |
from simlab_hr import HRAction
|
|
@@ -128,14 +158,26 @@ from simlab_hr.client import HREnv
|
|
| 128 |
env = HREnv.from_hub("collinear/simlab-hr")
|
| 129 |
with env:
|
| 130 |
obs = env.reset()
|
| 131 |
-
# ... your training loop
|
| 132 |
```
|
| 133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
## More Environments
|
| 135 |
|
| 136 |
-
SimLab includes **5 enterprise simulation scenarios** with **14 tool servers**:
|
| 137 |
|
| 138 |
-
| Scenario |
|
| 139 |
|---|---|
|
| 140 |
| **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
|
| 141 |
| **Customer Service** | Helpdesk ticketing, team chat, email |
|
|
|
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
+
short_description: "MCP gym for benchmarking & training AI HR agents"
|
| 9 |
tags:
|
| 10 |
+
- openenv
|
| 11 |
+
- hr
|
| 12 |
+
- human-resources
|
| 13 |
+
- recruiting
|
| 14 |
+
- hrms
|
| 15 |
+
- agent-evaluation
|
| 16 |
+
- agent-benchmark
|
| 17 |
+
- simlab
|
| 18 |
+
- reinforcement-learning
|
| 19 |
+
- rl-environment
|
| 20 |
+
- ai-agent
|
| 21 |
+
- tool-use
|
| 22 |
+
- function-calling
|
| 23 |
+
- enterprise
|
| 24 |
+
- multi-tool
|
| 25 |
+
- gymnasium
|
| 26 |
+
- gym
|
| 27 |
+
- benchmark
|
| 28 |
+
- mcp
|
| 29 |
+
- model-context-protocol
|
| 30 |
+
- reward-model
|
| 31 |
+
- verifier
|
| 32 |
+
- collinear
|
| 33 |
+
- langchain
|
| 34 |
+
- openai
|
| 35 |
+
- sandbox
|
| 36 |
+
- docker
|
| 37 |
pinned: true
|
| 38 |
license: apache-2.0
|
| 39 |
---
|
| 40 |
|
| 41 |
+
# SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation
|
| 42 |
|
| 43 |
+
A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.
|
| 44 |
+
|
| 45 |
+
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
|
| 46 |
|
| 47 |
Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
|
| 48 |
|
| 49 |
+
## 4 MCP Tool Servers, 1 Environment
|
| 50 |
|
| 51 |
+
| Tool Server | Port | What it does |
|
| 52 |
|---|---|---|
|
| 53 |
| **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
|
| 54 |
| **Email** (MailHog) | 8040 | Send and read emails, inbox management |
|
| 55 |
| **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
|
| 56 |
| **RocketChat** | 8060 | Team messaging, channels, direct messages |
|
| 57 |
|
| 58 |
+
Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.
|
| 59 |
|
| 60 |
## Quickstart
|
| 61 |
|
|
|
|
| 85 |
))
|
| 86 |
```
|
| 87 |
|
| 88 |
+
## Benchmark Tasks
|
| 89 |
|
| 90 |
+
**8 sample tasks** covering real HR workflows across three difficulty levels:
|
| 91 |
|
| 92 |
+
| Difficulty | Example Tasks |
|
| 93 |
|---|---|
|
| 94 |
| Easy | Approve a leave request, update an employee's designation |
|
| 95 |
| Medium | Schedule a phone screen + send confirmation, run an attendance report |
|
| 96 |
| Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
|
| 97 |
|
| 98 |
+
Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.
|
| 99 |
+
|
| 100 |
+
## Automated Evaluation
|
| 101 |
+
|
| 102 |
+
SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:
|
| 103 |
+
|
| 104 |
+
- **0.8–1.0**: All requirements fully met with clear evidence
|
| 105 |
+
- **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold)
|
| 106 |
+
- **0.4–0.6**: Partial completion, significant gaps remain
|
| 107 |
+
- **0.0–0.4**: Minimal or no meaningful progress
|
| 108 |
+
|
| 109 |
+
Configure the verifier model:
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
export VERIFIER_MODEL="gpt-4o"
|
| 113 |
+
export VERIFIER_API_KEY="sk-..."
|
| 114 |
+
```
|
| 115 |
|
| 116 |
## Run Locally
|
| 117 |
|
|
|
|
| 149 |
|
| 150 |
## Use with TRL / GRPOTrainer
|
| 151 |
|
| 152 |
+
Compatible with Hugging Face TRL for reinforcement learning fine-tuning:
|
| 153 |
|
| 154 |
```python
|
| 155 |
from simlab_hr import HRAction
|
|
|
|
| 158 |
env = HREnv.from_hub("collinear/simlab-hr")
|
| 159 |
with env:
|
| 160 |
obs = env.reset()
|
| 161 |
+
# ... your RL training loop
|
| 162 |
```
|
| 163 |
|
| 164 |
+
## How SimLab HR Compares
|
| 165 |
+
|
| 166 |
+
| | SimLab HR | EnterpriseOps-Gym | Toolathlon-GYM | tau-bench |
|
| 167 |
+
|---|---|---|---|---|
|
| 168 |
+
| HR-specific tasks | ✅ Dedicated | ✅ 1 of 8 domains | Partial (analytics data) | ❌ |
|
| 169 |
+
| Real backing services | ✅ Frappe, MailHog, CalDAV, RocketChat | ❌ Mock APIs | ✅ PostgreSQL | ❌ Simulated |
|
| 170 |
+
| MCP tool servers | ✅ 4 servers | ❌ REST APIs | ✅ 25 servers | ❌ |
|
| 171 |
+
| Automated evaluation | ✅ Rubric judge | ✅ Expert-curated | ✅ Groundtruth scripts | ✅ Policy checks |
|
| 172 |
+
| RL / Gymnasium support | ✅ OpenEnv | ❌ | ❌ | ❌ |
|
| 173 |
+
| Custom MCP server plugging | ✅ via SimLab CLI | ❌ | ❌ | ❌ |
|
| 174 |
+
| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ |
|
| 175 |
+
|
| 176 |
## More Environments
|
| 177 |
|
| 178 |
+
SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**:
|
| 179 |
|
| 180 |
+
| Scenario | MCP Tool Servers |
|
| 181 |
|---|---|
|
| 182 |
| **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
|
| 183 |
| **Customer Service** | Helpdesk ticketing, team chat, email |
|