renanserrano commited on
Commit
f03afc0
·
verified ·
1 Parent(s): 384d994

Improve README for discoverability — add MCP/gym/benchmark keywords, comparison table, evaluation docs

Browse files
Files changed (1) hide show
  1. README.md +71 -29
README.md CHANGED
@@ -5,43 +5,57 @@ colorFrom: blue
5
  colorTo: green
6
  sdk: docker
7
  app_port: 7860
8
- short_description: "AI HR agent environment HRMS, email, calendar, chat"
9
  tags:
10
- - openenv
11
- - hr
12
- - human-resources
13
- - recruiting
14
- - hrms
15
- - agent-evaluation
16
- - simlab
17
- - reinforcement-learning
18
- - rl-environment
19
- - ai-agent
20
- - tool-use
21
- - enterprise
22
- - multi-tool
23
- - gymnasium
24
- - collinear
 
 
 
 
 
 
 
 
 
 
 
 
25
  pinned: true
26
  license: apache-2.0
27
  ---
28
 
29
- # SimLab HR — AI Recruiting & People Management Agent Environment
30
 
31
- A fully-functional HR simulation for training, evaluating, and benchmarking AI recruiting and people management agents. Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
 
 
32
 
33
  Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
34
 
35
- ## 4 Tool Servers, 1 Environment
36
 
37
- | Server | Port | What it does |
38
  |---|---|---|
39
  | **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
40
  | **Email** (MailHog) | 8040 | Send and read emails, inbox management |
41
  | **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
42
  | **RocketChat** | 8060 | Team messaging, channels, direct messages |
43
 
44
- Agents must reason across all four systems to complete real HR workflows — just like a human would.
45
 
46
  ## Quickstart
47
 
@@ -71,17 +85,33 @@ with client:
71
  ))
72
  ```
73
 
74
- ## What's Inside
75
 
76
- **8 sample tasks** covering real HR workflows:
77
 
78
- | Difficulty | Example |
79
  |---|---|
80
  | Easy | Approve a leave request, update an employee's designation |
81
  | Medium | Schedule a phone screen + send confirmation, run an attendance report |
82
  | Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
83
 
84
- Every task requires the agent to coordinate across multiple tool servers — this is what makes it hard.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
  ## Run Locally
87
 
@@ -119,7 +149,7 @@ With the API key, every `reset()` pulls a fresh task from Collinear's Scenario M
119
 
120
  ## Use with TRL / GRPOTrainer
121
 
122
- Compatible with Hugging Face TRL for RL fine-tuning:
123
 
124
  ```python
125
  from simlab_hr import HRAction
@@ -128,14 +158,26 @@ from simlab_hr.client import HREnv
128
  env = HREnv.from_hub("collinear/simlab-hr")
129
  with env:
130
  obs = env.reset()
131
- # ... your training loop
132
  ```
133
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  ## More Environments
135
 
136
- SimLab includes **5 enterprise simulation scenarios** with **14 tool servers**:
137
 
138
- | Scenario | Tools |
139
  |---|---|
140
  | **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
141
  | **Customer Service** | Helpdesk ticketing, team chat, email |
 
5
  colorTo: green
6
  sdk: docker
7
  app_port: 7860
8
+ short_description: "MCP gym for benchmarking & training AI HR agents"
9
  tags:
10
+ - openenv
11
+ - hr
12
+ - human-resources
13
+ - recruiting
14
+ - hrms
15
+ - agent-evaluation
16
+ - agent-benchmark
17
+ - simlab
18
+ - reinforcement-learning
19
+ - rl-environment
20
+ - ai-agent
21
+ - tool-use
22
+ - function-calling
23
+ - enterprise
24
+ - multi-tool
25
+ - gymnasium
26
+ - gym
27
+ - benchmark
28
+ - mcp
29
+ - model-context-protocol
30
+ - reward-model
31
+ - verifier
32
+ - collinear
33
+ - langchain
34
+ - openai
35
+ - sandbox
36
+ - docker
37
  pinned: true
38
  license: apache-2.0
39
  ---
40
 
41
+ # SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation
42
 
43
+ A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.
44
+
45
+ Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
46
 
47
  Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
48
 
49
+ ## 4 MCP Tool Servers, 1 Environment
50
 
51
+ | Tool Server | Port | What it does |
52
  |---|---|---|
53
  | **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
54
  | **Email** (MailHog) | 8040 | Send and read emails, inbox management |
55
  | **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
56
  | **RocketChat** | 8060 | Team messaging, channels, direct messages |
57
 
58
+ Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.
59
 
60
  ## Quickstart
61
 
 
85
  ))
86
  ```
87
 
88
+ ## Benchmark Tasks
89
 
90
+ **8 sample tasks** covering real HR workflows across three difficulty levels:
91
 
92
+ | Difficulty | Example Tasks |
93
  |---|---|
94
  | Easy | Approve a leave request, update an employee's designation |
95
  | Medium | Schedule a phone screen + send confirmation, run an attendance report |
96
  | Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
97
 
98
+ Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.
99
+
100
+ ## Automated Evaluation
101
+
102
+ SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:
103
+
104
+ - **0.8–1.0**: All requirements fully met with clear evidence
105
+ - **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold)
106
+ - **0.4–0.6**: Partial completion, significant gaps remain
107
+ - **0.0–0.4**: Minimal or no meaningful progress
108
+
109
+ Configure the verifier model:
110
+
111
+ ```bash
112
+ export VERIFIER_MODEL="gpt-4o"
113
+ export VERIFIER_API_KEY="sk-..."
114
+ ```
115
 
116
  ## Run Locally
117
 
 
149
 
150
  ## Use with TRL / GRPOTrainer
151
 
152
+ Compatible with Hugging Face TRL for reinforcement learning fine-tuning:
153
 
154
  ```python
155
  from simlab_hr import HRAction
 
158
  env = HREnv.from_hub("collinear/simlab-hr")
159
  with env:
160
  obs = env.reset()
161
+ # ... your RL training loop
162
  ```
163
 
164
+ ## How SimLab HR Compares
165
+
166
+ | | SimLab HR | EnterpriseOps-Gym | Toolathlon-GYM | tau-bench |
167
+ |---|---|---|---|---|
168
+ | HR-specific tasks | ✅ Dedicated | ✅ 1 of 8 domains | Partial (analytics data) | ❌ |
169
+ | Real backing services | ✅ Frappe, MailHog, CalDAV, RocketChat | ❌ Mock APIs | ✅ PostgreSQL | ❌ Simulated |
170
+ | MCP tool servers | ✅ 4 servers | ❌ REST APIs | ✅ 25 servers | ❌ |
171
+ | Automated evaluation | ✅ Rubric judge | ✅ Expert-curated | ✅ Groundtruth scripts | ✅ Policy checks |
172
+ | RL / Gymnasium support | ✅ OpenEnv | ❌ | ❌ | ❌ |
173
+ | Custom MCP server plugging | ✅ via SimLab CLI | ❌ | ❌ | ❌ |
174
+ | Task generation | ✅ API pipeline | ❌ | ❌ | ❌ |
175
+
176
  ## More Environments
177
 
178
+ SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**:
179
 
180
+ | Scenario | MCP Tool Servers |
181
  |---|---|
182
  | **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
183
  | **Customer Service** | Helpdesk ticketing, team chat, email |