Spaces:

renanserrano
/

simulationlab-hr

Runtime error

App Files Files Community

renanserrano commited on Mar 31

Commit

f03afc0

verified ·

1 Parent(s): 384d994

Improve README for discoverability — add MCP/gym/benchmark keywords, comparison table, evaluation docs

Browse files

Files changed (1) hide show

README.md +71 -29

README.md CHANGED Viewed

@@ -5,43 +5,57 @@ colorFrom: blue
 colorTo: green
 sdk: docker
 app_port: 7860
-short_description: "AI HR agent environment — HRMS, email, calendar, chat"
 tags:
-  - openenv
-  - hr
-  - human-resources
-  - recruiting
-  - hrms
-  - agent-evaluation
-  - simlab
-  - reinforcement-learning
-  - rl-environment
-  - ai-agent
-  - tool-use
-  - enterprise
-  - multi-tool
-  - gymnasium
-  - collinear
 pinned: true
 license: apache-2.0
 ---
-# SimLab HR — AI Recruiting & People Management Agent Environment
-A fully-functional HR simulation for training, evaluating, and benchmarking AI recruiting and people management agents. Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
 Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
-## 4 Tool Servers, 1 Environment
-| Server | Port | What it does |
 |---|---|---|
 | **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
 | **Email** (MailHog) | 8040 | Send and read emails, inbox management |
 | **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
 | **RocketChat** | 8060 | Team messaging, channels, direct messages |
-Agents must reason across all four systems to complete real HR workflows — just like a human would.
 ## Quickstart
@@ -71,17 +85,33 @@ with client:
     ))
 ```
-## What's Inside
-**8 sample tasks** covering real HR workflows:
-| Difficulty | Example |
 |---|---|
 | Easy | Approve a leave request, update an employee's designation |
 | Medium | Schedule a phone screen + send confirmation, run an attendance report |
 | Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
-Every task requires the agent to coordinate across multiple tool servers — this is what makes it hard.
 ## Run Locally
@@ -119,7 +149,7 @@ With the API key, every `reset()` pulls a fresh task from Collinear's Scenario M
 ## Use with TRL / GRPOTrainer
-Compatible with Hugging Face TRL for RL fine-tuning:
 ```python
 from simlab_hr import HRAction
@@ -128,14 +158,26 @@ from simlab_hr.client import HREnv
 env = HREnv.from_hub("collinear/simlab-hr")
 with env:
     obs = env.reset()
-    # ... your training loop
 ```
 ## More Environments
-SimLab includes **5 enterprise simulation scenarios** with **14 tool servers**:
-| Scenario | Tools |
 |---|---|
 | **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
 | **Customer Service** | Helpdesk ticketing, team chat, email |

 colorTo: green
 sdk: docker
 app_port: 7860
+short_description: "MCP gym for benchmarking & training AI HR agents"
 tags:
+ - openenv
+ - hr
+ - human-resources
+ - recruiting
+ - hrms
+ - agent-evaluation
+ - agent-benchmark
+ - simlab
+ - reinforcement-learning
+ - rl-environment
+ - ai-agent
+ - tool-use
+ - function-calling
+ - enterprise
+ - multi-tool
+ - gymnasium
+ - gym
+ - benchmark
+ - mcp
+ - model-context-protocol
+ - reward-model
+ - verifier
+ - collinear
+ - langchain
+ - openai
+ - sandbox
+ - docker
 pinned: true
 license: apache-2.0
 ---
+# SimLab HR — MCP-Powered Gym for AI HR Agent Evaluation
+A fully-functional HR simulation gym for training, evaluating, and benchmarking AI recruiting and people management agents. Test your agent's tool-use and function-calling abilities across 4 MCP tool servers — HRMS, email, calendar, and team chat — with automated rubric-based evaluation.
+Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
 Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
+## 4 MCP Tool Servers, 1 Environment
+| Tool Server | Port | What it does |
 |---|---|---|
 | **HRMS** (Frappe) | 8030 | Employee records, leave management, attendance, payroll |
 | **Email** (MailHog) | 8040 | Send and read emails, inbox management |
 | **Calendar** (Baikal/Chronos) | 8050 | Schedule meetings, check availability, manage events |
 | **RocketChat** | 8060 | Team messaging, channels, direct messages |
+Agents must reason and coordinate tool calls across all four MCP servers to complete real HR workflows — the kind of multi-step function calling that separates real tool-use from single-API benchmarks.
 ## Quickstart
     ))
 ```
+## Benchmark Tasks
+**8 sample tasks** covering real HR workflows across three difficulty levels:
+| Difficulty | Example Tasks |
 |---|---|
 | Easy | Approve a leave request, update an employee's designation |
 | Medium | Schedule a phone screen + send confirmation, run an attendance report |
 | Hard | Multi-person panel interview scheduling, full new-hire onboarding flow |
+Every task requires the agent to coordinate function calls across multiple MCP tool servers — this is what makes it hard.
+## Automated Evaluation
+SimLab HR includes a rubric-based LLM judge that evaluates agent performance after each episode:
+- **0.8–1.0**: All requirements fully met with clear evidence
+- **0.6–0.8**: Core requirements met with minor gaps (0.6 = PASS threshold)
+- **0.4–0.6**: Partial completion, significant gaps remain
+- **0.0–0.4**: Minimal or no meaningful progress
+Configure the verifier model:
+```bash
+export VERIFIER_MODEL="gpt-4o"
+export VERIFIER_API_KEY="sk-..."
+```
 ## Run Locally
 ## Use with TRL / GRPOTrainer
+Compatible with Hugging Face TRL for reinforcement learning fine-tuning:
 ```python
 from simlab_hr import HRAction
 env = HREnv.from_hub("collinear/simlab-hr")
 with env:
     obs = env.reset()
+    # ... your RL training loop
 ```
+## How SimLab HR Compares
+| | SimLab HR | EnterpriseOps-Gym | Toolathlon-GYM | tau-bench |
+|---|---|---|---|---|
+| HR-specific tasks | ✅ Dedicated | ✅ 1 of 8 domains | Partial (analytics data) | ❌ |
+| Real backing services | ✅ Frappe, MailHog, CalDAV, RocketChat | ❌ Mock APIs | ✅ PostgreSQL | ❌ Simulated |
+| MCP tool servers | ✅ 4 servers | ❌ REST APIs | ✅ 25 servers | ❌ |
+| Automated evaluation | ✅ Rubric judge | ✅ Expert-curated | ✅ Groundtruth scripts | ✅ Policy checks |
+| RL / Gymnasium support | ✅ OpenEnv | ❌ | ❌ | ❌ |
+| Custom MCP server plugging | ✅ via SimLab CLI | ❌ | ❌ | ❌ |
+| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ |
 ## More Environments
+SimLab includes **5 enterprise simulation scenarios** with **14 MCP tool servers**:
+| Scenario | MCP Tool Servers |
 |---|---|
 | **Human Resources** | HRMS, email, calendar, team chat ← *you are here* |
 | **Customer Service** | Helpdesk ticketing, team chat, email |