sidebar_position: 5
title: Environments, Benchmarks & Data Generation
description: >-
Building RL training environments, running evaluation benchmarks, and
generating SFT data with the Hermes-Agent Atropos integration
Environments, Benchmarks & Data Generation
Hermes Agent includes a full environment framework that connects its tool-calling capabilities to the Atropos RL training framework. This enables three workflows:
- RL Training β Train language models on multi-turn agentic tasks with GRPO
- Benchmarks β Evaluate models on standardised agentic benchmarks
- Data Generation β Generate SFT training data from agent rollouts
All three share the same core: an environment class that defines tasks, runs an agent loop, and scores the output.
:::info Repo environments vs RL training tools
The Python environment framework documented here lives under the repo's environments/ directory and is the implementation-level API for Hermes/Atropos integration. This is separate from the user-facing rl_* tools, which operate as an orchestration surface for remote RL training workflows.
:::
:::tip Quick Links
- Want to run benchmarks? Jump to Available Benchmarks
- Want to train with RL? See RL Training Tools for the agent-driven interface, or Running Environments for manual execution
- Want to create a new environment? See Creating Environments :::
Architecture
The environment system is built on a three-layer inheritance chain:
classDiagram
class BaseEnv {
Server management
Worker scheduling
Wandb logging
CLI: serve / process / evaluate
}
class HermesAgentBaseEnv {
Terminal backend configuration
Tool resolution
Agent loop engine
ToolContext access
}
class TerminalTestEnv {
Stack testing
}
class HermesSweEnv {
SWE training
}
class TerminalBench2EvalEnv {
Benchmark evaluation
}
class TBLiteEvalEnv {
Fast benchmark
}
class YCBenchEvalEnv {
Long-horizon benchmark
}
BaseEnv <|-- HermesAgentBaseEnv
HermesAgentBaseEnv <|-- TerminalTestEnv
HermesAgentBaseEnv <|-- HermesSweEnv
HermesAgentBaseEnv <|-- TerminalBench2EvalEnv
TerminalBench2EvalEnv <|-- TBLiteEvalEnv
TerminalBench2EvalEnv <|-- YCBenchEvalEnv
BaseEnv (Atropos)
The foundation from atroposlib. Provides:
- Server management β connects to OpenAI-compatible APIs (VLLM, SGLang, OpenRouter)
- Worker scheduling β parallel rollout coordination
- Wandb integration β metrics logging and rollout visualisation
- CLI interface β three subcommands:
serve,process,evaluate - Eval logging β
evaluate_log()saves results to JSON + JSONL
HermesAgentBaseEnv
The hermes-agent layer (environments/hermes_base_env.py). Adds:
- Terminal backend configuration β sets
TERMINAL_ENVfor sandboxed execution (local, Docker, Modal, Daytona, SSH, Singularity) - Tool resolution β
_resolve_tools_for_group()calls hermes-agent'sget_tool_definitions()to get the right tool schemas based on enabled/disabled toolsets - Agent loop integration β
collect_trajectory()runsHermesAgentLoopand scores the result - Two-phase operation β Phase 1 (OpenAI server) for eval/SFT, Phase 2 (VLLM ManagedServer) for full RL with logprobs
- Async safety patches β monkey-patches Modal backend to work inside Atropos's event loop
Concrete Environments
Your environment inherits from HermesAgentBaseEnv and implements five methods:
| Method | Purpose |
|---|---|
setup() |
Load dataset, initialise state |
get_next_item() |
Return the next item for rollout |
format_prompt(item) |
Convert an item into the user message |
compute_reward(item, result, ctx) |
Score the rollout (0.0β1.0) |
evaluate() |
Periodic evaluation logic |
Core Components
Agent Loop
HermesAgentLoop (environments/agent_loop.py) is the reusable multi-turn agent engine. It runs the same tool-calling pattern as hermes-agent's main loop:
- Send messages + tool schemas to the API via
server.chat_completion() - If the response contains
tool_calls, dispatch each viahandle_function_call() - Append tool results to the conversation, go back to step 1
- If no
tool_calls, the agent is done
Tool calls execute in a thread pool (ThreadPoolExecutor(128)) so that async backends (Modal, Docker) don't deadlock inside Atropos's event loop.
Returns an AgentResult:
@dataclass
class AgentResult:
messages: List[Dict[str, Any]] # Full conversation history
turns_used: int # Number of LLM calls made
finished_naturally: bool # True if model stopped on its own
reasoning_per_turn: List[Optional[str]] # Extracted reasoning content
tool_errors: List[ToolError] # Errors encountered during tool dispatch
managed_state: Optional[Dict] # VLLM ManagedServer state (Phase 2)
Tool Context
ToolContext (environments/tool_context.py) gives reward functions direct access to the same sandbox the model used during its rollout. The task_id scoping means all state (files, processes, browser tabs) is preserved.
async def compute_reward(self, item, result, ctx: ToolContext):
# Run tests in the model's terminal sandbox
test = ctx.terminal("pytest -v")
if test["exit_code"] == 0:
return 1.0
# Check if a file was created
content = ctx.read_file("/workspace/solution.py")
if content.get("content"):
return 0.5
# Download files for local verification
ctx.download_file("/remote/output.bin", "/local/output.bin")
return 0.0
Available methods:
| Category | Methods |
|---|---|
| Terminal | terminal(command, timeout) |
| Files | read_file(path), write_file(path, content), search(query, path) |
| Transfers | upload_file(), upload_dir(), download_file(), download_dir() |
| Web | web_search(query), web_extract(urls) |
| Browser | browser_navigate(url), browser_snapshot() |
| Generic | call_tool(name, args) β escape hatch for any hermes-agent tool |
| Cleanup | cleanup() β release all resources |
Tool Call Parsers
For Phase 2 (VLLM ManagedServer), the server returns raw text without structured tool calls. Client-side parsers in environments/tool_call_parsers/ extract tool_calls from raw output:
from environments.tool_call_parsers import get_parser
parser = get_parser("hermes") # or "mistral", "llama3_json", "qwen", "deepseek_v3", etc.
content, tool_calls = parser.parse(raw_model_output)
Available parsers: hermes, mistral, llama3_json, qwen, qwen3_coder, deepseek_v3, deepseek_v3_1, kimi_k2, longcat, glm45, glm47.
In Phase 1 (OpenAI server type), parsers are not needed β the server handles tool call parsing natively.
Available Benchmarks
TerminalBench2
89 challenging terminal tasks with per-task Docker sandbox environments.
| What it tests | Single-task coding/sysadmin ability |
| Scoring | Binary pass/fail (test suite verification) |
| Sandbox | Modal cloud sandboxes (per-task Docker images) |
| Tools | terminal + file |
| Tasks | 89 tasks across multiple categories |
| Cost | ~$50β200 for full eval (parallel execution) |
| Time | ~2β4 hours |
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml
# Run specific tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml \
--env.task_filter fix-git,git-multibranch
Dataset: NousResearch/terminal-bench-2 on HuggingFace.
TBLite (OpenThoughts Terminal Bench Lite)
100 difficulty-calibrated tasks β a faster proxy for TerminalBench2.
| What it tests | Same as TB2 (coding/sysadmin), calibrated difficulty tiers |
| Scoring | Binary pass/fail |
| Sandbox | Modal cloud sandboxes |
| Tools | terminal + file |
| Tasks | 100 tasks: Easy (40), Medium (26), Hard (26), Extreme (8) |
| Correlation | r=0.911 with full TB2 |
| Speed | 2.6β8Γ faster than TB2 |
python environments/benchmarks/tblite/tblite_env.py evaluate \
--config environments/benchmarks/tblite/default.yaml
TBLite is a thin subclass of TerminalBench2 β only the dataset and timeouts differ. Created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs). Dataset: NousResearch/openthoughts-tblite.
YC-Bench
Long-horizon strategic benchmark β the agent plays CEO of an AI startup.
| What it tests | Multi-turn strategic coherence over hundreds of turns |
| Scoring | Composite: 0.5 Γ survival + 0.5 Γ normalised_funds |
| Sandbox | Local terminal (no Modal needed) |
| Tools | terminal only |
| Runs | 9 default (3 presets Γ 3 seeds), sequential |
| Cost | ~$50β200 for full eval |
| Time | ~3β6 hours |
# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"
# Run evaluation
bash environments/benchmarks/yc_bench/run_eval.sh
# Or directly
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
--config environments/benchmarks/yc_bench/default.yaml
# Quick single-preset test
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
--config environments/benchmarks/yc_bench/default.yaml \
--env.presets '["fast_test"]' --env.seeds '[1]'
YC-Bench uses collinear-ai/yc-bench β a deterministic simulation with 4 skill domains (research, inference, data_environment, training), prestige system, employee management, and financial pressure. Unlike TB2's per-task binary scoring, YC-Bench measures whether an agent can maintain coherent strategy over hundreds of compounding decisions.
Training Environments
TerminalTestEnv
A minimal self-contained environment with inline tasks (no external dataset). Used for validating the full stack end-to-end. Each task asks the model to create a file at a known path; the verifier checks the content.
# Process mode (saves rollouts to JSONL, no training server needed)
python environments/terminal_test_env/terminal_test_env.py process \
--env.data_path_to_save_groups terminal_test_output.jsonl
# Serve mode (connects to Atropos API for RL training)
python environments/terminal_test_env/terminal_test_env.py serve
HermesSweEnv
SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
python environments/hermes_swe_env/hermes_swe_env.py serve \
--openai.model_name YourModel \
--env.dataset_name bigcode/humanevalpack \
--env.terminal_backend modal
Running Environments
Every environment is a standalone Python script with three CLI subcommands:
evaluate β Run a benchmark
For eval-only environments (benchmarks). Runs all items, computes metrics, logs to wandb.
python environments/benchmarks/tblite/tblite_env.py evaluate \
--config environments/benchmarks/tblite/default.yaml \
--openai.model_name anthropic/claude-sonnet-4.6
No training server or run-api needed. The environment handles everything.
process β Generate SFT data
Runs rollouts and saves scored trajectories to JSONL. Useful for generating training data without a full RL loop.
python environments/terminal_test_env/terminal_test_env.py process \
--env.data_path_to_save_groups output.jsonl \
--openai.model_name anthropic/claude-sonnet-4.6
Output format: each line is a scored trajectory with the full conversation history, reward, and metadata.
serve β Connect to Atropos for RL training
Connects the environment to a running Atropos API server (run-api). Used during live RL training.
# Terminal 1: Start the Atropos API
run-api
# Terminal 2: Start the environment
python environments/hermes_swe_env/hermes_swe_env.py serve \
--openai.model_name YourModel
The environment receives items from Atropos, runs agent rollouts, computes rewards, and sends scored trajectories back for training.
Two-Phase Operation
Phase 1: OpenAI Server (Eval / SFT)
Uses server.chat_completion() with tools= parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns ChatCompletion objects with structured tool_calls.
- Use for: evaluation, SFT data generation, benchmarks, testing
- Placeholder tokens are created for the Atropos pipeline (since real token IDs aren't available from the OpenAI API)
Phase 2: VLLM ManagedServer (Full RL)
Uses ManagedServer for exact token IDs + logprobs via /generate. A client-side tool call parser reconstructs structured tool_calls from raw output.
- Use for: full RL training with GRPO/PPO
- Real tokens, masks, and logprobs flow through the pipeline
- Set
tool_call_parserin config to match your model's format (e.g.,"hermes","qwen","mistral")
Creating Environments
Training Environment
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from atroposlib.envs.server_handling.server_manager import APIServerConfig
class MyEnvConfig(HermesAgentEnvConfig):
my_custom_field: str = "default_value"
class MyEnv(HermesAgentBaseEnv):
name = "my-env"
env_config_cls = MyEnvConfig
@classmethod
def config_init(cls):
env_config = MyEnvConfig(
enabled_toolsets=["terminal", "file"],
terminal_backend="modal",
max_agent_turns=30,
)
server_configs = [APIServerConfig(
base_url="https://openrouter.ai/api/v1",
model_name="anthropic/claude-sonnet-4.6",
server_type="openai",
)]
return env_config, server_configs
async def setup(self):
from datasets import load_dataset
self.dataset = list(load_dataset("my-dataset", split="train"))
self.iter = 0
async def get_next_item(self):
item = self.dataset[self.iter % len(self.dataset)]
self.iter += 1
return item
def format_prompt(self, item):
return item["instruction"]
async def compute_reward(self, item, result, ctx):
# ctx gives full tool access to the rollout's sandbox
test = ctx.terminal("pytest -v")
return 1.0 if test["exit_code"] == 0 else 0.0
async def evaluate(self, *args, **kwargs):
# Periodic evaluation during training
pass
if __name__ == "__main__":
MyEnv.cli()
Eval-Only Benchmark
For benchmarks, follow the pattern used by TerminalBench2, TBLite, and YC-Bench:
- Create under
environments/benchmarks/your-benchmark/ - Set eval-only config:
eval_handling=STOP_TRAIN,steps_per_eval=1,total_steps=1 - Stub training methods:
collect_trajectories()returns(None, []),score()returnsNone - Implement
rollout_and_score_eval(eval_item)β the per-item agent loop + scoring - Implement
evaluate()β orchestrates all runs, computes aggregate metrics - Add streaming JSONL for crash-safe result persistence
- Add cleanup:
KeyboardInterrupthandling,cleanup_all_environments(),_tool_executor.shutdown() - Run with
evaluatesubcommand
See environments/benchmarks/yc_bench/yc_bench_env.py for a clean, well-documented reference implementation.
Configuration Reference
HermesAgentEnvConfig Fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled_toolsets |
List[str] |
None (all) |
Which hermes toolsets to enable |
disabled_toolsets |
List[str] |
None |
Toolsets to filter out |
distribution |
str |
None |
Probabilistic toolset distribution name |
max_agent_turns |
int |
30 |
Max LLM calls per rollout |
agent_temperature |
float |
1.0 |
Sampling temperature |
system_prompt |
str |
None |
System message for the agent |
terminal_backend |
str |
"local" |
local, docker, modal, daytona, ssh, singularity |
terminal_timeout |
int |
120 |
Seconds per terminal command |
terminal_lifetime |
int |
3600 |
Max sandbox lifetime |
dataset_name |
str |
None |
HuggingFace dataset identifier |
tool_pool_size |
int |
128 |
Thread pool size for tool execution |
tool_call_parser |
str |
"hermes" |
Parser for Phase 2 raw output |
extra_body |
Dict |
None |
Extra params for OpenAI API (e.g., OpenRouter provider prefs) |
eval_handling |
Enum |
STOP_TRAIN |
STOP_TRAIN, LIMIT_TRAIN, NONE |
YAML Configuration
Environments can be configured via YAML files passed with --config:
env:
enabled_toolsets: ["terminal", "file"]
max_agent_turns: 60
max_token_length: 32000
agent_temperature: 0.8
terminal_backend: "modal"
terminal_timeout: 300
dataset_name: "NousResearch/terminal-bench-2"
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
use_wandb: true
wandb_name: "my-benchmark"
openai:
base_url: "https://openrouter.ai/api/v1"
model_name: "anthropic/claude-sonnet-4.6"
server_type: "openai"
health_check: false
YAML values override config_init() defaults. CLI arguments override YAML values:
python my_env.py evaluate \
--config my_config.yaml \
--openai.model_name anthropic/claude-opus-4.6 # overrides YAML
Prerequisites
For all environments
- Python >= 3.11
atroposlib:pip install git+https://github.com/NousResearch/atropos.git- An LLM API key (OpenRouter, OpenAI, or self-hosted VLLM/SGLang)
For Modal-sandboxed benchmarks (TB2, TBLite)
- Modal account and CLI:
pip install "hermes-agent[modal]" MODAL_TOKEN_IDandMODAL_TOKEN_SECRETenvironment variables
For YC-Bench
pip install "hermes-agent[yc-bench]"(installs the yc-bench CLI + SQLAlchemy)- No Modal needed β runs with local terminal backend
For RL training
TINKER_API_KEYβ API key for the Tinker training serviceWANDB_API_KEYβ for Weights & Biases metrics tracking- The
tinker-atropossubmodule (attinker-atropos/in the repo)
See RL Training for the agent-driven RL workflow.
Directory Structure
environments/
βββ hermes_base_env.py # Abstract base class (HermesAgentBaseEnv)
βββ agent_loop.py # Multi-turn agent engine (HermesAgentLoop)
βββ tool_context.py # Per-rollout tool access for reward functions
βββ patches.py # Async-safety patches for Modal backend
β
βββ tool_call_parsers/ # Phase 2 client-side parsers
β βββ hermes_parser.py # Hermes/ChatML <tool_call> format
β βββ mistral_parser.py # Mistral [TOOL_CALLS] format
β βββ llama_parser.py # Llama 3 JSON tool calling
β βββ qwen_parser.py # Qwen format
β βββ deepseek_v3_parser.py # DeepSeek V3 format
β βββ ... # + kimi_k2, longcat, glm45/47, etc.
β
βββ terminal_test_env/ # Stack validation (inline tasks)
βββ hermes_swe_env/ # SWE-bench training environment
β
βββ benchmarks/ # Evaluation benchmarks
βββ terminalbench_2/ # 89 terminal tasks, Modal sandboxes
βββ tblite/ # 100 calibrated tasks (fast TB2 proxy)
βββ yc_bench/ # Long-horizon strategic benchmark