workflow_arena / README.md
Cyber-Machine's picture
Update README.md
fecc757 verified
---
title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
- qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
- openenv
- workflow-orchestration
- reinforcement-learning
---
# WorkflowArena
WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers.
Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait,
and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.
## Problem
This environment models a common orchestration problem:
- tasks have dependencies, so not everything can start immediately
- workers are limited, so not every ready task can run at once
- deadlines and priorities are uneven, so the obvious greedy move is not always best
- higher difficulties add time pressure and failure dynamics
The action space is intentionally small:
1. `dispatch(task_ids=[...])`
2. `wait()`
That keeps the challenge focused on decision quality rather than action syntax.
## Episode Loop
1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`.
2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
4. Time advances only on `wait()`.
5. The episode ends when:
- all tasks complete, or
- the preset time budget is exhausted, or
- the safety step limit is hit
## Difficulty Presets
### `easy`
- smaller DAGs
- softer deadlines
- no fixed time budget
- no failure events
This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.
### `medium`
- larger DAGs
- tighter deadlines
- fixed episode time budget
- terminal penalty for unfinished work
This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything,
so it must decide what is worth finishing before time runs out.
### `hard`
- denser DAGs
- tighter deadlines
- tighter time budget than `medium`
- temporary worker outages
- task retry failures
In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.
## Rewards
WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.
### Per-step reward channels
The observation exposes `last_reward_breakdown` with these channels:
- `completion_reward`: reward for tasks that finished on the latest `wait()`
- `utilization_reward`: reward for keeping workers occupied
- `deadline_reward`: positive for on-time completion, negative for lateness
- `criticality_reward`: reward for progress on high-impact work
- `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle
- `invalid_action_penalty`: penalty for malformed or infeasible actions
- `terminal_makespan_score`: terminal efficiency score at episode end
- `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish
### Reward design intent
The reward is set up to encourage:
- filling worker capacity when good work is available
- respecting deadlines
- protecting high-priority and critical-path tasks
- avoiding pointless waits
- finishing as much important work as possible before the time budget expires
The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.
## Failures and Constraints
The environment keeps the action space fixed, but higher presets change the transition dynamics.
### Capacity constraint
- `dispatch(task_ids=[...])` cannot exceed current free capacity
- only tasks in `ready_tasks` are legal to dispatch
### Hard-mode worker outages
- a temporary outage can reduce usable workers
- `total_workers` stays constant
- `effective_workers` reflects usable workers after degradation
- `free_workers` is computed from `effective_workers`, not from the original total
### Hard-mode retry failures
- a running task may fail at completion
- it consumes time but does not complete
- it returns to `ready_tasks`
- `attempt_count` shows how many retry failures that task has already consumed
## Observation Contract
The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py).
Important fields include:
- `current_time`
- `total_workers`
- `effective_workers`
- `degraded_workers`
- `free_workers`
- `time_budget`
- `time_remaining`
- `progress`
- `ready_tasks`
- `running_tasks`
- `completed_tasks`
- `blocked_tasks`
- `recent_failure_events`
- `last_reward_breakdown`
- `success_metrics`
- `validation_error`
Each task view includes:
- `task_id`
- `duration`
- `priority`
- `deadline`
- `criticality`
- `slack`
- `downstream_count`
- `dependencies`
- `attempt_count`
## Expected Agent Output
Agents are expected to return compact JSON actions in one of these exact forms:
```json
{ "action_type": "wait", "task_ids": [] }
```
```json
{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
```
Rules:
- dispatch only task ids that appear in `ready_tasks`
- do not exceed `free_workers`
- do not send duplicate ids
- `wait()` must use an empty `task_ids` list
## Success Metrics
The environment reports schedule quality through `success_metrics`:
- `makespan`
- `worker_utilization`
- `deadline_miss_count`
- `unfinished_task_count`
- `weighted_priority_completion`
- `benchmark_score`
Interpretation:
- higher `benchmark_score` is better
- lower `deadline_miss_count` is better
- lower `unfinished_task_count` is better
- `makespan` is only populated when everything completed
## Expected Outputs for Evaluation
For benchmark use, an agent should produce:
1. a legal JSON action at every step
2. a full episode rollout until termination
3. a final observation containing the terminal score and success metrics
Typical downstream evaluation reads:
- cumulative reward
- final `benchmark_score`
- whether the agent completed all tasks
- how many deadlines were missed
- how much important work remained unfinished
## Benchmarks
Verified self-contained inference run using:
1. `qwen/qwen3.5-9b`
Results:
| Preset | Success | Steps | Score |
| -------- | ------- | ----- | ------- |
| `easy` | `true` | `11` | `0.952` |
| `medium` | `true` | `20` | `0.945` |
| `hard` | `true` | `45` | `0.652` |
## Local Development
Validate the environment:
```bash
.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
```
Run the server locally:
```bash
cd workflow_arena
uv run --project . server
```