Spaces:
Sleeping
Sleeping
File size: 6,681 Bytes
2dfa6e3 fecc757 2dfa6e3 a70e05e 2dfa6e3 9007754 2dfa6e3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | ---
title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
- qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
- openenv
- workflow-orchestration
- reinforcement-learning
---
# WorkflowArena
WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers.
Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait,
and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.
## Problem
This environment models a common orchestration problem:
- tasks have dependencies, so not everything can start immediately
- workers are limited, so not every ready task can run at once
- deadlines and priorities are uneven, so the obvious greedy move is not always best
- higher difficulties add time pressure and failure dynamics
The action space is intentionally small:
1. `dispatch(task_ids=[...])`
2. `wait()`
That keeps the challenge focused on decision quality rather than action syntax.
## Episode Loop
1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`.
2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
4. Time advances only on `wait()`.
5. The episode ends when:
- all tasks complete, or
- the preset time budget is exhausted, or
- the safety step limit is hit
## Difficulty Presets
### `easy`
- smaller DAGs
- softer deadlines
- no fixed time budget
- no failure events
This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.
### `medium`
- larger DAGs
- tighter deadlines
- fixed episode time budget
- terminal penalty for unfinished work
This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything,
so it must decide what is worth finishing before time runs out.
### `hard`
- denser DAGs
- tighter deadlines
- tighter time budget than `medium`
- temporary worker outages
- task retry failures
In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.
## Rewards
WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.
### Per-step reward channels
The observation exposes `last_reward_breakdown` with these channels:
- `completion_reward`: reward for tasks that finished on the latest `wait()`
- `utilization_reward`: reward for keeping workers occupied
- `deadline_reward`: positive for on-time completion, negative for lateness
- `criticality_reward`: reward for progress on high-impact work
- `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle
- `invalid_action_penalty`: penalty for malformed or infeasible actions
- `terminal_makespan_score`: terminal efficiency score at episode end
- `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish
### Reward design intent
The reward is set up to encourage:
- filling worker capacity when good work is available
- respecting deadlines
- protecting high-priority and critical-path tasks
- avoiding pointless waits
- finishing as much important work as possible before the time budget expires
The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.
## Failures and Constraints
The environment keeps the action space fixed, but higher presets change the transition dynamics.
### Capacity constraint
- `dispatch(task_ids=[...])` cannot exceed current free capacity
- only tasks in `ready_tasks` are legal to dispatch
### Hard-mode worker outages
- a temporary outage can reduce usable workers
- `total_workers` stays constant
- `effective_workers` reflects usable workers after degradation
- `free_workers` is computed from `effective_workers`, not from the original total
### Hard-mode retry failures
- a running task may fail at completion
- it consumes time but does not complete
- it returns to `ready_tasks`
- `attempt_count` shows how many retry failures that task has already consumed
## Observation Contract
The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py).
Important fields include:
- `current_time`
- `total_workers`
- `effective_workers`
- `degraded_workers`
- `free_workers`
- `time_budget`
- `time_remaining`
- `progress`
- `ready_tasks`
- `running_tasks`
- `completed_tasks`
- `blocked_tasks`
- `recent_failure_events`
- `last_reward_breakdown`
- `success_metrics`
- `validation_error`
Each task view includes:
- `task_id`
- `duration`
- `priority`
- `deadline`
- `criticality`
- `slack`
- `downstream_count`
- `dependencies`
- `attempt_count`
## Expected Agent Output
Agents are expected to return compact JSON actions in one of these exact forms:
```json
{ "action_type": "wait", "task_ids": [] }
```
```json
{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
```
Rules:
- dispatch only task ids that appear in `ready_tasks`
- do not exceed `free_workers`
- do not send duplicate ids
- `wait()` must use an empty `task_ids` list
## Success Metrics
The environment reports schedule quality through `success_metrics`:
- `makespan`
- `worker_utilization`
- `deadline_miss_count`
- `unfinished_task_count`
- `weighted_priority_completion`
- `benchmark_score`
Interpretation:
- higher `benchmark_score` is better
- lower `deadline_miss_count` is better
- lower `unfinished_task_count` is better
- `makespan` is only populated when everything completed
## Expected Outputs for Evaluation
For benchmark use, an agent should produce:
1. a legal JSON action at every step
2. a full episode rollout until termination
3. a final observation containing the terminal score and success metrics
Typical downstream evaluation reads:
- cumulative reward
- final `benchmark_score`
- whether the agent completed all tasks
- how many deadlines were missed
- how much important work remained unfinished
## Benchmarks
Verified self-contained inference run using:
1. `qwen/qwen3.5-9b`
Results:
| Preset | Success | Steps | Score |
| -------- | ------- | ----- | ------- |
| `easy` | `true` | `11` | `0.952` |
| `medium` | `true` | `20` | `0.945` |
| `hard` | `true` | `45` | `0.652` |
## Local Development
Validate the environment:
```bash
.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
```
Run the server locally:
```bash
cd workflow_arena
uv run --project . server
```
|