File size: 6,681 Bytes
2dfa6e3
 
 
 
 
 
 
fecc757
 
2dfa6e3
a70e05e
2dfa6e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9007754
 
 
 
 
 
 
 
 
 
 
 
 
 
2dfa6e3
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
  - qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
  - openenv
  - workflow-orchestration
  - reinforcement-learning
---

# WorkflowArena

WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers.
Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait,
and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.

## Problem

This environment models a common orchestration problem:

- tasks have dependencies, so not everything can start immediately
- workers are limited, so not every ready task can run at once
- deadlines and priorities are uneven, so the obvious greedy move is not always best
- higher difficulties add time pressure and failure dynamics

The action space is intentionally small:

1. `dispatch(task_ids=[...])`
2. `wait()`

That keeps the challenge focused on decision quality rather than action syntax.

## Episode Loop

1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`.
2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
4. Time advances only on `wait()`.
5. The episode ends when:
   - all tasks complete, or
   - the preset time budget is exhausted, or
   - the safety step limit is hit

## Difficulty Presets

### `easy`

- smaller DAGs
- softer deadlines
- no fixed time budget
- no failure events

This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.

### `medium`

- larger DAGs
- tighter deadlines
- fixed episode time budget
- terminal penalty for unfinished work

This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything,
so it must decide what is worth finishing before time runs out.

### `hard`

- denser DAGs
- tighter deadlines
- tighter time budget than `medium`
- temporary worker outages
- task retry failures

In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.

## Rewards

WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.

### Per-step reward channels

The observation exposes `last_reward_breakdown` with these channels:

- `completion_reward`: reward for tasks that finished on the latest `wait()`
- `utilization_reward`: reward for keeping workers occupied
- `deadline_reward`: positive for on-time completion, negative for lateness
- `criticality_reward`: reward for progress on high-impact work
- `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle
- `invalid_action_penalty`: penalty for malformed or infeasible actions
- `terminal_makespan_score`: terminal efficiency score at episode end
- `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish

### Reward design intent

The reward is set up to encourage:

- filling worker capacity when good work is available
- respecting deadlines
- protecting high-priority and critical-path tasks
- avoiding pointless waits
- finishing as much important work as possible before the time budget expires

The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.

## Failures and Constraints

The environment keeps the action space fixed, but higher presets change the transition dynamics.

### Capacity constraint

- `dispatch(task_ids=[...])` cannot exceed current free capacity
- only tasks in `ready_tasks` are legal to dispatch

### Hard-mode worker outages

- a temporary outage can reduce usable workers
- `total_workers` stays constant
- `effective_workers` reflects usable workers after degradation
- `free_workers` is computed from `effective_workers`, not from the original total

### Hard-mode retry failures

- a running task may fail at completion
- it consumes time but does not complete
- it returns to `ready_tasks`
- `attempt_count` shows how many retry failures that task has already consumed

## Observation Contract

The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py).
Important fields include:

- `current_time`
- `total_workers`
- `effective_workers`
- `degraded_workers`
- `free_workers`
- `time_budget`
- `time_remaining`
- `progress`
- `ready_tasks`
- `running_tasks`
- `completed_tasks`
- `blocked_tasks`
- `recent_failure_events`
- `last_reward_breakdown`
- `success_metrics`
- `validation_error`

Each task view includes:

- `task_id`
- `duration`
- `priority`
- `deadline`
- `criticality`
- `slack`
- `downstream_count`
- `dependencies`
- `attempt_count`

## Expected Agent Output

Agents are expected to return compact JSON actions in one of these exact forms:

```json
{ "action_type": "wait", "task_ids": [] }
```

```json
{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
```

Rules:

- dispatch only task ids that appear in `ready_tasks`
- do not exceed `free_workers`
- do not send duplicate ids
- `wait()` must use an empty `task_ids` list

## Success Metrics

The environment reports schedule quality through `success_metrics`:

- `makespan`
- `worker_utilization`
- `deadline_miss_count`
- `unfinished_task_count`
- `weighted_priority_completion`
- `benchmark_score`

Interpretation:

- higher `benchmark_score` is better
- lower `deadline_miss_count` is better
- lower `unfinished_task_count` is better
- `makespan` is only populated when everything completed

## Expected Outputs for Evaluation

For benchmark use, an agent should produce:

1. a legal JSON action at every step
2. a full episode rollout until termination
3. a final observation containing the terminal score and success metrics

Typical downstream evaluation reads:

- cumulative reward
- final `benchmark_score`
- whether the agent completed all tasks
- how many deadlines were missed
- how much important work remained unfinished

## Benchmarks

Verified self-contained inference run using:

1. `qwen/qwen3.5-9b`

Results:

| Preset   | Success | Steps | Score   |
| -------- | ------- | ----- | ------- |
| `easy`   | `true`  | `11`  | `0.952` |
| `medium` | `true`  | `20`  | `0.945` |
| `hard`   | `true`  | `45`  | `0.652` |

## Local Development

Validate the environment:

```bash
.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
```

Run the server locally:

```bash
cd workflow_arena
uv run --project . server
```