Spaces:

Cyber-Machine
/

workflow_arena

Sleeping

File size: 6,681 Bytes

---
title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
  - qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
  - openenv
  - workflow-orchestration
  - reinforcement-learning
---

# WorkflowArena

WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers.
Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait,
and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.

## Problem

This environment models a common orchestration problem:

- tasks have dependencies, so not everything can start immediately
- workers are limited, so not every ready task can run at once
- deadlines and priorities are uneven, so the obvious greedy move is not always best
- higher difficulties add time pressure and failure dynamics

The action space is intentionally small:

1. `dispatch(task_ids=[...])`
2. `wait()`

That keeps the challenge focused on decision quality rather than action syntax.

## Episode Loop

1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`.
2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
4. Time advances only on `wait()`.
5. The episode ends when:
   - all tasks complete, or
   - the preset time budget is exhausted, or
   - the safety step limit is hit

## Difficulty Presets

### `easy`

- smaller DAGs
- softer deadlines
- no fixed time budget
- no failure events

This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.

### `medium`

- larger DAGs
- tighter deadlines
- fixed episode time budget
- terminal penalty for unfinished work

This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything,
so it must decide what is worth finishing before time runs out.

### `hard`

- denser DAGs
- tighter deadlines
- tighter time budget than `medium`
- temporary worker outages
- task retry failures

In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.

## Rewards

WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.

### Per-step reward channels

The observation exposes `last_reward_breakdown` with these channels:

- `completion_reward`: reward for tasks that finished on the latest `wait()`
- `utilization_reward`: reward for keeping workers occupied
- `deadline_reward`: positive for on-time completion, negative for lateness
- `criticality_reward`: reward for progress on high-impact work
- `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle
- `invalid_action_penalty`: penalty for malformed or infeasible actions
- `terminal_makespan_score`: terminal efficiency score at episode end
- `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish

### Reward design intent

The reward is set up to encourage:

- filling worker capacity when good work is available
- respecting deadlines
- protecting high-priority and critical-path tasks
- avoiding pointless waits
- finishing as much important work as possible before the time budget expires

The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.

## Failures and Constraints

The environment keeps the action space fixed, but higher presets change the transition dynamics.

### Capacity constraint

- `dispatch(task_ids=[...])` cannot exceed current free capacity
- only tasks in `ready_tasks` are legal to dispatch

### Hard-mode worker outages

- a temporary outage can reduce usable workers
- `total_workers` stays constant
- `effective_workers` reflects usable workers after degradation
- `free_workers` is computed from `effective_workers`, not from the original total

### Hard-mode retry failures

- a running task may fail at completion
- it consumes time but does not complete
- it returns to `ready_tasks`
- `attempt_count` shows how many retry failures that task has already consumed

## Observation Contract

The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py).
Important fields include:

- `current_time`
- `total_workers`
- `effective_workers`
- `degraded_workers`
- `free_workers`
- `time_budget`
- `time_remaining`
- `progress`
- `ready_tasks`
- `running_tasks`
- `completed_tasks`
- `blocked_tasks`
- `recent_failure_events`
- `last_reward_breakdown`
- `success_metrics`
- `validation_error`

Each task view includes:

- `task_id`
- `duration`
- `priority`
- `deadline`
- `criticality`
- `slack`
- `downstream_count`
- `dependencies`
- `attempt_count`

## Expected Agent Output

Agents are expected to return compact JSON actions in one of these exact forms:

```json
{ "action_type": "wait", "task_ids": [] }
```

```json
{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
```

Rules:

- dispatch only task ids that appear in `ready_tasks`
- do not exceed `free_workers`
- do not send duplicate ids
- `wait()` must use an empty `task_ids` list

## Success Metrics

The environment reports schedule quality through `success_metrics`:

- `makespan`
- `worker_utilization`
- `deadline_miss_count`
- `unfinished_task_count`
- `weighted_priority_completion`
- `benchmark_score`

Interpretation:

- higher `benchmark_score` is better
- lower `deadline_miss_count` is better
- lower `unfinished_task_count` is better
- `makespan` is only populated when everything completed

## Expected Outputs for Evaluation

For benchmark use, an agent should produce:

1. a legal JSON action at every step
2. a full episode rollout until termination
3. a final observation containing the terminal score and success metrics

Typical downstream evaluation reads:

- cumulative reward
- final `benchmark_score`
- whether the agent completed all tasks
- how many deadlines were missed
- how much important work remained unfinished

## Benchmarks

Verified self-contained inference run using:

1. `qwen/qwen3.5-9b`

Results:

| Preset   | Success | Steps | Score   |
| -------- | ------- | ----- | ------- |
| `easy`   | `true`  | `11`  | `0.952` |
| `medium` | `true`  | `20`  | `0.945` |
| `hard`   | `true`  | `45`  | `0.652` |

## Local Development

Validate the environment:

```bash
.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
```

Run the server locally:

```bash
cd workflow_arena
uv run --project . server
```