Spaces:
Running
Running
| title: WorkflowArena | |
| emoji: 🏗️ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| models: | |
| - qwen/qwen3.5-9b | |
| app_port: 8000 | |
| base_path: / | |
| tags: | |
| - openenv | |
| - workflow-orchestration | |
| - reinforcement-learning | |
| # WorkflowArena | |
| WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers. | |
| Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait, | |
| and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work. | |
| ## Problem | |
| This environment models a common orchestration problem: | |
| - tasks have dependencies, so not everything can start immediately | |
| - workers are limited, so not every ready task can run at once | |
| - deadlines and priorities are uneven, so the obvious greedy move is not always best | |
| - higher difficulties add time pressure and failure dynamics | |
| The action space is intentionally small: | |
| 1. `dispatch(task_ids=[...])` | |
| 2. `wait()` | |
| That keeps the challenge focused on decision quality rather than action syntax. | |
| ## Episode Loop | |
| 1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`. | |
| 2. The observation exposes ready, running, blocked, and completed tasks plus planner hints. | |
| 3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event. | |
| 4. Time advances only on `wait()`. | |
| 5. The episode ends when: | |
| - all tasks complete, or | |
| - the preset time budget is exhausted, or | |
| - the safety step limit is hit | |
| ## Difficulty Presets | |
| ### `easy` | |
| - smaller DAGs | |
| - softer deadlines | |
| - no fixed time budget | |
| - no failure events | |
| This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits. | |
| ### `medium` | |
| - larger DAGs | |
| - tighter deadlines | |
| - fixed episode time budget | |
| - terminal penalty for unfinished work | |
| This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything, | |
| so it must decide what is worth finishing before time runs out. | |
| ### `hard` | |
| - denser DAGs | |
| - tighter deadlines | |
| - tighter time budget than `medium` | |
| - temporary worker outages | |
| - task retry failures | |
| In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue. | |
| ## Rewards | |
| WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters. | |
| ### Per-step reward channels | |
| The observation exposes `last_reward_breakdown` with these channels: | |
| - `completion_reward`: reward for tasks that finished on the latest `wait()` | |
| - `utilization_reward`: reward for keeping workers occupied | |
| - `deadline_reward`: positive for on-time completion, negative for lateness | |
| - `criticality_reward`: reward for progress on high-impact work | |
| - `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle | |
| - `invalid_action_penalty`: penalty for malformed or infeasible actions | |
| - `terminal_makespan_score`: terminal efficiency score at episode end | |
| - `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish | |
| ### Reward design intent | |
| The reward is set up to encourage: | |
| - filling worker capacity when good work is available | |
| - respecting deadlines | |
| - protecting high-priority and critical-path tasks | |
| - avoiding pointless waits | |
| - finishing as much important work as possible before the time budget expires | |
| The terminal score is bounded and deterministic. Higher values correspond to stronger schedules. | |
| ## Failures and Constraints | |
| The environment keeps the action space fixed, but higher presets change the transition dynamics. | |
| ### Capacity constraint | |
| - `dispatch(task_ids=[...])` cannot exceed current free capacity | |
| - only tasks in `ready_tasks` are legal to dispatch | |
| ### Hard-mode worker outages | |
| - a temporary outage can reduce usable workers | |
| - `total_workers` stays constant | |
| - `effective_workers` reflects usable workers after degradation | |
| - `free_workers` is computed from `effective_workers`, not from the original total | |
| ### Hard-mode retry failures | |
| - a running task may fail at completion | |
| - it consumes time but does not complete | |
| - it returns to `ready_tasks` | |
| - `attempt_count` shows how many retry failures that task has already consumed | |
| ## Observation Contract | |
| The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py). | |
| Important fields include: | |
| - `current_time` | |
| - `total_workers` | |
| - `effective_workers` | |
| - `degraded_workers` | |
| - `free_workers` | |
| - `time_budget` | |
| - `time_remaining` | |
| - `progress` | |
| - `ready_tasks` | |
| - `running_tasks` | |
| - `completed_tasks` | |
| - `blocked_tasks` | |
| - `recent_failure_events` | |
| - `last_reward_breakdown` | |
| - `success_metrics` | |
| - `validation_error` | |
| Each task view includes: | |
| - `task_id` | |
| - `duration` | |
| - `priority` | |
| - `deadline` | |
| - `criticality` | |
| - `slack` | |
| - `downstream_count` | |
| - `dependencies` | |
| - `attempt_count` | |
| ## Expected Agent Output | |
| Agents are expected to return compact JSON actions in one of these exact forms: | |
| ```json | |
| { "action_type": "wait", "task_ids": [] } | |
| ``` | |
| ```json | |
| { "action_type": "dispatch", "task_ids": ["task_01", "task_02"] } | |
| ``` | |
| Rules: | |
| - dispatch only task ids that appear in `ready_tasks` | |
| - do not exceed `free_workers` | |
| - do not send duplicate ids | |
| - `wait()` must use an empty `task_ids` list | |
| ## Success Metrics | |
| The environment reports schedule quality through `success_metrics`: | |
| - `makespan` | |
| - `worker_utilization` | |
| - `deadline_miss_count` | |
| - `unfinished_task_count` | |
| - `weighted_priority_completion` | |
| - `benchmark_score` | |
| Interpretation: | |
| - higher `benchmark_score` is better | |
| - lower `deadline_miss_count` is better | |
| - lower `unfinished_task_count` is better | |
| - `makespan` is only populated when everything completed | |
| ## Expected Outputs for Evaluation | |
| For benchmark use, an agent should produce: | |
| 1. a legal JSON action at every step | |
| 2. a full episode rollout until termination | |
| 3. a final observation containing the terminal score and success metrics | |
| Typical downstream evaluation reads: | |
| - cumulative reward | |
| - final `benchmark_score` | |
| - whether the agent completed all tasks | |
| - how many deadlines were missed | |
| - how much important work remained unfinished | |
| ## Benchmarks | |
| Verified self-contained inference run using: | |
| 1. `qwen/qwen3.5-9b` | |
| Results: | |
| | Preset | Success | Steps | Score | | |
| | -------- | ------- | ----- | ------- | | |
| | `easy` | `true` | `11` | `0.952` | | |
| | `medium` | `true` | `20` | `0.945` | | |
| | `hard` | `true` | `45` | `0.652` | | |
| ## Local Development | |
| Validate the environment: | |
| ```bash | |
| .venv/bin/python -m openenv.cli.__main__ validate workflow_arena | |
| ``` | |
| Run the server locally: | |
| ```bash | |
| cd workflow_arena | |
| uv run --project . server | |
| ``` | |