Spaces:

Cyber-Machine
/

workflow_arena

Running

App Files Files Community

workflow_arena / README.md

Cyber-Machine

Update README.md

fecc757 verified about 2 months ago

preview code

raw

history blame contribute delete

6.68 kB

	---
	title: WorkflowArena
	emoji: 🏗️
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	pinned: false
	models:
	- qwen/qwen3.5-9b
	app_port: 8000
	base_path: /
	tags:
	- openenv
	- workflow-orchestration
	- reinforcement-learning
	---

	# WorkflowArena

	WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers.
	Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait,
	and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.

	## Problem

	This environment models a common orchestration problem:

	- tasks have dependencies, so not everything can start immediately
	- workers are limited, so not every ready task can run at once
	- deadlines and priorities are uneven, so the obvious greedy move is not always best
	- higher difficulties add time pressure and failure dynamics

	The action space is intentionally small:

	1. `dispatch(task_ids=[...])`
	2. `wait()`

	That keeps the challenge focused on decision quality rather than action syntax.

	## Episode Loop

	1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`.
	2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
	3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
	4. Time advances only on `wait()`.
	5. The episode ends when:
	- all tasks complete, or
	- the preset time budget is exhausted, or
	- the safety step limit is hit

	## Difficulty Presets

	### `easy`

	- smaller DAGs
	- softer deadlines
	- no fixed time budget
	- no failure events

	This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.

	### `medium`

	- larger DAGs
	- tighter deadlines
	- fixed episode time budget
	- terminal penalty for unfinished work

	This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything,
	so it must decide what is worth finishing before time runs out.

	### `hard`

	- denser DAGs
	- tighter deadlines
	- tighter time budget than `medium`
	- temporary worker outages
	- task retry failures

	In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.

	## Rewards

	WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.

	### Per-step reward channels

	The observation exposes `last_reward_breakdown` with these channels:

	- `completion_reward`: reward for tasks that finished on the latest `wait()`
	- `utilization_reward`: reward for keeping workers occupied
	- `deadline_reward`: positive for on-time completion, negative for lateness
	- `criticality_reward`: reward for progress on high-impact work
	- `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle
	- `invalid_action_penalty`: penalty for malformed or infeasible actions
	- `terminal_makespan_score`: terminal efficiency score at episode end
	- `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish

	### Reward design intent

	The reward is set up to encourage:

	- filling worker capacity when good work is available
	- respecting deadlines
	- protecting high-priority and critical-path tasks
	- avoiding pointless waits
	- finishing as much important work as possible before the time budget expires

	The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.

	## Failures and Constraints

	The environment keeps the action space fixed, but higher presets change the transition dynamics.

	### Capacity constraint

	- `dispatch(task_ids=[...])` cannot exceed current free capacity
	- only tasks in `ready_tasks` are legal to dispatch

	### Hard-mode worker outages

	- a temporary outage can reduce usable workers
	- `total_workers` stays constant
	- `effective_workers` reflects usable workers after degradation
	- `free_workers` is computed from `effective_workers`, not from the original total

	### Hard-mode retry failures

	- a running task may fail at completion
	- it consumes time but does not complete
	- it returns to `ready_tasks`
	- `attempt_count` shows how many retry failures that task has already consumed

	## Observation Contract

	The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py).
	Important fields include:

	- `current_time`
	- `total_workers`
	- `effective_workers`
	- `degraded_workers`
	- `free_workers`
	- `time_budget`
	- `time_remaining`
	- `progress`
	- `ready_tasks`
	- `running_tasks`
	- `completed_tasks`
	- `blocked_tasks`
	- `recent_failure_events`
	- `last_reward_breakdown`
	- `success_metrics`
	- `validation_error`

	Each task view includes:

	- `task_id`
	- `duration`
	- `priority`
	- `deadline`
	- `criticality`
	- `slack`
	- `downstream_count`
	- `dependencies`
	- `attempt_count`

	## Expected Agent Output

	Agents are expected to return compact JSON actions in one of these exact forms:

	```json
	{ "action_type": "wait", "task_ids": [] }
	```

	```json
	{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
	```

	Rules:

	- dispatch only task ids that appear in `ready_tasks`
	- do not exceed `free_workers`
	- do not send duplicate ids
	- `wait()` must use an empty `task_ids` list

	## Success Metrics

	The environment reports schedule quality through `success_metrics`:

	- `makespan`
	- `worker_utilization`
	- `deadline_miss_count`
	- `unfinished_task_count`
	- `weighted_priority_completion`
	- `benchmark_score`

	Interpretation:

	- higher `benchmark_score` is better
	- lower `deadline_miss_count` is better
	- lower `unfinished_task_count` is better
	- `makespan` is only populated when everything completed

	## Expected Outputs for Evaluation

	For benchmark use, an agent should produce:

	1. a legal JSON action at every step
	2. a full episode rollout until termination
	3. a final observation containing the terminal score and success metrics

	Typical downstream evaluation reads:

	- cumulative reward
	- final `benchmark_score`
	- whether the agent completed all tasks
	- how many deadlines were missed
	- how much important work remained unfinished

	## Benchmarks

	Verified self-contained inference run using:

	1. `qwen/qwen3.5-9b`

	Results:

	\| Preset \| Success \| Steps \| Score \|
	\| -------- \| ------- \| ----- \| ------- \|
	\| `easy` \| `true` \| `11` \| `0.952` \|
	\| `medium` \| `true` \| `20` \| `0.945` \|
	\| `hard` \| `true` \| `45` \| `0.652` \|

	## Local Development

	Validate the environment:

	```bash
	.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
	```

	Run the server locally:

	```bash
	cd workflow_arena
	uv run --project . server
	```