Spaces:

Cyber-Machine
/

workflow_arena

Sleeping

App Files Files Community

Cyber-Machine commited on Apr 5

Commit

2dfa6e3

verified ·

1 Parent(s): fab9447

docs: add README.md

Browse files

fix: update color scheme in README.md

Files changed (1) hide show

README.md +232 -0

README.md ADDED Viewed

	@@ -0,0 +1,232 @@

+---
+title: WorkflowArena
+emoji: 🏗️
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - workflow-orchestration
+  - reinforcement-learning
+---
+# WorkflowArena
+WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers.
+Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait,
+and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.
+## Problem
+This environment models a common orchestration problem:
+- tasks have dependencies, so not everything can start immediately
+- workers are limited, so not every ready task can run at once
+- deadlines and priorities are uneven, so the obvious greedy move is not always best
+- higher difficulties add time pressure and failure dynamics
+The action space is intentionally small:
+1. `dispatch(task_ids=[...])`
+2. `wait()`
+That keeps the challenge focused on decision quality rather than action syntax.
+## Episode Loop
+1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`.
+2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
+3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
+4. Time advances only on `wait()`.
+5. The episode ends when:
+   - all tasks complete, or
+   - the preset time budget is exhausted, or
+   - the safety step limit is hit
+## Difficulty Presets
+### `easy`
+- smaller DAGs
+- softer deadlines
+- no fixed time budget
+- no failure events
+This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.
+### `medium`
+- larger DAGs
+- tighter deadlines
+- fixed episode time budget
+- terminal penalty for unfinished work
+This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything,
+so it must decide what is worth finishing before time runs out.
+### `hard`
+- denser DAGs
+- tighter deadlines
+- tighter time budget than `medium`
+- temporary worker outages
+- task retry failures
+In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.
+## Rewards
+WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.
+### Per-step reward channels
+The observation exposes `last_reward_breakdown` with these channels:
+- `completion_reward`: reward for tasks that finished on the latest `wait()`
+- `utilization_reward`: reward for keeping workers occupied
+- `deadline_reward`: positive for on-time completion, negative for lateness
+- `criticality_reward`: reward for progress on high-impact work
+- `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle
+- `invalid_action_penalty`: penalty for malformed or infeasible actions
+- `terminal_makespan_score`: terminal efficiency score at episode end
+- `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish
+### Reward design intent
+The reward is set up to encourage:
+- filling worker capacity when good work is available
+- respecting deadlines
+- protecting high-priority and critical-path tasks
+- avoiding pointless waits
+- finishing as much important work as possible before the time budget expires
+The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.
+## Failures and Constraints
+The environment keeps the action space fixed, but higher presets change the transition dynamics.
+### Capacity constraint
+- `dispatch(task_ids=[...])` cannot exceed current free capacity
+- only tasks in `ready_tasks` are legal to dispatch
+### Hard-mode worker outages
+- a temporary outage can reduce usable workers
+- `total_workers` stays constant
+- `effective_workers` reflects usable workers after degradation
+- `free_workers` is computed from `effective_workers`, not from the original total
+### Hard-mode retry failures
+- a running task may fail at completion
+- it consumes time but does not complete
+- it returns to `ready_tasks`
+- `attempt_count` shows how many retry failures that task has already consumed
+## Observation Contract
+The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py).
+Important fields include:
+- `current_time`
+- `total_workers`
+- `effective_workers`
+- `degraded_workers`
+- `free_workers`
+- `time_budget`
+- `time_remaining`
+- `progress`
+- `ready_tasks`
+- `running_tasks`
+- `completed_tasks`
+- `blocked_tasks`
+- `recent_failure_events`
+- `last_reward_breakdown`
+- `success_metrics`
+- `validation_error`
+Each task view includes:
+- `task_id`
+- `duration`
+- `priority`
+- `deadline`
+- `criticality`
+- `slack`
+- `downstream_count`
+- `dependencies`
+- `attempt_count`
+## Expected Agent Output
+Agents are expected to return compact JSON actions in one of these exact forms:
+```json
+{ "action_type": "wait", "task_ids": [] }
+```
+```json
+{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
+```
+Rules:
+- dispatch only task ids that appear in `ready_tasks`
+- do not exceed `free_workers`
+- do not send duplicate ids
+- `wait()` must use an empty `task_ids` list
+## Success Metrics
+The environment reports schedule quality through `success_metrics`:
+- `makespan`
+- `worker_utilization`
+- `deadline_miss_count`
+- `unfinished_task_count`
+- `weighted_priority_completion`
+- `benchmark_score`
+Interpretation:
+- higher `benchmark_score` is better
+- lower `deadline_miss_count` is better
+- lower `unfinished_task_count` is better
+- `makespan` is only populated when everything completed
+## Expected Outputs for Evaluation
+For benchmark use, an agent should produce:
+1. a legal JSON action at every step
+2. a full episode rollout until termination
+3. a final observation containing the terminal score and success metrics
+Typical downstream evaluation reads:
+- cumulative reward
+- final `benchmark_score`
+- whether the agent completed all tasks
+- how many deadlines were missed
+- how much important work remained unfinished
+## Local Development
+Validate the environment:
+```bash
+.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
+```
+Run the server locally:
+```bash
+cd workflow_arena
+uv run --project . server
+```