File size: 6,343 Bytes
0c470ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# Adaptive Project Manager

## Problem Statement

Software projects often fail for repeatable reasons:

- critical work is discovered late,
- teams become overloaded,
- priorities change,
- deadlines slip,
- short-term fixes create long-term issues.

Most project tools track tasks and timelines, but they do not help with decision-making under uncertainty.

This project introduces `AdaptiveProjectManagerEnv`, an OpenEnv-compatible reinforcement learning environment where the agent acts as a project manager.

The environment models:

- tasks with dependencies,
- employees with different skills,
- budget and deadline limits,
- workload and burnout,
- random project disruptions.

The goal is to learn better day-to-day project decisions over time.

In this environment, the agent plays the role of project manager and must choose actions repeatedly as the project evolves.

This is the core challenge: decisions that look useful today can hurt delivery later.

## Why This Problem Matters

Project management requires constant tradeoffs.

| Goal | Competing Goal |
| --- | --- |
| Deliver quickly | Avoid burnout |
| Stay in budget | Add enough capacity |
| Finish full scope | Protect the critical path |
| Solve current issues | Reduce future risk |

These decisions are hard because:

- future events are uncertain,
- effects are delayed,
- resources are limited,
- improving one metric can hurt another.

Traditional project-management systems are mostly descriptive: they show status, but they do not decide what should happen next.

This problem asks whether an agent can learn stronger decision policies through repeated interaction with a realistic simulation.

## Why Reinforcement Learning

This is a sequential decision problem, not just a prediction problem.

The core question is not:

> “Will the project be late?”

It is:

> “Given the current state, what action should we take now?”

RL is a good fit because the environment includes:

- repeated decisions,
- delayed rewards,
- changing conditions,
- multiple conflicting objectives.

The agent is not asked to predict outcomes once. It must repeatedly choose the next best action under uncertainty.

### Why Fixed Rules Are Not Enough

A simple rule like “always assign the strongest engineer to the highest-priority task” can fail because:

- one reassignment may block another dependency,
- overtime may help now but increase burnout later,
- dropping minor scope may protect delivery,
- short-term budget increases (e.g., contractor) may prevent larger delays.

No single rule is optimal in every state.

The best action depends on:

- what has already happened,
- what is likely to happen next,
- how current choices change future options.

## Research Question

Can an RL agent manage a software project under uncertainty better than fixed rule-based policies?

More specifically: can it learn when to assign, delay, reprioritize, de-scope, or escalate work to improve final delivery outcomes?

Baseline comparison is against fixed heuristic policies.

## Environment Objective

Deliver the project while balancing:

- schedule performance,
- budget control,
- burnout risk,
- stakeholder satisfaction,
- completed scope.

These goals conflict, so there is no perfect action at each step.

| Action | Benefit | Cost |
| --- | --- | --- |
| Request overtime | Faster progress | Higher burnout risk |
| Hire contractor | More capacity | Higher spend |
| Drop low-priority scope | Better schedule protection | Lower stakeholder satisfaction |
| Focus on easy tasks | Quick visible progress | Critical blockers remain |

Target behavior is to maximize project success while minimizing delay, overspend, and burnout.

## What the Agent Decides

### 1) Resource Allocation

- Who works on which task
- When to reassign staff
- When to preserve specialist capacity

### 2) Prioritization

- Which tasks to do now
- Whether to clear blockers first
- How to manage critical-path work

### 3) Contingency Actions

- Overtime
- Temporary hiring
- Scope deferral
- Delay acceptance

These contingency choices can produce delayed effects across later steps.

## OpenEnv Fit

This environment is suitable for OpenEnv because it provides:

- a realistic human decision task,
- clear observation/action/reward structure,
- deterministic evaluation through seeded randomness,
- support for multiple difficulty settings,
- efficient simulation with meaningful complexity.

Per simulated day, the loop is:

1. Observe project state
2. Choose actions
3. Simulate one day of progress
4. Apply possible disruptions
5. Return new state and reward

This directly matches the standard RL interaction cycle:

`Observation -> Action -> Environment Update -> Reward`

## Difficulty and Delayed Effects

Short-term gains can create long-term losses.

Example:

- Day 3: use overtime to recover schedule.
- Day 10: burnout reduces team speed; critical work slips.

The agent must therefore plan ahead, not optimize only for immediate reward.

This introduces long-horizon planning, risk management, and tradeoff control.

## Stochastic Events and Reproducibility

The environment includes realistic random events, such as:

- employee absence,
- urgent feature requests,
- effort increases,
- vendor delays,
- bug discovery,
- requirement changes.

Randomness is seed-controlled to ensure:

- reproducible runs,
- fair policy comparison,
- stable evaluation.

This keeps the environment realistic while still allowing fair and repeatable comparison between policies.

## Why This Problem Is Useful

`AdaptiveProjectManagerEnv` can serve as:

- a benchmark for long-horizon decision-making,
- a controlled setting for comparing project-management strategies,
- an early foundation for AI-assisted planning tools.

## Summary

`AdaptiveProjectManagerEnv` frames project management as sequential decision-making under uncertainty.

It combines limited resources, conflicting goals, delayed consequences, and controlled randomness in a reproducible RL environment.

Core question:

> Can an agent learn to manage a complex project under uncertainty better than fixed rules?