File size: 18,104 Bytes
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
 
 
3752981
5954205
 
3752981
5954205
 
 
 
 
 
3752981
5954205
3752981
5954205
3752981
5954205
6753cde
5954205
6753cde
5954205
 
 
 
 
6753cde
5954205
3752981
5954205
 
 
 
3752981
5954205
72d2634
5954205
72d2634
5954205
72d2634
5954205
72d2634
5954205
 
 
 
 
 
 
 
 
 
 
 
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
 
3752981
5954205
3752981
5954205
 
3752981
5954205
3752981
5954205
 
3752981
5954205
3752981
5954205
 
 
 
 
3752981
5954205
3752981
5954205
61398c0
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
 
 
 
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
 
 
3752981
5954205
61398c0
5954205
61398c0
5954205
61398c0
5954205
 
 
 
72d2634
5954205
72d2634
5954205
72d2634
5954205
c64d203
5954205
c64d203
5954205
72d2634
5954205
72d2634
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
61398c0
5954205
 
 
61398c0
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
61398c0
5954205
61398c0
5954205
 
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
6920aae
5954205
 
 
 
 
6920aae
5954205
3752981
5954205
3752981
5954205
6920aae
5954205
 
 
 
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
 
 
 
3752981
5954205
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3752981
61398c0
5954205
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3752981
5954205
c64d203
5954205
3752981
5954205
 
 
6920aae
5954205
6920aae
5954205
6920aae
5954205
6920aae
5954205
6920aae
5954205
 
 
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
 
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
3752981
5954205
 
3752981
5954205
1b9e464
5954205
1b9e464
5954205
3752981
5954205
 
 
 
 
3752981
5954205
1b9e464
5954205
1b9e464
5954205
1b9e464
5954205
1b9e464
5954205
1b9e464
5954205
1b9e464
5954205
6753cde
5954205
6753cde
5954205
 
6753cde
5954205
6753cde
5954205
 
6920aae
5954205
6920aae
5954205
6920aae
5954205
c64d203
5954205
c64d203
5954205
c64d203
5954205
6920aae
5954205
 
 
 
 
 
 
3752981
5954205
3752981
5954205
3752981
5954205
 
 
 
3752981
5954205
3752981
5954205
3752981
5954205
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
# IT Helpdesk Ticket Routing OpenEnv - Mentor Guide

This document is written as if I am mentoring someone who only knows basic Python and wants to understand how to build this project well.

The goal is not to teach every code detail. The goal is to explain the real-world thinking behind the project so you understand what you are building, why each piece exists, and how all the parts fit together.

## Start With The Big Picture

This project is a small simulation of an IT helpdesk team.

A company receives support tickets like:

- "I was charged twice after the integration outage"
- "My admin account is locked and I cannot access payroll"
- "Can we extend this contractor account for two more weeks?"
- "We think this email is a phishing attempt"

A human helpdesk lead does not just read those tickets and say "this is category X."
They also decide:

- how urgent it is
- which team should own it
- what the next action should be
- whether to gather more information first
- whether this is big enough to open an incident
- whether to delay one ticket because a more important cluster is coming

That is why this project is stronger than a simple text classifier. It tries to model a small operational workflow, not just a label lookup.

## What OpenEnv Means In Plain English

OpenEnv is a way of turning a real task into an environment that an agent can interact with step by step.

Instead of asking a model one question and scoring one answer, we create a loop:

1. the environment shows the agent the current situation
2. the agent chooses an action
3. the environment changes state
4. the agent sees the new situation
5. this continues until the episode ends

That matters because many real jobs are not one-shot question answering. They involve:

- incomplete information
- intermediate choices
- trade-offs
- consequences that show up later

Helpdesk work fits this pattern well.

## The Real-World Problem We Chose

The business problem is IT helpdesk ticket routing.

In a real company, support work usually has four important decisions:

1. `issue_type`
   - What kind of problem is this really?
   - Example: billing issue, access issue, phishing report, onboarding request.
2. `priority`
   - How urgent is it?
   - Example: low, medium, high, critical.
3. `assignment_group`
   - Which team should own it?
   - Example: service desk, security team, procurement, onboarding ops.
4. `resolution_action`
   - What should happen next?
   - Example: fulfill it directly, assign it, escalate it, acknowledge it, or ignore it.

These four decisions are the heart of the benchmark.

## Why This Problem Is Good For A Hackathon

This use case is strong because it has the right mix of realism and clarity.

It is realistic:

- companies really do route tickets like this every day
- mistakes are costly
- urgency and ownership matter

It is structured:

- the inputs are messy natural language
- the outputs are typed and easy to score

It is judge-friendly:

- someone can understand the workflow quickly
- the labels are concrete

It is agentic:

- the agent can investigate
- the agent can ask for more info
- the agent can defer
- the agent can open an incident
- earlier decisions can affect later tickets

## The Mental Model: Think Like A Shift Lead

The best way to understand the environment is to imagine you are the helpdesk shift lead for the next 20 minutes.

Tickets are arriving in a short queue.

You cannot treat each ticket as if it lives alone.

Sometimes:

- two tickets are part of the same outage
- one customer keeps opening related follow-ups
- your security team has limited bandwidth
- if you ignore a risky ticket now, it will create another ticket later
- if you open an incident early, later related tickets become easier to manage

That is the real heart of the benchmark.

## What The Agent Actually Does

The agent interacts with the environment one step at a time.

For each ticket, it can choose one of several actions.

### 1. `submit`

This means:

"I know enough. Here is my routing decision."

The agent provides:

- issue type
- priority
- assignment group
- resolution action

Real-world example:

A ticket says, "A new contractor starts Monday and needs access to the standard onboarding apps."

The agent may decide:

- issue type: `onboarding`
- priority: `medium`
- assignment group: `onboarding_ops`
- resolution action: `fulfill`

### 2. `investigate`

This means:

"I do not want to commit yet. Let me look up one more internal signal."

This is similar to a real support lead opening internal notes, checking a related case, or reviewing requester history before making a decision.

### 3. `request_info`

This means:

"The current ticket is missing something important. I want clarification before routing it strongly."

Real-world example:

A customer writes:

"We need help before the board meeting."

That is too vague. You may need to know:

- what system is affected
- whether it is a live outage
- whether security is involved

### 4. `defer`

This means:

"I am intentionally pushing this later in the queue because another item is more urgent or I expect better context soon."

This is not the same as ignoring the ticket.
It is a strategic queue decision.

Real-world example:

You have one ticket about a pricing clarification and another about a company-wide identity lockout.
You may defer the pricing question so you can stabilize the outage cluster first.

### 5. `open_incident`

This means:

"This is bigger than a normal ticket. I need to reserve incident-handling capacity."

Real-world example:

If multiple customers are reporting the same outage or privileged-access failure, opening an incident early can prevent chaos later in the queue.

## Why The Tools Exist

The investigation tools are there because real support work is rarely solved from the first sentence alone.

The environment includes tools such as:

- related ticket lookup
- requester history lookup
- internal routing note lookup
- queue capacity forecast
- queue cluster summary

Think of these as controlled windows into the rest of the system.

They matter because some tickets are intentionally incomplete.

For example:

- the visible ticket may look like a normal billing issue
- the internal routing note may reveal it is actually connected to an application outage
- the queue cluster summary may reveal there are two more related tickets behind it
- the capacity forecast may reveal the preferred team is overloaded, so a fallback route becomes reasonable

This is how the project creates decision-making instead of simple label prediction.

## Why Earlier Decisions Affect Later Tickets

This is one of the most important ideas in the whole project.

If your benchmark has no carry-over state, it is often just classification repeated several times.

This project tries to avoid that by making the queue matter.

Examples:

- if you handle an outage ticket well, later tickets from the same cluster become easier to route
- if you handle it poorly, later tickets can become more urgent or more confused
- if you open an incident, related tickets may already have incident coverage
- if you defer too many things, SLA pressure grows
- if you burn the wrong team's capacity early, later tickets may need fallback routing

In simple terms:

the world changes because of what the agent did earlier.

That is what makes the benchmark feel more like operations and less like a quiz.

## The Three Tasks And Why They Exist

All three tasks now use full routing. That is an important design choice.

We are not making one task "just classify the issue type" anymore. We keep the core job the same and change how hard the world is.

### Task 1: Guided Full Routing

This is the easiest version.

The ticket is mostly visible.
The agent still performs full routing, but the world is simpler and more single-ticket.

This task teaches:

"Can you route a normal helpdesk ticket correctly?"

### Task 2: Contextual Full Routing

This is the medium version.

Now some useful context is hidden unless the agent investigates or asks for more information.
There is also moderate queue carry-over.

This task teaches:

"Can you route well when the ticket alone is not enough?"

### Task 3: Adaptive Queue Routing

This is the hard version.

Now the agent must handle:

- hidden decisive context
- queue capacity pressure
- incidents
- clustered requests
- deferrals
- follow-up tickets created by weak earlier handling

This task teaches:

"Can you manage the queue like an operator, not just label a ticket?"

## What The Dataset Must Do

The dataset is not just a list of random support messages.

It must teach the benchmark what "good routing" looks like.

A useful dataset for this project needs:

- clear easy examples
- medium examples where urgency matters
- ambiguous examples where the wording can mislead a naive policy
- related tickets that belong to the same cluster
- tickets where fallback routing can still be acceptable
- tickets where weak handling should logically create follow-up work

Real-world example:

If a ticket says:

"The seat increase is blocked and finance is also confused about prorating"

that is not a perfectly clean one-label case.
It could pull toward procurement, license operations, or service desk depending on queue pressure and business context.

Those are the kinds of examples that make the environment interesting.

## How Scoring Works Conceptually

The grader should feel like a tough but fair manager.

It should not be vague.

It should not say:

"Anything somewhat close gets points."

Instead, it should say:

- exact answers get the most credit
- a few near misses can receive partial credit
- fallback routes only count when they were explicitly designed to count
- clearly wrong answers get low or zero credit

That is why the grader is deterministic and narrow.

This matters for two reasons:

1. judges can trust the benchmark
2. an agent actually gets a meaningful learning signal

## Why Reward Is Not Exactly The Same As Grading

This is a subtle but important idea.

The final rubric score tells us how good the overall episode was.

The step reward helps the agent learn during the episode.

You can think of it like coaching during a football match:

- the final match result is the real outcome
- the coach's feedback during the game helps the team adjust sooner

In this project:

- terminal reward reflects overall routing plus queue-management quality
- step rewards make the environment less sparse
- unnecessary investigation or poor operational choices can carry penalties

So the final score is the verdict, while the step reward is the training signal.

## The Difference Between "Correct Ticket Routing" And "Good Queue Management"

This difference separates average benchmarks from stronger ones.

A ticket can be locally correct but globally poor.

Example:

- yes, security might be the best owner for a certain ticket
- but if the security queue is already overloaded and the task explicitly allows a fallback operational route, a smart agent may choose the alternate route

That is why this project now includes:

- alternate acceptable routes on selected tickets
- capacity-aware routing
- queue-management score
- cluster stabilization and destabilization

A good benchmark should reward not just being correct in isolation, but being operationally sensible.

## How To Explain The Main Files To A Beginner

If you are teaching this project to someone new, use these analogies.

### `server/tasks.py`

This is the curriculum.

It says:

- what the tasks are
- how hard they are
- what kinds of tickets exist

### `data/dataset.json`

This is the casebook.

It is the collection of real-looking helpdesk scenarios that power the environment.

### `server/environment.py`

This is the game master.

It keeps track of:

- which ticket is current
- what the queue looks like
- what happened earlier
- what the next observation should be

### `server/grader.py`

This is the scorekeeper.

It decides how good a routing answer was.

### `server/reward.py`

This is the coach.

It turns raw outcomes into feedback signals the agent can learn from.

### `inference.py`

This is the example player.

It shows how an agent can interact with the environment.

### `server/app.py`

This is the front desk.

It exposes the environment through web endpoints so tools and evaluators can use it.

## How I Would Teach A Beginner To Build This Project From Scratch

If you were starting from zero, I would teach the build order like this.

### Step 1: Choose A Real Workflow

Do not start with code.
Start with the business process.

Ask:

- who is the user?
- what decision are they making?
- what makes that decision hard?
- what happens if they get it wrong?

For us, the answers were:

- the user is a helpdesk routing agent
- the decisions are issue type, priority, owner, and next action
- the hard parts are ambiguity, queue pressure, and incomplete information
- mistakes cause delays, wrong ownership, and follow-up work

### Step 2: Freeze The Vocabulary

Before coding, decide the labels clearly.

If the team keeps changing label names midway, everything becomes unstable:

- dataset
- grader
- prompts
- docs
- tests

This is why a frozen vocabulary is so important.

### Step 3: Build Realistic Example Cases

Write tickets the way real people write them:

- incomplete
- emotional
- slightly messy
- not perfectly labeled in the text

If every ticket literally contains the answer, the benchmark becomes a keyword game.

### Step 4: Decide What The Agent Sees Immediately

Not everything should be visible at once.

Ask:

- what would a real support analyst know right away?
- what would require investigation?
- what would require asking someone?

That decision creates the need for tools and intermediate actions.

### Step 5: Add Actions Beyond Final Submission

If the only action is "submit the answer," you are probably building classification.

To make it feel operational, add actions that shape the path:

- investigate
- ask for clarification
- defer
- escalate or open incident

These are realistic and easy to explain.

### Step 6: Make State Carry Over

This is where many projects stay shallow.

You need earlier choices to matter later.

For example:

- capacity should be reduced after use
- related tickets should react to earlier handling
- follow-up tickets should appear when earlier work was weak

Without this, you do not really have a sequential benchmark.

### Step 7: Design Deterministic Grading

The grader should be explainable to a judge in under a minute.

That usually means:

- exact match for most things
- a small number of explicit partial-credit rules
- no secret fuzzy logic

### Step 8: Add Reward Shaping Carefully

Reward shaping should help learning, not distort the benchmark.

Good shaping:

- rewards useful investigation
- discourages wasteful probing
- gently rewards good operational flow

Bad shaping:

- makes a silly exploit better than actually solving the task

### Step 9: Build A Baseline Agent

Always include a runner that can play the environment.

It does not need to be perfect.
It just needs to prove the environment works and give judges something concrete to run.

### Step 10: Make It Easy To Validate And Deploy

A good benchmark is not just interesting. It is runnable.

That means:

- clean metadata
- clear docs
- Docker support
- validation passing
- a landing page that makes sense to a judge

## Common Beginner Mistakes To Avoid

### Mistake 1: Building A Fancy Classifier And Calling It An Environment

If nothing carries over between steps, you probably do not have a true environment yet.

### Mistake 2: Making The Grader Too Fuzzy

If almost every answer gets partial credit, your score stops being trustworthy.

### Mistake 3: Making The Hard Task Easy For Heuristics

If a simple keyword rule gets near-perfect scores, the benchmark will not feel meaningful.

### Mistake 4: Adding Random Complexity Instead Of Business Logic

Harder is not always better.
Complexity should come from realistic workflow pressure, not arbitrary tricks.

### Mistake 5: Writing Docs Only For Teammates

Hackathon judges are outsiders.
Your docs must help a smart new reader understand the project quickly.

## How To Talk About This Project In A Demo

If you need to explain the project fast, say this:

"We built an OpenEnv benchmark for IT helpdesk routing. The agent does not just classify tickets. It manages a short operational queue, can investigate hidden context, request clarification, defer work, open incidents, and make routing choices whose consequences affect later tickets. The scoring is deterministic, but the environment still has real trade-offs because queue pressure and related-ticket clusters change what good handling looks like."

That is the shortest honest pitch.

## What Makes This Project Strong Today

The current version is strongest in these areas:

- clear real-world workflow
- structured, judge-friendly outputs
- deterministic grading
- multi-step operational actions
- queue-level consequences
- cluster-aware carry-over state
- clean packaging and validation story

## What Would Make It Even Stronger Later

If this project kept growing after the hackathon, the next upgrades would be:

- make more of the consequences emerge from a general simulator instead of authored rules
- increase the data diversity further
- train stronger learned policies instead of relying mainly on deterministic policy search
- add more business objectives like cost, customer satisfaction, and resolver fatigue

## One-Minute Recap

If you forget everything else, remember this:

- this project simulates helpdesk queue management, not just ticket classification
- the agent must choose both what the ticket means and what to do next
- some useful context is hidden and must be uncovered through actions
- earlier choices affect later tickets
- the grader is deterministic so the benchmark stays trustworthy
- the project is built to be understandable, runnable, and useful as an OpenEnv environment