Spaces:
Running
Running
File size: 18,104 Bytes
5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 6753cde 5954205 6753cde 5954205 6753cde 5954205 3752981 5954205 3752981 5954205 72d2634 5954205 72d2634 5954205 72d2634 5954205 72d2634 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 61398c0 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 61398c0 5954205 61398c0 5954205 61398c0 5954205 72d2634 5954205 72d2634 5954205 72d2634 5954205 c64d203 5954205 c64d203 5954205 72d2634 5954205 72d2634 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 61398c0 5954205 61398c0 5954205 3752981 5954205 3752981 5954205 3752981 5954205 61398c0 5954205 61398c0 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 6920aae 5954205 6920aae 5954205 3752981 5954205 3752981 5954205 6920aae 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 61398c0 5954205 3752981 5954205 c64d203 5954205 3752981 5954205 6920aae 5954205 6920aae 5954205 6920aae 5954205 6920aae 5954205 6920aae 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 1b9e464 5954205 1b9e464 5954205 3752981 5954205 3752981 5954205 1b9e464 5954205 1b9e464 5954205 1b9e464 5954205 1b9e464 5954205 1b9e464 5954205 1b9e464 5954205 6753cde 5954205 6753cde 5954205 6753cde 5954205 6753cde 5954205 6920aae 5954205 6920aae 5954205 6920aae 5954205 c64d203 5954205 c64d203 5954205 c64d203 5954205 6920aae 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 3752981 5954205 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 | # IT Helpdesk Ticket Routing OpenEnv - Mentor Guide
This document is written as if I am mentoring someone who only knows basic Python and wants to understand how to build this project well.
The goal is not to teach every code detail. The goal is to explain the real-world thinking behind the project so you understand what you are building, why each piece exists, and how all the parts fit together.
## Start With The Big Picture
This project is a small simulation of an IT helpdesk team.
A company receives support tickets like:
- "I was charged twice after the integration outage"
- "My admin account is locked and I cannot access payroll"
- "Can we extend this contractor account for two more weeks?"
- "We think this email is a phishing attempt"
A human helpdesk lead does not just read those tickets and say "this is category X."
They also decide:
- how urgent it is
- which team should own it
- what the next action should be
- whether to gather more information first
- whether this is big enough to open an incident
- whether to delay one ticket because a more important cluster is coming
That is why this project is stronger than a simple text classifier. It tries to model a small operational workflow, not just a label lookup.
## What OpenEnv Means In Plain English
OpenEnv is a way of turning a real task into an environment that an agent can interact with step by step.
Instead of asking a model one question and scoring one answer, we create a loop:
1. the environment shows the agent the current situation
2. the agent chooses an action
3. the environment changes state
4. the agent sees the new situation
5. this continues until the episode ends
That matters because many real jobs are not one-shot question answering. They involve:
- incomplete information
- intermediate choices
- trade-offs
- consequences that show up later
Helpdesk work fits this pattern well.
## The Real-World Problem We Chose
The business problem is IT helpdesk ticket routing.
In a real company, support work usually has four important decisions:
1. `issue_type`
- What kind of problem is this really?
- Example: billing issue, access issue, phishing report, onboarding request.
2. `priority`
- How urgent is it?
- Example: low, medium, high, critical.
3. `assignment_group`
- Which team should own it?
- Example: service desk, security team, procurement, onboarding ops.
4. `resolution_action`
- What should happen next?
- Example: fulfill it directly, assign it, escalate it, acknowledge it, or ignore it.
These four decisions are the heart of the benchmark.
## Why This Problem Is Good For A Hackathon
This use case is strong because it has the right mix of realism and clarity.
It is realistic:
- companies really do route tickets like this every day
- mistakes are costly
- urgency and ownership matter
It is structured:
- the inputs are messy natural language
- the outputs are typed and easy to score
It is judge-friendly:
- someone can understand the workflow quickly
- the labels are concrete
It is agentic:
- the agent can investigate
- the agent can ask for more info
- the agent can defer
- the agent can open an incident
- earlier decisions can affect later tickets
## The Mental Model: Think Like A Shift Lead
The best way to understand the environment is to imagine you are the helpdesk shift lead for the next 20 minutes.
Tickets are arriving in a short queue.
You cannot treat each ticket as if it lives alone.
Sometimes:
- two tickets are part of the same outage
- one customer keeps opening related follow-ups
- your security team has limited bandwidth
- if you ignore a risky ticket now, it will create another ticket later
- if you open an incident early, later related tickets become easier to manage
That is the real heart of the benchmark.
## What The Agent Actually Does
The agent interacts with the environment one step at a time.
For each ticket, it can choose one of several actions.
### 1. `submit`
This means:
"I know enough. Here is my routing decision."
The agent provides:
- issue type
- priority
- assignment group
- resolution action
Real-world example:
A ticket says, "A new contractor starts Monday and needs access to the standard onboarding apps."
The agent may decide:
- issue type: `onboarding`
- priority: `medium`
- assignment group: `onboarding_ops`
- resolution action: `fulfill`
### 2. `investigate`
This means:
"I do not want to commit yet. Let me look up one more internal signal."
This is similar to a real support lead opening internal notes, checking a related case, or reviewing requester history before making a decision.
### 3. `request_info`
This means:
"The current ticket is missing something important. I want clarification before routing it strongly."
Real-world example:
A customer writes:
"We need help before the board meeting."
That is too vague. You may need to know:
- what system is affected
- whether it is a live outage
- whether security is involved
### 4. `defer`
This means:
"I am intentionally pushing this later in the queue because another item is more urgent or I expect better context soon."
This is not the same as ignoring the ticket.
It is a strategic queue decision.
Real-world example:
You have one ticket about a pricing clarification and another about a company-wide identity lockout.
You may defer the pricing question so you can stabilize the outage cluster first.
### 5. `open_incident`
This means:
"This is bigger than a normal ticket. I need to reserve incident-handling capacity."
Real-world example:
If multiple customers are reporting the same outage or privileged-access failure, opening an incident early can prevent chaos later in the queue.
## Why The Tools Exist
The investigation tools are there because real support work is rarely solved from the first sentence alone.
The environment includes tools such as:
- related ticket lookup
- requester history lookup
- internal routing note lookup
- queue capacity forecast
- queue cluster summary
Think of these as controlled windows into the rest of the system.
They matter because some tickets are intentionally incomplete.
For example:
- the visible ticket may look like a normal billing issue
- the internal routing note may reveal it is actually connected to an application outage
- the queue cluster summary may reveal there are two more related tickets behind it
- the capacity forecast may reveal the preferred team is overloaded, so a fallback route becomes reasonable
This is how the project creates decision-making instead of simple label prediction.
## Why Earlier Decisions Affect Later Tickets
This is one of the most important ideas in the whole project.
If your benchmark has no carry-over state, it is often just classification repeated several times.
This project tries to avoid that by making the queue matter.
Examples:
- if you handle an outage ticket well, later tickets from the same cluster become easier to route
- if you handle it poorly, later tickets can become more urgent or more confused
- if you open an incident, related tickets may already have incident coverage
- if you defer too many things, SLA pressure grows
- if you burn the wrong team's capacity early, later tickets may need fallback routing
In simple terms:
the world changes because of what the agent did earlier.
That is what makes the benchmark feel more like operations and less like a quiz.
## The Three Tasks And Why They Exist
All three tasks now use full routing. That is an important design choice.
We are not making one task "just classify the issue type" anymore. We keep the core job the same and change how hard the world is.
### Task 1: Guided Full Routing
This is the easiest version.
The ticket is mostly visible.
The agent still performs full routing, but the world is simpler and more single-ticket.
This task teaches:
"Can you route a normal helpdesk ticket correctly?"
### Task 2: Contextual Full Routing
This is the medium version.
Now some useful context is hidden unless the agent investigates or asks for more information.
There is also moderate queue carry-over.
This task teaches:
"Can you route well when the ticket alone is not enough?"
### Task 3: Adaptive Queue Routing
This is the hard version.
Now the agent must handle:
- hidden decisive context
- queue capacity pressure
- incidents
- clustered requests
- deferrals
- follow-up tickets created by weak earlier handling
This task teaches:
"Can you manage the queue like an operator, not just label a ticket?"
## What The Dataset Must Do
The dataset is not just a list of random support messages.
It must teach the benchmark what "good routing" looks like.
A useful dataset for this project needs:
- clear easy examples
- medium examples where urgency matters
- ambiguous examples where the wording can mislead a naive policy
- related tickets that belong to the same cluster
- tickets where fallback routing can still be acceptable
- tickets where weak handling should logically create follow-up work
Real-world example:
If a ticket says:
"The seat increase is blocked and finance is also confused about prorating"
that is not a perfectly clean one-label case.
It could pull toward procurement, license operations, or service desk depending on queue pressure and business context.
Those are the kinds of examples that make the environment interesting.
## How Scoring Works Conceptually
The grader should feel like a tough but fair manager.
It should not be vague.
It should not say:
"Anything somewhat close gets points."
Instead, it should say:
- exact answers get the most credit
- a few near misses can receive partial credit
- fallback routes only count when they were explicitly designed to count
- clearly wrong answers get low or zero credit
That is why the grader is deterministic and narrow.
This matters for two reasons:
1. judges can trust the benchmark
2. an agent actually gets a meaningful learning signal
## Why Reward Is Not Exactly The Same As Grading
This is a subtle but important idea.
The final rubric score tells us how good the overall episode was.
The step reward helps the agent learn during the episode.
You can think of it like coaching during a football match:
- the final match result is the real outcome
- the coach's feedback during the game helps the team adjust sooner
In this project:
- terminal reward reflects overall routing plus queue-management quality
- step rewards make the environment less sparse
- unnecessary investigation or poor operational choices can carry penalties
So the final score is the verdict, while the step reward is the training signal.
## The Difference Between "Correct Ticket Routing" And "Good Queue Management"
This difference separates average benchmarks from stronger ones.
A ticket can be locally correct but globally poor.
Example:
- yes, security might be the best owner for a certain ticket
- but if the security queue is already overloaded and the task explicitly allows a fallback operational route, a smart agent may choose the alternate route
That is why this project now includes:
- alternate acceptable routes on selected tickets
- capacity-aware routing
- queue-management score
- cluster stabilization and destabilization
A good benchmark should reward not just being correct in isolation, but being operationally sensible.
## How To Explain The Main Files To A Beginner
If you are teaching this project to someone new, use these analogies.
### `server/tasks.py`
This is the curriculum.
It says:
- what the tasks are
- how hard they are
- what kinds of tickets exist
### `data/dataset.json`
This is the casebook.
It is the collection of real-looking helpdesk scenarios that power the environment.
### `server/environment.py`
This is the game master.
It keeps track of:
- which ticket is current
- what the queue looks like
- what happened earlier
- what the next observation should be
### `server/grader.py`
This is the scorekeeper.
It decides how good a routing answer was.
### `server/reward.py`
This is the coach.
It turns raw outcomes into feedback signals the agent can learn from.
### `inference.py`
This is the example player.
It shows how an agent can interact with the environment.
### `server/app.py`
This is the front desk.
It exposes the environment through web endpoints so tools and evaluators can use it.
## How I Would Teach A Beginner To Build This Project From Scratch
If you were starting from zero, I would teach the build order like this.
### Step 1: Choose A Real Workflow
Do not start with code.
Start with the business process.
Ask:
- who is the user?
- what decision are they making?
- what makes that decision hard?
- what happens if they get it wrong?
For us, the answers were:
- the user is a helpdesk routing agent
- the decisions are issue type, priority, owner, and next action
- the hard parts are ambiguity, queue pressure, and incomplete information
- mistakes cause delays, wrong ownership, and follow-up work
### Step 2: Freeze The Vocabulary
Before coding, decide the labels clearly.
If the team keeps changing label names midway, everything becomes unstable:
- dataset
- grader
- prompts
- docs
- tests
This is why a frozen vocabulary is so important.
### Step 3: Build Realistic Example Cases
Write tickets the way real people write them:
- incomplete
- emotional
- slightly messy
- not perfectly labeled in the text
If every ticket literally contains the answer, the benchmark becomes a keyword game.
### Step 4: Decide What The Agent Sees Immediately
Not everything should be visible at once.
Ask:
- what would a real support analyst know right away?
- what would require investigation?
- what would require asking someone?
That decision creates the need for tools and intermediate actions.
### Step 5: Add Actions Beyond Final Submission
If the only action is "submit the answer," you are probably building classification.
To make it feel operational, add actions that shape the path:
- investigate
- ask for clarification
- defer
- escalate or open incident
These are realistic and easy to explain.
### Step 6: Make State Carry Over
This is where many projects stay shallow.
You need earlier choices to matter later.
For example:
- capacity should be reduced after use
- related tickets should react to earlier handling
- follow-up tickets should appear when earlier work was weak
Without this, you do not really have a sequential benchmark.
### Step 7: Design Deterministic Grading
The grader should be explainable to a judge in under a minute.
That usually means:
- exact match for most things
- a small number of explicit partial-credit rules
- no secret fuzzy logic
### Step 8: Add Reward Shaping Carefully
Reward shaping should help learning, not distort the benchmark.
Good shaping:
- rewards useful investigation
- discourages wasteful probing
- gently rewards good operational flow
Bad shaping:
- makes a silly exploit better than actually solving the task
### Step 9: Build A Baseline Agent
Always include a runner that can play the environment.
It does not need to be perfect.
It just needs to prove the environment works and give judges something concrete to run.
### Step 10: Make It Easy To Validate And Deploy
A good benchmark is not just interesting. It is runnable.
That means:
- clean metadata
- clear docs
- Docker support
- validation passing
- a landing page that makes sense to a judge
## Common Beginner Mistakes To Avoid
### Mistake 1: Building A Fancy Classifier And Calling It An Environment
If nothing carries over between steps, you probably do not have a true environment yet.
### Mistake 2: Making The Grader Too Fuzzy
If almost every answer gets partial credit, your score stops being trustworthy.
### Mistake 3: Making The Hard Task Easy For Heuristics
If a simple keyword rule gets near-perfect scores, the benchmark will not feel meaningful.
### Mistake 4: Adding Random Complexity Instead Of Business Logic
Harder is not always better.
Complexity should come from realistic workflow pressure, not arbitrary tricks.
### Mistake 5: Writing Docs Only For Teammates
Hackathon judges are outsiders.
Your docs must help a smart new reader understand the project quickly.
## How To Talk About This Project In A Demo
If you need to explain the project fast, say this:
"We built an OpenEnv benchmark for IT helpdesk routing. The agent does not just classify tickets. It manages a short operational queue, can investigate hidden context, request clarification, defer work, open incidents, and make routing choices whose consequences affect later tickets. The scoring is deterministic, but the environment still has real trade-offs because queue pressure and related-ticket clusters change what good handling looks like."
That is the shortest honest pitch.
## What Makes This Project Strong Today
The current version is strongest in these areas:
- clear real-world workflow
- structured, judge-friendly outputs
- deterministic grading
- multi-step operational actions
- queue-level consequences
- cluster-aware carry-over state
- clean packaging and validation story
## What Would Make It Even Stronger Later
If this project kept growing after the hackathon, the next upgrades would be:
- make more of the consequences emerge from a general simulator instead of authored rules
- increase the data diversity further
- train stronger learned policies instead of relying mainly on deterministic policy search
- add more business objectives like cost, customer satisfaction, and resolver fatigue
## One-Minute Recap
If you forget everything else, remember this:
- this project simulates helpdesk queue management, not just ticket classification
- the agent must choose both what the ticket means and what to do next
- some useful context is hidden and must be uncovered through actions
- earlier choices affect later tickets
- the grader is deterministic so the benchmark stays trustworthy
- the project is built to be understandable, runnable, and useful as an OpenEnv environment
|