File size: 2,844 Bytes
0584798 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | # Roadmap Issue Drafts
These are the next three roadmap issues to open in GitHub once authenticated issue creation is available.
## 1. Build External Demo UI For Decision Envelope
Suggested title:
`Build external demo UI for query -> model_output -> system_decision`
Suggested body:
```md
## Goal
Add a simple external-facing demo interface on top of `/classify` so a user can paste a query and see the full decision envelope in a clean, understandable format.
## Scope
- add a lightweight UI for entering a raw query
- render `model_output.classification.intent`
- render fallback state when present
- render `system_decision.policy`
- render `system_decision.opportunity`
- include a few preloaded demo prompts
## Why
The current JSON API is enough for engineering validation, but not enough for partner demos or taxonomy walkthroughs.
## Done When
- someone can run the demo locally and inspect the full output without using curl
- the UI clearly shows query -> classification -> system decision
```
## 2. Add Better Support Handling To Intent-Type Layer
Suggested title:
`Add dedicated support handling to reduce personal_reflection fallback on account-help prompts`
Suggested body:
```md
## Goal
Reduce the current failure mode where support-like prompts such as login and billing issues collapse into `personal_reflection` or low-confidence fallback behavior.
## Scope
- review support-like prompts in the current benchmark
- decide whether to add a dedicated `support` intent-type head or a rule-based override layer
- add a fixed support-oriented evaluation set
- document the chosen approach in `known_limitations.md`
## Why
The `decision_phase` head can already separate `support` reasonably well, but the `intent_type` layer still underperforms on these cases.
## Done When
- support prompts are no longer commonly labeled as `personal_reflection`
- the combined envelope fails safe for support queries with clearer semantics
```
## 3. Add Evaluation Harness And Canonical Benchmark Runner
Suggested title:
`Add canonical benchmark runner for demo prompts and regression checks`
Suggested body:
```md
## Goal
Turn the current prompt suite and canonical examples into a repeatable regression harness.
## Scope
- add a script that runs the fixed demo prompts through `combined_inference.py`
- save outputs to a machine-readable artifact
- compare current outputs against expected behavior notes
- flag meaningful regressions in fallback behavior and phase classification
## Why
The repo now has frozen `v0.1` baselines. A benchmark runner is the clean way to protect demo quality without returning to ad hoc tuning.
## Done When
- one command runs the prompt suite end to end
- current outputs are easy to inspect and compare over time
- demo regressions become visible before external sharing
```
|