File size: 2,844 Bytes
0584798
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# Roadmap Issue Drafts

These are the next three roadmap issues to open in GitHub once authenticated issue creation is available.

## 1. Build External Demo UI For Decision Envelope

Suggested title:

`Build external demo UI for query -> model_output -> system_decision`

Suggested body:

```md
## Goal

Add a simple external-facing demo interface on top of `/classify` so a user can paste a query and see the full decision envelope in a clean, understandable format.

## Scope

- add a lightweight UI for entering a raw query
- render `model_output.classification.intent`
- render fallback state when present
- render `system_decision.policy`
- render `system_decision.opportunity`
- include a few preloaded demo prompts

## Why

The current JSON API is enough for engineering validation, but not enough for partner demos or taxonomy walkthroughs.

## Done When

- someone can run the demo locally and inspect the full output without using curl
- the UI clearly shows query -> classification -> system decision
```

## 2. Add Better Support Handling To Intent-Type Layer

Suggested title:

`Add dedicated support handling to reduce personal_reflection fallback on account-help prompts`

Suggested body:

```md
## Goal

Reduce the current failure mode where support-like prompts such as login and billing issues collapse into `personal_reflection` or low-confidence fallback behavior.

## Scope

- review support-like prompts in the current benchmark
- decide whether to add a dedicated `support` intent-type head or a rule-based override layer
- add a fixed support-oriented evaluation set
- document the chosen approach in `known_limitations.md`

## Why

The `decision_phase` head can already separate `support` reasonably well, but the `intent_type` layer still underperforms on these cases.

## Done When

- support prompts are no longer commonly labeled as `personal_reflection`
- the combined envelope fails safe for support queries with clearer semantics
```

## 3. Add Evaluation Harness And Canonical Benchmark Runner

Suggested title:

`Add canonical benchmark runner for demo prompts and regression checks`

Suggested body:

```md
## Goal

Turn the current prompt suite and canonical examples into a repeatable regression harness.

## Scope

- add a script that runs the fixed demo prompts through `combined_inference.py`
- save outputs to a machine-readable artifact
- compare current outputs against expected behavior notes
- flag meaningful regressions in fallback behavior and phase classification

## Why

The repo now has frozen `v0.1` baselines. A benchmark runner is the clean way to protect demo quality without returning to ad hoc tuning.

## Done When

- one command runs the prompt suite end to end
- current outputs are easy to inspect and compare over time
- demo regressions become visible before external sharing
```