File size: 6,287 Bytes
ceacbf1
935d7f2
ceacbf1
 
 
935d7f2
 
 
ceacbf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
935d7f2
 
ceacbf1
 
935d7f2
 
 
ceacbf1
935d7f2
ceacbf1
 
 
 
 
935d7f2
ceacbf1
 
 
 
 
 
935d7f2
ceacbf1
 
 
935d7f2
 
 
ceacbf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
935d7f2
ceacbf1
 
 
 
935d7f2
 
 
ceacbf1
935d7f2
ceacbf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
935d7f2
ceacbf1
 
 
 
 
 
 
 
 
 
 
 
935d7f2
 
ceacbf1
 
935d7f2
 
 
ceacbf1
935d7f2
ceacbf1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# 02 - Runtime Paths

> Path A generates a response and records attribution. Path B buffers feedback
> signals and finalizes learning. The shared ledger is `ape_turn_record`; the
> personalized learning table is `ape_user_bandit_state`.

---

## Path A - `/turn`

`POST /turn` handles one user message end to end.

```text
1. Ensure session_id
   If the request omits session_id, the orchestrator mints one.

2. Load history
   Read recent messages from ape_messages by session_id.

3. Append user message
   Write raw user text to ape_messages before generation.

4. Classify
   LLM returns:
     intent
     topic
     signal

5. Finalize previous pending response
   If this user/session has a recent PENDING response:
     - append classifier signal as source=llm if it is not no_signal
     - append session_continue as source=derived if previous response age < 5 min
     - run finalization:
         composite detection
         atomic resolver fallback
         strongest format-relevant reward selection
         CAS PENDING -> APPLIED
         bandit update if reward exists
         UCB cache refresh for the whole cell

6. Validate intent
   Intent must be ACTIVE in ape_config. Unknown or inactive intents become
   unmapped.

7. Resolve candidate strategies
   Lookup order:
     - exact policy: domain + intent + topic
     - default topic policy: domain + intent + _default
     - hardcoded fallback catalog

8. Load active instructions
   Each candidate strategy needs an ACTIVE instruction. Missing instructions
   fall back to safe defaults where available.

9. Load or create bandit cell
   The cell is keyed by user_id_hash + domain + intent + topic.
   One row exists per strategy.

10. Select strategy
    UCB selects argmax(cached_ucb).
    Unpulled arms have cached_ucb = 999.0 for natural exploration.

11. Generate answer
    The synthesizer receives the selected strategy instruction.

12. Measure format compliance
    format_compliance = 1 if rendered_format matches selected_strategy's
    expected format; otherwise 0.

13. Write PENDING turn record
    The row includes response_id, selected_strategy, rendered_format,
    attribution_bandit_pk/sk, and initial pending_signals:
      - format_compliance_pass, or
      - format_compliance_fail

14. Append assistant message
    Assistant message stores response_id so the UI can attach feedback to the
    exact response.
```

The response body returns `response_id`. The client must echo that id when it
posts feedback.

---

## Path A Streaming - `/turn/stream`

`POST /turn/stream` has the same learning semantics as `/turn`. The difference
is delivery: it returns Server-Sent Events while the synthesizer is producing
tokens.

Event sequence:

| Event | When | Payload highlights |
|---|---|---|
| `metadata` | After strategy selection, before token streaming | `response_id`, `session_id`, `intent`, `topic`, `selected_strategy`, `candidate_strategies` |
| `delta` | Many times during generation | Raw text chunk from the LLM |
| `done` | After generation and writes complete | Final `answer`, `rendered_format`, `format_compliance`, `selection`, `classification`, timings |
| `error` | If streaming fails | Error message |

The streaming path preallocates `response_id` before token streaming so the UI
can associate live feedback controls with the response as soon as metadata
arrives.

---

## Path B - `/feedback`

`POST /feedback` accepts an explicit UI signal for one `response_id`.

```text
1. Validate target
   - response_id must exist
   - user_id must hash to the response owner
   - reward_status must still be PENDING

2. Validate signal
   The signal must exist in ACTIVE signal_routing config.

3. Append to pending_signals
   UI feedback is appended as:
     { signal, source: "ui", ts }

4. Decide whether to finalize now
   Finalize immediately if:
     - response age >= 5 seconds, or
     - pending signal count >= 3, or
     - signal is explicit/strong:
         format_change_request
         format_keep_request
         content_correction
         regenerate_click
         thumbs_down

   Otherwise return status=queued. The response remains PENDING until another
   signal arrives or the next /turn finalizes it.

5. Finalize response
   - detect composite pattern first
   - otherwise resolve atomic signals:
       UI signals outrank LLM signals
       LLM signals outrank derived-only signals
   - record the winning label in ape_turn_record.signal
   - separately scan all buffered signals plus the winner
   - choose the highest absolute format-relevant normalized reward
   - CAS reward_status from PENDING to APPLIED
   - update the attributed bandit arm if a reward exists
   - refresh cached_ucb for every arm in the cell
```

This design protects bandit learning from noisy satisfaction clicks. For
example, `thumbs_up` can be the final analytics label, but because it has no
format category it does not by itself update the bandit. A co-fired
`format_compliance_pass` can still supply a +0.5 format reward.

---

## Feedback Statuses

| Status | Meaning |
|---|---|
| `queued` | Signal was buffered; finalization will happen later |
| `applied` | Response finalized and bandit row updated |
| `applied_no_bandit_update` | Response finalized, but no format-relevant reward existed |
| `rejected` | Response missing, wrong user, already finalized, or CAS race lost |
| `skipped` | Signal is unknown and was not buffered |

---

## Reward State Machine

```text
PENDING
  |
  | queued feedback appends pending_signals
  v
PENDING
  |
  | finalizer applies a label and optional reward
  v
APPLIED

PENDING
  |
  | explicit skip path for non-applicable responses
  v
SKIPPED
```

A PENDING response is safe. It means APE is waiting for more evidence or the
next user turn. It does not corrupt the bandit.

---

## Why UCB Cache Refresh Touches the Whole Cell

The UCB exploration term depends on total pulls in the cell:

```text
ucb = avg_reward + c * sqrt(2 * ln(total_cell_pulls) / arm_count)
```

When one arm receives a reward, the total pull count changes, so every arm's
`cached_ucb` can change. Path B refreshes the whole cell after a bandit update.

---

## See Also

- [03 - Admin config](./03-admin-config.md)
- [09 - API reference](./09-api-reference.md)