Spaces:
Sleeping
Sleeping
| # 02 - Runtime Paths | |
| > Path A generates a response and records attribution. Path B buffers feedback | |
| > signals and finalizes learning. The shared ledger is `ape_turn_record`; the | |
| > personalized learning table is `ape_user_bandit_state`. | |
| --- | |
| ## Path A - `/turn` | |
| `POST /turn` handles one user message end to end. | |
| ```text | |
| 1. Ensure session_id | |
| If the request omits session_id, the orchestrator mints one. | |
| 2. Load history | |
| Read recent messages from ape_messages by session_id. | |
| 3. Append user message | |
| Write raw user text to ape_messages before generation. | |
| 4. Classify | |
| LLM returns: | |
| intent | |
| topic | |
| signal | |
| 5. Finalize previous pending response | |
| If this user/session has a recent PENDING response: | |
| - append classifier signal as source=llm if it is not no_signal | |
| - append session_continue as source=derived if previous response age < 5 min | |
| - run finalization: | |
| composite detection | |
| atomic resolver fallback | |
| strongest format-relevant reward selection | |
| CAS PENDING -> APPLIED | |
| bandit update if reward exists | |
| UCB cache refresh for the whole cell | |
| 6. Validate intent | |
| Intent must be ACTIVE in ape_config. Unknown or inactive intents become | |
| unmapped. | |
| 7. Resolve candidate strategies | |
| Lookup order: | |
| - exact policy: domain + intent + topic | |
| - default topic policy: domain + intent + _default | |
| - hardcoded fallback catalog | |
| 8. Load active instructions | |
| Each candidate strategy needs an ACTIVE instruction. Missing instructions | |
| fall back to safe defaults where available. | |
| 9. Load or create bandit cell | |
| The cell is keyed by user_id_hash + domain + intent + topic. | |
| One row exists per strategy. | |
| 10. Select strategy | |
| UCB selects argmax(cached_ucb). | |
| Unpulled arms have cached_ucb = 999.0 for natural exploration. | |
| 11. Generate answer | |
| The synthesizer receives the selected strategy instruction. | |
| 12. Measure format compliance | |
| format_compliance = 1 if rendered_format matches selected_strategy's | |
| expected format; otherwise 0. | |
| 13. Write PENDING turn record | |
| The row includes response_id, selected_strategy, rendered_format, | |
| attribution_bandit_pk/sk, and initial pending_signals: | |
| - format_compliance_pass, or | |
| - format_compliance_fail | |
| 14. Append assistant message | |
| Assistant message stores response_id so the UI can attach feedback to the | |
| exact response. | |
| ``` | |
| The response body returns `response_id`. The client must echo that id when it | |
| posts feedback. | |
| --- | |
| ## Path A Streaming - `/turn/stream` | |
| `POST /turn/stream` has the same learning semantics as `/turn`. The difference | |
| is delivery: it returns Server-Sent Events while the synthesizer is producing | |
| tokens. | |
| Event sequence: | |
| | Event | When | Payload highlights | | |
| |---|---|---| | |
| | `metadata` | After strategy selection, before token streaming | `response_id`, `session_id`, `intent`, `topic`, `selected_strategy`, `candidate_strategies` | | |
| | `delta` | Many times during generation | Raw text chunk from the LLM | | |
| | `done` | After generation and writes complete | Final `answer`, `rendered_format`, `format_compliance`, `selection`, `classification`, timings | | |
| | `error` | If streaming fails | Error message | | |
| The streaming path preallocates `response_id` before token streaming so the UI | |
| can associate live feedback controls with the response as soon as metadata | |
| arrives. | |
| --- | |
| ## Path B - `/feedback` | |
| `POST /feedback` accepts an explicit UI signal for one `response_id`. | |
| ```text | |
| 1. Validate target | |
| - response_id must exist | |
| - user_id must hash to the response owner | |
| - reward_status must still be PENDING | |
| 2. Validate signal | |
| The signal must exist in ACTIVE signal_routing config. | |
| 3. Append to pending_signals | |
| UI feedback is appended as: | |
| { signal, source: "ui", ts } | |
| 4. Decide whether to finalize now | |
| Finalize immediately if: | |
| - response age >= 5 seconds, or | |
| - pending signal count >= 3, or | |
| - signal is explicit/strong: | |
| format_change_request | |
| format_keep_request | |
| content_correction | |
| regenerate_click | |
| thumbs_down | |
| Otherwise return status=queued. The response remains PENDING until another | |
| signal arrives or the next /turn finalizes it. | |
| 5. Finalize response | |
| - detect composite pattern first | |
| - otherwise resolve atomic signals: | |
| UI signals outrank LLM signals | |
| LLM signals outrank derived-only signals | |
| - record the winning label in ape_turn_record.signal | |
| - separately scan all buffered signals plus the winner | |
| - choose the highest absolute format-relevant normalized reward | |
| - CAS reward_status from PENDING to APPLIED | |
| - update the attributed bandit arm if a reward exists | |
| - refresh cached_ucb for every arm in the cell | |
| ``` | |
| This design protects bandit learning from noisy satisfaction clicks. For | |
| example, `thumbs_up` can be the final analytics label, but because it has no | |
| format category it does not by itself update the bandit. A co-fired | |
| `format_compliance_pass` can still supply a +0.5 format reward. | |
| --- | |
| ## Feedback Statuses | |
| | Status | Meaning | | |
| |---|---| | |
| | `queued` | Signal was buffered; finalization will happen later | | |
| | `applied` | Response finalized and bandit row updated | | |
| | `applied_no_bandit_update` | Response finalized, but no format-relevant reward existed | | |
| | `rejected` | Response missing, wrong user, already finalized, or CAS race lost | | |
| | `skipped` | Signal is unknown and was not buffered | | |
| --- | |
| ## Reward State Machine | |
| ```text | |
| PENDING | |
| | | |
| | queued feedback appends pending_signals | |
| v | |
| PENDING | |
| | | |
| | finalizer applies a label and optional reward | |
| v | |
| APPLIED | |
| PENDING | |
| | | |
| | explicit skip path for non-applicable responses | |
| v | |
| SKIPPED | |
| ``` | |
| A PENDING response is safe. It means APE is waiting for more evidence or the | |
| next user turn. It does not corrupt the bandit. | |
| --- | |
| ## Why UCB Cache Refresh Touches the Whole Cell | |
| The UCB exploration term depends on total pulls in the cell: | |
| ```text | |
| ucb = avg_reward + c * sqrt(2 * ln(total_cell_pulls) / arm_count) | |
| ``` | |
| When one arm receives a reward, the total pull count changes, so every arm's | |
| `cached_ucb` can change. Path B refreshes the whole cell after a bandit update. | |
| --- | |
| ## See Also | |
| - [03 - Admin config](./03-admin-config.md) | |
| - [09 - API reference](./09-api-reference.md) | |