Spaces:
Sleeping
02 - Runtime Paths
Path A generates a response and records attribution. Path B buffers feedback signals and finalizes learning. The shared ledger is
ape_turn_record; the personalized learning table isape_user_bandit_state.
Path A - /turn
POST /turn handles one user message end to end.
1. Ensure session_id
If the request omits session_id, the orchestrator mints one.
2. Load history
Read recent messages from ape_messages by session_id.
3. Append user message
Write raw user text to ape_messages before generation.
4. Classify
LLM returns:
intent
topic
signal
5. Finalize previous pending response
If this user/session has a recent PENDING response:
- append classifier signal as source=llm if it is not no_signal
- append session_continue as source=derived if previous response age < 5 min
- run finalization:
composite detection
atomic resolver fallback
strongest format-relevant reward selection
CAS PENDING -> APPLIED
bandit update if reward exists
UCB cache refresh for the whole cell
6. Validate intent
Intent must be ACTIVE in ape_config. Unknown or inactive intents become
unmapped.
7. Resolve candidate strategies
Lookup order:
- exact policy: domain + intent + topic
- default topic policy: domain + intent + _default
- hardcoded fallback catalog
8. Load active instructions
Each candidate strategy needs an ACTIVE instruction. Missing instructions
fall back to safe defaults where available.
9. Load or create bandit cell
The cell is keyed by user_id_hash + domain + intent + topic.
One row exists per strategy.
10. Select strategy
UCB selects argmax(cached_ucb).
Unpulled arms have cached_ucb = 999.0 for natural exploration.
11. Generate answer
The synthesizer receives the selected strategy instruction.
12. Measure format compliance
format_compliance = 1 if rendered_format matches selected_strategy's
expected format; otherwise 0.
13. Write PENDING turn record
The row includes response_id, selected_strategy, rendered_format,
attribution_bandit_pk/sk, and initial pending_signals:
- format_compliance_pass, or
- format_compliance_fail
14. Append assistant message
Assistant message stores response_id so the UI can attach feedback to the
exact response.
The response body returns response_id. The client must echo that id when it
posts feedback.
Path A Streaming - /turn/stream
POST /turn/stream has the same learning semantics as /turn. The difference
is delivery: it returns Server-Sent Events while the synthesizer is producing
tokens.
Event sequence:
| Event | When | Payload highlights |
|---|---|---|
metadata |
After strategy selection, before token streaming | response_id, session_id, intent, topic, selected_strategy, candidate_strategies |
delta |
Many times during generation | Raw text chunk from the LLM |
done |
After generation and writes complete | Final answer, rendered_format, format_compliance, selection, classification, timings |
error |
If streaming fails | Error message |
The streaming path preallocates response_id before token streaming so the UI
can associate live feedback controls with the response as soon as metadata
arrives.
Path B - /feedback
POST /feedback accepts an explicit UI signal for one response_id.
1. Validate target
- response_id must exist
- user_id must hash to the response owner
- reward_status must still be PENDING
2. Validate signal
The signal must exist in ACTIVE signal_routing config.
3. Append to pending_signals
UI feedback is appended as:
{ signal, source: "ui", ts }
4. Decide whether to finalize now
Finalize immediately if:
- response age >= 5 seconds, or
- pending signal count >= 3, or
- signal is explicit/strong:
format_change_request
format_keep_request
content_correction
regenerate_click
thumbs_down
Otherwise return status=queued. The response remains PENDING until another
signal arrives or the next /turn finalizes it.
5. Finalize response
- detect composite pattern first
- otherwise resolve atomic signals:
UI signals outrank LLM signals
LLM signals outrank derived-only signals
- record the winning label in ape_turn_record.signal
- separately scan all buffered signals plus the winner
- choose the highest absolute format-relevant normalized reward
- CAS reward_status from PENDING to APPLIED
- update the attributed bandit arm if a reward exists
- refresh cached_ucb for every arm in the cell
This design protects bandit learning from noisy satisfaction clicks. For
example, thumbs_up can be the final analytics label, but because it has no
format category it does not by itself update the bandit. A co-fired
format_compliance_pass can still supply a +0.5 format reward.
Feedback Statuses
| Status | Meaning |
|---|---|
queued |
Signal was buffered; finalization will happen later |
applied |
Response finalized and bandit row updated |
applied_no_bandit_update |
Response finalized, but no format-relevant reward existed |
rejected |
Response missing, wrong user, already finalized, or CAS race lost |
skipped |
Signal is unknown and was not buffered |
Reward State Machine
PENDING
|
| queued feedback appends pending_signals
v
PENDING
|
| finalizer applies a label and optional reward
v
APPLIED
PENDING
|
| explicit skip path for non-applicable responses
v
SKIPPED
A PENDING response is safe. It means APE is waiting for more evidence or the next user turn. It does not corrupt the bandit.
Why UCB Cache Refresh Touches the Whole Cell
The UCB exploration term depends on total pulls in the cell:
ucb = avg_reward + c * sqrt(2 * ln(total_cell_pulls) / arm_count)
When one arm receives a reward, the total pull count changes, so every arm's
cached_ucb can change. Path B refreshes the whole cell after a bandit update.