File size: 11,648 Bytes
3a64edb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 | # Optional Modules and Parallel Evaluation Plan
This document describes two optional extension points already preserved in the current TMCRA package:
- Embedder interface
- LLM planner interface
It also summarizes the parallel evaluation pattern used for the S500 baseline so later deployment, evaluation, and ablation runs can reuse the same structure.
## 1. Current Main Path
The frozen S500 baseline uses this core path:
```text
dialogue -> writer layer -> graph memory -> learned node/path scorer -> evidence selection -> answer layer
```
The responsibilities are:
- The writer layer converts dialogue into memory nodes, event units, profile signals, and temporal signals.
- The graph memory layer stores nodes, paths, and tunnel links.
- `node_scorer.pt` and `path_scorer.pt` perform learned node/path scoring.
- Evidence selection converts candidate memories into compact evidence.
- The answer-layer LLM produces the final response from the selected evidence.
Embedder and LLM planner modules are optional enhancement modules. They should not replace the main graph model. They are better treated as auxiliary channels, ablation switches, or higher-cost deployment paths.
## 2. Embedder Interface
The current code exposes three embedder integration points.
### 2.1 Write-time Indexing
At write time, TMCRA can build an embedding index for newly written memory nodes. That index can later serve as an auxiliary candidate source during retrieval.
Relevant configuration:
```bash
export TMCRA_EMBEDDER_MODEL_PATH="BAAI/bge-m3"
export TMCRA_EMBEDDER_DEVICE="cuda"
export TMCRA_EMBEDDER_MODEL_MAX_LENGTH="512"
export TMCRA_WRITE_EMBEDDER_INDEX_MODE="bge_m3"
export TMCRA_WRITE_EMBEDDER_INDEX_MAX_TERMS="96"
```
Purpose:
- Build semantic indexes after the writer stores memory nodes.
- Keep the original graph structure unchanged.
- Add a semantic candidate channel without replacing the learned node/path scorers.
### 2.2 Pre-recall Candidate Expansion
Before graph retrieval, the embedder can find candidate event ids, which are then passed into graph retrieval and scorer ranking.
Relevant configuration:
```bash
export TMCRA_EMBEDDER_PRE_RECALL_MODE="bge_m3"
export TMCRA_EMBEDDER_PRE_RECALL_K="16"
export TMCRA_EMBEDDER_INDEX_RECALL_MODE="bge_m3"
export TMCRA_EMBEDDER_INDEX_RECALL_K="24"
```
Purpose:
- Expand the candidate range before retrieval.
- Help when query wording and memory wording differ.
- Provide an auxiliary path for semantically close memories with weak graph paths.
### 2.3 Post-recall Fusion
After retrieval, embedder-matched events can be fused with graph-model results so semantically relevant nodes receive a limited boost.
Relevant configuration:
```bash
export TMCRA_EMBEDDER_FUSION_MODE="on"
export TMCRA_EMBEDDER_FUSION_WEIGHT="0.35"
export TMCRA_EMBEDDER_FUSION_SCORE_FLOOR="0.62"
export TMCRA_EMBEDDER_FUSION_TOP_K="16"
export TMCRA_EMBEDDER_FUSION_SELECT_K="4"
export TMCRA_EMBEDDER_FUSION_MAX_BOOST="0.42"
```
Purpose:
- Give semantically similar candidates a bounded score boost.
- Prevent the embedder from directly replacing main evidence ranking.
- Use embedding as an auxiliary recall layer for the learned graph scorer.
## 3. LLM Planner Interface
The current code exposes three main LLM planner paths. They run either after retrieval or before query-side retrieval expansion.
### 3.1 Evidence-unit Planner
The evidence-unit planner runs after retrieval and uses an LLM to normalize retrieved windows into evidence units.
Relevant configuration:
```bash
export TMCRA_EVIDENCE_UNIT_PLANNER_MODE="on"
export TMCRA_EVIDENCE_UNIT_PLANNER_BASE_URL="<openai-compatible-base-url>"
export TMCRA_EVIDENCE_UNIT_PLANNER_MODEL="<planner-model>"
export TMCRA_EVIDENCE_UNIT_PLANNER_API_KEY="<planner-api-key>"
export TMCRA_EVIDENCE_UNIT_PLANNER_MAX_CANDIDATES="10"
export TMCRA_EVIDENCE_UNIT_PLANNER_CHARS="1100"
export TMCRA_EVIDENCE_UNIT_PLANNER_MAX_TOKENS="760"
export TMCRA_EVIDENCE_UNIT_PLANNER_REORDER="0"
```
If planner-specific base/model/key values are not configured, this planner inherits answer-layer configuration:
```bash
export TMCRA_ANSWER_BASE_URL="<openai-compatible-base-url>"
export TMCRA_ANSWER_MODEL="<answer-model>"
export TMCRA_ANSWER_API_KEY="<answer-api-key>"
```
Purpose:
- Mark answer units, positive evidence, temporal anchors, current values, old values, constraints, and negative evidence.
- Help the final answer layer understand how to use the retrieved evidence.
- Organize evidence without replacing graph retrieval.
### 3.2 LLM Channel Planner
The LLM channel planner runs before final evidence is sent to the answer layer. It separates main evidence, coverage evidence, support evidence, and suppressed evidence.
Relevant configuration:
```bash
export TMCRA_LLM_CHANNEL_PLANNER_MODE="on"
export TMCRA_LLM_CHANNEL_PLANNER_MAX_WINDOWS="16"
export TMCRA_LLM_CHANNEL_PLANNER_WINDOW_CHARS="520"
export TMCRA_LLM_CHANNEL_PLANNER_MAX_TOKENS="700"
```
Purpose:
- Make coverage evidence supplement main facts instead of replacing them.
- Improve count/sum/ratio/duration/multi-unit tasks.
- Provide a higher-cost quality mode for experiments and selected deployments.
In the frozen S500 baseline record, this module was:
```text
llm_channel_planner=off
```
### 3.3 Query Graph Builder
The query graph builder runs before retrieval. It converts the user question into a compact query graph and can expand that graph into sidecar retrieval queries.
Relevant configuration:
```bash
export TMCRA_QUERY_GRAPH_BUILDER_MODE="on"
export TMCRA_QUERY_GRAPH_BASE_URL="<openai-compatible-base-url>"
export TMCRA_QUERY_GRAPH_MODEL="<query-graph-model>"
export TMCRA_QUERY_GRAPH_API_KEY="<query-graph-api-key>"
export TMCRA_QUERY_GRAPH_MAX_TOKENS="700"
export TMCRA_QUERY_GRAPH_SIDECAR_RETRIEVAL_MODE="on"
export TMCRA_QUERY_GRAPH_SIDECAR_MAX_QUERIES="6"
export TMCRA_QUERY_GRAPH_SIDECAR_TOP_K="4"
```
Purpose:
- Convert the question into task intent, required units, operation, and tunnel needs.
- Give complex multi-session, temporal, and profile questions a clearer retrieval direction.
- Test whether building a query graph before retrieval improves candidate recall.
## 4. Local Model Planner vs LLM Planner
The code also contains local model planner interfaces, for example:
```bash
export TMCRA_ANSWER_WINDOW_PLANNER_MODE="on"
export TMCRA_ANSWER_WINDOW_PLANNER_MODEL_PATH="<planner-checkpoint>"
export TMCRA_UNIFIED_OPERATION_PLANNER_MODE="on"
export TMCRA_UNIFIED_OPERATION_PLANNER_MODEL_PATH="<planner-checkpoint>"
export TMCRA_INJECTION_PLANNER_MODE="guided"
export TMCRA_INJECTION_PLANNER_MODEL_PATH="<planner-checkpoint>"
```
These are local model interfaces, not LLM planner interfaces.
The distinction is:
- LLM planner: calls an external or local LLM; higher cost; useful for validating capability ceilings.
- Local model planner: lower cost and better for productization, but requires targeted training and stability validation.
Recommended workflow:
```text
validate behavior with an LLM planner -> distill or train the useful behavior into the graph model or a local planner head
```
## 5. Parallel Evaluation Plan
The S500 baseline used shard-level parallelism:
```text
500 samples -> 10 shards -> 50 samples per shard
```
Each shard runs independently:
```text
input_shard_N.json -> shard_N/ -> predictions/debug/summary
```
Core parallelization principles:
- One independent process per shard.
- One independent output directory per shard.
- Writer key pool is rotated by shard index.
- Main model weights are read-only and shared.
- Predictions, samples_debug, and judge results are merged after all shards complete.
Key S500 baseline runtime configuration:
```text
samples=500
shards=10
per_shard=50
writer=DeepSeek v4 Flash
answer_layer=GPT5.4
llm_channel_planner=off
history_mode=controlled_answer_plus_distractors
```
### 5.1 Reusable Parallel Template
Recommended baseline template:
```bash
export TMCRA_RETRIEVAL_MODE="hybrid_node_scored"
export TMCRA_REQUIRE_LEARNED_SCORER="1"
export TMCRA_NODE_MODEL_DEVICE="cuda"
export TMCRA_NODE_MODEL_PATH="models/action_frame_tunnel_graph548_tunnel_fusion_train_20260524_042557/node_scorer.pt"
export TMCRA_PATH_MODEL_PATH="models/action_frame_tunnel_graph548_tunnel_fusion_train_20260524_042557/path_scorer.pt"
export TMCRA_WRITER_MODEL="deepseek-chat"
export TMCRA_WRITER_MAX_TOKENS="512"
export TMCRA_WRITER_TIMEOUT_SECONDS="180"
export TMCRA_WRITER_TEMPERATURE="0"
export TMCRA_WRITER_INPUT_MODE="delta"
export TMCRA_WRITER_MAX_PROPOSALS="2"
export TMCRA_ANSWER_MAX_TOKENS="512"
```
Single-shard execution shape:
```bash
python code/run_lme_s10_native_tmcra.py \
--data "<run-root>/input_shard_N.json" \
--repo "<tmcra-repo-root>" \
--service-root "<tmcra-service-root>" \
--out "<run-root>/shard_N" \
--limit 50 \
--top-k 10 \
--max-distractor-sessions 5 \
--max-distractor-chunks 1 \
--max-answer-chunks 4 \
--chunk-chars 7000
```
### 5.2 Suggested Rollout Order
Do not enable all optional modules at once. Use staged A/B testing:
1. **Baseline scorer-only**
- embedder off
- LLM channel planner off
- query graph builder off
- confirms frozen baseline stability
2. **Embedder pre-recall A/B**
- enable write-time indexing and pre-recall candidate expansion only
- measure candidate hit rate, retrieval latency, and error-type shifts
3. **Embedder fusion A/B**
- enable fusion only after pre-recall is stable
- keep boost bounded so embedder does not override the main graph scorer
4. **Evidence-unit planner A/B**
- enable LLM evidence-unit planner
- measure whether the answer layer uses retrieved evidence better
5. **LLM channel planner A/B**
- test mainly on multi/aggregation/temporal error clusters
- verify coverage evidence supplements main facts instead of replacing them
6. **Query graph builder A/B**
- validate the ceiling of query-graph-first retrieval
- if effective, distill the behavior into query-understanding or graph scorer training
### 5.3 Parallel Scale Guidance
Parallelism should not be determined only by the number of API keys. Also watch:
- GPU memory
- CPU memory
- writer latency
- answer-layer latency
- graph ingest / SQLite write overhead
- average writer calls per shard
Scale gradually:
```text
5 shards smoke -> 10 shards stable -> 20 shards stress -> 30 shards only if no memory/API/IO issue
```
If error rate rises, memory drops sharply, API 402/429 appears, chunk errors occur, or shards stall, reduce parallelism first and then resume missing samples.
## 6. Recommended Experiment Matrix
Minimal interpretable matrix:
| experiment | Embedder | LLM planner | purpose |
| --- | --- | --- | --- |
| baseline | off | off | fixed main graph-model baseline |
| embedder-pre | pre-recall on | off | test candidate expansion |
| embedder-fusion | pre-recall + fusion on | off | test semantic fusion |
| evidence-unit | off | evidence-unit on | test pre-answer evidence organization |
| channel-planner | off | channel planner on | test main/coverage separation |
| query-graph | off | query graph on | test query-graph retrieval |
| combined-light | pre-recall on | evidence-unit on | test lower-cost combined path |
| combined-heavy | pre-recall + fusion on | evidence-unit + channel planner on | test capability ceiling |
Each run should preserve:
- predictions
- samples_debug
- judge output
- by-task accuracy
- writer calls
- retrieval latency
- answer latency
- per-sample error type
This makes it possible to separate recall errors, evidence-selection errors, planner errors, answer-layer errors, and parallel-runtime instability.
|