File size: 7,743 Bytes
85b19cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
# Eval Framework 输出格式

## 输出目录结构

运行完成后 `--output-dir` 下包含 5 个文件:

```
output-dir/
├── pipeline_sessions.jsonl     # Stage 1 checkpoint — pipeline 中间结果(session 级)
├── pipeline_qa.jsonl           # Stage 1 checkpoint — pipeline 中间结果(QA 级)
├── session_records.jsonl       # 最终结果:session pipeline 数据 + eval 评判
├── qa_records.jsonl            # 最终结果:QA pipeline 数据 + eval 评判
└── aggregate_metrics.json      # 最终结果:baseline 级别汇总指标
```

## 文件详解

### 1. `session_records.jsonl`

每行一个 session,包含 pipeline 原始数据和 `eval` 评判结果:

```json
{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "session_id": "S01",
  "memory_snapshot": [
    {
      "memory_id": "3",
      "text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...",
      "session_id": "S01",
      "status": "active",
      "source": "FUMemory",
      "raw_backend_id": "3",
      "raw_backend_type": "linear",
      "metadata": {}
    }
  ],
  "memory_delta": [
    {
      "session_id": "S01",
      "op": "add",
      "text": "user: OBSERVATION: ...",
      "linked_previous": [],
      "raw_backend_id": "3",
      "metadata": {"baseline": "FUMemory"}
    }
  ],
  "gold_state": {
    "session_id": "S01",
    "cumulative_gold_memories": [...],
    "session_new_memories": [...],
    "session_update_memories": [...],
    "session_interference_memories": []
  },
  "eval": {
    "session_id": "S01",
    "recall": 0.8,
    "covered_count": 4,
    "num_gold": 5,
    "update_recall": 1.0,
    "update_covered_count": 2,
    "update_total": 2,
    "recall_reasoning": "4 of 5 gold points are covered...",
    "correctness_rate": 0.75,
    "num_memories": 8,
    "num_correct": 6,
    "num_hallucination": 1,
    "num_irrelevant": 1,
    "correctness_reasoning": "...",
    "correctness_records": [
      {"id": 1, "label": "correct"},
      {"id": 2, "label": "hallucination"}
    ],
    "update_score": 1.0,
    "update_num_updated": 2,
    "update_num_both": 0,
    "update_num_outdated": 0,
    "update_total_items": 2,
    "update_records": [
      {"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."}
    ],
    "interference_score": null,
    "interference_num_rejected": 0,
    "interference_num_memorized": 0,
    "interference_total_items": 0,
    "interference_records": []
  }
}
```

**eval 字段说明:**

| 字段 | 含义 |
|------|------|
| `recall` | 本 session gold points 被 delta 覆盖的比例 (0-1) |
| `update_recall` | update 类型 gold points 的覆盖比例 |
| `correctness_rate` | delta 中正确记忆的比例 |
| `num_hallucination` | delta 中幻觉记忆数量 |
| `num_irrelevant` | delta 中无关记忆数量 |
| `update_score` | 更新处理得分 (updated=1.0, both=0.5, outdated=0.0) |
| `interference_score` | 干扰拒绝得分 (rejected=1.0, memorized=0.0) |

### 2. `qa_records.jsonl`

每行一个 QA question,包含检索结果、模型回答和评判:

```json
{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "checkpoint_id": "probe_e980c238",
  "question": "What was in the agent's inventory at step 1?",
  "gold_answer": "At step 1, the agent's inventory was empty.",
  "gold_evidence_memory_ids": ["mp_S04_1"],
  "gold_evidence_contents": ["The agent started with empty inventory"],
  "question_type": "factual_recall",
  "question_type_abbrev": "FR",
  "difficulty": "easy",
  "retrieval": {
    "query": "What was in the agent's inventory at step 1?",
    "top_k": 5,
    "items": [
      {
        "rank": 0,
        "memory_id": "memgallery:string_bundle",
        "text": "user: OBSERVATION: Your Inventory: ...",
        "score": 1.0,
        "raw_backend_id": null
      }
    ],
    "raw_trace": {"baseline": "FUMemory"}
  },
  "generated_answer": "The agent's inventory was empty at step 1.",
  "cited_memories": ["user: OBSERVATION: Inventory: nothing"],
  "eval": {
    "answer_label": "Correct",
    "answer_reasoning": "The response matches the reference answer...",
    "answer_is_valid": true,
    "evidence_hit_rate": 1.0,
    "evidence_covered_count": 1,
    "num_evidence": 1,
    "evidence_reasoning": "The cited memory covers the gold evidence...",
    "num_cited_memories": 1
  }
}
```

**eval 字段说明:**

| 字段 | 含义 |
|------|------|
| `answer_label` | `Correct` / `Hallucination` / `Omission` |
| `answer_is_valid` | 评判是否成功(非 LLM 错误) |
| `evidence_hit_rate` | cited memories 覆盖了多少 gold evidence (0-1) |
| `evidence_covered_count` | 被覆盖的 gold evidence 数量 |
| `num_cited_memories` | 模型回答时引用的记忆条数 |

### 3. `aggregate_metrics.json`

baseline 级别的 6 维汇总指标:

```json
{
  "baseline_id": "FUMemory",
  "memory_recall": {
    "avg_recall": 0.72,
    "avg_update_recall": 0.65,
    "num_sessions_with_recall": 110,
    "num_sessions_with_update": 85,
    "total_covered": 320,
    "total_gold": 445
  },
  "memory_correctness": {
    "avg_correctness": 0.81,
    "avg_hallucination": 0.08,
    "avg_irrelevant": 0.11,
    "num_sessions": 110,
    "total_memories": 1200,
    "total_correct": 972,
    "total_hallucination": 96,
    "total_irrelevant": 132
  },
  "update_handling": {
    "score": 0.65,
    "num_updated": 52,
    "num_both": 18,
    "num_outdated": 15,
    "num_total": 85
  },
  "interference_rejection": {
    "score": 0.0,
    "num_rejected": 0,
    "num_memorized": 0,
    "num_total": 0
  },
  "question_answering": {
    "correct_ratio": 0.58,
    "hallucination_ratio": 0.22,
    "omission_ratio": 0.20,
    "num_total": 990,
    "num_valid": 990
  },
  "evidence_coverage": {
    "hit_rate": 0.43,
    "num_covered": 425,
    "num_total": 990
  }
}
```

**6 个维度:**

| 维度 | 聚合方式 | 核心指标 | 方向 |
|------|---------|---------|------|
| Memory Recall | 按 session 平均 | `avg_recall` | ↑ |
| Memory Correctness | 按 session 平均 | `avg_correctness`, `avg_hallucination` | ↑, ↓ |
| Update Handling | 跨 session 池化 | `score` | ↑ |
| Interference Rejection | 跨 session 池化 | `score` | ↑ |
| Question Answering | 跨 question 池化 | `correct_ratio`, `hallucination_ratio` | ↑, ↓ |
| Evidence Coverage | 跨 question 池化 | `hit_rate` | ↑ |

### 4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl`

Stage 1 的 checkpoint 文件,结构与 `session_records.jsonl` / `qa_records.jsonl` 相同但**不含 `eval` 字段**。

用途:`--eval-only` 模式跳过 pipeline 直接从 checkpoint 恢复,只重跑 eval 阶段。典型场景:

```bash
# 首次完整运行
python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU

# 换 judge 模型重评(不重跑 pipeline)
OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \
    --dataset ... --baseline FUMemory --output-dir results/FU --eval-only
```

## 结果分析示例

```python
import json

# 读取汇总
with open("results/FUMemory/aggregate_metrics.json") as f:
    agg = json.load(f)
print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}")
print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}")

# 按 QA type 分析正确率
qa_by_type = {}
with open("results/FUMemory/qa_records.jsonl") as f:
    for line in f:
        rec = json.loads(line)
        qt = rec["question_type_abbrev"]
        label = rec["eval"]["answer_label"]
        qa_by_type.setdefault(qt, []).append(label)

for qt, labels in sorted(qa_by_type.items()):
    correct = sum(1 for l in labels if l == "Correct")
    print(f"  {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}")
```