File size: 7,743 Bytes

85b19cf

# Eval Framework 输出格式

## 输出目录结构

运行完成后 `--output-dir` 下包含 5 个文件：

```
output-dir/
├── pipeline_sessions.jsonl     # Stage 1 checkpoint — pipeline 中间结果（session 级）
├── pipeline_qa.jsonl           # Stage 1 checkpoint — pipeline 中间结果（QA 级）
├── session_records.jsonl       # 最终结果：session pipeline 数据 + eval 评判
├── qa_records.jsonl            # 最终结果：QA pipeline 数据 + eval 评判
└── aggregate_metrics.json      # 最终结果：baseline 级别汇总指标
```

## 文件详解

### 1. `session_records.jsonl`

每行一个 session，包含 pipeline 原始数据和 `eval` 评判结果：

```json
{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "session_id": "S01",
  "memory_snapshot": [
    {
      "memory_id": "3",
      "text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...",
      "session_id": "S01",
      "status": "active",
      "source": "FUMemory",
      "raw_backend_id": "3",
      "raw_backend_type": "linear",
      "metadata": {}
    }
  ],
  "memory_delta": [
    {
      "session_id": "S01",
      "op": "add",
      "text": "user: OBSERVATION: ...",
      "linked_previous": [],
      "raw_backend_id": "3",
      "metadata": {"baseline": "FUMemory"}
    }
  ],
  "gold_state": {
    "session_id": "S01",
    "cumulative_gold_memories": [...],
    "session_new_memories": [...],
    "session_update_memories": [...],
    "session_interference_memories": []
  },
  "eval": {
    "session_id": "S01",
    "recall": 0.8,
    "covered_count": 4,
    "num_gold": 5,
    "update_recall": 1.0,
    "update_covered_count": 2,
    "update_total": 2,
    "recall_reasoning": "4 of 5 gold points are covered...",
    "correctness_rate": 0.75,
    "num_memories": 8,
    "num_correct": 6,
    "num_hallucination": 1,
    "num_irrelevant": 1,
    "correctness_reasoning": "...",
    "correctness_records": [
      {"id": 1, "label": "correct"},
      {"id": 2, "label": "hallucination"}
    ],
    "update_score": 1.0,
    "update_num_updated": 2,
    "update_num_both": 0,
    "update_num_outdated": 0,
    "update_total_items": 2,
    "update_records": [
      {"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."}
    ],
    "interference_score": null,
    "interference_num_rejected": 0,
    "interference_num_memorized": 0,
    "interference_total_items": 0,
    "interference_records": []
  }
}
```

**eval 字段说明：**

| 字段 | 含义 |
|------|------|
| `recall` | 本 session gold points 被 delta 覆盖的比例 (0-1) |
| `update_recall` | update 类型 gold points 的覆盖比例 |
| `correctness_rate` | delta 中正确记忆的比例 |
| `num_hallucination` | delta 中幻觉记忆数量 |
| `num_irrelevant` | delta 中无关记忆数量 |
| `update_score` | 更新处理得分 (updated=1.0, both=0.5, outdated=0.0) |
| `interference_score` | 干扰拒绝得分 (rejected=1.0, memorized=0.0) |

### 2. `qa_records.jsonl`

每行一个 QA question，包含检索结果、模型回答和评判：

```json
{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "checkpoint_id": "probe_e980c238",
  "question": "What was in the agent's inventory at step 1?",
  "gold_answer": "At step 1, the agent's inventory was empty.",
  "gold_evidence_memory_ids": ["mp_S04_1"],
  "gold_evidence_contents": ["The agent started with empty inventory"],
  "question_type": "factual_recall",
  "question_type_abbrev": "FR",
  "difficulty": "easy",
  "retrieval": {
    "query": "What was in the agent's inventory at step 1?",
    "top_k": 5,
    "items": [
      {
        "rank": 0,
        "memory_id": "memgallery:string_bundle",
        "text": "user: OBSERVATION: Your Inventory: ...",
        "score": 1.0,
        "raw_backend_id": null
      }
    ],
    "raw_trace": {"baseline": "FUMemory"}
  },
  "generated_answer": "The agent's inventory was empty at step 1.",
  "cited_memories": ["user: OBSERVATION: Inventory: nothing"],
  "eval": {
    "answer_label": "Correct",
    "answer_reasoning": "The response matches the reference answer...",
    "answer_is_valid": true,
    "evidence_hit_rate": 1.0,
    "evidence_covered_count": 1,
    "num_evidence": 1,
    "evidence_reasoning": "The cited memory covers the gold evidence...",
    "num_cited_memories": 1
  }
}
```

**eval 字段说明：**

| 字段 | 含义 |
|------|------|
| `answer_label` | `Correct` / `Hallucination` / `Omission` |
| `answer_is_valid` | 评判是否成功（非 LLM 错误） |
| `evidence_hit_rate` | cited memories 覆盖了多少 gold evidence (0-1) |
| `evidence_covered_count` | 被覆盖的 gold evidence 数量 |
| `num_cited_memories` | 模型回答时引用的记忆条数 |

### 3. `aggregate_metrics.json`

baseline 级别的 6 维汇总指标：

```json
{
  "baseline_id": "FUMemory",
  "memory_recall": {
    "avg_recall": 0.72,
    "avg_update_recall": 0.65,
    "num_sessions_with_recall": 110,
    "num_sessions_with_update": 85,
    "total_covered": 320,
    "total_gold": 445
  },
  "memory_correctness": {
    "avg_correctness": 0.81,
    "avg_hallucination": 0.08,
    "avg_irrelevant": 0.11,
    "num_sessions": 110,
    "total_memories": 1200,
    "total_correct": 972,
    "total_hallucination": 96,
    "total_irrelevant": 132
  },
  "update_handling": {
    "score": 0.65,
    "num_updated": 52,
    "num_both": 18,
    "num_outdated": 15,
    "num_total": 85
  },
  "interference_rejection": {
    "score": 0.0,
    "num_rejected": 0,
    "num_memorized": 0,
    "num_total": 0
  },
  "question_answering": {
    "correct_ratio": 0.58,
    "hallucination_ratio": 0.22,
    "omission_ratio": 0.20,
    "num_total": 990,
    "num_valid": 990
  },
  "evidence_coverage": {
    "hit_rate": 0.43,
    "num_covered": 425,
    "num_total": 990
  }
}
```

**6 个维度：**

| 维度 | 聚合方式 | 核心指标 | 方向 |
|------|---------|---------|------|
| Memory Recall | 按 session 平均 | `avg_recall` | ↑ |
| Memory Correctness | 按 session 平均 | `avg_correctness`, `avg_hallucination` | ↑, ↓ |
| Update Handling | 跨 session 池化 | `score` | ↑ |
| Interference Rejection | 跨 session 池化 | `score` | ↑ |
| Question Answering | 跨 question 池化 | `correct_ratio`, `hallucination_ratio` | ↑, ↓ |
| Evidence Coverage | 跨 question 池化 | `hit_rate` | ↑ |

### 4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl`

Stage 1 的 checkpoint 文件，结构与 `session_records.jsonl` / `qa_records.jsonl` 相同但**不含 `eval` 字段**。

用途：`--eval-only` 模式跳过 pipeline 直接从 checkpoint 恢复，只重跑 eval 阶段。典型场景：

```bash
# 首次完整运行
python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU

# 换 judge 模型重评（不重跑 pipeline）
OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \
    --dataset ... --baseline FUMemory --output-dir results/FU --eval-only
```

## 结果分析示例

```python
import json

# 读取汇总
with open("results/FUMemory/aggregate_metrics.json") as f:
    agg = json.load(f)
print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}")
print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}")

# 按 QA type 分析正确率
qa_by_type = {}
with open("results/FUMemory/qa_records.jsonl") as f:
    for line in f:
        rec = json.loads(line)
        qt = rec["question_type_abbrev"]
        label = rec["eval"]["answer_label"]
        qa_by_type.setdefault(qt, []).append(label)

for qt, labels in sorted(qa_by_type.items()):
    correct = sum(1 for l in labels if l == "Correct")
    print(f"  {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}")
```