Upload eval_framework source code

85b19cf verified 28 days ago

7.74 kB

	# Eval Framework 输出格式

	## 输出目录结构

	运行完成后 `--output-dir` 下包含 5 个文件：

	```
	output-dir/
	├── pipeline_sessions.jsonl # Stage 1 checkpoint — pipeline 中间结果（session 级）
	├── pipeline_qa.jsonl # Stage 1 checkpoint — pipeline 中间结果（QA 级）
	├── session_records.jsonl # 最终结果：session pipeline 数据 + eval 评判
	├── qa_records.jsonl # 最终结果：QA pipeline 数据 + eval 评判
	└── aggregate_metrics.json # 最终结果：baseline 级别汇总指标
	```

	## 文件详解

	### 1. `session_records.jsonl`

	每行一个 session，包含 pipeline 原始数据和 `eval` 评判结果：

	```json
	{
	"sample_id": "vab_minecraft_...",
	"sample_uuid": "uuid-...",
	"session_id": "S01",
	"memory_snapshot": [
	{
	"memory_id": "3",
	"text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...",
	"session_id": "S01",
	"status": "active",
	"source": "FUMemory",
	"raw_backend_id": "3",
	"raw_backend_type": "linear",
	"metadata": {}
	}
	],
	"memory_delta": [
	{
	"session_id": "S01",
	"op": "add",
	"text": "user: OBSERVATION: ...",
	"linked_previous": [],
	"raw_backend_id": "3",
	"metadata": {"baseline": "FUMemory"}
	}
	],
	"gold_state": {
	"session_id": "S01",
	"cumulative_gold_memories": [...],
	"session_new_memories": [...],
	"session_update_memories": [...],
	"session_interference_memories": []
	},
	"eval": {
	"session_id": "S01",
	"recall": 0.8,
	"covered_count": 4,
	"num_gold": 5,
	"update_recall": 1.0,
	"update_covered_count": 2,
	"update_total": 2,
	"recall_reasoning": "4 of 5 gold points are covered...",
	"correctness_rate": 0.75,
	"num_memories": 8,
	"num_correct": 6,
	"num_hallucination": 1,
	"num_irrelevant": 1,
	"correctness_reasoning": "...",
	"correctness_records": [
	{"id": 1, "label": "correct"},
	{"id": 2, "label": "hallucination"}
	],
	"update_score": 1.0,
	"update_num_updated": 2,
	"update_num_both": 0,
	"update_num_outdated": 0,
	"update_total_items": 2,
	"update_records": [
	{"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."}
	],
	"interference_score": null,
	"interference_num_rejected": 0,
	"interference_num_memorized": 0,
	"interference_total_items": 0,
	"interference_records": []
	}
	}
	```

	eval 字段说明：

	\| 字段 \| 含义 \|
	\|------\|------\|
	\| `recall` \| 本 session gold points 被 delta 覆盖的比例 (0-1) \|
	\| `update_recall` \| update 类型 gold points 的覆盖比例 \|
	\| `correctness_rate` \| delta 中正确记忆的比例 \|
	\| `num_hallucination` \| delta 中幻觉记忆数量 \|
	\| `num_irrelevant` \| delta 中无关记忆数量 \|
	\| `update_score` \| 更新处理得分 (updated=1.0, both=0.5, outdated=0.0) \|
	\| `interference_score` \| 干扰拒绝得分 (rejected=1.0, memorized=0.0) \|

	### 2. `qa_records.jsonl`

	每行一个 QA question，包含检索结果、模型回答和评判：

	```json
	{
	"sample_id": "vab_minecraft_...",
	"sample_uuid": "uuid-...",
	"checkpoint_id": "probe_e980c238",
	"question": "What was in the agent's inventory at step 1?",
	"gold_answer": "At step 1, the agent's inventory was empty.",
	"gold_evidence_memory_ids": ["mp_S04_1"],
	"gold_evidence_contents": ["The agent started with empty inventory"],
	"question_type": "factual_recall",
	"question_type_abbrev": "FR",
	"difficulty": "easy",
	"retrieval": {
	"query": "What was in the agent's inventory at step 1?",
	"top_k": 5,
	"items": [
	{
	"rank": 0,
	"memory_id": "memgallery:string_bundle",
	"text": "user: OBSERVATION: Your Inventory: ...",
	"score": 1.0,
	"raw_backend_id": null
	}
	],
	"raw_trace": {"baseline": "FUMemory"}
	},
	"generated_answer": "The agent's inventory was empty at step 1.",
	"cited_memories": ["user: OBSERVATION: Inventory: nothing"],
	"eval": {
	"answer_label": "Correct",
	"answer_reasoning": "The response matches the reference answer...",
	"answer_is_valid": true,
	"evidence_hit_rate": 1.0,
	"evidence_covered_count": 1,
	"num_evidence": 1,
	"evidence_reasoning": "The cited memory covers the gold evidence...",
	"num_cited_memories": 1
	}
	}
	```

	eval 字段说明：

	\| 字段 \| 含义 \|
	\|------\|------\|
	\| `answer_label` \| `Correct` / `Hallucination` / `Omission` \|
	\| `answer_is_valid` \| 评判是否成功（非 LLM 错误） \|
	\| `evidence_hit_rate` \| cited memories 覆盖了多少 gold evidence (0-1) \|
	\| `evidence_covered_count` \| 被覆盖的 gold evidence 数量 \|
	\| `num_cited_memories` \| 模型回答时引用的记忆条数 \|

	### 3. `aggregate_metrics.json`

	baseline 级别的 6 维汇总指标：

	```json
	{
	"baseline_id": "FUMemory",
	"memory_recall": {
	"avg_recall": 0.72,
	"avg_update_recall": 0.65,
	"num_sessions_with_recall": 110,
	"num_sessions_with_update": 85,
	"total_covered": 320,
	"total_gold": 445
	},
	"memory_correctness": {
	"avg_correctness": 0.81,
	"avg_hallucination": 0.08,
	"avg_irrelevant": 0.11,
	"num_sessions": 110,
	"total_memories": 1200,
	"total_correct": 972,
	"total_hallucination": 96,
	"total_irrelevant": 132
	},
	"update_handling": {
	"score": 0.65,
	"num_updated": 52,
	"num_both": 18,
	"num_outdated": 15,
	"num_total": 85
	},
	"interference_rejection": {
	"score": 0.0,
	"num_rejected": 0,
	"num_memorized": 0,
	"num_total": 0
	},
	"question_answering": {
	"correct_ratio": 0.58,
	"hallucination_ratio": 0.22,
	"omission_ratio": 0.20,
	"num_total": 990,
	"num_valid": 990
	},
	"evidence_coverage": {
	"hit_rate": 0.43,
	"num_covered": 425,
	"num_total": 990
	}
	}
	```

	6 个维度：

	\| 维度 \| 聚合方式 \| 核心指标 \| 方向 \|
	\|------\|---------\|---------\|------\|
	\| Memory Recall \| 按 session 平均 \| `avg_recall` \| ↑ \|
	\| Memory Correctness \| 按 session 平均 \| `avg_correctness`, `avg_hallucination` \| ↑, ↓ \|
	\| Update Handling \| 跨 session 池化 \| `score` \| ↑ \|
	\| Interference Rejection \| 跨 session 池化 \| `score` \| ↑ \|
	\| Question Answering \| 跨 question 池化 \| `correct_ratio`, `hallucination_ratio` \| ↑, ↓ \|
	\| Evidence Coverage \| 跨 question 池化 \| `hit_rate` \| ↑ \|

	### 4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl`

	Stage 1 的 checkpoint 文件，结构与 `session_records.jsonl` / `qa_records.jsonl` 相同但不含 `eval` 字段。

	用途：`--eval-only` 模式跳过 pipeline 直接从 checkpoint 恢复，只重跑 eval 阶段。典型场景：

	```bash
	# 首次完整运行
	python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU

	# 换 judge 模型重评（不重跑 pipeline）
	OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \
	--dataset ... --baseline FUMemory --output-dir results/FU --eval-only
	```

	## 结果分析示例

	```python
	import json

	# 读取汇总
	with open("results/FUMemory/aggregate_metrics.json") as f:
	agg = json.load(f)
	print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}")
	print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}")

	# 按 QA type 分析正确率
	qa_by_type = {}
	with open("results/FUMemory/qa_records.jsonl") as f:
	for line in f:
	rec = json.loads(line)
	qt = rec["question_type_abbrev"]
	label = rec["eval"]["answer_label"]
	qa_by_type.setdefault(qt, []).append(label)

	for qt, labels in sorted(qa_by_type.items()):
	correct = sum(1 for l in labels if l == "Correct")
	print(f" {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}")
	```