| # Eval Framework 输出格式 |
|
|
| ## 输出目录结构 |
|
|
| 运行完成后 `--output-dir` 下包含 5 个文件: |
|
|
| ``` |
| output-dir/ |
| ├── pipeline_sessions.jsonl # Stage 1 checkpoint — pipeline 中间结果(session 级) |
| ├── pipeline_qa.jsonl # Stage 1 checkpoint — pipeline 中间结果(QA 级) |
| ├── session_records.jsonl # 最终结果:session pipeline 数据 + eval 评判 |
| ├── qa_records.jsonl # 最终结果:QA pipeline 数据 + eval 评判 |
| └── aggregate_metrics.json # 最终结果:baseline 级别汇总指标 |
| ``` |
|
|
| ## 文件详解 |
|
|
| ### 1. `session_records.jsonl` |
| |
| 每行一个 session,包含 pipeline 原始数据和 `eval` 评判结果: |
| |
| ```json |
| { |
| "sample_id": "vab_minecraft_...", |
| "sample_uuid": "uuid-...", |
| "session_id": "S01", |
| "memory_snapshot": [ |
| { |
| "memory_id": "3", |
| "text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...", |
| "session_id": "S01", |
| "status": "active", |
| "source": "FUMemory", |
| "raw_backend_id": "3", |
| "raw_backend_type": "linear", |
| "metadata": {} |
| } |
| ], |
| "memory_delta": [ |
| { |
| "session_id": "S01", |
| "op": "add", |
| "text": "user: OBSERVATION: ...", |
| "linked_previous": [], |
| "raw_backend_id": "3", |
| "metadata": {"baseline": "FUMemory"} |
| } |
| ], |
| "gold_state": { |
| "session_id": "S01", |
| "cumulative_gold_memories": [...], |
| "session_new_memories": [...], |
| "session_update_memories": [...], |
| "session_interference_memories": [] |
| }, |
| "eval": { |
| "session_id": "S01", |
| "recall": 0.8, |
| "covered_count": 4, |
| "num_gold": 5, |
| "update_recall": 1.0, |
| "update_covered_count": 2, |
| "update_total": 2, |
| "recall_reasoning": "4 of 5 gold points are covered...", |
| "correctness_rate": 0.75, |
| "num_memories": 8, |
| "num_correct": 6, |
| "num_hallucination": 1, |
| "num_irrelevant": 1, |
| "correctness_reasoning": "...", |
| "correctness_records": [ |
| {"id": 1, "label": "correct"}, |
| {"id": 2, "label": "hallucination"} |
| ], |
| "update_score": 1.0, |
| "update_num_updated": 2, |
| "update_num_both": 0, |
| "update_num_outdated": 0, |
| "update_total_items": 2, |
| "update_records": [ |
| {"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."} |
| ], |
| "interference_score": null, |
| "interference_num_rejected": 0, |
| "interference_num_memorized": 0, |
| "interference_total_items": 0, |
| "interference_records": [] |
| } |
| } |
| ``` |
| |
| **eval 字段说明:** |
|
|
| | 字段 | 含义 | |
| |------|------| |
| | `recall` | 本 session gold points 被 delta 覆盖的比例 (0-1) | |
| | `update_recall` | update 类型 gold points 的覆盖比例 | |
| | `correctness_rate` | delta 中正确记忆的比例 | |
| | `num_hallucination` | delta 中幻觉记忆数量 | |
| | `num_irrelevant` | delta 中无关记忆数量 | |
| | `update_score` | 更新处理得分 (updated=1.0, both=0.5, outdated=0.0) | |
| | `interference_score` | 干扰拒绝得分 (rejected=1.0, memorized=0.0) | |
|
|
| ### 2. `qa_records.jsonl` |
| |
| 每行一个 QA question,包含检索结果、模型回答和评判: |
| |
| ```json |
| { |
| "sample_id": "vab_minecraft_...", |
| "sample_uuid": "uuid-...", |
| "checkpoint_id": "probe_e980c238", |
| "question": "What was in the agent's inventory at step 1?", |
| "gold_answer": "At step 1, the agent's inventory was empty.", |
| "gold_evidence_memory_ids": ["mp_S04_1"], |
| "gold_evidence_contents": ["The agent started with empty inventory"], |
| "question_type": "factual_recall", |
| "question_type_abbrev": "FR", |
| "difficulty": "easy", |
| "retrieval": { |
| "query": "What was in the agent's inventory at step 1?", |
| "top_k": 5, |
| "items": [ |
| { |
| "rank": 0, |
| "memory_id": "memgallery:string_bundle", |
| "text": "user: OBSERVATION: Your Inventory: ...", |
| "score": 1.0, |
| "raw_backend_id": null |
| } |
| ], |
| "raw_trace": {"baseline": "FUMemory"} |
| }, |
| "generated_answer": "The agent's inventory was empty at step 1.", |
| "cited_memories": ["user: OBSERVATION: Inventory: nothing"], |
| "eval": { |
| "answer_label": "Correct", |
| "answer_reasoning": "The response matches the reference answer...", |
| "answer_is_valid": true, |
| "evidence_hit_rate": 1.0, |
| "evidence_covered_count": 1, |
| "num_evidence": 1, |
| "evidence_reasoning": "The cited memory covers the gold evidence...", |
| "num_cited_memories": 1 |
| } |
| } |
| ``` |
| |
| **eval 字段说明:** |
|
|
| | 字段 | 含义 | |
| |------|------| |
| | `answer_label` | `Correct` / `Hallucination` / `Omission` | |
| | `answer_is_valid` | 评判是否成功(非 LLM 错误) | |
| | `evidence_hit_rate` | cited memories 覆盖了多少 gold evidence (0-1) | |
| | `evidence_covered_count` | 被覆盖的 gold evidence 数量 | |
| | `num_cited_memories` | 模型回答时引用的记忆条数 | |
|
|
| ### 3. `aggregate_metrics.json` |
| |
| baseline 级别的 6 维汇总指标: |
| |
| ```json |
| { |
| "baseline_id": "FUMemory", |
| "memory_recall": { |
| "avg_recall": 0.72, |
| "avg_update_recall": 0.65, |
| "num_sessions_with_recall": 110, |
| "num_sessions_with_update": 85, |
| "total_covered": 320, |
| "total_gold": 445 |
| }, |
| "memory_correctness": { |
| "avg_correctness": 0.81, |
| "avg_hallucination": 0.08, |
| "avg_irrelevant": 0.11, |
| "num_sessions": 110, |
| "total_memories": 1200, |
| "total_correct": 972, |
| "total_hallucination": 96, |
| "total_irrelevant": 132 |
| }, |
| "update_handling": { |
| "score": 0.65, |
| "num_updated": 52, |
| "num_both": 18, |
| "num_outdated": 15, |
| "num_total": 85 |
| }, |
| "interference_rejection": { |
| "score": 0.0, |
| "num_rejected": 0, |
| "num_memorized": 0, |
| "num_total": 0 |
| }, |
| "question_answering": { |
| "correct_ratio": 0.58, |
| "hallucination_ratio": 0.22, |
| "omission_ratio": 0.20, |
| "num_total": 990, |
| "num_valid": 990 |
| }, |
| "evidence_coverage": { |
| "hit_rate": 0.43, |
| "num_covered": 425, |
| "num_total": 990 |
| } |
| } |
| ``` |
| |
| **6 个维度:** |
|
|
| | 维度 | 聚合方式 | 核心指标 | 方向 | |
| |------|---------|---------|------| |
| | Memory Recall | 按 session 平均 | `avg_recall` | ↑ | |
| | Memory Correctness | 按 session 平均 | `avg_correctness`, `avg_hallucination` | ↑, ↓ | |
| | Update Handling | 跨 session 池化 | `score` | ↑ | |
| | Interference Rejection | 跨 session 池化 | `score` | ↑ | |
| | Question Answering | 跨 question 池化 | `correct_ratio`, `hallucination_ratio` | ↑, ↓ | |
| | Evidence Coverage | 跨 question 池化 | `hit_rate` | ↑ | |
|
|
| ### 4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl` |
|
|
| Stage 1 的 checkpoint 文件,结构与 `session_records.jsonl` / `qa_records.jsonl` 相同但**不含 `eval` 字段**。 |
|
|
| 用途:`--eval-only` 模式跳过 pipeline 直接从 checkpoint 恢复,只重跑 eval 阶段。典型场景: |
|
|
| ```bash |
| # 首次完整运行 |
| python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU |
| |
| # 换 judge 模型重评(不重跑 pipeline) |
| OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \ |
| --dataset ... --baseline FUMemory --output-dir results/FU --eval-only |
| ``` |
|
|
| ## 结果分析示例 |
|
|
| ```python |
| import json |
| |
| # 读取汇总 |
| with open("results/FUMemory/aggregate_metrics.json") as f: |
| agg = json.load(f) |
| print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}") |
| print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}") |
| |
| # 按 QA type 分析正确率 |
| qa_by_type = {} |
| with open("results/FUMemory/qa_records.jsonl") as f: |
| for line in f: |
| rec = json.loads(line) |
| qt = rec["question_type_abbrev"] |
| label = rec["eval"]["answer_label"] |
| qa_by_type.setdefault(qt, []).append(label) |
| |
| for qt, labels in sorted(qa_by_type.items()): |
| correct = sum(1 for l in labels if l == "Correct") |
| print(f" {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}") |
| ``` |
|
|