| # Eval Framework 使用指南 |
|
|
| ## 1. 整体架构 |
|
|
| ``` |
| eval_framework/ |
| ├── cli.py # 入口:CLI 解析 + 三阶段编排 (Pipeline → Eval → Aggregate) |
| ├── config.py # EvalConfig 数据类 |
| ├── openai_compat.py # GPT-5 系列 max_tokens→max_completion_tokens 兼容补丁 |
| ├── datasets/ |
| │ ├── schemas.py # 运行时共享数据结构 (NormalizedTurn, MemorySnapshotRecord, RetrievalRecord 等) |
| │ └── domain_a_v2.py # domain_a_v2 数据集加载器 |
| ├── memory_adapters/ |
| │ ├── base.py # MemoryAdapter 抽象基类 (7 个接口方法) |
| │ ├── registry.py # Baseline 注册表 + Mem-Gallery 默认配置覆盖 |
| │ ├── memgallery_native.py # Mem-Gallery 11 种内置 baseline 的统一适配器 |
| │ ├── amem.py # A-Mem 外部 baseline 适配器 |
| │ ├── memoryos.py # MemoryOS 外部 baseline 适配器 |
| │ └── export_utils.py # 快照/检索结果归一化工具 |
| ├── pipeline/ |
| │ ├── runner.py # 按 session 顺序喂入对话 → 生成 snapshot/delta → 触发 QA |
| │ ├── qa_runner.py # 对每个 checkpoint question 做 retrieve + answer |
| │ ├── gold_state.py # Gold memory points 累积构建 |
| │ └── records.py # PipelineSessionRecord / PipelineCheckpointQARecord |
| ├── evaluators/ |
| │ ├── extraction.py # Session 级评估:Recall + Correctness + Update + Interference |
| │ ├── qa.py # Checkpoint QA 评估:Answer 正确性 + Evidence 覆盖率 |
| │ └── aggregate.py # 聚合所有 session/QA 评估到 baseline 级汇总指标 |
| └── judges/ |
| ├── llm_client.py # OpenAI 兼容 LLM 调用 + JSON 解析 + 重试 + 并发控制 |
| └── prompts.py # 6 套 LLM judge prompt 模板 |
| ``` |
|
|
| ## 2. 运行流程 |
|
|
| 整个 eval 分三个阶段(`cli.py: run_eval()`): |
|
|
| ### Stage 1 — Pipeline(串行,适配器有状态) |
|
|
| ``` |
| for each sample: |
| adapter = create_adapter(baseline_name) |
| adapter.reset() |
| for each session in sample.sessions: |
| for each turn in session.turns: |
| adapter.ingest_turn(turn) # 喂入一条对话 |
| adapter.end_session(session_id) # 触发 session 后处理(如 GA 反思、RF 优化) |
| snapshot = adapter.snapshot_memories() # 拍快照 |
| delta = adapter.export_memory_delta() # 导出本 session 增量 |
| → PipelineSessionRecord |
| |
| # 当某个 checkpoint 的 covered_sessions 全部完成时触发 QA |
| for each question in checkpoint: |
| retrieval = adapter.retrieve(question, top_k=5) |
| answer = answer_fn(question, retrieval) # 可注入外部 LLM 回答 |
| → PipelineCheckpointQARecord |
| ``` |
|
|
| Pipeline 结束后写入 checkpoint 文件 `pipeline_sessions.jsonl` + `pipeline_qa.jsonl`,支持 `--eval-only` 跳过此阶段直接从 checkpoint 恢复。 |
|
|
| ### Stage 2 — Eval(并行,ThreadPoolExecutor) |
|
|
| - **Session 评估**(`evaluators/extraction.py`)— 每个 session 4+ 次 LLM 调用: |
| 1. **Recall**:本 session 的 gold points 中有多少被 delta 覆盖? |
| 2. **Correctness**:每条 delta 记忆是 correct / hallucination / irrelevant? |
| 3. **Update handling**:每个 update gold point → updated / both / outdated |
| 4. **Interference rejection**:每个 interference gold point → rejected / memorized |
|
|
| - **QA 评估**(`evaluators/qa.py`)— 每个 question 2 次 LLM 调用: |
| 1. **Answer 正确性**:Correct / Hallucination / Omission |
| 2. **Evidence 覆盖率**:cited memories 覆盖了多少 gold evidence points |
|
|
| ### Stage 3 — Aggregate |
|
|
| 将所有 session 和 QA 级别的评估结果聚合为 6 个维度的 baseline 级指标: |
|
|
| | 维度 | 聚合方式 | 关键指标 | |
| |------|---------|---------| |
| | Memory Recall | 按 session 平均 | `avg_recall`, `avg_update_recall` | |
| | Memory Correctness | 按 session 平均 | `avg_correctness`, `avg_hallucination` | |
| | Update Handling | 跨 session 池化 | `score` (updated=1.0, both=0.5, outdated=0.0) | |
| | Interference Rejection | 跨 session 池化 | `score` (rejected/total) | |
| | Question Answering | 跨 question 池化 | `correct_ratio`, `hallucination_ratio`, `omission_ratio` | |
| | Evidence Coverage | 跨 question 池化 | `hit_rate` | |
|
|
| 输出文件: |
| - `session_records.jsonl` — 每条含 pipeline 数据 + eval 结果 |
| - `qa_records.jsonl` — 同上 |
| - `aggregate_metrics.json` — baseline 级汇总 |
|
|
| ## 3. 支持的 Baselines |
|
|
| ### 3.1 Mem-Gallery 内置(11 种) |
|
|
| 通过 `MemGalleryNativeAdapter` 统一包装,需要在 `eval_framework/` 同级目录放置 `memengine/` 和 `default_config/`(从 Mem-Gallery 的 `benchmark/` 目录复制)。 |
|
|
| | Baseline | 类型 | 特性 | 额外依赖 | |
| |----------|------|------|---------| |
| | `FUMemory` | text-only | 全量存储(FIFO 截断) | — | |
| | `STMemory` | text-only | 短期记忆 | — | |
| | `LTMemory` | text-only | 长期记忆,embedding 检索 | sentence-transformers | |
| | `GAMemory` | text-only | 带 importance judge + 自反思 | LLM API | |
| | `MGMemory` | text-only | 多层存储(working/FIFO/recall/archival) | LLM API, sentence-transformers | |
| | `RFMemory` | text-only | 带 reflection optimizer | LLM API | |
| | `MMMemory` | multimodal | 多模态记忆 | torch | |
| | `MMFUMemory` | multimodal | 多模态全量存储 | torch | |
| | `NGMemory` | multimodal | 知识图谱节点存储 | torch | |
| | `AUGUSTUSMemory` | multimodal | 概念抽取 + 图谱 | LLM API, torch | |
| | `UniversalRAGMemory` | multimodal | RAG routing + 存储 | LLM API | |
|
|
| ### 3.2 外部适配器 |
|
|
| | Baseline | 来源 | 安装方式 | 需要外部服务 | |
| |----------|------|---------|-------------| |
| | `Mem0` | [mem0ai/mem0](https://github.com/mem0ai/mem0) | `pip install mem0ai` | 否(内置 Qdrant + SQLite) | |
| | `Mem0-Graph` | 同上(graph 模式) | `pip install "mem0ai[graph]"` | 需要 Neo4j | |
| | `SimpleMem` | [aiming-lab/SimpleMem](https://github.com/aiming-lab/SimpleMem) | clone + requirements | 否 | |
| | `Omni-SimpleMem` | 同上(omni 模式) | 同上 | 否 | |
| | `Zep` | [getzep/zep](https://github.com/getzep/zep) | `pip install zep-python` | 需要 Zep server | |
| | `A-Mem` | [A-Mem](https://arxiv.org/abs/2504.19413) | clone 源码 | 否 | |
| | `MemoryOS` | [MemoryOS](https://github.com/memodb-io/memobase) | clone 源码 | 否 | |
|
|
| **论文来源:** |
|
|
| | Baseline | 论文 | GitHub | |
| |----------|------|--------| |
| | Mem0 / Mem0-Graph | [arXiv:2504.19413](https://arxiv.org/abs/2504.19413) | https://github.com/mem0ai/mem0 | |
| | SimpleMem | [arXiv:2601.02553](https://arxiv.org/abs/2601.02553) | https://github.com/aiming-lab/SimpleMem | |
| | Omni-SimpleMem | [arXiv:2604.01007](https://arxiv.org/abs/2604.01007) | https://github.com/aiming-lab/SimpleMem | |
| | MemVerse | [arXiv:2512.03627](https://arxiv.org/abs/2512.03627) | https://github.com/KnowledgeXLab/MemVerse | |
| | Memobase | — | https://github.com/memodb-io/memobase | |
| | Supermemory | — | https://github.com/supermemoryai/supermemory | |
| | Zep | [arXiv:2501.13956](https://arxiv.org/abs/2501.13956) | https://github.com/getzep/zep | |
|
|
| ### 3.3 添加新 Baseline |
|
|
| 实现 `MemoryAdapter` 的 7 个抽象方法: |
|
|
| ```python |
| class MyAdapter(MemoryAdapter): |
| def reset(self) -> None: ... |
| def ingest_turn(self, turn: NormalizedTurn) -> None: ... |
| def end_session(self, session_id: str) -> None: ... |
| def snapshot_memories(self) -> list[MemorySnapshotRecord]: ... |
| def export_memory_delta(self, session_id: str) -> list[MemoryDeltaRecord]: ... |
| def retrieve(self, query: str, top_k: int) -> RetrievalRecord: ... |
| def get_capabilities(self) -> dict[str, Any]: ... |
| ``` |
|
|
| 然后在 `registry.py` 的 `EXTERNAL_ADAPTER_REGISTRY` 中注册。 |
|
|
| ## 4. 数据适配 |
|
|
| ### 4.1 数据集格式(domain_a_v2) |
|
|
| 加载器 `load_domain_a_v2_academic(data_dir)` 要求 `data_dir` 下有三个文件: |
|
|
| ``` |
| data_dir/ |
| ├── domain_a_v2.json # 主对话数据(JSON array) |
| ├── stage4_memory_points.jsonl # 每 session 的 gold memory points |
| └── stage4b_qa_checkpoints.jsonl # checkpoint QA 题目 |
| ``` |
|
|
| **`domain_a_v2.json`** 中每个 sample 结构: |
|
|
| ```json |
| { |
| "uuid": "unique-id", |
| "sample_id": "sample_001", |
| "sessions": [ |
| { |
| "_v2_session_id": "S00", |
| "dialogue": [ |
| { |
| "role": "user", |
| "content": "Hello...", |
| "timestamp": "2025-01-01T10:00:00", |
| "attachments": [{"caption": "photo of...", "type": "image_caption"}] |
| }, |
| {"role": "assistant", "content": "Hi..."} |
| ], |
| "memory_points": [...] // 仅 S00 需要 |
| }, |
| {"_v2_session_id": "S01", "dialogue": [...]} |
| ] |
| } |
| ``` |
|
|
| **`stage4_memory_points.jsonl`** 每行一个 sample: |
|
|
| ```json |
| { |
| "uuid": "...", "sample_id": "sample_001", |
| "memory_sessions": [ |
| { |
| "session_id": "S01", |
| "memory_points": [ |
| { |
| "memory_id": "m001", |
| "memory_content": "User prefers dark mode", |
| "memory_type": "preference", |
| "memory_source": "normal", |
| "is_update": false, |
| "original_memories": [], |
| "importance": 0.8 |
| } |
| ] |
| } |
| ] |
| } |
| ``` |
|
|
| **`stage4b_qa_checkpoints.jsonl`** 每行一个 sample: |
|
|
| ```json |
| { |
| "uuid": "...", "sample_id": "sample_001", |
| "checkpoints": [ |
| { |
| "checkpoint_id": "cp01", |
| "covered_sessions": ["S00", "S01"], |
| "questions": [ |
| { |
| "question": "What theme does the user prefer?", |
| "answer": "Dark mode", |
| "question_type": "preference_recall", |
| "question_type_abbrev": "pref", |
| "difficulty": "easy", |
| "evidence": [{"memory_id": "m001"}] |
| } |
| ] |
| } |
| ] |
| } |
| ``` |
|
|
| ### 4.2 适配自有数据 |
|
|
| 若要接入新数据源,有两条路径: |
|
|
| **路径 A:转换为 domain_a_v2 格式**(推荐) |
| - 将原始对话整理为上述三文件格式 |
| - 直接使用现有 CLI 运行 |
|
|
| **路径 B:编写新的 dataset loader** |
| - 在 `datasets/` 下新建加载器,返回 `DomainAV2AcademicBundle`(或等价结构) |
| - 在 `cli.py` 的 `run_eval()` 中通过 `load_domain_bundle` 参数注入 |
|
|
| ### 4.3 关键数据结构 |
|
|
| 每条对话 turn 会被归一化为 `NormalizedTurn`: |
|
|
| ```python |
| NormalizedTurn( |
| sample_id="sample_001", |
| session_id="S01", |
| turn_index=0, |
| role="user", # "user" | "assistant" |
| text="Hello...", |
| attachments=(Attachment(caption="...", type="image_caption"),), |
| timestamp="2025-01-01T10:00:00", |
| ) |
| ``` |
|
|
| Memory 的 gold 标注支持三种来源标记: |
| - `normal` — 正常记忆点 |
| - `interference` — 干扰信息(不应被记忆) |
| - `is_update=True` — 更新型记忆(应替换旧记忆) |
|
|
| ## 5. 环境配置(uv) |
|
|
| ### 5.1 安装 uv |
|
|
| ```bash |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| ``` |
|
|
| ### 5.2 初始化项目环境 |
|
|
| ```bash |
| cd /data1/toby/nips26 |
| |
| # 创建虚拟环境 |
| uv venv .venv --python 3.11 |
| source .venv/bin/activate |
| ``` |
|
|
| ### 5.3 安装核心依赖 |
|
|
| ```bash |
| # 最小依赖(可跑 FUMemory/STMemory 等纯文本 baseline) |
| uv pip install openai tenacity |
| |
| # embedding 检索类 baseline(LTMemory, GAMemory, MGMemory 等) |
| uv pip install sentence-transformers |
| |
| # 多模态 baseline(MMMemory, NGMemory, AUGUSTUSMemory 等) |
| uv pip install torch torchvision transformers |
| |
| # 外部 baseline(A-Mem, MemoryOS)— 按各自文档安装额外依赖 |
| # A-Mem 需要其源码目录下的 requirements |
| # MemoryOS 需要 memoryos 包 |
| ``` |
|
|
| ### 5.4 环境变量(.env 文件) |
|
|
| 在项目根目录 (`nips26/`) 创建 `.env` 文件,框架会自动加载: |
|
|
| ```bash |
| # .env |
| # 必需 — LLM API(pipeline 答题 + judge 评估统一使用) |
| OPENAI_API_KEY=sk-... |
| OPENAI_BASE_URL=https://api.openai.com/v1 # 或兼容端点 |
| OPENAI_MODEL=gpt-4o |
| |
| # 可选 |
| OPENAI_TEMPERATURE=0.0 |
| OPENAI_MAX_TOKENS=1024 |
| OPENAI_TIMEOUT=120 |
| JUDGE_TEMPERATURE=0.0 # judge 专用温度 |
| LLM_MAX_CONCURRENT=5 # LLM 并发上限 |
| ``` |
|
|
| ### 5.5 Mem-Gallery 本地依赖 |
|
|
| Mem-Gallery 内置 baseline 需要将其源码放到 `eval_framework/` 的同级目录: |
|
|
| ```bash |
| # 假设 Mem-Gallery repo 在 /path/to/Mem-Gallery |
| cp -r /path/to/Mem-Gallery/benchmark/memengine /data1/toby/nips26/ |
| cp -r /path/to/Mem-Gallery/benchmark/default_config /data1/toby/nips26/ |
| ``` |
|
|
| 最终目录结构应为: |
|
|
| ``` |
| nips26/ |
| ├── eval_framework/ |
| ├── memengine/ # Mem-Gallery 记忆引擎 |
| └── default_config/ # Mem-Gallery 默认配置 |
| ``` |
|
|
| ## 6. 运行示例 |
|
|
| ### 基本运行 |
|
|
| ```bash |
| # 运行单个 baseline |
| python -m eval_framework.cli \ |
| --dataset /path/to/domain_a_v2_data/ \ |
| --baseline FUMemory \ |
| --output-dir eval_framework/results/FUMemory |
| |
| # smoke 模式(只跑第 1 个 sample,快速验证) |
| python -m eval_framework.cli \ |
| --dataset /path/to/domain_a_v2_data/ \ |
| --baseline FUMemory \ |
| --output-dir eval_framework/results/FUMemory_smoke \ |
| --smoke |
| |
| # dry-run(不实际运行,打印配置) |
| python -m eval_framework.cli \ |
| --dataset /path/to/domain_a_v2_data/ \ |
| --baseline FUMemory \ |
| --dry-run |
| |
| # 仅重跑 eval 阶段(从 checkpoint 恢复,pipeline 不重跑) |
| python -m eval_framework.cli \ |
| --dataset /path/to/domain_a_v2_data/ \ |
| --baseline FUMemory \ |
| --output-dir eval_framework/results/FUMemory \ |
| --eval-only |
| |
| # 调整 eval 并发数 |
| python -m eval_framework.cli \ |
| --dataset /path/to/domain_a_v2_data/ \ |
| --baseline MGMemory \ |
| --output-dir eval_framework/results/MGMemory \ |
| --max-eval-workers 10 |
| ``` |
|
|
| ### 批量跑所有 baseline |
|
|
| ```bash |
| DATASET="/path/to/domain_a_v2_data" |
| for baseline in FUMemory STMemory LTMemory GAMemory MGMemory RFMemory A-Mem MemoryOS; do |
| echo "=== Running $baseline ===" |
| python -m eval_framework.cli \ |
| --dataset "$DATASET" \ |
| --baseline "$baseline" \ |
| --output-dir "eval_framework/results/$baseline" |
| done |
| ``` |
|
|
| ### 输出文件说明 |
|
|
| 运行完成后 `output-dir` 下包含: |
|
|
| ``` |
| results/FUMemory/ |
| ├── pipeline_sessions.jsonl # Stage 1 checkpoint — session 级 pipeline 结果 |
| ├── pipeline_qa.jsonl # Stage 1 checkpoint — QA 级 pipeline 结果 |
| ├── session_records.jsonl # 最终 session 结果(含 eval) |
| ├── qa_records.jsonl # 最终 QA 结果(含 eval) |
| └── aggregate_metrics.json # baseline 级汇总指标 |
| ``` |
|
|
| ## 7. LLM API 开销估算 |
|
|
| 每个 sample 的 LLM 调用量: |
|
|
| | 来源 | 调用次数 | |
| |------|---------| |
| | Pipeline answer(每个 QA question) | N_questions | |
| | Session Recall judge | N_sessions | |
| | Session Correctness judge | N_sessions | |
| | Update judge | N_update_points(逐条) | |
| | Interference judge | N_interference_points(逐条) | |
| | QA Answer judge | N_questions | |
| | QA Evidence judge | N_questions | |
| |
| 典型场景下一个 sample 约 20-50 次 LLM 调用。通过 `LLM_MAX_CONCURRENT` 控制并发避免 rate limit。 |
| |