Upload eval_framework source code

85b19cf verified 28 days ago

15.1 kB

	# Eval Framework 使用指南

	## 1. 整体架构

	```
	eval_framework/
	├── cli.py # 入口：CLI 解析 + 三阶段编排 (Pipeline → Eval → Aggregate)
	├── config.py # EvalConfig 数据类
	├── openai_compat.py # GPT-5 系列 max_tokens→max_completion_tokens 兼容补丁
	├── datasets/
	│ ├── schemas.py # 运行时共享数据结构 (NormalizedTurn, MemorySnapshotRecord, RetrievalRecord 等)
	│ └── domain_a_v2.py # domain_a_v2 数据集加载器
	├── memory_adapters/
	│ ├── base.py # MemoryAdapter 抽象基类 (7 个接口方法)
	│ ├── registry.py # Baseline 注册表 + Mem-Gallery 默认配置覆盖
	│ ├── memgallery_native.py # Mem-Gallery 11 种内置 baseline 的统一适配器
	│ ├── amem.py # A-Mem 外部 baseline 适配器
	│ ├── memoryos.py # MemoryOS 外部 baseline 适配器
	│ └── export_utils.py # 快照/检索结果归一化工具
	├── pipeline/
	│ ├── runner.py # 按 session 顺序喂入对话 → 生成 snapshot/delta → 触发 QA
	│ ├── qa_runner.py # 对每个 checkpoint question 做 retrieve + answer
	│ ├── gold_state.py # Gold memory points 累积构建
	│ └── records.py # PipelineSessionRecord / PipelineCheckpointQARecord
	├── evaluators/
	│ ├── extraction.py # Session 级评估：Recall + Correctness + Update + Interference
	│ ├── qa.py # Checkpoint QA 评估：Answer 正确性 + Evidence 覆盖率
	│ └── aggregate.py # 聚合所有 session/QA 评估到 baseline 级汇总指标
	└── judges/
	├── llm_client.py # OpenAI 兼容 LLM 调用 + JSON 解析 + 重试 + 并发控制
	└── prompts.py # 6 套 LLM judge prompt 模板
	```

	## 2. 运行流程

	整个 eval 分三个阶段（`cli.py: run_eval()`）：

	### Stage 1 — Pipeline（串行，适配器有状态）

	```
	for each sample:
	adapter = create_adapter(baseline_name)
	adapter.reset()
	for each session in sample.sessions:
	for each turn in session.turns:
	adapter.ingest_turn(turn) # 喂入一条对话
	adapter.end_session(session_id) # 触发 session 后处理（如 GA 反思、RF 优化）
	snapshot = adapter.snapshot_memories() # 拍快照
	delta = adapter.export_memory_delta() # 导出本 session 增量
	→ PipelineSessionRecord

	# 当某个 checkpoint 的 covered_sessions 全部完成时触发 QA
	for each question in checkpoint:
	retrieval = adapter.retrieve(question, top_k=5)
	answer = answer_fn(question, retrieval) # 可注入外部 LLM 回答
	→ PipelineCheckpointQARecord
	```

	Pipeline 结束后写入 checkpoint 文件 `pipeline_sessions.jsonl` + `pipeline_qa.jsonl`，支持 `--eval-only` 跳过此阶段直接从 checkpoint 恢复。

	### Stage 2 — Eval（并行，ThreadPoolExecutor）

	- Session 评估（`evaluators/extraction.py`）— 每个 session 4+ 次 LLM 调用：
	1. Recall：本 session 的 gold points 中有多少被 delta 覆盖？
	2. Correctness：每条 delta 记忆是 correct / hallucination / irrelevant？
	3. Update handling：每个 update gold point → updated / both / outdated
	4. Interference rejection：每个 interference gold point → rejected / memorized

	- QA 评估（`evaluators/qa.py`）— 每个 question 2 次 LLM 调用：
	1. Answer 正确性：Correct / Hallucination / Omission
	2. Evidence 覆盖率：cited memories 覆盖了多少 gold evidence points

	### Stage 3 — Aggregate

	将所有 session 和 QA 级别的评估结果聚合为 6 个维度的 baseline 级指标：

	\| 维度 \| 聚合方式 \| 关键指标 \|
	\|------\|---------\|---------\|
	\| Memory Recall \| 按 session 平均 \| `avg_recall`, `avg_update_recall` \|
	\| Memory Correctness \| 按 session 平均 \| `avg_correctness`, `avg_hallucination` \|
	\| Update Handling \| 跨 session 池化 \| `score` (updated=1.0, both=0.5, outdated=0.0) \|
	\| Interference Rejection \| 跨 session 池化 \| `score` (rejected/total) \|
	\| Question Answering \| 跨 question 池化 \| `correct_ratio`, `hallucination_ratio`, `omission_ratio` \|
	\| Evidence Coverage \| 跨 question 池化 \| `hit_rate` \|

	输出文件：
	- `session_records.jsonl` — 每条含 pipeline 数据 + eval 结果
	- `qa_records.jsonl` — 同上
	- `aggregate_metrics.json` — baseline 级汇总

	## 3. 支持的 Baselines

	### 3.1 Mem-Gallery 内置（11 种）

	通过 `MemGalleryNativeAdapter` 统一包装，需要在 `eval_framework/` 同级目录放置 `memengine/` 和 `default_config/`（从 Mem-Gallery 的 `benchmark/` 目录复制）。

	\| Baseline \| 类型 \| 特性 \| 额外依赖 \|
	\|----------\|------\|------\|---------\|
	\| `FUMemory` \| text-only \| 全量存储（FIFO 截断） \| — \|
	\| `STMemory` \| text-only \| 短期记忆 \| — \|
	\| `LTMemory` \| text-only \| 长期记忆，embedding 检索 \| sentence-transformers \|
	\| `GAMemory` \| text-only \| 带 importance judge + 自反思 \| LLM API \|
	\| `MGMemory` \| text-only \| 多层存储（working/FIFO/recall/archival） \| LLM API, sentence-transformers \|
	\| `RFMemory` \| text-only \| 带 reflection optimizer \| LLM API \|
	\| `MMMemory` \| multimodal \| 多模态记忆 \| torch \|
	\| `MMFUMemory` \| multimodal \| 多模态全量存储 \| torch \|
	\| `NGMemory` \| multimodal \| 知识图谱节点存储 \| torch \|
	\| `AUGUSTUSMemory` \| multimodal \| 概念抽取 + 图谱 \| LLM API, torch \|
	\| `UniversalRAGMemory` \| multimodal \| RAG routing + 存储 \| LLM API \|

	### 3.2 外部适配器

	\| Baseline \| 来源 \| 安装方式 \| 需要外部服务 \|
	\|----------\|------\|---------\|-------------\|
	\| `Mem0` \| [mem0ai/mem0](https://github.com/mem0ai/mem0) \| `pip install mem0ai` \| 否（内置 Qdrant + SQLite） \|
	\| `Mem0-Graph` \| 同上（graph 模式） \| `pip install "mem0ai[graph]"` \| 需要 Neo4j \|
	\| `SimpleMem` \| [aiming-lab/SimpleMem](https://github.com/aiming-lab/SimpleMem) \| clone + requirements \| 否 \|
	\| `Omni-SimpleMem` \| 同上（omni 模式） \| 同上 \| 否 \|
	\| `Zep` \| [getzep/zep](https://github.com/getzep/zep) \| `pip install zep-python` \| 需要 Zep server \|
	\| `A-Mem` \| [A-Mem](https://arxiv.org/abs/2504.19413) \| clone 源码 \| 否 \|
	\| `MemoryOS` \| [MemoryOS](https://github.com/memodb-io/memobase) \| clone 源码 \| 否 \|

	论文来源：

	\| Baseline \| 论文 \| GitHub \|
	\|----------\|------\|--------\|
	\| Mem0 / Mem0-Graph \| [arXiv:2504.19413](https://arxiv.org/abs/2504.19413) \| https://github.com/mem0ai/mem0 \|
	\| SimpleMem \| [arXiv:2601.02553](https://arxiv.org/abs/2601.02553) \| https://github.com/aiming-lab/SimpleMem \|
	\| Omni-SimpleMem \| [arXiv:2604.01007](https://arxiv.org/abs/2604.01007) \| https://github.com/aiming-lab/SimpleMem \|
	\| MemVerse \| [arXiv:2512.03627](https://arxiv.org/abs/2512.03627) \| https://github.com/KnowledgeXLab/MemVerse \|
	\| Memobase \| — \| https://github.com/memodb-io/memobase \|
	\| Supermemory \| — \| https://github.com/supermemoryai/supermemory \|
	\| Zep \| [arXiv:2501.13956](https://arxiv.org/abs/2501.13956) \| https://github.com/getzep/zep \|

	### 3.3 添加新 Baseline

	实现 `MemoryAdapter` 的 7 个抽象方法：

	```python
	class MyAdapter(MemoryAdapter):
	def reset(self) -> None: ...
	def ingest_turn(self, turn: NormalizedTurn) -> None: ...
	def end_session(self, session_id: str) -> None: ...
	def snapshot_memories(self) -> list[MemorySnapshotRecord]: ...
	def export_memory_delta(self, session_id: str) -> list[MemoryDeltaRecord]: ...
	def retrieve(self, query: str, top_k: int) -> RetrievalRecord: ...
	def get_capabilities(self) -> dict[str, Any]: ...
	```

	然后在 `registry.py` 的 `EXTERNAL_ADAPTER_REGISTRY` 中注册。

	## 4. 数据适配

	### 4.1 数据集格式（domain_a_v2）

	加载器 `load_domain_a_v2_academic(data_dir)` 要求 `data_dir` 下有三个文件：

	```
	data_dir/
	├── domain_a_v2.json # 主对话数据（JSON array）
	├── stage4_memory_points.jsonl # 每 session 的 gold memory points
	└── stage4b_qa_checkpoints.jsonl # checkpoint QA 题目
	```

	`domain_a_v2.json` 中每个 sample 结构：

	```json
	{
	"uuid": "unique-id",
	"sample_id": "sample_001",
	"sessions": [
	{
	"_v2_session_id": "S00",
	"dialogue": [
	{
	"role": "user",
	"content": "Hello...",
	"timestamp": "2025-01-01T10:00:00",
	"attachments": [{"caption": "photo of...", "type": "image_caption"}]
	},
	{"role": "assistant", "content": "Hi..."}
	],
	"memory_points": [...] // 仅 S00 需要
	},
	{"_v2_session_id": "S01", "dialogue": [...]}
	]
	}
	```

	`stage4_memory_points.jsonl` 每行一个 sample：

	```json
	{
	"uuid": "...", "sample_id": "sample_001",
	"memory_sessions": [
	{
	"session_id": "S01",
	"memory_points": [
	{
	"memory_id": "m001",
	"memory_content": "User prefers dark mode",
	"memory_type": "preference",
	"memory_source": "normal",
	"is_update": false,
	"original_memories": [],
	"importance": 0.8
	}
	]
	}
	]
	}
	```

	`stage4b_qa_checkpoints.jsonl` 每行一个 sample：

	```json
	{
	"uuid": "...", "sample_id": "sample_001",
	"checkpoints": [
	{
	"checkpoint_id": "cp01",
	"covered_sessions": ["S00", "S01"],
	"questions": [
	{
	"question": "What theme does the user prefer?",
	"answer": "Dark mode",
	"question_type": "preference_recall",
	"question_type_abbrev": "pref",
	"difficulty": "easy",
	"evidence": [{"memory_id": "m001"}]
	}
	]
	}
	]
	}
	```

	### 4.2 适配自有数据

	若要接入新数据源，有两条路径：

	路径 A：转换为 domain_a_v2 格式（推荐）
	- 将原始对话整理为上述三文件格式
	- 直接使用现有 CLI 运行

	路径 B：编写新的 dataset loader
	- 在 `datasets/` 下新建加载器，返回 `DomainAV2AcademicBundle`（或等价结构）
	- 在 `cli.py` 的 `run_eval()` 中通过 `load_domain_bundle` 参数注入

	### 4.3 关键数据结构

	每条对话 turn 会被归一化为 `NormalizedTurn`：

	```python
	NormalizedTurn(
	sample_id="sample_001",
	session_id="S01",
	turn_index=0,
	role="user", # "user" \| "assistant"
	text="Hello...",
	attachments=(Attachment(caption="...", type="image_caption"),),
	timestamp="2025-01-01T10:00:00",
	)
	```

	Memory 的 gold 标注支持三种来源标记：
	- `normal` — 正常记忆点
	- `interference` — 干扰信息（不应被记忆）
	- `is_update=True` — 更新型记忆（应替换旧记忆）

	## 5. 环境配置（uv）

	### 5.1 安装 uv

	```bash
	curl -LsSf https://astral.sh/uv/install.sh \| sh
	```

	### 5.2 初始化项目环境

	```bash
	cd /data1/toby/nips26

	# 创建虚拟环境
	uv venv .venv --python 3.11
	source .venv/bin/activate
	```

	### 5.3 安装核心依赖

	```bash
	# 最小依赖（可跑 FUMemory/STMemory 等纯文本 baseline）
	uv pip install openai tenacity

	# embedding 检索类 baseline（LTMemory, GAMemory, MGMemory 等）
	uv pip install sentence-transformers

	# 多模态 baseline（MMMemory, NGMemory, AUGUSTUSMemory 等）
	uv pip install torch torchvision transformers

	# 外部 baseline（A-Mem, MemoryOS）— 按各自文档安装额外依赖
	# A-Mem 需要其源码目录下的 requirements
	# MemoryOS 需要 memoryos 包
	```

	### 5.4 环境变量（.env 文件）

	在项目根目录 (`nips26/`) 创建 `.env` 文件，框架会自动加载：

	```bash
	# .env
	# 必需 — LLM API（pipeline 答题 + judge 评估统一使用）
	OPENAI_API_KEY=sk-...
	OPENAI_BASE_URL=https://api.openai.com/v1 # 或兼容端点
	OPENAI_MODEL=gpt-4o

	# 可选
	OPENAI_TEMPERATURE=0.0
	OPENAI_MAX_TOKENS=1024
	OPENAI_TIMEOUT=120
	JUDGE_TEMPERATURE=0.0 # judge 专用温度
	LLM_MAX_CONCURRENT=5 # LLM 并发上限
	```

	### 5.5 Mem-Gallery 本地依赖

	Mem-Gallery 内置 baseline 需要将其源码放到 `eval_framework/` 的同级目录：

	```bash
	# 假设 Mem-Gallery repo 在 /path/to/Mem-Gallery
	cp -r /path/to/Mem-Gallery/benchmark/memengine /data1/toby/nips26/
	cp -r /path/to/Mem-Gallery/benchmark/default_config /data1/toby/nips26/
	```

	最终目录结构应为：

	```
	nips26/
	├── eval_framework/
	├── memengine/ # Mem-Gallery 记忆引擎
	└── default_config/ # Mem-Gallery 默认配置
	```

	## 6. 运行示例

	### 基本运行

	```bash
	# 运行单个 baseline
	python -m eval_framework.cli \
	--dataset /path/to/domain_a_v2_data/ \
	--baseline FUMemory \
	--output-dir eval_framework/results/FUMemory

	# smoke 模式（只跑第 1 个 sample，快速验证）
	python -m eval_framework.cli \
	--dataset /path/to/domain_a_v2_data/ \
	--baseline FUMemory \
	--output-dir eval_framework/results/FUMemory_smoke \
	--smoke

	# dry-run（不实际运行，打印配置）
	python -m eval_framework.cli \
	--dataset /path/to/domain_a_v2_data/ \
	--baseline FUMemory \
	--dry-run

	# 仅重跑 eval 阶段（从 checkpoint 恢复，pipeline 不重跑）
	python -m eval_framework.cli \
	--dataset /path/to/domain_a_v2_data/ \
	--baseline FUMemory \
	--output-dir eval_framework/results/FUMemory \
	--eval-only

	# 调整 eval 并发数
	python -m eval_framework.cli \
	--dataset /path/to/domain_a_v2_data/ \
	--baseline MGMemory \
	--output-dir eval_framework/results/MGMemory \
	--max-eval-workers 10
	```

	### 批量跑所有 baseline

	```bash
	DATASET="/path/to/domain_a_v2_data"
	for baseline in FUMemory STMemory LTMemory GAMemory MGMemory RFMemory A-Mem MemoryOS; do
	echo "=== Running $baseline ==="
	python -m eval_framework.cli \
	--dataset "$DATASET" \
	--baseline "$baseline" \
	--output-dir "eval_framework/results/$baseline"
	done
	```

	### 输出文件说明

	运行完成后 `output-dir` 下包含：

	```
	results/FUMemory/
	├── pipeline_sessions.jsonl # Stage 1 checkpoint — session 级 pipeline 结果
	├── pipeline_qa.jsonl # Stage 1 checkpoint — QA 级 pipeline 结果
	├── session_records.jsonl # 最终 session 结果（含 eval）
	├── qa_records.jsonl # 最终 QA 结果（含 eval）
	└── aggregate_metrics.json # baseline 级汇总指标
	```

	## 7. LLM API 开销估算

	每个 sample 的 LLM 调用量：

	\| 来源 \| 调用次数 \|
	\|------\|---------\|
	\| Pipeline answer（每个 QA question） \| N_questions \|
	\| Session Recall judge \| N_sessions \|
	\| Session Correctness judge \| N_sessions \|
	\| Update judge \| N_update_points（逐条） \|
	\| Interference judge \| N_interference_points（逐条） \|
	\| QA Answer judge \| N_questions \|
	\| QA Evidence judge \| N_questions \|

	典型场景下一个 sample 约 20-50 次 LLM 调用。通过 `LLM_MAX_CONCURRENT` 控制并发避免 rate limit。