Eval Framework 使用指南
1. 整体架构
eval_framework/
├── cli.py # 入口:CLI 解析 + 三阶段编排 (Pipeline → Eval → Aggregate)
├── config.py # EvalConfig 数据类
├── openai_compat.py # GPT-5 系列 max_tokens→max_completion_tokens 兼容补丁
├── datasets/
│ ├── schemas.py # 运行时共享数据结构 (NormalizedTurn, MemorySnapshotRecord, RetrievalRecord 等)
│ └── domain_a_v2.py # domain_a_v2 数据集加载器
├── memory_adapters/
│ ├── base.py # MemoryAdapter 抽象基类 (7 个接口方法)
│ ├── registry.py # Baseline 注册表 + Mem-Gallery 默认配置覆盖
│ ├── memgallery_native.py # Mem-Gallery 11 种内置 baseline 的统一适配器
│ ├── amem.py # A-Mem 外部 baseline 适配器
│ ├── memoryos.py # MemoryOS 外部 baseline 适配器
│ └── export_utils.py # 快照/检索结果归一化工具
├── pipeline/
│ ├── runner.py # 按 session 顺序喂入对话 → 生成 snapshot/delta → 触发 QA
│ ├── qa_runner.py # 对每个 checkpoint question 做 retrieve + answer
│ ├── gold_state.py # Gold memory points 累积构建
│ └── records.py # PipelineSessionRecord / PipelineCheckpointQARecord
├── evaluators/
│ ├── extraction.py # Session 级评估:Recall + Correctness + Update + Interference
│ ├── qa.py # Checkpoint QA 评估:Answer 正确性 + Evidence 覆盖率
│ └── aggregate.py # 聚合所有 session/QA 评估到 baseline 级汇总指标
└── judges/
├── llm_client.py # OpenAI 兼容 LLM 调用 + JSON 解析 + 重试 + 并发控制
└── prompts.py # 6 套 LLM judge prompt 模板
2. 运行流程
整个 eval 分三个阶段(cli.py: run_eval()):
Stage 1 — Pipeline(串行,适配器有状态)
for each sample:
adapter = create_adapter(baseline_name)
adapter.reset()
for each session in sample.sessions:
for each turn in session.turns:
adapter.ingest_turn(turn) # 喂入一条对话
adapter.end_session(session_id) # 触发 session 后处理(如 GA 反思、RF 优化)
snapshot = adapter.snapshot_memories() # 拍快照
delta = adapter.export_memory_delta() # 导出本 session 增量
→ PipelineSessionRecord
# 当某个 checkpoint 的 covered_sessions 全部完成时触发 QA
for each question in checkpoint:
retrieval = adapter.retrieve(question, top_k=5)
answer = answer_fn(question, retrieval) # 可注入外部 LLM 回答
→ PipelineCheckpointQARecord
Pipeline 结束后写入 checkpoint 文件 pipeline_sessions.jsonl + pipeline_qa.jsonl,支持 --eval-only 跳过此阶段直接从 checkpoint 恢复。
Stage 2 — Eval(并行,ThreadPoolExecutor)
Session 评估(
evaluators/extraction.py)— 每个 session 4+ 次 LLM 调用:- Recall:本 session 的 gold points 中有多少被 delta 覆盖?
- Correctness:每条 delta 记忆是 correct / hallucination / irrelevant?
- Update handling:每个 update gold point → updated / both / outdated
- Interference rejection:每个 interference gold point → rejected / memorized
QA 评估(
evaluators/qa.py)— 每个 question 2 次 LLM 调用:- Answer 正确性:Correct / Hallucination / Omission
- Evidence 覆盖率:cited memories 覆盖了多少 gold evidence points
Stage 3 — Aggregate
将所有 session 和 QA 级别的评估结果聚合为 6 个维度的 baseline 级指标:
| 维度 | 聚合方式 | 关键指标 |
|---|---|---|
| Memory Recall | 按 session 平均 | avg_recall, avg_update_recall |
| Memory Correctness | 按 session 平均 | avg_correctness, avg_hallucination |
| Update Handling | 跨 session 池化 | score (updated=1.0, both=0.5, outdated=0.0) |
| Interference Rejection | 跨 session 池化 | score (rejected/total) |
| Question Answering | 跨 question 池化 | correct_ratio, hallucination_ratio, omission_ratio |
| Evidence Coverage | 跨 question 池化 | hit_rate |
输出文件:
session_records.jsonl— 每条含 pipeline 数据 + eval 结果qa_records.jsonl— 同上aggregate_metrics.json— baseline 级汇总
3. 支持的 Baselines
3.1 Mem-Gallery 内置(11 种)
通过 MemGalleryNativeAdapter 统一包装,需要在 eval_framework/ 同级目录放置 memengine/ 和 default_config/(从 Mem-Gallery 的 benchmark/ 目录复制)。
| Baseline | 类型 | 特性 | 额外依赖 |
|---|---|---|---|
FUMemory |
text-only | 全量存储(FIFO 截断) | — |
STMemory |
text-only | 短期记忆 | — |
LTMemory |
text-only | 长期记忆,embedding 检索 | sentence-transformers |
GAMemory |
text-only | 带 importance judge + 自反思 | LLM API |
MGMemory |
text-only | 多层存储(working/FIFO/recall/archival) | LLM API, sentence-transformers |
RFMemory |
text-only | 带 reflection optimizer | LLM API |
MMMemory |
multimodal | 多模态记忆 | torch |
MMFUMemory |
multimodal | 多模态全量存储 | torch |
NGMemory |
multimodal | 知识图谱节点存储 | torch |
AUGUSTUSMemory |
multimodal | 概念抽取 + 图谱 | LLM API, torch |
UniversalRAGMemory |
multimodal | RAG routing + 存储 | LLM API |
3.2 外部适配器
| Baseline | 来源 | 安装方式 | 需要外部服务 |
|---|---|---|---|
Mem0 |
mem0ai/mem0 | pip install mem0ai |
否(内置 Qdrant + SQLite) |
Mem0-Graph |
同上(graph 模式) | pip install "mem0ai[graph]" |
需要 Neo4j |
SimpleMem |
aiming-lab/SimpleMem | clone + requirements | 否 |
Omni-SimpleMem |
同上(omni 模式) | 同上 | 否 |
Zep |
getzep/zep | pip install zep-python |
需要 Zep server |
A-Mem |
A-Mem | clone 源码 | 否 |
MemoryOS |
MemoryOS | clone 源码 | 否 |
论文来源:
| Baseline | 论文 | GitHub |
|---|---|---|
| Mem0 / Mem0-Graph | arXiv:2504.19413 | https://github.com/mem0ai/mem0 |
| SimpleMem | arXiv:2601.02553 | https://github.com/aiming-lab/SimpleMem |
| Omni-SimpleMem | arXiv:2604.01007 | https://github.com/aiming-lab/SimpleMem |
| MemVerse | arXiv:2512.03627 | https://github.com/KnowledgeXLab/MemVerse |
| Memobase | — | https://github.com/memodb-io/memobase |
| Supermemory | — | https://github.com/supermemoryai/supermemory |
| Zep | arXiv:2501.13956 | https://github.com/getzep/zep |
3.3 添加新 Baseline
实现 MemoryAdapter 的 7 个抽象方法:
class MyAdapter(MemoryAdapter):
def reset(self) -> None: ...
def ingest_turn(self, turn: NormalizedTurn) -> None: ...
def end_session(self, session_id: str) -> None: ...
def snapshot_memories(self) -> list[MemorySnapshotRecord]: ...
def export_memory_delta(self, session_id: str) -> list[MemoryDeltaRecord]: ...
def retrieve(self, query: str, top_k: int) -> RetrievalRecord: ...
def get_capabilities(self) -> dict[str, Any]: ...
然后在 registry.py 的 EXTERNAL_ADAPTER_REGISTRY 中注册。
4. 数据适配
4.1 数据集格式(domain_a_v2)
加载器 load_domain_a_v2_academic(data_dir) 要求 data_dir 下有三个文件:
data_dir/
├── domain_a_v2.json # 主对话数据(JSON array)
├── stage4_memory_points.jsonl # 每 session 的 gold memory points
└── stage4b_qa_checkpoints.jsonl # checkpoint QA 题目
domain_a_v2.json 中每个 sample 结构:
{
"uuid": "unique-id",
"sample_id": "sample_001",
"sessions": [
{
"_v2_session_id": "S00",
"dialogue": [
{
"role": "user",
"content": "Hello...",
"timestamp": "2025-01-01T10:00:00",
"attachments": [{"caption": "photo of...", "type": "image_caption"}]
},
{"role": "assistant", "content": "Hi..."}
],
"memory_points": [...] // 仅 S00 需要
},
{"_v2_session_id": "S01", "dialogue": [...]}
]
}
stage4_memory_points.jsonl 每行一个 sample:
{
"uuid": "...", "sample_id": "sample_001",
"memory_sessions": [
{
"session_id": "S01",
"memory_points": [
{
"memory_id": "m001",
"memory_content": "User prefers dark mode",
"memory_type": "preference",
"memory_source": "normal",
"is_update": false,
"original_memories": [],
"importance": 0.8
}
]
}
]
}
stage4b_qa_checkpoints.jsonl 每行一个 sample:
{
"uuid": "...", "sample_id": "sample_001",
"checkpoints": [
{
"checkpoint_id": "cp01",
"covered_sessions": ["S00", "S01"],
"questions": [
{
"question": "What theme does the user prefer?",
"answer": "Dark mode",
"question_type": "preference_recall",
"question_type_abbrev": "pref",
"difficulty": "easy",
"evidence": [{"memory_id": "m001"}]
}
]
}
]
}
4.2 适配自有数据
若要接入新数据源,有两条路径:
路径 A:转换为 domain_a_v2 格式(推荐)
- 将原始对话整理为上述三文件格式
- 直接使用现有 CLI 运行
路径 B:编写新的 dataset loader
- 在
datasets/下新建加载器,返回DomainAV2AcademicBundle(或等价结构) - 在
cli.py的run_eval()中通过load_domain_bundle参数注入
4.3 关键数据结构
每条对话 turn 会被归一化为 NormalizedTurn:
NormalizedTurn(
sample_id="sample_001",
session_id="S01",
turn_index=0,
role="user", # "user" | "assistant"
text="Hello...",
attachments=(Attachment(caption="...", type="image_caption"),),
timestamp="2025-01-01T10:00:00",
)
Memory 的 gold 标注支持三种来源标记:
normal— 正常记忆点interference— 干扰信息(不应被记忆)is_update=True— 更新型记忆(应替换旧记忆)
5. 环境配置(uv)
5.1 安装 uv
curl -LsSf https://astral.sh/uv/install.sh | sh
5.2 初始化项目环境
cd /data1/toby/nips26
# 创建虚拟环境
uv venv .venv --python 3.11
source .venv/bin/activate
5.3 安装核心依赖
# 最小依赖(可跑 FUMemory/STMemory 等纯文本 baseline)
uv pip install openai tenacity
# embedding 检索类 baseline(LTMemory, GAMemory, MGMemory 等)
uv pip install sentence-transformers
# 多模态 baseline(MMMemory, NGMemory, AUGUSTUSMemory 等)
uv pip install torch torchvision transformers
# 外部 baseline(A-Mem, MemoryOS)— 按各自文档安装额外依赖
# A-Mem 需要其源码目录下的 requirements
# MemoryOS 需要 memoryos 包
5.4 环境变量(.env 文件)
在项目根目录 (nips26/) 创建 .env 文件,框架会自动加载:
# .env
# 必需 — LLM API(pipeline 答题 + judge 评估统一使用)
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1 # 或兼容端点
OPENAI_MODEL=gpt-4o
# 可选
OPENAI_TEMPERATURE=0.0
OPENAI_MAX_TOKENS=1024
OPENAI_TIMEOUT=120
JUDGE_TEMPERATURE=0.0 # judge 专用温度
LLM_MAX_CONCURRENT=5 # LLM 并发上限
5.5 Mem-Gallery 本地依赖
Mem-Gallery 内置 baseline 需要将其源码放到 eval_framework/ 的同级目录:
# 假设 Mem-Gallery repo 在 /path/to/Mem-Gallery
cp -r /path/to/Mem-Gallery/benchmark/memengine /data1/toby/nips26/
cp -r /path/to/Mem-Gallery/benchmark/default_config /data1/toby/nips26/
最终目录结构应为:
nips26/
├── eval_framework/
├── memengine/ # Mem-Gallery 记忆引擎
└── default_config/ # Mem-Gallery 默认配置
6. 运行示例
基本运行
# 运行单个 baseline
python -m eval_framework.cli \
--dataset /path/to/domain_a_v2_data/ \
--baseline FUMemory \
--output-dir eval_framework/results/FUMemory
# smoke 模式(只跑第 1 个 sample,快速验证)
python -m eval_framework.cli \
--dataset /path/to/domain_a_v2_data/ \
--baseline FUMemory \
--output-dir eval_framework/results/FUMemory_smoke \
--smoke
# dry-run(不实际运行,打印配置)
python -m eval_framework.cli \
--dataset /path/to/domain_a_v2_data/ \
--baseline FUMemory \
--dry-run
# 仅重跑 eval 阶段(从 checkpoint 恢复,pipeline 不重跑)
python -m eval_framework.cli \
--dataset /path/to/domain_a_v2_data/ \
--baseline FUMemory \
--output-dir eval_framework/results/FUMemory \
--eval-only
# 调整 eval 并发数
python -m eval_framework.cli \
--dataset /path/to/domain_a_v2_data/ \
--baseline MGMemory \
--output-dir eval_framework/results/MGMemory \
--max-eval-workers 10
批量跑所有 baseline
DATASET="/path/to/domain_a_v2_data"
for baseline in FUMemory STMemory LTMemory GAMemory MGMemory RFMemory A-Mem MemoryOS; do
echo "=== Running $baseline ==="
python -m eval_framework.cli \
--dataset "$DATASET" \
--baseline "$baseline" \
--output-dir "eval_framework/results/$baseline"
done
输出文件说明
运行完成后 output-dir 下包含:
results/FUMemory/
├── pipeline_sessions.jsonl # Stage 1 checkpoint — session 级 pipeline 结果
├── pipeline_qa.jsonl # Stage 1 checkpoint — QA 级 pipeline 结果
├── session_records.jsonl # 最终 session 结果(含 eval)
├── qa_records.jsonl # 最终 QA 结果(含 eval)
└── aggregate_metrics.json # baseline 级汇总指标
7. LLM API 开销估算
每个 sample 的 LLM 调用量:
| 来源 | 调用次数 |
|---|---|
| Pipeline answer(每个 QA question) | N_questions |
| Session Recall judge | N_sessions |
| Session Correctness judge | N_sessions |
| Update judge | N_update_points(逐条) |
| Interference judge | N_interference_points(逐条) |
| QA Answer judge | N_questions |
| QA Evidence judge | N_questions |
典型场景下一个 sample 约 20-50 次 LLM 调用。通过 LLM_MAX_CONCURRENT 控制并发避免 rate limit。