File size: 6,233 Bytes

85b19cf

# VisTrajQA → Eval Framework 数据适配指南

## 概述

`convert_vistrajqa.py` 将 VisTrajQA 的 `sessions-*.jsonl` 转换为 eval_framework 所需的 domain_a_v2 三文件格式，从而可以用 Mem-Gallery / A-Mem / MemoryOS 等 baseline 进行统一评测。

## 快速使用

```bash
# 转换所有数据源（text-only 模式，默认）
python -m eval_framework.datasets.convert_vistrajqa \
    --input data/generated/sessions-vab.jsonl \
           data/generated/sessions-eb-nav.jsonl \
           data/generated/sessions-arena.jsonl \
           data/generated/sessions-eb-alfred.jsonl \
           data/generated/sessions-infini-thor.jsonl \
    --output eval_framework/converted/all

# 只转换某个数据源
python -m eval_framework.datasets.convert_vistrajqa \
    --input data/generated/sessions-vab.jsonl \
    --output eval_framework/converted/vab

# multimodal 模式（image caption 作为 attachment 而非内联文本）
python -m eval_framework.datasets.convert_vistrajqa \
    --input data/generated/sessions-vab.jsonl \
    --output eval_framework/converted/vab-mm \
    --multimodal

# 转换后直接跑 eval
python -m eval_framework.cli \
    --dataset eval_framework/converted/all \
    --baseline FUMemory \
    --output-dir eval_framework/results/FUMemory
```

## 转换映射

### 数据结构映射

```
VisTrajQA session                →  eval_framework sample
├── session_id                   →  sample_id
├── step_plan[]                  →  sessions[].dialogue[] (user + assistant turns)
├── probes[]                     →  checkpoints[] (probe checkpoints)
├── post_trajectory_qa[]         →  checkpoints[-1] (post-trajectory checkpoint)
└── memory_points[]              →  gold memory points (S00 embedded + stage4)
```

### Session 切分

一条 VisTrajQA 轨迹（如 30 步，4 个 probe 在 step 6/12/18/24）按 probe 边界切分为 5 个 session：

```
步骤 1-6   → S00   (probe 1 在此 session 结束后触发)
步骤 7-12  → S01   (probe 2)
步骤 13-18 → S02   (probe 3)
步骤 19-24 → S03   (probe 4)
步骤 25-30 → S04   (post-trajectory QA 在全部 session 结束后触发)
```

这样保证 eval_framework 的 runner 在每个 session 完成后恰好触发对应的 checkpoint。

### Turn 构建

每个 step 生成 2 个 dialogue turn：

| Turn | Role | 内容 |
|------|------|------|
| User turn | `user` | OBSERVATION + FEEDBACK + IMAGE caption（text-only 模式） |
| Assistant turn | `assistant` | THOUGHT + ACTION |

**text-only 模式**（默认）：image caption 直接写入 user turn 文本，格式为 `IMAGE: <caption>`。适用于所有 text-only baseline。

**multimodal 模式**（`--multimodal`）：image caption 作为 `attachment` 附加，不写入正文。适用于 MMMemory 等多模态 baseline。

### Memory Point 映射

| VisTrajQA 字段 | eval_framework 字段 | 说明 |
|----------------|---------------------|------|
| `mp_id` | `memory_id` | 如 `mp_S04_1` |
| `content` | `memory_content` | 一句话事实描述 |
| `type` | `memory_type` | `event_memory` / `state_memory` / `spatial_memory` |
| `source` | `memory_source` | `primary` (文本) / `secondary` (推断) |
| `is_update` | `is_update` | 是否为更新型记忆 |
| `original_memories` | `original_memories` | 被替换的旧内容列表 |
| `importance` | `importance` | 0.4 / 0.6 / 0.8 / 1.0 |
| `update_type` | `update_type` | `status_update` / `location_change` / ... |

Memory point 按 `step_num` 分配到对应 session：
- S00 的 memory points 嵌入在 `domain_a_v2.json` 的 session 对象中
- 其他 session 的 memory points 写入 `stage4_memory_points.jsonl`

### QA / Checkpoint 映射

**Probe checkpoint**：每个 probe 生成一个 checkpoint，`covered_sessions` 为该 probe 及之前所有 session。

**Post-trajectory checkpoint**：覆盖全部 session，包含 9 类 QA。

| VisTrajQA QA type | eval_framework question_type | 缩写 |
|----|----|-----|
| FR | factual_recall | FR |
| DU | dynamic_update | DU |
| MB | memory_boundary | MB |
| TR | temporal_reasoning | TR |
| KR | knowledge_reasoning | KR |
| VFR | visual_factual_recall | VFR |
| VS | visual_search | VS |
| VU | visual_update | VU |
| CMR | cross_modal_reasoning | CMR |

Evidence 字段从 `["mp_S04_1"]`（字符串列表）转换为 `[{"memory_id": "mp_S04_1"}]`（字典列表）以匹配 eval_framework 格式。

## 输出文件

```
eval_framework/converted/all/
├── domain_a_v2.json               # 主对话数据 (JSON array)
├── stage4_memory_points.jsonl     # 每 session 的 gold memory points
└── stage4b_qa_checkpoints.jsonl   # checkpoint QA 题目
```

## 评测维度与 VisTrajQA 的对应

| eval_framework 维度 | 测量内容 | 对应 VisTrajQA 特性 |
|-----|-----|-----|
| Memory Recall | 记忆系统存储了多少 gold points | 直接对应，所有 MP 类型 |
| Memory Correctness | 存储的记忆是否正确 | 检测 hallucination |
| Update Handling | 更新型记忆是否正确替换 | 对应 `is_update=true` 的 MP |
| Interference Rejection | 干扰信息是否被过滤 | VisTrajQA 无 interference 标注，此维度为空 |
| QA Accuracy | 问答正确率 | 对应 9 类 QA (FR/DU/MB/TR/KR/VFR/VS/VU/CMR) |
| Evidence Coverage | 回答引用了多少 gold evidence | 对应 evidence memory_point_ids |

> **注意**：VisTrajQA 没有 interference（干扰信息）标注，因此 eval_framework 的 Interference Rejection 维度在评测结果中会为空值。MB（Memory Boundary）类型的题目在 QA 层面测试了类似能力。

## 注意事项

1. **text-only baseline（FU/ST/LT/GA/MG/RF）**：使用默认 `--text-only`，image caption 内联到用户消息文本中
2. **multimodal baseline（MM/MMFU/NG/AUGUSTUS）**：使用 `--multimodal`，caption 作为 attachment
3. **caption 质量**：text-only baseline 对图像的理解完全依赖 caption 质量。如果 `image_caption` 为空，用户 turn 中不会有任何视觉信息
4. **Arena 数据**：observation 恒为空字符串，视觉信息完全来自 image_caption
5. **转换器会自动验证**：运行后会调用 `load_domain_a_v2_academic` 检验输出是否合法