# TubeToken 实验计划 v4（Final / Experiment-Ready）

> 主线：以 **TubeToken** 为核心框架，将 **Existence / Null 建模** 与 **Text-Audio Conditional Compression** 作为 TubeToken 的自然组成部分，而不是作为 SimToken 的外接补丁。  
> v4 目标：在 v3 Reviewer-Revised 的基础上完成最后一轮实验前定稿，固定 matched-compute baseline 的实现，修正 Phase 0 红灯条件，精确化 H3 CosSim baseline，补充 multi-expression training 的梯度冲突风险，重构主表与公平性分析表，并明确多 expression 场景下的 proposal amortization efficiency。

---

## 0. v4 最终修改摘要

本版是实验启动前的最终方案。v3 已经具备启动实验的完整框架；v4 只做定稿级别的精修，重点消除可能导致后期 Reviewer 质疑或实验返工的模糊点。

相较 v3，v4 做了以下最终修改：

1. **固定 SimToken + matched compute 的唯一实现**：不再保留四个候选方案，明确使用 **SimToken + multiple keyframe prompting with the same number of keyframes as TubeToken-Fast**。该对照在概念上最接近 TubeToken-Fast 的额外计算来源，也避免实验结束后选择有利 baseline 的嫌疑。
2. **修正 Milestone 1 第三条红灯条件**：删除 “预计 TubeToken-Minimal 无法获得 selection 收益” 这类 Phase 0 不可观测判断，改为完全基于 Phase 0 可测量量：Recall@32、Oracle Tube J/F、Oracle Refined J/F。
3. **精确化 Fixed Q-Former 的 H3 CosSim baseline**：Fixed Q-Former 对同一 tube 的不同 expression 输出完全相同，因此 cross-expression CosSim **恒等于 1.0**，不是“接近 1”。Conditioned Q-Former 是否显著低于 1.0 是 H3 的直接证据。
4. **补充 multi-expression training 的梯度冲突风险与缓解方案**：若不同 expressions 对同一 tube 要求矛盾的 temporal / audio / spatial attention，可使用 gradient accumulation 分开累积，或先采样语义差异较小的 expression pair。
5. **重构主表为顶会友好格式**：主表精简为 8 行，只保留主要公开 baseline 与 TubeToken 主配置；SimToken + SAM2 proposals、learned reranker、matched compute、TubeToken-Minimal、TubeToken-Fast 移入独立 Fairness Analysis Table。
6. **在 efficiency 中明确 per-video 与 per-expression 成本**：SAM2 proposals 是 per video 一次性成本；在同一视频有 K 个 expressions 时，proposal cost 可在 expressions 间摊销，CondQFormer 与 selector 才是 per-expression 成本。
7. **澄清 Selection Acc@3 对 null tube 的处理**：正样本计算 object-level Top-3 时排除 null tube；“GT tube Top-3 but Null Top-1” 作为独立 null calibration 指标在全 ranking 中计算。
8. **明确 error decomposition 的互斥优先级**：每个失败样本只归入一个错误类别，按 Proposal miss → Null FN with GT Top-3 → Null FN without GT Top-3 → Selection error → Refinement error → Null FP 的优先级判定。
9. **更新 Phase -1 Go/No-Go 标准**：SimToken 复现与 multi-expression audit 可并行启动；若 SimToken 复现差异 > 1.5 J&F，则暂停后续实验；若 multi-expression 不足，则将 H3 direct validation 从 P0 降为 P2 并采用回退叙事。
10. **更新 Appendix 检查表**：把最终 Reviewer 精修建议全部纳入落地状态，形成实验前 checklist。

## 1. 核心研究假设

### 1.1 任务重述

Referring Audio-Visual Segmentation, Ref-AVS, 不应仅被建模为：

\[
\text{MLLM} \rightarrow \langle SEG \rangle \rightarrow \text{SAM}
\]

而应被建模为：

\[
\text{Candidate Object Tubes}
\rightarrow
\text{Text-Audio Conditioned Tube Selection}
\rightarrow
\text{SAM Refinement}
\]

也就是说，Ref-AVS 的本质更接近 **object-level retrieval + mask refinement**：

1. 视频中有哪些候选对象实例？
2. 哪一个对象实例被文本和音频共同指代？
3. 如果没有符合条件的对象，模型能否显式选择 Null？
4. 选中的对象 tube 是否能被进一步精修为高质量 mask？

### 1.2 主要假设

**H1: Object tube 是比 global `<SEG>` token 更适合 Ref-AVS 的中间表示。**  
Tube 可以显式保持跨帧身份一致性，降低同类多实例、遮挡、出入画面情况下的 identity switch 风险。

**H2: Null / Existence 应该通过显式候选建模解决。**  
TubeToken 中引入 learnable null tube，将 Null 判断转化为候选选择问题，而不是依赖 SAM decoder 被动输出空 mask。

**H3: 同一 candidate tube 在不同 referring expression 下应暴露不同的时序证据，因此 tube 表征必须由 text/audio condition 动态调制。**  
在 TubeToken 中，conditional compression 不是全视频 token pooling 的替代品，而是 **tube-level evidence summarization**。同一 object tube 对于不同表达可能需要关注不同帧、不同动作、不同音频片段或不同空间关系。

**H3 的成立前提与验证要求：**

1. 数据层面必须先确认 Ref-AVSBench 中是否存在多个 expression 指向同一视频或同一目标。
2. 若存在 multi-expression 结构，训练阶段必须显式利用它：对同一视频 / 同一 tube 使用至少两个不同 expressions 进行 forward pass，共享 proposals，但使用不同 conditional queries。
3. 验证 H3 时不能只报告 AC。AC 只能证明模型是否关注正确区域 / 正确 tube，不能证明同一 tube 在不同 expression 下产生了差异化证据摘要。
4. H3 的直接验证指标是：同一视频、同一 matched GT tube、不同 expression 下 \(\tilde{z}_i\) 的 cosine similarity。Fixed Q-Former 因为不依赖 expression，对同一 tube 的不同 expression 输出完全相同，CosSim \(\equiv 1.0\)；conditioned Q-Former 的 similarity 应显著低于 1.0，并且 selection performance 不下降。
5. 若数据审计发现每个视频平均只有一个 expression，则 H3 不作为主贡献，论文主线应回退为 “proposal-conditioned instance grounding + explicit null reasoning”。

**H4: TubeToken 的收益必须通过 proposal recall、oracle upper bound、selection accuracy、refinement quality 和 efficiency breakdown 分别解释。**  
不能只报告最终 J/F/S，否则无法回答性能提升来自哪里，也无法判断瓶颈位于 proposal、selection 还是 refinement。

**H5: TubeToken 的提升必须在公平计算量和公平 proposal 条件下仍然成立。**  
必须通过 SimToken + SAM2 proposals、SimToken + matched compute、SAM2 proposals + learned reranker（no null tube）等对照排除 “只是 SAM2 proposal 更强” 或 “只是计算量更多” 的解释。

## 2. 方法版本定义

### 2.1 TubeToken-Full

完整方法包含四个阶段。

---

### Stage 1: Candidate tube generation

在关键帧上使用 SAM2 automatic mask generation 产生候选 masks，并用 SAM2 tracking / memory 机制向前后帧传播，得到候选 object tubes：

\[
\mathcal{O} = \{o_1, o_2, \dots, o_N\}
\]

每个 tube：

\[
o_i = \{m_{i,t}, b_{i,t}, f_{i,t}\}_{t=1}^{T}
\]

其中：

- \(m_{i,t}\)：第 \(t\) 帧 mask；
- \(b_{i,t}\)：第 \(t\) 帧 bbox；
- \(f_{i,t}\)：mask-pooled visual feature。

**实现约定**：  
默认在关键帧上运行 SAM2 AMG，在非关键帧上使用 SAM2 propagation，而不是每帧重新运行 AMG。这样可以避免 proposal 阶段计算量过高。

---

### Stage 2: Text-audio conditioned tube representation

文本表达编码为 \(e_{text}\)，音频编码为 \(e_{audio}\)。构造条件化 query：

\[
Q = Q_0 + W_t e_{text} + W_a e_{audio} + W_{ta}(e_{text} \odot e_{audio})
\]

对每个 tube 的时序特征 \(\{f_{i,t}\}_{t=1}^{T}\) 进行条件化压缩：

\[
\tilde{z}_i = \text{CondQFormer}(Q, \{f_{i,t}\}_{t=1}^{T})
\]

该模块的目标不是单纯减少 token 数，而是让同一 tube 在不同 expression 下形成不同的证据摘要。

**v3 约束：** 如果数据集中存在多 expression 样本，Stage 2 的训练必须在 batch 内显式包含同一视频 / 同一 tube 的不同 expression forward pass。否则 H3 只能作为推理假设，不能作为强实验证明。

#### 2.2 特征来源说明

默认设定：

\[
f_{i,t} = \text{MaskPool}(\text{SAM2ImageEncoder}(I_t), m_{i,t})
\]

也就是说，Stage 2 复用 SAM2 image encoder 特征，不额外引入独立 ViT 或 CLIP visual encoder。这样有三个好处：

1. proposal generation 与 tube representation 使用一致的视觉特征；
2. 避免额外视觉 encoder 带来的计算量和公平性争议；
3. efficiency table 更清楚，便于与 SimToken 和 SAM2-based baselines 对比。

可选扩展：若 SAM2 encoder feature 与文本/音频语义对齐不足，可增加一个轻量 projector：

\[
f'_{i,t} = W_v f_{i,t}
\]

但默认不引入额外大规模 visual-language encoder。

---

### Stage 3: Tube selection with null tube

加入一个 learnable null tube：

\[
z_{null}
\]

将所有候选 tubes 与 null tube 一起输入 tube selector：

\[
P(i \mid video, audio, text) =
\text{Softmax}([s_1, s_2, \dots, s_N, s_{null}])
\]

若 \(P(null)\) 最大，则输出空 mask；否则选择得分最高的 object tube。

Existence probability 自然定义为：

\[
p_{exist} = 1 - P(null)
\]

#### Tube selector 默认结构

默认采用：

1. reference query \(q_{ref}=\text{MLP}([e_{text},e_{audio}])\)；
2. tube tokens \(\{\tilde{z}_i\}_{i=1}^{N}\)；
3. inter-tube self-attention；
4. reference-conditioned cross-attention；
5. per-tube classification head。

必须做消融：

- w/ inter-tube self-attention；
- w/o inter-tube self-attention；
- independent tube scoring，即每个 tube 独立通过 \([q_{ref}; \tilde{z}_i]\) 的线性层打分。

---

### Stage 4: SAM refinement

选中 tube 后，默认只使用 tube bbox 作为 box prompt，并结合 text/audio semantic prompt 进行 SAM refinement：

\[
\hat{m}_t = \text{SAMRefine}(I_t, b_{i,t}, q_{ref})
\]

默认不使用 tube mask 作为 mask prompt，避免“自我精修”带来的解释问题。tube mask 只用于：

1. 生成 bbox；
2. 提取 tube feature；
3. proposal matching；
4. oracle upper bound 计算。

需要额外做对照：

- bbox-only prompt；
- bbox + semantic prompt；
- bbox + mask prompt。

如果 bbox + mask prompt 没有明显收益，正文采用 bbox-only 或 bbox + semantic prompt 作为默认版本。

---

## 3. 数据审计与诊断子集构建

正式训练前必须先完成数据审计。该步骤决定后续实验是否有足够说服力。v3 将数据审计升级为 **Phase -1**，其中 multi-expression 结构与 SimToken 复现是进入 Phase 0 的前置条件。

### 3.1 必统计项目

| 项目 | 目的 |
|---|---|
| 每个视频的 referring expression 数量 | 判断 H3 是否可以被直接训练和验证 |
| 每个 GT object / tube 对应的 expression 数量 | 构建 H3 direct validation subset |
| SimToken alignment loss 中正样本表达集 \(\mathcal{P}_i\) 是否可复用 | 决定 multi-expression training 的实现路径 |
| Null 样本比例 | 判断 null tube / weighted CE 的训练难度 |
| GT 目标可见帧比例 | 决定是否需要 frame-level existence；若比例低则不引入 |
| 目标首次出现时间分布 | 构建 late-target subset，验证是否缓解 first-frame bias |
| 同类多实例比例 | 验证 inter-tube reasoning 和 hard negative 是否必要 |
| 小目标 / 遮挡目标比例 | 评估 proposal recall 风险 |
| 音频依赖表达比例 | 验证 audio-conditioned compression 是否有空间 |
| 空间关系表达比例 | 验证 spatial/relation query 是否必要 |
| Proposal miss 与目标属性关系 | 分析 SAM2 proposal 对小目标、遮挡、unseen 类别的系统性偏差 |

### 3.1.1 Multi-expression audit 的决策规则

| 审计结果 | 对 H3 和 CondQFormer 的影响 |
|---|---|
| 每个视频平均 expression 数 > 1.5，且同一 GT object 有多个 expression | 正常推进 H3；使用 multi-expression training 和 direct cosine validation |
| 多数视频只有 1 个 expression，但少量视频有多 expression | H3 作为诊断性贡献；在 multi-expression subset 上报告直接验证 |
| 每个视频基本只有 1 个 expression | 不把 H3 作为核心 claim；CondQFormer 改述为 learned tube compression / multimodal query adaptation |

---

### 3.2 诊断子集

至少构建以下子集。

#### 3.2.1 Late-target subset

目标首次可见帧位于视频后 50% 的样本。

定义：

\[
t_{first} = \min \{t \mid g_t \neq \emptyset\}
\]

若：

\[
t_{first} > 0.5T
\]

则归入 late-target subset。

---

#### 3.2.2 Audio-critical subset

v3 继续采用两阶段定义。

**Stage A: 初筛**

通过文本关键词筛选：

- sounding；
- making sound；
- longest sound；
- intermittent sound；
- silent；
- audio；
- heard；
- emitting sound；
- playing instrument 等。

**Stage B: 精筛**

训练出 w/o Audio 版本后，将满足以下条件的样本归入 strict audio-critical subset：

1. Full model 预测正确或显著优于阈值；
2. w/o Audio 模型预测错误或 J/F 显著下降；
3. 视频中存在至少两个视觉候选，单靠视觉无法稳定区分目标。

这样避免“表达包含音频词但视觉上唯一可解”的伪 audio-critical 样本。

---

#### 3.2.3 Same-category distractor subset

视频中存在多个同类别或高度相似候选对象，表达需要区分实例。

优先数据来源：

1. 数据集原始 object annotations；
2. 若无现成标注，使用 CLIP / Grounding DINO / OWL-ViT 进行 zero-shot object discovery；
3. 结合 SAM2 proposals 的 mask-pooled CLIP similarity 聚类，近似识别同类候选。

该子集需要报告构建方式和人工抽查准确率，避免 Reviewer 质疑子集可靠性。

---

#### 3.2.4 Null subset

原始 Null 样本，并进一步区分：

1. visual object exists but not referred；
2. audio exists but no valid visual target；
3. text refers to absent object；
4. audio-text conflict / ambiguous null。

---

#### 3.2.5 Small / occluded target subset

用于分析 proposal miss。

初始定义：

- small：GT mask area 小于图像面积的 5%；
- heavily occluded：连续可见帧少于 \(0.5T\)，或 mask area 在时序上剧烈波动；
- partial target：目标只在部分帧出现。

---

#### 3.2.6 Multi-expression H3 subset

用于直接验证 H3。

样本条件：

1. 同一视频中存在至少两个 referring expressions；
2. 这些 expressions 指向同一 GT object / GT tube，或至少指向可稳定匹配的同一 target instance；
3. expressions 在语义上存在差异，例如类别、动作、音频、空间关系、交互对象或时序片段不同；
4. SAM2 proposals 中存在 matched GT tube，避免 proposal miss 干扰 H3 验证。

报告内容：

- 每个视频平均 expression 数；
- 每个 GT object 平均 expression 数；
- H3 subset 样本数量；
- expression 差异类型分布；
- 人工抽查准确率。

## 4. Phase 0: Proposal Recall 与 Oracle 上界预实验

这是 TubeToken 的 go / no-go 实验。若 proposal recall 或 oracle upper bound 不足，TubeToken 的性能上限会被 proposal 阶段限制。

### 4.0 Phase -1 前置基准线：SimToken 复现

在运行 Proposal Recall 与 Oracle 上界之前，必须先完成 SimToken 复现。

要求：

1. 使用与 TubeToken 后续实验一致的数据划分、输入分辨率、音频特征、训练 epoch、batch size、optimizer、scheduler 和 evaluation script。
2. 以作者复现的 SimToken J/F/S 作为所有 Go/No-Go 条件中的主基准。
3. 官方 SimToken 数字只作为旁注；若复现数字与官方数字差异超过 1.5 J&F，需要先定位差异来源。
4. 论文中明确写作：

> All comparisons are conducted under the same training configuration as SimToken (reproduced), with official results cited where applicable.

### 4.1 设置

- Proposal model: SAM2 automatic mask generation。
- 关键帧策略：
  - stride = 4；
  - stride = 8；
  - stride = 16；
  - first / middle / last + audio-peak frames；
  - uniform + motion-peak frames；
  - uniform + audio-peak + motion-peak frames。
- Propagation: 使用 SAM2 memory / tracking 机制生成完整 tube。
- Candidate numbers: \(N=16,32,64,128\)。

---

### 4.2 Tube matching 定义

v3 使用 **GT-visible-frame mean tube IoU**，避免 late-target 或 partial target 样本被空帧稀释。

令：

\[
\mathcal{T}_g = \{t \mid g_t \neq \emptyset\}
\]

则：

\[
IoU_{tube}(o_i, g)=
\frac{1}{|\mathcal{T}_g|}
\sum_{t \in \mathcal{T}_g}
IoU(m_{i,t}, g_t)
\]

若：

\[
\max_i IoU_{tube}(o_i, g) \ge 0.5
\]

则认为 GT 被 proposal 覆盖。

同时报告更严格版本：

\[
IoU_{tube}^{all}
=
\frac{1}{T}
\sum_{t=1}^{T}
IoU(m_{i,t}, g_t)
\]

用于分析 tube 在 GT 不存在帧是否产生多余 mask。

### 4.2.1 Oracle Refined J/F 精确定义

**Oracle Tube J/F**：在 top-N candidate tubes 中选择 \(IoU_{tube}\) 最高的 tube，直接评估该 tube mask 的 J/F。

**Oracle Refined J/F**：在 top-N candidate tubes 中选择 oracle tube，只使用该 tube 的 bbox 作为 SAM / SAM2 box prompt，经 refinement 后评估 J/F。

约束：

1. 不允许使用 GT mask 作为 mask prompt；
2. 不允许使用 oracle GT box；
3. bbox 来自 oracle proposal tube；
4. refinement 设置必须与实际 Stage 4 默认设置一致。

这样 Oracle Refined J/F 才是实际 TubeToken refinement 的可达上界，而不是依赖 GT mask 的理想化上界。

---

### 4.3 指标

| 指标 | 解释 |
|---|---|
| Recall@16 / 32 / 64 / 128 | top-N tubes 中是否存在 GT tube |
| Oracle Tube J/F | 总是选择 \(IoU_{tube}\) 最高 tube 的 proposal 上界 |
| Oracle Refined J/F | 选择 oracle tube 后，仅用 proposal bbox prompt 做 SAM refinement 的上界 |
| Proposal coverage by subset | 在 late-target、small、occluded、unseen 上分别报告 |
| Proposal miss % | 未覆盖 GT 的样本比例 |
| Average tubes per video | 计算量和 pruning 难度 |
| Proposal generation latency | 评估效率 |
| Tube temporal purity | tube 是否在 GT 不存在帧产生大量 false positive |

---

### 4.4 Go / No-Go 决策标准

下列阈值中的 SimToken 均指 **作者复现的 SimToken**，不是仅引用官方数字。

#### 4.4.1 Milestone 1 绿灯条件

同时满足：

1. Recall@32 ≥ 85%，其中 matching 使用 GT-visible-frame IoU ≥ 0.5；
2. Oracle Tube J/F ≥ reproduced SimToken J/F + 5%；
3. Oracle Refined J/F ≥ Oracle Tube J/F + 3%，说明 SAM refinement 有明确提升空间；
4. Small / occluded subset Recall@32 ≥ 70%，避免 proposal 对关键困难样本存在系统性盲区。

策略：TubeToken 正常推进，默认 Balanced 配置使用 \(N=32\)。

#### 4.4.2 Milestone 1 黄灯条件

| 条件 | 后续策略 |
|---|---|
| Recall@32 为 80%-85%，且 Oracle Tube J/F 满足绿灯条件 | 继续推进，但默认 \(N=64\)，并在论文中重点分析 proposal miss |
| Oracle Tube J/F 仅 ≥ SimToken + 2%，但 Oracle Refined J/F ≥ SimToken + 5% | 继续推进，但论文重心从 selection 转向 refinement；强调 proposal-conditioned refinement |
| Recall@32 ≥ 85%，但 small/occluded Recall@32 < 70% | 继续推进主线，但必须增加 detector-assisted proposals 或 high-resolution proposals 的备选实验 |

#### 4.4.3 Milestone 1 红灯条件

任一条件满足即暂停 TubeToken 主线，优先切换 EC-SimToken 或重做 proposal 阶段：

1. Recall@64 < 80%；
2. Oracle Tube J/F ≤ reproduced SimToken J/F；
3. Recall@32 ≥ 85%，且 Oracle Refined J/F 与 Oracle Tube J/F 差距 < 1%，且 Oracle Tube J/F ≤ reproduced SimToken J/F + 2%。

第三条红灯条件只使用 Phase 0 可观测量。其含义是：proposal 质量本身只比 SimToken 略好，bbox-only refinement 又几乎无增益，此时 TubeToken 在该数据集上缺少足够立足点，不应依赖 Milestone 2 之前无法验证的 selection 收益预期。

---

### 4.5 若 recall 不足的备选策略

1. 增加关键帧数量；
2. 使用 audio-peak / motion-peak keyframes；
3. 对文本中出现的类别词使用 open-vocabulary detector 生成 boxes，再送 SAM2；
4. 使用 SimToken / EC-SimToken 的 mask 作为额外 proposal；
5. 引入 hybrid fallback：若 proposal confidence 低，则回退到 global semantic prompt segmentation。

## 5. Baseline 与模型变体

### 5.1 必须复现 / 对比的模型

| 模型 | 用途 |
|---|---|
| EEMC | 原始 Ref-AVS baseline |
| TSAM | SAM-based Ref-AVS baseline |
| SAM2-LOVE | SAM2-based Ref-AVS baseline |
| SimToken | 最直接对比对象，必须复现 |
| EC-SimToken | 强化后的 global token baseline，用于证明 TubeToken 不是只打 weak baseline |
| SimToken + SAM2 proposals | 控制 SAM2 proposals 带来的收益，采用零参数 reranking |
| SAM2 proposals + learned reranker（no null tube） | 分离 learned tube reranker 与 null tube 的贡献 |
| SimToken + matched compute | 等计算量公平对照 |
| TubeToken-Minimal | 最小 tube selection 框架 |
| TubeToken-Full | 完整方法 |

如果无法完整复现 EEMC、TSAM、SAM2-LOVE，可引用官方结果；但 SimToken、SimToken + SAM2 proposals、SAM2 proposals + learned reranker、SimToken + matched compute、TubeToken 必须在同一训练 / 输入 / 评估设置下比较。

---

### 5.2 TubeToken 主要消融

| 变体 | 目的 |
|---|---|
| TubeToken-Full | 完整模型 |
| TubeToken-Minimal | SAM2 proposals + fixed tube feature + selector + null tube，无 CondQFormer，无 refinement |
| SAM2 proposals + learned reranker（no null tube） | 分离 learned selector 与 null tube 的贡献 |
| w/o null tube | 验证显式 Null 建模 |
| null tube → binary existence head | 比较 null tube 与额外二分类 head |
| w/o null tube + mask-area threshold | 区分 Null 性能来自 tube 框架还是 null tube 设计 |
| fixed Q-Former | 验证 conditioning 是否有效，而非参数量增加 |
| text-conditioned only | 验证文本条件贡献 |
| audio-conditioned only | 验证音频条件贡献 |
| text+audio-conditioned | 完整条件化压缩 |
| w/o inter-tube self-attention | 验证 tube 间相对比较是否必要 |
| independent tube scoring | 每个 tube 独立通过 \([q_{ref};z_i]\) 线性打分 |
| w/o SAM refinement | 验证 tube selection 本身能力 |
| bbox prompt refinement | 默认 refinement 方案 |
| bbox + semantic prompt refinement | 验证 semantic prompt 是否有贡献 |
| bbox + mask prompt refinement | 检查 mask prompt 是否会带来收益或过拟合 |
| N=16/32/64/128 | 分析 candidate 数量和 recall/效率 trade-off |
| stride=4/8/16 | 分析关键帧数量和效率 trade-off |

---

### 5.3 公平性控制变体

#### 5.3.1 SimToken + SAM2 proposals：零参数 proposal reranking baseline

目的：回答 “TubeToken 的提升是否只是因为使用了 SAM2 proposals？”

该 baseline 必须采用参数无关的 reranking，不能使用模糊的 “rerank or fusion” 写法。

实现：

1. 保持 SimToken 的 global `<SEG>` 生成方式，得到 \(F_{seg}\)。
2. 使用与 TubeToken 完全相同的 SAM2 proposals 和 tube construction。
3. 对每个 proposal tube 提取时序 mask-pooled feature \(f_{i,t}\)。
4. 使用如下零参数分数：

\[
\text{score}(o_i)
=
F_{seg}^{\top}
\cdot
\frac{1}{|\mathcal{T}|}
\sum_t f_{i,t}
\]

5. 选择分数最高的 proposal tube，并使用与 TubeToken-Minimal 一致的输出设置。

该方案不引入额外可学习参数，与 SimToken 的 \(F_{seg}\) 使用方式一致，能最大限度避免 Reviewer 质疑对照组被弱化。

---

#### 5.3.2 SAM2 proposals + learned reranker（no null tube）

目的：回答 “TubeToken-Minimal 的提升来自 learned tube selector，还是来自 null tube？”

实现：

1. 使用与 TubeToken-Minimal 相同的 SAM2 proposals、tube construction、tube feature 和 \(q_{ref}\)。
2. 训练一个 learned reranker / classifier 对非 null candidate tubes 打分。
3. 不加入 learnable null tube。
4. Null case 使用 mask-area threshold 或 calibrated score threshold 处理。
5. 与 TubeToken-Minimal 对比：若 TubeToken-Minimal 明显更好，说明 null tube 有独立贡献；若 learned reranker 已接近 TubeToken-Minimal，说明主要收益来自 learned tube selection。

---

#### 5.3.3 SimToken + matched compute：预注册等计算量 baseline

目的：回答 “TubeToken 是否只是计算量换性能？”

v4 固定唯一实现，不再保留多个候选方案：

> **SimToken + multiple keyframe prompting with the same number of keyframes as TubeToken-Fast.**

实现约定：

1. 使用与 TubeToken-Fast 相同数量的关键帧，默认对应 TubeToken-Fast 的 stride=16 keyframe budget。
2. 对每个关键帧分别运行 SimToken 的 global `<SEG>` / SAM prompting 流程。
3. 将多个 keyframe 的预测通过同一 propagation / aggregation 规则合成为视频级 mask，规则必须在实验前固定。
4. 不使用 SAM2 proposal tube reranking，不引入 learned tube selector，不引入 null tube。
5. 报告 latency、FLOPs、SAM/SAM2 call 数、MLLM token count，使其与 TubeToken-Fast 的计算预算尽可能接近。

选择该实现的原因：TubeToken-Fast 的额外计算主要来自更多关键帧与 proposal/propagation 处理，而 multiple keyframe prompting 是 SimToken 侧最直接、最可解释、最难被质疑的等计算量增强方式。该 baseline 必须在实验开始前预注册，不能根据最终结果临时更换。

## 6. 训练设计

### 6.1 Tube label assignment

正样本视频中，选择 GT-visible-frame mean tube IoU 最大的 candidate tube 作为正 tube：

\[
i^* = \arg\max_i IoU_{tube}(o_i,g)
\]

若最大 IoU 小于 0.5，则标记为 proposal miss。训练时：

- 不用于 tube classification loss；
- 可用于 proposal miss 统计；
- 不建议强行把低 IoU tube 当正样本，以免污染 selector。

Null 样本中，正类为 null tube。

---

### 6.2 Loss function

v3 默认总损失中 **不包含未定义的 \(\mathcal{L}_{cond}\)**。Null 加权并入 tube classification CE，而不是单独写成独立的 \(\mathcal{L}_{null}\)。

默认总损失：

\[
\mathcal{L}
=
\mathcal{L}_{tube}^{weighted}
+
\lambda_m y\mathcal{L}_{mask}
+
\lambda_r\mathcal{L}_{rank}
\]

其中：

\[
\mathcal{L}_{tube}^{weighted}
=
\sum_i
w_i \cdot
\text{CE}(P(i \mid video,audio,text), y_i)
\]

- 正样本：\(w_i=1\)；
- Null 样本：\(w_i=w_{null}\)，由 curriculum 控制；
- \(\mathcal{L}_{mask}\)：BCE + Dice，只对非 Null 且非 proposal miss 样本计算；
- \(\mathcal{L}_{rank}\)：hard negative ranking loss。

Hard negative ranking：

\[
\mathcal{L}_{rank}
=
\sum_{j\in\mathcal{N}}
\max(0,\Delta-s_{i^*}+s_j)
\]

#### 6.2.1 Optional \(\mathcal{L}_{cond}\) 辅助项

如果实验中决定使用 attention supervision，则 \(\mathcal{L}_{cond}\) 必须单独定义、单独消融，不能作为默认损失悬空出现。

可选定义：

\[
\mathcal{L}_{cond}
=
-
\sum_{t,l}
\bar{M}_{t,l}
\log A_{t,l}
\]

其中：

- \(A_{t,l}\)：CondQFormer 对第 \(t\) 帧第 \(l\) 个 patch / region 的 attention；
- \(\bar{M}_{t,l}\)：归一化后的 GT mask 或 matched proposal mask；
- 该项只在有可靠 GT spatial supervision 的样本上使用。

若使用该项，则总损失写为：

\[
\mathcal{L}
=
\mathcal{L}_{tube}^{weighted}
+
\lambda_m y\mathcal{L}_{mask}
+
\lambda_r\mathcal{L}_{rank}
+
\lambda_c\mathcal{L}_{cond}
\]

并报告 with / without \(\mathcal{L}_{cond}\)。

---

### 6.3 Multi-expression training for CondQFormer

这是 H3 在训练层面的必要实现。

适用前提：数据审计确认同一视频或同一 GT object 存在多个 referring expressions。

训练方式：

1. 对每个 multi-expression 样本，先生成一次 SAM2 proposals，得到共享 candidate tubes \(\mathcal{O}\)。
2. 在同一个 batch 或 gradient accumulation window 中采样至少两个不同 expressions：\(r_a, r_b\)。
3. 对同一组 tubes 分别构造条件化 query：

\[
Q_a = Q_0 + W_t e_{text}^{a} + W_a e_{audio}^{a} + W_{ta}(e_{text}^{a} \odot e_{audio}^{a})
\]

\[
Q_b = Q_0 + W_t e_{text}^{b} + W_a e_{audio}^{b} + W_{ta}(e_{text}^{b} \odot e_{audio}^{b})
\]

4. 分别得到：

\[
\tilde{z}_{i}^{a} = \text{CondQFormer}(Q_a, \{f_{i,t}\}_{t=1}^{T})
\]

\[
\tilde{z}_{i}^{b} = \text{CondQFormer}(Q_b, \{f_{i,t}\}_{t=1}^{T})
\]

5. 共享 tube proposals，但每个 expression 独立计算 tube selection loss。
6. 如果两个 expressions 指向同一 GT tube，则要求 selection 都正确；不强制 \(\tilde{z}_{i}^{a}\) 与 \(\tilde{z}_{i}^{b}\) 相同，因为 H3 恰恰要求不同 expression 暴露不同证据。
7. 如果两个 expressions 指向不同 targets，则作为 inter-expression hard negatives，用于强化同视频 instance discrimination。

**实现注记：梯度冲突风险。**  
当两个 expressions 对同一 tube 需要关注不同证据时，例如一个表达依赖音频活跃帧，另一个表达依赖空间位置，CondQFormer 的共享参数可能收到相互冲突的梯度，造成训练振荡。若出现 loss oscillation、attention collapse 或正样本 Selection Acc 明显下降，采用以下缓解策略：

1. 将不同 expression 的 forward / backward 放入同一 gradient accumulation window，但分开计算梯度后再累积，而不是在一个合并 forward 中强行混合；
2. 训练早期优先采样语义差异较小的 expression pair，例如同为视觉表达或同为音频表达；
3. 训练稳定后再逐步加入 cross-modality expression pair，例如 audio-expression vs spatial-expression；
4. 单独记录 multi-expression pair 类型与训练稳定性，避免把梯度冲突误判为 conditioning 无效。

训练记录：

- batch 中 multi-expression 样本比例；
- 每个 shared proposal set 对应的 expression 数；
- expression pair 类型分布：visual-visual、audio-audio、visual-audio、spatial-audio；
- 使用 multi-expression training 与不使用该训练策略的对比结果。

若数据集不支持 multi-expression training，则必须在论文中降低 H3 的表述强度。

---

### 6.4 Null tube curriculum

Null tube 训练初期不稳定，因此采用 curriculum：

| 阶段 | epoch | Null 权重 \(w_{null}\) |
|---|---:|---:|
| Warmup | 0-2 | 2.0 |
| Middle | 3-6 | 1.0 |
| Final | 7+ | 0.5 |

同时使用 Null oversampling，但必须明确目标比例。

默认设置：

- 每个 batch 中 Null 样本目标比例：25%；
- 若原始 Null 比例高于 25%，不额外下采样，直接使用自然分布；
- 若原始 Null 比例低于 25%，通过 oversampling 补足；
- 单个 batch 中 Null 比例原则上不超过 33%，除非专门做采样比例消融。

必须报告 Null sampling ratio 对以下指标的影响：

- Null FPR；
- Positive FNR；
- Null S；
- Tube Selection Acc@1；
- “GT tube Top-3 but null tube Top-1” 错误比例。

Null sampling ratio 消融：

| Ratio | 目的 |
|---:|---|
| 0% | no oversampling baseline |
| 12.5% | 弱 oversampling |
| 25% | 默认设置 |
| 33% | 较强 oversampling |
| 50% | 检查是否导致过度保守预测 null |

---

### 6.5 Hard negative mining

Hard negative mining 分阶段引入，避免工程依赖混乱。

#### Milestone 2: TubeToken-Minimal 阶段

只使用不依赖 CondQFormer 的 hard negatives：

1. tube IoU 与 GT 较高但不是目标；
2. 与 GT bbox / mask 空间位置接近；
3. mask-pooled visual feature 与 GT 相似；
4. 若有类别标签，则加入同类别不同实例。

#### Milestone 3: CondQFormer 阶段

加入 text/audio mismatch negatives：

1. 与文本相似但音频不匹配；
2. 与音频同步但文本不匹配；
3. 与 audio-critical expression 高相关但不是 GT 的 tube；
4. same-category distractor 中的高分错误 tube；
5. 同一视频不同 expression 指向不同目标时，将非当前 expression 的目标 tube 作为 hard negative。

## 7. 评价指标

### 7.1 标准指标

| 指标 | 说明 |
|---|---|
| Seen J / F / J&F | seen categories 分割质量 |
| Unseen J / F / J&F | unseen categories 泛化能力 |
| Mix J / F / J&F | 综合表现 |
| Null S | Null subset 空目标表现 |

---

### 7.2 TubeToken 专属指标

| 指标 | 说明 |
|---|---|
| Recall@N | proposal 阶段是否覆盖 GT |
| Oracle Tube J/F | proposal 上界 |
| Oracle Refined J/F | proposal + bbox-only refinement 上界 |
| Tube Selection Acc@1 | GT tube 被覆盖时，Top-1 预测是否为 matched GT tube |
| Tube Selection Acc@3 | matched GT tube 是否进入 Top-3 |
| GT Top-3 but Null Top-1 Rate | GT tube 已在 Top-3，但 null tube 排名第 1 的比例 |
| Null Accuracy | 是否正确选择 null tube |
| Null FPR | Null 视频中错误选择非空 tube 的比例 |
| Positive FNR | 正样本视频中错误选择 null tube 的比例 |
| Existence AUC | \(p_{exist}=1-P(null)\) 的判别能力 |
| Reliability Diagram / ECE | existence probability 是否校准 |
| Refinement Gain | SAM refinement 前后 J/F 提升 |
| Latency / FPS / Memory | 效率指标 |
| \(AC\) | attention mass 是否集中在 GT region / GT tube |
| \(\widehat{AC}_{tube}\) | 标准化 tube-level AC，定义为 \(N\cdot AC_{tube}\)，用于不同 N 之间比较 |
| H3 Cosine Similarity Gap | 同一 tube 不同 expression 下 conditioned 与 fixed Q-Former 的 \(\tilde{z}_i\) 相似度差异 |

**Tube Selection Acc 定义：**  
在 GT tube 被 proposal 覆盖的样本中，selector 的 Top-1 预测与 matched GT tube 一致的比例。proposal miss 样本不计入该指标，但必须单独报告。

**Selection Acc@3 的 null 处理：**  
针对正样本评估 object-level Top-3 时，先从候选排名中排除 null tube，再判断 matched GT tube 是否进入 Top-3。否则 null tube 排名第 2 但 GT tube 排名第 3 的情况会被误计为 object selection 成功。与 null 校准相关的情况单独用 **GT Top-3 but Null Top-1 Rate** 报告，该指标在包含 null tube 的完整 ranking 上计算。

若 Null 样本少于 200 个，Reliability Diagram 作为主要校准分析，ECE 仅作为辅助数字。

---

### 7.3 Error decomposition

每个失败样本归类为：

| 错误类型 | 判定标准 |
|---|---|
| Proposal miss | top-N candidate tubes 中无 tube 与 GT-visible-frame mean IoU ≥ 0.5 |
| Selection error | GT tube 存在，且非 null tube 被错误选择为其他 object tube |
| Refinement error | selector 选对，但 refined mask J/F 明显低 |
| Null false positive | Null 视频中选择了非空 tube |
| Null false negative | 正样本视频中选择了 null tube |
| GT tube Top-3 but Null Top-1 | 正样本中 matched GT tube 已进入 Top-3，但 null tube 得分最高 |

最后一类不应简单并入 Selection error 或 Null FN。它说明模型具备候选识别能力，但 existence / null 校准存在问题。

**互斥归类优先级：**  
Error decomposition 必须保证每个失败样本只落入一个类别，避免各项占比相互重叠。默认优先级为：

1. Proposal miss；
2. Null FN with GT Top-3，即正样本中 null ranked 1st 且 matched GT tube 进入 object-level Top-3；
3. Null FN without GT Top-3；
4. Selection error；
5. Refinement error；
6. Null FP。

报告时可以把第 2、3 类合并成总 Null FN，同时单独列出 GT Top-3 but Null Top-1 作为 Null FN 的校准子类型。

该分析需要在 Seen、Unseen、Null、late-target、same-category distractor、audio-critical 子集上分别报告。

## 8. 诊断实验

### 8.1 Conditioning 是否真的有效

v3 将 conditioning 诊断拆成两个层次：

1. **Correctness level**：模型是否关注正确 GT 区域 / GT tube。对应 AC 与 \(\widehat{AC}_{tube}\)。
2. **Expression-sensitivity level**：同一 tube 在不同 referring expressions 下是否产生不同证据摘要。对应 H3 direct validation。

这两个层次不能混淆。高 AC 只能说明模型关注正确对象，不能直接证明 H3。

#### 8.1.1 H3 direct validation：同一 tube 不同 expression 的表示差异

适用子集：3.2.6 Multi-expression H3 subset。

实验设置：

1. 对同一视频生成一次 shared candidate tubes；
2. 找到 matched GT tube \(o_{i^*}\)；
3. 对同一视频的两个 expressions \(r_a,r_b\) 分别运行 fixed Q-Former 与 conditioned Q-Former；
4. 记录同一 tube 的输出表示：\(\tilde{z}_{i^*}^{a}\)、\(\tilde{z}_{i^*}^{b}\)。

指标：

\[
\text{CosSim}_{same\ tube}
=
\cos(\tilde{z}_{i^*}^{a},\tilde{z}_{i^*}^{b})
\]

报告：

| Model | Same-tube cross-expression CosSim | Selection Acc@1 | H3 解释 |
|---|---:|---:|---|
| Fixed Q-Former | 1.0 |  | 不依赖 expression，确定性恒等 baseline |
| Text-conditioned |  |  | 文本差异是否改变 tube summary |
| Audio-conditioned |  |  | 音频差异是否改变 tube summary |
| Text+Audio-conditioned |  |  | 完整条件化是否产生最大差异 |

期望结果：

- Fixed Q-Former 的 cross-expression CosSim \(\equiv 1.0\)，这是确定性 baseline，而不是经验近似；
- Text+Audio-conditioned Q-Former 的 CosSim 显著低于 1.0；
- CosSim 降低不能以 Selection Acc 下降为代价；
- 若 CosSim 无差异但性能提升存在，则论文表述应改为 “learned compression improves selection”，而不是强称 “expression-conditioned evidence summarization”。

---

#### 8.1.2 Attention Concentration 指标

对于 patch-level 或 frame-level attention \(A\)，定义：

\[
AC
=
\frac{
\sum_{t,l} A_{t,l} \cdot \mathbf{1}[(t,l)\in GT]
}{
\sum_{t,l} A_{t,l}
}
\]

若 attention 是 tube-level，则原始 tube attention concentration 为：

\[
AC_{tube}
=
\sum_i A_i \cdot \mathbf{1}[i=i^*]
\]

但 \(AC_{tube}\) 受 candidate 数 \(N\) 影响。为保证不同 N 下可比较，v3 使用标准化版本：

\[
\widehat{AC}_{tube}=N\cdot AC_{tube}
\]

其中随机基准恒为 1.0，完全集中在 GT tube 上时为 \(N\)。

比较：

- fixed Q-Former；
- text-conditioned；
- audio-conditioned；
- text+audio-conditioned。

并在以下表达类型上分别报告：

1. audio-related expressions；
2. spatial relation expressions；
3. category-only expressions；
4. same-category distractor samples；
5. multi-expression H3 subset。

---

### 8.2 Audio robustness

| 实验 | 目的 |
|---|---|
| audio removed | 测试音频模块整体贡献 |
| audio amplitude zeroed, temporal length preserved | 区分音频缺失与全零音频特征；检查模型是否只利用“有无音频”信号 |
| audio shuffled | 测试是否依赖时间同步 |
| same-category audio swapped | 测试是否依赖细粒度音频差异 |
| cross-category audio swapped | 测试是否使用音频语义，而非只检测音频存在 |
| audio-text conflict | 测试冲突条件下模型是否合理退化 |
| strict audio-critical subset | 测试音频关键样本上的收益 |

**Audio swapped 分组要求**：

1. Same-category swap：例如吉他声换另一段吉他声；
2. Cross-category swap：例如吉他声换狗叫或人声。

只有 cross-category swap 导致显著下降，并且 zeroed audio 与 removed audio 呈现可解释差异，才能更有力证明模型确实使用音频语义。

---

### 8.3 First-frame bias / temporal coverage

| 实验 | 目的 |
|---|---|
| late-target subset | 目标后半段出现时是否优于 SimToken |
| keyframe stride ablation | 分析关键帧覆盖对性能影响 |
| partial target subset | 测试目标只在部分帧出现的鲁棒性 |
| target disappears subset | 测试 tracking 稳定性 |
| GT-visible-frame IoU vs all-frame IoU | 区分目标定位质量和多余 mask 问题 |

---

### 8.4 Same-category distractor

报告：

- TubeToken vs SimToken；
- w/ self-attention vs w/o self-attention；
- hard-negative ranking loss ablation；
- Selection Acc@1 / Acc@3；
- 同类干扰样本上的 error decomposition。

重点验证 TubeToken 是否减少同类实例混淆。

---

### 8.5 Null threshold sensitivity

虽然 TubeToken 使用 null tube，不需要手工 mask area threshold，但仍需要展示：

\[
p_{exist}=1-P(null)
\]

在不同 threshold 下的：

- Null FPR；
- Positive FNR；
- J&F；
- Null S；
- GT tube Top-3 but Null Top-1 Rate。

这能说明模型是否对阈值敏感。

同时比较：

1. null tube；
2. binary existence head；
3. mask-area threshold。

## 9. Efficiency 与公平计算量对比

Reviewer 会质疑 TubeToken 是否只是计算量换性能，因此必须主动报告效率与等计算量对照。

### 9.1 需要报告的效率项

| 项目 | 说明 |
|---|---|
| Proposal generation time | SAM2 AMG + keyframe processing，按 per video 统计 |
| Tracking / propagation time | SAM2 memory propagation |
| Tube selection time | conditional compression + selector，按 per expression 统计 |
| SAM refinement time | bbox prompt refinement |
| Total latency per video | 完整推理耗时，需区分单 expression 与多 expression 场景 |
| FPS | 视频级速度 |
| Peak GPU memory | 显存 |
| MLLM token count | 与 SimToken 比较 |
| Number of SAM/SAM2 calls | 计算量透明化 |
| Candidate tube number | N=16/32/64/128 |
| Keyframe stride | stride=4/8/16 |
| Amortized proposal cost per expression | 多 expression 场景下，SAM2 proposal generation 对同一视频只运行一次，在 K 个 expressions 间摊销 |
| Per-expression incremental cost | CondQFormer、selector、refinement 对每个 expression 的增量耗时 |

---

### 9.2 TubeToken 三种配置

| 配置 | 默认设置 | 目的 |
|---|---|---|
| Fast | N=16, stride=16 | 接近 SimToken 计算预算 |
| Balanced | N=32, stride=8 | 性能与效率折中 |
| Accuracy | N=64 或 128, stride=4 | 追求最好性能 |

---

### 9.3 等计算量对比

必须加入：

1. **SimToken + matched compute**，固定为 multiple keyframe prompting with the same number of keyframes as TubeToken-Fast；
2. **SimToken + SAM2 proposals**；
3. **SAM2 proposals + learned reranker（no null tube）**；
4. **TubeToken-Fast**。

报告这些变体在接近 latency / FLOPs / SAM call 数量下的性能。matched compute baseline 的实现必须在实验前固定，不能在实验后根据结果从 multi-scale prompting、multiple decode attempts 等候选方案中挑选。

若 TubeToken-Fast 显著优于 SimToken + matched compute，则可以有力回应“只是计算量换性能”的质疑。

### 9.4 多 expression 场景下的 proposal amortization

若同一视频有 \(K\) 个 referring expressions，TubeToken 的推理成本应拆分为：

\[
C_{video}
=
C_{proposal}^{video}
+
K\cdot(C_{cond}^{expr}+C_{select}^{expr}+C_{refine}^{expr})
\]

其中 \(C_{proposal}^{video}\) 是 SAM2 AMG + propagation 的一次性 per-video 成本，不应被错误地重复计算 \(K\) 次。因此需要额外报告：

| 指标 | 定义 |
|---|---|
| Proposal cost per video | 同一视频生成 candidate tubes 的一次性成本 |
| Amortized proposal cost per expression | \(C_{proposal}^{video}/K\) |
| Incremental expression cost | CondQFormer + selector + refinement 的 per-expression 成本 |
| Total cost for K expressions | \(C_{proposal}^{video}+K\cdot C_{expr}\) |

这既避免 Reviewer 误解 TubeToken 每个 expression 都要重跑 SAM2 proposals，也能展示 TubeToken 在多 expression 视频上的潜在效率优势。

---

## 10. 主表设计

### 10.1 Main comparison table

主表只保留公开 baseline、复现主基线和 TubeToken 主配置，避免把公平性控制变体全部塞入主表导致结构臃肿。公平性控制单独放入 10.2。

| Method | Seen J&F | Unseen J&F | Mix J&F | Null S | FPS | Memory |
|---|---:|---:|---:|---:|---:|---:|
| EEMC |  |  |  |  |  |  |
| TSAM |  |  |  |  |  |  |
| SAM2-LOVE |  |  |  |  |  |  |
| SimToken official |  |  |  |  |  |  |
| SimToken reproduced |  |  |  |  |  |  |
| EC-SimToken |  |  |  |  |  |  |
| TubeToken-Balanced |  |  |  |  |  |  |
| TubeToken-Accuracy |  |  |  |  |  |  |

---

### 10.2 Fairness analysis table

该表专门回答公平性问题：TubeToken 的收益是否来自 SAM2 proposals、learned reranking、null tube 或额外计算量。

| Method | Matched Proposal? | Matched Compute? | Null Modeling | Seen J&F | Unseen J&F | Mix J&F | Null S | FPS |
|---|---|---|---|---:|---:|---:|---:|---:|
| SimToken reproduced | No | Base | Implicit / mask output |  |  |  |  |  |
| SimToken + SAM2 proposals zero-param rerank | Yes | No | SimToken implicit |  |  |  |  |  |
| SAM2 proposals + learned reranker（no null tube） | Yes | Partial | threshold / calibrated score |  |  |  |  |  |
| SimToken + matched compute（multiple keyframe prompting） | No | Yes, TubeToken-Fast budget | SimToken implicit |  |  |  |  |  |
| TubeToken-Minimal | Yes | TubeToken-Fast/Balanced reported | learnable null tube |  |  |  |  |  |
| TubeToken-Fast | Yes | Yes | learnable null tube |  |  |  |  |  |

---

### 10.3 Proposal analysis table

| Split | Recall@16 | Recall@32 | Recall@64 | Oracle Tube J&F | Oracle Refined J&F bbox-only | Proposal Miss % |
|---|---:|---:|---:|---:|---:|---:|
| Seen |  |  |  |  |  |  |
| Unseen |  |  |  |  |  |  |
| Late-target |  |  |  |  |  |  |
| Small/occluded |  |  |  |  |  |  |
| Audio-critical |  |  |  |  |  |  |
| Multi-expression H3 subset |  |  |  |  |  |  |

---

### 10.4 Ablation table

| Variant | Seen J&F | Unseen J&F | Null S | Selection Acc@1 | Null FPR | GT Top-3 Null Top-1 | FPS |
|---|---:|---:|---:|---:|---:|---:|---:|
| Full |  |  |  |  |  |  |  |
| TubeToken-Minimal |  |  |  |  |  |  |  |
| SAM2 proposals + learned reranker（no null tube） |  |  |  |  |  |  |  |
| w/o null tube |  |  |  |  |  |  |  |
| binary existence head |  |  |  |  |  |  |  |
| mask-area threshold |  |  |  |  |  |  |  |
| fixed Q-Former |  |  |  |  |  |  |  |
| text-only cond |  |  |  |  |  |  |  |
| audio-only cond |  |  |  |  |  |  |  |
| text+audio cond |  |  |  |  |  |  |  |
| w/o multi-expression training |  |  |  |  |  |  |  |
| w/ optional \(\mathcal{L}_{cond}\) |  |  |  |  |  |  |  |
| w/o self-attn |  |  |  |  |  |  |  |
| independent scoring |  |  |  |  |  |  |  |
| w/o refinement |  |  |  |  |  |  |  |
| bbox+mask prompt |  |  |  |  |  |  |  |

---

### 10.5 Error decomposition table

| Split | Proposal Miss | Selection Error | Refinement Error | Null FP | Null FN | GT Top-3 but Null Top-1 |
|---|---:|---:|---:|---:|---:|---:|
| Seen |  |  |  | - |  |  |
| Unseen |  |  |  | - |  |  |
| Null | - | - | - |  | - | - |
| Same-category |  |  |  | - |  |  |
| Late-target |  |  |  | - |  |  |
| Audio-critical |  |  |  | - |  |  |
| Multi-expression H3 subset |  |  |  | - |  |  |

说明：Late-target、Same-category、Audio-critical 通常为正样本子集，因此 Null FP 不适用，用 “-” 标记；若某个子集定义中包含 Null 样本，则需要拆成 positive / null 两行。

---

### 10.6 Conditioning analysis table

| Model | Overall \(\widehat{AC}_{tube}\) | Audio-expression \(\widehat{AC}_{tube}\) | Spatial-expression \(\widehat{AC}_{tube}\) | Same-category \(\widehat{AC}_{tube}\) | Cross-expression CosSim | Selection Acc@1 |
|---|---:|---:|---:|---:|---:|---:|
| Fixed Q-Former |  |  |  |  |  |  |
| Text-conditioned |  |  |  |  |  |  |
| Audio-conditioned |  |  |  |  |  |  |
| Text+Audio-conditioned |  |  |  |  |  |  |

## 11. 可视化计划

### 11.1 必做可视化

1. **Tube selection visualization**  
   展示 top-5 candidate tubes、selector score、最终选择。

2. **Null case visualization**  
   展示 null tube 得分最高，输出空 mask。

3. **Same-category distractor**  
   展示两个相似对象，TubeToken 正确选择目标 tube。

4. **Late-target case**  
   展示目标不在第一帧时，TubeToken 仍能通过 tube 选择找到目标。

5. **Conditional attention map**  
   同一视频、不同 expression 下，compressor 关注不同 tube/时间片段。

6. **Attention Concentration visualization**  
   展示 fixed Q-Former 与 conditioned Q-Former 的 attention mass 差异。

7. **Failure cases**  
   至少展示 proposal miss、selection error、refinement error 三类失败。

---

### 11.2 可视化标准

每个案例应包含：

- 输入视频关键帧；
- expression；
- audio waveform 或 audio activity；
- candidate tubes；
- selection scores；
- selected tube；
- final mask；
- GT mask；
- 对应的 error category 或 diagnostic subset 标签。

---

## 12. 实施顺序与里程碑

### Phase -1: 数据审计与 SimToken 复现

目标：确认 H3 是否具备数据基础，并建立所有 Go/No-Go 判断的主基准。

交付物：

- SimToken reproduced result；
- reproduced vs official 差异分析；
- multi-expression audit；
- H3 subset 构建结果；
- Null 样本比例与 batch sampling 计划。

Phase -1 的两个任务可以并行启动：SimToken 复现用于建立所有阈值的主基准，multi-expression audit 用于决定 H3 的叙事强度。

Go / No-Go 条件：

| Phase -1 结果 | 建议 |
|---|---|
| SimToken 复现与官方差异 ≤ 1.5 J&F，且每个视频平均 expression 数 > 1.5 | 按 v4 计划全面推进 Phase 0，H3 保持 P0 级直接验证 |
| SimToken 复现与官方差异 ≤ 1.5 J&F，但每个视频基本只有 1 个 expression | 推进 Phase 0，但 H3 direct validation 从 P0 降为 P2，论文采用回退叙事 |
| SimToken 复现差异 > 1.5 J&F | 暂停后续实验，先定位复现差异，因为所有 Go/No-Go 阈值都依赖该基准 |

Phase -1 结束时必须明确说明 H3 属于强验证、弱验证还是叙事回退。

---

### Milestone 1: 数据审计与 proposal recall

目标：判断 TubeToken 是否可行。

交付物：

- 数据统计表；
- Recall@N；
- Oracle Tube J/F；
- Oracle Refined J/F bbox-only；
- proposal miss 分析；
- go / no-go 决策。

绿灯条件：

- Recall@32 ≥ 85%；
- Oracle Tube J/F ≥ reproduced SimToken J/F + 5%；
- Oracle Refined J/F ≥ Oracle Tube J/F + 3%；
- Small / occluded subset Recall@32 ≥ 70%。

黄灯条件：

- Recall@32 为 80%-85%，但 Oracle Tube J/F 满足绿灯条件：推进但默认 N=64；
- Oracle Tube J/F 仅 ≥ SimToken + 2%，但 Oracle Refined J/F ≥ SimToken + 5%：推进但论文重心转向 refinement。

红灯条件：

- Recall@64 < 80%；
- Oracle Tube J/F ≤ reproduced SimToken J/F；
- Recall@32 ≥ 85%，且 Oracle Refined J/F 与 Oracle Tube J/F 差距 < 1%，且 Oracle Tube J/F ≤ reproduced SimToken J/F + 2%；
- proposal 对 small / occluded / unseen 存在不可接受的系统性盲区。

---

### Milestone 2: TubeToken-Minimal + Fairness Controls

实现最小版本：

- SAM2 proposals；
- tube construction；
- fixed tube feature；
- selector + null tube；
- no conditional Q-Former；
- no SAM refinement。

同时实现公平性控制：

1. SimToken + SAM2 proposals 零参数 reranking；
2. SAM2 proposals + learned reranker（no null tube）；
3. SimToken + matched compute；
4. w/o null tube + mask-area threshold。

目标：验证 object tube selection 是否优于 global token baseline，并排除“只是 SAM2 proposals 更强”或“只是计算量更多”的解释。

绿灯条件：

- TubeToken-Minimal 的 Seen / Unseen J&F 均优于 reproduced SimToken ≥ 2%；
- TubeToken-Minimal 优于 SimToken + SAM2 proposals；
- TubeToken-Minimal 的 Null S ≤ SimToken Null S × 1.5；
- Tube Selection Acc@1 ≥ 70%。

黄灯条件：

- TubeToken-Minimal 优于 SimToken 但不优于 SimToken + SAM2 proposals：说明 proposal 贡献占主导，需要强化 selector 或调整论文叙事；
- TubeToken-Minimal 仅在 Null 子集优于 SimToken，Seen / Unseen 持平：继续推进 Milestone 3，但不能把 Minimal 作为主要贡献。

红灯条件：

- TubeToken-Minimal 在 Seen / Unseen 均不优于 SimToken，且不优于 SimToken + SAM2 proposals：重新设计 selector 或回退 EC-SimToken。

---

### Milestone 3: 加入 Conditional Compression

实现：

- fixed Q-Former；
- text-conditioned Q-Former；
- audio-conditioned Q-Former；
- text+audio-conditioned Q-Former；
- multi-expression training；
- H3 cosine similarity validation。

目标：证明 conditioning 本身有效，而非 learnable Q-Former 参数量带来的提升。

必须交付：

- conditioning ablation；
- \(\widehat{AC}_{tube}\)；
- H3 cross-expression CosSim；
- attention visualization；
- audio-critical subset 结果；
- audio zeroed / removed / shuffled / swapped robustness。

绿灯条件：

- Text+Audio conditioned Q-Former 在 Seen / Unseen 均优于 Fixed Q-Former ≥ 1.5%；
- \(\widehat{AC}_{tube}\) 在 audio-related expressions 上 conditioned ≥ fixed × 1.3；
- 同一视频不同 expression 下，CondQFormer 的 \(\tilde{z}_i\) CosSim 明显低于 Fixed Q-Former；
- strict audio-critical subset 上性能提升 ≥ 2%。

黄灯条件：

- CondQFormer 整体提升明显，但 \(\widehat{AC}_{tube}\) 差异不显著：论文改述为 learned tube compression；
- Text-only 已足够好，Audio conditioning 额外收益 < 0.5%：audio conditioning 改为 robustness improvement，不作为主贡献。

红灯条件：

- Fixed Q-Former 与 Text+Audio conditioned Q-Former 差距 < 0.5%，且所有子集无收益：conditioning 无效，考虑 CLIP visual features 或回退论文叙事。

---

### Milestone 4: 加入 SAM Refinement

实现：

- bbox prompt refinement；
- bbox + semantic prompt refinement；
- bbox + mask prompt 作为对照。

目标：证明 refinement 的贡献，并确认默认方案。

绿灯条件：

- Bbox prompt refinement 在 J 上优于 w/o refinement ≥ 2%；
- Oracle Refined J/F 与实际 TubeToken-Full J/F 的差距 ≤ 10%；
- Bbox + mask prompt 不显著优于 bbox-only。

黄灯条件：

- Refinement 提升 < 1%：将 SAM refinement 降为 optional module，论文重心转回 tube selection。

红灯条件：

- Bbox + mask prompt 显著优于 bbox-only，且差距来自 mask prompt 的 GT-quality dependency：说明 proposal mask 质量不足，需要回到 Milestone 1 改 proposal。

---

### Milestone 5: 完整实验与论文分析

完成：

- 主表；
- 消融；
- hard subset；
- error decomposition；
- efficiency；
- equal-compute comparison；
- 可视化；
- failure case；
- reliability diagram / threshold sensitivity。

## 13. 风险与应对

| 风险 | 严重程度 | 应对 |
|---|---|---|
| Ref-AVSBench 缺少 multi-expression 结构 | 极高 | 不将 H3 作为主贡献；叙事回退为 learned tube compression / proposal-conditioned instance grounding |
| SimToken 复现与官方数字差异过大 | 高 | 先定位训练、输入、评估差异；所有后续 Go/No-Go 使用 reproduced number |
| Multi-expression training 出现梯度冲突 | 中高 | 使用 gradient accumulation 分开累积不同 expression 的梯度；早期采样语义差异较小的 expression pair，稳定后再引入 cross-modality pair |
| SimToken + matched compute 实现被质疑 | 高 | 实验前固定为 multiple keyframe prompting with TubeToken-Fast keyframe budget，不保留事后选择空间 |
| 多 expression efficiency 被误解为每个 expression 重跑 proposals | 中 | 报告 proposal per-video cost、amortized proposal cost per expression 和 incremental expression cost |

| Recall@32 低于 80% | 极高 | 增加 proposal 数、引入 detector、使用 hybrid fallback |
| Oracle Tube J/F 不高于 reproduced SimToken | 极高 | 暂停 TubeToken 主线，改 refinement、高分辨率特征、proposal 方法或回退 EC-SimToken |
| Oracle Refined J/F 定义不公平 | 高 | 固定为 oracle proposal bbox-only，不使用 GT mask prompt |
| SimToken + SAM2 proposals 对照过弱 | 高 | 使用零参数 \(F_{seg}\) reranking，并公开公式 |
| TubeToken-Minimal 优于 SimToken 但不优于 SimToken + SAM2 proposals | 高 | 说明 proposal 是主要贡献，需强化 tube selector 或调整论文叙事 |
| learned reranker 与 TubeToken-Minimal 差距很小 | 中高 | null tube 贡献有限；Null 相关 claim 降级 |
| \(\mathcal{L}_{cond}\) 定义不清 | 中高 | 默认删除；若使用则单独定义并做 with/without 消融 |
| Null tube 不稳定 | 中高 | 25% Null oversampling + weighted CE curriculum；报告采样比例敏感性 |
| Null oversampling 过强导致正样本误判 Null | 高 | 监控 Positive FNR 与 GT Top-3 but Null Top-1 Rate |
| conditioning 只带来小幅提升 | 高 | 强化诊断子集、\(\widehat{AC}_{tube}\)、H3 CosSim、fixed Q-Former 对照 |
| H3 CosSim 无明显差异 | 高 | 不强调 expression-conditioned summarization；改强调 learned compression 或 selection architecture |
| TubeToken 计算量过大 | 高 | 报告 Fast/Balanced/Accuracy 与 matched-compute baseline |
| refinement 提升不明显 | 中 | 将重点转向 selection accuracy 与 hard cases；refinement 作为 optional module |
| self-attention 无贡献 | 低 | 删除 self-attention，采用更简洁 selector |
| attention map 不可解释 | 中高 | 使用 \(\widehat{AC}_{tube}\)、query 分组、H3 CosSim 重新诊断 |
| 与 SAM2 工程强绑定 | 中 | 明确核心贡献在 tube-level text/audio selection，不在 proposal generation |

## 14. 实验优先级

### P0: 必须完成

1. SimToken 复现与官方结果差异分析；
2. Multi-expression audit；
3. Proposal Recall@N；
4. Oracle Tube J/F 和 bbox-only Oracle Refined J/F；
5. TubeToken-Minimal vs SimToken；
6. TubeToken-Minimal vs SimToken + SAM2 proposals；
7. SAM2 proposals + learned reranker（no null tube）；
8. TubeToken-Fast vs SimToken + matched compute（固定为 multiple keyframe prompting）；
9. Null tube ablation；
10. mask-area threshold Null baseline；
11. Null oversampling ratio ablation；
12. fixed Q-Former vs text+audio conditioned Q-Former；
13. \(\widehat{AC}_{tube}\)；
14. H3 cross-expression CosSim（若 multi-expression audit 支持；否则降为 P2）；
15. Error decomposition；
16. GT Top-3 but Null Top-1 Rate；
17. Efficiency table。

---

### P1: 强烈建议完成

1. late-target subset；
2. strict audio-critical subset；
3. same-category distractor subset；
4. threshold sensitivity；
5. conditioning attention visualization；
6. H3 cross-expression visualization；
7. self-attention ablation；
8. Reliability Diagram；
9. same-category vs cross-category audio swap；
10. audio amplitude zeroed, temporal length preserved。

---

### P2: 有时间再做

1. audio shuffled；
2. cross-dataset validation, e.g., AVSBench / MeViS；
3. frame-level existence；
4. open-vocabulary detector assisted proposals；
5. manual hard negative benchmark；
6. hybrid fallback with EC-SimToken；
7. optional \(\mathcal{L}_{cond}\) attention supervision。

## 15. 预期论文叙事

### 15.1 正常叙事：H3 成立时

若 multi-expression audit、multi-expression training、H3 CosSim 和 \(\widehat{AC}_{tube}\) 均支持 H3，建议论文主线写成：

> Existing Ref-AVS methods often compress multimodal evidence into a global semantic token, implicitly coupling existence judgment, instance grounding, and frame-level segmentation. We find that this implicit coupling becomes fragile in samples requiring instance-level comparison, temporal coverage, explicit null reasoning, or expression-dependent temporal evidence. We therefore formulate Ref-AVS as text-audio conditioned object-tube retrieval followed by mask refinement. Based on this view, we propose TubeToken, which constructs candidate object tubes, summarizes each tube with expression-conditioned temporal evidence, selects the referred tube through multimodal reasoning, handles Null cases via a learnable null tube, and refines the selected tube with SAM.

Introduction 中建议加入数据驱动的动机，例如：

- SimToken 在 same-category distractor subset 上下降多少；
- SimToken 在 late-target subset 上下降多少；
- 去掉 audio 后 audio-critical subset 上下降多少；
- Null false positive 是否集中在某类样本；
- fixed Q-Former 与 conditioned Q-Former 在 H3 subset 上的 CosSim 差异。

这能把叙事从“我们认为 global token 不好”升级为“我们用诊断数据证明 global token 有系统性弱点”。

### 15.2 回退叙事：H3 不强时

若数据集中 multi-expression 不足，或 conditioned Q-Former 的 H3 CosSim / \(\widehat{AC}_{tube}\) 证据不足，避免强称 “expression-conditioned evidence summarization”。建议改为：

> We formulate Ref-AVS as proposal-conditioned instance grounding with explicit null reasoning. TubeToken improves robustness by decomposing global segmentation into candidate object tube construction, learned tube selection, null-aware existence modeling, and optional mask refinement.

此时论文主贡献应改为：

1. candidate object tube formulation；
2. explicit null tube / existence modeling；
3. fairness-controlled comparison with SimToken + SAM2 proposals and matched compute；
4. diagnostic error decomposition；
5. optional learned compression rather than strong conditioning claim。

## 16. 最小可接受结论标准

若最终结果满足以下条件，可以支撑一篇完整论文：

1. SimToken 复现可信，且所有关键比较基于 reproduced SimToken；
2. Recall@32 或 Recall@64 足够高，且 Oracle Tube J/F 明确高于 reproduced SimToken，证明 proposal 不是不可接受的瓶颈；
3. Oracle Refined J/F 使用 bbox-only prompt，且明确高于 Oracle Tube J/F，证明 refinement 有可达收益；
4. TubeToken 在 Seen / Unseen / Mix 不低于 SimToken 超过 2 个点；若主集只持平，必须在 Null、late-target、same-category、audio-critical 子集上有显著提升，并提供效率-鲁棒性-可解释性三维论证；
5. TubeToken-Fast 在接近计算预算下优于 SimToken + matched compute（multiple keyframe prompting）；
6. TubeToken-Minimal 优于 SimToken + SAM2 proposals，证明 tube selection 框架本身有效；
7. SAM2 proposals + learned reranker（no null tube）与 TubeToken-Minimal 的对比能解释 selector 与 null tube 的各自贡献；
8. fixed Q-Former 明显弱于 text+audio conditioned Q-Former；
9. 如果主张 H3，则必须满足：multi-expression audit 支持、multi-expression training 有效、Fixed Q-Former CosSim \(\equiv 1.0\) 而 conditioned CosSim 显著低于 1.0，且 \(\widehat{AC}_{tube}\) 有提升；
10. null tube 明显优于 mask-area threshold 和 binary existence head；
11. Null oversampling 没有导致 Positive FNR 或 GT Top-3 but Null Top-1 Rate 不可接受地上升；
12. error decomposition 能清楚说明主要失败来自 proposal miss、selection error、refinement error、Null FP/FN 还是 Null 校准；
13. efficiency 虽然可能更高，但 Fast/Balanced/Accuracy setting 显示计算-性能 trade-off 合理。

如果第 2 点不成立，应及时回退到 EC-SimToken 路线，避免在低 recall 的 TubeToken 上投入过多。如果第 9 点不成立，应保留 TubeToken 框架，但下调 CondQFormer / H3 的论文权重。

## 17. 最终执行建议

推荐按照以下顺序推进：

1. **先做 Phase -1：SimToken 复现 + multi-expression audit。**  
   这是所有 Go/No-Go 条件和 H3 叙事是否成立的前提。

2. **再做 Phase 0：proposal recall + bbox-only Oracle Tube / Refined J/F。**  
   这是 TubeToken 能否成立的硬前提，且 Oracle Refined J/F 必须与实际 refinement 设置一致。

3. **再做 Milestone 2 的 fairness controls。**  
   TubeToken-Minimal、SimToken + SAM2 proposals 零参数 reranking、SAM2 proposals + learned reranker（no null tube）、SimToken + matched compute（multiple keyframe prompting）必须同时完成。

4. **确认 tube 框架有效后再加入 CondQFormer。**  
   若 multi-expression 数据充足，必须同步加入 multi-expression training 与 H3 CosSim；若不足，则不要把 H3 写成主贡献。

5. **最后加入 refinement。**  
   refinement 是性能增强项，不应成为论文叙事的唯一支柱。若 bbox-only refinement 提升很小，应将其降为 optional module。

这一路径可以最大程度降低风险：如果 proposal recall 或 oracle upper bound 不理想，可以及时切回 EC-SimToken；如果 TubeToken-Minimal 已经显示出明显优势，再继续投入完整 TubeToken 是合理的；如果 H3 验证不足，可以保留 tube-level retrieval 贡献，同时修改 CondQFormer 的叙事。

---

## Appendix A. Reviewer 建议落地检查表

| Reviewer 建议 | v3 落地位置 | 状态 |
|---|---|---|
| 增加 H3 直接验证，不能只用 AC | 1.2, 3.2.6, 8.1.1, 10.5, 12 | 已落实 |
| 检查数据集 multi-expression 结构 | 3.1, 3.1.1, Phase -1 | 已落实 |
| CondQFormer 显式利用 multi-expression training | 6.3, 12 Milestone 3 | 已落实 |
| Go/No-Go 使用 reproduced SimToken，而非不明来源数字 | 4.0, 4.4, 12 | 已落实 |
| Oracle Refined J/F 使用 bbox-only prompt，不用 GT mask | 4.2.1, 4.3, 10.2 | 已落实 |
| SimToken + SAM2 proposals 使用零参数 reranking | 5.3.1 | 已落实 |
| 增加 SAM2 proposals + learned reranker（no null tube） | 5.1, 5.2, 5.3.2, 10.1, 10.3, 12 | 已落实 |
| 删除或定义悬空的 \(\mathcal{L}_{cond}\) | 6.2, 6.2.1 | 已落实 |
| 明确 Null oversampling 比例 | 6.4, 14 | 已落实 |
| 增加 GT Top-3 but Null Top-1 错误类型 | 7.2, 7.3, 10.4 | 已落实 |
| 使用标准化 \(\widehat{AC}_{tube}\) | 7.2, 8.1.2, 10.5 | 已落实 |
| 增加 audio amplitude zeroed 控制实验 | 8.2, 14 | 已落实 |
| 修正 Error decomposition 表 Late-target 缺列 | 10.4 | 已落实 |
| Main table 加入 TubeToken-Minimal | 10.1 | 已落实 |
| 写入各 Milestone 绿灯 / 黄灯 / 红灯条件 | 12 | 已落实 |
| 增加叙事回退方案 | 15.2, 16, 17 | 已落实 |
| 固定 SimToken + matched compute 的唯一实现 | 5.3.3, 9.3, 10.2, 12 | v4 已落实 |
| 修正 Phase 0 第三条红灯条件为可观测量 | 4.4.3, 12 Milestone 1 | v4 已落实 |
| Fixed Q-Former CosSim baseline 精确为 1.0 | 1.2, 8.1.1, 10.6, 16 | v4 已落实 |
| 增加 multi-expression training 梯度冲突风险 | 6.3, 13 | v4 已落实 |
| 主表精简，公平性控制移入独立表 | 10.1, 10.2 | v4 已落实 |
| 增加多 expression proposal amortization efficiency | 9.1, 9.4 | v4 已落实 |
| Selection Acc@3 排除 null tube | 7.2 | v4 已落实 |
| Error decomposition 使用互斥优先级 | 7.3 | v4 已落实 |
| Phase -1 Go/No-Go 明确 SimToken 复现与 H3 audit 分支 | 12 Phase -1 | v4 已落实 |