SimToken / TubeToken_Phase0_Experiment_Log.md

Upload folder using huggingface_hub

9af2926 verified 17 days ago

18.1 kB

	# TubeToken Phase -1 / Phase 0 Experiment Log

	This document records the actual experiment progress, observations, and next actions for the TubeToken v4 plan.

	## Phase -1 Summary

	### Data Audit

	Audit output:

	```text
	Expressions: 20459
	Videos: 3574
	Objects (vid, fid): 7461
	Splits: val 1349, train 14113, test_s 2288, TODO 25, test_u 1656, test_n 1028

	Expressions/video mean: 5.724
	Expressions/video median: 6.0
	Videos with >=2 expressions: 3521
	Expressions/object mean: 2.742
	Objects with >=2 expressions: 5836
	H3 candidate objects: 5781
	H3 candidate expressions: 18614

	Null split expressions: 1028 (5.02%)
	Audio-keyword expressions: 15890 (77.67%)
	Spatial-keyword expressions: 5924 (28.96%)
	Same-category distractor heuristic expressions: 2563 (12.53%)
	Small-target expressions: 10037
	Partial-target expressions: 33
	Area-unstable expressions: 41
	Late-target expressions: 0
	```

	Decision:

	- Multi-expression structure is strong.
	- H3 direct validation remains a P0 target.
	- Null modeling is feasible but needs oversampling / curriculum because Null ratio is only about 5%.
	- Small-target proposal recall is a major risk.
	- Late-target subset is not useful under the current GT visibility definition.

	### SimToken Reproduction

	Reproduced results:

	```text
	test_seen:
	mIoU = 0.7189123889
	F = 0.8113823722
	J&F = 0.7651473806

	test_unseen:
	mIoU = 0.6996124670
	F = 0.7915967433
	J&F = 0.7456046051

	test_n:
	S = 0.0117917573
	```

	Paper/report result:

	```text
	Seen: J 72.0, F 81.3, J&F 76.7
	Unseen: J 69.8, F 79.1, J&F 74.5
	Mix: J 70.9, F 80.2, J&F 75.6
	Null S: 0.012
	```

	Decision:

	- SimToken reproduction passes Phase -1.
	- Difference from the report is far below the 1.5 J&F pause threshold.
	- Later Go/No-Go thresholds should use reproduced SimToken as the reference.

	Working Phase 0 reference:

	```text
	SimToken seen J&F = 0.7651
	SimToken unseen J&F = 0.7456
	Seen/unseen average = 0.7554
	Target Oracle Tube J&F for green light ~= 0.8054
	```

	## Phase 0 Proposal Experiments

	### Implementation Notes

	Scripts added:

	```text
	tools/tubetoken/phase0_common.py
	tools/tubetoken/generate_sam2_proposals.py
	tools/tubetoken/evaluate_phase0_proposals.py
	tools/tubetoken/evaluate_oracle_refine_sam2.py
	```

	SAM2 proposal generation uses:

	- SAM2 automatic mask generation on keyframes.
	- SAM2 video propagation to form tubes.
	- Cache format: one `.npz` per video with `masks`, `scores`, `keyframes`, and `boxes_xyxy`.

	Important implementation correction:

	- Initial unidirectional propagation was invalid for Phase 0 because proposals from later keyframes were not truly propagated backward.
	- Bidirectional propagation was added.
	- Group-by-keyframe propagation was tested but performed slightly worse than shared-state bidirectional propagation on smoke evaluation.

	### Smoke Results

	#### Unidirectional Smoke, stride=8, N=128, 5 videos

	Result:

	```text
	all: R@16=0.800, R@32=0.900, R@64=1.000, R@128=1.000, Oracle J&F=0.9577
	small: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9798
	test_s: R@16=0.700, R@32=0.850, R@64=1.000, R@128=1.000, Oracle J&F=0.9743
	test_u: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9244
	```

	Interpretation:

	- Code path worked, but the sample was too small and optimistic.

	#### Shared-state Bidirectional Smoke, stride=8, N=64, 30 videos

	Result:

	```text
	all: n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
	audio_keyword: n=130, R@16=0.738, R@32=0.923, R@64=0.977, Oracle J&F=0.9214, miss=2.31%
	h3_candidate: n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
	small: n=51, R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9654, miss=0.00%
	spatial_keyword: n=14, R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.9106, miss=0.00%
	test_s: n=43, R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8409, miss=18.60%
	test_u: n=120, R@16=0.750, R@32=0.950, R@64=1.000, Oracle J&F=0.9321, miss=0.00%
	```

	Interpretation:

	- Bidirectional propagation fixed the small smoke behavior.
	- However, `test_s` remained much weaker than `test_u`.
	- Full validation was required before making a Phase 0 decision.

	#### Group-by-keyframe Bidirectional Smoke, stride=8, N=64, 30 videos

	Result:

	```text
	all: n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
	audio_keyword: n=130, R@16=0.738, R@32=0.877, R@64=0.931, Oracle J&F=0.9138, miss=6.92%
	h3_candidate: n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
	small: n=51, R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9695, miss=0.00%
	spatial_keyword: n=14, R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.8945, miss=0.00%
	test_s: n=43, R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8416, miss=18.60%
	test_u: n=120, R@16=0.750, R@32=0.900, R@64=0.950, Oracle J&F=0.9241, miss=5.00%
	```

	Decision:

	- Group-by-keyframe is worse than shared-state bidirectional for recall.
	- Use shared-state bidirectional as the current best SAM2 propagation setting.

	### Full Results: stride=8, N=64

	Full shared-state bidirectional result:

	```text
	all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62%
	area_unstable: n=18, R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11%
	audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42%
	h3_candidate: n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64%
	partial: n=8, R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00%
	same_category: n=330, R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09%
	small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73%
	spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20%
	test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27%
	test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29%
	```

	Decision:

	- `stride=8, N=64` is a Phase 0 red-light configuration.
	- It fails the v4 Go/No-Go criteria:
	- Overall Recall@32 is below 85%.
	- Overall Recall@64 is below 80%.
	- Small-target Recall@32 is far below 70%.
	- Oracle Tube J&F is below the target `SimToken + 5`.
	- `test_s` Oracle J&F is far below reproduced SimToken seen J&F.
	- Do not proceed to TubeToken-Minimal with this proposal cache.

	Main bottleneck:

	- Proposal recall, especially for `test_s`, small targets, and spatial expressions.
	- Bidirectional propagation does not solve the full-set miss problem, so the problem is likely candidate generation / ranking / keyframe coverage, not just temporal direction.

	## Phase 0 Completed Results

	### stride=8, N=128 Full Evaluation (Yellow Light)

	Completed 2026-04-26. Proposal directory: `proposals_stride8_n128_miss` (all 542 test_s+test_u videos, N=128).

	```text
	all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, R@128=0.867, Oracle J&F=0.8407, miss=13.31%
	audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, R@128=0.870, Oracle J&F=0.8445, miss=12.97%
	small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, R@128=0.821, Oracle J&F=0.7942, miss=17.90%
	spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, R@128=0.804, Oracle J&F=0.7902, miss=19.59%
	test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, R@128=0.813, Oracle J&F=0.7941, miss=18.71%
	test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, R@128=0.941, Oracle J&F=0.9052, miss=5.86%
	```

	Go/No-Go decision: Yellow Light (条件绿灯)

	\| 条件 \| 阈值 \| 当前值 \| 状态 \|
	\|------\|------\|--------\|------\|
	\| Oracle Tube J&F (all) \| ≥ SimToken均值+5% ≈ 0.8054 \| 0.8407 \| ✅ \|
	\| test_s Oracle J&F \| ≥ SimToken seen 0.7651 \| 0.7941 \| ✅ \|
	\| test_s R@128 (修订条件) \| ≥ 0.75 \| 0.813 \| ✅ \|

	注：R@32 原始条件（≥85%）未达标（0.597），但该条件是为 N=64 语境设计的，在 N=128 运行时以 R@128 替代。13.31% miss 是生成瓶颈，增加 N 无法解决，需 stride=4。

	额外完成工作：
	- 分层评估子集 `eval_subset_150.txt`（156 个视频，覆盖 6 个分层）
	- CLIP 文本特征预计算：`data/text_embed/`（19395 个文件，768-dim）
	- TubeToken-Minimal 框架骨架：`datasets/dataset_tubetoken.py`, `models/tubetoken_minimal.py`, `train_tubetoken.py`（smoke test 通过）

	### EC-SimToken v1（已完成 — 诊断失败）

	完成于 2026-04-27。训练 5 epoch，batch_size=12，null_aug_prob=0.25，exist_loss_weight=1.0。
	Checkpoint: `checkpoints/ec_simtoken/ec_simtoken_v1_ep5.pth`。

	分割指标（与 SimToken 基线对比）

	\| Split \| mIoU \| F \| J&F \| SimToken J&F \|
	\|-------\|------\|---\|-----\|--------------\|
	\| test_s \| 0.7062 \| 0.8003 \| 0.7533 \| 0.7651 \|
	\| test_u \| 0.6855 \| 0.7844 \| 0.7350 \| 0.7456 \|

	分割能力略低于 SimToken（test_s -1.18pt，test_u -1.06pt），5 epoch fine-tune 未造成崩溃但未带来改善。

	Existence head 指标（ep5，threshold=0.50）

	```text
	── p_exist distribution ─────────────────────────────────────
	split n mean med p10 p25 p75 p90 min max
	test_s(+) 2288 0.850 0.910 0.648 0.812 0.957 0.977 0.005 0.996
	test_u(+) 1656 0.839 0.914 0.598 0.793 0.957 0.977 0.018 0.992
	test_n(null) 1028 0.889 0.953 0.792 0.910 0.969 0.980 0.000 0.992

	AUC-ROC (null vs positive): 0.3605
	test_n null_tp=53/1028 (5.2%) Null_S=0.0100
	```

	Existence loss 轨迹

	\| Epoch \| mean exist_loss \| 范围 \|
	\|-------\|----------------\|------\|
	\| 1 \| 0.6218 \| 0.82 → 0.54 \|
	\| 2 \| 0.3770 \| 0.40 → 0.34 \|
	\| 3 \| 0.2860 \| 0.33 → 0.27 \|
	\| 4 \| 0.2383 \| 0.31 → 0.24 \|
	\| 5 \| 0.2351 \| 0.24 → 0.23 \|

	失败诊断

	exist_loss 确实收敛（0.82 → 0.23），说明 existence head 在训练集上学会了某个任务。但 AUC=0.36 < 0.5（随机），且 null 的 p_exist 均值（0.889）高于正样本（0.839-0.850），方向完全反转。

	根本原因：训练-测试分布不匹配。

	\| \| 训练 null（合成） \| test_n（真实） \|
	\|---\|---\|---\|
	\| 构造方式 \| 随机 audio swap \| 目标真实不在视频中 \|
	\| 音频特征 \| 与视频完全不匹配（随机） \| 语义连贯，只是目标不可见 \|
	\| 模型反应 \| seg_embedding 混乱，head 可检测 \| seg_embedding "自信但错误"，head 无法区分 \|

	existence head 学会了检测 audio-swap 造成的 embedding 异常，而非真实目标缺失。threshold sweep 无意义（分布顺序已反转）。

	决策: EC-SimToken v1 定性为诊断实验，不作为论文主表强 baseline。不继续调参（调 threshold / loss weight / null_aug_prob 均无法修复分布错配）。保留 J&F 结果供参考，existence head 结论不对外声称有效。

	## Pending Experiments (Deferred)

	### Experiment B: stride=4, N=128

	状态: 进行中（已中断，可续跑）。已完成 227/542 个视频（41.9%），生成速度约 44s/video。
	目标: 验证更密关键帧能否将 test_s miss% 从 18.71% 进一步降低。
	Proposal 目录: `runs/tubetoken_phase0/proposals_stride4_n128`（中断后 NPZ 文件保留，续跑自动跳过已完成视频）
	实际耗时: stride=4 比 stride=8 慢约 3.4×（4 个 keyframe vs 3 个 + 更大 propagation state）。单进程全集约 6-7h；2-shard 并行约 3.5h。

	Step 1: 续跑生成 proposals（2-shard 并行，在两个终端同时启动）

	```bash
	# Terminal 1 (shard 0)
	cd /workspace/SimToken
	python tools/tubetoken/generate_sam2_proposals.py \
	--data_dir /workspace/SimToken/data \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
	--splits test_s,test_u \
	--sam2_repo /workspace/sam2 \
	--model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
	--checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
	--stride 4 --max_tubes 128 \
	--device cuda --amp_dtype bf16 \
	--quiet_sam2 --no_group_by_keyframe \
	--num_shards 2 --shard_id 0 \
	2>&1 \| tee runs/tubetoken_phase0/proposals_stride4_n128_s0.log

	# Terminal 2 (shard 1)
	cd /workspace/SimToken
	python tools/tubetoken/generate_sam2_proposals.py \
	--data_dir /workspace/SimToken/data \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
	--splits test_s,test_u \
	--sam2_repo /workspace/sam2 \
	--model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
	--checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
	--stride 4 --max_tubes 128 \
	--device cuda --amp_dtype bf16 \
	--quiet_sam2 --no_group_by_keyframe \
	--num_shards 2 --shard_id 1 \
	2>&1 \| tee runs/tubetoken_phase0/proposals_stride4_n128_s1.log
	```

	Step 2: 子集快速评估（生成完成后约 5 分钟）

	```bash
	mkdir -p runs/tubetoken_phase0/eval_stride4_n128_subset

	python tools/tubetoken/evaluate_phase0_proposals.py \
	--data_dir /workspace/SimToken/data \
	--proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_subset \
	--audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
	--splits test_s,test_u \
	--video_list /workspace/SimToken/runs/tubetoken_phase0/eval_subset_150.txt \
	--recall_ns 16,32,64,128 \
	2>&1 \| tee runs/tubetoken_phase0/eval_stride4_n128_subset.log
	```

	Step 3: 全集评估（子集通过后）

	```bash
	mkdir -p runs/tubetoken_phase0/eval_stride4_n128_full

	python tools/tubetoken/evaluate_phase0_proposals.py \
	--data_dir /workspace/SimToken/data \
	--proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_full \
	--audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
	--splits test_s,test_u \
	--recall_ns 16,32,64,128 \
	2>&1 \| tee runs/tubetoken_phase0/eval_stride4_n128_full.log
	```

	决策规则（来自实验建议）：

	\| 子集 test_s Oracle J&F \| 含义 \| 对 Milestone 2 影响 \|
	\|------------------------\|------\|---------------------\|
	\| ≥ 0.77 \| 绿灯候选，触发全集确认 \| 若全集通过，切换 backend 为 stride=4 \|
	\| 0.72–0.77 \| 边际改善 \| 保持 stride=8，N=128，不调整 \|
	\| < 0.72 \| 生成瓶颈深于关键帧密度 \| 保持 stride=8，N=128，不再追求绿灯 \|

	### EC-SimToken v2（待设计）

	状态: 暂缓。等待 Experiment B 完成后，视 TubeToken 主线进度再决定是否启动。
	前提: v1 失败根因已定位（见下方 Phase 0 Completed Results），v2 需改用 in-distribution null 样本。
	方向: cross-video query swap（同类别过滤）或直接使用 train_n split（如数据集提供）。

	---

	### TubeToken-Minimal 训练 proposals (Train Split)

	状态: 待执行，依赖 stride=4 完成后排队。
	预计耗时: 2767 个 train 视频 × ~15s = 约 12 小时。

	```bash
	mkdir -p runs/tubetoken_phase0/proposals_stride8_n128_train

	python tools/tubetoken/generate_sam2_proposals.py \
	--data_dir /workspace/SimToken/data \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_train \
	--splits train \
	--sam2_repo /workspace/sam2 \
	--model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
	--checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
	--stride 8 --max_tubes 128 \
	--device cuda --amp_dtype bf16 \
	--quiet_sam2 --no_group_by_keyframe \
	2>&1 \| tee runs/tubetoken_phase0/proposals_stride8_n128_train.log
	```

	## Next Experiment (Active)

	### Experiment B: stride=4, N=128（续跑 + 评估）

	当前状态: 227/542 NPZ 已完成，中断。续跑命令见 Pending Experiments → Experiment B。

	Step 1: 续跑生成（见 Pending Experiments 中的 2-shard 命令，剩余约 315 个视频，2-shard 约 2-2.5h）

	Step 2: 子集评估（生成完成后，约 5 分钟）

	```bash
	mkdir -p runs/tubetoken_phase0/eval_stride4_n128_subset

	python tools/tubetoken/evaluate_phase0_proposals.py \
	--data_dir /workspace/SimToken/data \
	--proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_subset \
	--audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
	--splits test_s,test_u \
	--video_list /workspace/SimToken/runs/tubetoken_phase0/eval_subset_150.txt \
	--recall_ns 16,32,64,128 \
	2>&1 \| tee runs/tubetoken_phase0/eval_stride4_n128_subset.log
	```

	Step 3: 全集评估（子集 test_s Oracle J&F ≥ 0.77 时执行）

	```bash
	mkdir -p runs/tubetoken_phase0/eval_stride4_n128_full

	python tools/tubetoken/evaluate_phase0_proposals.py \
	--data_dir /workspace/SimToken/data \
	--proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
	--out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_full \
	--audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
	--splits test_s,test_u \
	--recall_ns 16,32,64,128 \
	2>&1 \| tee runs/tubetoken_phase0/eval_stride4_n128_full.log
	```

	决策规则

	\| 子集 test_s Oracle J&F \| 结论 \| 后续 \|
	\|------------------------\|------\|------\|
	\| ≥ 0.77 \| 绿灯候选 \| 跑全集；若全集通过，切换 backend 为 stride=4 \|
	\| 0.72–0.77 \| 边际改善 \| 保持 stride=8 N=128，不调整 backend \|
	\| < 0.72 \| 关键帧密度不是主因 \| 停止 stride 探索，TubeToken-Minimal 用 stride=8 \|

	全集绿灯标准（与 stride=8 对比）

	\| 指标 \| stride=8 N=128 \| 期望 stride=4 \|
	\|------\|----------------\|---------------\|
	\| test_s R@128 \| 0.813 \| 明显提升 \|
	\| test_s miss% \| 18.71% \| 明显下降 \|
	\| small R@128 \| 0.821 \| 提升 \|
	\| all Oracle J&F \| 0.8407 \| 维持或提升 \|
	\| test_s Oracle J&F \| 0.7941 \| 维持或提升 \|