| # TubeToken Phase -1 / Phase 0 Experiment Log |
|
|
| This document records the actual experiment progress, observations, and next actions for the TubeToken v4 plan. |
|
|
| ## Phase -1 Summary |
|
|
| ### Data Audit |
|
|
| Audit output: |
|
|
| ```text |
| Expressions: 20459 |
| Videos: 3574 |
| Objects (vid, fid): 7461 |
| Splits: val 1349, train 14113, test_s 2288, TODO 25, test_u 1656, test_n 1028 |
| |
| Expressions/video mean: 5.724 |
| Expressions/video median: 6.0 |
| Videos with >=2 expressions: 3521 |
| Expressions/object mean: 2.742 |
| Objects with >=2 expressions: 5836 |
| H3 candidate objects: 5781 |
| H3 candidate expressions: 18614 |
| |
| Null split expressions: 1028 (5.02%) |
| Audio-keyword expressions: 15890 (77.67%) |
| Spatial-keyword expressions: 5924 (28.96%) |
| Same-category distractor heuristic expressions: 2563 (12.53%) |
| Small-target expressions: 10037 |
| Partial-target expressions: 33 |
| Area-unstable expressions: 41 |
| Late-target expressions: 0 |
| ``` |
|
|
| Decision: |
|
|
| - Multi-expression structure is strong. |
| - H3 direct validation remains a P0 target. |
| - Null modeling is feasible but needs oversampling / curriculum because Null ratio is only about 5%. |
| - Small-target proposal recall is a major risk. |
| - Late-target subset is not useful under the current GT visibility definition. |
|
|
| ### SimToken Reproduction |
|
|
| Reproduced results: |
|
|
| ```text |
| test_seen: |
| mIoU = 0.7189123889 |
| F = 0.8113823722 |
| J&F = 0.7651473806 |
| |
| test_unseen: |
| mIoU = 0.6996124670 |
| F = 0.7915967433 |
| J&F = 0.7456046051 |
| |
| test_n: |
| S = 0.0117917573 |
| ``` |
|
|
| Paper/report result: |
|
|
| ```text |
| Seen: J 72.0, F 81.3, J&F 76.7 |
| Unseen: J 69.8, F 79.1, J&F 74.5 |
| Mix: J 70.9, F 80.2, J&F 75.6 |
| Null S: 0.012 |
| ``` |
|
|
| Decision: |
|
|
| - SimToken reproduction passes Phase -1. |
| - Difference from the report is far below the 1.5 J&F pause threshold. |
| - Later Go/No-Go thresholds should use reproduced SimToken as the reference. |
|
|
| Working Phase 0 reference: |
|
|
| ```text |
| SimToken seen J&F = 0.7651 |
| SimToken unseen J&F = 0.7456 |
| Seen/unseen average = 0.7554 |
| Target Oracle Tube J&F for green light ~= 0.8054 |
| ``` |
|
|
| ## Phase 0 Proposal Experiments |
|
|
| ### Implementation Notes |
|
|
| Scripts added: |
|
|
| ```text |
| tools/tubetoken/phase0_common.py |
| tools/tubetoken/generate_sam2_proposals.py |
| tools/tubetoken/evaluate_phase0_proposals.py |
| tools/tubetoken/evaluate_oracle_refine_sam2.py |
| ``` |
|
|
| SAM2 proposal generation uses: |
|
|
| - SAM2 automatic mask generation on keyframes. |
| - SAM2 video propagation to form tubes. |
| - Cache format: one `.npz` per video with `masks`, `scores`, `keyframes`, and `boxes_xyxy`. |
|
|
| Important implementation correction: |
|
|
| - Initial unidirectional propagation was invalid for Phase 0 because proposals from later keyframes were not truly propagated backward. |
| - Bidirectional propagation was added. |
| - Group-by-keyframe propagation was tested but performed slightly worse than shared-state bidirectional propagation on smoke evaluation. |
|
|
| ### Smoke Results |
|
|
| #### Unidirectional Smoke, stride=8, N=128, 5 videos |
|
|
| Result: |
|
|
| ```text |
| all: R@16=0.800, R@32=0.900, R@64=1.000, R@128=1.000, Oracle J&F=0.9577 |
| small: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9798 |
| test_s: R@16=0.700, R@32=0.850, R@64=1.000, R@128=1.000, Oracle J&F=0.9743 |
| test_u: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9244 |
| ``` |
|
|
| Interpretation: |
|
|
| - Code path worked, but the sample was too small and optimistic. |
|
|
| #### Shared-state Bidirectional Smoke, stride=8, N=64, 30 videos |
|
|
| Result: |
|
|
| ```text |
| all: n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91% |
| audio_keyword: n=130, R@16=0.738, R@32=0.923, R@64=0.977, Oracle J&F=0.9214, miss=2.31% |
| h3_candidate: n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91% |
| small: n=51, R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9654, miss=0.00% |
| spatial_keyword: n=14, R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.9106, miss=0.00% |
| test_s: n=43, R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8409, miss=18.60% |
| test_u: n=120, R@16=0.750, R@32=0.950, R@64=1.000, Oracle J&F=0.9321, miss=0.00% |
| ``` |
|
|
| Interpretation: |
|
|
| - Bidirectional propagation fixed the small smoke behavior. |
| - However, `test_s` remained much weaker than `test_u`. |
| - Full validation was required before making a Phase 0 decision. |
|
|
| #### Group-by-keyframe Bidirectional Smoke, stride=8, N=64, 30 videos |
|
|
| Result: |
|
|
| ```text |
| all: n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59% |
| audio_keyword: n=130, R@16=0.738, R@32=0.877, R@64=0.931, Oracle J&F=0.9138, miss=6.92% |
| h3_candidate: n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59% |
| small: n=51, R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9695, miss=0.00% |
| spatial_keyword: n=14, R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.8945, miss=0.00% |
| test_s: n=43, R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8416, miss=18.60% |
| test_u: n=120, R@16=0.750, R@32=0.900, R@64=0.950, Oracle J&F=0.9241, miss=5.00% |
| ``` |
|
|
| Decision: |
|
|
| - Group-by-keyframe is worse than shared-state bidirectional for recall. |
| - Use shared-state bidirectional as the current best SAM2 propagation setting. |
|
|
| ### Full Results: stride=8, N=64 |
|
|
| Full shared-state bidirectional result: |
|
|
| ```text |
| all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62% |
| area_unstable: n=18, R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11% |
| audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42% |
| h3_candidate: n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64% |
| partial: n=8, R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00% |
| same_category: n=330, R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09% |
| small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73% |
| spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20% |
| test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27% |
| test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29% |
| ``` |
|
|
| Decision: |
|
|
| - `stride=8, N=64` is a Phase 0 red-light configuration. |
| - It fails the v4 Go/No-Go criteria: |
| - Overall Recall@32 is below 85%. |
| - Overall Recall@64 is below 80%. |
| - Small-target Recall@32 is far below 70%. |
| - Oracle Tube J&F is below the target `SimToken + 5`. |
| - `test_s` Oracle J&F is far below reproduced SimToken seen J&F. |
| - Do not proceed to TubeToken-Minimal with this proposal cache. |
|
|
| Main bottleneck: |
|
|
| - Proposal recall, especially for `test_s`, small targets, and spatial expressions. |
| - Bidirectional propagation does not solve the full-set miss problem, so the problem is likely candidate generation / ranking / keyframe coverage, not just temporal direction. |
|
|
| ## Phase 0 Completed Results |
|
|
| ### stride=8, N=128 Full Evaluation (Yellow Light) |
|
|
| Completed 2026-04-26. Proposal directory: `proposals_stride8_n128_miss` (all 542 test_s+test_u videos, N=128). |
|
|
| ```text |
| all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, R@128=0.867, Oracle J&F=0.8407, miss=13.31% |
| audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, R@128=0.870, Oracle J&F=0.8445, miss=12.97% |
| small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, R@128=0.821, Oracle J&F=0.7942, miss=17.90% |
| spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, R@128=0.804, Oracle J&F=0.7902, miss=19.59% |
| test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, R@128=0.813, Oracle J&F=0.7941, miss=18.71% |
| test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, R@128=0.941, Oracle J&F=0.9052, miss=5.86% |
| ``` |
|
|
| Go/No-Go decision: **Yellow Light (条件绿灯)** |
|
|
| | 条件 | 阈值 | 当前值 | 状态 | |
| |------|------|--------|------| |
| | Oracle Tube J&F (all) | ≥ SimToken均值+5% ≈ 0.8054 | 0.8407 | ✅ | |
| | test_s Oracle J&F | ≥ SimToken seen 0.7651 | 0.7941 | ✅ | |
| | test_s R@128 (修订条件) | ≥ 0.75 | 0.813 | ✅ | |
|
|
| 注:R@32 原始条件(≥85%)未达标(0.597),但该条件是为 N=64 语境设计的,在 N=128 运行时以 R@128 替代。13.31% miss 是生成瓶颈,增加 N 无法解决,需 stride=4。 |
|
|
| 额外完成工作: |
| - 分层评估子集 `eval_subset_150.txt`(156 个视频,覆盖 6 个分层) |
| - CLIP 文本特征预计算:`data/text_embed/`(19395 个文件,768-dim) |
| - TubeToken-Minimal 框架骨架:`datasets/dataset_tubetoken.py`, `models/tubetoken_minimal.py`, `train_tubetoken.py`(smoke test 通过) |
|
|
| ### EC-SimToken v1(已完成 — 诊断失败) |
|
|
| 完成于 2026-04-27。训练 5 epoch,batch_size=12,null_aug_prob=0.25,exist_loss_weight=1.0。 |
| Checkpoint: `checkpoints/ec_simtoken/ec_simtoken_v1_ep5.pth`。 |
| |
| **分割指标(与 SimToken 基线对比)** |
| |
| | Split | mIoU | F | J&F | SimToken J&F | |
| |-------|------|---|-----|--------------| |
| | test_s | 0.7062 | 0.8003 | 0.7533 | 0.7651 | |
| | test_u | 0.6855 | 0.7844 | 0.7350 | 0.7456 | |
| |
| 分割能力略低于 SimToken(test_s -1.18pt,test_u -1.06pt),5 epoch fine-tune 未造成崩溃但未带来改善。 |
| |
| **Existence head 指标(ep5,threshold=0.50)** |
| |
| ```text |
| ── p_exist distribution ───────────────────────────────────── |
| split n mean med p10 p25 p75 p90 min max |
| test_s(+) 2288 0.850 0.910 0.648 0.812 0.957 0.977 0.005 0.996 |
| test_u(+) 1656 0.839 0.914 0.598 0.793 0.957 0.977 0.018 0.992 |
| test_n(null) 1028 0.889 0.953 0.792 0.910 0.969 0.980 0.000 0.992 |
| |
| AUC-ROC (null vs positive): 0.3605 |
| test_n null_tp=53/1028 (5.2%) Null_S=0.0100 |
| ``` |
| |
| **Existence loss 轨迹** |
| |
| | Epoch | mean exist_loss | 范围 | |
| |-------|----------------|------| |
| | 1 | 0.6218 | 0.82 → 0.54 | |
| | 2 | 0.3770 | 0.40 → 0.34 | |
| | 3 | 0.2860 | 0.33 → 0.27 | |
| | 4 | 0.2383 | 0.31 → 0.24 | |
| | 5 | 0.2351 | 0.24 → 0.23 | |
| |
| **失败诊断** |
| |
| exist_loss 确实收敛(0.82 → 0.23),说明 existence head 在训练集上学会了某个任务。但 AUC=0.36 < 0.5(随机),且 null 的 p_exist 均值(0.889)高于正样本(0.839-0.850),方向完全反转。 |
| |
| 根本原因:**训练-测试分布不匹配**。 |
| |
| | | 训练 null(合成) | test_n(真实) | |
| |---|---|---| |
| | 构造方式 | 随机 audio swap | 目标真实不在视频中 | |
| | 音频特征 | 与视频完全不匹配(随机) | 语义连贯,只是目标不可见 | |
| | 模型反应 | seg_embedding 混乱,head 可检测 | seg_embedding "自信但错误",head 无法区分 | |
| |
| existence head 学会了检测 audio-swap 造成的 embedding 异常,而非真实目标缺失。threshold sweep 无意义(分布顺序已反转)。 |
| |
| **决策**: EC-SimToken v1 定性为**诊断实验**,不作为论文主表强 baseline。不继续调参(调 threshold / loss weight / null_aug_prob 均无法修复分布错配)。保留 J&F 结果供参考,existence head 结论不对外声称有效。 |
| |
| ## Pending Experiments (Deferred) |
| |
| ### Experiment B: stride=4, N=128 |
| |
| **状态**: **进行中(已中断,可续跑)**。已完成 227/542 个视频(41.9%),生成速度约 44s/video。 |
| **目标**: 验证更密关键帧能否将 test_s miss% 从 18.71% 进一步降低。 |
| **Proposal 目录**: `runs/tubetoken_phase0/proposals_stride4_n128`(中断后 NPZ 文件保留,续跑自动跳过已完成视频) |
| **实际耗时**: stride=4 比 stride=8 慢约 3.4×(4 个 keyframe vs 3 个 + 更大 propagation state)。单进程全集约 6-7h;2-shard 并行约 3.5h。 |
| |
| **Step 1: 续跑生成 proposals(2-shard 并行,在两个终端同时启动)** |
| |
| ```bash |
| # Terminal 1 (shard 0) |
| cd /workspace/SimToken |
| python tools/tubetoken/generate_sam2_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \ |
| --splits test_s,test_u \ |
| --sam2_repo /workspace/sam2 \ |
| --model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \ |
| --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \ |
| --stride 4 --max_tubes 128 \ |
| --device cuda --amp_dtype bf16 \ |
| --quiet_sam2 --no_group_by_keyframe \ |
| --num_shards 2 --shard_id 0 \ |
| 2>&1 | tee runs/tubetoken_phase0/proposals_stride4_n128_s0.log |
| |
| # Terminal 2 (shard 1) |
| cd /workspace/SimToken |
| python tools/tubetoken/generate_sam2_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \ |
| --splits test_s,test_u \ |
| --sam2_repo /workspace/sam2 \ |
| --model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \ |
| --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \ |
| --stride 4 --max_tubes 128 \ |
| --device cuda --amp_dtype bf16 \ |
| --quiet_sam2 --no_group_by_keyframe \ |
| --num_shards 2 --shard_id 1 \ |
| 2>&1 | tee runs/tubetoken_phase0/proposals_stride4_n128_s1.log |
| ``` |
| |
| **Step 2: 子集快速评估(生成完成后约 5 分钟)** |
| |
| ```bash |
| mkdir -p runs/tubetoken_phase0/eval_stride4_n128_subset |
|
|
| python tools/tubetoken/evaluate_phase0_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_subset \ |
| --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \ |
| --splits test_s,test_u \ |
| --video_list /workspace/SimToken/runs/tubetoken_phase0/eval_subset_150.txt \ |
| --recall_ns 16,32,64,128 \ |
| 2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_subset.log |
| ``` |
| |
| **Step 3: 全集评估(子集通过后)** |
| |
| ```bash |
| mkdir -p runs/tubetoken_phase0/eval_stride4_n128_full |
|
|
| python tools/tubetoken/evaluate_phase0_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_full \ |
| --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \ |
| --splits test_s,test_u \ |
| --recall_ns 16,32,64,128 \ |
| 2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_full.log |
| ``` |
| |
| **决策规则(来自实验建议)**: |
| |
| | 子集 test_s Oracle J&F | 含义 | 对 Milestone 2 影响 | |
| |------------------------|------|---------------------| |
| | ≥ 0.77 | 绿灯候选,触发全集确认 | 若全集通过,切换 backend 为 stride=4 | |
| | 0.72–0.77 | 边际改善 | 保持 stride=8,N=128,不调整 | |
| | < 0.72 | 生成瓶颈深于关键帧密度 | 保持 stride=8,N=128,不再追求绿灯 | |
| |
| ### EC-SimToken v2(待设计) |
| |
| **状态**: 暂缓。等待 Experiment B 完成后,视 TubeToken 主线进度再决定是否启动。 |
| **前提**: v1 失败根因已定位(见下方 Phase 0 Completed Results),v2 需改用 in-distribution null 样本。 |
| **方向**: cross-video query swap(同类别过滤)或直接使用 train_n split(如数据集提供)。 |
| |
| --- |
| |
| ### TubeToken-Minimal 训练 proposals (Train Split) |
| |
| **状态**: 待执行,依赖 stride=4 完成后排队。 |
| **预计耗时**: 2767 个 train 视频 × ~15s = 约 12 小时。 |
| |
| ```bash |
| mkdir -p runs/tubetoken_phase0/proposals_stride8_n128_train |
|
|
| python tools/tubetoken/generate_sam2_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_train \ |
| --splits train \ |
| --sam2_repo /workspace/sam2 \ |
| --model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \ |
| --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \ |
| --stride 8 --max_tubes 128 \ |
| --device cuda --amp_dtype bf16 \ |
| --quiet_sam2 --no_group_by_keyframe \ |
| 2>&1 | tee runs/tubetoken_phase0/proposals_stride8_n128_train.log |
| ``` |
| |
| ## Next Experiment (Active) |
| |
| ### Experiment B: stride=4, N=128(续跑 + 评估) |
| |
| **当前状态**: 227/542 NPZ 已完成,中断。续跑命令见 Pending Experiments → Experiment B。 |
| |
| **Step 1: 续跑生成**(见 Pending Experiments 中的 2-shard 命令,剩余约 315 个视频,2-shard 约 2-2.5h) |
| |
| **Step 2: 子集评估(生成完成后,约 5 分钟)** |
| |
| ```bash |
| mkdir -p runs/tubetoken_phase0/eval_stride4_n128_subset |
|
|
| python tools/tubetoken/evaluate_phase0_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_subset \ |
| --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \ |
| --splits test_s,test_u \ |
| --video_list /workspace/SimToken/runs/tubetoken_phase0/eval_subset_150.txt \ |
| --recall_ns 16,32,64,128 \ |
| 2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_subset.log |
| ``` |
| |
| **Step 3: 全集评估(子集 test_s Oracle J&F ≥ 0.77 时执行)** |
| |
| ```bash |
| mkdir -p runs/tubetoken_phase0/eval_stride4_n128_full |
|
|
| python tools/tubetoken/evaluate_phase0_proposals.py \ |
| --data_dir /workspace/SimToken/data \ |
| --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \ |
| --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_full \ |
| --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \ |
| --splits test_s,test_u \ |
| --recall_ns 16,32,64,128 \ |
| 2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_full.log |
| ``` |
| |
| **决策规则** |
| |
| | 子集 test_s Oracle J&F | 结论 | 后续 | |
| |------------------------|------|------| |
| | ≥ 0.77 | 绿灯候选 | 跑全集;若全集通过,切换 backend 为 stride=4 | |
| | 0.72–0.77 | 边际改善 | 保持 stride=8 N=128,不调整 backend | |
| | < 0.72 | 关键帧密度不是主因 | 停止 stride 探索,TubeToken-Minimal 用 stride=8 | |
| |
| **全集绿灯标准**(与 stride=8 对比) |
| |
| | 指标 | stride=8 N=128 | 期望 stride=4 | |
| |------|----------------|---------------| |
| | test_s R@128 | 0.813 | 明显提升 | |
| | test_s miss% | 18.71% | 明显下降 | |
| | small R@128 | 0.821 | 提升 | |
| | all Oracle J&F | 0.8407 | 维持或提升 | |
| | test_s Oracle J&F | 0.7941 | 维持或提升 | |
| |
| |