SimToken / TubeToken_Phase0_Experiment_Log.md
yfan07's picture
Upload folder using huggingface_hub
9af2926 verified

TubeToken Phase -1 / Phase 0 Experiment Log

This document records the actual experiment progress, observations, and next actions for the TubeToken v4 plan.

Phase -1 Summary

Data Audit

Audit output:

Expressions: 20459
Videos: 3574
Objects (vid, fid): 7461
Splits: val 1349, train 14113, test_s 2288, TODO 25, test_u 1656, test_n 1028

Expressions/video mean: 5.724
Expressions/video median: 6.0
Videos with >=2 expressions: 3521
Expressions/object mean: 2.742
Objects with >=2 expressions: 5836
H3 candidate objects: 5781
H3 candidate expressions: 18614

Null split expressions: 1028 (5.02%)
Audio-keyword expressions: 15890 (77.67%)
Spatial-keyword expressions: 5924 (28.96%)
Same-category distractor heuristic expressions: 2563 (12.53%)
Small-target expressions: 10037
Partial-target expressions: 33
Area-unstable expressions: 41
Late-target expressions: 0

Decision:

  • Multi-expression structure is strong.
  • H3 direct validation remains a P0 target.
  • Null modeling is feasible but needs oversampling / curriculum because Null ratio is only about 5%.
  • Small-target proposal recall is a major risk.
  • Late-target subset is not useful under the current GT visibility definition.

SimToken Reproduction

Reproduced results:

test_seen:
  mIoU = 0.7189123889
  F    = 0.8113823722
  J&F  = 0.7651473806

test_unseen:
  mIoU = 0.6996124670
  F    = 0.7915967433
  J&F  = 0.7456046051

test_n:
  S = 0.0117917573

Paper/report result:

Seen:   J 72.0, F 81.3, J&F 76.7
Unseen: J 69.8, F 79.1, J&F 74.5
Mix:    J 70.9, F 80.2, J&F 75.6
Null S: 0.012

Decision:

  • SimToken reproduction passes Phase -1.
  • Difference from the report is far below the 1.5 J&F pause threshold.
  • Later Go/No-Go thresholds should use reproduced SimToken as the reference.

Working Phase 0 reference:

SimToken seen J&F   = 0.7651
SimToken unseen J&F = 0.7456
Seen/unseen average = 0.7554
Target Oracle Tube J&F for green light ~= 0.8054

Phase 0 Proposal Experiments

Implementation Notes

Scripts added:

tools/tubetoken/phase0_common.py
tools/tubetoken/generate_sam2_proposals.py
tools/tubetoken/evaluate_phase0_proposals.py
tools/tubetoken/evaluate_oracle_refine_sam2.py

SAM2 proposal generation uses:

  • SAM2 automatic mask generation on keyframes.
  • SAM2 video propagation to form tubes.
  • Cache format: one .npz per video with masks, scores, keyframes, and boxes_xyxy.

Important implementation correction:

  • Initial unidirectional propagation was invalid for Phase 0 because proposals from later keyframes were not truly propagated backward.
  • Bidirectional propagation was added.
  • Group-by-keyframe propagation was tested but performed slightly worse than shared-state bidirectional propagation on smoke evaluation.

Smoke Results

Unidirectional Smoke, stride=8, N=128, 5 videos

Result:

all:    R@16=0.800, R@32=0.900, R@64=1.000, R@128=1.000, Oracle J&F=0.9577
small:  R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9798
test_s: R@16=0.700, R@32=0.850, R@64=1.000, R@128=1.000, Oracle J&F=0.9743
test_u: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9244

Interpretation:

  • Code path worked, but the sample was too small and optimistic.

Shared-state Bidirectional Smoke, stride=8, N=64, 30 videos

Result:

all:             n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
audio_keyword:   n=130, R@16=0.738, R@32=0.923, R@64=0.977, Oracle J&F=0.9214, miss=2.31%
h3_candidate:    n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
small:           n=51,  R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9654, miss=0.00%
spatial_keyword: n=14,  R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.9106, miss=0.00%
test_s:          n=43,  R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8409, miss=18.60%
test_u:          n=120, R@16=0.750, R@32=0.950, R@64=1.000, Oracle J&F=0.9321, miss=0.00%

Interpretation:

  • Bidirectional propagation fixed the small smoke behavior.
  • However, test_s remained much weaker than test_u.
  • Full validation was required before making a Phase 0 decision.

Group-by-keyframe Bidirectional Smoke, stride=8, N=64, 30 videos

Result:

all:             n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
audio_keyword:   n=130, R@16=0.738, R@32=0.877, R@64=0.931, Oracle J&F=0.9138, miss=6.92%
h3_candidate:    n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
small:           n=51,  R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9695, miss=0.00%
spatial_keyword: n=14,  R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.8945, miss=0.00%
test_s:          n=43,  R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8416, miss=18.60%
test_u:          n=120, R@16=0.750, R@32=0.900, R@64=0.950, Oracle J&F=0.9241, miss=5.00%

Decision:

  • Group-by-keyframe is worse than shared-state bidirectional for recall.
  • Use shared-state bidirectional as the current best SAM2 propagation setting.

Full Results: stride=8, N=64

Full shared-state bidirectional result:

all:             n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62%
area_unstable:   n=18,   R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11%
audio_keyword:   n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42%
h3_candidate:    n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64%
partial:         n=8,    R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00%
same_category:   n=330,  R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09%
small:           n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73%
spatial_keyword: n=965,  R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20%
test_s:          n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27%
test_u:          n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29%

Decision:

  • stride=8, N=64 is a Phase 0 red-light configuration.
  • It fails the v4 Go/No-Go criteria:
    • Overall Recall@32 is below 85%.
    • Overall Recall@64 is below 80%.
    • Small-target Recall@32 is far below 70%.
    • Oracle Tube J&F is below the target SimToken + 5.
    • test_s Oracle J&F is far below reproduced SimToken seen J&F.
  • Do not proceed to TubeToken-Minimal with this proposal cache.

Main bottleneck:

  • Proposal recall, especially for test_s, small targets, and spatial expressions.
  • Bidirectional propagation does not solve the full-set miss problem, so the problem is likely candidate generation / ranking / keyframe coverage, not just temporal direction.

Phase 0 Completed Results

stride=8, N=128 Full Evaluation (Yellow Light)

Completed 2026-04-26. Proposal directory: proposals_stride8_n128_miss (all 542 test_s+test_u videos, N=128).

all:             n=3944, R@16=0.469, R@32=0.597, R@64=0.754, R@128=0.867, Oracle J&F=0.8407, miss=13.31%
audio_keyword:   n=2844, R@16=0.475, R@32=0.610, R@64=0.766, R@128=0.870, Oracle J&F=0.8445, miss=12.97%
small:           n=1631, R@16=0.237, R@32=0.392, R@64=0.633, R@128=0.821, Oracle J&F=0.7942, miss=17.90%
spatial_keyword: n=965,  R@16=0.331, R@32=0.476, R@64=0.658, R@128=0.804, Oracle J&F=0.7902, miss=19.59%
test_s:          n=2288, R@16=0.326, R@32=0.483, R@64=0.657, R@128=0.813, Oracle J&F=0.7941, miss=18.71%
test_u:          n=1656, R@16=0.665, R@32=0.755, R@64=0.887, R@128=0.941, Oracle J&F=0.9052, miss=5.86%

Go/No-Go decision: Yellow Light (条件绿灯)

条件 阈值 当前值 状态
Oracle Tube J&F (all) ≥ SimToken均值+5% ≈ 0.8054 0.8407
test_s Oracle J&F ≥ SimToken seen 0.7651 0.7941
test_s R@128 (修订条件) ≥ 0.75 0.813

注:R@32 原始条件(≥85%)未达标(0.597),但该条件是为 N=64 语境设计的,在 N=128 运行时以 R@128 替代。13.31% miss 是生成瓶颈,增加 N 无法解决,需 stride=4。

额外完成工作:

  • 分层评估子集 eval_subset_150.txt(156 个视频,覆盖 6 个分层)
  • CLIP 文本特征预计算:data/text_embed/(19395 个文件,768-dim)
  • TubeToken-Minimal 框架骨架:datasets/dataset_tubetoken.py, models/tubetoken_minimal.py, train_tubetoken.py(smoke test 通过)

EC-SimToken v1(已完成 — 诊断失败)

完成于 2026-04-27。训练 5 epoch,batch_size=12,null_aug_prob=0.25,exist_loss_weight=1.0。 Checkpoint: checkpoints/ec_simtoken/ec_simtoken_v1_ep5.pth

分割指标(与 SimToken 基线对比)

Split mIoU F J&F SimToken J&F
test_s 0.7062 0.8003 0.7533 0.7651
test_u 0.6855 0.7844 0.7350 0.7456

分割能力略低于 SimToken(test_s -1.18pt,test_u -1.06pt),5 epoch fine-tune 未造成崩溃但未带来改善。

Existence head 指标(ep5,threshold=0.50)

── p_exist distribution ─────────────────────────────────────
split           n   mean    med    p10    p25    p75    p90    min    max
test_s(+)    2288  0.850  0.910  0.648  0.812  0.957  0.977  0.005  0.996
test_u(+)    1656  0.839  0.914  0.598  0.793  0.957  0.977  0.018  0.992
test_n(null) 1028  0.889  0.953  0.792  0.910  0.969  0.980  0.000  0.992

AUC-ROC (null vs positive): 0.3605
test_n null_tp=53/1028 (5.2%)  Null_S=0.0100

Existence loss 轨迹

Epoch mean exist_loss 范围
1 0.6218 0.82 → 0.54
2 0.3770 0.40 → 0.34
3 0.2860 0.33 → 0.27
4 0.2383 0.31 → 0.24
5 0.2351 0.24 → 0.23

失败诊断

exist_loss 确实收敛(0.82 → 0.23),说明 existence head 在训练集上学会了某个任务。但 AUC=0.36 < 0.5(随机),且 null 的 p_exist 均值(0.889)高于正样本(0.839-0.850),方向完全反转。

根本原因:训练-测试分布不匹配

训练 null(合成) test_n(真实)
构造方式 随机 audio swap 目标真实不在视频中
音频特征 与视频完全不匹配(随机) 语义连贯,只是目标不可见
模型反应 seg_embedding 混乱,head 可检测 seg_embedding "自信但错误",head 无法区分

existence head 学会了检测 audio-swap 造成的 embedding 异常,而非真实目标缺失。threshold sweep 无意义(分布顺序已反转)。

决策: EC-SimToken v1 定性为诊断实验,不作为论文主表强 baseline。不继续调参(调 threshold / loss weight / null_aug_prob 均无法修复分布错配)。保留 J&F 结果供参考,existence head 结论不对外声称有效。

Pending Experiments (Deferred)

Experiment B: stride=4, N=128

状态: 进行中(已中断,可续跑)。已完成 227/542 个视频(41.9%),生成速度约 44s/video。 目标: 验证更密关键帧能否将 test_s miss% 从 18.71% 进一步降低。 Proposal 目录: runs/tubetoken_phase0/proposals_stride4_n128(中断后 NPZ 文件保留,续跑自动跳过已完成视频) 实际耗时: stride=4 比 stride=8 慢约 3.4×(4 个 keyframe vs 3 个 + 更大 propagation state)。单进程全集约 6-7h;2-shard 并行约 3.5h。

Step 1: 续跑生成 proposals(2-shard 并行,在两个终端同时启动)

# Terminal 1 (shard 0)
cd /workspace/SimToken
python tools/tubetoken/generate_sam2_proposals.py \
  --data_dir   /workspace/SimToken/data \
  --out_dir    /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
  --splits     test_s,test_u \
  --sam2_repo  /workspace/sam2 \
  --model_cfg  configs/sam2.1/sam2.1_hiera_l.yaml \
  --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
  --stride 4 --max_tubes 128 \
  --device cuda --amp_dtype bf16 \
  --quiet_sam2 --no_group_by_keyframe \
  --num_shards 2 --shard_id 0 \
  2>&1 | tee runs/tubetoken_phase0/proposals_stride4_n128_s0.log

# Terminal 2 (shard 1)
cd /workspace/SimToken
python tools/tubetoken/generate_sam2_proposals.py \
  --data_dir   /workspace/SimToken/data \
  --out_dir    /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
  --splits     test_s,test_u \
  --sam2_repo  /workspace/sam2 \
  --model_cfg  configs/sam2.1/sam2.1_hiera_l.yaml \
  --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
  --stride 4 --max_tubes 128 \
  --device cuda --amp_dtype bf16 \
  --quiet_sam2 --no_group_by_keyframe \
  --num_shards 2 --shard_id 1 \
  2>&1 | tee runs/tubetoken_phase0/proposals_stride4_n128_s1.log

Step 2: 子集快速评估(生成完成后约 5 分钟)

mkdir -p runs/tubetoken_phase0/eval_stride4_n128_subset

python tools/tubetoken/evaluate_phase0_proposals.py \
  --data_dir /workspace/SimToken/data \
  --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
  --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_subset \
  --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
  --splits test_s,test_u \
  --video_list /workspace/SimToken/runs/tubetoken_phase0/eval_subset_150.txt \
  --recall_ns 16,32,64,128 \
  2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_subset.log

Step 3: 全集评估(子集通过后)

mkdir -p runs/tubetoken_phase0/eval_stride4_n128_full

python tools/tubetoken/evaluate_phase0_proposals.py \
  --data_dir /workspace/SimToken/data \
  --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
  --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_full \
  --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
  --splits test_s,test_u \
  --recall_ns 16,32,64,128 \
  2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_full.log

决策规则(来自实验建议)

子集 test_s Oracle J&F 含义 对 Milestone 2 影响
≥ 0.77 绿灯候选,触发全集确认 若全集通过,切换 backend 为 stride=4
0.72–0.77 边际改善 保持 stride=8,N=128,不调整
< 0.72 生成瓶颈深于关键帧密度 保持 stride=8,N=128,不再追求绿灯

EC-SimToken v2(待设计)

状态: 暂缓。等待 Experiment B 完成后,视 TubeToken 主线进度再决定是否启动。 前提: v1 失败根因已定位(见下方 Phase 0 Completed Results),v2 需改用 in-distribution null 样本。 方向: cross-video query swap(同类别过滤)或直接使用 train_n split(如数据集提供)。


TubeToken-Minimal 训练 proposals (Train Split)

状态: 待执行,依赖 stride=4 完成后排队。 预计耗时: 2767 个 train 视频 × ~15s = 约 12 小时。

mkdir -p runs/tubetoken_phase0/proposals_stride8_n128_train

python tools/tubetoken/generate_sam2_proposals.py \
  --data_dir /workspace/SimToken/data \
  --out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_train \
  --splits train \
  --sam2_repo /workspace/sam2 \
  --model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
  --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
  --stride 8 --max_tubes 128 \
  --device cuda --amp_dtype bf16 \
  --quiet_sam2 --no_group_by_keyframe \
  2>&1 | tee runs/tubetoken_phase0/proposals_stride8_n128_train.log

Next Experiment (Active)

Experiment B: stride=4, N=128(续跑 + 评估)

当前状态: 227/542 NPZ 已完成,中断。续跑命令见 Pending Experiments → Experiment B。

Step 1: 续跑生成(见 Pending Experiments 中的 2-shard 命令,剩余约 315 个视频,2-shard 约 2-2.5h)

Step 2: 子集评估(生成完成后,约 5 分钟)

mkdir -p runs/tubetoken_phase0/eval_stride4_n128_subset

python tools/tubetoken/evaluate_phase0_proposals.py \
  --data_dir     /workspace/SimToken/data \
  --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
  --out_dir      /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_subset \
  --audit_csv    /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
  --splits       test_s,test_u \
  --video_list   /workspace/SimToken/runs/tubetoken_phase0/eval_subset_150.txt \
  --recall_ns    16,32,64,128 \
  2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_subset.log

Step 3: 全集评估(子集 test_s Oracle J&F ≥ 0.77 时执行)

mkdir -p runs/tubetoken_phase0/eval_stride4_n128_full

python tools/tubetoken/evaluate_phase0_proposals.py \
  --data_dir     /workspace/SimToken/data \
  --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride4_n128 \
  --out_dir      /workspace/SimToken/runs/tubetoken_phase0/eval_stride4_n128_full \
  --audit_csv    /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
  --splits       test_s,test_u \
  --recall_ns    16,32,64,128 \
  2>&1 | tee runs/tubetoken_phase0/eval_stride4_n128_full.log

决策规则

子集 test_s Oracle J&F 结论 后续
≥ 0.77 绿灯候选 跑全集;若全集通过,切换 backend 为 stride=4
0.72–0.77 边际改善 保持 stride=8 N=128,不调整 backend
< 0.72 关键帧密度不是主因 停止 stride 探索,TubeToken-Minimal 用 stride=8

全集绿灯标准(与 stride=8 对比)

指标 stride=8 N=128 期望 stride=4
test_s R@128 0.813 明显提升
test_s miss% 18.71% 明显下降
small R@128 0.821 提升
all Oracle J&F 0.8407 维持或提升
test_s Oracle J&F 0.7941 维持或提升