yfan07 commited on Apr 25

Commit

e214bf0

verified ·

1 Parent(s): d5c375d

Clean experimental files and restore original SimToken layout

Browse files

Files changed (38) hide show

Residual_Prompt_Bridge.md +0 -501
SEG_LTPO_results.md +0 -488
analyze_d2_csv.py +0 -239
build_rpb_dev_manifest.py +0 -71
cache_q_features.py +0 -125
cache_q_smoke/test_s/000000.pt +0 -3
cache_q_smoke/test_s/index.jsonl +0 -1
checkpoints/rpb_dev_mixed_pm_only_a018_wm005.pth +0 -3
checkpoints/rpb_dev_pm_only_a018.pth +0 -3
checkpoints/rpb_probe_eval_directional_pm_only_a02.pth +0 -3
d2_basic.py +0 -340
d2_llm_space.py +0 -314
decoder_invariance_check.py +0 -256
dev_subsets_rpb_v1.json +0 -620
log/rpb_dev_eval_baseline_step0.txt +0 -5
log/rpb_dev_eval_pm_only_a02_step0.txt +0 -7
log/rpb_dev_mixed_pm_only_a015_wm005.txt +0 -11
log/rpb_dev_mixed_pm_only_a018_wm005.txt +0 -11
log/rpb_dev_pm_only_a012.txt +0 -11
log/rpb_dev_pm_only_a015.txt +0 -11
log/rpb_dev_pm_only_a018.txt +0 -11
log/rpb_dev_qonly_pm_only_a018.txt +0 -11
log/rpb_e1_baseline.txt +0 -5
log/rpb_e4_min.txt +0 -16
log/rpb_e4_min_v2.txt +0 -11
log/rpb_probe_a1_teacher_only.txt +0 -22
log/rpb_probe_a1_teacher_only_v2.txt +0 -22
log/rpb_probe_a1p_directional_pm_only.txt +0 -22
log/rpb_probe_a1p_directional_pm_only_a02.txt +0 -22
log/rpb_probe_eval_directional_pm_only_a02.txt +0 -11
log/rpb_probe_eval_directional_pm_only_a02_step0.txt +0 -7
log/rpb_probe_mixed_pm_only_a02_wm005_s80.txt +0 -11
seg_ltpo.py +0 -1372
setup_simtoken.md +0 -163
simtoken_experiment.md +0 -369
target_frame_sweep.py +0 -265
train_cached_gate.py +0 -439
upload_hf.py +0 -74

Residual_Prompt_Bridge.md DELETED Viewed

@@ -1,501 +0,0 @@
-# Residual Prompt Bridge 论文导向实验路线图
-## 1. 当前主 claim
-论文主 claim 现在正式锁定为：
-> **We propose an image-conditioned directional prompt correction module that orthogonalizes prompt updates to steer language-side prompts toward a more decodable SAM prompt manifold, mitigating cross-distribution prompt interface mismatch.**
-对应中文表述：
-> **我们提出一种图像条件的方向型 prompt correction，通过正交化更新把语言侧 prompt 朝更可解码的 SAM prompt manifold 偏转，从而缓解跨分布的 prompt 接口失配。**
-从现在开始，所有实验都只服务这句 claim，不再让方法故事扩散成“大而全系统”。
----
-## 2. 当前项目定位
-当前 RPB 项目已经完成了最关键的早期筛查：
-1. **实现正确性通过**
-   - checkpoint / LoRA 兼容问题已修复
-   - bridge 路径不会自动破坏 baseline
-   - identity-preserving sanity check 已通过
-2. **几何机制方向明确**
-   - additive residual 不足以推动 `p_hat` 离开 `q`
-   - directional bridge 明显优于 additive
-   - orthogonalization 能把 residual 预算从径向缩放转成方向修正
-3. **当前最小核心已浮现**
-   - `image-conditioned`
-   - `p_mask-only`
-   - `directional`
-   - `orthogonal`
-   - `single-token correction`
-4. **mixed 的角色目前仍未定型**
-   - weak mixed 不会抹掉 bridge
-   - 但目前更像 enhancer / compatibility probe，而不是稳定的 decoder-facing calibration mechanism
-因此，当前最重要的不是继续加模块，而是把这个**最小有效核心**做成稳定、可复现、可投稿的方法骨架。
----
-## 3. 两套判据：Mechanism Pass vs Paper Pass
-### 3.1 Mechanism pass
-回答的问题是：
-> 这个方法设计是否真的抓住了问题本质？
-当前 mechanism pass 需要被下面这些证据支撑：
-- additive vs directional：directional 明显更能让 `p_hat` 离开 identity
-- without orthogonal vs with orthogonal：orthogonalization 明显改善 `Δp` 的几何利用效率
-- `Δp` 稳定朝 `p_mask`
-- `p_hat` 能明显离开 `q`
-- seen/unseen 的 alignment ratio 健康
-- weak mixed 不会直接把 bridge 拉回 baseline
-### 3.2 Paper pass
-回答的问题是：
-> 这个方法是否已经强到能单独撑起一篇顶会方法论文？
-paper pass 需要下面这些更强条件：
-- 更大规模评估上有稳定、同向的 headline 趋势
-- 至少在 unseen 上有清晰、可复现的优势
-- seen / null 的代价可接受
-- 2 个随机种子下趋势稳定
-- 最小闭环 ablation 完整
-当前状态：
-- **mechanism pass：接近通过，但还缺更大规模验证和关键 baseline**
-- **paper pass：尚未通过**
-后续每组实验都要明确写清楚：它是在推进 mechanism pass，还是在推进 paper pass。
----
-## 4. 冻结最小核心方法
-在 pure RPB standalone 路线中，当前只保留下列组成：
-- `image-conditioned correction`
-- `p_mask-only teacher`
-- `directional bridge`
-- `orthogonalized update`
-- `single-token prompt correction`
-当前明确**不进入主线**的内容：
-- `z_gt` 作为主 teacher
-- calibrator
-- refinement
-- 多 token bridge
-- 大而全的完整 bridge 系统
-这些内容后续最多作为 ablation、扩展或 hybrid 组件，而不是当前主方法本体。
----
-## 5. 当前实验事实总结
-### 5.1 已确认的正结果
-- bridge 可以安全接入，不会自动毁掉 baseline
-- 修复 checkpoint / LoRA 后，RPB 路径与 baseline 基本等价
-- `directional + orthogonal` 后：
-  - `Δp` 高度对齐 `p_mask`
-  - `Δp` 不再主要沿 `q` 的平行方向浪费预算
-  - `p_hat` 能够明显离开 identity 区
-- `p_mask-only teacher-only` 已在 quick eval 上给出：
-  - seen 小幅回落但可控
-  - unseen 轻微正信号
-  - null 基本持平
-### 5.2 已确认的负结果
-- additive residual 不足以真正旋转 prompt
-- `L_mask` 不是早期主矛盾
-- `z_gt` 目前不是 sparse bridge 的主 teacher
-- weak mixed 目前不能稳定把 seen 拉回 baseline
-### 5.3 当前最重要的工作假设
-> `p_mask-only + image-conditioned + directional + orthogonal` 已经抓住主问题，但还需要找到更稳定的 operating point，并证明其 headline 趋势不是噪声。
-### 5.4 Fixed dev 阶段 A 当前记录
-固定 dev 子集：
-- `test_s`: 200 samples
-- `test_u`: 200 samples
-- `test_n`: 200 samples
-- manifest: `/workspace/SimToken/dev_subsets_rpb_v1.json`
-#### Fixed dev baseline
-| Setting | Seen mIoU | Seen F | Unseen mIoU | Unseen F | Null |
-|---|---:|---:|---:|---:|---:|
-| baseline | 0.72554 | 0.81811 | 0.68531 | 0.77238 | 0.01452 |
-#### Teacher-only alpha search
-| Setting | Seen mIoU | Seen F | Unseen mIoU | Unseen F | Null | Seen cos(p_hat,p_mask) | Unseen cos(p_hat,p_mask) | 机制判断 |
-|---|---:|---:|---:|---:|---:|---:|---:|---|
-| image, alpha=0.20 | 0.72517 | 0.81376 | 0.68596 | 0.77730 | 0.01426 | 0.09502 | 0.06611 | 机制最强，Seen/F 有代价 |
-| image, alpha=0.18 | 0.72692 | 0.81705 | 0.68595 | 0.77354 | 0.01448 | 0.02873 | 0.00605 | 性能平衡较好，机制偏弱 |
-| image, alpha=0.15 | 0.72669 | 0.81725 | 0.68569 | 0.77330 | 0.01448 | 0.02373 | 0.00282 | 更接近 identity |
-| image, alpha=0.12 | 0.72651 | 0.81748 | 0.68578 | 0.77314 | 0.01449 | 0.01871 | -0.00046 | 轻扰动区，机制最弱 |
-阶段 A 的 teacher-only 结论：
-- `alpha=0.20` 是机制候选点，能明显改变 prompt geometry。
-- `alpha=0.18` 是性能平衡候选点，seen / unseen / null 都更稳。
-- `alpha=0.12/0.15` 已经过于接近 identity，不适合作为机制主证据。
-#### Weak mixed 局部验证
-| Setting | Seen mIoU | Seen F | Unseen mIoU | Unseen F | Null | Seen cos(p_hat,p_mask) | Unseen cos(p_hat,p_mask) | 角色判断 |
-|---|---:|---:|---:|---:|---:|---:|---:|---|
-| image, alpha=0.18, weak mixed | 0.72704 | 0.81554 | 0.68706 | 0.77454 | 0.01451 | 0.04079 | 0.01325 | 当前最佳性能平衡候选 |
-| image, alpha=0.15, weak mixed | 0.72684 | 0.81607 | 0.68674 | 0.77419 | 0.01451 | 0.03382 | 0.00882 | 稳定但略弱于 alpha=0.18 mixed |
-weak mixed 当前结论：
-- weak mixed 没有把 bridge 拉回 identity。
-- weak mixed 对 `alpha=0.15/0.18` 都更像 mild enhancement，而不是 destructive pullback。
-- `alpha=0.18 + weak mixed` 是当前 fixed dev 的最佳 operating point。
-#### q-only directional baseline
-| Setting | Seen mIoU | Seen F | Unseen mIoU | Unseen F | Null | Seen cos(p_hat,p_mask) | Unseen cos(p_hat,p_mask) | 判断 |
-|---|---:|---:|---:|---:|---:|---:|---:|---|
-| q-only, alpha=0.18 | 0.72311 | 0.81206 | 0.68289 | 0.77666 | 0.01424 | 0.12061 | 0.09598 | alignment 更强但 mIoU 更差 |
-q-only 结论：
-- directional / orthogonal 机制本身很强，q-only 也能大幅拉高 teacher alignment。
-- q-only 的 prompt steering 更激进，`gate_mean` 更高，`delta_norm` 更大。
-- q-only mIoU 在 seen / unseen 上都低于 image-conditioned candidate。
-- 当前证据支持：image conditioning 的价值不是单纯提高 teacher cosine，而是约束方向修正，使 prompt steering 与 decoder compatibility 之间的平衡更好。
-#### 阶段 A 当前候选
-当前 fixed dev 最佳候选：
-> **image-conditioned + p_mask-only + directional + orthogonal + alpha=0.18 + weak mixed**
-对应 checkpoint：
-> `/workspace/SimToken/checkpoints/rpb_dev_mixed_pm_only_a018_wm005.pth`
----
-## 6. 实验纪律：停止在 test 上自由调方向
-从下一阶段开始，必须冻结一套 **dev tuning subset**，不再继续在 `test_s/test_u/test_n` 上自由调 alpha 和 mixed 设定。
-建议立即固定：
-- `dev_seen`
-- `dev_unseen`
-- `dev_null`
-每个 split 可先取 `100` 或 `200` 个样本，后续：
-- alpha 选择
-- mixed 选择
-- warm-start 配置
-- early stopping
-全部只在 dev 上完成。
-真正的 test split 只用于后续一次性确认和最终表格。
----
-## 7. 三阶段推进路线
-## 阶段 A：锁最小核心的 operating point
-### 目标
-回答：
-> 当前最小核心是否能在更大 quick eval 上形成稳定、可接受的性能-几何平衡？
-### 本阶段只做两类实验
-#### A1. teacher-only operating point 搜索
-固定：
-- image-conditioned
-- `p_mask-only`
-- directional
-- orthogonal
-- single-token
-- 不加 `z_gt`
-- 不加 calibrator
-- 不加 refinement
-重点只扫：
-- `alpha = 0.12, 0.15, 0.18, 0.20`
-当前判断是：`0.20` 已经是 promising pass，因此没有必要继续向更大 alpha 发散。
-#### A2. weak mixed 局部验证
-只围绕最佳 teacher-only checkpoint 做 warm-start，不做大 sweep。
-建议只测：
-- `best_alpha`
-- `best_alpha - 0.03`
-以及很弱的 mask 强度两档：
-- `λ_mask = 0.05`
-- `λ_mask = 0.10`
-mixed 的目标不是涨分，而是判断它的角色到底是：
-- calibration
-- enhancement
-- 还是 destructive pullback
-### 阶段 A 重点指标
-几何指标：
-- `cos(p_hat, p_mask)_seen`
-- `cos(p_hat, p_mask)_unseen`
-- `cos(p_hat, q)`
-- `cos(Δp, p_mask)`
-- `cos(Δp, q)`
-- `align_ratio = cos_u / cos_s`
-性能指标：
-- `mIoU_seen`
-- `mIoU_unseen`
-- `Fscore_seen`
-- `Fscore_unseen`
-- `Null metric`
-### 阶段 A 的通过标准
-若在 dev 或更大 quick eval 上，能找到一个稳定点满足：
-- unseen 稳定不差于 baseline，最好有小幅提升
-- seen 代价可控
-- null 基本持平或代价可接受
-- `cos(p_hat, p_mask)` 明显离开 identity 区
-- seen/unseen 的 alignment ratio 健康
-则阶段 A 通过。
-### 阶段 A 的停止条件
-若完成：
-1. alpha 局部搜索
-2. weak mixed 局部搜索
-3. 100 / 200 样本 quick eval
-之后仍出现任一情况，则停止 pure RPB standalone 主线：
-- 在更大 quick eval 上没有稳定、同向的 unseen 优势
-- seen/unseen tradeoff 对 alpha 高度敏感
-- null 代价无法压到 baseline 附近
-- mixed 始终只是增强器，而不是 decoder-facing calibration
----
-## 阶段 B：做最小闭环 ablation
-只有阶段 A 通过后，才进入阶段 B。
-### 目标
-把方法主骨架讲圆，形成 mechanism pass 的闭环证据。
-### 必做的 4 个关键 ablation
-1. **additive vs directional**
-2. **directional without orthogonalization vs with orthogonalization**
-3. **q-only directional vs image-conditioned directional**
-4. **`p_mask-only` vs `p_mask + weak z_gt`**
-这 4 个已经足够支撑方法论证，不再继续扩更多 trick ablation。
-### 阶段 B 的补充要求
-- 至少 2 个随机种子重复
-- 至少一次更大规模验证
-- 建立 geometry-performance coupling：
-  - prompt geometry 改写程度
-  - 与 seen/unseen 表现之间的关系
-  - 与 identity 回缩之间的关系
-### 阶段 B 的停止条件
-若完成：
-1. alpha 局部搜索
-2. weak mixed 局部搜索
-3. 100 / 200 样本 quick eval
-4. 至少一次更大规模验证
-5. 2 个随机种子重复
-后仍满足以下任一条，则停止 pure RPB standalone：
-- 大子集 / full-split 上没有稳定、同向的 unseen 优势
-- 最优点高度依赖 seed 或 alpha，趋势不稳定
-- null 代价无法控制
-- mixed 无法形成稳定 calibration 作用
-- headline result 仍然只有极弱波动
----
-## 阶段 C：决定论文定位
-### 路线 1：pure RPB standalone
-如果满足：
-- 更大评估上有稳定 unseen gain
-- seen / null 代价可接受
-- 2 seeds 稳定
-- 最小闭环 ablation 完整
-则走：
-> **pure RPB 方法论文**
-### 路线 2：RPB + TTO hybrid
-如果出现：
-- mechanism 成立
-- 但 paper pass 不够硬
-- headline result 仍然偏弱或不稳定
-则立刻切换定位：
-> **RPB + TTO hybrid 方法论文**
-此时 RPB 的角色不再是 standalone 主方法，而是：
-- amortized prompt corrector
-- 改善 test-time refinement 起点质量的前端模块
----
-## 8. Hybrid 路线作为明确 Plan B
-若 pure RPB 最终只能做到：
-- unseen 稳定小涨
-- seen 小掉
-- null 持平或略好
-那么 standalone 顶会会比较吃力。
-但此时 RPB 作为前端 prompt corrector 仍很有价值：
-- 改善初始 `q` 的几何
-- 为 q-LTPO / selective refinement 提供更好的初始化
-- 降低 test-time optimization 的步数和不稳定性
-hybrid 的论文叙事可以明确写成：
-1. train-time：amortized interface correction
-2. test-time：instance-specific prompt refinement
-3. 两者结合：同时解决全局接口失配与样本级细化问题
-当前判断：hybrid 是非常强的 Plan B，而不是临时补救路线。
----
-## 9. 负结果如何写进论文论证链条
-当前已经得到了一条清晰的“设计收敛链条”，后续可以直接转写为论文方法论证：
-### 为什么不是 additive residual
-因为 additive 下：
-- `Δp` 主要对抗 `q` 的平行分量
-- teacher 方向被大范数 `q` 吞掉
-- 结果更像缩放，而不是旋转
-### 为什么要 directional
-因为 directional 才能把修正显式变成 prompt 方向控制，而不是数值扰动。
-### 为什么要 orthogonal
-因为 orthogonalization 才能避免 residual 预算浪费在径向缩放上。
-### 为什么当前只保留 `p_mask`
-因为当前 sparse bridge 里，`p_mask` 一直是主 teacher，`z_gt` 尚未成为主信号。
-### 为什么 mixed 不是主模块
-因为 mixed 目前更像 compatibility / enhancement probe，而不是稳定的 calibration mechanism。
-这条链条必须在文中明确写出，让 reviewer 看到方法是沿诊断逐步收敛的，而不是盲目堆模块。
----
-## 10. 当前最直接的执行建议
-接下来不要发散，严格按下面顺序走：
-1. **立刻冻结论文主 claim**
-2. **立刻切换到固定 dev 子集，不再自由用 test 调方向**
-3. **完成阶段 A：最小核心 operating point 搜索**
-4. **补关键 baseline：q-only directional**
-5. **做两种 seed**
-6. **然后做 pure RPB standalone 的去留决策**
-当前最重要的执行原则是：
-> **先证明最小核心能稳定成立；如果 headline 不够硬，就及时把它升级成 hybrid 前端，而不是继续把 pure RPB 做复杂。**
----
-## 11. 当前阶段的明确结论
-### 当前方向值得继续吗？
-**值得。**
-### 现在最应该做什么？
-不是继续扩模块，而是：
-- 找到 teacher-only `p_mask-only directional orthogonal` 的最佳 operating point
-- 用 very weak mixed 判断 mixed 是否能形成 calibration
-- 在 dev 和更大 quick eval 上证明趋势不是噪声
-### 什么时候该停 pure RPB？
-只要阶段 A + B 完成后，headline 仍然弱且不稳定，就停止 pure RPB standalone。
-### 停了之后怎么办？
-直接转：
-> **RPB + TTO hybrid**
-这条路线当前是明确的 Plan B，而且很可能是更强的顶会方法论文路径。

SEG_LTPO_results.md DELETED Viewed

@@ -1,488 +0,0 @@
-# SEG-LTPO: Experimental Results and Analysis
----
-## Method 1: SEG-LTPO-simple (ES-based, zeroth-order)
-### Overview
-SEG-LTPO-simple performs test-time optimization of SimToken's single semantic token **Fseg** using antithetic Evolution Strategies (ES), guided by an internal reward signal that requires no ground-truth masks.
-**Optimization loop** (T=5 steps, 4 anchor frames):
-```
-eps_t ~ N(0, σ_t² I)
-F± = F_curr ± eps_t
-F_curr = F_curr + η_t · (R+ − R−) / (2σ_t²) · eps_t
-best_F = argmax_F R(F) over all evaluated candidates
-```
-**Reward function:**
-```
-R = λ1·R_temp_feat + λ2·R_iou_pred + λ3·R_align_contrast − λ4·R_area
-  = 0.3·R_temp + 0.4·R_iou + 1.0·R_align − 0.3·R_area
-```
-- **R_align_contrast**: cosine(Fseg, z_inside) − β·cosine(Fseg, z_outside); main signal
-- **R_iou_pred**: SAM's internal mask quality head output
-- **R_temp_feat**: feature-space cosine consistency between adjacent anchor frames
-- **R_area**: average foreground ratio (degenerate-mask penalty)
-**Reward gating**: accept optimized Fseg only when R(best_F) > R(F_init) + gate_delta.
-### Results (Unseen split, full 1656 samples)
-| Method | mIoU | F | Δ mIoU |
-|--------|------|---|--------|
-| Baseline | 0.6989 | 0.7927 | — |
-| Best-of-2 Random | 0.7050 (subset) → 0.7030 (full) | 0.7953 | +0.0040 |
-| SEG-LTPO-simple (ES) | **0.7050** | **0.7960** | **+0.0061** |
-> Best-of-2 and LTPO-ES results at full scale confirmed in the q-LTPO evaluation run below.
-### Key Findings
-1. **Reward signal is valid**: both Best-of-2 and ES-LTPO outperform baseline, confirming R_align_contrast provides useful signal.
-2. **ES update is noisy**: in 500-sample ablation, Best-of-2 (0.7235) slightly outperformed iterative ES (0.7228), due to extremely low SNR of single-sample gradient estimation in 256d space. At full scale (1656), ES-LTPO recovers (+0.0065 vs +0.0040), but the margin over Best-of-2 is small.
-3. **Null stability**: Null S metric change negligible (+0.00025), reward gating effectively suppresses false positives.
----
-## Method 2: q-LTPO-autograd (first-order, Adam maximize)
-### Overview
-**Core insight from LTPO analysis**: optimize the variable that is *directly consumed* by the downstream module, using autograd rather than noisy zeroth-order estimation.
-**Three design decisions borrowed from original LTPO:**
-1. **Optimize q, not Fseg.** In SimToken+SAM, the token that directly enters the mask decoder's cross-attention is `q = sparse_emb = Fseg.unsqueeze(1)` (prompt encoder passes text_embeds through unchanged). We set `q = nn.Parameter(q_init)` and optimize q directly, bypassing the prompt encoder entirely. This requires no invertibility of ε_p — q_best is used directly for final inference.
-2. **Use autograd when reward is differentiable.** The mask decoder (transformer + MLP + matmul) is fully differentiable. With soft masks instead of hard thresholds, all reward terms are differentiable w.r.t. q. Adam maximize replaces the low-SNR score-function estimator.
-3. **Track best_q by task reward (no regularization), gate at the end.** λ_reg penalty is excluded from gating to avoid penalizing solutions that drifted slightly from q_init but achieved better task reward.
-**Stage 0: Gradient connectivity check (verified)**
-```
-grad_norm (step 0): 0.503070
-reward trajectory:  [0.4650, 0.4709, 0.4770, 0.4831, 0.4892]  ← strictly monotone
-gradient_connected: True
-```
-### Optimization loop
-```python
-q = nn.Parameter(q_init.float().detach().clone())
-optimizer = Adam([q], lr=lr_auto, maximize=True)
-best_q, best_reward = q_init.clone(), R_task(q_init)
-for step in range(T=5):
-    R_full = R_task(q) - λ_reg * ||q - q_init||²
-    R_full.backward()
-    optimizer.step()
-    clip_to_L2_ball(q, q_init, max_drift)      # hard norm constraint
-    if R_task(q) > best_reward:
-        best_q = q.clone()
-# gating
-use best_q if R_task(best_q) > R_task(q_init) + gate_delta, else q_init
-```
-**Hyperparameters (auto-scaled from q_init):**
-- `lr = 0.01 × RMS(q_init)`
-- `max_drift = 0.5 × ||q_init||`
-- `λ_reg = 0.01`, `gate_delta = 0.0`
-### Staged reward build-up
-**Stage 1** (R_iou + R_area_soft + λ_reg):
-```
-R_task = 0.6·R_iou_pred − 0.2·sigmoid(mask_logits/τ).mean()
-         where τ=5.0 (temperature to avoid sigmoid saturation)
-```
-**Stage 2** (Stage 1 + R_align_det):
-```
-R_task = 0.4·R_iou_pred + 1.0·R_align_det − 0.3·R_area_soft
-R_align_det = mean_t [ cosine(q, stopgrad(z_in^t)) − 0.5·cosine(q, stopgrad(z_out^t)) ]
-```
-z_in/z_out are stopgrad'd to avoid coupling: q first finds a mask, then moves toward the masked region's semantics.
-### Results (Unseen split)
-#### 200-sample subset (Stage 1 vs Stage 2 fair comparison, same baseline)
-| Method | mIoU | F | Δ mIoU |
-|--------|------|---|--------|
-| Baseline | 0.6749 | 0.7763 | — |
-| Best-of-2 ES | 0.6801 | 0.7803 | +0.0052 |
-| LTPO-ES | 0.6838 | 0.7826 | +0.0089 |
-| q-LTPO Stage 1 | 0.6979 | 0.7802 | +0.0230 |
-| q-LTPO Stage 2 | **0.6989** | **0.7810** | **+0.0240** |
-On 200 samples: Stage 2 marginally better than Stage 1 on both metrics.
-#### Full evaluation (Unseen, 1656 samples)
-| Method | mIoU | F | Δ mIoU vs Baseline |
-|--------|------|---|---------------------|
-| Baseline | 0.6990 | 0.7924 | — |
-| Best-of-2 ES | 0.7030 | 0.7953 | +0.0040 (+0.57%) |
-| LTPO-ES | 0.7055 | 0.7969 | +0.0065 (+0.93%) |
-| **q-LTPO Stage 1** | **0.7285** | **0.8013** | **+0.0295 (+4.22%)** |
-| q-LTPO Stage 2 | 0.7273 | 0.8002 | +0.0283 (+4.04%) |
-**Stage 1 beats Stage 2 on full eval** (opposite of 200-sample trend). R_align_det adds noise at scale: in harder Unseen samples, the initial mask quality is lower, making stopgrad z_in/z_out a less reliable target.
-### Evaluation Status (after e0 fix)
-| Split | Baseline mIoU/S | q-LTPO S1 (no e0) | q-LTPO S1 (e0) | Status |
-|-------|-----------------|-------------------|----------------|--------|
-| Unseen (1656) | 0.6990 | **0.7285** | — | Done (pre-e0) |
-| Seen (200-sample) | 0.7483 | 0.7618 (+0.0136) | **0.7634 (+0.0151)** | Quick-val done |
-| Null (200-sample, S↓) | 0.0619 | 0.0646 (+4.4%) | **0.0634 (+2.4%)** | Quick-val done |
-| Unseen (200-sample) | 0.6761 | — | **0.6929 (+0.0168)** | Quick-val done |
-| Seen (full) | — | — | — | Pending |
-| Null (full, S↓) | 0.0120 | 0.0126 (+5.0%) | — | Pending e0 run |
-| Unseen (full) | — | — | — | Pending |
----
-## Null Safety Analysis and e0-Modulated Reward
-### Root Cause: R_iou_pred is a Conditional Quality Metric
-The original q-LTPO Stage 1 reward:
-```
-R_task = 0.6·R_iou_pred − 0.2·R_area_soft
-```
-caused Null S metric degradation (+4.4% on 200-sample quick validation, +5.0% on full Null).
-**Root cause**: `R_iou_pred` is SAM's internal mask quality head — it measures *how good the mask is given that segmentation was performed*, not *whether the target exists*. On Null frames, SAM still outputs `R_iou_pred ≈ 0.73–0.74` because it confidently segments the most prominent region (even if no audio target exists). The optimizer sees positive `R_iou_pred` and expands the mask accordingly.
-**Why oracle gating approaches fail methodologically:**
-- **Path A (gate_delta threshold)**: Distribution analysis showed Null reward_gain p50 = +0.0166 ≈ Seen p50 = +0.0181. The two distributions overlap heavily; any threshold that blocks most Null samples also blocks most Seen/Unseen samples.
-- **Path B (area-based reject rule)**: Threshold 0.02 (area fraction) was derived by observing Null mean_area = 0.0094 vs Seen mean_area = 0.054 from the test distribution. This is benchmark-specific tuning = test-set overfitting. **Not a valid method.**
-Both oracle approaches are useful for diagnostic analysis only. The principled fix must be structural.
-### Principled Fix: e0-Modulated Reward
-**Key insight**: decouple *existence* from *quality*. Use the initial mask area as a proxy for the prior probability that a real target exists.
-```python
-e0 = stopgrad( sigmoid(lrm_init / area_temp).mean() )   # R_area_soft at q_init
-R_task = λ_iou · e0 · R_iou_pred  −  λ_area · R_area_soft
-```
-**Why stopgrad on e0 is critical:**
-- Without stopgrad: gradients flow through e0 → optimizer first inflates area to increase e0, then uses the higher e0 to justify larger R_iou reward ("area gaming").
-- With stopgrad: e0 is a fixed scalar from the initialization. Gradients only flow through the explicit terms `R_iou_pred` and `R_area_soft`.
-**Effect by split:**
-| Split | mean e0 | Effective λ_iou = 0.6·e0 | Behavior |
-|-------|---------|--------------------------|----------|
-| Null | 0.037 | 0.022 | Area penalty dominates → conservative |
-| Seen | 0.120 | 0.072 | Balanced optimization |
-| Unseen | 0.150 | 0.090 | Full optimization drive |
-The 3.2× e0 ratio (Unseen/Null) arises naturally from the initial mask size, providing automatic split-specific optimization strength without any threshold tuning.
-**Implementation fix also addressed (best_q tracking bug):**
-Before fix, `q_{N+1}` (post-step) was evaluated using `lrm/iou` from `q_N` (pre-step), corrupting best_q selection. Fixed by adding a fresh `no_grad` forward after each `optimizer.step()`.
-### Quick Validation Results (200 samples each, e0 modulation)
-#### Null split (S metric, lower is better)
-| Method | S metric | Δ relative |
-|--------|----------|-----------|
-| Baseline | 0.0619 | — |
-| q-LTPO S1 (no e0) | 0.0646 | +4.4% |
-| **q-LTPO S1 (e0)** | **0.0634** | **+2.4%** |
-Diagnostic stats with e0:
-```
-acceptance rate      : 1.000
-mean e0              : 0.0372
-reward_gain p10/50/90: 0.0 / 0.0000 / +0.0123   ← p50=0 means >50% of samples frozen
-mean drift           : 0.4962                    ← down from ~0.8 without e0
-area (hard) init→best: 0.0094 → 0.0098           ← minimal area expansion
-reward↑ & area+20%↑  : 0.040                     ← low Null-safety risk
-```
-#### Seen split (mIoU, higher is better)
-| Method | mIoU | F | Δ mIoU |
-|--------|------|---|--------|
-| Baseline | 0.7483 | — | — |
-| q-LTPO S1 (no e0) | 0.7618 | — | +0.0136 |
-| **q-LTPO S1 (e0)** | **0.7634** | — | **+0.0151** |
-Diagnostic stats with e0:
-```
-mean e0              : 0.1200
-reward_gain p10/50/90: +0.0026 / +0.0181 / +0.0944
-mean drift           : 0.5225
-area (hard) init→best: 0.054 → (slight increase)
-```
-#### Unseen split (mIoU, higher is better)
-| Method | mIoU | F | Δ mIoU |
-|--------|------|---|--------|
-| Baseline | 0.6761 | 0.7776 | — |
-| **q-LTPO S1 (e0)** | **0.6929** | **0.7765** | **+0.0168** |
-Diagnostic stats with e0:
-```
-acceptance rate      : 1.000
-mean e0              : 0.1506
-reward_gain p10/50/90: +0.0011 / +0.0055 / +0.0293
-mean drift           : 0.6666
-R_iou_pred init→best : 0.8029 → 0.8802
-area (hard) init→best: 0.0635 → 0.0650
-reward↑ & area+20%↑  : 0.125
-```
-### Analysis: e0 is a Pareto Improvement
-Three conditions for Pareto improvement all satisfied on quick validation:
-1. **Null safer**: degradation halved (+4.4% → +2.4%). p50 reward_gain = 0.0000, meaning >50% of Null samples produce `best_q ≈ q_init`.
-2. **Seen maintained and slightly improved**: +0.0151 vs +0.0136 without e0.
-3. **Unseen not hurt — gains even larger**: +0.0168 > Seen +0.0151. The "harder positives suppressed" failure mode did not materialize.
-**e0 hierarchy confirms split-level discriminability:**
-```
-Null (0.037)  <<  Seen (0.120)  <  Unseen (0.150)
-```
-The ordering is sensible: Null frames have small/empty initial masks → low e0. Unseen e0 slightly exceeds Seen, possibly because the model produces slightly larger (less specific) masks on novel object-sentence combinations.
-**Residual Null degradation (+2.4%) assessment**: Acceptable for now. The absolute magnitude is +0.0015 in S metric, while Seen/Unseen absolute gains are 10–11× larger. The residual originates from a small tail of Null samples where e0 is still large enough to permit some mask expansion. Further suppression (e.g., e0², sqrt(e0+ε)) risks hurting harder positives and should only be explored after full-set confirmation.
----
-## Summary and Comparison
-### Pre-e0 (original q-LTPO Stage 1, full Unseen)
-| Method | Unseen mIoU | Δ vs Baseline | Relative to ES-LTPO |
-|--------|-------------|---------------|----------------------|
-| Baseline | 0.6990 | — | — |
-| ES-LTPO | 0.7055 | +0.0065 | 1× |
-| **q-LTPO Stage 1** | **0.7285** | **+0.0295** | **4.5×** |
-### e0-Modulated Stage 1 (quick validation, 200 samples)
-| Split | Baseline | e0-Stage1 | Δ | e0 |
-|-------|----------|-----------|---|-----|
-| Null (S↓) | 0.0619 | 0.0634 | +2.4% (rel) | 0.037 |
-| Seen | 0.7483 | 0.7634 | +0.0151 | 0.120 |
-| Unseen | 0.6761 | 0.6929 | +0.0168 | 0.150 |
-q-LTPO-autograd with e0 modulation is the current primary method candidate. It achieves first-order gradient-based optimization with automatic Null-safety via the initial-area existence prior, without any test-set-derived thresholds.
----
-## Hyperparameter Configurations
-### ES-LTPO (Method 1)
-```python
-LTPOConfig(
-    T=5, num_anchors=4,
-    sigma_schedule=[0.10, 0.08, 0.06, 0.04, 0.02],
-    eta_scale=0.5,
-    lambda1=0.3, lambda2=0.4, lambda3=1.0, lambda4=0.3,
-    beta=0.5, gate_delta=0.0, trust_delta=None,
-)
-```
-### q-LTPO Stage 1 with e0 (current primary candidate)
-```python
-QLTPOConfig(
-    stage=1, T=5, num_anchors=4,
-    lr=0.0,              # auto: 0.01 × RMS(q_init)
-    max_drift=0.0,       # auto: 0.5 × ||q_init||
-    lambda_iou=0.6, lambda_area=0.2,
-    lambda_reg=0.01, area_temp=5.0,
-    gate_delta=0.0,
-    e0_modulation="identity",   # e0 = R_area_soft(q_init), stopgrad
-    e0_eps=1e-4,
-    # oracle-only fields (disabled, not used in final method):
-    null_area_threshold=0.02,
-    null_gate_delta=0.0,
-)
-```
-### Full Unseen Evaluation with e0 (1656 samples)
-| Method | mIoU | F | Δ mIoU |
-|--------|------|---|--------|
-| Baseline | 0.6990 | 0.7926 | — |
-| q-LTPO S1 (no e0) | 0.7285 | 0.8013 | +0.0295 (+4.22%) |
-| **q-LTPO S1 (e0)** | **0.7240** | **0.7985** | **+0.0250 (+3.56%)** |
-e0 版本相比 no-e0 版本 mIoU 略低 (-0.0045)，但 Null 安全性更好。F 与 mIoU 的提升比例基本一致（约 60%）。
-**全量评估状态（更新）：**
-| Split | Baseline | q-LTPO S1 (e0) | Δ | Status |
-|-------|----------|----------------|---|--------|
-| Unseen (full, 1656) | 0.6990 / 0.7926 | 0.7240 / 0.7985 | +3.56% mIoU | ✅ Done |
-| Seen (full) | — | — | — | Pending |
-| Null (full, S↓) | 0.0120 | — | — | Pending |
----
-## Direction B: Boundary Precision Experiments（已结束，结论为失败）
-### B-Step1: Multimask Post-Processing（彻底失败）
-用 SAM 多 mask 输出（K=3）替换单 mask 解码，分别用 iou_pred 和 Sobel edge score 选最佳候选。
-| Method | mIoU | F | ΔF vs s1 |
-|--------|------|---|----------|
-| s1 (single mask) | 0.6979 | 0.8024 | — |
-| s1_mm (iou_pred selection) | 0.6979 | 0.7917 | -0.0107 |
-| s1_mm_edge (Sobel selection) | 0.5715 | 0.6820 | -0.1204 |
-**根本原因：** SAM 内部的单 mask 选择已经最优；外部重选更差。Sobel 在 1024×1024 归一化空间中选到纹理碎片而非语义目标，灾难性失败。
-### B1: 非对称面积膨胀惩罚（机制性无效）
-假设：LTPO 导致 mask 向非目标区域膨胀（精度下降），加惩罚项压制。
-**实验结论：假设错误。** LTPO 期间 soft area 实际在下降（-16%）而非上升：
-```
-soft area:  0.1507 → 0.1267  (-16%)   ← background logits 更负
-hard area:  0.0635 → 0.0650  (+2.4%)  ← 实际 mask 区域微增
-```
-**"mask sharpening" 现象：** Adam 在 R_iou_pred 驱动下使 logit 更双峰化（前景更正、背景更负），soft area 因 93% 背景像素的贡献减少而下降。B1 惩罚的前提条件（soft area 上升）从未发生：
-```
-B1 activation rate : 0.025   ← 仅 2.5% 样本触发
-B1 mean excess     : 0.00002 ← 可忽略
-```
-**结论：** Direction B 从多 mask 选择到面积约束全部失败，不再追求。F-score 滞后于 mIoU 的根本原因不是 mask 精度，而是 reward 代理信号质量问题（见 Path A）。
----
-## Direction II: Frame-Adaptive Token Optimization（初步探索，待后续）
-### 方法设计
-将单一共享 token q 扩展为视频 token 轨迹：
-```
-q_t = q_global + delta_t
-```
-其中 q_global 是全局共享 token，delta_t 是每个 anchor 帧的局部残差，初始化为 0。联合优化：
-```
-max  Σ_t [λ_iou · e0_t · R_iou(q_t) - λ_area · R_area(q_t)]
-   - λ_residual · ||delta||² - λ_smooth · Σ_t ||delta_t - delta_{t+1}||²  - λ_reg · ||q_global - q_init||²
-```
-每个 anchor 帧使用各自的 e0_t（per-frame 存在先验）。delta_t 受 hard clip 约束：`||delta_t|| ≤ scale × ||q_init||`。
-### 200-sample Probe Results（Unseen split）
-| Method | mIoU | F | reward gain p50 | delta ‖Δ‖ |
-|--------|------|---|-----------------|-----------|
-| baseline | 0.6745 | 0.7763 | — | — |
-| s1 | 0.6945 | 0.7773 | +0.0053 | — |
-| fa_base (无约束) | 0.6945 | 0.7711 | +0.0112 | 1.675 |
-| fa_smooth (λ_smooth=0.01) | 0.6960 | 0.7731 | +0.0104 | 1.488 |
-| fa_c03 (delta clip 0.3×) | 0.6959 | 0.7722 | +0.0112 | — |
-### 关键发现
-**Reward-metric gap（核心问题）：**
-```
-reward gain p50:   s1 = +0.0053    fa_c03 = +0.0112  (fa 高 2.1×)
-R_iou_pred 提升:   s1 +0.077       fa_c03 +0.114
-实际 mIoU 提升:    s1 +2.96%       fa_c03 +3.17%     (仅差 0.21%)
-```
-fa 拿到了多得多的 reward，但 mIoU 几乎没有额外提升，F 还略降。
-**结论：** 瓶颈不是优化结构，而是 R_iou_pred 本身的任务相关性不足。R_iou_pred 衡量"mask 有多干净"，不衡量"mask 是否包含正确的音频目标"。所有架构变体（单 token / frame-adaptive）都受同一个天花板限制。
-Direction II 不在旧 reward 下继续调参，等 Path A（新 reward）有正向信号后再考虑是否重新引入。
----
-## Path A: AVT-Aware Reward 重设计
-### 动机
-Ref-AVS 中的 referent 不一定是发声体本身（可能是拿着发声物体的人、与声源相关的对象）。纯音频对齐 reward 会将优化推向 sound source 而非 text 指向的 referent。需要 audio + text + global visual context 共同定义的 referent consistency。
-### AVT Proxy Reward 设计
-**核心洞察：** Fseg（= q_init）已经是 audio + video + text 的多模态融合 token，可直接作为 frozen AVT teacher。
-```python
-R_avt   = mean_t  cos(z_in_t,  q_init)
-R_avt_c = mean_t [cos(z_in_t,  q_init) - β · cos(z_out_t, q_init)]
-```
-- `z_in_t`：anchor 帧 t 的 soft-masked 图像特征（SAM 256-dim 空间）
-- `q_init`：frozen Fseg（AVT anchor，不参与优化梯度）
-- R_avt 高 → mask 区域与查询 referent 对齐；R_avt 低 → mask 指向错误目标
-与 Stage 2 的区别：Stage 2 用当前 q（移动）对齐 z_in（当前 mask），导致自我确认偏差；R_avt 用 q_init（固定）作为 teacher，打破偏差。
-### Step A0: Reward–Metric Correlation Study（下一步要做）
-**目的：** 在进入 full optimization 之前，先用数据验证新 reward 是否比 R_iou_pred 更能预测真实 metric 变化。
-**实验设置（200 samples, Unseen split）：**
-对每个（视频，segment）样本：
-1. Baseline decode → IoU_base, F_base
-2. q-LTPO s1 → q_best；记录 reward_gain、r_avt_gain、r_avt_c_gain（均在 q_ltpo_autograd 内计算）
-3. LTPO decode → IoU_ltpo, F_ltpo
-4. Δ = LTPO - baseline
-输出 Pearson 相关表：
-```
-Pearson r with ΔmIoU:
-  R_iou_pred_gain  : +0.xxx  ← 当前 proxy
-  R_avt_gain       : +0.xxx  ← cos(z_in, q_init)
-  R_avt_c_gain     : +0.xxx  ← 对比版本
-Wrong direction (gain>0 但 Δ<0):
-  R_iou / ΔmIoU : 0.xxx
-  R_avt / ΔmIoU : 0.xxx
-```
-**运行命令：**
-```bash
-python load_model.py --eval_split test_u --max_eval_rows 200
-```
-**判断标准：**
-- `r(R_avt, ΔmIoU) > r(R_iou, ΔmIoU)` → AVT proxy 更好，进入 Step A1
-- 两者相近 → reward 本身不是瓶颈，需要重新审视
-- `R_avt / ΔF wrong frac` 明显低于 `R_iou / ΔF` → AVT 能解释 F-score 不跟随 mIoU 的现象
-### Step A1: Hybrid Reward（Step A0 验证后）
-```
-R_task = λ1 · e0 · R_iou_pred + λ2 · R_avt_c - λ3 · R_area_soft
-```
-- R_iou_pred 继续负责 mask quality（shape quality signal）
-- R_avt_c 负责 referent correctness（task-specific signal）
-- 两者结合才有可能同时维持 IoU 并提升 F
-候选权重组合：`λ1=0.6, λ2=0.5, λ3=0.2`（AVT 作为辅助项，不完全取代 R_iou）。
-如果 Step A1 有正向信号，再考虑将 Direction II（frame-adaptive）和新 reward 结合。

analyze_d2_csv.py DELETED Viewed

@@ -1,239 +0,0 @@
-import argparse
-import csv
-import math
-from collections import defaultdict
-import numpy as np
-def parse_args():
-    parser = argparse.ArgumentParser(description="Analyze D2 frame-level CSV.")
-    parser.add_argument("--csv", required=True, help="Path to d2_llm_space.py or d2_basic.py CSV output.")
-    parser.add_argument("--beta", type=float, default=1.0)
-    parser.add_argument("--failure_iou", type=float, default=0.5)
-    parser.add_argument("--bottom_frac", type=float, default=0.2)
-    parser.add_argument("--pr_points", type=int, default=10)
-    return parser.parse_args()
-def read_rows(path, beta):
-    rows = []
-    with open(path, newline="") as f:
-        reader = csv.DictReader(f)
-        for row in reader:
-            row_beta = float(row["beta"])
-            if abs(row_beta - beta) > 1e-8:
-                continue
-            q_col = "h_type" if "h_type" in row else "q_type"
-            rows.append(
-                {
-                    "sample_idx": int(row["sample_idx"]),
-                    "frame": int(row["frame"]),
-                    "anchor_type": row[q_col],
-                    "s_pred": float(row["s_pred"]),
-                    "s_gt": float(row["s_gt"]),
-                    "frame_iou": float(row["frame_iou"]),
-                    "iou_pred": float(row["iou_pred"]),
-                    "pred_area": float(row["pred_area"]),
-                    "gt_area": float(row["gt_area"]),
-                }
-            )
-    if not rows:
-        raise RuntimeError(f"No rows found for beta={beta} in {path}")
-    return rows
-def corr(x, y):
-    x = np.asarray(x, dtype=np.float64)
-    y = np.asarray(y, dtype=np.float64)
-    if len(x) < 2 or np.std(x) < 1e-12 or np.std(y) < 1e-12:
-        return float("nan")
-    return float(np.corrcoef(x, y)[0, 1])
-def residualize(y, controls):
-    y = np.asarray(y, dtype=np.float64)
-    cols = [np.ones(len(y), dtype=np.float64)]
-    for control in controls:
-        cols.append(np.asarray(control, dtype=np.float64))
-    x = np.stack(cols, axis=1)
-    coef, *_ = np.linalg.lstsq(x, y, rcond=None)
-    return y - x @ coef
-def r2_score(y, y_pred):
-    y = np.asarray(y, dtype=np.float64)
-    y_pred = np.asarray(y_pred, dtype=np.float64)
-    ss_res = np.sum((y - y_pred) ** 2)
-    ss_tot = np.sum((y - y.mean()) ** 2)
-    if ss_tot < 1e-12:
-        return float("nan")
-    return float(1.0 - ss_res / ss_tot)
-def linear_r2(y, features):
-    y = np.asarray(y, dtype=np.float64)
-    cols = [np.ones(len(y), dtype=np.float64)]
-    for feature in features:
-        cols.append(np.asarray(feature, dtype=np.float64))
-    x = np.stack(cols, axis=1)
-    coef, *_ = np.linalg.lstsq(x, y, rcond=None)
-    return r2_score(y, x @ coef)
-def real_rows(rows):
-    return [r for r in rows if r["anchor_type"] == "real"]
-def bottom_failure_enrichment(rows, failure_iou, bottom_frac):
-    rr = real_rows(rows)
-    n = len(rr)
-    k = max(1, int(round(n * bottom_frac)))
-    sorted_rows = sorted(rr, key=lambda r: r["s_pred"])
-    bottom = sorted_rows[:k]
-    baseline_rate = np.mean([r["frame_iou"] < failure_iou for r in rr])
-    bottom_rate = np.mean([r["frame_iou"] < failure_iou for r in bottom])
-    total_failures = sum(r["frame_iou"] < failure_iou for r in rr)
-    covered_failures = sum(r["frame_iou"] < failure_iou for r in bottom)
-    recall = covered_failures / max(total_failures, 1)
-    enrichment = bottom_rate / max(baseline_rate, 1e-12)
-    return {
-        "n": n,
-        "k": k,
-        "baseline_failure_rate": baseline_rate,
-        "bottom_failure_rate": bottom_rate,
-        "bottom_failure_recall": recall,
-        "enrichment": enrichment,
-        "total_failures": total_failures,
-    }
-def pr_curve(rows, failure_iou, points):
-    rr = sorted(real_rows(rows), key=lambda r: r["s_pred"])
-    total_failures = sum(r["frame_iou"] < failure_iou for r in rr)
-    out = []
-    for frac in np.linspace(0.05, 1.0, points):
-        k = max(1, int(round(len(rr) * frac)))
-        selected = rr[:k]
-        failures = sum(r["frame_iou"] < failure_iou for r in selected)
-        precision = failures / k
-        recall = failures / max(total_failures, 1)
-        out.append((frac, precision, recall))
-    return out
-def margin_rows(rows):
-    grouped = defaultdict(dict)
-    for r in rows:
-        key = (r["sample_idx"], r["frame"])
-        grouped[key][r["anchor_type"]] = r
-    out = []
-    for key, group in grouped.items():
-        if "real" not in group:
-            continue
-        controls = [group[name]["s_pred"] for name in ("shuffled", "wrong_ref") if name in group]
-        if not controls:
-            continue
-        real = group["real"]
-        item = dict(real)
-        item["s_margin"] = real["s_pred"] - max(controls)
-        out.append(item)
-    return out
-def bottom_failure_enrichment_for_score(rows, score_key, failure_iou, bottom_frac):
-    n = len(rows)
-    k = max(1, int(round(n * bottom_frac)))
-    sorted_rows = sorted(rows, key=lambda r: r[score_key])
-    bottom = sorted_rows[:k]
-    baseline_rate = np.mean([r["frame_iou"] < failure_iou for r in rows])
-    bottom_rate = np.mean([r["frame_iou"] < failure_iou for r in bottom])
-    total_failures = sum(r["frame_iou"] < failure_iou for r in rows)
-    covered_failures = sum(r["frame_iou"] < failure_iou for r in bottom)
-    return {
-        "n": n,
-        "k": k,
-        "baseline_failure_rate": baseline_rate,
-        "bottom_failure_rate": bottom_rate,
-        "bottom_failure_recall": covered_failures / max(total_failures, 1),
-        "enrichment": bottom_rate / max(baseline_rate, 1e-12),
-    }
-def main():
-    args = parse_args()
-    rows = read_rows(args.csv, args.beta)
-    rr = real_rows(rows)
-    print(f"CSV: {args.csv}")
-    print(f"beta: {args.beta}")
-    print(f"real frames: {len(rr)}")
-    print(f"failure definition: frame_iou < {args.failure_iou}")
-    print("\nReal s_pred Correlations")
-    print(f"corr(s_pred, frame_iou): {corr([r['s_pred'] for r in rr], [r['frame_iou'] for r in rr]):+.4f}")
-    print(f"corr(s_pred, iou_pred):  {corr([r['s_pred'] for r in rr], [r['iou_pred'] for r in rr]):+.4f}")
-    print(f"corr(s_pred, pred_area): {corr([r['s_pred'] for r in rr], [r['pred_area'] for r in rr]):+.4f}")
-    s_pred_values = [r["s_pred"] for r in rr]
-    frame_iou_values = [r["frame_iou"] for r in rr]
-    iou_pred_values = [r["iou_pred"] for r in rr]
-    pred_area_values = [r["pred_area"] for r in rr]
-    gt_area_values = [r["gt_area"] for r in rr]
-    partial_iou_pred = corr(
-        residualize(s_pred_values, [iou_pred_values]),
-        residualize(frame_iou_values, [iou_pred_values]),
-    )
-    partial_iou_area = corr(
-        residualize(s_pred_values, [iou_pred_values, pred_area_values]),
-        residualize(frame_iou_values, [iou_pred_values, pred_area_values]),
-    )
-    partial_iou_area_gt = corr(
-        residualize(s_pred_values, [iou_pred_values, pred_area_values, gt_area_values]),
-        residualize(frame_iou_values, [iou_pred_values, pred_area_values, gt_area_values]),
-    )
-    r2_iou_pred = linear_r2(frame_iou_values, [iou_pred_values])
-    r2_iou_pred_s = linear_r2(frame_iou_values, [iou_pred_values, s_pred_values])
-    r2_iou_pred_area = linear_r2(frame_iou_values, [iou_pred_values, pred_area_values])
-    r2_iou_pred_area_s = linear_r2(frame_iou_values, [iou_pred_values, pred_area_values, s_pred_values])
-    print("\nPartial Correlation / Residual Gain")
-    print(f"partial corr(s_pred, frame_iou | iou_pred):                 {partial_iou_pred:+.4f}")
-    print(f"partial corr(s_pred, frame_iou | iou_pred,pred_area):       {partial_iou_area:+.4f}")
-    print(f"partial corr(s_pred, frame_iou | iou_pred,pred_area,gt_area): {partial_iou_area_gt:+.4f}")
-    print(f"R2 frame_iou ~ iou_pred:                       {r2_iou_pred:.4f}")
-    print(f"R2 frame_iou ~ iou_pred + s_pred:              {r2_iou_pred_s:.4f} (gain {r2_iou_pred_s - r2_iou_pred:+.4f})")
-    print(f"R2 frame_iou ~ iou_pred + pred_area:           {r2_iou_pred_area:.4f}")
-    print(f"R2 frame_iou ~ iou_pred + pred_area + s_pred:  {r2_iou_pred_area_s:.4f} (gain {r2_iou_pred_area_s - r2_iou_pred_area:+.4f})")
-    stats = bottom_failure_enrichment(rows, args.failure_iou, args.bottom_frac)
-    print("\nBottom-k Failure Enrichment")
-    print(f"bottom_frac: {args.bottom_frac:.2f} ({stats['k']}/{stats['n']} frames)")
-    print(f"total failures: {stats['total_failures']}")
-    print(f"random/baseline failure rate: {stats['baseline_failure_rate']:.4f}")
-    print(f"bottom-s_pred failure rate:   {stats['bottom_failure_rate']:.4f}")
-    print(f"bottom-s_pred failure recall: {stats['bottom_failure_recall']:.4f}")
-    print(f"enrichment:                  {stats['enrichment']:.2f}x")
-    print("\nPR Curve Summary")
-    print("selected_frac | precision | recall")
-    for frac, precision, recall in pr_curve(rows, args.failure_iou, args.pr_points):
-        print(f"{frac:.2f} | {precision:.4f} | {recall:.4f}")
-    mr = margin_rows(rows)
-    if mr:
-        print("\nOffline Margin-D2")
-        print(f"margin frames: {len(mr)}")
-        print(f"corr(s_margin, frame_iou): {corr([r['s_margin'] for r in mr], [r['frame_iou'] for r in mr]):+.4f}")
-        print(f"corr(s_margin, pred_area): {corr([r['s_margin'] for r in mr], [r['pred_area'] for r in mr]):+.4f}")
-        mstats = bottom_failure_enrichment_for_score(mr, "s_margin", args.failure_iou, args.bottom_frac)
-        print(f"bottom-s_margin failure rate:   {mstats['bottom_failure_rate']:.4f}")
-        print(f"bottom-s_margin failure recall: {mstats['bottom_failure_recall']:.4f}")
-        print(f"margin enrichment:              {mstats['enrichment']:.2f}x")
-    else:
-        print("\nOffline Margin-D2 skipped: shuffled/wrong_ref controls not available.")
-if __name__ == "__main__":
-    main()

build_rpb_dev_manifest.py DELETED Viewed

@@ -1,71 +0,0 @@
-import argparse
-import json
-import os
-import random
-import pandas as pd
-def sample_indices(size, count, seed):
-    if count <= 0:
-        return []
-    if count > size:
-        raise ValueError(f"Requested {count} samples from a split of size {size}")
-    rng = random.Random(seed)
-    indices = list(range(size))
-    rng.shuffle(indices)
-    selected = sorted(indices[:count])
-    return selected
-def main():
-    parser = argparse.ArgumentParser(description="Build a fixed subset manifest for RPB dev experiments.")
-    parser.add_argument("--metadata", type=str, default="/workspace/SimToken/data/metadata.csv")
-    parser.add_argument("--output", type=str, required=True)
-    parser.add_argument("--seed", type=int, default=42)
-    parser.add_argument("--train_rows", type=int, default=0)
-    parser.add_argument("--test_s_rows", type=int, default=200)
-    parser.add_argument("--test_u_rows", type=int, default=200)
-    parser.add_argument("--test_n_rows", type=int, default=200)
-    args = parser.parse_args()
-    metadata = pd.read_csv(args.metadata, header=0)
-    split_sizes = {
-        "train": int((metadata["split"] == "train").sum()),
-        "test_s": int((metadata["split"] == "test_s").sum()),
-        "test_u": int((metadata["split"] == "test_u").sum()),
-        "test_n": int((metadata["split"] == "test_n").sum()),
-    }
-    manifest = {
-        "train": sample_indices(split_sizes["train"], args.train_rows, args.seed),
-        "test_s": sample_indices(split_sizes["test_s"], args.test_s_rows, args.seed + 1),
-        "test_u": sample_indices(split_sizes["test_u"], args.test_u_rows, args.seed + 2),
-        "test_n": sample_indices(split_sizes["test_n"], args.test_n_rows, args.seed + 3),
-    }
-    # Remove empty entries so train.py only subsets the splits we intentionally fix.
-    manifest = {key: value for key, value in manifest.items() if value}
-    os.makedirs(os.path.dirname(os.path.abspath(args.output)), exist_ok=True)
-    with open(args.output, "w", encoding="utf-8") as f:
-        json.dump(
-            {
-                "metadata": {
-                    "seed": args.seed,
-                    "split_sizes": split_sizes,
-                    "source_metadata": os.path.abspath(args.metadata),
-                },
-                "subsets": manifest,
-            },
-            f,
-            indent=2,
-        )
-    print(f"saved subset manifest to {args.output}")
-    for split_name, indices in manifest.items():
-        print(f"{split_name}: {len(indices)} samples")
-if __name__ == "__main__":
-    main()

cache_q_features.py DELETED Viewed

@@ -1,125 +0,0 @@
-import json
-import os
-from functools import partial
-from itertools import islice
-import torch
-import transformers
-from torch.utils.data import DataLoader
-from tqdm import tqdm
-from configs import args
-from datasets import REFAVS
-from decoder_invariance_check import build_model, set_seed
-from load_model import collate_fn, dict_to_cuda
-def _jsonable_size(size):
-    if isinstance(size, torch.Tensor):
-        return [int(x) for x in size.detach().cpu().tolist()]
-    return [int(x) for x in size]
-def main():
-    set_seed(42)
-    torch.set_grad_enabled(False)
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        args.mllm,
-        cache_dir=None,
-        model_max_length=2048,
-        padding_side="right",
-        use_fast=False,
-    )
-    tokenizer.pad_token = tokenizer.unk_token
-    tokenizer.add_tokens("[SEG]")
-    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
-    dataset = REFAVS(args.cache_split, args, tokenizer, input_type="refer")
-    loader = DataLoader(
-        dataset,
-        batch_size=1,
-        shuffle=False,
-        num_workers=0,
-        collate_fn=partial(collate_fn, tokenizer=tokenizer),
-    )
-    split_root = os.path.join(args.cache_root, args.cache_split)
-    os.makedirs(split_root, exist_ok=True)
-    index_path = os.path.join(split_root, "index.jsonl")
-    if os.path.exists(index_path) and not args.overwrite_cache:
-        raise FileExistsError(
-            f"{index_path} already exists. Pass --overwrite_cache to rebuild it."
-        )
-    limit = args.max_eval_rows if args.max_eval_rows > 0 else len(dataset)
-    print(f"cache split={args.cache_split} | samples={min(limit, len(dataset))}")
-    print(f"cache root: {split_root}")
-    model = build_model(tokenizer, seg_token_idx)
-    model.eval()
-    rows = []
-    for sample_idx, batch in enumerate(
-        tqdm(islice(loader, limit), total=min(limit, len(dataset)), desc=f"Caching {args.cache_split}")
-    ):
-        batch = dict_to_cuda(batch)
-        with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-            output = model.forward(
-                images=batch["images"],
-                images_clip=batch["images_clip"],
-                audio_features=batch["audio_feats"],
-                image_features=batch["image_feats"],
-                input_ids=batch["input_ids"],
-                labels=batch["labels"],
-                attention_masks=batch["attention_masks"],
-                masks_list=batch["masks"],
-                resize_list=batch["resizes"],
-                orgsize_list=batch["orgsizes"],
-                conversation_list=batch["convs"],
-                refs_num=batch["refs_num"],
-                fids=batch["fids"],
-                vids=batch["vids"],
-                contrast=args.ct_weight,
-                ref_ids=batch["ref_ids"],
-                inference=True,
-            )
-        cache_name = f"{sample_idx:06d}.pt"
-        cache_path = os.path.join(split_root, cache_name)
-        item = {
-            "sample_idx": sample_idx,
-            "vid": batch["vids"][0],
-            "refs": batch["refs"][0],
-            "fids": [int(x) for x in batch["fids"][0]],
-            "resize": _jsonable_size(batch["resizes"][0]),
-            "orgsize": _jsonable_size(batch["orgsizes"][0]),
-            "q": output["seg_embeddings"][0].detach().cpu().float(),
-        }
-        torch.save(item, cache_path)
-        rows.append(
-            {
-                "sample_idx": sample_idx,
-                "path": cache_name,
-                "vid": item["vid"],
-                "refs": item["refs"],
-                "fids": item["fids"],
-                "resize": item["resize"],
-                "orgsize": item["orgsize"],
-                "num_seg": int(item["q"].shape[0]),
-            }
-        )
-    if not rows:
-        raise RuntimeError("No samples were cached.")
-    with open(index_path, "w") as f:
-        for row in rows:
-            f.write(json.dumps(row) + "\n")
-    print(f"cached samples: {len(rows)}")
-    print(f"saved index: {index_path}")
-if __name__ == "__main__":
-    main()

cache_q_smoke/test_s/000000.pt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:5f85d7cf7b83caf6fedb153a2cea2b36dd144ee3c0e34039483e20d208ea92d3
-size 2327

cache_q_smoke/test_s/index.jsonl DELETED Viewed

	@@ -1 +0,0 @@
1	- {"sample_idx": 0, "path": "000000.pt", "vid": "-3ABOVeVmpU_136000_146000", "refs": ["the object that keeps making sound at all times"], "fids": [1], "resize": [576, 1024], "orgsize": [720, 1280], "num_seg": 1}

checkpoints/rpb_dev_mixed_pm_only_a018_wm005.pth DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:2c1facc9eac5ffdfd12c97d252af2c8eedc4e526a53931d301b0ef4bed698213
-size 30841132766

checkpoints/rpb_dev_pm_only_a018.pth DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:33e8b6251c69d7d4de055b488a2f2345eece1991831a2f08ce5f1d1cb795ae5f
-size 30841115170

checkpoints/rpb_probe_eval_directional_pm_only_a02.pth DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6dc5cd6b02f5d54a026694f6a1217f46137fdaa4499a71fa7b9bd95ede17da6c
-size 30841141852

d2_basic.py DELETED Viewed

@@ -1,340 +0,0 @@
-import csv
-import math
-import os
-from functools import partial
-import numpy as np
-import torch
-import torch.nn.functional as F
-import transformers
-from torch.utils.data import DataLoader
-from configs import args
-from datasets import REFAVS
-from decoder_invariance_check import build_model, set_seed
-from load_model import collate_fn, dict_to_cuda
-def make_loader(tokenizer):
-    dataset = REFAVS(args.eval_split, args, tokenizer, input_type="refer")
-    return DataLoader(
-        dataset,
-        batch_size=1,
-        shuffle=False,
-        num_workers=0,
-        collate_fn=partial(collate_fn, tokenizer=tokenizer),
-    )
-def build_tokenizer():
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        args.mllm,
-        cache_dir=None,
-        model_max_length=2048,
-        padding_side="right",
-        use_fast=False,
-    )
-    tokenizer.pad_token = tokenizer.unk_token
-    tokenizer.add_tokens("[SEG]")
-    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
-    return tokenizer, seg_token_idx
-def get_q(model, batch):
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        output = model.forward(
-            images=batch["images"],
-            images_clip=batch["images_clip"],
-            audio_features=batch["audio_feats"],
-            image_features=batch["image_feats"],
-            input_ids=batch["input_ids"],
-            labels=batch["labels"],
-            attention_masks=batch["attention_masks"],
-            masks_list=batch["masks"],
-            resize_list=batch["resizes"],
-            orgsize_list=batch["orgsizes"],
-            conversation_list=batch["convs"],
-            refs_num=batch["refs_num"],
-            fids=batch["fids"],
-            vids=batch["vids"],
-            contrast=args.ct_weight,
-            ref_ids=batch["ref_ids"],
-            inference=True,
-        )
-    return output["seg_embeddings"][0][0].float()
-def decode_low_res(model, batch, q):
-    visual_model = model.get_model().visual_model
-    sparse, dense = visual_model.prompt_encoder(
-        points=None,
-        boxes=None,
-        masks=None,
-        text_embeds=q.view(1, 1, -1).to(next(visual_model.parameters()).dtype),
-    )
-    sparse = sparse.to(q.dtype)
-    dense = dense.to(q.dtype)
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        low_res_masks, iou_predictions = visual_model.mask_decoder(
-            image_embeddings=batch["image_feats"][0],
-            image_pe=visual_model.prompt_encoder.get_dense_pe(),
-            sparse_prompt_embeddings=sparse,
-            dense_prompt_embeddings=dense,
-            multimask_output=False,
-        )
-    return low_res_masks.float(), iou_predictions.float().squeeze(-1)
-def masks_to_64(mask_logits_or_binary):
-    if mask_logits_or_binary.ndim == 3:
-        mask_logits_or_binary = mask_logits_or_binary.unsqueeze(1)
-    return F.interpolate(
-        mask_logits_or_binary.float(),
-        size=(64, 64),
-        mode="bilinear",
-        align_corners=False,
-    ).clamp(0.0, 1.0)
-def d2_scores(image_embeddings, mask64, q, beta):
-    feats = image_embeddings.float()
-    if mask64.shape[0] != feats.shape[0]:
-        raise ValueError(f"Mask/frame mismatch: {mask64.shape} vs {feats.shape}")
-    q = F.normalize(q.float().view(1, -1), dim=-1)
-    mask = mask64.float()
-    comp = 1.0 - mask
-    z_in = (feats * mask).sum(dim=(2, 3)) / mask.sum(dim=(2, 3)).clamp_min(1e-6)
-    z_out = (feats * comp).sum(dim=(2, 3)) / comp.sum(dim=(2, 3)).clamp_min(1e-6)
-    z_in = F.normalize(z_in, dim=-1)
-    z_out = F.normalize(z_out, dim=-1)
-    return (z_in @ q.T).squeeze(-1) - beta * (z_out @ q.T).squeeze(-1)
-def frame_iou(pred_logits, gt_masks):
-    pred = (torch.sigmoid(pred_logits.float()) > 0.4).float()
-    gt = gt_masks.float()
-    if pred.ndim == 4:
-        pred = pred.squeeze(1)
-    inter = (pred * gt).sum(dim=(1, 2))
-    union = torch.maximum(pred, gt).sum(dim=(1, 2))
-    num_pixels = pred.shape[-1] * pred.shape[-2]
-    no_obj = gt.sum(dim=(1, 2)) == 0
-    inter_no_obj = ((1.0 - pred) * (1.0 - gt)).sum(dim=(1, 2))
-    inter = torch.where(no_obj, inter_no_obj, inter)
-    union = torch.where(no_obj, torch.full_like(union, float(num_pixels)), union)
-    return inter / union.clamp_min(1e-7)
-def frame_fscore_proxy(pred_logits, gt_masks):
-    pred = (torch.sigmoid(pred_logits.float()) > 0.4).float()
-    gt = gt_masks.float()
-    if pred.ndim == 4:
-        pred = pred.squeeze(1)
-    tp = (pred * gt).sum(dim=(1, 2))
-    precision = tp / pred.sum(dim=(1, 2)).clamp_min(1e-7)
-    recall = tp / gt.sum(dim=(1, 2)).clamp_min(1e-7)
-    beta2 = 0.3
-    fscore = (1 + beta2) * precision * recall / (beta2 * precision + recall).clamp_min(1e-7)
-    no_obj = gt.sum(dim=(1, 2)) == 0
-    return torch.where(no_obj, torch.zeros_like(fscore), fscore)
-def parse_betas():
-    raw = os.environ.get("D2_BETAS", "0.5")
-    return [float(x.strip()) for x in raw.split(",") if x.strip()]
-def collect_q_pool(model, tokenizer, limit):
-    q_pool = []
-    loader = make_loader(tokenizer)
-    for sample_idx, batch in enumerate(loader):
-        if sample_idx >= limit:
-            break
-        batch = dict_to_cuda(batch)
-        q = get_q(model, batch)
-        q_pool.append(
-            {
-                "sample_idx": sample_idx,
-                "vid": batch["vids"][0],
-                "ref": batch["refs"][0][0],
-                "fid": int(batch["fids"][0][0]),
-                "q": q.cpu(),
-            }
-        )
-        print(f"Collected q {sample_idx}: vid={q_pool[-1]['vid']} ref={q_pool[-1]['ref']}")
-    if not q_pool:
-        raise RuntimeError("No q vectors collected. Is the selected split empty?")
-    return q_pool
-def choose_shuffled_idx(sample_idx, q_pool):
-    if len(q_pool) <= 1:
-        return None
-    return (sample_idx + 1) % len(q_pool)
-def choose_wrong_ref_idx(sample_idx, q_pool):
-    current = q_pool[sample_idx]
-    for item in q_pool:
-        if item["sample_idx"] == sample_idx:
-            continue
-        if item["vid"] == current["vid"] and item["fid"] != current["fid"]:
-            return item["sample_idx"]
-    for item in q_pool:
-        if item["sample_idx"] == sample_idx:
-            continue
-        if item["vid"] == current["vid"] and item["ref"] != current["ref"]:
-            return item["sample_idx"]
-    return None
-def run_d2(model, tokenizer, q_pool, betas, limit):
-    rows = []
-    loader = make_loader(tokenizer)
-    q_lookup = {item["sample_idx"]: item for item in q_pool}
-    generator = torch.Generator(device="cuda")
-    generator.manual_seed(1234)
-    for sample_idx, batch in enumerate(loader):
-        if sample_idx >= limit:
-            break
-        batch = dict_to_cuda(batch)
-        item = q_lookup[sample_idx]
-        real_q = item["q"].cuda()
-        low_res_masks, iou_predictions = decode_low_res(model, batch, real_q)
-        pred_mask64 = masks_to_64(torch.sigmoid(low_res_masks))
-        gt_masks = batch["masks"][0][0].float()
-        gt_mask64 = masks_to_64(gt_masks)
-        image_embeddings = batch["image_feats"][0].float()
-        pred_logits_hr = model.get_model().visual_model.postprocess_masks(
-            low_res_masks.to(batch["image_feats"][0].dtype),
-            input_size=batch["resizes"][0],
-            original_size=batch["orgsizes"][0],
-        ).squeeze(1)
-        frame_ious = frame_iou(pred_logits_hr, gt_masks)
-        frame_fscores = frame_fscore_proxy(pred_logits_hr, gt_masks)
-        pred_area = (torch.sigmoid(pred_logits_hr.float()) > 0.4).float().mean(dim=(1, 2))
-        gt_area = gt_masks.float().mean(dim=(1, 2))
-        shuffled_idx = choose_shuffled_idx(sample_idx, q_pool)
-        wrong_ref_idx = choose_wrong_ref_idx(sample_idx, q_pool)
-        q_controls = [
-            ("real", real_q, sample_idx),
-            ("random", torch.randn(real_q.shape, device=real_q.device, generator=generator), None),
-        ]
-        if shuffled_idx is not None:
-            q_controls.append(("shuffled", q_lookup[shuffled_idx]["q"].cuda(), shuffled_idx))
-        if wrong_ref_idx is not None:
-            q_controls.append(("wrong_ref", q_lookup[wrong_ref_idx]["q"].cuda(), wrong_ref_idx))
-        for beta in betas:
-            for q_type, q, q_source_idx in q_controls:
-                pred_scores = d2_scores(image_embeddings, pred_mask64, q, beta)
-                gt_scores = d2_scores(image_embeddings, gt_mask64, q, beta)
-                base_info = {
-                    "sample_idx": sample_idx,
-                    "vid": item["vid"],
-                    "ref": item["ref"],
-                    "fid": item["fid"],
-                    "split": args.eval_split,
-                    "frame_iou": math.nan,
-                    "frame_fscore_proxy": math.nan,
-                    "iou_pred": math.nan,
-                    "pred_area": math.nan,
-                    "gt_area": math.nan,
-                }
-                for frame_idx in range(pred_scores.shape[0]):
-                    base_info_frame = dict(base_info)
-                    base_info_frame.update(
-                        {
-                            "frame_iou": frame_ious[frame_idx].item(),
-                            "frame_fscore_proxy": frame_fscores[frame_idx].item(),
-                            "iou_pred": iou_predictions[frame_idx].item(),
-                            "pred_area": pred_area[frame_idx].item(),
-                            "gt_area": gt_area[frame_idx].item(),
-                        }
-                    )
-                    row = dict(base_info_frame)
-                    row.update(
-                        {
-                            "frame": frame_idx,
-                            "q_type": q_type,
-                            "beta": beta,
-                            "s_pred": pred_scores[frame_idx].item(),
-                            "s_gt": gt_scores[frame_idx].item(),
-                            "q_source_idx": q_source_idx if q_source_idx is not None else "",
-                        }
-                    )
-                    rows.append(row)
-        real_rows = [
-            r for r in rows if r["sample_idx"] == sample_idx and r["q_type"] == "real" and r["beta"] == betas[0]
-        ]
-        s_pred_values = [r["s_pred"] for r in real_rows]
-        print(
-            f"D2 {sample_idx}: vid={item['vid']} ref={item['ref']} "
-            f"mean_s_pred={np.mean(s_pred_values):.4f} min_s_pred={np.min(s_pred_values):.4f} "
-            f"mean_iou={frame_ious.mean().item():.4f}"
-        )
-    return rows
-def print_summary(rows):
-    real_rows = [r for r in rows if r["q_type"] == "real"]
-    if not real_rows:
-        return
-    by_beta = sorted(set(r["beta"] for r in real_rows))
-    print("\nSummary")
-    print(f"rows: {len(rows)}")
-    for beta in by_beta:
-        beta_rows = [r for r in rows if r["beta"] == beta]
-        print(f"\nbeta={beta}")
-        for q_type in sorted(set(r["q_type"] for r in beta_rows)):
-            qr = [r for r in beta_rows if r["q_type"] == q_type]
-            print(
-                f"{q_type:10s} "
-                f"mean_s_pred={np.mean([r['s_pred'] for r in qr]):+.4f} "
-                f"mean_s_gt={np.mean([r['s_gt'] for r in qr]):+.4f}"
-            )
-        real_beta = [r for r in beta_rows if r["q_type"] == "real"]
-        s_pred = np.array([r["s_pred"] for r in real_beta])
-        frame_iou_values = np.array([r["frame_iou"] for r in real_beta])
-        if len(s_pred) > 1 and np.std(s_pred) > 1e-8 and np.std(frame_iou_values) > 1e-8:
-            corr = np.corrcoef(s_pred, frame_iou_values)[0, 1]
-            print(f"corr(real s_pred, frame_iou)={corr:+.4f}")
-        else:
-            print("corr(real s_pred, frame_iou)=nan")
-def main():
-    set_seed(42)
-    torch.set_grad_enabled(False)
-    betas = parse_betas()
-    tokenizer, seg_token_idx = build_tokenizer()
-    limit = args.max_eval_rows if args.max_eval_rows > 0 else 30
-    print(f"Split: {args.eval_split} | samples: {limit} | betas: {betas}")
-    model = build_model(tokenizer, seg_token_idx)
-    q_pool = collect_q_pool(model, tokenizer, limit)
-    rows = run_d2(model, tokenizer, q_pool, betas, limit)
-    print_summary(rows)
-    csv_path = os.environ.get("D2_BASIC_CSV", f"/workspace/SimToken/d2_basic_{args.eval_split}_{limit}.csv")
-    os.makedirs(os.path.dirname(os.path.abspath(csv_path)), exist_ok=True)
-    with open(csv_path, "w", newline="") as f:
-        writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
-        writer.writeheader()
-        writer.writerows(rows)
-    print(f"\nSaved CSV: {csv_path}")
-if __name__ == "__main__":
-    main()

d2_llm_space.py DELETED Viewed

@@ -1,314 +0,0 @@
-import csv
-import math
-import os
-from functools import partial
-import numpy as np
-import torch
-import torch.nn.functional as F
-import transformers
-from torch.utils.data import DataLoader
-from configs import args
-from datasets import REFAVS
-from decoder_invariance_check import build_model, set_seed
-from d2_basic import frame_fscore_proxy, frame_iou
-from load_model import collate_fn, dict_to_cuda
-def build_tokenizer():
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        args.mllm,
-        cache_dir=None,
-        model_max_length=2048,
-        padding_side="right",
-        use_fast=False,
-    )
-    tokenizer.pad_token = tokenizer.unk_token
-    tokenizer.add_tokens("[SEG]")
-    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
-    return tokenizer, seg_token_idx
-def make_loader(tokenizer):
-    dataset = REFAVS(args.eval_split, args, tokenizer, input_type="refer")
-    return DataLoader(
-        dataset,
-        batch_size=1,
-        shuffle=False,
-        num_workers=0,
-        collate_fn=partial(collate_fn, tokenizer=tokenizer),
-    )
-def forward_for_hidden_and_q(model, batch):
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        output = model.forward(
-            images=batch["images"],
-            images_clip=batch["images_clip"],
-            audio_features=batch["audio_feats"],
-            image_features=batch["image_feats"],
-            input_ids=batch["input_ids"],
-            labels=batch["labels"],
-            attention_masks=batch["attention_masks"],
-            masks_list=batch["masks"],
-            resize_list=batch["resizes"],
-            orgsize_list=batch["orgsizes"],
-            conversation_list=batch["convs"],
-            refs_num=batch["refs_num"],
-            fids=batch["fids"],
-            vids=batch["vids"],
-            contrast=args.ct_weight,
-            ref_ids=batch["ref_ids"],
-            inference=True,
-        )
-    h_seg = output["seg_hidden_states"][0][0].float()
-    q = output["seg_embeddings"][0][0].float()
-    return h_seg, q
-def decode_low_res(model, batch, q):
-    visual_model = model.get_model().visual_model
-    sparse, dense = visual_model.prompt_encoder(
-        points=None,
-        boxes=None,
-        masks=None,
-        text_embeds=q.view(1, 1, -1).to(next(visual_model.parameters()).dtype),
-    )
-    sparse = sparse.to(q.dtype)
-    dense = dense.to(q.dtype)
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        low_res_masks, iou_predictions = visual_model.mask_decoder(
-            image_embeddings=batch["image_feats"][0],
-            image_pe=visual_model.prompt_encoder.get_dense_pe(),
-            sparse_prompt_embeddings=sparse,
-            dense_prompt_embeddings=dense,
-            multimask_output=False,
-        )
-    return low_res_masks.float(), iou_predictions.float().squeeze(-1)
-def clip_projected_tokens(model, batch):
-    images = torch.cat(batch["images_clip"], dim=0)
-    with torch.no_grad():
-        clip_tokens = model.encode_images(images)
-        projector = model.get_model().mm_projector
-        clip_tokens = clip_tokens.to(projector.weight.dtype)
-        llm_tokens = projector(clip_tokens).float()
-    return llm_tokens
-def infer_square_grid(num_tokens):
-    grid = int(math.sqrt(num_tokens))
-    if grid * grid != num_tokens:
-        raise ValueError(f"Expected square patch-token grid, got {num_tokens} tokens")
-    return grid
-def masks_to_token_grid(mask_logits_or_binary, num_tokens):
-    if mask_logits_or_binary.ndim == 3:
-        mask_logits_or_binary = mask_logits_or_binary.unsqueeze(1)
-    grid = infer_square_grid(num_tokens)
-    return F.interpolate(
-        mask_logits_or_binary.float(),
-        size=(grid, grid),
-        mode="bilinear",
-        align_corners=False,
-    ).flatten(2).transpose(1, 2).clamp(0.0, 1.0)
-def d2_scores_llm(llm_tokens, mask_tokens, h_seg, beta):
-    if llm_tokens.shape[:2] != mask_tokens.shape[:2]:
-        raise ValueError(f"Token/mask mismatch: {llm_tokens.shape} vs {mask_tokens.shape}")
-    h = F.normalize(h_seg.float().view(1, -1), dim=-1)
-    tokens = llm_tokens.float()
-    mask = mask_tokens.float()
-    comp = 1.0 - mask
-    z_in = (tokens * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1e-6)
-    z_out = (tokens * comp).sum(dim=1) / comp.sum(dim=1).clamp_min(1e-6)
-    z_in = F.normalize(z_in, dim=-1)
-    z_out = F.normalize(z_out, dim=-1)
-    return (z_in @ h.T).squeeze(-1) - beta * (z_out @ h.T).squeeze(-1)
-def parse_betas():
-    raw = os.environ.get("D2_BETAS", "0.5")
-    return [float(x.strip()) for x in raw.split(",") if x.strip()]
-def collect_hidden_pool(model, tokenizer, limit):
-    pool = []
-    loader = make_loader(tokenizer)
-    for sample_idx, batch in enumerate(loader):
-        if sample_idx >= limit:
-            break
-        batch = dict_to_cuda(batch)
-        h_seg, q = forward_for_hidden_and_q(model, batch)
-        pool.append(
-            {
-                "sample_idx": sample_idx,
-                "vid": batch["vids"][0],
-                "ref": batch["refs"][0][0],
-                "fid": int(batch["fids"][0][0]),
-                "h": h_seg.cpu(),
-                "q": q.cpu(),
-            }
-        )
-        print(f"Collected h {sample_idx}: vid={pool[-1]['vid']} ref={pool[-1]['ref']}")
-    if not pool:
-        raise RuntimeError("No hidden states collected. Is the selected split empty?")
-    return pool
-def choose_shuffled_idx(sample_idx, pool):
-    if len(pool) <= 1:
-        return None
-    return (sample_idx + 1) % len(pool)
-def choose_wrong_ref_idx(sample_idx, pool):
-    current = pool[sample_idx]
-    for item in pool:
-        if item["sample_idx"] == sample_idx:
-            continue
-        if item["vid"] == current["vid"] and item["fid"] != current["fid"]:
-            return item["sample_idx"]
-    for item in pool:
-        if item["sample_idx"] == sample_idx:
-            continue
-        if item["vid"] == current["vid"] and item["ref"] != current["ref"]:
-            return item["sample_idx"]
-    return None
-def run_d2_llm(model, tokenizer, pool, betas, limit):
-    rows = []
-    lookup = {item["sample_idx"]: item for item in pool}
-    generator = torch.Generator(device="cuda")
-    generator.manual_seed(1234)
-    loader = make_loader(tokenizer)
-    for sample_idx, batch in enumerate(loader):
-        if sample_idx >= limit:
-            break
-        batch = dict_to_cuda(batch)
-        item = lookup[sample_idx]
-        h_real = item["h"].cuda()
-        q_real = item["q"].cuda()
-        low_res_masks, iou_predictions = decode_low_res(model, batch, q_real)
-        llm_tokens = clip_projected_tokens(model, batch)
-        pred_mask_tokens = masks_to_token_grid(torch.sigmoid(low_res_masks), llm_tokens.shape[1])
-        gt_masks = batch["masks"][0][0].float()
-        gt_mask_tokens = masks_to_token_grid(gt_masks, llm_tokens.shape[1])
-        pred_logits_hr = model.get_model().visual_model.postprocess_masks(
-            low_res_masks.to(batch["image_feats"][0].dtype),
-            input_size=batch["resizes"][0],
-            original_size=batch["orgsizes"][0],
-        ).squeeze(1)
-        frame_ious = frame_iou(pred_logits_hr, gt_masks)
-        frame_fscores = frame_fscore_proxy(pred_logits_hr, gt_masks)
-        pred_area = (torch.sigmoid(pred_logits_hr.float()) > 0.4).float().mean(dim=(1, 2))
-        gt_area = gt_masks.float().mean(dim=(1, 2))
-        shuffled_idx = choose_shuffled_idx(sample_idx, pool)
-        wrong_ref_idx = choose_wrong_ref_idx(sample_idx, pool)
-        controls = [
-            ("real", h_real, sample_idx),
-            ("random", torch.randn(h_real.shape, device=h_real.device, generator=generator), None),
-        ]
-        if shuffled_idx is not None:
-            controls.append(("shuffled", lookup[shuffled_idx]["h"].cuda(), shuffled_idx))
-        if wrong_ref_idx is not None:
-            controls.append(("wrong_ref", lookup[wrong_ref_idx]["h"].cuda(), wrong_ref_idx))
-        for beta in betas:
-            for h_type, h, h_source_idx in controls:
-                pred_scores = d2_scores_llm(llm_tokens, pred_mask_tokens, h, beta)
-                gt_scores = d2_scores_llm(llm_tokens, gt_mask_tokens, h, beta)
-                for frame_idx in range(pred_scores.shape[0]):
-                    rows.append(
-                        {
-                            "sample_idx": sample_idx,
-                            "vid": item["vid"],
-                            "ref": item["ref"],
-                            "fid": item["fid"],
-                            "split": args.eval_split,
-                            "frame": frame_idx,
-                            "h_type": h_type,
-                            "beta": beta,
-                            "s_pred": pred_scores[frame_idx].item(),
-                            "s_gt": gt_scores[frame_idx].item(),
-                            "h_source_idx": h_source_idx if h_source_idx is not None else "",
-                            "frame_iou": frame_ious[frame_idx].item(),
-                            "frame_fscore_proxy": frame_fscores[frame_idx].item(),
-                            "iou_pred": iou_predictions[frame_idx].item(),
-                            "pred_area": pred_area[frame_idx].item(),
-                            "gt_area": gt_area[frame_idx].item(),
-                        }
-                    )
-        real_rows = [
-            r for r in rows if r["sample_idx"] == sample_idx and r["h_type"] == "real" and r["beta"] == betas[0]
-        ]
-        s_pred_values = [r["s_pred"] for r in real_rows]
-        print(
-            f"D2-LLM {sample_idx}: vid={item['vid']} ref={item['ref']} "
-            f"mean_s_pred={np.mean(s_pred_values):.4f} min_s_pred={np.min(s_pred_values):.4f} "
-            f"mean_iou={frame_ious.mean().item():.4f}"
-        )
-    return rows
-def print_summary(rows):
-    print("\nSummary")
-    print(f"rows: {len(rows)}")
-    for beta in sorted(set(r["beta"] for r in rows)):
-        beta_rows = [r for r in rows if r["beta"] == beta]
-        print(f"\nbeta={beta}")
-        for h_type in sorted(set(r["h_type"] for r in beta_rows)):
-            hr = [r for r in beta_rows if r["h_type"] == h_type]
-            print(
-                f"{h_type:10s} "
-                f"mean_s_pred={np.mean([r['s_pred'] for r in hr]):+.4f} "
-                f"mean_s_gt={np.mean([r['s_gt'] for r in hr]):+.4f}"
-            )
-        real_rows = [r for r in beta_rows if r["h_type"] == "real"]
-        s_pred = np.array([r["s_pred"] for r in real_rows])
-        frame_iou_values = np.array([r["frame_iou"] for r in real_rows])
-        if len(s_pred) > 1 and np.std(s_pred) > 1e-8 and np.std(frame_iou_values) > 1e-8:
-            corr = np.corrcoef(s_pred, frame_iou_values)[0, 1]
-            print(f"corr(real s_pred, frame_iou)={corr:+.4f}")
-        else:
-            print("corr(real s_pred, frame_iou)=nan")
-def main():
-    set_seed(42)
-    torch.set_grad_enabled(False)
-    betas = parse_betas()
-    tokenizer, seg_token_idx = build_tokenizer()
-    limit = args.max_eval_rows if args.max_eval_rows > 0 else 30
-    print(f"Split: {args.eval_split} | samples: {limit} | betas: {betas}")
-    model = build_model(tokenizer, seg_token_idx)
-    pool = collect_hidden_pool(model, tokenizer, limit)
-    rows = run_d2_llm(model, tokenizer, pool, betas, limit)
-    print_summary(rows)
-    csv_path = os.environ.get("D2_LLM_CSV", f"/workspace/SimToken/d2_llm_{args.eval_split}_{limit}.csv")
-    os.makedirs(os.path.dirname(os.path.abspath(csv_path)), exist_ok=True)
-    with open(csv_path, "w", newline="") as f:
-        writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
-        writer.writeheader()
-        writer.writerows(rows)
-    print(f"\nSaved CSV: {csv_path}")
-if __name__ == "__main__":
-    main()

decoder_invariance_check.py DELETED Viewed

@@ -1,256 +0,0 @@
-import csv
-import os
-import random
-from functools import partial
-import numpy as np
-import torch
-import transformers
-from peft import LoraConfig, get_peft_model
-from torch.utils.data import DataLoader
-from transformers import AutoConfig
-from configs import args
-from datasets import REFAVS
-from load_model import collate_fn, dict_to_cuda
-from models.avs_model import Simtoken_ForCausalLM
-def set_seed(seed=42):
-    torch.manual_seed(seed)
-    np.random.seed(seed)
-    random.seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    torch.backends.cudnn.deterministic = True
-    torch.backends.cudnn.benchmark = False
-def find_lora_target_modules(model, target_modules=("q_proj", "v_proj")):
-    modules = set()
-    excluded = [
-        "visual_model",
-        "vision_tower",
-        "mm_projector",
-        "text_hidden_fcs",
-        "audio_feature_layer",
-    ]
-    for name, module in model.named_modules():
-        if not isinstance(module, torch.nn.Linear):
-            continue
-        if any(x in name for x in excluded):
-            continue
-        if any(x in name for x in target_modules):
-            modules.add(name)
-    return sorted(modules)
-def build_model(tokenizer, seg_token_idx):
-    model_args = {
-        "train_mask_decoder": True,
-        "out_dim": 256,
-        "ce_loss_weight": 1.0,
-        "dice_loss_weight": 0.5,
-        "bce_loss_weight": 2.0,
-        "seg_token_idx": seg_token_idx,
-        "vision_pretrained": args.vision_pretrained,
-        "vision_tower": args.vision_tower,
-        "use_im_start_end": False,
-        "compress": args.compress,
-        "start": args.start,
-    }
-    model = Simtoken_ForCausalLM.from_pretrained(
-        args.mllm,
-        torch_dtype=torch.bfloat16,
-        low_cpu_mem_usage=True,
-        **model_args,
-    )
-    model.config.eos_token_id = tokenizer.eos_token_id
-    model.config.bos_token_id = tokenizer.bos_token_id
-    model.config.pad_token_id = tokenizer.pad_token_id
-    model.get_model().initialize_vision_modules(model.get_model().config)
-    vision_tower = model.get_model().get_vision_tower()
-    vision_tower.to(dtype=torch.float32, device="cuda")
-    model_args_from_pt = AutoConfig.from_pretrained(args.mllm)
-    model_args_from_pt.use_cluster = True
-    model_args_from_pt.freeze = False
-    model_args_from_pt.mm_tune = True
-    model_args_from_pt.spatial_cluster_rate0 = 64
-    model_args_from_pt.spatial_cluster_rate1 = 32
-    model_args_from_pt.spatial_cluster_rate2 = 16
-    model_args_from_pt.temporal_cluster_rate = 0.0625
-    model_args_from_pt.vision_tune = False
-    model.get_model().initialize_cluster_modules(model_args_from_pt)
-    model.get_model().initialize_lisa_modules(model.get_model().config)
-    lora_config = LoraConfig(
-        r=8,
-        lora_alpha=16,
-        target_modules=find_lora_target_modules(model),
-        lora_dropout=0.05,
-        bias="none",
-        task_type="CAUSAL_LM",
-    )
-    model = get_peft_model(model, lora_config)
-    model = model.to("cuda")
-    model.resize_token_embeddings(len(tokenizer))
-    state = torch.load(args.saved_model, map_location="cpu")
-    missing, unexpected = model.load_state_dict(state, strict=False)
-    print(f"Loaded checkpoint: {args.saved_model}")
-    print(f"Missing keys: {len(missing)} | Unexpected keys: {len(unexpected)}")
-    model.eval()
-    return model
-def get_seg_embedding(model, batch):
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        output = model.forward(
-            images=batch["images"],
-            images_clip=batch["images_clip"],
-            audio_features=batch["audio_feats"],
-            image_features=batch["image_feats"],
-            input_ids=batch["input_ids"],
-            labels=batch["labels"],
-            attention_masks=batch["attention_masks"],
-            masks_list=batch["masks"],
-            resize_list=batch["resizes"],
-            orgsize_list=batch["orgsizes"],
-            conversation_list=batch["convs"],
-            refs_num=batch["refs_num"],
-            fids=batch["fids"],
-            vids=batch["vids"],
-            contrast=args.ct_weight,
-            ref_ids=batch["ref_ids"],
-            inference=True,
-        )
-    return output["seg_embeddings"][0][0:1]
-def check_one_sample(model, batch):
-    q = get_seg_embedding(model, batch)
-    image_embeddings = batch["image_feats"][0]
-    visual_model = model.get_model().visual_model
-    sparse, dense = visual_model.prompt_encoder(
-        points=None,
-        boxes=None,
-        masks=None,
-        text_embeds=q.unsqueeze(1),
-    )
-    sparse = sparse.to(q.dtype)
-    dense = dense.to(q.dtype)
-    decoder = visual_model.mask_decoder
-    image_pe = visual_model.prompt_encoder.get_dense_pe()
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        full_masks, full_iou = decoder(
-            image_embeddings=image_embeddings,
-            image_pe=image_pe,
-            sparse_prompt_embeddings=sparse,
-            dense_prompt_embeddings=dense,
-            multimask_output=False,
-        )
-        rows = []
-        for t in range(image_embeddings.shape[0]):
-            single_masks, single_iou = decoder(
-                image_embeddings=image_embeddings[t : t + 1],
-                image_pe=image_pe,
-                sparse_prompt_embeddings=sparse,
-                dense_prompt_embeddings=dense,
-                multimask_output=False,
-            )
-            diff = (full_masks[t : t + 1] - single_masks).float().abs()
-            iou_diff = (full_iou[t : t + 1] - single_iou).float().abs()
-            rows.append(
-                {
-                    "vid": batch["vids"][0],
-                    "ref": batch["refs"][0][0],
-                    "frame": t,
-                    "max_abs_diff": diff.max().item(),
-                    "mean_abs_diff": diff.mean().item(),
-                    "iou_pred_diff": iou_diff.max().item(),
-                }
-            )
-    return rows
-def main():
-    set_seed(42)
-    torch.set_grad_enabled(False)
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        args.mllm,
-        cache_dir=None,
-        model_max_length=2048,
-        padding_side="right",
-        use_fast=False,
-    )
-    tokenizer.pad_token = tokenizer.unk_token
-    tokenizer.add_tokens("[SEG]")
-    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
-    dataset = REFAVS(args.eval_split, args, tokenizer, input_type="refer")
-    loader = DataLoader(
-        dataset,
-        batch_size=1,
-        shuffle=False,
-        num_workers=0,
-        collate_fn=partial(collate_fn, tokenizer=tokenizer),
-    )
-    limit = args.max_eval_rows if args.max_eval_rows > 0 else 1
-    print(f"Split: {args.eval_split} | samples to check: {limit}")
-    model = build_model(tokenizer, seg_token_idx)
-    all_rows = []
-    for sample_idx, batch in enumerate(loader):
-        if sample_idx >= limit:
-            break
-        batch = dict_to_cuda(batch)
-        rows = check_one_sample(model, batch)
-        all_rows.extend(rows)
-        print(f"\nSample {sample_idx}: vid={rows[0]['vid']} ref={rows[0]['ref']}")
-        print("frame | max_abs_diff | mean_abs_diff | iou_pred_diff")
-        for row in rows:
-            print(
-                f"{row['frame']:02d} | "
-                f"{row['max_abs_diff']:.8e} | "
-                f"{row['mean_abs_diff']:.8e} | "
-                f"{row['iou_pred_diff']:.8e}"
-            )
-    if not all_rows:
-        raise RuntimeError("No rows were checked. Is the selected split empty?")
-    max_diff = max(row["max_abs_diff"] for row in all_rows)
-    mean_diff = sum(row["mean_abs_diff"] for row in all_rows) / len(all_rows)
-    max_iou_diff = max(row["iou_pred_diff"] for row in all_rows)
-    print("\nSummary")
-    print(f"checked frames: {len(all_rows)}")
-    print(f"global max_abs_diff: {max_diff:.8e}")
-    print(f"average mean_abs_diff: {mean_diff:.8e}")
-    print(f"global max_iou_pred_diff: {max_iou_diff:.8e}")
-    csv_path = os.environ.get("DECODER_INVARIANCE_CSV")
-    if csv_path:
-        os.makedirs(os.path.dirname(os.path.abspath(csv_path)), exist_ok=True)
-        with open(csv_path, "w", newline="") as f:
-            writer = csv.DictWriter(f, fieldnames=list(all_rows[0].keys()))
-            writer.writeheader()
-            writer.writerows(all_rows)
-        print(f"Saved CSV: {csv_path}")
-if __name__ == "__main__":
-    main()

dev_subsets_rpb_v1.json DELETED Viewed

@@ -1,620 +0,0 @@
-{
-  "metadata": {
-    "seed": 42,
-    "split_sizes": {
-      "train": 14113,
-      "test_s": 2288,
-      "test_u": 1656,
-      "test_n": 1028
-    },
-    "source_metadata": "/workspace/SimToken/data/metadata.csv"
-  },
-  "subsets": {
-    "test_s": [
-      6,
-      16,
-      36,
-      71,
-      74,
-      88,
-      108,
-      114,
-      116,
-      122,
-      126,
-      128,
-      134,
-      138,
-      139,
-      146,
-      152,
-      159,
-      177,
-      196,
-      217,
-      219,
-      249,
-      256,
-      268,
-      276,
-      279,
-      286,
-      287,
-      297,
-      298,
-      299,
-      312,
-      313,
-      324,
-      331,
-      332,
-      347,
-      378,
-      383,
-      402,
-      410,
-      412,
-      420,
-      451,
-      452,
-      458,
-      467,
-      477,
-      484,
-      486,
-      497,
-      499,
-      512,
-      526,
-      533,
-      543,
-      550,
-      551,
-      567,
-      574,
-      576,
-      581,
-      594,
-      596,
-      608,
-      616,
-      625,
-      627,
-      642,
-      646,
-      663,
-      692,
-      700,
-      704,
-      724,
-      745,
-      754,
-      795,
-      815,
-      819,
-      831,
-      843,
-      854,
-      867,
-      895,
-      946,
-      953,
-      965,
-      975,
-      979,
-      989,
-      1004,
-      1007,
-      1008,
-      1010,
-      1023,
-      1039,
-      1051,
-      1052,
-      1072,
-      1075,
-      1080,
-      1088,
-      1099,
-      1101,
-      1104,
-      1106,
-      1134,
-      1138,
-      1169,
-      1180,
-      1201,
-      1205,
-      1221,
-      1230,
-      1247,
-      1258,
-      1272,
-      1279,
-      1284,
-      1294,
-      1297,
-      1312,
-      1329,
-      1339,
-      1343,
-      1367,
-      1379,
-      1406,
-      1417,
-      1461,
-      1462,
-      1468,
-      1473,
-      1474,
-      1489,
-      1493,
-      1500,
-      1510,
-      1517,
-      1552,
-      1556,
-      1557,
-      1589,
-      1609,
-      1612,
-      1618,
-      1622,
-      1624,
-      1644,
-      1647,
-      1665,
-      1669,
-      1676,
-      1682,
-      1683,
-      1691,
-      1700,
-      1726,
-      1746,
-      1748,
-      1758,
-      1764,
-      1765,
-      1778,
-      1785,
-      1786,
-      1808,
-      1826,
-      1852,
-      1861,
-      1883,
-      1891,
-      1916,
-      1938,
-      1944,
-      1967,
-      1971,
-      1980,
-      1986,
-      2034,
-      2044,
-      2067,
-      2074,
-      2082,
-      2085,
-      2118,
-      2128,
-      2156,
-      2176,
-      2182,
-      2185,
-      2188,
-      2194,
-      2206,
-      2211,
-      2215,
-      2247,
-      2256
-    ],
-    "test_u": [
-      4,
-      16,
-      26,
-      38,
-      40,
-      48,
-      50,
-      65,
-      83,
-      92,
-      102,
-      117,
-      120,
-      135,
-      144,
-      153,
-      155,
-      185,
-      200,
-      201,
-      211,
-      219,
-      221,
-      226,
-      227,
-      240,
-      245,
-      251,
-      252,
-      255,
-      267,
-      272,
-      274,
-      276,
-      278,
-      282,
-      284,
-      286,
-      303,
-      309,
-      313,
-      328,
-      345,
-      348,
-      358,
-      363,
-      374,
-      376,
-      379,
-      383,
-      385,
-      387,
-      393,
-      396,
-      400,
-      412,
-      417,
-      428,
-      434,
-      452,
-      453,
-      456,
-      459,
-      463,
-      473,
-      490,
-      493,
-      504,
-      517,
-      525,
-      535,
-      543,
-      544,
-      545,
-      549,
-      550,
-      565,
-      584,
-      585,
-      594,
-      602,
-      603,
-      606,
-      638,
-      642,
-      643,
-      651,
-      684,
-      687,
-      692,
-      700,
-      721,
-      728,
-      752,
-      757,
-      779,
-      783,
-      785,
-      794,
-      803,
-      807,
-      814,
-      847,
-      849,
-      853,
-      854,
-      861,
-      867,
-      884,
-      900,
-      903,
-      906,
-      924,
-      930,
-      931,
-      941,
-      948,
-      957,
-      968,
-      972,
-      980,
-      987,
-      995,
-      996,
-      1007,
-      1009,
-      1028,
-      1033,
-      1034,
-      1040,
-      1054,
-      1098,
-      1104,
-      1111,
-      1121,
-      1126,
-      1134,
-      1155,
-      1161,
-      1167,
-      1180,
-      1186,
-      1192,
-      1212,
-      1214,
-      1219,
-      1226,
-      1254,
-      1256,
-      1259,
-      1261,
-      1270,
-      1278,
-      1285,
-      1288,
-      1290,
-      1305,
-      1310,
-      1323,
-      1325,
-      1343,
-      1360,
-      1375,
-      1376,
-      1404,
-      1411,
-      1426,
-      1429,
-      1442,
-      1449,
-      1452,
-      1456,
-      1475,
-      1478,
-      1479,
-      1484,
-      1493,
-      1499,
-      1500,
-      1501,
-      1506,
-      1517,
-      1523,
-      1528,
-      1536,
-      1545,
-      1546,
-      1550,
-      1561,
-      1570,
-      1598,
-      1609,
-      1611,
-      1625,
-      1632,
-      1634,
-      1635,
-      1641,
-      1654,
-      1655
-    ],
-    "test_n": [
-      4,
-      5,
-      9,
-      16,
-      20,
-      25,
-      27,
-      33,
-      37,
-      40,
-      45,
-      46,
-      48,
-      53,
-      56,
-      60,
-      62,
-      67,
-      77,
-      78,
-      80,
-      81,
-      86,
-      90,
-      94,
-      99,
-      102,
-      106,
-      108,
-      111,
-      116,
-      121,
-      126,
-      127,
-      132,
-      143,
-      148,
-      153,
-      155,
-      156,
-      158,
-      160,
-      164,
-      168,
-      170,
-      171,
-      173,
-      175,
-      183,
-      184,
-      185,
-      188,
-      189,
-      190,
-      196,
-      202,
-      206,
-      208,
-      212,
-      217,
-      221,
-      222,
-      223,
-      233,
-      242,
-      246,
-      247,
-      259,
-      262,
-      269,
-      283,
-      298,
-      299,
-      306,
-      316,
-      317,
-      323,
-      330,
-      332,
-      334,
-      354,
-      357,
-      367,
-      372,
-      395,
-      397,
-      400,
-      405,
-      407,
-      420,
-      431,
-      435,
-      436,
-      444,
-      446,
-      461,
-      464,
-      470,
-      479,
-      481,
-      483,
-      485,
-      487,
-      494,
-      512,
-      516,
-      520,
-      524,
-      529,
-      530,
-      539,
-      540,
-      541,
-      554,
-      559,
-      560,
-      564,
-      568,
-      571,
-      572,
-      576,
-      577,
-      581,
-      585,
-      592,
-      602,
-      609,
-      620,
-      630,
-      632,
-      677,
-      678,
-      684,
-      693,
-      694,
-      695,
-      702,
-      716,
-      724,
-      727,
-      732,
-      735,
-      736,
-      747,
-      750,
-      752,
-      755,
-      758,
-      764,
-      767,
-      774,
-      775,
-      777,
-      779,
-      780,
-      782,
-      795,
-      800,
-      812,
-      815,
-      818,
-      821,
-      823,
-      825,
-      828,
-      834,
-      841,
-      843,
-      846,
-      848,
-      860,
-      861,
-      863,
-      869,
-      871,
-      878,
-      882,
-      891,
-      893,
-      896,
-      898,
-      899,
-      901,
-      906,
-      930,
-      940,
-      944,
-      969,
-      970,
-      973,
-      980,
-      990,
-      993,
-      996,
-      997,
-      1007,
-      1012,
-      1013,
-      1019,
-      1025
-    ]
-  }
-}

log/rpb_dev_eval_baseline_step0.txt DELETED Viewed

@@ -1,5 +0,0 @@
-Epoch 0: running_loss 0.004542401526123285  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7255374467872275  true fscore 0.8181094569922425
-valuate on test_u_refer:  miou 0.68531153425507  true fscore 0.7723772643739357
- valuate on  test_n_refer:   metric 0.014519116841256618

log/rpb_dev_eval_pm_only_a02_step0.txt DELETED Viewed

@@ -1,7 +0,0 @@
-Epoch 0: running_loss 0.013856410048902035  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7251653336426284  true fscore 0.8137564373598434
-bridge on test_s_refer: cos_delta_p_mask_mean=0.752373 | cos_delta_q_mean=-0.063845 | cos_delta_z_gt_mean=0.066832 | cos_p_hat_p_mask_mean=0.095022 | cos_p_hat_q_mean=0.991696 | cos_p_hat_z_gt_mean=0.058512 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=4.838175 | gate_mean=0.642605 | gate_std=0.066554 | p_hat_norm_mean=37.143986 | p_mask_norm_mean=0.855194 | q_norm_mean=37.143986 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.6859597001315854  true fscore 0.7773032036889345
-bridge on test_u_refer: cos_delta_p_mask_mean=0.752107 | cos_delta_q_mean=-0.052752 | cos_delta_z_gt_mean=0.059016 | cos_p_hat_p_mask_mean=0.066111 | cos_p_hat_q_mean=0.994380 | cos_p_hat_z_gt_mean=0.056506 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=3.232154 | gate_mean=0.529798 | gate_std=0.041540 | p_hat_norm_mean=30.350392 | p_mask_norm_mean=0.854621 | q_norm_mean=30.350392 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014255181886255741

log/rpb_dev_mixed_pm_only_a015_wm005.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.12634180719032884  Learning Rate:0.000048
-Epoch 1: running_loss 0.06299160566413775  Learning Rate:0.000038
-Epoch 2: running_loss 0.04188278445508331  Learning Rate:0.000021
-Epoch 3: running_loss 0.03136271081166342  Learning Rate:0.000006
-Epoch 4: running_loss 0.025073944311589002  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7268448945908449  true fscore 0.8160740848700516
-bridge on test_s_refer: cos_delta_p_mask_mean=0.780949 | cos_delta_q_mean=-0.022341 | cos_delta_z_gt_mean=0.080238 | cos_p_hat_p_mask_mean=0.033820 | cos_p_hat_q_mean=0.998889 | cos_p_hat_z_gt_mean=0.053521 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=1.741799 | gate_mean=0.298187 | gate_std=0.074034 | p_hat_norm_mean=37.144979 | p_mask_norm_mean=0.855194 | q_norm_mean=37.144979 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.6867437321859904  true fscore 0.774193259445019
-bridge on test_u_refer: cos_delta_p_mask_mean=0.787519 | cos_delta_q_mean=-0.014046 | cos_delta_z_gt_mean=0.070144 | cos_p_hat_p_mask_mean=0.008821 | cos_p_hat_q_mean=0.999587 | cos_p_hat_z_gt_mean=0.052258 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=0.869715 | gate_mean=0.187340 | gate_std=0.030662 | p_hat_norm_mean=30.349741 | p_mask_norm_mean=0.854621 | q_norm_mean=30.349741 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014510215260088444

log/rpb_dev_mixed_pm_only_a018_wm005.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.12581317650619894  Learning Rate:0.000048
-Epoch 1: running_loss 0.0626903815427795  Learning Rate:0.000038
-Epoch 2: running_loss 0.04165894452792903  Learning Rate:0.000021
-Epoch 3: running_loss 0.031184122432023287  Learning Rate:0.000006
-Epoch 4: running_loss 0.024928097636438905  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.727035479994347  true fscore 0.8155373766715638
-bridge on test_s_refer: cos_delta_p_mask_mean=0.779142 | cos_delta_q_mean=-0.026866 | cos_delta_z_gt_mean=0.080963 | cos_p_hat_p_mask_mean=0.040792 | cos_p_hat_q_mean=0.998394 | cos_p_hat_z_gt_mean=0.054268 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=2.094408 | gate_mean=0.298949 | gate_std=0.074175 | p_hat_norm_mean=37.145271 | p_mask_norm_mean=0.855194 | q_norm_mean=37.145271 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.6870561258980442  true fscore 0.774542552176863
-bridge on test_u_refer: cos_delta_p_mask_mean=0.786014 | cos_delta_q_mean=-0.016895 | cos_delta_z_gt_mean=0.071182 | cos_p_hat_p_mask_mean=0.013252 | cos_p_hat_q_mean=0.999403 | cos_p_hat_z_gt_mean=0.052698 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=1.046129 | gate_mean=0.187813 | gate_std=0.030748 | p_hat_norm_mean=30.349577 | p_mask_norm_mean=0.854621 | q_norm_mean=30.349577 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014507208950817585

log/rpb_dev_pm_only_a012.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.1251933453604579  Learning Rate:0.000291
-Epoch 1: running_loss 0.06243458506651223  Learning Rate:0.000225
-Epoch 2: running_loss 0.04142383218277246  Learning Rate:0.000124
-Epoch 3: running_loss 0.030912025278666988  Learning Rate:0.000035
-Epoch 4: running_loss 0.024670254811644553  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7265147582390341  true fscore 0.8174789174459874
-bridge on test_s_refer: cos_delta_p_mask_mean=0.785657 | cos_delta_q_mean=-0.012593 | cos_delta_z_gt_mean=0.074588 | cos_p_hat_p_mask_mean=0.018714 | cos_p_hat_q_mean=0.999648 | cos_p_hat_z_gt_mean=0.051832 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=0.980784 | gate_mean=0.209955 | gate_std=0.050712 | p_hat_norm_mean=37.145389 | p_mask_norm_mean=0.855194 | q_norm_mean=37.145389 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.685781483513075  true fscore 0.7731429794151335
-bridge on test_u_refer: cos_delta_p_mask_mean=0.790605 | cos_delta_q_mean=-0.008125 | cos_delta_z_gt_mean=0.065258 | cos_p_hat_p_mask_mean=-0.000455 | cos_p_hat_q_mean=0.999863 | cos_p_hat_z_gt_mean=0.051334 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=0.502185 | gate_mean=0.135438 | gate_std=0.020096 | p_hat_norm_mean=30.347839 | p_mask_norm_mean=0.854621 | q_norm_mean=30.347839 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014490844681859016

log/rpb_dev_pm_only_a015.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.12516111659351736  Learning Rate:0.000291
-Epoch 1: running_loss 0.06237624154891819  Learning Rate:0.000225
-Epoch 2: running_loss 0.04133288407077392  Learning Rate:0.000124
-Epoch 3: running_loss 0.03080323277390562  Learning Rate:0.000035
-Epoch 4: running_loss 0.024568469962105155  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7266912544447951  true fscore 0.8172510598856024
-bridge on test_s_refer: cos_delta_p_mask_mean=0.784637 | cos_delta_q_mean=-0.015801 | cos_delta_z_gt_mean=0.074893 | cos_p_hat_p_mask_mean=0.023727 | cos_p_hat_q_mean=0.999446 | cos_p_hat_z_gt_mean=0.052317 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=1.230677 | gate_mean=0.210794 | gate_std=0.050954 | p_hat_norm_mean=37.144974 | p_mask_norm_mean=0.855194 | q_norm_mean=37.144974 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.6856936469832761  true fscore 0.7733012911863625
-bridge on test_u_refer: cos_delta_p_mask_mean=0.789761 | cos_delta_q_mean=-0.010194 | cos_delta_z_gt_mean=0.065751 | cos_p_hat_p_mask_mean=0.002815 | cos_p_hat_q_mean=0.999784 | cos_p_hat_z_gt_mean=0.051617 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=0.630081 | gate_mean=0.135950 | gate_std=0.020168 | p_hat_norm_mean=30.349286 | p_mask_norm_mean=0.854621 | q_norm_mean=30.349286 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014483190141618252

log/rpb_dev_pm_only_a018.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.12512886058539152  Learning Rate:0.000291
-Epoch 1: running_loss 0.062317848962266  Learning Rate:0.000225
-Epoch 2: running_loss 0.04124188135998944  Learning Rate:0.000124
-Epoch 3: running_loss 0.03069439489627257  Learning Rate:0.000035
-Epoch 4: running_loss 0.024466648511588574  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7269170961743339  true fscore 0.817047117385082
-bridge on test_s_refer: cos_delta_p_mask_mean=0.783528 | cos_delta_q_mean=-0.019011 | cos_delta_z_gt_mean=0.075155 | cos_p_hat_p_mask_mean=0.028732 | cos_p_hat_q_mean=0.999199 | cos_p_hat_z_gt_mean=0.052798 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=1.480661 | gate_mean=0.211391 | gate_std=0.051102 | p_hat_norm_mean=37.145608 | p_mask_norm_mean=0.855194 | q_norm_mean=37.145608 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.6859480822706291  true fscore 0.7735356919141486
-bridge on test_u_refer: cos_delta_p_mask_mean=0.788825 | cos_delta_q_mean=-0.012263 | cos_delta_z_gt_mean=0.066219 | cos_p_hat_p_mask_mean=0.006046 | cos_p_hat_q_mean=0.999688 | cos_p_hat_z_gt_mean=0.051902 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=0.757877 | gate_mean=0.136287 | gate_std=0.020245 | p_hat_norm_mean=30.346972 | p_mask_norm_mean=0.854621 | q_norm_mean=30.346972 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014475596137344837

log/rpb_dev_qonly_pm_only_a018.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.1250931837130338  Learning Rate:0.000291
-Epoch 1: running_loss 0.06158186250831932  Learning Rate:0.000225
-Epoch 2: running_loss 0.03905615148444971  Learning Rate:0.000124
-Epoch 3: running_loss 0.028493995574535802  Learning Rate:0.000035
-Epoch 4: running_loss 0.022694221674464644  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7231086666105239  true fscore 0.8120589338685386
-bridge on test_s_refer: cos_delta_p_mask_mean=0.740588 | cos_delta_q_mean=-0.082204 | cos_delta_z_gt_mean=0.083615 | cos_p_hat_p_mask_mean=0.120609 | cos_p_hat_q_mean=0.986413 | cos_p_hat_z_gt_mean=0.063688 | cos_p_mask_z_gt_mean=0.064319 | delta_norm_mean=6.165701 | gate_mean=0.922904 | gate_std=0.048146 | p_hat_norm_mean=37.145128 | p_mask_norm_mean=0.855194 | q_norm_mean=37.145128 | z_gt_norm_mean=1.270137
-valuate on test_u_refer:  miou 0.6828930461963626  true fscore 0.7766606059018523
-bridge on test_u_refer: cos_delta_p_mask_mean=0.750842 | cos_delta_q_mean=-0.072793 | cos_delta_z_gt_mean=0.080115 | cos_p_hat_p_mask_mean=0.095975 | cos_p_hat_q_mean=0.989300 | cos_p_hat_z_gt_mean=0.061951 | cos_p_mask_z_gt_mean=0.056127 | delta_norm_mean=4.458672 | gate_mean=0.815494 | gate_std=0.064275 | p_hat_norm_mean=30.349046 | p_mask_norm_mean=0.854621 | q_norm_mean=30.349046 | z_gt_norm_mean=1.131404
- valuate on  test_n_refer:   metric 0.014240134507417679

log/rpb_e1_baseline.txt DELETED Viewed

@@ -1,5 +0,0 @@
-Epoch 0: running_loss 0.0045423684641718864  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7299158895817891  true fscore 0.8098922965396196
-valuate on test_u_refer:  miou 0.7330115197712439  true fscore 0.8183729078620672
- valuate on  test_n_refer:   metric 0.1223459392786026

log/rpb_e4_min.txt DELETED Viewed

@@ -1,16 +0,0 @@
-Epoch 0: running_loss 7.052718125283718  Learning Rate:0.000097
-Epoch 1: running_loss 3.5262171775102615  Learning Rate:0.000075
-Epoch 2: running_loss 2.35092111180226  Learning Rate:0.000041
-Epoch 3: running_loss 1.7629929669201374  Learning Rate:0.000012
-Epoch 4: running_loss 1.4105001017451286  Learning Rate:0.000000
-Epoch 0: running_loss 7.052717879414558  Learning Rate:0.000097
-Epoch 1: running_loss 3.526217419654131  Learning Rate:0.000075
-Epoch 2: running_loss 2.3509211614727974  Learning Rate:0.000041
-Epoch 3: running_loss 1.762992987409234  Learning Rate:0.000012
-Epoch 4: running_loss 1.410500232875347  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.010701371397460661  true fscore 0.16367542997933923
-bridge on test_s_refer: cos_p_hat_p_mask_mean=-0.003076 | cos_p_hat_q_mean=1.000000 | cos_p_hat_z_gt_mean=0.031631 | cos_p_mask_z_gt_mean=0.072929 | delta_norm_mean=0.003709 | gate_mean=0.019151 | gate_std=0.000754 | p_hat_norm_mean=6.222885 | p_mask_norm_mean=0.854909 | q_norm_mean=6.223040 | z_gt_norm_mean=1.275222
-valuate on test_u_refer:  miou 0.03141531638093511  true fscore 0.1579975866433233
-bridge on test_u_refer: cos_p_hat_p_mask_mean=-0.004606 | cos_p_hat_q_mean=1.000000 | cos_p_hat_z_gt_mean=-0.000177 | cos_p_mask_z_gt_mean=0.081724 | delta_norm_mean=0.003449 | gate_mean=0.019014 | gate_std=0.000658 | p_hat_norm_mean=5.875611 | p_mask_norm_mean=0.855032 | q_norm_mean=5.875684 | z_gt_norm_mean=0.969146
- valuate on  test_n_refer:   metric 0.15515293180942535

log/rpb_e4_min_v2.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.2470331892836839  Learning Rate:0.000097
-Epoch 1: running_loss 0.12353144341614097  Learning Rate:0.000075
-Epoch 2: running_loss 0.08232998211557667  Learning Rate:0.000041
-Epoch 3: running_loss 0.0617638936964795  Learning Rate:0.000012
-Epoch 4: running_loss 0.04941030433401465  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.729936970449844  true fscore 0.8099028875399381
-bridge on test_s_refer: cos_p_hat_p_mask_mean=-0.009047 | cos_p_hat_q_mean=1.000000 | cos_p_hat_z_gt_mean=0.060572 | cos_p_mask_z_gt_mean=0.072929 | delta_norm_mean=0.004936 | gate_mean=0.024371 | gate_std=0.005409 | p_hat_norm_mean=36.236958 | p_mask_norm_mean=0.854909 | q_norm_mean=36.239986 | z_gt_norm_mean=1.275222
-valuate on test_u_refer:  miou 0.7330397108156467  true fscore 0.8183516443520784
-bridge on test_u_refer: cos_p_hat_p_mask_mean=-0.004755 | cos_p_hat_q_mean=1.000000 | cos_p_hat_z_gt_mean=0.013517 | cos_p_mask_z_gt_mean=0.081724 | delta_norm_mean=0.004417 | gate_mean=0.023295 | gate_std=0.004361 | p_hat_norm_mean=30.846060 | p_mask_norm_mean=0.855032 | q_norm_mean=30.848833 | z_gt_norm_mean=0.969146
- valuate on  test_n_refer:   metric 0.12235464155673981

log/rpb_probe_a1_teacher_only.txt DELETED Viewed

@@ -1,22 +0,0 @@
-Epoch 0: running_loss 0.15941409580409527  Learning Rate:0.000150
-Epoch 1: running_loss 0.07969226781278849  Learning Rate:0.000300
-Epoch 2: running_loss 0.05310918173442284  Learning Rate:0.000298
-Epoch 3: running_loss 0.03982830489985645  Learning Rate:0.000291
-Epoch 4: running_loss 0.03184974528849125  Learning Rate:0.000280
-Epoch 5: running_loss 0.02652722302203377  Learning Rate:0.000265
-Epoch 6: running_loss 0.02272333244660071  Learning Rate:0.000246
-Epoch 7: running_loss 0.019872855627909303  Learning Rate:0.000225
-Epoch 8: running_loss 0.017649518532885447  Learning Rate:0.000201
-Epoch 9: running_loss 0.015872883144766092  Learning Rate:0.000176
-Epoch 10: running_loss 0.014423399655656382  Learning Rate:0.000150
-Epoch 11: running_loss 0.013206382282078266  Learning Rate:0.000124
-Epoch 12: running_loss 0.012179449988672366  Learning Rate:0.000099
-Epoch 13: running_loss 0.011303224135190248  Learning Rate:0.000075
-Epoch 14: running_loss 0.010542566950122515  Learning Rate:0.000054
-Epoch 15: running_loss 0.0098747648880817  Learning Rate:0.000035
-Epoch 16: running_loss 0.009292871307800798  Learning Rate:0.000020
-Epoch 17: running_loss 0.008775248295731015  Learning Rate:0.000009
-Epoch 18: running_loss 0.008311718702316284  Learning Rate:0.000002
-Epoch 19: running_loss 0.007893257355317474  Learning Rate:0.000000
-valuate on train_overfit:  miou 0.8857842811448791  true fscore 0.9381048823706806
-bridge on train_overfit: cos_p_hat_p_mask_mean=0.004767 | cos_p_hat_q_mean=0.999904 | cos_p_hat_z_gt_mean=0.058385 | cos_p_mask_z_gt_mean=0.065508 | delta_norm_mean=0.571159 | gate_mean=0.425535 | gate_std=0.188610 | p_hat_norm_mean=32.916147 | p_mask_norm_mean=0.854710 | q_norm_mean=33.257832 | z_gt_norm_mean=1.191098

log/rpb_probe_a1_teacher_only_v2.txt DELETED Viewed

@@ -1,22 +0,0 @@
-Epoch 0: running_loss 0.15941409580409527  Learning Rate:0.000150
-Epoch 1: running_loss 0.0796922636218369  Learning Rate:0.000300
-Epoch 2: running_loss 0.05310917769869169  Learning Rate:0.000298
-Epoch 3: running_loss 0.03982830559834838  Learning Rate:0.000291
-Epoch 4: running_loss 0.03184974305331707  Learning Rate:0.000280
-Epoch 5: running_loss 0.02652722333247463  Learning Rate:0.000265
-Epoch 6: running_loss 0.022723329652632986  Learning Rate:0.000246
-Epoch 7: running_loss 0.019872855744324625  Learning Rate:0.000225
-Epoch 8: running_loss 0.017649516980681155  Learning Rate:0.000201
-Epoch 9: running_loss 0.015872882585972546  Learning Rate:0.000176
-Epoch 10: running_loss 0.01442340033298189  Learning Rate:0.000150
-Epoch 11: running_loss 0.013206382825349769  Learning Rate:0.000124
-Epoch 12: running_loss 0.012179449773751773  Learning Rate:0.000099
-Epoch 13: running_loss 0.011303224002144166  Learning Rate:0.000075
-Epoch 14: running_loss 0.010542566763858001  Learning Rate:0.000054
-Epoch 15: running_loss 0.00987476430600509  Learning Rate:0.000035
-Epoch 16: running_loss 0.009292872293907054  Learning Rate:0.000020
-Epoch 17: running_loss 0.0087752483992113  Learning Rate:0.000009
-Epoch 18: running_loss 0.008311718849367216  Learning Rate:0.000002
-Epoch 19: running_loss 0.007893257355317474  Learning Rate:0.000000
-valuate on train_overfit:  miou 0.8857840351993218  true fscore 0.9381047114729881
-bridge on train_overfit: cos_delta_p_mask_mean=0.354064 | cos_delta_q_mean=-0.604202 | cos_delta_z_gt_mean=0.126264 | cos_p_hat_p_mask_mean=0.004767 | cos_p_hat_q_mean=0.999904 | cos_p_hat_z_gt_mean=0.058385 | cos_p_mask_z_gt_mean=0.065508 | delta_norm_mean=0.571159 | gate_mean=0.425535 | gate_std=0.188610 | p_hat_norm_mean=32.916147 | p_mask_norm_mean=0.854710 | q_norm_mean=33.257831 | z_gt_norm_mean=1.191098

log/rpb_probe_a1p_directional_pm_only.txt DELETED Viewed

@@ -1,22 +0,0 @@
-Epoch 0: running_loss 0.11214640829712152  Learning Rate:0.000150
-Epoch 1: running_loss 0.05601485609076917  Learning Rate:0.000300
-Epoch 2: running_loss 0.03723815083503723  Learning Rate:0.000298
-Epoch 3: running_loss 0.02785203023813665  Learning Rate:0.000291
-Epoch 4: running_loss 0.022219109814614058  Learning Rate:0.000280
-Epoch 5: running_loss 0.018464789803450305  Learning Rate:0.000265
-Epoch 6: running_loss 0.01578202284872532  Learning Rate:0.000246
-Epoch 7: running_loss 0.013773231767117977  Learning Rate:0.000225
-Epoch 8: running_loss 0.012206872407760885  Learning Rate:0.000201
-Epoch 9: running_loss 0.010958488751202821  Learning Rate:0.000176
-Epoch 10: running_loss 0.009943378030915152  Learning Rate:0.000150
-Epoch 11: running_loss 0.009091336939794322  Learning Rate:0.000124
-Epoch 12: running_loss 0.00837581454274746  Learning Rate:0.000099
-Epoch 13: running_loss 0.007767901090638978  Learning Rate:0.000075
-Epoch 14: running_loss 0.007241058039168516  Learning Rate:0.000054
-Epoch 15: running_loss 0.006779163610190153  Learning Rate:0.000035
-Epoch 16: running_loss 0.006378827452221338  Learning Rate:0.000020
-Epoch 17: running_loss 0.006023053286804093  Learning Rate:0.000009
-Epoch 18: running_loss 0.005704282390836038  Learning Rate:0.000002
-Epoch 19: running_loss 0.005416269856505096  Learning Rate:0.000000
-valuate on train_overfit:  miou 0.883418077353781  true fscore 0.937678836286068
-bridge on train_overfit: cos_delta_p_mask_mean=0.818447 | cos_delta_q_mean=-0.029885 | cos_delta_z_gt_mean=0.063824 | cos_p_hat_p_mask_mean=0.047561 | cos_p_hat_q_mean=0.998200 | cos_p_hat_z_gt_mean=0.059441 | cos_p_mask_z_gt_mean=0.065508 | delta_norm_mean=2.004932 | gate_mean=0.598515 | gate_std=0.034498 | p_hat_norm_mean=33.257835 | p_mask_norm_mean=0.854710 | q_norm_mean=33.257834 | z_gt_norm_mean=1.191098

log/rpb_probe_a1p_directional_pm_only_a02.txt DELETED Viewed

@@ -1,22 +0,0 @@
-Epoch 0: running_loss 0.11209722375497222  Learning Rate:0.000150
-Epoch 1: running_loss 0.05594216543249786  Learning Rate:0.000300
-Epoch 2: running_loss 0.03709370751554767  Learning Rate:0.000298
-Epoch 3: running_loss 0.027660266729071736  Learning Rate:0.000291
-Epoch 4: running_loss 0.02200547931715846  Learning Rate:0.000280
-Epoch 5: running_loss 0.018238045663262408  Learning Rate:0.000265
-Epoch 6: running_loss 0.015544687730393239  Learning Rate:0.000246
-Epoch 7: running_loss 0.013526892522349954  Learning Rate:0.000225
-Epoch 8: running_loss 0.01195424489883913  Learning Rate:0.000201
-Epoch 9: running_loss 0.010702831950038672  Learning Rate:0.000176
-Epoch 10: running_loss 0.009686671324412931  Learning Rate:0.000150
-Epoch 11: running_loss 0.008837080444209278  Learning Rate:0.000124
-Epoch 12: running_loss 0.008126160953767024  Learning Rate:0.000099
-Epoch 13: running_loss 0.007524690058614526  Learning Rate:0.000075
-Epoch 14: running_loss 0.007005957514047622  Learning Rate:0.000054
-Epoch 15: running_loss 0.0065534417517483234  Learning Rate:0.000035
-Epoch 16: running_loss 0.006162627901443664  Learning Rate:0.000020
-Epoch 17: running_loss 0.005816713182462586  Learning Rate:0.000009
-Epoch 18: running_loss 0.005507827319793011  Learning Rate:0.000002
-Epoch 19: running_loss 0.005229406012222171  Learning Rate:0.000000
-valuate on train_overfit:  miou 0.8791497684578644  true fscore 0.9370119273662567
-bridge on train_overfit: cos_delta_p_mask_mean=0.808940 | cos_delta_q_mean=-0.059708 | cos_delta_z_gt_mean=0.061659 | cos_p_hat_p_mask_mean=0.095240 | cos_p_hat_q_mean=0.992816 | cos_p_hat_z_gt_mean=0.062994 | cos_p_mask_z_gt_mean=0.065508 | delta_norm_mean=4.005328 | gate_mean=0.600366 | gate_std=0.034520 | p_hat_norm_mean=33.257835 | p_mask_norm_mean=0.854710 | q_norm_mean=33.257836 | z_gt_norm_mean=1.191098

log/rpb_probe_eval_directional_pm_only_a02.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.12453739601187408  Learning Rate:0.000291
-Epoch 1: running_loss 0.06081169372191653  Learning Rate:0.000225
-Epoch 2: running_loss 0.039517335942946374  Learning Rate:0.000124
-Epoch 3: running_loss 0.029158065939554945  Learning Rate:0.000035
-Epoch 4: running_loss 0.02320093212183565  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7251764057789819  true fscore 0.8044321979023517
-bridge on test_s_refer: cos_delta_p_mask_mean=0.754565 | cos_delta_q_mean=-0.062171 | cos_delta_z_gt_mean=0.077296 | cos_p_hat_p_mask_mean=0.084720 | cos_p_hat_q_mean=0.992132 | cos_p_hat_z_gt_mean=0.070147 | cos_p_mask_z_gt_mean=0.072929 | delta_norm_mean=4.598394 | gate_mean=0.625537 | gate_std=0.054432 | p_hat_norm_mean=36.239987 | p_mask_norm_mean=0.854909 | q_norm_mean=36.239987 | z_gt_norm_mean=1.275222
-valuate on test_u_refer:  miou 0.7347305961538223  true fscore 0.8193065231665969
-bridge on test_u_refer: cos_delta_p_mask_mean=0.754954 | cos_delta_q_mean=-0.054195 | cos_delta_z_gt_mean=0.089436 | cos_p_hat_p_mask_mean=0.077127 | cos_p_hat_q_mean=0.994077 | cos_p_hat_z_gt_mean=0.023352 | cos_p_mask_z_gt_mean=0.081724 | delta_norm_mean=3.370293 | gate_mean=0.544416 | gate_std=0.033540 | p_hat_norm_mean=30.852975 | p_mask_norm_mean=0.855032 | q_norm_mean=30.852975 | z_gt_norm_mean=0.969146
- valuate on  test_n_refer:   metric 0.12181796133518219

log/rpb_probe_eval_directional_pm_only_a02_step0.txt DELETED Viewed

@@ -1,7 +0,0 @@
-Epoch 0: running_loss 0.01385641098022461  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7251643069144439  true fscore 0.8044421944022179
-bridge on test_s_refer: cos_delta_p_mask_mean=0.754565 | cos_delta_q_mean=-0.062169 | cos_delta_z_gt_mean=0.077297 | cos_p_hat_p_mask_mean=0.084709 | cos_p_hat_q_mean=0.992133 | cos_p_hat_z_gt_mean=0.070145 | cos_p_mask_z_gt_mean=0.072929 | delta_norm_mean=4.598003 | gate_mean=0.625515 | gate_std=0.054416 | p_hat_norm_mean=36.238429 | p_mask_norm_mean=0.854909 | q_norm_mean=36.238428 | z_gt_norm_mean=1.275222
-valuate on test_u_refer:  miou 0.7346898949889146  true fscore 0.819309664927423
-bridge on test_u_refer: cos_delta_p_mask_mean=0.754958 | cos_delta_q_mean=-0.054197 | cos_delta_z_gt_mean=0.089438 | cos_p_hat_p_mask_mean=0.077138 | cos_p_hat_q_mean=0.994077 | cos_p_hat_z_gt_mean=0.023334 | cos_p_mask_z_gt_mean=0.081724 | delta_norm_mean=3.370548 | gate_mean=0.544434 | gate_std=0.033514 | p_hat_norm_mean=30.854847 | p_mask_norm_mean=0.855032 | q_norm_mean=30.854847 | z_gt_norm_mean=0.969146
- valuate on  test_n_refer:   metric 0.12185448408126831

log/rpb_probe_mixed_pm_only_a02_wm005_s80.txt DELETED Viewed

@@ -1,11 +0,0 @@
-Epoch 0: running_loss 0.11956256674602628  Learning Rate:0.000048
-Epoch 1: running_loss 0.059521447168663144  Learning Rate:0.000038
-Epoch 2: running_loss 0.03955021120297412  Learning Rate:0.000021
-Epoch 3: running_loss 0.029611277248477563  Learning Rate:0.000006
-Epoch 4: running_loss 0.023673650273121894  Learning Rate:0.000000
-valuate on test_s_refer:  miou 0.7234249453799384  true fscore 0.8020988971926272
-bridge on test_s_refer: cos_delta_p_mask_mean=0.752115 | cos_delta_q_mean=-0.071252 | cos_delta_z_gt_mean=0.081856 | cos_p_hat_p_mask_mean=0.098034 | cos_p_hat_q_mean=0.989714 | cos_p_hat_z_gt_mean=0.072197 | cos_p_mask_z_gt_mean=0.072929 | delta_norm_mean=5.254162 | gate_mean=0.718218 | gate_std=0.053861 | p_hat_norm_mean=36.239985 | p_mask_norm_mean=0.854909 | q_norm_mean=36.239985 | z_gt_norm_mean=1.275222
-valuate on test_u_refer:  miou 0.7361468947966933  true fscore 0.8214005154371261
-bridge on test_u_refer: cos_delta_p_mask_mean=0.754059 | cos_delta_q_mean=-0.063183 | cos_delta_z_gt_mean=0.096618 | cos_p_hat_p_mask_mean=0.090575 | cos_p_hat_q_mean=0.991959 | cos_p_hat_z_gt_mean=0.025874 | cos_p_mask_z_gt_mean=0.081724 | delta_norm_mean=3.926547 | gate_mean=0.635724 | gate_std=0.036734 | p_hat_norm_mean=30.848887 | p_mask_norm_mean=0.855032 | q_norm_mean=30.848887 | z_gt_norm_mean=0.969146
- valuate on  test_n_refer:   metric 0.12358559668064117

seg_ltpo.py DELETED Viewed

@@ -1,1372 +0,0 @@
-"""
-SEG-LTPO: test-time optimization of SimToken's Fseg / q prompt token.
-Two optimizers are provided:
-ltpo_optimize  – original antithetic-ES zeroth-order optimizer (Fseg space).
-q_ltpo_autograd – autograd optimizer that directly optimizes q (= sparse
-                  prompt embedding passed to the mask decoder) via Adam
-                  maximize, with a differentiable reward.  This is the
-                  recommended path when the reward can be made differentiable.
-Staged autograd reward build-up:
-  Stage 0  check_grad_connectivity  — verify ∂R_iou/∂q ≠ 0
-  Stage 1  QLTPOConfig(stage=1)     — R = 0.6·R_iou − 0.2·R_area_soft − λ_reg·‖q−q₀‖²
-  Stage 2  QLTPOConfig(stage=2)     — Stage 1 + 1.0·R_align_det  (z_in/z_out stopgrad)
-  Stage 3  QLTPOConfig(stage=3)     — Stage 2 + 0.2·R_temp_feat  (full reward)
-Reward gating: use best_q only when R_task(best_q) > R_task(q_init) + gate_delta.
---- ES baseline (original) ---
-Reward:
-    R = λ1·R_temp_feat + λ2·R_iou_pred + λ3·R_align_contrast − λ4·R_area
-Update (antithetic ES, step t):
-    F_curr = F_curr + η_t · (R+ − R−)/(2σ_t²) · eps_t
-    best_F = argmax_F R(F)
-"""
-from __future__ import annotations
-from dataclasses import dataclass, field
-from typing import Any, Dict, List, Optional, Tuple
-import torch
-import torch.nn.functional as F
-# ---------------------------------------------------------------------------
-# Per-sample diagnostics accumulator for q_ltpo_autograd
-# ---------------------------------------------------------------------------
-_q_ltpo_stats: List[Dict[str, Any]] = []
-def reset_q_ltpo_stats() -> None:
-    global _q_ltpo_stats
-    _q_ltpo_stats = []
-def get_q_ltpo_stats() -> List[Dict[str, Any]]:
-    return list(_q_ltpo_stats)
-# ---------------------------------------------------------------------------
-# Configuration
-# ---------------------------------------------------------------------------
-@dataclass
-class LTPOConfig:
-    T: int = 5
-    num_anchors: int = 4
-    sigma_schedule: List[float] = field(
-        default_factory=lambda: [0.10, 0.08, 0.06, 0.04, 0.02]
-    )
-    eta_scale: float = 0.5      # η_t = eta_scale · σ_t
-    # Reward weights
-    lambda1: float = 0.3        # R_temp_feat
-    lambda2: float = 0.4        # R_iou_pred
-    lambda3: float = 1.0        # R_align_contrast
-    lambda4: float = 0.3        # R_area penalty
-    beta: float = 0.5           # background penalty coefficient in R_align_contrast
-    # Reward gating: fall back to F_init when improvement < gate_delta
-    gate_delta: float = 0.0
-    # L2 trust-region radius on Fseg; None = disabled
-    trust_delta: Optional[float] = None
-# ---------------------------------------------------------------------------
-# Utilities
-# ---------------------------------------------------------------------------
-def get_sam_model(model):
-    """Return SAM visual_model, unwrapping a PeftModel wrapper if present."""
-    base = model.base_model.model if hasattr(model, "base_model") else model
-    return base.model.visual_model
-def get_anchor_indices(num_frames: int, num_anchors: int) -> List[int]:
-    """Uniformly sample anchor frame indices from [0, num_frames-1]."""
-    return [round(v) for v in torch.linspace(0, num_frames - 1, num_anchors).tolist()]
-def _precompute_dense_emb(
-    sam_model, model_dtype: torch.dtype, device: torch.device
-) -> torch.Tensor:
-    """
-    Constant 'no-mask' dense embedding from SAM's prompt encoder.
-    Independent of Fseg; precompute once per sample to avoid redundant calls.
-    Shape: [1, 256, 64, 64].
-    """
-    pe = sam_model.prompt_encoder
-    H, W = pe.image_embedding_size
-    return (
-        pe.no_mask_embed.weight           # [1, 256]
-        .reshape(1, -1, 1, 1)
-        .expand(1, -1, H, W)
-        .contiguous()
-        .to(model_dtype)
-        .to(device)
-    )
-# ---------------------------------------------------------------------------
-# Lightweight SAM decode (skips prompt_encoder overhead)
-# ---------------------------------------------------------------------------
-def _decode_on_anchors(
-    fseg: torch.Tensor,                 # [1, 256] float32
-    image_embeds_anchor: torch.Tensor,  # [A, 256, 64, 64] model dtype
-    dense_emb: torch.Tensor,            # [1, 256, 64, 64] model dtype (constant)
-    mask_decoder,
-    dense_pe: torch.Tensor,             # [1, 256, 64, 64]
-    model_dtype: torch.dtype,
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """
-    Decode anchor frames for a given Fseg.
-    Since no points/boxes are used, prompt_encoder simply concatenates
-    text_embeds onto an empty sparse tensor, so sparse_emb == Fseg.unsqueeze(1).
-    We exploit this to skip the full prompt_encoder call each iteration.
-    Returns:
-        low_res_masks: [A, 1, 256, 256]
-        iou_preds:     [A, 1]
-    """
-    sparse_emb = fseg.to(model_dtype).unsqueeze(1)  # [1, 1, 256]
-    with torch.no_grad():
-        low_res_masks, iou_preds = mask_decoder(
-            image_embeddings=image_embeds_anchor,
-            image_pe=dense_pe,
-            sparse_prompt_embeddings=sparse_emb,
-            dense_prompt_embeddings=dense_emb,
-            multimask_output=False,
-        )
-    return low_res_masks, iou_preds  # [A,1,256,256], [A,1]
-# ---------------------------------------------------------------------------
-# Reward computation
-# ---------------------------------------------------------------------------
-def _compute_reward(
-    fseg: torch.Tensor,                 # [1, 256] float32
-    low_res_masks: torch.Tensor,        # [A, 1, 256, 256]
-    iou_preds: torch.Tensor,            # [A, 1]
-    image_embeds_anchor: torch.Tensor,  # [A, 256, 64, 64]
-    cfg: LTPOConfig,
-) -> float:
-    num_anchor = low_res_masks.shape[0]
-    device = fseg.device
-    # Work entirely in float32 for numerical stability
-    masks_soft = torch.sigmoid(low_res_masks.float().squeeze(1))  # [A, 256, 256]
-    img_embs   = image_embeds_anchor.float()                       # [A, 256, 64, 64]
-    # q lives in SAM's 256-d prompt space (same as Fseg after text_hidden_fcs)
-    q = F.normalize(fseg[0].float(), dim=0)  # [256]
-    # Downsample soft masks 256×256 → 64×64 to match image_embed spatial dims.
-    # Keep as soft weights (no hard threshold) so the reward surface is smooth.
-    masks_64 = F.interpolate(
-        masks_soft.unsqueeze(1), size=(64, 64),
-        mode="bilinear", align_corners=False,
-    ).squeeze(1)  # [A, 64, 64]
-    # ── Per-frame masked pooling ──────────────────────────────────────────
-    z_ins:  List[torch.Tensor] = []
-    z_outs: List[torch.Tensor] = []
-    for t in range(num_anchor):
-        m   = masks_64[t]   # [64, 64]
-        img = img_embs[t]   # [256, 64, 64]
-        # Soft weighted average pooling over foreground / background
-        z_in  = (img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6)
-        z_out = (img * (1.0 - m).unsqueeze(0)).sum(dim=[1, 2]) / ((1.0 - m).sum() + 1e-6)
-        z_ins.append(F.normalize(z_in,  dim=0))   # [256]
-        z_outs.append(F.normalize(z_out, dim=0))  # [256]
-    # ── R_align_contrast ──────────────────────────────────────────────────
-    # Maximise Fseg↔inside alignment while penalising Fseg↔outside alignment.
-    # Contrast term prevents reward-hacking via large masks:
-    # a large mask pulls inside and outside features together, shrinking the gap.
-    r_align = sum(
-        (q @ z_ins[t]) - cfg.beta * (q @ z_outs[t])
-        for t in range(num_anchor)
-    ) / num_anchor
-    # ── R_iou_pred ────────────────────────────────────────────────────────
-    # SAM's internal mask-quality head, calibrated during SAM training.
-    r_iou = iou_preds.float().mean()
-    # ── R_temp_feat ───────────────────────────────────────────────────────
-    # Feature-space consistency between adjacent anchor frames.
-    # Harder to game than mask-IoU: large masks pool diverse background
-    # features across frames, degrading cosine similarity.
-    r_temp = torch.tensor(0.0, device=device)
-    if num_anchor > 1:
-        r_temp = sum(
-            z_ins[t] @ z_ins[t + 1] for t in range(num_anchor - 1)
-        ) / (num_anchor - 1)
-    # ── R_area ────────────────────────────────────────────────────────────
-    r_area = masks_64.mean()
-    R = (cfg.lambda1 * r_temp
-         + cfg.lambda2 * r_iou
-         + cfg.lambda3 * r_align
-         - cfg.lambda4 * r_area)
-    return R.item()
-# ---------------------------------------------------------------------------
-# Ablation baseline: Best-of-2 Random (no iterative update)
-# ---------------------------------------------------------------------------
-def best_of_2_optimize(
-    F_init: torch.Tensor,
-    image_embeds: torch.Tensor,
-    anchor_indices: List[int],
-    sam_model,
-    model_dtype: torch.dtype,
-    cfg: LTPOConfig,
-) -> torch.Tensor:
-    """
-    Best-of-2 Random baseline.
-    Sample one antithetic pair (F+, F-) using the first sigma value,
-    evaluate both, return whichever has the higher reward.
-    No iterative update — serves as the ablation for the update rule.
-    Same reward gating as ltpo_optimize for a fair comparison.
-    """
-    device = F_init.device
-    image_embeds_anchor = image_embeds[anchor_indices]
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
-    mask_dec  = sam_model.mask_decoder
-    lrm0, iou0 = _decode_on_anchors(
-        F_init, image_embeds_anchor, dense_emb, mask_dec, dense_pe, model_dtype
-    )
-    R_init = _compute_reward(F_init, lrm0, iou0, image_embeds_anchor, cfg)
-    sigma = cfg.sigma_schedule[0]
-    eps   = torch.randn_like(F_init) * sigma
-    F_plus  = F_init + eps
-    F_minus = F_init - eps
-    lrm_p, iou_p = _decode_on_anchors(
-        F_plus,  image_embeds_anchor, dense_emb, mask_dec, dense_pe, model_dtype
-    )
-    lrm_m, iou_m = _decode_on_anchors(
-        F_minus, image_embeds_anchor, dense_emb, mask_dec, dense_pe, model_dtype
-    )
-    R_plus  = _compute_reward(F_plus,  lrm_p, iou_p, image_embeds_anchor, cfg)
-    R_minus = _compute_reward(F_minus, lrm_m, iou_m, image_embeds_anchor, cfg)
-    best_R, best_F = R_init, F_init.clone()
-    if R_plus  > best_R: best_R, best_F = R_plus,  F_plus.clone()
-    if R_minus > best_R: best_R, best_F = R_minus, F_minus.clone()
-    if best_R <= R_init + cfg.gate_delta:
-        return F_init
-    return best_F
-# ---------------------------------------------------------------------------
-# Full-video decode with a given Fseg
-# ---------------------------------------------------------------------------
-def _sobel_edge(rgb_frames: torch.Tensor) -> torch.Tensor:
-    """Compute Sobel edge magnitude from normalized RGB frames.
-    Args:
-        rgb_frames: [T, 3, H, W] float32 (SAM-normalized, CUDA)
-    Returns:
-        edge: [T, 1, H, W] float32, non-negative
-    """
-    gray = rgb_frames.float().mean(dim=1, keepdim=True)  # [T, 1, H, W]
-    kx = torch.tensor([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]],
-                      dtype=torch.float32, device=rgb_frames.device).view(1, 1, 3, 3)
-    ky = kx.transpose(2, 3)
-    gx = F.conv2d(gray, kx, padding=1)
-    gy = F.conv2d(gray, ky, padding=1)
-    return torch.sqrt(gx ** 2 + gy ** 2 + 1e-6)  # [T, 1, H, W]
-def _boundary_edge_score(
-    low_res_masks: torch.Tensor,   # [T, K, 256, 256] logits
-    rgb_frames: torch.Tensor,      # [T, 3, H, W] float32
-    resize: tuple,                 # (H_resized, W_resized)
-    area_temp: float = 5.0,
-) -> torch.Tensor:
-    """Score each of K mask candidates by boundary-edge alignment.
-    R_edge = <soft_boundary_band, Sobel_edge> / (sum(soft_boundary_band) + ε)
-    Rewards masks whose boundaries coincide with image edges.
-    Returns: [T, K] float32 scores (higher = better boundary alignment)
-    """
-    T, K = low_res_masks.shape[:2]
-    H_r, W_r = resize
-    # Upsample all candidates to resized image resolution at once
-    masks_up = F.interpolate(
-        low_res_masks.reshape(T * K, 1, 256, 256).float(),
-        size=(H_r, W_r), mode="bilinear", align_corners=False,
-    ).reshape(T, K, H_r, W_r)  # [T, K, H, W]
-    E = _sobel_edge(rgb_frames[:, :, :H_r, :W_r])  # [T, 1, H, W]
-    m  = torch.sigmoid(masks_up / area_temp)                     # [T, K, H, W]
-    b  = 4.0 * m * (1.0 - m)                                    # soft boundary band
-    num = (b * E.squeeze(1).unsqueeze(1)).sum(dim=[2, 3])        # [T, K]
-    den = b.sum(dim=[2, 3]) + 1e-6
-    return num / den                                             # [T, K]
-def decode_full_video(
-    fseg: torch.Tensor,                    # [1, 256] float32
-    image_embeds: torch.Tensor,            # [T, 256, 64, 64] model dtype on CUDA
-    sam_model,
-    resize: tuple,                         # (H_resized, W_resized)
-    orgsize: tuple,                        # (H_orig, W_orig)
-    model_dtype: torch.dtype,
-    rgb_frames: Optional[torch.Tensor] = None,  # [T, 3, H, W]; enables edge selection
-    multimask: bool = False,               # True = 3 candidates; False = single mask
-) -> torch.Tensor:
-    """Decode all T frames with the given Fseg.
-    Selection logic (applied per-frame):
-      - multimask=False, rgb_frames=None : original single-mask decode (baseline)
-      - multimask=True,  rgb_frames=None : 3 candidates, select by SAM iou_pred
-      - multimask=True,  rgb_frames=*   : 3 candidates, select by boundary-edge score
-        (boundary band × Sobel edge; directly rewards boundary-image alignment)
-    Returns raw logit mask [T, H_orig, W_orig] (not yet sigmoid).
-    """
-    device    = image_embeds.device
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
-    sparse_emb = fseg.to(model_dtype).unsqueeze(1)  # [1, 1, 256]
-    with torch.no_grad():
-        low_res_masks, iou_preds = sam_model.mask_decoder(
-            image_embeddings=image_embeds,
-            image_pe=dense_pe,
-            sparse_prompt_embeddings=sparse_emb,
-            dense_prompt_embeddings=dense_emb,
-            multimask_output=multimask,
-        )  # [T, K, 256, 256], [T, K]  where K=1 or K=3
-    if multimask:
-        T = low_res_masks.shape[0]
-        if rgb_frames is not None:
-            # Step 1b: boundary-edge score selects best candidate
-            scores = _boundary_edge_score(low_res_masks, rgb_frames, resize)
-        else:
-            # Step 1a: SAM's own iou_pred selects best candidate
-            scores = iou_preds
-        best_idx = scores.argmax(dim=1)  # [T]
-        low_res_masks = low_res_masks[torch.arange(T, device=device), best_idx].unsqueeze(1)
-    pred_mask = sam_model.postprocess_masks(
-        low_res_masks, input_size=resize, original_size=orgsize
-    )  # [T, 1, H, W]
-    return pred_mask.squeeze(1)  # [T, H, W]
-# ---------------------------------------------------------------------------
-# Main optimisation loop
-# ---------------------------------------------------------------------------
-def ltpo_optimize(
-    F_init: torch.Tensor,          # [1, 256] float32 on CUDA
-    image_embeds: torch.Tensor,    # [T, 256, 64, 64] model dtype on CUDA
-    anchor_indices: List[int],
-    sam_model,
-    model_dtype: torch.dtype,
-    cfg: LTPOConfig,
-) -> torch.Tensor:
-    """
-    Optimise Fseg at test time via antithetic ES.
-    Returns best Fseg found [1, 256] float32.
-    Falls back to F_init when reward gating rejects all updates.
-    """
-    device = F_init.device
-    image_embeds_anchor = image_embeds[anchor_indices]  # [A, 256, 64, 64]
-    # Precompute constants shared across every optimisation step
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
-    mask_dec  = sam_model.mask_decoder
-    # ── Evaluate initial token ────────────────────────────────────────────
-    lrm0, iou0 = _decode_on_anchors(
-        F_init, image_embeds_anchor, dense_emb, mask_dec, dense_pe, model_dtype
-    )
-    R_init = _compute_reward(F_init, lrm0, iou0, image_embeds_anchor, cfg)
-    best_F, best_R = F_init.clone(), R_init
-    F_curr = F_init.clone()
-    # ── Optimisation loop ─────────────────────────────────────────────────
-    for t in range(cfg.T):
-        sigma_t = cfg.sigma_schedule[t]
-        eta_t   = cfg.eta_scale * sigma_t
-        eps     = torch.randn_like(F_curr) * sigma_t
-        F_plus  = F_curr + eps
-        F_minus = F_curr - eps
-        lrm_p, iou_p = _decode_on_anchors(
-            F_plus,  image_embeds_anchor, dense_emb, mask_dec, dense_pe, model_dtype
-        )
-        lrm_m, iou_m = _decode_on_anchors(
-            F_minus, image_embeds_anchor, dense_emb, mask_dec, dense_pe, model_dtype
-        )
-        R_plus  = _compute_reward(F_plus,  lrm_p, iou_p, image_embeds_anchor, cfg)
-        R_minus = _compute_reward(F_minus, lrm_m, iou_m, image_embeds_anchor, cfg)
-        # Track the best token seen across all evaluated candidates
-        if R_plus > best_R:
-            best_R, best_F = R_plus,  F_plus.clone()
-        if R_minus > best_R:
-            best_R, best_F = R_minus, F_minus.clone()
-        # Antithetic policy-gradient update of the iterate
-        # Formula: F_{t+1} = F_t + η_t · (R+ - R−)/(2σ_t²) · eps_t
-        grad_est = (R_plus - R_minus) / (2.0 * sigma_t ** 2)
-        F_curr   = F_curr + eta_t * grad_est * eps
-        # Optional L2 trust-region: keep F_curr within radius trust_delta of F_init
-        if cfg.trust_delta is not None:
-            diff = F_curr - F_init
-            norm = diff.norm()
-            if norm > cfg.trust_delta:
-                F_curr = F_init + diff * (cfg.trust_delta / norm)
-    # ── Reward gating ─────────────────────────────────────────────────────
-    # Reject the update when there is no meaningful improvement over the
-    # initial token (handles Null-like samples where no target exists).
-    if best_R <= R_init + cfg.gate_delta:
-        return F_init
-    return best_F
-# ===========================================================================
-# q-LTPO-autograd: differentiable test-time optimization of the prompt token
-# ===========================================================================
-@dataclass
-class QLTPOConfig:
-    """Configuration for q_ltpo_autograd (Stages 1–3 + Stage 2-ext variants).
-    stage controls which reward terms are active:
-      1   R_iou + R_area_soft + reg                      (baseline autograd)
-      2   Stage 1 + R_align_det (z_in/z_out stopgrad)   (self-bootstrapped alignment)
-      3   Stage 2 + R_temp_feat                          (full reward)
-      21  Stage 1 + R_tether    (P1a: tether probe)      (frozen r_ref via q_init attn)
-      22  Stage 1 + R_faithful  (P1b: faithful ext-ref)  (z_in/z_out vs frozen r_ref)
-    """
-    stage: int = 1
-    T: int = 5
-    num_anchors: int = 4
-    # ── Optimizer ──────────────────────────────────────────────────────────
-    # lr=0  → auto-set to 0.01 × RMS(q_init); any positive value is used directly
-    lr: float = 0.0
-    # max_drift=0 → auto-set to 0.5 × ‖q_init‖; any positive value is a hard radius
-    max_drift: float = 0.0
-    # ── Stage 1 reward weights ─────────────────────────────────────────────
-    lambda_iou: float = 0.6
-    lambda_area: float = 0.2
-    lambda_reg: float = 0.01
-    area_temp: float = 5.0      # sigmoid temperature for R_area_soft
-    # ── Stage 2 additional weights ─────────────────────────────────────────
-    lambda_align: float = 1.0
-    beta_align: float = 0.5     # background penalty coefficient in R_align
-    # ── Stage 3 additional weights ─────────────────────────────────────────
-    lambda_temp: float = 0.2
-    # ── Gating ─────────────────────────────────────────────────────────────
-    gate_delta: float = 0.0
-    # ── e0-modulated R_iou (principled Null-safety) ────────────────────────
-    # e0 = stopgrad(R_area_soft(q_init)): the initial soft-area fraction acts
-    # as an existence prior on the R_iou term.
-    #   "none"     → original behavior (e0 = 1, no modulation)
-    #   "identity" → e0 = R_area_soft(q_init)          [first version]
-    #   "sqrt"     → e0 = sqrt(R_area_soft(q_init) + e0_eps)
-    e0_modulation: str = "identity"
-    e0_eps: float = 1e-4   # epsilon for "sqrt" variant
-    # ── Stage 2-ext: external reference (stages 21 and 22) ────────────────
-    # r_ref = AttnPool(image_feats_anchor, q_init): frozen visual anchor derived
-    # from q_init's attention over SAM image features. Breaks Stage 2's
-    # self-confirming bias by providing a mask-independent teacher.
-    # r_ref_temp: softmax temperature for attention pooling (sqrt(256) = 16).
-    r_ref_temp: float = 16.0
-    # ── Direction B: boundary precision rewards ────────────────────────────
-    # B1: asymmetric area expansion penalty
-    #   Only penalises growth beyond (1+τ)×e0; allows mask contraction.
-    #   Targets the observed pattern where LTPO slightly expands masks into
-    #   non-target regions (recall↑ but precision↓, hurting F-score).
-    # B2: boundary sharpness reward
-    #   -mean(4m(1-m)) with temperature=1.0; rewards bimodal (certain)
-    #   mask predictions, encouraging cleaner boundary predictions.
-    lambda_area_inc: float = 0.0   # B1 weight (0 = disabled)
-    area_inc_tau: float = 0.0      # B1 tolerance band: allow (1+τ)×e0
-    lambda_sharp: float = 0.0      # B2 weight (0 = disabled)
-    # ── Oracle Null-safety gate (analysis only; NOT for final method) ──────
-    # Derived from test-set distribution (Null area_hard ≈ 0.01, Seen ≈ 0.05)
-    # so must not be used in reported results.  Set null_gate_delta=0 to disable.
-    null_area_threshold: float = 0.02   # hard area fraction below which guard activates
-    null_gate_delta: float = 0.0        # 0 = disabled; 0.05 = oracle experiment
-    # ── Direction II: Frame-adaptive token optimization (stage=4) ─────────
-    # q_t = q_global + delta_t, where delta_t is a per-anchor residual.
-    # Optimizes q_global and {delta_t} jointly with Adam.
-    # lambda_residual: soft L2 penalty on delta_t
-    # lambda_smooth_temp: temporal smoothness penalty on adjacent delta differences
-    # max_delta_drift_scale: per-anchor hard L2 clip = scale × ‖q_init‖
-    #   Prevents individual anchors from wandering to a completely different visual mode.
-    #   Keep << max_drift (0.5) so delta stays a "small frame correction" to q_global.
-    #   0.1 is tight (delta ≤ 20% of global drift budget), 0.3 is moderate.
-    lambda_residual: float = 0.001
-    lambda_smooth_temp: float = 0.0
-    max_delta_drift_scale: float = 0.1   # per-anchor clip = scale × ‖q_init‖
-# ---------------------------------------------------------------------------
-# e0 helper
-# ---------------------------------------------------------------------------
-def _compute_e0(r_area_soft_init: float, cfg: "QLTPOConfig") -> float:
-    """Compute the existence-prior weight from the initial soft area."""
-    if cfg.e0_modulation == "identity":
-        return r_area_soft_init
-    if cfg.e0_modulation == "sqrt":
-        return (r_area_soft_init + cfg.e0_eps) ** 0.5
-    return 1.0  # "none"
-# ---------------------------------------------------------------------------
-# Differentiable anchor decode (float32 throughout; no torch.no_grad)
-# ---------------------------------------------------------------------------
-def _decode_on_anchors_diff(
-    q: torch.Tensor,                        # [1, 256] float32
-    image_embeds_anchor_fp32: torch.Tensor, # [A, 256, 64, 64] float32
-    dense_emb_fp32: torch.Tensor,           # [1, 256, 64, 64] float32
-    mask_decoder,
-    dense_pe_fp32: torch.Tensor,            # [1, 256, 64, 64] float32
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """Differentiable mask-decoder forward.
-    All inputs are float32 to avoid fp16 gradient truncation.
-    q may be a Parameter (requires_grad=True) or a plain detached tensor.
-    Returns low_res_masks [A,1,256,256] and iou_preds [A,1], both float32.
-    """
-    sparse_emb = q.unsqueeze(1)  # [1, 1, 256]
-    low_res_masks, iou_preds = mask_decoder(
-        image_embeddings=image_embeds_anchor_fp32,
-        image_pe=dense_pe_fp32,
-        sparse_prompt_embeddings=sparse_emb,
-        dense_prompt_embeddings=dense_emb_fp32,
-        multimask_output=False,
-    )
-    return low_res_masks, iou_preds  # [A,1,256,256], [A,1]
-# ---------------------------------------------------------------------------
-# Differentiable reward components
-# ---------------------------------------------------------------------------
-def _task_reward_stage1(
-    lrm: torch.Tensor,   # [A,1,256,256] float32
-    iou: torch.Tensor,   # [A,1] float32
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-) -> torch.Tensor:
-    """Task reward (no regularization): used for best_q tracking and gating.
-    e0 is the stopgrad existence prior: R_area_soft(q_init) scaled via
-    cfg.e0_modulation.  When e0 << 1 the iou term is suppressed, so the
-    optimizer sees only the area-penalty gradient and naturally tends toward
-    smaller (more conservative) masks — the correct behavior when the initial
-    prediction is near-empty (Null frames).
-    Optional boundary precision terms (Direction B):
-      B1 (lambda_area_inc > 0): asymmetric expansion penalty
-        -λ_inc · ReLU(r_area - (1+τ)·e0)
-        Penalises mask growth beyond the initial area (+ tolerance band τ).
-        e0 doubles as the stopgrad initial-area threshold — zero extra cost.
-      B2 (lambda_sharp > 0): boundary sharpness reward
-        -λ_sharp · mean(4m(1-m))  with m = sigmoid(lrm), temperature=1.0
-        Maximises bimodality of mask logits → cleaner boundary predictions.
-    """
-    r_iou  = iou.mean()
-    r_area = torch.sigmoid(lrm / cfg.area_temp).mean()
-    R = cfg.lambda_iou * e0 * r_iou - cfg.lambda_area * r_area
-    # B1: penalise expansion beyond (1+τ)×e0 (allow contraction freely)
-    if cfg.lambda_area_inc > 0.0:
-        area_ceil = (1.0 + cfg.area_inc_tau) * e0
-        R = R - cfg.lambda_area_inc * F.relu(r_area - area_ceil)
-    # B2: reward confident (bimodal) boundary predictions
-    if cfg.lambda_sharp > 0.0:
-        m_sharp = torch.sigmoid(lrm)            # temperature=1.0 (sharp)
-        boundary_uncertain = 4.0 * m_sharp * (1.0 - m_sharp)
-        R = R - cfg.lambda_sharp * boundary_uncertain.mean()
-    return R
-def _task_reward_stage2(
-    q: torch.Tensor,      # [1, 256] float32
-    lrm: torch.Tensor,    # [A,1,256,256] float32
-    iou: torch.Tensor,    # [A,1] float32
-    image_embeds_anchor_fp32: torch.Tensor,  # [A, 256, 64, 64] float32
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-) -> torch.Tensor:
-    """Stage 2 task reward: Stage 1 + R_align_det (z_in/z_out are stopgrad)."""
-    r_s1 = _task_reward_stage1(lrm, iou, cfg, e0)
-    A = lrm.shape[0]
-    masks_64 = F.interpolate(
-        torch.sigmoid(lrm.squeeze(1) / cfg.area_temp).unsqueeze(1),
-        size=(64, 64), mode="bilinear", align_corners=False,
-    ).squeeze(1)  # [A, 64, 64]
-    q_norm = F.normalize(q[0], dim=0)  # [256]
-    r_align = torch.tensor(0.0, device=q.device)
-    for t in range(A):
-        m   = masks_64[t].detach()          # stopgrad on z_in/z_out
-        img = image_embeds_anchor_fp32[t]   # [256, 64, 64]
-        z_in  = F.normalize((img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6), dim=0)
-        z_out = F.normalize((img * (1 - m).unsqueeze(0)).sum(dim=[1, 2]) / ((1 - m).sum() + 1e-6), dim=0)
-        r_align = r_align + q_norm @ z_in - cfg.beta_align * (q_norm @ z_out)
-    r_align = r_align / A
-    return r_s1 + cfg.lambda_align * r_align
-def _task_reward_stage3(
-    q: torch.Tensor,
-    lrm: torch.Tensor,
-    iou: torch.Tensor,
-    image_embeds_anchor_fp32: torch.Tensor,
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-) -> torch.Tensor:
-    """Stage 3 task reward: Stage 2 + R_temp_feat."""
-    r_s2 = _task_reward_stage2(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
-    A = lrm.shape[0]
-    if A < 2:
-        return r_s2
-    masks_64 = F.interpolate(
-        torch.sigmoid(lrm.squeeze(1) / cfg.area_temp).unsqueeze(1),
-        size=(64, 64), mode="bilinear", align_corners=False,
-    ).squeeze(1)  # [A, 64, 64]
-    z_ins = []
-    for t in range(A):
-        m   = masks_64[t].detach()
-        img = image_embeds_anchor_fp32[t]
-        z_in = F.normalize((img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6), dim=0)
-        z_ins.append(z_in)
-    r_temp = sum(z_ins[t] @ z_ins[t + 1] for t in range(A - 1)) / (A - 1)
-    return r_s2 + cfg.lambda_temp * r_temp
-@torch.no_grad()
-def _compute_r_ref(
-    q_init: torch.Tensor,               # [1, 256] float32
-    image_embeds_anchor: torch.Tensor,  # [A, 256, 64, 64] float32
-    temp: float = 16.0,
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """Frozen external visual reference via attention pooling guided by q_init.
-    r_ref: regions most attended by q_init (positive anchor).
-    r_neg: regions least attended by q_init (anti-attended negative).
-    Both are in the SAM 256d space — no projection needed.
-    Computed once before the optimization loop and kept fixed (stopgrad).
-    """
-    img_flat = image_embeds_anchor.flatten(2)          # [A, 256, H*W]
-    q_norm   = F.normalize(q_init[0], dim=0)           # [256]
-    img_norm = F.normalize(img_flat, dim=1)            # [A, 256, H*W]
-    # cosine similarity between q and each spatial position
-    attn = torch.einsum('d,adp->ap', q_norm, img_norm)  # [A, H*W]
-    attn_w_pos = torch.softmax( attn / temp, dim=-1)    # [A, H*W]
-    attn_w_neg = torch.softmax(-attn / temp, dim=-1)    # [A, H*W] anti-attended
-    # soft attention pooling in the original (non-normalized) feature space
-    r_ref_frames = torch.einsum('ap,adp->ad', attn_w_pos, img_flat)  # [A, 256]
-    r_neg_frames = torch.einsum('ap,adp->ad', attn_w_neg, img_flat)  # [A, 256]
-    r_ref = F.normalize(r_ref_frames.mean(0), dim=0)   # [256]
-    r_neg = F.normalize(r_neg_frames.mean(0), dim=0)   # [256]
-    return r_ref, r_neg
-def _task_reward_stage2_tether(
-    q: torch.Tensor,        # [1, 256] float32
-    lrm: torch.Tensor,      # [A,1,256,256] float32
-    iou: torch.Tensor,      # [A,1] float32
-    r_ref: torch.Tensor,    # [256] frozen
-    r_neg: torch.Tensor,    # [256] frozen
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-) -> torch.Tensor:
-    """Stage 21 (P1a tether): Stage 1 + R_tether.
-    R_tether = cos(q, r_ref) - beta·cos(q, r_neg)
-    q is pulled toward the frozen visual anchor without touching mask features.
-    Tests whether a fixed external reference stabilizes the optimization trajectory.
-    """
-    r_s1    = _task_reward_stage1(lrm, iou, cfg, e0)
-    q_norm  = F.normalize(q[0], dim=0)
-    r_tether = q_norm @ r_ref - cfg.beta_align * (q_norm @ r_neg)
-    return r_s1 + cfg.lambda_align * r_tether
-def _task_reward_stage2_faithful(
-    q: torch.Tensor,                        # [1, 256] float32
-    lrm: torch.Tensor,                      # [A,1,256,256] float32
-    iou: torch.Tensor,                      # [A,1] float32
-    image_embeds_anchor_fp32: torch.Tensor, # [A, 256, 64, 64] float32
-    r_ref: torch.Tensor,                    # [256] frozen
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-) -> torch.Tensor:
-    """Stage 22 (P1b faithful): Stage 1 + R_faithful.
-    R_faithful = mean_t[ cos(z_in(q,t), r_ref) - beta·cos(z_out(q,t), r_ref) ]
-    z_in/z_out come from the *current* mask (change during optimization), but the
-    teacher r_ref is frozen — breaking Stage 2's self-confirming bias while keeping
-    the same structural form (mask-region vs. reference alignment).
-    """
-    r_s1   = _task_reward_stage1(lrm, iou, cfg, e0)
-    A      = lrm.shape[0]
-    masks_64 = F.interpolate(
-        torch.sigmoid(lrm.squeeze(1) / cfg.area_temp).unsqueeze(1),
-        size=(64, 64), mode="bilinear", align_corners=False,
-    ).squeeze(1)  # [A, 64, 64]
-    r_align = torch.tensor(0.0, device=q.device)
-    for t in range(A):
-        m   = masks_64[t].detach()           # stopgrad on mask weights only
-        img = image_embeds_anchor_fp32[t]    # [256, 64, 64]
-        z_in  = F.normalize((img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6), dim=0)
-        z_out = F.normalize((img * (1 - m).unsqueeze(0)).sum(dim=[1, 2]) / ((1 - m).sum() + 1e-6), dim=0)
-        # teacher is r_ref (frozen), not z_in itself — no confirmation bias
-        r_align = r_align + z_in @ r_ref - cfg.beta_align * (z_out @ r_ref)
-    r_align = r_align / A
-    return r_s1 + cfg.lambda_align * r_align
-def _decode_on_anchors_diff_adaptive(
-    q_global: torch.Tensor,                  # [1, 256] float32, requires_grad
-    delta: torch.Tensor,                     # [A, 256] float32, requires_grad
-    image_embeds_anchor_fp32: torch.Tensor,  # [A, 256, 64, 64] float32, detached
-    dense_emb_fp32: torch.Tensor,            # [1, 256, 64, 64] float32, detached
-    mask_decoder,
-    dense_pe_fp32: torch.Tensor,             # [1, 256, 64, 64] float32, detached
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """Frame-adaptive differentiable decode: each anchor t uses q_t = q_global + delta[t].
-    Loops over A anchors to preserve gradient flow through both q_global and delta.
-    Returns low_res_masks [A,1,256,256] and iou_preds [A,1], both float32.
-    """
-    A = image_embeds_anchor_fp32.shape[0]
-    lrm_list: List[torch.Tensor] = []
-    iou_list: List[torch.Tensor] = []
-    for t in range(A):
-        q_t = q_global + delta[t : t + 1]      # [1, 256]
-        sparse_emb = q_t.unsqueeze(1)           # [1, 1, 256]
-        lrm_t, iou_t = mask_decoder(
-            image_embeddings=image_embeds_anchor_fp32[t : t + 1],
-            image_pe=dense_pe_fp32,
-            sparse_prompt_embeddings=sparse_emb,
-            dense_prompt_embeddings=dense_emb_fp32,
-            multimask_output=False,
-        )  # [1,1,256,256], [1,1]
-        lrm_list.append(lrm_t)
-        iou_list.append(iou_t)
-    return torch.cat(lrm_list, dim=0), torch.cat(iou_list, dim=0)  # [A,1,256,256], [A,1]
-def _task_reward_frame_adaptive(
-    lrm: torch.Tensor,   # [A, 1, 256, 256] float32
-    iou: torch.Tensor,   # [A, 1] float32
-    cfg: "QLTPOConfig",
-    e0_vec: List[float],  # per-anchor existence priors [A]
-) -> torch.Tensor:
-    """Per-anchor task reward averaged over anchors (no regularization)."""
-    A = lrm.shape[0]
-    R = torch.tensor(0.0, device=lrm.device)
-    for t in range(A):
-        r_iou_t  = iou[t].mean()
-        r_area_t = torch.sigmoid(lrm[t] / cfg.area_temp).mean()
-        R = R + cfg.lambda_iou * e0_vec[t] * r_iou_t - cfg.lambda_area * r_area_t
-    return R / A
-def _compute_full_reward_adaptive(
-    q_global: torch.Tensor,   # [1, 256]
-    delta: torch.Tensor,      # [A, 256]
-    lrm: torch.Tensor,        # [A, 1, 256, 256]
-    iou: torch.Tensor,        # [A, 1]
-    q_init: torch.Tensor,     # [1, 256] detached
-    cfg: "QLTPOConfig",
-    e0_vec: List[float],
-) -> torch.Tensor:
-    """Full adaptive reward = task + residual penalty + temporal smoothness + L2 reg."""
-    r_task   = _task_reward_frame_adaptive(lrm, iou, cfg, e0_vec)
-    r_delta  = delta.pow(2).sum()
-    r_reg    = (q_global - q_init).pow(2).sum()
-    R = r_task - cfg.lambda_residual * r_delta - cfg.lambda_reg * r_reg
-    A = delta.shape[0]
-    if A > 1 and cfg.lambda_smooth_temp > 0.0:
-        r_smooth = torch.tensor(0.0, device=delta.device)
-        for t in range(A - 1):
-            r_smooth = r_smooth + (delta[t] - delta[t + 1]).pow(2).sum()
-        R = R - cfg.lambda_smooth_temp * r_smooth / (A - 1)
-    return R
-def _compute_task_reward(
-    q: torch.Tensor,
-    lrm: torch.Tensor,
-    iou: torch.Tensor,
-    image_embeds_anchor_fp32: torch.Tensor,
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-    r_ref: Optional[torch.Tensor] = None,
-    r_neg: Optional[torch.Tensor] = None,
-) -> torch.Tensor:
-    """Dispatch to the correct stage's task reward."""
-    if cfg.stage == 1:
-        return _task_reward_stage1(lrm, iou, cfg, e0)
-    if cfg.stage == 2:
-        return _task_reward_stage2(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
-    if cfg.stage == 21:
-        assert r_ref is not None and r_neg is not None, "stage 21 requires r_ref/r_neg"
-        return _task_reward_stage2_tether(q, lrm, iou, r_ref, r_neg, cfg, e0)
-    if cfg.stage == 22:
-        assert r_ref is not None, "stage 22 requires r_ref"
-        return _task_reward_stage2_faithful(q, lrm, iou, image_embeds_anchor_fp32, r_ref, cfg, e0)
-    return _task_reward_stage3(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
-def _compute_full_reward(
-    q: torch.Tensor,
-    lrm: torch.Tensor,
-    iou: torch.Tensor,
-    image_embeds_anchor_fp32: torch.Tensor,
-    q_init: torch.Tensor,
-    cfg: QLTPOConfig,
-    e0: float = 1.0,
-    r_ref: Optional[torch.Tensor] = None,
-    r_neg: Optional[torch.Tensor] = None,
-) -> torch.Tensor:
-    """Full reward = task reward + L2 regularization (used for backward)."""
-    r_task = _compute_task_reward(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0, r_ref, r_neg)
-    r_reg  = (q - q_init).pow(2).sum()
-    return r_task - cfg.lambda_reg * r_reg
-# ---------------------------------------------------------------------------
-# Stage 0: gradient connectivity check
-# ---------------------------------------------------------------------------
-def check_grad_connectivity(
-    F_init: torch.Tensor,         # [1, 256] any dtype
-    image_embeds: torch.Tensor,   # [T, 256, 64, 64] any dtype
-    anchor_indices: List[int],
-    sam_model,
-    model_dtype: torch.dtype,
-    num_steps: int = 5,
-    lr: float = 0.0,
-) -> dict:
-    """Stage 0: verify ∂R_iou_pred/∂q ≠ 0 and reward rises with Adam maximize.
-    Runs num_steps of Adam on R = R_iou_pred only (the simplest differentiable
-    reward, no custom ops required).  Returns a diagnostic dict.
-    Usage:
-        diag = check_grad_connectivity(F_init, image_embeds, anchors, sam, dtype)
-        print(diag['grad_norm_step0'], diag['reward_trajectory'])
-        # expect grad_norm > 0 and rewards non-decreasing
-    """
-    device = F_init.device
-    image_embeds_anchor = image_embeds[anchor_indices].float().detach()
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device).float().detach()
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device).float().detach()
-    mask_dec  = sam_model.mask_decoder
-    q_init_fp32 = F_init.float().detach()
-    if lr <= 0:
-        lr = 0.01 * (q_init_fp32.norm() / (q_init_fp32.numel() ** 0.5)).item()
-    q = torch.nn.Parameter(q_init_fp32.clone())
-    optimizer = torch.optim.Adam([q], lr=lr, maximize=True)
-    grad_norms, rewards = [], []
-    for step in range(num_steps):
-        optimizer.zero_grad()
-        lrm, iou = _decode_on_anchors_diff(q, image_embeds_anchor, dense_emb, mask_dec, dense_pe)
-        R = iou.mean()
-        R.backward()
-        grad_norm = q.grad.norm().item() if q.grad is not None else 0.0
-        grad_norms.append(grad_norm)
-        rewards.append(R.item())
-        optimizer.step()
-    return {
-        "grad_norm_step0": grad_norms[0],
-        "grad_norms": grad_norms,
-        "reward_trajectory": rewards,
-        "gradient_connected": grad_norms[0] > 1e-8,
-    }
-# ---------------------------------------------------------------------------
-# AVT proxy reward (Step A0: reward–metric correlation study)
-# ---------------------------------------------------------------------------
-@torch.no_grad()
-def _compute_avt_proxy_reward(
-    q_init_fp32: torch.Tensor,               # [1, 256] — frozen AVT anchor (= Fseg)
-    lrm: torch.Tensor,                        # [A, 1, 256, 256] float32
-    image_embeds_anchor_fp32: torch.Tensor,   # [A, 256, 64, 64] float32
-    cfg: "QLTPOConfig",
-    beta: float = 0.5,
-) -> Tuple[float, float]:
-    """Task-specific proxy reward using frozen q_init (Fseg) as teacher.
-    q_init = Fseg is already the audio+video+text fusion token produced by SimToken.
-    Using it as a frozen reference breaks Stage 2's self-confirming bias while
-    measuring whether the mask region aligns with the correct referent.
-    Returns:
-        R_avt   = mean_t cos(z_in_t, q_init)                          [scalar]
-        R_avt_c = mean_t [cos(z_in_t, q_init) - beta·cos(z_out_t, q_init)]  [scalar]
-    """
-    A = lrm.shape[0]
-    q_norm = F.normalize(q_init_fp32[0], dim=0)  # [256]
-    masks_64 = F.interpolate(
-        torch.sigmoid(lrm.squeeze(1) / cfg.area_temp).unsqueeze(1),
-        size=(64, 64), mode="bilinear", align_corners=False,
-    ).squeeze(1)  # [A, 64, 64]
-    r_avt, r_avt_c = 0.0, 0.0
-    for t in range(A):
-        m   = masks_64[t]
-        img = image_embeds_anchor_fp32[t]
-        z_in  = F.normalize(
-            (img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6), dim=0
-        )
-        z_out = F.normalize(
-            (img * (1.0 - m).unsqueeze(0)).sum(dim=[1, 2]) / ((1.0 - m).sum() + 1e-6), dim=0
-        )
-        c_in  = (q_norm @ z_in).item()
-        c_out = (q_norm @ z_out).item()
-        r_avt   += c_in
-        r_avt_c += c_in - beta * c_out
-    return r_avt / A, r_avt_c / A
-# ---------------------------------------------------------------------------
-# Stage 1–3: q-LTPO-autograd main optimizer
-# ---------------------------------------------------------------------------
-def q_ltpo_autograd(
-    F_init: torch.Tensor,         # [1, 256] any dtype on CUDA
-    image_embeds: torch.Tensor,   # [T, 256, 64, 64] any dtype on CUDA
-    anchor_indices: List[int],
-    sam_model,
-    model_dtype: torch.dtype,
-    cfg: QLTPOConfig,
-) -> torch.Tensor:
-    """Optimise the SAM prompt token q at test time via Adam maximize.
-    q is initialised to F_init (= Fseg after text_hidden_fcs projection).
-    The prompt encoder is bypassed: sparse_emb = q.unsqueeze(1), identical
-    to what prompt_encoder produces when text_embeds is the only prompt.
-    All computation is done in float32 to avoid fp16 gradient truncation.
-    Returns best_q as float32 [1, 256].  Falls back to F_init when gating
-    rejects all updates.
-    """
-    device = F_init.device
-    # ── Precompute constants (float32, detached) ──────────────────────────
-    q_init_fp32 = F_init.float().detach()
-    image_embeds_anchor = image_embeds[anchor_indices].float().detach()
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device).float().detach()
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device).float().detach()
-    mask_dec  = sam_model.mask_decoder
-    # ── Auto-scale lr and max_drift from q_init magnitude ─────────────────
-    rms = q_init_fp32.norm() / (q_init_fp32.numel() ** 0.5)
-    lr        = cfg.lr        if cfg.lr        > 0 else 0.01 * rms.item()
-    max_drift = cfg.max_drift if cfg.max_drift > 0 else 0.5  * q_init_fp32.norm().item()
-    # ── Precompute frozen external reference (stages 21, 22 only) ────────
-    r_ref, r_neg = None, None
-    if cfg.stage in (21, 22):
-        r_ref, r_neg = _compute_r_ref(q_init_fp32, image_embeds_anchor, cfg.r_ref_temp)
-    # ── Baseline forward + e0 existence prior ────────────────────────────
-    with torch.no_grad():
-        lrm0, iou0 = _decode_on_anchors_diff(
-            q_init_fp32, image_embeds_anchor, dense_emb, mask_dec, dense_pe
-        )
-        # e0 = stopgrad(R_area_soft(q_init)): fixes the scalar before the loop.
-        # Suppresses R_iou when the initial mask is near-empty (existence prior).
-        r_area_soft_init = torch.sigmoid(lrm0 / cfg.area_temp).mean().item()
-        e0 = _compute_e0(r_area_soft_init, cfg)
-        R_init_task = _compute_task_reward(
-            q_init_fp32, lrm0, iou0, image_embeds_anchor, cfg, e0=e0,
-            r_ref=r_ref, r_neg=r_neg,
-        ).item()
-    # ── Optimisation setup ────────────────────────────────────────────────
-    q = torch.nn.Parameter(q_init_fp32.clone())
-    optimizer = torch.optim.Adam([q], lr=lr, maximize=True)
-    best_q      = q.detach().clone()
-    best_reward = R_init_task
-    hit_clip    = False
-    # ── Optimisation loop ─────────────────────────────────────────────────
-    # Track per-step soft area to diagnose whether B1 penalty ever activates.
-    _step_soft_areas: List[float] = []
-    for step in range(cfg.T):
-        optimizer.zero_grad()
-        lrm, iou = _decode_on_anchors_diff(
-            q, image_embeds_anchor, dense_emb, mask_dec, dense_pe
-        )
-        R_full = _compute_full_reward(q, lrm, iou, image_embeds_anchor, q_init_fp32, cfg, e0=e0,
-                                      r_ref=r_ref, r_neg=r_neg)
-        R_full.backward()
-        optimizer.step()
-        # Hard L2 norm clip: keep q within max_drift ball around q_init
-        with torch.no_grad():
-            diff = q - q_init_fp32
-            d    = diff.norm()
-            if d > max_drift:
-                q.copy_(q_init_fp32 + diff * (max_drift / d))
-                hit_clip = True
-        # Fresh no_grad forward on the post-step q_{N+1} for correct tracking.
-        # (Pre-step lrm/iou would mismatch the updated q, causing wrong best_q.)
-        with torch.no_grad():
-            lrm_eval, iou_eval = _decode_on_anchors_diff(
-                q.detach(), image_embeds_anchor, dense_emb, mask_dec, dense_pe
-            )
-            # Record soft area at this step for B1 activation diagnosis
-            _step_soft_areas.append(
-                torch.sigmoid(lrm_eval / cfg.area_temp).mean().item()
-            )
-            r_task = _compute_task_reward(
-                q.detach(), lrm_eval, iou_eval, image_embeds_anchor, cfg, e0=e0,
-                r_ref=r_ref, r_neg=r_neg,
-            ).item()
-            if r_task > best_reward:
-                best_reward = r_task
-                best_q = q.detach().clone()
-    # Peak excess: how much did soft area exceed e0 at its highest point?
-    # b1_peak_excess > 0  ↔  B1 ReLU was non-zero at that step.
-    # b1_peak_excess = 0  ↔  B1 never activated (area stayed below e0 throughout).
-    _max_step_area = max(_step_soft_areas) if _step_soft_areas else r_area_soft_init
-    b1_peak_excess = max(_max_step_area - e0, 0.0)
-    # ── Reward gating: clean re-eval of best_q vs q_init ─────────────────
-    with torch.no_grad():
-        lrm_b, iou_b = _decode_on_anchors_diff(
-            best_q, image_embeds_anchor, dense_emb, mask_dec, dense_pe
-        )
-        R_best_task = _compute_task_reward(
-            best_q, lrm_b, iou_b, image_embeds_anchor, cfg, e0=e0,
-            r_ref=r_ref, r_neg=r_neg,
-        ).item()
-    area_init = (lrm0 > 0).float().mean().item()
-    effective_gate = (
-        cfg.null_gate_delta
-        if (cfg.null_gate_delta > 0 and area_init < cfg.null_area_threshold)
-        else cfg.gate_delta
-    )
-    accepted = R_best_task > R_init_task + effective_gate
-    # ── Mask soft-IoU: how much did the mask actually change? ─────────────
-    # Answers whether q-drift translated into mask change, or fell in a
-    # flat direction of the mask decoder manifold.
-    with torch.no_grad():
-        m0 = torch.sigmoid(lrm0 / cfg.area_temp).squeeze(1)   # [A,256,256]
-        mb = torch.sigmoid(lrm_b / cfg.area_temp).squeeze(1)   # [A,256,256]
-        inter = (m0 * mb).sum(dim=[1, 2])
-        union = (m0 + mb - m0 * mb).sum(dim=[1, 2])
-        mask_soft_iou = (inter / (union + 1e-6)).mean().item()
-        # Soft area at best_q — tracks whether B1 asymmetric penalty worked
-        r_area_soft_best = mb.mean().item()  # sigmoid(lrm_b/area_temp).mean()
-    # Reward decomposition: iou contribution to reward gain
-    R_iou_contrib_gain = (
-        cfg.lambda_iou * e0 * (iou_b.mean().item() - iou0.mean().item())
-    )
-    # AVT proxy reward (Step A0 correlation study)
-    r_avt_init, r_avt_c_init = _compute_avt_proxy_reward(
-        q_init_fp32, lrm0, image_embeds_anchor, cfg
-    )
-    r_avt_best, r_avt_c_best = _compute_avt_proxy_reward(
-        q_init_fp32, lrm_b, image_embeds_anchor, cfg
-    )
-    # ── Per-sample diagnostics ────────────────────────────────────────────
-    _q_ltpo_stats.append({
-        "accepted":           accepted,
-        "reward_gain":        R_best_task - R_init_task,
-        "drift":              (best_q - q_init_fp32).norm().item(),
-        "hit_clip":           hit_clip,
-        "e0":                 e0,
-        "R_iou_pred_init":    iou0.mean().item(),
-        "R_iou_pred_best":    iou_b.mean().item(),
-        "area_hard_init":     area_init,
-        "area_hard_best":     (lrm_b > 0).float().mean().item(),
-        "r_area_soft_init":   r_area_soft_init,
-        "r_area_soft_best":   r_area_soft_best,
-        "b1_peak_excess":     b1_peak_excess,
-        "mask_soft_iou":      mask_soft_iou,
-        "R_iou_contrib_gain": R_iou_contrib_gain,
-        # AVT proxy: frozen q_init as teacher — task-specific alignment
-        "r_avt_init":         r_avt_init,
-        "r_avt_best":         r_avt_best,
-        "r_avt_gain":         r_avt_best - r_avt_init,
-        "r_avt_c_init":       r_avt_c_init,
-        "r_avt_c_best":       r_avt_c_best,
-        "r_avt_c_gain":       r_avt_c_best - r_avt_c_init,
-    })
-    if not accepted:
-        return F_init.float()
-    return best_q
-# ===========================================================================
-# Direction II: Frame-adaptive token optimization (stage=4)
-# q_t = q_global + delta_t  — shared global token + per-anchor residual
-# ===========================================================================
-def q_ltpo_frame_adaptive(
-    F_init: torch.Tensor,         # [1, 256] any dtype on CUDA
-    image_embeds: torch.Tensor,   # [T, 256, 64, 64] any dtype on CUDA
-    anchor_indices: List[int],
-    sam_model,
-    model_dtype: torch.dtype,
-    cfg: QLTPOConfig,
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """Frame-adaptive q-LTPO: optimize q_global and per-anchor delta jointly.
-    Each anchor frame t gets its own token q_t = q_global + delta_t.
-    delta_t is initialized to zero so q_t starts equal to q_init for all frames.
-    Per-frame existence priors e0_t suppress optimization on near-empty anchors.
-    Returns:
-        q_global [1, 256] float32  — shared global token
-        delta    [A, 256] float32  — per-anchor residuals (zero if not accepted)
-    """
-    device = F_init.device
-    A = len(anchor_indices)
-    q_init_fp32 = F_init.float().detach()
-    image_embeds_anchor = image_embeds[anchor_indices].float().detach()
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device).float().detach()
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device).float().detach()
-    mask_dec  = sam_model.mask_decoder
-    rms             = q_init_fp32.norm() / (q_init_fp32.numel() ** 0.5)
-    lr              = cfg.lr       if cfg.lr       > 0 else 0.01 * rms.item()
-    max_drift       = cfg.max_drift if cfg.max_drift > 0 else 0.5  * q_init_fp32.norm().item()
-    max_delta_drift = cfg.max_delta_drift_scale * q_init_fp32.norm().item()
-    # ── Baseline: per-anchor e0 existence priors ────────────────────────────
-    with torch.no_grad():
-        lrm0, iou0 = _decode_on_anchors_diff(
-            q_init_fp32, image_embeds_anchor, dense_emb, mask_dec, dense_pe
-        )
-        e0_vec: List[float] = []
-        for t in range(A):
-            e0_t = torch.sigmoid(lrm0[t] / cfg.area_temp).mean().item()
-            e0_vec.append(_compute_e0(e0_t, cfg))
-        e0_global = sum(e0_vec) / A
-        R_init_task = _task_reward_frame_adaptive(lrm0, iou0, cfg, e0_vec).item()
-    # ── Setup optimization ───────────────────────────────────────────────────
-    q_global = torch.nn.Parameter(q_init_fp32.clone())
-    delta    = torch.nn.Parameter(torch.zeros(A, 256, device=device, dtype=torch.float32))
-    optimizer = torch.optim.Adam([q_global, delta], lr=lr, maximize=True)
-    best_q_global = q_global.detach().clone()
-    best_delta    = delta.detach().clone()
-    best_reward   = R_init_task
-    hit_clip      = False
-    # ── Optimization loop ────────────────────────────────────────────────────
-    for step in range(cfg.T):
-        optimizer.zero_grad()
-        lrm, iou = _decode_on_anchors_diff_adaptive(
-            q_global, delta, image_embeds_anchor, dense_emb, mask_dec, dense_pe
-        )
-        R_full = _compute_full_reward_adaptive(
-            q_global, delta, lrm, iou, q_init_fp32, cfg, e0_vec
-        )
-        R_full.backward()
-        optimizer.step()
-        # Clip q_global and each per-anchor delta within trust regions
-        with torch.no_grad():
-            diff = q_global - q_init_fp32
-            d = diff.norm()
-            if d > max_drift:
-                q_global.copy_(q_init_fp32 + diff * (max_drift / d))
-                hit_clip = True
-            for t in range(A):
-                dn = delta[t].norm()
-                if dn > max_delta_drift:
-                    delta[t].copy_(delta[t] * (max_delta_drift / dn))
-        # Track best (no_grad re-eval of task reward without reg)
-        with torch.no_grad():
-            lrm_eval, iou_eval = _decode_on_anchors_diff_adaptive(
-                q_global.detach(), delta.detach(),
-                image_embeds_anchor, dense_emb, mask_dec, dense_pe
-            )
-            r_task = _task_reward_frame_adaptive(lrm_eval, iou_eval, cfg, e0_vec).item()
-            if r_task > best_reward:
-                best_reward   = r_task
-                best_q_global = q_global.detach().clone()
-                best_delta    = delta.detach().clone()
-    # ── Gating ───────────────────────────────────────────────────────────────
-    with torch.no_grad():
-        lrm_b, iou_b = _decode_on_anchors_diff_adaptive(
-            best_q_global, best_delta, image_embeds_anchor, dense_emb, mask_dec, dense_pe
-        )
-        R_best_task = _task_reward_frame_adaptive(lrm_b, iou_b, cfg, e0_vec).item()
-    accepted = R_best_task > R_init_task + cfg.gate_delta
-    area_init = (lrm0 > 0).float().mean().item()
-    r_area_soft_init = sum(torch.sigmoid(lrm0[t] / cfg.area_temp).mean().item() for t in range(A)) / A
-    r_area_soft_best = sum(torch.sigmoid(lrm_b[t] / cfg.area_temp).mean().item() for t in range(A)) / A
-    # Actual mask soft-IoU between init and best (per anchor, averaged)
-    m0 = torch.sigmoid(lrm0 / cfg.area_temp).squeeze(1)   # [A,256,256]
-    mb = torch.sigmoid(lrm_b / cfg.area_temp).squeeze(1)   # [A,256,256]
-    inter = (m0 * mb).sum(dim=[1, 2])
-    union = (m0 + mb - m0 * mb).sum(dim=[1, 2])
-    mask_soft_iou_fa = (inter / (union + 1e-6)).mean().item()
-    _q_ltpo_stats.append({
-        "accepted":           accepted,
-        "reward_gain":        R_best_task - R_init_task,
-        "drift":              (best_q_global - q_init_fp32).norm().item(),
-        "delta_norm":         best_delta.norm().item(),
-        "hit_clip":           hit_clip,
-        "e0":                 e0_global,
-        "R_iou_pred_init":    iou0.mean().item(),
-        "R_iou_pred_best":    iou_b.mean().item(),
-        "area_hard_init":     area_init,
-        "area_hard_best":     (lrm_b > 0).float().mean().item(),
-        "r_area_soft_init":   r_area_soft_init,
-        "r_area_soft_best":   r_area_soft_best,
-        "b1_peak_excess":     0.0,
-        "mask_soft_iou":      mask_soft_iou_fa,
-        "R_iou_contrib_gain": cfg.lambda_iou * e0_global * (iou_b.mean().item() - iou0.mean().item()),
-    })
-    if not accepted:
-        return q_init_fp32, torch.zeros(A, 256, device=device, dtype=torch.float32)
-    return best_q_global, best_delta
-def decode_full_video_adaptive(
-    q_global: torch.Tensor,       # [1, 256] float32
-    delta: torch.Tensor,          # [A, 256] float32
-    anchor_indices: List[int],
-    image_embeds: torch.Tensor,   # [T, 256, 64, 64] model dtype on CUDA
-    sam_model,
-    resize: tuple,
-    orgsize: tuple,
-    model_dtype: torch.dtype,
-) -> torch.Tensor:
-    """Decode all T frames with frame-adaptive tokens.
-    Each frame is assigned to its nearest anchor by index distance, then decoded
-    with q_t = q_global + delta[anchor_idx].
-    Returns raw logit masks [T, H_orig, W_orig].
-    """
-    T      = image_embeds.shape[0]
-    A      = len(anchor_indices)
-    device = image_embeds.device
-    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
-    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
-    # Nearest-anchor assignment for every frame
-    anchor_arr = torch.tensor(anchor_indices, dtype=torch.float32)
-    frame_to_anchor = [int((anchor_arr - t).abs().argmin().item()) for t in range(T)]
-    pred_masks: List[torch.Tensor] = []
-    with torch.no_grad():
-        for t in range(T):
-            a   = frame_to_anchor[t]
-            q_t = (q_global + delta[a : a + 1]).to(model_dtype)  # [1, 256]
-            sparse_emb = q_t.unsqueeze(1)                         # [1, 1, 256]
-            lrm_t, _ = sam_model.mask_decoder(
-                image_embeddings=image_embeds[t : t + 1],
-                image_pe=dense_pe,
-                sparse_prompt_embeddings=sparse_emb,
-                dense_prompt_embeddings=dense_emb,
-                multimask_output=False,
-            )  # [1, 1, 256, 256]
-            pred_t = sam_model.postprocess_masks(lrm_t, input_size=resize, original_size=orgsize)
-            pred_masks.append(pred_t.squeeze(0).squeeze(0))  # [H, W]
-    return torch.stack(pred_masks, dim=0)  # [T, H_orig, W_orig]

setup_simtoken.md DELETED Viewed

@@ -1,163 +0,0 @@
-# SimToken Setup
-本文档用于在新机器上重建 SimToken 环境，并准备后续 A-min 实验。
----
-## 1. Create Environment
-先确认 GPU 和 CUDA driver 状态：
-```bash
-nvidia-smi
-```
-创建 conda 环境：
-```bash
-/opt/miniforge3/condabin/conda create -n simtoken python=3.10 -y
-/opt/miniforge3/condabin/conda activate simtoken
-python -m pip install --upgrade pip wheel "setuptools<81"
-pip install \
-  torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 \
-  --index-url https://download.pytorch.org/whl/cu121
-pip install \
-  transformers==4.30.2 \
-  peft==0.2.0 \
-  accelerate==0.21.0 \
-  sentencepiece \
-  protobuf \
-  safetensors \
-  numpy==1.26.4 \
-  pandas \
-  matplotlib \
-  opencv-python \
-  pillow \
-  tqdm \
-  einops \
-  timm \
-  requests \
-  towhee \
-  huggingface_hub
-```
-快速验证：
-```bash
-python - <<'PY'
-import torch
-print("torch:", torch.__version__)
-print("cuda available:", torch.cuda.is_available())
-print("device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu")
-PY
-```
----
-## 2. Download from HuggingFace
-如果新机器不使用迁移工具，而是从 HuggingFace 重新初始化，先登录：
-```bash
-huggingface-cli login
-```
-下载完整 repo：
-```bash
-mkdir -p /workspace/SimToken
-cd /workspace/SimToken
-huggingface-cli download yfan07/SimToken \
-  --repo-type model \
-  --local-dir . \
-  --local-dir-use-symlinks False
-```
-下载完成后解压数据：
-```bash
-cd /workspace/SimToken/data
-tar -xf image_embed.tar
-tar -xzf gt_mask.tar.gz
-tar -xzf audio_embed.tar.gz
-tar -xf media.tar
-```
----
-## 3. Pre-download Model Weights
-`transformers==4.30.2` 与新版 `huggingface_hub` 可能存在网络/API 兼容问题。建议先用 CLI 将模型下载到本地缓存，实验时再加 `TRANSFORMERS_OFFLINE=1`。
-```bash
-# Chat-UniVi-7B
-huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5
-# CLIP ViT-L
-huggingface-cli download openai/clip-vit-large-patch14
-```
-下载完成后做离线验证：
-```bash
-cd /workspace/SimToken
-TRANSFORMERS_OFFLINE=1 /opt/miniforge3/condabin/conda run -n simtoken \
-  python -m py_compile train.py load_model.py decoder_invariance_check.py
-```
----
-## 4. Upload to HuggingFace
-实验结束后，如需重新上传到 HuggingFace，先将数据目录压缩为归档文件，减少文件数量：
-```bash
-cd /workspace/SimToken/data
-tar -cf image_embed.tar image_embed/
-tar -czf gt_mask.tar.gz gt_mask/
-tar -czf audio_embed.tar.gz audio_embed/
-tar -cf media.tar media/
-ls -lh *.tar*
-# HuggingFace 单文件硬限制为 50GB；如果 image_embed.tar 超过 50GB，
-# 需要切成小于 50GB 的分片再上传。
-split -b 45G -d -a 2 image_embed.tar image_embed.tar.part-
-# 校验分片拼接后仍能读出完整 tar 文件列表。
-cat image_embed.tar.part-* | tar -tf - | grep -v '/$' | wc -l
-# 分片校验通过后再删除超大原始 tar，避免上传失败。
-rm -f image_embed.tar
-rm -rf image_embed/ gt_mask/ audio_embed/ media/
-```
-下载后如需恢复 `image_embed.tar`：
-```bash
-cd /workspace/SimToken/data
-cat image_embed.tar.part-* > image_embed.tar
-tar -xf image_embed.tar
-```
-清理缓存并上传：
-```bash
-cd /workspace/SimToken
-find . -name "__pycache__" -prune -exec rm -rf {} +
-find . -name ".pytest_cache" -prune -exec rm -rf {} +
-find . -name ".cache" -prune -exec rm -rf {} +
-find . -name "*.pyc" -delete
-huggingface-cli login
-python upload_hf.py --repo yfan07/SimToken
-```

simtoken_experiment.md DELETED Viewed

@@ -1,369 +0,0 @@
-# SimToken 实验路线文档
-## 0. 当前状态
-前置诊断已经完成，路线收敛到 **A-min dynamic referent gate training**。
-已确认结论：
-1. **SAM decoder 下游是逐帧 batch-parallel 解码**
-   `mask_decoder(image_embeddings[0:T])[t]` 与 `mask_decoder(image_embeddings[t:t+1])[0]` 只有混合精度数值噪声差异。旧的 decoder-level joint-frame competition 假设关闭。
-2. **target_frame sweep 基本无效**
-   不同 target frame 生成的 q 几乎相同，`cos_to_q5` 通常在 `0.997+`；Seen/Null 上 oracle gain 约 `+0.0009`。这条 TTO 路线关闭。
-3. **raw SAM-space D2 失效**
-   256 维 `q/Fseg` 与 SAM image embedding 不在可直接 cosine 的语义空间，`real q ≈ shuffled/wrong_ref q`，甚至 random q 更高。该定义关闭。
-4. **LLM-space D2 有弱诊断信号，但不适合作为主 reward**
-   用 4096 维 `[SEG]` hidden state 与 `mm_projector(CLIP patch tokens)` 后的视觉 token 计算 D2，可以得到正相关：
-   - `corr(s_pred, frame_iou) ≈ +0.316`
-   - bottom 20% `s_pred` 中 failure rate 相比随机 baseline 约 `1.60x`
-   - 控制 `iou_pred` / `pred_area` 后偏相关约 `+0.14`
-   结论：`s_pred(beta=1.0)` 可以作为诊断信号或 frame-aware gate 的候选输入，但不能作为核心 TTO reward。
-5. **margin-D2 无效**
-   离线 `s_margin = s(real) - max(s(shuffled), s(wrong_ref))` 的 failure enrichment 约 `0.93x`，会抵消掉有用的通用可见性/质量信号。该路线关闭。
-当前最干净的解释是：
-> q 本身通常是稳定的 referent anchor；主要瓶颈不在 q 生成，也不在简单 q selection，而在 SAM decoder 如何使用已有的 `mask_token -> q` sparse self-attention path。
-2026-04-22 更新：
-完整训练每个 epoch 约 2-4 小时，瓶颈主要在 7B MLLM forward，而不在 gate 本身。因此当前实验策略已调整为：
-1. 先缓存固定 checkpoint 下的 `q = seg_embeddings`；
-2. 在 cached q + cached SAM image embeddings 上训练 gate-only；
-3. 用 cached eval split 快速判断 gate 是否有泛化收益；
-4. 只有 gate-only 泛化信号成立后，再跑完整 A-min 联合训练。
----
-## 1. A-min 当前实现
-已在代码中加入 A-min dynamic referent gate：
-- 文件：`models/segment_anything/modeling/transformer.py`
-- 模块：`ReferentGate`
-- 插入位置：`TwoWayAttentionBlock` 的 sparse self-attention + `norm1` 之后，token-to-image cross-attention 之前
-- 作用对象：只作用于 `mask_tokens`
-- 不作用于：`iou_token` 和 `q/sparse_prompt` 本身
-SAM token index：
-```python
-tokens = [iou_token, mask_tokens..., sparse_prompt(q)]
-```
-因此：
-```python
-iou_token index: 0
-mask token range: 1 : 1 + num_mask_tokens
-q token index: 1 + num_mask_tokens
-```
-A-min gate 形式：
-```python
-alpha = sigmoid(Linear([mask_token, q, cos(mask_token, q)]))
-mask_token = mask_token + alpha * Linear(q)
-```
-为保证旧 checkpoint 初始行为不变，`proj(q)` 分支使用零初始化。当前也将 `gate` 分支零初始化，使 alpha 有干净观测基线：
-```python
-nn.init.zeros_(self.gate.weight)
-nn.init.zeros_(self.gate.bias)
-nn.init.zeros_(self.proj.weight)
-nn.init.zeros_(self.proj.bias)
-```
-初始时 gate 为 identity：
-```text
-max_abs_diff(gate(mask, q), mask) = 0.0
-alpha_mean = 0.5
-alpha_std = 0.0
-```
-当前训练 forward 保持完整链路：`prepare_inputs_labels_for_multimodal -> MLLM forward -> text_hidden_fcs -> SAM mask decoder -> loss`。`--gate_only` 只控制参数冻结范围，不再改变 forward 语义。
----
-## 2. 当前新增工具
-### 2.1 训练脚本增强
-`train.py` 已加入：
-- `--max_steps`
-- `--overfit_samples`
-- `--log_gate_stats_every`
-- `--skip_eval_after_train`
-- `--eval_train_only`
-启动时会打印 referent gate 参数是否 trainable、是否进入 optimizer，以及初始 `proj_norm/gate_norm`。
-### 2.2 cached q 路线
-新增脚本：
-- `cache_q_features.py`
-  - 离线缓存 `q = seg_embeddings`
-  - cache 文件很小，因为只保存 q 和少量 metadata
-  - `image_embeddings` 仍使用已有 `data/image_embed/{vid}.pt`
-  - `gt_masks` 仍使用已有 `data/gt_mask/...`
-- `train_cached_gate.py`
-  - 加载 base model 和 cached q
-  - 冻结全部参数，只训练 `referent_gate`
-  - 支持 `--eval_only`、`--disable_gate`
-  - 支持 `--save_gate_only`，只保存 gate 参数，checkpoint 约 1.6MB
-  - 支持 `--gate_checkpoint`，在 base checkpoint 上 overlay gate-only checkpoint
-  - gate stats 会记录：
-```text
-batch_miou
-batch_fscore
-proj_norm
-gate_norm
-proj_grad_norm
-gate_grad_norm
-alpha_mean / alpha_std / alpha_min / alpha_max
-```
-cached 解码已优化：一个 dataloader batch 会展平成 paired frame batch 调用 `mask_decoder.forward_modified_v3`，避免逐 sample 调 decoder 的主要开销，同时不会产生 prompt/image cross product。
----
-## 3. 已完成实验结果
-### 3.1 cached identity 与原始 pipeline 一致性
-先用 `test_s` 前 10 条验证 cached pipeline 是否与原始 `load_model.py` 对齐：
-```text
-cached identity:
-mIoU   = 0.9686462879
-Fscore = 0.9868578851
-original load_model.py:
-mIoU   = 0.9686277151
-Fscore = 0.9868472159
-diff:
-mIoU   = +0.0000186
-Fscore = +0.0000107
-```
-结论：差异远小于 0.001，cached q pipeline 与原始 eval pipeline 一致，可以用于 gate-only 快速验证。
-### 3.2 gate probe：梯度路径与 alpha 分化
-在 cached train128 上跑 50 optimizer steps：
-```text
-step 5:
-proj_norm=0.074015
-gate_norm=0.064479
-proj_grad_norm=0.052291
-gate_grad_norm=0.000170
-alpha_mean=0.4999
-alpha_std=0.0019
-step 50:
-proj_norm=0.428711
-gate_norm=0.523223
-proj_grad_norm=0.022453
-gate_grad_norm=0.000504
-alpha_mean=0.5063
-alpha_std=0.0112
-```
-结论：
-- `proj_norm` 从 0 稳定增长，注入分支有梯度；
-- `gate_norm` 也开始增长，alpha 控制分支参与学习；
-- `alpha_std` 从 0 增长，说明 gate 对不同输入有分化响应；
-- 计算图、冻结范围、optimizer param groups 均正常。
-### 3.3 overfit32：表达能力验证
-cached train32 identity baseline：
-```text
-mIoU   = 0.8814558
-Fscore = 0.9375512
-```
-cached gate overfit32，200 steps，lr=1e-4：
-```text
-mIoU   = 0.9085821
-Fscore = 0.9444574
-```
-提升：
-```text
-mIoU   = +0.0271263
-Fscore = +0.0069063
-```
-结论：在 q、SAM image embeddings、mask decoder 原始参数均固定时，仅训练 A-min gate 就能明显提高训练集 mIoU，说明 gate 机制有表达能力，梯度路径通畅。
-### 3.4 overfit32 泛化评估
-对 cached eval split 前 200 条，identity baseline：
-```text
-test_s mIoU   = 0.7390979
-test_s Fscore = 0.8190672
-test_u mIoU   = 0.6732285
-test_u Fscore = 0.7734924
-test_n metric = 0.0606105
-```
-overfit32 gate checkpoint：
-```text
-test_s mIoU   = 0.7199481
-test_s Fscore = 0.8045849
-test_u mIoU   = 0.6672303
-test_u Fscore = 0.7663978
-test_n metric = 0.0648588
-```
-delta：
-```text
-test_s mIoU   = -0.0191498
-test_s Fscore = -0.0144823
-test_u mIoU   = -0.0059983
-test_u Fscore = -0.0070946
-test_n metric = +0.0042483
-```
-结论：
-- overfit32 gate 没有泛化；
-- Null metric 略升，说明小样本过拟合有轻微放大前景的倾向；
-- 这不是方法失败，而是 32 个样本不足以学到泛化 referent anchoring 的预期结果；
-- 下一步应扩大 cached train 样本量，并降低 lr。
----
-## 4. 当前下一步实验：cached train256 gate-only
-用户已经完成 train256 的 q 缓存。下一步用 train256 跑更保守的 gate-only 泛化实验。
-### Step 1：训练 cached gate-only train256
-```bash
-cd /workspace/SimToken
-mkdir -p log checkpoints
-TRANSFORMERS_OFFLINE=1 python -u -W ignore train_cached_gate.py \
-  --cache_split train \
-  --cache_root /workspace/SimToken/cache_q \
-  --name cached_gate_train256_s300_lr3e5 \
-  --epochs 20 \
-  --max_steps 300 \
-  --batch_size 8 \
-  --lr 3e-5 \
-  --saved_model /workspace/SimToken/checkpoints/simtoken_pretrained.pth \
-  --log_root /workspace/SimToken/log \
-  --checkpoint_root /workspace/SimToken/checkpoints \
-  --log_gate_stats_every 50 \
-  --skip_eval_after_train \
-  --save_gate_only \
-  2>&1 | tee /workspace/SimToken/log/cached_gate_train256_s300_lr3e5.stdout
-```
-训练中重点观察：
-```text
-batch_miou / batch_fscore 是否逐步改善
-proj_norm 是否持续增长
-alpha_std 是否温和分化
-Null 风险：alpha 是否出现极端偏移
-```
-如果 `proj_norm` 在前 100 steps 仍接近 0，说明 lr=3e-5 可能过小，可以改回 1e-4 或使用分层 lr。
-### Step 2：评估 cached train256 gate checkpoint
-```bash
-for split in test_s test_u test_n; do
-  TRANSFORMERS_OFFLINE=1 python -u -W ignore train_cached_gate.py \
-    --cache_split $split \
-    --cache_root /workspace/SimToken/cache_q \
-    --batch_size 8 \
-    --saved_model /workspace/SimToken/checkpoints/simtoken_pretrained.pth \
-    --gate_checkpoint /workspace/SimToken/checkpoints/cached_gate_train256_s300_lr3e5.pth \
-    --eval_only \
-    --name cached_gate_train256_s300_lr3e5_${split}_200 \
-    2>&1 | tee /workspace/SimToken/log/cached_gate_train256_s300_lr3e5_${split}_200.stdout
-done
-```
-对比 baseline 使用 3.4 中 identity 200 条结果。
-### Step 3：根据结果决策
-判断标准：
-- Seen / Unseen 都提升：进入更大 cached train 或完整 A-min；
-- Seen 提升、Unseen 不提升：gate 仍可能学 dataset pattern，需要更多 train cache 或更强正则；
-- Seen / Unseen 都下降：不要跑完整 A-min，先调 lr、正则或 gate 容量；
-- Null metric 保持 `< 0.07`：暂不加 area penalty；
-- Null metric 超过 `0.10`：强危险信号，需要 area penalty 或约束预测面积。
-如果 train256 有弱正收益但幅度小，先看 alpha 分布和 hard/easy frames，而不是立刻扩大。若 alpha 在所有帧上几乎一致，可能只是全局偏置；若 hard frames alpha 系统性更高，说明更像 referent anchoring。
----
-## 5. 成功标准
-A-min 成功不能只看总体 mIoU，需要同时满足：
-1. Seen / Unseen mIoU 稳定提升；
-2. Unseen 至少不弱于 Seen 的提升趋势；
-3. Null 指标不恶化，预测面积不膨胀；
-4. hard frames 改善更明显；
-5. 如果记录 gate alpha，hard frames 的 alpha 应系统性高于 easy frames。
-失败解释：
-- 如果 Seen 提升、Unseen 不提升：可能是 gate 学到数据集模式，而不是 referent anchoring；
-- 如果 Null 恶化：gate 可能放大了通用前景显著性；
-- 如果 gate-only 无变化但完整 A-min 有收益：说明 gate 需要与 mask decoder / text projection 协同适配；
-- 如果全 split 下降：gate 插入位置、初始化或学习率需要重新检查。
----
-## 6. 后续机制分析
-如果 A-min 有正收益，再做 hook 分析：
-1. sparse self-attention 中 `mask_token -> q`；
-2. token-to-image attention 中 mask token 对 image tokens 的关注；
-3. A-min 前后 hard/easy frames 的 gate alpha；
-4. `s_pred(beta=1.0)` 与 gate alpha 的关系。
-这部分用于论文解释，不作为当前阻塞项。
----
-## 7. 当前一句话结论
-> A-min gate 的梯度路径、表达能力和 cached pipeline 一致性已经通过验证；overfit32 能显著提升训练集但不能泛化。当前主线是用更大 cached train set（已完成 train256 cache）验证 gate-only 泛化，再决定是否投入完整 A-min 联合训练。

target_frame_sweep.py DELETED Viewed

@@ -1,265 +0,0 @@
-import csv
-import os
-import random
-from functools import partial
-import numpy as np
-import torch
-import torch.nn.functional as F
-import transformers
-from torch.utils.data import DataLoader
-from configs import args
-from datasets import REFAVS
-from decoder_invariance_check import build_model, set_seed
-from load_model import collate_fn, dict_to_cuda
-from utils import utility
-def decode_with_q(model, batch, q):
-    visual_model = model.get_model().visual_model
-    image_embeddings = batch["image_feats"][0]
-    sparse, dense = visual_model.prompt_encoder(
-        points=None,
-        boxes=None,
-        masks=None,
-        text_embeds=q.unsqueeze(1),
-    )
-    sparse = sparse.to(q.dtype)
-    dense = dense.to(q.dtype)
-    low_res_masks, iou_predictions = visual_model.mask_decoder(
-        image_embeddings=image_embeddings,
-        image_pe=visual_model.prompt_encoder.get_dense_pe(),
-        sparse_prompt_embeddings=sparse,
-        dense_prompt_embeddings=dense,
-        multimask_output=False,
-    )
-    pred_masks = visual_model.postprocess_masks(
-        low_res_masks,
-        input_size=batch["resizes"][0],
-        original_size=batch["orgsizes"][0],
-    ).squeeze(1)
-    return pred_masks.unsqueeze(0), iou_predictions.squeeze(-1)
-def get_q_for_target_frame(model, batch, target_frame):
-    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-        output = model.forward(
-            images=batch["images"],
-            images_clip=batch["images_clip"],
-            audio_features=batch["audio_feats"],
-            image_features=batch["image_feats"],
-            input_ids=batch["input_ids"],
-            labels=batch["labels"],
-            attention_masks=batch["attention_masks"],
-            masks_list=batch["masks"],
-            resize_list=batch["resizes"],
-            orgsize_list=batch["orgsizes"],
-            conversation_list=batch["convs"],
-            refs_num=batch["refs_num"],
-            fids=batch["fids"],
-            vids=batch["vids"],
-            contrast=args.ct_weight,
-            ref_ids=batch["ref_ids"],
-            inference=True,
-            target_frame=target_frame,
-        )
-    return output["seg_embeddings"][0][0:1]
-def mask_area(pred_masks):
-    return (torch.sigmoid(pred_masks) > 0.4).float().mean().item()
-def mean_mask_iou_to_others(mask, other_masks):
-    if not other_masks:
-        return 1.0
-    binary = (torch.sigmoid(mask) > 0.4).float()
-    other_binary = [(torch.sigmoid(m) > 0.4).float() for m in other_masks]
-    vals = []
-    for other in other_binary:
-        inter = (binary * other).sum()
-        union = torch.maximum(binary, other).sum()
-        vals.append((inter / (union + 1e-7)).item())
-    return float(np.mean(vals))
-def evaluate_one_sample(model, batch, sample_idx):
-    rows = []
-    qs = []
-    pred_masks_by_tf = []
-    gt_masks = batch["masks"][0]
-    vid = batch["vids"][0]
-    ref = batch["refs"][0][0]
-    for target_frame in range(args.frame_n):
-        q = get_q_for_target_frame(model, batch, target_frame)
-        qs.append(q.float().squeeze(0))
-        with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-            pred_masks, iou_predictions = decode_with_q(model, batch, q)
-        pred_masks_by_tf.append(pred_masks.detach())
-        miou = utility.mask_iou(pred_masks.float(), gt_masks.float())
-        fscore = utility.Eval_Fmeasure(pred_masks.float(), gt_masks.float(), None)
-        null_metric = utility.metric_s_for_null(pred_masks.float())
-        area = mask_area(pred_masks)
-        mean_iou_pred = iou_predictions.float().mean().item()
-        rows.append(
-            {
-                "sample_idx": sample_idx,
-                "vid": vid,
-                "ref": ref,
-                "target_frame": target_frame,
-                "mean_iou_pred": mean_iou_pred,
-                "mask_area": area,
-                "null_metric": float(null_metric),
-                "miou": miou,
-                "fscore": fscore,
-                "cos_to_q5": 0.0,
-                "mean_cos_to_other_q": 0.0,
-                "mean_mask_iou_to_other_tf": 0.0,
-            }
-        )
-    q_stack = F.normalize(torch.stack(qs, dim=0), dim=-1)
-    q_cos = q_stack @ q_stack.T
-    q5_idx = min(5, len(qs) - 1)
-    for i, row in enumerate(rows):
-        other = [j for j in range(len(rows)) if j != i]
-        row["cos_to_q5"] = q_cos[i, q5_idx].item()
-        row["mean_cos_to_other_q"] = q_cos[i, other].mean().item()
-        row["mean_mask_iou_to_other_tf"] = mean_mask_iou_to_others(
-            pred_masks_by_tf[i], [pred_masks_by_tf[j] for j in other]
-        )
-    return rows
-def print_sample_summary(rows):
-    print(f"\nSample {rows[0]['sample_idx']}: vid={rows[0]['vid']} ref={rows[0]['ref']}")
-    print("tf | miou | fscore | null_s | iou_pred | area | cos_to_q5 | mean_q_cos")
-    for row in rows:
-        print(
-            f"{row['target_frame']:02d} | "
-            f"{row['miou']:.4f} | "
-            f"{row['fscore']:.4f} | "
-            f"{row['null_metric']:.4f} | "
-            f"{row['mean_iou_pred']:.4f} | "
-            f"{row['mask_area']:.4f} | "
-            f"{row['cos_to_q5']:.4f} | "
-            f"{row['mean_cos_to_other_q']:.4f}"
-        )
-    best_miou = max(rows, key=lambda x: x["miou"])
-    best_iou_pred = max(rows, key=lambda x: x["mean_iou_pred"])
-    fixed = rows[min(5, len(rows) - 1)]
-    miou_values = [row["miou"] for row in rows]
-    q5_values = [row["cos_to_q5"] for row in rows]
-    print(
-        "Best miou tf="
-        f"{best_miou['target_frame']} ({best_miou['miou']:.4f}); "
-        "best iou_pred tf="
-        f"{best_iou_pred['target_frame']} ({best_iou_pred['mean_iou_pred']:.4f}); "
-        f"fixed tf=5 miou={fixed['miou']:.4f}"
-    )
-    print(
-        f"target-frame miou range={max(miou_values) - min(miou_values):.4f}; "
-        f"min cos_to_q5={min(q5_values):.4f}"
-    )
-def main():
-    set_seed(42)
-    torch.set_grad_enabled(False)
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        args.mllm,
-        cache_dir=None,
-        model_max_length=2048,
-        padding_side="right",
-        use_fast=False,
-    )
-    tokenizer.pad_token = tokenizer.unk_token
-    tokenizer.add_tokens("[SEG]")
-    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
-    dataset = REFAVS(args.eval_split, args, tokenizer, input_type="refer")
-    loader = DataLoader(
-        dataset,
-        batch_size=1,
-        shuffle=False,
-        num_workers=0,
-        collate_fn=partial(collate_fn, tokenizer=tokenizer),
-    )
-    limit = args.max_eval_rows if args.max_eval_rows > 0 else 1
-    print(f"Split: {args.eval_split} | samples to sweep: {limit}")
-    model = build_model(tokenizer, seg_token_idx)
-    all_rows = []
-    for sample_idx, batch in enumerate(loader):
-        if sample_idx >= limit:
-            break
-        batch = dict_to_cuda(batch)
-        rows = evaluate_one_sample(model, batch, sample_idx)
-        all_rows.extend(rows)
-        print_sample_summary(rows)
-    if not all_rows:
-        raise RuntimeError("No rows were checked. Is the selected split empty?")
-    fixed_rows = [r for r in all_rows if r["target_frame"] == min(5, args.frame_n - 1)]
-    oracle_by_sample = {}
-    iou_pred_by_sample = {}
-    for row in all_rows:
-        key = row["sample_idx"]
-        if key not in oracle_by_sample or row["miou"] > oracle_by_sample[key]["miou"]:
-            oracle_by_sample[key] = row
-        if key not in iou_pred_by_sample or row["mean_iou_pred"] > iou_pred_by_sample[key]["mean_iou_pred"]:
-            iou_pred_by_sample[key] = row
-    fixed_miou = np.mean([r["miou"] for r in fixed_rows])
-    fixed_null_metric = np.mean([r["null_metric"] for r in fixed_rows])
-    oracle_miou = np.mean([r["miou"] for r in oracle_by_sample.values()])
-    iou_pred_selected_miou = np.mean([r["miou"] for r in iou_pred_by_sample.values()])
-    min_cos_to_q5 = np.mean(
-        [min(r["cos_to_q5"] for r in all_rows if r["sample_idx"] == sample_idx) for sample_idx in oracle_by_sample]
-    )
-    mean_miou_range = np.mean(
-        [
-            max(r["miou"] for r in all_rows if r["sample_idx"] == sample_idx)
-            - min(r["miou"] for r in all_rows if r["sample_idx"] == sample_idx)
-            for sample_idx in oracle_by_sample
-        ]
-    )
-    print("\nSummary")
-    print(f"samples: {len(fixed_rows)}")
-    print(f"fixed target_frame=5 mean miou: {fixed_miou:.4f}")
-    print(f"fixed target_frame=5 mean null_s: {fixed_null_metric:.4f}")
-    print(f"oracle best-target-frame mean miou: {oracle_miou:.4f}")
-    print(f"best-by-iou_pred selected mean miou: {iou_pred_selected_miou:.4f}")
-    print(f"oracle gain over fixed: {oracle_miou - fixed_miou:+.4f}")
-    print(f"iou_pred-selection gain over fixed: {iou_pred_selected_miou - fixed_miou:+.4f}")
-    print(f"mean target-frame miou range: {mean_miou_range:.4f}")
-    print(f"mean sample min cos_to_q5: {min_cos_to_q5:.4f}")
-    csv_path = os.environ.get("TARGET_FRAME_SWEEP_CSV")
-    if csv_path:
-        os.makedirs(os.path.dirname(os.path.abspath(csv_path)), exist_ok=True)
-        with open(csv_path, "w", newline="") as f:
-            writer = csv.DictWriter(f, fieldnames=list(all_rows[0].keys()))
-            writer.writeheader()
-            writer.writerows(all_rows)
-        print(f"Saved CSV: {csv_path}")
-if __name__ == "__main__":
-    main()

train_cached_gate.py DELETED Viewed

@@ -1,439 +0,0 @@
-import json
-import os
-import random
-import cv2
-import numpy as np
-import torch
-import transformers
-from torch.optim import AdamW
-from torch.utils.data import DataLoader, Dataset, Subset
-from tqdm import tqdm
-from configs import args
-from decoder_invariance_check import build_model, set_seed
-from models.avs_model import dice_loss, sigmoid_ce_loss
-from utils import utility
-def _total_norm(values):
-    if not values:
-        return 0.0
-    return float(sum(v * v for v in values) ** 0.5)
-def collect_referent_gate_stats(model):
-    gate_modules = [(n, m) for n, m in model.named_modules() if n.endswith("referent_gate")]
-    proj_norms = []
-    gate_norms = []
-    proj_grad_norms = []
-    gate_grad_norms = []
-    alpha_tensors = []
-    for _, module in gate_modules:
-        proj_norms.append(module.proj.weight.detach().float().norm().item())
-        gate_norms.append(module.gate.weight.detach().float().norm().item())
-        if module.proj.weight.grad is not None:
-            proj_grad_norms.append(module.proj.weight.grad.detach().float().norm().item())
-        if module.gate.weight.grad is not None:
-            gate_grad_norms.append(module.gate.weight.grad.detach().float().norm().item())
-        if module.last_alpha is not None:
-            alpha_tensors.append(module.last_alpha.detach().float().reshape(-1))
-    stats = {
-        "modules": len(gate_modules),
-        "proj_norm": _total_norm(proj_norms),
-        "gate_norm": _total_norm(gate_norms),
-        "proj_grad_norm": _total_norm(proj_grad_norms),
-        "gate_grad_norm": _total_norm(gate_grad_norms),
-    }
-    if alpha_tensors:
-        alpha = torch.cat(alpha_tensors)
-        stats.update(
-            {
-                "alpha_mean": alpha.mean().item(),
-                "alpha_std": alpha.std(unbiased=False).item(),
-                "alpha_min": alpha.min().item(),
-                "alpha_max": alpha.max().item(),
-            }
-        )
-    else:
-        stats.update(
-            {
-                "alpha_mean": float("nan"),
-                "alpha_std": float("nan"),
-                "alpha_min": float("nan"),
-                "alpha_max": float("nan"),
-            }
-        )
-    return stats
-def zero_referent_gate(model):
-    with torch.no_grad():
-        for _, module in model.named_modules():
-            if not _.endswith("referent_gate"):
-                continue
-            module.gate.weight.zero_()
-            module.gate.bias.zero_()
-            module.proj.weight.zero_()
-            module.proj.bias.zero_()
-            module.last_alpha = None
-def referent_gate_state_dict(model):
-    return {
-        name: param.detach().cpu()
-        for name, param in model.state_dict().items()
-        if "referent_gate" in name
-    }
-def load_referent_gate_checkpoint(model, path):
-    checkpoint = torch.load(path, map_location="cpu")
-    if isinstance(checkpoint, dict) and checkpoint.get("type") == "referent_gate_only":
-        checkpoint = checkpoint["state_dict"]
-    gate_state = {k: v for k, v in checkpoint.items() if "referent_gate" in k}
-    if not gate_state:
-        raise RuntimeError(f"No referent_gate parameters found in {path}")
-    current = model.state_dict()
-    missing_shape = [
-        k
-        for k, v in gate_state.items()
-        if k not in current or tuple(current[k].shape) != tuple(v.shape)
-    ]
-    if missing_shape:
-        raise RuntimeError(f"Gate checkpoint has incompatible keys: {missing_shape[:5]}")
-    current.update(gate_state)
-    model.load_state_dict(current, strict=True)
-    print(f"loaded referent gate checkpoint: {path} ({len(gate_state)} tensors)")
-def log_gate_stats(model, step, loss_value, batch_metrics=None):
-    stats = collect_referent_gate_stats(model)
-    metric_text = ""
-    if batch_metrics is not None:
-        metric_text = (
-            f"batch_miou={batch_metrics['miou']:.4f} "
-            f"batch_fscore={batch_metrics['fscore']:.4f} "
-        )
-    message = (
-        f"gate_stats step={step} "
-        f"loss={loss_value:.6f} "
-        f"{metric_text}"
-        f"proj_norm={stats['proj_norm']:.6f} "
-        f"gate_norm={stats['gate_norm']:.6f} "
-        f"proj_grad_norm={stats['proj_grad_norm']:.6f} "
-        f"gate_grad_norm={stats['gate_grad_norm']:.6f} "
-        f"alpha_mean={stats['alpha_mean']:.4f} "
-        f"alpha_std={stats['alpha_std']:.4f} "
-        f"alpha_min={stats['alpha_min']:.4f} "
-        f"alpha_max={stats['alpha_max']:.4f}"
-    )
-    print(message)
-    os.makedirs(args.log_root, exist_ok=True)
-    with open(os.path.join(args.log_root, f"{args.name}.txt"), "a") as f:
-        f.write(message + "\n")
-class CachedQDataset(Dataset):
-    def __init__(self, split, cfg):
-        self.split = split
-        self.cfg = cfg
-        self.root = os.path.join(cfg.cache_root, split)
-        self.index_path = os.path.join(self.root, "index.jsonl")
-        if not os.path.exists(self.index_path):
-            raise FileNotFoundError(f"Missing cache index: {self.index_path}")
-        with open(self.index_path) as f:
-            self.rows = [json.loads(line) for line in f if line.strip()]
-    def __len__(self):
-        return len(self.rows)
-    def _load_masks(self, vid, fids):
-        masks = []
-        for fid in fids:
-            frames = []
-            for frame_idx in range(self.cfg.frame_n):
-                path = os.path.join(
-                    self.cfg.data_dir,
-                    "gt_mask",
-                    vid,
-                    f"fid_{int(fid)}",
-                    f"0000{frame_idx}.png",
-                )
-                mask = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
-                if mask is None:
-                    raise FileNotFoundError(path)
-                frames.append(torch.as_tensor(mask > 0, dtype=torch.float32))
-            masks.append(torch.stack(frames, dim=0))
-        return torch.stack(masks, dim=0)
-    def __getitem__(self, idx):
-        row = self.rows[idx]
-        cache = torch.load(os.path.join(self.root, row["path"]), map_location="cpu")
-        vid = cache["vid"]
-        return {
-            "sample_idx": cache["sample_idx"],
-            "vid": vid,
-            "refs": cache["refs"],
-            "fids": cache["fids"],
-            "q": cache["q"].float(),
-            "image_embeddings": torch.load(
-                os.path.join(self.cfg.data_dir, "image_embed", f"{vid}.pt"),
-                map_location="cpu",
-            ).float(),
-            "gt_masks": self._load_masks(vid, cache["fids"]),
-            "resize": tuple(cache["resize"]),
-            "orgsize": tuple(cache["orgsize"]),
-        }
-def collate_cached(batch):
-    return batch
-def decode_batch(visual_model, batch, device):
-    image_pe = visual_model.prompt_encoder.get_dense_pe()
-    frame_qs = []
-    frame_image_embeddings = []
-    prompt_spans = []
-    for sample_idx, sample in enumerate(batch):
-        q = sample["q"].to(device=device, dtype=torch.float32)
-        image_embeddings = sample["image_embeddings"].to(device=device, dtype=torch.float32)
-        frames = image_embeddings.shape[0]
-        for prompt_idx in range(q.shape[0]):
-            start = len(frame_qs) * frames
-            frame_qs.append(q[prompt_idx].unsqueeze(0).expand(frames, -1))
-            frame_image_embeddings.append(image_embeddings)
-            prompt_spans.append((sample_idx, prompt_idx, start, start + frames))
-    if not frame_qs:
-        raise RuntimeError("No cached prompts were provided for decoding.")
-    frame_qs = torch.cat(frame_qs, dim=0)
-    frame_image_embeddings = torch.cat(frame_image_embeddings, dim=0)
-    sparse_embeddings, dense_embeddings = visual_model.prompt_encoder(
-        points=None,
-        boxes=None,
-        masks=None,
-        text_embeds=frame_qs.unsqueeze(1),
-    )
-    sparse_embeddings = sparse_embeddings.to(frame_qs.dtype)
-    dense_embeddings = dense_embeddings.to(frame_qs.dtype)
-    low_res_masks = visual_model.mask_decoder.forward_modified_v3(
-        image_embeddings=frame_image_embeddings,
-        image_pe=image_pe,
-        sparse_prompt_embeddings=sparse_embeddings,
-        dense_prompt_embeddings=dense_embeddings,
-    ).unsqueeze(1)
-    pred_by_sample = [[] for _ in batch]
-    for sample_idx, _, start, end in prompt_spans:
-        sample = batch[sample_idx]
-        pred_mask = visual_model.postprocess_masks(
-            low_res_masks[start:end],
-            input_size=sample["resize"],
-            original_size=sample["orgsize"],
-        )
-        pred_by_sample[sample_idx].append(pred_mask.squeeze(1))
-    return [torch.stack(pred_masks, dim=0) for pred_masks in pred_by_sample]
-def decode_sample(visual_model, sample, device):
-    return decode_batch(visual_model, [sample], device)[0]
-def compute_mask_loss(pred_masks, gt_masks):
-    mask_bce_loss = 0.0
-    mask_dice_loss = 0.0
-    num_masks = 0
-    for pred_mask, gt_mask in zip(pred_masks, gt_masks):
-        gt_mask = gt_mask.to(device=pred_mask.device, dtype=pred_mask.dtype)
-        num_seg, frames, height, width = gt_mask.shape
-        gt_flat = gt_mask.view(num_seg * frames, height, width)
-        pred_flat = pred_mask.view(num_seg * frames, height, width)
-        mask_bce_loss = mask_bce_loss + (
-            sigmoid_ce_loss(pred_flat, gt_flat, num_masks=gt_flat.shape[0])
-            * gt_flat.shape[0]
-        )
-        mask_dice_loss = mask_dice_loss + (
-            dice_loss(pred_flat, gt_flat, num_masks=gt_flat.shape[0])
-            * gt_flat.shape[0]
-        )
-        num_masks += gt_flat.shape[0]
-    mask_bce_loss = 2.0 * mask_bce_loss / (num_masks + 1e-8)
-    mask_dice_loss = 0.5 * mask_dice_loss / (num_masks + 1e-8)
-    return mask_bce_loss + mask_dice_loss
-def compute_batch_metrics(pred_masks, gt_masks):
-    total_iou = 0.0
-    total_fscore = 0.0
-    count = 0
-    for pred_mask, gt_mask in zip(pred_masks, gt_masks):
-        gt_mask = gt_mask.to(device=pred_mask.device, dtype=pred_mask.dtype)
-        num_seg, frames = pred_mask.shape[:2]
-        weight = num_seg * frames
-        total_iou += utility.mask_iou(pred_mask.detach().float(), gt_mask.float()) * weight
-        total_fscore += utility.Eval_Fmeasure(pred_mask.detach().float(), gt_mask.float(), None) * weight
-        count += weight
-    return {
-        "miou": total_iou / max(1, count),
-        "fscore": total_fscore / max(1, count),
-    }
-def evaluate(model, loader):
-    model.eval()
-    visual_model = model.get_model().visual_model
-    total_iou = 0.0
-    total_fscore = 0.0
-    total_null = 0.0
-    count = 0
-    with torch.no_grad():
-        for batch in tqdm(loader, desc=f"Cached eval {args.cache_split}"):
-            with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-                batch_pred = decode_batch(visual_model, batch, "cuda")
-            for sample, pred in zip(batch, batch_pred):
-                gt = sample["gt_masks"].to(device=pred.device, dtype=pred.dtype)
-                num_seg, frames = pred.shape[:2]
-                weight = num_seg * frames
-                if args.cache_split == "test_n":
-                    total_null += float(utility.metric_s_for_null(pred.float())) * weight
-                else:
-                    total_iou += utility.mask_iou(pred.float(), gt.float()) * weight
-                    total_fscore += utility.Eval_Fmeasure(pred.float(), gt.float(), None) * weight
-                count += weight
-    if count == 0:
-        raise RuntimeError("No cached samples were evaluated.")
-    if args.cache_split == "test_n":
-        print(f"cached valuate on test_n_refer, metric: {total_null / count}")
-    else:
-        print(
-            f"cached valuate on {args.cache_split}: "
-            f"miou: {total_iou / count} fscore: {total_fscore / count}"
-        )
-def train(model, loader):
-    if args.disable_gate:
-        raise ValueError("--disable_gate is only valid with --eval_only")
-    for p in model.parameters():
-        p.requires_grad = False
-    for name, p in model.named_parameters():
-        if "referent_gate" in name:
-            p.requires_grad = True
-    gate_params = [p for p in model.parameters() if p.requires_grad]
-    optimizer = AdamW(gate_params, lr=args.lr, betas=(0.9, 0.95), weight_decay=0.01)
-    stats = collect_referent_gate_stats(model)
-    print(
-        "cached gate init: "
-        f"modules={stats['modules']} "
-        f"proj_norm={stats['proj_norm']:.6f} "
-        f"gate_norm={stats['gate_norm']:.6f} "
-        f"trainable_params={sum(p.numel() for p in gate_params)}"
-    )
-    visual_model = model.get_model().visual_model
-    step = 0
-    for epoch in range(args.epochs):
-        model.train()
-        order_loader = loader
-        for batch in tqdm(order_loader, desc=f"Cached gate train {epoch + 1}/{args.epochs}"):
-            if args.max_steps > 0 and step >= args.max_steps:
-                break
-            with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-                pred_masks = decode_batch(visual_model, batch, "cuda")
-            gt_masks = [sample["gt_masks"] for sample in batch]
-            loss = compute_mask_loss(pred_masks, gt_masks)
-            optimizer.zero_grad()
-            loss.backward()
-            step += 1
-            if args.log_gate_stats_every > 0 and step % args.log_gate_stats_every == 0:
-                batch_metrics = compute_batch_metrics(pred_masks, gt_masks)
-                log_gate_stats(model, step, loss.item(), batch_metrics)
-            optimizer.step()
-        if args.max_steps > 0 and step >= args.max_steps:
-            print(f"stopped early at cached optimizer step {step}")
-            break
-    os.makedirs(args.checkpoint_root, exist_ok=True)
-    ckpt_path = os.path.join(args.checkpoint_root, f"{args.name}.pth")
-    if args.save_gate_only:
-        torch.save(
-            {
-                "type": "referent_gate_only",
-                "base_model": args.saved_model,
-                "state_dict": referent_gate_state_dict(model),
-            },
-            ckpt_path,
-        )
-    else:
-        torch.save(model.state_dict(), ckpt_path)
-    print(f"cached gate model saved as {ckpt_path}")
-def main():
-    set_seed(42)
-    random.seed(42)
-    np.random.seed(42)
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        args.mllm,
-        cache_dir=None,
-        model_max_length=2048,
-        padding_side="right",
-        use_fast=False,
-    )
-    tokenizer.pad_token = tokenizer.unk_token
-    tokenizer.add_tokens("[SEG]")
-    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
-    dataset = CachedQDataset(args.cache_split, args)
-    if args.overfit_samples > 0:
-        n = min(args.overfit_samples, len(dataset))
-        dataset = Subset(dataset, list(range(n)))
-        print(f"cached overfit_samples enabled: using first {n} samples")
-    loader = DataLoader(
-        dataset,
-        batch_size=args.batch_size,
-        shuffle=not args.eval_only,
-        num_workers=4,
-        collate_fn=collate_cached,
-    )
-    model = build_model(tokenizer, seg_token_idx)
-    if args.gate_checkpoint:
-        load_referent_gate_checkpoint(model, args.gate_checkpoint)
-    if args.disable_gate:
-        zero_referent_gate(model)
-        print("disable_gate enabled: referent gate forced to identity")
-    if args.eval_only:
-        evaluate(model, loader)
-        return
-    train(model, loader)
-    if not args.skip_eval_after_train:
-        evaluate(model, loader)
-if __name__ == "__main__":
-    main()

upload_hf.py DELETED Viewed

@@ -1,74 +0,0 @@
-"""Upload the current SimToken workspace to HuggingFace Hub.
-Example:
-    python upload_hf.py --repo yfan07/SimToken
-"""
-from __future__ import annotations
-import argparse
-import logging
-from pathlib import Path
-from huggingface_hub import HfApi, create_repo
-ROOT = Path(__file__).resolve().parent
-IGNORE_PATTERNS = [
-    ".git/**",
-    "**/__pycache__/**",
-    "**/.pytest_cache/**",
-    "**/.cache/**",
-    "**/*.pyc",
-    "**/*.pyo",
-    "upload.log",
-]
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Upload SimToken to HuggingFace Hub.")
-    parser.add_argument("--repo", required=True, help="Repo id, e.g. yfan07/SimToken")
-    parser.add_argument("--repo_type", default="model", choices=["model", "dataset", "space"])
-    parser.add_argument("--private", action="store_true", help="Create repo as private if missing.")
-    parser.add_argument("--num_workers", type=int, default=4)
-    return parser.parse_args()
-def main() -> None:
-    args = parse_args()
-    logging.basicConfig(
-        level=logging.INFO,
-        format="%(asctime)s %(levelname)s %(message)s",
-        handlers=[logging.FileHandler(ROOT / "upload.log"), logging.StreamHandler()],
-    )
-    create_repo(
-        repo_id=args.repo,
-        repo_type=args.repo_type,
-        private=args.private,
-        exist_ok=True,
-    )
-    api = HfApi()
-    if hasattr(api, "upload_large_folder"):
-        logging.info("Uploading %s to %s with upload_large_folder", ROOT, args.repo)
-        api.upload_large_folder(
-            repo_id=args.repo,
-            repo_type=args.repo_type,
-            folder_path=str(ROOT),
-            ignore_patterns=IGNORE_PATTERNS,
-            num_workers=args.num_workers,
-        )
-    else:
-        logging.info("Uploading %s to %s with upload_folder", ROOT, args.repo)
-        api.upload_folder(
-            repo_id=args.repo,
-            repo_type=args.repo_type,
-            folder_path=str(ROOT),
-            ignore_patterns=IGNORE_PATTERNS,
-        )
-if __name__ == "__main__":
-    main()