dieKarotte commited on 20 days ago

Commit

bf04039

verified ·

1 Parent(s): dd39446

Add files using upload-large-folder tool

Browse files

Files changed (21) hide show

.gitattributes +3 -0
Evaluation_Results/Comparing_Different_Pre-Training_Targets.png +3 -0
Evaluation_Results/Comparing_with_the_SOTA_Ensemble_Models.png +3 -0
Evaluation_Results/Comparing_with_the_SOTA_Single_Models.png +3 -0
__pycache__/BEATs.cpython-310.pyc +0 -0
__pycache__/BEATs.cpython-312.pyc +0 -0
__pycache__/modules.cpython-311.pyc +0 -0
__pycache__/modules.cpython-312.pyc +0 -0
__pycache__/spatial_beats.cpython-310.pyc +0 -0
__pycache__/spatial_dataset.cpython-311.pyc +0 -0
__pycache__/test_vectorized_matching.cpython-311.pyc +0 -0
docs/00_START_HERE.md +228 -0
docs/0427_v11_series.md +184 -0
docs/0429_v11a_with_dynamic.md +475 -0
docs/V11_QUICK_START.md +345 -0
docs/gemini.md +63 -0
docs/spatial_beats_implementation_spec.md +706 -0
docs/spatial_beats_training_overview.md +608 -0
docs/v13_honest_postmortem.md +170 -0
docs/v13_spatial_beats_design.md +528 -0
docs/v13d_spatial_beats_design.md +333 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+Evaluation_Results/Comparing_with_the_SOTA_Single_Models.png filter=lfs diff=lfs merge=lfs -text
+Evaluation_Results/Comparing_Different_Pre-Training_Targets.png filter=lfs diff=lfs merge=lfs -text
+Evaluation_Results/Comparing_with_the_SOTA_Ensemble_Models.png filter=lfs diff=lfs merge=lfs -text

Evaluation_Results/Comparing_Different_Pre-Training_Targets.png ADDED Viewed

Git LFS Details

SHA256: c2f5ea4f6c904c39e72d28b5587423d4f28260ef536cca4655568aebb70332ac
Pointer size: 131 Bytes
Size of remote file: 242 kB

Evaluation_Results/Comparing_with_the_SOTA_Ensemble_Models.png ADDED Viewed

Git LFS Details

SHA256: 4e4916c2f7c1d6cc32ce3093a8eb3a97cf52cc8fe181a91ed25dc7e1d908324a
Pointer size: 131 Bytes
Size of remote file: 151 kB

Evaluation_Results/Comparing_with_the_SOTA_Single_Models.png ADDED Viewed

Git LFS Details

SHA256: e3b0e623169fa02769ed21d0e821ef5be5d7f9e1fd7aaaef0e2e386b61e52fe3
Pointer size: 131 Bytes
Size of remote file: 437 kB

__pycache__/BEATs.cpython-310.pyc ADDED Viewed

Binary file (4.13 kB). View file

__pycache__/BEATs.cpython-312.pyc ADDED Viewed

Binary file (7.34 kB). View file

__pycache__/modules.cpython-311.pyc ADDED Viewed

Binary file (11.1 kB). View file

__pycache__/modules.cpython-312.pyc ADDED Viewed

Binary file (10.2 kB). View file

__pycache__/spatial_beats.cpython-310.pyc ADDED Viewed

Binary file (49.1 kB). View file

__pycache__/spatial_dataset.cpython-311.pyc ADDED Viewed

Binary file (76 kB). View file

__pycache__/test_vectorized_matching.cpython-311.pyc ADDED Viewed

Binary file (19 kB). View file

docs/00_START_HERE.md ADDED Viewed

	@@ -0,0 +1,228 @@

+# 🚀 START HERE — Spatial-BEATs Documentation Guide
+**Welcome!** You have access to comprehensive analysis of the Spatial-BEATs codebase. This guide will direct you to exactly what you need.
+---
+## ⚡ Quick Pick Your Task
+### "I have 5 minutes"
+→ Read: [`SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md`](SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md)
+### "I have 15 minutes"
+→ Read: [`README_DOCUMENTATION_INDEX.md`](README_DOCUMENTATION_INDEX.md) then [`ANALYSIS_COMPLETION_SUMMARY.md`](ANALYSIS_COMPLETION_SUMMARY.md)
+### "I have 30 minutes"
+→ Choose one:
+- **New to codebase?** Read: Part 1 of [`SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md`](SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md)
+- **Debugging DOA gap?** Read: [`doa_train_valid_gap_analysis.md`](doa_train_valid_gap_analysis.md) Executive Summary
+- **Planning experiments?** Read: [`0427_v11_series.md`](0427_v11_series.md) Section 1-2
+### "I have 1-2 hours"
+→ Full reading path for your role:
+- **Researcher**: QUICK_REF → 0427_v11_series.md → ANALYSIS Part 2-3
+- **Contributor**: QUICK_REF → ANALYSIS Part 1-2 → Pick component → read code
+- **Investigator**: DOA_GAP Executive → Part 6 → Part 8 → Appendix
+---
+## 📚 The Five Documents
+### 1. **README_DOCUMENTATION_INDEX.md**
+🏠 **Navigation hub** — Where to find what
+- Use case lookup (choose your problem)
+- Code component quick reference
+- Reading order for different roles
+- Cross-reference guide
+**👉 Read this first if**: You're not sure where to start
+---
+### 2. **SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md**
+⚡ **Practitioner's card** — Fast lookup
+- Framework table (4 frameworks, 1 page)
+- Route A/B/C comparison
+- Version series highlights
+- Code locations by component
+- Loss weight patterns
+- When to use each configuration
+**👉 Read this for**: Quick answers, practitioner reference
+---
+### 3. **SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md**
+📖 **Deep technical reference** — Architecture bible
+- Part 1: Four spatial frameworks (Spatial-AST, DCASE SELD, EINV2, DETR)
+- Part 2: Routes A/B/C with full specifications
+- Part 3: Version evolution (v7→v11)
+- Part 4-10: Implementation, configs, metrics, future work
+- Appendix: Code reference table
+**👉 Read this for**: Deep understanding, architecture details, code paths
+---
+### 4. **doa_train_valid_gap_analysis.md**
+🔍 **Diagnostic & fix guide** — Root cause analysis
+- Executive Summary: 6 critical mechanisms
+- Part 1: Data pipeline analysis
+- Part 2: Loss computation asymmetry
+- Part 3: Training configuration (v9/v10)
+- Part 4: Validation metrics
+- **Part 6: Root causes ranked by severity**
+- **Part 7: Diagnostic checklist**
+- **Part 8: Recommended fixes (prioritized)**
+- Appendix: Code reference with exact line numbers
+**👉 Read this for**: Debugging train/val gaps, understanding root causes
+---
+### 5. **ANALYSIS_COMPLETION_SUMMARY.md**
+📋 **Executive overview** — What was found
+- Deliverables summary (5 docs, 1,883 lines)
+- Key findings (frameworks, routes, v11 series)
+- Next steps (immediate vs experimental)
+- How to use documents
+- Verification checklist
+**👉 Read this for**: Overview, decision-making, what comes next
+---
+## 🎯 Choose Your Path
+### Path 1: "I want to understand the architecture (60 min)"
+1. SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md (5 min)
+2. SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md Part 1-2 (25 min)
+3. Pick a component from Part 6 Appendix, find in code (20 min)
+4. spatial_beats_ov123_frame_routes.md if curious (10 min)
+**Outcome**: Can navigate codebase, understand paradigms, modify code confidently
+---
+### Path 2: "I need to debug a train/val gap (30 min)"
+1. doa_train_valid_gap_analysis.md Executive Summary (2 min)
+2. Part 6: Check which mechanisms apply to your situation (5 min)
+3. Part 7: Diagnostics—check your logs (10 min)
+4. Part 8: Pick a fix priority (5 min)
+5. Appendix: Get code locations (2 min)
+6. SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md if you need to modify (optional)
+**Outcome**: Root cause identified, fix strategy chosen, code locations ready
+---
+### Path 3: "I want to run an experiment (v11 series) (45 min)"
+1. SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md (5 min)
+2. 0427_v11_series.md Section 1-2 (15 min)
+3. SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md Part 3 (15 min)
+4. 0427_v11_series.md Part 4 (verification method) (5 min)
+5. Copy shell script from QUICK_REF (5 min)
+**Outcome**: Experiment ready to launch, understanding of success metrics
+---
+### Path 4: "I'm new, I want to understand everything (2 hours)"
+1. SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md (10 min)
+2. SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md Part 1-3 (40 min)
+3. spatial_beats_ov123_frame_routes.md (25 min)
+4. spatial_beats_training_overview.md (20 min)
+5. Pick component, trace through code with Part 6 references (20 min)
+6. doa_train_valid_gap_analysis.md Part 6 for context (5 min)
+**Outcome**: Comprehensive understanding of system, ready to contribute
+---
+## 🔑 Key Findings at a Glance
+### 4 Spatial Frameworks in Codebase
+- **Spatial-AST**: Task tokens (pre-trunk)
+- **DCASE SELD**: Per-class activity+DOA
+- **EINV2**: Learnable track queries
+- **DETR**: Per-frame K-slot allocation
+### 3 Parallel Routes (A/B/C)
+- **Route A**: Per-frame K-slot, per-step Hungarian
+- **Route B**: Learnable queries, clip-level Hungarian (PRODUCTION v9)
+- **Route C**: Per-class vectors (PROTOTYPE, v11c test)
+### DOA Train/Val Gap Root Causes
+1. ⚠️⚠️⚠️ **ZERO spatial augmentation (rotations)** — 40-60% of variance
+2. ⚠️⚠️ **SpecAugment train-only** — 10-20% variance
+3. ⚠️⚠️ **v10 freezes direction head** — 30-40% on multi-source
+4. ⚠️ **Regression sensitivity** — 5-15% variance
+5. ⚠️ **Detached prediction asymmetry** — 2-5% variance
+### v11 Experiments (Parallel Runs)
+- **v11a**: DOA demixer → ov2 angles ↓ 5pp+
+- **v11b**: LocalSpatial pre-pool → test IV necessity
+- **v11c**: ACCDOA paradigm → ov3 binding ↓ 5pp+
+- **v11d**: Post-hoc calibration → ov1 ranking ↑ 5pp+
+---
+## 📞 FAQ
+**Q: Where do I find the direction head loss?**
+A: `SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md` Appendix → search "direction loss" → `spatial_loss.py:1562-1565`
+**Q: What's the difference between routes?**
+A: Compare table in `SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md` or detailed Part 2 of `SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md`
+**Q: Should I implement fix #1, #2, or #3?**
+A: Read `doa_train_valid_gap_analysis.md` Part 6, pick based on your gap size and risk tolerance.
+**Q: How do I run v11a?**
+A: Shell script in `SPATIAL_FRAMEWORKS_QUICK_REFERENCE.md` v11 section + spec in `0427_v11_series.md` Section 2.2
+**Q: I'm stuck on a component. Where's the code?**
+A: `SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS.md` Part 6 has complete reference table with file:line for every component.
+---
+## 🎁 You Now Have
+✅ **Navigation guide** for all documents
+✅ **Quick reference card** with all the essentials
+✅ **Architecture bible** with code paths
+✅ **Diagnostic guide** for train/val gaps
+✅ **Experimental specifications** for v11 series
+✅ **Comprehensive metadata** (1,883 lines, 77KB)
+✅ **All findings tied to exact code locations**
+---
+## 🚀 Next Steps
+1. **Choose your path above** based on how much time you have
+2. **Follow the reading order** in that path
+3. **Use cross-references** when you need more detail
+4. **Check Appendices** for exact code locations
+5. **Reference Part 6/Part 8** when implementing
+---
+## 📊 Document Overview
+| Document | Size | Time | Purpose |
+|----------|------|------|---------|
+| README_DOCUMENTATION_INDEX | 12KB | 5-10m | Navigation hub |
+| SPATIAL_FRAMEWORKS_QUICK_REFERENCE | 7KB | 5-10m | Quick lookup |
+| SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS | 28KB | 30-45m | Deep reference |
+| doa_train_valid_gap_analysis | 19KB | 20-30m | Diagnostics |
+| ANALYSIS_COMPLETION_SUMMARY | 11KB | 10m | Executive summary |
+| **TOTAL** | **77KB** | **2-4 hours** | **Complete set** |
+---
+**Status**: ✅ Complete and ready for use
+**Created**: 2026-04-27
+**Next update**: After v11 experiments
+👉 **Pick your path above and start reading!**

docs/0427_v11_series.md ADDED Viewed

	@@ -0,0 +1,184 @@

+# 2026-04-27 — v11 系列实验：DOA demixer / ACCDOA 范式 / 校准对照
+本文档对应「v9 之后做什么」的对话定型。把 docs/0424.md 的诊断结论拆成四个独立、可单独评估、可并行跑的实验：v11a / v11b / v11c / v11d。
+## 1. 上一轮诊断回顾
+参考 docs/0423.md（v9 design）+ docs/0424.md（v9 real dump 拆解）。三个真实 split 的失败模式互不重合：
+| split    | 主要症状                                      | 责任层                          |
+| -------- | --------------------------------------------- | ------------------------------- |
+| real_ov1 | raw 4-track 100% 同类候选，但 act>=0.5 后 37% GT 丢同类预测 | 排序 / activity 校准（**非架构**） |
+| real_ov2 | 73.9% 的预测「同类有，但角度 >20°」          | direction head 本体（**架构**）  |
+| real_ov3 | 24.5% raw 层无同类候选 + avg_pred 1.82 < avg_gt 2.88 | binding（query→source）+ 少亮轨（**架构**） |
+v9 已经给 class head 加过 `ClassHeadSpectralDemixer`（Fix C：track latent → BEATs trunk pre-pool grid 的频率轴 cross-attn，零门控残差）。但 **direction / distance head 没有对应通路**——它们只看 `track_time_features` 这一个 D 维向量，输入再往前是 `fused_spatial_embeddings`，已经被 `FrequencyPool(mean)` 平均掉。多源情况下，频率维被平均后单 D 向量无法同时表达两个方向，这是 real_ov2 的物理根因。
+## 2. 实验定义
+### v11a — DOA / 距离对称 demixer（最小改动）
+**目的**：直接验证「v9 Fix C 没补到 DOA」是否就是 real_ov2 angle 错的根因。
+**做法**：
+* 在 `FrameTrackPredictionHeads` 里再加一个 `ClassHeadSpectralDemixer` 实例（参数独立，结构完全相同），命名 `spatial_head_demixer`，作用于 `direction_head` 和 `distance_head` 的输入。
+* KV 与 class demixer 共用：BEATs trunk 的 pre-pool tokens `[B, T_p*F_p, D]` + grid_size `(T_p, F_p)` + pre-pool 时间 mask。
+* 零门控初始化：`out_proj.weight = out_proj.bias = 0`，`gate = 1e-2`，所以 epoch-0 forward 与 v9 bit-equivalent。
+**热启动**：`RESUME_CKPT` = v9 best.pt；`strict=False`；新增 13 个参数走默认零门控初值。`--no-resume-optimizer --reset-epoch-on-resume --reset-best-on-resume`。
+**预期**：
+* real_ov2：`class_right_angle_wrong` 从 73.9% 显著下降；`mean_best_angle_when_same_class_exists` 缩小。
+* sim split（`valid__hm3d__`）的 `F20 / LE_CD / ocls` 不退化（零门控保证 epoch-0 安全，训练只能修不能炸）。
+* real_ov1 / real_ov3 不一定改善——它们卡的不是 DOA 本身。
+**证明的事**：v9 在 class 上加 demixer 是有效的，但 DOA 也需要同样的通路才能拿到等价收益；后 frequency_pool 单向量是 DOA 多源 demix 的硬瓶颈。
+**入口**：`run_ov1_v11a_ov123_top4.sh` → preset `ov1_local_spatial_v11a_ov123_top4`。
+### v11b — DOA demixer 的 KV 换成 LocalSpatial pre-pool
+**目的**：进一步追问——v11a 的 KV 是 BEATs 的 mono fbank pre-pool，本身没有方向信息（IV 是后面 `local_spatial_fuser` 才混进来的）；如果让 DOA demixer 直接 attend 到 7 通道 FOA + IV 的 CNN pre-pool，会不会比 v11a 更好？
+**做法**：
+* 让 `LocalSpatialEncoder.forward(foa_feat, return_pre_pool=True)` 额外返回 4D CNN 特征 `[B, D_s, T_f, F_cnn]`（在 `mean(dim=-1)` 频率塌缩之前）。
+* `build_local_spatial_fusion(..., return_local_pre_pool=True)` 透传，并 reshape 到 `[B, T_f*F_cnn, D_s]` + grid `(T_f, F_cnn)`。
+* 新增 `local_spatial_pre_pool_proj: Linear(D_s -> D=768)`，xavier 初始化乘以 `local_spatial_proj_scale_init`，bias=0。
+* `FrameTrackPredictionHeads.forward` 接 `spatial_pre_pool_features / spatial_pre_pool_grid_size / spatial_pre_pool_time_mask`；当传入时，DOA demixer 用这条 KV，否则回落到 v11a 的 BEATs trunk pre-pool。
+**热启动**：同样 v9 best.pt + strict=False。`spatial_head_demixer` 仍然零门控；`local_spatial_pre_pool_proj` 是新参数，但 demixer gate=1e-2，`out_proj=0` 保证 epoch-0 数值与 v9 一致。
+**预期**：
+* 如果 v11b 比 v11a 在 real_ov2 上明显更好 → 物理 IV 信号确实是 DOA 必要输入，BEATs trunk pre-pool 信息不足。
+* 如果 v11b ≈ v11a → BEATs trunk pre-pool 已经携带足够 spatial 上下文（`local_spatial_fuser` 把 IV 混回去了），DOA 头的瓶颈只在 demixer 本身。
+* 如果 v11b < v11a → 新 KV 引入太多噪声 / projection 没充分训。
+**证明的事**：架构里「方向先验来自哪里」的问题——是 fuser 后已经够，还是必须 pre-fuser 直读 IV。
+**入口**：`run_ov1_v11b_ov123_top4.sh` → preset `ov1_local_spatial_v11b_ov123_top4`。
+### v11c — ACCDOA 范式对照
+**目的**：质询「K-track DETR 范式」本身。v11a/v11b 都在补 head；但 real_ov3 24.5% 的 GT 在 raw 4-track 里就找不到同类候选——这意味着问题不在 head，而在 **��哪个 query 该负责哪个 source」** 的 binding 阶段。把范式整体换掉看是否绕得过去。
+**做法**：
+* `readout_scheme = local_spatial_accdoa`，per-class 3D 向量场 `v_c` ：`||v_c||` = activity_c，`v_c/||v_c||` = DOA_c。无 query、无 Hungarian。
+* ov2/ov3 的 same-class-in-same-frame 几乎为零，所以 per-class 输出是无歧义的。
+* 接现有 `accdoa_heads` + `compute_frame_accdoa_losses`（仓库已实现）。
+**冷启动**：拓扑与 v9 不兼容（无 `source_query_decoder`、无 `FrameTrackPredictionHeads`）。改用 `--init-from-spatial-ckpt` 从 ov1 local_spatial warmup ckpt 初始化（继承 BEATs trunk + LocalSpatialEncoder + fuser），strict=False。
+**调度**：24 epochs，`lr=3e-5`（默认 1e-4 是 ov1 单源用的，没在多源上调过；3e-5 与 v9 同档，更安全）。
+**预期**：
+* real_ov3 `no_same_class_pred_but_other_preds_exist` 显著下降：query binding 不存在了，每个类天然有自己的 vector slot。
+* real_ov2 也可能改善——因为 DOA 来自 per-class 向量，已经按类分离，避免多源 head 共享单 D 向量。
+* sim ov1 上 ocls 可能略低于 v9（ACCDOA 把 activity/DOA 耦合，class 信号被向量幅值稀释）；这个代价是已知的。
+**证明的事**：query-binding 阶段是不是 real_ov3 的真正瓶颈。如果 v11c 就把 ov3 拉起来了，说明 head 修补（v11a/v11b）治不好 ov3；如果 v11c 也救不了 ov3，说明问题在更前——可能是 `LocalSpatialEncoder` 时空分辨率不够。
+**入口**：`run_ov1_v11c_ov123_accdoa.sh` → preset `ov1_local_spatial_v11c_ov123_accdoa`。
+### v11d — activity 校准 + Top-K̂ 解码（纯后处理）
+**目的**：real_ov1 的失血点在「raw 100% 有同类候选 → act>=0.5 后 37% 丢失」，是阈值 / 排序问题，**不是模型**。所以不改架构、不重训，只改 decode。同步给 v9 / v11a / v11b 出个 pareto，避免后续把 ranking 收益错算到 head 改动上。
+**做法**：
+`scripts/calibrate_activity.py` 重读已 dump 的 `*__pred.csv`（这些 CSV 里 `eval_v7k_real_valid.py` 已经写入 `activity_prob` 和——v10 head 在的话——`num_active_pred`）。三种 decode 模式：
+* `threshold`：固定阈值（扫 0.3 / 0.4 / 0.5 / 0.6）。
+* `topk_hat`：每帧按 `activity_prob` 降序取前 K̂ 个，K̂ = `num_active_pred`（v10 的 num_active_head argmax）。
+* `topk_hat_min`：上面两条的 AND（K̂ 之内还要过最低阈值）。
+每个 (split, mode, thr) 计算与 `analyze_csv_dump.py` 同口径的指标：`hit_share`、`class_right_angle_wrong`、`matched_tp_precision/recall`、`mean_best_angle_when_same_class_exists`。再用 `pick_best` 给每个 split 找最优配置。
+**输入**：任何已经包含 `__pred.csv / __gt.csv` 的目录（v9 best 的 dump、v11a/b/c 任一 epoch 的 dump 都行）。
+**预期**：
+* real_ov1：降阈值或换 `topk_hat` 后 `no_same_class_pred_but_other_preds_exist` 显著下降；存在「降阈值 → recall 大涨而 precision 小掉」的清晰拐点。
+* real_ov2 / real_ov3：阈值调节收益有限——它们的问题不在 ranking。
+* 出一个表，给每个 split 单独决定上线 decode 配置（线上不必三 split 共用同一阈值）。
+**证明的事**：real_ov1 的 37% 丢失是 decode 过紧 / activity 分布漂移的可校正问题，**不需要拿训练侧资源去解**；同时给 v11a/v11b/v11c 报告时分离「ranking 收益」与「head 改动收益」，避免归因混淆。
+**入口**：`scripts/calibrate_activity.py --dump-dir <csv_dir>`。
+## 3. 改动清单
+代码改动集中在四个文件：
+* `spatial_modules.py`
+  * `LocalSpatialEncoder.forward` 新增 `return_pre_pool=False` 参数。
+  * `FrameTrackPredictionHeads.__init__` 新增 5 个 spatial demixer 配置项；`forward` 新增 3 个 `spatial_pre_pool_*` kwargs，复用 `ClassHeadSpectralDemixer` 类作 `spatial_head_demixer`。
+* `spatial_beats.py`
+  * `SpatialBEATsConfig` 新增 5 个字段：`use_spatial_head_demixer / spatial_head_demixer_layers / heads / dropout / spatial_demixer_use_local_spatial_kv`。
+  * `__init__` 两处 `FrameTrackPredictionHeads` 构造点透传新 kwargs；同处按需创建 `local_spatial_pre_pool_proj`。
+  * `build_local_spatial_fusion` 新增 `return_local_pre_pool` 选项，6-tuple 返回；reshape 4D CNN feature 到 `[B, T_f*F_cnn, D_s]`。
+  * `forward` 两条 readout 分支（`local_spatial`、`local_spatial_track`）按配置切换 6-tuple，并把 `local_spatial_pre_pool_proj` 投影后的张量传给 frame_track head。
+* `train_spatial_beats.py`
+  * 新增 3 个 preset factory：`make_ov1_local_spatial_v11a_ov123_top4_config`（继承 v9 + `use_spatial_head_demixer=True`），`v11b`（v11a + `spatial_demixer_use_local_spatial_kv=True`），`v11c`（包装 `make_ov123_local_spatial_accdoa_config` + 24 epochs + lr=3e-5）。
+  * 三处 dispatch 分支 + argparse `--preset` choices。
+* `scripts/calibrate_activity.py`（新文件，纯 stdlib）
+  * 复用 `analyze_csv_dump.py` 的几何 / 计数逻辑，但读 `num_active_pred` 列做 Top-K̂ 解码；输出 per-split 最优配置。
+新增 shell 入口：
+* `run_ov1_v11a_ov123_top4.sh`（master_port 29561）
+* `run_ov1_v11b_ov123_top4.sh`（master_port 29562）
+* `run_ov1_v11c_ov123_accdoa.sh`（master_port 29563）
+## 4. 验证方法
+每个实验跑完后用同一套口径回验：
+```bash
+# 1. dump real valid CSV
+python3 scripts/eval_v7k_real_valid.py \
+  --ckpt <ckpt_path> \
+  --dump-pred-dir <dump_dir> \
+  --dump-splits real_ov1,real_ov2,real_ov3 \
+  --activity-threshold 0.5
+# 2. 阈值口径下的指标
+python3 scripts/analyze_csv_dump.py \
+  --dump-dir <dump_dir> \
+  --threshold 0.5 \
+  --threshold-sweep 0.3 0.4 0.6
+# 3. (v11d 用) 全 decode 模式扫描
+python3 scripts/calibrate_activity.py \
+  --dump-dir <dump_dir> \
+  --thresholds 0.3 0.4 0.5 0.6 \
+  --modes threshold topk_hat topk_hat_min \
+  --json-out calibration.json
+```
+判 pass 的硬指标：
+| 实验  | 主要观测                                                   | 副作用看护                       |
+| ---- | ---------------------------------------------------------- | -------------------------------- |
+| v11a | real_ov2 `class_right_angle_wrong` 下降 ≥ 5 pp            | sim ov1 ocls / F20 不退化       |
+| v11b | real_ov2 / real_ov3 vs v11a 是否更优                      | 同 v11a + projection 不爆显存   |
+| v11c | real_ov3 `no_same_class_pred_but_other_preds_exist` 下降 ≥ 5 pp | sim ov1 ocls 退化 < 3 pp |
+| v11d | real_ov1 `hit_share` 在最佳 decode 下提升 ≥ 5 pp           | precision 不崩（>= 0.6×baseline） |
+## 5. 跑的顺序建议
+1. **v11a**（最便宜、最可能直接命中 real_ov2，hot-start 完整保留）。
+2. **v11d** 并行：用 v9 best 的现成 dump 出 ranking pareto，拿 real_ov1 的「decode 收益基线」。这样 v11a 的 dump 出来后能立刻分离「DOA 收益」vs「decode 收益」。
+3. **v11b**：仅在 v11a 收益不达标、或想验「IV 直读 vs fuser 后」时启动。
+4. **v11c**：作为范式对照单独跑，主要看 real_ov3。和 v11a/v11b 不可比同一基础（拓扑不同），但 sim ov1 应保持在可接受退化内。
+## 6. 不在本轮范围内的事
+* K=4 → 6/8 + query 正交正则（计划里的 P3）。代价高且需要重启动；放在 v11c 结果出来之后再决定要不要做。
+* `SourceQueryDecoder` memory 升级为 `[B, T_s*F_p, D]` pre-freq-pool（计划里的 P2a）。改动面太大，先用 v11a/v11b 的 head-side demixer 取等效收益。
+* 真实数据 finetune 调度（`run_ov1_v9_real_balanced_*.sh` 已经在跑），与 v11 系列正交。

docs/0429_v11a_with_dynamic.md ADDED Viewed

	@@ -0,0 +1,475 @@

+# 0429 · v11a_with_dynamic_10hz —— 动态 DOA 监督 + 真实/QA 数据扩展
+> 记录当前 `ov1_local_spatial_v11a_with_dynamic_10hz` preset 的完整链路：
+> 数据 → loader → 模型 → loss；并标明相对于 `v11a_real_balanced_10hz` 的每一处差异。
+> 对应代码入口：
+> - Preset: `train_spatial_beats.py::make_ov1_local_spatial_v11a_with_dynamic_10hz_config`
+> - Run script: `run_ov1_v11a_with_dynamic_10hz.sh`
+> - DCASE 转换器: `tools/dcase_starss_to_jsonl.py`
+---
+## 1. 一句话定位
+**v11a_with_dynamic_10hz = v11a_real_balanced_10hz 的训练数据扩展 + loader/loss 升级到逐帧 target。**
+- **模型结构不变**：仍然是 `local_spatial_track` + `SourceQueryDecoder (K=4)` + spatial_head_demixer
+  （v11a 的新组件），10 Hz token rate，ov123 top4 目录。
+- **预测侧不变**：`FrameTrackPredictionOutput` 仍然是 `[B, K, T_s, ...]` 的逐帧四元组
+  `(activity, class, direction, distance)`。
+- **监督侧升级**：target 张量从 `[B, N_gt]` 扩展为 `[B, N_gt, T_s]`——静态源沿 T_s 轴广播
+  （和旧行为一致），动态源按每帧轨迹线性插值到 10 Hz 栅格。
+- **数据扩展**：新增 5 个训练 manifest（qa_moving / qa_counting / qa_lr_pair /
+  qa_same_doa / dcase_starss_foa.train）和 1 个验证 manifest（dcase_starss_foa.valid）。
+- **Hot-start**：默认从 `v11a_real_balanced_10hz/03_ov123_top4/best.pt` 继续训练，`strict=False`
+  且不继承 optimizer/epoch/best。上游 ov123 静态 clip 的 epoch 0 loss 应与 v11a 吻合
+  （逐帧 target 对静态源退化为标量广播）。
+---
+## 2. 数据：新增的 manifest 与样本格式
+### 2.1 训练集组成
+| manifest | 记录数 | 类型 | DOA 来源 | distance 有效 | 复制次数 |
+| --- | --- | --- | --- | --- | --- |
+| `ov1_foa.jsonl` (sim) | — | ov1 静态 | scalar | ✓ | 1 |
+| `ov2_foa.jsonl` (sim) | — | ov2 静态 | scalar | ✓ | 3 |
+| `ov3_foa.jsonl` (sim) | — | ov3 静态 | scalar | ✓ | 3 |
+| `ov1_real_static_foa_mapped.jsonl` | — | ov1 real 静态 | scalar | ✗ (null) | 4 |
+| `ov2_real_static_foa_mapped.jsonl` | — | ov2 real 静态 | scalar | ✗ | 8 |
+| `ov3_real_static_foa_mapped.jsonl` | — | ov3 real 静态 | scalar | ✗ | 8 |
+| **`qa_moving.jsonl`** | 19 597 | QA sim 动态（单源平滑轨迹） | `frames[]` per-frame | ✓ | **2** |
+| **`qa_counting.jsonl`** | 2 428 | QA sim 静态（多源 2-5 个） | scalar | ✓ | **1** |
+| **`qa_lr_pair.jsonl`** | 6 631 | QA sim 静态（左右成对） | scalar | ✓ | **1** |
+| **`qa_same_doa.jsonl`** | 7 896 | QA sim 静态（同一方向多源） | scalar | ✓ | **1** |
+| **`dcase_starss_foa.train.jsonl`** | 12 805 | DCASE 真录 20s 多源动态 | `frames[]` per-frame | ✗ (-1) | **2** |
+对应的 `train_manifest_replication = (1, 3, 3, 4, 8, 8, 2, 1, 1, 1, 2)`。
+- 总训练 clip 数（未复制）= ov123sim + ov123real + 5 × 新 manifest ≈ ov123 基础 + 49 357。
+- 验证集相比 v11a 增加 `dcase_starss_foa.valid.jsonl` (4 560 clips)，作为真实录音的统一评估入口。
+### 2.2 manifest schema（以 qa_moving / DCASE 为例）
+**qa_moving.jsonl**（来自 `build_qa_foa_moving.py` 合成管道）
+```jsonc
+{
+  "scene_id": "...",
+  "output_foa_path": "/abs/.../foa.wav",
+  "output_duration_seconds": 10.0,
+  "sample_rate": 16000,
+  "frame_rate": 10.0,
+  "num_frames": 100,
+  "sources": [
+    {
+      "source_index": 0,
+      "is_moving": true,
+      "mono_target_label": "speech",       // FSD50K 63-class 名字
+      "active_time": [0.0, 10.0],
+      "doa": null,                          // ← 动态源 scalar doa 为空
+      "distance_cm": 150.0,                 // clip 级 fallback 距离
+      "trajectory": "...",                  // sweep_arc / lshape / ...
+      "frames": [
+        {"frame_idx": 0, "doa": {"azimuth_deg": 175.6, "elevation_deg": -3.2},
+         "distance_cm": 150.2},
+        {"frame_idx": 1, "doa": {"azimuth_deg": 177.9, "elevation_deg": -3.1},
+         "distance_cm": 150.1},
+        ...
+      ]
+    }
+  ]
+}
+```
+**dcase_starss_foa.{train,valid,test}.jsonl**（由 `tools/dcase_starss_to_jsonl.py` 生成）
+```jsonc
+{
+  "scene_id": "fold1_starss22__fold4_room10_mix001_0",
+  "dataset_source": "dcase_starss",
+  "split": "train",
+  "output_foa_path": "/abs/.../foa.wav",
+  "output_duration_seconds": 20.0,
+  "sample_rate": 16000,
+  "frame_rate": 10.0,
+  "sources": [
+    {
+      "source_index": 0,
+      "is_moving": true,
+      "dcase_class_idx": 5,
+      "dcase_source_idx": 3,
+      "mono_target_label": "speech",         // 经 DCASE_TO_FSD50K 重映射
+      "mono_primary_label": "male_speech",
+      "active_time": [1.3, 4.8],
+      "full_time": [0.0, 20.0],
+      "doa": null,
+      "distance_cm": -1,                     // DCASE 没有距离
+      "distance_valid": false,
+      "frames": [
+        {"frame_idx": 13, "time_s": 1.3,
+         "doa": {"azimuth_deg": -45.0, "elevation_deg": 10.0},
+         "distance_cm": -1},
+        ...
+      ]
+    },
+    ...   // 可能 4+ 个 track，但逐帧同时 active 的 ≤ 4（K=4 由 matcher 保证）
+  ]
+}
+```
+### 2.3 生成 DCASE manifest 的一次性步骤
+```bash
+python tools/dcase_starss_to_jsonl.py \
+    --dcase-root /apdcephfs_cq10/.../DCASE2024_seld_baseline/prepared_datasets/starss23_foa_plus_29cls_20s \
+    --output /apdcephfs_cq10/.../data/metadata/dcase_starss_foa.jsonl \
+    --per-split-output
+```
+- 扫描 `metadata_dev/<dataset>/<stem>.csv`，每行 `frame_idx, class_idx, source_idx, az_deg, el_deg, dist_cm`。
+- 按 `(class_idx, source_idx)` 分 track，若相邻 labelled frame 间隔 > `gap_split_frames` (默认 50 帧 = 5s)，
+  就把 track 切成多段 `SourceEvent`，避免在静默区间乱插值。
+- **类别空间压缩**：DCASE 29 类 → FSD50K 63 类的语义最近邻映射
+  （`DCASE_TO_FSD50K` dict，见文件头部）；碰不上 FSD50K 词表的 DCASE 类（如 `unknown_*`）整 track 丢弃。
+- 输出统计：18 061 CSV → train 12 805 / valid 4 560 / test 505。
+### 2.4 FSD50K 63 类别名
+qa_*/DCASE manifest 里可能出现细粒度标签（`male_singing` / `female_singing`），
+但 v11a 的 vocab 只有压缩后的 63 类（含 `singing` 不含性别变体）。在 `spatial_dataset.py` 的
+`_resolve_class_index` / `_resolve_class_label` 之前先跑一次 `_LABEL_ALIASES.get(raw, raw)` 归一化：
+```python
+_LABEL_ALIASES = {
+    "male_singing": "singing",
+    "female_singing": "singing",
+}
+```
+这样不用重新生成 jsonl，就能把 508 条 `male_singing` + 527 条 `female_singing` 折到 `singing` 上。
+---
+## 3. Loader：`spatial_dataset.py` 的逐帧化改造
+### 3.1 `SourceEvent` 新增 5 个可选字段
+```python
+@dataclass
+class SourceEvent:
+    class_index: int
+    class_label: str
+    azimuth_deg: float        # 静态 scalar；动态时是 frames[0] 的 fallback
+    elevation_deg: float
+    distance: float           # 动态时是第一个 valid frame 的 fallback
+    distance_valid: bool
+    start_time_seconds: float
+    end_time_seconds: float
+    # ---- 动态轨迹（仅动态源设置）----
+    frame_times_s: Optional[Tensor] = None      # [N_f] 秒，相对 clip 起点
+    frame_azi_deg: Optional[Tensor] = None      # [N_f] 度，未 unwrap
+    frame_ele_deg: Optional[Tensor] = None      # [N_f] 度
+    frame_distance_m: Optional[Tensor] = None   # [N_f] 米
+    frame_distance_valid: Optional[Tensor] = None  # [N_f] bool
+```
+### 3.2 `_parse_frame_trajectory`
+从 manifest 的 `frames[]` 里抽出 5 个 1D tensor，同时支持两种 layout：
+1. **qa_moving**：每帧带 `frame_idx`（不带 `time_s`），clip 级 `frame_rate` 用于换算 `time_s = frame_idx / frame_rate`。
+2. **DCASE 转换器输出**：每帧直接给 `time_s`，跳过 `frame_idx / frame_rate` 换算。
+距离单位处理：优先读 `distance_cm`（除以 100 得米），`-1` 或缺失标记为 `distance_valid=False`；
+其次读 `distance_m`。
+### 3.3 `_build_source_event_from_nested_entry` 的 fallback
+- 动态源 top-level `doa` 通常是 `null`，所以把 `_get_float` 换成 `_maybe_get_float`，
+  再用 `frames[0]` 的 DOA 补 `azimuth_deg` / `elevation_deg`（scalar fallback，只有在
+  loss 层碰到静态路径时才会用到）。
+- 距离同理：若 source-level `distance_valid=False`，但 `frames[]` 里有至少一个 `distance_cm >= 0`，
+  就用第一个 valid frame 的距离作为 scalar fallback；否则保留 `distance_valid=False`。
+### 3.4 `_maybe_crop_sample` 的轨迹裁剪
+随机/中心裁剪时，除了裁 waveform 和更新 `start/end_time_seconds`，还要：
+- 用 `new_start/new_end` 窗口过滤 `frame_times_s`，把留下来的帧时间重置到新 clip 起点（`- crop_start_seconds`）。
+- `frame_azi_deg` / `frame_ele_deg` / `frame_distance_m` / `frame_distance_valid` 一起按索引截断。
+保证裁剪后的 `SourceEvent` 时间轴仍和 waveform 同源。
+### 3.5 Collate：`[B, N_gt, T_s]` 逐帧 target
+`collate_spatial_batch` 相对旧实现的关键变化：
+```python
+t_s_max = int(target_num_steps.max())          # batch 内最大 token 数
+source_azimuth_deg   = zeros(B, N_gt_max, t_s_max)   # 原来是 (B, N_gt_max)
+source_elevation_deg = zeros(B, N_gt_max, t_s_max)
+source_distance      = zeros(B, N_gt_max, t_s_max)
+source_distance_valid = ones (B, N_gt_max, t_s_max, dtype=bool)  # 默认 True
+for b, sample in enumerate(samples):
+    t_axis = arange(t_s_i) / target_token_rate        # 该 sample 的有效时间轴
+    for s, source in enumerate(sample.sources):
+        azi_row, ele_row, dist_row, dist_valid_row = _build_per_frame_targets(
+            source=source, t_axis=t_axis, t_s_max=t_s_max,
+        )
+        source_azimuth_deg[b, s]   = azi_row
+        source_elevation_deg[b, s] = ele_row
+        source_distance[b, s]      = dist_row
+        source_distance_valid[b, s] = dist_valid_row
+```
+`_build_per_frame_targets` 的两条路径：
+- **静态源**（`frame_times_s is None`���：在 `[0:t_s_i)` 填入 scalar；`[t_s_i:t_s_max)` 填零（padding）。
+  行为等价于旧版广播。
+- **动态源**：对 `t_axis` 做线性插值。方位角先用 `_unwrap_deg` 去掉 ±180° 的跳变
+  （qa_moving / DCASE 都可能有 170° → -170° 这样跨接的情况），插值后再 wrap 回 `[-180, 180]`；
+  elevation / distance 直接线性插值；`distance_valid` 用**两端都 valid 才 valid**的逻辑
+  （`_linear_interp_valid_mask`），避免在未知距离段里猜出假的 valid。
+### 3.6 `SpatialBatch` 的契约变化
+```python
+@dataclass
+class SpatialBatch:
+    ...
+    source_azimuth_deg:     Tensor   # [B, N_gt_max, T_s_max]     ← 原 [B, N_gt_max]
+    source_elevation_deg:   Tensor   # [B, N_gt_max, T_s_max]
+    source_distance:        Tensor   # [B, N_gt_max, T_s_max]
+    source_distance_valid:  Tensor   # [B, N_gt_max, T_s_max]  新字段
+    source_class_indices:   Tensor   # [B, N_gt_max]  (class 仍是 clip 级)
+    source_start_time_seconds: Tensor  # [B, N_gt_max]
+    source_end_time_seconds:   Tensor  # [B, N_gt_max]
+    source_valid_mask:         Tensor  # [B, N_gt_max]
+```
+`source_class_indices` 保持 clip 级：v11a 没有「同一 source 换类」的需求，且对应 track 内 class 恒定。
+---
+## 4. 模型：和 v11a 完全一致
+**没改**，为了让 hot-start 生效。这里简要记录一下 v11a 已有的配置，便于对照：
+```
+                        FOA 4-ch waveform @ 16 kHz
+                                  │
+                                  ▼
+                     SpatialBEATsPreprocessor  (mel, iv feat)
+                                  │
+              ┌───────────────────┴──────────────────┐
+              ▼                                      ▼
+     SpatialPatchEmbedding                SpatialDeltaPatchAdapter
+     (mel → 768-d patch tokens)           (+IV residual contribution)
+              └───────────────┬──────────────────────┘
+                              ▼
+                BEATs TransformerEncoder (12 层，冻结)
+                              │
+                              ▼
+               LocalSpatialEncoder  (IV-aware conv over (T_p, F_p))
+                              │
+                              ▼
+            TemporalResampler → fused_spatial_embeddings [B, T_s, 768]
+                   (T_s @ 10 Hz ，cfg.target_token_rate=10)
+                              │
+                              ▼
+      SourceQueryDecoder  (K=4 queries × T_s 次 decode，两段式)
+        • track_latents:  [B, K, D]
+        • track_time_feat:[B, K, T_s, D]
+                              │
+                              ▼
+   FrameTrackHeads (+ SpatialHeadDemixer 1 层 attn refine, heads=8)
+        • pred_activity:          [B, K, T_s]
+        • pred_class_logits:      [B, K, T_s, 63]
+        • pred_direction:         [B, K, T_s, 3]  L2-normed
+        • pred_distance:          [B, K, T_s]     softplus 米
+        • pred_num_active_logits: [B, T_s, K+1]   (v10 num_active head)
+```
+v11a 相对 v9 的关键新组件（均保留）：
+- `use_spatial_head_demixer=True`（1 层 self-attn，8 heads，dropout 0.1）—— 在 FrameTrack head 输出后做一次
+  track 维解相关。
+- `local_spatial_lr_scale=1.0` —— LocalSpatialEncoder 和 head 用相同 LR（v9 默认 0.3 偏低）。
+---
+## 5. Loss：逐帧 target + distance valid mask
+入口 `compute_frame_track_losses(prediction_output, batch, temporal_padding_mask, config)`。
+### 5.1 target 抽取
+```python
+targets = _frame_source_target_tensors(batch, t_s_max, device)
+# 返回:
+#   window_mask:           [B, N_gt, T_s]  (active_time 内为 True)
+#   source_valid:          [B, N_gt]
+#   source_class:          [B, N_gt]
+#   source_direction:      [B, N_gt, T_s, 3]  ← 逐帧 unit vector
+#   source_distance:       [B, N_gt, T_s]     ← 逐帧米
+#   source_distance_valid: [B, N_gt, T_s]     ← 逐帧 bool
+```
+对于来自 loader 的 `source_azimuth_deg` / `source_elevation_deg`，`_align_t` 处理长度不匹配：
+batch 内 `t_s_max` 可能与 loader 构造时的大小不同（不同 DataLoader 的 collate 边界），
+短则 pad 末帧的值，长则截断。
+### 5.2 Hungarian 匹配（代价按每帧）
+`_match_frame_tracks`（`per_frame` 或 `segment` 两种策略，preset 里走 `segment`）
+在 `[B, N, K, T]` 的代价张量上做匹配：
+```
+cost[b, n, k, t] =   class_cost_w * NLL(pred_class, target_class[b, n])
+                   + dir_cost_w   * (1 - pred_direction[b,k,t] · target_direction[b,n,t])
+                   + dist_cost_w  * |pred_distance[b,k,t] - target_distance[b,n,t]|
+                   + (1 - σ(pred_activity[b,k,t]))         # include_activity_cost
+```
+**关键点**：`target_direction` / `target_distance` 从 `[B, N, T, *]` 广播到 `[B, N, 1, T, *]`��之前是 clip 级标量），
+代价按每帧独立累加，所以动态源在不同帧的 best-match track 可以不同。segment matching
+额外加了一个 −2.0 的 continuity bonus，让同一 GT 在连续的相同 active-set segment 里尽量停留在同一 track。
+### 5.3 监督张量的构建
+```python
+matched_track: [B, N_gt, T_s] (匹配结果 k∈[0,K) 或 -1)
+valid_assign = matched_track >= 0
+idx_b, idx_gt, idx_t = valid_assign.nonzero(as_tuple=True)
+idx_k = matched_track[idx_b, idx_gt, idx_t]
+activity_target[idx_b, idx_k, idx_t] = 1.0
+class_target    [idx_b, idx_k, idx_t] = targets["source_class"][idx_b, idx_gt]
+direction_target[idx_b, idx_k, idx_t] = targets["source_direction"][idx_b, idx_gt, idx_t]   # ← 3D 索引
+distance_target [idx_b, idx_k, idx_t] = targets["source_distance" ][idx_b, idx_gt, idx_t]   # ← 3D 索引
+dist_supervise_mask[idx_b, idx_k, idx_t] = targets["source_distance_valid"][idx_b, idx_gt, idx_t]
+```
+相对旧版（`targets["source_direction"][idx_b, idx_gt]` 是 `[M, 3]`）的差别是把第三个 axis 替换为具体
+的 `idx_t`，真正拿到逐帧 GT。`supervise_mask` 是 activity-winning 的 mask，`dist_supervise_mask`
+在其基础上再 AND 一个**逐帧 distance validity**：STARSS/DCASE 整源为 False 时，
+distance loss 就不会回传任何梯度。
+### 5.4 各项损失
+| 项 | 公式 | mask |
+| --- | --- | --- |
+| activity | `BCE_with_logits(pred_activity, activity_target, pos_weight=dyn)` | `valid_time` 扩到 [B, K, T_s] |
+| num_active (v10) | `CE(pred_num_active_logits, active_count)` | `valid_time` |
+| class | `CE(pred_class_logits, class_target)` + 可选 ontology smoothing | `supervise_mask` |
+| direction | `mean(1 - pred · target)` | `supervise_mask` |
+| distance | `smooth_l1(pred, target)` | `dist_supervise_mask` ← **逐帧 validity** |
+**ADPIT duplicate** & **nonwinner soft activity**（v9/v10 的两个辅助）同样全部走逐帧 `source_direction[..., t]`
+和 `source_distance_valid[..., t]` 索引；旧版的 `batch.source_azimuth_deg[:, 0]` 之类的 2D 访问被替换成
+`[:, 0, 0]`（共 44 处）以避免 shape 冲突。
+最终汇总：
+```
+loss_total = λ_act  · loss_activity
+           + λ_cls  · loss_class
+           + λ_dir  · loss_direction
+           + λ_dist · loss_distance
+           + λ_na   · loss_num_active
+```
+λ 与 v11a 完全相同（在 `SpatialLossConfig` 里走 v10 phase-2 的基线数值）。
+### 5.5 静态源的退化等价性
+因为 loader 把静态源沿 T_s 轴广播，`target_direction[b, gt, 0:T_s_i]` 每一帧都一致，
+Hungarian 代价 `(1 - pred·target)` 和 per-clip 版本逐项相等；distance 同理。因此
+**ov123 sim/real clip 的 epoch 0 loss 与 v11a 数值吻合**，这也是为什么可以直接从 v11a best.pt 热启动。
+---
+## 6. 训练配置（preset diff）
+```python
+def make_ov1_local_spatial_v11a_with_dynamic_10hz_config(...):
+    cfg = make_ov1_local_spatial_v11a_real_balanced_10hz_config(...)
+    # —— 只改了数据和轮次 ——
+    cfg.train_manifest_paths = (
+        ov1_sim, ov2_sim, ov3_sim,
+        ov1_real, ov2_real, ov3_real,
+        qa_moving, qa_counting, qa_lr_pair, qa_same_doa,
+        dcase_starss_train,
+    )
+    cfg.train_manifest_replication = (1, 3, 3, 4, 8, 8, 2, 1, 1, 1, 2)
+    cfg.val_manifest_paths = (
+        ov1_sim, ov2_sim, ov3_sim,
+        ov1_real, ov2_real, ov3_real,
+        dcase_starss_valid,
+    )
+    cfg.test_manifest_paths = cfg.val_manifest_paths
+    cfg.num_epochs = 15
+    cfg.output_dir = "checkpoints/spatial_beats_ov1_local_spatial_v11a_with_dynamic_10hz_exp/03_ov123_top4"
+    return cfg
+```
+运行入口（`run_ov1_v11a_with_dynamic_10hz.sh`）默认：
+```
+GPUS=8  BATCH_SIZE=4  NUM_WORKERS=8
+SPATIAL_EPOCHS=15  SPATIAL_LR=1.5e-5  AMP=fp32
+RESUME_CKPT=checkpoints/spatial_beats_ov1_local_spatial_v11a_real_balanced_10hz_exp/03_ov123_top4/best.pt
+  --no-resume-optimizer --reset-epoch-on-resume --reset-best-on-resume
+```
+---
+## 7. 与 v11a_real_balanced_10hz 的差异一览
+| 维度 | v11a_real_balanced_10hz | **v11a_with_dynamic_10hz** |
+| --- | --- | --- |
+| 训练 manifest 数 | 6 (ov123 sim + ov123 real) | **11**（+5 个 QA/DCASE） |
+| 训练 clip 量（未复制） | ~O(20k) | +49 357 |
+| 真实录音监督 | ov123 real 静态 scalar | **+DCASE STARSS 20s 逐帧** |
+| 动态源监督 | 无 | **qa_moving + DCASE 动态 track** |
+| `SourceEvent` 字段 | 无 `frame_*` | +5 个 `frame_times_s/azi/ele/dist/distance_valid` |
+| Loader target shape | `source_*: [B, N_gt]` | **`source_*: [B, N_gt, T_s]`**（静态广播，动态插值） |
+| `source_distance_valid` | `[B, N_gt]` | **`[B, N_gt, T_s]`**（逐帧） |
+| Hungarian 代价 | dir/dist 代价用 clip 级标量 target | **用逐帧 `[B, N, T, *]` target 广播** |
+| distance 监督 mask | 按源 `distance_valid` | **按 (源, 帧) `dist_supervise_mask`**，STARSS/DCASE 不回传梯度 |
+| Class 词表别名 | 无 | **`_LABEL_ALIASES` 处理 `male_singing`/`female_singing → singing`** |
+| 验证集 | ov123 sim+real | **+ DCASE valid (4 560 clips)** |
+| `num_epochs` | 15 (继承 v9) | 15（不变） |
+| 模型结构 | local_spatial_track + K=4 + demixer | **完全一致** |
+| 热启动 | v9_real_balanced_10hz best.pt | **v11a_real_balanced_10hz best.pt**, strict=False |
+| 输出目录 | `.../v11a_real_balanced_10hz_exp/03_ov123_top4` | `.../v11a_with_dynamic_10hz_exp/03_ov123_top4` |
+---
+## 8. 已知坑 & 兼容性备忘
+1. **`pred_*` 的 t_s 与 loader 的 `T_s_max` 有可能不一致**。loss 侧 `_align_t` 会 pad/truncate GT 的最后一维；
+   loader 侧已保证 `target_num_steps = round(duration × target_token_rate)` 与模型 temporal resampler 对齐，
+   正常只有在 batch 内最长样本决定 `t_s_max` 而 frame-track head 输出更短时需要截断。
+2. **方位角跨 ±180°**：qa_moving 里观测到 `175.6° → -177.3°` 的自然轨迹，
+   `_unwrap_deg` 会把它解包成 `175.6° → 182.7°` 插值，最后 wrap 回 `[-180, 180]`，不会出现 -340° 的错误跨越。
+3. **DCASE 一个 clip 可以出现 6+ 个 track**，但任意帧同时 active 的 ≤ 4（DCASE 规范约束）；
+   `_match_frame_tracks_per_frame` 里 `active_count.clamp(max=K)` 是一层防御。
+4. **distance=-1 的 clip** 在 loss 内走 `dist_supervise_mask` 全 False 的分支：
+   `loss_distance = pred_distance.sum() * 0.0`，梯度为 0，这一 clip 只贡献 activity/class/direction loss。
+5. **`male_singing/female_singing`** 是 loader 侧别名，若以后又有新细粒度标签冒出来，直接在
+   `spatial_dataset.py` 的 `_LABEL_ALIASES` 里加映射即可，不需要重跑数据。
+6. **ov123 静态 clip 的 epoch 0 loss 必须与 v11a 一致**——这是验证 loader/loss 升级没有引入
+   回归的烟雾测试；曾经跑过 smoke script 确认 variance=0 for 静态 target、variance>0 for qa_moving target。
+---
+## 9. 相关脚本索引
+- `tools/dcase_starss_to_jsonl.py` —— DCASE CSV → jsonl + FSD50K 类别映射
+- `run_ov1_v11a_with_dynamic_10hz.sh` —— 训练入口（含 6 个动态 manifest 存在性 warning）
+- `spatial_dataset.py::_parse_frame_trajectory` / `_build_per_frame_targets` —— 动态 target 构建
+- `spatial_loss.py::_frame_source_target_tensors` / `compute_frame_track_losses` —— 逐帧 loss
+- Preset: `train_spatial_beats.py::make_ov1_local_spatial_v11a_with_dynamic_10hz_config`
+  (CLI 名: `--preset ov1_local_spatial_v11a_with_dynamic_10hz`)

docs/V11_QUICK_START.md ADDED Viewed

	@@ -0,0 +1,345 @@

+# v11 Architecture - Quick Start Guide
+## What is v11?
+The v11 series represents a major architectural enhancement addressing the classification accuracy plateau at ~51% in the Spatial-BEATs model. It introduces:
+1. **SpatialDeltaPatchAdapterV2**: Enhanced front-end spatial encoder (17.4M params)
+2. **SpatialAdapterLayer**: In-trunk spatial conditioning (1.2M params total)
+3. **Multiple routing options**: Route A/B (track-based) or Route C (ACCDOA class-based)
+---
+## Four v11 Variants
+### 1. v11_phase1_cls - Phase 1 Classification Refinement
+**Use this first** to diagnose if the new adapter improves classification accuracy.
+**What it does:**
+- Enables SpatialDeltaPatchAdapterV2 only
+- Freezes direction/distance heads
+- Trains classification + num_active heads
+- Hot-starts from v10 phase-1 best checkpoint
+**Command:**
+```bash
+./run_ov1_v11_phase1_cls.sh
+```
+**Environment variables:**
+```bash
+SPATIAL_EPOCHS=10          # Default: 10 epochs
+SPATIAL_LR=7.5e-6          # Default: 7.5e-6
+BATCH_SIZE=8               # Default: 8
+GPUS=8                     # Default: 8
+```
+**Expected output:**
+```
+Epoch 1: cls_acc=0.720 (should be at least v10 level)
+Epoch 5: cls_acc=0.755 (expect improvement trend)
+Epoch 10: cls_acc=0.78+ (best of phase-1)
+```
+**What to look for:**
+- Does cls_acc improve beyond v10 phase-1 peak (0.78)?
+- How quickly does it converge?
+- Does val_loss plateau or continue improving?
+---
+### 2. v11a_ov123_top4 - Route B + Spatial Demixer (Full Architecture)
+**Use after v11_phase1_cls confirms improvement.**
+**What it does:**
+- Enables SpatialDeltaPatchAdapterV2 + trunk adapters + spatial_head_demixer
+- Trains all heads (activity, class, direction, distance)
+- Uses demixer for both class AND spatial heads
+- Hot-starts from v9 ov123_top4 best checkpoint
+**Command:**
+```bash
+./run_ov1_v11a_ov123_top4.sh
+```
+**Key metrics:**
+- `azi_mae_deg`: Azimuth mean absolute error (primary DOA metric)
+- `class_acc`: Matched-source class accuracy
+- `activity_f1`: Source presence F1-score
+**Expected improvement:**
+```
+Metric                  v9 Baseline    v11a Target    Expected Delta
+────────────────────────────────────────────────────────────────────
+azi_mae_deg (train)     10°            8-9°           -1 to -2°
+azi_mae_deg (val)       30°            24-26°         -4 to -6°
+class_acc (val)         73%            75%+           +2%
+```
+**What to look for:**
+- Validation azimuth error should be significantly lower
+- Train/val gap should narrow (from ~20° toward ~15°)
+- No collapse in accuracy metrics
+---
+### 3. v11b_ov123_top4 - Route B + LocalSpatial Demixer KV
+**Use for comparison with v11a.**
+**What it does:**
+- Same as v11a, BUT
+- Demixer attends to LocalSpatial's 7-channel pre-pool (FOA + IV)
+- Instead of BEATs mono mel-filterbank features
+- Hypothesis: Spatial features better for DOA decomposition
+**Command:**
+```bash
+./run_ov1_v11b_ov123_top4.sh
+```
+**Comparison with v11a:**
+```
+Aspect              v11a (BEATs KV)        v11b (LocalSpatial KV)
+─────────────────────────────────────────────────────────────────
+Demixer KV source   BEATs trunk            LocalSpatial pre-pool
+Channels            1 (mono fbank)         7 (4 FOA + 3 IV)
+Prior knowledge     Semantic               Spatial physics
+Expected advantage  Better for class       Better for direction
+Computational cost  Lower                  Higher
+```
+**When to pick v11b over v11a:**
+- If DOA error (azi_mae_deg) is more important than class accuracy
+- If you have GPU budget for extra feature processing
+- For acoustic scenes where spatial features matter more
+---
+### 4. v11c_ov123_accdoa - Paradigm Shift to ACCDOA (Route C)
+**Use as a "simplicity first" baseline.**
+**What it does:**
+- Enables SpatialDeltaPatchAdapterV2 + trunk adapters
+- Replaces query decoder + Hungarian matching with per-class ACCDOA heads
+- Each class gets its own spatial slot (no matching needed)
+- Activity encoded in vector magnitude, direction in unit vector
+**Command:**
+```bash
+./run_ov1_v11c_ov123_accdoa.sh
+```
+**Key differences from v11a:**
+```
+Aspect              v11a (Route B, Track)      v11c (Route C, ACCDOA)
+─────────────────────────────────────────────────────────────────────
+Paradigm            K learnable tracks         Per-class slots
+Matching            Hungarian (clip-level)     None (inherent per-class)
+Activity loss       Binary cross-entropy       MSE on magnitude
+Direction repr.     L2 normalized vector      Unit vector (normalized)
+Scalability         O(K×T_s) per-frame        O(num_classes×T_s)
+ov2/ov3 fit         Good (overlap ambiguity)  Better (same-class=0)
+```
+**When to pick v11c:**
+- For DCASE evaluation (uses official SELD metrics)
+- If Hungarian matching is a bottleneck
+- For datasets with no overlapping same-class sources (ov2/ov3 constraints)
+- For interpretability (each class = one direction)
+---
+## Decision Tree: Which v11 to Run?
+```
+START
+  │
+  ├─→ "Do I want to diagnose if new adapters help classification?"
+  │   └─→ YES: Run v11_phase1_cls
+  │           ↓ (wait for results)
+  │           Does cls_acc improve?
+  │           ├─→ YES ✓
+  │           │   └─→ Proceed to multi-head experiments
+  │           └─→ NO ✗
+  │               └─→ Back to drawing board (architecture issue)
+  │
+  ├─→ "Is direction-of-arrival (DOA) error my primary concern?"
+  │   ├─→ YES: Need DOA focus
+  │   │   ├─→ "Do I have GPU budget for LocalSpatial features?"
+  │   │   │   ├─→ YES: Run v11b_ov123_top4
+  │   │   │   └─→ NO: Run v11a_ov123_top4
+  │   └─→ NO: Skip v11a/v11b
+  │
+  └─→ "Am I targeting DCASE evaluation / ov2/ov3 constraints?"
+      ├─→ YES: Run v11c_ov123_accdoa
+      └─→ NO: Run v11a_ov123_top4 (default full-featured)
+```
+---
+## Monitoring Experiments
+### Key Metrics to Track
+**Classification**:
+- `class_acc`: Top-1 accuracy on matched sources
+- `class_precision`: Per-class precision
+- `class_recall`: Per-class recall
+**Direction (DOA)**:
+- `azi_mae_deg`: **Primary metric** - azimuth mean absolute error
+- `ele_mae_deg`: Elevation mean absolute error
+- `azi_std_deg`: Azimuth error standard deviation
+**Distance**:
+- `dist_mae_m`: Distance mean absolute error
+**Activity**:
+- `activity_f1`: Source presence F1-score
+- `num_active_mae`: Mean absolute error in source count
+**Gap Analysis**:
+- `train_azi_mae_deg`: Training set azimuth error
+- `val_azi_mae_deg`: Validation set azimuth error
+- `gap = val - train`: **Gap should decrease with v11**
+### TensorBoard Visualization
+```bash
+tensorboard --logdir=checkpoints/spatial_beats_v11_phase1_cls_exp/ov123_top4 --port=6006
+```
+**Plots to monitor:**
+- `metrics/val_azi_mae_deg`: Should decrease smoothly
+- `metrics/train_azi_mae_deg`: Should decrease with training
+- `loss/total`: Should follow training dynamics (may oscillate)
+- `loss/frame_direction`: DOA-specific loss component
+---
+## Checkpoint Management
+### Hot-Start Strategy
+Each v11 variant is designed to hot-start from a previous checkpoint:
+**v11_phase1_cls**:
+```
+Loads from: v10_phase1_cls best.pt
+Missing params: V2 adapter + trunk adapters
+Initialize with: Zero-init adapters (identity at epoch-0)
+Benefit: Inherits v10's frozen classification features
+```
+**v11a_ov123_top4**:
+```
+Loads from: v9_ov123_top4 best.pt
+Missing params: V2 + trunk adapters + spatial_demixer (added to heads)
+Initialize with: Zero-init everything (identity at epoch-0)
+Benefit: Inherits v9's proven multi-head balance
+```
+**v11b_ov123_top4**:
+```
+Same as v11a, but adds LocalSpatial pre-pool processing
+```
+**v11c_ov123_accdoa**:
+```
+Loads from: ov1_local_spatial baseline (v9 incompatible)
+Missing params: ACCDOAHeads (entire head replacement)
+Initialize with: Zero-init (no class/spatial heads to inherit)
+Benefit: Simpler routing = faster convergence
+```
+### How to Load a Checkpoint Manually
+```python
+import torch
+from train_spatial_beats import make_ov1_local_spatial_v11a_ov123_top4_config
+from spatial_beats import SpatialBEATs
+# Create model with v11a config
+cfg = make_ov1_local_spatial_v11a_ov123_top4_config()
+model = SpatialBEATs(cfg)
+# Load v9 checkpoint (strict=False ignores new params)
+ckpt = torch.load('checkpoints/.../v9_best.pt')
+model.load_state_dict(ckpt['model'], strict=False)
+# New params are zero-initialized (identity behavior)
+# Ready to train!
+model.train()
+```
+---
+## Troubleshooting
+### Issue: "CUDA out of memory"
+**Solution**: Reduce batch size or sequence length
+```bash
+BATCH_SIZE=4 ./run_ov1_v11a_ov123_top4.sh
+```
+### Issue: "ClassHeadSpectralDemixer not initialized"
+**Solution**: Ensure config enables it:
+```python
+cfg.use_class_head_demixer = True  # For v11a
+cfg.use_spatial_head_demixer = True  # For v11a (added in v11)
+```
+### Issue: "Large train/val gap not shrinking"
+**Diagnosis steps**:
+1. Check if Dropout is OFF during evaluation
+2. Verify SpecAugment is applied only during training
+3. Run diagnostic: evaluate same checkpoint in train/eval modes
+```bash
+python -c "
+model.eval()
+val_error_no_dropout = evaluate(model, val_loader)
+model.train()
+val_error_with_dropout = evaluate(model, val_loader)
+print(f'Dropout effect: {val_error_with_dropout - val_error_no_dropout:.1f}°')
+"
+```
+### Issue: "Trunk adapters not being applied"
+**Check**: Verify config flag is True
+```python
+if not cfg.use_trunk_spatial_adapters:
+    print("WARNING: Trunk adapters disabled!")
+    cfg.use_trunk_spatial_adapters = True
+```
+---
+## Next Steps After v11 Experiments
+1. **Analyze results** (docs/V11_IMPLEMENTATION_SUMMARY.md contains diagnostic templates)
+2. **Pick best variant** based on your primary metric
+3. **Fine-tune hyperparameters** (learning rate, dropout rate if you modify later)
+4. **Run official evaluation** on test set using DCASE metrics
+5. **Consider multi-stage training**:
+   - Stage 1: Classification only (v11_phase1_cls)
+   - Stage 2: Full pipeline (v11a/b/c)
+   - Stage 3: Fine-tuning (reduce LR, increase epochs)
+---
+## Citation & References
+This architecture is built on:
+- **BEATs** (Microsoft): Base semantic encoder (https://arxiv.org/abs/2212.09058)
+- **DCASE SELD**: Official evaluation metrics (https://github.com/sharathadavanne/seld-dcase2023)
+- **EINV2 paradigm**: Track-based source modeling
+- **Spatial audio physics**: FOA (First-Order Ambisonics) + Intensity Vectors
+For detailed technical justification, see:
+- docs/V11_IMPLEMENTATION_SUMMARY.md
+- docs/doa_train_valid_gap_analysis.md
+- SPATIAL_AUDIO_FRAMEWORKS_ANALYSIS_COMPREHENSIVE.md

docs/gemini.md ADDED Viewed

	@@ -0,0 +1,63 @@

+# Spatial-BEATs 最终实施指南 (Reference Implementation Guide)
+本文档定义了 `Spatial-BEATs` 的模型架构、特征工程与训练流程的最终技术细节，作为代码实现的唯一参照。
+## 1. 模型架构细节 (Architecture Specification)
+### 1.1 输入前端 (Stem)
+- **输入特征图**: $7 \times 128 \times 1024$ (Channels $\times$ Mel-bins $\times$ Time-frames)。
+- **通道定义**:
+  - `[0:4]`: W, X, Y, Z 的 Log-mel。
+  - `[4:7]`: IVx, IVy, IVz (Intensity Vector)，按时间/频率对齐。
+- **Patch Embedding**:
+  - 结构: `nn.Conv2d(7, embed_dim, kernel_size=16, stride=16)`。
+  - 初始化: 通道 0 (W) 复用 BEATs 预训练权重，通道 1-6 随机初始化。
+### 1.2 空间 Token 提取 (Source Queries)
+- **Token 数量 ($K$)**: 4 个。
+- **实现方式**:
+  - 定义 `nn.Parameter(torch.randn(1, 4, embed_dim))` 作为 Source Queries。
+  - 使用 2 层 Transformer Decoder 层。
+  - **Query**: Source Queries。
+  - **Key/Value**: BEATs Trunk 的输出序列 (Dense Patch Tokens)。
+- **输出**: 4 个维度为 `embed_dim` 的 `Spatial Tokens`。
+### 1.3 预测头 (Prediction Heads)
+每个 Spatial Token 独立连接以下 MLP 层：
+- **Objectness**: `Linear -> Sigmoid` (1 unit)。
+- **Azimuth**: `Linear -> tanh` (2 units: $\sin, \cos$)。计算角度使用 `atan2`。
+- **Elevation**: `Linear -> tanh` (2 units: $\sin, \cos$)。
+- **Distance**: `Linear` (1 unit, 单位：**Centimeters**)。
+- **Class**: `Linear -> Softmax` (N units, 对应 FSD50k 类别)。
+## 2. 坐标系与物理特征 (Spatial Physics)
+### 2.1 坐标系 (DCASE Standard)
+- **轴向**: +x 前, +y 左, +z 上。
+- **方位角 (Azimuth)**: $[-180, 180]$，逆时针增加。+90 度为左，-90 度为右。
+- **仰角 (Elevation)**: $[-90, 90]$，向上增加。
+- **距离 (Distance)**: 以 **厘米 (cm)** 为单位进行回归。
+### 2.2 IV 计算 (Intensity Vector)
+在特征提取阶段，按以下逻辑计算 IV：
+- $I_x = \text{Re}\{W^* \cdot X\}$
+- $I_y = \text{Re}\{W^* \cdot Y\}$
+- $I_z = \text{Re}\{W^* \cdot Z\}$
+- 所有的 $I$ 均通过 Mel 滤波器组进行映射，以匹配 Log-mel 的分辨率。
+## 3. 训练策略 (Training Recipe)
+### 3.1 损失函数 (Hungarian Loss)
+- **匹配算法**: 使用 `scipy.optimize.linear_sum_assignment` (Hungarian Matching) 匹配 4 个预测 Token 与 $N$ 个 GT 声源 ($N \le 4$)。
+- **匹配代价 (Matching Cost)**: 综合位置误差 (Az/El/Dist)、类别误差和 Objectness 分数。
+- **总损失**:
+  - 对匹配成功的 Token：计算 $L_{MSE}(pos) + L_{BCE}(obj) + L_{CrossEntropy}(cls)$。
+  - 对未匹配的 Token：计算 $L_{BCE}(obj, 0)$。
+### 3.2 训练阶段
+1.  **Stage 1 (Stem & Head Warmup)**: 冻结 BEATs Trunk (Transformer 层)，仅训练新 Patch Embedding 和 Spatial Decoder/Heads。
+2.  **Stage 2 (Joint Fine-tuning)**: 以 $1 \times 10^{-5}$ 的低学习率解冻整个 Trunk 进行微调。
+## 4. LLM 接入接口 (LLM Interface)
+- 提取后的 4 个 `Spatial Tokens` 将通过一个 `Linear` 投影层对齐到 LLM 的隐藏层空间。
+- 在 Prompt 中，这 4 个 tokens 将按 object-wise 顺序排列，代表音频中的空间实体。

docs/spatial_beats_implementation_spec.md ADDED Viewed

	@@ -0,0 +1,706 @@

+# Spatial-BEATs 实现规格
+## 1. 目标
+本规格文档用于将前期讨论收敛为一个可以直接实施的 `Spatial-BEATs` 方案。
+目标是构建一个独立的 `Spatial Encoder`：
+- 输入为完整 `FOA` 音频及其派生空间特征
+- 完整的 `FOA` 特征经过 `BEATs backbone`
+- 最大化复用 `BEATs` 预训练权重
+- 输出一组 `source-level spatial tokens`
+- 这些 token 作为独立模态输入给 LLM
+- 原有语义 audio encoder 保持不动
+这里的关键原则是：
+> 不是让 `W-only` 走主干，再外挂一个小空间 adapter；而是让完整 FOA 空间特征真正进入 BEATs 主干，并在主干之后产出结构化空间 token。
+## 2. 最终任务定义
+### 2.1 核心任务
+`Spatial-BEATs` 的主任务定义为：
+- 给定一个多源 `FOA` 音频片段
+- 预测其中最多 `K` 个潜在声源的空间表示
+- 每个表示对应一个 `source token`
+每个 source token 至少承载：
+- `objectness`
+- `azimuth`
+- `elevation`
+- `distance`
+可选承载：
+- `source class auxiliary logits`
+- `source embedding`
+### 2.2 推荐监督形式
+如果训练数据中每个源都有标注，则推荐采用：
+- `set prediction`
+- `K` 个预测 token 对 `N` 个 GT sources
+- 用 `Hungarian matching` 做一一匹配
+不建议采用：
+- 单一 scene-level spatial token
+- 仅回归整段音频的全局空间摘要
+原因是这会损失多源结构，不利于后续 LLM 做关系推理。
+## 3. 最终架构
+推荐最终架构：
+```text
+FOA waveform
+  -> SpatialBEATsPreprocessor
+  -> FOA feature map [B, C_foa, T, F]
+  -> FOA patch embedding
+  -> BEATs trunk
+  -> Spatial query decoder
+  -> K source tokens
+  -> Spatial prediction heads
+  -> LLM projector
+```
+为了最大化复用 BEATs 主干，本方案尽量不改 trunk 内部的 Transformer 结构。
+## 4. 输入特征定义
+### 4.1 默认推荐特征
+第一版推荐输入通道：
+- `W_logmel`
+- `X_logmel`
+- `Y_logmel`
+- `Z_logmel`
+- `IVx`
+- `IVy`
+- `IVz`
+即：
+- `C_foa = 7`
+这是默认推荐方案。
+### 4.2 备选输入特征
+若希望先降低复杂度，可以使用：
+- `WXYZ logmel`
+即：
+- `C_foa = 4`
+但这只适合最小原型。
+如果目标是稳定学习空间方向与结构，优先使用 `WXYZ + IV`。
+### 4.3 前端参数建议
+为了最大化复用 BEATs 主干，推荐保持与 BEATs 接近的时频分辨率：
+- sample rate：优先 `16k`
+- mel bins：`128`
+- frame length：`25 ms`
+- frame shift：`10 ms`
+原因：
+- 这能让 trunk 看到与原始 BEATs 更接近的 patch 几何结构
+- patch embedding 和后续序列长度更容易保持一致
+- 预训练权重复用更稳定
+### 4.4 为什么不沿用 Spatial-AST 的 binaural 前端
+Spatial-AST 采用的是：
+- 双耳 log-mel
+- IPD
+这适合 binaural，不适合直接迁移到 FOA。
+FOA 下应优先利用：
+- ambisonic 通道本身
+- intensity vector
+- 其他 FOA 物理特征
+## 5. 对 BEATs 具体修改哪些模块
+下面按模块说明修改方案。
+### 5.1 保留不动的模块
+建议尽量保留：
+- `TransformerEncoder`
+- `TransformerSentenceEncoderLayer`
+- `MultiheadAttention`
+- `conv_pos`
+- `LayerNorm`
+- `FFN`
+- `post_extract_proj`
+也就是 `backbone.py` 内的主干结构和 `BEATs.py` 中的 trunk 逻辑尽量不动。
+### 5.2 必须修改的模块
+必须重做：
+1. `preprocess`
+2. `patch_embedding`
+3. `extract_features` 输出头部逻辑
+4. 下游 `predictor`
+### 5.3 推荐新增的模块
+建议新增：
+1. `SpatialBEATsPreprocessor`
+2. `SpatialPatchEmbedding`
+3. `SpatialQueryDecoder`
+4. `SpatialPredictionHead`
+5. `SpatialTokenProjector`
+6. `HungarianMatcher`
+7. `SpatialSetCriterion`
+## 6. 代码级映射建议
+### 6.1 现有文件建议
+建议保留和复用：
+- [BEATs.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/BEATs.py)
+- [backbone.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/backbone.py)
+建议新增：
+- `spatial_beats.py`
+- `spatial_modules.py`
+- `spatial_loss.py`
+- `spatial_dataset.py`
+- `train_spatial_beats.py`
+### 6.2 `spatial_beats.py` 建议包含
+建议实现：
+- `SpatialBEATsConfig`
+- `SpatialBEATs`
+- `SpatialBEATs.extract_spatial_tokens()`
+- `SpatialBEATs.forward()`
+### 6.3 `spatial_modules.py` 建议包含
+建议实现：
+- `SpatialBEATsPreprocessor`
+- `SpatialPatchEmbedding`
+- `SpatialQueryDecoder`
+- `SpatialPredictionHead`
+- `SpatialTokenProjector`
+### 6.4 `spatial_loss.py` 建议包含
+建议实现：
+- `HungarianMatcher`
+- `SpatialSetCriterion`
+## 7. 预训练权重如何复用
+## 7.1 默认推荐权重
+默认推荐：
+- `BEATs_iter3+ (AS2M) pre-trained`
+而不是：
+- fine-tuned checkpoints
+原因：
+- `pre-trained` 更适合作为 trunk 初始化
+- `fine-tuned` 更偏向 AudioSet 分类判别
+- 你这里的 spatial encoder 应与原语义 encoder 职责分离
+### 7.2 必须直接加载的层
+这些层建议直接加载原 BEATs checkpoint：
+- `post_extract_proj`
+- `encoder.pos_conv`
+- `encoder.layers.*`
+- `encoder.layer_norm`
+- `layer_norm`
+即除了输入 stem 和输出头，主干参数都尽量继承。
+### 7.3 需要特殊初始化的层
+以下层因为 shape 不同，不能直接 strict load：
+- `patch_embedding`
+- 新增的 `query decoder`
+- 新增的 `spatial heads`
+- 新增的 `LLM projector`
+### 7.4 新 patch embedding 的初始化策略
+原 BEATs stem 是：
+- `Conv2d(1, embed_dim, kernel_size=patch, stride=patch)`
+新 stem 建议是：
+- `Conv2d(C_foa, embed_dim, kernel_size=patch, stride=patch)`
+推荐初始化策略：
+#### 方案 A：保守初始化，默认推荐
+- `W_logmel` 通道继承原 stem 权重
+- 其他空间通道初始化为 `0` 或较小随机值
+优点：
+- 最大程度保留原 BEATs 初始分布
+- trunk 适配更稳
+缺点：
+- 训练初期空间通道利用较慢
+#### 方案 B：通道 inflation
+- 把原 stem 权重复制到全部输入通道
+- 再按通道数做归一化
+优点：
+- 所有通道一开始都能进入主干
+缺点：
+- 初始统计更可能偏离原 BEATs
+最终推荐：
+- 第一版用 `方案 A`
+- 后续做 ablation 再比较 `方案 B`
+## 8. Spatial token 模块的最终设计
+### 8.1 为什么不用全局池化
+原始 BEATs 的输出方式更接近：
+- patch sequence
+- mean pooling
+- clip-level prediction
+这不适合多源空间任务。
+### 8.2 最终推荐：Query Decoder
+在 trunk 输出后新增：
+- `K` 个 learnable source queries
+- 一个轻量 `cross-attention decoder`
+输入：
+- encoder memory：`H in R^{B x T x D}`
+- source queries：`Q in R^{B x K x D}`
+输出：
+- `Z in R^{B x K x D}`
+这里的 `Z[:, i, :]` 即第 `i` 个 `source token`
+### 8.3 为什么 query decoder 是当前最优解
+它的优点：
+- 不改 trunk 内部结构
+- 仍然让完整 FOA 特征经过 backbone
+- 适合多源 set prediction
+- 最利于最大化复用 trunk 权重
+## 9. 输出头设计
+对每个 source token `z_i`，预测：
+- `objectness`
+- `azimuth`
+- `elevation`
+- `distance`
+- 可选 `class_aux`
+### 9.1 离散还是连续
+第一版推荐全部使用离散分类头：
+- `azimuth`: 360 bins
+- `elevation`: 180 bins
+- `distance`: 按数据分桶，例如 `0.5m` 一档
+原因：
+- 与已有 Spatial-AST/BAT 经验一致
+- 分类头更稳
+- 更便于构造离散坐标 embedding
+### 9.2 objectness 头
+推荐增加：
+- `objectness_head: D -> 1`
+用于：
+- 判断当前 token 是否对应真实声源
+- 作为 Hungarian matching 的一部分
+- 推理时做 token 保留/裁剪
+### 9.3 类别头
+类别头建议作为：
+- `auxiliary head`
+而不是最终 LLM 的主要输入内容。
+这样做的作用：
+- 让 query token 更容易学会 source slot 对齐
+- 但不把 Spatial-BEATs 变成第二个强语义 encoder
+## 10. Loss 设计
+推荐总损失：
+```text
+L_total =
+  lambda_obj * L_obj
+  + lambda_azi * L_azi
+  + lambda_ele * L_ele
+  + lambda_dist * L_dist
+  + lambda_cls * L_cls_aux
+```
+### 10.1 匹配方式
+使用 `Hungarian matching`：
+- 预测：`K` 个 token
+- GT：`N` 个 sources
+- 成本由以下项构成：
+  - objectness cost
+  - azimuth cost
+  - elevation cost
+  - distance cost
+  - optional class cost
+### 10.2 损失项定义
+推荐：
+- `L_obj`: BCE 或 focal loss
+- `L_azi`: cross entropy
+- `L_ele`: cross entropy
+- `L_dist`: cross entropy
+- `L_cls_aux`: cross entropy 或 BCE
+### 10.3 初始 loss 权重建议
+第一版建议从以下权重起步：
+```text
+lambda_obj = 1.0
+lambda_azi = 2.0
+lambda_ele = 2.0
+lambda_dist = 1.0
+lambda_cls = 0.25
+```
+解释：
+- 方向任务通常更关键
+- 距离次之
+- objectness 必须稳定
+- 类别监督只作为辅助
+### 10.4 不建议的做法
+第一版不建议：
+- 重分类损失压倒空间损失
+- 直接照搬 Spatial-AST 的 `1250 * cls`
+原因：
+- Spatial-AST 的目标之一是保住 sound event detection
+- 这里 `Spatial-BEATs` 的主要目标是空间 token
+- 原项目已有独立语义 encoder
+## 11. 训练策略
+### 11.1 第一阶段是否需要 SSL
+当前最终结论：
+- 第一版 **不需要** 重新做 BEATs 式 SSL
+因为当前已经有：
+- 多源监督
+- 每个源的空间标注
+- 可复用的 BEATs 主干预训练
+所以第一阶段应优先做：
+- `supervised multi-source spatial training`
+### 11.2 分阶段训练建议
+#### Stage A：Warmup
+冻结：
+- 大部分 trunk
+只训练：
+- FOA preprocessor
+- patch embedding
+- query decoder
+- spatial heads
+- LLM projector
+目的：
+- 让新输入 stem 和新输出头稳定接入 trunk
+#### Stage B：Upper-trunk finetune
+解冻：
+- trunk 上层若干层
+目的：
+- 让主干逐步适应 FOA 空间任务
+#### Stage C：Near-full finetune
+进一步解冻：
+- 更多 encoder layers
+目的：
+- 提升空间表示上限
+### 11.3 学习率建议
+推荐：
+- trunk：较小 lr
+- 新模块：较大学习率
+例如：
+```text
+lr_trunk = 1e-5 ~ 5e-5
+lr_new = 1e-4 ~ 5e-4
+```
+并配合：
+- layer-wise lr decay
+## 12. 最终输出给 LLM 的 spatial token 形式
+这是本项目最关键的接口定义之一。
+### 12.1 内部 token 形式
+`Spatial-BEATs` 内部输出：
+- `Z in R^{B x K x D}`
+其中：
+- `B`: batch size
+- `K`: source token 数
+- `D`: Spatial-BEATs hidden dim，建议与 BEATs trunk 一致
+### 12.2 不建议直接把 raw logits 喂给 LLM
+不建议直接给 LLM：
+- azimuth logits
+- elevation logits
+- distance logits
+- objectness logits
+这些是监督头，不是最终模态表示。
+### 12.3 最终推荐的 LLM spatial token 形式
+最终推荐送给 LLM 的每个 token 形式为：
+```text
+s_i = Proj([z_i ; e_azi(i) ; e_ele(i) ; e_dist(i) ; e_obj(i)])
+```
+其中：
+- `z_i`: query decoder 输出的 latent token
+- `e_azi(i)`: 由预测 azimuth bin 查表得到的 embedding
+- `e_ele(i)`: 由预测 elevation bin 查表得到的 embedding
+- `e_dist(i)`: 由预测 distance bin 查表得到的 embedding
+- `e_obj(i)`: 由 objectness/confidence 产生的 embedding
+- `Proj`: 投影到 LLM hidden size 的 MLP/Linear
+最终：
+- `s_i in R^{d_llm}`
+### 12.4 为什么采用“latent + structured embedding”的混合形式
+原因：
+1. `z_i` 保留丰富的隐式空间结构信息
+2. `坐标 embedding` 给 LLM 显式离散空间线索
+3. `confidence` 有助于 LLM 区分可靠/不可靠 token
+这比单纯只传：
+- raw latent token
+或者只传：
+- 显式坐标 one-hot / scalar
+都更合适。
+### 12.5 最终序列形式
+送入 LLM 时推荐：
+```text
+<SPATIAL_START>, s_1, s_2, ..., s_K, <SPATIAL_END>
+```
+并且：
+- 按 `objectness` 从高到低排序
+- 对低置信 token 可直接截断或 mask
+### 12.6 是否保留全部 K 个 token
+默认推荐：
+- 训练时保留全部 `K`
+- 推理时按 `objectness` 过滤
+例如：
+- 保留前 `K_keep`
+- 或保留 `obj > threshold` 的 token
+## 13. 与原语义 audio encoder 的关系
+为了避免“两个 encoder 在做同样的事”，推荐如下职责划分：
+- 原语义 audio encoder：负责 `what`
+- Spatial-BEATs：负责 `where / spatial structure / relations`
+### 13.1 是否允许 Spatial-BEATs 学类别
+允许，但只作为辅助。
+建议：
+- 类别头只用于训练
+- 最终输入给 LLM 的空间 token 不直接暴露完整类别 logits
+### 13.2 是否需要和语义 encoder 做对齐
+第一版不是必须。
+若后续希望更强的 source grounding，可进一步加入：
+- semantic distillation
+- cross-encoder alignment
+- source-wise contrastive loss
+但这些应放到第二阶段。
+## 14. 第一版推荐配置
+第一版默认建议：
+- 输入特征：`WXYZ + IVxyz`
+- `C_foa = 7`
+- 采样率：`16k`
+- mel bins：`128`
+- patch 配置：与 BEATs 保持一致
+- 预训练权重：`BEATs_iter3+ AS2M pre-trained`
+- trunk：最大化加载
+- patch stem：`W` 继承，其余通道小初始化
+- 输出：`K` 个 source tokens
+- token 解码：轻量 query decoder
+- 监督：Hungarian matching + 多头空间分类
+- LLM 输入：`latent + structured coordinate embedding` 的混合 token
+## 15. 实现优先级
+推荐按如下优先级推进：
+1. 实现 `FOA preprocessor`
+2. 实现多通道 `patch embedding`
+3. 完成 trunk ckpt 加载
+4. 实现 `query decoder`
+5. 实现 `objectness / azi / ele / dist` heads
+6. 实现 `Hungarian matcher + criterion`
+7. 实现 `LLM projector`
+8. 完成训练脚本
+## 16. 当前仍需用户确认的问题
+以下问题会直接影响第一版实现细节：
+1. `FOA` 数据当前主要采样率是多少？是 `16k`、`24k`、`32k` 还是 `48k`？
+2. 每个样本中 `最大同时源数` 大概是多少？这会影响 `K` 的默认设定。
+3. 每个源是否都有 `source-level class label`？如果有，类别头和匹配会更稳。
+4. 你希望 `distance` 是离散分类还是连续回归？当前默认推荐离散分类。
+5. 下游 LLM 的 hidden size 是多少？是否已有固定的 audio token projector？
+6. 你是否希望 Spatial-BEATs 在第一版就具备一定的 source semantic 辅助能力，还是严格只做空间？
+## 17. 结论
+当前最终方案已经明确：
+- **完整 FOA 特征进入 BEATs 主干**
+- **最大化复用 trunk 预训练**
+- **重做输入 stem**
+- **重做输出为多源 spatial tokens**
+- **第一版采用监督式 set prediction**
+- **最终给 LLM 的不是 raw logits，而是融合 latent 与坐标 embedding 的 spatial tokens**
+这是当前最符合项目目标、也最稳妥的 `Spatial-BEATs` 方案。

docs/spatial_beats_training_overview.md ADDED Viewed

	@@ -0,0 +1,608 @@

+# Spatial-BEATs Training And Architecture Overview
+This document summarizes the current `Spatial-BEATs` implementation in this repository:
+- model architecture
+- tensor shape flow
+- dataset contract
+- variable-length batching
+- supervision and losses
+- stage-1 training setup
+- current `ov1/ov2/ov3` presets
+The implementation described here corresponds to:
+- [spatial_beats.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/spatial_beats.py)
+- [spatial_modules.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/spatial_modules.py)
+- [spatial_dataset.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/spatial_dataset.py)
+- [spatial_loss.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/spatial_loss.py)
+- [train_spatial_beats.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/train_spatial_beats.py)
+- [spatial_beats_ov123_stage1_config.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/spatial_beats_ov123_stage1_config.py)
+## 1. Goal
+`Spatial-BEATs` is a separate spatial encoder for FOA audio.
+It is designed to:
+- reuse the BEATs backbone and pretrained weights
+- take full FOA input instead of only the `W` channel
+- learn spatial structure through explicit supervision
+- output fixed-rate spatial tokens for an LLM
+- stay separate from the original audio encoder used for semantic audio understanding
+The current implementation follows the simplified design:
+- the main objective is to train the FOA front-end and BEATs trunk to produce spatially informative embeddings
+- the supervision heads are lightweight readout heads
+- the final LLM tokens are taken from the encoder-side spatial embeddings, not from the final logits
+## 2. High-Level Architecture
+The end-to-end model path is:
+```text
+FOA waveform
+  -> FOA spatial preprocessor
+  -> multi-channel patch embedding
+  -> BEATs trunk
+  -> frequency pooling
+  -> temporal resampling to 2.5 Hz
+  -> shallow temporal readout
+  -> spatial embeddings
+  -> fixed-slot supervision heads
+  -> projector
+  -> LLM spatial tokens
+```
+More concretely:
+```text
+[B, 4, T]
+  -> [B, 7, T_f, 128]
+  -> [B, N_p, 512]
+  -> [B, N_p, 768]
+  -> [B, T_p, 768]
+  -> [B, T_s_max, 768]
+  -> [B, T_s_max, 768]
+  -> [B, T_s_max, 4, 768]
+  -> [B, T_s_max, d_llm]
+```
+Where:
+- `B`: batch size
+- `T`: waveform length in samples
+- `T_f`: acoustic frame count before patching
+- `N_p`: number of BEATs patches
+- `T_p`: time-axis patch count after frequency pooling
+- `T_s_max`: padded token count in the batch after resampling to `2.5 Hz`
+- `d_llm`: spatial token width sent to the LLM
+## 3. Input And Front-End
+### 3.1 Input audio
+The model expects:
+- FOA waveform
+- shape `[B, 4, T]`
+- channel order: `W, X, Y, Z`
+- sample rate: `16 kHz`
+### 3.2 Qwen-like low-level mel setup
+The current front-end is aligned to the Qwen-2.5-Omni audio tower style low-level parameters:
+- `sample_rate = 16000`
+- `num_mel_bins = 128`
+- `n_fft = 400`
+- `win_length = 400`
+- `hop_length = 160`
+- `dither = 0.0`
+These parameters are shared between:
+- `SpatialBEATsConfig`
+- `SpatialDatasetConfig`
+This keeps the data pipeline and the model front-end consistent.
+### 3.3 FOA feature construction
+The preprocessor converts FOA waveform into a 7-channel feature map:
+- `W_logmel`
+- `X_logmel`
+- `Y_logmel`
+- `Z_logmel`
+- `IVx`
+- `IVy`
+- `IVz`
+Output shape:
+- `foa_feat: [B, 7, T_f, 128]`
+This allows the whole FOA structure to enter the backbone instead of relying on only `W`.
+## 4. Backbone And Spatial Embedding Path
+### 4.1 Spatial patch embedding
+The model replaces the original single-channel patch stem with a 7-channel patch embedding:
+- input: `foa_feat [B, 7, T_f, 128]`
+- output: `patch_tokens [B, N_p, 512]`
+- also returns `grid_size = (T_p, F_p)`
+This is the first modified entry point for reusing BEATs on FOA input.
+### 4.2 Reused BEATs trunk
+The trunk reuses BEATs pretrained components:
+- `layer_norm`
+- `post_extract_proj`
+- `encoder.pos_conv`
+- all transformer layers
+- `encoder.layer_norm`
+Flow:
+- input: `patch_tokens [B, N_p, 512]`
+- output: `encoder_memory [B, N_p, 768]`
+### 4.3 Frequency pooling
+The patch sequence is reshaped back into a patch grid and pooled over the frequency axis:
+- input: `encoder_memory [B, N_p, 768]` with `grid_size=(T_p, F_p)`
+- reshaped internally to `[B, T_p, F_p, 768]`
+- pooled output: `temporal_patch_tokens [B, T_p, 768]`
+This produces a time-aligned sequence before the final token-rate conversion.
+### 4.4 Temporal resampling
+The temporal resampler converts the patch-rate sequence into the final spatial token rate:
+- target token rate: `2.5 Hz`
+- per-sample target length:
+```text
+T_s_i = round(duration_i * 2.5)
+```
+Batch handling:
+- each sample is resampled independently
+- the batch is padded to `T_s_max = max_i(T_s_i)`
+- a temporal mask is produced
+Outputs:
+- `temporal_tokens: [B, T_s_max, 768]`
+- `temporal_padding_mask: [B, T_s_max]`
+Mask convention:
+- `False`: valid time step
+- `True`: padded time step
+### 4.5 Shallow temporal readout
+The shallow temporal readout refines the resampled sequence with a lightweight transformer encoder:
+- input: `temporal_tokens [B, T_s_max, 768]`
+- output: `spatial_embeddings [B, T_s_max, 768]`
+This is the main representation used for both:
+- spatial supervision
+- final projection to LLM tokens
+## 5. Supervision Heads
+The current stage-1 design does not use a heavy decoder.
+Instead, it uses a fixed-slot readout for supervision only.
+### 5.1 Fixed-slot readout
+The readout expands each time step into a small number of internal supervision slots:
+- max slots per step: `K = 4`
+- input: `spatial_embeddings [B, T_s_max, 768]`
+- output: `slot_latents [B, T_s_max, 4, 768]`
+Important:
+- `K=4` is only a supervision capacity
+- it does not change the final LLM token count
+- the final LLM-visible token rate is still `2.5 Hz`
+### 5.2 Prediction heads
+Each supervision slot predicts:
+- `pred_activity: [B, T_s_max, 4]`
+- `pred_azi_logits: [B, T_s_max, 4, 360]`
+- `pred_ele_logits: [B, T_s_max, 4, 180]`
+- `pred_dist: [B, T_s_max, 4, 1]`
+- `pred_class_logits: [B, T_s_max, 4, C]`
+Where:
+- `C = 65`
+- the class vocabulary comes from:
+  - `/apdcephfs_cq12/share_302080740/user/schmittzhu/data/fsd50k/FSD50K.ground_truth/final_vocabulary.csv`
+These heads are used to supply explicit training loss and push the front-end plus BEATs trunk to learn spatial structure.
+## 6. LLM Spatial Tokens
+The final LLM tokens are not taken from slot logits.
+They are projected from the encoder-side spatial embeddings:
+- input: `spatial_embeddings [B, T_s_max, 768]`
+- output: `llm_spatial_tokens [B, T_s_max, d_llm]`
+Therefore:
+- `2.5 Hz` means final LLM-visible tokens arrive at `2.5 tokens/second`
+- a `20 s` clip produces about `50` spatial tokens
+- a `10 s` clip produces about `25` spatial tokens
+This is the externally visible spatial token interface.
+## 7. Pretrained Weight Reuse
+The model initializes from `BEATs_iter3+ AS2M`.
+Current pretrained loading logic:
+- selectively load BEATs trunk modules
+- skip task-specific components that do not match
+- inflate the old single-channel patch embedding into the new 7-channel stem
+Patch stem initialization rule:
+- original BEATs patch weight is copied into channel `0` of the new 7-channel stem
+- remaining channels start from zero
+This is a conservative initialization intended to preserve BEATs trunk stability while enabling FOA adaptation.
+## 8. Dataset Contract
+### 8.1 Supported manifests
+The dataset loader currently supports:
+- `ov1_foa.jsonl`
+- `ov2_foa.jsonl`
+- `ov3_foa.jsonl`
+It handles:
+- single-source top-level manifest style
+- nested multi-source manifest style with `sources`
+### 8.2 Required scene-level data
+At scene level the dataset expects one FOA path, typically:
+- `output_foa_path`
+or compatible fallback names already handled in the parser.
+### 8.3 Required source-level data
+For each source, the loader extracts:
+- source class
+- azimuth
+- elevation
+- distance
+- weak time window
+Internally each source is converted into a `SourceEvent` containing:
+- `class_index`
+- `class_label`
+- `azimuth_deg`
+- `elevation_deg`
+- `distance_m`
+- `start_time_seconds`
+- `end_time_seconds`
+### 8.4 Vocabulary mapping
+Source labels are mapped to `final_vocabulary.csv`.
+The loader supports several field aliases, including:
+- `mono_target_label`
+- `mono_primary_label`
+- `final_label`
+- `source_label`
+- `label`
+and several id-style aliases if an integer class index is already present.
+## 9. Variable-Length Batching
+Handling mixed-length FOA clips is a core part of the current implementation.
+### 9.1 Waveform padding
+At batch time:
+- each waveform is padded to the batch maximum waveform length
+- the padded tensor has shape `[B, 4, T_max]`
+- a waveform padding mask is created:
+```text
+waveform_padding_mask: [B, T_max]
+```
+Mask convention:
+- `False`: valid waveform sample
+- `True`: padded sample
+dui
+### 9.2 Temporal token padding
+After temporal resampling:
+- each sample has its own `T_s_i = round(duration_i * 2.5)`
+- the batch is padded to `T_s_max`
+- the model returns:
+```text
+temporal_padding_mask: [B, T_s_max]
+target_num_steps: [B]
+```
+All temporal supervision, matching, and loss computation respect these lengths.
+### 9.3 Long clip truncation
+The current training presets cap clip duration at:
+- `20.0 seconds`
+The dataset applies cropping before batching.
+Preset crop policy:
+- `crop_mode = "start"`
+This means:
+- clips longer than 20 seconds are truncated from the beginning
+- training and validation follow the same deterministic sequence policy
+If needed later, the dataset also supports:
+- `random`
+- `center`
+- `none`
+## 10. Matching And Losses
+### 10.1 Weak temporal supervision
+The model uses weak source windows:
+- each source provides `start_time_seconds` and `end_time_seconds`
+- these define a valid supervision window, not guaranteed frame-level activity
+The loss code first converts source windows into a time-window mask:
+```text
+window_mask: [B, N_gt, T_s_max]
+```
+### 10.2 Per-step fixed-slot matching
+Matching is performed per time step:
+- only on valid temporal positions
+- only within each source's weak time window
+- between active GT sources and the `K=4` slot predictions
+The current matcher uses a detached cost built from:
+- activity
+- class
+- azimuth
+- elevation
+- distance
+The output contains the assigned GT target for each valid slot-time pair.
+### 10.3 Multi-task loss terms
+Current loss terms are:
+- `loss_activity`
+- `loss_azi`
+- `loss_ele`
+- `loss_dist`
+- `loss_cls_aux`
+- `loss_temp`
+Their roles:
+- `loss_activity`
+  - `BCEWithLogits` on slot activity
+  - computed over valid time steps
+- `loss_azi`
+  - cross-entropy over 360 azimuth bins
+- `loss_ele`
+  - cross-entropy over 180 elevation bins
+- `loss_dist`
+  - `SmoothL1Loss` on continuous distance regression
+- `loss_cls_aux`
+  - auxiliary source class cross-entropy
+- `loss_temp`
+  - temporal smoothness regularization over valid consecutive steps
+The total loss is the weighted sum defined in `SpatialLossConfig`.
+## 11. Stage-1 Training Flow
+The current training entry is stage-1 encoder-focused training.
+High-level flow per step:
+```text
+batch
+  -> SpatialBEATs.forward()
+  -> match_fixed_slots()
+  -> compute_spatial_losses()
+  -> backward()
+  -> optimizer.step()
+```
+### 11.1 Trainable modules in stage 1
+By default, stage 1 trains:
+- `preprocessor`
+- `patch_embedding`
+- `frequency_pool`
+- `temporal_resampler`
+- `temporal_readout`
+- `slot_readout`
+- `prediction_heads`
+It can also unfreeze the BEATs trunk.
+The projector is kept frozen by default in stage 1.
+### 11.2 Optimizer
+The current trainer uses:
+- `AdamW`
+Default preset values:
+- `batch_size = 4`
+- `num_epochs = 20`
+- `learning_rate = 1e-4`
+- `weight_decay = 0.05`
+## 12. Current Presets
+### 12.1 `OV123_STAGE1_CFG`
+Defined in:
+- [spatial_beats_ov123_stage1_config.py](/apdcephfs_cq10/share_1603164/user/schmittzhu/code/unilm/beats/spatial_beats_ov123_stage1_config.py)
+This preset is intended to train on:
+- `ov1_foa.jsonl`
+- `ov2_foa.jsonl`
+- `ov3_foa.jsonl`
+with split filtering:
+- train: `("train",)`
+- val: `("valid",)`
+- test: `("test",)`
+and clip truncation:
+- `max_clip_duration_seconds = 20.0`
+- `crop_mode = "start"`
+### 12.2 `OV23_STAGE1_CFG`
+This is the safer baseline preset using only:
+- `ov2_foa.jsonl`
+- `ov3_foa.jsonl`
+It uses the same split and truncation policy.
+### 12.3 Important note on `ov1`
+The trainer is already written to use `split` filtering for `ov1`.
+If the active `ov1` manifest at the configured path does not yet contain `split`, then:
+- the `OV123` preset will not automatically include those samples in train, valid, or test
+- the fix is simply to point the preset at the updated `ov1` manifest path
+The code path itself already supports split-aware loading.
+## 13. Current Runtime Status
+The current implementation has already been checked on:
+- a real FOA waveform file
+- mixed-length real manifest samples
+- full forward pass
+- fixed-slot matching
+- multi-task loss computation
+- BEATs pretrained weight loading
+The following paths are already operational:
+- dataset parsing
+- waveform batching
+- mixed-length temporal masking
+- model forward
+- matching
+- loss computation
+- stage-1 optimization loop
+## 14. Recommended Launch Pattern
+Example usage:
+```python
+from spatial_beats_ov123_stage1_config import OV123_STAGE1_CFG
+from train_spatial_beats import main
+main(OV123_STAGE1_CFG)
+```
+If `ov1` still needs a different manifest path, update only:
+```python
+OV123_STAGE1_CFG.train_manifest_paths
+OV123_STAGE1_CFG.val_manifest_paths
+OV123_STAGE1_CFG.test_manifest_paths
+```
+or rebuild the config through:
+```python
+from train_spatial_beats import make_ov123_stage1_config
+```
+## 15. Summary
+The current `Spatial-BEATs` implementation is a FOA-first BEATs-based spatial encoder with:
+- Qwen-like low-level mel settings
+- a 7-channel FOA front-end
+- reused BEATs trunk
+- fixed-rate `2.5 Hz` spatial token output
+- fixed-slot supervision heads
+- variable-length batching
+- split-aware `ov1/ov2/ov3` training presets
+The central training idea is:
+- use explicit spatial supervision to shape the front-end and BEATs trunk
+- keep the supervision head lightweight
+- use encoder-side spatial embeddings as the final source of LLM spatial tokens

docs/v13_honest_postmortem.md ADDED Viewed

	@@ -0,0 +1,170 @@

+# v13 系列实验 Postmortem — 诚实记录我的错误判断
+> 日期：2026-05-02
+> 目的：保留错误判断的证据，避免后续重蹈覆辙
+> 作者说明：多次对实验状态做出错误诊断，本文记录这些错误及其根因
+---
+## 1. 核心错误：没注意 v13_D 的 cls warmup 是 8 ep 不是 3 ep
+### 事实
+v13_D 配置：
+```python
+cfg.frame_spatial_loss_warmup_epochs = 8   # v12 是 3
+cfg.num_epochs = 25
+```
+这意味着 **ep0-7 全是 cls-only 训练**（spatial lambda 被 warmup 机制压为 warmup_scale），ep8 才是 spatial loss 真正放开的第一个 epoch。
+### v13_D 真实训练曲线
+```
+ep   F20     o_cls   azi      阶段
+ 0   0.311   0.650   28.6°    cls warmup, spatial loss 几乎为 0
+ 1   0.340   0.768   25.5°    ...
+ 3   0.308   0.788   26.5°    ...
+ 7   0.193   0.786   31.0°    spatial 还没 ramp，azi 越走越远
+ 8   0.397   0.876   18.5°    ★ spatial loss 真正放开，F 跳涨
+ 9   0.402   0.868   17.3°
+10   0.402   0.864   17.2°    (best so far)
+```
+### 我的错误判断（两次对话前）
+看到 ep0-7 的数据，我写了：
+> v13_D：ep1 就到 best，之后一路发散
+> 这是完全不同的问题，不是过拟合 — 是 top-k rank loss 本身没用对。
+> Top-K rank loss + EMA + resume optimizer + cosine LR 这四个改动叠加起来打架了
+**完全错误**。真相：
+- ep1 的 "best" 只是 cls warmup 期间的 F20 虚高值，不代表 spatial 性能
+- F 从 ep1 → ep7 下降，是因为 spatial loss 被压住但 trunk 在学 class（trunk 权重变化 → 没有 spatial 监督 → azi 漂移）
+- ep8 spatial loss 放开后 F 立刻 **0.19 → 0.40**（+107%）
+- **Top-K rank、EMA、cosine LR、resume optimizer 全部工作正常**
+### 我为什么犯这个错
+- 用 v12 的 3 ep warmup 经验直觉套 v13_D 的 8 ep warmup
+- 没有先拉长查看 ep0-10 的完整曲线，只看前几个 epoch 就下诊断
+- "F 在下降" 这个表面现象让我急于给出解释，没有对照 **ramp schedule**
+## 2. 次要错误：误判了 v13_B 的状态
+v13_B 实际跑到 ep4 停下：
+```
+ep  F20   o_cls
+ 0  0.255 0.647
+ 1  0.292 0.771
+ 2  0.298 0.779
+ 3  0.357 0.776   ← spatial 放开第一个 epoch（warmup=3）
+ 4  0.356 0.775
+```
+v13_B 的 warmup 是 3（与 v12 一致），所以 ep3 才是 spatial 放开第一个 epoch。ep3-4 刚放开 2 个 epoch，F 还没上涨空间。
+### 我的错误判断
+> v13_B 的 ASL + soft-F1 在这个数据规模下不足以扭转 precision/recall 权衡
+**错**：ep4 就停了训练，根本没给 ASL + soft-F1 时间学习。如果和 v12 同样跑 15 epoch，F 可能到 0.38-0.40 也说不定。
+### 更正
+- v13_B 的结论应该是 **"跑不完整，不能下结论"**，而不是 "设计失败"
+- 如果用户还想看 v13_B 的真实效果，应该重启并跑满 15+ epoch
+## 3. v13_C 的判断基本正确，但也需校准
+v13_C 的 spatial warmup 也是 3 ep（继承 v12）。跑满 15 epoch：
+```
+ep  F20    val_loss
+ 3  0.385  2.67    spatial 放开第一个 epoch，F 就跳到 0.385
+ 7  0.387  3.40
+10  0.385  3.38    F 平了
+15  0.385  3.60    val_loss 持续上升
+```
+v13_C **确实** 从 ep3 就到 0.385，之后 12 个 epoch 没再涨，这是真的 overfitting（val_loss 从 2.67 → 3.60）。
+但我的归因可能也有偏差：
+- 我说 "real replication 6× 是失败配方"
+- 但实际上 v13_C 是 C-1（real 6×）+ C-2（refinement）+ C-3（V3 adapter）+ C-4（log-dist） 四个改动同时叠加，不能武断归罪 C-1
+- 正确的做法：做 ablation，只开 C-2，或只开 C-1，分别看
+## 4. v13_E 的设计基于错误诊断
+v13_E 的目标写的是 "F 0.40-0.43，基于 v13_D 崩溃的教训"。
+但 v13_D 其实没崩，F 已到 0.402 @ ep10，还在涨。
+**v13_E 的实际价值**：
+- 它开启 num_active head 训练 + SELD evaluator 的 top-K̂ gate —— v13_D 没做的
+- 作为 v13_D 之后的 **扩展实验**（v13_F？），不是替代
+## 5. 学到的教训
+### 教训 1：看 full trajectory，不看 prefix
+以后评估实验状态，必须等到 **spatial ramp 结束** 且 **至少 5 个 epoch 的 spatial 阶段数据**。看 ep0-7 的 cls warmup 数据就下判断是严重错误。
+### 教训 2：warmup schedule 是不同实验的关键差异
+v12: warmup=3, v13_B/C: warmup=3, v13_D: warmup=8。相同 epoch number 对应的训练阶段完全不同。画图时应标注 "spatial_enabled_epoch" 作为基准点对齐。
+### 教训 3：诊断要基于 pipeline 理解
+我多次说 "Top-K rank 和 Hungarian 对着干"、"ASL 和 gate 互相抵消"，但这些都是**基于少量 ep 数据的事后解释**，不是真正的机制推导。下次先问："这组数据在 pipeline 的哪个阶段？"
+### 教训 4：不要急于写新实验替代旧的
+v13_E 本来不需要。v13_D 只要让它跑完 25 epoch 就够了。写 v13_E 的动机是 "v13_D 崩了所以要抢救"，这个前提本身就错。
+## 6. 接下来该怎么办
+### 立即行动
+- **让 v13_D 继续跑到 25 epoch**，不要停
+- **v13_B 不要急着下结论**。如果有算力，重启跑满 15 epoch；没有就标记为 "incomplete"
+- **v13_C 的结论保留**（确实 overfit），但不能归罪单一改动
+### v13_D 最终预期（基于 v12 曲线外推）
+```
+v12 从 spatial 放开后曲线:
+  ep3:  0.353  (刚放开)
+  ep12: 0.378  (best, +0.025 / 9 ep)
+v13_D 类比:
+  ep10: 0.402  (刚放开 ramp 结束)
+  ep19 (+9 ep): ~0.425  (保守估计)
+  ep22 (+12 ep): ~0.43~0.46  (乐观)
+```
+EMA + cosine LR 可能再给 +0.005~0.01。最终 v13_D 预期 **0.43 ~ 0.46**。
+### v13_E 的定位调整
+从"替代 v13_D" 改为 "v13_D 之后的扩展实验"：
+- 先等 v13_D 跑完
+- 用 v13_D best.pt 作为 hot-start
+- 在其上启用 num_active head + top-K̂ gate 看是否再涨
+- 如果涨了 → v13_F 路径
+- 如果不涨 → num_active 在这个任务上意义不大
+代码和 run 脚本都已落地，随时可用，不影响 v13_D 的实验。
+## 7. 对用户的道歉
+我多次给出过于自信的错误判断：
+- "v13_C F 卡在 0.38 不再上升" —— 其实那是 overfitting 没错，但原因归咎 C-1 过于武断
+- "v13_D 不收敛" —— 完全错误
+- "Top-K rank loss 本身没用对" —— 没有证据
+- "改 activity loss 的改动都失败了" —— 基于错误数据的推断
+应该做但没做的事：
+- 应该先看 full trajectory 再诊断
+- 应该注意 warmup schedule 差异
+- 应该说 "ep0-7 是 cls warmup 期间的数据，不能代表最终性能"
+**今后对策**：评估实验前先读该实验的 preset 代码，看 schedule 是什么，再去解读数据。不基于前几个 epoch 就给"崩了/不收敛"的结论。

docs/v13_spatial_beats_design.md ADDED Viewed

	@@ -0,0 +1,528 @@

+# v13_B + v13_C 设计文档
+> 日期：2026-05-01
+> 作者：Claude + user
+> 目标：在 v12（F20=0.378）基础上，分两个正交实验把 F20 推到 0.45+，合并实验 v14 目标 0.55~0.62
+> 设计原则：**所有改动增量式**，通过 cfg flag 控制，默认全关 → 不破坏任何现有实验（v7/v11/v12 全可正常复现）
+---
+## 0. 背景与瓶颈诊断（摘自 v12 per-subset 分析）
+v12 best.pt (F20=0.3779 聚合) 按子集拆分：
+```
+子集           N      F20    ER20   LE_CD  LR_CD  o_cls  o_azi   aP    aR
+ov1_sim       4800   0.386  0.686   26.8° 0.640  0.796  25.5°  0.950 0.046
+ov2_sim       1718   0.299  0.926   29.6° 0.546  0.653  30.0°  0.916 0.151
+ov3_sim       1612   0.270  0.917   30.5° 0.499  0.599  31.8°  0.888 0.165
+ov1_real      3374   0.140  0.924  117.7° 0.232  0.809  27.4°  0.767 0.144
+ov2_real      2230   0.098  0.866  121.1° 0.198  0.747  38.3°  0.655 0.227
+ov3_real       740   0.052  0.766  146.8° 0.125  0.624  51.5°  0.612 0.232
+dcase_starss  4560   0.071  1.171  130.3° 0.185  0.698  36.0°  0.625 0.176
+unified       35021  (~0.40, 推算)
+```
+**三大诊断**：
+1. **瓶颈是 activity_recall (0.13)，不是 cls 也不是 spatial。** oracle_class_acc 在 real 上仍有 0.80，oracle_azi_mae_deg 27-51°，说明表征是健康的，被 activity gate 挡住了。
+2. **Real/Sim gap 极大**：同 ov 级别下 real F20 比 sim 低 64~81%。原因是 train 里 dcase_real 只占 6%。
+3. **Overlap 惩罚显著**：sim ov3 F20 比 ov1 低 30%（0.27 vs 0.39），K=4 track head 在 overlap 下 slot 分配混乱。
+---
+## 1. 实验切分：B 和 C 是正交维度
+| 维度 | v13_B | v13_C |
+|---|---|---|
+| 改 loss / head 输出接口 | ✅ | ❌（沿用 v12 loss） |
+| 改数据比例 / augment | 仅 augment | ✅ replication |
+| 改主干架构 / 容量 | ❌ | ✅ refinement + adapter V3 |
+| 热启动 | v12 best.pt (strict=False) | v12 best.pt (strict=False) |
+| 能解的子集 | sim + real 的 activity 瓶颈 | ov2/ov3 overlap + real gap |
+理由：让 B 的收益和 C 的收益互不污染，便于 ablation。v14 = B + C 合并。
+---
+## 2. 共同约束（关键：不破坏现有实验）
+所有 B/C 改动都要遵守：
+### 2.1 Cfg flag 默认全关
+每条新改动对应一个 `SpatialBEATsConfig` / `SpatialLossConfig` / `TrainSpatialBEATsConfig` 字段，**默认值就是"这条改动不启用"**。
+```python
+# spatial_modules 侧
+self.use_class_activity_bias: bool = False        # [B-1]
+self.use_class_conditional_gate: bool = False     # [B-3]
+self.use_track_refinement: bool = False           # [C-2]
+self.track_refinement_layers: int = 2
+self.patch_adapter_version: str = "v1"            # "v1"/"v2"/"v3"  [C-3] 加 v3
+self.use_log_distance_head: bool = False          # [C-4]
+# spatial_loss 侧
+self.activity_loss_type: str = "bce"              # "bce"/"asymmetric"  [B-2]
+self.asl_gamma_neg: float = 4.0
+self.asl_gamma_pos: float = 0.0
+self.asl_probability_margin: float = 0.05
+self.soft_f1_weight: float = 0.0                  # 0 = disabled  [B-4]
+self.distance_loss_type: str = "l1"               # "l1"/"laplace_nll"  [C-4]
+# dataset 侧
+self.use_spec_augment: bool = False               # [B-5]
+self.spec_augment_time_mask_ratio: float = 0.0
+self.spec_augment_freq_mask_ratio: float = 0.0
+self.random_gain_db: float = 0.0
+self.channel_dropout_prob: float = 0.0
+self.lowpass_sim_real_prob: float = 0.0
+```
+`ov1_unified_v12` preset 和之前所有 preset 都**不动**，因为它们没设置这些 flag → 新代码分支不走。
+### 2.2 热启动安全
+新加的 `nn.Module` / `nn.Parameter` 全部**零初始化**或**identity 等价初始化**：
+- `class_activity_bias`：`torch.zeros(num_classes)` → logit 不变
+- `GatingMLP` 最后一层 bias/weight zero-init → gate_logit = 0
+- `TrackRefinementDecoder`：`layer_scale = zeros(num_layers)` → 残差 x + 0 = x
+- `log_distance_head`：bias 初始化为 `log(mean_distance_v12) ≈ log(1.5)`
+从 v12 best.pt `strict=False` 加载时：
+- 不存在的 key → 走零初始化
+- 存在的 key（所有 v12 组件）→ 正常加载
+- ep0 forward 输出应与 v12 best.pt 完全一致（或数值上差 < 1e-5）
+### 2.3 完全向后兼容
+任何旧脚本跑起来（比如 `run_ov1_unified_v12.sh`）**不需要改一行**，因为所有新字段都有默认值且默认 disabled。
+### 2.4 Preset 命名
+- `ov1_unified_v13b`：启用所有 B 相关 flag
+- `ov1_unified_v13c`：启用所有 C 相关 flag
+- `ov1_unified_v14`：B + C 全开，热启动 max(v13b.best, v13c.best)
+---
+## 3. v13_B 详细设计：Loss + Decision 全面重写
+### [B-1] Per-class learnable logit bias
+**为什么**：全局阈值 0.5 对所有类别一视同仁不合理。稀有类（jackhammer）和常见类（singing）的 activity 先验差异巨大，应当让模型自己学各类的 logit bias（等价于 per-class threshold）。
+**实现点**：
+- 文件：`spatial_modules.py`
+- 类：`FrameTrackPredictionHeads`（定位：`class_logits` 和 `activity_logits` 的出口处）
+- 新增 parameter：`self.class_activity_bias = nn.Parameter(torch.zeros(num_classes))`
+- 新增 buffer：`self.use_class_activity_bias: bool`
+- Forward 改动：
+```python
+# 原：
+activity_logit = self.activity_head(token)  # [B, K, T, 1]
+# 新：
+activity_logit_raw = self.activity_head(token)  # [B, K, T, 1]
+if self.use_class_activity_bias:
+    class_probs = F.softmax(class_logits, dim=-1)  # [B, K, T, C]
+    # 用 class_probs 作为加权软分配（避免 argmax 阻断梯度）
+    expected_bias = torch.einsum('bktc,c->bkt', class_probs, self.class_activity_bias)
+    activity_logit = activity_logit_raw + expected_bias.unsqueeze(-1)
+else:
+    activity_logit = activity_logit_raw
+```
+**训练/推理一致性**：bias 在训练的 BCE loss 和推理的 sigmoid 里都是**同一个量**，因此不需要 threshold sweep。推理时 threshold 始终 = 0.0（logit 空间）或 0.5（prob 空间），完全等价。
+**参数量**：63 个标量，忽略不计。
+### [B-2] Asymmetric Loss 替换 BCE
+**为什么**：BCE 把 FN 和 FP 等权重。当前 activity_recall=0.13 说明 FN 惩罚严重不够。ASL 对 easy negatives 用 `(1-p)^γ-` 下压，对 positives 用弱 `γ+=0`，正负不均衡下表现显著好于 BCE。
+**实现点**：
+- 文件：`spatial_loss.py`
+- 新增函数：`asymmetric_loss_with_logits(logits, targets, gamma_neg=4, gamma_pos=0, margin=0.05)`
+- 在 `compute_frame_track_losses`（或同名函数）里根据 `config.activity_loss_type` 分支：
+```python
+if config.activity_loss_type == "asymmetric":
+    loss_act = asymmetric_loss_with_logits(
+        activity_logit, target_active,
+        gamma_neg=config.asl_gamma_neg,
+        gamma_pos=config.asl_gamma_pos,
+        margin=config.asl_probability_margin,
+    )
+else:  # "bce"
+    loss_act = F.binary_cross_entropy_with_logits(activity_logit, target_active, ...)
+```
+**数学**：
+```
+p = sigmoid(logit)
+positive: -( (1-p)**γ+ ) * log(p)
+negative: p_shifted = max(p - m, 0)
+          -( p_shifted**γ- ) * log(1 - p_shifted)
+```
+**参数**：`γ+ = 0`, `γ- = 4`, `margin = 0.05`（ASL paper 推荐起点）
+### [B-3] Class-conditional gating MLP
+**为什么**：activity 当前只看 token embedding。应该让 activity 也依赖 class/DOA 的确信度 —— class softmax 尖锐、DOA 稳定时更大胆判 active。
+**实现点**：
+- 文件：`spatial_modules.py`
+- 新增类：`ClassConditionalGate(embed_dim, num_classes, hidden_dim=128)`
+  - 输入：`fused_token [B, K, T, D]`, `class_logits [B, K, T, C]`, `pred_dir [B, K, T, 3]`
+  - 融合：`gate_input = concat(token, class_emb_avg, dir_vec)` → MLP → `gate_logit [B, K, T, 1]`
+  - class_emb 用 `class_logits.softmax()` 加权的 class embedding（新增 `nn.Embedding(C, 32)`）
+- 在 FrameTrackPredictionHeads 里：
+```python
+if self.use_class_conditional_gate:
+    gate_logit = self.class_conditional_gate(token, class_logits, pred_dir)
+    activity_logit = activity_logit + self.gate_scale * gate_logit
+```
+**初始化**：MLP 最后一层 `weight=zero, bias=zero` → gate_logit = 0 → ep0 等价 v12。
+**参数量**：~80K。
+### [B-4] Soft-F1 auxiliary loss
+**为什么**：BCE/ASL 仍是 per-sample 损失，优化目标和 macro-F20 评测有 gap。Soft-F1 直接按类聚合，和 DCASE 评估同构。
+**实现点**：
+- 文件：`spatial_loss.py`
+- 新增函数：`soft_macro_f1_loss(activity_logits, class_logits, target_active, target_class)`
+  - 对每个类 `c`：
+    - `p_c = sigmoid(act_logit) * softmax(class)[c]`  （class-c 的软 activity）
+    - `y_c = (target_active and target_class==c)`
+    - `tp_c = sum(p_c * y_c)`, `fp_c = sum(p_c * (1-y_c))`, `fn_c = sum((1-p_c) * y_c)`
+    - `f1_c = 2 tp_c / (2 tp_c + fp_c + fn_c + eps)`
+  - `loss = 1 - mean(f1_c)`
+- 在总 loss 里：
+```python
+if config.soft_f1_weight > 0:
+    total_loss = total_loss + config.soft_f1_weight * soft_macro_f1_loss(...)
+```
+**warmup**（已确认采用）：前 3 ep `soft_f1_weight=0.1`，第 3 ep 起硬切到 `0.3`。
+实现方式：在 `train_spatial_beats.py` 的 epoch 循环里根据 `epoch >= soft_f1_warmup_epochs` 动态设置 `train_cfg.loss.soft_f1_weight`，新增 config 字段：
+```python
+cfg.loss.soft_f1_weight_warmup: float = 0.1        # ep < warmup_epochs 时使用
+cfg.loss.soft_f1_weight: float = 0.3               # ep >= warmup_epochs
+cfg.loss.soft_f1_warmup_epochs: int = 3
+```
+### [B-5] Real-distribution augment
+**为什么**：sim_static 混响干净，模型学到的 activity 判据在低 SNR 下崩溃。augment 让模型见到各种"污染"的 spec，对 real 数据更鲁棒。
+**实现点**：
+- 文件：`spatial_dataset.py`
+- 在 `SpatialDataset.__getitem__` 或 collate 里加 augment pipeline
+- 只在训练集（`split='train'`）启用，valid/test 不启用
+- 新增 config flag：
+  - `use_spec_augment`（默认 False）
+  - `spec_augment_time_mask_ratio`（0.2 = 20% time 长度）
+  - `spec_augment_freq_mask_ratio`（0.15）
+  - `random_gain_db`（±8）
+  - `channel_dropout_prob`（0.1）
+  - `lowpass_sim_real_prob`（0.1，cutoff ∈ U[4000, 8000] Hz）
+**顺序**：waveform-level augment（gain, lowpass, channel_dropout）→ feature-level augment（SpecAugment）。
+**重要**：augment 只作用在 FOA 4 通道 waveform / delta feature 上，**target labels 不变**。
+### B 实验 preset: `make_ov1_unified_v13b_config`
+热启动：`v12_best.pt` (strict=False)
+开关：
+```python
+cfg.model.use_class_activity_bias = True          # [B-1]
+cfg.model.use_class_conditional_gate = True       # [B-3]
+cfg.loss.activity_loss_type = "asymmetric"        # [B-2]
+cfg.loss.asl_gamma_neg = 4.0
+cfg.loss.asl_probability_margin = 0.05
+cfg.loss.soft_f1_weight = 0.3                     # [B-4]
+cfg.dataset.use_spec_augment = True               # [B-5]
+cfg.dataset.spec_augment_time_mask_ratio = 0.2
+cfg.dataset.spec_augment_freq_mask_ratio = 0.15
+cfg.dataset.random_gain_db = 8.0
+cfg.dataset.channel_dropout_prob = 0.1
+cfg.dataset.lowpass_sim_real_prob = 0.1
+cfg.learning_rate = 1e-5
+cfg.num_epochs = 15
+cfg.output_dir = "checkpoints/spatial_beats_ov1_unified_v13b_exp/03_ov123_top4"
+```
+**数据 manifest 完全复用 v12**：unified train/valid + old ov1/2/3 sim/real + dcase_starss 作为 val。
+---
+## 4. v13_C 详细设计：Data + Architecture 全面重写
+### [C-1] Real data upsampling (replication)
+**为什么**：real (dcase_real) 在 train 里占 6%，梯度感受不到。DCASE 社区标准做法是 20-30% real。
+**实现点**：
+- 预处理脚本：`scripts/split_unified_train_by_source.py`
+  - 读 `unified_spatial_foa_fsd63_all/train.jsonl`
+  - 按 `data_source` 字段拆成三份：
+    - `train_sim_static.jsonl`
+    - `train_qa_sim.jsonl`
+    - `train_dcase_real.jsonl`
+  - 写到 `unified_spatial_foa_fsd63_all/` 同目录下
+- Preset：
+```python
+cfg.train_manifest_paths = (
+    unified_root / "train_sim_static.jsonl",
+    unified_root / "train_qa_sim.jsonl",
+    unified_root / "train_dcase_real.jsonl",
+)
+cfg.train_manifest_replication = (1, 1, 6)
+```
+**影响估算**（基于 v12 已知分布）：
+- sim_static 304K × 1 = 304K
+- qa_sim ~? × 1 = ~?
+- dcase_real 20K × 6 = 120K
+- real 占比从 6% → ~25% （取决于 qa_sim 规模）
+**兼容性**：`train_manifest_replication` 机制在 `train_spatial_beats.py` 已经存在（v7j 用过），不需要新加框架代码。只改 preset。
+### [C-2] Track-wise Refinement Transformer（2 layers）
+**为什么**：K=4 track slots 之间互相不知道对方在干嘛，overlap 时同一源被多个 slot 抢，或同一 slot 被多个源抢。引入 self-attention 让 slot 互相"排斥"。
+**实现点**：
+- 文件：`spatial_modules.py`
+- 新增类：
+```python
+class TrackRefinementDecoder(nn.Module):
+    def __init__(self, num_tracks=4, embed_dim=768, num_layers=2,
+                 num_heads=8, dim_feedforward=2048, dropout=0.0):
+        self.track_queries = nn.Parameter(torch.randn(num_tracks, embed_dim) * 0.02)
+        self.layers = nn.ModuleList([
+            nn.TransformerDecoderLayer(
+                d_model=embed_dim, nhead=num_heads,
+                dim_feedforward=dim_feedforward, dropout=dropout,
+                activation='gelu', norm_first=True, batch_first=True,
+            ) for _ in range(num_layers)
+        ])
+        # Zero-init layer scale: ep0 refinement = identity
+        self.layer_scale = nn.Parameter(torch.zeros(num_layers))
+    def forward(self, memory):
+        # memory: fused_spatial_embeddings [B, T, D]
+        # 输出：refined track tokens [B, K, T, D]
+        B, T, D = memory.shape
+        K = self.track_queries.size(0)
+        # 复制 K queries 到时间维度：[B, K, T, D]
+        q = self.track_queries[None, :, None, :].expand(B, K, T, D)
+        # 每个时间步独立做 decoder
+        # 为简化，把 T 维 flatten 进 batch：[B*T, K, D] cross-attn with [B*T, 1, D]
+        q_flat = q.permute(0, 2, 1, 3).reshape(B * T, K, D)
+        mem_flat = memory.reshape(B * T, 1, D)
+        for i, layer in enumerate(self.layers):
+            out = layer(q_flat, mem_flat)
+            q_flat = q_flat + self.layer_scale[i] * (out - q_flat)
+        # reshape 回 [B, K, T, D]
+        refined = q_flat.reshape(B, T, K, D).permute(0, 2, 1, 3).contiguous()
+        return refined
+```
+- 在 `SpatialBEATs` 里：
+```python
+if cfg.use_track_refinement:
+    self.track_refinement = TrackRefinementDecoder(
+        num_tracks=cfg.num_tracks,
+        embed_dim=cfg.encoder_embed_dim,
+        num_layers=cfg.track_refinement_layers,
+    )
+# encode_patches 之后、送入 head 之前：
+if self.track_refinement is not None:
+    track_tokens = self.track_refinement(encoder_memory)  # [B, K, T, D]
+    # 传给 FrameTrackPredictionHeads 的输入从 [B, T, D] 改成 [B, K, T, D]
+else:
+    track_tokens = None  # head 沿用旧 expand 逻辑
+```
+- `FrameTrackPredictionHeads` 的 forward ��个 `track_tokens: Optional[Tensor]` 参数：
+  - 传入 None → 沿用现有的"[B,T,D] 复制到 K slots"
+  - 传入 `[B,K,T,D]` → 用 refined tokens 走 head
+**参数量**：2 layer × (self_attn + cross_attn + FFN) ≈ 2 × 2M = 4M。
+**Zero-init 校验**：`layer_scale = zeros(2)` + 残差公式 `q + scale * (out - q)` → ep0 输出 = `track_queries`（静态，和 memory 无关）。但这会丢掉时间信息 —— **修正**：改用 `q + scale * layer_out`，并且把 track_queries 初始化成 `memory` 投影：
+实际更安全的等价初始化：
+```python
+# Zero-init 方案：layer 不改 query，query 本身先吸收 memory 信息
+# 思路：在 refine 前先做一次 "identity fallback"：如果 scale=0，输出 = memory 广播到 K
+def forward(self, memory):
+    B, T, D = memory.shape
+    K = ...
+    # 初始 track_tokens = memory 广播到 K（+ 一个很小的 query 偏移）
+    track_tokens = memory[:, None, :, :].expand(B, K, T, D).contiguous()
+    track_tokens = track_tokens + 0.02 * self.track_queries[None, :, None, :]
+    # refine
+    for i, layer in enumerate(self.layers):
+        ...
+        track_tokens = track_tokens + self.layer_scale[i] * delta
+    return track_tokens
+```
+这样 `layer_scale=0` 时 refinement 输出 ≈ `memory` 广播到 K，和 v12 "把 [B,T,D] 复制到 [B,K,T,D]" 等价。热启动安全。
+### [C-3] Multi-scale Patch Adapter V3
+**为什么**：v12 用的 V2 adapter 只看 3 个时间 bin（30 ms），抓不到房间冲激响应的 early reflection (50-150ms)。V3 加多尺度 + dilated conv。
+**实现点**：
+- 文件：`spatial_modules.py`
+- 新增类：`SpatialDeltaPatchAdapterV3`
+  - 三路 branch：
+    - branch_3x3: `Conv2d(C, H, kernel=3, padding=1)` (同 V2)
+    - branch_5x5: `Conv2d(C, H, kernel=5, padding=2)` (中尺度)
+    - branch_dilated: `Conv2d(C, H, kernel=3, padding=2, dilation=2)` (长时)
+  - fuse: `torch.cat` along channel → `Conv2d(3H, H, kernel=1)`
+  - 接现有 V2 的 SE block + residual + patchify
+- cfg：`patch_adapter_version: str = "v1"` 增加选项 `"v3"`
+**参数量**：比 V2 多 ~1M。
+### [C-4] Log-distance head + Laplace NLL loss
+**为什么**：dist_mae=0.57 很差。距离分布长尾，log 后近似高斯。加 uncertainty 头允许模型对不确信的距离给大 variance，减少高 bias 样本的损失。
+**实现点**：
+- 文件：`spatial_modules.py`, 类 `FrameTrackPredictionHeads`
+- 把现有 `distance_head: Linear(D, 1)` 升级为 `distance_head: Linear(D, 2)` 输出 `[log_dist, log_var]`
+- 初始化：`bias[0] = log(1.5)`（v12 平均距离附近），`bias[1] = log(0.2^2)`（var=0.04 起点）
+- cfg：`use_log_distance_head: bool = False`, `distance_loss_type: str = "l1" / "laplace_nll"`
+- 文件：`spatial_loss.py`, Laplace NLL:
+```python
+def laplace_nll_loss(pred_log_dist, pred_log_var, target_dist, mask):
+    # target_dist > 0 的位置才算 loss
+    pred_dist = pred_log_dist.exp()
+    pred_b = (0.5 * pred_log_var).exp()  # Laplace scale
+    nll = (target_dist - pred_dist).abs() / pred_b + pred_log_var * 0.5
+    return (nll * mask).sum() / mask.sum().clamp(min=1)
+```
+推理时 `pred_distance = exp(pred_log_dist)`。
+**初期稳定性**（已确认）：v13c 从第 0 ep 就切 Laplace NLL（不做 L1 warmup）。如果训练前期 loss_dist 不稳，再回来调。
+### C 实验 preset: `make_ov1_unified_v13c_config`
+热启动：`v12_best.pt` (strict=False)
+开关：
+```python
+cfg.model.use_track_refinement = True             # [C-2]
+cfg.model.track_refinement_layers = 2
+cfg.model.patch_adapter_version = "v3"            # [C-3]
+cfg.model.use_log_distance_head = True            # [C-4]
+cfg.loss.distance_loss_type = "laplace_nll"       # [C-4]
+cfg.train_manifest_paths = (sim_static, qa_sim, dcase_real)   # [C-1]
+cfg.train_manifest_replication = (1, 1, 6)
+cfg.learning_rate = 1e-5
+cfg.num_epochs = 20
+cfg.output_dir = "checkpoints/spatial_beats_ov1_unified_v13c_exp/03_ov123_top4"
+```
+**loss 完全沿用 v12 默认**（BCE + L1 → Laplace），`soft_f1_weight=0`, `activity_loss_type="bce"`。
+---
+## 5. v14 合并实验（后续）
+在 B 和 C 都验证有效（F20 > 0.42）后启动：
+- 热启动：`max(v13b.best, v13c.best).pt`
+- 所有 B 和 C 的 flag 全开
+- LR = 5e-6（更保守，防止双改动发散）
+- epochs = 20
+预期 F20 = 0.55~0.62。
+---
+## 6. 预期结果 & 风险矩阵
+### B 预期
+- 聚合 F20: 0.378 → **0.45~0.52**
+- sim ov1: 0.386 → 0.50~0.55
+- real ov1: 0.140 → 0.22~0.28
+- dcase: 0.071 → 0.15~0.20
+- activity_recall: 0.13 → 0.40~0.55
+### C 预期
+- 聚合 F20: 0.378 → **0.44~0.50**
+- ov3_sim: 0.270 → 0.38~0.42
+- real ov1: 0.140 → 0.20~0.26
+- dcase: 0.071 → 0.14~0.19
+- dist_mae: 0.566 → 0.38~0.42
+### 风险
+| 风险 | 发生概率 | 兜底 |
+|---|---|---|
+| B-2 ASL γ- 太大发散 | 中 | 先 γ-=2 跑 1 ep 验证，再拉到 4 |
+| B-3 gate 挡掉好样本 | 低 | gate_scale 从 0.5 改 0.2 重跑 |
+| B-4 soft-F1 和 ASL 冲突 | 中 | soft_f1_weight 从 0.3 降到 0.1 |
+| B-5 augment 太强 sim 掉点 | 中 | 减半 augment 比例重跑 |
+| C-1 real 6× 导致 sim 掉点 | 中 | 降到 4× |
+| C-2 refinement 不学 | 中 | 手动设 layer_scale warmup |
+| C-3 多尺度显存爆 | 低 | 去掉 dilated branch |
+| C-4 log-dist 初期不稳 | 中 | 前 3 ep 用 L1 再切 |
+| v14 合并发散 | 高 | 降 LR 到 3e-6，freeze trunk 前半段 |
+---
+## 7. 落地文件清单
+| 文件 | 改动类型 | B/C |
+|---|---|---|
+| `spatial_modules.py` | 新增 3 个类 + 现有类加 forward 分支 | B+C |
+| `spatial_loss.py` | 新增 2 个 loss 函数 + config 分支 | B+C |
+| `spatial_dataset.py` | 新增 augment 逻辑 + config 字段 | B |
+| `spatial_beats.py` | config 字段 + 可选模块构造 + forward 分支 | B+C |
+| `train_spatial_beats.py` | 新增 2 个 preset 工厂 + CLI dispatch + choices | B+C |
+| `scripts/split_unified_train_by_source.py` | 新文件，预处理脚本 | C |
+| `run_ov1_unified_v13b.sh` | 新文件 | B |
+| `run_ov1_unified_v13c.sh` | 新文件 | C |
+| `docs/v13_spatial_beats_design.md` | 本文档 | — |
+所有现有文件的改动都是**新增分支**，不删除/修改任何现有逻辑。
+---
+## 8. 验证步骤
+每完成一步，按顺序验证：
+1. **语法检查**：`python -c "import ast; ast.parse(open(path).read())"`
+2. **旧 preset 回归**：`python train_spatial_beats.py --preset ov1_unified_v12 --dry-run`（或者 ep=1 跑到第一个 batch），确认 F20 和 v12 一致
+3. **新模块零初始化等价**：在 `v13b_config` / `v13c_config` 下跑 ep=0 valid，确认和 v12 best.pt 的 valid 指标差异 < 1%
+4. **B/C 训练**：完整跑 15/20 ep，观察 F20 曲线
+5. **per-subset eval**：用 `eval_v12_per_subset.py --preset ov1_unified_v13b --checkpoint ...` 看每个子集涨点
+6. **test eval**：用同脚本加 `--split test` 跑最终测试

docs/v13d_spatial_beats_design.md ADDED Viewed

	@@ -0,0 +1,333 @@

+# v13_D 设计文档 — Loss 机制 + 训练细节双重改进
+> 日期：2026-05-02
+> 起源：v13_B / v13_C 实验 ep3 结果证明 "loss/arch 改动在 cls warmup 结束时贡献微弱"，需要重新从**训练机制**角度切入
+> 目标：F20 从 v12 的 0.378 推到 **0.43+**
+> 设计原则：延续 v13_B/C 的"增量式、不破坏现有实验"，所有新 flag 默认 False
+---
+## 0. 为什么 v13_B / v13_C 都不 work
+ep3（cls warmup 结束）对比：
+|  | v12 ep3 | v13_B ep3 | v13_C ep3 |
+|---|---|---|---|
+| F20 | 0.3529 | 0.3565 (+0.004) | 0.3849 (+0.032) |
+| o_cls | 0.7793 | 0.7757 | 0.7779 |
+| aR | 0.120 | 0.194 | 0.090 |
+| aP | 0.866 | 0.860 | 0.859 |
+**三个实验的 oracle_class_acc 基本一致** → 三组实验的表征学的是同一个东西，新改动都没触达表征。loss/head decision 层面的改动（ASL / gate / soft-F1）在 cls warmup 结束时**基本还没起作用**，因为：
+1. **zero-init 的新模块需要 3-5 ep 才能让 layer_scale 爬起来**，而 cls warmup 只有 3 ep
+2. **ASL γ-=4 让 loss_activity 绝对值从 0.29 降到 0.07**，实际上是把 activity head 的梯度信号弄弱了
+3. **augment 反而降低 activity_precision**（0.95 → 0.86）
+4. **v12 自身从 ep3→ep12 只涨了 +0.025**，外推 v13_B/C 的 best 也就 0.38~0.41
+## 1. v13_D 的切入点：**训练机制本身**
+不碰架构、不碰表征，只改三件事：
+- **D-1**: 扩大 cls warmup（3 ep → 8 ep）+ cosine LR + 总 25 ep，让表征学得更稳
+- **D-2**: 用 Top-K rank activity loss 替换 BCE，直接针对 activity_recall=0.13 的根本瓶颈
+- **D-5**: resume optimizer，从 v12 最后的 Adam momentum 继续，避免前 2-3 ep 梯度方向混乱
+- **D-6**: EMA 权重，validate 时用 EMA 模型，ep 间震荡降低 1-2 个点
+D-2 是核心，D-1 / D-5 / D-6 是辅助。
+## 2. 具体设计
+### D-1: cls warmup 拉长 + cosine LR + 25 epochs
+#### 诊断
+v12 `frame_spatial_loss_warmup_epochs = 3`，ep0-2 只训练 cls + activity，ep3 起放开 spatial loss。但：
+- v12 曲线：cls_acc ep0=0.65 → ep1=0.81 → ep2=0.85 → ep3=0.84 → ... → ep12=0.83
+- **ep2 已 0.85 但 ep3 反而微降 0.84**，说明 **ep3 放开 spatial loss 的瞬间干扰了 cls**
+- cls 没有再涨的机会 —— 之后一直在 0.83 附近波动
+- FSD63 的 oracle_cls 卡在 0.78 是 class head 学的，不是 trunk 表征的上限
+#### 改动
+```python
+cfg.frame_spatial_loss_warmup_epochs = 8      # 3 → 8
+cfg.frame_spatial_loss_ramp_epochs = 2        # 1 → 2（ramp 更平滑）
+cfg.num_epochs = 25                           # 15 → 25
+# LR: cosine schedule，峰值在 ep 5-8 之间（warmup 结束附近）
+cfg.use_cosine_lr = True                      # 新 flag
+cfg.cosine_lr_warmup_epochs = 3               # 前 3 ep linear warmup 到 peak
+cfg.cosine_lr_min_ratio = 0.05                # 最后降到 peak * 0.05
+```
+**训练循环改动**：
+```python
+# ep 0-2: linear warmup LR 0 → peak_lr
+# ep 3-24: cosine decay peak_lr → peak_lr * min_ratio
+# ep 0-7: spatial loss weight = warmup_scale (0.0 or 0.1)
+# ep 8-9: linear ramp to 1.0
+# ep 10-24: full spatial loss
+```
+### D-2: Top-K rank activity loss（核心）
+#### 诊断
+当前 BCE（或 ASL）对每个 `(b, k, t)` 位置独立判断"该 slot 是否 active"。问题：
+- K=4 slots 在 ov1 数据上永远 3/4 inactive，类别极不平衡
+- BCE 的最优解是 `sigmoid(act_logit) ≈ 0.25`（平均 prior），不会敢预测 active
+- SELD 评估实际用 **sorted by activity logit, take top-K̂** 的方式决策
+- 训练目标和评估目标不对齐
+#### 设计
+**Top-K rank loss**：在每一帧 `(b, t)`，强制 top-`N_active_gt` 个 slot 的 activity logit **必须比其他 slot 至少高 margin**，而不是独立回归 0/1。
+```python
+def topk_rank_activity_loss(
+    activity_logit: Tensor,     # [B, K, T]
+    target_active: Tensor,       # [B, K, T], 0/1
+    valid_time: Tensor,          # [B, T]
+    margin: float = 2.0,
+) -> Tensor:
+    """
+    Per-frame marginal ranking loss:
+      For each active slot i (target=1) and each inactive slot j (target=0),
+      enforce logit[i] > logit[j] + margin.
+    Equivalent to:
+      loss = Σ_{i in A, j in I} max(0, margin + logit[j] - logit[i])
+    This gives direct gradient that "ranks" active slots above inactive ones,
+    which aligns with the DCASE eval pipeline (take top-K̂ per frame).
+    Plus a weak binary regularizer to anchor logit magnitude:
+      + 0.1 * BCE(activity_logit, target_active)
+    """
+    # [B, T] active_count per frame
+    n_active = target_active.sum(dim=1)  # [B, T]
+    # Loop-free formulation using broadcasting:
+    # logit_i: [B, K, T] (active side)
+    # logit_j: [B, K, T] (inactive side)
+    # pairwise diff: [B, K_i, K_j, T]
+    # mask: target_active[i] * (1 - target_active[j])  [B, K_i, K_j, T]
+    act = target_active.unsqueeze(2)      # [B, K, 1, T]
+    ina = (1.0 - target_active).unsqueeze(1)  # [B, 1, K, T]
+    pair_mask = act * ina                  # [B, K_i, K_j, T]
+    logit_i = activity_logit.unsqueeze(2)  # [B, K, 1, T]
+    logit_j = activity_logit.unsqueeze(1)  # [B, 1, K, T]
+    diff = logit_j - logit_i + margin      # want this <= 0
+    # hinge loss, masked
+    hinge = F.relu(diff) * pair_mask       # [B, K, K, T]
+    # normalize by valid pairs count
+    pair_valid = pair_mask.sum(dim=(1, 2))  # [B, T]
+    time_mask = valid_time.float() * (pair_valid > 0).float()
+    loss_rank = (hinge.sum(dim=(1, 2)) * time_mask).sum() / time_mask.sum().clamp(min=1.0)
+    # Anchor term: prevents logits from drifting to ±inf
+    loss_bce = F.binary_cross_entropy_with_logits(
+        activity_logit, target_active, reduction='none'
+    )
+    loss_bce = (loss_bce * valid_time.unsqueeze(1)).mean()
+    return loss_rank + 0.1 * loss_bce
+```
+#### 为什么比 ASL 好
+| 特性 | BCE | ASL | **Top-K rank** |
+|---|---|---|---|
+| 优化目标 | per-element logprob | per-element logprob (γ-weighted) | **pairwise ranking** |
+| 受 K 不平衡影响 | 严重 | 缓解 | 无（只看 rank） |
+| 与 DCASE 评估对齐 | ❌ | ❌ | **✓** (top-K̂) |
+| 训练稳定性 | 好 | 中 (γ 过大会崩) | **好**（hinge + 小 BCE anchor） |
+| 已知效果 | v12 aR=0.13 | v13_B aR=0.19，aP 降 | **未验证**，但机制对 |
+#### Config flag
+```python
+# spatial_loss.py
+frame_activity_loss_type: str = "bce"        # + "topk_rank"
+topk_rank_margin: float = 2.0
+topk_rank_bce_weight: float = 0.1            # anchor 的 BCE 权重
+```
+### D-5: Resume optimizer（低成本改进）
+#### 诊断
+v13_B/C 都用 `--no-resume-optimizer`，Adam 的 m/v moment buffer 从零重建。前 2-3 个 epoch 梯度方向不稳，尤其在"换 loss 函数"后更明显。
+#### 改动
+```bash
+# run_ov1_unified_v13d.sh 不加 --no-resume-optimizer
+# 但 LR 设为 v12 最后 LR 的 1/3（避免 resume 后太激进）
+SPATIAL_LR="${SPATIAL_LR:-7e-6}"   # v12 是 2e-5，这里是 2e-5/3 ≈ 7e-6
+```
+**注意**：Optimizer state 包含 LR scheduler 状态，如果我们切 cosine schedule 需要 reset schedule 但保留 Adam moments。实现时：
+```python
+# 加载 optimizer_state_dict
+optimizer.load_state_dict(ckpt['optimizer_state_dict'])
+# 但把所有 param_group 的 LR 重设为新 LR（cosine scheduler 会从这开始）
+for pg in optimizer.param_groups:
+    pg['lr'] = new_peak_lr
+# 删掉 step count（avoid schedule confusion）
+# scheduler 从 epoch=0 重新开始
+```
+### D-6: EMA 权重
+#### 诊断
+v12 ep10-14 F20 在 0.367-0.378 震荡，SGD 困在鞍点。EMA = 取最近 N 个权重的平滑平均，能稳定在鞍点中间而非某一端。
+#### 改动
+**新加 `EMAModel` helper**：
+```python
+class EMAModel:
+    def __init__(self, model: nn.Module, decay: float = 0.9995):
+        self.decay = decay
+        self.shadow: Dict[str, Tensor] = {
+            name: p.detach().clone()
+            for name, p in model.named_parameters()
+            if p.requires_grad
+        }
+    @torch.no_grad()
+    def update(self, model: nn.Module):
+        for name, p in model.named_parameters():
+            if not p.requires_grad: continue
+            self.shadow[name].mul_(self.decay).add_(p.detach(), alpha=1 - self.decay)
+    def apply_to(self, model: nn.Module) -> Dict[str, Tensor]:
+        """Swap model params with EMA shadow, returns backup for restoration."""
+        backup = {}
+        for name, p in model.named_parameters():
+            if name in self.shadow:
+                backup[name] = p.data.clone()
+                p.data.copy_(self.shadow[name])
+        return backup
+    def restore(self, model: nn.Module, backup: Dict[str, Tensor]):
+        for name, p in model.named_parameters():
+            if name in backup:
+                p.data.copy_(backup[name])
+```
+**训练循环**：
+```python
+# 每 step 后：
+if ema_model is not None:
+    ema_model.update(model)
+# validate 前：
+if ema_model is not None:
+    backup = ema_model.apply_to(model)
+val_metrics = evaluate_one_epoch(model, ...)
+if ema_model is not None:
+    ema_model.restore(model, backup)
+# save best.pt 时，保存 EMA 权重而非原始权重
+if ema_model is not None:
+    backup = ema_model.apply_to(model)
+    torch.save({'model_state_dict': model.state_dict(), ...})
+    ema_model.restore(model, backup)
+```
+#### Config flag
+```python
+# TrainSpatialBEATsConfig
+use_ema: bool = False
+ema_decay: float = 0.9995
+ema_start_epoch: int = 3   # 前 3 ep 不 EMA（避免 warmup 噪声）
+```
+## 3. v13_D preset
+```python
+def make_ov1_unified_v13d_config(...):
+    cfg = make_ov1_unified_v12_config(...)  # v12 为基础
+    # D-1: 扩大 cls warmup，cosine schedule
+    cfg.frame_spatial_loss_warmup_epochs = 8
+    cfg.frame_spatial_loss_ramp_epochs = 2
+    cfg.num_epochs = 25
+    cfg.use_cosine_lr = True
+    cfg.cosine_lr_warmup_epochs = 3
+    cfg.cosine_lr_min_ratio = 0.05
+    cfg.learning_rate = 1.5e-5   # peak LR
+    # D-2: Top-K rank activity loss
+    cfg.loss.frame_activity_loss_type = "topk_rank"
+    cfg.loss.topk_rank_margin = 2.0
+    cfg.loss.topk_rank_bce_weight = 0.1
+    # D-5: resume optimizer (在 run script 里，不写 --no-resume-optimizer)
+    # D-6: EMA
+    cfg.use_ema = True
+    cfg.ema_decay = 0.9995
+    cfg.ema_start_epoch = 3
+    cfg.output_dir = "checkpoints/spatial_beats_ov1_unified_v13d_exp/03_ov123_top4"
+    return cfg
+```
+## 4. 实现步骤
+| 文件 | 改动 | 对应 D-* |
+|---|---|---|
+| `spatial_loss.py` | 加 `_topk_rank_activity_loss` + config 字段 + 分支 | D-2 |
+| `train_spatial_beats.py` | 加 `EMAModel` class + cosine LR scheduler + 训练循环 hook | D-1, D-6 |
+| `train_spatial_beats.py` | 加 `make_ov1_unified_v13d_config` + CLI + choices | - |
+| `run_ov1_unified_v13d.sh` | 新脚本，不带 `--no-resume-optimizer` | D-5 |
+| `docs/v13d_spatial_beats_design.md` | 本文档 | - |
+所有改动都通过 cfg flag 控制，默认 False → v12/v13_B/v13_C 不受影响。
+## 5. 预期结果
+| 指标 | v12 best | v13_D 预期 | 机制 |
+|---|---|---|---|
+| F20 | 0.378 | **0.42 ~ 0.46** | Top-K rank + EMA + 长 warmup |
+| aR | 0.126 | **0.25 ~ 0.40** | Top-K 强制拉高活跃 slot |
+| aP | 0.855 | **0.80 ~ 0.85** | 可能小降（recall ↑ 的代价），但 hinge 保留 rank 信号 |
+| class_acc | 0.834 | **0.86 ~ 0.88** | 长 warmup 让 cls 真的学完 |
+| azi_mae | 19.7° | **18~20°** | 不变，不是目标 |
+## 6. 风险和兜底
+| 风险 | 概率 | 兜底 |
+|---|---|---|
+| Top-K rank 的 hinge 梯度饱和（margin 太大） | 中 | 降 margin 到 1.0 |
+| margin=2.0 导致 logit 分布爆炸（两端拉开） | 低 | anchor BCE 权重从 0.1 升到 0.3 |
+| EMA 反而降 F（热启动 EMA 初始化问题） | 低 | ema_start_epoch 提到 5 |
+| Cosine LR 峰值太高毁掉 v12 表征 | 中 | peak_lr 降到 1e-5（v12 也是这个） |
+| resume optimizer 把 v12 的 Adam moment 固化在错方向 | 低 | 如果 ep0-2 loss 爆炸，退回 --no-resume-optimizer |
+| 总体不涨（所有改动都没用） | 中 | 写 ablation，跑 v13_D_noema / v13_D_nocos 诊断 |
+## 7. 和 v13_B/C 的关系
+- **v13_D 不依赖 v13_B/C 的改动**，直接从 v12 best.pt 热启动
+- v13_B/C 可以看作"改模块结构"的尝试，v13_D 是"改训练机制"的尝试
+- 如果 v13_D 成功（F ≥ 0.42），**可以在它之上加回 v13_C 的 refinement 2-layer**，那才是真正的 v14
+## 8. 验证步骤
+1. `python -c "import ast; ast.parse(open('spatial_loss.py').read())"` 语法
+2. Top-K rank loss 单测：给已知 activity_logit + target 手算验证
+3. 模型构造 + hot-start v12 best.pt：确认 missing=0, unexpected=0
+4. 单 batch 前向 + backward：确认 loss 是 scalar、梯度非 NaN
+5. EMA 单测：update 后 shadow 权重正确
+6. 1 epoch dry-run：看 cosine LR 曲线 + EMA shadow 随 step 变化
+---
+## 附：为什么不加更多改动（D-7 per-class expert 等）
+诊断：v13_B/C 失败的主因不是"改动不够多"，而是"改动不对靶"。v13_D 只碰 loss 机制 + 训练 schedule，属于**最小必要改动**：
+- D-2 Top-K 直接针对 activity_recall 瓶颈
+- D-1 扩大 warmup 给表征更多学习时间
+- D-6 EMA 降低末期震荡
+- D-5 resume optimizer 让 LR 轨迹连续
+加更多改动（per-class expert、class-conditional gate v2）会重复 v13_B 的错误——改了 head 但没触达瓶颈，而且同时改太多东西无法 ablation。