| # Spatial-BEATs Token Interface Clarification |
|
|
| ## 1. Why `25 * 4 = 100` Is Not the Right Final Token Count |
|
|
| The previous discussion mixed two different concepts: |
|
|
| - internal multi-source slot capacity |
| - final LLM-visible token rate |
|
|
| These should be separated. |
|
|
| For the current design: |
|
|
| - `2.5 Hz` means the **final spatial token rate visible to the LLM** |
| - for a `10 s` clip, this means: |
| - `T_s = 10 * 2.5 = 25` |
|
|
| So the correct final token count is: |
|
|
| - `25 spatial tokens` |
|
|
| not: |
|
|
| - `25 * 4 = 100` |
|
|
| The `4` only refers to: |
|
|
| - internal source slots per time step |
|
|
| It is an internal modeling capacity, not an external token-rate multiplier. |
|
|
| ## 2. Corrected Design |
|
|
| The corrected interface is: |
|
|
| ```text |
| FOA waveform |
| -> FOA features |
| -> BEATs trunk |
| -> temporal memory at 2.5 Hz [B, T_s, D] |
| -> per-step source slots (K=4) [B, T_s, K, D] |
| -> objectness-weighted slot pooling [B, T_s, D] |
| -> MLP projector |
| -> final LLM spatial tokens [B, T_s, d_llm] |
| ``` |
|
|
| With the default setup: |
|
|
| - `T_s = 25` |
| - `K = 4` |
|
|
| So: |
|
|
| - internal representation: `[B, 25, 4, D]` |
| - final LLM tokens: `[B, 25, d_llm]` |
|
|
| ## 3. What `objectness-weighted pooling + MLP projector` Means |
|
|
| At each time step `t`, the model first predicts `K=4` source slots: |
|
|
| - `z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}` |
|
|
| Each slot also has an objectness score: |
|
|
| - `o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}` |
|
|
| These objectness scores are normalized across the `K` slots: |
|
|
| ```text |
| alpha_{t,k} = softmax(o_{t,:})_k |
| ``` |
|
|
| Then the slot latents are pooled: |
|
|
| ```text |
| h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k} |
| ``` |
|
|
| This produces one pooled latent for this time step: |
|
|
| - `h_t` |
|
|
| The same idea is used to pool the structured slot-level predictions: |
|
|
| ```text |
| c_t = sum_k alpha_{t,k} * c_{t,k} |
| u_t = sum_k alpha_{t,k} * u_{t,k} |
| d_t = sum_k alpha_{t,k} * d_{t,k} |
| o_t = sum_k alpha_{t,k} * e_{obj,t,k} |
| ``` |
|
|
| where: |
|
|
| - `c_{t,k}` is the slot-level class-context embedding |
| - `u_{t,k}` is the slot-level direction embedding/vector |
| - `d_{t,k}` is the slot-level distance embedding |
| - `e_{obj,t,k}` is the slot-level confidence embedding |
|
|
| Then the final per-step spatial token is formed as: |
|
|
| ```text |
| s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t]) |
| ``` |
|
|
| where: |
|
|
| - `Proj` is an MLP projector into the LLM hidden space |
|
|
| So the final sequence is: |
|
|
| ```text |
| S = [s_1, s_2, ..., s_{T_s}] |
| ``` |
|
|
| For a `10 s` clip: |
|
|
| - `S` has `25` tokens |
|
|
| ## 4. Why This Is Better |
|
|
| This corrected design keeps both goals: |
|
|
| 1. multi-source capacity inside the model |
| 2. fixed low-rate spatial tokens outside the model |
|
|
| Advantages: |
|
|
| - the model can still represent up to `4` sources at each time step |
| - the final LLM token count stays fixed at `2.5 Hz` |
| - the external token interface is simpler and easier to scale |
| - it avoids unnecessarily inflating the LLM token count by `K` |
|
|
| ## 5. Corrected Tensor Shapes |
|
|
| Recommended tensor shapes: |
|
|
| - `temporal_memory`: `[B, T_s, D]` |
| - `slot_tokens`: `[B, T_s, K, D]` |
| - `pred_obj`: `[B, T_s, K]` |
| - `pred_azi_logits`: `[B, T_s, K, 360]` |
| - `pred_ele_logits`: `[B, T_s, K, 180]` |
| - `pred_dist`: `[B, T_s, K, 1]` |
| - `pred_class_logits`: `[B, T_s, K, C_cls]` |
| - `pooled_spatial_latents`: `[B, T_s, D]` |
| - `llm_spatial_tokens`: `[B, T_s, d_llm]` |
|
|
| For the default setup: |
|
|
| - `T_s = 25` |
| - `K = 4` |
|
|
| Therefore: |
|
|
| - internal slots: `[B, 25, 4, D]` |
| - final LLM tokens: `[B, 25, d_llm]` |
|
|
| ## 6. What Should Be Updated in the Main Design |
|
|
| The main design should be interpreted as: |
|
|
| - internal `4` slots |
| - external fixed `25` spatial tokens for a `10 s` clip |
|
|
| So any previous statement implying: |
|
|
| - `2.5 Hz * 4 = 10 tokens / second` |
|
|
| should be considered obsolete for the final LLM interface. |
|
|
| The correct statement is: |
|
|
| - final LLM-visible spatial tokens are `2.5 tokens / second` |
|
|
| and: |
|
|
| - `K=4` is only internal source-slot capacity. |
|
|