# Spatial-BEATs Token Interface Clarification ## 1. Why `25 * 4 = 100` Is Not the Right Final Token Count The previous discussion mixed two different concepts: - internal multi-source slot capacity - final LLM-visible token rate These should be separated. For the current design: - `2.5 Hz` means the **final spatial token rate visible to the LLM** - for a `10 s` clip, this means: - `T_s = 10 * 2.5 = 25` So the correct final token count is: - `25 spatial tokens` not: - `25 * 4 = 100` The `4` only refers to: - internal source slots per time step It is an internal modeling capacity, not an external token-rate multiplier. ## 2. Corrected Design The corrected interface is: ```text FOA waveform -> FOA features -> BEATs trunk -> temporal memory at 2.5 Hz [B, T_s, D] -> per-step source slots (K=4) [B, T_s, K, D] -> objectness-weighted slot pooling [B, T_s, D] -> MLP projector -> final LLM spatial tokens [B, T_s, d_llm] ``` With the default setup: - `T_s = 25` - `K = 4` So: - internal representation: `[B, 25, 4, D]` - final LLM tokens: `[B, 25, d_llm]` ## 3. What `objectness-weighted pooling + MLP projector` Means At each time step `t`, the model first predicts `K=4` source slots: - `z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}` Each slot also has an objectness score: - `o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}` These objectness scores are normalized across the `K` slots: ```text alpha_{t,k} = softmax(o_{t,:})_k ``` Then the slot latents are pooled: ```text h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k} ``` This produces one pooled latent for this time step: - `h_t` The same idea is used to pool the structured slot-level predictions: ```text c_t = sum_k alpha_{t,k} * c_{t,k} u_t = sum_k alpha_{t,k} * u_{t,k} d_t = sum_k alpha_{t,k} * d_{t,k} o_t = sum_k alpha_{t,k} * e_{obj,t,k} ``` where: - `c_{t,k}` is the slot-level class-context embedding - `u_{t,k}` is the slot-level direction embedding/vector - `d_{t,k}` is the slot-level distance embedding - `e_{obj,t,k}` is the slot-level confidence embedding Then the final per-step spatial token is formed as: ```text s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t]) ``` where: - `Proj` is an MLP projector into the LLM hidden space So the final sequence is: ```text S = [s_1, s_2, ..., s_{T_s}] ``` For a `10 s` clip: - `S` has `25` tokens ## 4. Why This Is Better This corrected design keeps both goals: 1. multi-source capacity inside the model 2. fixed low-rate spatial tokens outside the model Advantages: - the model can still represent up to `4` sources at each time step - the final LLM token count stays fixed at `2.5 Hz` - the external token interface is simpler and easier to scale - it avoids unnecessarily inflating the LLM token count by `K` ## 5. Corrected Tensor Shapes Recommended tensor shapes: - `temporal_memory`: `[B, T_s, D]` - `slot_tokens`: `[B, T_s, K, D]` - `pred_obj`: `[B, T_s, K]` - `pred_azi_logits`: `[B, T_s, K, 360]` - `pred_ele_logits`: `[B, T_s, K, 180]` - `pred_dist`: `[B, T_s, K, 1]` - `pred_class_logits`: `[B, T_s, K, C_cls]` - `pooled_spatial_latents`: `[B, T_s, D]` - `llm_spatial_tokens`: `[B, T_s, d_llm]` For the default setup: - `T_s = 25` - `K = 4` Therefore: - internal slots: `[B, 25, 4, D]` - final LLM tokens: `[B, 25, d_llm]` ## 6. What Should Be Updated in the Main Design The main design should be interpreted as: - internal `4` slots - external fixed `25` spatial tokens for a `10 s` clip So any previous statement implying: - `2.5 Hz * 4 = 10 tokens / second` should be considered obsolete for the final LLM interface. The correct statement is: - final LLM-visible spatial tokens are `2.5 tokens / second` and: - `K=4` is only internal source-slot capacity.