Spatial-BEATs Token Interface Clarification
1. Why 25 * 4 = 100 Is Not the Right Final Token Count
The previous discussion mixed two different concepts:
- internal multi-source slot capacity
- final LLM-visible token rate
These should be separated.
For the current design:
2.5 Hzmeans the final spatial token rate visible to the LLM- for a
10 sclip, this means:T_s = 10 * 2.5 = 25
So the correct final token count is:
25 spatial tokens
not:
25 * 4 = 100
The 4 only refers to:
- internal source slots per time step
It is an internal modeling capacity, not an external token-rate multiplier.
2. Corrected Design
The corrected interface is:
FOA waveform
-> FOA features
-> BEATs trunk
-> temporal memory at 2.5 Hz [B, T_s, D]
-> per-step source slots (K=4) [B, T_s, K, D]
-> objectness-weighted slot pooling [B, T_s, D]
-> MLP projector
-> final LLM spatial tokens [B, T_s, d_llm]
With the default setup:
T_s = 25K = 4
So:
- internal representation:
[B, 25, 4, D] - final LLM tokens:
[B, 25, d_llm]
3. What objectness-weighted pooling + MLP projector Means
At each time step t, the model first predicts K=4 source slots:
z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}
Each slot also has an objectness score:
o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}
These objectness scores are normalized across the K slots:
alpha_{t,k} = softmax(o_{t,:})_k
Then the slot latents are pooled:
h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}
This produces one pooled latent for this time step:
h_t
The same idea is used to pool the structured slot-level predictions:
c_t = sum_k alpha_{t,k} * c_{t,k}
u_t = sum_k alpha_{t,k} * u_{t,k}
d_t = sum_k alpha_{t,k} * d_{t,k}
o_t = sum_k alpha_{t,k} * e_{obj,t,k}
where:
c_{t,k}is the slot-level class-context embeddingu_{t,k}is the slot-level direction embedding/vectord_{t,k}is the slot-level distance embeddinge_{obj,t,k}is the slot-level confidence embedding
Then the final per-step spatial token is formed as:
s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])
where:
Projis an MLP projector into the LLM hidden space
So the final sequence is:
S = [s_1, s_2, ..., s_{T_s}]
For a 10 s clip:
Shas25tokens
4. Why This Is Better
This corrected design keeps both goals:
- multi-source capacity inside the model
- fixed low-rate spatial tokens outside the model
Advantages:
- the model can still represent up to
4sources at each time step - the final LLM token count stays fixed at
2.5 Hz - the external token interface is simpler and easier to scale
- it avoids unnecessarily inflating the LLM token count by
K
5. Corrected Tensor Shapes
Recommended tensor shapes:
temporal_memory:[B, T_s, D]slot_tokens:[B, T_s, K, D]pred_obj:[B, T_s, K]pred_azi_logits:[B, T_s, K, 360]pred_ele_logits:[B, T_s, K, 180]pred_dist:[B, T_s, K, 1]pred_class_logits:[B, T_s, K, C_cls]pooled_spatial_latents:[B, T_s, D]llm_spatial_tokens:[B, T_s, d_llm]
For the default setup:
T_s = 25K = 4
Therefore:
- internal slots:
[B, 25, 4, D] - final LLM tokens:
[B, 25, d_llm]
6. What Should Be Updated in the Main Design
The main design should be interpreted as:
- internal
4slots - external fixed
25spatial tokens for a10 sclip
So any previous statement implying:
2.5 Hz * 4 = 10 tokens / second
should be considered obsolete for the final LLM interface.
The correct statement is:
- final LLM-visible spatial tokens are
2.5 tokens / second
and:
K=4is only internal source-slot capacity.