Spatial-BEATs / docs /spatial_beats_token_interface_note.md
dieKarotte's picture
Add files using upload-large-folder tool
29ab2d0 verified
|
Raw
History Blame Contribute Delete
3.78 kB
# Spatial-BEATs Token Interface Clarification
## 1. Why `25 * 4 = 100` Is Not the Right Final Token Count
The previous discussion mixed two different concepts:
- internal multi-source slot capacity
- final LLM-visible token rate
These should be separated.
For the current design:
- `2.5 Hz` means the **final spatial token rate visible to the LLM**
- for a `10 s` clip, this means:
- `T_s = 10 * 2.5 = 25`
So the correct final token count is:
- `25 spatial tokens`
not:
- `25 * 4 = 100`
The `4` only refers to:
- internal source slots per time step
It is an internal modeling capacity, not an external token-rate multiplier.
## 2. Corrected Design
The corrected interface is:
```text
FOA waveform
-> FOA features
-> BEATs trunk
-> temporal memory at 2.5 Hz [B, T_s, D]
-> per-step source slots (K=4) [B, T_s, K, D]
-> objectness-weighted slot pooling [B, T_s, D]
-> MLP projector
-> final LLM spatial tokens [B, T_s, d_llm]
```
With the default setup:
- `T_s = 25`
- `K = 4`
So:
- internal representation: `[B, 25, 4, D]`
- final LLM tokens: `[B, 25, d_llm]`
## 3. What `objectness-weighted pooling + MLP projector` Means
At each time step `t`, the model first predicts `K=4` source slots:
- `z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}`
Each slot also has an objectness score:
- `o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}`
These objectness scores are normalized across the `K` slots:
```text
alpha_{t,k} = softmax(o_{t,:})_k
```
Then the slot latents are pooled:
```text
h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}
```
This produces one pooled latent for this time step:
- `h_t`
The same idea is used to pool the structured slot-level predictions:
```text
c_t = sum_k alpha_{t,k} * c_{t,k}
u_t = sum_k alpha_{t,k} * u_{t,k}
d_t = sum_k alpha_{t,k} * d_{t,k}
o_t = sum_k alpha_{t,k} * e_{obj,t,k}
```
where:
- `c_{t,k}` is the slot-level class-context embedding
- `u_{t,k}` is the slot-level direction embedding/vector
- `d_{t,k}` is the slot-level distance embedding
- `e_{obj,t,k}` is the slot-level confidence embedding
Then the final per-step spatial token is formed as:
```text
s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])
```
where:
- `Proj` is an MLP projector into the LLM hidden space
So the final sequence is:
```text
S = [s_1, s_2, ..., s_{T_s}]
```
For a `10 s` clip:
- `S` has `25` tokens
## 4. Why This Is Better
This corrected design keeps both goals:
1. multi-source capacity inside the model
2. fixed low-rate spatial tokens outside the model
Advantages:
- the model can still represent up to `4` sources at each time step
- the final LLM token count stays fixed at `2.5 Hz`
- the external token interface is simpler and easier to scale
- it avoids unnecessarily inflating the LLM token count by `K`
## 5. Corrected Tensor Shapes
Recommended tensor shapes:
- `temporal_memory`: `[B, T_s, D]`
- `slot_tokens`: `[B, T_s, K, D]`
- `pred_obj`: `[B, T_s, K]`
- `pred_azi_logits`: `[B, T_s, K, 360]`
- `pred_ele_logits`: `[B, T_s, K, 180]`
- `pred_dist`: `[B, T_s, K, 1]`
- `pred_class_logits`: `[B, T_s, K, C_cls]`
- `pooled_spatial_latents`: `[B, T_s, D]`
- `llm_spatial_tokens`: `[B, T_s, d_llm]`
For the default setup:
- `T_s = 25`
- `K = 4`
Therefore:
- internal slots: `[B, 25, 4, D]`
- final LLM tokens: `[B, 25, d_llm]`
## 6. What Should Be Updated in the Main Design
The main design should be interpreted as:
- internal `4` slots
- external fixed `25` spatial tokens for a `10 s` clip
So any previous statement implying:
- `2.5 Hz * 4 = 10 tokens / second`
should be considered obsolete for the final LLM interface.
The correct statement is:
- final LLM-visible spatial tokens are `2.5 tokens / second`
and:
- `K=4` is only internal source-slot capacity.