Spatial-BEATs / docs /spatial_beats_token_interface_note.md

dieKarotte

Add files using upload-large-folder tool

29ab2d0 verified 13 days ago

preview code

Raw

History Blame Contribute Delete

3.78 kB

Spatial-BEATs Token Interface Clarification

1. Why `25 * 4 = 100` Is Not the Right Final Token Count

The previous discussion mixed two different concepts:

internal multi-source slot capacity
final LLM-visible token rate

These should be separated.

For the current design:

2.5 Hz means the final spatial token rate visible to the LLM
for a 10 s clip, this means:
- T_s = 10 * 2.5 = 25

So the correct final token count is:

25 spatial tokens

not:

25 * 4 = 100

The 4 only refers to:

internal source slots per time step

It is an internal modeling capacity, not an external token-rate multiplier.

2. Corrected Design

The corrected interface is:

FOA waveform
  -> FOA features
  -> BEATs trunk
  -> temporal memory at 2.5 Hz              [B, T_s, D]
  -> per-step source slots (K=4)            [B, T_s, K, D]
  -> objectness-weighted slot pooling       [B, T_s, D]
  -> MLP projector
  -> final LLM spatial tokens               [B, T_s, d_llm]

With the default setup:

T_s = 25
K = 4

So:

internal representation: [B, 25, 4, D]
final LLM tokens: [B, 25, d_llm]

3. What `objectness-weighted pooling + MLP projector` Means

At each time step t, the model first predicts K=4 source slots:

z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}

Each slot also has an objectness score:

o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}

These objectness scores are normalized across the K slots:

alpha_{t,k} = softmax(o_{t,:})_k

Then the slot latents are pooled:

h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}

This produces one pooled latent for this time step:

h_t

The same idea is used to pool the structured slot-level predictions:

c_t = sum_k alpha_{t,k} * c_{t,k}
u_t = sum_k alpha_{t,k} * u_{t,k}
d_t = sum_k alpha_{t,k} * d_{t,k}
o_t = sum_k alpha_{t,k} * e_{obj,t,k}

where:

c_{t,k} is the slot-level class-context embedding
u_{t,k} is the slot-level direction embedding/vector
d_{t,k} is the slot-level distance embedding
e_{obj,t,k} is the slot-level confidence embedding

Then the final per-step spatial token is formed as:

s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])

where:

Proj is an MLP projector into the LLM hidden space

So the final sequence is:

S = [s_1, s_2, ..., s_{T_s}]

For a 10 s clip:

S has 25 tokens

4. Why This Is Better

This corrected design keeps both goals:

multi-source capacity inside the model
fixed low-rate spatial tokens outside the model

Advantages:

the model can still represent up to 4 sources at each time step
the final LLM token count stays fixed at 2.5 Hz
the external token interface is simpler and easier to scale
it avoids unnecessarily inflating the LLM token count by K

5. Corrected Tensor Shapes

Recommended tensor shapes:

temporal_memory: [B, T_s, D]
slot_tokens: [B, T_s, K, D]
pred_obj: [B, T_s, K]
pred_azi_logits: [B, T_s, K, 360]
pred_ele_logits: [B, T_s, K, 180]
pred_dist: [B, T_s, K, 1]
pred_class_logits: [B, T_s, K, C_cls]
pooled_spatial_latents: [B, T_s, D]
llm_spatial_tokens: [B, T_s, d_llm]

For the default setup:

T_s = 25
K = 4

Therefore:

internal slots: [B, 25, 4, D]
final LLM tokens: [B, 25, d_llm]

6. What Should Be Updated in the Main Design

The main design should be interpreted as:

internal 4 slots
external fixed 25 spatial tokens for a 10 s clip

So any previous statement implying:

2.5 Hz * 4 = 10 tokens / second

should be considered obsolete for the final LLM interface.

The correct statement is:

final LLM-visible spatial tokens are 2.5 tokens / second

and:

K=4 is only internal source-slot capacity.