Spatial-BEATs / docs /spatial_beats_token_interface_note.md
dieKarotte's picture
Add files using upload-large-folder tool
29ab2d0 verified
|
Raw
History Blame Contribute Delete
3.78 kB

Spatial-BEATs Token Interface Clarification

1. Why 25 * 4 = 100 Is Not the Right Final Token Count

The previous discussion mixed two different concepts:

  • internal multi-source slot capacity
  • final LLM-visible token rate

These should be separated.

For the current design:

  • 2.5 Hz means the final spatial token rate visible to the LLM
  • for a 10 s clip, this means:
    • T_s = 10 * 2.5 = 25

So the correct final token count is:

  • 25 spatial tokens

not:

  • 25 * 4 = 100

The 4 only refers to:

  • internal source slots per time step

It is an internal modeling capacity, not an external token-rate multiplier.

2. Corrected Design

The corrected interface is:

FOA waveform
  -> FOA features
  -> BEATs trunk
  -> temporal memory at 2.5 Hz              [B, T_s, D]
  -> per-step source slots (K=4)            [B, T_s, K, D]
  -> objectness-weighted slot pooling       [B, T_s, D]
  -> MLP projector
  -> final LLM spatial tokens               [B, T_s, d_llm]

With the default setup:

  • T_s = 25
  • K = 4

So:

  • internal representation: [B, 25, 4, D]
  • final LLM tokens: [B, 25, d_llm]

3. What objectness-weighted pooling + MLP projector Means

At each time step t, the model first predicts K=4 source slots:

  • z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}

Each slot also has an objectness score:

  • o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}

These objectness scores are normalized across the K slots:

alpha_{t,k} = softmax(o_{t,:})_k

Then the slot latents are pooled:

h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}

This produces one pooled latent for this time step:

  • h_t

The same idea is used to pool the structured slot-level predictions:

c_t = sum_k alpha_{t,k} * c_{t,k}
u_t = sum_k alpha_{t,k} * u_{t,k}
d_t = sum_k alpha_{t,k} * d_{t,k}
o_t = sum_k alpha_{t,k} * e_{obj,t,k}

where:

  • c_{t,k} is the slot-level class-context embedding
  • u_{t,k} is the slot-level direction embedding/vector
  • d_{t,k} is the slot-level distance embedding
  • e_{obj,t,k} is the slot-level confidence embedding

Then the final per-step spatial token is formed as:

s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])

where:

  • Proj is an MLP projector into the LLM hidden space

So the final sequence is:

S = [s_1, s_2, ..., s_{T_s}]

For a 10 s clip:

  • S has 25 tokens

4. Why This Is Better

This corrected design keeps both goals:

  1. multi-source capacity inside the model
  2. fixed low-rate spatial tokens outside the model

Advantages:

  • the model can still represent up to 4 sources at each time step
  • the final LLM token count stays fixed at 2.5 Hz
  • the external token interface is simpler and easier to scale
  • it avoids unnecessarily inflating the LLM token count by K

5. Corrected Tensor Shapes

Recommended tensor shapes:

  • temporal_memory: [B, T_s, D]
  • slot_tokens: [B, T_s, K, D]
  • pred_obj: [B, T_s, K]
  • pred_azi_logits: [B, T_s, K, 360]
  • pred_ele_logits: [B, T_s, K, 180]
  • pred_dist: [B, T_s, K, 1]
  • pred_class_logits: [B, T_s, K, C_cls]
  • pooled_spatial_latents: [B, T_s, D]
  • llm_spatial_tokens: [B, T_s, d_llm]

For the default setup:

  • T_s = 25
  • K = 4

Therefore:

  • internal slots: [B, 25, 4, D]
  • final LLM tokens: [B, 25, d_llm]

6. What Should Be Updated in the Main Design

The main design should be interpreted as:

  • internal 4 slots
  • external fixed 25 spatial tokens for a 10 s clip

So any previous statement implying:

  • 2.5 Hz * 4 = 10 tokens / second

should be considered obsolete for the final LLM interface.

The correct statement is:

  • final LLM-visible spatial tokens are 2.5 tokens / second

and:

  • K=4 is only internal source-slot capacity.