File size: 3,778 Bytes

29ab2d0

# Spatial-BEATs Token Interface Clarification

## 1. Why `25 * 4 = 100` Is Not the Right Final Token Count

The previous discussion mixed two different concepts:

- internal multi-source slot capacity
- final LLM-visible token rate

These should be separated.

For the current design:

- `2.5 Hz` means the **final spatial token rate visible to the LLM**
- for a `10 s` clip, this means:
  - `T_s = 10 * 2.5 = 25`

So the correct final token count is:

- `25 spatial tokens`

not:

- `25 * 4 = 100`

The `4` only refers to:

- internal source slots per time step

It is an internal modeling capacity, not an external token-rate multiplier.

## 2. Corrected Design

The corrected interface is:

```text
FOA waveform
  -> FOA features
  -> BEATs trunk
  -> temporal memory at 2.5 Hz              [B, T_s, D]
  -> per-step source slots (K=4)            [B, T_s, K, D]
  -> objectness-weighted slot pooling       [B, T_s, D]
  -> MLP projector
  -> final LLM spatial tokens               [B, T_s, d_llm]
```

With the default setup:

- `T_s = 25`
- `K = 4`

So:

- internal representation: `[B, 25, 4, D]`
- final LLM tokens: `[B, 25, d_llm]`

## 3. What `objectness-weighted pooling + MLP projector` Means

At each time step `t`, the model first predicts `K=4` source slots:

- `z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}`

Each slot also has an objectness score:

- `o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}`

These objectness scores are normalized across the `K` slots:

```text
alpha_{t,k} = softmax(o_{t,:})_k
```

Then the slot latents are pooled:

```text
h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}
```

This produces one pooled latent for this time step:

- `h_t`

The same idea is used to pool the structured slot-level predictions:

```text
c_t = sum_k alpha_{t,k} * c_{t,k}
u_t = sum_k alpha_{t,k} * u_{t,k}
d_t = sum_k alpha_{t,k} * d_{t,k}
o_t = sum_k alpha_{t,k} * e_{obj,t,k}
```

where:

- `c_{t,k}` is the slot-level class-context embedding
- `u_{t,k}` is the slot-level direction embedding/vector
- `d_{t,k}` is the slot-level distance embedding
- `e_{obj,t,k}` is the slot-level confidence embedding

Then the final per-step spatial token is formed as:

```text
s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])
```

where:

- `Proj` is an MLP projector into the LLM hidden space

So the final sequence is:

```text
S = [s_1, s_2, ..., s_{T_s}]
```

For a `10 s` clip:

- `S` has `25` tokens

## 4. Why This Is Better

This corrected design keeps both goals:

1. multi-source capacity inside the model
2. fixed low-rate spatial tokens outside the model

Advantages:

- the model can still represent up to `4` sources at each time step
- the final LLM token count stays fixed at `2.5 Hz`
- the external token interface is simpler and easier to scale
- it avoids unnecessarily inflating the LLM token count by `K`

## 5. Corrected Tensor Shapes

Recommended tensor shapes:

- `temporal_memory`: `[B, T_s, D]`
- `slot_tokens`: `[B, T_s, K, D]`
- `pred_obj`: `[B, T_s, K]`
- `pred_azi_logits`: `[B, T_s, K, 360]`
- `pred_ele_logits`: `[B, T_s, K, 180]`
- `pred_dist`: `[B, T_s, K, 1]`
- `pred_class_logits`: `[B, T_s, K, C_cls]`
- `pooled_spatial_latents`: `[B, T_s, D]`
- `llm_spatial_tokens`: `[B, T_s, d_llm]`

For the default setup:

- `T_s = 25`
- `K = 4`

Therefore:

- internal slots: `[B, 25, 4, D]`
- final LLM tokens: `[B, 25, d_llm]`

## 6. What Should Be Updated in the Main Design

The main design should be interpreted as:

- internal `4` slots
- external fixed `25` spatial tokens for a `10 s` clip

So any previous statement implying:

- `2.5 Hz * 4 = 10 tokens / second`

should be considered obsolete for the final LLM interface.

The correct statement is:

- final LLM-visible spatial tokens are `2.5 tokens / second`

and:

- `K=4` is only internal source-slot capacity.