Spatial-BEATs / docs /spatial_beats_token_interface_note.md

Add files using upload-large-folder tool

29ab2d0 verified 13 days ago

3.78 kB

	# Spatial-BEATs Token Interface Clarification

	## 1. Why `25 * 4 = 100` Is Not the Right Final Token Count

	The previous discussion mixed two different concepts:

	- internal multi-source slot capacity
	- final LLM-visible token rate

	These should be separated.

	For the current design:

	- `2.5 Hz` means the final spatial token rate visible to the LLM
	- for a `10 s` clip, this means:
	- `T_s = 10 * 2.5 = 25`

	So the correct final token count is:

	- `25 spatial tokens`

	not:

	- `25 * 4 = 100`

	The `4` only refers to:

	- internal source slots per time step

	It is an internal modeling capacity, not an external token-rate multiplier.

	## 2. Corrected Design

	The corrected interface is:

	```text
	FOA waveform
	-> FOA features
	-> BEATs trunk
	-> temporal memory at 2.5 Hz [B, T_s, D]
	-> per-step source slots (K=4) [B, T_s, K, D]
	-> objectness-weighted slot pooling [B, T_s, D]
	-> MLP projector
	-> final LLM spatial tokens [B, T_s, d_llm]
	```

	With the default setup:

	- `T_s = 25`
	- `K = 4`

	So:

	- internal representation: `[B, 25, 4, D]`
	- final LLM tokens: `[B, 25, d_llm]`

	## 3. What `objectness-weighted pooling + MLP projector` Means

	At each time step `t`, the model first predicts `K=4` source slots:

	- `z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}`

	Each slot also has an objectness score:

	- `o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}`

	These objectness scores are normalized across the `K` slots:

	```text
	alpha_{t,k} = softmax(o_{t,:})_k
	```

	Then the slot latents are pooled:

	```text
	h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}
	```

	This produces one pooled latent for this time step:

	- `h_t`

	The same idea is used to pool the structured slot-level predictions:

	```text
	c_t = sum_k alpha_{t,k} * c_{t,k}
	u_t = sum_k alpha_{t,k} * u_{t,k}
	d_t = sum_k alpha_{t,k} * d_{t,k}
	o_t = sum_k alpha_{t,k} * e_{obj,t,k}
	```

	where:

	- `c_{t,k}` is the slot-level class-context embedding
	- `u_{t,k}` is the slot-level direction embedding/vector
	- `d_{t,k}` is the slot-level distance embedding
	- `e_{obj,t,k}` is the slot-level confidence embedding

	Then the final per-step spatial token is formed as:

	```text
	s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])
	```

	where:

	- `Proj` is an MLP projector into the LLM hidden space

	So the final sequence is:

	```text
	S = [s_1, s_2, ..., s_{T_s}]
	```

	For a `10 s` clip:

	- `S` has `25` tokens

	## 4. Why This Is Better

	This corrected design keeps both goals:

	1. multi-source capacity inside the model
	2. fixed low-rate spatial tokens outside the model

	Advantages:

	- the model can still represent up to `4` sources at each time step
	- the final LLM token count stays fixed at `2.5 Hz`
	- the external token interface is simpler and easier to scale
	- it avoids unnecessarily inflating the LLM token count by `K`

	## 5. Corrected Tensor Shapes

	Recommended tensor shapes:

	- `temporal_memory`: `[B, T_s, D]`
	- `slot_tokens`: `[B, T_s, K, D]`
	- `pred_obj`: `[B, T_s, K]`
	- `pred_azi_logits`: `[B, T_s, K, 360]`
	- `pred_ele_logits`: `[B, T_s, K, 180]`
	- `pred_dist`: `[B, T_s, K, 1]`
	- `pred_class_logits`: `[B, T_s, K, C_cls]`
	- `pooled_spatial_latents`: `[B, T_s, D]`
	- `llm_spatial_tokens`: `[B, T_s, d_llm]`

	For the default setup:

	- `T_s = 25`
	- `K = 4`

	Therefore:

	- internal slots: `[B, 25, 4, D]`
	- final LLM tokens: `[B, 25, d_llm]`

	## 6. What Should Be Updated in the Main Design

	The main design should be interpreted as:

	- internal `4` slots
	- external fixed `25` spatial tokens for a `10 s` clip

	So any previous statement implying:

	- `2.5 Hz * 4 = 10 tokens / second`

	should be considered obsolete for the final LLM interface.

	The correct statement is:

	- final LLM-visible spatial tokens are `2.5 tokens / second`

	and:

	- `K=4` is only internal source-slot capacity.