File size: 3,778 Bytes
29ab2d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# Spatial-BEATs Token Interface Clarification

## 1. Why `25 * 4 = 100` Is Not the Right Final Token Count

The previous discussion mixed two different concepts:

- internal multi-source slot capacity
- final LLM-visible token rate

These should be separated.

For the current design:

- `2.5 Hz` means the **final spatial token rate visible to the LLM**
- for a `10 s` clip, this means:
  - `T_s = 10 * 2.5 = 25`

So the correct final token count is:

- `25 spatial tokens`

not:

- `25 * 4 = 100`

The `4` only refers to:

- internal source slots per time step

It is an internal modeling capacity, not an external token-rate multiplier.

## 2. Corrected Design

The corrected interface is:

```text
FOA waveform
  -> FOA features
  -> BEATs trunk
  -> temporal memory at 2.5 Hz              [B, T_s, D]
  -> per-step source slots (K=4)            [B, T_s, K, D]
  -> objectness-weighted slot pooling       [B, T_s, D]
  -> MLP projector
  -> final LLM spatial tokens               [B, T_s, d_llm]
```

With the default setup:

- `T_s = 25`
- `K = 4`

So:

- internal representation: `[B, 25, 4, D]`
- final LLM tokens: `[B, 25, d_llm]`

## 3. What `objectness-weighted pooling + MLP projector` Means

At each time step `t`, the model first predicts `K=4` source slots:

- `z_{t,1}, z_{t,2}, z_{t,3}, z_{t,4}`

Each slot also has an objectness score:

- `o_{t,1}, o_{t,2}, o_{t,3}, o_{t,4}`

These objectness scores are normalized across the `K` slots:

```text
alpha_{t,k} = softmax(o_{t,:})_k
```

Then the slot latents are pooled:

```text
h_t = sum_{k=1..K} alpha_{t,k} * z_{t,k}
```

This produces one pooled latent for this time step:

- `h_t`

The same idea is used to pool the structured slot-level predictions:

```text
c_t = sum_k alpha_{t,k} * c_{t,k}
u_t = sum_k alpha_{t,k} * u_{t,k}
d_t = sum_k alpha_{t,k} * d_{t,k}
o_t = sum_k alpha_{t,k} * e_{obj,t,k}
```

where:

- `c_{t,k}` is the slot-level class-context embedding
- `u_{t,k}` is the slot-level direction embedding/vector
- `d_{t,k}` is the slot-level distance embedding
- `e_{obj,t,k}` is the slot-level confidence embedding

Then the final per-step spatial token is formed as:

```text
s_t = Proj([h_t ; c_t ; u_t ; d_t ; o_t])
```

where:

- `Proj` is an MLP projector into the LLM hidden space

So the final sequence is:

```text
S = [s_1, s_2, ..., s_{T_s}]
```

For a `10 s` clip:

- `S` has `25` tokens

## 4. Why This Is Better

This corrected design keeps both goals:

1. multi-source capacity inside the model
2. fixed low-rate spatial tokens outside the model

Advantages:

- the model can still represent up to `4` sources at each time step
- the final LLM token count stays fixed at `2.5 Hz`
- the external token interface is simpler and easier to scale
- it avoids unnecessarily inflating the LLM token count by `K`

## 5. Corrected Tensor Shapes

Recommended tensor shapes:

- `temporal_memory`: `[B, T_s, D]`
- `slot_tokens`: `[B, T_s, K, D]`
- `pred_obj`: `[B, T_s, K]`
- `pred_azi_logits`: `[B, T_s, K, 360]`
- `pred_ele_logits`: `[B, T_s, K, 180]`
- `pred_dist`: `[B, T_s, K, 1]`
- `pred_class_logits`: `[B, T_s, K, C_cls]`
- `pooled_spatial_latents`: `[B, T_s, D]`
- `llm_spatial_tokens`: `[B, T_s, d_llm]`

For the default setup:

- `T_s = 25`
- `K = 4`

Therefore:

- internal slots: `[B, 25, 4, D]`
- final LLM tokens: `[B, 25, d_llm]`

## 6. What Should Be Updated in the Main Design

The main design should be interpreted as:

- internal `4` slots
- external fixed `25` spatial tokens for a `10 s` clip

So any previous statement implying:

- `2.5 Hz * 4 = 10 tokens / second`

should be considered obsolete for the final LLM interface.

The correct statement is:

- final LLM-visible spatial tokens are `2.5 tokens / second`

and:

- `K=4` is only internal source-slot capacity.