File size: 7,800 Bytes
c85bf8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c1b25e
 
c85bf8a
 
 
 
 
 
 
 
 
 
 
b1401d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e412b76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1401d9
c85bf8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
331efd2
c85bf8a
 
 
e412b76
c85bf8a
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
language:
- en
license: apache-2.0
tags:
- multimodal
- embedding
- trimodal
- retrieval
- image-text-audio
- audio
- speech
- memory-augmentation
- feature-extraction
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- custom
---

# AIST-87M

`AIST-87M` is a compact audio + image + speech + text embedding model for
human-memory augmentation workloads.

It is the single-audio evolution of the earlier dual-audio tower line: the
runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead
of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
merged into the native audio encoder in this release artifact, so there is no
separate LoRA pass at inference time.

Core stack:

- text: `MongoDB/mdbr-leaf-ir`
- image: `mobilenetv4_conv_medium.e180_r384_in12k`
- audio: native merged `mn20_as` EfficientAT encoder
- projection output: `1280d`
- Matryoshka slices: `[1280, 768, 512, 256, 128]`
- exact loaded params: `87,118,774`

The canonical name follows the Augmem naming standard:

- `AIST` = audio + image + speech + text
- `87M` = exact loaded parameter count rounded to integer millions

## Runtime Contract

This model returns L2-normalized embeddings in a shared 1280-dimensional space.
For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

```text
z1280 = l2norm(model(input))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
```

The release safetensors file is self-contained and includes the text encoder,
image encoder, merged native audio encoder, and the three projection heads.

## Evaluation Scope

This release uses a human-memory evaluation slice rather than a broad
leaderboard sweep. The slice is chosen to match practical memory augmentation
surfaces:

- text continuity: duplicate-question and semantic textual similarity tasks
- image recall: Flickr30k text-image and image-text retrieval
- audio recall: speech/general-audio text-audio retrieval tasks

Primary metrics:

- text continuity: `main_score`
- image recall: `NDCG@10`
- audio recall: `NDCG@10`

## Human-Memory Slice

Source: `aist87m_memory_slice_release_report.md` and
`aist87m_memory_slice_release_report.json`.

| Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
|---:|---:|---:|---:|---:|---:|
| 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
| 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
| 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |

Selected 1280d task scores:

| Task | Family | Metric | Score | R@1 | R@10 |
|---|---|---|---:|---:|---:|
| SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - |
| STSBenchmark | Text continuity | main_score | 0.651 | - | - |
| Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 |
| Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 |
| CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 |
| MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 |
| UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 |
| ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 |

## Task-Aligned Comparisons

Comparisons below are only for locally available, task-aligned runs from the
same raw AIST line and its audio baselines.

| Comparison | Dim | Paired tasks | Read |
|---|---:|---:|---|
| vs native `mn20_as` audio baseline | 768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat |
| vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores |
| vs `AIST-95M` | 1280 | 2 | only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair |

This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
Broad diagnostic runs contain many task families that are not part of this
release gate.

## Runtime Footprint vs Dual-Audio Tower

`AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
audio branches with one merged native `mn20_as` EfficientAT encoder. The result
is a smaller deployed path with the same 1280d output contract.

| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---:|---:|---:|
| Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
| Audio encoders | 1 | 2 | removes Whisper branch |
| Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
| Audio projection input width | 1,280 | 2,304 | -44.4% |

Exact-gate tradeoff against the same dual-audio local baseline:

| 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---:|---:|---:|
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |

Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s
32 kHz CPU waveforms passed through waveform -> audio encoder -> projection ->
normalized embedding. Median wall time is over 50 timed iterations after 20
warmup iterations. This excludes audio file decode, dataset download, and MTEB
result serialization.

| Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup |
|---:|---:|---:|---:|---:|---:|
| 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x |
| 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x |
| 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x |

Projection-only throughput at feature batch 2048 is also higher for the
single-audio path: 314k features/s for `AIST-87M` vs 282k features/s for the
dual-audio tower. Raw benchmark output is included as
`aist87m_vs_dual_audio_throughput_l4_20260504.json`.

## Architecture

```text
Text   -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio  -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
```

The audio encoder in this artifact is the merged native checkpoint:

`mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt`

## Parameter Count

| Component | Params |
|---|---:|
| Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 |
| Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 |
| Audio encoder (merged native `mn20_as`) | 19,886,566 |
| Image projection head | 12,306,560 |
| Audio projection head | 12,306,560 |
| Text projection head | 11,323,520 |
| **Total exact loaded params** | **87,118,774** |

## Files

| File | Purpose |
|---|---|
| `AIST-87M.safetensors` | Self-contained release artifact |
| `aist_81m_raw_mn20_lora.yaml` | Training recipe for the source run |
| `manifest.json` | Release manifest with checksums and eval coverage |
| `parameter_breakdown.json` | Exact parameter accounting |
| `aist87m_memory_slice_release_report.md` | Human-memory slice report |
| `aist87m_memory_slice_release_report.json` | Machine-readable evaluation summary |
| `aist87m_vs_dual_audio_throughput_l4_20260504.json` | L4 throughput benchmark vs dual-audio tower |

## Caveats

- The model is optimized and reported for memory-relevant embedding surfaces,
  not broad leaderboard coverage.
- The single-audio path is smaller and simpler than the dual-audio tower, but
  it does not dominate the dual-audio tower on paired diagnostic scores.
- 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.