File size: 7,800 Bytes
c85bf8a 9c1b25e c85bf8a b1401d9 e412b76 b1401d9 c85bf8a 331efd2 c85bf8a e412b76 c85bf8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | ---
language:
- en
license: apache-2.0
tags:
- multimodal
- embedding
- trimodal
- retrieval
- image-text-audio
- audio
- speech
- memory-augmentation
- feature-extraction
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- custom
---
# AIST-87M
`AIST-87M` is a compact audio + image + speech + text embedding model for
human-memory augmentation workloads.
It is the single-audio evolution of the earlier dual-audio tower line: the
runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead
of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
merged into the native audio encoder in this release artifact, so there is no
separate LoRA pass at inference time.
Core stack:
- text: `MongoDB/mdbr-leaf-ir`
- image: `mobilenetv4_conv_medium.e180_r384_in12k`
- audio: native merged `mn20_as` EfficientAT encoder
- projection output: `1280d`
- Matryoshka slices: `[1280, 768, 512, 256, 128]`
- exact loaded params: `87,118,774`
The canonical name follows the Augmem naming standard:
- `AIST` = audio + image + speech + text
- `87M` = exact loaded parameter count rounded to integer millions
## Runtime Contract
This model returns L2-normalized embeddings in a shared 1280-dimensional space.
For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:
```text
z1280 = l2norm(model(input))
z768 = l2norm(z1280[0:768])
z512 = l2norm(z1280[0:512])
```
The release safetensors file is self-contained and includes the text encoder,
image encoder, merged native audio encoder, and the three projection heads.
## Evaluation Scope
This release uses a human-memory evaluation slice rather than a broad
leaderboard sweep. The slice is chosen to match practical memory augmentation
surfaces:
- text continuity: duplicate-question and semantic textual similarity tasks
- image recall: Flickr30k text-image and image-text retrieval
- audio recall: speech/general-audio text-audio retrieval tasks
Primary metrics:
- text continuity: `main_score`
- image recall: `NDCG@10`
- audio recall: `NDCG@10`
## Human-Memory Slice
Source: `aist87m_memory_slice_release_report.md` and
`aist87m_memory_slice_release_report.json`.
| Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
|---:|---:|---:|---:|---:|---:|
| 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
| 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
| 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
Selected 1280d task scores:
| Task | Family | Metric | Score | R@1 | R@10 |
|---|---|---|---:|---:|---:|
| SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - |
| STSBenchmark | Text continuity | main_score | 0.651 | - | - |
| Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 |
| Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 |
| CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 |
| MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 |
| UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 |
| ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 |
## Task-Aligned Comparisons
Comparisons below are only for locally available, task-aligned runs from the
same raw AIST line and its audio baselines.
| Comparison | Dim | Paired tasks | Read |
|---|---:|---:|---|
| vs native `mn20_as` audio baseline | 768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat |
| vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores |
| vs `AIST-95M` | 1280 | 2 | only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair |
This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
Broad diagnostic runs contain many task families that are not part of this
release gate.
## Runtime Footprint vs Dual-Audio Tower
`AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
audio branches with one merged native `mn20_as` EfficientAT encoder. The result
is a smaller deployed path with the same 1280d output contract.
| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---:|---:|---:|
| Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
| Audio encoders | 1 | 2 | removes Whisper branch |
| Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
| Audio projection input width | 1,280 | 2,304 | -44.4% |
Exact-gate tradeoff against the same dual-audio local baseline:
| 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---:|---:|---:|
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |
Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s
32 kHz CPU waveforms passed through waveform -> audio encoder -> projection ->
normalized embedding. Median wall time is over 50 timed iterations after 20
warmup iterations. This excludes audio file decode, dataset download, and MTEB
result serialization.
| Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup |
|---:|---:|---:|---:|---:|---:|
| 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x |
| 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x |
| 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x |
Projection-only throughput at feature batch 2048 is also higher for the
single-audio path: 314k features/s for `AIST-87M` vs 282k features/s for the
dual-audio tower. Raw benchmark output is included as
`aist87m_vs_dual_audio_throughput_l4_20260504.json`.
## Architecture
```text
Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
```
The audio encoder in this artifact is the merged native checkpoint:
`mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt`
## Parameter Count
| Component | Params |
|---|---:|
| Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 |
| Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 |
| Audio encoder (merged native `mn20_as`) | 19,886,566 |
| Image projection head | 12,306,560 |
| Audio projection head | 12,306,560 |
| Text projection head | 11,323,520 |
| **Total exact loaded params** | **87,118,774** |
## Files
| File | Purpose |
|---|---|
| `AIST-87M.safetensors` | Self-contained release artifact |
| `aist_81m_raw_mn20_lora.yaml` | Training recipe for the source run |
| `manifest.json` | Release manifest with checksums and eval coverage |
| `parameter_breakdown.json` | Exact parameter accounting |
| `aist87m_memory_slice_release_report.md` | Human-memory slice report |
| `aist87m_memory_slice_release_report.json` | Machine-readable evaluation summary |
| `aist87m_vs_dual_audio_throughput_l4_20260504.json` | L4 throughput benchmark vs dual-audio tower |
## Caveats
- The model is optimized and reported for memory-relevant embedding surfaces,
not broad leaderboard coverage.
- The single-audio path is smaller and simpler than the dual-audio tower, but
it does not dominate the dual-audio tower on paired diagnostic scores.
- 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.
|