File size: 7,800 Bytes

---
language:
- en
license: apache-2.0
tags:
- multimodal
- embedding
- trimodal
- retrieval
- image-text-audio
- audio
- speech
- memory-augmentation
- feature-extraction
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- custom
---

# AIST-87M

`AIST-87M` is a compact audio + image + speech + text embedding model for
human-memory augmentation workloads.

It is the single-audio evolution of the earlier dual-audio tower line: the
runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead
of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
merged into the native audio encoder in this release artifact, so there is no
separate LoRA pass at inference time.

Core stack:

- text: `MongoDB/mdbr-leaf-ir`
- image: `mobilenetv4_conv_medium.e180_r384_in12k`
- audio: native merged `mn20_as` EfficientAT encoder
- projection output: `1280d`
- Matryoshka slices: `[1280, 768, 512, 256, 128]`
- exact loaded params: `87,118,774`

The canonical name follows the Augmem naming standard:

- `AIST` = audio + image + speech + text
- `87M` = exact loaded parameter count rounded to integer millions

## Runtime Contract

This model returns L2-normalized embeddings in a shared 1280-dimensional space.
For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

```text
z1280 = l2norm(model(input))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
```

The release safetensors file is self-contained and includes the text encoder,
image encoder, merged native audio encoder, and the three projection heads.

## Evaluation Scope

This release uses a human-memory evaluation slice rather than a broad
leaderboard sweep. The slice is chosen to match practical memory augmentation
surfaces:

- text continuity: duplicate-question and semantic textual similarity tasks
- image recall: Flickr30k text-image and image-text retrieval
- audio recall: speech/general-audio text-audio retrieval tasks

Primary metrics:

- text continuity: `main_score`
- image recall: `NDCG@10`
- audio recall: `NDCG@10`

## Human-Memory Slice

Source: `aist87m_memory_slice_release_report.md` and
`aist87m_memory_slice_release_report.json`.

| Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
|---:|---:|---:|---:|---:|---:|
| 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
| 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
| 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |

Selected 1280d task scores:

| Task | Family | Metric | Score | R@1 | R@10 |
|---|---|---|---:|---:|---:|
| SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - |
| STSBenchmark | Text continuity | main_score | 0.651 | - | - |
| Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 |
| Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 |
| CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 |
| MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 |
| UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 |
| ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 |

## Task-Aligned Comparisons

Comparisons below are only for locally available, task-aligned runs from the
same raw AIST line and its audio baselines.

| Comparison | Dim | Paired tasks | Read |
|---|---:|---:|---|
| vs native `mn20_as` audio baseline | 768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat |
| vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores |
| vs `AIST-95M` | 1280 | 2 | only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair |

This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
Broad diagnostic runs contain many task families that are not part of this
release gate.

## Runtime Footprint vs Dual-Audio Tower

`AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
audio branches with one merged native `mn20_as` EfficientAT encoder. The result
is a smaller deployed path with the same 1280d output contract.

| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---:|---:|---:|
| Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
| Audio encoders | 1 | 2 | removes Whisper branch |
| Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
| Audio projection input width | 1,280 | 2,304 | -44.4% |

Exact-gate tradeoff against the same dual-audio local baseline:

| 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---:|---:|---:|
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |

Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s
32 kHz CPU waveforms passed through waveform -> audio encoder -> projection ->
normalized embedding. Median wall time is over 50 timed iterations after 20
warmup iterations. This excludes audio file decode, dataset download, and MTEB
result serialization.

| Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup |
|---:|---:|---:|---:|---:|---:|
| 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x |
| 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x |
| 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x |

Projection-only throughput at feature batch 2048 is also higher for the
single-audio path: 314k features/s for `AIST-87M` vs 282k features/s for the
dual-audio tower. Raw benchmark output is included as
`aist87m_vs_dual_audio_throughput_l4_20260504.json`.

## Architecture

```text
Text   -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio  -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
```

The audio encoder in this artifact is the merged native checkpoint:

`mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt`

## Parameter Count

| Component | Params |
|---|---:|
| Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 |
| Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 |
| Audio encoder (merged native `mn20_as`) | 19,886,566 |
| Image projection head | 12,306,560 |
| Audio projection head | 12,306,560 |
| Text projection head | 11,323,520 |
| **Total exact loaded params** | **87,118,774** |

## Files

| File | Purpose |
|---|---|
| `AIST-87M.safetensors` | Self-contained release artifact |
| `aist_81m_raw_mn20_lora.yaml` | Training recipe for the source run |
| `manifest.json` | Release manifest with checksums and eval coverage |
| `parameter_breakdown.json` | Exact parameter accounting |
| `aist87m_memory_slice_release_report.md` | Human-memory slice report |
| `aist87m_memory_slice_release_report.json` | Machine-readable evaluation summary |
| `aist87m_vs_dual_audio_throughput_l4_20260504.json` | L4 throughput benchmark vs dual-audio tower |

## Caveats

- The model is optimized and reported for memory-relevant embedding surfaces,
  not broad leaderboard coverage.
- The single-audio path is smaller and simpler than the dual-audio tower, but
  it does not dominate the dual-audio tower on paired diagnostic scores.
- 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.