| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - multimodal |
| - embedding |
| - trimodal |
| - retrieval |
| - image-text-audio |
| - audio |
| - speech |
| - memory-augmentation |
| - feature-extraction |
| library_name: pytorch |
| pipeline_tag: feature-extraction |
| datasets: |
| - custom |
| --- |
| |
| # AIST-87M |
|
|
| `AIST-87M` is a compact audio + image + speech + text embedding model for |
| human-memory augmentation workloads. |
|
|
| It is the single-audio evolution of the earlier dual-audio tower line: the |
| runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead |
| of a separate EfficientAT + Whisper dual branch. The LoRA training weights are |
| merged into the native audio encoder in this release artifact, so there is no |
| separate LoRA pass at inference time. |
|
|
| Core stack: |
|
|
| - text: `MongoDB/mdbr-leaf-ir` |
| - image: `mobilenetv4_conv_medium.e180_r384_in12k` |
| - audio: native merged `mn20_as` EfficientAT encoder |
| - projection output: `1280d` |
| - Matryoshka slices: `[1280, 768, 512, 256, 128]` |
| - exact loaded params: `87,118,774` |
|
|
| The canonical name follows the Augmem naming standard: |
|
|
| - `AIST` = audio + image + speech + text |
| - `87M` = exact loaded parameter count rounded to integer millions |
|
|
| ## Runtime Contract |
|
|
| This model returns L2-normalized embeddings in a shared 1280-dimensional space. |
| For smaller runtime profiles, truncate to a Matryoshka slice and renormalize: |
|
|
| ```text |
| z1280 = l2norm(model(input)) |
| z768 = l2norm(z1280[0:768]) |
| z512 = l2norm(z1280[0:512]) |
| ``` |
|
|
| The release safetensors file is self-contained and includes the text encoder, |
| image encoder, merged native audio encoder, and the three projection heads. |
|
|
| ## Evaluation Scope |
|
|
| This release uses a human-memory evaluation slice rather than a broad |
| leaderboard sweep. The slice is chosen to match practical memory augmentation |
| surfaces: |
|
|
| - text continuity: duplicate-question and semantic textual similarity tasks |
| - image recall: Flickr30k text-image and image-text retrieval |
| - audio recall: speech/general-audio text-audio retrieval tasks |
|
|
| Primary metrics: |
|
|
| - text continuity: `main_score` |
| - image recall: `NDCG@10` |
| - audio recall: `NDCG@10` |
|
|
| ## Human-Memory Slice |
|
|
| Source: `aist87m_memory_slice_release_report.md` and |
| `aist87m_memory_slice_release_report.json`. |
|
|
| | Dim | Tasks | Text continuity | Image recall | Audio recall | Overall | |
| |---:|---:|---:|---:|---:|---:| |
| | 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 | |
| | 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 | |
| | 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 | |
|
|
| Selected 1280d task scores: |
|
|
| | Task | Family | Metric | Score | R@1 | R@10 | |
| |---|---|---|---:|---:|---:| |
| | SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - | |
| | STSBenchmark | Text continuity | main_score | 0.651 | - | - | |
| | Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 | |
| | Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 | |
| | CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 | |
| | MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 | |
| | UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 | |
| | ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 | |
|
|
| ## Task-Aligned Comparisons |
|
|
| Comparisons below are only for locally available, task-aligned runs from the |
| same raw AIST line and its audio baselines. |
|
|
| | Comparison | Dim | Paired tasks | Read | |
| |---|---:|---:|---| |
| | vs native `mn20_as` audio baseline | 768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat | |
| | vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores | |
| | vs `AIST-95M` | 1280 | 2 | only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair | |
|
|
| This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. |
| Broad diagnostic runs contain many task families that are not part of this |
| release gate. |
|
|
| ## Runtime Footprint vs Dual-Audio Tower |
|
|
| `AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny |
| audio branches with one merged native `mn20_as` EfficientAT encoder. The result |
| is a smaller deployed path with the same 1280d output contract. |
|
|
| | Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta | |
| |---|---:|---:|---:| |
| | Loaded parameters | 87,118,774 | 95,315,959 | -8.6% | |
| | Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% | |
| | Audio encoders | 1 | 2 | removes Whisper branch | |
| | Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% | |
| | Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% | |
| | Audio projection input width | 1,280 | 2,304 | -44.4% | |
|
|
| Exact-gate tradeoff against the same dual-audio local baseline: |
|
|
| | 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta | |
| |---|---:|---:|---:| |
| | Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 | |
| | WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 | |
| | SALT audio-text R@1 avg | 0.008 | 0.007 | flat | |
| | SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 | |
|
|
| Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s |
| 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> |
| normalized embedding. Median wall time is over 50 timed iterations after 20 |
| warmup iterations. This excludes audio file decode, dataset download, and MTEB |
| result serialization. |
|
|
| | Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup | |
| |---:|---:|---:|---:|---:|---:| |
| | 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x | |
| | 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x | |
| | 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x | |
|
|
| Projection-only throughput at feature batch 2048 is also higher for the |
| single-audio path: 314k features/s for `AIST-87M` vs 282k features/s for the |
| dual-audio tower. Raw benchmark output is included as |
| `aist87m_vs_dual_audio_throughput_l4_20260504.json`. |
|
|
| ## Architecture |
|
|
| ```text |
| Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280 |
| Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280 |
| Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280 |
| ``` |
|
|
| The audio encoder in this artifact is the merged native checkpoint: |
|
|
| `mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt` |
|
|
| ## Parameter Count |
|
|
| | Component | Params | |
| |---|---:| |
| | Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 | |
| | Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 | |
| | Audio encoder (merged native `mn20_as`) | 19,886,566 | |
| | Image projection head | 12,306,560 | |
| | Audio projection head | 12,306,560 | |
| | Text projection head | 11,323,520 | |
| | **Total exact loaded params** | **87,118,774** | |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |---|---| |
| | `AIST-87M.safetensors` | Self-contained release artifact | |
| | `aist_81m_raw_mn20_lora.yaml` | Training recipe for the source run | |
| | `manifest.json` | Release manifest with checksums and eval coverage | |
| | `parameter_breakdown.json` | Exact parameter accounting | |
| | `aist87m_memory_slice_release_report.md` | Human-memory slice report | |
| | `aist87m_memory_slice_release_report.json` | Machine-readable evaluation summary | |
| | `aist87m_vs_dual_audio_throughput_l4_20260504.json` | L4 throughput benchmark vs dual-audio tower | |
|
|
| ## Caveats |
|
|
| - The model is optimized and reported for memory-relevant embedding surfaces, |
| not broad leaderboard coverage. |
| - The single-audio path is smaller and simpler than the dual-audio tower, but |
| it does not dominate the dual-audio tower on paired diagnostic scores. |
| - 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint. |
|
|