--- language: - en license: apache-2.0 tags: - multimodal - embedding - trimodal - retrieval - image-text-audio - audio - speech - memory-augmentation - feature-extraction library_name: pytorch pipeline_tag: feature-extraction datasets: - custom --- # AIST-87M `AIST-87M` is a compact audio + image + speech + text embedding model for human-memory augmentation workloads. It is the single-audio evolution of the earlier dual-audio tower line: the runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead of a separate EfficientAT + Whisper dual branch. The LoRA training weights are merged into the native audio encoder in this release artifact, so there is no separate LoRA pass at inference time. Core stack: - text: `MongoDB/mdbr-leaf-ir` - image: `mobilenetv4_conv_medium.e180_r384_in12k` - audio: native merged `mn20_as` EfficientAT encoder - projection output: `1280d` - Matryoshka slices: `[1280, 768, 512, 256, 128]` - exact loaded params: `87,118,774` The canonical name follows the Augmem naming standard: - `AIST` = audio + image + speech + text - `87M` = exact loaded parameter count rounded to integer millions ## Runtime Contract This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize: ```text z1280 = l2norm(model(input)) z768 = l2norm(z1280[0:768]) z512 = l2norm(z1280[0:512]) ``` The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads. ## Evaluation Scope This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces: - text continuity: duplicate-question and semantic textual similarity tasks - image recall: Flickr30k text-image and image-text retrieval - audio recall: speech/general-audio text-audio retrieval tasks Primary metrics: - text continuity: `main_score` - image recall: `NDCG@10` - audio recall: `NDCG@10` ## Human-Memory Slice Source: `aist87m_memory_slice_release_report.md` and `aist87m_memory_slice_release_report.json`. | Dim | Tasks | Text continuity | Image recall | Audio recall | Overall | |---:|---:|---:|---:|---:|---:| | 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 | | 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 | | 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 | Selected 1280d task scores: | Task | Family | Metric | Score | R@1 | R@10 | |---|---|---|---:|---:|---:| | SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - | | STSBenchmark | Text continuity | main_score | 0.651 | - | - | | Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 | | Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 | | CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 | | MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 | | UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 | | ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 | ## Task-Aligned Comparisons Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines. | Comparison | Dim | Paired tasks | Read | |---|---:|---:|---| | vs native `mn20_as` audio baseline | 768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat | | vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores | | vs `AIST-95M` | 1280 | 2 | only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair | This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate. ## Runtime Footprint vs Dual-Audio Tower `AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny audio branches with one merged native `mn20_as` EfficientAT encoder. The result is a smaller deployed path with the same 1280d output contract. | Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta | |---|---:|---:|---:| | Loaded parameters | 87,118,774 | 95,315,959 | -8.6% | | Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% | | Audio encoders | 1 | 2 | removes Whisper branch | | Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% | | Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% | | Audio projection input width | 1,280 | 2,304 | -44.4% | Exact-gate tradeoff against the same dual-audio local baseline: | 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta | |---|---:|---:|---:| | Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 | | WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 | | SALT audio-text R@1 avg | 0.008 | 0.007 | flat | | SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 | Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization. | Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup | |---:|---:|---:|---:|---:|---:| | 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x | | 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x | | 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x | Projection-only throughput at feature batch 2048 is also higher for the single-audio path: 314k features/s for `AIST-87M` vs 282k features/s for the dual-audio tower. Raw benchmark output is included as `aist87m_vs_dual_audio_throughput_l4_20260504.json`. ## Architecture ```text Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280 Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280 Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280 ``` The audio encoder in this artifact is the merged native checkpoint: `mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt` ## Parameter Count | Component | Params | |---|---:| | Text encoder (`MongoDB/mdbr-leaf-ir`) | 22,861,056 | | Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) | 8,434,512 | | Audio encoder (merged native `mn20_as`) | 19,886,566 | | Image projection head | 12,306,560 | | Audio projection head | 12,306,560 | | Text projection head | 11,323,520 | | **Total exact loaded params** | **87,118,774** | ## Files | File | Purpose | |---|---| | `AIST-87M.safetensors` | Self-contained release artifact | | `aist_81m_raw_mn20_lora.yaml` | Training recipe for the source run | | `manifest.json` | Release manifest with checksums and eval coverage | | `parameter_breakdown.json` | Exact parameter accounting | | `aist87m_memory_slice_release_report.md` | Human-memory slice report | | `aist87m_memory_slice_release_report.json` | Machine-readable evaluation summary | | `aist87m_vs_dual_audio_throughput_l4_20260504.json` | L4 throughput benchmark vs dual-audio tower | ## Caveats - The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage. - The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores. - 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.