AIST-87M / README.md

Remove ES-AIST comparison from AIST-87M card

9c1b25e verified 14 days ago

7.8 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- multimodal
	- embedding
	- trimodal
	- retrieval
	- image-text-audio
	- audio
	- speech
	- memory-augmentation
	- feature-extraction
	library_name: pytorch
	pipeline_tag: feature-extraction
	datasets:
	- custom
	---

	# AIST-87M

	`AIST-87M` is a compact audio + image + speech + text embedding model for
	human-memory augmentation workloads.

	It is the single-audio evolution of the earlier dual-audio tower line: the
	runtime audio path uses one merged native `mn20_as` EfficientAT encoder instead
	of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
	merged into the native audio encoder in this release artifact, so there is no
	separate LoRA pass at inference time.

	Core stack:

	- text: `MongoDB/mdbr-leaf-ir`
	- image: `mobilenetv4_conv_medium.e180_r384_in12k`
	- audio: native merged `mn20_as` EfficientAT encoder
	- projection output: `1280d`
	- Matryoshka slices: `[1280, 768, 512, 256, 128]`
	- exact loaded params: `87,118,774`

	The canonical name follows the Augmem naming standard:

	- `AIST` = audio + image + speech + text
	- `87M` = exact loaded parameter count rounded to integer millions

	## Runtime Contract

	This model returns L2-normalized embeddings in a shared 1280-dimensional space.
	For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

	```text
	z1280 = l2norm(model(input))
	z768 = l2norm(z1280[0:768])
	z512 = l2norm(z1280[0:512])
	```

	The release safetensors file is self-contained and includes the text encoder,
	image encoder, merged native audio encoder, and the three projection heads.

	## Evaluation Scope

	This release uses a human-memory evaluation slice rather than a broad
	leaderboard sweep. The slice is chosen to match practical memory augmentation
	surfaces:

	- text continuity: duplicate-question and semantic textual similarity tasks
	- image recall: Flickr30k text-image and image-text retrieval
	- audio recall: speech/general-audio text-audio retrieval tasks

	Primary metrics:

	- text continuity: `main_score`
	- image recall: `NDCG@10`
	- audio recall: `NDCG@10`

	## Human-Memory Slice

	Source: `aist87m_memory_slice_release_report.md` and
	`aist87m_memory_slice_release_report.json`.

	\| Dim \| Tasks \| Text continuity \| Image recall \| Audio recall \| Overall \|
	\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 1280 \| 8 / 8 \| 0.763 \| 0.425 \| 0.104 \| 0.349 \|
	\| 768 \| 8 / 8 \| 0.762 \| 0.424 \| 0.104 \| 0.349 \|
	\| 512 \| 8 / 8 \| 0.762 \| 0.424 \| 0.104 \| 0.349 \|

	Selected 1280d task scores:

	\| Task \| Family \| Metric \| Score \| R@1 \| R@10 \|
	\|---\|---\|---\|---:\|---:\|---:\|
	\| SprintDuplicateQuestions \| Text continuity \| main_score \| 0.875 \| - \| - \|
	\| STSBenchmark \| Text continuity \| main_score \| 0.651 \| - \| - \|
	\| Flickr30kT2IRetrieval \| Image recall \| NDCG@10 \| 0.469 \| 0.296 \| 0.672 \|
	\| Flickr30kI2TRetrieval \| Image recall \| NDCG@10 \| 0.381 \| 0.082 \| 0.407 \|
	\| CommonVoiceMini21T2ARetrieval \| Audio recall \| NDCG@10 \| 0.028 \| 0.006 \| 0.062 \|
	\| MACST2ARetrieval \| Audio recall \| NDCG@10 \| 0.110 \| 0.033 \| 0.214 \|
	\| UrbanSound8KT2ARetrieval \| Audio recall \| NDCG@10 \| 0.009 \| 0.002 \| 0.018 \|
	\| ClothoT2ARetrieval \| Audio recall \| NDCG@10 \| 0.269 \| 0.128 \| 0.443 \|

	## Task-Aligned Comparisons

	Comparisons below are only for locally available, task-aligned runs from the
	same raw AIST line and its audio baselines.

	\| Comparison \| Dim \| Paired tasks \| Read \|
	\|---\|---:\|---:\|---\|
	\| vs native `mn20_as` audio baseline \| 768 \| 4 \| slightly lower selected audio recall on average; UrbanSound8K is flat \|
	\| vs dual-audio tower \| 768 \| 6 \| smaller single-audio runtime, but lower paired text/image/audio scores \|
	\| vs `AIST-95M` \| 1280 \| 2 \| only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair \|

	This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
	Broad diagnostic runs contain many task families that are not part of this
	release gate.

	## Runtime Footprint vs Dual-Audio Tower

	`AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
	audio branches with one merged native `mn20_as` EfficientAT encoder. The result
	is a smaller deployed path with the same 1280d output contract.

	\| Runtime surface \| AIST-87M \| AIST-95M dual-audio tower \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| Loaded parameters \| 87,118,774 \| 95,315,959 \| -8.6% \|
	\| Safetensors artifact \| 348.9 MB \| 381.9 MB \| -8.6% \|
	\| Audio encoders \| 1 \| 2 \| removes Whisper branch \|
	\| Audio encoder parameters \| 19,886,566 \| 26,117,671 \| -23.9% \|
	\| Audio path parameters incl. projection \| 32,193,126 \| 40,390,311 \| -20.3% \|
	\| Audio projection input width \| 1,280 \| 2,304 \| -44.4% \|

	Exact-gate tradeoff against the same dual-audio local baseline:

	\| 1280d exact-gate slice \| AIST-87M \| AIST-95M dual-audio tower \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| Speech holdout audio-text R@1 avg \| 0.724 \| 0.582 \| +0.142 \|
	\| WavCaps FSD audio-text R@1 avg \| 0.097 \| 0.105 \| -0.009 \|
	\| SALT audio-text R@1 avg \| 0.008 \| 0.007 \| flat \|
	\| SALT image-audio R@1 avg \| 0.138 \| 0.148 \| -0.010 \|

	Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s
	32 kHz CPU waveforms passed through waveform -> audio encoder -> projection ->
	normalized embedding. Median wall time is over 50 timed iterations after 20
	warmup iterations. This excludes audio file decode, dataset download, and MTEB
	result serialization.

	\| Batch \| AIST-87M median ms \| AIST-87M throughput \| AIST-95M median ms \| AIST-95M throughput \| Speedup \|
	\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 1 \| 5.36 \| 186.7 clips/s; 1,867 audio-s/s \| 10.50 \| 95.2 clips/s; 952 audio-s/s \| 1.96x \|
	\| 8 \| 16.46 \| 486.0 clips/s; 4,860 audio-s/s \| 60.29 \| 132.7 clips/s; 1,327 audio-s/s \| 3.66x \|
	\| 16 \| 41.19 \| 388.5 clips/s; 3,885 audio-s/s \| 133.95 \| 119.4 clips/s; 1,194 audio-s/s \| 3.25x \|

	Projection-only throughput at feature batch 2048 is also higher for the
	single-audio path: 314k features/s for `AIST-87M` vs 282k features/s for the
	dual-audio tower. Raw benchmark output is included as
	`aist87m_vs_dual_audio_throughput_l4_20260504.json`.

	## Architecture

	```text
	Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
	Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
	Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
	```

	The audio encoder in this artifact is the merged native checkpoint:

	`mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt`

	## Parameter Count

	\| Component \| Params \|
	\|---\|---:\|
	\| Text encoder (`MongoDB/mdbr-leaf-ir`) \| 22,861,056 \|
	\| Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`) \| 8,434,512 \|
	\| Audio encoder (merged native `mn20_as`) \| 19,886,566 \|
	\| Image projection head \| 12,306,560 \|
	\| Audio projection head \| 12,306,560 \|
	\| Text projection head \| 11,323,520 \|
	\| Total exact loaded params \| 87,118,774 \|

	## Files

	\| File \| Purpose \|
	\|---\|---\|
	\| `AIST-87M.safetensors` \| Self-contained release artifact \|
	\| `aist_81m_raw_mn20_lora.yaml` \| Training recipe for the source run \|
	\| `manifest.json` \| Release manifest with checksums and eval coverage \|
	\| `parameter_breakdown.json` \| Exact parameter accounting \|
	\| `aist87m_memory_slice_release_report.md` \| Human-memory slice report \|
	\| `aist87m_memory_slice_release_report.json` \| Machine-readable evaluation summary \|
	\| `aist87m_vs_dual_audio_throughput_l4_20260504.json` \| L4 throughput benchmark vs dual-audio tower \|

	## Caveats

	- The model is optimized and reported for memory-relevant embedding surfaces,
	not broad leaderboard coverage.
	- The single-audio path is smaller and simpler than the dual-audio tower, but
	it does not dominate the dual-audio tower on paired diagnostic scores.
	- 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.