AIT-75M — Audio, Image, Text Embeddings

AIT-75M maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.

Built for edge deployment — the entire model runs on a Raspberry Pi 5.

Also available in GGUF format for quantized edge deployment (114 MB at Q8_0).

Architecture

AIT-75M uses lightweight edge encoders with learned projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:

Text  --> LEAF-IR (768-d) -----------> DeepProjectionHead (768 -> 1920 -> 1280)
Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead (1280 -> 1920 -> 1280)
Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead (1920 -> 1920 -> 1280)

All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.

Component	Architecture	Params	Size
Text encoder	LEAF-IR (MongoDB/mdbr-leaf-ir)	22.7M	87.2 MB
Image encoder	MobileNetV4-Medium (timm)	8.4M	32.4 MB
Audio encoder	EfficientAT mn20_as	17.9M	68.5 MB
Image projection	DeepProjectionHead (1280 -> 1920 -> 1280)	8.6M	32.9 MB
Audio projection	DeepProjectionHead (1920 -> 1920 -> 1280)	9.8M	37.5 MB
Text projection	DeepProjectionHead (768 -> 1920 -> 1280)	7.6M	29.1 MB
Total		75.2M	287.7 MB

Projection head detail

Each DeepProjectionHead is a depth-1 residual MLP with Matryoshka-aware training:

Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.2)
  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.2) + residual
  -> Linear(1920, 1280)

Matryoshka dimensions

Embeddings can be truncated to [1280, 768, 512, 256, 128] dimensions while preserving retrieval quality — trained with Matryoshka Representation Learning (MRL).

Benchmarks

All benchmarks run on a single NVIDIA L4 GPU with 5K SALT samples.

Cross-modal retrieval — SALT (5K trimodal samples)

Direction	AIT-75M (75M)	TEG-421M (421M)	ImageBind (1.2B)	EBind (1.78B*)
Image -> Text R@1	0.615	0.620	0.736	0.783
Text -> Image R@1	0.614	0.672	0.712	0.779
Text -> Audio R@1	0.103	0.113	0.038	0.047
Audio -> Text R@1	0.082	0.115	0.039	0.035
Image -> Audio R@1	0.062	0.083	0.023	0.027
Audio -> Image R@1	0.063	0.081	0.025	0.032

Audio retrieval — AudioCaps & Clotho

Benchmark	Direction	AIT-75M	CLAP-Large	ImageBind	EBind
AudioCaps	A->T R@1	0.210	0.420	0.116	0.225
AudioCaps	T->A R@1	0.148	0.280	0.080	0.219
Clotho	A->T R@1	0.208	0.195	0.061	0.088
Clotho	T->A R@1	0.172	0.167	0.074	0.118

AIT-75M beats Clotho A->T R@1 for all models including CLAP-Large, while being fully trimodal.

Image-text retrieval — MSCOCO & Flickr30k

Benchmark	Direction	AIT-75M (75M)	EBind (1.78B*)	ImageBind (1.2B)
Flickr30k	I->T R@1	0.478	0.951	0.918
Flickr30k	T->I R@1	0.303	0.853	0.766
MSCOCO 5K	I->T R@1	0.320	0.743	0.658
MSCOCO 5K	T->I R@1	0.208	0.559	0.490

Zero-shot classification — ESC-50

Model	Params	Accuracy
CLAP-Large	67.8M	90.5%
AIT-75M	75M	93.2%
EBind	1.78B*	77.0%
ImageBind	1.2B	66.4%

#1 on ESC-50 (93.2%) at 75M params — beats CLAP-Large (90.5%) while being trimodal.

Text retrieval — MTEB (NDCG@10)

Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:

Task	AIT-75M	Raw LEAF-IR	Recovery
ArguAna	0.544	0.594	92%
CQADupstackGaming	0.506	0.607	83%
CQADupstackUnix	0.355	0.428	83%
FEVERHardNegatives	0.551	0.863	64%
HotpotQAHardNegatives	0.531	0.700	76%
FiQA2018	0.292	0.392	74%
ClimateFEVER	0.215	0.353	61%
SCIDOCS	0.153	0.198	77%
TRECCOVID	0.474	0.820	58%

The text projection head recovers 58-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.

Usage

Loading components

from safetensors.torch import load_file

# Load entire model
tensors = load_file("AIT-75M.safetensors")

# Extract components by prefix
text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}

Matryoshka truncation

import torch.nn.functional as F

# Full 1280-dim embedding
embedding = model(input)  # (N, 1280)

# Truncate to 256-dim and re-normalize
embedding_256 = F.normalize(embedding[:, :256], dim=-1)

File layout

AIT-75M.safetensors     # All components in one file (~288 MB)

Tensor key prefixes

Prefix	Component	Tensors
`text_encoder.*`	LEAF-IR (float32)	103
`image_encoder.*`	MobileNetV4-Medium	462
`audio_encoder.*`	EfficientAT mn20_as	312
`image_projection.*`	Projection head	10
`audio_projection.*`	Projection head	10
`text_projection.*`	Projection head	10

Training

Loss: InfoNCE (contrastive) with Matryoshka Representation Learning
Data: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
Hardware: 2x NVIDIA L4 GPUs
Text retrieval fine-tune: Phase 1 warm start from d20 checkpoint, text-head-only with frozen image/audio heads, Nomic supervised text pairs mixed at lambda_tt=0.25
Optimizer: AdamW, lr=1e-3, weight decay=1e-4, cosine scheduler
Epochs: 7 (text fine-tune from pre-trained trimodal base)
Projection heads only — source encoders are frozen during training

Design decisions

3-head shared space: All modalities project into a learned 1280-dim space (image-native dimension) instead of targeting a pre-existing text encoder space
LEAF-IR text encoder: 23M-param retrieval-optimized text encoder replaces 300M Gemma, enabling fully edge-deployable text inference
Frozen source encoders: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
Text retrieval fine-tune: Nomic supervised text pairs (1.5M) mixed into trimodal training to improve text-text retrieval while preserving cross-modal alignment
Edge-first: All source encoders can run on devices like Raspberry Pi 5

Limitations

Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
Text retrieval recovers 58-92% of raw LEAF-IR quality (gap is domain-dependent)

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for augmem/AIT-75M

Quantizations

1 model

augmem
/

AIT-75M