AIT-75M β€” Audio, Image, Text Embeddings

AIT-75M maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.

Built for edge deployment β€” the entire model runs on a Raspberry Pi 5.

Also available in GGUF format for quantized edge deployment (114 MB at Q8_0).

Architecture

AIT-75M uses lightweight edge encoders with learned projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:

Text  --> LEAF-IR (768-d) -----------> DeepProjectionHead (768 -> 1920 -> 1280)
Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead (1280 -> 1920 -> 1280)
Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead (1920 -> 1920 -> 1280)

All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.

Component Architecture Params Size
Text encoder LEAF-IR (MongoDB/mdbr-leaf-ir) 22.7M 87.2 MB
Image encoder MobileNetV4-Medium (timm) 8.4M 32.4 MB
Audio encoder EfficientAT mn20_as 17.9M 68.5 MB
Image projection DeepProjectionHead (1280 -> 1920 -> 1280) 8.6M 32.9 MB
Audio projection DeepProjectionHead (1920 -> 1920 -> 1280) 9.8M 37.5 MB
Text projection DeepProjectionHead (768 -> 1920 -> 1280) 7.6M 29.1 MB
Total 75.2M 287.7 MB

Projection head detail

Each DeepProjectionHead is a depth-1 residual MLP with Matryoshka-aware training:

Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.2)
  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.2) + residual
  -> Linear(1920, 1280)

Matryoshka dimensions

Embeddings can be truncated to [1280, 768, 512, 256, 128] dimensions while preserving retrieval quality β€” trained with Matryoshka Representation Learning (MRL).

Benchmarks

All benchmarks run on a single NVIDIA L4 GPU with 5K SALT samples.

Cross-modal retrieval β€” SALT (5K trimodal samples)

Direction AIT-75M (75M) TEG-421M (421M) ImageBind (1.2B) EBind (1.78B*)
Image -> Text R@1 0.615 0.620 0.736 0.783
Text -> Image R@1 0.614 0.672 0.712 0.779
Text -> Audio R@1 0.103 0.113 0.038 0.047
Audio -> Text R@1 0.082 0.115 0.039 0.035
Image -> Audio R@1 0.062 0.083 0.023 0.027
Audio -> Image R@1 0.063 0.081 0.025 0.032

Audio retrieval β€” AudioCaps & Clotho

Benchmark Direction AIT-75M CLAP-Large ImageBind EBind
AudioCaps A->T R@1 0.210 0.420 0.116 0.225
AudioCaps T->A R@1 0.148 0.280 0.080 0.219
Clotho A->T R@1 0.208 0.195 0.061 0.088
Clotho T->A R@1 0.172 0.167 0.074 0.118

AIT-75M beats Clotho A->T R@1 for all models including CLAP-Large, while being fully trimodal.

Image-text retrieval β€” MSCOCO & Flickr30k

Benchmark Direction AIT-75M (75M) EBind (1.78B*) ImageBind (1.2B)
Flickr30k I->T R@1 0.478 0.951 0.918
Flickr30k T->I R@1 0.303 0.853 0.766
MSCOCO 5K I->T R@1 0.320 0.743 0.658
MSCOCO 5K T->I R@1 0.208 0.559 0.490

Zero-shot classification β€” ESC-50

Model Params Accuracy
CLAP-Large 67.8M 90.5%
AIT-75M 75M 93.2%
EBind 1.78B* 77.0%
ImageBind 1.2B 66.4%

#1 on ESC-50 (93.2%) at 75M params β€” beats CLAP-Large (90.5%) while being trimodal.

Text retrieval β€” MTEB (NDCG@10)

Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:

Task AIT-75M Raw LEAF-IR Recovery
ArguAna 0.544 0.594 92%
CQADupstackGaming 0.506 0.607 83%
CQADupstackUnix 0.355 0.428 83%
FEVERHardNegatives 0.551 0.863 64%
HotpotQAHardNegatives 0.531 0.700 76%
FiQA2018 0.292 0.392 74%
ClimateFEVER 0.215 0.353 61%
SCIDOCS 0.153 0.198 77%
TRECCOVID 0.474 0.820 58%

The text projection head recovers 58-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.

Usage

Loading components

from safetensors.torch import load_file

# Load entire model
tensors = load_file("AIT-75M.safetensors")

# Extract components by prefix
text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}

Matryoshka truncation

import torch.nn.functional as F

# Full 1280-dim embedding
embedding = model(input)  # (N, 1280)

# Truncate to 256-dim and re-normalize
embedding_256 = F.normalize(embedding[:, :256], dim=-1)

File layout

AIT-75M.safetensors     # All components in one file (~288 MB)

Tensor key prefixes

Prefix Component Tensors
text_encoder.* LEAF-IR (float32) 103
image_encoder.* MobileNetV4-Medium 462
audio_encoder.* EfficientAT mn20_as 312
image_projection.* Projection head 10
audio_projection.* Projection head 10
text_projection.* Projection head 10

Training

  • Loss: InfoNCE (contrastive) with Matryoshka Representation Learning
  • Data: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
  • Hardware: 2x NVIDIA L4 GPUs
  • Text retrieval fine-tune: Phase 1 warm start from d20 checkpoint, text-head-only with frozen image/audio heads, Nomic supervised text pairs mixed at lambda_tt=0.25
  • Optimizer: AdamW, lr=1e-3, weight decay=1e-4, cosine scheduler
  • Epochs: 7 (text fine-tune from pre-trained trimodal base)
  • Projection heads only β€” source encoders are frozen during training

Design decisions

  • 3-head shared space: All modalities project into a learned 1280-dim space (image-native dimension) instead of targeting a pre-existing text encoder space
  • LEAF-IR text encoder: 23M-param retrieval-optimized text encoder replaces 300M Gemma, enabling fully edge-deployable text inference
  • Frozen source encoders: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
  • Text retrieval fine-tune: Nomic supervised text pairs (1.5M) mixed into trimodal training to improve text-text retrieval while preserving cross-modal alignment
  • Edge-first: All source encoders can run on devices like Raspberry Pi 5

Limitations

  • Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
  • Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
  • Text retrieval recovers 58-92% of raw LEAF-IR quality (gap is domain-dependent)

Links

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for augmem/AIT-75M

Quantizations
1 model