augmem
/

AIT-86M

@@ -18,298 +18,27 @@ datasets:
 - custom
 ---
-# TE-86M — Trimodal Embeddings (Depth-2)
-**TE-86M** maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.
-Built for edge deployment — the entire model runs on a Raspberry Pi 5.
-Successor to [TE-75M](https://huggingface.co/augmem/TE-75M), with depth-2 residual projection heads that break through the cross-modal retrieval ceiling of depth-1 architectures while maintaining text retrieval quality.
-> Also available in [GGUF format](https://huggingface.co/augmem/TE-86M-GGUF) for quantized edge deployment.
-## Architecture
-TE-86M uses lightweight edge encoders with depth-2 residual projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:
-```
-Text  --> LEAF-IR (768-d) -----------> DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280)
-Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280)
-Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280)
-```
-All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.
-| Component | Architecture | Params | Size |
-|---|---|---|---|
-| Text encoder | LEAF-IR (MongoDB/mdbr-leaf-ir) | 22.7M | 87.2 MB |
-| Image encoder | MobileNetV4-Medium (timm) | 8.4M | 32.4 MB |
-| Audio encoder | EfficientAT mn20_as | 17.9M | 68.5 MB |
-| Image projection | DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280) | 12.3M | 47.0 MB |
-| Audio projection | DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280) | 13.5M | 51.7 MB |
-| Text projection | DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280) | 11.3M | 43.2 MB |
-| **Total** | | **86.1M** | **329.9 MB** |
-### Projection head detail
-Each `DeepProjectionHead-d2` is a depth-2 residual MLP with Matryoshka-aware training:
-```
-Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.3)
-  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
-  -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
-  -> Linear(1920, 1280)
-```
-### Why depth-2?
-Ablation experiments showed depth-1 heads hit an I->T retrieval ceiling at ~0.60 R@1 regardless of hyperparameter tuning. Depth-2 heads broke through to 0.618, providing the representational capacity to serve cross-modal AND text retrieval simultaneously. The extra 11M params (75M -> 86M) remain edge-viable.
-### Matryoshka dimensions
-Embeddings can be truncated to `[1280, 768, 512, 256, 128]` dimensions while preserving retrieval quality — trained with Matryoshka Representation Learning (MRL).
-## Benchmarks
-SALT retrieval benchmarks use 5K trimodal samples. Full MTEB / MAEB evaluation used the 768-d Matryoshka truncation.
-### Cross-modal retrieval — SALT (5K trimodal samples)
-| Direction | TE-86M (86M) | TE-75M (75M) | ImageBind (1.2B) | EBind (1.78B*) |
-|---|---|---|---|---|
-| Image -> Text R@1 | 0.618 | 0.615 | 0.736 | **0.783** |
-| Text -> Image R@1 | 0.630 | 0.614 | 0.712 | **0.779** |
-| Text -> Audio R@1 | **0.108** | 0.103 | 0.038 | 0.047 |
-| Audio -> Text R@1 | 0.087 | 0.082 | 0.039 | 0.035 |
-| Image -> Audio R@1 | **0.068** | 0.062 | 0.023 | 0.027 |
-| Audio -> Image R@1 | **0.070** | 0.063 | 0.025 | 0.032 |
-### Audio retrieval — AudioCaps & Clotho
-| Benchmark | Direction | TE-86M | TE-75M | CLAP-Large | ImageBind | EBind |
-|---|---|---|---|---|---|---|
-| AudioCaps | A->T R@1 | 0.229 | 0.210 | **0.420** | 0.116 | 0.225 |
-| AudioCaps | T->A R@1 | 0.156 | 0.148 | **0.280** | 0.080 | 0.219 |
-| Clotho | A->T R@1 | **0.219** | 0.208 | 0.195 | 0.061 | 0.088 |
-| Clotho | T->A R@1 | **0.177** | 0.172 | 0.167 | 0.074 | 0.118 |
-### Image-text retrieval — MSCOCO & Flickr30k
-| Benchmark | Direction | TE-86M (86M) | TE-75M (75M) | EBind (1.78B*) | ImageBind (1.2B) |
-|---|---|---|---|---|---|
-| Flickr30k | I->T R@1 | 0.494 | 0.478 | **0.951** | 0.918 |
-| Flickr30k | T->I R@1 | 0.332 | 0.303 | **0.853** | 0.766 |
-| MSCOCO 5K | I->T R@1 | 0.343 | 0.320 | **0.743** | 0.658 |
-| MSCOCO 5K | T->I R@1 | 0.225 | 0.208 | **0.559** | 0.490 |
-### Zero-shot classification — ESC-50
-| Model | Params | Accuracy |
-|---|---|---|
-| TE-86M | 86M | **93.9%** |
-| CLAP-Large | 67.8M | 90.5% |
-| TE-75M | 75M | 93.2% |
-| EBind | 1.78B* | 77.0% |
-| ImageBind | 1.2B | 66.4% |
-### Text retrieval — MTEB (NDCG@10)
-Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:
-| Task | TE-86M | TE-75M | Raw LEAF-IR | Recovery |
-|---|---|---|---|---|
-| ArguAna | 0.545 | 0.544 | 0.594 | 92% |
-| CQADupstackGaming | 0.515 | 0.506 | 0.607 | 85% |
-| CQADupstackUnix | 0.334 | 0.355 | 0.428 | 78% |
-| FEVERHardNegatives | 0.561 | 0.551 | 0.863 | 65% |
-| HotpotQAHardNegatives | 0.554 | 0.531 | 0.700 | 79% |
-| FiQA2018 | 0.291 | 0.292 | 0.392 | 74% |
-| ClimateFEVER | 0.231 | 0.215 | 0.353 | 65% |
-| SCIDOCS | 0.154 | 0.153 | 0.198 | 78% |
-| TRECCOVID | 0.507 | 0.474 | 0.820 | 62% |
-TE-86M improves MTEB text retrieval over TE-75M on 7/9 tasks. The depth-2 projection heads recover 62-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.
-### Full MTEB / MAEB kitchen-sink evaluation
-Full evaluation used the 768-d Matryoshka truncation of the TE-86M checkpoint `checkpoints/trimodal_3head_h1920_d2_v2_d30_tt025/best_model.pt` with MTEB 2.12.15.
-The run completed 63/71 requested tasks (`MTEB(eng, v2)` and `MAEB`). Failed tasks were: BirdCLEF, CREMA_DClustering, CommonLanguageAgeDetection, FleursT2ARetrieval, IEMOCAPGender, VoxCelebSA, VoxPopuliGenderClustering, VoxPopuliLanguageID.
-| Family | Tasks | Mean primary score |
-|---|---:|---:|
-| Text retrieval | 11 | 0.437 |
-| Text reranking | 1 | 0.314 |
-| Text summarization | 1 | 0.308 |
-| Text classification | 8 | 0.693 |
-| Text clustering | 8 | 0.445 |
-| Text STS / pair | 12 | 0.780 |
-| Audio retrieval | 8 | 0.169 |
-| Audio classification | 9 | 0.379 |
-| Audio pair / reranking | 4 | 0.620 |
-| Audio clustering | 1 | 0.009 |
-<details>
-<summary>Full per-task scores</summary>
-| Task | Family | Metric | Score | NDCG@10 | Recall@10 | Subsets |
-|---|---|---|---:|---:|---:|---:|
-| BeijingOpera | Audio classification | Main score | 0.868 |  |  | 1 |
-| CREMA_D | Audio classification | Main score | 0.285 |  |  | 1 |
-| FSD2019Kaggle | Audio classification | Main score | 0.548 |  |  | 2 |
-| GTZANGenre | Audio classification | Main score | 0.742 |  |  | 1 |
-| MInDS14 | Audio classification | Main score | 0.094 |  |  | 12 |
-| MridinghamTonic | Audio classification | Main score | 0.350 |  |  | 1 |
-| RavdessZeroshot | Audio classification | Main score | 0.197 |  |  | 1 |
-| SIBFLEURS | Audio classification | Main score | 0.218 |  |  | 102 |
-| SpeechCommandsZeroshotv0.02 | Audio classification | Main score | 0.111 |  |  | 1 |
-| VehicleSoundClustering | Audio clustering | Main score | 0.009 |  |  | 1 |
-| CREMADPairClassification | Audio pair / reranking | Main score | 0.528 |  |  | 1 |
-| GTZANAudioReranking | Audio pair / reranking | NDCG@10 | 0.874 | 0.874 | 0.987 | 1 |
-| NMSQAPairClassification | Audio pair / reranking | Main score | 0.547 |  |  | 1 |
-| VoxPopuliAccentPairClassification | Audio pair / reranking | Main score | 0.529 |  |  | 1 |
-| ClothoT2ARetrieval | Audio retrieval | NDCG@10 | 0.294 | 0.294 | 0.467 | 1 |
-| CommonVoiceMini21T2ARetrieval | Audio retrieval | NDCG@10 | 0.023 | 0.023 | 0.052 | 117 |
-| GigaSpeechT2ARetrieval | Audio retrieval | NDCG@10 | 0.002 | 0.002 | 0.004 | 1 |
-| JamAltArtistA2ARetrieval | Audio retrieval | NDCG@10 | 0.873 | 0.873 | 0.183 | 4 |
-| JamAltLyricA2TRetrieval | Audio retrieval | NDCG@10 | 0.008 | 0.008 | 0.013 | 4 |
-| MACST2ARetrieval | Audio retrieval | NDCG@10 | 0.136 | 0.136 | 0.252 | 1 |
-| SpokenSQuADT2ARetrieval | Audio retrieval | NDCG@10 | 0.010 | 0.010 | 0.020 | 1 |
-| UrbanSound8KT2ARetrieval | Audio retrieval | NDCG@10 | 0.009 | 0.009 | 0.020 | 1 |
-| BIOSSES | Text STS / pair | Main score | 0.803 |  |  | 1 |
-| SICK-R | Text STS / pair | Main score | 0.746 |  |  | 1 |
-| STS12 | Text STS / pair | Main score | 0.709 |  |  | 1 |
-| STS13 | Text STS / pair | Main score | 0.783 |  |  | 1 |
-| STS14 | Text STS / pair | Main score | 0.731 |  |  | 1 |
-| STS15 | Text STS / pair | Main score | 0.828 |  |  | 1 |
-| STS17 | Text STS / pair | Main score | 0.859 |  |  | 1 |
-| STS22.v2 | Text STS / pair | Main score | 0.679 |  |  | 1 |
-| STSBenchmark | Text STS / pair | Main score | 0.810 |  |  | 1 |
-| SprintDuplicateQuestions | Text STS / pair | Main score | 0.957 |  |  | 1 |
-| TwitterSemEval2015 | Text STS / pair | Main score | 0.620 |  |  | 1 |
-| TwitterURLCorpus | Text STS / pair | Main score | 0.837 |  |  | 1 |
-| AmazonCounterfactualClassification | Text classification | Main score | 0.681 |  |  | 1 |
-| Banking77Classification | Text classification | Main score | 0.742 |  |  | 1 |
-| ImdbClassification | Text classification | Main score | 0.722 |  |  | 1 |
-| MTOPDomainClassification | Text classification | Main score | 0.896 |  |  | 1 |
-| MassiveIntentClassification | Text classification | Main score | 0.623 |  |  | 1 |
-| MassiveScenarioClassification | Text classification | Main score | 0.709 |  |  | 1 |
-| ToxicConversationsClassification | Text classification | Main score | 0.623 |  |  | 1 |
-| TweetSentimentExtractionClassification | Text classification | Main score | 0.545 |  |  | 1 |
-| ArXivHierarchicalClusteringP2P | Text clustering | Main score | 0.549 |  |  | 1 |
-| ArXivHierarchicalClusteringS2S | Text clustering | Main score | 0.523 |  |  | 1 |
-| BiorxivClusteringP2P.v2 | Text clustering | Main score | 0.352 |  |  | 1 |
-| MedrxivClusteringP2P.v2 | Text clustering | Main score | 0.352 |  |  | 1 |
-| MedrxivClusteringS2S.v2 | Text clustering | Main score | 0.333 |  |  | 1 |
-| StackExchangeClustering.v2 | Text clustering | Main score | 0.589 |  |  | 1 |
-| StackExchangeClusteringP2P.v2 | Text clustering | Main score | 0.405 |  |  | 1 |
-| TwentyNewsgroupsClustering.v2 | Text clustering | Main score | 0.457 |  |  | 1 |
-| MindSmallReranking | Text reranking | Main score | 0.314 | 0.317 | 0.542 | 1 |
-| ArguAna | Text retrieval | NDCG@10 | 0.546 | 0.546 | 0.826 | 1 |
-| AskUbuntuDupQuestions | Text retrieval | NDCG@10 | 0.659 | 0.659 | 0.740 | 1 |
-| CQADupstackGamingRetrieval | Text retrieval | NDCG@10 | 0.519 | 0.519 | 0.659 | 1 |
-| CQADupstackUnixRetrieval | Text retrieval | NDCG@10 | 0.355 | 0.355 | 0.462 | 1 |
-| ClimateFEVERHardNegatives | Text retrieval | NDCG@10 | 0.239 | 0.239 | 0.301 | 1 |
-| FEVERHardNegatives | Text retrieval | NDCG@10 | 0.569 | 0.569 | 0.765 | 1 |
-| FiQA2018 | Text retrieval | NDCG@10 | 0.292 | 0.292 | 0.361 | 1 |
-| HotpotQAHardNegatives | Text retrieval | NDCG@10 | 0.553 | 0.553 | 0.605 | 1 |
-| SCIDOCS | Text retrieval | NDCG@10 | 0.154 | 0.154 | 0.164 | 1 |
-| TRECCOVID | Text retrieval | NDCG@10 | 0.513 | 0.513 | 0.014 | 1 |
-| Touche2020Retrieval.v3 | Text retrieval | NDCG@10 | 0.414 | 0.414 | 0.173 | 1 |
-| SummEvalSummarization.v2 | Text summarization | Main score | 0.308 |  |  | 1 |
-</details>
-## Usage
-### Loading components
-```python
-from safetensors.torch import load_file
-# Load entire model
-tensors = load_file("TE-86M.safetensors")
-# Extract components by prefix
-text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
-image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
-audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
-image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
-audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
-text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}
-```
-### Matryoshka truncation
-```python
-import torch.nn.functional as F
-# Full 1280-dim embedding
-embedding = model(input)  # (N, 1280)
-# Truncate to 256-dim and re-normalize
-embedding_256 = F.normalize(embedding[:, :256], dim=-1)
-```
 ## File layout
 ```
-TE-86M.safetensors     # All components in one file (~330 MB)
-```
-### Tensor key prefixes
-| Prefix | Component | Tensors |
-|---|---|---|
-| `text_encoder.*` | LEAF-IR (float32) | 103 |
-| `image_encoder.*` | MobileNetV4-Medium | 462 |
-| `audio_encoder.*` | EfficientAT mn20_as | 312 |
-| `image_projection.*` | Depth-2 projection head | 14 |
-| `audio_projection.*` | Depth-2 projection head | 14 |
-| `text_projection.*` | Depth-2 projection head | 14 |
-## Training
-- **Loss**: InfoNCE (contrastive) with Matryoshka Representation Learning
-- **Data**: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
-- **Hardware**: 2x NVIDIA L4 GPUs
-- **Optimizer**: AdamW, lr=1.41e-3, weight decay=1e-4, cosine scheduler
-- **Epochs**: 50
-- **Batch size**: 4096
-- **Dropout**: 0.20 -> 0.25 (ep27) -> 0.30 (ep29) — mid-run regularization increases
-- **Text mixing**: λ_tt=0.5 (ep1-9) -> 0.25 (ep10-50) — Nomic supervised text pairs
-- **Projection heads only** — source encoders are frozen during training
-### Improvements over TE-75M
-| Change | TE-75M | TE-86M |
-|---|---|---|
-| Projection depth | 1 (single residual block) | 2 (two residual blocks) |
-| Head params | 26.1M | 37.2M |
-| Total params | 75.2M | 86.1M |
-| SALT I->T R@1 | 0.615 | 0.618 (+0.5%) |
-| SALT T->I R@1 | 0.614 | 0.630 (+2.6%) |
-| MSCOCO I->T R@1 | 0.320 | 0.343 (+7.2%) |
-| Clotho A->T R@1 | 0.208 | 0.219 (+5.3%) |
-| ESC-50 | 93.2% | 93.9% (+0.7%) |
-### Design decisions
-- **Depth-2 residual heads**: Ablation confirmed depth-1 hits I->T ceiling at ~0.60 regardless of dropout or λ_tt. Depth-2 provides capacity to serve cross-modal and text retrieval simultaneously.
-- **3-head shared space**: All modalities project into a learned 1280-dim space (image-native dimension)
-- **LEAF-IR text encoder**: 23M-param retrieval-optimized text encoder enables fully edge-deployable text inference
-- **Frozen source encoders**: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
-- **Edge-first**: All source encoders can run on devices like Raspberry Pi 5
-## Limitations
-- Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
-- Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
-- Text retrieval recovers 62-92% of raw LEAF-IR quality (gap is domain-dependent)
-## Links
-- **Website**: [augmem.ai](https://augmem.ai)
-- **GitHub**: [github.com/augmem](https://github.com/augmem)
 ## License

 - custom
 ---
+# AIT-86M — Audio, Image, Text Embeddings (Depth-2)
+**AIT-86M** maps image, audio, and text into a shared 1280-dim embedding space for cross-modal retrieval with a single vector index. All three modalities share one space with full Matryoshka truncation support down to 128 dims.
+Built for edge deployment, with a single combined safetensors artifact.
+Successor to [TE-75M](https://huggingface.co/augmem/TE-75M).
+> Also available in [GGUF format](https://huggingface.co/augmem/AIT-86M-GGUF) for quantized edge deployment.
 ## File layout
+```text
+AIT-86M.safetensors
 ```
+## Notes
+- shared trimodal embedding space
+- Matryoshka truncation: `1280 / 768 / 512 / 256 / 128`
+- intended for retrieval and embedding use, not generation
 ## License