augmem
/

teg-421m

@@ -61,48 +61,64 @@ Embeddings can be truncated to `[768, 512, 256, 128]` dimensions while preservin
 ## Benchmarks
 ### Cross-modal retrieval — SALT (5K trimodal samples)
-| Direction | TEG-421M | ImageBind | EBind |
-|---|---|---|---|
-| Text → Image R@1 | **0.687** | 0.712 | 0.779 |
-| Image → Text R@1 | **0.624** | 0.736 | 0.783 |
-| Text → Audio R@1 | **0.117** | 0.038 | 0.047 |
-| Audio → Text R@1 | **0.104** | 0.039 | 0.035 |
-| Audio → Image R@1 | **0.059** | 0.023 | 0.027 |
-| Image → Audio R@1 | **0.057** | 0.025 | 0.032 |
-TEG significantly outperforms both ImageBind and EBind on all audio cross-modal directions while remaining competitive on vision-text with encoders ~3x smaller.
 ### Audio retrieval — AudioCaps & Clotho
-| Benchmark | Direction | TEG-421M | CLAP-Small | CLAP-Large | ImageBind | EBind |
-|---|---|---|---|---|---|---|
-| AudioCaps | A→T R@1 | 0.156 | **0.425** | 0.420 | 0.116 | 0.225 |
-| AudioCaps | T→A R@1 | 0.145 | **0.315** | 0.280 | 0.080 | 0.219 |
-| Clotho | A→T R@1 | 0.159 | 0.166 | **0.195** | 0.061 | 0.088 |
-| Clotho | T→A R@1 | 0.125 | **0.159** | 0.167 | 0.074 | 0.118 |
-CLAP models still lead on audio-only benchmarks (they're audio specialists), but TEG closes much of the gap vs other trimodal models while adding image support.
 ### Image-text retrieval — Flickr30k (MTEB)
-| Direction | TEG-421M |
-|---|---|
-| I→T R@1 | 0.481 |
-| I→T R@10 | 0.835 |
-| T→I R@1 | 0.375 |
-| T→I R@10 | 0.763 |
 ### Zero-shot classification — ESC-50
-| Model | Accuracy |
-|---|---|
-| CLAP-Large | **0.905** |
-| TEG-421M | 0.829 |
-| EBind | 0.770 |
-| CLAP-Small | 0.751 |
-| ImageBind | 0.664 |
 ## Usage

 ## Benchmarks
+All benchmarks run on a single NVIDIA L4 GPU with 5K samples where applicable.
 ### Cross-modal retrieval — SALT (5K trimodal samples)
+| Direction | TEG-421M (421M) | LCO-3B (4.7B) | Nemotron-3B (4.7B) | ImageBind (1.2B) | EBind |
+|---|---|---|---|---|---|
+| Text → Image R@1 | 0.687 | 0.660 | 0.529 | 0.712 | **0.779** |
+| Image → Text R@1 | 0.624 | 0.564 | 0.299 | 0.736 | **0.783** |
+| Text → Audio R@1 | **0.117** | 0.042 | 0.018 | 0.038 | 0.047 |
+| Audio → Text R@1 | **0.104** | 0.032 | 0.010 | 0.039 | 0.035 |
+| Audio → Image R@1 | **0.059** | 0.027 | 0.016 | 0.023 | 0.027 |
+| Image → Audio R@1 | **0.057** | 0.034 | 0.018 | 0.025 | 0.032 |
+TEG leads all audio cross-modal directions by 2-10x over models that are 3-11x larger. Vision-text trails EBind/ImageBind but uses encoders small enough for edge deployment.
 ### Audio retrieval — AudioCaps & Clotho
+| Benchmark | Direction | TEG-421M | LCO-3B | Nemotron-3B | CLAP-Small | CLAP-Large | ImageBind | EBind |
+|---|---|---|---|---|---|---|---|---|
+| AudioCaps | A→T R@1 | 0.156 | 0.250 | 0.050 | **0.425** | 0.420 | 0.116 | 0.225 |
+| AudioCaps | T→A R@1 | 0.145 | 0.215 | 0.075 | **0.315** | 0.280 | 0.080 | 0.219 |
+| Clotho | A→T R@1 | 0.159 | 0.178 | 0.038 | 0.166 | **0.195** | 0.061 | 0.088 |
+| Clotho | T→A R@1 | 0.125 | **0.187** | 0.070 | 0.159 | 0.167 | 0.074 | 0.118 |
+CLAP models lead on audio-only benchmarks (audio specialists with no image support). Among trimodal models, TEG is competitive with LCO while being 11x smaller.
 ### Image-text retrieval — Flickr30k (MTEB)
+| Direction | TEG-421M | LCO-3B | Nemotron-3B |
+|---|---|---|---|
+| I→T R@1 | 0.481 | **0.840** | 0.419 |
+| I→T R@10 | 0.835 | **0.990** | 0.875 |
+| T→I R@1 | 0.375 | **0.765** | 0.563 |
+| T→I R@10 | 0.763 | **0.963** | 0.869 |
+LCO excels on Flickr30k due to its 4.7B Qwen2.5-Omni backbone, but at 10x the image encoding cost.
 ### Zero-shot classification — ESC-50
+| Model | Params | Accuracy |
+|---|---|---|
+| CLAP-Large | 67.8M | **0.905** |
+| LCO-3B | 4.7B | 0.853 |
+| TEG-421M | 421M | 0.829 |
+| EBind | ~200M | 0.770 |
+| CLAP-Small | 27.5M | 0.751 |
+| Nemotron-3B | 4.7B | 0.727 |
+| ImageBind | 1.2B | 0.664 |
+### Throughput — items/s on NVIDIA L4
+| Modality | TEG-421M | LCO-3B | Nemotron-3B | ImageBind |
+|---|---|---|---|---|
+| Text | **470** | 90 | 90 | — |
+| Audio | **180** | 5.2 | 42.8 | — |
+| Image | **158** | 15.4 | 15.4 | — |
+TEG is **35x faster than LCO on audio** and **10x faster on image** — the difference between real-time edge inference and datacenter-only deployment.
 ## Usage