Add LCO-3B and Nemotron-3B benchmark comparisons + throughput table
Browse files
README.md
CHANGED
|
@@ -61,48 +61,64 @@ Embeddings can be truncated to `[768, 512, 256, 128]` dimensions while preservin
|
|
| 61 |
|
| 62 |
## Benchmarks
|
| 63 |
|
|
|
|
|
|
|
| 64 |
### Cross-modal retrieval — SALT (5K trimodal samples)
|
| 65 |
|
| 66 |
-
| Direction | TEG-421M | ImageBind | EBind |
|
| 67 |
-
|---|---|---|---|
|
| 68 |
-
| Text → Image R@1 |
|
| 69 |
-
| Image → Text R@1 |
|
| 70 |
-
| Text → Audio R@1 | **0.117** | 0.038 | 0.047 |
|
| 71 |
-
| Audio → Text R@1 | **0.104** | 0.039 | 0.035 |
|
| 72 |
-
| Audio → Image R@1 | **0.059** | 0.023 | 0.027 |
|
| 73 |
-
| Image → Audio R@1 | **0.057** | 0.025 | 0.032 |
|
| 74 |
|
| 75 |
-
TEG
|
| 76 |
|
| 77 |
### Audio retrieval — AudioCaps & Clotho
|
| 78 |
|
| 79 |
-
| Benchmark | Direction | TEG-421M | CLAP-Small | CLAP-Large | ImageBind | EBind |
|
| 80 |
-
|---|---|---|---|---|---|---|
|
| 81 |
-
| AudioCaps | A→T R@1 | 0.156 | **0.425** | 0.420 | 0.116 | 0.225 |
|
| 82 |
-
| AudioCaps | T→A R@1 | 0.145 | **0.315** | 0.280 | 0.080 | 0.219 |
|
| 83 |
-
| Clotho | A→T R@1 | 0.159 | 0.166 | **0.195** | 0.061 | 0.088 |
|
| 84 |
-
| Clotho | T→A R@1 | 0.125 | **0.
|
| 85 |
|
| 86 |
-
CLAP models
|
| 87 |
|
| 88 |
### Image-text retrieval — Flickr30k (MTEB)
|
| 89 |
|
| 90 |
-
| Direction | TEG-421M |
|
| 91 |
-
|---|---|
|
| 92 |
-
| I→T R@1 | 0.481 |
|
| 93 |
-
| I→T R@10 | 0.835 |
|
| 94 |
-
| T→I R@1 | 0.375 |
|
| 95 |
-
| T→I R@10 | 0.763 |
|
|
|
|
|
|
|
| 96 |
|
| 97 |
### Zero-shot classification — ESC-50
|
| 98 |
|
| 99 |
-
| Model | Accuracy |
|
| 100 |
-
|---|---|
|
| 101 |
-
| CLAP-Large | **0.905** |
|
| 102 |
-
|
|
| 103 |
-
|
|
| 104 |
-
|
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
## Usage
|
| 108 |
|
|
|
|
| 61 |
|
| 62 |
## Benchmarks
|
| 63 |
|
| 64 |
+
All benchmarks run on a single NVIDIA L4 GPU with 5K samples where applicable.
|
| 65 |
+
|
| 66 |
### Cross-modal retrieval — SALT (5K trimodal samples)
|
| 67 |
|
| 68 |
+
| Direction | TEG-421M (421M) | LCO-3B (4.7B) | Nemotron-3B (4.7B) | ImageBind (1.2B) | EBind |
|
| 69 |
+
|---|---|---|---|---|---|
|
| 70 |
+
| Text → Image R@1 | 0.687 | 0.660 | 0.529 | 0.712 | **0.779** |
|
| 71 |
+
| Image → Text R@1 | 0.624 | 0.564 | 0.299 | 0.736 | **0.783** |
|
| 72 |
+
| Text → Audio R@1 | **0.117** | 0.042 | 0.018 | 0.038 | 0.047 |
|
| 73 |
+
| Audio → Text R@1 | **0.104** | 0.032 | 0.010 | 0.039 | 0.035 |
|
| 74 |
+
| Audio → Image R@1 | **0.059** | 0.027 | 0.016 | 0.023 | 0.027 |
|
| 75 |
+
| Image → Audio R@1 | **0.057** | 0.034 | 0.018 | 0.025 | 0.032 |
|
| 76 |
|
| 77 |
+
TEG leads all audio cross-modal directions by 2-10x over models that are 3-11x larger. Vision-text trails EBind/ImageBind but uses encoders small enough for edge deployment.
|
| 78 |
|
| 79 |
### Audio retrieval — AudioCaps & Clotho
|
| 80 |
|
| 81 |
+
| Benchmark | Direction | TEG-421M | LCO-3B | Nemotron-3B | CLAP-Small | CLAP-Large | ImageBind | EBind |
|
| 82 |
+
|---|---|---|---|---|---|---|---|---|
|
| 83 |
+
| AudioCaps | A→T R@1 | 0.156 | 0.250 | 0.050 | **0.425** | 0.420 | 0.116 | 0.225 |
|
| 84 |
+
| AudioCaps | T→A R@1 | 0.145 | 0.215 | 0.075 | **0.315** | 0.280 | 0.080 | 0.219 |
|
| 85 |
+
| Clotho | A→T R@1 | 0.159 | 0.178 | 0.038 | 0.166 | **0.195** | 0.061 | 0.088 |
|
| 86 |
+
| Clotho | T→A R@1 | 0.125 | **0.187** | 0.070 | 0.159 | 0.167 | 0.074 | 0.118 |
|
| 87 |
|
| 88 |
+
CLAP models lead on audio-only benchmarks (audio specialists with no image support). Among trimodal models, TEG is competitive with LCO while being 11x smaller.
|
| 89 |
|
| 90 |
### Image-text retrieval — Flickr30k (MTEB)
|
| 91 |
|
| 92 |
+
| Direction | TEG-421M | LCO-3B | Nemotron-3B |
|
| 93 |
+
|---|---|---|---|
|
| 94 |
+
| I→T R@1 | 0.481 | **0.840** | 0.419 |
|
| 95 |
+
| I→T R@10 | 0.835 | **0.990** | 0.875 |
|
| 96 |
+
| T→I R@1 | 0.375 | **0.765** | 0.563 |
|
| 97 |
+
| T→I R@10 | 0.763 | **0.963** | 0.869 |
|
| 98 |
+
|
| 99 |
+
LCO excels on Flickr30k due to its 4.7B Qwen2.5-Omni backbone, but at 10x the image encoding cost.
|
| 100 |
|
| 101 |
### Zero-shot classification — ESC-50
|
| 102 |
|
| 103 |
+
| Model | Params | Accuracy |
|
| 104 |
+
|---|---|---|
|
| 105 |
+
| CLAP-Large | 67.8M | **0.905** |
|
| 106 |
+
| LCO-3B | 4.7B | 0.853 |
|
| 107 |
+
| TEG-421M | 421M | 0.829 |
|
| 108 |
+
| EBind | ~200M | 0.770 |
|
| 109 |
+
| CLAP-Small | 27.5M | 0.751 |
|
| 110 |
+
| Nemotron-3B | 4.7B | 0.727 |
|
| 111 |
+
| ImageBind | 1.2B | 0.664 |
|
| 112 |
+
|
| 113 |
+
### Throughput — items/s on NVIDIA L4
|
| 114 |
+
|
| 115 |
+
| Modality | TEG-421M | LCO-3B | Nemotron-3B | ImageBind |
|
| 116 |
+
|---|---|---|---|---|
|
| 117 |
+
| Text | **470** | 90 | 90 | — |
|
| 118 |
+
| Audio | **180** | 5.2 | 42.8 | — |
|
| 119 |
+
| Image | **158** | 15.4 | 15.4 | — |
|
| 120 |
+
|
| 121 |
+
TEG is **35x faster than LCO on audio** and **10x faster on image** — the difference between real-time edge inference and datacenter-only deployment.
|
| 122 |
|
| 123 |
## Usage
|
| 124 |
|