gcoderw commited on
Commit
b15b053
·
verified ·
1 Parent(s): 1ffc68a

Add LCO-3B and Nemotron-3B benchmark comparisons + throughput table

Browse files
Files changed (1) hide show
  1. README.md +45 -29
README.md CHANGED
@@ -61,48 +61,64 @@ Embeddings can be truncated to `[768, 512, 256, 128]` dimensions while preservin
61
 
62
  ## Benchmarks
63
 
 
 
64
  ### Cross-modal retrieval — SALT (5K trimodal samples)
65
 
66
- | Direction | TEG-421M | ImageBind | EBind |
67
- |---|---|---|---|
68
- | Text → Image R@1 | **0.687** | 0.712 | 0.779 |
69
- | Image → Text R@1 | **0.624** | 0.736 | 0.783 |
70
- | Text → Audio R@1 | **0.117** | 0.038 | 0.047 |
71
- | Audio → Text R@1 | **0.104** | 0.039 | 0.035 |
72
- | Audio → Image R@1 | **0.059** | 0.023 | 0.027 |
73
- | Image → Audio R@1 | **0.057** | 0.025 | 0.032 |
74
 
75
- TEG significantly outperforms both ImageBind and EBind on all audio cross-modal directions while remaining competitive on vision-text with encoders ~3x smaller.
76
 
77
  ### Audio retrieval — AudioCaps & Clotho
78
 
79
- | Benchmark | Direction | TEG-421M | CLAP-Small | CLAP-Large | ImageBind | EBind |
80
- |---|---|---|---|---|---|---|
81
- | AudioCaps | A→T R@1 | 0.156 | **0.425** | 0.420 | 0.116 | 0.225 |
82
- | AudioCaps | T→A R@1 | 0.145 | **0.315** | 0.280 | 0.080 | 0.219 |
83
- | Clotho | A→T R@1 | 0.159 | 0.166 | **0.195** | 0.061 | 0.088 |
84
- | Clotho | T→A R@1 | 0.125 | **0.159** | 0.167 | 0.074 | 0.118 |
85
 
86
- CLAP models still lead on audio-only benchmarks (they're audio specialists), but TEG closes much of the gap vs other trimodal models while adding image support.
87
 
88
  ### Image-text retrieval — Flickr30k (MTEB)
89
 
90
- | Direction | TEG-421M |
91
- |---|---|
92
- | I→T R@1 | 0.481 |
93
- | I→T R@10 | 0.835 |
94
- | T→I R@1 | 0.375 |
95
- | T→I R@10 | 0.763 |
 
 
96
 
97
  ### Zero-shot classification — ESC-50
98
 
99
- | Model | Accuracy |
100
- |---|---|
101
- | CLAP-Large | **0.905** |
102
- | TEG-421M | 0.829 |
103
- | EBind | 0.770 |
104
- | CLAP-Small | 0.751 |
105
- | ImageBind | 0.664 |
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ## Usage
108
 
 
61
 
62
  ## Benchmarks
63
 
64
+ All benchmarks run on a single NVIDIA L4 GPU with 5K samples where applicable.
65
+
66
  ### Cross-modal retrieval — SALT (5K trimodal samples)
67
 
68
+ | Direction | TEG-421M (421M) | LCO-3B (4.7B) | Nemotron-3B (4.7B) | ImageBind (1.2B) | EBind |
69
+ |---|---|---|---|---|---|
70
+ | Text → Image R@1 | 0.687 | 0.660 | 0.529 | 0.712 | **0.779** |
71
+ | Image → Text R@1 | 0.624 | 0.564 | 0.299 | 0.736 | **0.783** |
72
+ | Text → Audio R@1 | **0.117** | 0.042 | 0.018 | 0.038 | 0.047 |
73
+ | Audio → Text R@1 | **0.104** | 0.032 | 0.010 | 0.039 | 0.035 |
74
+ | Audio → Image R@1 | **0.059** | 0.027 | 0.016 | 0.023 | 0.027 |
75
+ | Image → Audio R@1 | **0.057** | 0.034 | 0.018 | 0.025 | 0.032 |
76
 
77
+ TEG leads all audio cross-modal directions by 2-10x over models that are 3-11x larger. Vision-text trails EBind/ImageBind but uses encoders small enough for edge deployment.
78
 
79
  ### Audio retrieval — AudioCaps & Clotho
80
 
81
+ | Benchmark | Direction | TEG-421M | LCO-3B | Nemotron-3B | CLAP-Small | CLAP-Large | ImageBind | EBind |
82
+ |---|---|---|---|---|---|---|---|---|
83
+ | AudioCaps | A→T R@1 | 0.156 | 0.250 | 0.050 | **0.425** | 0.420 | 0.116 | 0.225 |
84
+ | AudioCaps | T→A R@1 | 0.145 | 0.215 | 0.075 | **0.315** | 0.280 | 0.080 | 0.219 |
85
+ | Clotho | A→T R@1 | 0.159 | 0.178 | 0.038 | 0.166 | **0.195** | 0.061 | 0.088 |
86
+ | Clotho | T→A R@1 | 0.125 | **0.187** | 0.070 | 0.159 | 0.167 | 0.074 | 0.118 |
87
 
88
+ CLAP models lead on audio-only benchmarks (audio specialists with no image support). Among trimodal models, TEG is competitive with LCO while being 11x smaller.
89
 
90
  ### Image-text retrieval — Flickr30k (MTEB)
91
 
92
+ | Direction | TEG-421M | LCO-3B | Nemotron-3B |
93
+ |---|---|---|---|
94
+ | I→T R@1 | 0.481 | **0.840** | 0.419 |
95
+ | I→T R@10 | 0.835 | **0.990** | 0.875 |
96
+ | T→I R@1 | 0.375 | **0.765** | 0.563 |
97
+ | T→I R@10 | 0.763 | **0.963** | 0.869 |
98
+
99
+ LCO excels on Flickr30k due to its 4.7B Qwen2.5-Omni backbone, but at 10x the image encoding cost.
100
 
101
  ### Zero-shot classification — ESC-50
102
 
103
+ | Model | Params | Accuracy |
104
+ |---|---|---|
105
+ | CLAP-Large | 67.8M | **0.905** |
106
+ | LCO-3B | 4.7B | 0.853 |
107
+ | TEG-421M | 421M | 0.829 |
108
+ | EBind | ~200M | 0.770 |
109
+ | CLAP-Small | 27.5M | 0.751 |
110
+ | Nemotron-3B | 4.7B | 0.727 |
111
+ | ImageBind | 1.2B | 0.664 |
112
+
113
+ ### Throughput — items/s on NVIDIA L4
114
+
115
+ | Modality | TEG-421M | LCO-3B | Nemotron-3B | ImageBind |
116
+ |---|---|---|---|---|
117
+ | Text | **470** | 90 | 90 | — |
118
+ | Audio | **180** | 5.2 | 42.8 | — |
119
+ | Image | **158** | 15.4 | 15.4 | — |
120
+
121
+ TEG is **35x faster than LCO on audio** and **10x faster on image** — the difference between real-time edge inference and datacenter-only deployment.
122
 
123
  ## Usage
124