gcoderw commited on
Commit
d2d7cf9
·
verified ·
1 Parent(s): ff3f75e

Publish TE-86M safetensors release

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -34
  2. README.md +227 -0
  3. TE-86M.safetensors +3 -0
.gitattributes CHANGED
@@ -1,35 +1,2 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.safetensors filter=lfs diff=lfs merge=lfs -text
2
+ *.gguf filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - multimodal
7
+ - embedding
8
+ - matryoshka
9
+ - trimodal
10
+ - image-text-audio
11
+ - retrieval
12
+ - cross-modal
13
+ - edge
14
+ - rag
15
+ library_name: safetensors
16
+ pipeline_tag: feature-extraction
17
+ datasets:
18
+ - custom
19
+ ---
20
+
21
+ # TE-86M — Trimodal Embeddings (Depth-2)
22
+
23
+ **TE-86M** maps image, audio, and text into a shared 1280-dim embedding space, enabling cross-modal retrieval with a single vector index. All three modalities share a unified space with full Matryoshka truncation support down to 128 dims.
24
+
25
+ Built for edge deployment — the entire model runs on a Raspberry Pi 5.
26
+
27
+ Successor to [TE-75M](https://huggingface.co/augmem/TE-75M), with depth-2 residual projection heads that break through the cross-modal retrieval ceiling of depth-1 architectures while maintaining text retrieval quality.
28
+
29
+ > Also available in [GGUF format](https://huggingface.co/augmem/TE-86M-GGUF) for quantized edge deployment.
30
+
31
+ ## Architecture
32
+
33
+ TE-86M uses lightweight edge encoders with depth-2 residual projection heads that expand through a 1920-dim hidden layer before projecting into a shared 1280-dim embedding space:
34
+
35
+ ```
36
+ Text --> LEAF-IR (768-d) -----------> DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280)
37
+ Image --> MobileNetV4-Medium (1280-d) --> DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280)
38
+ Audio --> EfficientAT mn20_as (1920-d) --> DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280)
39
+ ```
40
+
41
+ All outputs are L2-normalized into the shared 1280-dim space for cross-modal cosine similarity.
42
+
43
+ | Component | Architecture | Params | Size |
44
+ |---|---|---|---|
45
+ | Text encoder | LEAF-IR (MongoDB/mdbr-leaf-ir) | 22.7M | 87.2 MB |
46
+ | Image encoder | MobileNetV4-Medium (timm) | 8.4M | 32.4 MB |
47
+ | Audio encoder | EfficientAT mn20_as | 17.9M | 68.5 MB |
48
+ | Image projection | DeepProjectionHead-d2 (1280 -> 1920 -> 1920 -> 1280) | 12.3M | 47.0 MB |
49
+ | Audio projection | DeepProjectionHead-d2 (1920 -> 1920 -> 1920 -> 1280) | 13.5M | 51.7 MB |
50
+ | Text projection | DeepProjectionHead-d2 (768 -> 1920 -> 1920 -> 1280) | 11.3M | 43.2 MB |
51
+ | **Total** | | **86.1M** | **329.9 MB** |
52
+
53
+ ### Projection head detail
54
+
55
+ Each `DeepProjectionHead-d2` is a depth-2 residual MLP with Matryoshka-aware training:
56
+
57
+ ```
58
+ Linear(encoder_dim, 1920) -> GELU -> LayerNorm -> Dropout(0.3)
59
+ -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
60
+ -> Linear(1920, 1920) -> GELU -> LayerNorm -> Dropout(0.3) + residual
61
+ -> Linear(1920, 1280)
62
+ ```
63
+
64
+ ### Why depth-2?
65
+
66
+ Ablation experiments showed depth-1 heads hit an I->T retrieval ceiling at ~0.60 R@1 regardless of hyperparameter tuning. Depth-2 heads broke through to 0.618, providing the representational capacity to serve cross-modal AND text retrieval simultaneously. The extra 11M params (75M -> 86M) remain edge-viable.
67
+
68
+ ### Matryoshka dimensions
69
+
70
+ Embeddings can be truncated to `[1280, 768, 512, 256, 128]` dimensions while preserving retrieval quality — trained with Matryoshka Representation Learning (MRL).
71
+
72
+ ## Benchmarks
73
+
74
+ All benchmarks run on a single NVIDIA L4 GPU with 5K SALT samples.
75
+
76
+ ### Cross-modal retrieval — SALT (5K trimodal samples)
77
+
78
+ | Direction | TE-86M (86M) | TE-75M (75M) | ImageBind (1.2B) | EBind (1.78B*) |
79
+ |---|---|---|---|---|
80
+ | Image -> Text R@1 | 0.618 | 0.615 | 0.736 | **0.783** |
81
+ | Text -> Image R@1 | 0.630 | 0.614 | 0.712 | **0.779** |
82
+ | Text -> Audio R@1 | **0.108** | 0.103 | 0.038 | 0.047 |
83
+ | Audio -> Text R@1 | 0.087 | 0.082 | 0.039 | 0.035 |
84
+ | Image -> Audio R@1 | **0.068** | 0.062 | 0.023 | 0.027 |
85
+ | Audio -> Image R@1 | **0.070** | 0.063 | 0.025 | 0.032 |
86
+
87
+ ### Audio retrieval — AudioCaps & Clotho
88
+
89
+ | Benchmark | Direction | TE-86M | TE-75M | CLAP-Large | ImageBind | EBind |
90
+ |---|---|---|---|---|---|---|
91
+ | AudioCaps | A->T R@1 | 0.229 | 0.210 | **0.420** | 0.116 | 0.225 |
92
+ | AudioCaps | T->A R@1 | 0.156 | 0.148 | **0.280** | 0.080 | 0.219 |
93
+ | Clotho | A->T R@1 | **0.219** | 0.208 | 0.195 | 0.061 | 0.088 |
94
+ | Clotho | T->A R@1 | **0.177** | 0.172 | 0.167 | 0.074 | 0.118 |
95
+
96
+ ### Image-text retrieval — MSCOCO & Flickr30k
97
+
98
+ | Benchmark | Direction | TE-86M (86M) | TE-75M (75M) | EBind (1.78B*) | ImageBind (1.2B) |
99
+ |---|---|---|---|---|---|
100
+ | Flickr30k | I->T R@1 | 0.494 | 0.478 | **0.951** | 0.918 |
101
+ | Flickr30k | T->I R@1 | 0.332 | 0.303 | **0.853** | 0.766 |
102
+ | MSCOCO 5K | I->T R@1 | 0.343 | 0.320 | **0.743** | 0.658 |
103
+ | MSCOCO 5K | T->I R@1 | 0.225 | 0.208 | **0.559** | 0.490 |
104
+
105
+ ### Zero-shot classification — ESC-50
106
+
107
+ | Model | Params | Accuracy |
108
+ |---|---|---|
109
+ | TE-86M | 86M | **93.9%** |
110
+ | CLAP-Large | 67.8M | 90.5% |
111
+ | TE-75M | 75M | 93.2% |
112
+ | EBind | 1.78B* | 77.0% |
113
+ | ImageBind | 1.2B | 66.4% |
114
+
115
+ ### Text retrieval — MTEB (NDCG@10)
116
+
117
+ Text-text retrieval quality in the shared embedding space, measured on MTEB retrieval tasks:
118
+
119
+ | Task | TE-86M | TE-75M | Raw LEAF-IR | Recovery |
120
+ |---|---|---|---|---|
121
+ | ArguAna | 0.545 | 0.544 | 0.594 | 92% |
122
+ | CQADupstackGaming | 0.515 | 0.506 | 0.607 | 85% |
123
+ | CQADupstackUnix | 0.334 | 0.355 | 0.428 | 78% |
124
+ | FEVERHardNegatives | 0.561 | 0.551 | 0.863 | 65% |
125
+ | HotpotQAHardNegatives | 0.554 | 0.531 | 0.700 | 79% |
126
+ | FiQA2018 | 0.291 | 0.292 | 0.392 | 74% |
127
+ | ClimateFEVER | 0.231 | 0.215 | 0.353 | 65% |
128
+ | SCIDOCS | 0.154 | 0.153 | 0.198 | 78% |
129
+ | TRECCOVID | 0.507 | 0.474 | 0.820 | 62% |
130
+
131
+ TE-86M improves MTEB text retrieval over TE-75M on 7/9 tasks. The depth-2 projection heads recover 62-92% of raw LEAF-IR's retrieval quality while mapping into the cross-modal shared space.
132
+
133
+ ## Usage
134
+
135
+ ### Loading components
136
+
137
+ ```python
138
+ from safetensors.torch import load_file
139
+
140
+ # Load entire model
141
+ tensors = load_file("TE-86M.safetensors")
142
+
143
+ # Extract components by prefix
144
+ text_enc_sd = {k.removeprefix("text_encoder."): v for k, v in tensors.items() if k.startswith("text_encoder.")}
145
+ image_enc_sd = {k.removeprefix("image_encoder."): v for k, v in tensors.items() if k.startswith("image_encoder.")}
146
+ audio_enc_sd = {k.removeprefix("audio_encoder."): v for k, v in tensors.items() if k.startswith("audio_encoder.")}
147
+ image_proj_sd = {k.removeprefix("image_projection."): v for k, v in tensors.items() if k.startswith("image_projection.")}
148
+ audio_proj_sd = {k.removeprefix("audio_projection."): v for k, v in tensors.items() if k.startswith("audio_projection.")}
149
+ text_proj_sd = {k.removeprefix("text_projection."): v for k, v in tensors.items() if k.startswith("text_projection.")}
150
+ ```
151
+
152
+ ### Matryoshka truncation
153
+
154
+ ```python
155
+ import torch.nn.functional as F
156
+
157
+ # Full 1280-dim embedding
158
+ embedding = model(input) # (N, 1280)
159
+
160
+ # Truncate to 256-dim and re-normalize
161
+ embedding_256 = F.normalize(embedding[:, :256], dim=-1)
162
+ ```
163
+
164
+ ## File layout
165
+
166
+ ```
167
+ TE-86M.safetensors # All components in one file (~330 MB)
168
+ ```
169
+
170
+ ### Tensor key prefixes
171
+
172
+ | Prefix | Component | Tensors |
173
+ |---|---|---|
174
+ | `text_encoder.*` | LEAF-IR (float32) | 103 |
175
+ | `image_encoder.*` | MobileNetV4-Medium | 462 |
176
+ | `audio_encoder.*` | EfficientAT mn20_as | 312 |
177
+ | `image_projection.*` | Depth-2 projection head | 14 |
178
+ | `audio_projection.*` | Depth-2 projection head | 14 |
179
+ | `text_projection.*` | Depth-2 projection head | 14 |
180
+
181
+ ## Training
182
+
183
+ - **Loss**: InfoNCE (contrastive) with Matryoshka Representation Learning
184
+ - **Data**: ~2.2M synthetically generated trimodal triplets (WordNet) + 200K MSCOCO img+txt + 262K WavCaps aud+txt + 1.5M Nomic text pairs
185
+ - **Hardware**: 2x NVIDIA L4 GPUs
186
+ - **Optimizer**: AdamW, lr=1.41e-3, weight decay=1e-4, cosine scheduler
187
+ - **Epochs**: 50
188
+ - **Batch size**: 4096
189
+ - **Dropout**: 0.20 -> 0.25 (ep27) -> 0.30 (ep29) — mid-run regularization increases
190
+ - **Text mixing**: λ_tt=0.5 (ep1-9) -> 0.25 (ep10-50) — Nomic supervised text pairs
191
+ - **Projection heads only** — source encoders are frozen during training
192
+
193
+ ### Improvements over TE-75M
194
+
195
+ | Change | TE-75M | TE-86M |
196
+ |---|---|---|
197
+ | Projection depth | 1 (single residual block) | 2 (two residual blocks) |
198
+ | Head params | 26.1M | 37.2M |
199
+ | Total params | 75.2M | 86.1M |
200
+ | SALT I->T R@1 | 0.615 | 0.618 (+0.5%) |
201
+ | SALT T->I R@1 | 0.614 | 0.630 (+2.6%) |
202
+ | MSCOCO I->T R@1 | 0.320 | 0.343 (+7.2%) |
203
+ | Clotho A->T R@1 | 0.208 | 0.219 (+5.3%) |
204
+ | ESC-50 | 93.2% | 93.9% (+0.7%) |
205
+
206
+ ### Design decisions
207
+
208
+ - **Depth-2 residual heads**: Ablation confirmed depth-1 hits I->T ceiling at ~0.60 regardless of dropout or λ_tt. Depth-2 provides capacity to serve cross-modal and text retrieval simultaneously.
209
+ - **3-head shared space**: All modalities project into a learned 1280-dim space (image-native dimension)
210
+ - **LEAF-IR text encoder**: 23M-param retrieval-optimized text encoder enables fully edge-deployable text inference
211
+ - **Frozen source encoders**: MobileNetV4, EfficientAT, and LEAF-IR are kept frozen; only projection heads are trained
212
+ - **Edge-first**: All source encoders can run on devices like Raspberry Pi 5
213
+
214
+ ## Limitations
215
+
216
+ - Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
217
+ - Image-text retrieval trades accuracy vs larger vision encoders for edge deployability
218
+ - Text retrieval recovers 62-92% of raw LEAF-IR quality (gap is domain-dependent)
219
+
220
+ ## Links
221
+
222
+ - **Website**: [augmem.ai](https://augmem.ai)
223
+ - **GitHub**: [github.com/augmem](https://github.com/augmem)
224
+
225
+ ## License
226
+
227
+ Apache 2.0
TE-86M.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5245f12b721086d90b4fe649c8f38b6928e658ba81087a620398f3dec567e2b7
3
+ size 346056964