AbstractPhil commited on
Commit
e063df5
Β·
verified Β·
1 Parent(s): 30bb4b6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +185 -3
README.md CHANGED
@@ -1,3 +1,185 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - geometric-deep-learning
5
+ - distillation
6
+ - consensus
7
+ - pentachoron
8
+ - procrustes
9
+ - caption-embedding
10
+ - sentence-similarity
11
+ - feature-extraction
12
+ language: en
13
+ pipeline_tag: feature-extraction
14
+ ---
15
+
16
+ # GEOLIP Consensus-Distilled Caption Encoder
17
+
18
+ **A standalone 23M-parameter caption encoder trained via geometric consensus distillation from 5 BERT-family models.**
19
+
20
+ No expert models needed at inference. Just a tokenizer and this model.
21
+
22
+ ## What Is This?
23
+
24
+ Five independently trained language models β€” BERT-base, ModernBERT-base, RoBERTa-base, ALBERT-base-v2, and DistilBERT-base β€” were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid (the **geometric consensus**) was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.
25
+
26
+ This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts β€” the subspace where all five agree β€” into a single small transformer.
27
+
28
+ ## Results
29
+
30
+ | Metric | Value |
31
+ |---|---|
32
+ | **Val cosine to consensus** | **0.8621352314949036** |
33
+ | **Val R@1** | **1.0** |
34
+ | **Val CV** | **0.0817226113729792** |
35
+ | Training data | CC12M captions (500000 samples) |
36
+ | Epochs | 30 |
37
+ | Warm-started | True |
38
+ | Parameters | ~23M |
39
+ | Position capacity | 8,192 tokens |
40
+
41
+ ### STS-B Comparison (mean-pooled, no fine-tuning)
42
+
43
+ | Model | Params | STS-B Spearman |
44
+ |---|---|---|
45
+ | DistilBERT-base | 66M | 0.5717 |
46
+ | RoBERTa-base | 125M | 0.5436 |
47
+ | **Consensus Student** | **23M** | **0.4814** |
48
+ | ALBERT-base-v2 | 12M | 0.4784 |
49
+ | BERT-base | 110M | 0.4729 |
50
+ | ModernBERT-base | 149M | 0.4215 |
51
+
52
+ The student beats BERT-base (5x larger) and ModernBERT-base (7x larger) on STS-B despite being trained from scratch on image captions β€” out of domain for sentence similarity.
53
+
54
+ ### Training Curve
55
+
56
+ | Epoch | t_acc | t_cos | v_acc | v_cos | v_cv | Time |
57
+ |---|---|---|---|---|---|---|
58
+ | 1 | 1.000 | 0.804 | 1.000 | 0.803 | 0.104 | 689s |
59
+ | 2 | 1.000 | 0.807 | 1.000 | 0.810 | 0.085 | 688s |
60
+ | 3 | 1.000 | 0.811 | 1.000 | 0.820 | 0.103 | 688s |
61
+ | 4 | 1.000 | 0.815 | 1.000 | 0.825 | 0.084 | 689s |
62
+ | 5 | 1.000 | 0.819 | 1.000 | 0.819 | 0.086 | 689s |
63
+ | 6 | 1.000 | 0.821 | 1.000 | 0.821 | 0.095 | 689s |
64
+ | 7 | 1.000 | 0.824 | 1.000 | 0.820 | 0.091 | 688s |
65
+ | 8 | 1.000 | 0.827 | 1.000 | 0.834 | 0.088 | 689s |
66
+ | 9 | 1.000 | 0.829 | 1.000 | 0.829 | 0.088 | 688s |
67
+ | 10 | 1.000 | 0.831 | 1.000 | 0.829 | 0.087 | 689s |
68
+ | 11 | 1.000 | 0.833 | 1.000 | 0.836 | 0.082 | 689s |
69
+ | 12 | 1.000 | 0.835 | 1.000 | 0.838 | 0.084 | 689s |
70
+ | 13 | 1.000 | 0.837 | 1.000 | 0.842 | 0.083 | 688s |
71
+ | 14 | 1.000 | 0.839 | 1.000 | 0.842 | 0.081 | 689s |
72
+ | 15 | 1.000 | 0.842 | 1.000 | 0.840 | 0.078 | 688s |
73
+ | 16 | 1.000 | 0.843 | 1.000 | 0.843 | 0.086 | 689s |
74
+ | 17 | 1.000 | 0.846 | 1.000 | 0.845 | 0.086 | 689s |
75
+ | 18 | 1.000 | 0.847 | 1.000 | 0.848 | 0.087 | 689s |
76
+ | 19 | 1.000 | 0.849 | 1.000 | 0.849 | 0.082 | 688s |
77
+ | 20 | 1.000 | 0.851 | 1.000 | 0.849 | 0.078 | 690s |
78
+ | 21 | 1.000 | 0.853 | 1.000 | 0.855 | 0.087 | 689s |
79
+ | 22 | 1.000 | 0.855 | 1.000 | 0.856 | 0.083 | 689s |
80
+ | 23 | 1.000 | 0.857 | 1.000 | 0.855 | 0.078 | 689s |
81
+ | 24 | 1.000 | 0.858 | 1.000 | 0.857 | 0.093 | 688s |
82
+ | 25 | 1.000 | 0.860 | 1.000 | 0.859 | 0.092 | 689s |
83
+ | 26 | 1.000 | 0.861 | 1.000 | 0.860 | 0.079 | 689s |
84
+ | 27 | 1.000 | 0.863 | 1.000 | 0.862 | 0.084 | 689s |
85
+ | 28 | 1.000 | 0.863 | 1.000 | 0.862 | 0.091 | 688s |
86
+ | 29 | 1.000 | 0.863 | 1.000 | 0.862 | 0.081 | 688s |
87
+ | 30 | 1.000 | 0.863 | 1.000 | 0.862 | 0.082 | 689s |
88
+
89
+
90
+ ## Usage
91
+
92
+ ```python
93
+ import torch
94
+ from transformers import AutoTokenizer
95
+ from caption_encoder import CaptionEncoder
96
+
97
+ # Load
98
+ tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
99
+ model = CaptionEncoder(
100
+ vocab_size=30522, max_len=8192, d_model=384,
101
+ n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
102
+ dropout=0.0, pad_token_id=0)
103
+ model.load_state_dict(torch.load("best_model.pt", weights_only=True))
104
+ model.eval()
105
+
106
+ # Encode
107
+ texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
108
+ tokens = tokenizer(texts, max_length=512, padding="max_length",
109
+ truncation=True, return_tensors="pt")
110
+ with torch.no_grad():
111
+ embeddings = model(tokens["input_ids"], tokens["attention_mask"])
112
+
113
+ # embeddings: (2, 768) L2-normalized
114
+ similarity = embeddings[0] @ embeddings[1]
115
+ print(f"Similarity: {similarity:.3f}")
116
+ ```
117
+
118
+ ## Architecture
119
+
120
+ ```
121
+ Input text
122
+ β”‚
123
+ β”œβ”€β”€ BERT WordPiece tokenizer (30,522 vocab)
124
+ β”œβ”€β”€ Token embeddings (384-dim)
125
+ β”œβ”€β”€ Position embeddings (8,192 capacity)
126
+ β”‚
127
+ β”œβ”€β”€ 6Γ— Transformer Encoder Layer
128
+ β”‚ (384-dim, 6 heads, 1536 FFN, GELU, pre-norm)
129
+ β”‚
130
+ β”œβ”€β”€ Mean pool over non-padding tokens
131
+ β”œβ”€β”€ Projection: 384 β†’ 384 β†’ GELU β†’ LN β†’ 768
132
+ └── L2 normalize
133
+ οΏ½οΏ½
134
+ └── (B, 768) consensus-aligned embedding
135
+ ```
136
+
137
+ ## The Consensus Distillation Pipeline
138
+
139
+ ```
140
+ 5 Expert Models (frozen)
141
+ β”‚
142
+ β”œβ”€β”€ BERT-base-uncased (110M, MLM)
143
+ β”œβ”€β”€ ModernBERT-base (149M, MLM + rotary)
144
+ β”œβ”€β”€ RoBERTa-base (125M, MLM + dynamic masking)
145
+ β”œβ”€β”€ ALBERT-base-v2 (12M, MLM + SOP + factorized)
146
+ └── DistilBERT-base (66M, distilled from BERT)
147
+ β”‚
148
+ β”œβ”€β”€ Extract embeddings on CC12M captions
149
+ β”œβ”€β”€ Whitened Procrustes alignment to shared space
150
+ β”œβ”€β”€ Consensus = normalized centroid
151
+ β”‚ (proven constant to 3 decimal places across 5 seeds)
152
+ β”‚
153
+ └── Train student with:
154
+ β”œβ”€β”€ InfoNCE(student, consensus) β€” retrieval alignment
155
+ β”œβ”€β”€ MSE(student, consensus) β€” direct regression
156
+ └── Pentachoron CV β†’ 0.084 β€” geometric regularity
157
+ ```
158
+
159
+ ## Key Properties
160
+
161
+ **Geometric regularity.** The embedding space has pentachoron CV β‰ˆ 0.08–0.10, meaning local neighborhoods are uniformly distributed. The space is smooth, interpolable, and well-conditioned for downstream operations.
162
+
163
+ **Multi-teacher consensus.** The target is the geometric intersection of five experts, not any single teacher. Individual model errors cancel. What remains is what five independent systems agree on.
164
+
165
+ **Minimal data requirement.** The consensus manifold is so smooth (CV=0.084) that 18K examples were sufficient for R@1=1.000 on held-out data. The function from text to consensus embedding has a low Lipschitz constant.
166
+
167
+ **8K position capacity.** Trained on 512-token sequences but position embeddings extend to 8,192. Ready for long-context applications without retraining.
168
+
169
+ ## GEOLIP Family
170
+
171
+ | System | Type | Output |
172
+ |---|---|---|
173
+ | [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | pooled (768,) |
174
+ | [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 768) |
175
+ | [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 1280) |
176
+ | [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | aligned (1024,) |
177
+ | **Consensus Distilled** | **Student** | **consensus (768,)** |
178
+
179
+ ## Citation
180
+
181
+ See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology.
182
+
183
+ ## License
184
+
185
+ Apache 2.0