AbstractPhil commited on
Commit
ee6bf50
Β·
verified Β·
1 Parent(s): d646be4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -126
README.md CHANGED
@@ -9,6 +9,7 @@ tags:
9
  - caption-embedding
10
  - sentence-similarity
11
  - feature-extraction
 
12
  language: en
13
  pipeline_tag: feature-extraction
14
  datasets:
@@ -17,121 +18,164 @@ base_model:
17
  - AbstractPhil/geolip-bertenstein
18
  ---
19
 
20
- # GEOLIP Consensus-Distilled Caption Encoder
21
 
22
- **A standalone 23M-parameter caption encoder trained via geometric consensus distillation from 5 BERT-family models.**
23
 
24
- No expert models needed at inference. Just a tokenizer and this model.
25
 
26
- ## What Is This?
27
 
28
- Five independently trained language models β€” BERT-base, ModernBERT-base, RoBERTa-base, ALBERT-base-v2, and DistilBERT-base β€” were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid (the **geometric consensus**) was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.
29
 
30
- This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts β€” the subspace where all five agree β€” into a single small transformer.
 
 
 
 
 
 
 
31
 
32
- The form of distillation is an experimental multi-teacher form of aligned distillation through anchoring geometric cooperation rather than adversarial conflict.
33
- The math aligns the data and the model no longer needs custom architecture now that the math aligns the structure so robustly and the losses maintain
34
- that analysis to such an extent.
35
 
36
- The system works now. The geolip patchwork isn't the patchwork itself, it's the topological differentiation between what is and what is needed, and the most
37
- robust utility of that difference is based on analytical differentiation through direct distillation rather than utility through architectural grinding mechanisms.
38
 
39
- With this experimentation result, it shows the alignment itself can be directly distilled. The upcoming heads training will show that not only can this work,
40
- but the actual differentation can be applied DIRECTLY to downstream tasks such as head training, analysis through procrustes with additional experts,
41
- decoder processing through multiple INDEPENDENT experts from what the encoder was trained on, and entirely typeless functional transfer from one model
42
- to another.
 
 
 
 
43
 
44
- This unlocks the entire spectrum of geometric function, and the processing depends entirely on fp32 or higher structural integrity, or else the system collapses.
45
 
46
- ## Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  | Metric | Value |
49
  |---|---|
50
- | **Val cosine to consensus** | **0.8621352314949036** |
51
- | **Val R@1** | **1.0** |
52
- | **Val CV** | **0.0817226113729792** |
53
- | Training data | CC12M captions (500000 samples) |
54
  | Epochs | 30 |
55
- | Warm-started | True |
56
- | Parameters | ~23M |
57
  | Position capacity | 8,192 tokens |
 
58
 
59
- ### STS-B Comparison (mean-pooled, no fine-tuning)
60
-
61
- | Model | Params | STS-B Spearman |
62
- |---|---|---|
63
- | DistilBERT-base | 66M | 0.5717 |
64
- | RoBERTa-base | 125M | 0.5436 |
65
- | **Consensus Student** | **23M** | **0.4814** |
66
- | ALBERT-base-v2 | 12M | 0.4784 |
67
- | BERT-base | 110M | 0.4729 |
68
- | ModernBERT-base | 149M | 0.4215 |
69
-
70
- The student beats BERT-base (5x larger) and ModernBERT-base (7x larger) on STS-B despite being trained from scratch on image captions β€” out of domain for sentence similarity.
71
-
72
- ### Training Curve
73
-
74
- | Epoch | t_acc | t_cos | v_acc | v_cos | v_cv | Time |
75
- |---|---|---|---|---|---|---|
76
- | 1 | 1.000 | 0.804 | 1.000 | 0.803 | 0.104 | 689s |
77
- | 2 | 1.000 | 0.807 | 1.000 | 0.810 | 0.085 | 688s |
78
- | 3 | 1.000 | 0.811 | 1.000 | 0.820 | 0.103 | 688s |
79
- | 4 | 1.000 | 0.815 | 1.000 | 0.825 | 0.084 | 689s |
80
- | 5 | 1.000 | 0.819 | 1.000 | 0.819 | 0.086 | 689s |
81
- | 6 | 1.000 | 0.821 | 1.000 | 0.821 | 0.095 | 689s |
82
- | 7 | 1.000 | 0.824 | 1.000 | 0.820 | 0.091 | 688s |
83
- | 8 | 1.000 | 0.827 | 1.000 | 0.834 | 0.088 | 689s |
84
- | 9 | 1.000 | 0.829 | 1.000 | 0.829 | 0.088 | 688s |
85
- | 10 | 1.000 | 0.831 | 1.000 | 0.829 | 0.087 | 689s |
86
- | 11 | 1.000 | 0.833 | 1.000 | 0.836 | 0.082 | 689s |
87
- | 12 | 1.000 | 0.835 | 1.000 | 0.838 | 0.084 | 689s |
88
- | 13 | 1.000 | 0.837 | 1.000 | 0.842 | 0.083 | 688s |
89
- | 14 | 1.000 | 0.839 | 1.000 | 0.842 | 0.081 | 689s |
90
- | 15 | 1.000 | 0.842 | 1.000 | 0.840 | 0.078 | 688s |
91
- | 16 | 1.000 | 0.843 | 1.000 | 0.843 | 0.086 | 689s |
92
- | 17 | 1.000 | 0.846 | 1.000 | 0.845 | 0.086 | 689s |
93
- | 18 | 1.000 | 0.847 | 1.000 | 0.848 | 0.087 | 689s |
94
- | 19 | 1.000 | 0.849 | 1.000 | 0.849 | 0.082 | 688s |
95
- | 20 | 1.000 | 0.851 | 1.000 | 0.849 | 0.078 | 690s |
96
- | 21 | 1.000 | 0.853 | 1.000 | 0.855 | 0.087 | 689s |
97
- | 22 | 1.000 | 0.855 | 1.000 | 0.856 | 0.083 | 689s |
98
- | 23 | 1.000 | 0.857 | 1.000 | 0.855 | 0.078 | 689s |
99
- | 24 | 1.000 | 0.858 | 1.000 | 0.857 | 0.093 | 688s |
100
- | 25 | 1.000 | 0.860 | 1.000 | 0.859 | 0.092 | 689s |
101
- | 26 | 1.000 | 0.861 | 1.000 | 0.860 | 0.079 | 689s |
102
- | 27 | 1.000 | 0.863 | 1.000 | 0.862 | 0.084 | 689s |
103
- | 28 | 1.000 | 0.863 | 1.000 | 0.862 | 0.091 | 688s |
104
- | 29 | 1.000 | 0.863 | 1.000 | 0.862 | 0.081 | 688s |
105
- | 30 | 1.000 | 0.863 | 1.000 | 0.862 | 0.082 | 689s |
106
 
 
107
 
108
- ## Usage
109
 
110
- ```python
111
- import torch
112
- from transformers import AutoTokenizer
113
- from caption_encoder import CaptionEncoder
114
 
115
- # Load
116
- tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
117
- model = CaptionEncoder(
118
- vocab_size=30522, max_len=8192, d_model=384,
119
- n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
120
- dropout=0.0, pad_token_id=0)
121
- model.load_state_dict(torch.load("best_model.pt", weights_only=True))
122
- model.eval()
123
 
124
- # Encode
125
- texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
126
- tokens = tokenizer(texts, max_length=512, padding="max_length",
127
- truncation=True, return_tensors="pt")
128
- with torch.no_grad():
129
- embeddings = model(tokens["input_ids"], tokens["attention_mask"])
130
-
131
- # embeddings: (2, 768) L2-normalized
132
- similarity = embeddings[0] @ embeddings[1]
133
- print(f"Similarity: {similarity:.3f}")
134
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
136
  ## Architecture
137
 
@@ -152,51 +196,61 @@ Input text
152
  └── (B, 768) consensus-aligned embedding
153
  ```
154
 
155
- ## The Consensus Distillation Pipeline
156
 
157
- ```
158
- 5 Expert Models (frozen)
159
- β”‚
160
- β”œβ”€β”€ BERT-base-uncased (110M, MLM)
161
- β”œβ”€β”€ ModernBERT-base (149M, MLM + rotary)
162
- β”œβ”€β”€ RoBERTa-base (125M, MLM + dynamic masking)
163
- β”œβ”€β”€ ALBERT-base-v2 (12M, MLM + SOP + factorized)
164
- └── DistilBERT-base (66M, distilled from BERT)
165
- β”‚
166
- β”œβ”€β”€ Extract embeddings on CC12M captions
167
- β”œβ”€β”€ Whitened Procrustes alignment to shared space
168
- β”œβ”€β”€ Consensus = normalized centroid
169
- β”‚ (proven constant to 3 decimal places across 5 seeds)
170
- β”‚
171
- └── Train student with:
172
- β”œβ”€β”€ InfoNCE(student, consensus) β€” retrieval alignment
173
- β”œβ”€β”€ MSE(student, consensus) β€” direct regression
174
- └── Pentachoron CV β†’ 0.084 β€” geometric regularity
175
- ```
176
 
177
- ## Key Properties
 
 
 
 
 
 
 
178
 
179
- **Geometric regularity.** The embedding space has pentachoron CV β‰ˆ 0.08–0.10, meaning local neighborhoods are uniformly distributed. The space is smooth, interpolable, and well-conditioned for downstream operations.
 
 
 
 
 
 
 
 
 
 
180
 
181
- **Multi-teacher consensus.** The target is the geometric intersection of five experts, not any single teacher. Individual model errors cancel. What remains is what five independent systems agree on.
182
 
183
- **Minimal data requirement.** The consensus manifold is so smooth (CV=0.084) that 18K examples were sufficient for R@1=1.000 on held-out data. The function from text to consensus embedding has a low Lipschitz constant.
 
 
 
 
 
 
 
 
184
 
185
- **8K position capacity.** Trained on 512-token sequences but position embeddings extend to 8,192. Ready for long-context applications without retraining.
186
 
187
  ## GEOLIP Family
188
 
189
- | System | Type | Output |
190
- |---|---|---|
191
- | [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | pooled (768,) |
192
- | [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 768) |
193
- | [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 1280) |
194
- | [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | aligned (1024,) |
195
- | **Consensus Distilled** | **Student** | **consensus (768,)** |
196
 
197
  ## Citation
198
 
199
- See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology.
200
 
201
  ## License
202
 
 
9
  - caption-embedding
10
  - sentence-similarity
11
  - feature-extraction
12
+ - caption_encoder
13
  language: en
14
  pipeline_tag: feature-extraction
15
  datasets:
 
18
  - AbstractPhil/geolip-bertenstein
19
  ---
20
 
21
+ # GEOLIP CaptionBERT-8192
22
 
23
+ A 26M-parameter caption encoder whose embedding space is the geometric intersection of five independently trained language models. Trained from scratch via consensus distillation β€” no pretrained weights, no expert models at inference.
24
 
25
+ ## Benchmarks
26
 
27
+ Evaluated against all five consensus teachers on STS-B, SICK-R, and MRPC. All models use mean-pooled embeddings with cosine similarity. No fine-tuning on any benchmark task.
28
 
29
+ ### Semantic Textual Similarity (STS-B)
30
 
31
+ | Model | Params | Spearman ρ | Pearson r |
32
+ |---|---|---|---|
33
+ | DistilBERT-base | 66M | 0.5717 | β€” |
34
+ | RoBERTa-base | 125M | 0.5436 | β€” |
35
+ | **CaptionBERT-8192** | **26M** | **0.5032** | **0.5100** |
36
+ | ALBERT-base-v2 | 12M | 0.4784 | β€” |
37
+ | BERT-base | 110M | 0.4729 | β€” |
38
+ | ModernBERT-base | 149M | 0.4215 | β€” |
39
 
40
+ Beats BERT-base (4.2Γ— larger) and ModernBERT-base (5.7Γ— larger) on general sentence similarity despite being trained exclusively on image captions.
 
 
41
 
42
+ ### SICK-R (Compositional Similarity)
 
43
 
44
+ | Model | Params | Spearman ρ | Pearson r |
45
+ |---|---|---|---|
46
+ | DistilBERT-base | 66M | 0.6424 | β€” |
47
+ | RoBERTa-base | 125M | 0.6296 | β€” |
48
+ | **CaptionBERT-8192** | **26M** | **0.6138** | **0.6645** |
49
+ | BERT-base | 110M | 0.5865 | β€” |
50
+ | ModernBERT-base | 149M | 0.5479 | β€” |
51
+ | ALBERT-base-v2 | 12M | 0.5364 | β€” |
52
 
53
+ \#3/6 on compositional/syntactic similarity. Beats BERT-base, ModernBERT-base, and ALBERT on a task requiring structural language understanding.
54
 
55
+ ### MRPC (Paraphrase Detection)
56
+
57
+ | Model | Params | F1 | Accuracy | Threshold |
58
+ |---|---|---|---|---|
59
+ | RoBERTa-base | 125M | 0.8122 | β€” | β€” |
60
+ | **CaptionBERT-8192** | **26M** | **0.8068** | **0.6881** | **0.71** |
61
+ | ALBERT-base-v2 | 12M | 0.8067 | β€” | β€” |
62
+ | BERT-base | 110M | 0.8062 | β€” | β€” |
63
+ | DistilBERT-base | 66M | 0.8055 | β€” | β€” |
64
+ | ModernBERT-base | 149M | 0.8038 | β€” | β€” |
65
+
66
+ **\#2/6 on paraphrase detection.** 0.005 F1 behind RoBERTa, ahead of every other teacher. No classification head β€” pure cosine similarity with auto-discovered threshold. A model that has never seen a paraphrase pair during training nearly wins paraphrase detection.
67
+
68
+ ### Caption Embedding Quality
69
+
70
+ | Metric | Value |
71
+ |---|---|
72
+ | Self-similarity mean | 0.0040 |
73
+ | Self-similarity max | 0.7181 |
74
+ | Top-1 retrieval cosine | 0.5477 |
75
+ | Top-5 retrieval cosine | 0.4853 |
76
+
77
+ Near-zero average self-similarity across 1000 random captions β€” the embedding space has excellent discrimination. Every caption occupies its own distinct region on the hypersphere.
78
+
79
+ ### Consensus Fidelity
80
 
81
  | Metric | Value |
82
  |---|---|
83
+ | Val cosine to consensus | 0.862 |
84
+ | Val R@1 | 1.000 |
85
+ | Pentachoron CV | 0.082 |
86
+ | Training data | 500K CC12M captions |
87
  | Epochs | 30 |
 
 
88
  | Position capacity | 8,192 tokens |
89
+ | Parameters | 25,958,016 |
90
 
91
+ ## How It Works
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
+ Five language models were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid β€” the **geometric consensus** β€” was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.
94
 
95
+ This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts into a single small transformer.
96
 
97
+ The distillation is not standard knowledge distillation. It is multi-teacher geometric consensus distillation: the target is not any single teacher's output but the fixed point where all five teachers agree. Individual model errors cancel. What remains is the structural invariant of language understanding that five different architectures and training objectives independently discovered.
 
 
 
98
 
99
+ The alignment itself is directly distillable. The geometric structure is so robust that a from-scratch model learns it with R@1=1.000 from 18K examples in 80 seconds. The consensus manifold has pentachoron CV=0.084 β€” the tightest geometric regularity measured across all GEOLIP experiments β€” which means the function from text to embedding is smooth enough that sparse sampling covers it completely.
 
 
 
 
 
 
 
100
 
 
 
 
 
 
 
 
 
 
 
101
  ```
102
+ 5 Expert Models (frozen)
103
+ β”‚
104
+ β”œβ”€β”€ BERT-base-uncased (110M, MLM)
105
+ β”œβ”€β”€ ModernBERT-base (149M, MLM + rotary, 8192 ctx)
106
+ β”œβ”€β”€ RoBERTa-base (125M, MLM + dynamic masking)
107
+ β”œβ”€β”€ ALBERT-base-v2 (12M, MLM + SOP + factorized)
108
+ └── DistilBERT-base (66M, distilled from BERT)
109
+ β”‚
110
+ β”œβ”€β”€ Extract pooled embeddings on 500K CC12M captions
111
+ β”œβ”€β”€ Whitened Procrustes alignment to shared space
112
+ β”œβ”€β”€ Consensus = normalized centroid (geometric constant)
113
+ β”‚
114
+ └── Train student with:
115
+ β”œβ”€β”€ InfoNCE(student, consensus) β€” retrieval alignment
116
+ β”œβ”€β”€ MSE(student, consensus) β€” direct regression
117
+ └── Pentachoron CV β†’ 0.084 β€” geometric regularity
118
+ ```
119
+
120
+ ## Planned Task Heads
121
+
122
+ The 768-dim consensus embedding serves as a frozen feature extractor. Linear heads trained on task-specific data snap on top.
123
+
124
+ ### Priority Heads
125
+
126
+ | Head | Architecture | Training Data | Use Case |
127
+ |---|---|---|---|
128
+ | **NLI / Entailment** | cat(a, b, \|a-b\|, a*b) β†’ Linear(3072, 3) | MNLI, SNLI | Agent reasoning validation |
129
+ | **Semantic Similarity** | Linear(768, 1) β†’ sigmoidΓ—5 | STS-B train | Push STS-B toward 0.80+ |
130
+ | **Multi-Label Tagging** | Linear(768, n_tags) β†’ sigmoid | COCO categories, Visual Genome | Predict objects/attributes from captions |
131
+ | **Paraphrase Detection** | cos(a, b) β†’ threshold (already works) | MRPC, QQP | Deduplication, reformulation detection |
132
+ | **Sentiment** | Linear(768, n_classes) | SST-2, IMDB | Content routing, sentiment analysis |
133
+
134
+ ### Extended Heads
135
+
136
+ | Head | Architecture | Training Data | Use Case |
137
+ |---|---|---|---|
138
+ | Caption Quality | Linear(768, 2) | Hallucination-annotated captions | Filter AI-generated training data |
139
+ | Cross-Encoder Reranker | cat(query, doc) β†’ Linear(1536, 1) | MS MARCO | Two-stage retrieval scoring |
140
+ | Clustering | Linear(768, 256) β†’ normalize | Unsupervised | Caption taxonomy, dataset organization |
141
+ | Relation Extraction | cat(subj_emb, obj_emb) β†’ Linear(1536, n_rel) | Visual Genome relationships | Structured scene understanding |
142
+ | Caption-Image Score | Linear(768, 256) β†’ cos with CLIP visual | CC12M image-caption pairs | Cross-modal retrieval without CLIP |
143
+
144
+ ### Consensus Head Distillation
145
+
146
+ The same consensus trick applies to task heads. Train five separate NLI heads on the five frozen expert models, take the consensus prediction, distill into a single head on CaptionBERT. The head learns where all five experts agree on entailment β€” same noise cancellation, one layer instead of five.
147
+
148
+ ## Training Datasets β€” Current and Planned
149
+
150
+ ### Current
151
+
152
+ | Dataset | Samples Used | Content | Notes |
153
+ |---|---|---|---|
154
+ | [CC12M LLaVA-Next](https://huggingface.co/datasets/CaptionEmporium/conceptual-captions-cc12m-llavanext) | 500K | Re-captioned CC12M with LLaVA-Next | Primary training data, mean ~92 tokens |
155
+
156
+ ### Planned β€” Caption Saturation
157
+
158
+ The model tokenizes to 512 but has 8,192 position capacity. Longer, more complex captions will exercise the full context window and push v_cos beyond 0.862.
159
+
160
+ | Dataset | Size | Content | Why |
161
+ |---|---|---|---|
162
+ | [ShareGPT4V](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) | 1.2M | GPT-4V detailed image descriptions | Longer captions (200-500 tokens), richer vocabulary |
163
+ | [DOCCI](https://huggingface.co/datasets/google/docci) | 15K | Expert-written dense image descriptions | Extremely detailed, 100-300 words per image |
164
+ | [Localized Narratives](https://huggingface.co/datasets/google/localized-narratives) | 850K | Spoken descriptions with mouse traces | Narrative structure, temporal ordering |
165
+ | [DenseCap](https://huggingface.co/datasets/visual-genome/dense-captions) | 5.4M | Region-level dense captions | Fine-grained spatial descriptions |
166
+ | [TextCaps](https://huggingface.co/datasets/lmms-lab/TextCaps) | 145K | Captions requiring OCR reading | Text-in-image understanding |
167
+ | [VizWiz](https://huggingface.co/datasets/lmms-lab/VizWiz-VQA) | 32K | Captions from blind/low-vision users | Diverse, real-world, often longer descriptions |
168
+ | [COCO Captions](https://huggingface.co/datasets/HuggingFaceM4/COCO) | 600K | 5 captions per image, human-written | Short but high-quality, broad coverage |
169
+ | [SBU Captions](https://huggingface.co/datasets/sbu_captions) | 1M | Web-crawled image-caption pairs | Scale and diversity |
170
+
171
+ ### Planned β€” Domain Extension
172
+
173
+ | Dataset | Size | Content | Why |
174
+ |---|---|---|---|
175
+ | [BookCorpus](https://huggingface.co/datasets/bookcorpus) | 11K books | Long-form narrative text | Exercise 8K context, literary language |
176
+ | [Wikipedia](https://huggingface.co/datasets/wikipedia) | 6M articles | Encyclopedic text | General knowledge, factual density |
177
+ | [Natural Questions](https://huggingface.co/datasets/google-research-datasets/natural_questions) | 300K | Question-answer pairs | QA capability for retrieval heads |
178
+ | [MS MARCO](https://huggingface.co/datasets/microsoft/ms_marco) | 1M | Passages + queries | Retrieval training for reranker head |
179
 
180
  ## Architecture
181
 
 
196
  └── (B, 768) consensus-aligned embedding
197
  ```
198
 
199
+ ## Usage
200
 
201
+ ```python
202
+ import torch
203
+ from transformers import AutoTokenizer
204
+ from caption_encoder import CaptionEncoder
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
 
206
+ # Load
207
+ tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
208
+ model = CaptionEncoder(
209
+ vocab_size=30522, max_len=8192, d_model=384,
210
+ n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
211
+ dropout=0.0, pad_token_id=0)
212
+ model.load_state_dict(torch.load("best_model.pt", weights_only=True))
213
+ model.eval()
214
 
215
+ # Encode
216
+ texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
217
+ tokens = tokenizer(texts, max_length=512, padding="max_length",
218
+ truncation=True, return_tensors="pt")
219
+ with torch.no_grad():
220
+ embeddings = model(tokens["input_ids"], tokens["attention_mask"])
221
+
222
+ # embeddings: (2, 768) L2-normalized
223
+ similarity = embeddings[0] @ embeddings[1]
224
+ print(f"Similarity: {similarity:.3f}")
225
+ ```
226
 
227
+ ## Training Curve
228
 
229
+ | Epoch | t_cos | v_cos | v_cv | Time |
230
+ |---|---|---|---|---|
231
+ | 1 | 0.804 | 0.803 | 0.104 | 689s |
232
+ | 5 | 0.819 | 0.819 | 0.086 | 689s |
233
+ | 10 | 0.831 | 0.829 | 0.087 | 689s |
234
+ | 15 | 0.842 | 0.840 | 0.078 | 688s |
235
+ | 20 | 0.851 | 0.849 | 0.078 | 690s |
236
+ | 25 | 0.860 | 0.859 | 0.092 | 689s |
237
+ | 30 | 0.863 | 0.862 | 0.082 | 689s |
238
 
239
+ R@1=1.000 and t_acc=1.000 throughout all 30 epochs. Train/val gap < 0.002 β€” no overfitting on 500K samples.
240
 
241
  ## GEOLIP Family
242
 
243
+ | System | Type | Params | Output |
244
+ |---|---|---|---|
245
+ | [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | 34M | pooled (768,) |
246
+ | [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | 53M | pooled + seq (77, 768) |
247
+ | [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | 167M | pooled + seq (77, 1280) |
248
+ | [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | 8.8M | aligned (1024,) |
249
+ | **CaptionBERT-8192** | **Consensus distilled** | **26M** | **consensus (768,)** |
250
 
251
  ## Citation
252
 
253
+ See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology, including the pentachoron consensus proof, whitened Procrustes alignment, compositional convolution experiments, and the path from accumulation-based memory to alignment-based distillation.
254
 
255
  ## License
256