Update README.md
Browse files
README.md
CHANGED
|
@@ -9,6 +9,7 @@ tags:
|
|
| 9 |
- caption-embedding
|
| 10 |
- sentence-similarity
|
| 11 |
- feature-extraction
|
|
|
|
| 12 |
language: en
|
| 13 |
pipeline_tag: feature-extraction
|
| 14 |
datasets:
|
|
@@ -17,121 +18,164 @@ base_model:
|
|
| 17 |
- AbstractPhil/geolip-bertenstein
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# GEOLIP
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
-
The math aligns the data and the model no longer needs custom architecture now that the math aligns the structure so robustly and the losses maintain
|
| 34 |
-
that analysis to such an extent.
|
| 35 |
|
| 36 |
-
|
| 37 |
-
robust utility of that difference is based on analytical differentiation through direct distillation rather than utility through architectural grinding mechanisms.
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
| Metric | Value |
|
| 49 |
|---|---|
|
| 50 |
-
|
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
| Training data | CC12M captions
|
| 54 |
| Epochs | 30 |
|
| 55 |
-
| Warm-started | True |
|
| 56 |
-
| Parameters | ~23M |
|
| 57 |
| Position capacity | 8,192 tokens |
|
|
|
|
| 58 |
|
| 59 |
-
##
|
| 60 |
-
|
| 61 |
-
| Model | Params | STS-B Spearman |
|
| 62 |
-
|---|---|---|
|
| 63 |
-
| DistilBERT-base | 66M | 0.5717 |
|
| 64 |
-
| RoBERTa-base | 125M | 0.5436 |
|
| 65 |
-
| **Consensus Student** | **23M** | **0.4814** |
|
| 66 |
-
| ALBERT-base-v2 | 12M | 0.4784 |
|
| 67 |
-
| BERT-base | 110M | 0.4729 |
|
| 68 |
-
| ModernBERT-base | 149M | 0.4215 |
|
| 69 |
-
|
| 70 |
-
The student beats BERT-base (5x larger) and ModernBERT-base (7x larger) on STS-B despite being trained from scratch on image captions β out of domain for sentence similarity.
|
| 71 |
-
|
| 72 |
-
### Training Curve
|
| 73 |
-
|
| 74 |
-
| Epoch | t_acc | t_cos | v_acc | v_cos | v_cv | Time |
|
| 75 |
-
|---|---|---|---|---|---|---|
|
| 76 |
-
| 1 | 1.000 | 0.804 | 1.000 | 0.803 | 0.104 | 689s |
|
| 77 |
-
| 2 | 1.000 | 0.807 | 1.000 | 0.810 | 0.085 | 688s |
|
| 78 |
-
| 3 | 1.000 | 0.811 | 1.000 | 0.820 | 0.103 | 688s |
|
| 79 |
-
| 4 | 1.000 | 0.815 | 1.000 | 0.825 | 0.084 | 689s |
|
| 80 |
-
| 5 | 1.000 | 0.819 | 1.000 | 0.819 | 0.086 | 689s |
|
| 81 |
-
| 6 | 1.000 | 0.821 | 1.000 | 0.821 | 0.095 | 689s |
|
| 82 |
-
| 7 | 1.000 | 0.824 | 1.000 | 0.820 | 0.091 | 688s |
|
| 83 |
-
| 8 | 1.000 | 0.827 | 1.000 | 0.834 | 0.088 | 689s |
|
| 84 |
-
| 9 | 1.000 | 0.829 | 1.000 | 0.829 | 0.088 | 688s |
|
| 85 |
-
| 10 | 1.000 | 0.831 | 1.000 | 0.829 | 0.087 | 689s |
|
| 86 |
-
| 11 | 1.000 | 0.833 | 1.000 | 0.836 | 0.082 | 689s |
|
| 87 |
-
| 12 | 1.000 | 0.835 | 1.000 | 0.838 | 0.084 | 689s |
|
| 88 |
-
| 13 | 1.000 | 0.837 | 1.000 | 0.842 | 0.083 | 688s |
|
| 89 |
-
| 14 | 1.000 | 0.839 | 1.000 | 0.842 | 0.081 | 689s |
|
| 90 |
-
| 15 | 1.000 | 0.842 | 1.000 | 0.840 | 0.078 | 688s |
|
| 91 |
-
| 16 | 1.000 | 0.843 | 1.000 | 0.843 | 0.086 | 689s |
|
| 92 |
-
| 17 | 1.000 | 0.846 | 1.000 | 0.845 | 0.086 | 689s |
|
| 93 |
-
| 18 | 1.000 | 0.847 | 1.000 | 0.848 | 0.087 | 689s |
|
| 94 |
-
| 19 | 1.000 | 0.849 | 1.000 | 0.849 | 0.082 | 688s |
|
| 95 |
-
| 20 | 1.000 | 0.851 | 1.000 | 0.849 | 0.078 | 690s |
|
| 96 |
-
| 21 | 1.000 | 0.853 | 1.000 | 0.855 | 0.087 | 689s |
|
| 97 |
-
| 22 | 1.000 | 0.855 | 1.000 | 0.856 | 0.083 | 689s |
|
| 98 |
-
| 23 | 1.000 | 0.857 | 1.000 | 0.855 | 0.078 | 689s |
|
| 99 |
-
| 24 | 1.000 | 0.858 | 1.000 | 0.857 | 0.093 | 688s |
|
| 100 |
-
| 25 | 1.000 | 0.860 | 1.000 | 0.859 | 0.092 | 689s |
|
| 101 |
-
| 26 | 1.000 | 0.861 | 1.000 | 0.860 | 0.079 | 689s |
|
| 102 |
-
| 27 | 1.000 | 0.863 | 1.000 | 0.862 | 0.084 | 689s |
|
| 103 |
-
| 28 | 1.000 | 0.863 | 1.000 | 0.862 | 0.091 | 688s |
|
| 104 |
-
| 29 | 1.000 | 0.863 | 1.000 | 0.862 | 0.081 | 688s |
|
| 105 |
-
| 30 | 1.000 | 0.863 | 1.000 | 0.862 | 0.082 | 689s |
|
| 106 |
|
|
|
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
-
import torch
|
| 112 |
-
from transformers import AutoTokenizer
|
| 113 |
-
from caption_encoder import CaptionEncoder
|
| 114 |
|
| 115 |
-
|
| 116 |
-
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
|
| 117 |
-
model = CaptionEncoder(
|
| 118 |
-
vocab_size=30522, max_len=8192, d_model=384,
|
| 119 |
-
n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
|
| 120 |
-
dropout=0.0, pad_token_id=0)
|
| 121 |
-
model.load_state_dict(torch.load("best_model.pt", weights_only=True))
|
| 122 |
-
model.eval()
|
| 123 |
|
| 124 |
-
# Encode
|
| 125 |
-
texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
|
| 126 |
-
tokens = tokenizer(texts, max_length=512, padding="max_length",
|
| 127 |
-
truncation=True, return_tensors="pt")
|
| 128 |
-
with torch.no_grad():
|
| 129 |
-
embeddings = model(tokens["input_ids"], tokens["attention_mask"])
|
| 130 |
-
|
| 131 |
-
# embeddings: (2, 768) L2-normalized
|
| 132 |
-
similarity = embeddings[0] @ embeddings[1]
|
| 133 |
-
print(f"Similarity: {similarity:.3f}")
|
| 134 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
## Architecture
|
| 137 |
|
|
@@ -152,51 +196,61 @@ Input text
|
|
| 152 |
βββ (B, 768) consensus-aligned embedding
|
| 153 |
```
|
| 154 |
|
| 155 |
-
##
|
| 156 |
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
βββ ModernBERT-base (149M, MLM + rotary)
|
| 162 |
-
βββ RoBERTa-base (125M, MLM + dynamic masking)
|
| 163 |
-
βββ ALBERT-base-v2 (12M, MLM + SOP + factorized)
|
| 164 |
-
βββ DistilBERT-base (66M, distilled from BERT)
|
| 165 |
-
β
|
| 166 |
-
βββ Extract embeddings on CC12M captions
|
| 167 |
-
βββ Whitened Procrustes alignment to shared space
|
| 168 |
-
βββ Consensus = normalized centroid
|
| 169 |
-
β (proven constant to 3 decimal places across 5 seeds)
|
| 170 |
-
β
|
| 171 |
-
βββ Train student with:
|
| 172 |
-
βββ InfoNCE(student, consensus) β retrieval alignment
|
| 173 |
-
βββ MSE(student, consensus) β direct regression
|
| 174 |
-
βββ Pentachoron CV β 0.084 β geometric regularity
|
| 175 |
-
```
|
| 176 |
|
| 177 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
## GEOLIP Family
|
| 188 |
|
| 189 |
-
| System | Type | Output |
|
| 190 |
-
|---|---|---|
|
| 191 |
-
| [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | pooled (768,) |
|
| 192 |
-
| [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 768) |
|
| 193 |
-
| [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 1280) |
|
| 194 |
-
| [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | aligned (1024,) |
|
| 195 |
-
| **Consensus
|
| 196 |
|
| 197 |
## Citation
|
| 198 |
|
| 199 |
-
See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology.
|
| 200 |
|
| 201 |
## License
|
| 202 |
|
|
|
|
| 9 |
- caption-embedding
|
| 10 |
- sentence-similarity
|
| 11 |
- feature-extraction
|
| 12 |
+
- caption_encoder
|
| 13 |
language: en
|
| 14 |
pipeline_tag: feature-extraction
|
| 15 |
datasets:
|
|
|
|
| 18 |
- AbstractPhil/geolip-bertenstein
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# GEOLIP CaptionBERT-8192
|
| 22 |
|
| 23 |
+
A 26M-parameter caption encoder whose embedding space is the geometric intersection of five independently trained language models. Trained from scratch via consensus distillation β no pretrained weights, no expert models at inference.
|
| 24 |
|
| 25 |
+
## Benchmarks
|
| 26 |
|
| 27 |
+
Evaluated against all five consensus teachers on STS-B, SICK-R, and MRPC. All models use mean-pooled embeddings with cosine similarity. No fine-tuning on any benchmark task.
|
| 28 |
|
| 29 |
+
### Semantic Textual Similarity (STS-B)
|
| 30 |
|
| 31 |
+
| Model | Params | Spearman Ο | Pearson r |
|
| 32 |
+
|---|---|---|---|
|
| 33 |
+
| DistilBERT-base | 66M | 0.5717 | β |
|
| 34 |
+
| RoBERTa-base | 125M | 0.5436 | β |
|
| 35 |
+
| **CaptionBERT-8192** | **26M** | **0.5032** | **0.5100** |
|
| 36 |
+
| ALBERT-base-v2 | 12M | 0.4784 | β |
|
| 37 |
+
| BERT-base | 110M | 0.4729 | β |
|
| 38 |
+
| ModernBERT-base | 149M | 0.4215 | β |
|
| 39 |
|
| 40 |
+
Beats BERT-base (4.2Γ larger) and ModernBERT-base (5.7Γ larger) on general sentence similarity despite being trained exclusively on image captions.
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
### SICK-R (Compositional Similarity)
|
|
|
|
| 43 |
|
| 44 |
+
| Model | Params | Spearman Ο | Pearson r |
|
| 45 |
+
|---|---|---|---|
|
| 46 |
+
| DistilBERT-base | 66M | 0.6424 | β |
|
| 47 |
+
| RoBERTa-base | 125M | 0.6296 | β |
|
| 48 |
+
| **CaptionBERT-8192** | **26M** | **0.6138** | **0.6645** |
|
| 49 |
+
| BERT-base | 110M | 0.5865 | β |
|
| 50 |
+
| ModernBERT-base | 149M | 0.5479 | β |
|
| 51 |
+
| ALBERT-base-v2 | 12M | 0.5364 | β |
|
| 52 |
|
| 53 |
+
\#3/6 on compositional/syntactic similarity. Beats BERT-base, ModernBERT-base, and ALBERT on a task requiring structural language understanding.
|
| 54 |
|
| 55 |
+
### MRPC (Paraphrase Detection)
|
| 56 |
+
|
| 57 |
+
| Model | Params | F1 | Accuracy | Threshold |
|
| 58 |
+
|---|---|---|---|---|
|
| 59 |
+
| RoBERTa-base | 125M | 0.8122 | β | β |
|
| 60 |
+
| **CaptionBERT-8192** | **26M** | **0.8068** | **0.6881** | **0.71** |
|
| 61 |
+
| ALBERT-base-v2 | 12M | 0.8067 | β | β |
|
| 62 |
+
| BERT-base | 110M | 0.8062 | β | β |
|
| 63 |
+
| DistilBERT-base | 66M | 0.8055 | β | β |
|
| 64 |
+
| ModernBERT-base | 149M | 0.8038 | β | β |
|
| 65 |
+
|
| 66 |
+
**\#2/6 on paraphrase detection.** 0.005 F1 behind RoBERTa, ahead of every other teacher. No classification head β pure cosine similarity with auto-discovered threshold. A model that has never seen a paraphrase pair during training nearly wins paraphrase detection.
|
| 67 |
+
|
| 68 |
+
### Caption Embedding Quality
|
| 69 |
+
|
| 70 |
+
| Metric | Value |
|
| 71 |
+
|---|---|
|
| 72 |
+
| Self-similarity mean | 0.0040 |
|
| 73 |
+
| Self-similarity max | 0.7181 |
|
| 74 |
+
| Top-1 retrieval cosine | 0.5477 |
|
| 75 |
+
| Top-5 retrieval cosine | 0.4853 |
|
| 76 |
+
|
| 77 |
+
Near-zero average self-similarity across 1000 random captions β the embedding space has excellent discrimination. Every caption occupies its own distinct region on the hypersphere.
|
| 78 |
+
|
| 79 |
+
### Consensus Fidelity
|
| 80 |
|
| 81 |
| Metric | Value |
|
| 82 |
|---|---|
|
| 83 |
+
| Val cosine to consensus | 0.862 |
|
| 84 |
+
| Val R@1 | 1.000 |
|
| 85 |
+
| Pentachoron CV | 0.082 |
|
| 86 |
+
| Training data | 500K CC12M captions |
|
| 87 |
| Epochs | 30 |
|
|
|
|
|
|
|
| 88 |
| Position capacity | 8,192 tokens |
|
| 89 |
+
| Parameters | 25,958,016 |
|
| 90 |
|
| 91 |
+
## How It Works
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
Five language models were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid β the **geometric consensus** β was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.
|
| 94 |
|
| 95 |
+
This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts into a single small transformer.
|
| 96 |
|
| 97 |
+
The distillation is not standard knowledge distillation. It is multi-teacher geometric consensus distillation: the target is not any single teacher's output but the fixed point where all five teachers agree. Individual model errors cancel. What remains is the structural invariant of language understanding that five different architectures and training objectives independently discovered.
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
The alignment itself is directly distillable. The geometric structure is so robust that a from-scratch model learns it with R@1=1.000 from 18K examples in 80 seconds. The consensus manifold has pentachoron CV=0.084 β the tightest geometric regularity measured across all GEOLIP experiments β which means the function from text to embedding is smooth enough that sparse sampling covers it completely.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
```
|
| 102 |
+
5 Expert Models (frozen)
|
| 103 |
+
β
|
| 104 |
+
βββ BERT-base-uncased (110M, MLM)
|
| 105 |
+
βββ ModernBERT-base (149M, MLM + rotary, 8192 ctx)
|
| 106 |
+
βββ RoBERTa-base (125M, MLM + dynamic masking)
|
| 107 |
+
βββ ALBERT-base-v2 (12M, MLM + SOP + factorized)
|
| 108 |
+
βββ DistilBERT-base (66M, distilled from BERT)
|
| 109 |
+
β
|
| 110 |
+
βββ Extract pooled embeddings on 500K CC12M captions
|
| 111 |
+
βββ Whitened Procrustes alignment to shared space
|
| 112 |
+
βββ Consensus = normalized centroid (geometric constant)
|
| 113 |
+
β
|
| 114 |
+
βββ Train student with:
|
| 115 |
+
βββ InfoNCE(student, consensus) β retrieval alignment
|
| 116 |
+
βββ MSE(student, consensus) β direct regression
|
| 117 |
+
βββ Pentachoron CV β 0.084 β geometric regularity
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
## Planned Task Heads
|
| 121 |
+
|
| 122 |
+
The 768-dim consensus embedding serves as a frozen feature extractor. Linear heads trained on task-specific data snap on top.
|
| 123 |
+
|
| 124 |
+
### Priority Heads
|
| 125 |
+
|
| 126 |
+
| Head | Architecture | Training Data | Use Case |
|
| 127 |
+
|---|---|---|---|
|
| 128 |
+
| **NLI / Entailment** | cat(a, b, \|a-b\|, a*b) β Linear(3072, 3) | MNLI, SNLI | Agent reasoning validation |
|
| 129 |
+
| **Semantic Similarity** | Linear(768, 1) β sigmoidΓ5 | STS-B train | Push STS-B toward 0.80+ |
|
| 130 |
+
| **Multi-Label Tagging** | Linear(768, n_tags) β sigmoid | COCO categories, Visual Genome | Predict objects/attributes from captions |
|
| 131 |
+
| **Paraphrase Detection** | cos(a, b) β threshold (already works) | MRPC, QQP | Deduplication, reformulation detection |
|
| 132 |
+
| **Sentiment** | Linear(768, n_classes) | SST-2, IMDB | Content routing, sentiment analysis |
|
| 133 |
+
|
| 134 |
+
### Extended Heads
|
| 135 |
+
|
| 136 |
+
| Head | Architecture | Training Data | Use Case |
|
| 137 |
+
|---|---|---|---|
|
| 138 |
+
| Caption Quality | Linear(768, 2) | Hallucination-annotated captions | Filter AI-generated training data |
|
| 139 |
+
| Cross-Encoder Reranker | cat(query, doc) β Linear(1536, 1) | MS MARCO | Two-stage retrieval scoring |
|
| 140 |
+
| Clustering | Linear(768, 256) β normalize | Unsupervised | Caption taxonomy, dataset organization |
|
| 141 |
+
| Relation Extraction | cat(subj_emb, obj_emb) β Linear(1536, n_rel) | Visual Genome relationships | Structured scene understanding |
|
| 142 |
+
| Caption-Image Score | Linear(768, 256) β cos with CLIP visual | CC12M image-caption pairs | Cross-modal retrieval without CLIP |
|
| 143 |
+
|
| 144 |
+
### Consensus Head Distillation
|
| 145 |
+
|
| 146 |
+
The same consensus trick applies to task heads. Train five separate NLI heads on the five frozen expert models, take the consensus prediction, distill into a single head on CaptionBERT. The head learns where all five experts agree on entailment β same noise cancellation, one layer instead of five.
|
| 147 |
+
|
| 148 |
+
## Training Datasets β Current and Planned
|
| 149 |
+
|
| 150 |
+
### Current
|
| 151 |
+
|
| 152 |
+
| Dataset | Samples Used | Content | Notes |
|
| 153 |
+
|---|---|---|---|
|
| 154 |
+
| [CC12M LLaVA-Next](https://huggingface.co/datasets/CaptionEmporium/conceptual-captions-cc12m-llavanext) | 500K | Re-captioned CC12M with LLaVA-Next | Primary training data, mean ~92 tokens |
|
| 155 |
+
|
| 156 |
+
### Planned β Caption Saturation
|
| 157 |
+
|
| 158 |
+
The model tokenizes to 512 but has 8,192 position capacity. Longer, more complex captions will exercise the full context window and push v_cos beyond 0.862.
|
| 159 |
+
|
| 160 |
+
| Dataset | Size | Content | Why |
|
| 161 |
+
|---|---|---|---|
|
| 162 |
+
| [ShareGPT4V](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) | 1.2M | GPT-4V detailed image descriptions | Longer captions (200-500 tokens), richer vocabulary |
|
| 163 |
+
| [DOCCI](https://huggingface.co/datasets/google/docci) | 15K | Expert-written dense image descriptions | Extremely detailed, 100-300 words per image |
|
| 164 |
+
| [Localized Narratives](https://huggingface.co/datasets/google/localized-narratives) | 850K | Spoken descriptions with mouse traces | Narrative structure, temporal ordering |
|
| 165 |
+
| [DenseCap](https://huggingface.co/datasets/visual-genome/dense-captions) | 5.4M | Region-level dense captions | Fine-grained spatial descriptions |
|
| 166 |
+
| [TextCaps](https://huggingface.co/datasets/lmms-lab/TextCaps) | 145K | Captions requiring OCR reading | Text-in-image understanding |
|
| 167 |
+
| [VizWiz](https://huggingface.co/datasets/lmms-lab/VizWiz-VQA) | 32K | Captions from blind/low-vision users | Diverse, real-world, often longer descriptions |
|
| 168 |
+
| [COCO Captions](https://huggingface.co/datasets/HuggingFaceM4/COCO) | 600K | 5 captions per image, human-written | Short but high-quality, broad coverage |
|
| 169 |
+
| [SBU Captions](https://huggingface.co/datasets/sbu_captions) | 1M | Web-crawled image-caption pairs | Scale and diversity |
|
| 170 |
+
|
| 171 |
+
### Planned β Domain Extension
|
| 172 |
+
|
| 173 |
+
| Dataset | Size | Content | Why |
|
| 174 |
+
|---|---|---|---|
|
| 175 |
+
| [BookCorpus](https://huggingface.co/datasets/bookcorpus) | 11K books | Long-form narrative text | Exercise 8K context, literary language |
|
| 176 |
+
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | 6M articles | Encyclopedic text | General knowledge, factual density |
|
| 177 |
+
| [Natural Questions](https://huggingface.co/datasets/google-research-datasets/natural_questions) | 300K | Question-answer pairs | QA capability for retrieval heads |
|
| 178 |
+
| [MS MARCO](https://huggingface.co/datasets/microsoft/ms_marco) | 1M | Passages + queries | Retrieval training for reranker head |
|
| 179 |
|
| 180 |
## Architecture
|
| 181 |
|
|
|
|
| 196 |
βββ (B, 768) consensus-aligned embedding
|
| 197 |
```
|
| 198 |
|
| 199 |
+
## Usage
|
| 200 |
|
| 201 |
+
```python
|
| 202 |
+
import torch
|
| 203 |
+
from transformers import AutoTokenizer
|
| 204 |
+
from caption_encoder import CaptionEncoder
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
|
| 206 |
+
# Load
|
| 207 |
+
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
|
| 208 |
+
model = CaptionEncoder(
|
| 209 |
+
vocab_size=30522, max_len=8192, d_model=384,
|
| 210 |
+
n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
|
| 211 |
+
dropout=0.0, pad_token_id=0)
|
| 212 |
+
model.load_state_dict(torch.load("best_model.pt", weights_only=True))
|
| 213 |
+
model.eval()
|
| 214 |
|
| 215 |
+
# Encode
|
| 216 |
+
texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
|
| 217 |
+
tokens = tokenizer(texts, max_length=512, padding="max_length",
|
| 218 |
+
truncation=True, return_tensors="pt")
|
| 219 |
+
with torch.no_grad():
|
| 220 |
+
embeddings = model(tokens["input_ids"], tokens["attention_mask"])
|
| 221 |
+
|
| 222 |
+
# embeddings: (2, 768) L2-normalized
|
| 223 |
+
similarity = embeddings[0] @ embeddings[1]
|
| 224 |
+
print(f"Similarity: {similarity:.3f}")
|
| 225 |
+
```
|
| 226 |
|
| 227 |
+
## Training Curve
|
| 228 |
|
| 229 |
+
| Epoch | t_cos | v_cos | v_cv | Time |
|
| 230 |
+
|---|---|---|---|---|
|
| 231 |
+
| 1 | 0.804 | 0.803 | 0.104 | 689s |
|
| 232 |
+
| 5 | 0.819 | 0.819 | 0.086 | 689s |
|
| 233 |
+
| 10 | 0.831 | 0.829 | 0.087 | 689s |
|
| 234 |
+
| 15 | 0.842 | 0.840 | 0.078 | 688s |
|
| 235 |
+
| 20 | 0.851 | 0.849 | 0.078 | 690s |
|
| 236 |
+
| 25 | 0.860 | 0.859 | 0.092 | 689s |
|
| 237 |
+
| 30 | 0.863 | 0.862 | 0.082 | 689s |
|
| 238 |
|
| 239 |
+
R@1=1.000 and t_acc=1.000 throughout all 30 epochs. Train/val gap < 0.002 β no overfitting on 500K samples.
|
| 240 |
|
| 241 |
## GEOLIP Family
|
| 242 |
|
| 243 |
+
| System | Type | Params | Output |
|
| 244 |
+
|---|---|---|---|
|
| 245 |
+
| [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | 34M | pooled (768,) |
|
| 246 |
+
| [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | 53M | pooled + seq (77, 768) |
|
| 247 |
+
| [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | 167M | pooled + seq (77, 1280) |
|
| 248 |
+
| [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | 8.8M | aligned (1024,) |
|
| 249 |
+
| **CaptionBERT-8192** | **Consensus distilled** | **26M** | **consensus (768,)** |
|
| 250 |
|
| 251 |
## Citation
|
| 252 |
|
| 253 |
+
See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology, including the pentachoron consensus proof, whitened Procrustes alignment, compositional convolution experiments, and the path from accumulation-based memory to alignment-based distillation.
|
| 254 |
|
| 255 |
## License
|
| 256 |
|