Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,185 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- geometric-deep-learning
|
| 5 |
+
- distillation
|
| 6 |
+
- consensus
|
| 7 |
+
- pentachoron
|
| 8 |
+
- procrustes
|
| 9 |
+
- caption-embedding
|
| 10 |
+
- sentence-similarity
|
| 11 |
+
- feature-extraction
|
| 12 |
+
language: en
|
| 13 |
+
pipeline_tag: feature-extraction
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# GEOLIP Consensus-Distilled Caption Encoder
|
| 17 |
+
|
| 18 |
+
**A standalone 23M-parameter caption encoder trained via geometric consensus distillation from 5 BERT-family models.**
|
| 19 |
+
|
| 20 |
+
No expert models needed at inference. Just a tokenizer and this model.
|
| 21 |
+
|
| 22 |
+
## What Is This?
|
| 23 |
+
|
| 24 |
+
Five independently trained language models β BERT-base, ModernBERT-base, RoBERTa-base, ALBERT-base-v2, and DistilBERT-base β were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid (the **geometric consensus**) was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.
|
| 25 |
+
|
| 26 |
+
This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts β the subspace where all five agree β into a single small transformer.
|
| 27 |
+
|
| 28 |
+
## Results
|
| 29 |
+
|
| 30 |
+
| Metric | Value |
|
| 31 |
+
|---|---|
|
| 32 |
+
| **Val cosine to consensus** | **0.8621352314949036** |
|
| 33 |
+
| **Val R@1** | **1.0** |
|
| 34 |
+
| **Val CV** | **0.0817226113729792** |
|
| 35 |
+
| Training data | CC12M captions (500000 samples) |
|
| 36 |
+
| Epochs | 30 |
|
| 37 |
+
| Warm-started | True |
|
| 38 |
+
| Parameters | ~23M |
|
| 39 |
+
| Position capacity | 8,192 tokens |
|
| 40 |
+
|
| 41 |
+
### STS-B Comparison (mean-pooled, no fine-tuning)
|
| 42 |
+
|
| 43 |
+
| Model | Params | STS-B Spearman |
|
| 44 |
+
|---|---|---|
|
| 45 |
+
| DistilBERT-base | 66M | 0.5717 |
|
| 46 |
+
| RoBERTa-base | 125M | 0.5436 |
|
| 47 |
+
| **Consensus Student** | **23M** | **0.4814** |
|
| 48 |
+
| ALBERT-base-v2 | 12M | 0.4784 |
|
| 49 |
+
| BERT-base | 110M | 0.4729 |
|
| 50 |
+
| ModernBERT-base | 149M | 0.4215 |
|
| 51 |
+
|
| 52 |
+
The student beats BERT-base (5x larger) and ModernBERT-base (7x larger) on STS-B despite being trained from scratch on image captions β out of domain for sentence similarity.
|
| 53 |
+
|
| 54 |
+
### Training Curve
|
| 55 |
+
|
| 56 |
+
| Epoch | t_acc | t_cos | v_acc | v_cos | v_cv | Time |
|
| 57 |
+
|---|---|---|---|---|---|---|
|
| 58 |
+
| 1 | 1.000 | 0.804 | 1.000 | 0.803 | 0.104 | 689s |
|
| 59 |
+
| 2 | 1.000 | 0.807 | 1.000 | 0.810 | 0.085 | 688s |
|
| 60 |
+
| 3 | 1.000 | 0.811 | 1.000 | 0.820 | 0.103 | 688s |
|
| 61 |
+
| 4 | 1.000 | 0.815 | 1.000 | 0.825 | 0.084 | 689s |
|
| 62 |
+
| 5 | 1.000 | 0.819 | 1.000 | 0.819 | 0.086 | 689s |
|
| 63 |
+
| 6 | 1.000 | 0.821 | 1.000 | 0.821 | 0.095 | 689s |
|
| 64 |
+
| 7 | 1.000 | 0.824 | 1.000 | 0.820 | 0.091 | 688s |
|
| 65 |
+
| 8 | 1.000 | 0.827 | 1.000 | 0.834 | 0.088 | 689s |
|
| 66 |
+
| 9 | 1.000 | 0.829 | 1.000 | 0.829 | 0.088 | 688s |
|
| 67 |
+
| 10 | 1.000 | 0.831 | 1.000 | 0.829 | 0.087 | 689s |
|
| 68 |
+
| 11 | 1.000 | 0.833 | 1.000 | 0.836 | 0.082 | 689s |
|
| 69 |
+
| 12 | 1.000 | 0.835 | 1.000 | 0.838 | 0.084 | 689s |
|
| 70 |
+
| 13 | 1.000 | 0.837 | 1.000 | 0.842 | 0.083 | 688s |
|
| 71 |
+
| 14 | 1.000 | 0.839 | 1.000 | 0.842 | 0.081 | 689s |
|
| 72 |
+
| 15 | 1.000 | 0.842 | 1.000 | 0.840 | 0.078 | 688s |
|
| 73 |
+
| 16 | 1.000 | 0.843 | 1.000 | 0.843 | 0.086 | 689s |
|
| 74 |
+
| 17 | 1.000 | 0.846 | 1.000 | 0.845 | 0.086 | 689s |
|
| 75 |
+
| 18 | 1.000 | 0.847 | 1.000 | 0.848 | 0.087 | 689s |
|
| 76 |
+
| 19 | 1.000 | 0.849 | 1.000 | 0.849 | 0.082 | 688s |
|
| 77 |
+
| 20 | 1.000 | 0.851 | 1.000 | 0.849 | 0.078 | 690s |
|
| 78 |
+
| 21 | 1.000 | 0.853 | 1.000 | 0.855 | 0.087 | 689s |
|
| 79 |
+
| 22 | 1.000 | 0.855 | 1.000 | 0.856 | 0.083 | 689s |
|
| 80 |
+
| 23 | 1.000 | 0.857 | 1.000 | 0.855 | 0.078 | 689s |
|
| 81 |
+
| 24 | 1.000 | 0.858 | 1.000 | 0.857 | 0.093 | 688s |
|
| 82 |
+
| 25 | 1.000 | 0.860 | 1.000 | 0.859 | 0.092 | 689s |
|
| 83 |
+
| 26 | 1.000 | 0.861 | 1.000 | 0.860 | 0.079 | 689s |
|
| 84 |
+
| 27 | 1.000 | 0.863 | 1.000 | 0.862 | 0.084 | 689s |
|
| 85 |
+
| 28 | 1.000 | 0.863 | 1.000 | 0.862 | 0.091 | 688s |
|
| 86 |
+
| 29 | 1.000 | 0.863 | 1.000 | 0.862 | 0.081 | 688s |
|
| 87 |
+
| 30 | 1.000 | 0.863 | 1.000 | 0.862 | 0.082 | 689s |
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
## Usage
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
import torch
|
| 94 |
+
from transformers import AutoTokenizer
|
| 95 |
+
from caption_encoder import CaptionEncoder
|
| 96 |
+
|
| 97 |
+
# Load
|
| 98 |
+
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
|
| 99 |
+
model = CaptionEncoder(
|
| 100 |
+
vocab_size=30522, max_len=8192, d_model=384,
|
| 101 |
+
n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
|
| 102 |
+
dropout=0.0, pad_token_id=0)
|
| 103 |
+
model.load_state_dict(torch.load("best_model.pt", weights_only=True))
|
| 104 |
+
model.eval()
|
| 105 |
+
|
| 106 |
+
# Encode
|
| 107 |
+
texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
|
| 108 |
+
tokens = tokenizer(texts, max_length=512, padding="max_length",
|
| 109 |
+
truncation=True, return_tensors="pt")
|
| 110 |
+
with torch.no_grad():
|
| 111 |
+
embeddings = model(tokens["input_ids"], tokens["attention_mask"])
|
| 112 |
+
|
| 113 |
+
# embeddings: (2, 768) L2-normalized
|
| 114 |
+
similarity = embeddings[0] @ embeddings[1]
|
| 115 |
+
print(f"Similarity: {similarity:.3f}")
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
## Architecture
|
| 119 |
+
|
| 120 |
+
```
|
| 121 |
+
Input text
|
| 122 |
+
β
|
| 123 |
+
βββ BERT WordPiece tokenizer (30,522 vocab)
|
| 124 |
+
βββ Token embeddings (384-dim)
|
| 125 |
+
βββ Position embeddings (8,192 capacity)
|
| 126 |
+
β
|
| 127 |
+
βββ 6Γ Transformer Encoder Layer
|
| 128 |
+
β (384-dim, 6 heads, 1536 FFN, GELU, pre-norm)
|
| 129 |
+
β
|
| 130 |
+
βββ Mean pool over non-padding tokens
|
| 131 |
+
βββ Projection: 384 β 384 β GELU β LN β 768
|
| 132 |
+
βββ L2 normalize
|
| 133 |
+
οΏ½οΏ½
|
| 134 |
+
βββ (B, 768) consensus-aligned embedding
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
## The Consensus Distillation Pipeline
|
| 138 |
+
|
| 139 |
+
```
|
| 140 |
+
5 Expert Models (frozen)
|
| 141 |
+
β
|
| 142 |
+
βββ BERT-base-uncased (110M, MLM)
|
| 143 |
+
βββ ModernBERT-base (149M, MLM + rotary)
|
| 144 |
+
βββ RoBERTa-base (125M, MLM + dynamic masking)
|
| 145 |
+
βββ ALBERT-base-v2 (12M, MLM + SOP + factorized)
|
| 146 |
+
βββ DistilBERT-base (66M, distilled from BERT)
|
| 147 |
+
β
|
| 148 |
+
βββ Extract embeddings on CC12M captions
|
| 149 |
+
βββ Whitened Procrustes alignment to shared space
|
| 150 |
+
βββ Consensus = normalized centroid
|
| 151 |
+
β (proven constant to 3 decimal places across 5 seeds)
|
| 152 |
+
β
|
| 153 |
+
βββ Train student with:
|
| 154 |
+
βββ InfoNCE(student, consensus) β retrieval alignment
|
| 155 |
+
βββ MSE(student, consensus) β direct regression
|
| 156 |
+
βββ Pentachoron CV β 0.084 β geometric regularity
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
## Key Properties
|
| 160 |
+
|
| 161 |
+
**Geometric regularity.** The embedding space has pentachoron CV β 0.08β0.10, meaning local neighborhoods are uniformly distributed. The space is smooth, interpolable, and well-conditioned for downstream operations.
|
| 162 |
+
|
| 163 |
+
**Multi-teacher consensus.** The target is the geometric intersection of five experts, not any single teacher. Individual model errors cancel. What remains is what five independent systems agree on.
|
| 164 |
+
|
| 165 |
+
**Minimal data requirement.** The consensus manifold is so smooth (CV=0.084) that 18K examples were sufficient for R@1=1.000 on held-out data. The function from text to consensus embedding has a low Lipschitz constant.
|
| 166 |
+
|
| 167 |
+
**8K position capacity.** Trained on 512-token sequences but position embeddings extend to 8,192. Ready for long-context applications without retraining.
|
| 168 |
+
|
| 169 |
+
## GEOLIP Family
|
| 170 |
+
|
| 171 |
+
| System | Type | Output |
|
| 172 |
+
|---|---|---|
|
| 173 |
+
| [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | pooled (768,) |
|
| 174 |
+
| [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 768) |
|
| 175 |
+
| [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 1280) |
|
| 176 |
+
| [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | aligned (1024,) |
|
| 177 |
+
| **Consensus Distilled** | **Student** | **consensus (768,)** |
|
| 178 |
+
|
| 179 |
+
## Citation
|
| 180 |
+
|
| 181 |
+
See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology.
|
| 182 |
+
|
| 183 |
+
## License
|
| 184 |
+
|
| 185 |
+
Apache 2.0
|