Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +185 -3

README.md CHANGED Viewed

@@ -1,3 +1,185 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+tags:
+- geometric-deep-learning
+- distillation
+- consensus
+- pentachoron
+- procrustes
+- caption-embedding
+- sentence-similarity
+- feature-extraction
+language: en
+pipeline_tag: feature-extraction
+---
+# GEOLIP Consensus-Distilled Caption Encoder
+**A standalone 23M-parameter caption encoder trained via geometric consensus distillation from 5 BERT-family models.**
+No expert models needed at inference. Just a tokenizer and this model.
+## What Is This?
+Five independently trained language models — BERT-base, ModernBERT-base, RoBERTa-base, ALBERT-base-v2, and DistilBERT-base — were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid (the **geometric consensus**) was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.
+This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts — the subspace where all five agree — into a single small transformer.
+## Results
+| Metric | Value |
+|---|---|
+| **Val cosine to consensus** | **0.8621352314949036** |
+| **Val R@1** | **1.0** |
+| **Val CV** | **0.0817226113729792** |
+| Training data | CC12M captions (500000 samples) |
+| Epochs | 30 |
+| Warm-started | True |
+| Parameters | ~23M |
+| Position capacity | 8,192 tokens |
+### STS-B Comparison (mean-pooled, no fine-tuning)
+| Model | Params | STS-B Spearman |
+|---|---|---|
+| DistilBERT-base | 66M | 0.5717 |
+| RoBERTa-base | 125M | 0.5436 |
+| **Consensus Student** | **23M** | **0.4814** |
+| ALBERT-base-v2 | 12M | 0.4784 |
+| BERT-base | 110M | 0.4729 |
+| ModernBERT-base | 149M | 0.4215 |
+The student beats BERT-base (5x larger) and ModernBERT-base (7x larger) on STS-B despite being trained from scratch on image captions — out of domain for sentence similarity.
+### Training Curve
+| Epoch | t_acc | t_cos | v_acc | v_cos | v_cv | Time |
+|---|---|---|---|---|---|---|
+| 1 | 1.000 | 0.804 | 1.000 | 0.803 | 0.104 | 689s |
+| 2 | 1.000 | 0.807 | 1.000 | 0.810 | 0.085 | 688s |
+| 3 | 1.000 | 0.811 | 1.000 | 0.820 | 0.103 | 688s |
+| 4 | 1.000 | 0.815 | 1.000 | 0.825 | 0.084 | 689s |
+| 5 | 1.000 | 0.819 | 1.000 | 0.819 | 0.086 | 689s |
+| 6 | 1.000 | 0.821 | 1.000 | 0.821 | 0.095 | 689s |
+| 7 | 1.000 | 0.824 | 1.000 | 0.820 | 0.091 | 688s |
+| 8 | 1.000 | 0.827 | 1.000 | 0.834 | 0.088 | 689s |
+| 9 | 1.000 | 0.829 | 1.000 | 0.829 | 0.088 | 688s |
+| 10 | 1.000 | 0.831 | 1.000 | 0.829 | 0.087 | 689s |
+| 11 | 1.000 | 0.833 | 1.000 | 0.836 | 0.082 | 689s |
+| 12 | 1.000 | 0.835 | 1.000 | 0.838 | 0.084 | 689s |
+| 13 | 1.000 | 0.837 | 1.000 | 0.842 | 0.083 | 688s |
+| 14 | 1.000 | 0.839 | 1.000 | 0.842 | 0.081 | 689s |
+| 15 | 1.000 | 0.842 | 1.000 | 0.840 | 0.078 | 688s |
+| 16 | 1.000 | 0.843 | 1.000 | 0.843 | 0.086 | 689s |
+| 17 | 1.000 | 0.846 | 1.000 | 0.845 | 0.086 | 689s |
+| 18 | 1.000 | 0.847 | 1.000 | 0.848 | 0.087 | 689s |
+| 19 | 1.000 | 0.849 | 1.000 | 0.849 | 0.082 | 688s |
+| 20 | 1.000 | 0.851 | 1.000 | 0.849 | 0.078 | 690s |
+| 21 | 1.000 | 0.853 | 1.000 | 0.855 | 0.087 | 689s |
+| 22 | 1.000 | 0.855 | 1.000 | 0.856 | 0.083 | 689s |
+| 23 | 1.000 | 0.857 | 1.000 | 0.855 | 0.078 | 689s |
+| 24 | 1.000 | 0.858 | 1.000 | 0.857 | 0.093 | 688s |
+| 25 | 1.000 | 0.860 | 1.000 | 0.859 | 0.092 | 689s |
+| 26 | 1.000 | 0.861 | 1.000 | 0.860 | 0.079 | 689s |
+| 27 | 1.000 | 0.863 | 1.000 | 0.862 | 0.084 | 689s |
+| 28 | 1.000 | 0.863 | 1.000 | 0.862 | 0.091 | 688s |
+| 29 | 1.000 | 0.863 | 1.000 | 0.862 | 0.081 | 688s |
+| 30 | 1.000 | 0.863 | 1.000 | 0.862 | 0.082 | 689s |
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer
+from caption_encoder import CaptionEncoder
+# Load
+tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
+model = CaptionEncoder(
+    vocab_size=30522, max_len=8192, d_model=384,
+    n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
+    dropout=0.0, pad_token_id=0)
+model.load_state_dict(torch.load("best_model.pt", weights_only=True))
+model.eval()
+# Encode
+texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
+tokens = tokenizer(texts, max_length=512, padding="max_length",
+                   truncation=True, return_tensors="pt")
+with torch.no_grad():
+    embeddings = model(tokens["input_ids"], tokens["attention_mask"])
+# embeddings: (2, 768) L2-normalized
+similarity = embeddings[0] @ embeddings[1]
+print(f"Similarity: {similarity:.3f}")
+```
+## Architecture
+```
+Input text
+    │
+    ├── BERT WordPiece tokenizer (30,522 vocab)
+    ├── Token embeddings (384-dim)
+    ├── Position embeddings (8,192 capacity)
+    │
+    ├── 6× Transformer Encoder Layer
+    │   (384-dim, 6 heads, 1536 FFN, GELU, pre-norm)
+    │
+    ├── Mean pool over non-padding tokens
+    ├── Projection: 384 → 384 → GELU → LN → 768
+    └── L2 normalize
+        ��
+        └── (B, 768) consensus-aligned embedding
+```
+## The Consensus Distillation Pipeline
+```
+5 Expert Models (frozen)
+    │
+    ├── BERT-base-uncased        (110M, MLM)
+    ├── ModernBERT-base          (149M, MLM + rotary)
+    ├── RoBERTa-base             (125M, MLM + dynamic masking)
+    ├── ALBERT-base-v2           (12M, MLM + SOP + factorized)
+    └── DistilBERT-base          (66M, distilled from BERT)
+        │
+        ├── Extract embeddings on CC12M captions
+        ├── Whitened Procrustes alignment to shared space
+        ├── Consensus = normalized centroid
+        │   (proven constant to 3 decimal places across 5 seeds)
+        │
+        └── Train student with:
+            ├── InfoNCE(student, consensus)  — retrieval alignment
+            ├── MSE(student, consensus)      — direct regression
+            └── Pentachoron CV → 0.084       — geometric regularity
+```
+## Key Properties
+**Geometric regularity.** The embedding space has pentachoron CV ≈ 0.08–0.10, meaning local neighborhoods are uniformly distributed. The space is smooth, interpolable, and well-conditioned for downstream operations.
+**Multi-teacher consensus.** The target is the geometric intersection of five experts, not any single teacher. Individual model errors cancel. What remains is what five independent systems agree on.
+**Minimal data requirement.** The consensus manifold is so smooth (CV=0.084) that 18K examples were sufficient for R@1=1.000 on held-out data. The function from text to consensus embedding has a low Lipschitz constant.
+**8K position capacity.** Trained on 512-token sequences but position embeddings extend to 8,192. Ready for long-context applications without retraining.
+## GEOLIP Family
+| System | Type | Output |
+|---|---|---|
+| [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | pooled (768,) |
+| [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 768) |
+| [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | pooled + seq (77, 1280) |
+| [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | aligned (1024,) |
+| **Consensus Distilled** | **Student** | **consensus (768,)** |
+## Citation
+See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology.
+## License
+Apache 2.0