GEOLIP-CLIP-ViT-L/14 ctx576 β Memory-Extended CLIP Text Encoder
Extends CLIP-ViT-L/14 text encoder from 77 tokens to 576 effective context via geometric memory bank, trained by distillation from ModernBERT-large.
Key Results
| Metric | Train | Val |
|---|---|---|
| m_acc (top1) | 0.945 | 0.944 |
| m_acc (top5) | 1.000 | 1.000 |
| Pentachoron CV | 0.162 | 0.160 |
| Procrustes pre-alignment | cos 0.001 β 0.816 | β |
| Trainable params | 34.5M | β |
| Effective context | 576 tokens | β |
Usage
from transformers import AutoModel
model = AutoModel.from_pretrained(
"AbstractPhil/geolip-clip-vit-large-patch14-ctx576",
trust_remote_code=True)
model = model.to("cuda").eval()
# Single text β 768-dim CLIP-compatible embedding
embedding = model.encode("A vast sweeping landscape of rolling green hills...")
print(embedding.shape) # (768,)
# Batch
embeddings = model.encode([
"A cat sleeping on a warm blanket",
"A medieval castle overlooking a turbulent sea with waves crashing "
"against ancient rocks and seabirds wheeling overhead against a sky "
"painted in shades of grey and gold as the sun breaks through clouds"
])
print(embeddings.shape) # (2, 768)
# HuggingFace forward() API
output = model(texts=["A photo of a dog"])
print(output.last_hidden_state.shape) # (1, 1, 768)
Architecture
Long caption (576 tokens)
β
βββ ModernBERT-large ββββ 4096 ctx, 1024-dim ββββ teacher (training only)
β (frozen, 395M)
β
βββ CLIP-L text enc ββββ 18 tok Γ 32 segments ββ student
(frozen, 768-dim, 123M)
+ Geometric Memory System (34.5M trainable)
βββ Depth compressor (6-layer profile β 768-dim anchor)
βββ Memory bank (64 anchors, 2-layer cross-attention)
βββ CLIP cross-attention (memory β CLIP hidden states)
βββ GRU gate (memory state updates)
βββ Layer fusion (learned weighted sum of 6 CLIP layers)
βββ Teacher projector (768 β 1024, Procrustes-initialized)
How It Works
- Long text split into 18-token segments (stride 14, overlap 4)
- Each segment: frozen CLIP text encoder forward (standard, untouched)
- Memory tokens cross-attend to CLIP's multi-layer hidden states
- Depth-profile anchors stored in geometric memory bank
- GRU gate controls rolling memory state across segments
- Final output: 768-dim embedding in CLIP's text space
Training
- Teacher: ModernBERT-large (frozen, 4096 context)
- Losses: InfoNCE(studentβteacher) Γ 1.0 + SVD Procrustes regularizer Γ 0.3 + pentachoron CV Γ 0.05
- Data: CC12M with LLaVA-next detailed captions (50K train, 2K val, mean 96 tokens)
- Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
- Training time: ~103 minutes (10 epochs Γ ~10.3 min/epoch)
- Batch size: 64
Geometric Properties
- Procrustes init: CLIPβModernBERT cos 0.001 β 0.816 (strongest alignment seen across all GEOLIP experiments)
- CV trajectory: 0.185 (E1) β 0.162 (E10), stable in the 0.16-0.19 band
- CV regularization on bank anchors prevented projector shortcut collapse
- 18-token segments forced genuine memory accumulation (vs 55-token segments which plateaued at 0.670)
Why 18-Token Segments?
55-token segments β 2 segments per caption β pentachoron CV dead (needs 5+ anchors) β projector shortcut β plateau at m_acc=0.670.
18-token segments β 7 segments per caption β CV active β memory bank forced to accumulate β m_acc=0.945.
The geometry regularization must be applied at the bottleneck (bank anchors). If it can't activate, the model finds cheap projector shortcuts.
Training Metrics
Full training metrics available in metrics.json. TensorBoard logs in tensorboard/.
More than likely the model didn't converge yet and requires additional data. Lots more data. It's a good start though.
Training Curve
| Epoch | Train m_acc | Val m_acc | Val m_acc5 | CV |
|---|---|---|---|---|
| 1 | 0.628 | 0.835 | β | 0.185 |
| 2 | 0.832 | 0.533* | 0.987 | 0.162 |
| 3 | 0.851 | 0.868 | 0.995 | 0.189 |
| 4 | 0.878 | 0.870 | 0.999 | 0.168 |
| 5 | 0.876 | 0.866 | 0.998 | 0.177 |
| 6 | 0.888 | 0.889 | 0.999 | 0.170 |
| 7 | 0.917 | 0.924 | 1.000 | 0.166 |
| 8 | 0.932 | 0.935 | 1.000 | 0.160 |
| 9 | 0.941 | 0.938 | 1.000 | 0.161 |
| 10 | 0.945 | 0.944 | 1.000 | 0.162 |
*E2 val dip: cosine LR transient at epoch boundary, recovered by E3.
GEOLIP Family
| System | Student | Teacher | m_acc | CV |
|---|---|---|---|---|
| GEOLIP-Bertenstein | BERT-large hub | DINOv2+Whisper+ESM2+CodeBERT | R@1=1.0 | 0.20 |
| GEOLIP-BERT-8192 | BERT-large (512 ctx) | ModernBERT+Longformer | 0.927 | 0.20 |
| GEOLIP-CLIP-ctx576 | CLIP-L (77 ctx) | ModernBERT-large | 0.945 | 0.162 |
All use the same blueprint: frozen expert teacher + Procrustes initialization + InfoNCE alignment force + pentachoron CV on the bottleneck representation.
License
Apache 2.0
- Downloads last month
- 219
Model tree for AbstractPhil/geolip-clip-vit-large-patch14-ctx576
Base model
openai/clip-vit-large-patch14