GEOLIP-CLIP-ViT-L/14 ctx576 — Memory-Extended CLIP Text Encoder

Extends CLIP-ViT-L/14 text encoder from 77 tokens to 576 effective context via geometric memory bank, trained by distillation from ModernBERT-large.

Key Results

Metric	Train	Val
m_acc (top1)	0.945	0.944
m_acc (top5)	1.000	1.000
Pentachoron CV	0.162	0.160
Procrustes pre-alignment	cos 0.001 → 0.816	—
Trainable params	34.5M	—
Effective context	576 tokens	—

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "AbstractPhil/geolip-clip-vit-large-patch14-ctx576",
    trust_remote_code=True)
model = model.to("cuda").eval()

# Single text → 768-dim CLIP-compatible embedding
embedding = model.encode("A vast sweeping landscape of rolling green hills...")
print(embedding.shape)  # (768,)

# Batch
embeddings = model.encode([
    "A cat sleeping on a warm blanket",
    "A medieval castle overlooking a turbulent sea with waves crashing "
    "against ancient rocks and seabirds wheeling overhead against a sky "
    "painted in shades of grey and gold as the sun breaks through clouds"
])
print(embeddings.shape)  # (2, 768)

# HuggingFace forward() API
output = model(texts=["A photo of a dog"])
print(output.last_hidden_state.shape)  # (1, 1, 768)

Architecture

Long caption (576 tokens)
    │
    ├── ModernBERT-large ──── 4096 ctx, 1024-dim ──── teacher (training only)
    │   (frozen, 395M)
    │
    └── CLIP-L text enc ──── 18 tok × 32 segments ── student
        (frozen, 768-dim, 123M)
        + Geometric Memory System (34.5M trainable)
            ├── Depth compressor (6-layer profile → 768-dim anchor)
            ├── Memory bank (64 anchors, 2-layer cross-attention)
            ├── CLIP cross-attention (memory → CLIP hidden states)
            ├── GRU gate (memory state updates)
            ├── Layer fusion (learned weighted sum of 6 CLIP layers)
            └── Teacher projector (768 → 1024, Procrustes-initialized)

How It Works

Long text split into 18-token segments (stride 14, overlap 4)
Each segment: frozen CLIP text encoder forward (standard, untouched)
Memory tokens cross-attend to CLIP's multi-layer hidden states
Depth-profile anchors stored in geometric memory bank
GRU gate controls rolling memory state across segments
Final output: 768-dim embedding in CLIP's text space

Training

Teacher: ModernBERT-large (frozen, 4096 context)
Losses: InfoNCE(student→teacher) × 1.0 + SVD Procrustes regularizer × 0.3 + pentachoron CV × 0.05
Data: CC12M with LLaVA-next detailed captions (50K train, 2K val, mean 96 tokens)
Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
Training time: ~103 minutes (10 epochs × ~10.3 min/epoch)
Batch size: 64

Geometric Properties

Procrustes init: CLIP→ModernBERT cos 0.001 → 0.816 (strongest alignment seen across all GEOLIP experiments)
CV trajectory: 0.185 (E1) → 0.162 (E10), stable in the 0.16-0.19 band
CV regularization on bank anchors prevented projector shortcut collapse
18-token segments forced genuine memory accumulation (vs 55-token segments which plateaued at 0.670)

Why 18-Token Segments?

55-token segments → 2 segments per caption → pentachoron CV dead (needs 5+ anchors) → projector shortcut → plateau at m_acc=0.670.

18-token segments → 7 segments per caption → CV active → memory bank forced to accumulate → m_acc=0.945.

The geometry regularization must be applied at the bottleneck (bank anchors). If it can't activate, the model finds cheap projector shortcuts.

Training Metrics

Full training metrics available in metrics.json. TensorBoard logs in tensorboard/.

More than likely the model didn't converge yet and requires additional data. Lots more data. It's a good start though.

Training Curve

Epoch	Train m_acc	Val m_acc	Val m_acc5	CV
1	0.628	0.835	—	0.185
2	0.832	0.533*	0.987	0.162
3	0.851	0.868	0.995	0.189
4	0.878	0.870	0.999	0.168
5	0.876	0.866	0.998	0.177
6	0.888	0.889	0.999	0.170
7	0.917	0.924	1.000	0.166
8	0.932	0.935	1.000	0.160
9	0.941	0.938	1.000	0.161
10	0.945	0.944	1.000	0.162

*E2 val dip: cosine LR transient at epoch boundary, recovered by E3.

GEOLIP Family

System	Student	Teacher	m_acc	CV
GEOLIP-Bertenstein	BERT-large hub	DINOv2+Whisper+ESM2+CodeBERT	R@1=1.0	0.20
GEOLIP-BERT-8192	BERT-large (512 ctx)	ModernBERT+Longformer	0.927	0.20
GEOLIP-CLIP-ctx576	CLIP-L (77 ctx)	ModernBERT-large	0.945	0.162

All use the same blueprint: frozen expert teacher + Procrustes initialization + InfoNCE alignment force + pentachoron CV on the bottleneck representation.