GEOLIP-CLIP-ViT-L/14 ctx576 β€” Memory-Extended CLIP Text Encoder

Extends CLIP-ViT-L/14 text encoder from 77 tokens to 576 effective context via geometric memory bank, trained by distillation from ModernBERT-large.

Key Results

Metric Train Val
m_acc (top1) 0.945 0.944
m_acc (top5) 1.000 1.000
Pentachoron CV 0.162 0.160
Procrustes pre-alignment cos 0.001 β†’ 0.816 β€”
Trainable params 34.5M β€”
Effective context 576 tokens β€”

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "AbstractPhil/geolip-clip-vit-large-patch14-ctx576",
    trust_remote_code=True)
model = model.to("cuda").eval()

# Single text β†’ 768-dim CLIP-compatible embedding
embedding = model.encode("A vast sweeping landscape of rolling green hills...")
print(embedding.shape)  # (768,)

# Batch
embeddings = model.encode([
    "A cat sleeping on a warm blanket",
    "A medieval castle overlooking a turbulent sea with waves crashing "
    "against ancient rocks and seabirds wheeling overhead against a sky "
    "painted in shades of grey and gold as the sun breaks through clouds"
])
print(embeddings.shape)  # (2, 768)

# HuggingFace forward() API
output = model(texts=["A photo of a dog"])
print(output.last_hidden_state.shape)  # (1, 1, 768)

Architecture

Long caption (576 tokens)
    β”‚
    β”œβ”€β”€ ModernBERT-large ──── 4096 ctx, 1024-dim ──── teacher (training only)
    β”‚   (frozen, 395M)
    β”‚
    └── CLIP-L text enc ──── 18 tok Γ— 32 segments ── student
        (frozen, 768-dim, 123M)
        + Geometric Memory System (34.5M trainable)
            β”œβ”€β”€ Depth compressor (6-layer profile β†’ 768-dim anchor)
            β”œβ”€β”€ Memory bank (64 anchors, 2-layer cross-attention)
            β”œβ”€β”€ CLIP cross-attention (memory β†’ CLIP hidden states)
            β”œβ”€β”€ GRU gate (memory state updates)
            β”œβ”€β”€ Layer fusion (learned weighted sum of 6 CLIP layers)
            └── Teacher projector (768 β†’ 1024, Procrustes-initialized)

How It Works

  1. Long text split into 18-token segments (stride 14, overlap 4)
  2. Each segment: frozen CLIP text encoder forward (standard, untouched)
  3. Memory tokens cross-attend to CLIP's multi-layer hidden states
  4. Depth-profile anchors stored in geometric memory bank
  5. GRU gate controls rolling memory state across segments
  6. Final output: 768-dim embedding in CLIP's text space

Training

  • Teacher: ModernBERT-large (frozen, 4096 context)
  • Losses: InfoNCE(studentβ†’teacher) Γ— 1.0 + SVD Procrustes regularizer Γ— 0.3 + pentachoron CV Γ— 0.05
  • Data: CC12M with LLaVA-next detailed captions (50K train, 2K val, mean 96 tokens)
  • Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
  • Training time: ~103 minutes (10 epochs Γ— ~10.3 min/epoch)
  • Batch size: 64

Geometric Properties

  • Procrustes init: CLIPβ†’ModernBERT cos 0.001 β†’ 0.816 (strongest alignment seen across all GEOLIP experiments)
  • CV trajectory: 0.185 (E1) β†’ 0.162 (E10), stable in the 0.16-0.19 band
  • CV regularization on bank anchors prevented projector shortcut collapse
  • 18-token segments forced genuine memory accumulation (vs 55-token segments which plateaued at 0.670)

Why 18-Token Segments?

55-token segments β†’ 2 segments per caption β†’ pentachoron CV dead (needs 5+ anchors) β†’ projector shortcut β†’ plateau at m_acc=0.670.

18-token segments β†’ 7 segments per caption β†’ CV active β†’ memory bank forced to accumulate β†’ m_acc=0.945.

The geometry regularization must be applied at the bottleneck (bank anchors). If it can't activate, the model finds cheap projector shortcuts.

Training Metrics

Full training metrics available in metrics.json. TensorBoard logs in tensorboard/.

More than likely the model didn't converge yet and requires additional data. Lots more data. It's a good start though.

Training Curve

Epoch Train m_acc Val m_acc Val m_acc5 CV
1 0.628 0.835 β€” 0.185
2 0.832 0.533* 0.987 0.162
3 0.851 0.868 0.995 0.189
4 0.878 0.870 0.999 0.168
5 0.876 0.866 0.998 0.177
6 0.888 0.889 0.999 0.170
7 0.917 0.924 1.000 0.166
8 0.932 0.935 1.000 0.160
9 0.941 0.938 1.000 0.161
10 0.945 0.944 1.000 0.162

*E2 val dip: cosine LR transient at epoch boundary, recovered by E3.

GEOLIP Family

System Student Teacher m_acc CV
GEOLIP-Bertenstein BERT-large hub DINOv2+Whisper+ESM2+CodeBERT R@1=1.0 0.20
GEOLIP-BERT-8192 BERT-large (512 ctx) ModernBERT+Longformer 0.927 0.20
GEOLIP-CLIP-ctx576 CLIP-L (77 ctx) ModernBERT-large 0.945 0.162

All use the same blueprint: frozen expert teacher + Procrustes initialization + InfoNCE alignment force + pentachoron CV on the bottleneck representation.

License

Apache 2.0

Downloads last month
219
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AbstractPhil/geolip-clip-vit-large-patch14-ctx576

Finetuned
(125)
this model