GEOLIP-CLIP-ViT-L/14 ctx576-seq77

Memory-Extended CLIP text encoder with 77-position sequence output for diffusion cross-attention.

Extends CLIP-ViT-L/14 from 77 to 576 effective tokens. Outputs both:

Pooled: (768,) — backward compatible with v1
Sequence: (77, 768) — drop-in for SD/SDXL UNet cross-attention

Results

Metric	Value
Pooled m_acc (top1)	0.957
Sequence cosine similarity	0.734
Pentachoron CV	0.164
Trainable params	53.5M (19M seq head + 34.5M memory)
Effective context	576 tokens (32 × 18)
Training time	78 minutes (30 phase1 + 48 phase2)

Built on geolip-clip-vit-large-patch14-ctx576 (v1 pooled model, m_acc=0.945).

SD15 Tests

Blend rate; 0 = original sequence, 1.0 = fully blended, 2.0 = amplified vector

0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0

    "still_life": (
        "A meticulously arranged still life painting in the Dutch Golden Age "
        "style featuring a silver goblet overflowing with deep red wine next "
        "to a half peeled lemon with its rind spiraling downward and a cracked "
        "walnut revealing its inner flesh beside a porcelain plate holding "
        "slices of rare roast beef garnished with fresh rosemary sprigs and "
        "a small bouquet of wilting tulips in shades of pink and white all set "
        "against a dark moody background with dramatic chiaroscuro lighting "
        "that highlights the reflective surfaces and textures of each object "
        "while casting deep shadows that add depth and mystery to the composition "
        "with a single fly resting on the edge of the goblet and droplets of "
        "condensation catching the light on the silver surface"
    ),

    "castle": (
        "A vast sweeping landscape of rolling green hills under dramatic "
        "storm clouds with a lone oak tree in the foreground its branches "
        "bent by wind casting long shadows across a field of wildflowers "
        "in purple yellow and white while in the distance a medieval stone "
        "castle sits atop a cliff overlooking a turbulent sea with waves "
        "crashing against ancient rocks and seabirds wheeling overhead "
        "against a sky painted in shades of grey and gold as the sun "
        "breaks through the clouds illuminating the castle towers"
    ),

As you'll see below, the images do in fact suffer in sequences below 13 sequence if the system is accessed. I'll figure out a tweak later.

    "short": "A medieval castle on a cliff overlooking the sea at sunset",

Usage

There is a full example for blending in the code here;

https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77/resolve/main/memory_clip_seq.py

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77",
    trust_remote_code=True)
model.to("cuda").eval()

# Pooled output (backward compatible)
emb = model.encode("A long detailed caption...")
print(emb.shape)  # (768,)

# Sequence output for diffusion
seq = model.encode("A long caption...", return_sequence=True)
print(seq.shape)  # (77, 768)

# Or via encode_for_diffusion
seq = model.encode_for_diffusion(["caption 1", "caption 2"])
print(seq.shape)  # (2, 77, 768)

# Full forward
out = model(texts=["A caption"], output_sequence=True)
out.last_hidden_state  # (1, 77, 768) — for UNet cross-attention
out.hidden_states[0]   # pooled (1, 768)
out.hidden_states[1]   # sequence (1, 77, 768)

Architecture

Long caption (576 tokens)
    │
    ├── Frozen CLIP-L text encoder (77 ctx, causal, 12 layers)
    │   Processes 18-token segments sequentially
    │
    ├── Geometric Memory System (34.5M, from v1)
    │   ├── Depth compressor (6-layer profile → 768-dim anchor)
    │   ├── Memory bank (64 anchors, 2-layer cross-attention)
    │   ├── CLIP cross-attention (memory → CLIP hidden states)
    │   ├── GRU gate (rolling memory state)
    │   └── Layer fusion (learned weighted sum of 6 CLIP layers)
    │
    └── Sequence Reconstructor (19M, NEW)
        ├── 77 learned query tokens with positional encoding
        ├── Cross-attend to: memory_tokens + bank_anchors + content_tokens
        ├── Self-attend among 77 output positions
        └── Output: (B, 77, 768) in CLIP's native distribution

Training

Two-phase training on 50K CC12M LLaVA-next captions:

Phase 1 — Sequence head only (5 epochs, 30 min): v1 memory system frozen. Only the SequenceReconstructor trains.

Epoch	m_acc	s_cos	CV
1	0.944	0.582	0.162
3	0.946	0.681	0.163
5	0.948	0.712	0.162

Phase 2 — Joint fine-tune (5 epochs, 48 min): Everything unfrozen. v1 components at reduced LR.

Epoch	m_acc	s_cos	CV
1	0.939	0.700	0.165
3	0.948	0.715	0.164
5	0.957	0.734	0.164

Losses:

InfoNCE: student pooled ↔ ModernBERT-large pooled (alignment force)
Procrustes SVD: geometric regularizer on pooled output
Pentachoron CV: bottleneck regularizer on bank anchors
Sequence MSE + cosine: reconstructed 77 ↔ CLIP's own last_hidden_state

Key design decision: Sequence target is CLIP's own last_hidden_state, NOT ModernBERT's. The UNet was trained on CLIP's sequence distribution. The reconstructor learns to produce sequences in that distribution, enriched with the full 576-token context from the memory bank.