GEOLIP-CLIP-ViT-L/14 ctx576-seq77
Memory-Extended CLIP text encoder with 77-position sequence output for diffusion cross-attention.
Extends CLIP-ViT-L/14 from 77 to 576 effective tokens. Outputs both:
- Pooled:
(768,)β backward compatible with v1 - Sequence:
(77, 768)β drop-in for SD/SDXL UNet cross-attention
Results
| Metric | Value |
|---|---|
| Pooled m_acc (top1) | 0.957 |
| Sequence cosine similarity | 0.734 |
| Pentachoron CV | 0.164 |
| Trainable params | 53.5M (19M seq head + 34.5M memory) |
| Effective context | 576 tokens (32 Γ 18) |
| Training time | 78 minutes (30 phase1 + 48 phase2) |
Built on geolip-clip-vit-large-patch14-ctx576 (v1 pooled model, m_acc=0.945).
SD15 Tests
Blend rate; 0 = original sequence, 1.0 = fully blended, 2.0 = amplified vector
0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0
"still_life": (
"A meticulously arranged still life painting in the Dutch Golden Age "
"style featuring a silver goblet overflowing with deep red wine next "
"to a half peeled lemon with its rind spiraling downward and a cracked "
"walnut revealing its inner flesh beside a porcelain plate holding "
"slices of rare roast beef garnished with fresh rosemary sprigs and "
"a small bouquet of wilting tulips in shades of pink and white all set "
"against a dark moody background with dramatic chiaroscuro lighting "
"that highlights the reflective surfaces and textures of each object "
"while casting deep shadows that add depth and mystery to the composition "
"with a single fly resting on the edge of the goblet and droplets of "
"condensation catching the light on the silver surface"
),
"castle": (
"A vast sweeping landscape of rolling green hills under dramatic "
"storm clouds with a lone oak tree in the foreground its branches "
"bent by wind casting long shadows across a field of wildflowers "
"in purple yellow and white while in the distance a medieval stone "
"castle sits atop a cliff overlooking a turbulent sea with waves "
"crashing against ancient rocks and seabirds wheeling overhead "
"against a sky painted in shades of grey and gold as the sun "
"breaks through the clouds illuminating the castle towers"
),
As you'll see below, the images do in fact suffer in sequences below 13 sequence if the system is accessed. I'll figure out a tweak later.
"short": "A medieval castle on a cliff overlooking the sea at sunset",
Usage
There is a full example for blending in the code here;
from transformers import AutoModel
model = AutoModel.from_pretrained(
"AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77",
trust_remote_code=True)
model.to("cuda").eval()
# Pooled output (backward compatible)
emb = model.encode("A long detailed caption...")
print(emb.shape) # (768,)
# Sequence output for diffusion
seq = model.encode("A long caption...", return_sequence=True)
print(seq.shape) # (77, 768)
# Or via encode_for_diffusion
seq = model.encode_for_diffusion(["caption 1", "caption 2"])
print(seq.shape) # (2, 77, 768)
# Full forward
out = model(texts=["A caption"], output_sequence=True)
out.last_hidden_state # (1, 77, 768) β for UNet cross-attention
out.hidden_states[0] # pooled (1, 768)
out.hidden_states[1] # sequence (1, 77, 768)
Architecture
Long caption (576 tokens)
β
βββ Frozen CLIP-L text encoder (77 ctx, causal, 12 layers)
β Processes 18-token segments sequentially
β
βββ Geometric Memory System (34.5M, from v1)
β βββ Depth compressor (6-layer profile β 768-dim anchor)
β βββ Memory bank (64 anchors, 2-layer cross-attention)
β βββ CLIP cross-attention (memory β CLIP hidden states)
β βββ GRU gate (rolling memory state)
β βββ Layer fusion (learned weighted sum of 6 CLIP layers)
β
βββ Sequence Reconstructor (19M, NEW)
βββ 77 learned query tokens with positional encoding
βββ Cross-attend to: memory_tokens + bank_anchors + content_tokens
βββ Self-attend among 77 output positions
βββ Output: (B, 77, 768) in CLIP's native distribution
Training
Two-phase training on 50K CC12M LLaVA-next captions:
Phase 1 β Sequence head only (5 epochs, 30 min): v1 memory system frozen. Only the SequenceReconstructor trains.
| Epoch | m_acc | s_cos | CV |
|---|---|---|---|
| 1 | 0.944 | 0.582 | 0.162 |
| 3 | 0.946 | 0.681 | 0.163 |
| 5 | 0.948 | 0.712 | 0.162 |
Phase 2 β Joint fine-tune (5 epochs, 48 min): Everything unfrozen. v1 components at reduced LR.
| Epoch | m_acc | s_cos | CV |
|---|---|---|---|
| 1 | 0.939 | 0.700 | 0.165 |
| 3 | 0.948 | 0.715 | 0.164 |
| 5 | 0.957 | 0.734 | 0.164 |
Losses:
- InfoNCE: student pooled β ModernBERT-large pooled (alignment force)
- Procrustes SVD: geometric regularizer on pooled output
- Pentachoron CV: bottleneck regularizer on bank anchors
- Sequence MSE + cosine: reconstructed 77 β CLIP's own last_hidden_state
Key design decision: Sequence target is CLIP's own last_hidden_state, NOT ModernBERT's.
The UNet was trained on CLIP's sequence distribution. The reconstructor learns to produce
sequences in that distribution, enriched with the full 576-token context from the memory bank.
Hardware
NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM)
GEOLIP Family
| System | Output | m_acc | s_cos | CV |
|---|---|---|---|---|
| ctx576 | pooled (768,) | 0.945 | β | 0.162 |
| ctx576-seq77 | pooled + seq (77, 768) | 0.957 | 0.734 | 0.164 |
License
Apache 2.0
- Downloads last month
- 484
Model tree for AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77
Base model
openai/clip-vit-large-patch14

