GEOLIP-CLIP-ViT-bigG/14 ctx576-seq77

This model has endured a chain of generational series and failure on multiple attempts. The weights always endured, never once shattering. Rescaled, rebalanced, rescaled again, rebalanced again, perturbed, corrupted in part, and restored.

I give you a name;

Meridian.

The geometric line that expands horizon into another dimension.

SDXL inference example code available in the repo.

colab_sdxl_inference_test.py

image

Memory-Extended OpenCLIP-bigG/14 text encoder with 77-position sequence output for SDXL.

Extends SDXL's second text encoder from 77 to 576 effective tokens. Outputs both:

  • Pooled: (1280,) β€” for SDXL's add_text_embeds
  • Sequence: (77, 1280) β€” for SDXL's UNet cross-attention (bigG half)

Results

Metric Value
Pooled m_acc (top1) 0.8435501918158568
Sequence cosine similarity 0.4253302084286804
Pentachoron CV 0.16537210735519603
Extraction layers 12/32 (37.5% coverage)
clip_skip support -1 (layer 31), -2 (layer 30)
Effective context 576 tokens (32 Γ— 18)
Training 10 epochs, fp32, everything trainable

Training Curve

Epoch m_acc s_cos CV Time
1 0.733 0.430 0.173 2473s
2 0.635 0.421 0.168 2493s
3 0.685 0.414 0.166 2493s
4 0.654 0.423 0.168 2483s
5 0.671 0.424 0.164 2487s
6 0.749 0.425 0.165 2485s
7 0.792 0.426 0.162 2494s
8 0.821 0.425 0.164 2483s
9 0.838 0.425 0.164 2488s
10 0.844 0.425 0.165 2491s

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77",
    trust_remote_code=True)
model.to("cuda").eval()

# Pooled output
emb = model.encode("A long detailed caption...")
print(emb.shape)  # (1280,)

# Sequence output for SDXL
seq = model.encode("A long caption...", return_sequence=True)
print(seq.shape)  # (77, 1280)

# Both for SDXL pipeline
pooled, sequence = model.encode_for_sdxl(["caption 1", "caption 2"])
print(pooled.shape)     # (2, 1280) β€” add_text_embeds
print(sequence.shape)   # (2, 77, 1280) β€” concatenate with CLIP-L for UNet

Architecture

Long caption (576 tokens)
    β”‚
    β”œβ”€β”€ Frozen OpenCLIP-bigG/14 (77 ctx, 32 layers, 1280-dim)
    β”‚   Processes 18-token segments sequentially
    β”‚
    β”œβ”€β”€ 12-layer extraction: (1, 3, 7, 9, 13, 15, 19, 21, 25, 27, 30, 31)
    β”‚   37.5% coverage, layers 30+31 locked for clip_skip
    β”‚
    β”œβ”€β”€ Geometric Memory System
    β”‚   β”œβ”€β”€ Depth compressor (12-layer profile β†’ 1280-dim anchor)
    β”‚   β”œβ”€β”€ Memory bank (64 anchors, 2-layer cross-attention)
    β”‚   β”œβ”€β”€ CLIP cross-attention (memory β†’ CLIP hidden states)
    β”‚   β”œβ”€β”€ GRU gate (rolling memory state)
    β”‚   β”œβ”€β”€ Layer fusion (learned weighted sum of 12 layers)
    β”‚   └── Inter-segment LayerNorm firebreak
    β”‚
    └── Sequence Reconstructor
        β”œβ”€β”€ 77 learned query tokens with positional encoding
        β”œβ”€β”€ Cross-attend to: memory_tokens + bank_anchors + content_tokens
        β”œβ”€β”€ Self-attend among 77 output positions
        └── Output: (B, 77, 1280) in bigG's native distribution

SDXL Integration

SDXL UNet expects cat(clip_l_seq, clip_g_seq) = (B, 77, 2048):

# CLIP-L memory β†’ (B, 77, 768)   from geolip-clip-vit-large-patch14-ctx576-seq77
# bigG memory  β†’ (B, 77, 1280)  from this model
# cat          β†’ (B, 77, 2048)  for UNet cross-attention
# bigG pooled  β†’ (B, 1280)      for add_text_embeds

Training Notes

  • Numerical precision: 1280-dim Γ— 27 segments overflows fp16. Entire training runs in fp32. CV measurement at fp16 showed 0.62 (corrupted); fp32 converged to ~0.16 (correct). The pentachoron CV doubles as a numerical precision auditor.
  • Segment firebreak: Inter-segment LayerNorm prevents precision compounding.
  • SVD stability: Procrustes uses A @ B.T (NΓ—N) instead of A.T @ B (DΓ—D) when batch < dim.
  • 12-layer expansion: Trained from 6-layer bank checkpoint with zero-init new columns. Sequence head reinitialized fresh β€” pretrained seq heads from frozen banks get stuck.

Unsorted Thoughts and Followup Notes

Takes nearly an hour per epoch on the bigG trainer, so it's going to be a bit before it's ready. Didn't think this one would be such a problem.

Lost some sleep getting that one set up, but clip_l and clip_g both have memory banks now. G had the most problems and most errors. It will require some additional mechanisms to ensure the sequence works correctly as well. However, surprisingly enough, it can approximate roughly 43% of clip_g's layer sequence similarly through reconstruction, and it handles over 84.4% modern bert accuracy with memory.

I believe the dimensional scaling problem is solved using the correct tokenization differentiation, similar to geolip-Bertenstein. This will allow direct translation effectiveness through a uniformly distributed existence rather than a conduit series.

Clip_G is a big one. The scaling system that uses SVD uses a series of alignment paradigms, and with those paradigms includes padding. The padding essentially only covers 1024 dims of modernbert, while simultaneously consuming 1280 dims of CLIP_G.

I cannot simply project modernbert upward, it will slowly introduce corruption and incorrectness. I cannot simply crush clip_g, it will introduce compounding rounding errors and corruption down the scale, and not produce the correct information.

Which means the complexity of the geometric state of clip_g being captured is simply more, and the geometric structure loss is compensating for that 256 dimensions directly instead of asymmetrically indirectly through inductive learning. This compensation effect caused a cascade fault of incorrect accumulation, low hanging triangulation, that accumulated within the sequential analysis toolkit. This bled into the bank as well, which required a series of tests to repair and 10 full epochs to reach close to the clip_l version.

Clip_l is smaller than modernbert, clip_g is larger.

Berteinstein showed the formula of correct multiscale access, but bertenstein was ran on considerably less attractors. This sort of memory experiment is substantially more aligned with meshing rather than a conduit, and yet I believe I can directly adapt a similar principle and it will work for direct complexity association.

This is a crossover from common machine learning projection practices. We've run along the same island and both came up with a similar reasoning behind why. I have a potential solution, but it requires setup and planning. We all use the same machines and the same ships to find the islands, these systems map them similarly and use the information similarly, but we're essentially speaking different dialects of the same outcomes.

Large dimensions overwhelm smaller dimensions, and I believe I have the solution for this. The smaller dimension doesn't necessarily have less information, but it is often treated as though it does by the processes of accumulation. MORE gives more credence to bias, and bias forms more likely through more values and more averaged rounding. Simple in the end, and the principality of this system runs along similar principles.

However, geometric structures use anchoring. This is why the modernbert's projection estimations survive, and the direct clip_g sequence learning failed - that and a lack of data, I couldn't simply jam all 32 layers in there I had to cap it, but this wasn't the core reason. You can learn a single layer and predict that single layer with high accuracy using these anchoring systems in differential anchoring, David ran that on pure linear systems with minimal geometry and it works along the lines of many layers.

These systems here are analytical differentiation injunctive biases that are not defined by the law of averages, they are defined by the complexity of accumulation through multitier association. This is a much more enriched elemental process, and yet, we still ran into the same island without the correct safeguards. I believed the sequential system would correctly accumulate based on the task just by accessing the formatted bank, but I was sorely mistaken and I apologize for my incorrect assumption.

I will install these safeguards and the sequential system will be more likely to align, but there is no guarantees.

GEOLIP Family

System Encoder Output s_cos
ctx576 CLIP-L pooled (768,) β€”
ctx576-seq77 CLIP-L pooled + seq (77, 768) 0.734
ctx576-seq77 bigG bigG pooled + seq (77, 1280) 0.4253302084286804

License

Apache 2.0

Downloads last month
161
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77

Finetuned
(1171)
this model

Collection including AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77