Battle experiment notebook uploaded

This wasn't meant to be seen by others, but it's too damn messy so there you go. I can't keep rewriting the same explanations over and over to justify the empirical data here.

It's a mess, but it's there. https://huggingface.co/AbstractPhil/geolip-vit-base-x3/blob/main/hypersphere_manifold_experimentation.ipynb

I'll update subsequent versions as I go.

Almost within the dimensional spectrum of aligned label counts

image

The 512 anchor system seems perfectly suited for text + vision combinatory shared space, but the concern of baking the two together means they can never be separated.

Even now the label system is essentially baked into the soup as a sinking signal to create output, but the output isn't the geometric structure that needs to exist to apply an nth anchored patchwork to a representational space as necessary just yet.

The outcome of the tests show the 80 BCE labels pretty much cover what they need on the embedding space and the rest is just whatever. They may or may not snap to it.

I forced anchor hops through dropout, otherwise the system would just default to CLS anchor because downstream funnels more conveniently that way. Path of least resistance.

Soup imperfect

Even with the high fidelity structural integrity, the structure still finds a way to learn differently than the intended outcome's shape.

I'll be debugging for the coming hours until I get the exact topology I need.

image

Invariantly no matter how many safeguards I instill, the objective somehow finds a way to skip steps.

Without the full anchored topology in the trained and prepared orthogonality, with the correct association, the correct differentiation, the correct offset, the correct procrustes alignment, and the correct... y'know what it needs about a billion needles lined up.

This might be a multipass train now matter how I size it up.

Heavy Soup

2048 anchor soup going to take a few hours to cook but it'll be ready.

image

Bert just kinda works like this, but the VIT structure doesn't. The patches require training.

image

V3 experiments incoming

This is a highly experimental repo, expect rapid changes.

I've begun training a much thicker soup with a ton more anchors and the advanced constellation. That should help keep cohesion much cleaner.

Upgrades V3

Anchor dropout, which helps discourage anchor dependency.

image

Residual access to hypersphere params in the transformer path. Potentially produces more accurate downstream for no cost params.

Fused and expanded anchor constellation to support a wider formula for experimentation on the geolip-vit-tiny 256

GeoLIP ViT Base x3

Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.

Components

1. Base Tier Soup (teacher)

800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.

Expert Architecture Training Dim
clip_l14_openai ViT-L/14 Text-supervised (CLIP) 768
dinov2_b14 ViT-B/14 Self-supervised (DINO) 768
siglip_b16_384 ViT-B/16 Sigmoid contrastive (SigLIP) 768

Pipeline: GPA alignment at 768-d β†’ PCA to 128-d β†’ per-expert whitened Procrustes calibration β†’ Procrustes-initialized projectors β†’ geometric autograd training.

Metric Value
mAP (COCO) 0.837
Parameters 799,952
Anchors 256 Γ— 128-d
Consensus CV (768-d) 0.2793
Consensus CV (128-d) 0.2731
Optimizer Adam, no weight decay

2. From-Scratch ViT Encoder (student)

11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.

Config Value
Layers 6
Hidden dim 384
Heads 6
FFN dim 1536
Patch size 16
Image size 224
Output dim 128 (on hypersphere)
Parameters 11,216,768

Training: Raw COCO images β†’ encoder β†’ 128-d embedding β†’ frozen soup pipeline (constellation + patchwork + classifier) β†’ BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).

Results (20 epochs, still converging)

Metric E1 E10 E20
nce_acc 0.340 0.887 0.972
cos→consensus 0.325 0.557 0.599
R@1 (5K) 0.032 0.254 0.323
mAP 0.151 0.380 0.429
F1 0.162 0.361 0.418
Active anchors 95 96 94

All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).

Architecture

Training (soup as teacher):
  3 expert features β†’ Procrustes projectors β†’ mean β†’ L2-norm β†’ 128-d consensus targets
  Raw images β†’ from-scratch ViT β†’ 128-d embedding
  Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
  Geometric autograd: tangential=0.01, separation=1.0

Inference (standalone):
  Raw image β†’ ViT encoder β†’ 128-d embedding (on hypersphere)
  No experts needed. Geometry is baked in.

Key Findings

  • 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
  • Proper calibration (GPA + whitened Procrustes + measured CV target) is essential β€” without it, constellation collapses to 1/256 active anchors
  • From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
  • Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 β€” the models encode identical geometry through completely different weight configurations

Files

  • base_tier_soup_calibrated.pt β€” Trained soup (teacher)
  • geolip_vit_encoder_e20.pt β€” ViT encoder at epoch 20
  • base_tier_soup_calibrated.py β€” Soup training script
  • vit_encoder_from_scratch.py β€” Encoder training script
  • runs/ β€” Tensorboard logs

Data

Usage

import torch

# Load encoder
ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
# ckpt["encoder_state_dict"] β€” model weights
# ckpt["config"] β€” architecture config
# ckpt["mAP"], ckpt["cos"], ckpt["r1"] β€” metrics

license: apache-2.0

Experiment 2.5 Update: COCO convergence is slow but steady.

BCE loss isn't the best catalyst for geometry but it does work to funnel through an aligned transformer.

I underestimated the complexity of associative cross-modal differences, but it is converging. Shared space is a very tricky catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple adjacent representations used as loss learning targets.

If this fails to meet direct expectations, I'll form a proper hub and teach using the bertenstein method. Bertenstein works because it's always expecting to hear from the experts and there is always one anchored expert in charge.

The expert/student distillation process requires skilled teachers with similar utility, which is different than simply funneling information through a route and pooling it.

geolip-captionbert-8192 accepts this pooled funneled information and produces useful output due to the shared expert informations having similar access utilities.

In either case, geolip-captionbert-8192 was trained from scratch and so is this model. They are not inheriting weights from any large-training, they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist in the smaller model by direct implicit learning.

geolip-vit-x3 must learn to predict the pixel data using the output of the experts as markers for loss, which means it can never get a full picture of anything outside of it's own tools.

This model is exceptionally small, absurdly small even by vit standards. This is because even at this size, this is too much. The model cannot overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling.

The anchors are strong enough and tuned to the experts, the external losses are tuned to teach the expert responses, the expert data is used as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe standalone and compliant without requiring the experts later.

I gave the model everything I could geometrically, and it must discover the way to connect them.

I'm teaching siglip, dinovit and clip-vit to communicate on the same manifold. They are essentially speaking three dialects of foreign offshoot evolved thousand year later Roman.

The fact that this works at all is a testament to the hypersphere attenuation.

=================================================================
GEOLIP VISION ENCODER β€” FROM SCRATCH
  ViT: 6L/384d/6h, patch16
  196 patches + CLS β†’ 128-d output
  Device: cuda
=================================================================

  Loading soup...
  Soup: mAP=0.837 CV_target=0.2731
  train: loaded cached targets (118,287)
  val: loaded cached targets (5,000)
  Caching train images (118,287)...

=================================================================
BUILD ENCODER
=================================================================
  Architecture: 6L/384d/6h, patch16
  Input: 224Γ—224 β†’ 196 patches
  Output: 128-d (on hypersphere)
  Parameters: 11,216,768

=================================================================
TRAINING
  20 epochs, lr=0.0003, batch=48
  Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment
  CV target: 0.2731
  Images: train=118,287 val=5,000 (cached as tensors)
=================================================================
E 1/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1]
  E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340
  E1 val:   mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 β˜…
E 2/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1]
  E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553
  E2 val:   mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 β˜…
E 3/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1]
  E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641
  E3 val:   mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 β˜…
E 4/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1]
  E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695
  E4 val:   mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 β˜…
E 5/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1]
  E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743
  E5 val:   mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 β˜…
E 6/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1]
  E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784
  E6 val:   mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 β˜…
E 7/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1]
  E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815
  E7 val:   mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 β˜…
E 8/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1]
  E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843
  E8 val:   mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 β˜…
E 9/20 train: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1]
  E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866
  E9 val:   mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 β˜…
E10/20 train:  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1]

Experiment 2.5:

The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice.

I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later.

The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications.

Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn.

The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere. This is more than enough information to house access to any data route that exists.

The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation on the hypersphere represented structure.

In other words, that 128 dimensions represents more information than I could express with words.

Experiment 2:

95/256 anchors survive, emergent geometric structure formed.

R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction.

Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means the system was aligned and then started growing further on a constraint that I was unaware of.

This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself survived so well because it was anchored post training, which gave added cohesion and complexity association that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring.

=================================================================
PHASE 5: TRAINING
  20 epochs, lr=0.001, CV target=0.2731
=================================================================
  E 1: mAP=0.788 F1=0.731 R@1=0.971 cos=0.806 cv=0.1213 anchors=226/256 nce=0.999 loss=0.1676 β˜…
  E 2: mAP=0.803 F1=0.742 R@1=0.971 cos=0.809 cv=0.1178 anchors=200/256 nce=0.999 loss=0.1459 β˜…
  E 3: mAP=0.810 F1=0.735 R@1=0.973 cos=0.808 cv=0.1197 anchors=161/256 nce=0.999 loss=0.1431 β˜…
  E 4: mAP=0.817 F1=0.752 R@1=0.971 cos=0.811 cv=0.1262 anchors=131/256 nce=0.999 loss=0.1404 β˜…
  E 5: mAP=0.823 F1=0.755 R@1=0.971 cos=0.812 cv=0.1232 anchors=113/256 nce=0.999 loss=0.1389 β˜…
  E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 β˜…
  E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 β˜…
  E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 β˜…
  E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 β˜…
  E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 β˜…
  E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 β˜…
  E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343
  E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 β˜…
  E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334
  E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338
  E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 β˜…
  E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327
  E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321
  E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328
  E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328

  Best mAP: 0.837
  CV target: 0.2731

Experiment 1:

Total collapse. The three models did not conform and the patchwork did not learn. The objectives are not correct.

One anchor was defaulted to, none of the others utilized. The memory bank solves this problem through queue assessment with the INFONCE hub processing, but this model is a different form of anchoring that did not work.

THE ENTIRE MODEL became the anchor, instead of the anchorpoints within the model. I'm thinking there wasn't enough scattering, so I'll try some additional tweaks.

Post

Active anchors: 1/256 (0.4%)
Every single image β†’ anchor 65
Anchor entropy: 0.0000
Anchors within cos>0.5 per image: 1.0
Nearest anchor dist: 0.016 β€” next nearest: 0.665

Effective dim: 23.6/128
Top-20 SVs explain 99.2%
Self-sim off-diag: 0.969

Expert uniqueness: 0.0008–0.0011

There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead.

Shortcut bypass, additional nonlinearity must be made.

Assessment

Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of where all three models are simultaneously rotating around a point, which is not the direct center.

This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried.

Hypothesis based on research

The procrustes alignment must align centerwise, and it must be defined specifically to specifications.

Downloads last month
221
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including AbstractPhil/geolip-vit-base-x3