Battle experiment notebook uploaded
This wasn't meant to be seen by others, but it's too damn messy so there you go. I can't keep rewriting the same explanations over and over to justify the empirical data here.
It's a mess, but it's there. https://huggingface.co/AbstractPhil/geolip-vit-base-x3/blob/main/hypersphere_manifold_experimentation.ipynb
I'll update subsequent versions as I go.
Almost within the dimensional spectrum of aligned label counts
The 512 anchor system seems perfectly suited for text + vision combinatory shared space, but the concern of baking the two together means they can never be separated.
Even now the label system is essentially baked into the soup as a sinking signal to create output, but the output isn't the geometric structure that needs to exist to apply an nth anchored patchwork to a representational space as necessary just yet.
The outcome of the tests show the 80 BCE labels pretty much cover what they need on the embedding space and the rest is just whatever. They may or may not snap to it.
I forced anchor hops through dropout, otherwise the system would just default to CLS anchor because downstream funnels more conveniently that way. Path of least resistance.
Soup imperfect
Even with the high fidelity structural integrity, the structure still finds a way to learn differently than the intended outcome's shape.
I'll be debugging for the coming hours until I get the exact topology I need.
Invariantly no matter how many safeguards I instill, the objective somehow finds a way to skip steps.
Without the full anchored topology in the trained and prepared orthogonality, with the correct association, the correct differentiation, the correct offset, the correct procrustes alignment, and the correct... y'know what it needs about a billion needles lined up.
This might be a multipass train now matter how I size it up.
Heavy Soup
2048 anchor soup going to take a few hours to cook but it'll be ready.
Bert just kinda works like this, but the VIT structure doesn't. The patches require training.
V3 experiments incoming
This is a highly experimental repo, expect rapid changes.
I've begun training a much thicker soup with a ton more anchors and the advanced constellation. That should help keep cohesion much cleaner.
Upgrades V3
Anchor dropout, which helps discourage anchor dependency.
Residual access to hypersphere params in the transformer path. Potentially produces more accurate downstream for no cost params.
Fused and expanded anchor constellation to support a wider formula for experimentation on the geolip-vit-tiny 256
GeoLIP ViT Base x3
Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.
Components
1. Base Tier Soup (teacher)
800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.
| Expert | Architecture | Training | Dim |
|---|---|---|---|
| clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 |
| dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 |
| siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 |
Pipeline: GPA alignment at 768-d β PCA to 128-d β per-expert whitened Procrustes calibration β Procrustes-initialized projectors β geometric autograd training.
| Metric | Value |
|---|---|
| mAP (COCO) | 0.837 |
| Parameters | 799,952 |
| Anchors | 256 Γ 128-d |
| Consensus CV (768-d) | 0.2793 |
| Consensus CV (128-d) | 0.2731 |
| Optimizer | Adam, no weight decay |
2. From-Scratch ViT Encoder (student)
11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.
| Config | Value |
|---|---|
| Layers | 6 |
| Hidden dim | 384 |
| Heads | 6 |
| FFN dim | 1536 |
| Patch size | 16 |
| Image size | 224 |
| Output dim | 128 (on hypersphere) |
| Parameters | 11,216,768 |
Training: Raw COCO images β encoder β 128-d embedding β frozen soup pipeline (constellation + patchwork + classifier) β BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).
Results (20 epochs, still converging)
| Metric | E1 | E10 | E20 |
|---|---|---|---|
| nce_acc | 0.340 | 0.887 | 0.972 |
| cosβconsensus | 0.325 | 0.557 | 0.599 |
| R@1 (5K) | 0.032 | 0.254 | 0.323 |
| mAP | 0.151 | 0.380 | 0.429 |
| F1 | 0.162 | 0.361 | 0.418 |
| Active anchors | 95 | 96 | 94 |
All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).
Architecture
Training (soup as teacher):
3 expert features β Procrustes projectors β mean β L2-norm β 128-d consensus targets
Raw images β from-scratch ViT β 128-d embedding
Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
Geometric autograd: tangential=0.01, separation=1.0
Inference (standalone):
Raw image β ViT encoder β 128-d embedding (on hypersphere)
No experts needed. Geometry is baked in.
Key Findings
- 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
- Proper calibration (GPA + whitened Procrustes + measured CV target) is essential β without it, constellation collapses to 1/256 active anchors
- From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
- Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 β the models encode identical geometry through completely different weight configurations
Files
base_tier_soup_calibrated.ptβ Trained soup (teacher)geolip_vit_encoder_e20.ptβ ViT encoder at epoch 20base_tier_soup_calibrated.pyβ Soup training scriptvit_encoder_from_scratch.pyβ Encoder training scriptruns/β Tensorboard logs
Data
- Training features: AbstractPhil/bulk-coco-features
- Images: COCO 2017 (118K train, 5K val)
Usage
import torch
# Load encoder
ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
# ckpt["encoder_state_dict"] β model weights
# ckpt["config"] β architecture config
# ckpt["mAP"], ckpt["cos"], ckpt["r1"] β metrics
license: apache-2.0
Experiment 2.5 Update: COCO convergence is slow but steady.
BCE loss isn't the best catalyst for geometry but it does work to funnel through an aligned transformer.
I underestimated the complexity of associative cross-modal differences, but it is converging. Shared space is a very tricky catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple adjacent representations used as loss learning targets.
If this fails to meet direct expectations, I'll form a proper hub and teach using the bertenstein method. Bertenstein works because it's always expecting to hear from the experts and there is always one anchored expert in charge.
The expert/student distillation process requires skilled teachers with similar utility, which is different than simply funneling information through a route and pooling it.
geolip-captionbert-8192 accepts this pooled funneled information and produces useful output due to the shared expert informations having similar access utilities.
In either case, geolip-captionbert-8192 was trained from scratch and so is this model. They are not inheriting weights from any large-training, they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist in the smaller model by direct implicit learning.
geolip-vit-x3 must learn to predict the pixel data using the output of the experts as markers for loss, which means it can never get a full picture of anything outside of it's own tools.
This model is exceptionally small, absurdly small even by vit standards. This is because even at this size, this is too much. The model cannot overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling.
The anchors are strong enough and tuned to the experts, the external losses are tuned to teach the expert responses, the expert data is used as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe standalone and compliant without requiring the experts later.
I gave the model everything I could geometrically, and it must discover the way to connect them.
I'm teaching siglip, dinovit and clip-vit to communicate on the same manifold. They are essentially speaking three dialects of foreign offshoot evolved thousand year later Roman.
The fact that this works at all is a testament to the hypersphere attenuation.
=================================================================
GEOLIP VISION ENCODER β FROM SCRATCH
ViT: 6L/384d/6h, patch16
196 patches + CLS β 128-d output
Device: cuda
=================================================================
Loading soup...
Soup: mAP=0.837 CV_target=0.2731
train: loaded cached targets (118,287)
val: loaded cached targets (5,000)
Caching train images (118,287)...
=================================================================
BUILD ENCODER
=================================================================
Architecture: 6L/384d/6h, patch16
Input: 224Γ224 β 196 patches
Output: 128-d (on hypersphere)
Parameters: 11,216,768
=================================================================
TRAINING
20 epochs, lr=0.0003, batch=48
Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment
CV target: 0.2731
Images: train=118,287 val=5,000 (cached as tensors)
=================================================================
E 1/20 train: 100%|ββββββββββ| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1]
E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340
E1 val: mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 β
E 2/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1]
E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553
E2 val: mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 β
E 3/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1]
E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641
E3 val: mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 β
E 4/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1]
E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695
E4 val: mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 β
E 5/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1]
E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743
E5 val: mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 β
E 6/20 train: 100%|ββββββββββ| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1]
E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784
E6 val: mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 β
E 7/20 train: 100%|ββββββββββ| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1]
E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815
E7 val: mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 β
E 8/20 train: 100%|ββββββββββ| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1]
E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843
E8 val: mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 β
E 9/20 train: 100%|ββββββββββ| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1]
E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866
E9 val: mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 β
E10/20 train: 36%|ββββ | 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1]
Experiment 2.5:
The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice.
I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later.
The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications.
Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn.
The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere. This is more than enough information to house access to any data route that exists.
The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation on the hypersphere represented structure.
In other words, that 128 dimensions represents more information than I could express with words.
Experiment 2:
95/256 anchors survive, emergent geometric structure formed.
R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction.
Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means the system was aligned and then started growing further on a constraint that I was unaware of.
This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself survived so well because it was anchored post training, which gave added cohesion and complexity association that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring.
=================================================================
PHASE 5: TRAINING
20 epochs, lr=0.001, CV target=0.2731
=================================================================
E 1: mAP=0.788 F1=0.731 R@1=0.971 cos=0.806 cv=0.1213 anchors=226/256 nce=0.999 loss=0.1676 β
E 2: mAP=0.803 F1=0.742 R@1=0.971 cos=0.809 cv=0.1178 anchors=200/256 nce=0.999 loss=0.1459 β
E 3: mAP=0.810 F1=0.735 R@1=0.973 cos=0.808 cv=0.1197 anchors=161/256 nce=0.999 loss=0.1431 β
E 4: mAP=0.817 F1=0.752 R@1=0.971 cos=0.811 cv=0.1262 anchors=131/256 nce=0.999 loss=0.1404 β
E 5: mAP=0.823 F1=0.755 R@1=0.971 cos=0.812 cv=0.1232 anchors=113/256 nce=0.999 loss=0.1389 β
E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 β
E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 β
E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 β
E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 β
E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 β
E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 β
E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343
E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 β
E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334
E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338
E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 β
E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327
E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321
E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328
E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328
Best mAP: 0.837
CV target: 0.2731
Experiment 1:
Total collapse. The three models did not conform and the patchwork did not learn. The objectives are not correct.
One anchor was defaulted to, none of the others utilized. The memory bank solves this problem through queue assessment with the INFONCE hub processing, but this model is a different form of anchoring that did not work.
THE ENTIRE MODEL became the anchor, instead of the anchorpoints within the model. I'm thinking there wasn't enough scattering, so I'll try some additional tweaks.
Post
Active anchors: 1/256 (0.4%)
Every single image β anchor 65
Anchor entropy: 0.0000
Anchors within cos>0.5 per image: 1.0
Nearest anchor dist: 0.016 β next nearest: 0.665
Effective dim: 23.6/128
Top-20 SVs explain 99.2%
Self-sim off-diag: 0.969
Expert uniqueness: 0.0008β0.0011
There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead.
Shortcut bypass, additional nonlinearity must be made.
Assessment
Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of where all three models are simultaneously rotating around a point, which is not the direct center.
This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried.
Hypothesis based on research
The procrustes alignment must align centerwise, and it must be defined specifically to specifications.
- Downloads last month
- 221




