README.md · AbstractPhil/geolip-captionbert-8192 at main

File size: 20,960 Bytes

e063df5
 
 
 
 
 
 
 
 
 
 
ee6bf50
e063df5
 
460613f
 
 
 
e063df5
 
c231846
16210f4
 
 
 
c231846
 
 
fd95e88
3999273
151709d
 
 
 
 
 
1de5a70
 
 
3d8b40d
1de5a70
3d8b40d
1de5a70
 
 
4576aad
 
 
 
1de5a70
3d8b40d
 
825cb6a
 
b8e7137
 
 
a59ff3f
3d8b40d
 
 
1de5a70
 
 
a59ff3f
7c1bb26
1de5a70
 
9b2664e
238a89d
9b2664e
4f34fa4
 
9b2664e
71ed734
 
 
 
 
 
 
626cda6
fa8374e
626cda6
 
 
 
 
 
fa8374e
626cda6
 
 
 
fa8374e
 
 
 
 
 
626cda6
 
0487cd1
 
 
 
 
 
 
6710926
626cda6
39b769b
 
 
 
 
 
 
e704fe0
 
 
 
ca8058d
 
 
 
fe9626f
e704fe0
 
 
 
 
88ab7af
e704fe0
ca8058d
 
 
 
 
effb106
 
84ecc52
effb106
 
 
 
 
 
 
 
 
 
2439717
 
3942cab
 
 
 
c0d3845
effb106
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
 
 
 
 
 
 
 
e063df5
ee6bf50
1fc19e2
ee6bf50
1fc19e2
ee6bf50
 
 
 
 
 
 
 
81a095c
ee6bf50
1fc19e2
ee6bf50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e063df5
 
 
ee6bf50
 
 
 
e063df5
 
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
ee6bf50
e063df5
 
ee6bf50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e063df5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee6bf50
e063df5
ee6bf50
 
 
 
e063df5
ee6bf50
 
 
 
 
 
 
 
e063df5
ee6bf50
 
 
 
 
 
 
 
 
 
 
e063df5
ee6bf50
e063df5
ee6bf50
 
 
 
 
 
 
 
 
e063df5
ee6bf50
e063df5
 
 
ee6bf50
 
 
 
 
 
 
e063df5
 
 
ee6bf50
e063df5
 
 
460613f

---
license: apache-2.0
tags:
- geometric-deep-learning
- distillation
- consensus
- pentachoron
- procrustes
- caption-embedding
- sentence-similarity
- feature-extraction
- caption_encoder
language: en
pipeline_tag: feature-extraction
datasets:
- CaptionEmporium/conceptual-captions-cc12m-llavanext
base_model:
- AbstractPhil/geolip-bertenstein
---

**Estimated Trained Samples** 
| Component | Samples |
|---|---|
| **CORE** | 300,000,000 |
| **BANK** | 17,500,000 |



# Newest: Prepping 12m conceptual-captions bert extractions aka 36m extractions * 5 models
So around, 180,000,000 different total samples, which is fundamentally different than a single task of repeated 200k or 500k like I've been doing.

https://huggingface.co/datasets/AbstractPhil/conceptual-captions-12m-webdataset-berts

You can track the process there.



The dataset is going to be in pt chunks because they load directly to vram nearly instantly in colab, and the system operates on them quicker than dataloaders.

I'll be running the full 12m set on all three captions, no exceptions - short llava, long llava, and original captions.

After the 36m 5 expert dataset training completes, the core model will be ready.

It's legitimately wild watching the system sit there at 100% accuracy validation, but it requires additional complexity so that isn't the measure to analyze.
The problem is solved for recall, but the internal structure's geometric system needs to align to the larger spectrum of rigidity that the smooth manifold
deviations require to create a full cohesion, meaning more data. These smooth curves will hammer into rigid structures, rather than rigid turning into smooth over time.
This experiment will determine if this training process is viable overall. As it stands I'm essentially just going to let this thing run now that it's going.
The results will be ready when they're ready, and the outcome should represent a serious 896 dim caption utility for reuse, with a heavily prepared 128 dim geometric
anchor system that will be entirely reusable as a pure geometric anchored bank.

36 million samples roughly 10 epochs should be a fair assessment. Hopefully the data isn't too much.

So it'll be around 36,000,000 * 5 * 10, roughly 1.8b more samples give or take should be enough for a full caption shared cohesion.

The training itself can be handled on a single G4 in a few days, nothing too major. Everything is within cost.

The data prep on the other hand may take a bit longer, but I can run multiple t4 at low cost to prepare them over time, which will be cheap.

Saturating the internals of the anchor and the subsystem will allow for more complex processes and easy alignment with pieces of the data. After that
it will be quite fast to sample the most accurate captions and begin forming vit association, which will allow for a full next token prediction capacity
thanks to the internal similarity mechanisms and the formed in steel anchor bank's solidity.


# 2 additional epochs, 1m samples ran
500k samples of 5 experts, so I guess that's... 2.5m samples per epoch then. 

The alignment became more aligned, considerably more aligned. The count went from 0.087 to around 0.1 something last I checked. It's rising every batch, and the
anchor is going to be continued to train on low heat now that they have both begun to align. 

As the subsystems aligned, the core system aligned around it, and the accuracy is still R1: 99.9% accuracy, meaning nearly 100% accuracy for validation.

With that the depth has been expanded with a massive influx of geometric information from the experts. Fully distilled through direct inference and utility.

# Older: Back in the oven.

I'm going to unfreeze the model and let it align with the 500k captions now that the new alignment bank is present.

This should fundamentally alter the alignment of the model to a near center alignment, somewhere between 0.25 and 0.082 as it was frozen.


# Older NLI head preliminary

So the outcome is surprisingly good. Around 75-76% accuracy give or take for NLI with the prototype conv5d that I'm working with.

It's not a true conv5d, more an accumulator meant to encapsulate the necessary behavioral implications of every stretch that implicates the potential for a conv5d.
This structure when paired with structural geometry defeats the MLP in terms of overfitting.

MLP managed to overfit the model at around 71% or so, to nearly 95% training accuracy.

Conv5d managed to preserve the bank's geometry instead of completely collapsing it into noise, allowing around 75-76% with 80% training accuracy.

So the problem is still present. Without the bank the model reached around 64% or so, with the anchor bank the model reaches around 75% with geometric solidity but it's not
enough to say the NLI head works yet.

As you can see either way, the model does eventually begin to overfit in it's current state. It's simply too small and predominantly distilled,
which means it will have problems no matter which task I attempt to teach this model.


![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/Yudr9ZF_bWx-8xAbSqGDr.png)

HOWEVER, it's enough to say that it can with more training. This is enough to continue for me.

If I were to say unlock the model's weights and train ALL FIVE EXPERTS, this would be an arbitrary task. The system would learn it instantly.

However, this is an attempt to train WITHOUT the experts, as they are a large burden on time and effort. I need to test the system's capacity
to handle training it's own heads, without the experts forcing their geometric structure into the mix.

The geometric alignment helps, but it's not enough yet. It requires more.

# Older Below

# OKAY

Now after all that prefitting, reconstruction, capacity extension, and deterministic vaulting - WE CAN TRAIN THE NLI HEAD!

Lets see if it takes.


# GEOLIP CaptionBERT-8192-anchored

This will be the real prototype, fingerprinting was the earlier thought and the full upcoming prototype is ready for train.

https://huggingface.co/AbstractPhil/geolip-axis-prototype

The example code and prototype axis modulators are present there as they are, and they will be utilized throughout upcoming experiments.

For CaptionBERT, upcoming checkpoints will push after the process is successful, likely 1 hour per epoch for 5 epochs or so should be more than enough.

This marks the first use of a new prototype object dubbed AnchorBank, which is designed specifically to house the necessary implications that the model is distilled with,
while specifically aligning the expectation of those distillation valuations into the bank itself.

This allows the model to POTENTIALLY solve nth token lookup without a head, so a head will allow finetuning. If successful, the anchor bank will contain 
all the knowledge the model requires to geometrically represent it's data into expanded structures - if the losses and training process is correctly aligned to the task.

**HOPEFULLY** after this refit, the structure will be capable of predicting NIL head token prediction, if not I'll work with a different small LLM project and then
determine the potential utility of direct integration of the two on a MOE pipeline instead of a full collective behavioral implication.

If that goes well, the MOE can be adapted into collective behavior if the systems align correctly, but that's a different process.

# GEOLIP CaptionBERT-8192-fingerprinted

The next iteration will require an expanded fingerprinting axis-based relational bank, specifically to the alignment of the data and the teachers at training time.

The differentiation between what is learned and what is retained specifically expert-to-expert will enable this fingerprint to preserve the student model's integrity,
which should allow cross_entropy training without complete geometric collapse and rapid overffiting.

As it stands this model is too rigid to train heads on, but I will directly improve it today and instill a core memory of geometry.

This geometry will be ever-learning, meaning when the core model trains from any experts, the bank must train as well. This geometry houses the entire
internalized geometric embedding anchored fingerprinting spectrum, and this will likely evolve over the coming hours until the functional prototype comes
to full fruition. Wish me luck as I design the reusable compact mechanism.

The final state of this will be a transparent embedding system with a transformer, specifically aligned stepwise.

No tricks, no gimmicks, just pure alignment math through solid and careful hypersphere rigidity analysis.

This alignment will allow the student to learn independently, without collapsing to overfitting due to exceeding internal utility, while the external heads
still have more than a reasonable amount of information to access.


# GEOLIP CaptionBERT-8192

A 26M-parameter caption encoder whose embedding space is the geometric intersection of five independently trained language models. Trained from scratch via consensus distillation — no pretrained weights, no expert models at inference.

## Benchmarks

Evaluated against all five consensus teachers on STS-B, SICK-R, and MRPC. All models use mean-pooled embeddings with cosine similarity. No fine-tuning on any benchmark task.

### Semantic Textual Similarity (STS-B)

| Model | Params | Spearman ρ | Pearson r |
|---|---|---|---|
| DistilBERT-base | 66M | 0.5717 | — |
| RoBERTa-base | 125M | 0.5436 | — |
| **CaptionBERT-8192** | **26M** | **0.5032** | **0.5100** |
| ALBERT-base-v2 | 12M | 0.4784 | — |
| BERT-base | 110M | 0.4729 | — |
| ModernBERT-base | 149M | 0.4215 | — |

Beats BERT-base (4.2× larger) and ModernBERT-base (5.7× larger) on general sentence similarity despite being trained exclusively on image captions.

### SICK-R (Compositional Similarity)

| Model | Params | Spearman ρ | Pearson r |
|---|---|---|---|
| DistilBERT-base | 66M | 0.6424 | — |
| RoBERTa-base | 125M | 0.6296 | — |
| **CaptionBERT-8192** | **26M** | **0.6138** | **0.6645** |
| BERT-base | 110M | 0.5865 | — |
| ModernBERT-base | 149M | 0.5479 | — |
| ALBERT-base-v2 | 12M | 0.5364 | — |

\#3/6 on compositional/syntactic similarity. Beats BERT-base, ModernBERT-base, and ALBERT on a task requiring structural language understanding.

### MRPC (Paraphrase Detection)

| Model | Params | F1 | Accuracy | Threshold |
|---|---|---|---|---|
| RoBERTa-base | 125M | 0.8122 | — | — |
| **CaptionBERT-8192** | **26M** | **0.8068** | **0.6881** | **0.71** |
| ALBERT-base-v2 | 12M | 0.8067 | — | — |
| BERT-base | 110M | 0.8062 | — | — |
| DistilBERT-base | 66M | 0.8055 | — | — |
| ModernBERT-base | 149M | 0.8038 | — | — |

**\#2/6 on paraphrase detection.** 0.005 F1 behind RoBERTa, ahead of every other teacher. No classification head — pure cosine similarity with auto-discovered threshold. A model that has never seen a paraphrase pair during training nearly wins paraphrase detection.

### Caption Embedding Quality

| Metric | Value |
|---|---|
| Self-similarity mean | 0.0040 |
| Self-similarity max | 0.7181 |
| Top-1 retrieval cosine | 0.5477 |
| Top-5 retrieval cosine | 0.4853 |

Near-zero average self-similarity across 1000 random captions — the embedding space has excellent discrimination. Every caption occupies its own distinct region on the hypersphere.

### Consensus Fidelity

| Metric | Value |
|---|---|
| Val cosine to consensus | 0.862 |
| Val R@1 | 1.000 |
| Pentachoron CV | 0.082 |
| Training data | 500K CC12M captions |
| Epochs | 30 |
| Position capacity | 8,192 tokens |
| Parameters | 25,958,016 |

## How It Works

Five language models were aligned into a shared geometric space via whitened Procrustes rotation. Their normalized centroid — the **geometric consensus** — was proven to be a mathematical constant: five different random seeds produced the same consensus point to three decimal places.

This model was trained from scratch to reproduce that consensus directly from text. It distills the geometric intersection of five experts into a single small transformer.

The distillation is not standard knowledge distillation. It is multi-teacher geometric consensus distillation: the target is not any single teacher's output but the fixed point where all five teachers agree. Individual model errors cancel. What remains is the structural invariant of language understanding that five different architectures and training objectives independently discovered.

The alignment itself is directly distillable. The geometric structure is so robust that a from-scratch model learns it with R@1=1.000 from 18K examples in 80 seconds. The consensus manifold has pentachoron CV=0.084 — the tightest geometric regularity measured across all GEOLIP experiments — which means the function from text to embedding is smooth enough that sparse sampling covers it completely.

```
5 Expert Models (frozen)
    │
    ├── BERT-base-uncased        (110M, MLM)
    ├── ModernBERT-base          (149M, MLM + rotary, 8192 ctx)
    ├── RoBERTa-base             (125M, MLM + dynamic masking)
    ├── ALBERT-base-v2           (12M, MLM + SOP + factorized)
    └── DistilBERT-base          (66M, distilled from BERT)
        │
        ├── Extract pooled embeddings on 500K CC12M captions
        ├── Whitened Procrustes alignment to shared space
        ├── Consensus = normalized centroid (geometric constant)
        │
        └── Train student with:
            ├── InfoNCE(student, consensus)   — retrieval alignment
            ├── MSE(student, consensus)       — direct regression
            └── Pentachoron CV → 0.084        — geometric regularity
```

## Planned Task Heads

The 768-dim consensus embedding serves as a frozen feature extractor. Linear heads trained on task-specific data snap on top.

### Priority Heads

| Head | Architecture | Training Data | Use Case |
|---|---|---|---|
| **NLI / Entailment** | cat(a, b, \|a-b\|, a*b) → Linear(3072, 3) | MNLI, SNLI | Agent reasoning validation |
| **Semantic Similarity** | Linear(768, 1) → sigmoid×5 | STS-B train | Push STS-B toward 0.80+ |
| **Multi-Label Tagging** | Linear(768, n_tags) → sigmoid | COCO categories, Visual Genome | Predict objects/attributes from captions |
| **Paraphrase Detection** | cos(a, b) → threshold (already works) | MRPC, QQP | Deduplication, reformulation detection |
| **Sentiment** | Linear(768, n_classes) | SST-2, IMDB | Content routing, sentiment analysis |

### Extended Heads

| Head | Architecture | Training Data | Use Case |
|---|---|---|---|
| Caption Quality | Linear(768, 2) | Hallucination-annotated captions | Filter AI-generated training data |
| Cross-Encoder Reranker | cat(query, doc) → Linear(1536, 1) | MS MARCO | Two-stage retrieval scoring |
| Clustering | Linear(768, 256) → normalize | Unsupervised | Caption taxonomy, dataset organization |
| Relation Extraction | cat(subj_emb, obj_emb) → Linear(1536, n_rel) | Visual Genome relationships | Structured scene understanding |
| Caption-Image Score | Linear(768, 256) → cos with CLIP visual | CC12M image-caption pairs | Cross-modal retrieval without CLIP |

### Consensus Head Distillation

The same consensus trick applies to task heads. Train five separate NLI heads on the five frozen expert models, take the consensus prediction, distill into a single head on CaptionBERT. The head learns where all five experts agree on entailment — same noise cancellation, one layer instead of five.

## Training Datasets — Current and Planned

### Current

| Dataset | Samples Used | Content | Notes |
|---|---|---|---|
| [CC12M LLaVA-Next](https://huggingface.co/datasets/CaptionEmporium/conceptual-captions-cc12m-llavanext) | 500K | Re-captioned CC12M with LLaVA-Next | Primary training data, mean ~92 tokens |

### Planned — Caption Saturation

The model tokenizes to 512 but has 8,192 position capacity. Longer, more complex captions will exercise the full context window and push v_cos beyond 0.862.

| Dataset | Size | Content | Why |
|---|---|---|---|
| [ShareGPT4V](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) | 1.2M | GPT-4V detailed image descriptions | Longer captions (200-500 tokens), richer vocabulary |
| [DOCCI](https://huggingface.co/datasets/google/docci) | 15K | Expert-written dense image descriptions | Extremely detailed, 100-300 words per image |
| [Localized Narratives](https://huggingface.co/datasets/google/localized-narratives) | 850K | Spoken descriptions with mouse traces | Narrative structure, temporal ordering |
| [DenseCap](https://huggingface.co/datasets/visual-genome/dense-captions) | 5.4M | Region-level dense captions | Fine-grained spatial descriptions |
| [TextCaps](https://huggingface.co/datasets/lmms-lab/TextCaps) | 145K | Captions requiring OCR reading | Text-in-image understanding |
| [VizWiz](https://huggingface.co/datasets/lmms-lab/VizWiz-VQA) | 32K | Captions from blind/low-vision users | Diverse, real-world, often longer descriptions |
| [COCO Captions](https://huggingface.co/datasets/HuggingFaceM4/COCO) | 600K | 5 captions per image, human-written | Short but high-quality, broad coverage |
| [SBU Captions](https://huggingface.co/datasets/sbu_captions) | 1M | Web-crawled image-caption pairs | Scale and diversity |

### Planned — Domain Extension

| Dataset | Size | Content | Why |
|---|---|---|---|
| [BookCorpus](https://huggingface.co/datasets/bookcorpus) | 11K books | Long-form narrative text | Exercise 8K context, literary language |
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | 6M articles | Encyclopedic text | General knowledge, factual density |
| [Natural Questions](https://huggingface.co/datasets/google-research-datasets/natural_questions) | 300K | Question-answer pairs | QA capability for retrieval heads |
| [MS MARCO](https://huggingface.co/datasets/microsoft/ms_marco) | 1M | Passages + queries | Retrieval training for reranker head |

## Architecture

```
Input text
    │
    ├── BERT WordPiece tokenizer (30,522 vocab)
    ├── Token embeddings (384-dim)
    ├── Position embeddings (8,192 capacity)
    │
    ├── 6× Transformer Encoder Layer
    │   (384-dim, 6 heads, 1536 FFN, GELU, pre-norm)
    │
    ├── Mean pool over non-padding tokens
    ├── Projection: 384 → 384 → GELU → LN → 768
    └── L2 normalize
        │
        └── (B, 768) consensus-aligned embedding
```

## Usage

```python
import torch
from transformers import AutoTokenizer
from caption_encoder import CaptionEncoder

# Load
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = CaptionEncoder(
    vocab_size=30522, max_len=8192, d_model=384,
    n_heads=6, n_layers=6, d_ff=1536, output_dim=768,
    dropout=0.0, pad_token_id=0)
model.load_state_dict(torch.load("best_model.pt", weights_only=True))
model.eval()

# Encode
texts = ["A cat sitting on a windowsill", "A dog playing fetch on the beach"]
tokens = tokenizer(texts, max_length=512, padding="max_length",
                   truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(tokens["input_ids"], tokens["attention_mask"])

# embeddings: (2, 768) L2-normalized
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.3f}")
```

## Training Curve

| Epoch | t_cos | v_cos | v_cv | Time |
|---|---|---|---|---|
| 1 | 0.804 | 0.803 | 0.104 | 689s |
| 5 | 0.819 | 0.819 | 0.086 | 689s |
| 10 | 0.831 | 0.829 | 0.087 | 689s |
| 15 | 0.842 | 0.840 | 0.078 | 688s |
| 20 | 0.851 | 0.849 | 0.078 | 690s |
| 25 | 0.860 | 0.859 | 0.092 | 689s |
| 30 | 0.863 | 0.862 | 0.082 | 689s |

R@1=1.000 and t_acc=1.000 throughout all 30 epochs. Train/val gap < 0.002 — no overfitting on 500K samples.

## GEOLIP Family

| System | Type | Params | Output |
|---|---|---|---|
| [CLIP-L ctx576](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576) | Memory bank | 34M | pooled (768,) |
| [CLIP-L seq77](https://huggingface.co/AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77) | Memory + sequence | 53M | pooled + seq (77, 768) |
| [Meridian bigG](https://huggingface.co/AbstractPhil/geolip-clip-vit-bigG-patch14-ctx576-seq77) | Memory + sequence | 167M | pooled + seq (77, 1280) |
| [Conduit v0](https://huggingface.co/AbstractPhil/geolip-bertenstein) | Multi-expert hub | 8.8M | aligned (1024,) |
| **CaptionBERT-8192** | **Consensus distilled** | **26M** | **consensus (768,)** |

## Citation

See [Geometric Memory Part I](https://huggingface.co/blog/AbstractPhil/geometric-memory-ft1) and Part II for the full methodology, including the pentachoron consensus proof, whitened Procrustes alignment, compositional convolution experiments, and the path from accumulation-based memory to alignment-based distillation.

## License

Apache 2.0