LeVLJEPA+ — vit_base_patch16_224 / CC12M

Official checkpoint for LeVLJEPA+, a non-contrastive vision-language pretraining method based on joint-embedding prediction and SIGReg regularisation. Trained on CC12M for 50,000 steps with batch size 2,048.

Model summary

Property	Value
Vision encoder	`vit_base_patch16_224` (timm)
Text encoder	GPT-2 (12L / 12H / 768D)
Embedding dim	768
Projector	4-layer MLP, width 2048
Training objective	Cross-modal prediction + SIGReg
Multi-view (DINO-style crops)	Yes
Training data	CC12M (~12 M image-caption pairs)
Training steps	50,000

Method

LeVLJEPA+ aligns image and text embeddings through predictive losses rather than contrastive classification:

Cross-modal prediction — image embeddings predict stop-gradient text embeddings and vice versa via modality-specific MLP predictors.
SIGReg regularisation — each modality's marginal embedding distribution is independently regularised toward an isotropic Gaussian, preventing representation collapse without needing negative pairs.
Visual multi-view prediction (LeVLJEPA+ only) — DINO-style global/local crops are generated and a consistency loss encourages all views of the same image to agree in representation space.

The objective has no negatives, no temperature parameter, no momentum encoder.

Usage

import torch
import timm
from transformers import GPT2Config, GPT2Model, AutoTokenizer
from safetensors.torch import load_file
from torchvision.ops import MLP
import torch.nn as nn

# LeVLJEPA+ adds a multi-view consistency loss during training;
# at inference the vision encoder is used identically to LeVLJEPA.

HIDDEN = 768
EMBED  = 768

# ── Vision encoder ───────────────────────────────────────────────────────────
vision_encoder = timm.create_model(
    "vit_base_patch16_224", pretrained=False, num_classes=0, dynamic_img_size=True
)
vision_pre_proj = nn.Sequential(
    nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED)
)

# ── Text encoder ─────────────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

text_encoder = GPT2Model(GPT2Config(
    n_embd=HIDDEN, n_layer=12, n_head=12,
    n_inner=HIDDEN * 4, vocab_size=tokenizer.vocab_size,
    attn_pdrop=0.0, resid_pdrop=0.0, embd_pdrop=0.0,
))
text_pre_proj = nn.Sequential(
    nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED)
)

# ── Load weights ─────────────────────────────────────────────────────────────
from huggingface_hub import hf_hub_download

vision_weights = load_file(hf_hub_download("Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M", "vision_encoder.safetensors"))
text_weights   = load_file(hf_hub_download("Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M", "text_encoder.safetensors"))

vision_encoder.load_state_dict({k[len("encoder."):]: v for k, v in vision_weights.items() if k.startswith("encoder.")})
vision_pre_proj.load_state_dict({k[len("pre_proj."):]: v for k, v in vision_weights.items() if k.startswith("pre_proj.")})
text_encoder.load_state_dict({k[len("encoder."):]: v for k, v in text_weights.items() if k.startswith("encoder.")})
text_pre_proj.load_state_dict({k[len("pre_proj."):]: v for k, v in text_weights.items() if k.startswith("pre_proj.")})

vision_encoder.eval()
text_encoder.eval()

# ── Encode an image ──────────────────────────────────────────────────────────
from torchvision import transforms
from PIL import Image

transform = transforms.Compose([
    transforms.Resize(224), transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open("image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0)

with torch.no_grad():
    image_features = vision_pre_proj(vision_encoder(pixel_values))  # (1, 768)

# ── Encode a caption ─────────────────────────────────────────────────────────
inputs = tokenizer("a photo of a cat", return_tensors="pt", padding=True)
with torch.no_grad():
    text_hidden = text_encoder(**inputs).last_hidden_state[:, -1, :]
    text_features = text_pre_proj(text_hidden)  # (1, 768)

Files

File	Contents
`vision_encoder.safetensors`	Vision encoder (`encoder.`), pre-projection head (`pre_proj.`), and cross-modal projector MLP (`projector.*`)
`text_encoder.safetensors`	Text encoder (`encoder.`), pre-projection head (`pre_proj.`), and cross-modal projector MLP (`projector.*`)
`config.json`	Architecture and training hyperparameters

Citation

@article{levljepa2026,
  title  = {LeVLJEPA: End-to-End Vision-Language Pretraining Without Contrastive Negatives},
  author = {Kuhn, Lukas and Serra, Giuseppe and Balestriero, Randall and Buettner, Florian},
  year   = {2026},
}

Downloads last month: 24

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M

Collection including Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M

LeVLJEPA

Collection

End-to-End Vision-Language Pretraining Without Contrastive Negatives • 2 items • Updated 4 days ago