LeVLJEPA
Collection
End-to-End Vision-Language Pretraining Without Contrastive Negatives β’ 2 items β’ Updated
Official checkpoint for LeVLJEPA+, a non-contrastive vision-language pretraining method based on joint-embedding prediction and SIGReg regularisation. Trained on CC12M for 50,000 steps with batch size 2,048.
| Property | Value |
|---|---|
| Vision encoder | vit_base_patch16_224 (timm) |
| Text encoder | GPT-2 (12L / 12H / 768D) |
| Embedding dim | 768 |
| Projector | 4-layer MLP, width 2048 |
| Training objective | Cross-modal prediction + SIGReg |
| Multi-view (DINO-style crops) | Yes |
| Training data | CC12M (~12 M image-caption pairs) |
| Training steps | 50,000 |
LeVLJEPA+ aligns image and text embeddings through predictive losses rather than contrastive classification:
The objective has no negatives, no temperature parameter, no momentum encoder.
import torch
import timm
from transformers import GPT2Config, GPT2Model, AutoTokenizer
from safetensors.torch import load_file
from torchvision.ops import MLP
import torch.nn as nn
# LeVLJEPA+ adds a multi-view consistency loss during training;
# at inference the vision encoder is used identically to LeVLJEPA.
HIDDEN = 768
EMBED = 768
# ββ Vision encoder βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
vision_encoder = timm.create_model(
"vit_base_patch16_224", pretrained=False, num_classes=0, dynamic_img_size=True
)
vision_pre_proj = nn.Sequential(
nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED)
)
# ββ Text encoder βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
text_encoder = GPT2Model(GPT2Config(
n_embd=HIDDEN, n_layer=12, n_head=12,
n_inner=HIDDEN * 4, vocab_size=tokenizer.vocab_size,
attn_pdrop=0.0, resid_pdrop=0.0, embd_pdrop=0.0,
))
text_pre_proj = nn.Sequential(
nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED)
)
# ββ Load weights βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from huggingface_hub import hf_hub_download
vision_weights = load_file(hf_hub_download("Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M", "vision_encoder.safetensors"))
text_weights = load_file(hf_hub_download("Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M", "text_encoder.safetensors"))
vision_encoder.load_state_dict({k[len("encoder."):]: v for k, v in vision_weights.items() if k.startswith("encoder.")})
vision_pre_proj.load_state_dict({k[len("pre_proj."):]: v for k, v in vision_weights.items() if k.startswith("pre_proj.")})
text_encoder.load_state_dict({k[len("encoder."):]: v for k, v in text_weights.items() if k.startswith("encoder.")})
text_pre_proj.load_state_dict({k[len("pre_proj."):]: v for k, v in text_weights.items() if k.startswith("pre_proj.")})
vision_encoder.eval()
text_encoder.eval()
# ββ Encode an image ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from torchvision import transforms
from PIL import Image
transform = transforms.Compose([
transforms.Resize(224), transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
image = Image.open("image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0)
with torch.no_grad():
image_features = vision_pre_proj(vision_encoder(pixel_values)) # (1, 768)
# ββ Encode a caption βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
inputs = tokenizer("a photo of a cat", return_tensors="pt", padding=True)
with torch.no_grad():
text_hidden = text_encoder(**inputs).last_hidden_state[:, -1, :]
text_features = text_pre_proj(text_hidden) # (1, 768)
| File | Contents |
|---|---|
vision_encoder.safetensors |
Vision encoder (encoder.*), pre-projection head (pre_proj.*), and cross-modal projector MLP (projector.*) |
text_encoder.safetensors |
Text encoder (encoder.*), pre-projection head (pre_proj.*), and cross-modal projector MLP (projector.*) |
config.json |
Architecture and training hyperparameters |
@article{levljepa2026,
title = {LeVLJEPA: End-to-End Vision-Language Pretraining Without Contrastive Negatives},
author = {Kuhn, Lukas and Serra, Giuseppe and Balestriero, Randall and Buettner, Florian},
year = {2026},
}