LeVLJEPA+ β€” vit_base_patch16_224 / CC12M

Official checkpoint for LeVLJEPA+, a non-contrastive vision-language pretraining method based on joint-embedding prediction and SIGReg regularisation. Trained on CC12M for 50,000 steps with batch size 2,048.

Model summary

Property Value
Vision encoder vit_base_patch16_224 (timm)
Text encoder GPT-2 (12L / 12H / 768D)
Embedding dim 768
Projector 4-layer MLP, width 2048
Training objective Cross-modal prediction + SIGReg
Multi-view (DINO-style crops) Yes
Training data CC12M (~12 M image-caption pairs)
Training steps 50,000

Method

LeVLJEPA+ aligns image and text embeddings through predictive losses rather than contrastive classification:

  1. Cross-modal prediction β€” image embeddings predict stop-gradient text embeddings and vice versa via modality-specific MLP predictors.
  2. SIGReg regularisation β€” each modality's marginal embedding distribution is independently regularised toward an isotropic Gaussian, preventing representation collapse without needing negative pairs.
  3. Visual multi-view prediction (LeVLJEPA+ only) β€” DINO-style global/local crops are generated and a consistency loss encourages all views of the same image to agree in representation space.

The objective has no negatives, no temperature parameter, no momentum encoder.

Usage

import torch
import timm
from transformers import GPT2Config, GPT2Model, AutoTokenizer
from safetensors.torch import load_file
from torchvision.ops import MLP
import torch.nn as nn

# LeVLJEPA+ adds a multi-view consistency loss during training;
# at inference the vision encoder is used identically to LeVLJEPA.

HIDDEN = 768
EMBED  = 768

# ── Vision encoder ───────────────────────────────────────────────────────────
vision_encoder = timm.create_model(
    "vit_base_patch16_224", pretrained=False, num_classes=0, dynamic_img_size=True
)
vision_pre_proj = nn.Sequential(
    nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED)
)

# ── Text encoder ─────────────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

text_encoder = GPT2Model(GPT2Config(
    n_embd=HIDDEN, n_layer=12, n_head=12,
    n_inner=HIDDEN * 4, vocab_size=tokenizer.vocab_size,
    attn_pdrop=0.0, resid_pdrop=0.0, embd_pdrop=0.0,
))
text_pre_proj = nn.Sequential(
    nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED)
)

# ── Load weights ─────────────────────────────────────────────────────────────
from huggingface_hub import hf_hub_download

vision_weights = load_file(hf_hub_download("Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M", "vision_encoder.safetensors"))
text_weights   = load_file(hf_hub_download("Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M", "text_encoder.safetensors"))

vision_encoder.load_state_dict({k[len("encoder."):]: v for k, v in vision_weights.items() if k.startswith("encoder.")})
vision_pre_proj.load_state_dict({k[len("pre_proj."):]: v for k, v in vision_weights.items() if k.startswith("pre_proj.")})
text_encoder.load_state_dict({k[len("encoder."):]: v for k, v in text_weights.items() if k.startswith("encoder.")})
text_pre_proj.load_state_dict({k[len("pre_proj."):]: v for k, v in text_weights.items() if k.startswith("pre_proj.")})

vision_encoder.eval()
text_encoder.eval()

# ── Encode an image ──────────────────────────────────────────────────────────
from torchvision import transforms
from PIL import Image

transform = transforms.Compose([
    transforms.Resize(224), transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open("image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0)

with torch.no_grad():
    image_features = vision_pre_proj(vision_encoder(pixel_values))  # (1, 768)

# ── Encode a caption ─────────────────────────────────────────────────────────
inputs = tokenizer("a photo of a cat", return_tensors="pt", padding=True)
with torch.no_grad():
    text_hidden = text_encoder(**inputs).last_hidden_state[:, -1, :]
    text_features = text_pre_proj(text_hidden)  # (1, 768)

Files

File Contents
vision_encoder.safetensors Vision encoder (encoder.*), pre-projection head (pre_proj.*), and cross-modal projector MLP (projector.*)
text_encoder.safetensors Text encoder (encoder.*), pre-projection head (pre_proj.*), and cross-modal projector MLP (projector.*)
config.json Architecture and training hyperparameters

Citation

@article{levljepa2026,
  title  = {LeVLJEPA: End-to-End Vision-Language Pretraining Without Contrastive Negatives},
  author = {Kuhn, Lukas and Serra, Giuseppe and Balestriero, Randall and Buettner, Florian},
  year   = {2026},
}
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M

Collection including Machine-Learning-Oncology/LeVLJEPA-plus-ViT-B-CC12M