Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Paper β’ 2301.08243 β’ Published β’ 7
A 302 M parameter Vision Transformer (ViT-Large/8) pretrained with I-JEPA on global Sentinel-2 imagery. Used by forestWHY to produce six differential attention/embedding panels from a temporal pair of 13-band Sentinel-2 tiles.
| Field | Value |
|---|---|
| Backbone | ViT-Large/8 |
| Embed dim | 1024 |
| Depth | 24 transformer blocks |
| Heads | 16 |
| Patch size | 8 Γ 8 |
| Input | 13-band Sentinel-2, 64 Γ 64 |
| Pretraining | I-JEPA (no contrastive heads) |
Input bands (channel order): B1, B2, B3, B4, B5, B6, B7, B8, B8A, B9, B10, B11, B12
(Sentinel-Hub naming).
Two-stage pretraining:
s2_ijepa_gee_vitl_full_encoder_final.pt β encoder-only checkpoint (1.1 GB).
Loadable via the S2Encoder class in
forestwhy/cookbook/src/forestwhy/jepa.py.from huggingface_hub import hf_hub_download
import torch
# Drop the S2Encoder class from forestwhy.jepa into your project, then:
from forestwhy.jepa import S2Encoder
ckpt_path = hf_hub_download(
repo_id="Siddharth63/forestWHY-JEPA-vitl",
filename="s2_ijepa_gee_vitl_full_encoder_final.pt",
)
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
encoder = S2Encoder(embed_dim=1024, depth=24, num_heads=16)
encoder.load_state_dict(ckpt["encoder"])
encoder.eval()
For the full panel-generation pipeline (six differential panels per temporal pair), see forestwhy.jepa.make_jepa_panels.
MIT. The training data is Β© Copernicus / ESA (Sentinel-2) and the European Space Agency under the Copernicus open licence.
Built for the Liquid AI hackathon LFM track.