RetJEPA

Pure predictive JEPA prototype for evidence retrieval.

The first implementation intentionally trains without contrastive ranking loss:

query -> BERT query encoder -> Transformer predictor -> CLS pool -> z_pred
positive passage -> frozen BERT target/context encoder -> CLS pool -> z_target

loss = mse(normalize(z_pred), normalize(stopgrad(z_target)))

UV Setup

Use Python 3.10-3.12 for the ML stack. UV is the intended package manager so the resolved package set stays stable across machines.

uv sync

The repository includes uv.lock; use it on the GPU machine before training.

Smoke Validation

Before launching real training, run the local tiny-model smoke test:

bash scripts/smoke_tiny.sh

This creates a tiny BERT model and tiny pair/corpus data under .tmp/retjepa_smoke, runs one pure-JEPA training step with gradient accumulation, writes a checkpoint, and evaluates frozen_bert vs retjepa. It does not download BERT or BGE-M3.

Data Format

Training/evaluation pairs are JSONL, CSV, or TSV with at least:

{"query_id": "q1", "passage_id": "p1", "query": "question text", "passage": "positive evidence passage"}

Evaluation can also use a separate retrieval corpus:

{"passage_id": "p1", "passage": "candidate passage text"}

This matters: pair files define positive supervision and qrels, while the corpus file defines the candidate retrieval pool. If no corpus is configured, evaluation falls back to the unique positive passages from the pair file, which is useful only for smoke tests.

Convert a local table into the expected pair/corpus files:

bash scripts/prepare_pairs.sh \
  --input raw_pairs.jsonl \
  --query-field query \
  --passage-field passage \
  --query-id-field query_id \
  --passage-id-field passage_id \
  --output-pairs data/nq_small/train_pairs.jsonl \
  --output-corpus data/nq_small/corpus.jsonl

The same converter can read HuggingFace datasets with --dataset, --config-name, and --split when the dataset already exposes query and positive-passage fields.

Full NQ Preparation

For the first full NQ run, use ir_datasets through the dedicated script:

uv sync
bash scripts/prepare_nq_full.sh --output-dir data/nq_full

This writes:

data/nq_full/train_pairs.jsonl
data/nq_full/dev_pairs.jsonl
data/nq_full/corpus.jsonl

By default this uses the DPR Wikipedia-100 NQ retrieval setup:

dpr-w100/natural-questions/train
dpr-w100/natural-questions/dev
dpr-w100

This is the standard DPR-style full-Wikipedia NQ retrieval setup: around 59K train queries, 8.9M train qrels, 6.5K dev queries, 980K dev qrels, and about 21M corpus passages according to the ir_datasets DPR Wiki100 documentation. The preparation script writes positive-only pair files by default using relevance >= 1, so DPR hard negatives are not used by the first pure-prediction objective.

The official Google NQ split is also supported, but it is not the default because the Google-hosted files can fail in some environments:

bash scripts/prepare_nq_full.sh --preset official --output-dir data/nq_official

For a pair-only preparation without rewriting the 28M-passage corpus:

bash scripts/prepare_nq_full.sh --output-dir data/nq_full --skip-corpus

Train

bash scripts/train_pure_jepa.sh configs/pure_jepa_bert_nq_small.yaml

The trainer supports:

gradient accumulation via training.grad_accum_steps
bf16/fp16 autocast via training.precision
TF32 on CUDA via training.tf32
gradient clipping via training.grad_clip
scheduler warmup via training.scheduler and training.warmup_steps
checkpoint cadence via training.checkpoint_every_steps

Evaluate

bash scripts/eval_pure_jepa.sh \
  configs/pure_jepa_bert_nq_small.yaml \
  outputs/pure_jepa_bert_nq_small/checkpoint_last.pt

Evaluation compares:

frozen_bert: frozen BERT dense baseline
retjepa: predicted evidence embedding from the trained model
bge_m3: BAAI/bge-m3 strong external retrieval baseline

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support