vjepa2-echonet-ef

Research / triage-aid only. Not a medical device. Not for clinical use.

A frozen V-JEPA 2 ViT-L video encoder paired with a linear (Ridge) regression head, trained to predict left-ventricular ejection fraction (EF) from apical four-chamber echocardiographic videos.

The headline finding is that a frozen, generic video foundation model — V-JEPA 2 trained on natural internet video, never tuned on echo — produces representations strong enough to support clinically-meaningful HFrEF screening when paired with a 1024-d → 1 linear head. No fine-tuning of the encoder.

Model details

Base encoder facebook/vjepa2-vitl-fpc64-256 (V-JEPA 2 ViT-L, 326M params, frozen)
Head Linear(1024 → 1) fitted by Ridge regression with α tuned on the validation split
Trainable parameters 1,025 (1024 weights + 1 bias)
Input Video clip; 64 frames at 256×256, replicated to 3 channels for grayscale
Pre-pool aggregation 3 evenly-spaced clips per video → V-JEPA encoder → mean over (clip × token) → 1024-d vector
Output Continuous left-ventricular ejection fraction (EF) prediction in %
Framework PyTorch + HuggingFace Transformers (transformers.models.vjepa2.VJEPA2Model)
License CC-BY-NC 4.0 (inherits from V-JEPA 2 base; downstream data use also gated by the EchoNet-Dynamic agreement)

Intended use

Direct use

  • Research baselines for echo + foundation-model studies.
  • Reference implementation for "frozen video encoder + lightweight head" workflows.
  • Demonstration of label-efficient transfer when full echo-specialised training data is unavailable.

Out-of-scope use

  • Any clinical decision-making. This model is not validated for, and must not be used for, diagnosis, triage, treatment selection, or any patient-affecting decision.
  • Echo views other than apical 4-chamber (parasternal, subcostal, apical 2/3/5-chamber).
  • Point-of-care or handheld ultrasound (POCUS) — not in the training distribution.
  • Pediatric, fetal, or stress echocardiography.
  • Any dataset, vendor, or institution outside the EchoNet-Dynamic distribution without re-validation.

Training data

EchoNet-Dynamic, Stanford/Cedars-Sinai (Ouyang et al., Nature 2020):

  • 10,030 apical 4-chamber echocardiogram videos.
  • Native resolution 112×112, ~50 FPS, grayscale.
  • Each video labelled with EF, end-systolic volume (ESV), and end-diastolic volume (EDV) computed from manually-traced LV contours via Simpson's method.
  • Single institution, single ultrasound vendor.
  • Official train/val/test split: 7,464 / 1,288 / 1,277 (used as-is).

EchoNet-Dynamic is released for research use under its own data use agreement; this model's downstream use inherits those constraints.

Training procedure

Embedding extraction (one-shot, per video)

  1. Sample 3 evenly-spaced clips of 64 frames each from the source video.
  2. Resize each clip to 256×256 and replicate the grayscale channel to 3 channels.
  3. Forward through V-JEPA 2 ViT-L (fp16, frozen), read last_hidden_state, mean-pool over tokens.
  4. Mean-pool the 3 clip embeddings into a single 1024-dim float32 vector per video.

Head training

  • Input: 7,464 train embeddings.
  • Loss: squared-error (Ridge).
  • Hyperparameter: alpha ∈ {0.01, 0.1, 1, 3, 10, 30, 100, 300, 1000} selected by minimising MAE on the 1,288-row validation split.
  • The fitted Ridge coefficients and intercept are loaded into a torch.nn.Linear(1024, 1) for inference parity with PyTorch-native heads.

The entire head training takes seconds because it operates in embedding space, not pixel space.

Compute

  • Hardware: single NVIDIA GB10 (Grace + Blackwell, aarch64).
  • Embedding pass: 6.6 hours wall-time for the full 10,030 videos at fp16 (2.4 s/video).
  • Head fit: seconds.

Evaluation

Test set

  • EchoNet-Dynamic official TEST split, n=1,277.
  • Class distribution by clinical EF cutoffs: 160 HFrEF (EF<40), 125 HFmrEF (40≤EF<50), 992 HFpEF (EF≥50).

Headline metrics

Metric Value
MAE (EF percentage points) 5.443
RMSE 7.093
R² (EF regression) 0.663
AUROC, screening EF<40 0.955
AUPRC, screening EF<40 0.840
Prevalence, EF<40 0.125
AUROC, screening EF<50 0.934
AUPRC, screening EF<50 0.853
Prevalence, EF<50 0.223

For context, the published EchoNet end-to-end 3D-CNN specialist (Ouyang et al., Nature 2020) achieves MAE ≈ 4.1 on the same test set. This model is within 1.3 EF points of that specialist while training only a 1,025-parameter linear head on top of frozen general-purpose video features.

Figures

ROC curves with val-tuned operating points

ROC for EF<40 and EF<50 screening

Left panel: HFrEF screening (true label EF<40), AUROC 0.955. Right panel: any reduced EF (true label EF<50), AUROC 0.934. Operating points are the val-tuned cutoffs (45.83 and 53.17 respectively) applied to the TEST split.

Predicted vs actual EF — TEST scatter

Predicted vs actual EF on TEST

Each point is one TEST video, coloured by clinical class. The dashed line is y = x (perfect prediction). Two patterns are visible: (i) strong correlation along the diagonal, confirming the regression signal; (ii) red dots clustered above the diagonal in the actual-EF<30 range — these are HFrEF cases the model predicts too high (the off-by-two failure mode). MAE on TEST is 5.44.

Reliability diagram

Reliability diagram

Bin actual EF in 8-percent-wide buckets; plot the mean predicted EF per bucket against the bucket centre, with n annotations for bucket counts. The model is well-calibrated near the population mean (40–60% bin centres) and regresses to the mean at both extremes — over-predicting in the very-low-EF range (predicts ~30 when truth ≈ 14) and under-predicting at high EFs. This Ridge-regression-to-mean bias is the reason the val-tuned screening cutoff (45.83 for EF<40) is higher than the clinical threshold (40), not lower.

Operating points (binary screening)

The screening rule is flag iff predicted_EF < cutoff. The val-tuned cutoff is the value at which target_sensitivity fraction of HFrEF cases in the validation set fall below the cutoff (i.e. the target_sensitivity-quantile of the positive-class predicted-EF distribution on VAL).

EF<40 screening (HFrEF, prevalence 12.5% on TEST)

Operating point Cutoff (predicted EF) Sensitivity Specificity TP FP FN TN
Naive 40.00 0.519 0.992 83 9 77 1108
Val-tuned (target sens 0.85) 45.83 0.775 0.957 124 48 36 1069

EF<50 screening (any reduced EF, prevalence 22.3% on TEST)

Operating point Cutoff (predicted EF) Sensitivity Specificity TP FP FN TN
Naive 50.00 0.751 0.940 214 60 71 932
Val-tuned (target sens 0.85) 53.17 0.874 0.837 249 162 36 830

Val→test sensitivity gap. The val-tuned cutoff is calibrated to hit a target sensitivity on VAL, but TEST sensitivity is a noisy estimate of that target. For HFrEF (rare class, 161 positives in VAL) the realised TEST sens is 0.775 vs target 0.85 — a ~7.5 pp gap consistent with sampling variance in the rare class. For EF<50 (more positives, more stable threshold) the realised TEST sens is 0.874, slightly above target. Operating points should be re-tuned per deployment population.

3-class confusion (rows = true, cols = predicted; HFrEF / HFmrEF / HFpEF)

                Predicted
              HFrEF  HFmrEF  HFpEF
True  HFrEF    83     61     16
      HFmrEF    8     62     55
      HFpEF     1     59    932

Per-class accuracy: HFrEF 52%, HFmrEF 50%, HFpEF 94%. Most errors are off-by-one (adjacent class), which is the clinically benign failure mode.

Limitations & risks

  • Not a medical device. No regulatory clearance (FDA, CE, etc.). Use is restricted to research and demonstration.
  • 16 of 160 HFrEF cases (10%) were predicted as HFpEF — off-by-two errors. This rate is too high for a deployed screener.
  • Single institution / single vendor. EchoNet-Dynamic is one institution and one ultrasound system. Generalisation to other vendors, countries, or clinical workflows has not been measured.
  • Apical 4-chamber view only. The model was trained exclusively on A4C clips. Other views are out of distribution and will produce confidently wrong predictions. A view-classification gate is required before any deployment.
  • No demographic, clinical, or device metadata. No subgroup analysis by age, sex, or comorbidity is possible from the EchoNet-Dynamic release used here.
  • No calibration. Predictions are point estimates from a frozen-encoder + linear head. No predictive uncertainty is reported. Probability calibration (e.g. Platt scaling on the val set) is recommended before any operating-point use.
  • EF label noise floor. EF is itself a noisy measurement (~5% absolute uncertainty between expert cardiologists). The achievable MAE floor is bounded below by ~3–4.
  • Long-tail performance. HFrEF and HFmrEF classes are underrepresented (12.5% and 9.8% of test set). Confidence intervals on subgroup metrics are correspondingly wide.

Bias considerations

The EchoNet-Dynamic dataset does not include detailed demographic metadata in the public release. The model has not been evaluated for bias across age, sex, race/ethnicity, body habitus, or scanning operator. Any deployment in a population whose distribution differs from the EchoNet cohort would require independent validation.

Ethical considerations

EF is used in real clinical pathways to decide on guideline-directed medical therapy for heart failure. A miscalibrated model could plausibly cause harm if naively deployed. The intended use of this artefact is to support method development and reproducible benchmarking — not to influence patient care.

How to use

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoModel, AutoVideoProcessor

REPO_ID = "vselvarajijay/vjepa2-echonet-ef"
BASE_MODEL = "facebook/vjepa2-vitl-fpc64-256"

# 1. Load V-JEPA 2 frozen encoder from HuggingFace
processor = AutoVideoProcessor.from_pretrained(BASE_MODEL)
encoder = AutoModel.from_pretrained(BASE_MODEL, dtype=torch.float16).eval()

# 2. Pull the trained Ridge head from this repo
head_path = hf_hub_download(repo_id=REPO_ID, filename="head.safetensors")
head = torch.nn.Linear(1024, 1)
head.load_state_dict(load_file(head_path))
head.eval()

# 3. Encode a video clip and predict EF
#    `frames` must be a numpy array shape (T=64, H=256, W=256, C=3), uint8.
#    For grayscale echo, replicate the single channel to 3 channels.
#    See `inference_example.py` in this repo for full clip-sampling + multi-clip averaging.
inputs = processor(videos=list(frames), return_tensors="pt")
with torch.inference_mode():
    hidden = encoder(**inputs).last_hidden_state            # (1, N_tokens, 1024)
    embedding = hidden.float().mean(dim=1).squeeze(0)        # (1024,)
    ef_pred = head(embedding).item()                          # float, EF in %

# Reduced-EF screening (HFrEF) at the val-tuned cutoff:
print(f"EF: {ef_pred:.2f}%   Reduced-EF flag: {ef_pred < 45.83}")

The snippet above pulls just the head from this repo. For an end-to-end script that decodes a .avi/.mp4 directly and averages multiple clips, see inference_example.py.

Citation

If you use this model, cite both the V-JEPA 2 base and the EchoNet-Dynamic dataset:

@article{ouyang2020video,
  title={Video-based AI for beat-to-beat assessment of cardiac function},
  author={Ouyang, David and He, Bryan and Ghorbani, Amirata and Yuan, Neal and
          Ebinger, Joseph and Langlotz, Curtis P and Heidenreich, Paul A and
          Harrington, Robert A and Liang, David H and Ashley, Euan A and Zou, James Y},
  journal={Nature},
  volume={580},
  pages={252--256},
  year={2020}
}

@misc{assran2025vjepa2,
  title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction, and Planning},
  author={Assran, Mahmoud and Bardes, Adrien and others},
  year={2025},
  publisher={Meta AI}
}

Changelog

  • 2026-05-08 — Initial draft. Full-dataset training run (n=7,464 train) reports MAE 5.4, R² 0.66, AUROC@40 0.955.
  • 2026-05-08 — Eval script's threshold-tuning bug fixed (replaced (1 - target)-quantile with target-quantile of positive predictions). Tuned operating points populated: EF<40 @ cutoff 45.83 → Sens 0.775, Spec 0.957; EF<50 @ cutoff 53.17 → Sens 0.874, Spec 0.837.
  • 2026-05-08 — Added ROC, scatter, and reliability figures (figures/) generated by app.figures.
  • 2026-05-08 — Ablation: doubled clips_per_video from 3 to 6. Result: marginally worse (MAE 5.487 vs 5.443) — within run-to-run noise. Mean-pooling more clips at the embedding level does not improve signal because the dominant embedding axes are EF-irrelevant appearance variance. 3-clip remains canonical. Future clip-count scaling will need concat or attention pooling, not averaging.
  • 2026-05-08 — v0.1.0 published.
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vselvarajijay/vjepa2-echonet-ef

Finetuned
(8)
this model