vjepa2-echonet-ef

Research / triage-aid only. Not a medical device. Not for clinical use.

A frozen V-JEPA 2 ViT-L video encoder paired with a linear (Ridge) regression head, trained to predict left-ventricular ejection fraction (EF) from apical four-chamber echocardiographic videos.

The headline finding is that a frozen, generic video foundation model — V-JEPA 2 trained on natural internet video, never tuned on echo — produces representations strong enough to support clinically-meaningful HFrEF screening when paired with a 1024-d → 1 linear head. No fine-tuning of the encoder.

Model details


Base encoder	`facebook/vjepa2-vitl-fpc64-256` (V-JEPA 2 ViT-L, 326M params, frozen)
Head	`Linear(1024 → 1)` fitted by Ridge regression with α tuned on the validation split
Trainable parameters	1,025 (1024 weights + 1 bias)
Input	Video clip; 64 frames at 256×256, replicated to 3 channels for grayscale
Pre-pool aggregation	3 evenly-spaced clips per video → V-JEPA encoder → mean over (clip × token) → 1024-d vector
Output	Continuous left-ventricular ejection fraction (EF) prediction in %
Framework	PyTorch + HuggingFace Transformers (`transformers.models.vjepa2.VJEPA2Model`)
License	CC-BY-NC 4.0 (inherits from V-JEPA 2 base; downstream data use also gated by the EchoNet-Dynamic agreement)

Intended use

Direct use

Research baselines for echo + foundation-model studies.
Reference implementation for "frozen video encoder + lightweight head" workflows.
Demonstration of label-efficient transfer when full echo-specialised training data is unavailable.

Out-of-scope use

Any clinical decision-making. This model is not validated for, and must not be used for, diagnosis, triage, treatment selection, or any patient-affecting decision.
Echo views other than apical 4-chamber (parasternal, subcostal, apical 2/3/5-chamber).
Point-of-care or handheld ultrasound (POCUS) — not in the training distribution.
Pediatric, fetal, or stress echocardiography.
Any dataset, vendor, or institution outside the EchoNet-Dynamic distribution without re-validation.

Training data

EchoNet-Dynamic, Stanford/Cedars-Sinai (Ouyang et al., Nature 2020):

10,030 apical 4-chamber echocardiogram videos.
Native resolution 112×112, ~50 FPS, grayscale.
Each video labelled with EF, end-systolic volume (ESV), and end-diastolic volume (EDV) computed from manually-traced LV contours via Simpson's method.
Single institution, single ultrasound vendor.
Official train/val/test split: 7,464 / 1,288 / 1,277 (used as-is).

EchoNet-Dynamic is released for research use under its own data use agreement; this model's downstream use inherits those constraints.

Training procedure

Embedding extraction (one-shot, per video)

Sample 3 evenly-spaced clips of 64 frames each from the source video.
Resize each clip to 256×256 and replicate the grayscale channel to 3 channels.
Forward through V-JEPA 2 ViT-L (fp16, frozen), read last_hidden_state, mean-pool over tokens.
Mean-pool the 3 clip embeddings into a single 1024-dim float32 vector per video.

Head training

Input: 7,464 train embeddings.
Loss: squared-error (Ridge).
Hyperparameter: alpha ∈ {0.01, 0.1, 1, 3, 10, 30, 100, 300, 1000} selected by minimising MAE on the 1,288-row validation split.
The fitted Ridge coefficients and intercept are loaded into a torch.nn.Linear(1024, 1) for inference parity with PyTorch-native heads.

The entire head training takes seconds because it operates in embedding space, not pixel space.

Compute

Hardware: single NVIDIA GB10 (Grace + Blackwell, aarch64).
Embedding pass: ~~6.6 hours wall-time for the full 10,030 videos at fp16 (~~2.4 s/video).
Head fit: seconds.

Evaluation

Test set

EchoNet-Dynamic official TEST split, n=1,277.
Class distribution by clinical EF cutoffs: 160 HFrEF (EF<40), 125 HFmrEF (40≤EF<50), 992 HFpEF (EF≥50).

Headline metrics

Metric	Value
MAE (EF percentage points)	5.443
RMSE	7.093
R² (EF regression)	0.663
AUROC, screening EF<40	0.955
AUPRC, screening EF<40	0.840
Prevalence, EF<40	0.125
AUROC, screening EF<50	0.934
AUPRC, screening EF<50	0.853
Prevalence, EF<50	0.223

For context, the published EchoNet end-to-end 3D-CNN specialist (Ouyang et al., Nature 2020) achieves MAE ≈ 4.1 on the same test set. This model is within 1.3 EF points of that specialist while training only a 1,025-parameter linear head on top of frozen general-purpose video features.

Figures

ROC curves with val-tuned operating points

Left panel: HFrEF screening (true label EF<40), AUROC 0.955. Right panel: any reduced EF (true label EF<50), AUROC 0.934. Operating points are the val-tuned cutoffs (45.83 and 53.17 respectively) applied to the TEST split.

Predicted vs actual EF — TEST scatter

Each point is one TEST video, coloured by clinical class. The dashed line is y = x (perfect prediction). Two patterns are visible: (i) strong correlation along the diagonal, confirming the regression signal; (ii) red dots clustered above the diagonal in the actual-EF<30 range — these are HFrEF cases the model predicts too high (the off-by-two failure mode). MAE on TEST is 5.44.

Reliability diagram

Bin actual EF in 8-percent-wide buckets; plot the mean predicted EF per bucket against the bucket centre, with n annotations for bucket counts. The model is well-calibrated near the population mean (40–60% bin centres) and regresses to the mean at both extremes — over-predicting in the very-low-EF range (predicts ~30 when truth ≈ 14) and under-predicting at high EFs. This Ridge-regression-to-mean bias is the reason the val-tuned screening cutoff (45.83 for EF<40) is higher than the clinical threshold (40), not lower.

Operating points (binary screening)

The screening rule is flag iff predicted_EF < cutoff. The val-tuned cutoff is the value at which target_sensitivity fraction of HFrEF cases in the validation set fall below the cutoff (i.e. the target_sensitivity-quantile of the positive-class predicted-EF distribution on VAL).

EF<40 screening (HFrEF, prevalence 12.5% on TEST)

Operating point	Cutoff (predicted EF)	Sensitivity	Specificity	TP	FP	FN	TN
Naive	40.00	0.519	0.992	83	9	77	1108
Val-tuned (target sens 0.85)	45.83	0.775	0.957	124	48	36	1069

EF<50 screening (any reduced EF, prevalence 22.3% on TEST)

Operating point	Cutoff (predicted EF)	Sensitivity	Specificity	TP	FP	FN	TN
Naive	50.00	0.751	0.940	214	60	71	932
Val-tuned (target sens 0.85)	53.17	0.874	0.837	249	162	36	830

Val→test sensitivity gap. The val-tuned cutoff is calibrated to hit a target sensitivity on VAL, but TEST sensitivity is a noisy estimate of that target. For HFrEF (rare class, 161 positives in VAL) the realised TEST sens is 0.775 vs target 0.85 — a ~7.5 pp gap consistent with sampling variance in the rare class. For EF<50 (more positives, more stable threshold) the realised TEST sens is 0.874, slightly above target. Operating points should be re-tuned per deployment population.

3-class confusion (rows = true, cols = predicted; HFrEF / HFmrEF / HFpEF)

                Predicted
              HFrEF  HFmrEF  HFpEF
True  HFrEF    83     61     16
      HFmrEF    8     62     55
      HFpEF     1     59    932

Per-class accuracy: HFrEF 52%, HFmrEF 50%, HFpEF 94%. Most errors are off-by-one (adjacent class), which is the clinically benign failure mode.

Limitations & risks

Not a medical device. No regulatory clearance (FDA, CE, etc.). Use is restricted to research and demonstration.
16 of 160 HFrEF cases (10%) were predicted as HFpEF — off-by-two errors. This rate is too high for a deployed screener.
Single institution / single vendor. EchoNet-Dynamic is one institution and one ultrasound system. Generalisation to other vendors, countries, or clinical workflows has not been measured.
Apical 4-chamber view only. The model was trained exclusively on A4C clips. Other views are out of distribution and will produce confidently wrong predictions. A view-classification gate is required before any deployment.
No demographic, clinical, or device metadata. No subgroup analysis by age, sex, or comorbidity is possible from the EchoNet-Dynamic release used here.
No calibration. Predictions are point estimates from a frozen-encoder + linear head. No predictive uncertainty is reported. Probability calibration (e.g. Platt scaling on the val set) is recommended before any operating-point use.
EF label noise floor. EF is itself a noisy measurement (~5% absolute uncertainty between expert cardiologists). The achievable MAE floor is bounded below by ~3–4.
Long-tail performance. HFrEF and HFmrEF classes are underrepresented (12.5% and 9.8% of test set). Confidence intervals on subgroup metrics are correspondingly wide.

Bias considerations

The EchoNet-Dynamic dataset does not include detailed demographic metadata in the public release. The model has not been evaluated for bias across age, sex, race/ethnicity, body habitus, or scanning operator. Any deployment in a population whose distribution differs from the EchoNet cohort would require independent validation.

Ethical considerations

EF is used in real clinical pathways to decide on guideline-directed medical therapy for heart failure. A miscalibrated model could plausibly cause harm if naively deployed. The intended use of this artefact is to support method development and reproducible benchmarking — not to influence patient care.

How to use

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoModel, AutoVideoProcessor

REPO_ID = "vselvarajijay/vjepa2-echonet-ef"
BASE_MODEL = "facebook/vjepa2-vitl-fpc64-256"

# 1. Load V-JEPA 2 frozen encoder from HuggingFace
processor = AutoVideoProcessor.from_pretrained(BASE_MODEL)
encoder = AutoModel.from_pretrained(BASE_MODEL, dtype=torch.float16).eval()

# 2. Pull the trained Ridge head from this repo
head_path = hf_hub_download(repo_id=REPO_ID, filename="head.safetensors")
head = torch.nn.Linear(1024, 1)
head.load_state_dict(load_file(head_path))
head.eval()

# 3. Encode a video clip and predict EF
#    `frames` must be a numpy array shape (T=64, H=256, W=256, C=3), uint8.
#    For grayscale echo, replicate the single channel to 3 channels.
#    See `inference_example.py` in this repo for full clip-sampling + multi-clip averaging.
inputs = processor(videos=list(frames), return_tensors="pt")
with torch.inference_mode():
    hidden = encoder(**inputs).last_hidden_state            # (1, N_tokens, 1024)
    embedding = hidden.float().mean(dim=1).squeeze(0)        # (1024,)
    ef_pred = head(embedding).item()                          # float, EF in %

# Reduced-EF screening (HFrEF) at the val-tuned cutoff:
print(f"EF: {ef_pred:.2f}%   Reduced-EF flag: {ef_pred < 45.83}")

The snippet above pulls just the head from this repo. For an end-to-end script that decodes a .avi/.mp4 directly and averages multiple clips, see inference_example.py.

Citation

If you use this model, cite both the V-JEPA 2 base and the EchoNet-Dynamic dataset:

@article{ouyang2020video,
  title={Video-based AI for beat-to-beat assessment of cardiac function},
  author={Ouyang, David and He, Bryan and Ghorbani, Amirata and Yuan, Neal and
          Ebinger, Joseph and Langlotz, Curtis P and Heidenreich, Paul A and
          Harrington, Robert A and Liang, David H and Ashley, Euan A and Zou, James Y},
  journal={Nature},
  volume={580},
  pages={252--256},
  year={2020}
}

@misc{assran2025vjepa2,
  title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction, and Planning},
  author={Assran, Mahmoud and Bardes, Adrien and others},
  year={2025},
  publisher={Meta AI}
}

Changelog

2026-05-08 — Initial draft. Full-dataset training run (n=7,464 train) reports MAE 5.4, R² 0.66, AUROC@40 0.955.
2026-05-08 — Eval script's threshold-tuning bug fixed (replaced (1 - target)-quantile with target-quantile of positive predictions). Tuned operating points populated: EF<40 @ cutoff 45.83 → Sens 0.775, Spec 0.957; EF<50 @ cutoff 53.17 → Sens 0.874, Spec 0.837.
2026-05-08 — Added ROC, scatter, and reliability figures (figures/) generated by app.figures.
2026-05-08 — Ablation: doubled clips_per_video from 3 to 6. Result: marginally worse (MAE 5.487 vs 5.443) — within run-to-run noise. Mean-pooling more clips at the embedding level does not improve signal because the dominant embedding axes are EF-irrelevant appearance variance. 3-clip remains canonical. Future clip-count scaling will need concat or attention pooling, not averaging.
2026-05-08 — v0.1.0 published.

Downloads last month: 41

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vselvarajijay/vjepa2-echonet-ef

Base model

facebook/vjepa2-vitl-fpc64-256

Finetuned

(8)

this model