vjepa2-echonet-ef
Research / triage-aid only. Not a medical device. Not for clinical use.
A frozen V-JEPA 2 ViT-L video encoder paired with a linear (Ridge) regression head, trained to predict left-ventricular ejection fraction (EF) from apical four-chamber echocardiographic videos.
The headline finding is that a frozen, generic video foundation model — V-JEPA 2 trained on natural internet video, never tuned on echo — produces representations strong enough to support clinically-meaningful HFrEF screening when paired with a 1024-d → 1 linear head. No fine-tuning of the encoder.
Model details
| Base encoder | facebook/vjepa2-vitl-fpc64-256 (V-JEPA 2 ViT-L, 326M params, frozen) |
| Head | Linear(1024 → 1) fitted by Ridge regression with α tuned on the validation split |
| Trainable parameters | 1,025 (1024 weights + 1 bias) |
| Input | Video clip; 64 frames at 256×256, replicated to 3 channels for grayscale |
| Pre-pool aggregation | 3 evenly-spaced clips per video → V-JEPA encoder → mean over (clip × token) → 1024-d vector |
| Output | Continuous left-ventricular ejection fraction (EF) prediction in % |
| Framework | PyTorch + HuggingFace Transformers (transformers.models.vjepa2.VJEPA2Model) |
| License | CC-BY-NC 4.0 (inherits from V-JEPA 2 base; downstream data use also gated by the EchoNet-Dynamic agreement) |
Intended use
Direct use
- Research baselines for echo + foundation-model studies.
- Reference implementation for "frozen video encoder + lightweight head" workflows.
- Demonstration of label-efficient transfer when full echo-specialised training data is unavailable.
Out-of-scope use
- Any clinical decision-making. This model is not validated for, and must not be used for, diagnosis, triage, treatment selection, or any patient-affecting decision.
- Echo views other than apical 4-chamber (parasternal, subcostal, apical 2/3/5-chamber).
- Point-of-care or handheld ultrasound (POCUS) — not in the training distribution.
- Pediatric, fetal, or stress echocardiography.
- Any dataset, vendor, or institution outside the EchoNet-Dynamic distribution without re-validation.
Training data
EchoNet-Dynamic, Stanford/Cedars-Sinai (Ouyang et al., Nature 2020):
- 10,030 apical 4-chamber echocardiogram videos.
- Native resolution 112×112, ~50 FPS, grayscale.
- Each video labelled with EF, end-systolic volume (ESV), and end-diastolic volume (EDV) computed from manually-traced LV contours via Simpson's method.
- Single institution, single ultrasound vendor.
- Official train/val/test split: 7,464 / 1,288 / 1,277 (used as-is).
EchoNet-Dynamic is released for research use under its own data use agreement; this model's downstream use inherits those constraints.
Training procedure
Embedding extraction (one-shot, per video)
- Sample 3 evenly-spaced clips of 64 frames each from the source video.
- Resize each clip to 256×256 and replicate the grayscale channel to 3 channels.
- Forward through V-JEPA 2 ViT-L (fp16, frozen), read
last_hidden_state, mean-pool over tokens. - Mean-pool the 3 clip embeddings into a single 1024-dim float32 vector per video.
Head training
- Input: 7,464 train embeddings.
- Loss: squared-error (Ridge).
- Hyperparameter:
alpha ∈ {0.01, 0.1, 1, 3, 10, 30, 100, 300, 1000}selected by minimising MAE on the 1,288-row validation split. - The fitted Ridge coefficients and intercept are loaded into a
torch.nn.Linear(1024, 1)for inference parity with PyTorch-native heads.
The entire head training takes seconds because it operates in embedding space, not pixel space.
Compute
- Hardware: single NVIDIA GB10 (Grace + Blackwell, aarch64).
- Embedding pass:
6.6 hours wall-time for the full 10,030 videos at fp16 (2.4 s/video). - Head fit: seconds.
Evaluation
Test set
- EchoNet-Dynamic official TEST split, n=1,277.
- Class distribution by clinical EF cutoffs: 160 HFrEF (EF<40), 125 HFmrEF (40≤EF<50), 992 HFpEF (EF≥50).
Headline metrics
| Metric | Value |
|---|---|
| MAE (EF percentage points) | 5.443 |
| RMSE | 7.093 |
| R² (EF regression) | 0.663 |
| AUROC, screening EF<40 | 0.955 |
| AUPRC, screening EF<40 | 0.840 |
| Prevalence, EF<40 | 0.125 |
| AUROC, screening EF<50 | 0.934 |
| AUPRC, screening EF<50 | 0.853 |
| Prevalence, EF<50 | 0.223 |
For context, the published EchoNet end-to-end 3D-CNN specialist (Ouyang et al., Nature 2020) achieves MAE ≈ 4.1 on the same test set. This model is within 1.3 EF points of that specialist while training only a 1,025-parameter linear head on top of frozen general-purpose video features.
Figures
ROC curves with val-tuned operating points
Left panel: HFrEF screening (true label EF<40), AUROC 0.955. Right panel: any reduced EF (true label EF<50), AUROC 0.934. Operating points are the val-tuned cutoffs (45.83 and 53.17 respectively) applied to the TEST split.
Predicted vs actual EF — TEST scatter
Each point is one TEST video, coloured by clinical class. The dashed line is y = x (perfect prediction). Two patterns are visible: (i) strong correlation along the diagonal, confirming the regression signal; (ii) red dots clustered above the diagonal in the actual-EF<30 range — these are HFrEF cases the model predicts too high (the off-by-two failure mode). MAE on TEST is 5.44.
Reliability diagram
Bin actual EF in 8-percent-wide buckets; plot the mean predicted EF per bucket against the bucket centre, with n annotations for bucket counts. The model is well-calibrated near the population mean (40–60% bin centres) and regresses to the mean at both extremes — over-predicting in the very-low-EF range (predicts ~30 when truth ≈ 14) and under-predicting at high EFs. This Ridge-regression-to-mean bias is the reason the val-tuned screening cutoff (45.83 for EF<40) is higher than the clinical threshold (40), not lower.
Operating points (binary screening)
The screening rule is flag iff predicted_EF < cutoff. The val-tuned cutoff is the value at which target_sensitivity fraction of HFrEF cases in the validation set fall below the cutoff (i.e. the target_sensitivity-quantile of the positive-class predicted-EF distribution on VAL).
EF<40 screening (HFrEF, prevalence 12.5% on TEST)
| Operating point | Cutoff (predicted EF) | Sensitivity | Specificity | TP | FP | FN | TN |
|---|---|---|---|---|---|---|---|
| Naive | 40.00 | 0.519 | 0.992 | 83 | 9 | 77 | 1108 |
| Val-tuned (target sens 0.85) | 45.83 | 0.775 | 0.957 | 124 | 48 | 36 | 1069 |
EF<50 screening (any reduced EF, prevalence 22.3% on TEST)
| Operating point | Cutoff (predicted EF) | Sensitivity | Specificity | TP | FP | FN | TN |
|---|---|---|---|---|---|---|---|
| Naive | 50.00 | 0.751 | 0.940 | 214 | 60 | 71 | 932 |
| Val-tuned (target sens 0.85) | 53.17 | 0.874 | 0.837 | 249 | 162 | 36 | 830 |
Val→test sensitivity gap. The val-tuned cutoff is calibrated to hit a target sensitivity on VAL, but TEST sensitivity is a noisy estimate of that target. For HFrEF (rare class, 161 positives in VAL) the realised TEST sens is 0.775 vs target 0.85 — a ~7.5 pp gap consistent with sampling variance in the rare class. For EF<50 (more positives, more stable threshold) the realised TEST sens is 0.874, slightly above target. Operating points should be re-tuned per deployment population.
3-class confusion (rows = true, cols = predicted; HFrEF / HFmrEF / HFpEF)
Predicted
HFrEF HFmrEF HFpEF
True HFrEF 83 61 16
HFmrEF 8 62 55
HFpEF 1 59 932
Per-class accuracy: HFrEF 52%, HFmrEF 50%, HFpEF 94%. Most errors are off-by-one (adjacent class), which is the clinically benign failure mode.
Limitations & risks
- Not a medical device. No regulatory clearance (FDA, CE, etc.). Use is restricted to research and demonstration.
- 16 of 160 HFrEF cases (10%) were predicted as HFpEF — off-by-two errors. This rate is too high for a deployed screener.
- Single institution / single vendor. EchoNet-Dynamic is one institution and one ultrasound system. Generalisation to other vendors, countries, or clinical workflows has not been measured.
- Apical 4-chamber view only. The model was trained exclusively on A4C clips. Other views are out of distribution and will produce confidently wrong predictions. A view-classification gate is required before any deployment.
- No demographic, clinical, or device metadata. No subgroup analysis by age, sex, or comorbidity is possible from the EchoNet-Dynamic release used here.
- No calibration. Predictions are point estimates from a frozen-encoder + linear head. No predictive uncertainty is reported. Probability calibration (e.g. Platt scaling on the val set) is recommended before any operating-point use.
- EF label noise floor. EF is itself a noisy measurement (~5% absolute uncertainty between expert cardiologists). The achievable MAE floor is bounded below by ~3–4.
- Long-tail performance. HFrEF and HFmrEF classes are underrepresented (12.5% and 9.8% of test set). Confidence intervals on subgroup metrics are correspondingly wide.
Bias considerations
The EchoNet-Dynamic dataset does not include detailed demographic metadata in the public release. The model has not been evaluated for bias across age, sex, race/ethnicity, body habitus, or scanning operator. Any deployment in a population whose distribution differs from the EchoNet cohort would require independent validation.
Ethical considerations
EF is used in real clinical pathways to decide on guideline-directed medical therapy for heart failure. A miscalibrated model could plausibly cause harm if naively deployed. The intended use of this artefact is to support method development and reproducible benchmarking — not to influence patient care.
How to use
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoModel, AutoVideoProcessor
REPO_ID = "vselvarajijay/vjepa2-echonet-ef"
BASE_MODEL = "facebook/vjepa2-vitl-fpc64-256"
# 1. Load V-JEPA 2 frozen encoder from HuggingFace
processor = AutoVideoProcessor.from_pretrained(BASE_MODEL)
encoder = AutoModel.from_pretrained(BASE_MODEL, dtype=torch.float16).eval()
# 2. Pull the trained Ridge head from this repo
head_path = hf_hub_download(repo_id=REPO_ID, filename="head.safetensors")
head = torch.nn.Linear(1024, 1)
head.load_state_dict(load_file(head_path))
head.eval()
# 3. Encode a video clip and predict EF
# `frames` must be a numpy array shape (T=64, H=256, W=256, C=3), uint8.
# For grayscale echo, replicate the single channel to 3 channels.
# See `inference_example.py` in this repo for full clip-sampling + multi-clip averaging.
inputs = processor(videos=list(frames), return_tensors="pt")
with torch.inference_mode():
hidden = encoder(**inputs).last_hidden_state # (1, N_tokens, 1024)
embedding = hidden.float().mean(dim=1).squeeze(0) # (1024,)
ef_pred = head(embedding).item() # float, EF in %
# Reduced-EF screening (HFrEF) at the val-tuned cutoff:
print(f"EF: {ef_pred:.2f}% Reduced-EF flag: {ef_pred < 45.83}")
The snippet above pulls just the head from this repo. For an end-to-end script that decodes a .avi/.mp4 directly and averages multiple clips, see inference_example.py.
Citation
If you use this model, cite both the V-JEPA 2 base and the EchoNet-Dynamic dataset:
@article{ouyang2020video,
title={Video-based AI for beat-to-beat assessment of cardiac function},
author={Ouyang, David and He, Bryan and Ghorbani, Amirata and Yuan, Neal and
Ebinger, Joseph and Langlotz, Curtis P and Heidenreich, Paul A and
Harrington, Robert A and Liang, David H and Ashley, Euan A and Zou, James Y},
journal={Nature},
volume={580},
pages={252--256},
year={2020}
}
@misc{assran2025vjepa2,
title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction, and Planning},
author={Assran, Mahmoud and Bardes, Adrien and others},
year={2025},
publisher={Meta AI}
}
Changelog
- 2026-05-08 — Initial draft. Full-dataset training run (n=7,464 train) reports MAE 5.4, R² 0.66, AUROC@40 0.955.
- 2026-05-08 — Eval script's threshold-tuning bug fixed (replaced
(1 - target)-quantile withtarget-quantile of positive predictions). Tuned operating points populated: EF<40 @ cutoff 45.83 → Sens 0.775, Spec 0.957; EF<50 @ cutoff 53.17 → Sens 0.874, Spec 0.837. - 2026-05-08 — Added ROC, scatter, and reliability figures (
figures/) generated byapp.figures. - 2026-05-08 — Ablation: doubled
clips_per_videofrom 3 to 6. Result: marginally worse (MAE 5.487 vs 5.443) — within run-to-run noise. Mean-pooling more clips at the embedding level does not improve signal because the dominant embedding axes are EF-irrelevant appearance variance. 3-clip remains canonical. Future clip-count scaling will need concat or attention pooling, not averaging. - 2026-05-08 — v0.1.0 published.
- Downloads last month
- 41
Model tree for vselvarajijay/vjepa2-echonet-ef
Base model
facebook/vjepa2-vitl-fpc64-256

