CatCon-One-Shot-Controlnet-SD-1-5-b2

Sleeping

App Files Files Community

Copilot Copilot commited on Mar 9

Commit

d28c63e

1 Parent(s): d9e4621

Add AI-Endo project hub UI

Browse files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

Dockerfile +1 -1
README.md +33 -11
app.py +544 -114
dinov2/.github/workflows/lint.yaml +38 -0
dinov2/.gitignore +11 -0
dinov2/CODE_OF_CONDUCT.md +80 -0
dinov2/CONTRIBUTING.md +31 -0
dinov2/LICENSE +203 -0
dinov2/MODEL_CARD.md +272 -0
dinov2/README.md +620 -0
dinov2/conda-extras.yaml +24 -0
dinov2/conda.yaml +21 -0
dinov2/pyproject.toml +29 -0
dinov2/requirements-dev.txt +3 -0
dinov2/requirements-extras.txt +2 -0
dinov2/requirements.txt +11 -0
dinov2/scripts/lint.sh +28 -0
dinov2/setup.cfg +8 -0
dinov2/setup.py +88 -0
explainability.py +112 -0
model/transformer.py +28 -8
model_manager.py +10 -0
predictor.py +372 -15
runtime-requirements.txt +1 -1
scripts/publish_model_repo.py +156 -0
scripts/publish_space_repo.py +98 -0
scripts/stage_space_bundle.py +104 -0
scripts/stage_vendor_sources.py +38 -0
vjepa2/.flake8 +5 -0
vjepa2/.github/workflows/base_tests.yaml +29 -0
vjepa2/.github/workflows/linters.yaml +48 -0
vjepa2/.gitignore +32 -0
vjepa2/APACHE-LICENSE +201 -0
vjepa2/CHANGELOG.md +5 -0
vjepa2/CODE_OF_CONDUCT.md +80 -0
vjepa2/CONTRIBUTING.md +39 -0
vjepa2/LICENSE +21 -0
vjepa2/README.md +450 -0
vjepa2/app/__init__.py +0 -0
vjepa2/app/main.py +84 -0
vjepa2/app/main_distributed.py +269 -0
vjepa2/app/scaffold.py +17 -0
vjepa2/app/vjepa/train.py +536 -0
vjepa2/app/vjepa/transforms.py +154 -0
vjepa2/app/vjepa/utils.py +267 -0
vjepa2/app/vjepa_droid/droid.py +232 -0
vjepa2/app/vjepa_droid/train.py +524 -0
vjepa2/app/vjepa_droid/transforms.py +156 -0
vjepa2/app/vjepa_droid/utils.py +253 -0
vjepa2/configs/eval/vitg-384/coin.yaml +163 -0

Dockerfile CHANGED Viewed

@@ -4,7 +4,7 @@ ENV DEBIAN_FRONTEND=noninteractive \
     PYTHONUNBUFFERED=1 \
     PIP_NO_CACHE_DIR=1 \
     SPACE_MODEL_DIR=/app/model \
-    SPACE_ENABLED_MODELS=dinov2 \
     SPACE_DEFAULT_MODEL=dinov2
 RUN apt-get update && apt-get install -y --no-install-recommends \

     PYTHONUNBUFFERED=1 \
     PIP_NO_CACHE_DIR=1 \
     SPACE_MODEL_DIR=/app/model \
+    SPACE_ENABLED_MODELS=dinov2,aiendo,vjepa2 \
     SPACE_DEFAULT_MODEL=dinov2
 RUN apt-get update && apt-get install -y --no-install-recommends \

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
----
-title: DINO-ENDO Phase Recognition
 emoji: 🩺
 colorFrom: blue
 colorTo: green
@@ -7,11 +6,12 @@ sdk: docker
 app_port: 7860
 ---
-# DINO-ENDO Streamlit Space
 This folder is an isolated Hugging Face Space scaffold for the phase-recognition models in this repository.
-It is intentionally separate from the existing FastAPI webapp and defaults to a **DINO-Endo demo** on paid GPU hardware such as **1x A10G (24 GB VRAM)**.
-The same code can still expose AI-Endo and V-JEPA2 when you opt into them through environment variables.
 ## Supported model families
@@ -42,15 +42,15 @@ A fully local `model/` folder is still supported as a fallback.
 ## Default Space behavior
-The Docker Space is configured to boot as a **DINO-Endo-first demo**:
-- `SPACE_ENABLED_MODELS=dinov2`
 - `SPACE_DEFAULT_MODEL=dinov2`
-If you want the same Space build to expose multiple model families again, override those environment variables in Space Settings, for example:
 ```text
-SPACE_ENABLED_MODELS=dinov2,aiendo,vjepa2
 SPACE_DEFAULT_MODEL=dinov2
 ```
@@ -67,11 +67,22 @@ If a required checkpoint is missing locally, it will try to download it from the
 ### Upload and dashboard behavior
-- The Space now keeps a single active predictor loaded at a time and unloads the previous model when the picker changes.
 - MP4 is the primary video upload format, while `mov`, `avi`, `mkv`, `webm`, and `m4v` remain enabled as fallback containers.
 - `.streamlit/config.toml` raises the default Streamlit single-file upload ceiling to **4096 MB** for this Space.
 - Uploaded videos are immediately spooled to local disk for metadata probing and analysis, instead of repeatedly reading the in-memory upload object on every rerun.
 - The UI shows file size, duration, fps, frame count, resolution, working-storage headroom, and suppresses inline preview for very large uploads to keep the browser path lighter.
 ### Common environment variables
@@ -116,6 +127,15 @@ That script refreshes the vendored source copies inside this folder before publi
 4. Upload your checkpoints to HF **model repo(s)**.
 5. Configure the relevant repo IDs (and `HF_TOKEN` only if the repos are private).
 ## Local smoke test
 Once the Space dependencies are installed, you can smoke test a predictor directly:
@@ -129,12 +149,14 @@ python scripts/smoke_test.py --model vjepa2 --model-dir /path/to/model
 ## Scope of v1
 - Streamlit UI
-- DINO-Endo demo by default, with optional multi-model selector when enabled
 - image upload and video upload
 - dashboard-style model/runtime status
 - robust video metadata probing with OpenCV + ffprobe fallback
 - large single-file uploads up to the configured Streamlit cap
 - per-frame phase timeline output for video
 - JSON / CSV export
 Not included in v1:

+title: AI-Endo Project Hub
 emoji: 🩺
 colorFrom: blue
 colorTo: green
 app_port: 7860
 ---
+# AI-Endo Project Hub
 This folder is an isolated Hugging Face Space scaffold for the phase-recognition models in this repository.
+It is intentionally separate from the existing FastAPI webapp and is designed to expose **DINO-Endo, AI-Endo, and V-JEPA2** on paid GPU hardware such as **1x A10G (24 GB VRAM)**.
+The public UI now behaves like a small **project hub**: DINO-Endo Surgery is the first featured workspace, and the same landing page can later host additional projects without rebuilding the overall shell.
+The default featured model remains **DINO-Endo**, but the same Space can load and unload all three model families one at a time.
 ## Supported model families
 ## Default Space behavior
+The Docker Space is configured to boot as a **three-model public demo** with **DINO-Endo** selected by default:
+- `SPACE_ENABLED_MODELS=dinov2,aiendo,vjepa2`
 - `SPACE_DEFAULT_MODEL=dinov2`
+If you want to narrow the public picker to a subset of models, override those environment variables in Space Settings, for example:
 ```text
+SPACE_ENABLED_MODELS=dinov2
 SPACE_DEFAULT_MODEL=dinov2
 ```
 ### Upload and dashboard behavior
+- The top of the app is a reusable project-hub landing section, with DINO-Endo Surgery as the current live workspace.
+- The active model family is selected through a visible **model slider** in the workspace rather than a hidden picker.
+- The Space now keeps a single active predictor loaded at a time and unloads the previous model when the model slider changes.
 - MP4 is the primary video upload format, while `mov`, `avi`, `mkv`, `webm`, and `m4v` remain enabled as fallback containers.
 - `.streamlit/config.toml` raises the default Streamlit single-file upload ceiling to **4096 MB** for this Space.
 - Uploaded videos are immediately spooled to local disk for metadata probing and analysis, instead of repeatedly reading the in-memory upload object on every rerun.
 - The UI shows file size, duration, fps, frame count, resolution, working-storage headroom, and suppresses inline preview for very large uploads to keep the browser path lighter.
+- V-JEPA2 is labeled as a slower first load so users understand the cold-cache cost of its very large encoder checkpoint.
+### Explainability behavior
+- The sidebar includes an opt-in live explainability toggle for encoder/decoder visualizations.
+- DINO-Endo and V-JEPA2 use true encoder self-attention maps, while AI-Endo uses a labeled proxy encoder overlay from ResNet activations.
+- AI-Endo and DINO-Endo render decoder-side temporal attention strips from the custom Transformer path.
+- V-JEPA2 renders a labeled proxy temporal strip from decoder feature energy because its classifier head is an MLP, not an attention block.
+- Encoder controls expose **layer/head sliders** when the loaded model supports true encoder attention.
 ### Common environment variables
 4. Upload your checkpoints to HF **model repo(s)**.
 5. Configure the relevant repo IDs (and `HF_TOKEN` only if the repos are private).
+### Deployment helper scripts
+- `python scripts/stage_space_bundle.py --overwrite --output-dir /tmp/dino_space_minimal_upload`
+  - stages a code-only upload bundle for the current multi-model Space without local caches or checkpoints.
+- `python scripts/publish_model_repo.py --family aiendo --repo-id <owner/repo> --model-dir /path/to/model`
+  - publishes one model family to a Hugging Face **model repo** and automatically switches to `upload_large_folder()` for very large bundles.
+- `python scripts/publish_space_repo.py --repo-id <owner/space> --dino-model-repo-id <owner/dino-repo> --aiendo-model-repo-id <owner/aiendo-repo> --vjepa2-model-repo-id <owner/vjepa2-repo>`
+  - stages/uploads the Docker Space bundle and updates the key Space environment variables for the three-model demo.
 ## Local smoke test
 Once the Space dependencies are installed, you can smoke test a predictor directly:
 ## Scope of v1
 - Streamlit UI
+- project-hub landing page with DINO-Endo Surgery as the first hosted workspace
+- three-model slider for DINO-Endo, AI-Endo, and V-JEPA2, with DINO-Endo selected by default
 - image upload and video upload
 - dashboard-style model/runtime status
 - robust video metadata probing with OpenCV + ffprobe fallback
 - large single-file uploads up to the configured Streamlit cap
 - per-frame phase timeline output for video
+- optional live encoder/decoder explainability sidebar with true attention where available and labeled proxies elsewhere
 - JSON / CSV export
 Not included in v1:

app.py CHANGED Viewed

@@ -4,6 +4,7 @@ import json
 import os
 import time
 from collections import Counter
 from pathlib import Path
 import cv2
@@ -13,6 +14,7 @@ import streamlit as st
 import torch
 from PIL import Image
 from model_manager import SpaceModelManager
 from model_registry import MODEL_SPECS, get_model_source_summary
 from predictor import MODEL_LABELS, PHASE_LABELS, normalize_model_key
@@ -28,7 +30,80 @@ from video_utils import (
     spool_uploaded_video,
 )
-st.set_page_config(page_title="DINO-Endo Phase Recognition", layout="wide")
 def _phase_index(phase: str) -> int:
@@ -43,6 +118,10 @@ def _image_to_rgb(uploaded_file) -> np.ndarray:
     return np.array(image)
 def _enabled_model_keys() -> list[str]:
     configured = os.getenv("SPACE_ENABLED_MODELS", "").strip()
     if not configured:
@@ -82,7 +161,223 @@ def _default_model_key(enabled_model_keys: list[str]) -> str:
 def _space_caption(enabled_model_keys: list[str]) -> str:
     if enabled_model_keys == ["dinov2"]:
         return "Streamlit Hugging Face Space demo for the DINO-Endo phase-recognition stack."
-    return "DINO-first Streamlit Hugging Face Space demo for DINO-Endo, AI-Endo, and V-JEPA2."
 def _get_model_manager() -> SpaceModelManager:
@@ -126,7 +421,100 @@ def _prepare_staged_video(uploaded_file):
     return temp_path, meta
-def _analyse_video(video_path: str | Path, predictor, frame_stride: int, max_frames: int):
     temp_path = Path(video_path)
     capture = cv2.VideoCapture(str(temp_path))
     if not capture.isOpened():
@@ -141,6 +529,7 @@ def _analyse_video(video_path: str | Path, predictor, frame_stride: int, max_fra
     records = []
     processed = 0
     frame_index = 0
     try:
         while True:
@@ -154,7 +543,7 @@ def _analyse_video(video_path: str | Path, predictor, frame_stride: int, max_fra
             rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
             started = time.perf_counter()
-            result = predictor.predict(rgb)
             elapsed_ms = (time.perf_counter() - started) * 1000.0
             probs = result.get("probs", [0.0, 0.0, 0.0, 0.0])
@@ -174,6 +563,9 @@ def _analyse_video(video_path: str | Path, predictor, frame_stride: int, max_fra
             records.append(record)
             processed += 1
             if total_frames > 0:
                 progress.progress(min(frame_index + 1, total_frames) / total_frames)
             else:
@@ -192,18 +584,6 @@ def _analyse_video(video_path: str | Path, predictor, frame_stride: int, max_fra
     return records, {"fps": fps, "total_frames": total_frames, "sampled_frames": processed}
-def _records_to_frame(records):
-    if not records:
-        return pd.DataFrame(columns=["frame_index", "timestamp_sec", "phase", "confidence"])
-    return pd.DataFrame.from_records(records)
-def _download_payloads(df: pd.DataFrame):
-    json_payload = df.to_json(orient="records", indent=2).encode("utf-8")
-    csv_payload = df.to_csv(index=False).encode("utf-8")
-    return json_payload, csv_payload
 def _render_single_result(result: dict):
     probs = result.get("probs", [0.0, 0.0, 0.0, 0.0])
     metrics = st.columns(3)
@@ -215,7 +595,7 @@ def _render_single_result(result: dict):
     st.bar_chart(prob_df.set_index("phase"))
     st.download_button(
         label="Download JSON",
-        data=json.dumps(result, indent=2).encode("utf-8"),
         file_name="phase_prediction.json",
         mime="application/json",
         key="download-single-json",
@@ -269,31 +649,23 @@ def main():
     enabled_model_keys = _enabled_model_keys()
     default_model_key = _default_model_key(enabled_model_keys)
     manager = _get_model_manager()
-    st.title("DINO-Endo Surgical Phase Recognition")
     st.caption(_space_caption(enabled_model_keys))
-    st.sidebar.markdown("### Model")
-    if len(enabled_model_keys) == 1:
-        model_key = enabled_model_keys[0]
-        st.sidebar.write(MODEL_LABELS[model_key])
-    else:
-        model_key = st.sidebar.selectbox(
-            "Model",
-            options=enabled_model_keys,
-            index=enabled_model_keys.index(default_model_key),
-            format_func=lambda key: MODEL_LABELS[key],
-        )
-    previous_selected_model_key = st.session_state.get("selected_model_key")
-    st.session_state["selected_model_key"] = model_key
     if previous_selected_model_key is not None and previous_selected_model_key != model_key:
         manager.unload_model()
     source_summary = get_model_source_summary(model_key)
-    manager_status = manager.status()
     st.sidebar.markdown("### Runtime")
     st.sidebar.write(f"Selected model: `{MODEL_LABELS[model_key]}`")
     st.sidebar.write(f"CUDA available: `{torch.cuda.is_available()}`")
     if torch.cuda.is_available():
         st.sidebar.write(f"Device: `{torch.cuda.get_device_name(torch.cuda.current_device())}`")
@@ -308,16 +680,13 @@ def main():
     st.sidebar.write(f"HF repo: `{source_summary['repo_id'] or 'local-only'}`")
     if source_summary["subfolder"]:
         st.sidebar.write(f"Repo subfolder: `{source_summary['subfolder']}`")
     st.sidebar.write(f"Video upload cap: `{STREAMLIT_SERVER_MAX_UPLOAD_MB} MB`")
     st.sidebar.write(f"Working storage free: `{format_bytes(get_workspace_free_bytes())}`")
-    if manager_status.is_loaded and manager_status.active_model_label:
-        st.sidebar.success(f"Loaded model: {manager_status.active_model_label}")
-    else:
-        st.sidebar.info("No model is currently loaded.")
-    if manager_status.last_error:
-        st.sidebar.error(manager_status.last_error)
     prepare_col, unload_col = st.sidebar.columns(2)
     if prepare_col.button("Load model", use_container_width=True):
         try:
@@ -331,90 +700,151 @@ def main():
         manager.unload_model()
         st.sidebar.success("Model unloaded")
     image_tab, video_tab = st.tabs(["Image", "Video"])
     with image_tab:
-        uploaded_image = st.file_uploader("Upload an RGB frame", type=["png", "jpg", "jpeg"], key="image-uploader")
-        if uploaded_image is not None:
-            rgb = _image_to_rgb(uploaded_image)
-            st.image(rgb, caption=uploaded_image.name, use_container_width=True)
-            if st.button("Run image inference", key="run-image"):
-                try:
-                    with st.spinner(f"Running {MODEL_LABELS[model_key]} on {uploaded_image.name}..."):
-                        predictor = manager.get_predictor(model_key)
-                    predictor.reset_state()
-                    started = time.perf_counter()
-                    result = predictor.predict(rgb)
-                    result["inference_ms"] = round((time.perf_counter() - started) * 1000.0, 3)
-                    predictor.reset_state()
-                except Exception as exc:
-                    st.error(str(exc))
-                else:
-                    _render_single_result(result)
     with video_tab:
-        frame_stride = st.slider("Analyze every Nth frame", min_value=1, max_value=30, value=5, step=1)
-        max_frames = st.slider("Maximum sampled frames", min_value=10, max_value=600, value=180, step=10)
-        uploaded_video = st.file_uploader(
-            "Upload a video (MP4 preferred)",
-            type=SUPPORTED_VIDEO_TYPES,
-            key="video-uploader",
-            help=(
-                f"Single-file uploads are enabled up to {STREAMLIT_SERVER_MAX_UPLOAD_MB} MB. "
-                "MP4 is preferred; MOV/AVI/MKV/WEBM/M4V stay enabled as fallback containers."
-            ),
-            max_upload_size=STREAMLIT_SERVER_MAX_UPLOAD_MB,
         )
-        if uploaded_video is not None:
-            try:
-                temp_path, video_meta = _prepare_staged_video(uploaded_video)
-            except Exception as exc:
-                st.error(str(exc))
-            else:
-                info_cols = st.columns(5)
-                info_cols[0].metric("File size", video_meta["file_size_label"])
-                info_cols[1].metric("Duration", video_meta["duration_label"])
-                info_cols[2].metric("FPS", f"{video_meta.get('fps', 0.0):.2f}" if video_meta.get("fps") else "Unknown")
-                info_cols[3].metric("Frames", int(video_meta.get("frame_count", 0)))
-                info_cols[4].metric("Resolution", video_meta["resolution_label"])
-                if video_meta.get("format_name"):
-                    st.caption(f"Container detected by ffprobe: {video_meta['format_name']}")
-                recommended_stride = recommended_frame_stride(video_meta.get("duration_seconds"))
-                st.caption(
-                    f"Recommended frame stride for this video: every {recommended_stride} frame(s). "
-                    "Use higher values for very long videos to keep analysis times reasonable."
-                )
-                if should_show_inline_preview(video_meta["file_size_bytes"]):
-                    st.video(uploaded_video)
                 else:
-                    st.info(
-                        "Inline preview is disabled for uploads larger than "
-                        "256 MB to avoid pushing very large media back through the browser. "
-                        "The staged video on disk is still used for analysis."
                     )
-                if st.button("Analyze video", key="run-video"):
-                    try:
-                        with st.spinner(f"Running {MODEL_LABELS[model_key]} on {uploaded_video.name}..."):
-                            predictor = manager.get_predictor(model_key)
-                        records, analysis_meta = _analyse_video(
-                            temp_path,
-                            predictor,
-                            frame_stride=frame_stride,
-                            max_frames=max_frames,
-                        )
-                        meta = {
-                            **video_meta,
-                            **analysis_meta,
-                        }
-                    except Exception as exc:
-                        st.error(str(exc))
                     else:
-                        _render_video_results(records, meta)
-        else:
-            _clear_video_stage()
 if __name__ == "__main__":

 import os
 import time
 from collections import Counter
+from dataclasses import dataclass
 from pathlib import Path
 import cv2
 import torch
 from PIL import Image
+from explainability import ExplainabilitySpec
 from model_manager import SpaceModelManager
 from model_registry import MODEL_SPECS, get_model_source_summary
 from predictor import MODEL_LABELS, PHASE_LABELS, normalize_model_key
     spool_uploaded_video,
 )
+st.set_page_config(page_title="AI-Endo Project Hub", layout="wide")
+MODEL_OPTION_LABELS = {
+    "aiendo": "AI-Endo",
+    "dinov2": "DINO-Endo",
+    "vjepa2": "V-JEPA2 (slower first load)",
+}
+MODEL_LOAD_NOTES = {
+    "aiendo": "AI-Endo uses the ResNet + MS-TCN + Transformer stack.",
+    "dinov2": "DINO-Endo remains the default public model in this demo.",
+    "vjepa2": "V-JEPA2 can take longer on the first load because the encoder checkpoint is several gigabytes.",
+}
+FALLBACK_EXPLAINABILITY_SPECS = {
+    "aiendo": ExplainabilitySpec(
+        encoder_mode="proxy",
+        encoder_label="ResNet layer4 activation energy (proxy)",
+        decoder_mode="attention",
+        decoder_label="Temporal Transformer attention",
+    ),
+    "dinov2": ExplainabilitySpec(
+        encoder_mode="attention",
+        encoder_label="DINOv2 encoder self-attention",
+        decoder_mode="attention",
+        decoder_label="Fusion Transformer temporal attention",
+        encoder_layer_count=12,
+        encoder_head_count=6,
+    ),
+    "vjepa2": ExplainabilitySpec(
+        encoder_mode="attention",
+        encoder_label="V-JEPA2 encoder self-attention",
+        decoder_mode="proxy",
+        decoder_label="MLP decoder feature energy (proxy)",
+        encoder_layer_count=24,
+        encoder_head_count=16,
+    ),
+}
+SPACE_TITLE = "AI-Endo Project Hub"
+FEATURED_PROJECT_TITLE = "DINO-Endo Surgery Workspace"
+MODEL_SLIDER_KEY = "workspace-model-slider"
+SELECTED_MODEL_STATE_KEY = "selected_model_key"
+@dataclass(frozen=True)
+class HostedProject:
+    key: str
+    title: str
+    status: str
+    summary: str
+    highlights: tuple[str, ...]
+    tags: tuple[str, ...]
+HOSTED_PROJECTS = (
+    HostedProject(
+        key="dino-endo-surgery",
+        title=FEATURED_PROJECT_TITLE,
+        status="Live now",
+        summary=(
+            "Upload single frames or full videos, swap between DINO-Endo, AI-Endo, and V-JEPA2, "
+            "and inspect optional explainability overlays inside one surgical phase-recognition workspace."
+        ),
+        highlights=(
+            "Large video uploads with on-disk staging",
+            "One-click JSON and CSV export",
+            "Live encoder and decoder explainability",
+            "Manual load and unload for GPU-safe model switching",
+        ),
+        tags=("Computer vision", "Medical video", "Multi-model inference"),
+    ),
+)
 def _phase_index(phase: str) -> int:
     return np.array(image)
+def _model_option_label(model_key: str) -> str:
+    return MODEL_OPTION_LABELS.get(model_key, MODEL_LABELS.get(model_key, model_key))
 def _enabled_model_keys() -> list[str]:
     configured = os.getenv("SPACE_ENABLED_MODELS", "").strip()
     if not configured:
 def _space_caption(enabled_model_keys: list[str]) -> str:
     if enabled_model_keys == ["dinov2"]:
         return "Streamlit Hugging Face Space demo for the DINO-Endo phase-recognition stack."
+    return "Streamlit Hugging Face Space demo for DINO-Endo, AI-Endo, and V-JEPA2 with one active model loaded at a time."
+def _inject_app_styles() -> None:
+    st.markdown(
+        """
+        <style>
+        .block-container {
+            padding-top: 2.4rem;
+            padding-bottom: 2rem;
+        }
+        .hub-hero,
+        .hub-card,
+        .workspace-card {
+            border-radius: 22px;
+            border: 1px solid rgba(148, 163, 184, 0.22);
+            background: linear-gradient(180deg, rgba(15, 23, 42, 0.86), rgba(15, 23, 42, 0.66));
+            box-shadow: 0 20px 45px rgba(15, 23, 42, 0.18);
+        }
+        .hub-hero {
+            padding: 2rem 2.25rem;
+            margin-bottom: 1rem;
+            background: linear-gradient(135deg, rgba(14, 165, 233, 0.18), rgba(16, 185, 129, 0.18), rgba(15, 23, 42, 0.9));
+        }
+        .hub-eyebrow {
+            margin: 0;
+            color: #67e8f9;
+            font-size: 0.78rem;
+            font-weight: 700;
+            letter-spacing: 0.18em;
+            text-transform: uppercase;
+        }
+        .hub-hero h1,
+        .workspace-card h2,
+        .hub-card h3 {
+            margin: 0.4rem 0 0 0;
+            color: #f8fafc;
+        }
+        .hub-subtitle,
+        .workspace-copy,
+        .hub-card p,
+        .hub-card li {
+            color: rgba(226, 232, 240, 0.92);
+            line-height: 1.55;
+        }
+        .hub-subtitle {
+            margin-top: 0.8rem;
+            max-width: 62rem;
+            font-size: 1.03rem;
+        }
+        .hub-chip-row {
+            display: flex;
+            flex-wrap: wrap;
+            gap: 0.55rem;
+            margin-top: 1rem;
+        }
+        .hub-chip,
+        .hub-status {
+            display: inline-flex;
+            align-items: center;
+            border-radius: 999px;
+            padding: 0.32rem 0.78rem;
+            font-size: 0.82rem;
+            font-weight: 600;
+        }
+        .hub-chip {
+            background: rgba(15, 23, 42, 0.56);
+            border: 1px solid rgba(103, 232, 249, 0.24);
+            color: #e2e8f0;
+        }
+        .hub-status {
+            background: rgba(34, 197, 94, 0.18);
+            border: 1px solid rgba(34, 197, 94, 0.28);
+            color: #bbf7d0;
+            margin-bottom: 0.7rem;
+        }
+        .hub-card,
+        .workspace-card {
+            padding: 1.25rem 1.4rem;
+            height: 100%;
+        }
+        .hub-card ul {
+            margin: 0.8rem 0 0 1rem;
+            padding: 0;
+        }
+        .workspace-card {
+            margin: 0.3rem 0 1rem 0;
+        }
+        </style>
+        """,
+        unsafe_allow_html=True,
+    )
+def _render_hub_chips(labels: list[str] | tuple[str, ...]) -> str:
+    return "".join(f'<span class="hub-chip">{label}</span>' for label in labels)
+def _render_project_hub(enabled_model_keys: list[str]) -> None:
+    featured = HOSTED_PROJECTS[0]
+    enabled_labels = [_model_option_label(key) for key in enabled_model_keys]
+    st.markdown(
+        f"""
+        <section class="hub-hero">
+            <p class="hub-eyebrow">Multi-project landing page</p>
+            <h1>{SPACE_TITLE}</h1>
+            <p class="hub-subtitle">
+                A polished landing page for applied vision demos. {FEATURED_PROJECT_TITLE} is the first live workspace,
+                and the layout is ready to host more projects later without rebuilding the app shell.
+            </p>
+            <div class="hub-chip-row">
+                {_render_hub_chips(tuple(enabled_labels) + ("Future-project ready", "Streamlit + Docker Space"))}
+            </div>
+        </section>
+        """,
+        unsafe_allow_html=True,
+    )
+    metrics = st.columns(4)
+    metrics[0].metric("Hosted projects", len(HOSTED_PROJECTS))
+    metrics[1].metric("Model families", len(enabled_model_keys))
+    metrics[2].metric("Explainability", "Opt-in")
+    metrics[3].metric("Exports", "JSON + CSV")
+    left_col, right_col = st.columns([1.8, 1.2], gap="large")
+    with left_col:
+        highlights_html = "".join(f"<li>{item}</li>" for item in featured.highlights)
+        st.markdown(
+            f"""
+            <section class="hub-card">
+                <span class="hub-status">{featured.status}</span>
+                <h3>{featured.title}</h3>
+                <p>{featured.summary}</p>
+                <div class="hub-chip-row">{_render_hub_chips(featured.tags)}</div>
+                <ul>{highlights_html}</ul>
+            </section>
+            """,
+            unsafe_allow_html=True,
+        )
+    with right_col:
+        st.markdown(
+            """
+            <section class="hub-card">
+                <span class="hub-status">Platform shell</span>
+                <h3>Ready for more demos</h3>
+                <p>
+                    The top section now works as a reusable project hub instead of a one-off page. Add more project cards
+                    and workspace blocks here later, while keeping one shared brand, layout, and deployment target.
+                </p>
+                <ul>
+                    <li>Keep each project's controls inside its own workspace section.</li>
+                    <li>Reuse the same landing-page hero, metrics, and project-card layout.</li>
+                    <li>Preserve one-model-at-a-time loading so future demos stay GPU-friendly.</li>
+                </ul>
+            </section>
+            """,
+            unsafe_allow_html=True,
+        )
+def _render_workspace_header(enabled_model_keys: list[str], model_key: str) -> None:
+    selected_label = _model_option_label(model_key)
+    selection_note = (
+        "Use the model slider to move between DINO-Endo, AI-Endo, and V-JEPA2. "
+        "Only one model stays loaded at a time so the Space remains responsive on shared GPU hardware."
+    )
+    st.markdown(
+        f"""
+        <section class="workspace-card">
+            <p class="hub-eyebrow">Featured project</p>
+            <h2>{FEATURED_PROJECT_TITLE}</h2>
+            <p class="workspace-copy">
+                {selection_note}
+            </p>
+            <div class="hub-chip-row">
+                {_render_hub_chips(tuple(_model_option_label(key) for key in enabled_model_keys))}
+                <span class="hub-chip">Selected: {selected_label}</span>
+            </div>
+        </section>
+        """,
+        unsafe_allow_html=True,
+    )
+def _resolve_model_selection(enabled_model_keys: list[str], default_model_key: str) -> tuple[str | None, str]:
+    previous_selected_model_key = st.session_state.get(SELECTED_MODEL_STATE_KEY)
+    current_slider_value = st.session_state.get(MODEL_SLIDER_KEY)
+    if current_slider_value not in enabled_model_keys:
+        st.session_state[MODEL_SLIDER_KEY] = default_model_key
+    if len(enabled_model_keys) == 1:
+        model_key = enabled_model_keys[0]
+        st.session_state[MODEL_SLIDER_KEY] = model_key
+        return previous_selected_model_key, model_key
+    model_key = st.select_slider(
+        "Project model slider",
+        options=enabled_model_keys,
+        key=MODEL_SLIDER_KEY,
+        format_func=_model_option_label,
+        help="Prominent model-family slider for the DINO-Endo project workspace.",
+    )
+    return previous_selected_model_key, model_key
 def _get_model_manager() -> SpaceModelManager:
     return temp_path, meta
+def _records_to_frame(records):
+    if not records:
+        return pd.DataFrame(columns=["frame_index", "timestamp_sec", "phase", "confidence"])
+    return pd.DataFrame.from_records(records)
+def _download_payloads(df: pd.DataFrame):
+    json_payload = df.to_json(orient="records", indent=2).encode("utf-8")
+    csv_payload = df.to_csv(index=False).encode("utf-8")
+    return json_payload, csv_payload
+def _get_explainability_spec(manager: SpaceModelManager, model_key: str) -> ExplainabilitySpec:
+    predictor = manager.get_loaded_predictor(model_key)
+    if predictor is not None and hasattr(predictor, "get_explainability_spec"):
+        return predictor.get_explainability_spec()
+    return FALLBACK_EXPLAINABILITY_SPECS[model_key]
+def _build_explainability_config(manager: SpaceModelManager, model_key: str):
+    spec = _get_explainability_spec(manager, model_key)
+    st.sidebar.markdown("### Explainability")
+    enabled = st.sidebar.toggle(
+        "Enable live encoder/decoder maps",
+        value=False,
+        help="Shows encoder heatmaps and decoder temporal strips on every processed frame. Leave this off if you want the fastest video analysis path.",
+    )
+    config = {"enabled": enabled}
+    if not enabled:
+        return config, spec
+    st.sidebar.caption(f"Encoder view: {spec.encoder_label}")
+    st.sidebar.caption(f"Decoder view: {spec.decoder_label}")
+    if spec.encoder_mode == "attention" and spec.encoder_layer_count > 0 and spec.encoder_head_count > 0:
+        default_layer = spec.encoder_layer_count - 1
+        config["encoder_layer"] = st.sidebar.slider(
+            "Encoder layer",
+            min_value=1,
+            max_value=spec.encoder_layer_count,
+            value=default_layer + 1,
+            key=f"explainability-layer-{model_key}",
+        ) - 1
+        config["encoder_head"] = st.sidebar.slider(
+            "Encoder head",
+            min_value=1,
+            max_value=spec.encoder_head_count,
+            value=1,
+            key=f"explainability-head-{model_key}",
+        ) - 1
+    else:
+        st.sidebar.info("This model uses a proxy encoder overlay instead of true encoder attention.")
+    st.sidebar.caption("Decoder strips are rendered as temporal heat strips rather than projected back onto the frame.")
+    return config, spec
+def _render_explainability_panel(target, payload: dict | None, *, enabled: bool, spec: ExplainabilitySpec, title: str) -> None:
+    with target.container():
+        st.markdown(f"### {title}")
+        if not enabled:
+            st.caption("Turn on the explainability toggle in the sidebar to inspect encoder heatmaps and decoder temporal strips.")
+            return
+        st.caption(f"Encoder default: {spec.encoder_label}")
+        st.caption(f"Decoder default: {spec.decoder_label}")
+        if payload is None:
+            st.info("Run image or video inference to populate this live explainability panel.")
+            return
+        layer_index = payload.get("encoder_layer")
+        head_index = payload.get("encoder_head")
+        encoder_caption = f"{payload['encoder_label']} ({payload['encoder_kind']})"
+        if layer_index is not None and head_index is not None:
+            encoder_caption += f" · layer {int(layer_index) + 1}, head {int(head_index) + 1}"
+        st.caption(encoder_caption)
+        st.image(payload["encoder_visualization"], use_container_width=True)
+        st.caption(f"{payload['decoder_label']} ({payload['decoder_kind']})")
+        st.image(payload["decoder_visualization"], use_container_width=True)
+        notes = payload.get("notes")
+        if notes:
+            st.caption(notes)
+def _analyse_video(
+    video_path: str | Path,
+    predictor,
+    frame_stride: int,
+    max_frames: int,
+    *,
+    explainability_config: dict | None = None,
+    explainability_callback=None,
+):
     temp_path = Path(video_path)
     capture = cv2.VideoCapture(str(temp_path))
     if not capture.isOpened():
     records = []
     processed = 0
     frame_index = 0
+    explain_enabled = bool(explainability_config and explainability_config.get("enabled"))
     try:
         while True:
             rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
             started = time.perf_counter()
+            result = predictor.predict(rgb, explainability=explainability_config if explain_enabled else None)
             elapsed_ms = (time.perf_counter() - started) * 1000.0
             probs = result.get("probs", [0.0, 0.0, 0.0, 0.0])
             records.append(record)
             processed += 1
+            if explain_enabled and explainability_callback is not None:
+                explainability_callback(result.get("explainability"), processed, frame_index)
             if total_frames > 0:
                 progress.progress(min(frame_index + 1, total_frames) / total_frames)
             else:
     return records, {"fps": fps, "total_frames": total_frames, "sampled_frames": processed}
 def _render_single_result(result: dict):
     probs = result.get("probs", [0.0, 0.0, 0.0, 0.0])
     metrics = st.columns(3)
     st.bar_chart(prob_df.set_index("phase"))
     st.download_button(
         label="Download JSON",
+        data=json.dumps(result, indent=2, default=str).encode("utf-8"),
         file_name="phase_prediction.json",
         mime="application/json",
         key="download-single-json",
     enabled_model_keys = _enabled_model_keys()
     default_model_key = _default_model_key(enabled_model_keys)
     manager = _get_model_manager()
+    _inject_app_styles()
+    _render_project_hub(enabled_model_keys)
+    previous_selected_model_key, model_key = _resolve_model_selection(enabled_model_keys, default_model_key)
+    _render_workspace_header(enabled_model_keys, model_key)
     st.caption(_space_caption(enabled_model_keys))
+    st.session_state[SELECTED_MODEL_STATE_KEY] = model_key
     if previous_selected_model_key is not None and previous_selected_model_key != model_key:
         manager.unload_model()
+    explainability_config, explainability_spec = _build_explainability_config(manager, model_key)
     source_summary = get_model_source_summary(model_key)
     st.sidebar.markdown("### Runtime")
     st.sidebar.write(f"Selected model: `{MODEL_LABELS[model_key]}`")
+    st.sidebar.caption(MODEL_LOAD_NOTES[model_key])
     st.sidebar.write(f"CUDA available: `{torch.cuda.is_available()}`")
     if torch.cuda.is_available():
         st.sidebar.write(f"Device: `{torch.cuda.get_device_name(torch.cuda.current_device())}`")
     st.sidebar.write(f"HF repo: `{source_summary['repo_id'] or 'local-only'}`")
     if source_summary["subfolder"]:
         st.sidebar.write(f"Repo subfolder: `{source_summary['subfolder']}`")
+    with st.sidebar.expander("Checkpoint requirements", expanded=False):
+        st.write(", ".join(source_summary["required_files"]))
+        if source_summary["optional_files"]:
+            st.caption("Optional: " + ", ".join(source_summary["optional_files"]))
     st.sidebar.write(f"Video upload cap: `{STREAMLIT_SERVER_MAX_UPLOAD_MB} MB`")
     st.sidebar.write(f"Working storage free: `{format_bytes(get_workspace_free_bytes())}`")
     prepare_col, unload_col = st.sidebar.columns(2)
     if prepare_col.button("Load model", use_container_width=True):
         try:
         manager.unload_model()
         st.sidebar.success("Model unloaded")
+    manager_status = manager.status()
+    if manager_status.is_loaded and manager_status.active_model_label:
+        st.sidebar.success(f"Loaded model: {manager_status.active_model_label}")
+    else:
+        st.sidebar.info("No model is currently loaded.")
+    if manager_status.last_error:
+        st.sidebar.error(manager_status.last_error)
     image_tab, video_tab = st.tabs(["Image", "Video"])
     with image_tab:
+        image_main_col, image_explain_col = st.columns([3, 2], gap="large")
+        image_explain_placeholder = image_explain_col.empty()
+        image_result = None
+        with image_main_col:
+            uploaded_image = st.file_uploader("Upload an RGB frame", type=["png", "jpg", "jpeg"], key="image-uploader")
+            if uploaded_image is not None:
+                rgb = _image_to_rgb(uploaded_image)
+                st.image(rgb, caption=uploaded_image.name, use_container_width=True)
+                if st.button("Run image inference", key="run-image"):
+                    try:
+                        with st.spinner(f"Running {MODEL_LABELS[model_key]} on {uploaded_image.name}..."):
+                            predictor = manager.get_predictor(model_key)
+                        predictor.reset_state()
+                        started = time.perf_counter()
+                        image_result = predictor.predict(
+                            rgb,
+                            explainability=explainability_config if explainability_config.get("enabled") else None,
+                        )
+                        image_result["inference_ms"] = round((time.perf_counter() - started) * 1000.0, 3)
+                        predictor.reset_state()
+                    except Exception as exc:
+                        st.error(str(exc))
+                    else:
+                        _render_single_result(image_result)
+        _render_explainability_panel(
+            image_explain_placeholder,
+            image_result.get("explainability") if image_result else None,
+            enabled=bool(explainability_config.get("enabled")),
+            spec=explainability_spec,
+            title="Explainability",
+        )
     with video_tab:
+        video_main_col, video_explain_col = st.columns([3, 2], gap="large")
+        video_explain_placeholder = video_explain_col.empty()
+        _render_explainability_panel(
+            video_explain_placeholder,
+            None,
+            enabled=bool(explainability_config.get("enabled")),
+            spec=explainability_spec,
+            title="Explainability",
         )
+        with video_main_col:
+            frame_stride = st.slider("Analyze every Nth frame", min_value=1, max_value=30, value=5, step=1)
+            max_frames = st.slider("Maximum sampled frames", min_value=10, max_value=600, value=180, step=10)
+            uploaded_video = st.file_uploader(
+                "Upload a video (MP4 preferred)",
+                type=SUPPORTED_VIDEO_TYPES,
+                key="video-uploader",
+                help=(
+                    f"Single-file uploads are enabled up to {STREAMLIT_SERVER_MAX_UPLOAD_MB} MB. "
+                    "MP4 is preferred; MOV/AVI/MKV/WEBM/M4V stay enabled as fallback containers."
+                ),
+                max_upload_size=STREAMLIT_SERVER_MAX_UPLOAD_MB,
+            )
+            if uploaded_video is not None:
+                try:
+                    temp_path, video_meta = _prepare_staged_video(uploaded_video)
+                except Exception as exc:
+                    st.error(str(exc))
                 else:
+                    info_cols = st.columns(5)
+                    info_cols[0].metric("File size", video_meta["file_size_label"])
+                    info_cols[1].metric("Duration", video_meta["duration_label"])
+                    info_cols[2].metric("FPS", f"{video_meta.get('fps', 0.0):.2f}" if video_meta.get("fps") else "Unknown")
+                    info_cols[3].metric("Frames", int(video_meta.get("frame_count", 0)))
+                    info_cols[4].metric("Resolution", video_meta["resolution_label"])
+                    if video_meta.get("format_name"):
+                        st.caption(f"Container detected by ffprobe: {video_meta['format_name']}")
+                    recommended_stride = recommended_frame_stride(video_meta.get("duration_seconds"))
+                    st.caption(
+                        f"Recommended frame stride for this video: every {recommended_stride} frame(s). "
+                        "Use higher values for very long videos to keep analysis times reasonable."
                     )
+                    if should_show_inline_preview(video_meta["file_size_bytes"]):
+                        st.video(uploaded_video)
                     else:
+                        st.info(
+                            "Inline preview is disabled for uploads larger than "
+                            "256 MB to avoid pushing very large media back through the browser. "
+                            "The staged video on disk is still used for analysis."
+                        )
+                    if st.button("Analyze video", key="run-video"):
+                        latest_payload = {"value": None}
+                        def _video_explainability_callback(payload, processed_count: int, current_frame_index: int):
+                            latest_payload["value"] = payload
+                            _render_explainability_panel(
+                                video_explain_placeholder,
+                                payload,
+                                enabled=True,
+                                spec=explainability_spec,
+                                title=f"Live explainability · sampled frame {processed_count}",
+                            )
+                        try:
+                            with st.spinner(f"Running {MODEL_LABELS[model_key]} on {uploaded_video.name}..."):
+                                predictor = manager.get_predictor(model_key)
+                            records, analysis_meta = _analyse_video(
+                                temp_path,
+                                predictor,
+                                frame_stride=frame_stride,
+                                max_frames=max_frames,
+                                explainability_config=explainability_config if explainability_config.get("enabled") else None,
+                                explainability_callback=(
+                                    _video_explainability_callback
+                                    if explainability_config.get("enabled")
+                                    else None
+                                ),
+                            )
+                            meta = {
+                                **video_meta,
+                                **analysis_meta,
+                            }
+                        except Exception as exc:
+                            st.error(str(exc))
+                        else:
+                            _render_video_results(records, meta)
+                            if explainability_config.get("enabled"):
+                                _render_explainability_panel(
+                                    video_explain_placeholder,
+                                    latest_payload["value"],
+                                    enabled=True,
+                                    spec=explainability_spec,
+                                    title="Explainability",
+                                )
+            else:
+                _clear_video_stage()
 if __name__ == "__main__":

dinov2/.github/workflows/lint.yaml ADDED Viewed

	@@ -0,0 +1,38 @@

+name: Lint
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+jobs:
+  run-linters:
+    name: Run linters
+    runs-on: ubuntu-20.04
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v3
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.9
+          cache: 'pip'
+          cache-dependency-path: '**/requirements*.txt'
+      - name: Install Python (development) dependencies
+        run: |
+          pip install -r requirements-dev.txt
+      - name: Run flake8
+        run: |
+          flake8
+      - name: Run black
+        if: always()
+        run: |
+          black --check dinov2
+      - name: Run pylint
+        if: always()
+        run: |
+          pylint --exit-zero dinov2

dinov2/.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+build/
+dist/
+*.egg-info/
+**/__pycache__/
+**/.ipynb_checkpoints
+**/.ipynb_checkpoints/**
+*.swp
+.vscode/

dinov2/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <opensource-conduct@meta.com>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq

dinov2/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Contributing to DINOv2
+We want to make contributing to this project as easy and transparent as
+possible.
+## Pull Requests
+We actively welcome your pull requests.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Meta's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## License
+By contributing to DINOv2, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.

dinov2/LICENSE ADDED Viewed

	@@ -0,0 +1,203 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

dinov2/MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,272 @@

+# Model Card for DINOv2-S/B/L/g
+These are Vision Transformer models trained following the method described in the papers:
+"DINOv2: Learning Robust Visual Features without Supervision"
+and
+"Vision Transformers Need Registers".
+We provide 8 models:
+- 1 ViT-g trained from scratch with 3 ViT-S/B/L models distilled from the ViT-g, without registers.
+- 1 ViT-g trained from scratch with 3 ViT-S/B/L models distilled from the ViT-g, with registers.
+## Model Details
+The model takes an image as input and returns a class token and patch tokens, and optionally 4 register tokens.
+The embedding dimension is:
+- 384 for ViT-S.
+- 768 for ViT-B.
+- 1024 for ViT-L.
+- 1536 for ViT-g.
+The models follow a Transformer architecture, with a patch size of 14. In the case of registers, we add 4 register tokens, learned during training, to the input sequence after the patch embedding.
+For a 224x224 image, this results in 1 class token + 256 patch tokens, and optionally 4 register tokens.
+The models can accept larger images provided the image shapes are multiples of the patch size (14).
+If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.
+### Model Description
+- **Developed by:** Meta AI
+- **Model type:** Vision Transformer
+- **License:** Apache License 2.0
+- **Repository:** https://github.com/facebookresearch/dinov2
+- **Paper:** https://arxiv.org/abs/2304.07193
+- **Demo:** https://dinov2.metademolab.com/
+## Uses
+The models are vision backbones providing multi-purpose features for downstream tasks.
+### Direct Use
+The models can be used without fine-tuning, with downstream classifiers as simple as linear layers, to obtain competitive results:
+- on depth estimation, semantic segmentation, using linear layers.
+- on image classification, using k-NN classifiers on the class token.
+- on image classification, with logistic regression classifiers applied on the class token.
+- on image classification, with a linear layer applied on the class token and the average of the patch tokens.
+- on image retrieval using nearest neighbors.
+### Downstream Use
+It is technically possible to perform fine-tuning on the models, for small gains (we measured +2% on ImageNet-1k classification).
+We recommend keeping this as a very last step and only when necessary, as the features already provide good performance out-of-the-box.
+## Bias, Risks, and Limitations
+Despite improvements thanks to the training method not using annotations, we still observe significant biases in our models toward rich households from Western countries.
+### Recommendations
+We expect fine-tuning will increase the biases in the features produced by the model as they will be tuned to the fine-tuning labels.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+import torch
+# DINOv2
+dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
+dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
+dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
+dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
+# DINOv2 with registers
+dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
+dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
+dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg')
+dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
+```
+## Training Details
+### Training Data
+- **Training data:** LVD-142M (see paper)
+- **Training regime:** fp16 using PyTorch-FSDP mixed-precision.
+### Training Procedure
+- **Training objective:**
+  - DINO self-distillation loss with multi-crop
+  - iBOT masked-image modeling loss
+  - KoLeo regularization on [CLS] tokens
+- **Architectures:**
+  - ViT-S (21M params): Patch size 14, embedding dimension 384, 6 heads, MLP FFN
+  - ViT-B (86M params): Patch size 14, embedding dimension 768, 12 heads, MLP FFN
+  - ViT-L (0.3B params): Patch size 14, embedding dimension 1024, 16 heads, MLP FFN
+  - ViT-g (1.1B params): Patch size 14, embedding dimension 1536, 24 heads, SwiGLU FFN
+- **Distillation:**
+  - Distillation follows the standard DINOv2 pretraining procedure, except the teacher is a pretrained ViT-g, frozen.
+## Evaluation
+We refer users to the associated papers for the evaluation protocols.
+<table>
+  <tr>
+    <th colspan="2"></th>
+    <th colspan="3">ImageNet-1k</th>
+    <th>NYU-Depth v2</th>
+    <th>SUN-RGBD</th>
+    <th>ADE20k</th>
+    <th>iNaturalist 2018</th>
+    <th>Oxford-H</th>
+  </tr>
+  <tr>
+    <th rowspan="2">model</th>
+    <th rowspan="2">with <br /> registers</th>
+    <th>classif. (acc)</th>
+    <th>classif. (acc)</th>
+    <th>classif. V2 (acc)</th>
+    <th>depth (RMSE)</th>
+    <th>depth (RMSE)</th>
+    <th>segm. (mAP)</th>
+    <th>classif. (acc)</th>
+    <th>retrieval (mAP)</th>
+  </tr>
+  <tr>
+    <!-- <th>^</th> -->
+    <th>k-NN</th>
+    <th>linear</th>
+    <th>linear</th>
+    <th>linear<br />4 layers</th>
+    <th>NYU-D transfer</th>
+    <th>multiscale</th>
+    <th>linear</th>
+    <th>nearest neighbor</th>
+  </tr>
+  <tr>
+    <td>ViT-S/14</td>
+    <td align="center">:x:</td>
+    <td align="right">79.0%</td>
+    <td align="right">81.1%</td>
+    <td align="right">70.8%</td>
+    <td align="right">0.417</td>
+    <td align="right">0.431</td>
+    <td align="right">47.2</td>
+    <td align="right">69.5%</td>
+    <td align="right">43.2</td>
+  </tr>
+  <tr>
+    <td>ViT-S/14</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">79.1%</td>
+    <td align="right">80.9%</td>
+    <td align="right">71.0%</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">67.6%</td>
+    <td align="right">39.5</td>
+  </tr>
+  <tr>
+    <td>ViT-B/14</td>
+    <td align="center">:x:</td>
+    <td align="right">82.1%</td>
+    <td align="right">84.5%</td>
+    <td align="right">74.9%</td>
+    <td align="right">0.362</td>
+    <td align="right">0.400</td>
+    <td align="right">51.3</td>
+    <td align="right">76.3%</td>
+    <td align="right">49.5</td>
+  </tr>
+    <td>ViT-B/14</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">82.0%</td>
+    <td align="right">84.6%</td>
+    <td align="right">75.6%</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">73.8%</td>
+    <td align="right">51.0</td>
+  </tr>
+  <tr>
+    <td>ViT-L/14</td>
+    <td align="center">:x:</td>
+    <td align="right">83.5%</td>
+    <td align="right">86.3%</td>
+    <td align="right">77.6%</td>
+    <td align="right">0.333</td>
+    <td align="right">0.396</td>
+    <td align="right">53.1</td>
+    <td align="right">79.8%</td>
+    <td align="right">54.0</td>
+  </tr>
+  <tr>
+    <td>ViT-L/14</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">83.8%</td>
+    <td align="right">86.7%</td>
+    <td align="right">78.5%</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">80.9%</td>
+    <td align="right">55.7</td>
+  </tr>
+  <tr>
+    <td>ViT-g/14</td>
+    <td align="center">:x:</td>
+    <td align="right">83.5%</td>
+    <td align="right">86.5%</td>
+    <td align="right">78.4%</td>
+    <td align="right">0.298</td>
+    <td align="right">0.362</td>
+    <td align="right">53.0</td>
+    <td align="right">81.6%</td>
+    <td align="right">52.3</td>
+  </tr>
+  <tr>
+  <tr>
+    <td>ViT-g/14</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">83.7%</td>
+    <td align="right">87.1%</td>
+    <td align="right">78.8%</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">N/A</td>
+    <td align="right">81.5%</td>
+    <td align="right">58.2</td>
+  </tr>
+</table>
+## Environmental Impact
+- **Hardware Type:** Nvidia A100
+- **Hours used:** 22,000 for ViT-g, 4,500 for ViT-S distillation, 5,300 for ViT-B distillation, 8,000 for ViT-L distillation
+- **Cloud Provider:** Private infra
+- **Compute Region:** USA
+- **Carbon Emitted:** 7t CO2eq
+#### Hardware
+Nvidia A100 GPUs
+#### Software
+PyTorch 2.0,
+xFormers 0.0.18
+**BibTeX**
+```
+@misc{oquab2023dinov2,
+  title={DINOv2: Learning Robust Visual Features without Supervision},
+  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+  journal={arXiv:2304.07193},
+  year={2023}
+}
+@misc{darcet2023vitneedreg,
+  title={Vision Transformers Need Registers},
+  author={Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
+  journal={arXiv:2309.16588},
+  year={2023}
+}
+```

dinov2/README.md ADDED Viewed

	@@ -0,0 +1,620 @@

+:new: [2023-10-26] *Added DINOv2 backbones with registers, following [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588).*
+# DINOv2: Learning Robust Visual Features without Supervision
+**[Meta AI Research, FAIR](https://ai.facebook.com/research/)**
+Maxime Oquab,
+Timothée Darcet,
+Théo Moutakanni,
+Huy V. Vo,
+Marc Szafraniec,
+Vasil Khalidov,
+Patrick Labatut,
+Armand Joulin,
+Piotr Bojanowski
+[[`Paper #1`](https://arxiv.org/abs/2304.07193)] [`Paper #2`](https://arxiv.org/abs/2309.16588)] [[`Blog`](https://ai.facebook.com/blog/dino-v2-computer-vision-self-supervised-learning/)] [[`Demo`](https://dinov2.metademolab.com)] [[`BibTeX`](#citing-dinov2)]
+PyTorch implementation and pretrained models for DINOv2. For details, see the papers: **[DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)** and **[Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588)**.
+DINOv2 models produce high-performance visual features that can be directly employed with classifiers as simple as linear layers on a variety of computer vision tasks; these visual features are robust and perform well across domains without any requirement for fine-tuning. The models were pretrained on a dataset of 142 M images without using any labels or annotations.
+https://github.com/facebookresearch/dinov2/assets/60359573/f168823e-7922-415a-b429-578badf5c356
+<div align="center">
+  Visualization of the three first principal components of the patch features of all frames, mapped to RGB values.
+</div>
+## Pretrained models
+<table style="margin: auto">
+  <thead>
+    <tr>
+      <th>model</th>
+      <th># of<br />params</th>
+      <th>with<br />registers</th>
+      <th>ImageNet<br />k-NN</th>
+      <th>ImageNet<br />linear</th>
+      <th>download</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>ViT-S/14 distilled</td>
+      <td align="right">21 M</td>
+      <td align="center">:x:</td>
+      <td align="right">79.0%</td>
+      <td align="right">81.1%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-S/14 distilled</td>
+      <td align="right">21 M</td>
+      <td align="center">:white_check_mark:</td>
+      <td align="right">79.1%</td>
+      <td align="right">80.9%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-B/14 distilled</td>
+      <td align="right">86 M</td>
+      <td align="center">:x:</td>
+      <td align="right">82.1%</td>
+      <td align="right">84.5%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-B/14 distilled</td>
+      <td align="right">86 M</td>
+      <td align="center">:white_check_mark:</td>
+      <td align="right">82.0%</td>
+      <td align="right">84.6%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-L/14 distilled</td>
+      <td align="right">300 M</td>
+      <td align="center">:x:</td>
+      <td align="right">83.5%</td>
+      <td align="right">86.3%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-L/14 distilled</td>
+      <td align="right">300 M</td>
+      <td align="center">:white_check_mark:</td>
+      <td align="right">83.8%</td>
+      <td align="right">86.7%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_reg4_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-g/14</td>
+      <td align="right">1,100 M</td>
+      <td align="center">:x:</td>
+      <td align="right">83.5%</td>
+      <td align="right">86.5%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth">backbone only</a></td>
+    </tr>
+    <tr>
+      <td>ViT-g/14</td>
+      <td align="right">1,100 M</td>
+      <td align="center">:white_check_mark:</td>
+      <td align="right">83.7%</td>
+      <td align="right">87.1%</td>
+      <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_reg4_pretrain.pth">backbone only</a></td>
+    </tr>
+  </tbody>
+</table>
+### Pretrained backbones (via PyTorch Hub)
+Please follow the instructions [here](https://pytorch.org/get-started/locally/) to install PyTorch (the only required dependency for loading the model). Installing PyTorch with CUDA support is strongly recommended.
+A corresponding [model card](MODEL_CARD.md) is included in the repository.
+```python
+import torch
+# DINOv2
+dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
+dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
+dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
+dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
+# DINOv2 with registers
+dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
+dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
+dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg')
+dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
+```
+### Pretrained heads - Image classification
+<table style="margin: auto">
+  <thead>
+    <tr>
+      <th rowspan="2">backbone</th>
+      <th rowspan="2">with<br />registers</th>
+      <th>download</th>
+    </tr>
+    <tr>
+      <th>ImageNet</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>ViT-S/14 distilled</td>
+      <td align="center">:x:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_linear4_head.pth">4 layers</a>)
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-S/14 distilled</td>
+      <td align="center">:white_check_mark:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_linear4_head.pth">4 layers</a>)
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-B/14 distilled</td>
+      <td align="center">:x:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_linear4_head.pth">4 layers</a>)
+    </tr>
+    <tr>
+      <td>ViT-B/14 distilled</td>
+      <td align="center">:white_check_mark:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_linear4_head.pth">4 layers</a>)
+    </tr>
+    <tr>
+      <td>ViT-L/14 distilled</td>
+      <td align="center">:x:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_linear4_head.pth">4 layers</a>)
+    </tr>
+    <tr>
+      <td>ViT-L/14 distilled</td>
+      <td align="center">:white_check_mark:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_reg4_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_reg4_linear4_head.pth">4 layers</a>)
+    </tr>
+    <tr>
+      <td>ViT-g/14</td>
+      <td align="center">:x:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_linear4_head.pth">4 layers</a>)
+    </tr>
+    <tr>
+      <td>ViT-g/14</td>
+      <td align="center">:white_check_mark:</td>
+      <td>
+        linear head (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_lreg4_inear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_reg4_linear4_head.pth">4 layers</a>)
+    </tr>
+  </tbody>
+</table>
+The (full) classifier models can be loaded via PyTorch Hub:
+```python
+import torch
+# DINOv2
+dinov2_vits14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_lc')
+dinov2_vitb14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_lc')
+dinov2_vitl14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_lc')
+dinov2_vitg14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_lc')
+# DINOv2 with registers
+dinov2_vits14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg_lc')
+dinov2_vitb14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg_lc')
+dinov2_vitl14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg_lc')
+dinov2_vitg14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg_lc')
+```
+### Pretrained heads - Depth estimation
+<table style="margin: auto">
+  <thead>
+    <tr>
+      <th rowspan="2">backbone</th>
+      <th colspan="2">download head</th>
+    </tr>
+    <tr>
+      <th>NYUd</th>
+      <th>KITTI</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>ViT-S/14 distilled</td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_nyu_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_nyu_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_nyu_dpt_head.pth">DPT</a>
+      </td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_kitti_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_kitti_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_kitti_dpt_head.pth">DPT</a>
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-B/14 distilled</td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_nyu_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_nyu_dpt_head.pth">DPT</a>
+      </td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_kitti_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_kitti_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_kitti_dpt_head.pth">DPT</a>
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-L/14 distilled</td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_nyu_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_nyu_dpt_head.pth">DPT</a>
+      </td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_kitti_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_kitti_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_kitti_dpt_head.pth">DPT</a>
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-g/14</td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_nyu_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_nyu_dpt_head.pth">DPT</a>
+      </td>
+      <td>
+        linear (<a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_kitti_linear_head.pth">1 layer</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_kitti_linear4_head.pth">4 layers</a>),
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_kitti_dpt_head.pth">DPT</a>
+      </td>
+    </tr>
+  </tbody>
+</table>
+### Pretrained heads - Semantic segmentation
+<table style="margin: auto">
+  <thead>
+    <tr>
+      <th rowspan="2">backbone</th>
+      <th>download model</th>
+      <th colspan="2">download head</th>
+    </tr>
+    <tr>
+      <th>ADE20K</th>
+      <th>ADE20K</th>
+      <th>VOC2012</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>ViT-S/14 distilled</td>
+      <td></td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_ade20k_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_ade20k_ms_head.pth">multi-scale</a>
+      </td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_voc2012_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_voc2012_ms_head.pth">multi-scale</a>
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-B/14 distilled</td>
+      <td></td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_ade20k_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_ade20k_ms_head.pth">multi-scale</a>
+      </td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_voc2012_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_voc2012_ms_head.pth">multi-scale</a>
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-L/14 distilled</td>
+      <td></td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_ade20k_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_ade20k_ms_head.pth">multi-scale</a>
+      </td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_voc2012_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_voc2012_ms_head.pth">multi-scale</a>
+      </td>
+    </tr>
+    <tr>
+      <td>ViT-g/14</td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_ade20k_m2f.pth">Mask2Former</a>
+      </td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_ade20k_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_ade20k_ms_head.pth">multi-scale</a>
+      </td>
+      <td>
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_voc2012_linear_head.pth">linear</a>,
+        <a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_voc2012_ms_head.pth">multi-scale</a>
+      </td>
+    </tr>
+  </tbody>
+</table>
+## Installation
+The training and evaluation code requires PyTorch 2.0 and [xFormers](https://github.com/facebookresearch/xformers) 0.0.18 as well as a number of other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:
+*[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)* **(Recommended)** - Clone the repository and then create and activate a `dinov2` conda environment using the provided environment definition:
+```shell
+conda env create -f conda.yaml
+conda activate dinov2
+```
+*[pip](https://pip.pypa.io/en/stable/getting-started/)* - Clone the repository and then use the provided `requirements.txt` to install the dependencies:
+```shell
+pip install -r requirements.txt
+```
+For dense tasks (depth estimation and semantic segmentation), there are additional dependencies (specific versions of `mmcv` and `mmsegmentation`) which are captured in the `extras` dependency specifications:
+*[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)* **(Recommended)**:
+```shell
+conda env create -f conda-extras.yaml
+conda activate dinov2-extras
+```
+*[pip](https://pip.pypa.io/en/stable/getting-started/)*:
+```shell
+pip install -r requirements.txt -r requirements-extras.txt
+```
+## Data preparation
+### ImageNet-1k
+The root directory of the dataset should hold the following contents:
+- `<ROOT>/test/ILSVRC2012_test_00000001.JPEG`
+- `<ROOT>/test/[..]`
+- `<ROOT>/test/ILSVRC2012_test_00100000.JPEG`
+- `<ROOT>/train/n01440764/n01440764_10026.JPEG`
+- `<ROOT>/train/[...]`
+- `<ROOT>/train/n15075141/n15075141_9993.JPEG`
+- `<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG`
+- `<ROOT>/val/[...]`
+- `<ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG`
+- `<ROOT>/labels.txt`
+The provided dataset implementation expects a few additional metadata files to be present under the extra directory:
+- `<EXTRA>/class-ids-TRAIN.npy`
+- `<EXTRA>/class-ids-VAL.npy`
+- `<EXTRA>/class-names-TRAIN.npy`
+- `<EXTRA>/class-names-VAL.npy`
+- `<EXTRA>/entries-TEST.npy`
+- `<EXTRA>/entries-TRAIN.npy`
+- `<EXTRA>/entries-VAL.npy`
+These metadata files can be generated (once) with the following lines of Python code:
+```python
+from dinov2.data.datasets import ImageNet
+for split in ImageNet.Split:
+    dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
+    dataset.dump_extra()
+```
+Note that the root and extra directories do not have to be distinct directories.
+### ImageNet-22k
+Please adapt the [dataset class](dinov2/data/datasets/image_net_22k.py) to match your local setup.
+<br />
+:warning: To execute the commands provided in the next sections for training and evaluation, the `dinov2` package should be included in the Python module search path, i.e. simply prefix the command to run with `PYTHONPATH=.`.
+## Training
+### Fast setup: training DINOv2 ViT-L/16 on ImageNet-1k
+Run DINOv2 training on 4 A100-80GB nodes (32 GPUs) in a SLURM cluster environment with submitit:
+```shell
+python dinov2/run/train/train.py \
+    --nodes 4 \
+    --config-file dinov2/configs/train/vitl16_short.yaml \
+    --output-dir <PATH/TO/OUTPUT/DIR> \
+    train.dataset_path=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
+```
+Training time is approximately 1 day and the resulting checkpoint should reach 81.6% on k-NN eval and 82.9% on linear eval.
+The training code saves the weights of the teacher in the `eval` folder every 12500 iterations for evaluation.
+### Long setup: training DINOv2 ViT-L/14 on ImageNet-22k
+Run DINOv2 training on 12 A100-80GB nodes (96 GPUs) in a SLURM cluster environment with submitit:
+```shell
+python dinov2/run/train/train.py \
+    --nodes 12 \
+    --config-file dinov2/configs/train/vitl14.yaml \
+    --output-dir <PATH/TO/OUTPUT/DIR> \
+    train.dataset_path=ImageNet22k:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
+```
+Training time is approximately 3.3 days and the resulting checkpoint should reach 82.0% on k-NN eval and 84.5% on linear eval.
+The training code saves the weights of the teacher in the `eval` folder every 12500 iterations for evaluation.
+## Evaluation
+The training code regularly saves the teacher weights. In order to evaluate the model, run the following evaluation on a single node:
+### k-NN classification on ImageNet-1k
+```shell
+python dinov2/run/eval/knn.py \
+    --config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
+    --pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
+    --output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/knn \
+    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
+    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
+```
+### Logistic regression classification on ImageNet-1k
+```shell
+python dinov2/run/eval/log_regression.py \
+    --config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
+    --pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
+    --output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/logreg \
+    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
+    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
+```
+### Linear classification with data augmentation on ImageNet-1k
+```shell
+python dinov2/run/eval/linear.py \
+    --config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
+    --pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
+    --output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/linear \
+    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
+    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
+```
+We release the weights from evaluating the different models:
+<table style="margin: auto">
+  <tr>
+    <th>model</th>
+    <th>with<br />registers</th>
+    <th>ImageNet<br />top-1</th>
+    <th>linear evaluation</th>
+  </tr>
+  <tr>
+    <td>ViT-S/14 distilled</td>
+    <td align="center">:x:</td>
+    <td align="right">81.1%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-S/14 distilled</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">80.8%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-B/14 distilled</td>
+    <td align="center">:x:</td>
+    <td align="right">84.5%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-B/14 distilled</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">84.4%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-L/14 distilled</td>
+    <td align="center">:x:</td>
+    <td align="right">86.3%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-L/14 distilled</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">86.5%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_reg4_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-g/14</td>
+    <td align="center">:x:</td>
+    <td align="right">86.5%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_linear_head.pth">linear head weights</a></td>
+  </tr>
+  <tr>
+    <td>ViT-g/14</td>
+    <td align="center">:white_check_mark:</td>
+    <td align="right">87.0%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_reg4_linear_head.pth">linear head weights</a></td>
+  </tr>
+</table>
+The performance of the provided pretrained model weights can be evaluated as follows on ImageNet-1k:
+```shell
+python dinov2/run/eval/linear.py \
+    --config-file dinov2/configs/eval/vitg14_pretrain.yaml \
+    --pretrained-weights https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth \
+    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
+    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
+```
+## Notebooks
+A few notebooks are provided to help the community leverage the models and code:
+<ul>
+  <li><a href="https://github.com/facebookresearch/dinov2/blob/main/notebooks/depth_estimation.ipynb">Depth estimation</a> - How to load and use the depth heads in combination with a matching backbone via mmcv</li>
+  <li><a href="https://github.com/facebookresearch/dinov2/blob/main/notebooks/semantic_segmentation.ipynb">Semantic segmentation</a> - How to load and use the segmentation heads in combination with a matching backbone via mmcv, and also how to load and use the Mask2Former-based segmentation model trained on ADE20K</li>
+</ul>
+## License
+DINOv2 code and model weights are released under the Apache License 2.0. See [LICENSE](LICENSE) for additional details.
+## Contributing
+See [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md).
+## Citing DINOv2
+If you find this repository useful, please consider giving a star :star: and citation :t-rex::
+```
+@misc{oquab2023dinov2,
+  title={DINOv2: Learning Robust Visual Features without Supervision},
+  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+  journal={arXiv:2304.07193},
+  year={2023}
+}
+```
+```
+@misc{darcet2023vitneedreg,
+  title={Vision Transformers Need Registers},
+  author={Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
+  journal={arXiv:2309.16588},
+  year={2023}
+}
+```

dinov2/conda-extras.yaml ADDED Viewed

	@@ -0,0 +1,24 @@

+name: dinov2-extras
+channels:
+  - defaults
+  - pytorch
+  - nvidia
+  - xformers
+  - conda-forge
+dependencies:
+  - python=3.9
+  - pytorch::pytorch=2.0.0
+  - pytorch::pytorch-cuda=11.7.0
+  - pytorch::torchvision=0.15.0
+  - omegaconf
+  - torchmetrics=0.10.3
+  - fvcore
+  - iopath
+  - xformers::xformers=0.0.18
+  - pip
+  - pip:
+    - git+https://github.com/facebookincubator/submitit
+    - --extra-index-url https://pypi.nvidia.com
+    - cuml-cu11
+    - mmcv-full==1.5.0
+    - mmsegmentation==0.27.0

dinov2/conda.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+name: dinov2
+channels:
+  - defaults
+  - pytorch
+  - nvidia
+  - conda-forge
+dependencies:
+  - python=3.9
+  - pytorch=2.0.0
+  - pytorch-cuda=11.7
+  - torchvision=0.15.0
+  - omegaconf
+  - torchmetrics=0.10.3
+  - fvcore
+  - iopath
+  - pip
+  - pip:
+    - git+https://github.com/facebookincubator/submitit
+    - --extra-index-url https://pypi.nvidia.com
+    - cuml-cu11
+    - xformers==0.0.20  # Updated xformers version compatible with PyTorch 2.0

dinov2/pyproject.toml ADDED Viewed

	@@ -0,0 +1,29 @@

+[tool.black]
+line-length = 120
+[tool.pylint.master]
+persistent = false
+score = false
+[tool.pylint.messages_control]
+disable = "all"
+enable = [
+  "miscellaneous",
+  "similarities",
+]
+[tool.pylint.similarities]
+ignore-comments = true
+ignore-docstrings = true
+ignore-imports = true
+min-similarity-lines = 8
+[tool.pylint.reports]
+reports = false
+[tool.pylint.miscellaneous]
+notes = [
+  "FIXME",
+  "XXX",
+  "TODO",
+]

dinov2/requirements-dev.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+black==22.6.0
+flake8==5.0.4
+pylint==2.15.0

dinov2/requirements-extras.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ mmcv-full==1.5.0
2	+ mmsegmentation==0.27.0

dinov2/requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+--extra-index-url https://download.pytorch.org/whl/cu117
+torch==2.0.0
+torchvision==0.15.0
+omegaconf
+torchmetrics==0.10.3
+fvcore
+iopath
+xformers==0.0.18
+submitit
+--extra-index-url https://pypi.nvidia.com
+cuml-cu11

dinov2/scripts/lint.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/bin/sh
+if [ -n "$1" ]; then
+  echo "linting \"$1\""
+fi
+echo "running black"
+if [ -n "$1" ]; then
+  black "$1"
+else
+  black dinov2
+fi
+echo "running flake8"
+if [ -n "$1" ]; then
+  flake8 "$1"
+else
+  flake8
+fi
+echo "running pylint"
+if [ -n "$1" ]; then
+  pylint "$1"
+else
+  pylint dinov2
+fi
+exit 0

dinov2/setup.cfg ADDED Viewed

	@@ -0,0 +1,8 @@

+[flake8]
+max-line-length = 120
+ignore = E203,E501,W503
+per-file-ignores =
+  __init__.py:F401
+  hubconf.py:F401
+exclude =
+    venv

dinov2/setup.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+from pathlib import Path
+import re
+from typing import List, Tuple
+from setuptools import setup, find_packages
+NAME = "dinov2"
+DESCRIPTION = "PyTorch code and models for the DINOv2 self-supervised learning method."
+URL = "https://github.com/facebookresearch/dinov2"
+AUTHOR = "FAIR"
+REQUIRES_PYTHON = ">=3.9.0"
+HERE = Path(__file__).parent
+try:
+    with open(HERE / "README.md", encoding="utf-8") as f:
+        long_description = "\n" + f.read()
+except FileNotFoundError:
+    long_description = DESCRIPTION
+def get_requirements(path: str = HERE / "requirements.txt") -> Tuple[List[str], List[str]]:
+    requirements = []
+    extra_indices = []
+    with open(path) as f:
+        for line in f.readlines():
+            line = line.rstrip("\r\n")
+            if line.startswith("--extra-index-url "):
+                extra_indices.append(line[18:])
+                continue
+            requirements.append(line)
+    return requirements, extra_indices
+def get_package_version() -> str:
+    with open(HERE / "dinov2/__init__.py") as f:
+        result = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", f.read(), re.M)
+        if result:
+            return result.group(1)
+    raise RuntimeError("Can't get package version")
+requirements, extra_indices = get_requirements()
+version = get_package_version()
+dev_requirements, _ = get_requirements(HERE / "requirements-dev.txt")
+extras_requirements, _ = get_requirements(HERE / "requirements-extras.txt")
+setup(
+    name=NAME,
+    version=version,
+    description=DESCRIPTION,
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    author=AUTHOR,
+    python_requires=REQUIRES_PYTHON,
+    url=URL,
+    packages=find_packages(),
+    package_data={
+        "": ["*.yaml"],
+    },
+    install_requires=requirements,
+    extras_require={
+        "dev": dev_requirements,
+        "extras": extras_requirements,
+    },
+    dependency_links=extra_indices,
+    install_package_data=True,
+    license="Apache",
+    license_files=("LICENSE",),
+    classifiers=[
+        # Trove classifiers: https://github.com/pypa/trove-classifiers/blob/main/src/trove_classifiers/__init__.py
+        "Development Status :: 3 - Alpha",
+        "Intended Audience :: Developers",
+        "Intended Audience :: Science/Research",
+        "License :: OSI Approved :: Apache Software License",
+        "Programming Language :: Python :: 3.9",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+        "Topic :: Software Development :: Libraries :: Python Modules",
+    ],
+)

explainability.py ADDED Viewed

	@@ -0,0 +1,112 @@

+from __future__ import annotations
+from dataclasses import dataclass
+import cv2
+import numpy as np
+import torch
+@dataclass(frozen=True)
+class ExplainabilitySpec:
+    encoder_mode: str
+    encoder_label: str
+    decoder_mode: str
+    decoder_label: str
+    encoder_layer_count: int = 0
+    encoder_head_count: int = 0
+class ModuleOutputRecorder:
+    def __init__(self) -> None:
+        self.handle = None
+        self.output = None
+    def attach(self, module) -> None:
+        self.remove()
+        self.handle = module.register_forward_hook(self._hook)
+    def clear(self) -> None:
+        self.output = None
+    def remove(self) -> None:
+        if self.handle is not None:
+            self.handle.remove()
+            self.handle = None
+        self.output = None
+    def _hook(self, module, inputs, output) -> None:  # pragma: no cover - hook signature
+        if torch.is_tensor(output):
+            self.output = output.detach()
+        else:
+            self.output = output
+def clamp_index(index: int | None, upper_bound: int) -> int:
+    if upper_bound <= 0:
+        return 0
+    if index is None:
+        return upper_bound - 1
+    return max(0, min(int(index), upper_bound - 1))
+def normalize_map(values) -> np.ndarray:
+    array = np.asarray(values, dtype=np.float32)
+    if array.ndim != 2:
+        raise ValueError(f"Expected a 2D array, got shape {array.shape}")
+    array = array.copy()
+    min_value = float(array.min(initial=0.0))
+    array -= min_value
+    max_value = float(array.max(initial=0.0))
+    if max_value > 0:
+        array /= max_value
+    return array
+def resize_rgb_image(rgb_image: np.ndarray, size: tuple[int, int]) -> np.ndarray:
+    width, height = size
+    return cv2.resize(rgb_image, (width, height), interpolation=cv2.INTER_LINEAR)
+def feature_energy_map(feature_tensor: torch.Tensor, output_shape: tuple[int, int]) -> np.ndarray:
+    tensor = feature_tensor.detach().float()
+    while tensor.dim() > 3:
+        tensor = tensor[0]
+    if tensor.dim() == 3:
+        tensor = tensor.abs().mean(dim=0)
+    elif tensor.dim() != 2:
+        raise ValueError(f"Unexpected feature tensor shape: {tuple(feature_tensor.shape)}")
+    heatmap = normalize_map(tensor.cpu().numpy())
+    height, width = output_shape
+    return cv2.resize(heatmap, (width, height), interpolation=cv2.INTER_CUBIC)
+def render_heatmap_overlay(rgb_image: np.ndarray, heatmap: np.ndarray, alpha: float = 0.45) -> np.ndarray:
+    if heatmap.shape != rgb_image.shape[:2]:
+        heatmap = cv2.resize(heatmap, (rgb_image.shape[1], rgb_image.shape[0]), interpolation=cv2.INTER_CUBIC)
+    colored = cv2.applyColorMap((normalize_map(heatmap) * 255.0).astype(np.uint8), cv2.COLORMAP_TURBO)
+    colored = cv2.cvtColor(colored, cv2.COLOR_BGR2RGB)
+    return cv2.addWeighted(rgb_image, 1.0 - alpha, colored, alpha, 0.0)
+def render_temporal_strip(values, *, active_index: int | None = None, cell_width: int = 12, height: int = 72) -> np.ndarray:
+    sequence = np.asarray(values, dtype=np.float32).reshape(1, -1)
+    if sequence.size == 0:
+        sequence = np.zeros((1, 1), dtype=np.float32)
+    normalized = normalize_map(sequence)
+    strip = (normalized * 255.0).astype(np.uint8)
+    strip = np.repeat(strip, height, axis=0)
+    strip = np.repeat(strip, cell_width, axis=1)
+    colored = cv2.applyColorMap(strip, cv2.COLORMAP_TURBO)
+    colored = cv2.cvtColor(colored, cv2.COLOR_BGR2RGB)
+    if active_index is not None and sequence.shape[1] > 0:
+        clamped = clamp_index(active_index, sequence.shape[1])
+        x0 = clamped * cell_width
+        x1 = min(colored.shape[1] - 1, x0 + cell_width - 1)
+        cv2.rectangle(colored, (x0, 0), (x1, colored.shape[0] - 1), (255, 255, 255), 2)
+    return colored

model/transformer.py CHANGED Viewed

@@ -146,7 +146,7 @@ class Decoder(nn.Module):
         super(Decoder, self).__init__()
         self.layers = nn.ModuleList([DecoderLayer(d_model, d_ff, d_k, d_v, n_heads, len_q) for _ in range(n_layers)])
-    def forward(self, dec_inputs, enc_outputs):
         '''
         dec_inputs: [batch_size, tgt_len, d_model]  [512, 1, 5]
         enc_intpus: [batch_size, src_len, d_model]  [512, 30, 5]
@@ -160,6 +160,8 @@ class Decoder(nn.Module):
             # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
             dec_outputs, dec_enc_attn = layer(dec_outputs, enc_outputs)
             dec_enc_attns.append(dec_enc_attn)
         return dec_outputs
@@ -175,7 +177,7 @@ class Transformer2_3_1(nn.Module):
         self.encoder = Encoder(d_model, d_ff, d_k, d_v, n_layers, n_heads, len_q)
         self.decoder = Decoder(d_model, d_ff, d_k, d_v, 1, n_heads, len_q)
-    def forward(self, enc_inputs, dec_inputs):
         '''
         enc_inputs: [Frames, src_len, d_model]  [512, 30, 5]
         dec_inputs: [Frames, 1, d_model]  [512, 1, 5]
@@ -185,8 +187,11 @@ class Transformer2_3_1(nn.Module):
         # enc_outputs: [batch_size, src_len, d_model], enc_self_attns: [n_layers, batch_size, n_heads, src_len, src_len]
         enc_outputs, enc_self_attns = self.encoder(enc_inputs)  # Self-attention for temporal features
-        dec_outputs = self.decoder(dec_inputs, enc_outputs)
-        return dec_outputs
 class Transformer(nn.Module):
@@ -210,7 +215,7 @@ class Transformer(nn.Module):
             nn.Linear(self.d_model, out_features, bias=False)
         )
-    def forward(self, x, long_feature):
         # x: [B, 256, T]; long_feature: [B, T, 256]
         B, D, T = x.shape
         out_features = x.transpose(1, 2)  # [B, T, 256]
@@ -238,9 +243,24 @@ class Transformer(nn.Module):
                 win = out_features[:, i - spa_len + 1:i + 1, :]
             out_feas.append(win)
         out_feas = torch.stack(out_feas, dim=0).squeeze(1)
-        out_feas, _ = self.spatial_encoder(out_feas)
         # Temporal-spatial fusion
-        output = self.transformer(inputs, out_feas)  # [T, B, 1, 256] collapsed → [T, B, 256]
         output = self.out(output)  # [T, B, C]
-        return output.transpose(0, 1)  # [B, T, C]

         super(Decoder, self).__init__()
         self.layers = nn.ModuleList([DecoderLayer(d_model, d_ff, d_k, d_v, n_heads, len_q) for _ in range(n_layers)])
+    def forward(self, dec_inputs, enc_outputs, return_attentions=False):
         '''
         dec_inputs: [batch_size, tgt_len, d_model]  [512, 1, 5]
         enc_intpus: [batch_size, src_len, d_model]  [512, 30, 5]
             # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
             dec_outputs, dec_enc_attn = layer(dec_outputs, enc_outputs)
             dec_enc_attns.append(dec_enc_attn)
+        if return_attentions:
+            return dec_outputs, dec_enc_attns
         return dec_outputs
         self.encoder = Encoder(d_model, d_ff, d_k, d_v, n_layers, n_heads, len_q)
         self.decoder = Decoder(d_model, d_ff, d_k, d_v, 1, n_heads, len_q)
+    def forward(self, enc_inputs, dec_inputs, return_attentions=False):
         '''
         enc_inputs: [Frames, src_len, d_model]  [512, 30, 5]
         dec_inputs: [Frames, 1, d_model]  [512, 1, 5]
         # enc_outputs: [batch_size, src_len, d_model], enc_self_attns: [n_layers, batch_size, n_heads, src_len, src_len]
         enc_outputs, enc_self_attns = self.encoder(enc_inputs)  # Self-attention for temporal features
+        decoder_outputs = self.decoder(dec_inputs, enc_outputs, return_attentions=return_attentions)
+        if return_attentions:
+            dec_outputs, dec_enc_attns = decoder_outputs
+            return dec_outputs, {"encoder_self_attns": enc_self_attns, "decoder_cross_attns": dec_enc_attns}
+        return decoder_outputs
 class Transformer(nn.Module):
             nn.Linear(self.d_model, out_features, bias=False)
         )
+    def forward(self, x, long_feature, return_attention=False):
         # x: [B, 256, T]; long_feature: [B, T, 256]
         B, D, T = x.shape
         out_features = x.transpose(1, 2)  # [B, T, 256]
                 win = out_features[:, i - spa_len + 1:i + 1, :]
             out_feas.append(win)
         out_feas = torch.stack(out_feas, dim=0).squeeze(1)
+        out_feas, spatial_attn = self.spatial_encoder(out_feas)
         # Temporal-spatial fusion
+        transformer_outputs = self.transformer(inputs, out_feas, return_attentions=return_attention)
+        if return_attention:
+            output, attention_meta = transformer_outputs
+        else:
+            output = transformer_outputs
         output = self.out(output)  # [T, B, C]
+        output = output.transpose(0, 1)  # [B, T, C]
+        if not return_attention:
+            return output
+        decoder_attn = attention_meta["decoder_cross_attns"][-1]
+        spatial_attn_last = spatial_attn[-1]
+        decoder_strip = decoder_attn[-1].mean(dim=0).mean(dim=0).detach()
+        spatial_strip = spatial_attn_last.mean(dim=0).mean(dim=0).detach()
+        return output, {
+            "decoder_strip": decoder_strip,
+            "spatial_strip": spatial_strip,
+        }

model_manager.py CHANGED Viewed

@@ -56,6 +56,16 @@ class SpaceModelManager:
         self.current_predictor = predictor
         return predictor
     def reset_predictor_state(self) -> None:
         if self.current_predictor is not None and hasattr(self.current_predictor, "reset_state"):
             self.current_predictor.reset_state()

         self.current_predictor = predictor
         return predictor
+    def get_loaded_predictor(self, model_key: str | None = None):
+        if self.current_predictor is None:
+            return None
+        if model_key is None:
+            return self.current_predictor
+        normalized_key = normalize_model_key(model_key)
+        if self.current_model_key != normalized_key:
+            return None
+        return self.current_predictor
     def reset_predictor_state(self) -> None:
         if self.current_predictor is not None and hasattr(self.current_predictor, "reset_state"):
             self.current_predictor.reset_state()

predictor.py CHANGED Viewed

@@ -1,5 +1,6 @@
 from __future__ import annotations
 import os
 import sys
 from contextlib import nullcontext
@@ -21,6 +22,15 @@ except ImportError:  # pragma: no cover
 from model.resnet import ResNet
 from model.mstcn import MultiStageModel
 from model.transformer import Transformer
 PHASE_LABELS = ("idle", "marking", "injection", "dissection")
 MODEL_LABELS = {
@@ -108,6 +118,37 @@ def _resolve_vendor_repo(repo_name: str, extra_candidates=()):
     raise FileNotFoundError(f"Required vendor repo '{repo_name}' not found. Stage it into this folder or keep the repo-root copy available.")
 class Predictor:
     def __init__(self, model_dir: str | None = None, device: str = "cuda"):
         self.device = torch.device(device if torch.cuda.is_available() else "cpu")
@@ -118,6 +159,14 @@ class Predictor:
         self.frame_feature_cache = None
         self.label_dict = dict(enumerate(PHASE_LABELS))
         self.available = False
         self._norm_mean = None
         self._norm_std = None
@@ -134,6 +183,9 @@ class Predictor:
         paras = {k.replace("share.", "resnet."): v for k, v in paras.items()}
         self.resnet.load_state_dict(paras, strict=True)
         self.resnet.to(self.device).eval()
         self.fusion = MultiStageModel(
             mstcn_stages=2,
@@ -174,11 +226,18 @@ class Predictor:
         self.predict(dummy)
         self.reset_state()
     def reset_state(self):
         self.frame_feature_cache = None
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
     def unload(self):
         self.available = False
         self.resnet.to("cpu")
@@ -188,6 +247,10 @@ class Predictor:
         self.fusion = None
         self.transformer = None
         self.frame_feature_cache = None
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
@@ -200,7 +263,10 @@ class Predictor:
             self.frame_feature_cache = torch.cat([self.frame_feature_cache, feature], dim=0)
     @torch.inference_mode()
-    def predict(self, rgb_image: np.ndarray):
         if self._norm_mean is not None:
             tensor = self._preprocess_gpu(rgb_image)
         else:
@@ -216,33 +282,91 @@ class Predictor:
                 single_frame_feature = feature.unsqueeze(1)
                 temporal_input = single_frame_feature.transpose(1, 2)
                 temporal_feature = self.fusion(temporal_input)
-                outputs = self.transformer(temporal_feature.detach(), single_frame_feature)
                 final_logits = outputs[-1, -1, :]
                 probs = F.softmax(final_logits.float(), dim=-1)
                 pred_np = probs.detach().cpu().numpy()
                 confidence = float(np.max(pred_np))
                 phase_idx = max(0, min(3, int(np.argmax(pred_np))))
                 phase = self.label_dict.get(phase_idx, "idle")
-                return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": 1}
             if self.frame_feature_cache.shape[0] < 30:
                 available_frames = self.frame_feature_cache.shape[0] + 1
                 cat_frame_feature = torch.cat([self.frame_feature_cache, feature], dim=0).unsqueeze(0)
                 temporal_input = cat_frame_feature.transpose(1, 2)
                 temporal_feature = self.fusion(temporal_input)
-                outputs = self.transformer(temporal_feature.detach(), cat_frame_feature)
                 final_logits = outputs[-1, -1, :]
                 probs = F.softmax(final_logits.float(), dim=-1)
                 pred_np = probs.detach().cpu().numpy()
                 confidence = float(np.max(pred_np))
                 phase_idx = max(0, min(3, int(np.argmax(pred_np))))
                 phase = self.label_dict.get(phase_idx, "idle")
-                return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
             cat_frame_feature = self.frame_feature_cache.unsqueeze(0)
             temporal_input = cat_frame_feature.transpose(1, 2)
             temporal_feature = self.fusion(temporal_input)
-            outputs = self.transformer(temporal_feature.detach(), cat_frame_feature)
             final_logits = outputs[-1, -1, :]
             probs = F.softmax(final_logits.float(), dim=-1)
             pred_np = probs.detach().cpu().numpy()
@@ -250,7 +374,22 @@ class Predictor:
         confidence = float(np.max(pred_np))
         phase_idx = max(0, min(3, int(np.argmax(pred_np))))
         phase = self.label_dict.get(phase_idx, "idle")
-        return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": min(self.trans_seq, self.frame_feature_cache.shape[0])}
 class PredictorDinoV2:
@@ -267,7 +406,19 @@ class PredictorDinoV2:
             A.CenterCrop(height=224, width=224),
             A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225), max_pixel_value=255.0),
         ])
         self.frame_features = []
         self._load_models(self.model_dir)
     def _amp_context(self):
@@ -297,6 +448,14 @@ class PredictorDinoV2:
         encoder_load = self.backbone.load_state_dict(encoder_state, strict=False)
         _validate_load_result(encoder_load, "DINOv2 backbone")
         self.backbone.to(self.device).eval()
         decoder_path = os.path.join(model_dir, "fusion_transformer_decoder_best_model.pth")
         if not os.path.exists(decoder_path):
@@ -326,7 +485,7 @@ class PredictorDinoV2:
                     d_model=d_model,
                 )
-            def forward(self, x):
                 x = x.permute(0, 2, 1)
                 x_reduced = self.reduce(x)
                 mstcn_input = x_reduced.permute(0, 2, 1)
@@ -341,8 +500,15 @@ class PredictorDinoV2:
                 else:
                     transformer_input = mstcn_input.detach()
-                transformer_out = self.transformer(transformer_input, x_reduced)
-                return transformer_out.permute(0, 2, 1)
         self.decoder = FusionTransformerDecoder()
         decoder_load = self.decoder.load_state_dict(decoder_state, strict=False)
@@ -359,6 +525,7 @@ class PredictorDinoV2:
     def reset_state(self):
         self.frame_features = []
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
@@ -367,6 +534,40 @@ class PredictorDinoV2:
         self.predict(dummy_img)
         self.reset_state()
     def unload(self):
         if self.backbone is not None:
             self.backbone.to("cpu")
@@ -375,15 +576,33 @@ class PredictorDinoV2:
         self.backbone = None
         self.decoder = None
         self.frame_features = []
         self.available = False
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
     @torch.inference_mode()
-    def predict(self, rgb_image: np.ndarray):
         if not self.available or self.backbone is None or self.decoder is None:
             raise RuntimeError("DINO-Endo predictor is not available")
         processed = self.aug(image=rgb_image)["image"]
         chw = np.transpose(processed, (2, 0, 1))
         tensor = torch.tensor(chw, dtype=torch.float32).unsqueeze(0).to(self.device)
@@ -408,7 +627,11 @@ class PredictorDinoV2:
         decoder_input = seq.transpose(1, 2)
         with self._amp_context():
-            logits = self.decoder(decoder_input)
         if logits.dim() != 3:
             raise ValueError(f"Unexpected DINOv2 decoder output shape: {tuple(logits.shape)}")
@@ -424,7 +647,22 @@ class PredictorDinoV2:
         confidence = float(np.max(pred_np))
         phase_idx = int(np.argmax(pred_np))
         phase = self.label_dict.get(phase_idx, "idle")
-        return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
 class PredictorVJEPA2:
@@ -443,6 +681,15 @@ class PredictorVJEPA2:
         self._feature_buffer = []
         self._vjepa_mean = torch.tensor([0.485, 0.456, 0.406], dtype=torch.float32).view(3, 1, 1, 1)
         self._vjepa_std = torch.tensor([0.229, 0.224, 0.225], dtype=torch.float32).view(3, 1, 1, 1)
         self._load_models(self.model_dir)
     def _amp_context(self):
@@ -509,7 +756,9 @@ class PredictorVJEPA2:
             sys.path.insert(0, str(vjepa2_path))
         from src.models import vision_transformer as vjepa_vit
         from src.utils.checkpoint_loader import robust_checkpoint_loader
         encoder_path = os.path.join(model_dir, "vjepa_encoder_human.pt")
         if not os.path.exists(encoder_path):
@@ -530,6 +779,14 @@ class PredictorVJEPA2:
         encoder_load = self.encoder.load_state_dict(encoder_state, strict=False)
         self._validate_load_result(encoder_load, "V-JEPA2 encoder")
         self.encoder.to(self.device).eval()
         decoder_path = os.path.join(model_dir, "mlp_decoder_human.pth")
         if not os.path.exists(decoder_path):
@@ -566,6 +823,7 @@ class PredictorVJEPA2:
     def reset_state(self):
         self._frame_buffer = []
         self._feature_buffer = []
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
@@ -574,6 +832,67 @@ class PredictorVJEPA2:
         self.predict(dummy)
         self.reset_state()
     def unload(self):
         if self.encoder is not None:
             self.encoder.to("cpu")
@@ -583,15 +902,30 @@ class PredictorVJEPA2:
         self.decoder = None
         self._frame_buffer = []
         self._feature_buffer = []
         self.available = False
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
     @torch.inference_mode()
-    def predict(self, rgb_image: np.ndarray):
         if not self.available:
             raise RuntimeError("V-JEPA2 predictor is not available")
         frame = np.ascontiguousarray(rgb_image, dtype=np.uint8)
         self._frame_buffer.append(frame)
         if len(self._frame_buffer) > self._clip_frames:
@@ -625,7 +959,30 @@ class PredictorVJEPA2:
         confidence = float(np.max(pred_np))
         phase_idx = int(np.argmax(pred_np))
         phase = self.label_dict.get(phase_idx, "idle")
-        return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
 def create_predictor(model_key: str, model_dir: str | None = None, device: str | None = None):

 from __future__ import annotations
+import math
 import os
 import sys
 from contextlib import nullcontext
 from model.resnet import ResNet
 from model.mstcn import MultiStageModel
 from model.transformer import Transformer
+from explainability import (
+    ExplainabilitySpec,
+    ModuleOutputRecorder,
+    clamp_index,
+    feature_energy_map,
+    render_heatmap_overlay,
+    render_temporal_strip,
+    resize_rgb_image,
+)
 PHASE_LABELS = ("idle", "marking", "injection", "dissection")
 MODEL_LABELS = {
     raise FileNotFoundError(f"Required vendor repo '{repo_name}' not found. Stage it into this folder or keep the repo-root copy available.")
+def _build_explainability_payload(
+    *,
+    display_image: np.ndarray,
+    encoder_heatmap: np.ndarray,
+    encoder_kind: str,
+    encoder_label: str,
+    decoder_values,
+    decoder_kind: str,
+    decoder_label: str,
+    active_decoder_index: int | None = None,
+    encoder_layer: int | None = None,
+    encoder_head: int | None = None,
+    notes: str | None = None,
+) -> dict:
+    payload = {
+        "encoder_kind": encoder_kind,
+        "encoder_label": encoder_label,
+        "encoder_visualization": render_heatmap_overlay(display_image, encoder_heatmap),
+        "decoder_kind": decoder_kind,
+        "decoder_label": decoder_label,
+        "decoder_visualization": render_temporal_strip(decoder_values, active_index=active_decoder_index),
+    }
+    if encoder_layer is not None:
+        payload["encoder_layer"] = int(encoder_layer)
+    if encoder_head is not None:
+        payload["encoder_head"] = int(encoder_head)
+    if notes:
+        payload["notes"] = notes
+    return payload
 class Predictor:
     def __init__(self, model_dir: str | None = None, device: str = "cuda"):
         self.device = torch.device(device if torch.cuda.is_available() else "cpu")
         self.frame_feature_cache = None
         self.label_dict = dict(enumerate(PHASE_LABELS))
         self.available = False
+        self._resnet_activation = None
+        self._resnet_activation_hook = None
+        self._explainability_spec = ExplainabilitySpec(
+            encoder_mode="proxy",
+            encoder_label="ResNet layer4 activation energy (proxy)",
+            decoder_mode="attention",
+            decoder_label="Temporal Transformer attention",
+        )
         self._norm_mean = None
         self._norm_std = None
         paras = {k.replace("share.", "resnet."): v for k, v in paras.items()}
         self.resnet.load_state_dict(paras, strict=True)
         self.resnet.to(self.device).eval()
+        self._resnet_activation_hook = self.resnet.resnet.layer4[-1].relu.register_forward_hook(
+            self._capture_resnet_activation
+        )
         self.fusion = MultiStageModel(
             mstcn_stages=2,
         self.predict(dummy)
         self.reset_state()
+    def _capture_resnet_activation(self, module, inputs, output):  # pragma: no cover - hook signature
+        self._resnet_activation = output.detach()
     def reset_state(self):
         self.frame_feature_cache = None
+        self._resnet_activation = None
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
+    def get_explainability_spec(self) -> ExplainabilitySpec:
+        return self._explainability_spec
     def unload(self):
         self.available = False
         self.resnet.to("cpu")
         self.fusion = None
         self.transformer = None
         self.frame_feature_cache = None
+        self._resnet_activation = None
+        if self._resnet_activation_hook is not None:
+            self._resnet_activation_hook.remove()
+            self._resnet_activation_hook = None
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
             self.frame_feature_cache = torch.cat([self.frame_feature_cache, feature], dim=0)
     @torch.inference_mode()
+    def predict(self, rgb_image: np.ndarray, explainability: dict | None = None):
+        explain_enabled = bool(explainability and explainability.get("enabled"))
+        attention_meta = None
+        display_image = resize_rgb_image(rgb_image, (224, 224)) if explain_enabled else None
         if self._norm_mean is not None:
             tensor = self._preprocess_gpu(rgb_image)
         else:
                 single_frame_feature = feature.unsqueeze(1)
                 temporal_input = single_frame_feature.transpose(1, 2)
                 temporal_feature = self.fusion(temporal_input)
+                transformer_outputs = self.transformer(
+                    temporal_feature.detach(),
+                    single_frame_feature,
+                    return_attention=explain_enabled,
+                )
+                if explain_enabled:
+                    outputs, attention_meta = transformer_outputs
+                else:
+                    outputs = transformer_outputs
                 final_logits = outputs[-1, -1, :]
                 probs = F.softmax(final_logits.float(), dim=-1)
                 pred_np = probs.detach().cpu().numpy()
                 confidence = float(np.max(pred_np))
                 phase_idx = max(0, min(3, int(np.argmax(pred_np))))
                 phase = self.label_dict.get(phase_idx, "idle")
+                frames_used = 1
+                result = {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": frames_used}
+                if explain_enabled and attention_meta is not None and display_image is not None and self._resnet_activation is not None:
+                    encoder_heatmap = feature_energy_map(self._resnet_activation, display_image.shape[:2])
+                    result["explainability"] = _build_explainability_payload(
+                        display_image=display_image,
+                        encoder_heatmap=encoder_heatmap,
+                        encoder_kind="proxy",
+                        encoder_label=self._explainability_spec.encoder_label,
+                        decoder_values=attention_meta["decoder_strip"].detach().cpu().numpy(),
+                        decoder_kind="attention",
+                        decoder_label=self._explainability_spec.decoder_label,
+                        active_decoder_index=frames_used - 1,
+                        notes="Encoder view is a proxy activation map because the ResNet backbone is not attention-based.",
+                    )
+                return result
             if self.frame_feature_cache.shape[0] < 30:
                 available_frames = self.frame_feature_cache.shape[0] + 1
                 cat_frame_feature = torch.cat([self.frame_feature_cache, feature], dim=0).unsqueeze(0)
                 temporal_input = cat_frame_feature.transpose(1, 2)
                 temporal_feature = self.fusion(temporal_input)
+                transformer_outputs = self.transformer(
+                    temporal_feature.detach(),
+                    cat_frame_feature,
+                    return_attention=explain_enabled,
+                )
+                if explain_enabled:
+                    outputs, attention_meta = transformer_outputs
+                else:
+                    outputs = transformer_outputs
                 final_logits = outputs[-1, -1, :]
                 probs = F.softmax(final_logits.float(), dim=-1)
                 pred_np = probs.detach().cpu().numpy()
                 confidence = float(np.max(pred_np))
                 phase_idx = max(0, min(3, int(np.argmax(pred_np))))
                 phase = self.label_dict.get(phase_idx, "idle")
+                result = {
+                    "phase": phase,
+                    "probs": pred_np.tolist(),
+                    "confidence": confidence,
+                    "frames_used": available_frames,
+                }
+                if explain_enabled and attention_meta is not None and display_image is not None and self._resnet_activation is not None:
+                    encoder_heatmap = feature_energy_map(self._resnet_activation, display_image.shape[:2])
+                    result["explainability"] = _build_explainability_payload(
+                        display_image=display_image,
+                        encoder_heatmap=encoder_heatmap,
+                        encoder_kind="proxy",
+                        encoder_label=self._explainability_spec.encoder_label,
+                        decoder_values=attention_meta["decoder_strip"].detach().cpu().numpy(),
+                        decoder_kind="attention",
+                        decoder_label=self._explainability_spec.decoder_label,
+                        active_decoder_index=available_frames - 1,
+                        notes="Encoder view is a proxy activation map because the ResNet backbone is not attention-based.",
+                    )
+                return result
             cat_frame_feature = self.frame_feature_cache.unsqueeze(0)
             temporal_input = cat_frame_feature.transpose(1, 2)
             temporal_feature = self.fusion(temporal_input)
+            transformer_outputs = self.transformer(
+                temporal_feature.detach(),
+                cat_frame_feature,
+                return_attention=explain_enabled,
+            )
+            if explain_enabled:
+                outputs, attention_meta = transformer_outputs
+            else:
+                outputs = transformer_outputs
             final_logits = outputs[-1, -1, :]
             probs = F.softmax(final_logits.float(), dim=-1)
             pred_np = probs.detach().cpu().numpy()
         confidence = float(np.max(pred_np))
         phase_idx = max(0, min(3, int(np.argmax(pred_np))))
         phase = self.label_dict.get(phase_idx, "idle")
+        frames_used = min(self.trans_seq, self.frame_feature_cache.shape[0])
+        result = {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": frames_used}
+        if explain_enabled and attention_meta is not None and display_image is not None and self._resnet_activation is not None:
+            encoder_heatmap = feature_energy_map(self._resnet_activation, display_image.shape[:2])
+            result["explainability"] = _build_explainability_payload(
+                display_image=display_image,
+                encoder_heatmap=encoder_heatmap,
+                encoder_kind="proxy",
+                encoder_label=self._explainability_spec.encoder_label,
+                decoder_values=attention_meta["decoder_strip"].detach().cpu().numpy(),
+                decoder_kind="attention",
+                decoder_label=self._explainability_spec.decoder_label,
+                active_decoder_index=frames_used - 1,
+                notes="Encoder view is a proxy activation map because the ResNet backbone is not attention-based.",
+            )
+        return result
 class PredictorDinoV2:
             A.CenterCrop(height=224, width=224),
             A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225), max_pixel_value=255.0),
         ])
+        self.display_aug = A.Compose([
+            A.SmallestMaxSize(max_size=256, interpolation=cv2.INTER_LINEAR),
+            A.CenterCrop(height=224, width=224),
+        ])
         self.frame_features = []
+        self._attention_recorder = ModuleOutputRecorder()
+        self._attention_layer_index = None
+        self._explainability_spec = ExplainabilitySpec(
+            encoder_mode="attention",
+            encoder_label="DINOv2 encoder self-attention",
+            decoder_mode="attention",
+            decoder_label="Fusion Transformer temporal attention",
+        )
         self._load_models(self.model_dir)
     def _amp_context(self):
         encoder_load = self.backbone.load_state_dict(encoder_state, strict=False)
         _validate_load_result(encoder_load, "DINOv2 backbone")
         self.backbone.to(self.device).eval()
+        self._explainability_spec = ExplainabilitySpec(
+            encoder_mode="attention",
+            encoder_label="DINOv2 encoder self-attention",
+            decoder_mode="attention",
+            decoder_label="Fusion Transformer temporal attention",
+            encoder_layer_count=len(self.backbone.blocks),
+            encoder_head_count=int(self.backbone.num_heads),
+        )
         decoder_path = os.path.join(model_dir, "fusion_transformer_decoder_best_model.pth")
         if not os.path.exists(decoder_path):
                     d_model=d_model,
                 )
+            def forward(self, x, return_attention=False):
                 x = x.permute(0, 2, 1)
                 x_reduced = self.reduce(x)
                 mstcn_input = x_reduced.permute(0, 2, 1)
                 else:
                     transformer_input = mstcn_input.detach()
+                transformer_outputs = self.transformer(
+                    transformer_input,
+                    x_reduced,
+                    return_attention=return_attention,
+                )
+                if return_attention:
+                    transformer_out, attention_meta = transformer_outputs
+                    return transformer_out.permute(0, 2, 1), attention_meta
+                return transformer_outputs.permute(0, 2, 1)
         self.decoder = FusionTransformerDecoder()
         decoder_load = self.decoder.load_state_dict(decoder_state, strict=False)
     def reset_state(self):
         self.frame_features = []
+        self._attention_recorder.clear()
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
         self.predict(dummy_img)
         self.reset_state()
+    def get_explainability_spec(self) -> ExplainabilitySpec:
+        return self._explainability_spec
+    def _ensure_attention_hook(self, layer_index: int) -> None:
+        clamped_layer = clamp_index(layer_index, self._explainability_spec.encoder_layer_count)
+        if self._attention_layer_index == clamped_layer and self._attention_recorder.handle is not None:
+            return
+        self._attention_recorder.attach(self.backbone.blocks[clamped_layer].norm1)
+        self._attention_layer_index = clamped_layer
+    def _compute_encoder_attention_map(self, head_index: int, output_shape: tuple[int, int]) -> np.ndarray:
+        if self._attention_recorder.output is None or self._attention_layer_index is None:
+            raise RuntimeError("DINO encoder attention recorder did not capture any tokens")
+        tokens = self._attention_recorder.output.to(self.device)
+        block = self.backbone.blocks[self._attention_layer_index]
+        attn_module = block.attn
+        qkv = attn_module.qkv(tokens).reshape(tokens.shape[0], tokens.shape[1], 3, attn_module.num_heads, -1).permute(
+            2, 0, 3, 1, 4
+        )
+        q = qkv[0] * attn_module.scale
+        k = qkv[1]
+        attn = (q @ k.transpose(-2, -1)).softmax(dim=-1)
+        head = clamp_index(head_index, attn.shape[1])
+        patch_start = 1 + int(getattr(self.backbone, "num_register_tokens", 0))
+        cls_attention = attn[0, head, 0, patch_start:]
+        patch_count = int(cls_attention.numel())
+        grid_size = int(math.sqrt(patch_count))
+        if grid_size * grid_size != patch_count:
+            raise RuntimeError(f"Unexpected DINO patch attention size: {patch_count}")
+        heatmap = cls_attention.view(grid_size, grid_size).detach().cpu().numpy()
+        return cv2.resize(heatmap, (output_shape[1], output_shape[0]), interpolation=cv2.INTER_CUBIC)
     def unload(self):
         if self.backbone is not None:
             self.backbone.to("cpu")
         self.backbone = None
         self.decoder = None
         self.frame_features = []
+        self._attention_recorder.remove()
+        self._attention_layer_index = None
         self.available = False
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
     @torch.inference_mode()
+    def predict(self, rgb_image: np.ndarray, explainability: dict | None = None):
         if not self.available or self.backbone is None or self.decoder is None:
             raise RuntimeError("DINO-Endo predictor is not available")
+        explain_enabled = bool(explainability and explainability.get("enabled"))
+        encoder_layer = clamp_index(
+            explainability.get("encoder_layer") if explainability else None,
+            self._explainability_spec.encoder_layer_count,
+        )
+        encoder_head = clamp_index(
+            explainability.get("encoder_head") if explainability else None,
+            self._explainability_spec.encoder_head_count,
+        )
+        if explain_enabled:
+            self._ensure_attention_hook(encoder_layer)
+            self._attention_recorder.clear()
+            display_image = self.display_aug(image=rgb_image)["image"]
+        else:
+            display_image = None
         processed = self.aug(image=rgb_image)["image"]
         chw = np.transpose(processed, (2, 0, 1))
         tensor = torch.tensor(chw, dtype=torch.float32).unsqueeze(0).to(self.device)
         decoder_input = seq.transpose(1, 2)
         with self._amp_context():
+            decoder_outputs = self.decoder(decoder_input, return_attention=explain_enabled)
+            if explain_enabled:
+                logits, attention_meta = decoder_outputs
+            else:
+                logits = decoder_outputs
         if logits.dim() != 3:
             raise ValueError(f"Unexpected DINOv2 decoder output shape: {tuple(logits.shape)}")
         confidence = float(np.max(pred_np))
         phase_idx = int(np.argmax(pred_np))
         phase = self.label_dict.get(phase_idx, "idle")
+        result = {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
+        if explain_enabled and display_image is not None:
+            encoder_heatmap = self._compute_encoder_attention_map(encoder_head, display_image.shape[:2])
+            result["explainability"] = _build_explainability_payload(
+                display_image=display_image,
+                encoder_heatmap=encoder_heatmap,
+                encoder_kind="attention",
+                encoder_label=self._explainability_spec.encoder_label,
+                decoder_values=attention_meta["decoder_strip"].detach().cpu().numpy(),
+                decoder_kind="attention",
+                decoder_label=self._explainability_spec.decoder_label,
+                active_decoder_index=available_frames - 1,
+                encoder_layer=encoder_layer,
+                encoder_head=encoder_head,
+            )
+        return result
 class PredictorVJEPA2:
         self._feature_buffer = []
         self._vjepa_mean = torch.tensor([0.485, 0.456, 0.406], dtype=torch.float32).view(3, 1, 1, 1)
         self._vjepa_std = torch.tensor([0.229, 0.224, 0.225], dtype=torch.float32).view(3, 1, 1, 1)
+        self._attention_recorder = ModuleOutputRecorder()
+        self._attention_layer_index = None
+        self._rotate_queries_or_keys = None
+        self._explainability_spec = ExplainabilitySpec(
+            encoder_mode="attention",
+            encoder_label="V-JEPA2 encoder self-attention",
+            decoder_mode="proxy",
+            decoder_label="MLP decoder feature energy (proxy)",
+        )
         self._load_models(self.model_dir)
     def _amp_context(self):
             sys.path.insert(0, str(vjepa2_path))
         from src.models import vision_transformer as vjepa_vit
+        from src.models.utils.modules import rotate_queries_or_keys
         from src.utils.checkpoint_loader import robust_checkpoint_loader
+        self._rotate_queries_or_keys = rotate_queries_or_keys
         encoder_path = os.path.join(model_dir, "vjepa_encoder_human.pt")
         if not os.path.exists(encoder_path):
         encoder_load = self.encoder.load_state_dict(encoder_state, strict=False)
         self._validate_load_result(encoder_load, "V-JEPA2 encoder")
         self.encoder.to(self.device).eval()
+        self._explainability_spec = ExplainabilitySpec(
+            encoder_mode="attention",
+            encoder_label="V-JEPA2 encoder self-attention",
+            decoder_mode="proxy",
+            decoder_label="MLP decoder feature energy (proxy)",
+            encoder_layer_count=len(self.encoder.blocks),
+            encoder_head_count=int(self.encoder.num_heads),
+        )
         decoder_path = os.path.join(model_dir, "mlp_decoder_human.pth")
         if not os.path.exists(decoder_path):
     def reset_state(self):
         self._frame_buffer = []
         self._feature_buffer = []
+        self._attention_recorder.clear()
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
         self.predict(dummy)
         self.reset_state()
+    def get_explainability_spec(self) -> ExplainabilitySpec:
+        return self._explainability_spec
+    def _ensure_attention_hook(self, layer_index: int) -> None:
+        clamped_layer = clamp_index(layer_index, self._explainability_spec.encoder_layer_count)
+        if self._attention_layer_index == clamped_layer and self._attention_recorder.handle is not None:
+            return
+        self._attention_recorder.attach(self.encoder.blocks[clamped_layer].norm1)
+        self._attention_layer_index = clamped_layer
+    def _compute_encoder_attention_map(
+        self,
+        *,
+        head_index: int,
+        temporal_group_index: int,
+        output_shape: tuple[int, int],
+    ) -> np.ndarray:
+        if self._attention_recorder.output is None or self._attention_layer_index is None:
+            raise RuntimeError("V-JEPA2 encoder attention recorder did not capture any tokens")
+        if self._rotate_queries_or_keys is None:
+            raise RuntimeError("V-JEPA2 rotation helper is unavailable")
+        tokens = self._attention_recorder.output.to(self.device)
+        block = self.encoder.blocks[self._attention_layer_index]
+        attn_module = block.attn
+        qkv = attn_module.qkv(tokens).unflatten(-1, (3, attn_module.num_heads, -1)).permute(2, 0, 3, 1, 4)
+        q, k = qkv[0], qkv[1]
+        patch_grid = self._crop_size // 16
+        temporal_groups = self._clip_frames // self._tubelet_size
+        if hasattr(attn_module, "separate_positions"):
+            mask = torch.arange(int(temporal_groups * patch_grid * patch_grid), device=tokens.device)
+            d_mask, h_mask, w_mask = attn_module.separate_positions(mask, patch_grid, patch_grid)
+            offset = 0
+            qd = self._rotate_queries_or_keys(q[..., offset : offset + attn_module.d_dim], pos=d_mask)
+            kd = self._rotate_queries_or_keys(k[..., offset : offset + attn_module.d_dim], pos=d_mask)
+            offset += attn_module.d_dim
+            qh = self._rotate_queries_or_keys(q[..., offset : offset + attn_module.h_dim], pos=h_mask)
+            kh = self._rotate_queries_or_keys(k[..., offset : offset + attn_module.h_dim], pos=h_mask)
+            offset += attn_module.h_dim
+            qw = self._rotate_queries_or_keys(q[..., offset : offset + attn_module.w_dim], pos=w_mask)
+            kw = self._rotate_queries_or_keys(k[..., offset : offset + attn_module.w_dim], pos=w_mask)
+            offset += attn_module.w_dim
+            q_parts = [qd, qh, qw]
+            k_parts = [kd, kh, kw]
+            if offset < attn_module.head_dim:
+                q_parts.append(q[..., offset:])
+                k_parts.append(k[..., offset:])
+            q = torch.cat(q_parts, dim=-1)
+            k = torch.cat(k_parts, dim=-1)
+        attn = ((q @ k.transpose(-2, -1)) * attn_module.scale).softmax(dim=-1)
+        head = clamp_index(head_index, attn.shape[1])
+        group_size = patch_grid * patch_grid
+        group_index = clamp_index(temporal_group_index, temporal_groups)
+        start = group_index * group_size
+        end = start + group_size
+        group_attention = attn[0, head, start:end, start:end].mean(dim=0)
+        heatmap = group_attention.view(patch_grid, patch_grid).detach().cpu().numpy()
+        return cv2.resize(heatmap, (output_shape[1], output_shape[0]), interpolation=cv2.INTER_CUBIC)
     def unload(self):
         if self.encoder is not None:
             self.encoder.to("cpu")
         self.decoder = None
         self._frame_buffer = []
         self._feature_buffer = []
+        self._attention_recorder.remove()
+        self._attention_layer_index = None
         self.available = False
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
     @torch.inference_mode()
+    def predict(self, rgb_image: np.ndarray, explainability: dict | None = None):
         if not self.available:
             raise RuntimeError("V-JEPA2 predictor is not available")
+        explain_enabled = bool(explainability and explainability.get("enabled"))
+        encoder_layer = clamp_index(
+            explainability.get("encoder_layer") if explainability else None,
+            self._explainability_spec.encoder_layer_count,
+        )
+        encoder_head = clamp_index(
+            explainability.get("encoder_head") if explainability else None,
+            self._explainability_spec.encoder_head_count,
+        )
+        if explain_enabled:
+            self._ensure_attention_hook(encoder_layer)
+            self._attention_recorder.clear()
         frame = np.ascontiguousarray(rgb_image, dtype=np.uint8)
         self._frame_buffer.append(frame)
         if len(self._frame_buffer) > self._clip_frames:
         confidence = float(np.max(pred_np))
         phase_idx = int(np.argmax(pred_np))
         phase = self.label_dict.get(phase_idx, "idle")
+        result = {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
+        if explain_enabled:
+            latest_group_index = latest_feature_idx // self._tubelet_size
+            display_image = resize_rgb_image(frame, (self._crop_size, self._crop_size))
+            encoder_heatmap = self._compute_encoder_attention_map(
+                head_index=encoder_head,
+                temporal_group_index=latest_group_index,
+                output_shape=display_image.shape[:2],
+            )
+            decoder_proxy_values = [feature.abs().mean().item() for feature in self._feature_buffer]
+            result["explainability"] = _build_explainability_payload(
+                display_image=display_image,
+                encoder_heatmap=encoder_heatmap,
+                encoder_kind="attention",
+                encoder_label=self._explainability_spec.encoder_label,
+                decoder_values=decoder_proxy_values,
+                decoder_kind="proxy",
+                decoder_label=self._explainability_spec.decoder_label,
+                active_decoder_index=available_frames - 1,
+                encoder_layer=encoder_layer,
+                encoder_head=encoder_head,
+                notes="Decoder view is a proxy feature-energy strip because the V-JEPA2 classifier head is an MLP.",
+            )
+        return result
 def create_predictor(model_key: str, model_dir: str | None = None, device: str | None = None):

runtime-requirements.txt CHANGED Viewed

@@ -1,5 +1,5 @@
 --extra-index-url https://download.pytorch.org/whl/cu121
-streamlit>=1.40,<2
 torch==2.5.1
 torchvision==0.20.1
 numpy>=1.26,<3

 --extra-index-url https://download.pytorch.org/whl/cu121
+streamlit>=1.55,<2
 torch==2.5.1
 torchvision==0.20.1
 numpy>=1.26,<3

scripts/publish_model_repo.py ADDED Viewed

	@@ -0,0 +1,156 @@

+from __future__ import annotations
+import argparse
+import os
+import shutil
+import tempfile
+from pathlib import Path
+from huggingface_hub import HfApi
+import sys
+SCRIPT_PATH = Path(__file__).resolve()
+SPACE_ROOT = SCRIPT_PATH.parents[1]
+if str(SPACE_ROOT) not in sys.path:
+    sys.path.insert(0, str(SPACE_ROOT))
+from model_registry import MODEL_SPECS
+ENV_VAR_BY_FAMILY = {
+    "aiendo": "AIENDO_MODEL_REPO_ID",
+    "dinov2": "DINO_MODEL_REPO_ID",
+    "vjepa2": "VJEPA2_MODEL_REPO_ID",
+}
+def _render_model_card(*, family: str, repo_id: str, copied_files: list[str]) -> str:
+    spec = MODEL_SPECS[family]
+    file_list = "\n".join(f"- `{name}`" for name in copied_files)
+    return f"""---
+tags:
+- medical-imaging
+- endoscopy
+- surgical-phase-recognition
+- {family}
+---
+# {spec.label} checkpoints for the AI-Endo Hugging Face Space
+This repository stores the published checkpoint set for the **{spec.label}** phase-recognition path used by `hf_spaces/DINO-ENDO/`.
+## Files
+{file_list}
+## Consumed by the Space
+Set the following Space environment variable so the Streamlit Space can download these files lazily at runtime:
+```text
+{ENV_VAR_BY_FAMILY[family]}={repo_id}
+```
+"""
+def _stage_model_family(*, family: str, model_dir: Path, staging_dir: Path, repo_id: str) -> int:
+    spec = MODEL_SPECS[family]
+    copied_files: list[str] = []
+    total_bytes = 0
+    for filename in spec.required_files:
+        src = model_dir / filename
+        if not src.exists():
+            raise FileNotFoundError(f"Missing required checkpoint: {src}")
+        dst = staging_dir / filename
+        shutil.copy2(src, dst)
+        copied_files.append(filename)
+        total_bytes += src.stat().st_size
+    for filename in spec.optional_files:
+        src = model_dir / filename
+        if not src.exists():
+            continue
+        dst = staging_dir / filename
+        shutil.copy2(src, dst)
+        copied_files.append(filename)
+        total_bytes += src.stat().st_size
+    (staging_dir / "README.md").write_text(
+        _render_model_card(family=family, repo_id=repo_id, copied_files=copied_files),
+        encoding="utf-8",
+    )
+    return total_bytes
+def _should_use_large_upload(mode: str, total_bytes: int) -> bool:
+    if mode == "always":
+        return True
+    if mode == "never":
+        return False
+    return total_bytes >= 2 * 1024 * 1024 * 1024
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Publish a model-family checkpoint repo for the HF Space.")
+    parser.add_argument("--family", choices=sorted(MODEL_SPECS), required=True)
+    parser.add_argument("--repo-id", required=True, help="Target Hugging Face model repo ID.")
+    parser.add_argument(
+        "--model-dir",
+        default=str(SPACE_ROOT / "model"),
+        help="Directory containing the local checkpoints to publish.",
+    )
+    parser.add_argument(
+        "--upload-mode",
+        choices=("auto", "never", "always"),
+        default="auto",
+        help="Choose whether to force upload_large_folder for this family.",
+    )
+    parser.add_argument("--revision", default=None, help="Optional target revision or branch.")
+    parser.add_argument("--private", action="store_true", help="Create the model repo as private.")
+    parser.add_argument(
+        "--token-env",
+        default="HF_TOKEN",
+        help="Environment variable name containing the Hugging Face write token.",
+    )
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    model_dir = Path(args.model_dir).expanduser().resolve()
+    token = os.getenv(args.token_env) or None
+    api = HfApi(token=token)
+    api.create_repo(repo_id=args.repo_id, repo_type="model", private=args.private, exist_ok=True)
+    with tempfile.TemporaryDirectory(prefix=f"hf-space-{args.family}-") as temp_dir:
+        staging_dir = Path(temp_dir)
+        total_bytes = _stage_model_family(
+            family=args.family,
+            model_dir=model_dir,
+            staging_dir=staging_dir,
+            repo_id=args.repo_id,
+        )
+        upload_kwargs = {
+            "repo_id": args.repo_id,
+            "repo_type": "model",
+            "folder_path": str(staging_dir),
+        }
+        if args.revision:
+            upload_kwargs["revision"] = args.revision
+        if _should_use_large_upload(args.upload_mode, total_bytes):
+            api.upload_large_folder(**upload_kwargs)
+            mode = "upload_large_folder"
+        else:
+            api.upload_folder(**upload_kwargs)
+            mode = "upload_folder"
+    print(f"Published {args.family} checkpoints to {args.repo_id} via {mode}")
+    print(f"Suggested Space variable: {ENV_VAR_BY_FAMILY[args.family]}={args.repo_id}")
+if __name__ == "__main__":
+    main()

scripts/publish_space_repo.py ADDED Viewed

	@@ -0,0 +1,98 @@

+from __future__ import annotations
+import argparse
+import os
+import tempfile
+from pathlib import Path
+from huggingface_hub import HfApi
+import sys
+SCRIPT_PATH = Path(__file__).resolve()
+SPACE_ROOT = SCRIPT_PATH.parents[1]
+if str(SPACE_ROOT) not in sys.path:
+    sys.path.insert(0, str(SPACE_ROOT))
+from stage_space_bundle import stage_bundle
+def _space_variables(args: argparse.Namespace) -> dict[str, str]:
+    variables = {
+        "SPACE_ENABLED_MODELS": args.enabled_models,
+        "SPACE_DEFAULT_MODEL": args.default_model,
+    }
+    if args.aiendo_model_repo_id:
+        variables["AIENDO_MODEL_REPO_ID"] = args.aiendo_model_repo_id
+    if args.dino_model_repo_id:
+        variables["DINO_MODEL_REPO_ID"] = args.dino_model_repo_id
+    if args.vjepa2_model_repo_id:
+        variables["VJEPA2_MODEL_REPO_ID"] = args.vjepa2_model_repo_id
+    return variables
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Publish the staged Docker Space bundle and set its variables.")
+    parser.add_argument("--repo-id", required=True, help="Target Hugging Face Space repo ID.")
+    parser.add_argument(
+        "--bundle-dir",
+        default=None,
+        help="Optional pre-staged bundle directory. If omitted, a temporary bundle is staged automatically.",
+    )
+    parser.add_argument("--enabled-models", default="dinov2,aiendo,vjepa2")
+    parser.add_argument("--default-model", default="dinov2")
+    parser.add_argument("--aiendo-model-repo-id", default=None)
+    parser.add_argument("--dino-model-repo-id", default=None)
+    parser.add_argument("--vjepa2-model-repo-id", default=None)
+    parser.add_argument("--revision", default=None, help="Optional target revision or branch.")
+    parser.add_argument("--private", action="store_true", help="Create the Space repo as private.")
+    parser.add_argument(
+        "--token-env",
+        default="HF_TOKEN",
+        help="Environment variable name containing the Hugging Face write token.",
+    )
+    return parser.parse_args()
+def _publish_bundle(api: HfApi, *, repo_id: str, bundle_dir: Path, revision: str | None) -> None:
+    upload_kwargs = {
+        "repo_id": repo_id,
+        "repo_type": "space",
+        "folder_path": str(bundle_dir),
+    }
+    if revision:
+        upload_kwargs["revision"] = revision
+    api.upload_folder(**upload_kwargs)
+def main() -> None:
+    args = parse_args()
+    token = os.getenv(args.token_env) or None
+    api = HfApi(token=token)
+    api.create_repo(repo_id=args.repo_id, repo_type="space", space_sdk="docker", private=args.private, exist_ok=True)
+    if args.bundle_dir:
+        bundle_dir = Path(args.bundle_dir).expanduser().resolve()
+        if not bundle_dir.exists():
+            raise FileNotFoundError(f"Bundle directory not found: {bundle_dir}")
+        _publish_bundle(api, repo_id=args.repo_id, bundle_dir=bundle_dir, revision=args.revision)
+    else:
+        with tempfile.TemporaryDirectory(prefix="hf-space-bundle-") as temp_dir:
+            bundle_dir = stage_bundle(SPACE_ROOT, Path(temp_dir), overwrite=True)
+            _publish_bundle(api, repo_id=args.repo_id, bundle_dir=bundle_dir, revision=args.revision)
+    for key, value in _space_variables(args).items():
+        api.add_space_variable(
+            repo_id=args.repo_id,
+            key=key,
+            value=value,
+            description=f"Managed by publish_space_repo.py for {key}",
+        )
+    print(f"Published Space bundle to {args.repo_id}")
+    for key, value in _space_variables(args).items():
+        print(f"{key}={value}")
+if __name__ == "__main__":
+    main()

scripts/stage_space_bundle.py ADDED Viewed

	@@ -0,0 +1,104 @@

+from __future__ import annotations
+import argparse
+import shutil
+from pathlib import Path
+ROOT_FILES = (
+    ".dockerignore",
+    ".gitattributes",
+    ".gitignore",
+    "Dockerfile",
+    "README.md",
+    "app.py",
+    "explainability.py",
+    "model_manager.py",
+    "model_registry.py",
+    "predictor.py",
+    "requirements.txt",
+    "runtime-requirements.txt",
+    "video_utils.py",
+)
+ROOT_DIRS = (
+    ".streamlit",
+    "dinov2",
+    "model",
+    "scripts",
+    "vjepa2",
+)
+IGNORE_PATTERNS = (
+    ".git",
+    ".cache",
+    "__pycache__",
+    ".pytest_cache",
+    ".mypy_cache",
+    "*.egg-info",
+    "*.ipynb",
+    "*.pt",
+    "*.pth",
+    "*.pyc",
+    "*.pyo",
+    "assets",
+    "notebooks",
+    "tests",
+)
+def _copy_item(src: Path, dst: Path) -> None:
+    if not src.exists():
+        raise FileNotFoundError(f"Missing required Space item: {src}")
+    if src.is_dir():
+        shutil.copytree(src, dst, ignore=shutil.ignore_patterns(*IGNORE_PATTERNS))
+    else:
+        dst.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy2(src, dst)
+def stage_bundle(space_root: Path, output_dir: Path, overwrite: bool) -> Path:
+    if output_dir.exists():
+        if not overwrite:
+            raise FileExistsError(f"Destination already exists: {output_dir}")
+        shutil.rmtree(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    for name in ROOT_FILES:
+        _copy_item(space_root / name, output_dir / name)
+    for name in ROOT_DIRS:
+        _copy_item(space_root / name, output_dir / name)
+    return output_dir
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Stage a code-only Hugging Face Space bundle from the local DINO-ENDO scaffold."
+    )
+    parser.add_argument(
+        "--output-dir",
+        default="/tmp/dino_space_minimal_upload",
+        help="Destination directory for the staged bundle.",
+    )
+    parser.add_argument(
+        "--overwrite",
+        action="store_true",
+        help="Replace the destination directory if it already exists.",
+    )
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    script_path = Path(__file__).resolve()
+    space_root = script_path.parents[1]
+    output_dir = Path(args.output_dir).expanduser().resolve()
+    staged_dir = stage_bundle(space_root, output_dir, overwrite=args.overwrite)
+    print(f"Staged Space bundle at {staged_dir}")
+if __name__ == "__main__":
+    main()

scripts/stage_vendor_sources.py ADDED Viewed

	@@ -0,0 +1,38 @@

+from __future__ import annotations
+import argparse
+import shutil
+from pathlib import Path
+def copy_tree(src: Path, dst: Path, overwrite: bool) -> None:
+    if not src.exists():
+        raise FileNotFoundError(f"Source directory not found: {src}")
+    if dst.exists():
+        if not overwrite:
+            print(f"Skipping existing {dst}")
+            return
+        shutil.rmtree(dst)
+    shutil.copytree(
+        src,
+        dst,
+        ignore=shutil.ignore_patterns('.git', '__pycache__', '.pytest_cache', '.mypy_cache', '*.pyc', '*.pyo'),
+    )
+    print(f"Copied {src} -> {dst}")
+def main() -> None:
+    parser = argparse.ArgumentParser(description='Copy vendored dinov2/ and vjepa2/ source trees into the Space folder.')
+    parser.add_argument('--overwrite', action='store_true', help='Replace existing destination directories.')
+    args = parser.parse_args()
+    script_path = Path(__file__).resolve()
+    space_root = script_path.parents[1]
+    repo_root = script_path.parents[3]
+    copy_tree(repo_root / 'dinov2', space_root / 'dinov2', overwrite=args.overwrite)
+    copy_tree(repo_root / 'vjepa2', space_root / 'vjepa2', overwrite=args.overwrite)
+if __name__ == '__main__':
+    main()

vjepa2/.flake8 ADDED Viewed

	@@ -0,0 +1,5 @@

+[flake8]
+max-line-length = 119
+select = E,F,W
+ignore = E203,E701,W503
+per-file-ignores=__init__.py:F401 version.py:F401

vjepa2/.github/workflows/base_tests.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+name: UnitTests
+on: [push]
+jobs:
+  unittests:
+    runs-on: ubuntu-latest
+    strategy:
+      max-parallel: 4
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python 3.12
+      uses: actions/setup-python@v5
+      with:
+        python-version: '3.12'
+    - name: Add conda to system path
+      run: |
+        # $CONDA is an environment variable pointing to the root of the miniconda directory
+        echo $CONDA/bin >> $GITHUB_PATH
+    - name: Install dependencies
+      run: |
+        conda create --name test-env python=3.12
+        conda install pytest
+        echo "Starting setup from $PWD"
+        pip install -e .
+    - name: Test with pytest
+      run: |
+        pytest tests

vjepa2/.github/workflows/linters.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+name: Lint (Common Code)
+on:
+  push:
+    branches:
+      - master
+    paths:
+      - 'app/'
+      - 'evals/*.py'
+      - 'src/'
+      - 'tests/'
+  pull_request:
+    branches:
+      - master
+      - 'gh/**'
+    paths:
+      - 'app/'
+      - 'evals/*.py'
+      - 'src/'
+      - 'tests/'
+jobs:
+  run-linters:
+    name: Run linters
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+      - name: Install Python lint dependencies
+        run: |
+          pip install -r requirements-test.txt
+      - name: Set lint paths
+        run: echo "lint_paths=app evals/*.py src tests" >> "$GITHUB_ENV"
+      - name: Run isort
+        run: |
+          python -m isort $lint_paths --check
+      - name: Run flake8
+        if: always()
+        run: |
+          python -m flake8 --config .flake8 --show-source --statistics $lint_paths
+      - name: Run black
+        if: always()
+        run: |
+          python -m black --check $lint_paths

vjepa2/.gitignore ADDED Viewed

	@@ -0,0 +1,32 @@

+*.pyc
+.vscode/
+.*.swp
+run_vjepa_aws.py
+run.py
+main_distributed_video.py
+main_video.py
+app/vjepa/configs/temp_aws
+app/main_dev.py
+app/main_distributed_dev.py
+evals/ava/alphaction/data
+run_evals.py
+run_evals_v2.py
+run_pretrain.py
+*.egg-info/
+*.ipynb_checkpoints/
+traces/
+third_party/*
+evals/simu_env_planning/local/
+evals/simu_env_planning/docker2/
+evals/simu_env_planning/docker/
+app/vjepa_droid/local/
+app/vjepa_droid_v2/local/
+app/vjepa_droid_v3/local/
+app/vjepa_droid_v4/local/
+configs/local

vjepa2/APACHE-LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2018-2021 William Falcon
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

vjepa2/CHANGELOG.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Changelog
+## [0.0.1] - 2025-06-05
+Initial release of V-JEPA 2 codebase

vjepa2/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <opensource-conduct@fb.com>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq

vjepa2/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# Contributing to V-JEPA 2
+We want to make contributing to this project as easy and transparent as
+possible.
+## Pull Requests
+We welcome your pull requests.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code is consistent with style guidance (below) and lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+7. Add reviewer(s) for approval.
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Meta has a [bounty program](https://bugbounty.meta.com/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## Coding Style
+* 4 spaces for indentation rather than tabs
+* 119 character line length
+* PEP8 formatting
+We recommend using `black`, `isort`, and `flake8` to format your code changes.
+## License
+By contributing to this repository, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.

vjepa2/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) Meta Platforms, Inc. and affiliates.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

vjepa2/README.md ADDED Viewed

	@@ -0,0 +1,450 @@

+# V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
+### [Meta FAIR](https://ai.meta.com/research/)
+Mahmoud Assran∗, Adrien Bardes∗, David Fan∗, Quentin Garrido∗, Russell Howes∗, Mojtaba
+Komeili∗, Matthew Muckley∗, Ammar Rizvi∗, Claire Roberts∗, Koustuv Sinha∗, Artem Zholus*,
+Sergio Arnaud*, Abha Gejji*, Ada Martin*, Francois Robert Hogan*, Daniel Dugas*, Piotr
+Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil
+Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier*, Yann LeCun*, Michael
+Rabbat*, Nicolas Ballas*
+*Core Team
+[[`Paper`](https://arxiv.org/abs/2506.09985)] [[`Blog`](https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks)] [[`BibTex`](#Citation)]
+Official Pytorch codebase for V-JEPA 2 and V-JEPA 2-AC.
+V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
+<p align="center">
+	<img src="assets/flowchart.png" width=100%>
+</p>
+<!---
+## Updates
+* **[Jun-6-25]:** V-JEPA 2 is released. [[`Blog`](https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks)]
+--->
+## V-JEPA 2 Pre-training
+**(Top)** The encoder and predictor are pre-trained through self-supervised learning from video using a masked latent feature prediction objective, leveraging abundant natural videos to bootstrap physical world understanding and prediction. **(Bottom)** Performance of V-JEPA 2 on downstream understanding and prediction tasks.
+<img align="left" src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab" width=65%>&nbsp;
+<table>
+  <tr>
+    <th colspan="1">Benchmark</th>
+    <th colspan="1">VJEPA 2</th>
+    <th colspan="1">Previous Best</th>
+  </tr>
+  <tr>
+    <td>EK100</td>
+    <td>39.7%</td>
+    <td>27.6% (PlausiVL)</td>
+  </tr>
+  <tr>
+    <td>SSv2 (Probe)</td>
+    <td>77.3%</td>
+    <td>69.7% (InternVideo2-1B)</td>
+  </tr>
+  <tr>
+    <td>Diving48 (Probe)</td>
+    <td>90.2%</td>
+    <td>86.4% (InternVideo2-1B)</td>
+  </tr>
+  <tr>
+    <td>MVP (Video QA)</td>
+    <td>44.5%</td>
+    <td>39.9% (InternVL-2.5)</td>
+  </tr>
+  <tr>
+    <td>TempCompass (Video QA)</td>
+    <td>76.9%</td>
+    <td>75.3% (Tarsier 2)</td>
+  </tr>
+</table>
+## V-JEPA 2-AC Post-training
+**(Top)** After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. **(Bottom)** Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.
+<img align="left" src="https://github.com/user-attachments/assets/c5d42221-0102-4216-911d-061a4369a805" width=65%>&nbsp;
+<table>
+  <tr>
+    <th colspan="1"></th>
+    <th colspan="1"></th>
+    <th colspan="2">Grasp</th>
+    <th colspan="2">Pick-and-Place</th>
+  </tr>
+  <tr>
+    <th colspan="1">Method</th>
+    <th colspan="1">Reach</th>
+    <th colspan="1">Cup</th>
+    <th colspan="1">Box</th>
+    <th colspan="1">Cup</th>
+    <th colspan="1">Box</th>
+  </tr>
+  <tr>
+    <td>Octo</td>
+    <td>100%</td>
+    <td>10%</td>
+    <td>0%</td>
+    <td>10%</td>
+    <td>10%</td>
+  </tr>
+  <tr>
+    <td>Cosmos</td>
+    <td>80%</td>
+    <td>0%</td>
+    <td>20%</td>
+    <td>0%</td>
+    <td>0%</td>
+  </tr>
+  <tr>
+    <td>VJEPA 2-AC</td>
+    <td>100%</td>
+    <td>60%</td>
+    <td>20%</td>
+    <td>80%</td>
+    <td>50%</td>
+  </tr>
+</table>
+## Models
+### V-JEPA 2
+#### HuggingFace
+See our [HuggingFace collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6) for V-JEPA 2.
+#### Pretrained Checkpoints
+<table>
+  <tr>
+    <th colspan="1">Model</th>
+    <th colspan="1">#Parameters</th>
+    <th colspan="1">Resolution</th>
+    <th colspan="1">Download Link</th>
+    <th colspan="1">Pretraining Config</th>
+  </tr>
+  <tr>
+    <td>ViT-L/16</td>
+    <td>300M</td>
+    <td>256</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitl.pt">checkpoint</a></td>
+    <td><a href="configs/train/vitl16">configs</a></td>
+  </tr>
+  <tr>
+    <td>ViT-H/16</td>
+    <td>600M</td>
+    <td>256</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vith.pt">checkpoint</a></td>
+    <td><a href="configs/train/vith16/">configs</a></td>
+  </tr>
+  <tr>
+    <td>ViT-g/16</td>
+    <td>1B</td>
+    <td>256</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitg.pt">checkpoint</a></td>
+    <td><a href="configs/train/vitg16">configs</a></td>
+  </tr>
+  <tr>
+    <td>ViT-g/16<sub>384</sub></td>
+    <td>1B</td>
+    <td>384</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitg-384.pt">checkpoint</a></td>
+    <td><a href="configs/train/vitg16">configs</a></td>
+  </tr>
+</table>
+#### Pretrained backbones (via PyTorch Hub)
+Please install [Pytorch](https://pytorch.org/get-started/locally/), [timm](https://pypi.org/project/timm/) and [einops](https://pypi.org/project/einops/) locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.
+```python
+import torch
+# preprocessor
+processor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_preprocessor')
+# models
+vjepa2_vit_large = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_large')
+vjepa2_vit_huge = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_huge')
+vjepa2_vit_giant = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant')
+vjepa2_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant_384')
+```
+#### Pretrained checkpoints on Huggingface
+You can also use our pretrained checkpoints on [Huggingface](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6).
+```python
+from transformers import AutoVideoProcessor, AutoModel
+hf_repo = "facebook/vjepa2-vitg-fpc64-256"
+# facebook/vjepa2-vitl-fpc64-256
+# facebook/vjepa2-vith-fpc64-256
+# facebook/vjepa2-vitg-fpc64-256
+# facebook/vjepa2-vitg-fpc64-384
+model = AutoModel.from_pretrained(hf_repo)
+processor = AutoVideoProcessor.from_pretrained(hf_repo)
+```
+#### Evaluation Attentive Probes
+We share the trained attentive probes for two of our visual understanding evals (Something-Something v2 and Diving48) and the action anticipation eval EPIC-KITCHENS-100.
+<table>
+  <tr>
+    <th colspan="1">Model</th>
+    <th colspan="4">SSv2</th>
+    <th colspan="4">Diving48</th>
+    <th colspan="4">EK100</th>
+  </tr>
+  <tr>
+    <th colspan="1"></th>
+    <th colspan="1">Checkpoint</th>
+    <th colspan="1">Training Config</th>
+    <th colspan="1">Inference Config</th>
+    <th colspan="1">Result</th>
+    <th colspan="1">Checkpoint</th>
+    <th colspan="1">Training Config</th>
+    <th colspan="1">Inference Config</th>
+    <th colspan="1">Result</th>
+    <th colspan="1">Checkpoint</th>
+    <th colspan="1">Training Config</th>
+    <th colspan="1">Inference Config</th>
+    <th colspan="1">Result</th>
+  </tr>
+  <tr>
+    <td>ViT-L/16</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/evals/ssv2-vitl-16x2x3.pt">checkpoint</a></td>
+    <td><a href="configs/eval/vitl/ssv2.yaml">config</a></td>
+    <td><a href="configs/inference/vitl/ssv2.yaml">config</a></td>
+    <td>73.7%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/evals/diving48-vitl-256.pt">checkpoint</a></td>
+    <td><a href="configs/eval/vitl/diving48.yaml">config</a></td>
+    <td><a href="configs/inference/vitl/diving48.yaml">config</a></td>
+    <td>89.0%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/evals/ek100-vitl-256.pt">checkpoint</a></td>
+    <td><a href="configs/eval/vitl/ek100.yaml">config</a></td>
+    <td><a href="configs/inference/vitl/ek100.yaml">config</a></td>
+    <td>32.7 R@5</td>
+  </tr>
+  <tr>
+    <td>ViT-g/16<sub>384</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/evals/ssv2-vitg-384-64x2x3.pt">checkpoint</a></td>
+    <td><a href="configs/eval/vitg-384/ssv2.yaml">config</a></td>
+    <td><a href="configs/inference/vitg-384/ssv2.yaml">config</a></td>
+    <td>77.3%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/evals/diving48-vitg-384-32x4x3.pt">checkpoint</a></td>
+    <td><a href="configs/eval/vitg-384/diving48.yaml">config</a></td>
+    <td><a href="configs/inference/vitg-384/diving48.yaml">config</a></td>
+    <td>90.2%</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/evals/ek100-vitg-384.pt">checkpoint</a></td>
+    <td><a href="configs/eval/vitg-384/ek100.yaml">config</a></td>
+    <td><a href="configs/inference/vitg-384/ek100.yaml">config</a></td>
+    <td>39.7 R@5</td>
+  </tr>
+</table>
+### V-JEPA 2-AC
+Our action-conditioned checkpoint was trained from the ViT-g encoder.
+<table>
+  <tr>
+    <th colspan="1">Model</th>
+    <th colspan="1">Download Link</th>
+    <th colspan="1">Training Config</th>
+  </tr>
+  <tr>
+    <td>ViT-g/16</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2-ac-vitg.pt">checkpoint</a></td>
+    <td><a href="configs/train/vitg16/droid-256px-8f.yaml">config</a></td>
+  </tr>
+</table>
+#### Pretrained action-conditioned backbone (via PyTorch Hub)
+Please install [Pytorch](https://pytorch.org/get-started/locally/), [timm](https://pypi.org/project/timm/) and [einops](https://pypi.org/project/einops/) locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.
+```python
+import torch
+vjepa2_encoder, vjepa2_ac_predictor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_ac_vit_giant')
+```
+See [energy_landscape_example.ipynb](notebooks/energy_landscape_example.ipynb) for an example notebook computing the energy landscape of the pretrained action-conditioned backbone using a robot trajectory collected from our lab.
+To run this notebook, you'll need to additionally install [Jupyter](https://jupyter.org/install) and [Scipy](https://scipy.org/install/) in your conda environment.
+## Getting Started
+### Setup
+```
+conda create -n vjepa2-312 python=3.12
+conda activate vjepa2-312
+pip install .  # or `pip install -e .` for development mode
+```
+**Note to macOS users:** V-JEPA 2 relies on [`decord`](https://github.com/dmlc/decord), which does not support macOS (and, unfortunately, is also no longer under development). In order to run the V-JEPA 2 code on macOS, you will need a different `decord` implementation. We do not make specific recommendations, although some users have reported the use of [`eva-decord`](https://github.com/georgia-tech-db/eva-decord) (see [PR 1](https://github.com/facebookresearch/vjepa2/pull/1)) or [`decord2`](https://github.com/johnnynunez/decord2) (see [PR 31](https://github.com/facebookresearch/vjepa2/pull/31)).  We leave the selection of the `decord` package up to the user's discretion.
+### Usage Demo
+See [vjepa2_demo.ipynb](notebooks/vjepa2_demo.ipynb) [(Colab Link)](https://colab.research.google.com/github/facebookresearch/vjepa2/blob/main/notebooks/vjepa2_demo.ipynb) or [vjepa2_demo.py](notebooks/vjepa2_demo.py) for an example of how to load both the HuggingFace and PyTorch V-JEPA 2 models and run inference on a sample video to get a sample classification result.
+The script assumes the presence of downloaded model checkpoints so you will need to download the model weights and update the corresponding paths in the script. E.g.:
+```
+wget https://dl.fbaipublicfiles.com/vjepa2/vitg-384.pt -P YOUR_DIR
+wget https://dl.fbaipublicfiles.com/vjepa2/evals/ssv2-vitg-384-64x2x3.pt -P YOUR_DIR
+# Then update your model paths in vjepa2_demo.py.
+pt_model_path = YOUR_DIR/vitg-384.pt
+classifier_model_path = YOUR_DIR/ssv2-vitg-384-64x2x3.pt
+# Then run the script (assumes your machine has a GPU)
+python -m notebooks.vjepa2_demo
+```
+### Probe-based evaluation
+Probe-based evaluation consists in training an attentive probe on top of frozen V-JEPA 2 features. We provide training scripts for training your own probes, and checkpoints to run inference directly.
+#### Training probes
+Evaluations can be run either locally, or distributed via SLURM. (Running locally is useful for debugging and validation).
+These sample commands launch Something-Something v2 video classification; other evals are launched by specifying the corresponding config.
+Use provided training configs under "Evaluation Attentive Probes". These configs allow to train multiple probes in parallel with various optimization parameters.
+Change filepaths as needed (e.g. `folder`, `checkpoint`, `dataset_train`, `dataset_val`) to match locations of data and downloaded checkpoints on your local filesystem.
+Change \# nodes and local batch size as needed to not exceed available GPU memory.
+##### Local
+To run locally, specify the GPUs to use on
+```
+python -m evals.main --fname configs/eval/vitl16/ssv2.yaml \
+  --devices cuda:0 cuda:1
+```
+##### Distributed
+```
+python -m evals.main_distributed \
+  --fname configs/eval/vitl/ssv2.yaml  \
+  --time 8600 \
+  --account my_account --qos=my_qos
+```
+#### Inference from existing probes
+Use provided inference configs under [Evaluation Attentive Probes](#evaluation-attentive-probes).
+Download the corresponding checkpoint, rename it to 'latest.pt', and create a folder with the checkpoint inside, with the format matching the variables in the config:
+```
+[folder]/[eval_name]/[tag]/latest.pt
+```
+Then run inference, locally or distributed, using the same evaluation commands as above, but with configs from `configs/inference`.
+### Pretraining
+Likewise, training can also be run locally or distributed. Pretraining and cooldown training phases are
+run with the same command using different configs.
+These sample commands launch initial training of a ViT-L model. Configs for cooldown (or action-conditioned) training
+can be found in the same directory as the config for initial training.
+#### Local
+```
+python -m app.main --fname configs/train/vitl16/pretrain-256px-16f.yaml \
+  --devices cuda:0
+```
+#### Distributed
+```
+python -m app.main_distributed \
+  --fname configs/train/vitl16/pretrain-256px-16f.yaml
+  --time 6000
+  --account my_account --qos=my_qos
+```
+### Postraining
+Post-training of the action-conditioned model, starting from the pretrained VJEPA 2 backbone, also follows a similar interface, and can be run locally or distributed using [this config](configs/train/vitg16/droid-256px-8f.yaml).
+We post-train the model starting from the ViT-g/16 backbone.
+#### Local
+```
+python -m app.main --fname configs/train/vitg16/droid-256px-8f.yaml \
+  --devices cuda:0
+```
+#### Distributed
+```
+python -m app.main_distributed \
+  --fname configs/train/vitg16/droid-256px-8f.yaml
+  --time 6000
+  --account my_account --qos=my_qos
+```
+## Code Structure
+```
+.
+├── app                              # training loops
+│   ├── vjepa                        #   video JEPA pre-training
+│   ├── vjepa_droid                  #   training the action-conditioned model
+│   ├── main_distributed.py          #   entrypoint for launch app on slurm cluster
+│   └── main.py                      #   entrypoint for launch app locally on your machine
+├── configs                          # config files with experiment params for training and evaluation
+│   ├── train                        #   pretraining (phase 1), cooldown (phase 2), and action-conditioned training
+│   └── eval                         #   frozen evaluations
+├── evals                            # evaluation loops training an attentive probe with frozen backbone...
+│   ├── action_anticipation_frozen   #   action anticipation
+│   ├── image_classification_frozen  #   image understanding
+│   ├── video_classification_frozen  #   video understanding
+│   ├── main_distributed.py          #   entrypoint for distributed evaluations
+│   └── main.py                      #   entrypoint for locally-run evaluations
+├── src                              # the package
+│   ├── datasets                     #   datasets, data loaders, ...
+│   ├── models                       #   model definitions
+│   ├── masks                        #   mask collators, masking utilities, ...
+│   └── utils                        #   shared utilities
+├── tests                            # unit tests for some modules in `src`
+```
+## License
+The majority of V-JEPA 2 is licensed under MIT, however portions of the project are available under separate license terms:
+[src/datasets/utils/video/randaugment.py](src/datasets/utils/video/randaugment.py)<br>
+[src/datasets/utils/video/randerase.py](src/datasets/utils/video/randerase.py)<br>
+[src/datasets/utils/worker_init_fn.py](src/datasets/utils/worker_init_fn.py)<br>
+are licensed under the Apache 2.0 license.
+## Citation
+If you find this repository useful in your research, please consider giving a star :star: and a citation
+```bibtex
+@article{assran2025vjepa2,
+  title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
+  author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
+Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
+Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
+Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
+Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
+Rabbat, Michael and Ballas, Nicolas},
+  journal={arXiv preprint arXiv:2506.09985},
+  year={2025}
+}
+```

vjepa2/app/__init__.py ADDED Viewed

File without changes

vjepa2/app/main.py ADDED Viewed

	@@ -0,0 +1,84 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import multiprocessing as mp
+import pprint
+from pathlib import Path
+import yaml
+from app.scaffold import main as app_main
+from src.utils.distributed import init_distributed
+parser = argparse.ArgumentParser()
+parser.add_argument("--fname", type=str, help="name of config file to load", default="configs.yaml")
+parser.add_argument(
+    "--devices",
+    type=str,
+    nargs="+",
+    default=["cuda:0", "cuda:1", "cuda:2", "cuda:3", "cuda:4", "cuda:5", "cuda:6", "cuda:7"],
+    help="which devices to use on local machine",
+)
+parser.add_argument(
+    "--debugmode",
+    type=bool,
+    default=False,
+    help="Setting this to true will not spin up new processes. "
+    "The main code runs the main process, which makes it easier to \
+    debug with checkpointing.",
+)
+def process_main(rank, fname, world_size, devices):
+    import os
+    os.environ["CUDA_VISIBLE_DEVICES"] = str(devices[rank].split(":")[-1])
+    import logging
+    from src.utils.logging import get_logger
+    logger = get_logger(force=True)
+    if rank == 0:
+        logger.setLevel(logging.INFO)
+    else:
+        logger.setLevel(logging.ERROR)
+    logger.info(f"called-params {fname}")
+    # Load config
+    params = None
+    with open(fname, "r") as y_file:
+        params = yaml.load(y_file, Loader=yaml.FullLoader)
+        logger.info("loaded params...")
+    # Log config
+    if rank == 0:
+        pprint.PrettyPrinter(indent=4).pprint(params)
+        folder = params["folder"]
+        params_path = os.path.join(folder, "params-pretrain.yaml")
+        folder = Path(folder)
+        folder.mkdir(parents=True, exist_ok=True)
+        with open(params_path, "w") as f:
+            yaml.dump(params, f)
+    # Init distributed (access to comm between GPUS on same machine)
+    world_size, rank = init_distributed(rank_and_world_size=(rank, world_size))
+    logger.info(f"Running... (rank: {rank}/{world_size})")
+    # Launch the app with loaded config
+    app_main(params["app"], args=params)
+if __name__ == "__main__":
+    args = parser.parse_args()
+    if args.debugmode:
+        process_main(rank=0, fname=args.fname, world_size=1, devices=["cuda:0"])
+    else:
+        num_gpus = len(args.devices)
+        mp.set_start_method("spawn")
+        for rank in range(num_gpus):
+            mp.Process(target=process_main, args=(rank, args.fname, num_gpus, args.devices)).start()

vjepa2/app/main_distributed.py ADDED Viewed

	@@ -0,0 +1,269 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import copy
+import datetime
+import os
+import pprint
+import shutil
+from pathlib import Path
+import submitit
+import yaml
+from app.scaffold import main as app_main
+from src.utils.logging import get_logger, git_information
+logger = get_logger(force=True)
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--fname",
+    type=str,
+    help="yaml file containing config file names to launch",
+    default="configs.yaml",
+)
+parser.add_argument("--exclude", type=str, help="nodes to exclude from training", default=None)
+parser.add_argument(
+    "--batch-launch",
+    action="store_true",
+    help="whether fname points to a file to batch-launch several config files",
+)
+parser.add_argument(
+    "--use_fname_as_folder",
+    action="store_true",
+    help="whether to append fname filename to folder",
+)
+parser.add_argument(
+    "--folder",
+    type=str,
+    default=None,
+    help="if specified, override 'folder' field in the .yaml with this",
+)
+parser.add_argument(
+    "--account",
+    type=str,
+    default="jepa",
+    help="Cluster account to use when submitting jobs",
+)
+parser.add_argument(
+    "--partition",
+    type=str,
+    default="learn",
+    help="Cluster partition to use when submitting jobs",
+)
+parser.add_argument(
+    "--qos",
+    type=str,
+    default=None,
+    help="If specified, cluster partition to use when submitting jobs",
+)
+parser.add_argument("--time", type=int, default=4300, help="time in minutes to run job")
+class Trainer:
+    def __init__(self, args_pretrain, load_model=None):
+        self.app = args_pretrain["app"]
+        self.args_pretrain = args_pretrain
+        self.load_model = load_model
+    def __call__(self):
+        app = self.app
+        params = self.args_pretrain
+        load_model = self.load_model
+        logger.info("loaded pretrain params...")
+        pp = pprint.PrettyPrinter(indent=4)
+        pp.pprint(params)
+        # Launch app with loaded config
+        resume_preempt = False if load_model is None else load_model
+        app_main(app, args=params, resume_preempt=resume_preempt)
+    def checkpoint(self):
+        fb_trainer = Trainer(self.args_pretrain, True)
+        return submitit.helpers.DelayedSubmission(
+            fb_trainer,
+        )
+def copy_code_folder(code_folder, ignore_patterns, ignore_paths):
+    path_to_node_folder = {}
+    for path in ignore_paths:
+        split_path = path.split("/")
+        base_path = "/".join(split_path[:-1])
+        node_folder = split_path[-1]
+        path_to_node_folder[base_path] = node_folder
+    def ignore_func(path, names):
+        ignore_list = ignore_patterns
+        if path in path_to_node_folder.keys():
+            ignore_list.append(path_to_node_folder[path])
+        return ignore_list
+    if not os.path.exists(code_folder):
+        shutil.copytree(".", code_folder, ignore=ignore_func)
+def update_folder_with_timestamp(args_list):
+    new_args_list = copy.deepcopy(args_list)
+    for i, args in enumerate(args_list):
+        folder = args["folder"]
+        load_checkpoint = args["meta"].get("load_checkpoint", False) if "meta" in args else False
+        if not load_checkpoint and Path(folder).exists():
+            timestamp = datetime.datetime.now().strftime("%y_%m_%d_%H_%M_%S")
+            folder = folder.rstrip("/") + f"_{timestamp}"
+            logger.info(f"Folder already exists but `load_checkpoint` is False. Logging to new folder {folder}...")
+            new_args_list[i]["folder"] = folder
+    return new_args_list
+def launch_app_with_parsed_args(
+    args_for_pretrain,
+    account,
+    partition,
+    qos,
+    mem_per_gpu="210G",
+    timeout=4300,
+    nodes=1,
+    tasks_per_node=1,
+    cpus_per_task=12,
+    exclude_nodes=None,
+):
+    args_for_pretrain = update_folder_with_timestamp(args_for_pretrain)
+    for ap in args_for_pretrain:
+        folder = ap["folder"]
+        Path(folder).mkdir(parents=True, exist_ok=True)
+    folder = args_for_pretrain[0]["folder"]
+    # -------------- Copy code --------------
+    code_folder = os.path.join(folder, "code")
+    ignore_patterns = [
+        "__pycache__",
+        ".vscode",
+        ".git",
+        "core",
+    ]
+    ignore_paths = [
+        "./evals/ava/alphaction/data",
+        "./demos",
+        "./traces",
+    ]
+    copy_code_folder(code_folder, ignore_patterns, ignore_paths)
+    os.chdir(code_folder)
+    # ---------------------------------------
+    # -------------- Save config file --------------
+    params_path = os.path.join(folder, "params-pretrain.yaml")
+    if not os.path.exists(params_path):
+        with open(params_path, "w") as f:
+            yaml.dump(args_for_pretrain, f)
+    # ----------------------------------------------
+    # -------------- Save git info file --------------
+    git_info_fpath = os.path.join(folder, "git-info.txt")
+    with open(git_info_fpath, "w") as f:
+        f.write(git_information())
+    # ----------------------------------------------
+    # -------------- SET JOB NAME --------------
+    folder_ = folder
+    if folder[-1] == "/":
+        folder_ = folder[:-1]
+    job_name = folder_.split("/")[-1]
+    # ------------------------------------------
+    executor = submitit.AutoExecutor(folder=os.path.join(folder, "job_%j"), slurm_max_num_timeout=20)
+    executor.update_parameters(
+        name=job_name,
+        slurm_partition=partition,
+        slurm_account=account,
+        slurm_qos=qos,
+        slurm_mem_per_gpu=mem_per_gpu,
+        timeout_min=timeout,
+        nodes=nodes,
+        tasks_per_node=tasks_per_node,
+        cpus_per_task=cpus_per_task,
+        gpus_per_node=tasks_per_node,
+    )
+    if exclude_nodes is not None:
+        executor.update_parameters(slurm_exclude=exclude_nodes)
+    jobs, trainers = [], []
+    with executor.batch():
+        for ap in args_for_pretrain:
+            # TODO Create sub folder and ap['folder']=subfolder
+            fb_trainer = Trainer(ap)
+            job = executor.submit(
+                fb_trainer,
+            )
+            trainers.append(fb_trainer)
+            jobs.append(job)
+    for job in jobs:
+        print(job.job_id)
+def launch():
+    # ---------------------------------------------------------------------- #
+    # 1. Put config file names in a list
+    # ---------------------------------------------------------------------- #
+    config_fnames = [args.fname]
+    # -- If batch-launch is True, then the args.fname yaml file is not a
+    # -- config, but actually specifies a list of other config files
+    # -- to run in a slurm job array
+    if args.batch_launch:
+        with open(args.fname, "r") as y_file:
+            config_fnames = yaml.load(y_file, Loader=yaml.FullLoader)
+    # ---------------------------------------------------------------------- #
+    # ---------------------------------------------------------------------- #
+    # 2. Parse each yaml config file as a dict and place in list
+    # ---------------------------------------------------------------------- #
+    nodes, tasks_per_node = None, None
+    configs = []
+    for f in config_fnames:
+        with open(f, "r") as y_file:
+            _params = yaml.load(y_file, Loader=yaml.FullLoader)
+            if args.use_fname_as_folder:
+                assert not args.folder, "Don't specify --folder if adding fname to folder"
+                _params["folder"] = str(Path(_params["folder"]) / f.split("/")[-1].split(".yaml")[0])
+            elif args.folder:
+                _params["folder"] = args.folder
+            nodes = int(_params.get("nodes"))
+            tasks_per_node = int(_params.get("tasks_per_node"))
+            cpus_per_task = int(_params.get("cpus_per_task", 32))
+            mem_per_gpu = _params.get("mem_per_gpu", "210G")
+            configs += [_params]
+    logger.info(f"Loaded {len(configs)} config files")
+    logger.info(f"Running all jobs with {nodes=} / {tasks_per_node=}")
+    # ---------------------------------------------------------------------- #
+    # ---------------------------------------------------------------------- #
+    # 3. Launch evals with parsed config files
+    # ---------------------------------------------------------------------- #
+    launch_app_with_parsed_args(
+        args_for_pretrain=configs,
+        account=args.account,
+        partition=args.partition,
+        qos=args.qos,
+        mem_per_gpu=mem_per_gpu,
+        cpus_per_task=cpus_per_task,
+        timeout=args.time,
+        nodes=nodes,
+        tasks_per_node=tasks_per_node,
+        exclude_nodes=args.exclude,
+    )
+    # ---------------------------------------------------------------------- #
+if __name__ == "__main__":
+    args = parser.parse_args()
+    launch()

vjepa2/app/scaffold.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import importlib
+import logging
+import sys
+logging.basicConfig(stream=sys.stdout, level=logging.INFO)
+logger = logging.getLogger()
+def main(app, args, resume_preempt=False):
+    logger.info(f"Running pre-training of app: {app}")
+    return importlib.import_module(f"app.{app}.train").main(args=args, resume_preempt=resume_preempt)

vjepa2/app/vjepa/train.py ADDED Viewed

	@@ -0,0 +1,536 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+# -- FOR DISTRIBUTED TRAINING ENSURE ONLY 1 DEVICE VISIBLE PER PROCESS
+try:
+    # -- WARNING: IF DOING DISTRIBUTED TRAINING ON A NON-SLURM CLUSTER, MAKE
+    # --          SURE TO UPDATE THIS TO GET LOCAL-RANK ON NODE, OR ENSURE
+    # --          THAT YOUR JOBS ARE LAUNCHED WITH ONLY 1 DEVICE VISIBLE
+    # --          TO EACH PROCESS
+    os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["SLURM_LOCALID"]
+except Exception:
+    pass
+import copy
+import gc
+import random
+import time
+import numpy as np
+import torch
+import torch.multiprocessing as mp
+import torch.nn.functional as F
+from torch.nn.parallel import DistributedDataParallel
+from app.vjepa.transforms import make_transforms
+from app.vjepa.utils import init_opt, init_video_model, load_checkpoint
+from src.datasets.data_manager import init_data
+from src.masks.multiseq_multiblock3d import MaskCollator
+from src.masks.utils import apply_masks
+from src.utils.distributed import init_distributed
+from src.utils.logging import AverageMeter, CSVLogger, get_logger, gpu_timer
+# --
+log_timings = True
+log_freq = 10
+CHECKPOINT_FREQ = 1
+GARBAGE_COLLECT_ITR_FREQ = 50
+# --
+_GLOBAL_SEED = 0
+random.seed(_GLOBAL_SEED)
+np.random.seed(_GLOBAL_SEED)
+torch.manual_seed(_GLOBAL_SEED)
+torch.backends.cudnn.benchmark = True
+logger = get_logger(__name__, force=True)
+def main(args, resume_preempt=False):
+    # ----------------------------------------------------------------------- #
+    #  PASSED IN PARAMS FROM CONFIG FILE
+    # ----------------------------------------------------------------------- #
+    # -- META
+    folder = args.get("folder")
+    cfgs_meta = args.get("meta")
+    load_model = cfgs_meta.get("load_checkpoint") or resume_preempt
+    r_file = cfgs_meta.get("read_checkpoint", None)
+    seed = cfgs_meta.get("seed", _GLOBAL_SEED)
+    save_every_freq = cfgs_meta.get("save_every_freq", -1)
+    skip_batches = cfgs_meta.get("skip_batches", -1)
+    use_sdpa = cfgs_meta.get("use_sdpa", False)
+    sync_gc = cfgs_meta.get("sync_gc", False)
+    which_dtype = cfgs_meta.get("dtype")
+    logger.info(f"{which_dtype=}")
+    if which_dtype.lower() == "bfloat16":
+        dtype = torch.bfloat16
+        mixed_precision = True
+    elif which_dtype.lower() == "float16":
+        dtype = torch.float16
+        mixed_precision = True
+    else:
+        dtype = torch.float32
+        mixed_precision = False
+    # -- MASK
+    cfgs_mask = args.get("mask")
+    # -- MODEL
+    cfgs_model = args.get("model")
+    compile_model = cfgs_model.get("compile_model", False)
+    use_activation_checkpointing = cfgs_model.get("use_activation_checkpointing", False)
+    model_name = cfgs_model.get("model_name")
+    pred_depth = cfgs_model.get("pred_depth")
+    pred_num_heads = cfgs_model.get("pred_num_heads", None)
+    pred_embed_dim = cfgs_model.get("pred_embed_dim")
+    uniform_power = cfgs_model.get("uniform_power", False)
+    use_mask_tokens = cfgs_model.get("use_mask_tokens", False)
+    zero_init_mask_tokens = cfgs_model.get("zero_init_mask_tokens", True)
+    use_rope = cfgs_model.get("use_rope", False)
+    use_silu = cfgs_model.get("use_silu", False)
+    use_pred_silu = cfgs_model.get("use_pred_silu", False)
+    wide_silu = cfgs_model.get("wide_silu", True)
+    # -- DATA
+    cfgs_data = args.get("data")
+    dataset_type = cfgs_data.get("dataset_type", "videodataset")
+    dataset_paths = cfgs_data.get("datasets", [])
+    datasets_weights = cfgs_data.get("datasets_weights")
+    dataset_fpcs = cfgs_data.get("dataset_fpcs")
+    max_num_frames = max(dataset_fpcs)
+    if datasets_weights is not None:
+        assert len(datasets_weights) == len(dataset_paths), "Must have one sampling weight specified for each dataset"
+    batch_size = cfgs_data.get("batch_size")
+    tubelet_size = cfgs_data.get("tubelet_size")
+    fps = cfgs_data.get("fps")
+    crop_size = cfgs_data.get("crop_size", 224)
+    patch_size = cfgs_data.get("patch_size")
+    pin_mem = cfgs_data.get("pin_mem", False)
+    num_workers = cfgs_data.get("num_workers", 1)
+    persistent_workers = cfgs_data.get("persistent_workers", True)
+    # -- DATA AUGS
+    cfgs_data_aug = args.get("data_aug")
+    ar_range = cfgs_data_aug.get("random_resize_aspect_ratio", [3 / 4, 4 / 3])
+    rr_scale = cfgs_data_aug.get("random_resize_scale", [0.3, 1.0])
+    motion_shift = cfgs_data_aug.get("motion_shift", False)
+    reprob = cfgs_data_aug.get("reprob", 0.0)
+    use_aa = cfgs_data_aug.get("auto_augment", False)
+    # -- LOSS
+    cfgs_loss = args.get("loss")
+    loss_exp = cfgs_loss.get("loss_exp")
+    # -- OPTIMIZATION
+    cfgs_opt = args.get("optimization")
+    is_anneal = cfgs_opt.get("is_anneal", False)
+    anneal_ckpt = cfgs_opt.get("anneal_ckpt", None)
+    if is_anneal and anneal_ckpt is None:
+        raise ValueError("Must specify anneal_ckpt if is_anneal is True")
+    resume_anneal = cfgs_opt.get("resume_anneal", False) or (is_anneal and resume_preempt)
+    ipe = cfgs_opt.get("ipe", None)
+    ipe_scale = cfgs_opt.get("ipe_scale", 1.0)
+    wd = float(cfgs_opt.get("weight_decay"))
+    final_wd = float(cfgs_opt.get("final_weight_decay"))
+    num_epochs = cfgs_opt.get("epochs")
+    warmup = cfgs_opt.get("warmup")
+    start_lr = cfgs_opt.get("start_lr")
+    lr = cfgs_opt.get("lr")
+    final_lr = cfgs_opt.get("final_lr")
+    ema = cfgs_opt.get("ema")
+    betas = cfgs_opt.get("betas", (0.9, 0.999))
+    eps = cfgs_opt.get("eps", 1.0e-8)
+    # ----------------------------------------------------------------------- #
+    # ----------------------------------------------------------------------- #
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.backends.cudnn.benchmark = True
+    try:
+        mp.set_start_method("spawn")
+    except Exception:
+        pass
+    # -- init torch distributed backend
+    world_size, rank = init_distributed()
+    logger.info(f"Initialized (rank/world-size) {rank}/{world_size}")
+    # -- set device
+    if not torch.cuda.is_available():
+        device = torch.device("cpu")
+    else:
+        device = torch.device("cuda:0")
+        torch.cuda.set_device(device)
+    # -- log/checkpointing paths
+    log_file = os.path.join(folder, f"log_r{rank}.csv")
+    latest_file = "latest.pt"
+    latest_path = os.path.join(folder, latest_file)
+    load_path = None
+    if load_model:
+        if is_anneal:
+            if os.path.exists(latest_path) and resume_anneal:
+                load_path = latest_path
+            else:
+                load_path = anneal_ckpt
+                resume_anneal = False
+        else:
+            load_path = r_file if r_file is not None else latest_path
+        if not os.path.exists(load_path):
+            load_path = None
+            load_model = False
+    # -- make csv_logger
+    csv_logger = CSVLogger(
+        log_file,
+        ("%d", "epoch"),
+        ("%d", "itr"),
+        ("%.5f", "loss"),
+        ("%d", "iter-time(ms)"),
+        ("%d", "gpu-time(ms)"),
+        ("%d", "dataload-time(ms)"),
+    )
+    # -- init model
+    encoder, predictor = init_video_model(
+        uniform_power=uniform_power,
+        use_mask_tokens=use_mask_tokens,
+        num_mask_tokens=int(len(cfgs_mask) * len(dataset_fpcs)),
+        zero_init_mask_tokens=zero_init_mask_tokens,
+        device=device,
+        patch_size=patch_size,
+        max_num_frames=max_num_frames,
+        tubelet_size=tubelet_size,
+        model_name=model_name,
+        crop_size=crop_size,
+        pred_depth=pred_depth,
+        pred_num_heads=pred_num_heads,
+        pred_embed_dim=pred_embed_dim,
+        use_sdpa=use_sdpa,
+        use_silu=use_silu,
+        use_pred_silu=use_pred_silu,
+        wide_silu=wide_silu,
+        use_rope=use_rope,
+        use_activation_checkpointing=use_activation_checkpointing,
+    )
+    target_encoder = copy.deepcopy(encoder)
+    if compile_model:
+        logger.info("Compiling encoder, target_encoder, and predictor.")
+        torch._dynamo.config.optimize_ddp = False
+        encoder.compile()
+        target_encoder.compile()
+        predictor.compile()
+    mask_collator = MaskCollator(
+        cfgs_mask=cfgs_mask,
+        dataset_fpcs=dataset_fpcs,
+        crop_size=crop_size,
+        patch_size=patch_size,
+        tubelet_size=tubelet_size,
+    )
+    transform = make_transforms(
+        random_horizontal_flip=True,
+        random_resize_aspect_ratio=ar_range,
+        random_resize_scale=rr_scale,
+        reprob=reprob,
+        auto_augment=use_aa,
+        motion_shift=motion_shift,
+        crop_size=crop_size,
+    )
+    # -- init data-loaders/samplers
+    (unsupervised_loader, unsupervised_sampler) = init_data(
+        data=dataset_type,
+        root_path=dataset_paths,
+        batch_size=batch_size,
+        training=True,
+        dataset_fpcs=dataset_fpcs,
+        fps=fps,
+        transform=transform,
+        rank=rank,
+        world_size=world_size,
+        datasets_weights=datasets_weights,
+        persistent_workers=persistent_workers,
+        collator=mask_collator,
+        num_workers=num_workers,
+        pin_mem=pin_mem,
+        log_dir=None,
+    )
+    try:
+        _dlen = len(unsupervised_loader)
+    except Exception:  # Different interface for webdataset
+        _dlen = unsupervised_loader.num_batches
+    if ipe is None:
+        ipe = _dlen
+    logger.info(f"iterations per epoch/dataset length: {ipe}/{_dlen}")
+    # -- init optimizer and scheduler
+    optimizer, scaler, scheduler, wd_scheduler = init_opt(
+        is_anneal=is_anneal,
+        encoder=encoder,
+        predictor=predictor,
+        wd=wd,
+        final_wd=final_wd,
+        start_lr=start_lr,
+        ref_lr=lr,
+        final_lr=final_lr,
+        iterations_per_epoch=ipe,
+        warmup=warmup,
+        num_epochs=num_epochs,
+        ipe_scale=ipe_scale,
+        mixed_precision=mixed_precision,
+        betas=betas,
+        eps=eps,
+    )
+    encoder = DistributedDataParallel(encoder, static_graph=True, find_unused_parameters=False)
+    predictor = DistributedDataParallel(predictor, static_graph=False, find_unused_parameters=False)
+    target_encoder = DistributedDataParallel(target_encoder, static_graph=True, find_unused_parameters=False)
+    for p in target_encoder.parameters():
+        p.requires_grad = False
+    # -- momentum schedule
+    momentum_scheduler = (
+        ema[0] + i * (ema[1] - ema[0]) / (ipe * num_epochs * ipe_scale)
+        for i in range(int(ipe * num_epochs * ipe_scale) + 1)
+    )
+    start_epoch = 0
+    # -- load training checkpoint
+    if load_model or os.path.exists(latest_path):
+        (
+            encoder,
+            predictor,
+            target_encoder,
+            optimizer,
+            scaler,
+            start_epoch,
+        ) = load_checkpoint(
+            r_path=load_path,
+            encoder=encoder,
+            predictor=predictor,
+            target_encoder=target_encoder,
+            opt=optimizer,
+            scaler=scaler,
+            is_anneal=is_anneal and not resume_anneal,
+        )
+        if not is_anneal or resume_anneal:
+            for _ in range(start_epoch * ipe):
+                scheduler.step()
+                wd_scheduler.step()
+                next(momentum_scheduler)
+                mask_collator.step()
+    def save_checkpoint(epoch, path):
+        if rank != 0:
+            return
+        save_dict = {
+            "encoder": encoder.state_dict(),
+            "predictor": predictor.state_dict(),
+            "opt": optimizer.state_dict(),
+            "scaler": None if scaler is None else scaler.state_dict(),
+            "target_encoder": target_encoder.state_dict(),
+            "epoch": epoch,
+            "loss": loss_meter.avg,
+            "batch_size": batch_size,
+            "world_size": world_size,
+            "lr": lr,
+        }
+        try:
+            torch.save(save_dict, path)
+        except Exception as e:
+            logger.info(f"Encountered exception when saving checkpoint: {e}")
+    logger.info("Initializing loader...")
+    unsupervised_sampler.set_epoch(start_epoch)
+    loader = iter(unsupervised_loader)
+    if skip_batches > 0:
+        logger.info(f"Skip {skip_batches} batches")
+        # -- update distributed-data-loader epoch
+        for itr in range(skip_batches):
+            if itr % 10 == 0:
+                logger.info(f"Skip {itr}/{skip_batches} batches")
+            try:
+                _ = next(loader)
+            except Exception:
+                loader = iter(unsupervised_loader)
+                _ = next(loader)
+    if sync_gc:
+        gc.disable()
+        gc.collect()
+    # -- TRAINING LOOP
+    for epoch in range(start_epoch, num_epochs):
+        logger.info("Epoch %d" % (epoch + 1))
+        loss_meter = AverageMeter()
+        mask_meters = {fpc: AverageMeter() for fpc in dataset_fpcs}
+        iter_time_meter = AverageMeter()
+        gpu_time_meter = AverageMeter()
+        data_elapsed_time_meter = AverageMeter()
+        for itr in range(ipe):
+            itr_start_time = time.time()
+            iter_retries = 0
+            iter_successful = False
+            while not iter_successful:
+                try:
+                    sample = next(loader)
+                    iter_successful = True
+                except StopIteration:
+                    logger.info("Exhausted data loaders. Refreshing...")
+                    unsupervised_sampler.set_epoch(epoch)
+                    loader = iter(unsupervised_loader)
+                except Exception as e:
+                    NUM_RETRIES = 5
+                    if iter_retries < NUM_RETRIES:
+                        logger.warning(f"Encountered exception when loading data (num retries {iter_retries}):\n{e}")
+                        iter_retries += 1
+                        time.sleep(5)
+                    else:
+                        logger.warning(f"Exceeded max retries ({NUM_RETRIES}) when loading data. Skipping batch.")
+                        raise e
+            for _fpc_sample in sample:
+                bs, fpc = _fpc_sample[0][-1][0].size()
+                mask_meters[fpc].update(bs / batch_size)
+            def load_clips():
+                all_clips, all_masks_enc, all_masks_pred = [], [], []
+                for fpc_sample in sample:
+                    udata, masks_enc, masks_pred = fpc_sample
+                    all_clips += [udata[0][0].to(device, non_blocking=True)]
+                    all_masks_enc += [[m.to(device, non_blocking=True) for m in masks_enc]]
+                    all_masks_pred += [[m.to(device, non_blocking=True) for m in masks_pred]]
+                return all_clips, all_masks_enc, all_masks_pred
+            clips, masks_enc, masks_pred = load_clips()
+            data_elapsed_time_ms = (time.time() - itr_start_time) * 1000.0
+            if sync_gc and (itr + 1) % GARBAGE_COLLECT_ITR_FREQ == 0:
+                logger.info("Running garbage collection...")
+                gc.collect()
+            def train_step():
+                _new_lr = scheduler.step()
+                _new_wd = wd_scheduler.step()
+                # --
+                def forward_target(c):
+                    with torch.no_grad():
+                        h = target_encoder(c)
+                        h = [F.layer_norm(hi, (hi.size(-1),)) for hi in h]
+                        return h
+                def forward_context(c):
+                    z = encoder(c, masks_enc)
+                    z = predictor(z, masks_enc, masks_pred)
+                    return z
+                def loss_fn(z, h):
+                    # Assumption: predictor will have returned only masked tokens for z
+                    h = [apply_masks(hi, mi, concat=False) for hi, mi in zip(h, masks_pred)]
+                    loss, n = 0, 0
+                    for zi, hi in zip(z, h):
+                        for zij, hij in zip(zi, hi):
+                            loss += torch.mean(torch.abs(zij - hij) ** loss_exp) / loss_exp
+                            n += 1
+                    loss /= n
+                    return loss
+                # Step 1. Forward
+                with torch.amp.autocast('cuda', dtype=dtype, enabled=mixed_precision):
+                    h = forward_target(clips)
+                    z = forward_context(clips)
+                    loss = loss_fn(z, h)  # jepa prediction loss
+                # Step 2. Backward & step
+                if mixed_precision:
+                    scaler.scale(loss).backward()
+                    scaler.unscale_(optimizer)
+                else:
+                    loss.backward()
+                if mixed_precision:
+                    scaler.step(optimizer)
+                    scaler.update()
+                else:
+                    optimizer.step()
+                optimizer.zero_grad()
+                # Step 3. momentum update of target encoder
+                m = next(momentum_scheduler)
+                with torch.no_grad():
+                    params_k = []
+                    params_q = []
+                    for param_q, param_k in zip(encoder.parameters(), target_encoder.parameters()):
+                        params_k.append(param_k)
+                        params_q.append(param_q)
+                    torch._foreach_mul_(params_k, m)
+                    torch._foreach_add_(params_k, params_q, alpha=1 - m)
+                return (
+                    loss.detach().item(),
+                    _new_lr,
+                    _new_wd,
+                )
+            (
+                loss,
+                _new_lr,
+                _new_wd,
+            ), gpu_etime_ms = gpu_timer(train_step)
+            iter_elapsed_time_ms = (time.time() - itr_start_time) * 1000.0
+            loss_meter.update(loss)
+            iter_time_meter.update(iter_elapsed_time_ms)
+            gpu_time_meter.update(gpu_etime_ms)
+            data_elapsed_time_meter.update(data_elapsed_time_ms)
+            # -- Logging
+            def log_stats():
+                csv_logger.log(epoch + 1, itr, loss, iter_elapsed_time_ms, gpu_etime_ms, data_elapsed_time_ms)
+                if (itr % log_freq == 0) or (itr == ipe - 1) or np.isnan(loss) or np.isinf(loss):
+                    logger.info(
+                        "[%d, %5d] loss: %.3f "
+                        "masks: %s "
+                        "[wd: %.2e] [lr: %.2e] "
+                        "[mem: %.2e] "
+                        "[iter: %.1f ms] "
+                        "[gpu: %.1f ms] "
+                        "[data: %.1f ms]"
+                        % (
+                            epoch + 1,
+                            itr,
+                            loss_meter.avg,
+                            "[" + ", ".join([f"{k}: " + "%.1f" % mask_meters[k].avg for k in mask_meters]) + "]",
+                            _new_wd,
+                            _new_lr,
+                            torch.cuda.max_memory_allocated() / 1024.0**2,
+                            iter_time_meter.avg,
+                            gpu_time_meter.avg,
+                            data_elapsed_time_meter.avg,
+                        )
+                    )
+            log_stats()
+            assert not np.isnan(loss), "loss is nan"
+        # -- Save Checkpoint
+        logger.info("avg. loss %.3f" % loss_meter.avg)
+        # -- Save Last
+        if epoch % CHECKPOINT_FREQ == 0 or epoch == (num_epochs - 1):
+            save_checkpoint(epoch + 1, latest_path)
+            if save_every_freq > 0 and epoch % save_every_freq == 0:
+                save_every_file = f"e{epoch}.pt"
+                save_every_path = os.path.join(folder, save_every_file)
+                save_checkpoint(epoch + 1, save_every_path)

vjepa2/app/vjepa/transforms.py ADDED Viewed

	@@ -0,0 +1,154 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import torchvision.transforms as transforms
+import src.datasets.utils.video.transforms as video_transforms
+from src.datasets.utils.video.randerase import RandomErasing
+def make_transforms(
+    random_horizontal_flip=True,
+    random_resize_aspect_ratio=(3 / 4, 4 / 3),
+    random_resize_scale=(0.3, 1.0),
+    reprob=0.0,
+    auto_augment=False,
+    motion_shift=False,
+    crop_size=224,
+    normalize=((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+):
+    _frames_augmentation = VideoTransform(
+        random_horizontal_flip=random_horizontal_flip,
+        random_resize_aspect_ratio=random_resize_aspect_ratio,
+        random_resize_scale=random_resize_scale,
+        reprob=reprob,
+        auto_augment=auto_augment,
+        motion_shift=motion_shift,
+        crop_size=crop_size,
+        normalize=normalize,
+    )
+    return _frames_augmentation
+class VideoTransform(object):
+    def __init__(
+        self,
+        random_horizontal_flip=True,
+        random_resize_aspect_ratio=(3 / 4, 4 / 3),
+        random_resize_scale=(0.3, 1.0),
+        reprob=0.0,
+        auto_augment=False,
+        motion_shift=False,
+        crop_size=224,
+        normalize=((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+    ):
+        self.random_horizontal_flip = random_horizontal_flip
+        self.random_resize_aspect_ratio = random_resize_aspect_ratio
+        self.random_resize_scale = random_resize_scale
+        self.auto_augment = auto_augment
+        self.motion_shift = motion_shift
+        self.crop_size = crop_size
+        self.mean = torch.tensor(normalize[0], dtype=torch.float32)
+        self.std = torch.tensor(normalize[1], dtype=torch.float32)
+        if not self.auto_augment:
+            # Without auto-augment, PIL and tensor conversions simply scale uint8 space by 255.
+            self.mean *= 255.0
+            self.std *= 255.0
+        self.autoaug_transform = video_transforms.create_random_augment(
+            input_size=(crop_size, crop_size),
+            # auto_augment="rand-m4-n4-w1-mstd0.5-inc1",
+            auto_augment="rand-m7-n4-mstd0.5-inc1",
+            interpolation="bicubic",
+        )
+        self.spatial_transform = (
+            video_transforms.random_resized_crop_with_shift if motion_shift else video_transforms.random_resized_crop
+        )
+        self.reprob = reprob
+        self.erase_transform = RandomErasing(
+            reprob,
+            mode="pixel",
+            max_count=1,
+            num_splits=1,
+            device="cpu",
+        )
+    def __call__(self, buffer):
+        if self.auto_augment:
+            buffer = [transforms.ToPILImage()(frame) for frame in buffer]
+            buffer = self.autoaug_transform(buffer)
+            buffer = [transforms.ToTensor()(img) for img in buffer]
+            buffer = torch.stack(buffer)  # T C H W
+            buffer = buffer.permute(0, 2, 3, 1)  # T H W C
+        elif torch.is_tensor(buffer):
+            # TODO: ensure input is always a tensor?
+            buffer = buffer.to(torch.float32)
+        else:
+            buffer = torch.tensor(buffer, dtype=torch.float32)
+        buffer = buffer.permute(3, 0, 1, 2)  # T H W C -> C T H W
+        buffer = self.spatial_transform(
+            images=buffer,
+            target_height=self.crop_size,
+            target_width=self.crop_size,
+            scale=self.random_resize_scale,
+            ratio=self.random_resize_aspect_ratio,
+        )
+        if self.random_horizontal_flip:
+            buffer, _ = video_transforms.horizontal_flip(0.5, buffer)
+        buffer = _tensor_normalize_inplace(buffer, self.mean, self.std)
+        if self.reprob > 0:
+            buffer = buffer.permute(1, 0, 2, 3)
+            buffer = self.erase_transform(buffer)
+            buffer = buffer.permute(1, 0, 2, 3)
+        return buffer
+def tensor_normalize(tensor, mean, std):
+    """
+    Normalize a given tensor by subtracting the mean and dividing the std.
+    Args:
+        tensor (tensor): tensor to normalize.
+        mean (tensor or list): mean value to subtract.
+        std (tensor or list): std to divide.
+    """
+    if tensor.dtype == torch.uint8:
+        tensor = tensor.float()
+        tensor = tensor / 255.0
+    if isinstance(mean, list):
+        mean = torch.tensor(mean)
+    if isinstance(std, list):
+        std = torch.tensor(std)
+    tensor = tensor - mean
+    tensor = tensor / std
+    return tensor
+def _tensor_normalize_inplace(tensor, mean, std):
+    """
+    Normalize a given tensor by subtracting the mean and dividing the std.
+    Args:
+        tensor (tensor): tensor to normalize (with dimensions C, T, H, W).
+        mean (tensor): mean value to subtract (in 0 to 255 floats).
+        std (tensor): std to divide (in 0 to 255 floats).
+    """
+    if tensor.dtype == torch.uint8:
+        tensor = tensor.float()
+    C, T, H, W = tensor.shape
+    tensor = tensor.view(C, -1).permute(1, 0)  # Make C the last dimension
+    tensor.sub_(mean).div_(std)
+    tensor = tensor.permute(1, 0).view(C, T, H, W)  # Put C back in front
+    return tensor

vjepa2/app/vjepa/utils.py ADDED Viewed

	@@ -0,0 +1,267 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+import sys
+import warnings
+import torch
+import yaml
+import src.models.predictor as vit_pred
+import src.models.vision_transformer as video_vit
+from src.utils.checkpoint_loader import robust_checkpoint_loader
+from src.utils.schedulers import CosineWDSchedule, LinearDecaySchedule, WarmupCosineSchedule
+from src.utils.wrappers import MultiSeqWrapper, PredictorMultiSeqWrapper
+logging.basicConfig(stream=sys.stdout, level=logging.INFO)
+logger = logging.getLogger()
+MAX_RETRIES = 3
+def build_eval_args(
+    model_name,
+    patch_size,
+    tubelet_size,
+    num_frames,
+    logging_folder,
+    checkpoint,
+    write_tag,
+    eval_cfg_paths,
+    uniform_power=False,
+    use_sdpa=False,
+    clip_duration=None,
+    use_silu=False,
+    wide_silu=True,
+    tag=None,
+):
+    """
+    Helper function to parse the pre-training configs to construct the
+    evaluation configs, return as a list of eval configs.
+    """
+    # By convention, the pre-training config should specify any required evals
+    # in the 'evals' key
+    if eval_cfg_paths is None:
+        logger.info("No evaluations specified!")
+        return
+    eval_nodes = None
+    eval_tasks_per_node = None
+    args_eval = []
+    for i, f in enumerate(eval_cfg_paths):
+        with open(f, "r") as y_file:
+            _args = yaml.load(y_file, Loader=yaml.FullLoader)
+            _tag = _args.get("tag", "")
+            _args["tag"] = f"{tag}-{_tag}"
+            _nodes = _args.get("nodes", None)
+            _tasks = _args.get("tasks_per_node", 8)
+            eval_nodes = _nodes if eval_nodes is None else eval_nodes
+            eval_tasks_per_node = _tasks if eval_tasks_per_node is None else eval_tasks_per_node
+            if (eval_nodes != _nodes) or (eval_tasks_per_node != _tasks):
+                warnings.warn("Configs for online evals must use same number of nodes for slurm-batch processing")
+            # Model params
+            _args["pretrain"] = {}
+            _args["pretrain"]["model_name"] = model_name
+            _args["pretrain"]["patch_size"] = patch_size
+            _args["pretrain"]["tubelet_size"] = tubelet_size
+            _args["pretrain"]["uniform_power"] = uniform_power
+            _args["pretrain"]["use_sdpa"] = use_sdpa
+            _args["pretrain"]["clip_duration"] = clip_duration
+            _args["pretrain"]["use_silu"] = use_silu
+            _args["pretrain"]["wide_silu"] = wide_silu
+            # Data params
+            _args["pretrain"]["frames_per_clip"] = num_frames
+            # Misc
+            _args["pretrain"]["folder"] = logging_folder
+            _args["pretrain"]["checkpoint"] = checkpoint
+            _args["pretrain"]["write_tag"] = write_tag
+            args_eval += [_args]
+    return eval_nodes, eval_tasks_per_node, args_eval
+def load_checkpoint(
+    r_path,
+    encoder,
+    predictor,
+    target_encoder,
+    opt,
+    scaler,
+    is_anneal=False,
+):
+    logger.info(f"Loading checkpoint from {r_path}")
+    checkpoint = robust_checkpoint_loader(r_path, map_location=torch.device("cpu"))
+    epoch = 0
+    if not is_anneal:
+        epoch = checkpoint["epoch"]
+    # -- loading encoder
+    pretrained_dict = checkpoint["encoder"]
+    msg = encoder.load_state_dict(pretrained_dict)
+    logger.info(f"loaded pretrained encoder from epoch {epoch} with msg: {msg}")
+    # -- loading predictor
+    pretrained_dict = checkpoint["predictor"]
+    msg = predictor.load_state_dict(pretrained_dict)
+    logger.info(f"loaded pretrained predictor from epoch {epoch} with msg: {msg}")
+    # -- loading target_encoder
+    if target_encoder is not None:
+        print(list(checkpoint.keys()))
+        pretrained_dict = checkpoint["target_encoder"]
+        msg = target_encoder.load_state_dict(pretrained_dict)
+        logger.info(f"loaded pretrained target encoder from epoch {epoch} with msg: {msg}")
+    # -- loading optimizer
+    opt.load_state_dict(checkpoint["opt"])
+    if scaler is not None:
+        scaler.load_state_dict(checkpoint["scaler"])
+    logger.info(f"loaded optimizers from epoch {epoch}")
+    logger.info(f"read-path: {r_path}")
+    del checkpoint
+    return (
+        encoder,
+        predictor,
+        target_encoder,
+        opt,
+        scaler,
+        epoch,
+    )
+def init_video_model(
+    device,
+    patch_size=16,
+    max_num_frames=16,
+    tubelet_size=2,
+    model_name="vit_base",
+    crop_size=224,
+    pred_depth=6,
+    pred_num_heads=None,
+    pred_embed_dim=384,
+    uniform_power=False,
+    use_mask_tokens=False,
+    num_mask_tokens=2,
+    zero_init_mask_tokens=True,
+    use_sdpa=False,
+    use_rope=False,
+    use_silu=False,
+    use_pred_silu=False,
+    wide_silu=False,
+    use_activation_checkpointing=False,
+):
+    encoder = video_vit.__dict__[model_name](
+        img_size=crop_size,
+        patch_size=patch_size,
+        num_frames=max_num_frames,
+        tubelet_size=tubelet_size,
+        uniform_power=uniform_power,
+        use_sdpa=use_sdpa,
+        use_silu=use_silu,
+        wide_silu=wide_silu,
+        use_activation_checkpointing=use_activation_checkpointing,
+        use_rope=use_rope,
+    )
+    encoder = MultiSeqWrapper(encoder)
+    predictor = vit_pred.__dict__["vit_predictor"](
+        img_size=crop_size,
+        use_mask_tokens=use_mask_tokens,
+        patch_size=patch_size,
+        num_frames=max_num_frames,
+        tubelet_size=tubelet_size,
+        embed_dim=encoder.backbone.embed_dim,
+        predictor_embed_dim=pred_embed_dim,
+        depth=pred_depth,
+        num_heads=encoder.backbone.num_heads if pred_num_heads is None else pred_num_heads,
+        uniform_power=uniform_power,
+        num_mask_tokens=num_mask_tokens,
+        zero_init_mask_tokens=zero_init_mask_tokens,
+        use_rope=use_rope,
+        use_sdpa=use_sdpa,
+        use_silu=use_pred_silu,
+        wide_silu=wide_silu,
+        use_activation_checkpointing=use_activation_checkpointing,
+    )
+    predictor = PredictorMultiSeqWrapper(predictor)
+    encoder.to(device)
+    predictor.to(device)
+    logger.info(encoder)
+    logger.info(predictor)
+    def count_parameters(model):
+        return sum(p.numel() for p in model.parameters() if p.requires_grad)
+    logger.info(f"Encoder number of parameters: {count_parameters(encoder)}")
+    logger.info(f"Predictor number of parameters: {count_parameters(predictor)}")
+    return encoder, predictor
+def init_opt(
+    is_anneal,
+    encoder,
+    predictor,
+    iterations_per_epoch,
+    start_lr,
+    ref_lr,
+    warmup,
+    num_epochs,
+    wd=1e-6,
+    final_wd=1e-6,
+    final_lr=0.0,
+    mixed_precision=False,
+    ipe_scale=1.25,
+    betas=(0.9, 0.999),
+    eps=1e-8,
+    zero_init_bias_wd=True,
+):
+    param_groups = [
+        {"params": (p for n, p in encoder.named_parameters() if ("bias" not in n) and (len(p.shape) != 1))},
+        {"params": (p for n, p in predictor.named_parameters() if ("bias" not in n) and (len(p.shape) != 1))},
+        {
+            "params": (p for n, p in encoder.named_parameters() if ("bias" in n) or (len(p.shape) == 1)),
+            "WD_exclude": zero_init_bias_wd,
+            "weight_decay": 0,
+        },
+        {
+            "params": (p for n, p in predictor.named_parameters() if ("bias" in n) or (len(p.shape) == 1)),
+            "WD_exclude": zero_init_bias_wd,
+            "weight_decay": 0,
+        },
+    ]
+    optimizer = torch.optim.AdamW(param_groups, betas=betas, eps=eps)
+    if not is_anneal:
+        scheduler = WarmupCosineSchedule(
+            optimizer,
+            warmup_steps=int(warmup * iterations_per_epoch),
+            start_lr=start_lr,
+            ref_lr=ref_lr,
+            final_lr=final_lr,
+            T_max=int(ipe_scale * num_epochs * iterations_per_epoch),
+        )
+    else:
+        scheduler = LinearDecaySchedule(
+            optimizer,
+            ref_lr=ref_lr,
+            final_lr=final_lr,
+            T_max=int(ipe_scale * num_epochs * iterations_per_epoch),
+        )
+    wd_scheduler = CosineWDSchedule(
+        optimizer,
+        ref_wd=wd,
+        final_wd=final_wd,
+        T_max=int(ipe_scale * num_epochs * iterations_per_epoch),
+    )
+    scaler = torch.amp.GradScaler('cuda') if mixed_precision else None
+    return optimizer, scaler, scheduler, wd_scheduler

vjepa2/app/vjepa_droid/droid.py ADDED Viewed

	@@ -0,0 +1,232 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+#
+import json
+import os
+from logging import getLogger
+from math import ceil
+import h5py
+import numpy as np
+import pandas as pd
+import torch
+import torch.utils.data
+from decord import VideoReader, cpu
+from scipy.spatial.transform import Rotation
+_GLOBAL_SEED = 0
+logger = getLogger()
+def init_data(
+    data_path,
+    batch_size,
+    frames_per_clip=16,
+    fps=5,
+    crop_size=224,
+    rank=0,
+    world_size=1,
+    camera_views=0,
+    stereo_view=False,
+    drop_last=True,
+    num_workers=10,
+    pin_mem=True,
+    persistent_workers=True,
+    collator=None,
+    transform=None,
+    camera_frame=False,
+    tubelet_size=2,
+):
+    dataset = DROIDVideoDataset(
+        data_path=data_path,
+        frames_per_clip=frames_per_clip,
+        transform=transform,
+        fps=fps,
+        camera_views=camera_views,
+        frameskip=tubelet_size,
+        camera_frame=camera_frame,
+    )
+    dist_sampler = torch.utils.data.distributed.DistributedSampler(
+        dataset, num_replicas=world_size, rank=rank, shuffle=True
+    )
+    data_loader = torch.utils.data.DataLoader(
+        dataset,
+        collate_fn=collator,
+        sampler=dist_sampler,
+        batch_size=batch_size,
+        drop_last=drop_last,
+        pin_memory=pin_mem,
+        num_workers=num_workers,
+        persistent_workers=(num_workers > 0) and persistent_workers,
+    )
+    logger.info("VideoDataset unsupervised data loader created")
+    return data_loader, dist_sampler
+def get_json(directory):
+    for filename in os.listdir(directory):
+        if filename.endswith(".json"):
+            file_path = os.path.join(directory, filename)
+            try:
+                with open(file_path, "r") as f:
+                    return json.load(f)
+            except json.JSONDecodeError:
+                print(f"Error decoding JSON in file: {filename}")
+            except Exception as e:
+                print(f"An unexpected error occurred while processing {filename}: {e}")
+class DROIDVideoDataset(torch.utils.data.Dataset):
+    """Video classification dataset."""
+    def __init__(
+        self,
+        data_path,
+        camera_views=["left_mp4_path", "right_mp4_path"],
+        frameskip=2,
+        frames_per_clip=16,
+        fps=5,
+        transform=None,
+        camera_frame=False,
+    ):
+        self.data_path = data_path
+        self.frames_per_clip = frames_per_clip
+        self.frameskip = frameskip
+        self.fps = fps
+        self.transform = transform
+        self.camera_frame = camera_frame
+        if VideoReader is None:
+            raise ImportError('Unable to import "decord" which is required to read videos.')
+        # Camera views
+        # ---
+        # wrist camera view
+        # left camera view
+        # right camera view
+        self.camera_views = camera_views
+        self.h5_name = "trajectory.h5"
+        samples = list(pd.read_csv(data_path, header=None, delimiter=" ").values[:, 0])
+        self.samples = samples
+    def __getitem__(self, index):
+        path = self.samples[index]
+        # -- keep trying to load videos until you find a valid sample
+        loaded_video = False
+        while not loaded_video:
+            try:
+                buffer, actions, states, extrinsics, indices = self.loadvideo_decord(path)
+                loaded_video = True
+            except Exception as e:
+                logger.info(f"Encountered exception when loading video {path=} {e=}")
+                loaded_video = False
+                index = np.random.randint(self.__len__())
+                path = self.samples[index]
+        return buffer, actions, states, extrinsics, indices
+    def poses_to_diffs(self, poses):
+        xyz = poses[:, :3]  # shape [T, 3]
+        thetas = poses[:, 3:6]  # euler angles, shape [T, 3]
+        matrices = [Rotation.from_euler("xyz", theta, degrees=False).as_matrix() for theta in thetas]
+        xyz_diff = xyz[1:] - xyz[:-1]
+        angle_diff = [matrices[t + 1] @ matrices[t].T for t in range(len(matrices) - 1)]
+        angle_diff = [Rotation.from_matrix(mat).as_euler("xyz", degrees=False) for mat in angle_diff]
+        angle_diff = np.stack([d for d in angle_diff], axis=0)
+        closedness = poses[:, -1:]
+        closedness_delta = closedness[1:] - closedness[:-1]
+        return np.concatenate([xyz_diff, angle_diff, closedness_delta], axis=1)
+    def transform_frame(self, poses, extrinsics):
+        gripper = poses[:, -1:]
+        poses = poses[:, :-1]
+        def pose_to_transform(pose):
+            trans = pose[:3]  # shape [3]
+            theta = pose[3:6]  # euler angles, shape [3]
+            Rot = Rotation.from_euler("xyz", theta, degrees=False).as_matrix()
+            T = np.eye(4)
+            T[:3, :3] = Rot
+            T[:3, 3] = trans
+            return T
+        def transform_to_pose(transform):
+            trans = transform[:3, 3]
+            Rot = transform[:3, :3]
+            angle = Rotation.from_matrix(Rot).as_euler("xyz", degrees=False)
+            return np.concatenate([trans, angle], axis=0)
+        new_pose = []
+        for p, e in zip(poses, extrinsics):
+            p_transform = pose_to_transform(p)
+            e_transform = pose_to_transform(e)
+            new_pose_transform = np.linalg.inv(e_transform) @ p_transform
+            new_pose += [transform_to_pose(new_pose_transform)]
+        new_pose = np.stack(new_pose, axis=0)
+        return np.concatenate([new_pose, gripper], axis=1)
+    def loadvideo_decord(self, path):
+        # -- load metadata
+        metadata = get_json(path)
+        if metadata is None:
+            raise Exception(f"No metadata for video {path=}")
+        # -- load trajectory info
+        tpath = os.path.join(path, self.h5_name)
+        trajectory = h5py.File(tpath)
+        # -- randomly sample a camera view
+        camera_view = self.camera_views[torch.randint(0, len(self.camera_views), (1,))]
+        mp4_name = metadata[camera_view].split("recordings/MP4/")[-1]
+        camera_name = mp4_name.split(".")[0]
+        extrinsics = trajectory["observation"]["camera_extrinsics"][f"{camera_name}_left"]
+        states = np.concatenate(
+            [
+                np.array(trajectory["observation"]["robot_state"]["cartesian_position"]),
+                np.array(trajectory["observation"]["robot_state"]["gripper_position"])[:, None],
+            ],
+            axis=1,
+        )  # [T, 7]
+        vpath = os.path.join(path, "recordings/MP4", mp4_name)
+        vr = VideoReader(vpath, num_threads=-1, ctx=cpu(0))
+        # --
+        vfps = vr.get_avg_fps()
+        fpc = self.frames_per_clip
+        fps = self.fps if self.fps is not None else vfps
+        fstp = ceil(vfps / fps)
+        nframes = int(fpc * fstp)
+        vlen = len(vr)
+        if vlen < nframes:
+            raise Exception(f"Video is too short {vpath=}, {nframes=}, {vlen=}")
+        # sample a random window of nframes
+        ef = np.random.randint(nframes, vlen)
+        sf = ef - nframes
+        indices = np.arange(sf, sf + nframes, fstp).astype(np.int64)
+        # --
+        states = states[indices, :][:: self.frameskip]
+        extrinsics = extrinsics[indices, :][:: self.frameskip]
+        if self.camera_frame:
+            states = self.transform_frame(states, extrinsics)
+        actions = self.poses_to_diffs(states)
+        # --
+        vr.seek(0)  # go to start of video before sampling frames
+        buffer = vr.get_batch(indices).asnumpy()
+        if self.transform is not None:
+            buffer = self.transform(buffer)
+        return buffer, actions, states, extrinsics, indices
+    def __len__(self):
+        return len(self.samples)

vjepa2/app/vjepa_droid/train.py ADDED Viewed

	@@ -0,0 +1,524 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+#
+import os
+# -- FOR DISTRIBUTED TRAINING ENSURE ONLY 1 DEVICE VISIBLE PER PROCESS
+try:
+    # -- WARNING: IF DOING DISTRIBUTED TRAINING ON A NON-SLURM CLUSTER, MAKE
+    # --          SURE TO UPDATE THIS TO GET LOCAL-RANK ON NODE, OR ENSURE
+    # --          THAT YOUR JOBS ARE LAUNCHED WITH ONLY 1 DEVICE VISIBLE
+    # --          TO EACH PROCESS
+    os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["SLURM_LOCALID"]
+except Exception:
+    pass
+import copy
+import gc
+import random
+import time
+import numpy as np
+import torch
+import torch.multiprocessing as mp
+import torch.nn.functional as F
+from torch.nn.parallel import DistributedDataParallel
+from app.vjepa_droid.droid import init_data
+from app.vjepa_droid.transforms import make_transforms
+from app.vjepa_droid.utils import init_opt, init_video_model, load_checkpoint, load_pretrained
+from src.utils.distributed import init_distributed
+from src.utils.logging import AverageMeter, CSVLogger, get_logger, gpu_timer
+# --
+log_timings = True
+log_freq = 10
+CHECKPOINT_FREQ = 1
+GARBAGE_COLLECT_ITR_FREQ = 50
+# --
+_GLOBAL_SEED = 0
+random.seed(_GLOBAL_SEED)
+np.random.seed(_GLOBAL_SEED)
+torch.manual_seed(_GLOBAL_SEED)
+torch.backends.cudnn.benchmark = True
+logger = get_logger(__name__, force=True)
+def main(args, resume_preempt=False):
+    # ----------------------------------------------------------------------- #
+    #  PASSED IN PARAMS FROM CONFIG FILE
+    # ----------------------------------------------------------------------- #
+    # -- META
+    folder = args.get("folder")
+    cfgs_meta = args.get("meta")
+    r_file = cfgs_meta.get("resume_checkpoint", None)
+    p_file = cfgs_meta.get("pretrain_checkpoint", None)
+    load_predictor = cfgs_meta.get("load_predictor", False)
+    context_encoder_key = cfgs_meta.get("context_encoder_key", "encoder")
+    target_encoder_key = cfgs_meta.get("target_encoder_key", "target_encoder")
+    load_encoder = cfgs_meta.get("load_encoder", True)
+    seed = cfgs_meta.get("seed", _GLOBAL_SEED)
+    save_every_freq = cfgs_meta.get("save_every_freq", -1)
+    skip_batches = cfgs_meta.get("skip_batches", -1)
+    use_sdpa = cfgs_meta.get("use_sdpa", False)
+    sync_gc = cfgs_meta.get("sync_gc", False)
+    which_dtype = cfgs_meta.get("dtype")
+    logger.info(f"{which_dtype=}")
+    if which_dtype.lower() == "bfloat16":
+        dtype = torch.bfloat16
+        mixed_precision = True
+    elif which_dtype.lower() == "float16":
+        dtype = torch.float16
+        mixed_precision = True
+    else:
+        dtype = torch.float32
+        mixed_precision = False
+    # -- MODEL
+    cfgs_model = args.get("model")
+    compile_model = cfgs_model.get("compile_model", False)
+    use_activation_checkpointing = cfgs_model.get("use_activation_checkpointing", False)
+    model_name = cfgs_model.get("model_name")
+    pred_depth = cfgs_model.get("pred_depth")
+    pred_num_heads = cfgs_model.get("pred_num_heads", None)
+    pred_embed_dim = cfgs_model.get("pred_embed_dim")
+    pred_is_frame_causal = cfgs_model.get("pred_is_frame_causal", True)
+    uniform_power = cfgs_model.get("uniform_power", False)
+    use_rope = cfgs_model.get("use_rope", False)
+    use_silu = cfgs_model.get("use_silu", False)
+    use_pred_silu = cfgs_model.get("use_pred_silu", False)
+    wide_silu = cfgs_model.get("wide_silu", True)
+    use_extrinsics = cfgs_model.get("use_extrinsics", False)
+    # -- DATA
+    cfgs_data = args.get("data")
+    datasets = cfgs_data.get("datasets", [])
+    dataset_path = datasets[0]
+    dataset_fpcs = cfgs_data.get("dataset_fpcs")
+    max_num_frames = max(dataset_fpcs)
+    camera_frame = cfgs_data.get("camera_frame", False)
+    camera_views = cfgs_data.get("camera_views", ["left_mp4_path"])
+    stereo_view = cfgs_data.get("stereo_view", False)
+    batch_size = cfgs_data.get("batch_size")
+    tubelet_size = cfgs_data.get("tubelet_size")
+    fps = cfgs_data.get("fps")
+    crop_size = cfgs_data.get("crop_size", 256)
+    patch_size = cfgs_data.get("patch_size")
+    pin_mem = cfgs_data.get("pin_mem", False)
+    num_workers = cfgs_data.get("num_workers", 1)
+    persistent_workers = cfgs_data.get("persistent_workers", True)
+    # -- DATA AUGS
+    cfgs_data_aug = args.get("data_aug")
+    horizontal_flip = cfgs_data_aug.get("horizontal_flip", False)
+    ar_range = cfgs_data_aug.get("random_resize_aspect_ratio", [3 / 4, 4 / 3])
+    rr_scale = cfgs_data_aug.get("random_resize_scale", [0.3, 1.0])
+    motion_shift = cfgs_data_aug.get("motion_shift", False)
+    reprob = cfgs_data_aug.get("reprob", 0.0)
+    use_aa = cfgs_data_aug.get("auto_augment", False)
+    # -- LOSS
+    cfgs_loss = args.get("loss")
+    loss_exp = cfgs_loss.get("loss_exp")
+    normalize_reps = cfgs_loss.get("normalize_reps")
+    auto_steps = min(cfgs_loss.get("auto_steps", 1), max_num_frames)
+    # --
+    tokens_per_frame = int((crop_size // patch_size) ** 2)
+    # -- OPTIMIZATION
+    cfgs_opt = args.get("optimization")
+    ipe = cfgs_opt.get("ipe", None)
+    wd = float(cfgs_opt.get("weight_decay"))
+    final_wd = float(cfgs_opt.get("final_weight_decay"))
+    num_epochs = cfgs_opt.get("epochs")
+    anneal = cfgs_opt.get("anneal")
+    warmup = cfgs_opt.get("warmup")
+    start_lr = cfgs_opt.get("start_lr")
+    lr = cfgs_opt.get("lr")
+    final_lr = cfgs_opt.get("final_lr")
+    enc_lr_scale = cfgs_opt.get("enc_lr_scale", 1.0)
+    betas = cfgs_opt.get("betas", (0.9, 0.999))
+    eps = cfgs_opt.get("eps", 1.0e-8)
+    # ----------------------------------------------------------------------- #
+    # ----------------------------------------------------------------------- #
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.backends.cudnn.benchmark = True
+    try:
+        mp.set_start_method("spawn")
+    except Exception:
+        pass
+    # -- init torch distributed backend
+    world_size, rank = init_distributed()
+    logger.info(f"Initialized (rank/world-size) {rank}/{world_size}")
+    # -- set device
+    if not torch.cuda.is_available():
+        device = torch.device("cpu")
+    else:
+        device = torch.device("cuda:0")
+        torch.cuda.set_device(device)
+    # -- log/checkpointing paths
+    log_file = os.path.join(folder, f"log_r{rank}.csv")
+    latest_path = os.path.join(folder, "latest.pt")
+    resume_path = os.path.join(folder, r_file) if r_file is not None else latest_path
+    if not os.path.exists(resume_path):
+        resume_path = None
+    # -- make csv_logger
+    csv_logger = CSVLogger(
+        log_file,
+        ("%d", "epoch"),
+        ("%d", "itr"),
+        ("%.5f", "loss"),
+        ("%d", "iter-time(ms)"),
+        ("%d", "gpu-time(ms)"),
+        ("%d", "dataload-time(ms)"),
+        mode="+a",
+    )
+    # -- init model
+    encoder, predictor = init_video_model(
+        uniform_power=uniform_power,
+        device=device,
+        patch_size=patch_size,
+        max_num_frames=512,
+        tubelet_size=tubelet_size,
+        model_name=model_name,
+        crop_size=crop_size,
+        pred_depth=pred_depth,
+        pred_num_heads=pred_num_heads,
+        pred_embed_dim=pred_embed_dim,
+        action_embed_dim=7,
+        pred_is_frame_causal=pred_is_frame_causal,
+        use_extrinsics=use_extrinsics,
+        use_sdpa=use_sdpa,
+        use_silu=use_silu,
+        use_pred_silu=use_pred_silu,
+        wide_silu=wide_silu,
+        use_rope=use_rope,
+        use_activation_checkpointing=use_activation_checkpointing,
+    )
+    target_encoder = copy.deepcopy(encoder)
+    if compile_model:
+        logger.info("Compiling encoder, target_encoder, and predictor.")
+        torch._dynamo.config.optimize_ddp = False
+        encoder.compile()
+        target_encoder.compile()
+        predictor.compile()
+    video_collator = torch.utils.data.default_collate
+    transform = make_transforms(
+        random_horizontal_flip=horizontal_flip,
+        random_resize_aspect_ratio=ar_range,
+        random_resize_scale=rr_scale,
+        reprob=reprob,
+        auto_augment=use_aa,
+        motion_shift=motion_shift,
+        crop_size=crop_size,
+    )
+    # -- init data-loaders/samplers
+    (unsupervised_loader, unsupervised_sampler) = init_data(
+        data_path=dataset_path,
+        batch_size=batch_size,
+        frames_per_clip=max_num_frames,
+        tubelet_size=1,
+        fps=fps,
+        camera_views=camera_views,
+        camera_frame=camera_frame,
+        stereo_view=stereo_view,
+        transform=transform,
+        collator=video_collator,
+        num_workers=num_workers,
+        world_size=world_size,
+        pin_mem=pin_mem,
+        persistent_workers=persistent_workers,
+        rank=rank,
+    )
+    _dlen = len(unsupervised_loader)
+    if ipe is None:
+        ipe = _dlen
+    logger.info(f"iterations per epoch/dataset length: {ipe}/{_dlen}")
+    # -- init optimizer and scheduler
+    optimizer, scaler, scheduler, wd_scheduler = init_opt(
+        encoder=encoder,
+        predictor=predictor,
+        wd=wd,
+        final_wd=final_wd,
+        start_lr=start_lr,
+        ref_lr=lr,
+        final_lr=final_lr,
+        enc_lr_scale=enc_lr_scale,
+        iterations_per_epoch=ipe,
+        anneal=anneal,
+        warmup=warmup,
+        num_epochs=num_epochs,
+        mixed_precision=mixed_precision,
+        betas=betas,
+        eps=eps,
+    )
+    encoder = DistributedDataParallel(encoder, static_graph=True)
+    predictor = DistributedDataParallel(predictor, static_graph=False, find_unused_parameters=True)
+    target_encoder = DistributedDataParallel(target_encoder)
+    for p in target_encoder.parameters():
+        p.requires_grad = False
+    # -- looad pretrained weights
+    encoder, predictor, target_encoder = load_pretrained(
+        r_path=p_file,
+        encoder=encoder,
+        predictor=predictor,
+        context_encoder_key=context_encoder_key,
+        target_encoder_key=target_encoder_key,
+        target_encoder=target_encoder,
+        load_predictor=load_predictor,
+        load_encoder=load_encoder,
+    )
+    start_epoch = 0
+    # -- load training checkpoint
+    if os.path.exists(latest_path):
+        (
+            encoder,
+            predictor,
+            target_encoder,
+            optimizer,
+            scaler,
+            start_epoch,
+        ) = load_checkpoint(
+            r_path=resume_path,
+            encoder=encoder,
+            predictor=predictor,
+            target_encoder=target_encoder,
+            opt=optimizer,
+            scaler=scaler,
+        )
+        for _ in range(start_epoch * ipe):
+            scheduler.step()
+            wd_scheduler.step()
+    def save_checkpoint(epoch, path):
+        if rank != 0:
+            return
+        save_dict = {
+            "encoder": encoder.state_dict(),
+            "predictor": predictor.state_dict(),
+            "opt": optimizer.state_dict(),
+            "scaler": None if scaler is None else scaler.state_dict(),
+            "target_encoder": target_encoder.state_dict(),
+            "epoch": epoch,
+            "loss": loss_meter.avg,
+            "batch_size": batch_size,
+            "world_size": world_size,
+            "lr": lr,
+        }
+        try:
+            torch.save(save_dict, path)
+        except Exception as e:
+            logger.info(f"Encountered exception when saving checkpoint: {e}")
+    logger.info("Initializing loader...")
+    unsupervised_sampler.set_epoch(start_epoch)
+    loader = iter(unsupervised_loader)
+    if skip_batches > 0:
+        logger.info(f"Skip {skip_batches} batches")
+        # -- update distributed-data-loader epoch
+        for itr in range(skip_batches):
+            if itr % 10 == 0:
+                logger.info(f"Skip {itr}/{skip_batches} batches")
+            try:
+                _ = next(loader)
+            except Exception:
+                loader = iter(unsupervised_loader)
+                _ = next(loader)
+    if sync_gc:
+        gc.disable()
+        gc.collect()
+    # -- TRAINING LOOP
+    for epoch in range(start_epoch, num_epochs):
+        logger.info("Epoch %d" % (epoch + 1))
+        loss_meter = AverageMeter()
+        jloss_meter = AverageMeter()
+        sloss_meter = AverageMeter()
+        iter_time_meter = AverageMeter()
+        gpu_time_meter = AverageMeter()
+        data_elapsed_time_meter = AverageMeter()
+        for itr in range(ipe):
+            itr_start_time = time.time()
+            iter_retries = 0
+            iter_successful = False
+            while not iter_successful:
+                try:
+                    sample = next(loader)
+                    iter_successful = True
+                except StopIteration:
+                    logger.info("Exhausted data loaders. Refreshing...")
+                    unsupervised_sampler.set_epoch(epoch)
+                    loader = iter(unsupervised_loader)
+                except Exception as e:
+                    NUM_RETRIES = 5
+                    if iter_retries < NUM_RETRIES:
+                        logger.warning(f"Encountered exception when loading data (num retries {iter_retries}):\n{e}")
+                        iter_retries += 1
+                        time.sleep(5)
+                    else:
+                        logger.warning(f"Exceeded max retries ({NUM_RETRIES}) when loading data. Skipping batch.")
+                        raise e
+            def load_clips():
+                clips = sample[0].to(device, non_blocking=True)  # [B C T H W]
+                actions = sample[1].to(device, dtype=torch.float, non_blocking=True)  # [B T-1 7]
+                states = sample[2].to(device, dtype=torch.float, non_blocking=True)  # [B T 7]
+                extrinsics = sample[3].to(device, dtype=torch.float, non_blocking=True)  # [B T 7]
+                return (clips, actions, states, extrinsics)
+            clips, actions, states, extrinsics = load_clips()
+            data_elapsed_time_ms = (time.time() - itr_start_time) * 1000.0
+            if sync_gc and (itr + 1) % GARBAGE_COLLECT_ITR_FREQ == 0:
+                logger.info("Running garbage collection...")
+                gc.collect()
+            def train_step():
+                _new_lr = scheduler.step()
+                _new_wd = wd_scheduler.step()
+                # --
+                def forward_target(c):
+                    with torch.no_grad():
+                        c = c.permute(0, 2, 1, 3, 4).flatten(0, 1).unsqueeze(2).repeat(1, 1, 2, 1, 1)
+                        h = target_encoder(c)
+                        h = h.view(batch_size, max_num_frames, -1, h.size(-1)).flatten(1, 2)
+                        if normalize_reps:
+                            h = F.layer_norm(h, (h.size(-1),))
+                        return h
+                def forward_predictions(z):
+                    def _step_predictor(_z, _a, _s, _e):
+                        _z = predictor(_z, _a, _s, _e)
+                        if normalize_reps:
+                            _z = F.layer_norm(_z, (_z.size(-1),))
+                        return _z
+                    # -- one step of predictor with teacher forcing
+                    _z, _a, _s, _e = z[:, :-tokens_per_frame], actions, states[:, :-1], extrinsics[:, :-1]
+                    z_tf = _step_predictor(_z, _a, _s, _e)
+                    # -- full auto-regressive rollouts of predictor
+                    _z = torch.cat([z[:, : tokens_per_frame], z_tf[:, : tokens_per_frame]], dim=1)
+                    for n in range(1, auto_steps):
+                        _a, _s, _e = actions[:, : n + 1], states[:, : n + 1], extrinsics[:, : n + 1]
+                        _z_nxt = _step_predictor(_z, _a, _s, _e)[:, -tokens_per_frame:]
+                        _z = torch.cat([_z, _z_nxt], dim=1)
+                    z_ar = _z[:, tokens_per_frame:]
+                    return z_tf, z_ar
+                def loss_fn(z, h):
+                    _h = h[:, tokens_per_frame : z.size(1) + tokens_per_frame]
+                    return torch.mean(torch.abs(z - _h) ** loss_exp) / loss_exp
+                # Step 1. Forward
+                with torch.cuda.amp.autocast(dtype=dtype, enabled=mixed_precision):
+                    h = forward_target(clips)
+                    z_tf, z_ar = forward_predictions(h)
+                    jloss = loss_fn(z_tf, h)
+                    sloss = loss_fn(z_ar, h)
+                    loss = jloss + sloss
+                # Step 2. Backward & step
+                if mixed_precision:
+                    scaler.scale(loss).backward()
+                    scaler.unscale_(optimizer)
+                else:
+                    loss.backward()
+                if mixed_precision:
+                    scaler.step(optimizer)
+                    scaler.update()
+                else:
+                    optimizer.step()
+                optimizer.zero_grad()
+                return (
+                    float(loss),
+                    float(jloss),
+                    float(sloss),
+                    _new_lr,
+                    _new_wd,
+                )
+            (
+                loss,
+                jloss,
+                sloss,
+                _new_lr,
+                _new_wd,
+            ), gpu_etime_ms = gpu_timer(train_step)
+            iter_elapsed_time_ms = (time.time() - itr_start_time) * 1000.0
+            loss_meter.update(loss)
+            jloss_meter.update(jloss)
+            sloss_meter.update(sloss)
+            iter_time_meter.update(iter_elapsed_time_ms)
+            gpu_time_meter.update(gpu_etime_ms)
+            data_elapsed_time_meter.update(data_elapsed_time_ms)
+            # -- Logging
+            def log_stats():
+                csv_logger.log(epoch + 1, itr, loss, iter_elapsed_time_ms, gpu_etime_ms, data_elapsed_time_ms)
+                if (itr % log_freq == 0) or (itr == ipe - 1) or np.isnan(loss) or np.isinf(loss):
+                    logger.info(
+                        "[%d, %5d] loss: %.3f [%.2f, %.2f] "
+                        "[wd: %.2e] [lr: %.2e] "
+                        "[mem: %.2e] "
+                        "[iter: %.1f ms] "
+                        "[gpu: %.1f ms] "
+                        "[data: %.1f ms]"
+                        % (
+                            epoch + 1,
+                            itr,
+                            loss_meter.avg,
+                            jloss_meter.avg,
+                            sloss_meter.avg,
+                            _new_wd,
+                            _new_lr,
+                            torch.cuda.max_memory_allocated() / 1024.0**2,
+                            iter_time_meter.avg,
+                            gpu_time_meter.avg,
+                            data_elapsed_time_meter.avg,
+                        )
+                    )
+            log_stats()
+            assert not np.isnan(loss), "loss is nan"
+        # -- Save Checkpoint
+        logger.info("avg. loss %.3f" % loss_meter.avg)
+        # -- Save Last
+        if epoch % CHECKPOINT_FREQ == 0 or epoch == (num_epochs - 1):
+            save_checkpoint(epoch + 1, latest_path)
+            if save_every_freq > 0 and epoch % save_every_freq == 0:
+                save_every_file = f"e{epoch}.pt"
+                save_every_path = os.path.join(folder, save_every_file)
+                save_checkpoint(epoch + 1, save_every_path)

vjepa2/app/vjepa_droid/transforms.py ADDED Viewed

	@@ -0,0 +1,156 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+#
+import torch
+import torchvision.transforms as transforms
+import src.datasets.utils.video.transforms as video_transforms
+from src.datasets.utils.video.randerase import RandomErasing
+def make_transforms(
+    random_horizontal_flip=True,
+    random_resize_aspect_ratio=(3 / 4, 4 / 3),
+    random_resize_scale=(0.3, 1.0),
+    reprob=0.0,
+    auto_augment=False,
+    motion_shift=False,
+    crop_size=224,
+    normalize=((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+):
+    _frames_augmentation = VideoTransform(
+        random_horizontal_flip=random_horizontal_flip,
+        random_resize_aspect_ratio=random_resize_aspect_ratio,
+        random_resize_scale=random_resize_scale,
+        reprob=reprob,
+        auto_augment=auto_augment,
+        motion_shift=motion_shift,
+        crop_size=crop_size,
+        normalize=normalize,
+    )
+    return _frames_augmentation
+class VideoTransform(object):
+    def __init__(
+        self,
+        random_horizontal_flip=True,
+        random_resize_aspect_ratio=(3 / 4, 4 / 3),
+        random_resize_scale=(0.3, 1.0),
+        reprob=0.0,
+        auto_augment=False,
+        motion_shift=False,
+        crop_size=224,
+        normalize=((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+    ):
+        self.random_horizontal_flip = random_horizontal_flip
+        self.random_resize_aspect_ratio = random_resize_aspect_ratio
+        self.random_resize_scale = random_resize_scale
+        self.auto_augment = auto_augment
+        self.motion_shift = motion_shift
+        self.crop_size = crop_size
+        self.mean = torch.tensor(normalize[0], dtype=torch.float32)
+        self.std = torch.tensor(normalize[1], dtype=torch.float32)
+        if not self.auto_augment:
+            # Without auto-augment, PIL and tensor conversions simply scale uint8 space by 255.
+            self.mean *= 255.0
+            self.std *= 255.0
+        self.autoaug_transform = video_transforms.create_random_augment(
+            input_size=(crop_size, crop_size),
+            # auto_augment="rand-m4-n4-w1-mstd0.5-inc1",
+            auto_augment="rand-m7-n4-mstd0.5-inc1",
+            interpolation="bicubic",
+        )
+        self.spatial_transform = (
+            video_transforms.random_resized_crop_with_shift if motion_shift else video_transforms.random_resized_crop
+        )
+        self.reprob = reprob
+        self.erase_transform = RandomErasing(
+            reprob,
+            mode="pixel",
+            max_count=1,
+            num_splits=1,
+            device="cpu",
+        )
+    def __call__(self, buffer):
+        if self.auto_augment:
+            buffer = [transforms.ToPILImage()(frame) for frame in buffer]
+            buffer = self.autoaug_transform(buffer)
+            buffer = [transforms.ToTensor()(img) for img in buffer]
+            buffer = torch.stack(buffer)  # T C H W
+            buffer = buffer.permute(0, 2, 3, 1)  # T H W C
+        elif torch.is_tensor(buffer):
+            # TODO: ensure input is always a tensor?
+            buffer = buffer.to(torch.float32)
+        else:
+            buffer = torch.tensor(buffer, dtype=torch.float32)
+        buffer = buffer.permute(3, 0, 1, 2)  # T H W C -> C T H W
+        buffer = self.spatial_transform(
+            images=buffer,
+            target_height=self.crop_size,
+            target_width=self.crop_size,
+            scale=self.random_resize_scale,
+            ratio=self.random_resize_aspect_ratio,
+        )
+        if self.random_horizontal_flip:
+            buffer, _ = video_transforms.horizontal_flip(0.5, buffer)
+        buffer = _tensor_normalize_inplace(buffer, self.mean, self.std)
+        if self.reprob > 0:
+            buffer = buffer.permute(1, 0, 2, 3)
+            buffer = self.erase_transform(buffer)
+            buffer = buffer.permute(1, 0, 2, 3)
+        return buffer
+def tensor_normalize(tensor, mean, std):
+    """
+    Normalize a given tensor by subtracting the mean and dividing the std.
+    Args:
+        tensor (tensor): tensor to normalize.
+        mean (tensor or list): mean value to subtract.
+        std (tensor or list): std to divide.
+    """
+    if tensor.dtype == torch.uint8:
+        tensor = tensor.float()
+        tensor = tensor / 255.0
+    if type(mean) == list:
+        mean = torch.tensor(mean)
+    if type(std) == list:
+        std = torch.tensor(std)
+    tensor = tensor - mean
+    tensor = tensor / std
+    return tensor
+def _tensor_normalize_inplace(tensor, mean, std):
+    """
+    Normalize a given tensor by subtracting the mean and dividing the std.
+    Args:
+        tensor (tensor): tensor to normalize (with dimensions C, T, H, W).
+        mean (tensor): mean value to subtract (in 0 to 255 floats).
+        std (tensor): std to divide (in 0 to 255 floats).
+    """
+    if tensor.dtype == torch.uint8:
+        tensor = tensor.float()
+    C, T, H, W = tensor.shape
+    tensor = tensor.view(C, -1).permute(1, 0)  # Make C the last dimension
+    tensor.sub_(mean).div_(std)
+    tensor = tensor.permute(1, 0).view(C, T, H, W)  # Put C back in front
+    return tensor

vjepa2/app/vjepa_droid/utils.py ADDED Viewed

	@@ -0,0 +1,253 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+#
+import logging
+import sys
+import torch
+import src.models.ac_predictor as vit_ac_pred
+import src.models.vision_transformer as video_vit
+from src.utils.checkpoint_loader import robust_checkpoint_loader
+from src.utils.schedulers import CosineWDSchedule, WSDSchedule
+logging.basicConfig(stream=sys.stdout, level=logging.INFO)
+logger = logging.getLogger()
+def load_pretrained(
+    r_path,
+    encoder=None,
+    predictor=None,
+    target_encoder=None,
+    context_encoder_key="encoder",
+    target_encoder_key="target_encoder",
+    load_predictor=False,
+    load_encoder=True,
+):
+    logger.info(f"Loading pretrained model from {r_path}")
+    checkpoint = robust_checkpoint_loader(r_path, map_location=torch.device("cpu"))
+    epoch = checkpoint["epoch"]
+    if load_encoder:
+        # -- loading encoder
+        pretrained_dict = checkpoint[context_encoder_key]
+        pretrained_dict = {k.replace("backbone.", ""): v for k, v in pretrained_dict.items()}
+        msg = encoder.load_state_dict(pretrained_dict, strict=False)
+        logger.info(f"loaded pretrained encoder from epoch {epoch} with msg: {msg}")
+    if load_predictor:
+        # -- loading predictor
+        pretrained_dict = checkpoint["predictor"]
+        pretrained_dict = {k.replace("backbone.", ""): v for k, v in pretrained_dict.items()}
+        msg = predictor.load_state_dict(pretrained_dict, strict=False)
+        logger.info(f"loaded pretrained predictor from epoch {epoch} with msg: {msg}")
+    # -- loading target_encoder
+    if load_encoder:
+        if target_encoder is not None:
+            print(list(checkpoint.keys()))
+            pretrained_dict = checkpoint[target_encoder_key]
+            pretrained_dict = {k.replace("backbone.", ""): v for k, v in pretrained_dict.items()}
+            msg = target_encoder.load_state_dict(pretrained_dict, strict=False)
+            logger.info(f"loaded pretrained target encoder from epoch {epoch} with msg: {msg}")
+    del checkpoint
+    return (
+        encoder,
+        predictor,
+        target_encoder,
+    )
+def load_checkpoint(
+    r_path,
+    encoder,
+    predictor,
+    target_encoder,
+    opt=None,
+    scaler=None,
+    replace_kw=["backbone."],
+):
+    logger.info(f"Loading checkpoint from {r_path}")
+    checkpoint = robust_checkpoint_loader(r_path, map_location=torch.device("cpu"))
+    epoch = checkpoint["epoch"]
+    # -- loading encoder
+    pretrained_dict = checkpoint["encoder"]
+    for kw in replace_kw:
+        pretrained_dict = {k.replace(kw, ""): v for k, v in pretrained_dict.items()}
+    msg = encoder.load_state_dict(pretrained_dict, strict=False)
+    logger.info(f"loaded pretrained encoder from epoch {epoch} with msg: {msg}")
+    # -- loading predictor
+    pretrained_dict = checkpoint["predictor"]
+    for kw in replace_kw:
+        pretrained_dict = {k.replace(kw, ""): v for k, v in pretrained_dict.items()}
+    msg = predictor.load_state_dict(pretrained_dict, strict=False)
+    logger.info(f"loaded pretrained predictor from epoch {epoch} with msg: {msg}")
+    # -- loading target_encoder
+    if target_encoder is not None:
+        print(list(checkpoint.keys()))
+        pretrained_dict = checkpoint["target_encoder"]
+        for kw in replace_kw:
+            pretrained_dict = {k.replace(kw, ""): v for k, v in pretrained_dict.items()}
+        msg = target_encoder.load_state_dict(pretrained_dict, strict=False)
+        logger.info(f"loaded pretrained target encoder from epoch {epoch} with msg: {msg}")
+    # -- loading optimizer
+    if opt is not None:
+        opt.load_state_dict(checkpoint["opt"])
+    if scaler is not None:
+        scaler.load_state_dict(checkpoint["scaler"])
+    logger.info(f"loaded optimizers from epoch {epoch}")
+    logger.info(f"read-path: {r_path}")
+    del checkpoint
+    return (
+        encoder,
+        predictor,
+        target_encoder,
+        opt,
+        scaler,
+        epoch,
+    )
+def init_video_model(
+    device,
+    patch_size=16,
+    max_num_frames=16,
+    tubelet_size=2,
+    model_name="vit_base",
+    crop_size=224,
+    pred_depth=6,
+    pred_num_heads=None,
+    pred_embed_dim=384,
+    uniform_power=False,
+    use_sdpa=False,
+    use_rope=False,
+    use_silu=False,
+    use_pred_silu=False,
+    wide_silu=False,
+    pred_is_frame_causal=True,
+    use_activation_checkpointing=False,
+    return_all_tokens=False,
+    action_embed_dim=7,
+    use_extrinsics=False,
+    old_pred=False,
+):
+    encoder = video_vit.__dict__[model_name](
+        img_size=crop_size,
+        patch_size=patch_size,
+        num_frames=max_num_frames,
+        tubelet_size=tubelet_size,
+        uniform_power=uniform_power,
+        use_sdpa=use_sdpa,
+        use_silu=use_silu,
+        wide_silu=wide_silu,
+        use_activation_checkpointing=use_activation_checkpointing,
+        use_rope=use_rope,
+    )
+    predictor = vit_ac_pred.__dict__["vit_ac_predictor"](
+        img_size=crop_size,
+        patch_size=patch_size,
+        num_frames=max_num_frames,
+        tubelet_size=tubelet_size,
+        embed_dim=encoder.embed_dim,
+        predictor_embed_dim=pred_embed_dim,
+        action_embed_dim=action_embed_dim,
+        depth=pred_depth,
+        is_frame_causal=pred_is_frame_causal,
+        num_heads=encoder.num_heads if pred_num_heads is None else pred_num_heads,
+        uniform_power=uniform_power,
+        use_rope=use_rope,
+        use_sdpa=use_sdpa,
+        use_silu=use_pred_silu,
+        wide_silu=wide_silu,
+        use_extrinsics=use_extrinsics,
+        use_activation_checkpointing=use_activation_checkpointing,
+    )
+    encoder.to(device)
+    predictor.to(device)
+    logger.info(encoder)
+    logger.info(predictor)
+    def count_parameters(model):
+        return sum(p.numel() for p in model.parameters() if p.requires_grad)
+    logger.info(f"Encoder number of parameters: {count_parameters(encoder)}")
+    logger.info(f"Predictor number of parameters: {count_parameters(predictor)}")
+    return encoder, predictor
+def init_opt(
+    encoder,
+    predictor,
+    iterations_per_epoch,
+    start_lr,
+    ref_lr,
+    warmup,
+    anneal,
+    num_epochs,
+    wd=1e-6,
+    final_wd=1e-6,
+    final_lr=0.0,
+    mixed_precision=False,
+    betas=(0.9, 0.999),
+    eps=1e-8,
+    zero_init_bias_wd=True,
+    enc_lr_scale=1.0,
+):
+    param_groups = [
+        {
+            "params": (p for n, p in encoder.named_parameters() if ("bias" not in n) and (len(p.shape) != 1)),
+            "lr_scale": enc_lr_scale,
+        },
+        {
+            "params": (p for n, p in predictor.named_parameters() if ("bias" not in n) and (len(p.shape) != 1)),
+        },
+        {
+            "params": (p for n, p in encoder.named_parameters() if ("bias" in n) or (len(p.shape) == 1)),
+            "WD_exclude": zero_init_bias_wd,
+            "weight_decay": 0,
+            "lr_scale": enc_lr_scale,
+        },
+        {
+            "params": (p for n, p in predictor.named_parameters() if ("bias" in n) or (len(p.shape) == 1)),
+            "WD_exclude": zero_init_bias_wd,
+            "weight_decay": 0,
+        },
+    ]
+    optimizer = torch.optim.AdamW(param_groups, betas=betas, eps=eps)
+    scheduler = WSDSchedule(
+        optimizer,
+        warmup_steps=int(warmup * iterations_per_epoch),
+        anneal_steps=int(anneal * iterations_per_epoch),
+        start_lr=start_lr,
+        ref_lr=ref_lr,
+        final_lr=final_lr,
+        T_max=int(num_epochs * iterations_per_epoch),
+    )
+    wd_scheduler = CosineWDSchedule(
+        optimizer,
+        ref_wd=wd,
+        final_wd=final_wd,
+        T_max=int(num_epochs * iterations_per_epoch),
+    )
+    scaler = torch.cuda.amp.GradScaler() if mixed_precision else None
+    return optimizer, scaler, scheduler, wd_scheduler

vjepa2/configs/eval/vitg-384/coin.yaml ADDED Viewed

	@@ -0,0 +1,163 @@

+cpus_per_task: 16
+eval_name: video_classification_frozen
+folder: /your_folder/evals/vitg-384/coin
+mem_per_gpu: 220G
+nodes: 16
+resume_checkpoint: true
+tag: coin-vitg16-384-16x8x3
+tasks_per_node: 8
+experiment:
+  classifier:
+    num_heads: 16
+    num_probe_blocks: 4
+  data:
+    dataset_type: VideoDataset
+    dataset_train: /your_data_folder/COIN/train_paths.csv
+    dataset_val: /your_data_folder/COIN/val_paths.csv
+    frame_step: 4
+    frames_per_clip: 16
+    num_classes: 180
+    num_segments: 8
+    num_views_per_segment: 3
+    resolution: 384
+  optimization:
+    batch_size: 1
+    multihead_kwargs:
+    - final_lr: 0.0
+      final_weight_decay: 0.01
+      lr: 0.005
+      start_lr: 0.005
+      warmup: 0.0
+      weight_decay: 0.01
+    - final_lr: 0.0
+      final_weight_decay: 0.01
+      lr: 0.003
+      start_lr: 0.003
+      warmup: 0.0
+      weight_decay: 0.01
+    - final_lr: 0.0
+      final_weight_decay: 0.01
+      lr: 0.001
+      start_lr: 0.001
+      warmup: 0.0
+      weight_decay: 0.01
+    - final_lr: 0.0
+      final_weight_decay: 0.01
+      lr: 0.0003
+      start_lr: 0.0003
+      warmup: 0.0
+      weight_decay: 0.01
+    - final_lr: 0.0
+      final_weight_decay: 0.01
+      lr: 0.0001
+      start_lr: 0.0001
+      warmup: 0.0
+      weight_decay: 0.01
+    - final_lr: 0.0
+      final_weight_decay: 0.1
+      lr: 0.005
+      start_lr: 0.005
+      warmup: 0.0
+      weight_decay: 0.1
+    - final_lr: 0.0
+      final_weight_decay: 0.1
+      lr: 0.003
+      start_lr: 0.003
+      warmup: 0.0
+      weight_decay: 0.1
+    - final_lr: 0.0
+      final_weight_decay: 0.1
+      lr: 0.001
+      start_lr: 0.001
+      warmup: 0.0
+      weight_decay: 0.1
+    - final_lr: 0.0
+      final_weight_decay: 0.1
+      lr: 0.0003
+      start_lr: 0.0003
+      warmup: 0.0
+      weight_decay: 0.1
+    - final_lr: 0.0
+      final_weight_decay: 0.1
+      lr: 0.0001
+      start_lr: 0.0001
+      warmup: 0.0
+      weight_decay: 0.1
+    - final_lr: 0.0
+      final_weight_decay: 0.4
+      lr: 0.005
+      start_lr: 0.005
+      warmup: 0.0
+      weight_decay: 0.4
+    - final_lr: 0.0
+      final_weight_decay: 0.4
+      lr: 0.003
+      start_lr: 0.003
+      warmup: 0.0
+      weight_decay: 0.4
+    - final_lr: 0.0
+      final_weight_decay: 0.4
+      lr: 0.001
+      start_lr: 0.001
+      warmup: 0.0
+      weight_decay: 0.4
+    - final_lr: 0.0
+      final_weight_decay: 0.4
+      lr: 0.0003
+      start_lr: 0.0003
+      warmup: 0.0
+      weight_decay: 0.4
+    - final_lr: 0.0
+      final_weight_decay: 0.4
+      lr: 0.0001
+      start_lr: 0.0001
+      warmup: 0.0
+      weight_decay: 0.4
+    - final_lr: 0.0
+      final_weight_decay: 0.8
+      lr: 0.005
+      start_lr: 0.005
+      warmup: 0.0
+      weight_decay: 0.8
+    - final_lr: 0.0
+      final_weight_decay: 0.8
+      lr: 0.003
+      start_lr: 0.003
+      warmup: 0.0
+      weight_decay: 0.8
+    - final_lr: 0.0
+      final_weight_decay: 0.8
+      lr: 0.001
+      start_lr: 0.001
+      warmup: 0.0
+      weight_decay: 0.8
+    - final_lr: 0.0
+      final_weight_decay: 0.8
+      lr: 0.0003
+      start_lr: 0.0003
+      warmup: 0.0
+      weight_decay: 0.8
+    - final_lr: 0.0
+      final_weight_decay: 0.8
+      lr: 0.0001
+      start_lr: 0.0001
+      warmup: 0.0
+      weight_decay: 0.8
+    num_epochs: 20
+    use_bfloat16: true
+    use_pos_embed: false
+model_kwargs:
+  checkpoint: /your_vjepa2_checkpoints/vitg-384.pt
+  module_name: evals.video_classification_frozen.modelcustom.vit_encoder_multiclip
+  pretrain_kwargs:
+    encoder:
+      checkpoint_key: target_encoder
+      img_temporal_dim_size: null
+      model_name: vit_giant_xformers
+      patch_size: 16
+      tubelet_size: 2
+      uniform_power: true
+      use_rope: true
+  wrapper_kwargs:
+    max_frames: 128
+    use_pos_embed: false