CatCon-One-Shot-Controlnet-SD-1-5-b2

Sleeping

App Files Files Community

Ryukijano commited on Mar 8

Commit

8682216

verified ·

1 Parent(s): cc13c69

Deploy minimal DINO-Endo Space app

Browse files

Files changed (13) hide show

.dockerignore +15 -0
.gitignore +3 -0
Dockerfile +38 -0
README.md +130 -14
app.py +305 -243
model/__init__.py +0 -0
model/mstcn.py +183 -0
model/resnet.py +19 -0
model/transformer.py +246 -0
model_registry.py +156 -0
predictor.py +642 -0
requirements.txt +13 -4
scripts/smoke_test.py +57 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,15 @@

+__pycache__/
+*.py[cod]
+*.so
+*.egg-info/
+.git/
+.gitignore
+.cache/
+.pytest_cache/
+.mypy_cache/
+.streamlit/
+.env
+.env.*
+venv/
+.venv/
+*.ipynb

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+.cache/
+__pycache__/
+*.pyc

Dockerfile ADDED Viewed

	@@ -0,0 +1,38 @@

+FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
+ENV DEBIAN_FRONTEND=noninteractive \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    SPACE_MODEL_DIR=/app/model \
+    SPACE_ENABLED_MODELS=dinov2 \
+    SPACE_DEFAULT_MODEL=dinov2
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    bash \
+    curl \
+    wget \
+    procps \
+    python3 \
+    python3-pip \
+    python3-venv \
+    git \
+    git-lfs \
+    ffmpeg \
+    libgl1 \
+    libglib2.0-0 \
+    && rm -rf /var/lib/apt/lists/*
+RUN useradd -m -u 1000 user && mkdir -p /app && chown user:user /app
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+WORKDIR /app
+COPY --chown=user requirements.txt /app/requirements.txt
+RUN python3 -m pip install --upgrade pip && \
+    python3 -m pip install -r requirements.txt
+COPY --chown=user . /app
+EXPOSE 7860
+CMD ["python3", "-m", "streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0", "--server.headless=true"]

README.md CHANGED Viewed

@@ -1,14 +1,130 @@
----
-title: LLaMA Mesh
-emoji: 👀
-colorFrom: red
-colorTo: green
-sdk: gradio
-sdk_version: 5.6.0
-app_file: app.py
-pinned: false
-license: llama3.1
-short_description: Create 3D mesh by chatting.
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: DINO-ENDO Phase Recognition
+emoji: 🩺
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+---
+# DINO-ENDO Streamlit Space
+This folder is an isolated Hugging Face Space scaffold for the phase-recognition models in this repository.
+It is intentionally separate from the existing FastAPI webapp and defaults to a **DINO-Endo demo** on paid GPU hardware such as **1x A10G (24 GB VRAM)**.
+The same code can still expose AI-Endo and V-JEPA2 when you opt into them through environment variables.
+## Supported model families
+- **AI-Endo**
+  - `resnet50.pth`
+  - `fusion.pth`
+  - `transformer.pth`
+- **DINO-Endo**
+  - `dinov2_vit14s_latest_checkpoint.pth`
+  - `fusion_transformer_decoder_best_model.pth`
+  - optional `dinov2_decoder.pth`
+  - vendored `dinov2/` source tree
+- **V-JEPA2**
+  - `vjepa_encoder_human.pt`
+  - `mlp_decoder_human.pth`
+  - vendored `vjepa2/` source tree
+## Weight delivery strategy
+The default design is:
+1. Keep the **Space repo mostly code-only**.
+2. Upload weights to one or more **Hugging Face model repos**.
+3. Let the Space populate `model/` (or `SPACE_MODEL_DIR`) on demand via `huggingface_hub`.
+This works better than checking all weights directly into the Space repo because code and weights stay versioned separately and Space rebuilds stay lighter.
+A fully local `model/` folder is still supported as a fallback.
+## Default Space behavior
+The Docker Space is configured to boot as a **DINO-Endo-first demo**:
+- `SPACE_ENABLED_MODELS=dinov2`
+- `SPACE_DEFAULT_MODEL=dinov2`
+If you want the same Space build to expose multiple model families again, override those environment variables in Space Settings, for example:
+```text
+SPACE_ENABLED_MODELS=dinov2,aiendo,vjepa2
+SPACE_DEFAULT_MODEL=dinov2
+```
+The Dockerfile is also set up to be **HF Dev Mode compatible**:
+- app code lives under `/app`
+- `/app` is owned by uid `1000`
+- the required Dev Mode packages (`bash`, `curl`, `wget`, `procps`, `git`, `git-lfs`) are installed
+## Runtime configuration
+The app looks for model files in `SPACE_MODEL_DIR` first (default: `./model`).
+If a required checkpoint is missing locally, it will try to download it from the configured model repo(s).
+### Common environment variables
+- `SPACE_ENABLED_MODELS` — comma-separated list of model families to expose in the UI
+- `SPACE_DEFAULT_MODEL` — default selected model when multiple model families are enabled
+- `SPACE_MODEL_DIR` — local directory where checkpoints should live (default: `./model`)
+- `PHASE_MODEL_REPO_ID` — shared HF model repo for all weights
+- `PHASE_MODEL_REVISION` — optional shared revision/tag/commit
+- `HF_TOKEN` — only needed for private or gated repos
+If `HF_HOME` / `HF_HUB_CACHE` are not set explicitly, the app will automatically use persistent `/data` storage when it exists and otherwise fall back to a local cache inside the Space folder.
+### Per-model overrides
+- `AIENDO_MODEL_REPO_ID`, `DINO_MODEL_REPO_ID`, `VJEPA2_MODEL_REPO_ID`
+- `AIENDO_MODEL_REVISION`, `DINO_MODEL_REVISION`, `VJEPA2_MODEL_REVISION`
+- `AIENDO_MODEL_SUBFOLDER`, `DINO_MODEL_SUBFOLDER`, `VJEPA2_MODEL_SUBFOLDER`
+Use subfolder env vars if you store multiple model families in one repo under different directories.
+## Local development vs. publishing
+The required vendored `dinov2/` and `vjepa2/` source trees are now staged inside this folder, so the Space scaffold is self-contained.
+If those upstream source trees change and you want to refresh the copies here, run:
+```bash
+python scripts/stage_vendor_sources.py --overwrite
+```
+That script refreshes the vendored source copies inside this folder before publishing.
+## Publishing checklist
+1. Populate the Space folder files here.
+2. Run `python scripts/stage_vendor_sources.py --overwrite` if you need to refresh the vendored source copies.
+3. Push the contents of this folder to a Hugging Face **Docker Space**.
+4. Upload your checkpoints to HF **model repo(s)**.
+5. Configure the relevant repo IDs (and `HF_TOKEN` only if the repos are private).
+## Local smoke test
+Once the Space dependencies are installed, you can smoke test a predictor directly:
+```bash
+python scripts/smoke_test.py --model dinov2 --model-dir /path/to/model
+python scripts/smoke_test.py --model aiendo --model-dir /path/to/model
+python scripts/smoke_test.py --model vjepa2 --model-dir /path/to/model
+```
+## Scope of v1
+- Streamlit UI
+- DINO-Endo demo by default, with optional multi-model selector when enabled
+- image upload and video upload
+- per-frame phase timeline output for video
+- JSON / CSV export
+Not included in v1:
+- auth / user management
+- SQL database
+- PDF/HTML report generation
+- background queue processing
+- polyp segmentation

app.py CHANGED Viewed

@@ -1,243 +1,305 @@
-import os
-os.environ["TOKENIZERS_PARALLELISM"] = "false"
-import gradio as gr
-import os
-# import spaces
-from transformers import GemmaTokenizer, AutoModelForCausalLM
-from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
-from threading import Thread
-import torch
-# Set an environment variable
-HF_TOKEN = os.environ.get("HF_TOKEN", None)
-DESCRIPTION = '''
-<div>
-<h1 style="text-align: center;">LLaMA-Mesh</h1>
-<div>
-<a style="display:inline-block" href="https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/"><img src='https://img.shields.io/badge/public_website-8A2BE2'></a>
-<a style="display:inline-block; margin-left: .5em" href="https://github.com/nv-tlabs/LLaMA-Mesh"><img src='https://img.shields.io/github/stars/nv-tlabs/LLaMA-Mesh?style=social'/></a>
-</div>
-<p>LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models. <a style="display:inline-block" href="https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/">[Project Page]</a> <a style="display:inline-block" href="https://github.com/nv-tlabs/LLaMA-Mesh">[Code]</a></p>
-<p> Notice: (1) The default token length is 4096. If you observe incomplete generated meshes, try to increase the maximum token length into 8192.</p>
-<p>(2) We only support generating a single mesh per dialog round. To generate another mesh, click the "clear" button and start a new dialog.</p>
-<p>(3) If the LLM refuses to generate a 3D mesh, try adding more explicit instructions to the prompt, such as "create a 3D model of a table <strong>in OBJ format</strong>." A more effective approach is to request the mesh generation at the start of the dialog.</p>
-</div>
-'''
-LICENSE = """
-<p/>
----
-Built with Meta Llama 3.1 8B
-"""
-PLACEHOLDER = """
-<div style="padding: 30px; text-align: center; display: flex; flex-direction: column; align-items: center;">
-   <h1 style="font-size: 28px; margin-bottom: 2px; opacity: 0.55;">LLaMA-Mesh</h1>
-   <p style="font-size: 18px; margin-bottom: 2px; opacity: 0.65;">Create 3D meshes by chatting.</p>
-</div>
-"""
-css = """
-h1 {
-  text-align: center;
-  display: block;
-}
-#duplicate-button {
-  margin: auto;
-  color: white;
-  background: #1565c0;
-  border-radius: 100vh;
-}
-"""
-# Load the tokenizer and model
-model_path = "Zhengyi/LLaMA-Mesh"
-tokenizer = AutoTokenizer.from_pretrained(model_path)
-model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0", torch_dtype=torch.float16).to('cuda')
-terminators = [
-    tokenizer.eos_token_id,
-    tokenizer.convert_tokens_to_ids("<|eot_id|>")
-]
-from trimesh.exchange.gltf import export_glb
-import gradio as gr
-import trimesh
-import numpy as np
-import tempfile
-def apply_gradient_color(mesh_text):
-    """
-    Apply a gradient color to the mesh vertices based on the Y-axis and save as GLB.
-    Args:
-        mesh_text (str): The input mesh in OBJ format as a string.
-    Returns:
-        str: Path to the GLB file with gradient colors applied.
-    """
-    # Load the mesh
-    temp_file =  tempfile.NamedTemporaryFile(suffix=f"", delete=False).name
-    with open(temp_file+".obj", "w") as f:
-        f.write(mesh_text)
-    # return temp_file
-    mesh = trimesh.load_mesh(temp_file+".obj", file_type='obj')
-    # Get vertex coordinates
-    vertices = mesh.vertices
-    y_values = vertices[:, 1]  # Y-axis values
-    # Normalize Y values to range [0, 1] for color mapping
-    y_normalized = (y_values - y_values.min()) / (y_values.max() - y_values.min())
-    # Generate colors: Map normalized Y values to RGB gradient (e.g., blue to red)
-    colors = np.zeros((len(vertices), 4))  # RGBA
-    colors[:, 0] = y_normalized  # Red channel
-    colors[:, 2] = 1 - y_normalized  # Blue channel
-    colors[:, 3] = 1.0  # Alpha channel (fully opaque)
-    # Attach colors to mesh vertices
-    mesh.visual.vertex_colors = colors
-    # Export to GLB format
-    glb_path = temp_file+".glb"
-    with open(glb_path, "wb") as f:
-        f.write(export_glb(mesh))
-    return glb_path
-def visualize_mesh(mesh_text):
-    """
-    Convert the provided 3D mesh text into a visualizable format.
-    This function assumes the input is in OBJ format.
-    """
-    temp_file = "temp_mesh.obj"
-    with open(temp_file, "w") as f:
-        f.write(mesh_text)
-    return temp_file
-# @spaces.GPU(duration=120)
-def chat_llama3_8b(message: str,
-              history: list,
-              temperature: float,
-              max_new_tokens: int
-             ) -> str:
-    """
-    Generate a streaming response using the llama3-8b model.
-    Args:
-        message (str): The input message.
-        history (list): The conversation history used by ChatInterface.
-        temperature (float): The temperature for generating the response.
-        max_new_tokens (int): The maximum number of new tokens to generate.
-    Returns:
-        str: The generated response.
-    """
-    conversation = []
-    for user, assistant in history:
-        conversation.extend([{"role": "user", "content": user}, {"role": "assistant", "content": assistant}])
-    conversation.append({"role": "user", "content": message})
-    input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt").to(model.device)
-    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
-    generate_kwargs = dict(
-        input_ids= input_ids,
-        streamer=streamer,
-        max_new_tokens=max_new_tokens,
-        do_sample=True,
-        temperature=temperature,
-        eos_token_id=terminators,
-    )
-    # This will enforce greedy generation (do_sample=False) when the temperature is passed 0, avoiding the crash.
-    if temperature == 0:
-        generate_kwargs['do_sample'] = False
-    t = Thread(target=model.generate, kwargs=generate_kwargs)
-    t.start()
-    outputs = []
-    for text in streamer:
-        outputs.append(text)
-        #print(outputs)
-        yield "".join(outputs)
-# Gradio block
-chatbot=gr.Chatbot(height=450, placeholder=PLACEHOLDER, label='Gradio ChatInterface')
-with gr.Blocks(fill_height=True, css=css) as demo:
-    with gr.Column():
-        gr.Markdown(DESCRIPTION)
-        # gr.DuplicateButton(value="Duplicate Space for private use", elem_id="duplicate-button")
-        with gr.Row():
-            with gr.Column(scale=3):
-                gr.ChatInterface(
-                    fn=chat_llama3_8b,
-                    chatbot=chatbot,
-                    fill_height=True,
-                    additional_inputs_accordion=gr.Accordion(label="⚙️ Parameters", open=False, render=False),
-                    additional_inputs=[
-                        gr.Slider(minimum=0,
-                                maximum=1,
-                                step=0.1,
-                                value=0.95,
-                                label="Temperature",
-                                render=False),
-                        gr.Slider(minimum=128,
-                                maximum=8192,
-                                step=1,
-                                value=4096,
-                                label="Max new tokens",
-                                render=False),
-                        ],
-                    examples=[
-                        ['Create a 3D model of a wooden hammer'],
-                        ['Create a 3D model of a pyramid in obj format'],
-                        ['Create a 3D model of a cabinet.'],
-                        ['Create a low poly 3D model of a coffe cup'],
-                        ['Create a 3D model of a table.'],
-                        ["Create a low poly 3D model of a tree."],
-                        ['Write a python code for sorting.'],
-                        ['How to setup a human base on Mars? Give short answer.'],
-                        ['Explain theory of relativity to me like I’m 8 years old.'],
-                        ['What is 9,000 * 9,000?'],
-                        ['Create a 3D model of a soda can.'],
-                        ['Create a 3D model of a sword.'],
-                        ['Create a 3D model of a wooden barrel'],
-                        ['Create a 3D model of a chair.']
-                        ],
-                    cache_examples=False,
-                                )
-                gr.Markdown(LICENSE)
-            with gr.Column(scale=2):
-                output_model = gr.Model3D(
-                            label="3D Mesh Visualization",
-                            interactive=False,
-                        )
-                gr.Markdown("You can copy the generated 3d objects in the left and paste in the textbox below. Put the button and you will see the visualization of the 3D mesh.")
-                # Add the text box for 3D mesh input and button
-                mesh_input = gr.Textbox(
-                    label="3D Mesh Input",
-                    placeholder="Paste your 3D mesh in OBJ format here...",
-                    lines=5,
-                )
-                visualize_button = gr.Button("Visualize 3D Mesh")
-                # Link the button to the visualization function
-                visualize_button.click(
-                    fn=apply_gradient_color,
-                    inputs=[mesh_input],
-                    outputs=[output_model]
-                )
-if __name__ == "__main__":
-    demo.launch()

+from __future__ import annotations
+import json
+import os
+import tempfile
+import time
+from collections import Counter
+from pathlib import Path
+import cv2
+import numpy as np
+import pandas as pd
+import streamlit as st
+import torch
+from PIL import Image
+from model_registry import MODEL_SPECS, ensure_model_artifacts, get_model_source_summary
+from predictor import MODEL_LABELS, PHASE_LABELS, create_predictor, normalize_model_key
+st.set_page_config(page_title="DINO-Endo Phase Recognition", layout="wide")
+def _phase_index(phase: str) -> int:
+    try:
+        return PHASE_LABELS.index(phase)
+    except ValueError:
+        return -1
+def _image_to_rgb(uploaded_file) -> np.ndarray:
+    image = Image.open(uploaded_file).convert("RGB")
+    return np.array(image)
+def _enabled_model_keys() -> list[str]:
+    configured = os.getenv("SPACE_ENABLED_MODELS", "").strip()
+    if not configured:
+        return list(MODEL_SPECS.keys())
+    enabled_keys = []
+    seen = set()
+    for token in configured.split(","):
+        raw = token.strip()
+        if not raw:
+            continue
+        normalized = normalize_model_key(raw)
+        if normalized not in MODEL_SPECS:
+            raise RuntimeError(f"SPACE_ENABLED_MODELS contains unsupported model '{raw}'")
+        if normalized not in seen:
+            enabled_keys.append(normalized)
+            seen.add(normalized)
+    if not enabled_keys:
+        raise RuntimeError("SPACE_ENABLED_MODELS did not resolve to any supported models")
+    return enabled_keys
+def _default_model_key(enabled_model_keys: list[str]) -> str:
+    configured = os.getenv("SPACE_DEFAULT_MODEL", "").strip()
+    if not configured:
+        return "dinov2" if "dinov2" in enabled_model_keys else enabled_model_keys[0]
+    normalized = normalize_model_key(configured)
+    if normalized not in enabled_model_keys:
+        raise RuntimeError(
+            f"SPACE_DEFAULT_MODEL '{configured}' is not enabled by SPACE_ENABLED_MODELS"
+        )
+    return normalized
+def _space_caption(enabled_model_keys: list[str]) -> str:
+    if enabled_model_keys == ["dinov2"]:
+        return "Streamlit Hugging Face Space demo for the DINO-Endo phase-recognition stack."
+    return "DINO-first Streamlit Hugging Face Space demo for DINO-Endo, AI-Endo, and V-JEPA2."
+def _ensure_predictor(model_key: str):
+    active_key = st.session_state.get("active_model_key")
+    active_predictor = st.session_state.get("active_predictor")
+    if active_predictor is not None and active_key != model_key:
+        active_predictor.unload()
+        st.session_state.pop("active_predictor", None)
+        st.session_state.pop("active_model_key", None)
+    if st.session_state.get("active_predictor") is None:
+        with st.spinner(f"Preparing {MODEL_LABELS[model_key]}..."):
+            model_dir = ensure_model_artifacts(model_key)
+            predictor = create_predictor(model_key, model_dir=str(model_dir))
+            predictor.warm_up()
+            st.session_state["active_predictor"] = predictor
+            st.session_state["active_model_key"] = model_key
+    return st.session_state["active_predictor"]
+def _analyse_video(uploaded_file, predictor, frame_stride: int, max_frames: int):
+    suffix = Path(uploaded_file.name).suffix or ".mp4"
+    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
+        tmp.write(uploaded_file.getbuffer())
+        temp_path = Path(tmp.name)
+    capture = cv2.VideoCapture(str(temp_path))
+    if not capture.isOpened():
+        temp_path.unlink(missing_ok=True)
+        raise RuntimeError("Unable to open uploaded video")
+    total_frames = int(capture.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
+    fps = float(capture.get(cv2.CAP_PROP_FPS) or 0.0)
+    progress = st.progress(0)
+    status = st.empty()
+    predictor.reset_state()
+    records = []
+    processed = 0
+    frame_index = 0
+    try:
+        while True:
+            ok, frame = capture.read()
+            if not ok:
+                break
+            if frame_index % frame_stride != 0:
+                frame_index += 1
+                continue
+            rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            started = time.perf_counter()
+            result = predictor.predict(rgb)
+            elapsed_ms = (time.perf_counter() - started) * 1000.0
+            probs = result.get("probs", [0.0, 0.0, 0.0, 0.0])
+            record = {
+                "frame_index": frame_index,
+                "timestamp_sec": round(frame_index / fps, 3) if fps > 0 else None,
+                "phase": result.get("phase", "unknown"),
+                "phase_id": _phase_index(result.get("phase", "unknown")),
+                "confidence": float(result.get("confidence", 0.0)),
+                "frames_used": int(result.get("frames_used", processed + 1)),
+                "idle": float(probs[0]) if len(probs) > 0 else 0.0,
+                "marking": float(probs[1]) if len(probs) > 1 else 0.0,
+                "injection": float(probs[2]) if len(probs) > 2 else 0.0,
+                "dissection": float(probs[3]) if len(probs) > 3 else 0.0,
+                "inference_ms": round(elapsed_ms, 3),
+            }
+            records.append(record)
+            processed += 1
+            if total_frames > 0:
+                progress.progress(min(frame_index + 1, total_frames) / total_frames)
+            else:
+                progress.progress(min(processed / max_frames, 1.0))
+            status.caption(f"Processed {processed} sampled frames")
+            frame_index += 1
+            if processed >= max_frames:
+                break
+    finally:
+        capture.release()
+        temp_path.unlink(missing_ok=True)
+        predictor.reset_state()
+    progress.empty()
+    status.empty()
+    return records, {"fps": fps, "total_frames": total_frames, "sampled_frames": processed}
+def _records_to_frame(records):
+    if not records:
+        return pd.DataFrame(columns=["frame_index", "timestamp_sec", "phase", "confidence"])
+    return pd.DataFrame.from_records(records)
+def _download_payloads(df: pd.DataFrame):
+    json_payload = df.to_json(orient="records", indent=2).encode("utf-8")
+    csv_payload = df.to_csv(index=False).encode("utf-8")
+    return json_payload, csv_payload
+def _render_single_result(result: dict):
+    probs = result.get("probs", [0.0, 0.0, 0.0, 0.0])
+    metrics = st.columns(3)
+    metrics[0].metric("Predicted phase", result.get("phase", "unknown").upper())
+    metrics[1].metric("Confidence", f"{float(result.get('confidence', 0.0)):.1%}")
+    metrics[2].metric("Frames used", int(result.get("frames_used", 1)))
+    prob_df = pd.DataFrame({"phase": list(PHASE_LABELS), "probability": probs})
+    st.bar_chart(prob_df.set_index("phase"))
+    st.download_button(
+        label="Download JSON",
+        data=json.dumps(result, indent=2).encode("utf-8"),
+        file_name="phase_prediction.json",
+        mime="application/json",
+        key="download-single-json",
+    )
+def _render_video_results(records, meta):
+    if not records:
+        st.warning("No frames were processed from the uploaded video.")
+        return
+    df = _records_to_frame(records)
+    counts = Counter(df["phase"].tolist())
+    dominant_phase, dominant_count = counts.most_common(1)[0]
+    metrics = st.columns(4)
+    metrics[0].metric("Sampled frames", int(meta["sampled_frames"]))
+    metrics[1].metric("Dominant phase", dominant_phase.upper())
+    metrics[2].metric("Mean confidence", f"{df['confidence'].mean():.1%}")
+    metrics[3].metric("Average inference", f"{df['inference_ms'].mean():.1f} ms")
+    chart_df = df.copy()
+    if "timestamp_sec" in chart_df and chart_df["timestamp_sec"].notna().any():
+        chart_df = chart_df.set_index("timestamp_sec")
+    else:
+        chart_df = chart_df.set_index("frame_index")
+    st.subheader("Confidence timeline")
+    st.line_chart(chart_df[["confidence"]])
+    st.subheader("Phase timeline")
+    st.line_chart(chart_df[["phase_id"]])
+    st.subheader("Per-frame predictions")
+    st.dataframe(df, use_container_width=True, hide_index=True)
+    json_payload, csv_payload = _download_payloads(df)
+    left, right = st.columns(2)
+    left.download_button("Download JSON", json_payload, file_name="phase_timeline.json", mime="application/json")
+    right.download_button("Download CSV", csv_payload, file_name="phase_timeline.csv", mime="text/csv")
+def main():
+    enabled_model_keys = _enabled_model_keys()
+    default_model_key = _default_model_key(enabled_model_keys)
+    st.title("DINO-Endo Surgical Phase Recognition")
+    st.caption(_space_caption(enabled_model_keys))
+    st.sidebar.markdown("### Model")
+    if len(enabled_model_keys) == 1:
+        model_key = enabled_model_keys[0]
+        st.sidebar.write(MODEL_LABELS[model_key])
+    else:
+        model_key = st.sidebar.selectbox(
+            "Model",
+            options=enabled_model_keys,
+            index=enabled_model_keys.index(default_model_key),
+            format_func=lambda key: MODEL_LABELS[key],
+        )
+    source_summary = get_model_source_summary(model_key)
+    st.sidebar.markdown("### Runtime")
+    st.sidebar.write(f"CUDA available: `{torch.cuda.is_available()}`")
+    if torch.cuda.is_available():
+        st.sidebar.write(f"Device: `{torch.cuda.get_device_name(torch.cuda.current_device())}`")
+    st.sidebar.write(f"Model dir: `{source_summary['model_dir']}`")
+    st.sidebar.write(f"HF repo: `{source_summary['repo_id'] or 'local-only'}`")
+    if source_summary["subfolder"]:
+        st.sidebar.write(f"Repo subfolder: `{source_summary['subfolder']}`")
+    image_tab, video_tab = st.tabs(["Image", "Video"])
+    with image_tab:
+        uploaded_image = st.file_uploader("Upload an RGB frame", type=["png", "jpg", "jpeg"], key="image-uploader")
+        if uploaded_image is not None:
+            rgb = _image_to_rgb(uploaded_image)
+            st.image(rgb, caption=uploaded_image.name, use_container_width=True)
+            if st.button("Run image inference", key="run-image"):
+                predictor = _ensure_predictor(model_key)
+                predictor.reset_state()
+                started = time.perf_counter()
+                result = predictor.predict(rgb)
+                result["inference_ms"] = round((time.perf_counter() - started) * 1000.0, 3)
+                predictor.reset_state()
+                _render_single_result(result)
+    with video_tab:
+        frame_stride = st.slider("Analyze every Nth frame", min_value=1, max_value=30, value=5, step=1)
+        max_frames = st.slider("Maximum sampled frames", min_value=10, max_value=600, value=180, step=10)
+        uploaded_video = st.file_uploader(
+            "Upload a video",
+            type=["mp4", "mov", "avi", "mkv", "webm", "m4v"],
+            key="video-uploader",
+        )
+        if uploaded_video is not None:
+            st.video(uploaded_video)
+            if st.button("Analyze video", key="run-video"):
+                predictor = _ensure_predictor(model_key)
+                records, meta = _analyse_video(uploaded_video, predictor, frame_stride=frame_stride, max_frames=max_frames)
+                _render_video_results(records, meta)
+    if st.sidebar.button("Unload active model"):
+        predictor = st.session_state.get("active_predictor")
+        if predictor is not None:
+            predictor.unload()
+            st.session_state.pop("active_predictor", None)
+            st.session_state.pop("active_model_key", None)
+        st.sidebar.success("Model unloaded")
+if __name__ == "__main__":
+    main()

model/__init__.py ADDED Viewed

File without changes

model/mstcn.py ADDED Viewed

	@@ -0,0 +1,183 @@

+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import copy
+class MultiStageModel(nn.Module):
+    def __init__(self, mstcn_stages, mstcn_layers, mstcn_f_maps, mstcn_f_dim, out_features, mstcn_causal_conv, is_train=True, dropout_prob: float = 0.0):
+        self.num_stages = mstcn_stages
+        self.num_layers = mstcn_layers
+        self.num_f_maps = mstcn_f_maps
+        self.dim = mstcn_f_dim
+        self.num_classes = out_features
+        self.causal_conv = mstcn_causal_conv
+        self.is_train = is_train
+        print(f"num_stages_classification: {self.num_stages}, num_layers: {self.num_layers}, num_f_maps: {self.num_f_maps}, dim: {self.dim}")
+        super(MultiStageModel, self).__init__()
+        self.stage1 = SingleStageModel(self.num_layers,
+                                       self.num_f_maps,
+                                       self.dim,
+                                       self.num_classes,
+                                       causal_conv=self.causal_conv,
+                                       is_train=is_train,
+                                       dropout_prob=dropout_prob)
+        self.stages = SingleStageModel(self.num_layers,
+                                       self.num_f_maps,
+                                       self.num_classes,
+                                       self.num_classes,
+                                       causal_conv=self.causal_conv,
+                                       is_train=is_train,
+                                       dropout_prob=dropout_prob)
+        self.smoothing = False
+    def forward(self, x):
+        """
+        If is_train is False (inference), return first-stage features [B, num_f_maps, T]
+        so downstream Transformer receives 32-d features, matching the working pipeline.
+        If is_train is True (training/classification), return stacked class logits.
+        """
+        out = self.stage1(x)
+        if not self.is_train:
+            # Inference path: return temporal features (num_f_maps channels)
+            return out
+        # Training path: run second stage on class probabilities
+        outputs_classes = out.unsqueeze(0)
+        out_classes = self.stages(F.softmax(out, dim=1))
+        outputs_classes = torch.cat((outputs_classes, out_classes.unsqueeze(0)), dim=0)
+        return outputs_classes
+    @staticmethod
+    def add_model_specific_args(parser):  # pragma: no cover
+        mstcn_reg_model_specific_args = parser.add_argument_group(title='mstcn reg specific args options')
+        mstcn_reg_model_specific_args.add_argument("--mstcn_stages", default=4, type=int)
+        mstcn_reg_model_specific_args.add_argument("--mstcn_layers", default=10, type=int)
+        mstcn_reg_model_specific_args.add_argument("--mstcn_f_maps", default=64, type=int)
+        mstcn_reg_model_specific_args.add_argument("--mstcn_f_dim", default=2048, type=int)
+        mstcn_reg_model_specific_args.add_argument("--mstcn_causal_conv", action='store_true')
+        return parser
+class SingleStageModel(nn.Module):
+    def __init__(self,
+                 num_layers: int,
+                 num_f_maps: int,
+                 dim: int,
+                 num_classes: int,
+                 causal_conv: bool = False,
+                 is_train: bool = True,
+                 dropout_prob: float = 0.0):
+        super(SingleStageModel, self).__init__()
+        self.conv_1x1 = nn.Conv1d(dim, num_f_maps, 1)
+        self.is_train = is_train
+        self.layers = nn.ModuleList([
+            copy.deepcopy(DilatedResidualLayer(2 ** i, num_f_maps, num_f_maps, causal_conv=causal_conv, dropout_prob=dropout_prob))
+            for i in range(num_layers)
+        ])
+        if self.is_train:
+            self.conv_out_classes = nn.Conv1d(num_f_maps, num_classes, 1)
+    def forward(self, x):
+        out = self.conv_1x1(x)
+        for layer in self.layers:
+            out = layer(out)
+        if self.is_train:
+            out = self.conv_out_classes(out)
+        return out
+class DilatedResidualLayer(nn.Module):
+    def __init__(self,
+                 dilation: int,
+                 in_channels: int,
+                 out_channels: int,
+                 causal_conv: bool = False,
+                 kernel_size: int = 3,
+                 dropout_prob: float = 0.0):
+        super(DilatedResidualLayer, self).__init__()
+        self.causal_conv = causal_conv
+        self.dilation = dilation
+        self.kernel_size = kernel_size
+        padding = (dilation * (kernel_size - 1)) if self.causal_conv else dilation
+        self.conv_dilated = nn.Conv1d(in_channels, out_channels, kernel_size, padding=padding, dilation=dilation)
+        self.conv_1x1 = nn.Conv1d(out_channels, out_channels, 1)
+        self.dropout = nn.Dropout(dropout_prob)
+        self.activation = nn.ReLU(inplace=True)
+    def forward(self, x):
+        out = self.activation(self.conv_dilated(x))
+        out = self.dropout(out)
+        if self.causal_conv:
+            out = out[:, :, :-(self.dilation * 2)]
+        out = self.activation(self.conv_1x1(out))
+        out = self.dropout(out)
+        return x + out
+class SingleStageModel1(nn.Module):
+    def __init__(self,
+                 num_layers,
+                 num_f_maps,
+                 dim,
+                 num_classes,
+                 causal_conv=False):
+        super(SingleStageModel1, self).__init__()
+        self.conv_1x1 = nn.Conv1d(dim, num_f_maps, 1)
+        self.layers = nn.ModuleList([
+            copy.deepcopy(
+                DilatedResidualLayer(2**i,
+                                     num_f_maps,
+                                     num_f_maps,
+                                     causal_conv=causal_conv))
+            for i in range(num_layers)
+        ])
+        self.conv_out_classes = nn.Conv1d(num_f_maps, num_classes, 1)
+    def forward(self, x):
+        out = self.conv_1x1(x)
+        for layer in self.layers:
+            out = layer(out)
+        out_classes = self.conv_out_classes(out)
+        return out_classes, out
+class MultiStageModel1(nn.Module):
+    def __init__(self, mstcn_stages, mstcn_layers, mstcn_f_maps, mstcn_f_dim, out_features, mstcn_causal_conv):
+        self.num_stages = mstcn_stages  # 4 #2
+        self.num_layers = mstcn_layers  # 10  #5
+        self.num_f_maps = mstcn_f_maps  # 64 #64
+        self.dim = mstcn_f_dim  #2048 # 2048
+        self.num_classes = out_features  # 7
+        self.causal_conv = mstcn_causal_conv
+        print(
+            f"num_stages_classification: {self.num_stages}, num_layers: {self.num_layers}, num_f_maps:"
+            f" {self.num_f_maps}, dim: {self.dim}")
+        super(MultiStageModel1, self).__init__()
+        self.stage1 = SingleStageModel1(self.num_layers,
+                                       self.num_f_maps,
+                                       self.dim,
+                                       self.num_classes,
+                                       causal_conv=self.causal_conv)
+        self.stages = nn.ModuleList([
+            copy.deepcopy(
+                SingleStageModel1(self.num_layers,
+                                 self.num_f_maps,
+                                 self.num_classes,
+                                 self.num_classes,
+                                 causal_conv=self.causal_conv))
+            for s in range(self.num_stages - 1)
+        ])
+        self.smoothing = False
+    def forward(self, x):
+        out_classes, _ = self.stage1(x)
+        outputs_classes = out_classes.unsqueeze(0)
+        for s in self.stages:
+            out_classes, out = s(F.softmax(out_classes, dim=1))
+            outputs_classes = torch.cat(
+                (outputs_classes, out_classes.unsqueeze(0)), dim=0)
+        return out

model/resnet.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import torch
+import torch.nn as nn
+import torchvision
+from torchvision import models, transforms
+from torchvision.models import ResNet50_Weights
+# User's ResNet variant (adapted for 2048-d features, no head)
+class ResNet(nn.Module):
+    def __init__(self, out_channels=4, has_fc=False):
+        super(ResNet, self).__init__()
+        self.resnet = torchvision.models.resnet50(pretrained=False)
+        if not has_fc:
+            self.resnet.fc = nn.Identity()  # Output 2048-d features
+        else:
+            # Keep the original fc layer for compatibility
+            pass
+    def forward(self, x):
+        return self.resnet(x)

model/transformer.py ADDED Viewed

	@@ -0,0 +1,246 @@

+import torch
+import numpy as np
+import torch.nn as nn
+import math
+# some code adapted from https://wmathor.com/index.php/archives/1455/
+class ScaledDotProductAttention(nn.Module):
+    def __init__(self, d_k, n_heads):
+        super(ScaledDotProductAttention, self).__init__()
+        self.d_k = d_k
+        self.n_heads = n_heads
+    def forward(self, Q, K, V):
+        '''
+        Q: [batch_size, n_heads, len_q=1, d_k]
+        K: [batch_size, n_heads, len_k, d_k]
+        V: [batch_size, n_heads, len_v(=len_k), d_v]
+        '''
+        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(
+            self.d_k)  # scores : [batch_size, n_heads, len_q, len_k]
+        attn = nn.Softmax(dim=-1)(scores)  # [batch_size, n_heads, len_q, len_q]
+        context = torch.matmul(attn, V)  # [batch_size, n_heads, len_q, d_v]
+        return context, attn
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, d_k, d_v, n_heads, len_q, len_k):
+        super(MultiHeadAttention, self).__init__()
+        self.W_Q = nn.Linear(d_model, d_k * n_heads, bias=False)
+        self.W_K = nn.Linear(d_model, d_k * n_heads, bias=False)
+        self.W_V = nn.Linear(d_model, d_v * n_heads, bias=False)
+        self.fc = nn.Linear(n_heads * d_v, d_model, bias=False)  # Linear only change the last dimension
+        self.d_model = d_model
+        self.d_k = d_k
+        self.d_v = d_v
+        self.n_heads = n_heads
+        self.ScaledDotProductAttention = ScaledDotProductAttention(self.d_k, n_heads)
+        self.len_q = len_q
+        self.len_k = len_k
+    def forward(self, input_Q, input_K, input_V):
+        '''
+        input_Q: [batch_size, len_q, d_model]  [512, 1, 5]  --> Spatial info
+        input_K: [batch_size, len_k, d_model]  [512, 30, 5]  --> Temporal info
+        input_V: [batch_size, len_v(=len_k), d_model]  [512, 30, 5]  --> Temporal info
+        '''
+        residual, batch_size = input_Q, input_Q.size(0)
+        # (B, S, D) -proj-> (B, S, D_new) -split-> (B, S, H, W) -trans-> (B, H, S, W)
+        Q = self.W_Q(input_Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)  # Q: [batch_size, n_heads, len_q, d_k]
+        K = self.W_K(input_K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)  # K: [batch_size, n_heads, len_k, d_k]
+        V = self.W_V(input_V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1, 2)  # V: [batch_size, n_heads, len_v(=len_k), d_v]
+        # context: [batch_size, n_heads, len_q, d_v], attn: [batch_size, n_heads, len_q, len_k]
+        context, attn = self.ScaledDotProductAttention(Q, K, V)
+        context = context.transpose(1, 2).reshape(batch_size, -1,
+                                                  self.n_heads * self.d_v)  # context: [batch_size, len_q, n_heads * d_v]
+        output = self.fc(context)  # [batch_size, len_q, d_model]
+        layer_norm = nn.LayerNorm(self.d_model).to(output.device)
+        return layer_norm(output + residual), attn  # All batch size dimensions are reserved.
+class PoswiseFeedForwardNet(nn.Module):
+    def __init__(self, d_model, d_ff):
+        super(PoswiseFeedForwardNet, self).__init__()
+        self.fc = nn.Sequential(
+            nn.Linear(d_model, d_ff, bias=False),
+            nn.ReLU(),
+            nn.Linear(d_ff, d_model, bias=False)
+        )
+        self.d_model = d_model
+    def forward(self, inputs):
+        '''
+        inputs: [batch_size, seq_len, d_model]
+        '''
+        residual = inputs
+        output = self.fc(inputs)
+        layer_norm = nn.LayerNorm(self.d_model).to(output.device)
+        return layer_norm(output + residual)  # [batch_size, seq_len, d_model]
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, d_ff, d_k, d_v, n_heads, len_q):
+        super(EncoderLayer, self).__init__()
+        self.enc_self_attn = MultiHeadAttention(d_model, d_k, d_v, n_heads, 1, len_q)
+        self.pos_ffn = PoswiseFeedForwardNet(d_model, d_ff)
+    def forward(self, enc_inputs):
+        '''
+        enc_inputs: [batch_size, src_len, d_model]
+        '''
+        # enc_outputs: [batch_size, src_len, d_model], attn: [batch_size, n_heads, src_len, src_len]
+        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs)  # enc_inputs to same Q,K,V
+        enc_outputs = self.pos_ffn(enc_outputs)  # enc_outputs: [batch_size, src_len, d_model]
+        return enc_outputs, attn
+class Encoder(nn.Module):
+    def __init__(self, d_model, d_ff, d_k, d_v, n_layers, n_heads, len_q):
+        super(Encoder, self).__init__()
+        self.layers = nn.ModuleList([EncoderLayer(d_model, d_ff, d_k, d_v, n_heads, len_q) for _ in range(n_layers)])
+    def forward(self, enc_inputs):
+        '''
+        enc_inputs: [batch_size, src_len, d_model]
+        '''
+        enc_outputs = enc_inputs
+        enc_self_attns = []
+        for layer in self.layers:
+            # enc_outputs: [batch_size, src_len, d_model], enc_self_attn: [batch_size, n_heads, src_len, src_len]
+            enc_outputs, enc_self_attn = layer(enc_outputs)
+            enc_self_attns.append(enc_self_attn)
+        return enc_outputs, enc_self_attns
+class DecoderLayer(nn.Module):
+    def __init__(self, d_model, d_ff, d_k, d_v, n_heads, len_q):
+        super(DecoderLayer, self).__init__()
+        self.dec_enc_attn = MultiHeadAttention(d_model, d_k, d_v, n_heads, 1, len_q)
+        self.pos_ffn = PoswiseFeedForwardNet(d_model, d_ff)
+    def forward(self, dec_inputs, enc_outputs):
+        '''
+        dec_inputs: [batch_size, tgt_len, d_model]  [512, 1, 5]  --> Spatial info
+        enc_outputs: [batch_size, src_len, d_model]  [512, 30, 5]  --> Temporal info
+        dec_self_attn_mask: [batch_size, tgt_len, tgt_len]
+        dec_enc_attn_mask: [batch_size, tgt_len, src_len]
+        '''
+        # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len]
+        # dec_outputs: [batch_size, tgt_len, d_model], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
+        dec_outputs, dec_enc_attn = self.dec_enc_attn(dec_inputs, enc_outputs, enc_outputs)
+        dec_outputs = self.pos_ffn(dec_outputs)  # [batch_size, tgt_len, d_model]
+        return dec_outputs, dec_enc_attn
+class Decoder(nn.Module):
+    def __init__(self, d_model, d_ff, d_k, d_v, n_layers, n_heads, len_q):
+        super(Decoder, self).__init__()
+        self.layers = nn.ModuleList([DecoderLayer(d_model, d_ff, d_k, d_v, n_heads, len_q) for _ in range(n_layers)])
+    def forward(self, dec_inputs, enc_outputs):
+        '''
+        dec_inputs: [batch_size, tgt_len, d_model]  [512, 1, 5]
+        enc_intpus: [batch_size, src_len, d_model]  [512, 30, 5]
+        enc_outputs: [batsh_size, src_len, d_model]
+        '''
+        dec_outputs = dec_inputs  # self.tgt_emb(dec_inputs) # [batch_size, tgt_len, d_model]
+        # dec_self_attn_subsequence_mask = get_attn_subsequence_mask(dec_inputs).cuda() # [batch_size, tgt_len, tgt_len]
+        dec_enc_attns = []
+        for layer in self.layers:
+            # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
+            dec_outputs, dec_enc_attn = layer(dec_outputs, enc_outputs)
+            dec_enc_attns.append(dec_enc_attn)
+        return dec_outputs
+# d_model,   Embedding Size
+# d_ff, FeedForward dimension
+# d_k = d_v,   dimension of K(=Q), V
+# n_layers,   number of Encoder of Decoder Layer
+# n_heads,   number of heads in Multi-Head Attention
+class Transformer2_3_1(nn.Module):
+    def __init__(self, d_model, d_ff, d_k, d_v, n_layers, n_heads, len_q):
+        super(Transformer2_3_1, self).__init__()
+        self.encoder = Encoder(d_model, d_ff, d_k, d_v, n_layers, n_heads, len_q)
+        self.decoder = Decoder(d_model, d_ff, d_k, d_v, 1, n_heads, len_q)
+    def forward(self, enc_inputs, dec_inputs):
+        '''
+        enc_inputs: [Frames, src_len, d_model]  [512, 30, 5]
+        dec_inputs: [Frames, 1, d_model]  [512, 1, 5]
+        '''
+        # tensor to store decoder outputs
+        # outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
+        # enc_outputs: [batch_size, src_len, d_model], enc_self_attns: [n_layers, batch_size, n_heads, src_len, src_len]
+        enc_outputs, enc_self_attns = self.encoder(enc_inputs)  # Self-attention for temporal features
+        dec_outputs = self.decoder(dec_inputs, enc_outputs)
+        return dec_outputs
+class Transformer(nn.Module):
+    def __init__(self, mstcn_f_maps, mstcn_f_dim, out_features, len_q, d_model=None):
+        super(Transformer, self).__init__()
+        # Use provided d_model (256) else fallback to mstcn_f_maps
+        self.d_model = d_model if d_model is not None else mstcn_f_maps
+        self.num_classes = out_features
+        self.len_q = len_q
+        # Spatial encoder with d_ff = d_model; heads=8; d_k=d_v=d_model
+        self.spatial_encoder = EncoderLayer(self.d_model, self.d_model, self.d_model, self.d_model, 8, 5)
+        self.transformer = Transformer2_3_1(d_model=self.d_model, d_ff=self.d_model, d_k=self.d_model,
+                                            d_v=self.d_model, n_layers=1, n_heads=8, len_q=len_q)
+        self.fc = nn.Linear(mstcn_f_dim, self.d_model, bias=False)
+        # Final head 256 -> num_classes, no bias to match checkpoint
+        self.out = nn.Sequential(
+            nn.ReLU(),
+            nn.Dropout(p=0.1),
+            nn.Linear(self.d_model, out_features, bias=False)
+        )
+    def forward(self, x, long_feature):
+        # x: [B, 256, T]; long_feature: [B, T, 256]
+        B, D, T = x.shape
+        out_features = x.transpose(1, 2)  # [B, T, 256]
+        # Build sliding windows for temporal inputs
+        inputs = []
+        for i in range(T):
+            if i < self.len_q - 1:
+                pad = torch.zeros((B, self.len_q - 1 - i, self.d_model), device=x.device)
+                win = torch.cat([pad, out_features[:, :i + 1, :]], dim=1)
+            else:
+                win = out_features[:, i - self.len_q + 1:i + 1, :]
+            inputs.append(win)
+        inputs = torch.stack(inputs, dim=0).squeeze(1)  # [T, B, len_q, 256]
+        # Project long features and create spatial windows
+        feas = torch.tanh(self.fc(long_feature))  # [B, T, 256]
+        spa_len = min(10, T)
+        out_feas = []
+        for i in range(T):
+            if i < spa_len - 1:
+                pad = torch.zeros((B, spa_len - 1 - i, self.d_model), device=feas.device)
+                win = torch.cat([pad, feas[:, :i + 1, :]], dim=1)
+            else:
+                win = out_features[:, i - spa_len + 1:i + 1, :]
+            out_feas.append(win)
+        out_feas = torch.stack(out_feas, dim=0).squeeze(1)
+        out_feas, _ = self.spatial_encoder(out_feas)
+        # Temporal-spatial fusion
+        output = self.transformer(inputs, out_feas)  # [T, B, 1, 256] collapsed → [T, B, 256]
+        output = self.out(output)  # [T, B, C]
+        return output.transpose(0, 1)  # [B, T, C]

model_registry.py ADDED Viewed

	@@ -0,0 +1,156 @@

+from __future__ import annotations
+import os
+import shutil
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Dict, Iterable, Tuple
+from huggingface_hub import hf_hub_download
+from huggingface_hub.utils import EntryNotFoundError
+APP_ROOT = Path(__file__).resolve().parent
+MODEL_ROOT = Path(os.environ.get("SPACE_MODEL_DIR", APP_ROOT / "model")).expanduser().resolve()
+def _default_hf_home() -> Path:
+    data_dir = Path("/data")
+    if data_dir.is_dir():
+        return data_dir / ".huggingface"
+    return APP_ROOT / ".cache" / "huggingface"
+HF_HOME = Path(os.environ.setdefault("HF_HOME", str(_default_hf_home()))).expanduser().resolve()
+os.environ.setdefault("HF_HUB_CACHE", str(HF_HOME / "hub"))
+@dataclass(frozen=True)
+class ModelSpec:
+    key: str
+    label: str
+    required_files: Tuple[str, ...]
+    optional_files: Tuple[str, ...] = ()
+MODEL_SPECS: Dict[str, ModelSpec] = {
+    "aiendo": ModelSpec(
+        key="aiendo",
+        label="AI-Endo",
+        required_files=("resnet50.pth", "fusion.pth", "transformer.pth"),
+    ),
+    "dinov2": ModelSpec(
+        key="dinov2",
+        label="DINO-Endo",
+        required_files=("dinov2_vit14s_latest_checkpoint.pth", "fusion_transformer_decoder_best_model.pth"),
+        optional_files=("dinov2_decoder.pth",),
+    ),
+    "vjepa2": ModelSpec(
+        key="vjepa2",
+        label="V-JEPA2",
+        required_files=("vjepa_encoder_human.pt", "mlp_decoder_human.pth"),
+    ),
+}
+def _repo_env_name(model_key: str) -> str:
+    prefix = {"aiendo": "AIENDO", "dinov2": "DINO", "vjepa2": "VJEPA2"}[model_key]
+    return f"{prefix}_MODEL_REPO_ID"
+def _revision_env_name(model_key: str) -> str:
+    prefix = {"aiendo": "AIENDO", "dinov2": "DINO", "vjepa2": "VJEPA2"}[model_key]
+    return f"{prefix}_MODEL_REVISION"
+def _subfolder_env_name(model_key: str) -> str:
+    prefix = {"aiendo": "AIENDO", "dinov2": "DINO", "vjepa2": "VJEPA2"}[model_key]
+    return f"{prefix}_MODEL_SUBFOLDER"
+def get_model_repo_id(model_key: str) -> str | None:
+    return os.getenv(_repo_env_name(model_key)) or os.getenv("PHASE_MODEL_REPO_ID")
+def get_model_revision(model_key: str) -> str | None:
+    return os.getenv(_revision_env_name(model_key)) or os.getenv("PHASE_MODEL_REVISION")
+def get_model_subfolder(model_key: str) -> str:
+    return (os.getenv(_subfolder_env_name(model_key)) or "").strip("/")
+def get_hf_token() -> str | None:
+    return os.getenv("HF_TOKEN") or os.getenv("HUGGING_FACE_HUB_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
+def ensure_model_root() -> Path:
+    MODEL_ROOT.mkdir(parents=True, exist_ok=True)
+    HF_HOME.mkdir(parents=True, exist_ok=True)
+    Path(os.environ["HF_HUB_CACHE"]).mkdir(parents=True, exist_ok=True)
+    return MODEL_ROOT
+def _remote_filename(model_key: str, filename: str) -> str:
+    subfolder = get_model_subfolder(model_key)
+    return f"{subfolder}/{filename}" if subfolder else filename
+def _download_to_model_root(model_key: str, filename: str, *, optional: bool = False) -> Path | None:
+    target = ensure_model_root() / filename
+    if target.exists():
+        return target
+    repo_id = get_model_repo_id(model_key)
+    if not repo_id:
+        if optional:
+            return None
+        raise FileNotFoundError(
+            f"Missing {filename} in {MODEL_ROOT}. Set { _repo_env_name(model_key) } or PHASE_MODEL_REPO_ID, "
+            f"or copy the checkpoint into the local model directory."
+        )
+    try:
+        downloaded = hf_hub_download(
+            repo_id=repo_id,
+            filename=_remote_filename(model_key, filename),
+            repo_type="model",
+            revision=get_model_revision(model_key),
+            token=get_hf_token(),
+        )
+    except EntryNotFoundError:
+        if optional:
+            return None
+        raise
+    downloaded_path = Path(downloaded)
+    if downloaded_path.resolve() != target.resolve():
+        shutil.copy2(downloaded_path, target)
+    return target
+def ensure_model_artifacts(model_key: str) -> Path:
+    if model_key not in MODEL_SPECS:
+        raise KeyError(f"Unknown model key: {model_key}")
+    spec = MODEL_SPECS[model_key]
+    ensure_model_root()
+    for filename in spec.required_files:
+        _download_to_model_root(model_key, filename, optional=False)
+    for filename in spec.optional_files:
+        _download_to_model_root(model_key, filename, optional=True)
+    return MODEL_ROOT
+def get_model_source_summary(model_key: str) -> dict:
+    spec = MODEL_SPECS[model_key]
+    return {
+        "label": spec.label,
+        "model_dir": str(MODEL_ROOT),
+        "repo_id": get_model_repo_id(model_key),
+        "revision": get_model_revision(model_key),
+        "subfolder": get_model_subfolder(model_key),
+        "required_files": list(spec.required_files),
+        "optional_files": list(spec.optional_files),
+    }

predictor.py ADDED Viewed

	@@ -0,0 +1,642 @@

+from __future__ import annotations
+import os
+import sys
+from contextlib import nullcontext
+from pathlib import Path
+import albumentations as A
+import cv2
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+try:
+    from torch.amp import autocast
+    MIXED_PRECISION_AVAILABLE = True
+except ImportError:  # pragma: no cover
+    MIXED_PRECISION_AVAILABLE = False
+from model.resnet import ResNet
+from model.mstcn import MultiStageModel
+from model.transformer import Transformer
+PHASE_LABELS = ("idle", "marking", "injection", "dissection")
+MODEL_LABELS = {
+    "aiendo": "AI-Endo",
+    "dinov2": "DINO-Endo",
+    "vjepa2": "V-JEPA2",
+}
+def _app_root() -> Path:
+    return Path(__file__).resolve().parent
+def default_model_dir() -> str:
+    return str(Path(os.environ.get("SPACE_MODEL_DIR", _app_root() / "model")).expanduser().resolve())
+def normalize_model_key(name: str | None) -> str:
+    token = (name or "aiendo").lower().replace("-", "").replace("_", "").strip()
+    if token in ("aiendo", "resnet", "aiendoresnet", "aiendoresnetmstcn", "aiendoresnetmstcntransformer"):
+        return "aiendo"
+    if token in ("dinov2", "dinov2endo", "dinoendo", "dino"):
+        return "dinov2"
+    if token in ("vjepa2", "vjepa", "vjepa2endo"):
+        return "vjepa2"
+    raise KeyError(f"Unsupported model key: {name}")
+def _load_trusted_checkpoint(path: str, map_location="cpu"):
+    try:
+        return torch.load(path, map_location=map_location, weights_only=False)
+    except TypeError:  # pragma: no cover
+        return torch.load(path, map_location=map_location)
+def _strip_state_dict_prefixes(state_dict, prefixes):
+    cleaned_state = {}
+    for key, value in state_dict.items():
+        while any(key.startswith(prefix) for prefix in prefixes):
+            for prefix in prefixes:
+                if key.startswith(prefix):
+                    key = key[len(prefix):]
+        cleaned_state[key] = value
+    return cleaned_state
+def _validate_load_result(
+    load_result,
+    model_name: str,
+    *,
+    allowed_missing=(),
+    allowed_missing_prefixes=(),
+    allowed_unexpected=(),
+    allowed_unexpected_prefixes=(),
+):
+    missing = [
+        key
+        for key in load_result.missing_keys
+        if key not in allowed_missing and not any(key.startswith(prefix) for prefix in allowed_missing_prefixes)
+    ]
+    unexpected = [
+        key
+        for key in load_result.unexpected_keys
+        if key not in allowed_unexpected and not any(key.startswith(prefix) for prefix in allowed_unexpected_prefixes)
+    ]
+    if missing or unexpected:
+        problems = []
+        if missing:
+            problems.append(f"missing={missing[:10]}")
+        if unexpected:
+            problems.append(f"unexpected={unexpected[:10]}")
+        raise RuntimeError(f"{model_name} checkpoint mismatch ({'; '.join(problems)})")
+def _resolve_vendor_repo(repo_name: str, extra_candidates=()):
+    app_root = _app_root()
+    candidates = [app_root / repo_name]
+    if len(app_root.parents) >= 2:
+        candidates.append(app_root.parents[1] / repo_name)
+    candidates.extend(extra_candidates)
+    for candidate in candidates:
+        if candidate and candidate.exists():
+            return candidate
+    raise FileNotFoundError(f"Required vendor repo '{repo_name}' not found. Stage it into this folder or keep the repo-root copy available.")
+class Predictor:
+    def __init__(self, model_dir: str | None = None, device: str = "cuda"):
+        self.device = torch.device(device if torch.cuda.is_available() else "cpu")
+        self.model_dir = model_dir or default_model_dir()
+        self.seq_length = 1024
+        self.trans_seq = 30
+        self.aug = A.Compose([A.Resize(height=224, width=224), A.Normalize()])
+        self.frame_feature_cache = None
+        self.label_dict = dict(enumerate(PHASE_LABELS))
+        self.available = False
+        self._norm_mean = None
+        self._norm_std = None
+        if self.device.type == "cuda":
+            self._norm_mean = torch.tensor([0.485, 0.456, 0.406], device=self.device).view(1, 3, 1, 1)
+            self._norm_std = torch.tensor([0.229, 0.224, 0.225], device=self.device).view(1, 3, 1, 1)
+        self._load_models(self.model_dir)
+    def _load_models(self, model_dir: str):
+        self.resnet = ResNet(out_channels=4, has_fc=False)
+        paras = torch.load(os.path.join(model_dir, "resnet50.pth"), map_location=self.device)["model"]
+        paras = {k: v for k, v in paras.items() if "fc" not in k and "embed" not in k}
+        paras = {k.replace("share.", "resnet."): v for k, v in paras.items()}
+        self.resnet.load_state_dict(paras, strict=True)
+        self.resnet.to(self.device).eval()
+        self.fusion = MultiStageModel(
+            mstcn_stages=2,
+            mstcn_layers=8,
+            mstcn_f_maps=32,
+            mstcn_f_dim=2048,
+            out_features=4,
+            mstcn_causal_conv=True,
+            is_train=False,
+        )
+        fusion_weights = torch.load(os.path.join(model_dir, "fusion.pth"), map_location=self.device)
+        fusion_load = self.fusion.load_state_dict(fusion_weights, strict=False)
+        _validate_load_result(
+            fusion_load,
+            "AI-Endo fusion",
+            allowed_unexpected_prefixes=("stage1.conv_out_classes.",),
+        )
+        self.fusion.to(self.device).eval()
+        self.transformer = Transformer(32, 2048, 4, 30, d_model=32)
+        trans_weights = torch.load(os.path.join(model_dir, "transformer.pth"), map_location=self.device)
+        self.transformer.load_state_dict(trans_weights)
+        self.transformer.to(self.device).eval()
+        self.available = True
+    def _amp_context(self):
+        return autocast("cuda") if MIXED_PRECISION_AVAILABLE and self.device.type == "cuda" else nullcontext()
+    def _preprocess_gpu(self, rgb_image: np.ndarray) -> torch.Tensor:
+        tensor = torch.from_numpy(rgb_image).permute(2, 0, 1).unsqueeze(0)
+        tensor = tensor.to(self.device, dtype=torch.float32, non_blocking=True).div_(255.0)
+        if tensor.shape[-2:] != (224, 224):
+            tensor = F.interpolate(tensor, size=(224, 224), mode="bilinear", align_corners=False)
+        return (tensor - self._norm_mean) / self._norm_std
+    def warm_up(self):
+        dummy = np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8)
+        self.predict(dummy)
+        self.reset_state()
+    def reset_state(self):
+        self.frame_feature_cache = None
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    def unload(self):
+        self.available = False
+        self.resnet.to("cpu")
+        self.fusion.to("cpu")
+        self.transformer.to("cpu")
+        self.resnet = None
+        self.fusion = None
+        self.transformer = None
+        self.frame_feature_cache = None
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    def _cache_features(self, feature: torch.Tensor):
+        if self.frame_feature_cache is None:
+            self.frame_feature_cache = feature
+        elif self.frame_feature_cache.shape[0] > self.seq_length:
+            self.frame_feature_cache = torch.cat([self.frame_feature_cache[1:], feature], dim=0)
+        else:
+            self.frame_feature_cache = torch.cat([self.frame_feature_cache, feature], dim=0)
+    @torch.inference_mode()
+    def predict(self, rgb_image: np.ndarray):
+        if self._norm_mean is not None:
+            tensor = self._preprocess_gpu(rgb_image)
+        else:
+            processed = self.aug(image=rgb_image)["image"]
+            chw = np.transpose(processed, (2, 0, 1))
+            tensor = torch.from_numpy(chw).unsqueeze(0).contiguous().to(self.device)
+        with self._amp_context():
+            feature = self.resnet(tensor).clone()
+            self._cache_features(feature)
+            if self.frame_feature_cache is None:
+                single_frame_feature = feature.unsqueeze(1)
+                temporal_input = single_frame_feature.transpose(1, 2)
+                temporal_feature = self.fusion(temporal_input)
+                outputs = self.transformer(temporal_feature.detach(), single_frame_feature)
+                final_logits = outputs[-1, -1, :]
+                probs = F.softmax(final_logits.float(), dim=-1)
+                pred_np = probs.detach().cpu().numpy()
+                confidence = float(np.max(pred_np))
+                phase_idx = max(0, min(3, int(np.argmax(pred_np))))
+                phase = self.label_dict.get(phase_idx, "idle")
+                return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": 1}
+            if self.frame_feature_cache.shape[0] < 30:
+                available_frames = self.frame_feature_cache.shape[0] + 1
+                cat_frame_feature = torch.cat([self.frame_feature_cache, feature], dim=0).unsqueeze(0)
+                temporal_input = cat_frame_feature.transpose(1, 2)
+                temporal_feature = self.fusion(temporal_input)
+                outputs = self.transformer(temporal_feature.detach(), cat_frame_feature)
+                final_logits = outputs[-1, -1, :]
+                probs = F.softmax(final_logits.float(), dim=-1)
+                pred_np = probs.detach().cpu().numpy()
+                confidence = float(np.max(pred_np))
+                phase_idx = max(0, min(3, int(np.argmax(pred_np))))
+                phase = self.label_dict.get(phase_idx, "idle")
+                return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
+            cat_frame_feature = self.frame_feature_cache.unsqueeze(0)
+            temporal_input = cat_frame_feature.transpose(1, 2)
+            temporal_feature = self.fusion(temporal_input)
+            outputs = self.transformer(temporal_feature.detach(), cat_frame_feature)
+            final_logits = outputs[-1, -1, :]
+            probs = F.softmax(final_logits.float(), dim=-1)
+            pred_np = probs.detach().cpu().numpy()
+        confidence = float(np.max(pred_np))
+        phase_idx = max(0, min(3, int(np.argmax(pred_np))))
+        phase = self.label_dict.get(phase_idx, "idle")
+        return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": min(self.trans_seq, self.frame_feature_cache.shape[0])}
+class PredictorDinoV2:
+    def __init__(self, model_dir: str | None = None, device: str = "cuda"):
+        self.device = torch.device(device if torch.cuda.is_available() else "cpu")
+        self.model_dir = model_dir or default_model_dir()
+        self.seq_length = 30
+        self.available = False
+        self.backbone = None
+        self.decoder = None
+        self.label_dict = dict(enumerate(PHASE_LABELS))
+        self.aug = A.Compose([
+            A.SmallestMaxSize(max_size=256, interpolation=cv2.INTER_LINEAR),
+            A.CenterCrop(height=224, width=224),
+            A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225), max_pixel_value=255.0),
+        ])
+        self.frame_features = []
+        self._load_models(self.model_dir)
+    def _amp_context(self):
+        return autocast("cuda") if MIXED_PRECISION_AVAILABLE and self.device.type == "cuda" else nullcontext()
+    def _resolve_local_dino_repo(self):
+        candidates = [_app_root() / "dinov2"]
+        app_root = _app_root()
+        if len(app_root.parents) >= 2:
+            candidates.append(app_root.parents[1] / "dinov2")
+        candidates.append(Path(torch.hub.get_dir()) / "facebookresearch_dinov2_main")
+        for candidate in candidates:
+            if (candidate / "hubconf.py").is_file():
+                return str(candidate)
+        raise FileNotFoundError("Local DINOv2 repo not found. Stage dinov2/ into this folder or keep the repo-root copy available.")
+    def _load_models(self, model_dir: str):
+        repo_path = self._resolve_local_dino_repo()
+        self.backbone = torch.hub.load(repo_path, "dinov2_vits14", source="local", pretrained=False)
+        encoder_path = os.path.join(model_dir, "dinov2_vit14s_latest_checkpoint.pth")
+        if not os.path.exists(encoder_path):
+            raise FileNotFoundError("DINOv2 encoder checkpoint not found")
+        encoder_checkpoint = _load_trusted_checkpoint(encoder_path, map_location="cpu")
+        encoder_state = encoder_checkpoint.get("student", encoder_checkpoint)
+        encoder_state = _strip_state_dict_prefixes(encoder_state, ("module.", "model."))
+        encoder_load = self.backbone.load_state_dict(encoder_state, strict=False)
+        _validate_load_result(encoder_load, "DINOv2 backbone")
+        self.backbone.to(self.device).eval()
+        decoder_path = os.path.join(model_dir, "fusion_transformer_decoder_best_model.pth")
+        if not os.path.exists(decoder_path):
+            raise FileNotFoundError("DINOv2 decoder checkpoint not found")
+        decoder_checkpoint = _load_trusted_checkpoint(decoder_path, map_location="cpu")
+        decoder_state = decoder_checkpoint.get("state_dict", decoder_checkpoint)
+        decoder_state = _strip_state_dict_prefixes(decoder_state, ("module.", "model."))
+        class FusionTransformerDecoder(nn.Module):
+            def __init__(self, feature_dim=384, num_classes=4, mstcn_stages=2, mstcn_layers=8, mstcn_f_maps=16, mstcn_f_dim=256, seq_length=30, d_model=256):
+                super().__init__()
+                self.reduce = nn.Linear(feature_dim, mstcn_f_dim)
+                self.mstcn = MultiStageModel(
+                    mstcn_stages=mstcn_stages,
+                    mstcn_layers=mstcn_layers,
+                    mstcn_f_maps=mstcn_f_maps,
+                    mstcn_f_dim=mstcn_f_dim,
+                    out_features=num_classes,
+                    mstcn_causal_conv=True,
+                    is_train=False,
+                )
+                self.transformer = Transformer(
+                    mstcn_f_maps=mstcn_f_maps,
+                    mstcn_f_dim=mstcn_f_dim,
+                    out_features=num_classes,
+                    len_q=seq_length,
+                    d_model=d_model,
+                )
+            def forward(self, x):
+                x = x.permute(0, 2, 1)
+                x_reduced = self.reduce(x)
+                mstcn_input = x_reduced.permute(0, 2, 1)
+                temporal_features = self.mstcn(mstcn_input)
+                if isinstance(temporal_features, (list, tuple)):
+                    temporal_features = temporal_features[-1]
+                elif isinstance(temporal_features, torch.Tensor) and temporal_features.dim() == 4:
+                    temporal_features = temporal_features[-1]
+                if temporal_features.shape[1] == mstcn_input.shape[1]:
+                    transformer_input = temporal_features.detach()
+                else:
+                    transformer_input = mstcn_input.detach()
+                transformer_out = self.transformer(transformer_input, x_reduced)
+                return transformer_out.permute(0, 2, 1)
+        self.decoder = FusionTransformerDecoder()
+        decoder_load = self.decoder.load_state_dict(decoder_state, strict=False)
+        _validate_load_result(
+            decoder_load,
+            "DINOv2 decoder",
+            allowed_unexpected_prefixes=(
+                "mstcn.stage1.conv_out_classes.",
+                "mstcn.stages.conv_out_classes.",
+            ),
+        )
+        self.decoder.to(self.device).eval()
+        self.available = True
+    def reset_state(self):
+        self.frame_features = []
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    def warm_up(self):
+        dummy_img = np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8)
+        self.predict(dummy_img)
+        self.reset_state()
+    def unload(self):
+        if self.backbone is not None:
+            self.backbone.to("cpu")
+        if self.decoder is not None:
+            self.decoder.to("cpu")
+        self.backbone = None
+        self.decoder = None
+        self.frame_features = []
+        self.available = False
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    @torch.inference_mode()
+    def predict(self, rgb_image: np.ndarray):
+        if not self.available or self.backbone is None or self.decoder is None:
+            raise RuntimeError("DINO-Endo predictor is not available")
+        processed = self.aug(image=rgb_image)["image"]
+        chw = np.transpose(processed, (2, 0, 1))
+        tensor = torch.tensor(chw, dtype=torch.float32).unsqueeze(0).to(self.device)
+        with self._amp_context():
+            feats = self.backbone.forward_features(tensor)
+            if isinstance(feats, dict):
+                feats = feats.get("x_norm_clstoken", next(iter(feats.values())))
+            if feats.dim() == 3:
+                feats = feats.mean(dim=1)
+        self.frame_features.append(feats.squeeze(0).detach().cpu())
+        if len(self.frame_features) > self.seq_length:
+            self.frame_features = self.frame_features[-self.seq_length:]
+        available_frames = len(self.frame_features)
+        seq = torch.stack(self.frame_features[-available_frames:]).unsqueeze(0).to(self.device)
+        if available_frames < self.seq_length:
+            last_frame = seq[:, -1:, :]
+            padding = last_frame.repeat(1, self.seq_length - available_frames, 1)
+            seq = torch.cat([seq, padding], dim=1)
+        decoder_input = seq.transpose(1, 2)
+        with self._amp_context():
+            logits = self.decoder(decoder_input)
+        if logits.dim() != 3:
+            raise ValueError(f"Unexpected DINOv2 decoder output shape: {tuple(logits.shape)}")
+        if logits.shape[1] == len(self.label_dict):
+            last = logits[0, :, -1]
+        elif logits.shape[2] == len(self.label_dict):
+            last = logits[0, -1, :]
+        else:
+            raise ValueError(f"Unexpected DINOv2 class dimension in decoder output: {tuple(logits.shape)}")
+        probs = torch.softmax(last, dim=0)
+        pred_np = probs.detach().cpu().numpy()
+        confidence = float(np.max(pred_np))
+        phase_idx = int(np.argmax(pred_np))
+        phase = self.label_dict.get(phase_idx, "idle")
+        return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
+class PredictorVJEPA2:
+    def __init__(self, model_dir: str | None = None, device: str = "cuda"):
+        self.device = torch.device(device if torch.cuda.is_available() else "cpu")
+        self.model_dir = model_dir or default_model_dir()
+        self.available = False
+        self.encoder = None
+        self.decoder = None
+        self.label_dict = dict(enumerate(PHASE_LABELS))
+        self._clip_frames = 16
+        self._tubelet_size = 2
+        self._crop_size = 256
+        self._decoder_seq_length = 30
+        self._frame_buffer = []
+        self._feature_buffer = []
+        self._vjepa_mean = torch.tensor([0.485, 0.456, 0.406], dtype=torch.float32).view(3, 1, 1, 1)
+        self._vjepa_std = torch.tensor([0.229, 0.224, 0.225], dtype=torch.float32).view(3, 1, 1, 1)
+        self._load_models(self.model_dir)
+    def _amp_context(self):
+        return autocast("cuda") if MIXED_PRECISION_AVAILABLE and self.device.type == "cuda" else nullcontext()
+    def _resolve_vjepa_repo(self):
+        extras = []
+        app_root = _app_root()
+        if len(app_root.parents) >= 2:
+            extras.append(app_root.parents[1] / "webapp" / "vjepa2")
+        return _resolve_vendor_repo("vjepa2", extras)
+    @staticmethod
+    def _clean_checkpoint_keys(state_dict):
+        cleaned_state = {}
+        for key, value in state_dict.items():
+            while key.startswith("module.") or key.startswith("backbone."):
+                if key.startswith("module."):
+                    key = key[len("module.") :]
+                elif key.startswith("backbone."):
+                    key = key[len("backbone.") :]
+            cleaned_state[key] = value
+        return cleaned_state
+    @staticmethod
+    def _validate_load_result(load_result, model_name: str):
+        if load_result.unexpected_keys:
+            sample = ", ".join(load_result.unexpected_keys[:5])
+            raise RuntimeError(f"{model_name} load had unexpected keys: {sample}")
+        if load_result.missing_keys:
+            sample = ", ".join(load_result.missing_keys[:5])
+            raise RuntimeError(f"{model_name} load missed required keys: {sample}")
+    def _extract_temporal_features(self, features: torch.Tensor) -> torch.Tensor:
+        if isinstance(features, dict):
+            features = features.get("x_norm_patchtokens", features.get("x_norm_clstoken", next(iter(features.values()))))
+        if features.dim() == 2:
+            return features.unsqueeze(1).repeat(1, self._clip_frames, 1)
+        if features.dim() != 3:
+            raise ValueError(f"Unexpected V-JEPA2 encoder output shape: {tuple(features.shape)}")
+        temporal_tokens = self._clip_frames // self._tubelet_size
+        if temporal_tokens <= 0:
+            raise ValueError("Invalid V-JEPA2 temporal configuration")
+        if features.shape[1] % temporal_tokens != 0:
+            raise ValueError(
+                f"Cannot reshape V-JEPA2 features of shape {tuple(features.shape)} into {temporal_tokens} temporal groups"
+            )
+        spatial_tokens = features.shape[1] // temporal_tokens
+        features = features.view(features.shape[0], temporal_tokens, spatial_tokens, features.shape[2]).mean(dim=2)
+        return features.repeat_interleave(self._tubelet_size, dim=1)[:, : self._clip_frames, :]
+    def _preprocess_clip(self, frames) -> torch.Tensor:
+        resized_frames = [cv2.resize(frame, (self._crop_size, self._crop_size), interpolation=cv2.INTER_LINEAR) for frame in frames]
+        clip = np.stack(resized_frames, axis=0).astype(np.float32) / 255.0
+        tensor = torch.from_numpy(np.transpose(clip, (3, 0, 1, 2)))
+        return (tensor - self._vjepa_mean) / self._vjepa_std
+    def _load_models(self, model_dir: str):
+        vjepa2_path = self._resolve_vjepa_repo()
+        if str(vjepa2_path) not in sys.path:
+            sys.path.insert(0, str(vjepa2_path))
+        from src.models import vision_transformer as vjepa_vit
+        from src.utils.checkpoint_loader import robust_checkpoint_loader
+        encoder_path = os.path.join(model_dir, "vjepa_encoder_human.pt")
+        if not os.path.exists(encoder_path):
+            raise FileNotFoundError("V-JEPA2 encoder not found")
+        checkpoint = robust_checkpoint_loader(encoder_path, map_location=torch.device("cpu"))
+        encoder_state = self._clean_checkpoint_keys(checkpoint.get("encoder", checkpoint))
+        self.encoder = vjepa_vit.vit_large(
+            patch_size=16,
+            num_frames=self._clip_frames,
+            tubelet_size=self._tubelet_size,
+            img_size=self._crop_size,
+            uniform_power=True,
+            use_sdpa=True,
+            use_rope=True,
+        )
+        encoder_load = self.encoder.load_state_dict(encoder_state, strict=False)
+        self._validate_load_result(encoder_load, "V-JEPA2 encoder")
+        self.encoder.to(self.device).eval()
+        decoder_path = os.path.join(model_dir, "mlp_decoder_human.pth")
+        if not os.path.exists(decoder_path):
+            raise FileNotFoundError("V-JEPA2 MLP decoder not found")
+        decoder_checkpoint = torch.load(decoder_path, map_location="cpu")
+        decoder_state = decoder_checkpoint.get("model", decoder_checkpoint)
+        decoder_in_dim = int(decoder_checkpoint.get("in_dim", 1024))
+        decoder_num_classes = int(decoder_checkpoint.get("num_classes", len(self.label_dict)))
+        self._decoder_seq_length = int(decoder_checkpoint.get("seq_length", self._decoder_seq_length))
+        class MLPDecoder(nn.Module):
+            def __init__(self, in_dim=1024, hidden_dim=256, num_classes=4):
+                super().__init__()
+                self.norm = nn.LayerNorm(in_dim)
+                self.fc1 = nn.Linear(in_dim, hidden_dim)
+                self.fc2 = nn.Linear(hidden_dim, hidden_dim)
+                self.fc3 = nn.Linear(hidden_dim, num_classes)
+                self.relu = nn.ReLU()
+                self.drop = nn.Dropout(0.5)
+            def forward(self, x):
+                x = x.mean(dim=1)
+                x = self.norm(x)
+                x = self.drop(self.relu(self.fc1(x)))
+                x = self.drop(self.relu(self.fc2(x)))
+                return self.fc3(x)
+        self.decoder = MLPDecoder(in_dim=decoder_in_dim, num_classes=decoder_num_classes)
+        self.decoder.load_state_dict(decoder_state, strict=True)
+        self.decoder.to(self.device).eval()
+        self.available = True
+    def reset_state(self):
+        self._frame_buffer = []
+        self._feature_buffer = []
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    def warm_up(self):
+        dummy = np.random.randint(0, 255, (self._crop_size, self._crop_size, 3), dtype=np.uint8)
+        self.predict(dummy)
+        self.reset_state()
+    def unload(self):
+        if self.encoder is not None:
+            self.encoder.to("cpu")
+        if self.decoder is not None:
+            self.decoder.to("cpu")
+        self.encoder = None
+        self.decoder = None
+        self._frame_buffer = []
+        self._feature_buffer = []
+        self.available = False
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    @torch.inference_mode()
+    def predict(self, rgb_image: np.ndarray):
+        if not self.available:
+            raise RuntimeError("V-JEPA2 predictor is not available")
+        frame = np.ascontiguousarray(rgb_image, dtype=np.uint8)
+        self._frame_buffer.append(frame)
+        if len(self._frame_buffer) > self._clip_frames:
+            self._frame_buffer = self._frame_buffer[-self._clip_frames:]
+        clip_frames = list(self._frame_buffer)
+        while len(clip_frames) < self._clip_frames:
+            clip_frames.append(clip_frames[-1])
+        tensor = self._preprocess_clip(clip_frames).unsqueeze(0).to(self.device)
+        with self._amp_context():
+            features = self._extract_temporal_features(self.encoder(tensor))
+        latest_feature_idx = min(len(self._frame_buffer), self._clip_frames) - 1
+        latest_feature = features[0, latest_feature_idx].float().detach().cpu()
+        self._feature_buffer.append(latest_feature)
+        if len(self._feature_buffer) > self._decoder_seq_length:
+            self._feature_buffer = self._feature_buffer[-self._decoder_seq_length:]
+        available_frames = len(self._feature_buffer)
+        seq = torch.stack(self._feature_buffer, dim=0).unsqueeze(0).to(self.device)
+        if available_frames < self._decoder_seq_length:
+            padding = seq[:, -1:, :].repeat(1, self._decoder_seq_length - available_frames, 1)
+            seq = torch.cat([seq, padding], dim=1)
+        with self._amp_context():
+            logits = self.decoder(seq)
+        probs = torch.softmax(logits[0], dim=0)
+        pred_np = probs.detach().cpu().numpy()
+        confidence = float(np.max(pred_np))
+        phase_idx = int(np.argmax(pred_np))
+        phase = self.label_dict.get(phase_idx, "idle")
+        return {"phase": phase, "probs": pred_np.tolist(), "confidence": confidence, "frames_used": available_frames}
+def create_predictor(model_key: str, model_dir: str | None = None, device: str | None = None):
+    resolved_key = normalize_model_key(model_key)
+    resolved_device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+    resolved_model_dir = model_dir or default_model_dir()
+    if resolved_key == "aiendo":
+        return Predictor(model_dir=resolved_model_dir, device=resolved_device)
+    if resolved_key == "dinov2":
+        return PredictorDinoV2(model_dir=resolved_model_dir, device=resolved_device)
+    if resolved_key == "vjepa2":
+        return PredictorVJEPA2(model_dir=resolved_model_dir, device=resolved_device)
+    raise KeyError(f"Unsupported model key: {model_key}")

requirements.txt CHANGED Viewed

@@ -1,4 +1,13 @@
-accelerate
-transformers
-trimesh
-numpy

+--extra-index-url https://download.pytorch.org/whl/cu121
+streamlit>=1.40,<2
+torch==2.5.1
+torchvision==0.20.1
+numpy>=1.26,<3
+pandas>=2.2,<3
+opencv-python-headless>=4.10,<5
+pillow>=10,<12
+albumentations>=2.0,<3
+huggingface_hub>=0.27,<1
+pyyaml>=6,<7
+timm>=1.0,<2
+einops>=0.8,<1

scripts/smoke_test.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from __future__ import annotations
+import argparse
+import os
+from pathlib import Path
+import numpy as np
+import sys
+SCRIPT_PATH = Path(__file__).resolve()
+SPACE_ROOT = SCRIPT_PATH.parents[1]
+if str(SPACE_ROOT) not in sys.path:
+    sys.path.insert(0, str(SPACE_ROOT))
+from predictor import create_predictor
+MODEL_REQUIREMENTS = {
+    "aiendo": ("resnet50.pth", "fusion.pth", "transformer.pth"),
+    "dinov2": ("dinov2_vit14s_latest_checkpoint.pth", "fusion_transformer_decoder_best_model.pth"),
+    "vjepa2": ("vjepa_encoder_human.pt", "mlp_decoder_human.pth"),
+}
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Smoke test the isolated HF Space predictors.")
+    parser.add_argument("--model", choices=sorted(MODEL_REQUIREMENTS), required=True)
+    parser.add_argument("--model-dir", default=str(SPACE_ROOT / "model"))
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument("--image-size", type=int, default=256)
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    model_dir = Path(args.model_dir).expanduser().resolve()
+    missing = [name for name in MODEL_REQUIREMENTS[args.model] if not (model_dir / name).exists()]
+    if missing:
+        raise FileNotFoundError(f"Missing required checkpoints in {model_dir}: {', '.join(missing)}")
+    os.environ["SPACE_MODEL_DIR"] = str(model_dir)
+    dummy = np.random.randint(0, 255, (args.image_size, args.image_size, 3), dtype=np.uint8)
+    predictor = create_predictor(args.model, model_dir=str(model_dir), device=args.device)
+    predictor.reset_state()
+    result = predictor.predict(dummy)
+    predictor.unload()
+    print(f"model={args.model}")
+    print(f"phase={result.get('phase')}")
+    print(f"confidence={result.get('confidence')}")
+    print(f"frames_used={result.get('frames_used')}")
+if __name__ == "__main__":
+    main()