Spaces:

Murphyyyy
/

UniSH

Build error

App Files Files Community

murphylmf commited on 6 days ago

Commit

ae166e6

1 Parent(s): eaea719

Init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +61 -7
app.py +457 -0
environment.yml +14 -0
inference.py +186 -0
install.sh +74 -0
packages.txt +7 -0
requirements.txt +22 -0
static/teaser.svg +0 -0
unish/__pycache__/pipeline.cpython-310.pyc +0 -0
unish/heads/__pycache__/align_net.cpython-310.pyc +0 -0
unish/heads/__pycache__/dpt_head.cpython-310.pyc +0 -0
unish/heads/__pycache__/head_act.cpython-310.pyc +0 -0
unish/heads/__pycache__/human_head_cliff.cpython-310.pyc +0 -0
unish/heads/__pycache__/pose_transformer.cpython-310.pyc +0 -0
unish/heads/__pycache__/t_cond_mlp.cpython-310.pyc +0 -0
unish/heads/__pycache__/utils.cpython-310.pyc +0 -0
unish/heads/__pycache__/vit.cpython-310.pyc +0 -0
unish/heads/align_net.py +571 -0
unish/heads/dpt_head.py +500 -0
unish/heads/head_act.py +125 -0
unish/heads/human_head_cliff.py +97 -0
unish/heads/pose_transformer.py +364 -0
unish/heads/t_cond_mlp.py +199 -0
unish/heads/utils.py +108 -0
unish/heads/vit.py +346 -0
unish/pi3/models/__pycache__/pi3.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/__init__.py +6 -0
unish/pi3/models/dinov2/__pycache__/__init__.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/hub/__init__.py +4 -0
unish/pi3/models/dinov2/hub/__pycache__/__init__.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/hub/__pycache__/backbones.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/hub/__pycache__/utils.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/hub/backbones.py +156 -0
unish/pi3/models/dinov2/hub/utils.py +39 -0
unish/pi3/models/dinov2/layers/__init__.py +11 -0
unish/pi3/models/dinov2/layers/__pycache__/__init__.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/attention.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/block.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/dino_head.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/drop_path.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/layer_scale.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/mlp.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/patch_embed.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/__pycache__/swiglu_ffn.cpython-310.pyc +0 -0
unish/pi3/models/dinov2/layers/attention.py +89 -0
unish/pi3/models/dinov2/layers/block.py +259 -0
unish/pi3/models/dinov2/layers/dino_head.py +58 -0
unish/pi3/models/dinov2/layers/drop_path.py +34 -0
unish/pi3/models/dinov2/layers/layer_scale.py +27 -0
unish/pi3/models/dinov2/layers/mlp.py +40 -0

README.md CHANGED Viewed

@@ -1,13 +1,67 @@
 ---
-title: UniSH
-emoji: 🏆
-colorFrom: pink
-colorTo: red
 sdk: gradio
-sdk_version: 6.3.0
 app_file: app.py
 pinned: false
-license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: UniSH (Unified Scene & Human Reconstruction)
+emoji: 🏃‍♂️
+colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: 5.0.0
 app_file: app.py
 pinned: false
+license: cc-by-nc-4.0
 ---
+# UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
+<div align="center">
+Mengfei Li<sup>1</sup>, Peng Li<sup>1</sup>, Zheng Zhang<sup>2</sup>, Jiahao Lu<sup>1</sup>, Chengfeng Zhao<sup>1</sup>, Wei Xue<sup>1</sup>, <br>
+Qifeng Liu<sup>1</sup>, Sida Peng<sup>3</sup>, Wenxiao Zhang<sup>1</sup>, Wenhan Luo<sup>1</sup>, Yuan Liu<sup>1†</sup>, Yike Guo<sup>1†</sup>
+<sup>1</sup>The Hong Kong University of Science and Technology, <sup>2</sup>Beijing University of Posts and Telecommunications, <sup>3</sup>Zhejiang University
+<a href="https://murphylmf.github.io/UniSH/"><img src="https://img.shields.io/badge/Project-Page-8A2BE2" alt="Project Page"></a>
+<a href="https://arxiv.org/abs/2601.01222"><img src="https://img.shields.io/badge/arXiv-2601.01222-b31b1b.svg" alt="arXiv"></a>
+<a href="https://github.com/murphylmf/UniSH"><img src="https://img.shields.io/badge/GitHub-Code-black.svg" alt="Code"></a>
+</div>
+## Abstract
+We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos.
+To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods.
+## Method
+![Teaser](static/teaser.svg)
+**The network architecture of UniSH.**
+UniSH takes a monocular video as input. The video frames are processed by the **Reconstruction Branch** to predict per-frame camera extrinsics *E*, confidence maps *C*, and pointmaps *P*. Camera intrinsics *K* are derived from the pointmaps. Human crops from the video are fed into the **Human Body Branch** along with *K* to estimate global SMPL shape parameters *β* and per-frame pose parameters *θ<sub>i</sub>*. Features from both branches are processed by **AlignNet** to predict the global scene scale *s* and per-frame SMPL translations *t<sub>i</sub>* for coherent scene and human alignment.
+## Usage
+This Space provides an interactive demo for UniSH.
+1. **Upload a Video**: Upload a monocular video containing a human.
+2. **Set Duration**: Choose the duration to process (default: 3 seconds).
+3. **Run Inference**: Click "Run Inference" to generate the 3D reconstruction.
+4. **Visualize**: The result will be displayed in an interactive 3D viewer where you can rotate, pan, and zoom.
+## BibTeX
+```bibtex
+@misc{li2026unishunifyingscenehuman,
+      title={UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass},
+      author={Mengfei Li and Peng Li and Zheng Zhang and Jiahao Lu and Chengfeng Zhao and Wei Xue and Qifeng Liu and Sida Peng and Wenxiao Zhang and Wenhan Luo and Yuan Liu and Yike Guo},
+      year={2026},
+      eprint={2601.01222},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2601.01222},
+}
+```
+## Acknowledgements
+This website is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
+Template borrowed from [Nerfies](https://github.com/nerfies/nerfies.github.io).

app.py ADDED Viewed

	@@ -0,0 +1,457 @@

+import gradio as gr
+import spaces
+import os
+import sys
+import shutil
+import tempfile
+import torch
+import cv2
+import subprocess
+import numpy as np
+import trimesh
+from huggingface_hub import hf_hub_download
+# Add current directory to path
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+from unish.utils.inference_utils import (
+    load_model,
+    process_video,
+    run_inference,
+    generate_mixed_geometries_in_memory,
+    save_smpl_meshes_per_frame,
+    save_scene_only_point_clouds,
+    save_human_point_clouds,
+    save_camera_parameters_per_frame
+)
+MODEL = None
+BODY_MODELS_PATH = "body_models/"
+def download_smpl_assets(body_models_path):
+    """
+    Download SMPL models from private repository if they don't exist.
+    The path logic mimics SMPLWrapper's expectation:
+    1. SMPLWrapper appends 'smpl' if not present in body_models_path.
+    2. smplx library expects another 'smpl' folder inside that (or appends it).
+    Based on existing structure 'body_models/smpl/smpl/SMPL_*.pkl', the target dir is constructed below.
+    """
+    if 'smpl' not in body_models_path:
+        model_path = os.path.join(body_models_path, 'smpl')
+    else:
+        model_path = body_models_path
+    # smplx looks for a 'smpl' folder inside the given model_path
+    target_dir = os.path.join(model_path, 'smpl')
+    os.makedirs(target_dir, exist_ok=True)
+    files = ["SMPL_NEUTRAL.pkl", "SMPL_MALE.pkl", "SMPL_FEMALE.pkl"]
+    token = os.environ.get("SMPL_DOWNLOAD_TOKEN")
+    for filename in files:
+        file_path = os.path.join(target_dir, filename)
+        if not os.path.exists(file_path):
+            if not token:
+                print(f"Warning: SMPL_DOWNLOAD_TOKEN not set. Cannot download {filename}.")
+                continue
+            print(f"Downloading {filename} to {target_dir}...")
+            try:
+                hf_hub_download(
+                    repo_id="Murphyyyy/UniSH-Private-Assets",
+                    filename=filename,
+                    local_dir=target_dir,
+                    token=token
+                )
+            except Exception as e:
+                print(f"Failed to download {filename}: {e}")
+def pack_sequence_to_glb(base_dir, output_path, start_frame=0, end_frame=60, scene_rate=0.5):
+    scene = trimesh.Scene()
+    print(f">>> Packing frames {start_frame} to {end_frame}...")
+    valid_count = 0
+    for i in range(start_frame, end_frame):
+        frame_node_name = f"frame_{valid_count}"
+        s_path = os.path.join(base_dir, "scene_only_point_clouds", f"scene_only_frame_{i:04d}.ply")
+        h_path = os.path.join(base_dir, "human_only_point_clouds", f"human_frame_{i:04d}.ply")
+        smpl_path = os.path.join(base_dir, "smpl_meshes_per_frame", f"smpl_mesh_frame_{i:04d}.ply")
+        if not (os.path.exists(h_path) or os.path.exists(smpl_path)):
+            continue
+        scene.graph.update(frame_node_name, parent="world")
+        if os.path.exists(smpl_path):
+            try:
+                smpl = trimesh.load(smpl_path)
+                flesh_color = [255, 160, 122, 255]
+                smpl.visual.vertex_colors = np.tile(flesh_color, (len(smpl.vertices), 1))
+                scene.add_geometry(smpl, node_name=f"{frame_node_name}_smpl", parent_node_name=frame_node_name)
+            except Exception as e:
+                pass
+        if os.path.exists(h_path):
+            try:
+                human = trimesh.load(h_path)
+                if isinstance(human, trimesh.PointCloud):
+                    scene.add_geometry(human, node_name=f"{frame_node_name}_human", parent_node_name=frame_node_name)
+            except: pass
+        if os.path.exists(s_path):
+            try:
+                s_obj = trimesh.load(s_path)
+                if isinstance(s_obj, trimesh.PointCloud):
+                    total_pts = len(s_obj.vertices)
+                    if total_pts > 0:
+                        if scene_rate < 0.99:
+                            count = int(total_pts * scene_rate)
+                            if count > 100:
+                                idx = np.random.choice(total_pts, count, replace=False)
+                                s_obj = trimesh.PointCloud(s_obj.vertices[idx], colors=s_obj.colors[idx])
+                        scene.add_geometry(s_obj, node_name=f"{frame_node_name}_scene", parent_node_name=frame_node_name)
+            except: pass
+        valid_count += 1
+    if valid_count == 0:
+        print("Error: No valid frames found.")
+        return
+    try:
+        rot = trimesh.transformations.rotation_matrix(np.radians(-90), [1, 0, 0])
+        scene.apply_transform(rot)
+    except: pass
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    print(f">>> Exporting to {output_path}...")
+    scene.export(output_path)
+    print(f">>> Done! Saved {valid_count} frames.")
+def get_player_html(glb_abs_path):
+    html_content = f"""
+    <!DOCTYPE html>
+    <html>
+    <head>
+      <meta charset="utf-8">
+      <title>UniSH Viewer</title>
+      <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css">
+      <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
+      <style>
+        #canvas-container {{
+          width: 100%;
+          height: 600px;
+          background: #f5f5f5;
+          border-radius: 8px;
+          position: relative;
+          overflow: hidden;
+          box-shadow: inset 0 0 20px rgba(0,0,0,0.05);
+        }}
+        .slider {{
+            width: 100%;
+        }}
+      </style>
+      <script type="importmap">
+      {{
+        "imports": {{
+          "three": "https://unpkg.com/three@0.158.0/build/three.module.js",
+          "three/addons/": "https://unpkg.com/three@0.158.0/examples/jsm/"
+        }}
+      }}
+      </script>
+    </head>
+    <body>
+      <div class="box" style="padding: 10px; background: #f5f5f5;">
+        <div id="canvas-container">
+          <div id="loading-overlay" style="position: absolute; top:0; left:0; width:100%; height:100%; background: rgba(0,0,0,0.7); color: white; display: flex; flex-direction: column; justify-content: center; align-items: center; z-index: 10;">
+            <span class="icon is-large"><i class="fas fa-spinner fa-pulse"></i></span>
+            <p style="margin-top: 10px;">Loading 3D Sequence...</p>
+          </div>
+        </div>
+        <div class="columns is-vcentered is-mobile" style="margin-top: 10px; padding: 0 10px;">
+          <div class="column is-narrow">
+            <button id="play-btn" class="button is-dark is-rounded is-small">
+              <span class="icon is-small"><i class="fas fa-play"></i></span>
+            </button>
+          </div>
+          <div class="column">
+            <input id="frame-slider" class="slider is-fullwidth is-circle is-dark" step="1" min="0" max="0" value="0" type="range">
+          </div>
+          <div class="column is-narrow">
+            <span id="frame-count" class="tag is-light" style="width: 80px;">Frame: 0</span>
+          </div>
+        </div>
+      </div>
+      <script type="module">
+        import * as THREE from 'three';
+        import {{ OrbitControls }} from 'three/addons/controls/OrbitControls.js';
+        import {{ GLTFLoader }} from 'three/addons/loaders/GLTFLoader.js';
+        // Inject the model path using f-string from Python
+        const MODEL_PATH = "/file={glb_abs_path}";
+        const FPS = 10;
+        let scene, camera, renderer, controls;
+        let frames = [];
+        let currentFrame = 0;
+        let isPlaying = false;
+        let intervalId = null;
+        const container = document.getElementById('canvas-container');
+        const slider = document.getElementById('frame-slider');
+        const playBtn = document.getElementById('play-btn');
+        const frameLabel = document.getElementById('frame-count');
+        const loadingOverlay = document.getElementById('loading-overlay');
+        init();
+        function init() {{
+          scene = new THREE.Scene();
+          scene.background = new THREE.Color(0xf5f5f5);
+          camera = new THREE.PerspectiveCamera(50, container.clientWidth / container.clientHeight, 0.1, 1000);
+          camera.position.set(-0.000, -4.272, 0.000);
+          renderer = new THREE.WebGLRenderer({{ antialias: true, alpha: true }});
+          renderer.setSize(container.clientWidth, container.clientHeight);
+          renderer.setPixelRatio(window.devicePixelRatio);
+          renderer.shadowMap.enabled = false;
+          renderer.useLegacyLights = false;
+          container.appendChild(renderer.domElement);
+          const hemiLight = new THREE.HemisphereLight(0xffffff, 0x444444, 3.0);
+          scene.add(hemiLight);
+          const dirLight = new THREE.DirectionalLight(0xffffff, 3.0);
+          dirLight.position.set(5, 10, 7);
+          scene.add(dirLight);
+          const frontLight = new THREE.DirectionalLight(0xffffff, 2.0);
+          frontLight.position.set(0, 0, 5);
+          scene.add(frontLight);
+          controls = new OrbitControls(camera, renderer.domElement);
+          controls.enableDamping = true;
+          controls.dampingFactor = 0.05;
+          controls.target.set(0.000, 0.000, 0.000);
+          const loader = new GLTFLoader();
+          console.log("Loading:", MODEL_PATH);
+          loader.load(MODEL_PATH, function (gltf) {{
+            const root = gltf.scene;
+            scene.add(root);
+            frames = [];
+            root.traverse((node) => {{
+              if (node.isMesh) {{
+                  node.geometry.computeVertexNormals();
+                  if (node.geometry.attributes.color) {{
+                      node.geometry.deleteAttribute('color');
+                  }}
+                  node.material = new THREE.MeshStandardMaterial({{
+                      color: 0xff9966,
+                      roughness: 0.4,
+                      metalness: 0.0,
+                      side: THREE.DoubleSide
+                  }});
+                  node.material.vertexColors = false;
+              }}
+              if (node.isPoints) {{
+                  if (node.name.toLowerCase().includes('scene')) {{
+                      node.material.size = 0.05;
+                      node.material.sizeAttenuation = true;
+                  }}
+                  if (node.name.toLowerCase().includes('human')) {{
+                      node.material.size = 0.005;
+                  }}
+              }}
+              if (node.name && node.name.startsWith('frame_')) {{
+                  const parts = node.name.split('_');
+                  if (parts.length === 2 && !isNaN(parseInt(parts[1]))) {{
+                      const idx = parseInt(parts[1]);
+                      frames[idx] = node;
+                      node.visible = false;
+                  }}
+              }}
+            }});
+            frames = frames.filter(n => n !== undefined);
+            console.log(`Loaded ${{frames.length}} frames.`);
+            if (frames.length > 0) {{
+              slider.max = frames.length - 1;
+              loadingOverlay.style.display = 'none';
+              showFrame(0);
+            }} else {{
+              loadingOverlay.innerHTML = "<p>No frames found.</p>";
+            }}
+          }}, undefined, function (error) {{
+            console.error(error);
+            loadingOverlay.innerHTML = "<p>Error loading model.</p>";
+          }});
+          window.addEventListener('resize', onWindowResize);
+          animate();
+        }}
+        function showFrame(idx) {{
+          if (!frames[idx]) return;
+          if (frames[currentFrame]) frames[currentFrame].visible = false;
+          frames[idx].visible = true;
+          currentFrame = idx;
+          slider.value = idx;
+          frameLabel.innerText = `Frame: ${{idx}}`;
+        }}
+        function togglePlay() {{
+          if (frames.length === 0) return;
+          isPlaying = !isPlaying;
+          const icon = playBtn.querySelector('.fa-play, .fa-pause');
+          if (isPlaying) {{
+            if(icon) {{ icon.classList.remove('fa-play'); icon.classList.add('fa-pause'); }}
+            intervalId = setInterval(() => {{
+              let next = currentFrame + 1;
+              if (next >= frames.length) next = 0;
+              showFrame(next);
+            }}, 1000 / FPS);
+          }} else {{
+            if(icon) {{ icon.classList.remove('fa-pause'); icon.classList.add('fa-play'); }}
+            clearInterval(intervalId);
+          }}
+        }}
+        slider.addEventListener('input', (e) => {{
+          if (isPlaying) togglePlay();
+          showFrame(parseInt(e.target.value));
+        }});
+        playBtn.addEventListener('click', togglePlay);
+        function onWindowResize() {{
+          camera.aspect = container.clientWidth / container.clientHeight;
+          camera.updateProjectionMatrix();
+          renderer.setSize(container.clientWidth, container.clientHeight);
+        }}
+        function animate() {{
+          requestAnimationFrame(animate);
+          controls.update();
+          renderer.render(scene, camera);
+        }}
+      </script>
+    </body>
+    </html>
+    """
+    return html_content
+@spaces.GPU(duration=120)
+def predict(video_path, duration_seconds=3.0):
+    global MODEL
+    # 0. Setup directories
+    output_dir = tempfile.mkdtemp()
+    # 1. Trim video
+    duration = min(float(duration_seconds), 10.0)
+    trimmed_video_path = os.path.join(output_dir, "input_trimmed.mp4")
+    cmd = [
+        "ffmpeg", "-i", video_path,
+        "-t", str(duration),
+        "-c:v", "libx264", "-c:a", "aac",
+        trimmed_video_path, "-y"
+    ]
+    subprocess.run(cmd, check=True)
+    # 2. Load Model
+    if MODEL is None:
+        MODEL = load_model()
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    MODEL.to(device)
+    MODEL.eval()
+    # 3. Process Video
+    fps = 6.0
+    target_size = 518
+    human_idx = 0
+    bbox_scale = 1.0
+    # Check and download SMPL assets
+    download_smpl_assets(BODY_MODELS_PATH)
+    data_dict = process_video(
+        trimmed_video_path, fps, human_idx, target_size,
+        bbox_scale=bbox_scale
+    )
+    # 4. Run Inference
+    results = run_inference(MODEL, data_dict, device, chunk_size=30)
+    # 5. Generate Geometries & Save
+    seq_name = results['seq_name']
+    viz_scene_point_clouds, viz_smpl_meshes, viz_scene_only_point_clouds, smpl_points_for_camera = generate_mixed_geometries_in_memory(
+        results, BODY_MODELS_PATH, fps=fps, conf_thres=0.1
+    )
+    # Save to disk
+    save_smpl_meshes_per_frame(results, output_dir, BODY_MODELS_PATH)
+    save_scene_only_point_clouds(viz_scene_only_point_clouds, output_dir, seq_name)
+    save_human_point_clouds(viz_scene_point_clouds, viz_scene_only_point_clouds, output_dir, seq_name, results)
+    # 6. Pack to GLB
+    base_dir = os.path.join(output_dir, seq_name)
+    output_glb_path = os.path.join(output_dir, "output.glb")
+    num_frames = len(viz_scene_point_clouds)
+    pack_sequence_to_glb(
+        base_dir,
+        output_glb_path,
+        start_frame=0,
+        end_frame=num_frames,
+        scene_rate=0.5
+    )
+    return get_player_html(output_glb_path)
+with gr.Blocks() as demo:
+    gr.Markdown("# UniSH Demo")
+    gr.Markdown("Upload a video to reconstruct scene and human in 3D.")
+    with gr.Row():
+        with gr.Column():
+            input_video = gr.Video(label="Input Video")
+            duration_slider = gr.Slider(minimum=1, maximum=10, value=3, step=1, label="Duration to Process (seconds)")
+            submit_btn = gr.Button("Run Inference", variant="primary")
+        with gr.Column():
+            output_html = gr.HTML(label="3D Result", min_height=600)
+    submit_btn.click(
+        predict,
+        inputs=[input_video, duration_slider],
+        outputs=[output_html]
+    )
+demo.queue()
+demo.launch()

environment.yml ADDED Viewed

	@@ -0,0 +1,14 @@

+name: unish
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.10
+  - pip
+  - git
+  - ninja
+  - mesalib
+  - libgl-devel
+  - libegl-devel
+  - gxx_linux-64=11.*
+  - ffmpeg

inference.py ADDED Viewed

	@@ -0,0 +1,186 @@

+import argparse
+import os
+import torch
+import numpy as np
+import random
+import logging
+from unish.utils.inference_utils import *
+def setup_seed(seed):
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.backends.cudnn.deterministic = True
+def setup_logging(output_dir):
+    os.makedirs(output_dir, exist_ok=True)
+    # Create logger
+    logger = logging.getLogger()
+    logger.setLevel(logging.INFO)
+    # Create handlers
+    c_handler = logging.StreamHandler()
+    f_handler = logging.FileHandler(os.path.join(output_dir, 'inference.log'), mode='w')
+    # Create formatters and add it to handlers
+    c_format = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+    f_format = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+    c_handler.setFormatter(c_format)
+    f_handler.setFormatter(f_format)
+    # Add handlers to the logger
+    logger.addHandler(c_handler)
+    logger.addHandler(f_handler)
+    return logger
+def main():
+    parser = argparse.ArgumentParser(description="Video Inference Script")
+    parser.add_argument("--video_path", type=str, required=True,
+                        help="Path to the input video file or directory containing images")
+    parser.add_argument("--fps", type=float, default=6.0,
+                        help="Target FPS for frame extraction (default: 6.0)")
+    parser.add_argument("--original_fps", type=float, default=30.0,
+                        help="Original FPS of the image sequence (default: 30.0, used only for directory input)")
+    parser.add_argument("--target_size", type=int, default=518,
+                        help="Target size for frame processing (default: 518)")
+    parser.add_argument("--checkpoint", type=str, default="checkpoints/unish_release.safetensors",
+                        help="Path to the model checkpoint")
+    parser.add_argument("--output_dir", type=str, default="inference_results_video",
+                        help="Output directory for results")
+    parser.add_argument("--body_models_path", type=str, default="body_models/",
+                        help="Path to SMPL body models")
+    parser.add_argument("--device", type=str, default="cuda",
+                        help="Device to run inference on")
+    parser.add_argument("--save_results", action="store_true", default=True,
+                        help="Save additional results including smpl_points_for_camera (default: True)")
+    parser.add_argument("--chunk_size", type=int, default=30,
+                        help="Number of frames to process in each chunk during inference (default: 30)")
+    parser.add_argument("--gpu_id", type=int, default=0,
+                        help="GPU ID to use for inference (default: 0)")
+    parser.add_argument("--camera_mode", type=str, default="fixed",
+                        choices=["predicted", "fixed"],
+                        help="Camera mode: 'predicted' uses model-predicted camera parameters, "
+                             "'fixed' uses a fixed camera angle (default: predicted)")
+    parser.add_argument("--human_idx", type=int, default=0,
+                        help="Human index to process (default: 0)")
+    parser.add_argument("--start_idx", type=int, default=None,
+                        help="Start frame index for processing (default: None, process from beginning)")
+    parser.add_argument("--end_idx", type=int, default=None,
+                        help="End frame index for processing (default: None, process to end)")
+    parser.add_argument("--bbox_scale", type=float, default=1.0,
+                        help="Scale factor for bounding box size (default: 1.0)")
+    parser.add_argument("--conf_thres", type=float, default=0.1,
+                        help="Confidence threshold for point cloud generation (default: 0.1)")
+    # New arguments
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for reproducibility")
+    parser.add_argument("--yolo_ckpt", type=str, default="ckpts/yolo11n.pt", help="Path to YOLO checkpoint")
+    parser.add_argument("--sam2_model", type=str, default="facebook/sam2-hiera-large", help="SAM2 model name or path")
+    args = parser.parse_args()
+    # Setup seed
+    setup_seed(args.seed)
+    # Setup logging
+    logger = setup_logging(args.output_dir)
+    # Setup device
+    if torch.cuda.is_available():
+        if args.device == "cuda":
+            # Use specified GPU ID
+            device = torch.device(f"cuda:{args.gpu_id}")
+            # Set the current CUDA device
+            torch.cuda.set_device(args.gpu_id)
+            logger.info(
+                f"Using GPU {args.gpu_id}: {torch.cuda.get_device_name(args.gpu_id)}")
+        else:
+            device = torch.device(args.device)
+    else:
+        device = torch.device("cpu")
+        logger.info("CUDA not available, using CPU")
+    logger.info(f"Using device: {device}")
+    # Load model
+    logger.info("Loading model...")
+    model = load_model(args.checkpoint)
+    model = model.to(device)
+    model.eval()
+    # Process video
+    logger.info(f"Processing video: {args.video_path}")
+    data_dict = process_video(
+        args.video_path, args.fps, args.human_idx, args.target_size,
+        bbox_scale=args.bbox_scale, start_idx=args.start_idx, end_idx=args.end_idx,
+        original_fps=args.original_fps,
+        yolo_ckpt=args.yolo_ckpt, sam2_model=args.sam2_model
+    )
+    # Run inference
+    results = run_inference(model, data_dict, device, args.chunk_size)
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    viz_scene_point_clouds, viz_smpl_meshes, viz_scene_only_point_clouds, smpl_points_for_camera = generate_mixed_geometries_in_memory(
+        results, args.body_models_path, fps=args.fps, conf_thres=args.conf_thres
+    )
+    # Determine camera mode based on arguments
+    use_predicted_camera = (args.camera_mode == "predicted")
+    logger.info(f"Using {args.camera_mode} camera mode")
+    original_rgb_images = results['rgb_images']
+    if original_rgb_images is not None:
+        if hasattr(original_rgb_images, 'permute'):  # It's a torch tensor
+            original_rgb_images = original_rgb_images.permute(
+                0, 2, 3, 1).cpu().numpy()  # [S, H, W, 3]
+        elif not isinstance(original_rgb_images, np.ndarray):
+            original_rgb_images = np.array(original_rgb_images)
+        # Ensure proper data type and range
+        if original_rgb_images.max() <= 1.0:
+            original_rgb_images = (
+                original_rgb_images * 255).astype(np.uint8)
+    original_human_boxes = data_dict['human_boxes']
+    run_visualization(viz_scene_point_clouds, viz_smpl_meshes, smpl_points_for_camera,
+                                    args.output_dir, results['seq_name'],
+                                    fps=args.fps,  # Use original fps
+                                    rgb_images=original_rgb_images,
+                                    human_boxes=original_human_boxes,
+                                    chunk_size=args.chunk_size,  # Use original chunk size
+                                    results=results,
+                                    use_predicted_camera=use_predicted_camera,
+                                    scene_only_point_clouds=viz_scene_only_point_clouds,
+                                    conf_thres=args.conf_thres)
+    if args.save_results:
+        logger.info("Creating SMPL meshes per frame...")
+        save_smpl_meshes_per_frame(
+            results, args.output_dir, args.body_models_path)
+        logger.info("Saving scene point clouds (without human)...")
+        save_scene_only_point_clouds(
+            viz_scene_only_point_clouds, args.output_dir, results['seq_name'])
+        logger.info("Saving human point clouds...")
+        save_human_point_clouds(viz_scene_point_clouds,
+                                viz_scene_only_point_clouds, args.output_dir, results['seq_name'], results)
+        logger.info("Saving camera parameters per frame...")
+        save_camera_parameters_per_frame(
+            results, args.output_dir, results['seq_name'])
+    logger.info(f"Inference completed! Results saved to {args.output_dir}")
+if __name__ == "__main__":
+    main()

install.sh ADDED Viewed

	@@ -0,0 +1,74 @@

+#!/bin/bash
+set -e
+# ==========================================
+#         UniSH Auto-Install Script
+# ==========================================
+get_cuda_version() {
+    if [ ! -z "$1" ]; then echo "$1"; return; fi
+    if command -v nvidia-smi &> /dev/null; then
+        DRIVER_CUDA_MAJOR=$(nvidia-smi | grep "CUDA Version" | awk -F'CUDA Version:' '{print $2}' | awk -F'.' '{print $1}' | tr -d '[:space:]')
+        if [ "$DRIVER_CUDA_MAJOR" == "12" ]; then echo "12.1"; elif [ "$DRIVER_CUDA_MAJOR" == "11" ]; then echo "11.8"; else echo "12.1"; fi
+    else echo "12.1"; fi
+}
+if [[ -z "$CONDA_PREFIX" ]]; then
+    echo "❌ Error: Please activate the conda environment first!"
+    exit 1
+fi
+TARGET_CUDA=$(get_cuda_version "$1")
+echo "========================================"
+echo "   Detected/Selected CUDA: $TARGET_CUDA"
+echo "========================================"
+if [[ "$TARGET_CUDA" == "12.1" ]]; then TORCH_INDEX_URL="https://download.pytorch.org/whl/cu121";
+elif [[ "$TARGET_CUDA" == "11.8" ]]; then TORCH_INDEX_URL="https://download.pytorch.org/whl/cu118";
+else TORCH_INDEX_URL=""; fi
+echo "[1/6] Installing PyTorch 2.4.1 (CUDA $TARGET_CUDA)..."
+pip install torch==2.4.1 torchvision==0.19.1 --index-url $TORCH_INDEX_URL
+echo "[2/6] Installing Safe Requirements..."
+pip install -r requirements.txt
+echo "[3/6] Installing Custom Utils3D..."
+pip install "git+https://github.com/EasternJournalist/utils3d.git@3fab839f0be9931dac7c8488eb0e1600c236e183"
+echo "[4/6] Installing Heavy Dependencies..."
+pip install open3d==0.19.0 --no-deps
+pip install ultralytics==8.3.227 --no-deps
+pip install timm==1.0.24 --no-deps
+echo "[5/6] Installing MMCV & PyTorch3D..."
+pip install mmcv==2.2.0 --no-deps --no-binary mmcv
+pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable" --no-build-isolation
+echo "[6/6] Installing SAM 2 (With Setuptools Fix)..."
+pip install setuptools==69.5.1 wheel
+rm -rf _tmp_install_sam2
+mkdir -p _tmp_install_sam2
+cd _tmp_install_sam2
+echo "   -> Cloning SAM 2..."
+git clone https://github.com/facebookresearch/segment-anything-2.git --depth 1
+cd segment-anything-2
+echo "   -> Patching setup.py..."
+python -c "
+path = 'setup.py'
+with open(path, 'r') as f: c = f.read()
+c = c.replace('torch>=2.5.1', 'torch>=2.4.1')
+with open(path, 'w') as f: f.write(c)
+"
+pip install . --no-deps --no-build-isolation
+cd ../..
+rm -rf _tmp_install_sam2
+echo "========================================"
+echo "Installation Complete!"
+python -c "import torch; print(f'PyTorch: {torch.__version__} | CUDA: {torch.version.cuda}')"
+echo "========================================"

packages.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+ffmpeg
+libgl1-mesa-glx
+libglib2.0-0
+libegl1-mesa
+xvfb

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+torch==2.4.1
+torchvision==0.19.1
+numpy
+scipy
+trimesh
+tqdm
+opencv-python-headless
+pillow
+gradio
+spaces
+ninja
+einops
+safetensors
+huggingface_hub
+open3d==0.19.0
+ultralytics==8.3.227
+timm==1.0.24
+git+https://github.com/EasternJournalist/utils3d.git@3fab839f0be9931dac7c8488eb0e1600c236e183
+mmcv==2.2.0 --find-links https://download.openmmlab.com/mmcv/dist/cu121/torch2.4/index.html
+pytorch3d @ https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py310_cu121_pyt241/pytorch3d-0.7.8-cp310-cp310-linux_x86_64.whl
+git+https://github.com/facebookresearch/segment-anything-2.git
+smplx

static/teaser.svg ADDED Viewed

unish/__pycache__/pipeline.cpython-310.pyc ADDED Viewed

Binary file (6.53 kB). View file

unish/heads/__pycache__/align_net.cpython-310.pyc ADDED Viewed

Binary file (13.6 kB). View file

unish/heads/__pycache__/dpt_head.cpython-310.pyc ADDED Viewed

Binary file (12.6 kB). View file

unish/heads/__pycache__/head_act.cpython-310.pyc ADDED Viewed

Binary file (3.11 kB). View file

unish/heads/__pycache__/human_head_cliff.cpython-310.pyc ADDED Viewed

Binary file (2.92 kB). View file

unish/heads/__pycache__/pose_transformer.cpython-310.pyc ADDED Viewed

Binary file (10.9 kB). View file

unish/heads/__pycache__/t_cond_mlp.cpython-310.pyc ADDED Viewed

Binary file (6.08 kB). View file

unish/heads/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (3.14 kB). View file

unish/heads/__pycache__/vit.cpython-310.pyc ADDED Viewed

Binary file (11.2 kB). View file

unish/heads/align_net.py ADDED Viewed

	@@ -0,0 +1,571 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+import numpy as np
+from unish.utils.data_utils import rot6d_to_rotmat
+from unish.utils.constants import SMPL_MEAN_PARAMS
+class TimeStepRoPE1D(nn.Module):
+    """1D RoPE for timestep embedding, similar to pi3's RoPE2D but for 1D time sequence"""
+    def __init__(self, freq=100.0):
+        super().__init__()
+        self.base = freq
+        self.cache = {}
+        self.max_train_len = 120
+    def get_cos_sin(self, D, seq_len, device, dtype):
+        if (D, seq_len, device, dtype) in self.cache:
+            return self.cache[D, seq_len, device, dtype]
+        if seq_len <= self.max_train_len:
+            assert D % 2 == 0
+            inv_freq = 1.0 / (self.base ** (torch.arange(0, D, 2).float().to(device) / D))
+            t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)
+            freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype)
+            freqs = torch.cat((freqs, freqs), dim=-1)
+            cos = freqs.cos()  # (seq_len, D)
+            sin = freqs.sin()  # (seq_len, D)
+            self.cache[D, seq_len, device, dtype] = (cos, sin)
+            return cos, sin
+        else:
+            cos_train, sin_train = self.get_cos_sin(D, self.max_train_len, device, dtype)
+            cos_train_res = cos_train.transpose(0, 1).unsqueeze(0)
+            sin_train_res = sin_train.transpose(0, 1).unsqueeze(0)
+            # [1, D, max_train_len] -> [1, D, seq_len]
+            cos_interp = F.interpolate(cos_train_res, size=seq_len, mode='linear', align_corners=True)
+            sin_interp = F.interpolate(sin_train_res, size=seq_len, mode='linear', align_corners=True)
+            # [1, D, seq_len] -> [seq_len, D]
+            cos_final = cos_interp.squeeze(0).transpose(0, 1)
+            sin_final = sin_interp.squeeze(0).transpose(0, 1)
+            self.cache[D, seq_len, device, dtype] = (cos_final, sin_final)
+            return cos_final, sin_final
+    @staticmethod
+    def rotate_half(x):
+        x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
+        return torch.cat((-x2, x1), dim=-1)
+    def apply_rope1d(self, tokens, pos1d, cos, sin):
+        """Apply 1D RoPE to tokens based on 1D positions"""
+        cos = torch.nn.functional.embedding(pos1d, cos)[:, None, :, :]  # [batch, 1, seq_len, D]
+        sin = torch.nn.functional.embedding(pos1d, sin)[:, None, :, :]  # [batch, 1, seq_len, D]
+        return (tokens * cos) + (self.rotate_half(tokens) * sin)
+    def forward(self, tokens, positions):
+        """
+        Apply 1D RoPE to tokens based on timestep positions.
+        Args:
+            tokens: [batch, num_heads, seq_len, head_dim]
+            positions: [batch, seq_len] - timestep positions (0, 1, 2, ...)
+        Returns:
+            tokens with RoPE applied: [batch, num_heads, seq_len, head_dim]
+        """
+        head_dim = tokens.size(3)
+        assert head_dim % 2 == 0, "head_dim should be a multiple of two"
+        assert positions.ndim == 2  # [batch, seq_len]
+        cos, sin = self.get_cos_sin(head_dim, int(positions.max()) + 1, tokens.device, tokens.dtype)
+        return self.apply_rope1d(tokens, positions.long(), cos, sin)
+class TransformerDecoderLayer(nn.Module):
+    """单层Transformer Decoder with RoPE support"""
+    def __init__(self, hidden_dim=512, num_heads=8, ff_dim=1024, dropout=0.1, use_rope=True):
+        super().__init__()
+        self.use_rope = use_rope
+        self.hidden_dim = hidden_dim
+        self.num_heads = num_heads
+        self.head_dim = hidden_dim // num_heads
+        if use_rope:
+            self.self_attention = None
+            self.cross_attention = None
+            self.self_q_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.self_k_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.self_v_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.self_out_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_q_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_k_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_v_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_out_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            # RoPE for timestep embedding
+            self.timestep_rope = TimeStepRoPE1D(freq=100.0)
+        else:
+            self.self_attention = nn.MultiheadAttention(
+                embed_dim=hidden_dim,
+                num_heads=num_heads,
+                dropout=dropout,
+                batch_first=True
+            )
+            self.cross_attention = nn.MultiheadAttention(
+                embed_dim=hidden_dim,
+                num_heads=num_heads,
+                dropout=dropout,
+                batch_first=True
+            )
+        self.feed_forward = nn.Sequential(
+            nn.Linear(hidden_dim, ff_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(ff_dim, hidden_dim),
+            nn.Dropout(dropout)
+        )
+        self.norm1 = nn.LayerNorm(hidden_dim)  # for self attention
+        self.norm2 = nn.LayerNorm(hidden_dim)  # for cross attention
+        self.norm3 = nn.LayerNorm(hidden_dim)  # for feed forward
+        # Dropout
+        self.dropout = nn.Dropout(dropout)
+        self.attn_dropout = nn.Dropout(dropout)
+        # Scale factor for attention
+        self.scale = self.head_dim ** -0.5
+        # Gradient checkpointing flag
+        self.use_gradient_checkpoint = False
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for memory optimization."""
+        self.use_gradient_checkpoint = True
+    def _rope_attention(self, q_proj, k_proj, v_proj, out_proj, query, key, value, timestep_pos=None):
+        """Apply RoPE-based attention using torch.nn.functional.scaled_dot_product_attention"""
+        batch_size, seq_len, _ = query.shape
+        # Project Q, K, V
+        q = q_proj(query).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        k = k_proj(key).view(batch_size, key.shape[1], self.num_heads, self.head_dim).transpose(1, 2)
+        v = v_proj(value).view(batch_size, value.shape[1], self.num_heads, self.head_dim).transpose(1, 2)
+        # Apply RoPE to Q and K if timestep positions are provided
+        if timestep_pos is not None and self.use_rope:
+            # For self-attention, both q and k use the same timestep positions
+            if query.shape == key.shape:  # self-attention case
+                q = self.timestep_rope(q, timestep_pos)
+                k = self.timestep_rope(k, timestep_pos)
+            else:  # cross-attention case
+                # Only apply RoPE to query (cam_token), key/value are spatial features
+                q = self.timestep_rope(q, timestep_pos)
+        attn_output = F.scaled_dot_product_attention(
+            q, k, v,
+            dropout_p=self.attn_dropout.p if self.training else 0.0,
+            scale=self.scale
+        )
+        # Reshape output
+        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.hidden_dim)
+        # Output projection
+        return out_proj(attn_output)
+    def forward(self, query, key, value, self_attn_mask=None, cross_attn_mask=None, timestep_pos=None):
+        """
+        Args:
+            query: [batch, num_views, hidden_dim]
+            key: [batch, num_views, hidden_dim]
+            value: [batch, num_views, hidden_dim]
+            timestep_pos: [batch, num_views] - timestep positions for RoPE
+        """
+        if self.use_gradient_checkpoint and self.training:
+            from torch.utils.checkpoint import checkpoint
+            if self.use_rope:
+                # 1. Self Attention + Residual with RoPE (with gradient checkpointing)
+                self_attn_output = checkpoint(
+                    self._rope_attention,
+                    self.self_q_proj, self.self_k_proj, self.self_v_proj, self.self_out_proj,
+                    query, query, query, timestep_pos,
+                    use_reentrant=False
+                )
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual with RoPE (with gradient checkpointing)
+                cross_attn_output = checkpoint(
+                    self._rope_attention,
+                    self.cross_q_proj, self.cross_k_proj, self.cross_v_proj, self.cross_out_proj,
+                    query, key, value, timestep_pos,
+                    use_reentrant=False
+                )
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            else:
+                # 1. Self Attention + Residual (with gradient checkpointing)
+                def self_attn_fn(q, k, v):
+                    out, _ = self.self_attention(q, k, v, attn_mask=self_attn_mask)
+                    return out
+                self_attn_output = checkpoint(self_attn_fn, query, query, query, use_reentrant=False)
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual (with gradient checkpointing)
+                def cross_attn_fn(q, k, v):
+                    out, _ = self.cross_attention(q, k, v, attn_mask=cross_attn_mask)
+                    return out
+                cross_attn_output = checkpoint(cross_attn_fn, query, key, value, use_reentrant=False)
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            # 3. Feed Forward + Residual (with gradient checkpointing)
+            ff_output = checkpoint(self.feed_forward, query, use_reentrant=False)
+            query = self.norm3(query + ff_output)
+        else:
+            # Original implementation without gradient checkpointing
+            if self.use_rope:
+                # 1. Self Attention + Residual with RoPE
+                self_attn_output = self._rope_attention(
+                    self.self_q_proj, self.self_k_proj, self.self_v_proj, self.self_out_proj,
+                    query, query, query, timestep_pos
+                )
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual with RoPE
+                cross_attn_output = self._rope_attention(
+                    self.cross_q_proj, self.cross_k_proj, self.cross_v_proj, self.cross_out_proj,
+                    query, key, value, timestep_pos
+                )
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            else:
+                # 1. Self Attention + Residual (original implementation)
+                self_attn_output, _ = self.self_attention(query, query, query, attn_mask=self_attn_mask)
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual (original implementation)
+                cross_attn_output, _ = self.cross_attention(query, key, value, attn_mask=cross_attn_mask)
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            # 3. Feed Forward + Residual
+            ff_output = self.feed_forward(query)
+            query = self.norm3(query + ff_output)
+        return query
+class CrossViewTransformerDecoderLayer(nn.Module):
+    """Cross-view Transformer Decoder Layer for V4 - handles concatenated tokens from multiple views"""
+    def __init__(self, hidden_dim=512, num_heads=8, ff_dim=1024, dropout=0.1, use_rope=True):
+        super().__init__()
+        self.use_rope = use_rope
+        self.hidden_dim = hidden_dim
+        self.num_heads = num_heads
+        self.head_dim = hidden_dim // num_heads
+        if use_rope:
+            self.self_attention = None
+            self.cross_attention = None
+            # Self-attention components
+            self.self_q_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.self_k_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.self_v_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.self_out_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            # Cross-attention components
+            self.cross_q_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_k_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_v_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            self.cross_out_proj = nn.Linear(hidden_dim, hidden_dim, bias=True)
+            # RoPE for timestep embedding
+            self.timestep_rope = TimeStepRoPE1D(freq=100.0)
+        else:
+            # Self Attention层
+            self.self_attention = nn.MultiheadAttention(
+                embed_dim=hidden_dim,
+                num_heads=num_heads,
+                dropout=dropout,
+                batch_first=True
+            )
+            # Cross Attention层
+            self.cross_attention = nn.MultiheadAttention(
+                embed_dim=hidden_dim,
+                num_heads=num_heads,
+                dropout=dropout,
+                batch_first=True
+            )
+        self.feed_forward = nn.Sequential(
+            nn.Linear(hidden_dim, ff_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(ff_dim, hidden_dim),
+            nn.Dropout(dropout)
+        )
+        self.norm1 = nn.LayerNorm(hidden_dim)  # for self attention
+        self.norm2 = nn.LayerNorm(hidden_dim)  # for cross attention
+        self.norm3 = nn.LayerNorm(hidden_dim)  # for feed forward
+        self.dropout = nn.Dropout(dropout)
+        self.attn_dropout = nn.Dropout(dropout)
+        self.scale = self.head_dim ** -0.5
+        self.use_gradient_checkpoint = False
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for memory optimization."""
+        self.use_gradient_checkpoint = True
+    def _rope_attention(self, q_proj, k_proj, v_proj, out_proj, query, key, value, query_timestep_pos=None, key_timestep_pos=None):
+        """Apply RoPE-based attention for cross-view scenarios using torch.nn.functional.scaled_dot_product_attention"""
+        batch_size, query_seq_len, _ = query.shape
+        _, key_seq_len, _ = key.shape
+        # Project Q, K, V
+        q = q_proj(query).view(batch_size, query_seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        k = k_proj(key).view(batch_size, key_seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        v = v_proj(value).view(batch_size, key_seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Apply RoPE to Q and K if timestep positions are provided
+        if self.use_rope:
+            if query_timestep_pos is not None:
+                q_scale = q[:, :, 0:1, :]  # [batch, num_heads, 1, head_dim] - scale token
+                q_cam = q[:, :, 1:, :]     # [batch, num_heads, num_views, head_dim] - cam tokens
+                cam_timestep_pos = query_timestep_pos[:, 1:]
+                q_cam_rope = self.timestep_rope(q_cam, cam_timestep_pos)
+                q = torch.cat([q_scale, q_cam_rope], dim=2)
+            if key_timestep_pos is not None:
+                k = self.timestep_rope(k, key_timestep_pos)
+        attn_output = F.scaled_dot_product_attention(
+            q, k, v,
+            dropout_p=self.attn_dropout.p if self.training else 0.0,
+            scale=self.scale
+        )
+        # Reshape output
+        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, query_seq_len, self.hidden_dim)
+        # Output projection
+        return out_proj(attn_output)
+    def forward(self, query, key, value, query_timestep_pos=None, key_timestep_pos=None):
+        """
+        Args:
+            query: [batch, num_queries, hidden_dim] - cam tokens + scale token
+            key: [batch, num_views * num_tokens, hidden_dim] - concatenated feature tokens from all views
+            value: [batch, num_views * num_tokens, hidden_dim] - concatenated feature tokens from all views
+            query_timestep_pos: [batch, num_queries] - timestep positions for query tokens
+            key_timestep_pos: [batch, num_views * num_tokens] - timestep positions for key/value tokens
+        """
+        if self.use_gradient_checkpoint and self.training:
+            from torch.utils.checkpoint import checkpoint
+            if self.use_rope:
+                # 1. Self Attention + Residual with RoPE (with gradient checkpointing)
+                self_attn_output = checkpoint(
+                    self._rope_attention,
+                    self.self_q_proj, self.self_k_proj, self.self_v_proj, self.self_out_proj,
+                    query, query, query, query_timestep_pos, query_timestep_pos,
+                    use_reentrant=False
+                )
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual with RoPE (with gradient checkpointing)
+                cross_attn_output = checkpoint(
+                    self._rope_attention,
+                    self.cross_q_proj, self.cross_k_proj, self.cross_v_proj, self.cross_out_proj,
+                    query, key, value, query_timestep_pos, key_timestep_pos,
+                    use_reentrant=False
+                )
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            else:
+                # 1. Self Attention + Residual (with gradient checkpointing)
+                def self_attn_fn(q, k, v):
+                    out, _ = self.self_attention(q, k, v)
+                    return out
+                self_attn_output = checkpoint(self_attn_fn, query, query, query, use_reentrant=False)
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual (with gradient checkpointing)
+                def cross_attn_fn(q, k, v):
+                    out, _ = self.cross_attention(q, k, v)
+                    return out
+                cross_attn_output = checkpoint(cross_attn_fn, query, key, value, use_reentrant=False)
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            # 3. Feed Forward + Residual (with gradient checkpointing)
+            ff_output = checkpoint(self.feed_forward, query, use_reentrant=False)
+            query = self.norm3(query + ff_output)
+        else:
+            # Original implementation without gradient checkpointing
+            if self.use_rope:
+                # 1. Self Attention + Residual with RoPE
+                self_attn_output = self._rope_attention(
+                    self.self_q_proj, self.self_k_proj, self.self_v_proj, self.self_out_proj,
+                    query, query, query, query_timestep_pos, query_timestep_pos
+                )
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual with RoPE
+                cross_attn_output = self._rope_attention(
+                    self.cross_q_proj, self.cross_k_proj, self.cross_v_proj, self.cross_out_proj,
+                    query, key, value, query_timestep_pos, key_timestep_pos
+                )
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            else:
+                # 1. Self Attention + Residual (original implementation)
+                self_attn_output, _ = self.self_attention(query, query, query)
+                query = self.norm1(query + self.dropout(self_attn_output))
+                # 2. Cross Attention + Residual (original implementation)
+                cross_attn_output, _ = self.cross_attention(query, key, value)
+                query = self.norm2(query + self.dropout(cross_attn_output))
+            # 3. Feed Forward + Residual
+            ff_output = self.feed_forward(query)
+            query = self.norm3(query + ff_output)
+        return query
+class AlignNet(nn.Module):
+    def __init__(self, aggregated_dim=2048, cam_dim=1024, hidden_dim=512, num_heads=8, ff_dim=512, dropout=0.1, use_rope=True, num_decoder_layers=2):
+        super().__init__()
+        self.use_rope = use_rope
+        self.hidden_dim = hidden_dim
+        self.num_decoder_layers = num_decoder_layers
+        self.scale_token = nn.Parameter(torch.randn(1, 1, hidden_dim) * 0.02)
+        self.cam_feature_adapter = nn.Sequential(
+            nn.LayerNorm(cam_dim),
+            nn.Linear(cam_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout)
+        )
+        self.patch_feature_adapter = nn.Sequential(
+            nn.LayerNorm(aggregated_dim),
+            nn.Linear(aggregated_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout)
+        )
+        self.register_feature_adapter = nn.Sequential(
+            nn.LayerNorm(aggregated_dim),
+            nn.Linear(aggregated_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout)
+        )
+        self.decoder_layers = nn.ModuleList([
+            CrossViewTransformerDecoderLayer(hidden_dim, num_heads, ff_dim, dropout, use_rope=use_rope)
+            for _ in range(num_decoder_layers)
+        ])
+        mean_params = SMPL_MEAN_PARAMS
+        init_body_pose = torch.from_numpy(mean_params['pose'].astype(np.float32)).unsqueeze(0)
+        init_betas = torch.from_numpy(mean_params['shape'].astype('float32')).unsqueeze(0)
+        init_cam = torch.from_numpy(mean_params['cam'].astype(np.float32)).unsqueeze(0)
+        self.register_buffer('init_body_pose', init_body_pose)
+        self.register_buffer('init_betas', init_betas)
+        self.register_buffer('init_cam', init_cam)
+        self.trans_head = nn.Linear(hidden_dim, 3)
+        self.scale_head = nn.Linear(hidden_dim, 1)
+        self.joint_conversion_fn = rot6d_to_rotmat
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for memory optimization."""
+        for layer in self.decoder_layers:
+            if hasattr(layer, 'gradient_checkpointing_enable'):
+                layer.gradient_checkpointing_enable()
+    def forward(self, hidden_tokens, cam_token, fps=6.0):
+        batch_size, num_views, num_tokens, _ = hidden_tokens.shape
+        register_tokens = hidden_tokens[:, :, :5, :]
+        patch_tokens = hidden_tokens[:, :, 5:, :]
+        if cam_token.dim() == 4:
+            cam_token = cam_token.squeeze(2)  # [batch, num_views, 1, 1024] -> [batch, num_views, 1024]
+        cam_adapted = self.cam_feature_adapter(cam_token)  # [batch, num_views, hidden_dim]
+        patch_tokens_reshaped = patch_tokens.view(batch_size * num_views, patch_tokens.shape[2], -1)  # [batch*num_views, 777, 2048]
+        patch_adapted_tokens = self.patch_feature_adapter(patch_tokens_reshaped)  # [batch*num_views, 777, hidden_dim]
+        patch_adapted_tokens = patch_adapted_tokens.view(batch_size, num_views, patch_tokens.shape[2], -1)  # [batch, num_views, 777, hidden_dim]
+        register_tokens_reshaped = register_tokens.view(batch_size * num_views, 5, -1)  # [batch*num_views, 5, 2048]
+        register_adapted_tokens = self.register_feature_adapter(register_tokens_reshaped)  # [batch*num_views, 5, hidden_dim]
+        register_adapted_tokens = register_adapted_tokens.view(batch_size, num_views, 5, -1)  # [batch, num_views, 5, hidden_dim]
+        fused_features_per_view = torch.cat([register_adapted_tokens, patch_adapted_tokens], dim=2)  # [batch, num_views, 782, hidden_dim]
+        concatenated_features = fused_features_per_view.view(batch_size, num_views * num_tokens, -1)
+        scale_token_expanded = self.scale_token.expand(batch_size, -1, -1)
+        query_tokens = torch.cat([scale_token_expanded, cam_adapted], dim=1)
+        if self.use_rope:
+            base_fps = 6.0
+            time_scale = base_fps / fps
+            scale_timestep = torch.zeros((batch_size, 1), device=cam_adapted.device, dtype=torch.long)
+            cam_timestep_float = torch.arange(num_views, device=cam_adapted.device, dtype=torch.float32) * time_scale
+            cam_timestep = cam_timestep_float.round().long().unsqueeze(0).expand(batch_size, -1)
+            query_timestep_pos = torch.cat([scale_timestep, cam_timestep], dim=1)  # [batch, 1 + num_views]
+            key_timestep_base_float = torch.arange(num_views, device=cam_adapted.device, dtype=torch.float32) * time_scale
+            key_timestep_base = key_timestep_base_float.round().long()
+            key_timestep_pos = key_timestep_base.unsqueeze(1).expand(-1, num_tokens).flatten()
+            key_timestep_pos = key_timestep_pos.unsqueeze(0).expand(batch_size, -1)  # [batch, num_views * num_tokens]
+        else:
+            query_timestep_pos = None
+            key_timestep_pos = None
+        decoder_output = query_tokens
+        for i, layer in enumerate(self.decoder_layers):
+            residual = decoder_output
+            decoder_output = layer(
+                decoder_output, concatenated_features, concatenated_features,
+                query_timestep_pos=query_timestep_pos, key_timestep_pos=key_timestep_pos
+            )
+            decoder_output = decoder_output + residual
+        scale_output = decoder_output[:, 0, :]
+        cam_outputs = decoder_output[:, 1:, :]
+        scale_logits = self.scale_head(scale_output)  # [batch, 1]
+        scale = F.softplus(scale_logits)
+        trans_raw = self.trans_head(cam_outputs) # [batch, num_views, 3]
+        xy, z = trans_raw.split([2, 1], dim=-1)  # xy: [batch, num_views, 2], z: [batch, num_views, 1]
+        z = torch.exp(z)
+        trans = torch.cat([xy * z, z], dim=-1)  # [batch, num_views, 3]
+        return {
+            "scale": scale,       # [batch, 1]
+            "trans_cam": trans,        # [batch, num_views, 3]
+        }

unish/heads/dpt_head.py ADDED Viewed

	@@ -0,0 +1,500 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# Inspired by https://github.com/DepthAnything/Depth-Anything-V2
+import os
+from typing import List, Dict, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .head_act import activate_head
+from .utils import create_uv_grid, position_grid_to_embed
+class DPTHead(nn.Module):
+    """
+    DPT  Head for dense prediction tasks.
+    This implementation follows the architecture described in "Vision Transformers for Dense Prediction"
+    (https://arxiv.org/abs/2103.13413). The DPT head processes features from a vision transformer
+    backbone and produces dense predictions by fusing multi-scale features.
+    Args:
+        dim_in (int): Input dimension (channels).
+        patch_size (int, optional): Patch size. Default is 14.
+        output_dim (int, optional): Number of output channels. Default is 4.
+        activation (str, optional): Activation type. Default is "inv_log".
+        conf_activation (str, optional): Confidence activation type. Default is "expp1".
+        features (int, optional): Feature channels for intermediate representations. Default is 256.
+        out_channels (List[int], optional): Output channels for each intermediate layer.
+        intermediate_layer_idx (List[int], optional): Indices of layers from aggregated tokens used for DPT.
+        pos_embed (bool, optional): Whether to use positional embedding. Default is True.
+        feature_only (bool, optional): If True, return features only without the last several layers and activation head. Default is False.
+        down_ratio (int, optional): Downscaling factor for the output resolution. Default is 1.
+    """
+    def __init__(
+        self,
+        dim_in: int,
+        patch_size: int = 14,
+        output_dim: int = 4,
+        activation: str = "inv_log",
+        conf_activation: str = "expp1",
+        features: int = 256,
+        out_channels: List[int] = [256, 512, 1024, 1024],
+        intermediate_layer_idx: List[int] = [4, 11, 17, 23],
+        pos_embed: bool = True,
+        feature_only: bool = False,
+        down_ratio: int = 1,
+    ) -> None:
+        super(DPTHead, self).__init__()
+        self.patch_size = patch_size
+        self.activation = activation
+        self.conf_activation = conf_activation
+        self.pos_embed = pos_embed
+        self.feature_only = feature_only
+        self.down_ratio = down_ratio
+        self.intermediate_layer_idx = intermediate_layer_idx
+        self.norm = nn.LayerNorm(dim_in)
+        # Projection layers for each output channel from tokens.
+        self.projects = nn.ModuleList(
+            [
+                nn.Conv2d(
+                    in_channels=dim_in,
+                    out_channels=oc,
+                    kernel_size=1,
+                    stride=1,
+                    padding=0,
+                )
+                for oc in out_channels
+            ]
+        )
+        # Resize layers for upsampling feature maps.
+        self.resize_layers = nn.ModuleList(
+            [
+                nn.ConvTranspose2d(
+                    in_channels=out_channels[0], out_channels=out_channels[0], kernel_size=4, stride=4, padding=0
+                ),
+                nn.ConvTranspose2d(
+                    in_channels=out_channels[1], out_channels=out_channels[1], kernel_size=2, stride=2, padding=0
+                ),
+                nn.Identity(),
+                nn.Conv2d(
+                    in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, stride=2, padding=1
+                ),
+            ]
+        )
+        self.scratch = _make_scratch(
+            out_channels,
+            features,
+            expand=False,
+        )
+        # Attach additional modules to scratch.
+        self.scratch.stem_transpose = None
+        self.scratch.refinenet1 = _make_fusion_block(features)
+        self.scratch.refinenet2 = _make_fusion_block(features)
+        self.scratch.refinenet3 = _make_fusion_block(features)
+        self.scratch.refinenet4 = _make_fusion_block(features, has_residual=False)
+        head_features_1 = features
+        head_features_2 = 32
+        if feature_only:
+            self.scratch.output_conv1 = nn.Conv2d(head_features_1, head_features_1, kernel_size=3, stride=1, padding=1)
+        else:
+            self.scratch.output_conv1 = nn.Conv2d(
+                head_features_1, head_features_1 // 2, kernel_size=3, stride=1, padding=1
+            )
+            conv2_in_channels = head_features_1 // 2
+            self.scratch.output_conv2 = nn.Sequential(
+                nn.Conv2d(conv2_in_channels, head_features_2, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(head_features_2, output_dim, kernel_size=1, stride=1, padding=0),
+            )
+    def forward(
+        self,
+        aggregated_tokens_list: List[torch.Tensor],
+        images: torch.Tensor,
+        patch_start_idx: int,
+        frames_chunk_size: int = 8,
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
+        """
+        Forward pass through the DPT head, supports processing by chunking frames.
+        Args:
+            aggregated_tokens_list (List[Tensor]): List of token tensors from different transformer layers.
+            images (Tensor): Input images with shape [B, S, 3, H, W], in range [0, 1].
+            patch_start_idx (int): Starting index for patch tokens in the token sequence.
+                Used to separate patch tokens from other tokens (e.g., camera or register tokens).
+            frames_chunk_size (int, optional): Number of frames to process in each chunk.
+                If None or larger than S, all frames are processed at once. Default: 8.
+        Returns:
+            Tensor or Tuple[Tensor, Tensor]:
+                - If feature_only=True: Feature maps with shape [B, S, C, H, W]
+                - Otherwise: Tuple of (predictions, confidence) both with shape [B, S, 1, H, W]
+        """
+        B, S, _, H, W = images.shape
+        # If frames_chunk_size is not specified or greater than S, process all frames at once
+        if frames_chunk_size is None or frames_chunk_size >= S:
+            return self._forward_impl(aggregated_tokens_list, images, patch_start_idx)
+        # Otherwise, process frames in chunks to manage memory usage
+        assert frames_chunk_size > 0
+        # Process frames in batches
+        all_preds = []
+        all_conf = []
+        for frames_start_idx in range(0, S, frames_chunk_size):
+            frames_end_idx = min(frames_start_idx + frames_chunk_size, S)
+            # Process batch of frames
+            if self.feature_only:
+                chunk_output = self._forward_impl(
+                    aggregated_tokens_list, images, patch_start_idx, frames_start_idx, frames_end_idx
+                )
+                all_preds.append(chunk_output)
+            else:
+                chunk_preds, chunk_conf = self._forward_impl(
+                    aggregated_tokens_list, images, patch_start_idx, frames_start_idx, frames_end_idx
+                )
+                all_preds.append(chunk_preds)
+                all_conf.append(chunk_conf)
+        # Concatenate results along the sequence dimension
+        if self.feature_only:
+            return torch.cat(all_preds, dim=1)
+        else:
+            return torch.cat(all_preds, dim=1), torch.cat(all_conf, dim=1)
+    def _forward_impl(
+        self,
+        aggregated_tokens_list: List[torch.Tensor],
+        images: torch.Tensor,
+        patch_start_idx: int,
+        frames_start_idx: int = None,
+        frames_end_idx: int = None,
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
+        """
+        Implementation of the forward pass through the DPT head.
+        This method processes a specific chunk of frames from the sequence.
+        Args:
+            aggregated_tokens_list (List[Tensor]): List of token tensors from different transformer layers.
+            images (Tensor): Input images with shape [B, S, 3, H, W].
+            patch_start_idx (int): Starting index for patch tokens.
+            frames_start_idx (int, optional): Starting index for frames to process.
+            frames_end_idx (int, optional): Ending index for frames to process.
+        Returns:
+            Tensor or Tuple[Tensor, Tensor]: Feature maps or (predictions, confidence).
+        """
+        if frames_start_idx is not None and frames_end_idx is not None:
+            images = images[:, frames_start_idx:frames_end_idx].contiguous()
+        B, S, _, H, W = images.shape
+        patch_h, patch_w = H // self.patch_size, W // self.patch_size
+        out = []
+        dpt_idx = 0
+        for layer_idx in self.intermediate_layer_idx:
+            x = aggregated_tokens_list[layer_idx][:, :, patch_start_idx:]
+            x = x.to(self.projects[0].weight.dtype)
+            # Select frames if processing a chunk
+            if frames_start_idx is not None and frames_end_idx is not None:
+                x = x[:, frames_start_idx:frames_end_idx]
+            x = x.reshape(B * S, -1, x.shape[-1])
+            x = self.norm(x)
+            x = x.permute(0, 2, 1).reshape((x.shape[0], x.shape[-1], patch_h, patch_w))
+            x = self.projects[dpt_idx](x)
+            if self.pos_embed:
+                x = self._apply_pos_embed(x, W, H).to(self.projects[0].weight.dtype)
+            x = self.resize_layers[dpt_idx](x)
+            out.append(x)
+            dpt_idx += 1
+        # Fuse features from multiple layers.
+        out = self.scratch_forward(out)
+        # Interpolate fused output to match target image resolution.
+        out = custom_interpolate(
+            out,
+            (int(patch_h * self.patch_size / self.down_ratio), int(patch_w * self.patch_size / self.down_ratio)),
+            mode="bilinear",
+            align_corners=True,
+        )
+        if self.pos_embed:
+            out = self._apply_pos_embed(out, W, H).to(self.projects[0].weight.dtype)
+        if self.feature_only:
+            return out.view(B, S, *out.shape[1:])
+        out = self.scratch.output_conv2(out)
+        preds, conf = activate_head(out, activation=self.activation, conf_activation=self.conf_activation)
+        preds = preds.view(B, S, *preds.shape[1:])
+        conf = conf.view(B, S, *conf.shape[1:])
+        return preds, conf
+    def _apply_pos_embed(self, x: torch.Tensor, W: int, H: int, ratio: float = 0.1) -> torch.Tensor:
+        """
+        Apply positional embedding to tensor x.
+        """
+        patch_w = x.shape[-1]
+        patch_h = x.shape[-2]
+        pos_embed = create_uv_grid(patch_w, patch_h, aspect_ratio=W / H, dtype=x.dtype, device=x.device)
+        pos_embed = position_grid_to_embed(pos_embed, x.shape[1])
+        pos_embed = pos_embed * ratio
+        pos_embed = pos_embed.permute(2, 0, 1)[None].expand(x.shape[0], -1, -1, -1)
+        return x + pos_embed
+    def scratch_forward(self, features: List[torch.Tensor]) -> torch.Tensor:
+        """
+        Forward pass through the fusion blocks.
+        Args:
+            features (List[Tensor]): List of feature maps from different layers.
+        Returns:
+            Tensor: Fused feature map.
+        """
+        layer_1, layer_2, layer_3, layer_4 = features
+        layer_1_rn = self.scratch.layer1_rn(layer_1)
+        layer_2_rn = self.scratch.layer2_rn(layer_2)
+        layer_3_rn = self.scratch.layer3_rn(layer_3)
+        layer_4_rn = self.scratch.layer4_rn(layer_4)
+        out = self.scratch.refinenet4(layer_4_rn, size=layer_3_rn.shape[2:])
+        del layer_4_rn, layer_4
+        out = self.scratch.refinenet3(out, layer_3_rn, size=layer_2_rn.shape[2:])
+        del layer_3_rn, layer_3
+        out = self.scratch.refinenet2(out, layer_2_rn, size=layer_1_rn.shape[2:])
+        del layer_2_rn, layer_2
+        out = self.scratch.refinenet1(out, layer_1_rn)
+        del layer_1_rn, layer_1
+        out = self.scratch.output_conv1(out)
+        return out
+################################################################################
+# Modules
+################################################################################
+def _make_fusion_block(features: int, size: int = None, has_residual: bool = True, groups: int = 1) -> nn.Module:
+    return FeatureFusionBlock(
+        features,
+        nn.ReLU(inplace=True),
+        deconv=False,
+        bn=False,
+        expand=False,
+        align_corners=True,
+        size=size,
+        has_residual=has_residual,
+        groups=groups,
+    )
+def _make_scratch(in_shape: List[int], out_shape: int, groups: int = 1, expand: bool = False) -> nn.Module:
+    scratch = nn.Module()
+    out_shape1 = out_shape
+    out_shape2 = out_shape
+    out_shape3 = out_shape
+    if len(in_shape) >= 4:
+        out_shape4 = out_shape
+    if expand:
+        out_shape1 = out_shape
+        out_shape2 = out_shape * 2
+        out_shape3 = out_shape * 4
+        if len(in_shape) >= 4:
+            out_shape4 = out_shape * 8
+    scratch.layer1_rn = nn.Conv2d(
+        in_shape[0], out_shape1, kernel_size=3, stride=1, padding=1, bias=False, groups=groups
+    )
+    scratch.layer2_rn = nn.Conv2d(
+        in_shape[1], out_shape2, kernel_size=3, stride=1, padding=1, bias=False, groups=groups
+    )
+    scratch.layer3_rn = nn.Conv2d(
+        in_shape[2], out_shape3, kernel_size=3, stride=1, padding=1, bias=False, groups=groups
+    )
+    if len(in_shape) >= 4:
+        scratch.layer4_rn = nn.Conv2d(
+            in_shape[3], out_shape4, kernel_size=3, stride=1, padding=1, bias=False, groups=groups
+        )
+    return scratch
+class ResidualConvUnit(nn.Module):
+    """Residual convolution module."""
+    def __init__(self, features, activation, bn, groups=1):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super().__init__()
+        self.bn = bn
+        self.groups = groups
+        self.conv1 = nn.Conv2d(features, features, kernel_size=3, stride=1, padding=1, bias=True, groups=self.groups)
+        self.conv2 = nn.Conv2d(features, features, kernel_size=3, stride=1, padding=1, bias=True, groups=self.groups)
+        self.norm1 = None
+        self.norm2 = None
+        self.activation = activation
+        self.skip_add = nn.quantized.FloatFunctional()
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input
+        Returns:
+            tensor: output
+        """
+        out = self.activation(x)
+        out = self.conv1(out)
+        if self.norm1 is not None:
+            out = self.norm1(out)
+        out = self.activation(out)
+        out = self.conv2(out)
+        if self.norm2 is not None:
+            out = self.norm2(out)
+        return self.skip_add.add(out, x)
+class FeatureFusionBlock(nn.Module):
+    """Feature fusion block."""
+    def __init__(
+        self,
+        features,
+        activation,
+        deconv=False,
+        bn=False,
+        expand=False,
+        align_corners=True,
+        size=None,
+        has_residual=True,
+        groups=1,
+    ):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super(FeatureFusionBlock, self).__init__()
+        self.deconv = deconv
+        self.align_corners = align_corners
+        self.groups = groups
+        self.expand = expand
+        out_features = features
+        if self.expand == True:
+            out_features = features // 2
+        self.out_conv = nn.Conv2d(
+            features, out_features, kernel_size=1, stride=1, padding=0, bias=True, groups=self.groups
+        )
+        if has_residual:
+            self.resConfUnit1 = ResidualConvUnit(features, activation, bn, groups=self.groups)
+        self.has_residual = has_residual
+        self.resConfUnit2 = ResidualConvUnit(features, activation, bn, groups=self.groups)
+        self.skip_add = nn.quantized.FloatFunctional()
+        self.size = size
+    def forward(self, *xs, size=None):
+        """Forward pass.
+        Returns:
+            tensor: output
+        """
+        output = xs[0]
+        if self.has_residual:
+            res = self.resConfUnit1(xs[1])
+            output = self.skip_add.add(output, res)
+        output = self.resConfUnit2(output)
+        if (size is None) and (self.size is None):
+            modifier = {"scale_factor": 2}
+        elif size is None:
+            modifier = {"size": self.size}
+        else:
+            modifier = {"size": size}
+        output = custom_interpolate(output, **modifier, mode="bilinear", align_corners=self.align_corners)
+        output = self.out_conv(output)
+        return output
+def custom_interpolate(
+    x: torch.Tensor,
+    size: Tuple[int, int] = None,
+    scale_factor: float = None,
+    mode: str = "bilinear",
+    align_corners: bool = True,
+) -> torch.Tensor:
+    """
+    Custom interpolate to avoid INT_MAX issues in nn.functional.interpolate.
+    """
+    if size is None:
+        size = (int(x.shape[-2] * scale_factor), int(x.shape[-1] * scale_factor))
+    INT_MAX = 1610612736
+    input_elements = size[0] * size[1] * x.shape[0] * x.shape[1]
+    if input_elements > INT_MAX:
+        chunks = torch.chunk(x, chunks=(input_elements // INT_MAX) + 1, dim=0)
+        interpolated_chunks = [
+            nn.functional.interpolate(chunk, size=size, mode=mode, align_corners=align_corners) for chunk in chunks
+        ]
+        x = torch.cat(interpolated_chunks, dim=0)
+        return x.contiguous()
+    else:
+        return nn.functional.interpolate(x, size=size, mode=mode, align_corners=align_corners)

unish/heads/head_act.py ADDED Viewed

	@@ -0,0 +1,125 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import torch.nn.functional as F
+def activate_pose(pred_pose_enc, trans_act="linear", quat_act="linear", fl_act="linear"):
+    """
+    Activate pose parameters with specified activation functions.
+    Args:
+        pred_pose_enc: Tensor containing encoded pose parameters [translation, quaternion, focal length]
+        trans_act: Activation type for translation component
+        quat_act: Activation type for quaternion component
+        fl_act: Activation type for focal length component
+    Returns:
+        Activated pose parameters tensor
+    """
+    T = pred_pose_enc[..., :3]
+    quat = pred_pose_enc[..., 3:7]
+    fl = pred_pose_enc[..., 7:]  # or fov
+    T = base_pose_act(T, trans_act)
+    quat = base_pose_act(quat, quat_act)
+    fl = base_pose_act(fl, fl_act)  # or fov
+    pred_pose_enc = torch.cat([T, quat, fl], dim=-1)
+    return pred_pose_enc
+def base_pose_act(pose_enc, act_type="linear"):
+    """
+    Apply basic activation function to pose parameters.
+    Args:
+        pose_enc: Tensor containing encoded pose parameters
+        act_type: Activation type ("linear", "inv_log", "exp", "relu")
+    Returns:
+        Activated pose parameters
+    """
+    if act_type == "linear":
+        return pose_enc
+    elif act_type == "inv_log":
+        return inverse_log_transform(pose_enc)
+    elif act_type == "exp":
+        return torch.exp(pose_enc)
+    elif act_type == "relu":
+        return F.relu(pose_enc)
+    else:
+        raise ValueError(f"Unknown act_type: {act_type}")
+def activate_head(out, activation="norm_exp", conf_activation="expp1"):
+    """
+    Process network output to extract 3D points and confidence values.
+    Args:
+        out: Network output tensor (B, C, H, W)
+        activation: Activation type for 3D points
+        conf_activation: Activation type for confidence values
+    Returns:
+        Tuple of (3D points tensor, confidence tensor)
+    """
+    # Move channels from last dim to the 4th dimension => (B, H, W, C)
+    fmap = out.permute(0, 2, 3, 1)  # B,H,W,C expected
+    # Split into xyz (first C-1 channels) and confidence (last channel)
+    xyz = fmap[:, :, :, :-1]
+    conf = fmap[:, :, :, -1]
+    if activation == "norm_exp":
+        d = xyz.norm(dim=-1, keepdim=True).clamp(min=1e-8)
+        xyz_normed = xyz / d
+        pts3d = xyz_normed * torch.expm1(d)
+    elif activation == "norm":
+        pts3d = xyz / xyz.norm(dim=-1, keepdim=True)
+    elif activation == "exp":
+        pts3d = torch.exp(xyz)
+    elif activation == "relu":
+        pts3d = F.relu(xyz)
+    elif activation == "inv_log":
+        pts3d = inverse_log_transform(xyz)
+    elif activation == "xy_inv_log":
+        xy, z = xyz.split([2, 1], dim=-1)
+        z = inverse_log_transform(z)
+        pts3d = torch.cat([xy * z, z], dim=-1)
+    elif activation == "sigmoid":
+        pts3d = torch.sigmoid(xyz)
+    elif activation == "linear":
+        pts3d = xyz
+    else:
+        raise ValueError(f"Unknown activation: {activation}")
+    if conf_activation == "expp1":
+        conf_out = 1 + conf.exp()
+    elif conf_activation == "expp0":
+        conf_out = conf.exp()
+    elif conf_activation == "sigmoid":
+        conf_out = torch.sigmoid(conf)
+    else:
+        raise ValueError(f"Unknown conf_activation: {conf_activation}")
+    return pts3d, conf_out
+def inverse_log_transform(y):
+    """
+    Apply inverse log transform: sign(y) * (exp(|y|) - 1)
+    Args:
+        y: Input tensor
+    Returns:
+        Transformed tensor
+    """
+    return torch.sign(y) * (torch.expm1(torch.abs(y)))

unish/heads/human_head_cliff.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import einops
+from unish.utils.data_utils import rot6d_to_rotmat
+from unish.utils.constants import SMPL_MEAN_PARAMS
+from .pose_transformer import TransformerDecoder
+TRANSFORMER_DECODER={'depth': 6,
+                    'heads': 8,
+                    'mlp_dim': 1024,
+                    'dim_head': 64,
+                    'dropout': 0.0,
+                    'emb_dropout': 0.0,
+                    'norm': 'layer',
+                    'context_dim': 1280}
+NUM_POSE_INPUT = 23
+NUM_BETAS_INPUT = 10
+NUM_BETAS = 10
+NUM_POSE_PARAMS = 23
+class HumanHeadCliff(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.joint_rep_dim = 6
+        npose = self.joint_rep_dim * (NUM_POSE_INPUT + 1)
+        self.npose = npose
+        transformer_args = dict(
+            num_tokens=1,
+            token_dim=(3 + npose + NUM_BETAS_INPUT + 3),
+            dim=1024,
+        )
+        transformer_args = (transformer_args | dict(TRANSFORMER_DECODER))
+        self.transformer = TransformerDecoder(
+            **transformer_args
+        )
+        dim=transformer_args['dim']
+        self.decpose = nn.Linear(dim, self.joint_rep_dim * (NUM_POSE_PARAMS + 1))
+        self.decshape = nn.Linear(dim, NUM_BETAS)
+        # self.deccam = nn.Linear(dim, 3)
+        # self.deckp = nn.Linear(dim, 88)
+        mean_params = SMPL_MEAN_PARAMS
+        init_body_pose = torch.from_numpy(mean_params['pose'].astype(np.float32)).unsqueeze(0)
+        init_betas = torch.from_numpy(mean_params['shape'].astype('float32')).unsqueeze(0)
+        init_cam = torch.from_numpy(mean_params['cam'].astype(np.float32)).unsqueeze(0)
+        self.register_buffer('init_body_pose', init_body_pose)
+        self.register_buffer('init_betas', init_betas)
+        self.register_buffer('init_cam', init_cam)
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for memory optimization."""
+        if hasattr(self.transformer, 'gradient_checkpointing_enable'):
+            self.transformer.gradient_checkpointing_enable()
+    def forward(self, x, bbox_info, **kwargs):
+        """
+        x: (B, N, C, H, W)
+        bbox_info: [cx / f, cy / f, box_size / f], (B, N, 3)
+        """
+        batch_size, num_views = x.shape[:2]
+        x = einops.rearrange(x, 'b n c h w -> (b n) (h w) c')
+        init_body_pose = self.init_body_pose.expand(batch_size * num_views, -1)
+        init_betas = self.init_betas.expand(batch_size * num_views, -1)
+        init_cam = self.init_cam.expand(batch_size * num_views, -1)
+        bbox_info = bbox_info.view(-1, 3)
+        pred_body_pose = init_body_pose
+        pred_betas = init_betas
+        pred_cam = init_cam
+        token = torch.cat([bbox_info, pred_body_pose, pred_betas, pred_cam], dim=-1)[:, None, :]
+        # Pass through transformer
+        token_out = self.transformer(token, context=x)
+        token_out = token_out.squeeze(1) # (B, C)
+        pred_body_pose = self.decpose(token_out) + pred_body_pose
+        pred_betas = self.decshape(token_out) + pred_betas
+        joint_conversion_fn = rot6d_to_rotmat
+        pred_body_pose = pred_body_pose.view(-1, 6)
+        pred_body_pose = joint_conversion_fn(pred_body_pose).view(batch_size, num_views, -1)
+        pred_betas = pred_betas.view(batch_size, num_views, -1).mean(dim=1)
+        token_out = token_out.view(batch_size, num_views, -1)
+        pred_smpl_params = {'pose_cam': pred_body_pose,
+                            'token_out': token_out,
+                            'betas': pred_betas}
+        return pred_smpl_params

unish/heads/pose_transformer.py ADDED Viewed

	@@ -0,0 +1,364 @@

+from inspect import isfunction
+from typing import Callable, Optional
+import torch
+from einops import rearrange
+from einops.layers.torch import Rearrange
+from torch import nn
+from .t_cond_mlp import (
+    AdaptiveLayerNorm1D,
+    FrequencyEmbedder,
+    normalization_layer,
+)
+# from .vit import Attention, FeedForward
+def exists(val):
+    return val is not None
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if isfunction(d) else d
+class PreNorm(nn.Module):
+    def __init__(self, dim: int, fn: Callable, norm: str = "layer", norm_cond_dim: int = -1):
+        super().__init__()
+        self.norm = normalization_layer(norm, dim, norm_cond_dim)
+        self.fn = fn
+    def forward(self, x: torch.Tensor, *args, **kwargs):
+        if isinstance(self.norm, AdaptiveLayerNorm1D):
+            return self.fn(self.norm(x, *args), **kwargs)
+        else:
+            return self.fn(self.norm(x), **kwargs)
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim, dropout=0.0):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, dim),
+            nn.Dropout(dropout),
+        )
+    def forward(self, x):
+        return self.net(x)
+class Attention(nn.Module):
+    def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
+        super().__init__()
+        inner_dim = dim_head * heads
+        project_out = not (heads == 1 and dim_head == dim)
+        self.heads = heads
+        self.scale = dim_head**-0.5
+        self.attend = nn.Softmax(dim=-1)
+        self.dropout = nn.Dropout(dropout)
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
+        self.to_out = (
+            nn.Sequential(nn.Linear(inner_dim, dim), nn.Dropout(dropout))
+            if project_out
+            else nn.Identity()
+        )
+    def forward(self, x):
+        qkv = self.to_qkv(x).chunk(3, dim=-1)
+        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=self.heads), qkv)
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+        attn = self.attend(dots)
+        attn = self.dropout(attn)
+        out = torch.matmul(attn, v)
+        out = rearrange(out, "b h n d -> b n (h d)")
+        return self.to_out(out)
+class CrossAttention(nn.Module):
+    def __init__(self, dim, context_dim=None, heads=8, dim_head=64, dropout=0.0):
+        super().__init__()
+        inner_dim = dim_head * heads
+        project_out = not (heads == 1 and dim_head == dim)
+        self.heads = heads
+        self.scale = dim_head**-0.5
+        self.attend = nn.Softmax(dim=-1)
+        self.dropout = nn.Dropout(dropout)
+        context_dim = default(context_dim, dim)
+        self.to_kv = nn.Linear(context_dim, inner_dim * 2, bias=False)
+        self.to_q = nn.Linear(dim, inner_dim, bias=False)
+        self.to_out = (
+            nn.Sequential(nn.Linear(inner_dim, dim), nn.Dropout(dropout))
+            if project_out
+            else nn.Identity()
+        )
+    def forward(self, x, context=None):
+        context = default(context, x)
+        k, v = self.to_kv(context).chunk(2, dim=-1)
+        q = self.to_q(x)
+        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=self.heads), [q, k, v])
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+        attn = self.attend(dots)
+        attn = self.dropout(attn)
+        out = torch.matmul(attn, v)
+        out = rearrange(out, "b h n d -> b n (h d)")
+        return self.to_out(out)
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        depth: int,
+        heads: int,
+        dim_head: int,
+        mlp_dim: int,
+        dropout: float = 0.0,
+        norm: str = "layer",
+        norm_cond_dim: int = -1,
+    ):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            sa = Attention(dim, heads=heads, dim_head=dim_head, dropout=dropout)
+            ff = FeedForward(dim, mlp_dim, dropout=dropout)
+            self.layers.append(
+                nn.ModuleList(
+                    [
+                        PreNorm(dim, sa, norm=norm, norm_cond_dim=norm_cond_dim),
+                        PreNorm(dim, ff, norm=norm, norm_cond_dim=norm_cond_dim),
+                    ]
+                )
+            )
+    def forward(self, x: torch.Tensor, *args):
+        for attn, ff in self.layers:
+            x = attn(x, *args) + x
+            x = ff(x, *args) + x
+        return x
+class TransformerCrossAttn(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        depth: int,
+        heads: int,
+        dim_head: int,
+        mlp_dim: int,
+        dropout: float = 0.0,
+        norm: str = "layer",
+        norm_cond_dim: int = -1,
+        context_dim: Optional[int] = None,
+    ):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            sa = Attention(dim, heads=heads, dim_head=dim_head, dropout=dropout)
+            ca = CrossAttention(
+                dim, context_dim=context_dim, heads=heads, dim_head=dim_head, dropout=dropout
+            )
+            ff = FeedForward(dim, mlp_dim, dropout=dropout)
+            self.layers.append(
+                nn.ModuleList(
+                    [
+                        PreNorm(dim, sa, norm=norm, norm_cond_dim=norm_cond_dim),
+                        PreNorm(dim, ca, norm=norm, norm_cond_dim=norm_cond_dim),
+                        PreNorm(dim, ff, norm=norm, norm_cond_dim=norm_cond_dim),
+                    ]
+                )
+            )
+    def forward(self, x: torch.Tensor, *args, context=None, context_list=None):
+        if context_list is None:
+            context_list = [context] * len(self.layers)
+        if len(context_list) != len(self.layers):
+            raise ValueError(f"len(context_list) != len(self.layers) ({len(context_list)} != {len(self.layers)})")
+        b, n = x.shape[:2]
+        for i, (self_attn, cross_attn, ff) in enumerate(self.layers):
+            x = self_attn(x, *args) + x
+            # TODO
+            # x = x.view(b*n, 1, -1)
+            x = cross_attn(x, *args, context=context_list[i]) + x
+            # x = x.view(b, n, -1)
+            x = ff(x, *args) + x
+        return x
+class DropTokenDropout(nn.Module):
+    def __init__(self, p: float = 0.1):
+        super().__init__()
+        if p < 0 or p > 1:
+            raise ValueError(
+                "dropout probability has to be between 0 and 1, " "but got {}".format(p)
+            )
+        self.p = p
+    def forward(self, x: torch.Tensor):
+        # x: (batch_size, seq_len, dim)
+        if self.training and self.p > 0:
+            zero_mask = torch.full_like(x[0, :, 0], self.p).bernoulli().bool()
+            # TODO: permutation idx for each batch using torch.argsort
+            if zero_mask.any():
+                x = x[:, ~zero_mask, :]
+        return x
+class ZeroTokenDropout(nn.Module):
+    def __init__(self, p: float = 0.1):
+        super().__init__()
+        if p < 0 or p > 1:
+            raise ValueError(
+                "dropout probability has to be between 0 and 1, " "but got {}".format(p)
+            )
+        self.p = p
+    def forward(self, x: torch.Tensor):
+        # x: (batch_size, seq_len, dim)
+        if self.training and self.p > 0:
+            zero_mask = torch.full_like(x[:, :, 0], self.p).bernoulli().bool()
+            # Zero-out the masked tokens
+            x[zero_mask, :] = 0
+        return x
+class TransformerEncoder(nn.Module):
+    def __init__(
+        self,
+        num_tokens: int,
+        token_dim: int,
+        dim: int,
+        depth: int,
+        heads: int,
+        mlp_dim: int,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        emb_dropout: float = 0.0,
+        emb_dropout_type: str = "drop",
+        emb_dropout_loc: str = "token",
+        norm: str = "layer",
+        norm_cond_dim: int = -1,
+        token_pe_numfreq: int = -1,
+    ):
+        super().__init__()
+        if token_pe_numfreq > 0:
+            token_dim_new = token_dim * (2 * token_pe_numfreq + 1)
+            self.to_token_embedding = nn.Sequential(
+                Rearrange("b n d -> (b n) d", n=num_tokens, d=token_dim),
+                FrequencyEmbedder(token_pe_numfreq, token_pe_numfreq - 1),
+                Rearrange("(b n) d -> b n d", n=num_tokens, d=token_dim_new),
+                nn.Linear(token_dim_new, dim),
+            )
+        else:
+            self.to_token_embedding = nn.Linear(token_dim, dim)
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_tokens, dim))
+        if emb_dropout_type == "drop":
+            self.dropout = DropTokenDropout(emb_dropout)
+        elif emb_dropout_type == "zero":
+            self.dropout = ZeroTokenDropout(emb_dropout)
+        else:
+            raise ValueError(f"Unknown emb_dropout_type: {emb_dropout_type}")
+        self.emb_dropout_loc = emb_dropout_loc
+        self.transformer = Transformer(
+            dim, depth, heads, dim_head, mlp_dim, dropout, norm=norm, norm_cond_dim=norm_cond_dim
+        )
+    def forward(self, inp: torch.Tensor, *args, **kwargs):
+        x = inp
+        if self.emb_dropout_loc == "input":
+            x = self.dropout(x)
+        x = self.to_token_embedding(x)
+        if self.emb_dropout_loc == "token":
+            x = self.dropout(x)
+        b, n, _ = x.shape
+        x += self.pos_embedding[:, :n]
+        if self.emb_dropout_loc == "token_afterpos":
+            x = self.dropout(x)
+        x = self.transformer(x, *args)
+        return x
+class TransformerDecoder(nn.Module):
+    def __init__(
+        self,
+        num_tokens: int,
+        token_dim: int,
+        dim: int,
+        depth: int,
+        heads: int,
+        mlp_dim: int,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        emb_dropout: float = 0.0,
+        emb_dropout_type: str = 'drop',
+        norm: str = "layer",
+        norm_cond_dim: int = -1,
+        context_dim: Optional[int] = None,
+        skip_token_embedding: bool = False,
+    ):
+        super().__init__()
+        if not skip_token_embedding:
+            self.to_token_embedding = nn.Linear(token_dim, dim)
+        else:
+            self.to_token_embedding = nn.Identity()
+            if token_dim != dim:
+                raise ValueError(
+                    f"token_dim ({token_dim}) != dim ({dim}) when skip_token_embedding is True"
+                )
+        self.pos_embedding = nn.Parameter(torch.randn(1, num_tokens, dim))
+        if emb_dropout_type == "drop":
+            self.dropout = DropTokenDropout(emb_dropout)
+        elif emb_dropout_type == "zero":
+            self.dropout = ZeroTokenDropout(emb_dropout)
+        elif emb_dropout_type == "normal":
+            self.dropout = nn.Dropout(emb_dropout)
+        self.transformer = TransformerCrossAttn(
+            dim,
+            depth,
+            heads,
+            dim_head,
+            mlp_dim,
+            dropout,
+            norm=norm,
+            norm_cond_dim=norm_cond_dim,
+            context_dim=context_dim,
+        )
+    def forward(self, inp: torch.Tensor, *args, context=None, context_list=None):
+        x = self.to_token_embedding(inp)
+        b, n, _ = x.shape
+        x = self.dropout(x)
+        x += self.pos_embedding[:, :n]
+        x = self.transformer(x, *args, context=context, context_list=context_list)
+        return x

unish/heads/t_cond_mlp.py ADDED Viewed

	@@ -0,0 +1,199 @@

+import copy
+from typing import List, Optional
+import torch
+class AdaptiveLayerNorm1D(torch.nn.Module):
+    def __init__(self, data_dim: int, norm_cond_dim: int):
+        super().__init__()
+        if data_dim <= 0:
+            raise ValueError(f"data_dim must be positive, but got {data_dim}")
+        if norm_cond_dim <= 0:
+            raise ValueError(f"norm_cond_dim must be positive, but got {norm_cond_dim}")
+        self.norm = torch.nn.LayerNorm(
+            data_dim
+        )  # TODO: Check if elementwise_affine=True is correct
+        self.linear = torch.nn.Linear(norm_cond_dim, 2 * data_dim)
+        torch.nn.init.zeros_(self.linear.weight)
+        torch.nn.init.zeros_(self.linear.bias)
+    def forward(self, x: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
+        # x: (batch, ..., data_dim)
+        # t: (batch, norm_cond_dim)
+        # return: (batch, data_dim)
+        x = self.norm(x)
+        alpha, beta = self.linear(t).chunk(2, dim=-1)
+        # Add singleton dimensions to alpha and beta
+        if x.dim() > 2:
+            alpha = alpha.view(alpha.shape[0], *([1] * (x.dim() - 2)), alpha.shape[1])
+            beta = beta.view(beta.shape[0], *([1] * (x.dim() - 2)), beta.shape[1])
+        return x * (1 + alpha) + beta
+class SequentialCond(torch.nn.Sequential):
+    def forward(self, input, *args, **kwargs):
+        for module in self:
+            if isinstance(module, (AdaptiveLayerNorm1D, SequentialCond, ResidualMLPBlock)):
+                # print(f'Passing on args to {module}', [a.shape for a in args])
+                input = module(input, *args, **kwargs)
+            else:
+                # print(f'Skipping passing args to {module}', [a.shape for a in args])
+                input = module(input)
+        return input
+def normalization_layer(norm: Optional[str], dim: int, norm_cond_dim: int = -1):
+    if norm == "batch":
+        return torch.nn.BatchNorm1d(dim)
+    elif norm == "layer":
+        return torch.nn.LayerNorm(dim)
+    elif norm == "ada":
+        assert norm_cond_dim > 0, f"norm_cond_dim must be positive, got {norm_cond_dim}"
+        return AdaptiveLayerNorm1D(dim, norm_cond_dim)
+    elif norm is None:
+        return torch.nn.Identity()
+    else:
+        raise ValueError(f"Unknown norm: {norm}")
+def linear_norm_activ_dropout(
+    input_dim: int,
+    output_dim: int,
+    activation: torch.nn.Module = torch.nn.ReLU(),
+    bias: bool = True,
+    norm: Optional[str] = "layer",  # Options: ada/batch/layer
+    dropout: float = 0.0,
+    norm_cond_dim: int = -1,
+) -> SequentialCond:
+    layers = []
+    layers.append(torch.nn.Linear(input_dim, output_dim, bias=bias))
+    if norm is not None:
+        layers.append(normalization_layer(norm, output_dim, norm_cond_dim))
+    layers.append(copy.deepcopy(activation))
+    if dropout > 0.0:
+        layers.append(torch.nn.Dropout(dropout))
+    return SequentialCond(*layers)
+def create_simple_mlp(
+    input_dim: int,
+    hidden_dims: List[int],
+    output_dim: int,
+    activation: torch.nn.Module = torch.nn.ReLU(),
+    bias: bool = True,
+    norm: Optional[str] = "layer",  # Options: ada/batch/layer
+    dropout: float = 0.0,
+    norm_cond_dim: int = -1,
+) -> SequentialCond:
+    layers = []
+    prev_dim = input_dim
+    for hidden_dim in hidden_dims:
+        layers.extend(
+            linear_norm_activ_dropout(
+                prev_dim, hidden_dim, activation, bias, norm, dropout, norm_cond_dim
+            )
+        )
+        prev_dim = hidden_dim
+    layers.append(torch.nn.Linear(prev_dim, output_dim, bias=bias))
+    return SequentialCond(*layers)
+class ResidualMLPBlock(torch.nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        num_hidden_layers: int,
+        output_dim: int,
+        activation: torch.nn.Module = torch.nn.ReLU(),
+        bias: bool = True,
+        norm: Optional[str] = "layer",  # Options: ada/batch/layer
+        dropout: float = 0.0,
+        norm_cond_dim: int = -1,
+    ):
+        super().__init__()
+        if not (input_dim == output_dim == hidden_dim):
+            raise NotImplementedError(
+                f"input_dim {input_dim} != output_dim {output_dim} is not implemented"
+            )
+        layers = []
+        prev_dim = input_dim
+        for i in range(num_hidden_layers):
+            layers.append(
+                linear_norm_activ_dropout(
+                    prev_dim, hidden_dim, activation, bias, norm, dropout, norm_cond_dim
+                )
+            )
+            prev_dim = hidden_dim
+        self.model = SequentialCond(*layers)
+        self.skip = torch.nn.Identity()
+    def forward(self, x: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+        return x + self.model(x, *args, **kwargs)
+class ResidualMLP(torch.nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        num_hidden_layers: int,
+        output_dim: int,
+        activation: torch.nn.Module = torch.nn.ReLU(),
+        bias: bool = True,
+        norm: Optional[str] = "layer",  # Options: ada/batch/layer
+        dropout: float = 0.0,
+        num_blocks: int = 1,
+        norm_cond_dim: int = -1,
+    ):
+        super().__init__()
+        self.input_dim = input_dim
+        self.model = SequentialCond(
+            linear_norm_activ_dropout(
+                input_dim, hidden_dim, activation, bias, norm, dropout, norm_cond_dim
+            ),
+            *[
+                ResidualMLPBlock(
+                    hidden_dim,
+                    hidden_dim,
+                    num_hidden_layers,
+                    hidden_dim,
+                    activation,
+                    bias,
+                    norm,
+                    dropout,
+                    norm_cond_dim,
+                )
+                for _ in range(num_blocks)
+            ],
+            torch.nn.Linear(hidden_dim, output_dim, bias=bias),
+        )
+    def forward(self, x: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+        return self.model(x, *args, **kwargs)
+class FrequencyEmbedder(torch.nn.Module):
+    def __init__(self, num_frequencies, max_freq_log2):
+        super().__init__()
+        frequencies = 2 ** torch.linspace(0, max_freq_log2, steps=num_frequencies)
+        self.register_buffer("frequencies", frequencies)
+    def forward(self, x):
+        # x should be of size (N,) or (N, D)
+        N = x.size(0)
+        if x.dim() == 1:  # (N,)
+            x = x.unsqueeze(1)  # (N, D) where D=1
+        x_unsqueezed = x.unsqueeze(-1)  # (N, D, 1)
+        scaled = self.frequencies.view(1, 1, -1) * x_unsqueezed  # (N, D, num_frequencies)
+        s = torch.sin(scaled)
+        c = torch.cos(scaled)
+        embedded = torch.cat([s, c, x_unsqueezed], dim=-1).view(
+            N, -1
+        )  # (N, D * 2 * num_frequencies + D)
+        return embedded

unish/heads/utils.py ADDED Viewed

	@@ -0,0 +1,108 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import torch.nn as nn
+def position_grid_to_embed(pos_grid: torch.Tensor, embed_dim: int, omega_0: float = 100) -> torch.Tensor:
+    """
+    Convert 2D position grid (HxWx2) to sinusoidal embeddings (HxWxC)
+    Args:
+        pos_grid: Tensor of shape (H, W, 2) containing 2D coordinates
+        embed_dim: Output channel dimension for embeddings
+    Returns:
+        Tensor of shape (H, W, embed_dim) with positional embeddings
+    """
+    H, W, grid_dim = pos_grid.shape
+    assert grid_dim == 2
+    pos_flat = pos_grid.reshape(-1, grid_dim)  # Flatten to (H*W, 2)
+    # Process x and y coordinates separately
+    emb_x = make_sincos_pos_embed(embed_dim // 2, pos_flat[:, 0], omega_0=omega_0)  # [1, H*W, D/2]
+    emb_y = make_sincos_pos_embed(embed_dim // 2, pos_flat[:, 1], omega_0=omega_0)  # [1, H*W, D/2]
+    # Combine and reshape
+    emb = torch.cat([emb_x, emb_y], dim=-1)  # [1, H*W, D]
+    return emb.view(H, W, embed_dim)  # [H, W, D]
+def make_sincos_pos_embed(embed_dim: int, pos: torch.Tensor, omega_0: float = 100) -> torch.Tensor:
+    """
+    This function generates a 1D positional embedding from a given grid using sine and cosine functions.
+    Args:
+    - embed_dim: The embedding dimension.
+    - pos: The position to generate the embedding from.
+    Returns:
+    - emb: The generated 1D positional embedding.
+    """
+    assert embed_dim % 2 == 0
+    omega = torch.arange(embed_dim // 2, dtype=torch.double, device=pos.device)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / omega_0**omega  # (D/2,)
+    pos = pos.reshape(-1)  # (M,)
+    out = torch.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
+    emb_sin = torch.sin(out)  # (M, D/2)
+    emb_cos = torch.cos(out)  # (M, D/2)
+    emb = torch.cat([emb_sin, emb_cos], dim=1)  # (M, D)
+    return emb.float()
+# Inspired by https://github.com/microsoft/moge
+def create_uv_grid(
+    width: int, height: int, aspect_ratio: float = None, dtype: torch.dtype = None, device: torch.device = None
+) -> torch.Tensor:
+    """
+    Create a normalized UV grid of shape (width, height, 2).
+    The grid spans horizontally and vertically according to an aspect ratio,
+    ensuring the top-left corner is at (-x_span, -y_span) and the bottom-right
+    corner is at (x_span, y_span), normalized by the diagonal of the plane.
+    Args:
+        width (int): Number of points horizontally.
+        height (int): Number of points vertically.
+        aspect_ratio (float, optional): Width-to-height ratio. Defaults to width/height.
+        dtype (torch.dtype, optional): Data type of the resulting tensor.
+        device (torch.device, optional): Device on which the tensor is created.
+    Returns:
+        torch.Tensor: A (width, height, 2) tensor of UV coordinates.
+    """
+    # Derive aspect ratio if not explicitly provided
+    if aspect_ratio is None:
+        aspect_ratio = float(width) / float(height)
+    # Compute normalized spans for X and Y
+    diag_factor = (aspect_ratio**2 + 1.0) ** 0.5
+    span_x = aspect_ratio / diag_factor
+    span_y = 1.0 / diag_factor
+    # Establish the linspace boundaries
+    left_x = -span_x * (width - 1) / width
+    right_x = span_x * (width - 1) / width
+    top_y = -span_y * (height - 1) / height
+    bottom_y = span_y * (height - 1) / height
+    # Generate 1D coordinates
+    x_coords = torch.linspace(left_x, right_x, steps=width, dtype=dtype, device=device)
+    y_coords = torch.linspace(top_y, bottom_y, steps=height, dtype=dtype, device=device)
+    # Create 2D meshgrid (width x height) and stack into UV
+    uu, vv = torch.meshgrid(x_coords, y_coords, indexing="xy")
+    uv_grid = torch.stack((uu, vv), dim=-1)
+    return uv_grid

unish/heads/vit.py ADDED Viewed

	@@ -0,0 +1,346 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import torch
+from functools import partial
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+from timm.models.layers import drop_path, to_2tuple, trunc_normal_
+def vit():
+    return ViT(
+                img_size=(256, 192),
+                patch_size=16,
+                embed_dim=1280,
+                depth=32,
+                num_heads=16,
+                ratio=1,
+                use_checkpoint=False,
+                mlp_ratio=4,
+                qkv_bias=True,
+                drop_path_rate=0.55,
+            )
+def get_abs_pos(abs_pos, h, w, ori_h, ori_w, has_cls_token=True):
+    """
+    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token
+        dimension for the original embeddings.
+    Args:
+        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).
+        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.
+        hw (Tuple): size of input image tokens.
+    Returns:
+        Absolute positional embeddings after processing with shape (1, H, W, C)
+    """
+    cls_token = None
+    B, L, C = abs_pos.shape
+    if has_cls_token:
+        cls_token = abs_pos[:, 0:1]
+        abs_pos = abs_pos[:, 1:]
+    if ori_h != h or ori_w != w:
+        new_abs_pos = F.interpolate(
+            abs_pos.reshape(1, ori_h, ori_w, -1).permute(0, 3, 1, 2),
+            size=(h, w),
+            mode="bicubic",
+            align_corners=False,
+        ).permute(0, 2, 3, 1).reshape(B, -1, C)
+    else:
+        new_abs_pos = abs_pos
+    if cls_token is not None:
+        new_abs_pos = torch.cat([cls_token, new_abs_pos], dim=1)
+    return new_abs_pos
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    """
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+    def extra_repr(self):
+        return 'p={}'.format(self.drop_prob)
+class Mlp(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(
+            self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
+            proj_drop=0., attn_head_dim=None,):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.dim = dim
+        if attn_head_dim is not None:
+            head_dim = attn_head_dim
+        all_head_dim = head_dim * self.num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, all_head_dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(all_head_dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = self.qkv(x)
+        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None,
+                 drop=0., attn_drop=0., drop_path=0., act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm, attn_head_dim=None
+                 ):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+            attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim
+            )
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x):
+        x = x + self.drop_path(self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class PatchEmbed(nn.Module):
+    """ Image to Patch Embedding
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, ratio=1):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) * (ratio ** 2)
+        self.patch_shape = (int(img_size[0] // patch_size[0] * ratio), int(img_size[1] // patch_size[1] * ratio))
+        self.origin_patch_shape = (int(img_size[0] // patch_size[0]), int(img_size[1] // patch_size[1]))
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=(patch_size[0] // ratio), padding=4 + 2 * (ratio//2-1))
+    def forward(self, x, **kwargs):
+        B, C, H, W = x.shape
+        x = self.proj(x)
+        Hp, Wp = x.shape[2], x.shape[3]
+        x = x.flatten(2).transpose(1, 2)
+        return x, (Hp, Wp)
+class HybridEmbed(nn.Module):
+    """ CNN Feature Map Embedding
+    Extract feature map from CNN, flatten, project to embedding dim.
+    """
+    def __init__(self, backbone, img_size=224, feature_size=None, in_chans=3, embed_dim=768):
+        super().__init__()
+        assert isinstance(backbone, nn.Module)
+        img_size = to_2tuple(img_size)
+        self.img_size = img_size
+        self.backbone = backbone
+        if feature_size is None:
+            with torch.no_grad():
+                training = backbone.training
+                if training:
+                    backbone.eval()
+                o = self.backbone(torch.zeros(1, in_chans, img_size[0], img_size[1]))[-1]
+                feature_size = o.shape[-2:]
+                feature_dim = o.shape[1]
+                backbone.train(training)
+        else:
+            feature_size = to_2tuple(feature_size)
+            feature_dim = self.backbone.feature_info.channels()[-1]
+        self.num_patches = feature_size[0] * feature_size[1]
+        self.proj = nn.Linear(feature_dim, embed_dim)
+    def forward(self, x):
+        x = self.backbone(x)[-1]
+        x = x.flatten(2).transpose(1, 2)
+        x = self.proj(x)
+        return x
+class ViT(nn.Module):
+    def __init__(self,
+                 img_size=224, patch_size=16, in_chans=3, num_classes=80, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
+                 drop_path_rate=0., hybrid_backbone=None, norm_layer=None, use_checkpoint=False,
+                 frozen_stages=-1, ratio=1, last_norm=True,
+                 patch_padding='pad', freeze_attn=False, freeze_ffn=False,
+                 ):
+        super(ViT, self).__init__()
+        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+        self.num_classes = num_classes
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.frozen_stages = frozen_stages
+        self.use_checkpoint = use_checkpoint
+        self.patch_padding = patch_padding
+        self.freeze_attn = freeze_attn
+        self.freeze_ffn = freeze_ffn
+        self.depth = depth
+        if hybrid_backbone is not None:
+            self.patch_embed = HybridEmbed(
+                hybrid_backbone, img_size=img_size, in_chans=in_chans, embed_dim=embed_dim)
+        else:
+            self.patch_embed = PatchEmbed(
+                img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim, ratio=ratio)
+        num_patches = self.patch_embed.num_patches
+        # since the pretraining model has class token
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
+                )
+            for i in range(depth)])
+        self.last_norm = norm_layer(embed_dim) if last_norm else nn.Identity()
+        if self.pos_embed is not None:
+            trunc_normal_(self.pos_embed, std=.02)
+        self._freeze_stages()
+    def _freeze_stages(self):
+        """Freeze parameters."""
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+        for i in range(1, self.frozen_stages + 1):
+            m = self.blocks[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        if self.freeze_attn:
+            for i in range(0, self.depth):
+                m = self.blocks[i]
+                m.attn.eval()
+                m.norm1.eval()
+                for param in m.attn.parameters():
+                    param.requires_grad = False
+                for param in m.norm1.parameters():
+                    param.requires_grad = False
+        if self.freeze_ffn:
+            self.pos_embed.requires_grad = False
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+            for i in range(0, self.depth):
+                m = self.blocks[i]
+                m.mlp.eval()
+                m.norm2.eval()
+                for param in m.mlp.parameters():
+                    param.requires_grad = False
+                for param in m.norm2.parameters():
+                    param.requires_grad = False
+    def init_weights(self):
+        """Initialize the weights in backbone.
+        Args:
+            pretrained (str, optional): Path to pre-trained weights.
+                Defaults to None.
+        """
+        def _init_weights(m):
+            if isinstance(m, nn.Linear):
+                trunc_normal_(m.weight, std=.02)
+                if isinstance(m, nn.Linear) and m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.LayerNorm):
+                nn.init.constant_(m.bias, 0)
+                nn.init.constant_(m.weight, 1.0)
+        self.apply(_init_weights)
+    def get_num_layers(self):
+        return len(self.blocks)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token'}
+    def forward_features(self, x):
+        B, C, H, W = x.shape
+        x, (Hp, Wp) = self.patch_embed(x)
+        if self.pos_embed is not None:
+            # fit for multiple GPU training
+            # since the first element for pos embed (sin-cos manner) is zero, it will cause no difference
+            x = x + self.pos_embed[:, 1:] + self.pos_embed[:, :1]
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(blk, x)
+            else:
+                x = blk(x)
+        x = self.last_norm(x)
+        xp = x.permute(0, 2, 1).reshape(B, -1, Hp, Wp).contiguous()
+        return xp
+    def forward(self, x):
+        x = self.forward_features(x)
+        return x
+    def train(self, mode=True):
+        """Convert the model into training mode."""
+        super().train(mode)
+        self._freeze_stages()

unish/pi3/models/__pycache__/pi3.cpython-310.pyc ADDED Viewed

Binary file (7.01 kB). View file

unish/pi3/models/dinov2/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+__version__ = "0.0.1"

unish/pi3/models/dinov2/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (181 Bytes). View file

unish/pi3/models/dinov2/hub/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.

unish/pi3/models/dinov2/hub/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (164 Bytes). View file

unish/pi3/models/dinov2/hub/__pycache__/backbones.cpython-310.pyc ADDED Viewed

Binary file (3.99 kB). View file

unish/pi3/models/dinov2/hub/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (1.78 kB). View file

unish/pi3/models/dinov2/hub/backbones.py ADDED Viewed

	@@ -0,0 +1,156 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+from enum import Enum
+from typing import Union
+import torch
+from .utils import _DINOV2_BASE_URL, _make_dinov2_model_name
+class Weights(Enum):
+    LVD142M = "LVD142M"
+def _make_dinov2_model(
+    *,
+    arch_name: str = "vit_large",
+    img_size: int = 518,
+    patch_size: int = 14,
+    init_values: float = 1.0,
+    ffn_layer: str = "mlp",
+    block_chunks: int = 0,
+    num_register_tokens: int = 0,
+    interpolate_antialias: bool = False,
+    interpolate_offset: float = 0.1,
+    pretrained: bool = True,
+    weights: Union[Weights, str] = Weights.LVD142M,
+    **kwargs,
+):
+    from ..models import vision_transformer as vits
+    if isinstance(weights, str):
+        try:
+            weights = Weights[weights]
+        except KeyError:
+            raise AssertionError(f"Unsupported weights: {weights}")
+    model_base_name = _make_dinov2_model_name(arch_name, patch_size)
+    vit_kwargs = dict(
+        img_size=img_size,
+        patch_size=patch_size,
+        init_values=init_values,
+        ffn_layer=ffn_layer,
+        block_chunks=block_chunks,
+        num_register_tokens=num_register_tokens,
+        interpolate_antialias=interpolate_antialias,
+        interpolate_offset=interpolate_offset,
+    )
+    vit_kwargs.update(**kwargs)
+    model = vits.__dict__[arch_name](**vit_kwargs)
+    if pretrained:
+        model_full_name = _make_dinov2_model_name(arch_name, patch_size, num_register_tokens)
+        url = _DINOV2_BASE_URL + f"/{model_base_name}/{model_full_name}_pretrain.pth"
+        state_dict = torch.hub.load_state_dict_from_url(url, map_location="cpu")
+        model.load_state_dict(state_dict, strict=True)
+    return model
+def dinov2_vits14(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-S/14 model (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(arch_name="vit_small", pretrained=pretrained, weights=weights, **kwargs)
+def dinov2_vitb14(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-B/14 model (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(arch_name="vit_base", pretrained=pretrained, weights=weights, **kwargs)
+def dinov2_vitl14(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-L/14 model (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(arch_name="vit_large", pretrained=pretrained, weights=weights, **kwargs)
+def dinov2_vitg14(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-g/14 model (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(
+        arch_name="vit_giant2",
+        ffn_layer="swiglufused",
+        weights=weights,
+        pretrained=pretrained,
+        **kwargs,
+    )
+def dinov2_vits14_reg(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-S/14 model with registers (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(
+        arch_name="vit_small",
+        pretrained=pretrained,
+        weights=weights,
+        num_register_tokens=4,
+        interpolate_antialias=True,
+        interpolate_offset=0.0,
+        **kwargs,
+    )
+def dinov2_vitb14_reg(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-B/14 model with registers (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(
+        arch_name="vit_base",
+        pretrained=pretrained,
+        weights=weights,
+        num_register_tokens=4,
+        interpolate_antialias=True,
+        interpolate_offset=0.0,
+        **kwargs,
+    )
+def dinov2_vitl14_reg(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-L/14 model with registers (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(
+        arch_name="vit_large",
+        pretrained=pretrained,
+        weights=weights,
+        num_register_tokens=4,
+        interpolate_antialias=True,
+        interpolate_offset=0.0,
+        **kwargs,
+    )
+def dinov2_vitg14_reg(*, pretrained: bool = True, weights: Union[Weights, str] = Weights.LVD142M, **kwargs):
+    """
+    DINOv2 ViT-g/14 model with registers (optionally) pretrained on the LVD-142M dataset.
+    """
+    return _make_dinov2_model(
+        arch_name="vit_giant2",
+        ffn_layer="swiglufused",
+        weights=weights,
+        pretrained=pretrained,
+        num_register_tokens=4,
+        interpolate_antialias=True,
+        interpolate_offset=0.0,
+        **kwargs,
+    )

unish/pi3/models/dinov2/hub/utils.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+import itertools
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+_DINOV2_BASE_URL = "https://dl.fbaipublicfiles.com/dinov2"
+def _make_dinov2_model_name(arch_name: str, patch_size: int, num_register_tokens: int = 0) -> str:
+    compact_arch_name = arch_name.replace("_", "")[:4]
+    registers_suffix = f"_reg{num_register_tokens}" if num_register_tokens else ""
+    return f"dinov2_{compact_arch_name}{patch_size}{registers_suffix}"
+class CenterPadding(nn.Module):
+    def __init__(self, multiple):
+        super().__init__()
+        self.multiple = multiple
+    def _get_pad(self, size):
+        new_size = math.ceil(size / self.multiple) * self.multiple
+        pad_size = new_size - size
+        pad_size_left = pad_size // 2
+        pad_size_right = pad_size - pad_size_left
+        return pad_size_left, pad_size_right
+    @torch.inference_mode()
+    def forward(self, x):
+        pads = list(itertools.chain.from_iterable(self._get_pad(m) for m in x.shape[:1:-1]))
+        output = F.pad(x, pads)
+        return output

unish/pi3/models/dinov2/layers/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+from .dino_head import DINOHead
+from .mlp import Mlp
+from .patch_embed import PatchEmbed
+from .swiglu_ffn import SwiGLUFFN, SwiGLUFFNFused
+from .block import NestedTensorBlock
+from .attention import MemEffAttention

unish/pi3/models/dinov2/layers/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (451 Bytes). View file

unish/pi3/models/dinov2/layers/__pycache__/attention.cpython-310.pyc ADDED Viewed

Binary file (2.47 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/block.cpython-310.pyc ADDED Viewed

Binary file (8.04 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/dino_head.cpython-310.pyc ADDED Viewed

Binary file (1.99 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/drop_path.cpython-310.pyc ADDED Viewed

Binary file (1.21 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/layer_scale.cpython-310.pyc ADDED Viewed

Binary file (1.01 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/mlp.cpython-310.pyc ADDED Viewed

Binary file (1.2 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/patch_embed.cpython-310.pyc ADDED Viewed

Binary file (2.65 kB). View file

unish/pi3/models/dinov2/layers/__pycache__/swiglu_ffn.cpython-310.pyc ADDED Viewed

Binary file (2.12 kB). View file

unish/pi3/models/dinov2/layers/attention.py ADDED Viewed

	@@ -0,0 +1,89 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/models/vision_transformer.py
+import logging
+import os
+import warnings
+from torch import Tensor
+from torch import nn
+logger = logging.getLogger("dinov2")
+XFORMERS_ENABLED = os.environ.get("XFORMERS_DISABLED") is None
+try:
+    if XFORMERS_ENABLED:
+        from xformers.ops import memory_efficient_attention, unbind
+        XFORMERS_AVAILABLE = True
+        # warnings.warn("xFormers is available (Attention)")
+    else:
+        # warnings.warn("xFormers is disabled (Attention)")
+        raise ImportError
+except ImportError:
+    XFORMERS_AVAILABLE = False
+    # warnings.warn("xFormers is not available (Attention)")
+class Attention(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int = 8,
+        qkv_bias: bool = False,
+        proj_bias: bool = True,
+        attn_drop: float = 0.0,
+        proj_drop: float = 0.0,
+    ) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim**-0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim, bias=proj_bias)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x: Tensor, attn_bias=None) -> Tensor:
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]
+        attn = q @ k.transpose(-2, -1)
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class MemEffAttention(Attention):
+    def forward(self, x: Tensor, attn_bias=None) -> Tensor:
+        if not XFORMERS_AVAILABLE:
+            if attn_bias is not None:
+                raise AssertionError("xFormers is required for using nested tensors")
+            return super().forward(x)
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads)
+        q, k, v = unbind(qkv, 2)
+        x = memory_efficient_attention(q, k, v, attn_bias=attn_bias)
+        x = x.reshape([B, N, C])
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x

unish/pi3/models/dinov2/layers/block.py ADDED Viewed

	@@ -0,0 +1,259 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/patch_embed.py
+import logging
+import os
+from typing import Callable, List, Any, Tuple, Dict
+import warnings
+import torch
+from torch import nn, Tensor
+from .attention import Attention, MemEffAttention
+from .drop_path import DropPath
+from .layer_scale import LayerScale
+from .mlp import Mlp
+logger = logging.getLogger("dinov2")
+XFORMERS_ENABLED = os.environ.get("XFORMERS_DISABLED") is None
+try:
+    if XFORMERS_ENABLED:
+        from xformers.ops import fmha, scaled_index_add, index_select_cat
+        XFORMERS_AVAILABLE = True
+        # warnings.warn("xFormers is available (Block)")
+    else:
+        # warnings.warn("xFormers is disabled (Block)")
+        raise ImportError
+except ImportError:
+    XFORMERS_AVAILABLE = False
+    # warnings.warn("xFormers is not available (Block)")
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        mlp_ratio: float = 4.0,
+        qkv_bias: bool = False,
+        proj_bias: bool = True,
+        ffn_bias: bool = True,
+        drop: float = 0.0,
+        attn_drop: float = 0.0,
+        init_values=None,
+        drop_path: float = 0.0,
+        act_layer: Callable[..., nn.Module] = nn.GELU,
+        norm_layer: Callable[..., nn.Module] = nn.LayerNorm,
+        attn_class: Callable[..., nn.Module] = Attention,
+        ffn_layer: Callable[..., nn.Module] = Mlp,
+    ) -> None:
+        super().__init__()
+        # print(f"biases: qkv: {qkv_bias}, proj: {proj_bias}, ffn: {ffn_bias}")
+        self.norm1 = norm_layer(dim)
+        self.attn = attn_class(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            proj_bias=proj_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        self.ls1 = LayerScale(dim, init_values=init_values) if init_values else nn.Identity()
+        self.drop_path1 = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = ffn_layer(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+            bias=ffn_bias,
+        )
+        self.ls2 = LayerScale(dim, init_values=init_values) if init_values else nn.Identity()
+        self.drop_path2 = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.sample_drop_ratio = drop_path
+    def forward(self, x: Tensor) -> Tensor:
+        def attn_residual_func(x: Tensor) -> Tensor:
+            return self.ls1(self.attn(self.norm1(x)))
+        def ffn_residual_func(x: Tensor) -> Tensor:
+            return self.ls2(self.mlp(self.norm2(x)))
+        if self.training and self.sample_drop_ratio > 0.1:
+            # the overhead is compensated only for a drop path rate larger than 0.1
+            x = drop_add_residual_stochastic_depth(
+                x,
+                residual_func=attn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+            )
+            x = drop_add_residual_stochastic_depth(
+                x,
+                residual_func=ffn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+            )
+        elif self.training and self.sample_drop_ratio > 0.0:
+            x = x + self.drop_path1(attn_residual_func(x))
+            x = x + self.drop_path1(ffn_residual_func(x))  # FIXME: drop_path2
+        else:
+            x = x + attn_residual_func(x)
+            x = x + ffn_residual_func(x)
+        return x
+def drop_add_residual_stochastic_depth(
+    x: Tensor,
+    residual_func: Callable[[Tensor], Tensor],
+    sample_drop_ratio: float = 0.0,
+) -> Tensor:
+    # 1) extract subset using permutation
+    b, n, d = x.shape
+    sample_subset_size = max(int(b * (1 - sample_drop_ratio)), 1)
+    brange = (torch.randperm(b, device=x.device))[:sample_subset_size]
+    x_subset = x[brange]
+    # 2) apply residual_func to get residual
+    residual = residual_func(x_subset)
+    x_flat = x.flatten(1)
+    residual = residual.flatten(1)
+    residual_scale_factor = b / sample_subset_size
+    # 3) add the residual
+    x_plus_residual = torch.index_add(x_flat, 0, brange, residual.to(dtype=x.dtype), alpha=residual_scale_factor)
+    return x_plus_residual.view_as(x)
+def get_branges_scales(x, sample_drop_ratio=0.0):
+    b, n, d = x.shape
+    sample_subset_size = max(int(b * (1 - sample_drop_ratio)), 1)
+    brange = (torch.randperm(b, device=x.device))[:sample_subset_size]
+    residual_scale_factor = b / sample_subset_size
+    return brange, residual_scale_factor
+def add_residual(x, brange, residual, residual_scale_factor, scaling_vector=None):
+    if scaling_vector is None:
+        x_flat = x.flatten(1)
+        residual = residual.flatten(1)
+        x_plus_residual = torch.index_add(x_flat, 0, brange, residual.to(dtype=x.dtype), alpha=residual_scale_factor)
+    else:
+        x_plus_residual = scaled_index_add(
+            x, brange, residual.to(dtype=x.dtype), scaling=scaling_vector, alpha=residual_scale_factor
+        )
+    return x_plus_residual
+attn_bias_cache: Dict[Tuple, Any] = {}
+def get_attn_bias_and_cat(x_list, branges=None):
+    """
+    this will perform the index select, cat the tensors, and provide the attn_bias from cache
+    """
+    batch_sizes = [b.shape[0] for b in branges] if branges is not None else [x.shape[0] for x in x_list]
+    all_shapes = tuple((b, x.shape[1]) for b, x in zip(batch_sizes, x_list))
+    if all_shapes not in attn_bias_cache.keys():
+        seqlens = []
+        for b, x in zip(batch_sizes, x_list):
+            for _ in range(b):
+                seqlens.append(x.shape[1])
+        attn_bias = fmha.BlockDiagonalMask.from_seqlens(seqlens)
+        attn_bias._batch_sizes = batch_sizes
+        attn_bias_cache[all_shapes] = attn_bias
+    if branges is not None:
+        cat_tensors = index_select_cat([x.flatten(1) for x in x_list], branges).view(1, -1, x_list[0].shape[-1])
+    else:
+        tensors_bs1 = tuple(x.reshape([1, -1, *x.shape[2:]]) for x in x_list)
+        cat_tensors = torch.cat(tensors_bs1, dim=1)
+    return attn_bias_cache[all_shapes], cat_tensors
+def drop_add_residual_stochastic_depth_list(
+    x_list: List[Tensor],
+    residual_func: Callable[[Tensor, Any], Tensor],
+    sample_drop_ratio: float = 0.0,
+    scaling_vector=None,
+) -> Tensor:
+    # 1) generate random set of indices for dropping samples in the batch
+    branges_scales = [get_branges_scales(x, sample_drop_ratio=sample_drop_ratio) for x in x_list]
+    branges = [s[0] for s in branges_scales]
+    residual_scale_factors = [s[1] for s in branges_scales]
+    # 2) get attention bias and index+concat the tensors
+    attn_bias, x_cat = get_attn_bias_and_cat(x_list, branges)
+    # 3) apply residual_func to get residual, and split the result
+    residual_list = attn_bias.split(residual_func(x_cat, attn_bias=attn_bias))  # type: ignore
+    outputs = []
+    for x, brange, residual, residual_scale_factor in zip(x_list, branges, residual_list, residual_scale_factors):
+        outputs.append(add_residual(x, brange, residual, residual_scale_factor, scaling_vector).view_as(x))
+    return outputs
+class NestedTensorBlock(Block):
+    def forward_nested(self, x_list: List[Tensor]) -> List[Tensor]:
+        """
+        x_list contains a list of tensors to nest together and run
+        """
+        assert isinstance(self.attn, MemEffAttention)
+        if self.training and self.sample_drop_ratio > 0.0:
+            def attn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.attn(self.norm1(x), attn_bias=attn_bias)
+            def ffn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.mlp(self.norm2(x))
+            x_list = drop_add_residual_stochastic_depth_list(
+                x_list,
+                residual_func=attn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+                scaling_vector=self.ls1.gamma if isinstance(self.ls1, LayerScale) else None,
+            )
+            x_list = drop_add_residual_stochastic_depth_list(
+                x_list,
+                residual_func=ffn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+                scaling_vector=self.ls2.gamma if isinstance(self.ls1, LayerScale) else None,
+            )
+            return x_list
+        else:
+            def attn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.ls1(self.attn(self.norm1(x), attn_bias=attn_bias))
+            def ffn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.ls2(self.mlp(self.norm2(x)))
+            attn_bias, x = get_attn_bias_and_cat(x_list)
+            x = x + attn_residual_func(x, attn_bias=attn_bias)
+            x = x + ffn_residual_func(x)
+            return attn_bias.split(x)
+    def forward(self, x_or_x_list):
+        if isinstance(x_or_x_list, Tensor):
+            return super().forward(x_or_x_list)
+        elif isinstance(x_or_x_list, list):
+            if not XFORMERS_AVAILABLE:
+                raise AssertionError("xFormers is required for using nested tensors")
+            return self.forward_nested(x_or_x_list)
+        else:
+            raise AssertionError

unish/pi3/models/dinov2/layers/dino_head.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+import torch
+import torch.nn as nn
+from torch.nn.init import trunc_normal_
+from torch.nn.utils import weight_norm
+class DINOHead(nn.Module):
+    def __init__(
+        self,
+        in_dim,
+        out_dim,
+        use_bn=False,
+        nlayers=3,
+        hidden_dim=2048,
+        bottleneck_dim=256,
+        mlp_bias=True,
+    ):
+        super().__init__()
+        nlayers = max(nlayers, 1)
+        self.mlp = _build_mlp(nlayers, in_dim, bottleneck_dim, hidden_dim=hidden_dim, use_bn=use_bn, bias=mlp_bias)
+        self.apply(self._init_weights)
+        self.last_layer = weight_norm(nn.Linear(bottleneck_dim, out_dim, bias=False))
+        self.last_layer.weight_g.data.fill_(1)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=0.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, x):
+        x = self.mlp(x)
+        eps = 1e-6 if x.dtype == torch.float16 else 1e-12
+        x = nn.functional.normalize(x, dim=-1, p=2, eps=eps)
+        x = self.last_layer(x)
+        return x
+def _build_mlp(nlayers, in_dim, bottleneck_dim, hidden_dim=None, use_bn=False, bias=True):
+    if nlayers == 1:
+        return nn.Linear(in_dim, bottleneck_dim, bias=bias)
+    else:
+        layers = [nn.Linear(in_dim, hidden_dim, bias=bias)]
+        if use_bn:
+            layers.append(nn.BatchNorm1d(hidden_dim))
+        layers.append(nn.GELU())
+        for _ in range(nlayers - 2):
+            layers.append(nn.Linear(hidden_dim, hidden_dim, bias=bias))
+            if use_bn:
+                layers.append(nn.BatchNorm1d(hidden_dim))
+            layers.append(nn.GELU())
+        layers.append(nn.Linear(hidden_dim, bottleneck_dim, bias=bias))
+        return nn.Sequential(*layers)

unish/pi3/models/dinov2/layers/drop_path.py ADDED Viewed

	@@ -0,0 +1,34 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/drop.py
+from torch import nn
+def drop_path(x, drop_prob: float = 0.0, training: bool = False):
+    if drop_prob == 0.0 or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = x.new_empty(shape).bernoulli_(keep_prob)
+    if keep_prob > 0.0:
+        random_tensor.div_(keep_prob)
+    output = x * random_tensor
+    return output
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)

unish/pi3/models/dinov2/layers/layer_scale.py ADDED Viewed

	@@ -0,0 +1,27 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+# Modified from: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py#L103-L110
+from typing import Union
+import torch
+from torch import Tensor
+from torch import nn
+class LayerScale(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        init_values: Union[float, Tensor] = 1e-5,
+        inplace: bool = False,
+    ) -> None:
+        super().__init__()
+        self.inplace = inplace
+        self.gamma = nn.Parameter(init_values * torch.ones(dim))
+    def forward(self, x: Tensor) -> Tensor:
+        return x.mul_(self.gamma) if self.inplace else x * self.gamma

unish/pi3/models/dinov2/layers/mlp.py ADDED Viewed

	@@ -0,0 +1,40 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/mlp.py
+from typing import Callable, Optional
+from torch import Tensor, nn
+class Mlp(nn.Module):
+    def __init__(
+        self,
+        in_features: int,
+        hidden_features: Optional[int] = None,
+        out_features: Optional[int] = None,
+        act_layer: Callable[..., nn.Module] = nn.GELU,
+        drop: float = 0.0,
+        bias: bool = True,
+    ) -> None:
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x