Spaces:

zhang3z
/

FLARE

Runtime error

App Files Files Community

聂如 commited on Feb 27, 2025

Commit

91126af

1 Parent(s): 7829591

Add design file

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LEGAL.md +7 -0
LICENSE.txt +7 -0
README.md +100 -13
app.py +291 -0
dust3r/croco/datasets/__init__.py +0 -0
dust3r/croco/datasets/crops/README.MD +104 -0
dust3r/croco/datasets/crops/extract_crops_from_images.py +159 -0
dust3r/croco/datasets/habitat_sim/README.MD +76 -0
dust3r/croco/datasets/habitat_sim/__init__.py +0 -0
dust3r/croco/datasets/habitat_sim/generate_from_metadata.py +92 -0
dust3r/croco/datasets/habitat_sim/generate_from_metadata_files.py +27 -0
dust3r/croco/datasets/habitat_sim/generate_multiview_images.py +177 -0
dust3r/croco/datasets/habitat_sim/multiview_habitat_sim_generator.py +390 -0
dust3r/croco/datasets/habitat_sim/pack_metadata_files.py +69 -0
dust3r/croco/datasets/habitat_sim/paths.py +129 -0
dust3r/croco/datasets/pairs_dataset.py +109 -0
dust3r/croco/datasets/transforms.py +95 -0
dust3r/croco/models/__pycache__/blocks.cpython-312.pyc +0 -0
dust3r/croco/models/__pycache__/croco.cpython-312.pyc +0 -0
dust3r/croco/models/__pycache__/dpt_block.cpython-312.pyc +0 -0
dust3r/croco/models/__pycache__/masking.cpython-312.pyc +0 -0
dust3r/croco/models/__pycache__/pos_embed.cpython-312.pyc +0 -0
dust3r/croco/models/blocks.py +307 -0
dust3r/croco/models/criterion.py +37 -0
dust3r/croco/models/croco.py +288 -0
dust3r/croco/models/dpt_block.py +450 -0
dust3r/croco/models/head_downstream.py +58 -0
dust3r/croco/models/masking.py +25 -0
dust3r/croco/models/pos_embed.py +159 -0
dust3r/croco/models/transformer_utils.py +1021 -0
dust3r/croco/models/x_transformer.py +558 -0
dust3r/croco/utils/misc.py +583 -0
dust3r/dust3r/__init__.py +2 -0
dust3r/dust3r/__pycache__/__init__.cpython-312.pyc +0 -0
dust3r/dust3r/__pycache__/model.cpython-312.pyc +0 -0
dust3r/dust3r/__pycache__/patch_embed.cpython-312.pyc +0 -0
dust3r/dust3r/__pycache__/viz.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/CustomDataset.py +145 -0
dust3r/dust3r/datasets/__init__.py +39 -0
dust3r/dust3r/datasets/__pycache__/CustomDataset.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/__pycache__/__init__.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/base/__init__.py +2 -0
dust3r/dust3r/datasets/base/__pycache__/__init__.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/base/__pycache__/base_stereo_view_dataset.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/base/__pycache__/batched_sampler.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/base/__pycache__/easy_dataset.cpython-312.pyc +0 -0
dust3r/dust3r/datasets/base/base_stereo_view_dataset.py +774 -0
dust3r/dust3r/datasets/base/batched_sampler.py +74 -0
dust3r/dust3r/datasets/base/easy_dataset.py +157 -0
dust3r/dust3r/datasets/utils/__init__.py +2 -0

LEGAL.md ADDED Viewed

	@@ -0,0 +1,7 @@

+Legal Disclaimer
+Within this source code, the comments in Chinese shall be the original, governing version. Any comment in other languages are for reference only. In the event of any conflict between the Chinese language version comments and other language version comments, the Chinese language version shall prevail.
+法律免责声明
+关于代码注释部分，中文注释为官方版本，其它语言注释仅做参考。中文注释可能与其它语言注释存在不一致，当中文注释与其它语言注释存在不一致时，请以中文注释为准。

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+FLARE, Copyright (c) 2025-present Ant Group, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.
+A summary of the CC BY-NC-SA 4.0 license is located here:
+	https://creativecommons.org/licenses/by-nc-sa/4.0/
+The CC BY-NC-SA 4.0 license is located here:
+	https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

README.md CHANGED Viewed

@@ -1,13 +1,100 @@
----
-title: FLARE
-emoji: 🦀
-colorFrom: blue
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.19.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
+[![Website](https://img.shields.io/website-up-down-green-red/http/shields.io.svg)](https://zhanghe3z.github.io/FLARE/)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97-Hugging%20Face-yellow)](https://huggingface.co/AntResearch/FLARE)
+[![Video](https://img.shields.io/badge/Video-Demo-red)](https://zhanghe3z.github.io/FLARE/videos/teaser_video.mp4)
+Official implementation of **FLARE** (CVPR 2025) - a feed-forward model for joint camera pose estimation, 3D reconstruction and novel view synthesis from sparse uncalibrated views.
+![Teaser Video](./assets/teaser.jpg)
+<!-- TOC start (generated with https://github.com/derlin/bitdowntoc) -->
+- [📖 Overview](#-overview)
+- [🛠️ TODO List](#-todo-list)
+- [🌍 Installation](#-installation)
+- [💿 Checkpoints](#-checkpoints)
+- [🎯 Run a Demo (Point Cloud and Camera Pose Estimation) ](#-run-a-demo-point-cloud-and-camera-pose-estimation)
+- [👀 Visualization ](#-visualization)
+- [📜 Citation ](#-citation)
+<!-- TOC end -->
+## 📖 Overview
+We present FLARE, a feed-forward model that simultaneously estimates high-quality camera poses, 3D geometry, and appearance from as few as 2-8 uncalibrated images. Our cascaded learning paradigm:
+1. **Camera Pose Estimation**: Directly regress camera poses without bundle adjustment
+2. **Geometry Reconstruction**: Decompose geometry reconstruction into two simpler sub-problems
+3. **Appearance Modeling**: Enable photorealistic novel view synthesis via 3D Gaussians
+Achieves SOTA performance with inference times <0.5 seconds!
+## 🛠️ TODO List
+- [x] Release point cloud and camera pose estimation code.
+- [x] Updated Gradio demo (app.py).
+- [ ] Release novel view synthesis code. (~2 weeks)
+- [ ] Release evaluation code. (~2 weeks)
+- [ ] Release training code.
+- [ ] Release data processing code.
+## 🌍 Installation
+```
+conda create -n flare python=3.8
+conda activate flare
+conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia  # use the correct version of cuda for your system
+pip install -r requirements.txt
+conda uninstall ffmpeg
+conda install -c conda-forge ffmpeg
+```
+## 💿 Checkpoints
+Download the checkpoint from [huggingface](https://huggingface.co/AntResearch/FLARE/blob/main/geometry_pose.pth) and place it in the /checkpoints/geometry_pose.pth directory.
+## 🎯 Run a Demo (Point Cloud and Camera Pose Estimation)
+```bash
+sh scripts/run_pose_pointcloud.sh
+```
+```bash
+torchrun --nproc_per_node=1 run_pose_pointcloud.py \
+    --test_dataset "1 @ CustomDataset(split='train', ROOT='Your/Data/Path', resolution=(512,384), seed=1, num_views=7, gt_num_image=0, aug_portrait_or_landscape=False, sequential_input=False)" \
+    --model "AsymmetricMASt3R(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='catmlp+dpt', output_mode='pts3d+desc24', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12, two_confs=True, desc_conf_mode=('exp', 0, inf))" \
+    --pretrained "Your/Checkpoint/Path" \
+    --test_criterion "MeshOutput(sam=False)" --output_dir "log/" --amp 1 --seed 1 --num_workers 0
+```
+**To run the demo using ground truth camera poses:**
+Enable the wpose=True flag in both the CustomDataset and AsymmetricMASt3R. An example script demonstrating this setup is provided in run_pose_pointcloud_wpose.sh.
+```bash
+sh scripts/run_pose_pointcloud_wpose.sh
+```
+## 👀 Visualization
+```
+sh ./visualizer/vis.sh
+```
+```
+CUDA_VISIBLE_DEVICES=0 python visualizer/run_vis.py --result_npz data/mesh/IMG_1511.HEIC.JPG.JPG/pred.npz --results_folder data/mesh/IMG_1511.HEIC.JPG.JPG/
+```
+## 📜 Citation
+```bibtex
+@misc{zhang2025flarefeedforwardgeometryappearance,
+      title={FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views},
+      author={Shangzhan Zhang and Jianyuan Wang and Yinghao Xu and Nan Xue and Christian Rupprecht and Xiaowei Zhou and Yujun Shen and Gordon Wetzstein},
+      year={2025},
+      eprint={2502.12138},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2502.12138},
+}

app.py ADDED Viewed

	@@ -0,0 +1,291 @@

+import spaces
+import mast3r.utils.path_to_dust3r  # noqa
+import dust3r.utils.path_to_croco  # noqa: F401
+import mast3r.utils.path_to_dust3r  # noqa
+import os
+import sys
+import os.path as path
+import torch
+import tempfile
+import gradio
+import shutil
+import math
+from mast3r.model import AsymmetricMASt3R
+import matplotlib.pyplot as pl
+from dust3r.utils.image import load_images
+import torch.nn.functional as F
+from pytorch3d.ops import knn_points
+from dust3r.utils.geometry import xy_grid
+import numpy as np
+import cv2
+from dust3r.utils.device import to_numpy
+import trimesh
+from dust3r.viz import add_scene_cam, CAM_COLORS, OPENGL, pts3d_to_trimesh, cat_meshes
+from scipy.spatial.transform import Rotation
+pl.ion()
+# for gpu >= Ampere and pytorch >= 1.12
+torch.backends.cuda.matmul.allow_tf32 = True
+batch_size = 1
+inf = float('inf')
+weights_path = "checkpoints/geometry_pose.pth"
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+ckpt = torch.load(weights_path, map_location=device)
+model = AsymmetricMASt3R(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='catmlp+dpt', output_mode='pts3d+desc24', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12, two_confs=True, desc_conf_mode=('exp', 0, inf))
+model = AsymmetricMASt3R.from_pretrained("zhang3z/FLARE").to(device)
+# model.from_pretrained(ckpt['model'], strict=False)
+model = model.to(device).eval()
+tmpdirname = tempfile.mkdtemp(suffix='_FLARE_gradio_demo')
+image_size = 512
+silent = True
+gradio_delete_cache = 7200
+backbone = torch.hub.load(
+    "facebookresearch/dinov2", "dinov2_vitb14_reg"
+    )
+backbone = backbone.eval().cuda()
+class FileState:
+    def __init__(self, outfile_name=None):
+        self.outfile_name = outfile_name
+    def __del__(self):
+        if self.outfile_name is not None and os.path.isfile(self.outfile_name):
+            os.remove(self.outfile_name)
+        self.outfile_name = None
+def pad_to_square(reshaped_image):
+    B, C, H, W = reshaped_image.shape
+    max_dim = max(H, W)
+    pad_height = max_dim - H
+    pad_width = max_dim - W
+    padding = (pad_width // 2, pad_width - pad_width // 2,
+               pad_height // 2, pad_height - pad_height // 2)
+    padded_image = F.pad(reshaped_image, padding, mode='constant', value=0)
+    return padded_image
+def generate_rank_by_dino(
+    reshaped_image, backbone, query_frame_num, image_size=336
+):
+    # Downsample image to image_size x image_size
+    # because we found it is unnecessary to use high resolution
+    rgbs = pad_to_square(reshaped_image)
+    rgbs = F.interpolate(
+        reshaped_image,
+        (image_size, image_size),
+        mode="bilinear",
+        align_corners=True,
+    )
+    rgbs = _resnet_normalize_image(rgbs.cuda())
+    # Get the image features (patch level)
+    frame_feat = backbone(rgbs, is_training=True)
+    frame_feat = frame_feat["x_norm_patchtokens"]
+    frame_feat_norm = F.normalize(frame_feat, p=2, dim=1)
+    # Compute the similiarty matrix
+    frame_feat_norm = frame_feat_norm.permute(1, 0, 2)
+    similarity_matrix = torch.bmm(
+        frame_feat_norm, frame_feat_norm.transpose(-1, -2)
+    )
+    similarity_matrix = similarity_matrix.mean(dim=0)
+    distance_matrix = 100 - similarity_matrix.clone()
+    # Ignore self-pairing
+    similarity_matrix.fill_diagonal_(-100)
+    similarity_sum = similarity_matrix.sum(dim=1)
+    # Find the most common frame
+    most_common_frame_index = torch.argmax(similarity_sum).item()
+    return most_common_frame_index
+_RESNET_MEAN = [0.485, 0.456, 0.406]
+_RESNET_STD = [0.229, 0.224, 0.225]
+_resnet_mean = torch.tensor(_RESNET_MEAN).view(1, 3, 1, 1).cuda()
+_resnet_std = torch.tensor(_RESNET_STD).view(1, 3, 1, 1).cuda()
+def _resnet_normalize_image(img: torch.Tensor) -> torch.Tensor:
+        return (img - _resnet_mean) / _resnet_std
+def calculate_index_mappings(query_index, S, device=None):
+    """
+    Construct an order that we can switch [query_index] and [0]
+    so that the content of query_index would be placed at [0]
+    """
+    new_order = torch.arange(S)
+    new_order[0] = query_index
+    new_order[query_index] = 0
+    if device is not None:
+        new_order = new_order.to(device)
+    return new_order
+def _convert_scene_output_to_glb(outfile, imgs, pts3d, mask, focals, cams2world, cam_size=0.05,
+                                 cam_color=None, as_pointcloud=False,
+                                 transparent_cams=False, silent=False):
+    assert len(pts3d) == len(mask) <= len(imgs) <= len(cams2world) == len(focals)
+    pts3d = to_numpy(pts3d)
+    imgs = to_numpy(imgs)
+    focals = to_numpy(focals)
+    mask = to_numpy(mask)
+    cams2world = to_numpy(cams2world)
+    scene = trimesh.Scene()
+    # full pointcloud
+    if as_pointcloud:
+        pts = np.concatenate([p[m] for p, m in zip(pts3d, mask)]).reshape(-1, 3)
+        col = np.concatenate([p[m] for p, m in zip(imgs, mask)]).reshape(-1, 3)
+        valid_msk = np.isfinite(pts.sum(axis=1))
+        pct = trimesh.PointCloud(pts[valid_msk], colors=col[valid_msk])
+        scene.add_geometry(pct)
+    else:
+        meshes = []
+        for i in range(len(imgs)):
+            pts3d_i = pts3d[i].reshape(imgs[i].shape)
+            msk_i = mask[i] & np.isfinite(pts3d_i.sum(axis=-1))
+            meshes.append(pts3d_to_trimesh(imgs[i], pts3d_i, msk_i))
+        mesh = trimesh.Trimesh(**cat_meshes(meshes))
+        scene.add_geometry(mesh)
+    # add each camera
+    for i, pose_c2w in enumerate(cams2world):
+        if isinstance(cam_color, list):
+            camera_edge_color = cam_color[i]
+        else:
+            camera_edge_color = cam_color or CAM_COLORS[i % len(CAM_COLORS)]
+        add_scene_cam(scene, pose_c2w, camera_edge_color,
+                      None if transparent_cams else imgs[i], focals[i],
+                      imsize=imgs[i].shape[1::-1], screen_width=cam_size)
+    rot = np.eye(4)
+    rot[:3, :3] = Rotation.from_euler('y', np.deg2rad(180)).as_matrix()
+    scene.apply_transform(np.linalg.inv(cams2world[0] @ OPENGL @ rot))
+    if not silent:
+        print('(exporting 3D scene to', outfile, ')')
+    scene.export(file_obj=outfile)
+    return outfile
+@spaces.GPU(duration=180)
+def local_get_reconstructed_scene(inputfiles, min_conf_thr, cam_size):
+    batch = load_images(inputfiles, size=image_size, verbose=not silent)
+    images = [gt['img'] for gt in batch]
+    images = torch.cat(images, dim=0)
+    images = images / 2 + 0.5
+    index = generate_rank_by_dino(images, backbone, query_frame_num=1)
+    sorted_order = calculate_index_mappings(index, len(images), device=device)
+    sorted_batch = []
+    for i in range(len(batch)):
+        sorted_batch.append(batch[sorted_order[i]])
+    batch = sorted_batch
+    ignore_keys = set(['depthmap', 'dataset', 'label', 'instance', 'idx', 'rng', 'vid'])
+    ignore_dtype_keys = set(['true_shape', 'camera_pose', 'pts3d', 'fxfycxcy', 'img_org', 'camera_intrinsics', 'depthmap', 'depth_anything', 'fxfycxcy_unorm'])
+    dtype = torch.bfloat16
+    for view in batch:
+        for name in view.keys():  # pseudo_focal
+            if name in ignore_keys:
+                continue
+            if isinstance(view[name], torch.Tensor):
+                view[name] = view[name].to(device, non_blocking=True)
+            else:
+                view[name] = torch.tensor(view[name]).to(device, non_blocking=True)
+            if view[name].dtype == torch.float32 and name not in ignore_dtype_keys:
+                view[name] = view[name].to(dtype)
+    view1 = batch[:1]
+    view2 = batch[1:]
+    with torch.cuda.amp.autocast(enabled=True, dtype=dtype):
+        pred1, pred2, pred_cameras = model(view1, view2, True, dtype)
+    pts3d = pred2['pts3d']
+    conf = pred2['conf']
+    pts3d = pts3d.detach().cpu()
+    B, N, H, W, _ = pts3d.shape
+    thres = torch.quantile(conf.flatten(2,3), min_conf_thr, dim=-1)[0]
+    masks_conf = conf > thres[None, :, None, None]
+    masks_conf = masks_conf.cpu()
+    images = [view['img'] for view in view1+view2]
+    shape = torch.stack([view['true_shape'] for view in view1+view2], dim=1).detach().cpu().numpy()
+    images = torch.stack(images,1).float().permute(0,1,3,4,2).detach().cpu().numpy()
+    images = images / 2 + 0.5
+    images = images.reshape(B, N, H, W, 3)
+    # estimate focal length
+    images = images[0]
+    pts3d = pts3d[0]
+    masks_conf = masks_conf[0]
+    xy_over_z = (pts3d[..., :2] / pts3d[..., 2:3]).nan_to_num(posinf=0, neginf=0)  # homogeneous (x,y,1)
+    pp = torch.tensor((W/2, H/2)).to(xy_over_z)
+    pixels = xy_grid(W, H, device=xy_over_z.device).view(1, -1, 2) - pp.view(-1, 1, 2)  # B,HW,2
+    u, v = pixels[:1].unbind(dim=-1)
+    x, y, z = pts3d[:1].reshape(-1,3).unbind(dim=-1)
+    fx_votes = (u * z) / x
+    fy_votes = (v * z) / y
+    # assume square pixels, hence same focal for X and Y
+    f_votes = torch.cat((fx_votes.view(B, -1), fy_votes.view(B, -1)), dim=-1)
+    focal = torch.nanmedian(f_votes, dim=-1).values
+    focal = focal.item()
+    pts3d = pts3d.numpy()
+    # use PNP to estimate camera poses
+    pred_poses = []
+    for i in range(pts3d.shape[0]):
+        shape_input_each = shape[:, i]
+        mesh_grid = xy_grid(shape_input_each[0,1], shape_input_each[0,0])
+        cur_inlier = conf[0,i] > torch.quantile(conf[0,i], 0.6)
+        cur_inlier = cur_inlier.detach().cpu().numpy()
+        ransac_thres = 0.5
+        confidence = 0.9999
+        iterationsCount = 10_000
+        cur_pts3d = pts3d[i]
+        K = np.float32([(focal, 0, W/2), (0, focal, H/2), (0, 0, 1)])
+        success, r_pose, t_pose, _ = cv2.solvePnPRansac(cur_pts3d[cur_inlier].astype(np.float64), mesh_grid[cur_inlier].astype(np.float64), K, None,
+                                                        flags=cv2.SOLVEPNP_SQPNP,
+                                                        iterationsCount=iterationsCount,
+                                                        reprojectionError=1,
+                                                        confidence=confidence)
+        r_pose = cv2.Rodrigues(r_pose)[0]
+        RT = np.r_[np.c_[r_pose, t_pose], [(0,0,0,1)]]
+        cam2world = np.linalg.inv(RT)
+        pred_poses.append(cam2world)
+    pred_poses = np.stack(pred_poses, axis=0)
+    pred_poses = torch.tensor(pred_poses)
+    # use knn to clean the point cloud
+    K = 10
+    points = torch.tensor(pts3d.reshape(1,-1,3)).cuda()
+    knn = knn_points(points, points, K=K)
+    dists = knn.dists
+    mean_dists = dists.mean(dim=-1)
+    masks_dist = mean_dists < torch.quantile(mean_dists.reshape(-1), 0.95)
+    masks_dist = masks_dist.detach().cpu().numpy()
+    masks_conf = (masks_conf > 0) & masks_dist.reshape(-1,H,W)
+    masks_conf = masks_conf > 0
+    outdir = tempfile.mkdtemp(suffix='_FLARE_gradio_demo')
+    os.makedirs(outdir, exist_ok=True)
+    focals = [focal] * len(images)
+    outfile_name = tempfile.mktemp(suffix='_scene.glb', dir=outdir)
+    _convert_scene_output_to_glb(outfile_name, images, pts3d, masks_conf, focals, pred_poses, as_pointcloud=True,
+                                        transparent_cams=False, cam_size=cam_size, silent=silent)
+    return filestate, outfile_name
+css = """.gradio-container {margin: 0 !important; min-width: 100%};"""
+title = "FLARE Demo"
+with gradio.Blocks(css=css, title=title, delete_cache=(gradio_delete_cache, gradio_delete_cache)) as demo:
+    filestate = gradio.State(None)
+    gradio.HTML('<h2 style="text-align: center;">3D Reconstruction with FLARE</h2>')
+    with gradio.Column():
+        inputfiles = gradio.File(file_count="multiple")
+        snapshot = gradio.Image(None, visible=False)
+        with gradio.Row():
+            # adjust the confidence threshold
+            min_conf_thr = gradio.Slider(label="min_conf_thr", value=0.1, minimum=0.0, maximum=1, step=0.05)
+            # adjust the camera size in the output pointcloud
+            cam_size = gradio.Slider(label="cam_size", value=0.2, minimum=0.001, maximum=1.0, step=0.001)
+        run_btn = gradio.Button("Run")
+        outmodel = gradio.Model3D()
+        # events
+        run_btn.click(fn=local_get_reconstructed_scene,
+                      inputs=[inputfiles, min_conf_thr, cam_size],
+                      outputs=[filestate, outmodel])
+demo.launch(show_error=True, share=None, server_name=None, server_port=None)
+shutil.rmtree(tmpdirname)

dust3r/croco/datasets/__init__.py ADDED Viewed

File without changes

dust3r/croco/datasets/crops/README.MD ADDED Viewed

	@@ -0,0 +1,104 @@

+## Generation of crops from the real datasets
+The instructions below allow to generate the crops used for pre-training CroCo v2 from the following real-world datasets: ARKitScenes, MegaDepth, 3DStreetView and IndoorVL.
+### Download the metadata of the crops to generate
+First, download the metadata and put them in `./data/`:
+```
+mkdir -p data
+cd data/
+wget https://download.europe.naverlabs.com/ComputerVision/CroCo/data/crop_metadata.zip
+unzip crop_metadata.zip
+rm crop_metadata.zip
+cd ..
+```
+### Prepare the original datasets
+Second, download the original datasets in `./data/original_datasets/`.
+```
+mkdir -p data/original_datasets
+```
+##### ARKitScenes
+Download the `raw` dataset from https://github.com/apple/ARKitScenes/blob/main/DATA.md and put it in `./data/original_datasets/ARKitScenes/`.
+The resulting file structure should be like:
+```
+./data/original_datasets/ARKitScenes/
+└───Training
+    └───40753679
+     │  │   ultrawide
+     │  │   ...
+     └───40753686
+     │
+      ...
+```
+##### MegaDepth
+Download `MegaDepth v1 Dataset` from https://www.cs.cornell.edu/projects/megadepth/ and put it in `./data/original_datasets/MegaDepth/`.
+The resulting file structure should be like:
+```
+./data/original_datasets/MegaDepth/
+└───0000
+│   └───images
+│    │      │   1000557903_87fa96b8a4_o.jpg
+│    │      └ ...
+│    └─── ...
+└───0001
+│   │
+│   └ ...
+└─── ...
+```
+##### 3DStreetView
+Download `3D_Street_View` dataset from https://github.com/amir32002/3D_Street_View and put it in `./data/original_datasets/3DStreetView/`.
+The resulting file structure should be like:
+```
+./data/original_datasets/3DStreetView/
+└───dataset_aligned
+│   └───0002
+│    │      │   0000002_0000001_0000002_0000001.jpg
+│    │      └ ...
+│    └─── ...
+└───dataset_unaligned
+│   └───0003
+│    │      │   0000003_0000001_0000002_0000001.jpg
+│    │      └ ...
+│    └─── ...
+```
+##### IndoorVL
+Download the `IndoorVL` datasets using [Kapture](https://github.com/naver/kapture).
+```
+pip install kapture
+mkdir -p ./data/original_datasets/IndoorVL
+cd ./data/original_datasets/IndoorVL
+kapture_download_dataset.py update
+kapture_download_dataset.py install  "HyundaiDepartmentStore_*"
+kapture_download_dataset.py install  "GangnamStation_*"
+cd -
+```
+### Extract the crops
+Now, extract the crops for each of the dataset:
+```
+for dataset in ARKitScenes MegaDepth 3DStreetView IndoorVL;
+do
+  python3 datasets/crops/extract_crops_from_images.py --crops ./data/crop_metadata/${dataset}/crops_release.txt --root-dir ./data/original_datasets/${dataset}/ --output-dir ./data/${dataset}_crops/ --imsize 256 --nthread 8 --max-subdir-levels 5 --ideal-number-pairs-in-dir 500;
+done
+```
+##### Note for IndoorVL
+Due to some legal issues, we can only release 144,228 pairs out of the 1,593,689 pairs used in the paper.
+To account for it in terms of number of pre-training iterations, the pre-training command in this repository uses 125 training epochs including 12 warm-up epochs and learning rate cosine schedule of 250, instead of 100, 10 and 200 respectively.
+The impact on the performance is negligible.

dust3r/croco/datasets/crops/extract_crops_from_images.py ADDED Viewed

	@@ -0,0 +1,159 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Extracting crops for pre-training
+# --------------------------------------------------------
+import os
+import argparse
+from tqdm import tqdm
+from PIL import Image
+import functools
+from multiprocessing import Pool
+import math
+def arg_parser():
+    parser = argparse.ArgumentParser('Generate cropped image pairs from image crop list')
+    parser.add_argument('--crops', type=str, required=True, help='crop file')
+    parser.add_argument('--root-dir', type=str, required=True, help='root directory')
+    parser.add_argument('--output-dir', type=str, required=True, help='output directory')
+    parser.add_argument('--imsize', type=int, default=256, help='size of the crops')
+    parser.add_argument('--nthread', type=int, required=True, help='number of simultaneous threads')
+    parser.add_argument('--max-subdir-levels', type=int, default=5, help='maximum number of subdirectories')
+    parser.add_argument('--ideal-number-pairs-in-dir', type=int, default=500, help='number of pairs stored in a dir')
+    return parser
+def main(args):
+    listing_path = os.path.join(args.output_dir, 'listing.txt')
+    print(f'Loading list of crops ... ({args.nthread} threads)')
+    crops, num_crops_to_generate = load_crop_file(args.crops)
+    print(f'Preparing jobs ({len(crops)} candidate image pairs)...')
+    num_levels = min(math.ceil(math.log(num_crops_to_generate, args.ideal_number_pairs_in_dir)), args.max_subdir_levels)
+    num_pairs_in_dir = math.ceil(num_crops_to_generate ** (1/num_levels))
+    jobs = prepare_jobs(crops, num_levels, num_pairs_in_dir)
+    del crops
+    os.makedirs(args.output_dir, exist_ok=True)
+    mmap = Pool(args.nthread).imap_unordered if args.nthread > 1 else map
+    call = functools.partial(save_image_crops, args)
+    print(f"Generating cropped images to {args.output_dir} ...")
+    with open(listing_path, 'w') as listing:
+        listing.write('# pair_path\n')
+        for results in tqdm(mmap(call, jobs), total=len(jobs)):
+            for path in results:
+                listing.write(f'{path}\n')
+    print('Finished writing listing to', listing_path)
+def load_crop_file(path):
+    data = open(path).read().splitlines()
+    pairs = []
+    num_crops_to_generate = 0
+    for line in tqdm(data):
+        if line.startswith('#'):
+            continue
+        line = line.split(', ')
+        if len(line) < 8:
+            img1, img2, rotation = line
+            pairs.append((img1, img2, int(rotation), []))
+        else:
+            l1, r1, t1, b1, l2, r2, t2, b2 = map(int, line)
+            rect1, rect2 = (l1, t1, r1, b1), (l2, t2, r2, b2)
+            pairs[-1][-1].append((rect1, rect2))
+            num_crops_to_generate += 1
+    return pairs, num_crops_to_generate
+def prepare_jobs(pairs, num_levels, num_pairs_in_dir):
+    jobs = []
+    powers = [num_pairs_in_dir**level for level in reversed(range(num_levels))]
+    def get_path(idx):
+        idx_array = []
+        d = idx
+        for level in range(num_levels - 1):
+            idx_array.append(idx // powers[level])
+            idx = idx % powers[level]
+        idx_array.append(d)
+        return '/'.join(map(lambda x: hex(x)[2:], idx_array))
+    idx = 0
+    for pair_data in tqdm(pairs):
+        img1, img2, rotation, crops = pair_data
+        if -60 <= rotation and rotation <= 60:
+            rotation = 0  # most likely not a true rotation
+        paths = [get_path(idx + k) for k in range(len(crops))]
+        idx += len(crops)
+        jobs.append(((img1, img2), rotation, crops, paths))
+    return jobs
+def load_image(path):
+    try:
+        return Image.open(path).convert('RGB')
+    except Exception as e:
+        print('skipping', path, e)
+        raise OSError()
+def save_image_crops(args, data):
+    # load images
+    img_pair, rot, crops, paths = data
+    try:
+        img1, img2 = [load_image(os.path.join(args.root_dir, impath)) for impath in img_pair]
+    except OSError as e:
+        return []
+    def area(sz):
+        return sz[0] * sz[1]
+    tgt_size = (args.imsize, args.imsize)
+    def prepare_crop(img, rect, rot=0):
+        # actual crop
+        img = img.crop(rect)
+        # resize to desired size
+        interp = Image.Resampling.LANCZOS if area(img.size) > 4*area(tgt_size) else Image.Resampling.BICUBIC
+        img = img.resize(tgt_size, resample=interp)
+        # rotate the image
+        rot90 = (round(rot/90) % 4) * 90
+        if rot90 == 90:
+            img = img.transpose(Image.Transpose.ROTATE_90)
+        elif rot90 == 180:
+            img = img.transpose(Image.Transpose.ROTATE_180)
+        elif rot90 == 270:
+            img = img.transpose(Image.Transpose.ROTATE_270)
+        return img
+    results = []
+    for (rect1, rect2), path in zip(crops, paths):
+        crop1 = prepare_crop(img1, rect1)
+        crop2 = prepare_crop(img2, rect2, rot)
+        fullpath1 = os.path.join(args.output_dir,  path+'_1.jpg')
+        fullpath2 = os.path.join(args.output_dir,  path+'_2.jpg')
+        os.makedirs(os.path.dirname(fullpath1), exist_ok=True)
+        assert not os.path.isfile(fullpath1), fullpath1
+        assert not os.path.isfile(fullpath2), fullpath2
+        crop1.save(fullpath1)
+        crop2.save(fullpath2)
+        results.append(path)
+    return results
+if __name__ == '__main__':
+    args = arg_parser().parse_args()
+    main(args)

dust3r/croco/datasets/habitat_sim/README.MD ADDED Viewed

	@@ -0,0 +1,76 @@

+## Generation of synthetic image pairs using Habitat-Sim
+These instructions allow to generate pre-training pairs from the Habitat simulator.
+As we did not save metadata of the pairs used in the original paper, they are not strictly the same, but these data use the same setting and are equivalent.
+### Download Habitat-Sim scenes
+Download Habitat-Sim scenes:
+- Download links can be found here: https://github.com/facebookresearch/habitat-sim/blob/main/DATASETS.md
+- We used scenes from the HM3D, habitat-test-scenes, Replica, ReplicaCad and ScanNet datasets.
+- Please put the scenes under `./data/habitat-sim-data/scene_datasets/` following the structure below, or update manually paths in `paths.py`.
+```
+./data/
+└──habitat-sim-data/
+   └──scene_datasets/
+      ├──hm3d/
+      ├──gibson/
+      ├──habitat-test-scenes/
+      ├──replica_cad_baked_lighting/
+      ├──replica_cad/
+      ├──ReplicaDataset/
+      └──scannet/
+```
+### Image pairs generation
+We provide metadata to generate reproducible images pairs for pretraining and validation.
+Experiments described in the paper used similar data, but whose generation was not reproducible at the time.
+Specifications:
+- 256x256 resolution images, with 60 degrees field of view .
+- Up to 1000 image pairs per scene.
+- Number of scenes considered/number of images pairs per dataset:
+  - Scannet: 1097 scenes / 985 209 pairs
+  - HM3D:
+    - hm3d/train: 800 / 800k pairs
+    - hm3d/val: 100 scenes / 100k pairs
+    - hm3d/minival: 10 scenes / 10k pairs
+  - habitat-test-scenes: 3 scenes / 3k pairs
+  - replica_cad_baked_lighting: 13 scenes / 13k pairs
+- Scenes from hm3d/val and hm3d/minival pairs were not used for the pre-training but kept for validation purposes.
+Download metadata and extract it:
+```bash
+mkdir -p data/habitat_release_metadata/
+cd data/habitat_release_metadata/
+wget https://download.europe.naverlabs.com/ComputerVision/CroCo/data/habitat_release_metadata/multiview_habitat_metadata.tar.gz
+tar -xvf multiview_habitat_metadata.tar.gz
+cd ../..
+# Location of the metadata
+METADATA_DIR="./data/habitat_release_metadata/multiview_habitat_metadata"
+```
+Generate image pairs from metadata:
+- The following command will print a list of commandlines to generate image pairs for each scene:
+```bash
+# Target output directory
+PAIRS_DATASET_DIR="./data/habitat_release/"
+python datasets/habitat_sim/generate_from_metadata_files.py --input_dir=$METADATA_DIR --output_dir=$PAIRS_DATASET_DIR
+```
+- One can launch multiple of such commands in parallel e.g. using GNU Parallel:
+```bash
+python datasets/habitat_sim/generate_from_metadata_files.py --input_dir=$METADATA_DIR --output_dir=$PAIRS_DATASET_DIR | parallel -j 16
+```
+## Metadata generation
+Image pairs were randomly sampled using the following commands, whose outputs contain randomness and are thus not exactly reproducible:
+```bash
+# Print commandlines to generate image pairs from the different scenes available.
+PAIRS_DATASET_DIR=MY_CUSTOM_PATH
+python datasets/habitat_sim/generate_multiview_images.py --list_commands --output_dir=$PAIRS_DATASET_DIR
+# Once a dataset is generated, pack metadata files for reproducibility.
+METADATA_DIR=MY_CUSTON_PATH
+python datasets/habitat_sim/pack_metadata_files.py $PAIRS_DATASET_DIR  $METADATA_DIR
+```

dust3r/croco/datasets/habitat_sim/__init__.py ADDED Viewed

File without changes

dust3r/croco/datasets/habitat_sim/generate_from_metadata.py ADDED Viewed

	@@ -0,0 +1,92 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+"""
+Script to generate image pairs for a given scene reproducing poses provided in a metadata file.
+"""
+import os
+from datasets.habitat_sim.multiview_habitat_sim_generator import MultiviewHabitatSimGenerator
+from datasets.habitat_sim.paths import SCENES_DATASET
+import argparse
+import quaternion
+import PIL.Image
+import cv2
+import json
+from tqdm import tqdm
+def generate_multiview_images_from_metadata(metadata_filename,
+                                            output_dir,
+                                            overload_params = dict(),
+                                            scene_datasets_paths=None,
+                                            exist_ok=False):
+    """
+    Generate images from a metadata file for reproducibility purposes.
+    """
+    # Reorder paths by decreasing label length, to avoid collisions when testing if a string by such label
+    if scene_datasets_paths is not None:
+        scene_datasets_paths = dict(sorted(scene_datasets_paths.items(), key= lambda x: len(x[0]), reverse=True))
+    with open(metadata_filename, 'r') as f:
+        input_metadata = json.load(f)
+    metadata = dict()
+    for key, value in input_metadata.items():
+        # Optionally replace some paths
+        if key in ("scene_dataset_config_file", "scene", "navmesh") and value != "":
+            if scene_datasets_paths is not None:
+                for dataset_label, dataset_path in scene_datasets_paths.items():
+                    if value.startswith(dataset_label):
+                        value = os.path.normpath(os.path.join(dataset_path, os.path.relpath(value, dataset_label)))
+                        break
+        metadata[key] = value
+    # Overload some parameters
+    for key, value in overload_params.items():
+        metadata[key] = value
+    generation_entries = dict([(key, value) for key, value in metadata.items() if not (key in ('multiviews', 'output_dir', 'generate_depth'))])
+    generate_depth = metadata["generate_depth"]
+    os.makedirs(output_dir, exist_ok=exist_ok)
+    generator = MultiviewHabitatSimGenerator(**generation_entries)
+    # Generate views
+    for idx_label, data in tqdm(metadata['multiviews'].items()):
+        positions = data["positions"]
+        orientations = data["orientations"]
+        n = len(positions)
+        for oidx in range(n):
+            observation = generator.render_viewpoint(positions[oidx], quaternion.from_float_array(orientations[oidx]))
+            observation_label = f"{oidx + 1}" # Leonid is indexing starting from 1
+            # Color image saved using PIL
+            img = PIL.Image.fromarray(observation['color'][:,:,:3])
+            filename = os.path.join(output_dir, f"{idx_label}_{observation_label}.jpeg")
+            img.save(filename)
+            if generate_depth:
+                # Depth image as EXR file
+                filename = os.path.join(output_dir, f"{idx_label}_{observation_label}_depth.exr")
+                cv2.imwrite(filename, observation['depth'], [cv2.IMWRITE_EXR_TYPE, cv2.IMWRITE_EXR_TYPE_HALF])
+                # Camera parameters
+                camera_params = dict([(key, observation[key].tolist()) for key in ("camera_intrinsics", "R_cam2world", "t_cam2world")])
+                filename = os.path.join(output_dir, f"{idx_label}_{observation_label}_camera_params.json")
+                with open(filename, "w") as f:
+                    json.dump(camera_params, f)
+                # Save metadata
+    with open(os.path.join(output_dir, "metadata.json"), "w") as f:
+        json.dump(metadata, f)
+    generator.close()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--metadata_filename", required=True)
+    parser.add_argument("--output_dir", required=True)
+    args = parser.parse_args()
+    generate_multiview_images_from_metadata(metadata_filename=args.metadata_filename,
+                             output_dir=args.output_dir,
+                             scene_datasets_paths=SCENES_DATASET,
+                             overload_params=dict(),
+                             exist_ok=True)

dust3r/croco/datasets/habitat_sim/generate_from_metadata_files.py ADDED Viewed

	@@ -0,0 +1,27 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+"""
+Script generating commandlines to generate image pairs from metadata files.
+"""
+import os
+import glob
+from tqdm import tqdm
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input_dir", required=True)
+    parser.add_argument("--output_dir", required=True)
+    parser.add_argument("--prefix", default="", help="Commanline prefix, useful e.g. to setup environment.")
+    args = parser.parse_args()
+    input_metadata_filenames = glob.iglob(f"{args.input_dir}/**/metadata.json", recursive=True)
+    for metadata_filename in tqdm(input_metadata_filenames):
+        output_dir = os.path.join(args.output_dir, os.path.relpath(os.path.dirname(metadata_filename), args.input_dir))
+        # Do not process the scene if the metadata file already exists
+        if os.path.exists(os.path.join(output_dir, "metadata.json")):
+            continue
+        commandline = f"{args.prefix}python datasets/habitat_sim/generate_from_metadata.py --metadata_filename={metadata_filename} --output_dir={output_dir}"
+        print(commandline)

dust3r/croco/datasets/habitat_sim/generate_multiview_images.py ADDED Viewed

	@@ -0,0 +1,177 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+import os
+from tqdm import tqdm
+import argparse
+import PIL.Image
+import numpy as np
+import json
+from datasets.habitat_sim.multiview_habitat_sim_generator import MultiviewHabitatSimGenerator, NoNaviguableSpaceError
+from datasets.habitat_sim.paths import list_scenes_available
+import cv2
+import quaternion
+import shutil
+def generate_multiview_images_for_scene(scene_dataset_config_file,
+                                        scene,
+                                        navmesh,
+                                        output_dir,
+                                        views_count,
+                                        size,
+                                        exist_ok=False,
+                                        generate_depth=False,
+                                        **kwargs):
+    """
+    Generate tuples of overlapping views for a given scene.
+    generate_depth: generate depth images and camera parameters.
+    """
+    if os.path.exists(output_dir) and not exist_ok:
+        print(f"Scene {scene}: data already generated. Ignoring generation.")
+        return
+    try:
+        print(f"Scene {scene}: {size} multiview acquisitions to generate...")
+        os.makedirs(output_dir, exist_ok=exist_ok)
+        metadata_filename = os.path.join(output_dir, "metadata.json")
+        metadata_template = dict(scene_dataset_config_file=scene_dataset_config_file,
+            scene=scene,
+            navmesh=navmesh,
+            views_count=views_count,
+            size=size,
+            generate_depth=generate_depth,
+            **kwargs)
+        metadata_template["multiviews"] = dict()
+        if os.path.exists(metadata_filename):
+            print("Metadata file already exists:", metadata_filename)
+            print("Loading already generated metadata file...")
+            with open(metadata_filename, "r") as f:
+                metadata = json.load(f)
+            for key in metadata_template.keys():
+                if key != "multiviews":
+                    assert metadata_template[key] == metadata[key], f"existing file is inconsistent with the input parameters:\nKey: {key}\nmetadata: {metadata[key]}\ntemplate: {metadata_template[key]}."
+        else:
+            print("No temporary file found. Starting generation from scratch...")
+            metadata = metadata_template
+        starting_id = len(metadata["multiviews"])
+        print(f"Starting generation from index {starting_id}/{size}...")
+        if starting_id >= size:
+            print("Generation already done.")
+            return
+        generator = MultiviewHabitatSimGenerator(scene_dataset_config_file=scene_dataset_config_file,
+                                                scene=scene,
+                                                navmesh=navmesh,
+                                                views_count = views_count,
+                                                size = size,
+                                                **kwargs)
+        for idx in tqdm(range(starting_id, size)):
+            # Generate / re-generate the observations
+            try:
+                data = generator[idx]
+                observations = data["observations"]
+                positions = data["positions"]
+                orientations = data["orientations"]
+                idx_label = f"{idx:08}"
+                for oidx, observation in enumerate(observations):
+                    observation_label = f"{oidx + 1}" # Leonid is indexing starting from 1
+                    # Color image saved using PIL
+                    img = PIL.Image.fromarray(observation['color'][:,:,:3])
+                    filename = os.path.join(output_dir, f"{idx_label}_{observation_label}.jpeg")
+                    img.save(filename)
+                    if generate_depth:
+                        # Depth image as EXR file
+                        filename = os.path.join(output_dir, f"{idx_label}_{observation_label}_depth.exr")
+                        cv2.imwrite(filename, observation['depth'], [cv2.IMWRITE_EXR_TYPE, cv2.IMWRITE_EXR_TYPE_HALF])
+                        # Camera parameters
+                        camera_params = dict([(key, observation[key].tolist()) for key in ("camera_intrinsics", "R_cam2world", "t_cam2world")])
+                        filename = os.path.join(output_dir, f"{idx_label}_{observation_label}_camera_params.json")
+                        with open(filename, "w") as f:
+                            json.dump(camera_params, f)
+                metadata["multiviews"][idx_label] = {"positions": positions.tolist(),
+                                                    "orientations": orientations.tolist(),
+                                                    "covisibility_ratios": data["covisibility_ratios"].tolist(),
+                                                    "valid_fractions": data["valid_fractions"].tolist(),
+                                                    "pairwise_visibility_ratios": data["pairwise_visibility_ratios"].tolist()}
+            except RecursionError:
+                print("Recursion error: unable to sample observations for this scene. We will stop there.")
+                break
+            # Regularly save a temporary metadata file, in case we need to restart the generation
+            if idx % 10 == 0:
+                with open(metadata_filename, "w") as f:
+                    json.dump(metadata, f)
+        # Save metadata
+        with open(metadata_filename, "w") as f:
+            json.dump(metadata, f)
+        generator.close()
+    except NoNaviguableSpaceError:
+        pass
+def create_commandline(scene_data, generate_depth, exist_ok=False):
+    """
+    Create a commandline string to generate a scene.
+    """
+    def my_formatting(val):
+        if val is None or val == "":
+            return '""'
+        else:
+            return val
+    commandline = f"""python {__file__} --scene {my_formatting(scene_data.scene)}
+    --scene_dataset_config_file {my_formatting(scene_data.scene_dataset_config_file)}
+    --navmesh {my_formatting(scene_data.navmesh)}
+    --output_dir {my_formatting(scene_data.output_dir)}
+    --generate_depth {int(generate_depth)}
+    --exist_ok {int(exist_ok)}
+    """
+    commandline = " ".join(commandline.split())
+    return commandline
+if __name__ == "__main__":
+    os.umask(2)
+    parser = argparse.ArgumentParser(description="""Example of use -- listing commands to generate data for scenes available:
+    > python datasets/habitat_sim/generate_multiview_habitat_images.py --list_commands
+    """)
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument("--list_commands", action='store_true', help="list commandlines to run if true")
+    parser.add_argument("--scene", type=str, default="")
+    parser.add_argument("--scene_dataset_config_file", type=str, default="")
+    parser.add_argument("--navmesh", type=str, default="")
+    parser.add_argument("--generate_depth", type=int, default=1)
+    parser.add_argument("--exist_ok", type=int, default=0)
+    kwargs = dict(resolution=(256,256), hfov=60, views_count = 2, size=1000)
+    args = parser.parse_args()
+    generate_depth=bool(args.generate_depth)
+    exist_ok = bool(args.exist_ok)
+    if args.list_commands:
+        # Listing scenes available...
+        scenes_data = list_scenes_available(base_output_dir=args.output_dir)
+        for scene_data in scenes_data:
+            print(create_commandline(scene_data, generate_depth=generate_depth, exist_ok=exist_ok))
+    else:
+        if args.scene == "" or args.output_dir == "":
+            print("Missing scene or output dir argument!")
+            print(parser.format_help())
+        else:
+            generate_multiview_images_for_scene(scene=args.scene,
+                                                scene_dataset_config_file = args.scene_dataset_config_file,
+                                                navmesh = args.navmesh,
+                                                output_dir = args.output_dir,
+                                                exist_ok=exist_ok,
+                                                generate_depth=generate_depth,
+                                                **kwargs)

dust3r/croco/datasets/habitat_sim/multiview_habitat_sim_generator.py ADDED Viewed

	@@ -0,0 +1,390 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+import os
+import numpy as np
+import quaternion
+import habitat_sim
+import json
+from sklearn.neighbors import NearestNeighbors
+import cv2
+# OpenCV to habitat camera convention transformation
+R_OPENCV2HABITAT = np.stack((habitat_sim.geo.RIGHT, -habitat_sim.geo.UP, habitat_sim.geo.FRONT), axis=0)
+R_HABITAT2OPENCV = R_OPENCV2HABITAT.T
+DEG2RAD = np.pi / 180
+def compute_camera_intrinsics(height, width, hfov):
+    f = width/2 / np.tan(hfov/2 * np.pi/180)
+    cu, cv = width/2, height/2
+    return f, cu, cv
+def compute_camera_pose_opencv_convention(camera_position, camera_orientation):
+    R_cam2world = quaternion.as_rotation_matrix(camera_orientation) @ R_OPENCV2HABITAT
+    t_cam2world = np.asarray(camera_position)
+    return R_cam2world, t_cam2world
+def compute_pointmap(depthmap, hfov):
+    """ Compute a HxWx3 pointmap in camera frame from a HxW depth map."""
+    height, width = depthmap.shape
+    f, cu, cv = compute_camera_intrinsics(height, width, hfov)
+    # Cast depth map to point
+    z_cam = depthmap
+    u, v = np.meshgrid(range(width), range(height))
+    x_cam = (u - cu) / f * z_cam
+    y_cam = (v - cv) / f * z_cam
+    X_cam = np.stack((x_cam, y_cam, z_cam), axis=-1)
+    return X_cam
+def compute_pointcloud(depthmap, hfov, camera_position, camera_rotation):
+    """Return a 3D point cloud corresponding to valid pixels of the depth map"""
+    R_cam2world, t_cam2world = compute_camera_pose_opencv_convention(camera_position, camera_rotation)
+    X_cam = compute_pointmap(depthmap=depthmap, hfov=hfov)
+    valid_mask = (X_cam[:,:,2] != 0.0)
+    X_cam = X_cam.reshape(-1, 3)[valid_mask.flatten()]
+    X_world = X_cam @ R_cam2world.T + t_cam2world.reshape(1, 3)
+    return X_world
+def compute_pointcloud_overlaps_scikit(pointcloud1, pointcloud2, distance_threshold, compute_symmetric=False):
+    """
+    Compute 'overlapping' metrics based on a distance threshold between two point clouds.
+    """
+    nbrs = NearestNeighbors(n_neighbors=1, algorithm = 'kd_tree').fit(pointcloud2)
+    distances, indices = nbrs.kneighbors(pointcloud1)
+    intersection1 = np.count_nonzero(distances.flatten() < distance_threshold)
+    data = {"intersection1": intersection1,
+            "size1": len(pointcloud1)}
+    if compute_symmetric:
+        nbrs = NearestNeighbors(n_neighbors=1, algorithm = 'kd_tree').fit(pointcloud1)
+        distances, indices = nbrs.kneighbors(pointcloud2)
+        intersection2 = np.count_nonzero(distances.flatten() < distance_threshold)
+        data["intersection2"] = intersection2
+        data["size2"] = len(pointcloud2)
+    return data
+def _append_camera_parameters(observation, hfov, camera_location, camera_rotation):
+    """
+    Add camera parameters to the observation dictionnary produced by Habitat-Sim
+    In-place modifications.
+    """
+    R_cam2world, t_cam2world = compute_camera_pose_opencv_convention(camera_location, camera_rotation)
+    height, width = observation['depth'].shape
+    f, cu, cv = compute_camera_intrinsics(height, width, hfov)
+    K = np.asarray([[f, 0, cu],
+                    [0, f, cv],
+                    [0, 0, 1.0]])
+    observation["camera_intrinsics"] = K
+    observation["t_cam2world"] = t_cam2world
+    observation["R_cam2world"] = R_cam2world
+def look_at(eye, center, up, return_cam2world=True):
+    """
+    Return camera pose looking at a given center point.
+    Analogous of gluLookAt function, using OpenCV camera convention.
+    """
+    z = center - eye
+    z /= np.linalg.norm(z, axis=-1, keepdims=True)
+    y = -up
+    y = y - np.sum(y * z, axis=-1, keepdims=True) * z
+    y /= np.linalg.norm(y, axis=-1, keepdims=True)
+    x = np.cross(y, z, axis=-1)
+    if return_cam2world:
+        R = np.stack((x, y, z), axis=-1)
+        t = eye
+    else:
+        # World to camera transformation
+        # Transposed matrix
+        R = np.stack((x, y, z), axis=-2)
+        t = - np.einsum('...ij, ...j', R, eye)
+    return R, t
+def look_at_for_habitat(eye, center, up, return_cam2world=True):
+    R, t = look_at(eye, center, up)
+    orientation = quaternion.from_rotation_matrix(R @ R_OPENCV2HABITAT.T)
+    return orientation, t
+def generate_orientation_noise(pan_range, tilt_range, roll_range):
+    return (quaternion.from_rotation_vector(np.random.uniform(*pan_range) * DEG2RAD * habitat_sim.geo.UP)
+            * quaternion.from_rotation_vector(np.random.uniform(*tilt_range) * DEG2RAD * habitat_sim.geo.RIGHT)
+            * quaternion.from_rotation_vector(np.random.uniform(*roll_range) * DEG2RAD * habitat_sim.geo.FRONT))
+class NoNaviguableSpaceError(RuntimeError):
+    def __init__(self, *args):
+            super().__init__(*args)
+class MultiviewHabitatSimGenerator:
+    def __init__(self,
+                scene,
+                navmesh,
+                scene_dataset_config_file,
+                resolution = (240, 320),
+                views_count=2,
+                hfov = 60,
+                gpu_id = 0,
+                size = 10000,
+                minimum_covisibility = 0.5,
+                transform = None):
+        self.scene = scene
+        self.navmesh = navmesh
+        self.scene_dataset_config_file = scene_dataset_config_file
+        self.resolution = resolution
+        self.views_count = views_count
+        assert(self.views_count >= 1)
+        self.hfov = hfov
+        self.gpu_id = gpu_id
+        self.size = size
+        self.transform = transform
+        # Noise added to camera orientation
+        self.pan_range = (-3, 3)
+        self.tilt_range = (-10, 10)
+        self.roll_range = (-5, 5)
+        # Height range to sample cameras
+        self.height_range = (1.2, 1.8)
+        # Random steps between the camera views
+        self.random_steps_count = 5
+        self.random_step_variance = 2.0
+        # Minimum fraction of the scene which should be valid (well defined depth)
+        self.minimum_valid_fraction = 0.7
+        # Distance threshold to see  to select pairs
+        self.distance_threshold = 0.05
+        # Minimum IoU of a view point cloud with respect to the reference view to be kept.
+        self.minimum_covisibility = minimum_covisibility
+        # Maximum number of retries.
+        self.max_attempts_count = 100
+        self.seed = None
+        self._lazy_initialization()
+    def _lazy_initialization(self):
+        # Lazy random seeding and instantiation of the simulator to deal with multiprocessing properly
+        if self.seed == None:
+            # Re-seed numpy generator
+            np.random.seed()
+            self.seed = np.random.randint(2**32-1)
+            sim_cfg = habitat_sim.SimulatorConfiguration()
+            sim_cfg.scene_id = self.scene
+            if self.scene_dataset_config_file is not None and self.scene_dataset_config_file != "":
+                    sim_cfg.scene_dataset_config_file = self.scene_dataset_config_file
+            sim_cfg.random_seed = self.seed
+            sim_cfg.load_semantic_mesh = False
+            sim_cfg.gpu_device_id = self.gpu_id
+            depth_sensor_spec = habitat_sim.CameraSensorSpec()
+            depth_sensor_spec.uuid = "depth"
+            depth_sensor_spec.sensor_type = habitat_sim.SensorType.DEPTH
+            depth_sensor_spec.resolution = self.resolution
+            depth_sensor_spec.hfov = self.hfov
+            depth_sensor_spec.position = [0.0, 0.0, 0]
+            depth_sensor_spec.orientation
+            rgb_sensor_spec = habitat_sim.CameraSensorSpec()
+            rgb_sensor_spec.uuid = "color"
+            rgb_sensor_spec.sensor_type = habitat_sim.SensorType.COLOR
+            rgb_sensor_spec.resolution = self.resolution
+            rgb_sensor_spec.hfov = self.hfov
+            rgb_sensor_spec.position = [0.0, 0.0, 0]
+            agent_cfg = habitat_sim.agent.AgentConfiguration(sensor_specifications=[rgb_sensor_spec, depth_sensor_spec])
+            cfg = habitat_sim.Configuration(sim_cfg, [agent_cfg])
+            self.sim = habitat_sim.Simulator(cfg)
+            if self.navmesh is not None and self.navmesh != "":
+                # Use pre-computed navmesh when available (usually better than those generated automatically)
+                self.sim.pathfinder.load_nav_mesh(self.navmesh)
+            if not self.sim.pathfinder.is_loaded:
+                # Try to compute a navmesh
+                navmesh_settings = habitat_sim.NavMeshSettings()
+                navmesh_settings.set_defaults()
+                self.sim.recompute_navmesh(self.sim.pathfinder, navmesh_settings, True)
+            # Ensure that the navmesh is not empty
+            if not self.sim.pathfinder.is_loaded:
+                raise NoNaviguableSpaceError(f"No naviguable location (scene: {self.scene} -- navmesh: {self.navmesh})")
+            self.agent = self.sim.initialize_agent(agent_id=0)
+    def close(self):
+        self.sim.close()
+    def __del__(self):
+        self.sim.close()
+    def __len__(self):
+        return self.size
+    def sample_random_viewpoint(self):
+        """ Sample a random viewpoint using the navmesh """
+        nav_point = self.sim.pathfinder.get_random_navigable_point()
+        # Sample a random viewpoint height
+        viewpoint_height = np.random.uniform(*self.height_range)
+        viewpoint_position = nav_point + viewpoint_height * habitat_sim.geo.UP
+        viewpoint_orientation = quaternion.from_rotation_vector(np.random.uniform(0, 2 * np.pi) * habitat_sim.geo.UP) * generate_orientation_noise(self.pan_range, self.tilt_range, self.roll_range)
+        return viewpoint_position, viewpoint_orientation, nav_point
+    def sample_other_random_viewpoint(self, observed_point, nav_point):
+        """ Sample a random viewpoint close to an existing one, using the navmesh and a reference observed point."""
+        other_nav_point = nav_point
+        walk_directions = self.random_step_variance * np.asarray([1,0,1])
+        for i in range(self.random_steps_count):
+            temp = self.sim.pathfinder.snap_point(other_nav_point + walk_directions * np.random.normal(size=3))
+            # Snapping may return nan when it fails
+            if not np.isnan(temp[0]):
+                    other_nav_point = temp
+        other_viewpoint_height = np.random.uniform(*self.height_range)
+        other_viewpoint_position = other_nav_point + other_viewpoint_height * habitat_sim.geo.UP
+        # Set viewing direction towards the central point
+        rotation, position = look_at_for_habitat(eye=other_viewpoint_position, center=observed_point, up=habitat_sim.geo.UP, return_cam2world=True)
+        rotation = rotation * generate_orientation_noise(self.pan_range, self.tilt_range, self.roll_range)
+        return position, rotation, other_nav_point
+    def is_other_pointcloud_overlapping(self, ref_pointcloud, other_pointcloud):
+        """ Check if a viewpoint is valid and overlaps significantly with a reference one. """
+        # Observation
+        pixels_count = self.resolution[0] * self.resolution[1]
+        valid_fraction = len(other_pointcloud) / pixels_count
+        assert valid_fraction <= 1.0 and valid_fraction >= 0.0
+        overlap = compute_pointcloud_overlaps_scikit(ref_pointcloud, other_pointcloud, self.distance_threshold, compute_symmetric=True)
+        covisibility = min(overlap["intersection1"] / pixels_count, overlap["intersection2"] / pixels_count)
+        is_valid = (valid_fraction >= self.minimum_valid_fraction) and (covisibility >= self.minimum_covisibility)
+        return is_valid, valid_fraction, covisibility
+    def is_other_viewpoint_overlapping(self, ref_pointcloud, observation, position, rotation):
+        """ Check if a viewpoint is valid and overlaps significantly with a reference one. """
+        # Observation
+        other_pointcloud = compute_pointcloud(observation['depth'], self.hfov, position, rotation)
+        return self.is_other_pointcloud_overlapping(ref_pointcloud, other_pointcloud)
+    def render_viewpoint(self, viewpoint_position, viewpoint_orientation):
+        agent_state = habitat_sim.AgentState()
+        agent_state.position = viewpoint_position
+        agent_state.rotation = viewpoint_orientation
+        self.agent.set_state(agent_state)
+        viewpoint_observations = self.sim.get_sensor_observations(agent_ids=0)
+        _append_camera_parameters(viewpoint_observations, self.hfov, viewpoint_position, viewpoint_orientation)
+        return viewpoint_observations
+    def __getitem__(self, useless_idx):
+        ref_position, ref_orientation, nav_point = self.sample_random_viewpoint()
+        ref_observations = self.render_viewpoint(ref_position, ref_orientation)
+        # Extract point cloud
+        ref_pointcloud = compute_pointcloud(depthmap=ref_observations['depth'], hfov=self.hfov,
+                                        camera_position=ref_position, camera_rotation=ref_orientation)
+        pixels_count = self.resolution[0] * self.resolution[1]
+        ref_valid_fraction = len(ref_pointcloud) / pixels_count
+        assert ref_valid_fraction <= 1.0 and ref_valid_fraction >= 0.0
+        if ref_valid_fraction < self.minimum_valid_fraction:
+                # This should produce a recursion error at some point when something is very wrong.
+                return self[0]
+        # Pick an reference observed point in the point cloud
+        observed_point = np.mean(ref_pointcloud, axis=0)
+        # Add the first image as reference
+        viewpoints_observations = [ref_observations]
+        viewpoints_covisibility = [ref_valid_fraction]
+        viewpoints_positions = [ref_position]
+        viewpoints_orientations = [quaternion.as_float_array(ref_orientation)]
+        viewpoints_clouds = [ref_pointcloud]
+        viewpoints_valid_fractions = [ref_valid_fraction]
+        for _ in range(self.views_count - 1):
+            # Generate an other viewpoint using some dummy random walk
+            successful_sampling = False
+            for sampling_attempt in range(self.max_attempts_count):
+                position, rotation, _ = self.sample_other_random_viewpoint(observed_point, nav_point)
+                # Observation
+                other_viewpoint_observations = self.render_viewpoint(position, rotation)
+                other_pointcloud = compute_pointcloud(other_viewpoint_observations['depth'], self.hfov, position, rotation)
+                is_valid, valid_fraction, covisibility = self.is_other_pointcloud_overlapping(ref_pointcloud, other_pointcloud)
+                if is_valid:
+                        successful_sampling = True
+                        break
+            if not successful_sampling:
+                print("WARNING: Maximum number of attempts reached.")
+                # Dirty hack, try using a novel original viewpoint
+                return self[0]
+            viewpoints_observations.append(other_viewpoint_observations)
+            viewpoints_covisibility.append(covisibility)
+            viewpoints_positions.append(position)
+            viewpoints_orientations.append(quaternion.as_float_array(rotation)) # WXYZ convention for the quaternion encoding.
+            viewpoints_clouds.append(other_pointcloud)
+            viewpoints_valid_fractions.append(valid_fraction)
+        # Estimate relations between all pairs of images
+        pairwise_visibility_ratios = np.ones((len(viewpoints_observations), len(viewpoints_observations)))
+        for i in range(len(viewpoints_observations)):
+            pairwise_visibility_ratios[i,i] = viewpoints_valid_fractions[i]
+            for j in range(i+1, len(viewpoints_observations)):
+                overlap = compute_pointcloud_overlaps_scikit(viewpoints_clouds[i], viewpoints_clouds[j], self.distance_threshold, compute_symmetric=True)
+                pairwise_visibility_ratios[i,j] = overlap['intersection1'] / pixels_count
+                pairwise_visibility_ratios[j,i] = overlap['intersection2'] / pixels_count
+        # IoU is relative to the image 0
+        data = {"observations": viewpoints_observations,
+                "positions": np.asarray(viewpoints_positions),
+                "orientations": np.asarray(viewpoints_orientations),
+                "covisibility_ratios": np.asarray(viewpoints_covisibility),
+                "valid_fractions": np.asarray(viewpoints_valid_fractions, dtype=float),
+                "pairwise_visibility_ratios": np.asarray(pairwise_visibility_ratios, dtype=float),
+                }
+        if self.transform is not None:
+            data = self.transform(data)
+        return  data
+    def generate_random_spiral_trajectory(self, images_count = 100, max_radius=0.5, half_turns=5, use_constant_orientation=False):
+        """
+        Return a list of images corresponding to a spiral trajectory from a random starting point.
+        Useful to generate nice visualisations.
+        Use an even number of half turns to get a nice "C1-continuous" loop effect
+        """
+        ref_position, ref_orientation, navpoint = self.sample_random_viewpoint()
+        ref_observations = self.render_viewpoint(ref_position, ref_orientation)
+        ref_pointcloud = compute_pointcloud(depthmap=ref_observations['depth'], hfov=self.hfov,
+                                                        camera_position=ref_position, camera_rotation=ref_orientation)
+        pixels_count = self.resolution[0] * self.resolution[1]
+        if len(ref_pointcloud) / pixels_count < self.minimum_valid_fraction:
+            # Dirty hack: ensure that the valid part of the image is significant
+            return self.generate_random_spiral_trajectory(images_count, max_radius, half_turns, use_constant_orientation)
+        # Pick an observed point in the point cloud
+        observed_point = np.mean(ref_pointcloud, axis=0)
+        ref_R, ref_t = compute_camera_pose_opencv_convention(ref_position, ref_orientation)
+        images = []
+        is_valid = []
+        # Spiral trajectory, use_constant orientation
+        for i, alpha in enumerate(np.linspace(0, 1, images_count)):
+            r = max_radius * np.abs(np.sin(alpha * np.pi)) # Increase then decrease the radius
+            theta = alpha * half_turns * np.pi
+            x = r * np.cos(theta)
+            y = r * np.sin(theta)
+            z = 0.0
+            position = ref_position + (ref_R @ np.asarray([x, y, z]).reshape(3,1)).flatten()
+            if use_constant_orientation:
+                orientation = ref_orientation
+            else:
+                # trajectory looking at a mean point in front of the ref observation
+                orientation, position = look_at_for_habitat(eye=position, center=observed_point, up=habitat_sim.geo.UP)
+            observations = self.render_viewpoint(position, orientation)
+            images.append(observations['color'][...,:3])
+            _is_valid, valid_fraction, iou = self.is_other_viewpoint_overlapping(ref_pointcloud, observations, position, orientation)
+            is_valid.append(_is_valid)
+        return images, np.all(is_valid)

dust3r/croco/datasets/habitat_sim/pack_metadata_files.py ADDED Viewed

	@@ -0,0 +1,69 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+"""
+Utility script to pack metadata files of the dataset in order to be able to re-generate it elsewhere.
+"""
+import os
+import glob
+from tqdm import tqdm
+import shutil
+import json
+from datasets.habitat_sim.paths import *
+import argparse
+import collections
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("input_dir")
+    parser.add_argument("output_dir")
+    args = parser.parse_args()
+    input_dirname = args.input_dir
+    output_dirname = args.output_dir
+    input_metadata_filenames = glob.iglob(f"{input_dirname}/**/metadata.json", recursive=True)
+    images_count = collections.defaultdict(lambda : 0)
+    os.makedirs(output_dirname)
+    for input_filename in tqdm(input_metadata_filenames):
+        # Ignore empty files
+        with open(input_filename, "r") as f:
+            original_metadata = json.load(f)
+            if "multiviews" not in original_metadata or len(original_metadata["multiviews"]) == 0:
+                print("No views in", input_filename)
+                continue
+        relpath = os.path.relpath(input_filename, input_dirname)
+        print(relpath)
+        # Copy metadata, while replacing scene paths by generic keys depending on the dataset, for portability.
+        # Data paths are sorted by decreasing length to avoid potential bugs due to paths starting by the same string pattern.
+        scenes_dataset_paths = dict(sorted(SCENES_DATASET.items(), key=lambda x: len(x[1]), reverse=True))
+        metadata = dict()
+        for key, value in original_metadata.items():
+            if key in ("scene_dataset_config_file", "scene", "navmesh") and value != "":
+                known_path = False
+                for dataset, dataset_path in scenes_dataset_paths.items():
+                    if value.startswith(dataset_path):
+                        value = os.path.join(dataset, os.path.relpath(value, dataset_path))
+                        known_path = True
+                        break
+                if not known_path:
+                    raise KeyError("Unknown path:" + value)
+            metadata[key] = value
+        # Compile some general statistics while packing data
+        scene_split = metadata["scene"].split("/")
+        upper_level = "/".join(scene_split[:2]) if scene_split[0] == "hm3d" else scene_split[0]
+        images_count[upper_level] += len(metadata["multiviews"])
+        output_filename = os.path.join(output_dirname, relpath)
+        os.makedirs(os.path.dirname(output_filename), exist_ok=True)
+        with open(output_filename, "w") as f:
+            json.dump(metadata, f)
+    # Print statistics
+    print("Images count:")
+    for upper_level, count in images_count.items():
+        print(f"- {upper_level}: {count}")

dust3r/croco/datasets/habitat_sim/paths.py ADDED Viewed

	@@ -0,0 +1,129 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+"""
+Paths to Habitat-Sim scenes
+"""
+import os
+import json
+import collections
+from tqdm import tqdm
+# Hardcoded path to the different scene datasets
+SCENES_DATASET = {
+    "hm3d": "./data/habitat-sim-data/scene_datasets/hm3d/",
+    "gibson": "./data/habitat-sim-data/scene_datasets/gibson/",
+    "habitat-test-scenes": "./data/habitat-sim/scene_datasets/habitat-test-scenes/",
+    "replica_cad_baked_lighting": "./data/habitat-sim/scene_datasets/replica_cad_baked_lighting/",
+    "replica_cad": "./data/habitat-sim/scene_datasets/replica_cad/",
+    "replica": "./data/habitat-sim/scene_datasets/ReplicaDataset/",
+    "scannet": "./data/habitat-sim/scene_datasets/scannet/"
+}
+SceneData = collections.namedtuple("SceneData", ["scene_dataset_config_file", "scene", "navmesh", "output_dir"])
+def list_replicacad_scenes(base_output_dir, base_path=SCENES_DATASET["replica_cad"]):
+    scene_dataset_config_file = os.path.join(base_path, "replicaCAD.scene_dataset_config.json")
+    scenes = [f"apt_{i}" for i in range(6)] + ["empty_stage"]
+    navmeshes = [f"navmeshes/apt_{i}_static_furniture.navmesh" for i in range(6)] + ["empty_stage.navmesh"]
+    scenes_data = []
+    for idx in range(len(scenes)):
+        output_dir = os.path.join(base_output_dir, "ReplicaCAD", scenes[idx])
+        # Add scene
+        data = SceneData(scene_dataset_config_file=scene_dataset_config_file,
+                    scene = scenes[idx] + ".scene_instance.json",
+                    navmesh = os.path.join(base_path, navmeshes[idx]),
+                    output_dir = output_dir)
+        scenes_data.append(data)
+    return scenes_data
+def list_replica_cad_baked_lighting_scenes(base_output_dir, base_path=SCENES_DATASET["replica_cad_baked_lighting"]):
+    scene_dataset_config_file = os.path.join(base_path, "replicaCAD_baked.scene_dataset_config.json")
+    scenes = sum([[f"Baked_sc{i}_staging_{j:02}" for i in range(5)] for j in range(21)], [])
+    navmeshes = ""#[f"navmeshes/apt_{i}_static_furniture.navmesh" for i in range(6)] + ["empty_stage.navmesh"]
+    scenes_data = []
+    for idx in range(len(scenes)):
+        output_dir = os.path.join(base_output_dir, "replica_cad_baked_lighting", scenes[idx])
+        data = SceneData(scene_dataset_config_file=scene_dataset_config_file,
+                    scene = scenes[idx],
+                    navmesh = "",
+                    output_dir = output_dir)
+        scenes_data.append(data)
+    return scenes_data
+def list_replica_scenes(base_output_dir, base_path):
+    scenes_data = []
+    for scene_id in os.listdir(base_path):
+        scene = os.path.join(base_path, scene_id, "mesh.ply")
+        navmesh = os.path.join(base_path, scene_id, "habitat/mesh_preseg_semantic.navmesh") # Not sure if I should use it
+        scene_dataset_config_file = ""
+        output_dir = os.path.join(base_output_dir, scene_id)
+        # Add scene only if it does not exist already, or if exist_ok
+        data = SceneData(scene_dataset_config_file = scene_dataset_config_file,
+                    scene = scene,
+                    navmesh = navmesh,
+                    output_dir = output_dir)
+        scenes_data.append(data)
+    return scenes_data
+def list_scenes(base_output_dir, base_path):
+    """
+    Generic method iterating through a base_path folder to find scenes.
+    """
+    scenes_data = []
+    for root, dirs, files in os.walk(base_path, followlinks=True):
+        folder_scenes_data = []
+        for file in files:
+            name, ext = os.path.splitext(file)
+            if ext == ".glb":
+                scene = os.path.join(root, name + ".glb")
+                navmesh = os.path.join(root, name + ".navmesh")
+                if not os.path.exists(navmesh):
+                    navmesh = ""
+                relpath = os.path.relpath(root, base_path)
+                output_dir = os.path.abspath(os.path.join(base_output_dir, relpath, name))
+                data = SceneData(scene_dataset_config_file="",
+                    scene = scene,
+                    navmesh = navmesh,
+                    output_dir = output_dir)
+                folder_scenes_data.append(data)
+        # Specific check for HM3D:
+        # When two meshesxxxx.basis.glb and xxxx.glb are present, use the 'basis' version.
+        basis_scenes = [data.scene[:-len(".basis.glb")] for data in folder_scenes_data if data.scene.endswith(".basis.glb")]
+        if len(basis_scenes) != 0:
+            folder_scenes_data = [data for data in folder_scenes_data if not (data.scene[:-len(".glb")] in basis_scenes)]
+        scenes_data.extend(folder_scenes_data)
+    return scenes_data
+def list_scenes_available(base_output_dir, scenes_dataset_paths=SCENES_DATASET):
+    scenes_data = []
+    # HM3D
+    for split in ("minival", "train", "val", "examples"):
+        scenes_data += list_scenes(base_output_dir=os.path.join(base_output_dir, f"hm3d/{split}/"),
+                                    base_path=f"{scenes_dataset_paths['hm3d']}/{split}")
+    # Gibson
+    scenes_data += list_scenes(base_output_dir=os.path.join(base_output_dir, "gibson"),
+                                base_path=scenes_dataset_paths["gibson"])
+    # Habitat test scenes (just a few)
+    scenes_data += list_scenes(base_output_dir=os.path.join(base_output_dir, "habitat-test-scenes"),
+                                base_path=scenes_dataset_paths["habitat-test-scenes"])
+    # ReplicaCAD (baked lightning)
+    scenes_data += list_replica_cad_baked_lighting_scenes(base_output_dir=base_output_dir)
+    # ScanNet
+    scenes_data += list_scenes(base_output_dir=os.path.join(base_output_dir, "scannet"),
+                            base_path=scenes_dataset_paths["scannet"])
+    # Replica
+    list_replica_scenes(base_output_dir=os.path.join(base_output_dir, "replica"),
+                        base_path=scenes_dataset_paths["replica"])
+    return scenes_data

dust3r/croco/datasets/pairs_dataset.py ADDED Viewed

	@@ -0,0 +1,109 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+import os
+from torch.utils.data import Dataset
+from PIL import Image
+from datasets.transforms import get_pair_transforms
+def load_image(impath):
+    return Image.open(impath)
+def load_pairs_from_cache_file(fname, root=''):
+    assert os.path.isfile(fname), "cannot parse pairs from {:s}, file does not exist".format(fname)
+    with open(fname, 'r') as fid:
+        lines = fid.read().strip().splitlines()
+    pairs = [ (os.path.join(root,l.split()[0]), os.path.join(root,l.split()[1])) for l in lines]
+    return pairs
+def load_pairs_from_list_file(fname, root=''):
+    assert os.path.isfile(fname), "cannot parse pairs from {:s}, file does not exist".format(fname)
+    with open(fname, 'r') as fid:
+        lines = fid.read().strip().splitlines()
+    pairs = [ (os.path.join(root,l+'_1.jpg'), os.path.join(root,l+'_2.jpg')) for l in lines if not l.startswith('#')]
+    return pairs
+def write_cache_file(fname, pairs, root=''):
+    if len(root)>0:
+        if not root.endswith('/'): root+='/'
+        assert os.path.isdir(root)
+    s = ''
+    for im1, im2 in pairs:
+        if len(root)>0:
+            assert im1.startswith(root), im1
+            assert im2.startswith(root), im2
+        s += '{:s} {:s}\n'.format(im1[len(root):], im2[len(root):])
+    with open(fname, 'w') as fid:
+        fid.write(s[:-1])
+def parse_and_cache_all_pairs(dname, data_dir='./data/'):
+    if dname=='habitat_release':
+        dirname = os.path.join(data_dir, 'habitat_release')
+        assert os.path.isdir(dirname), "cannot find folder for habitat_release pairs: "+dirname
+        cache_file = os.path.join(dirname, 'pairs.txt')
+        assert not os.path.isfile(cache_file), "cache file already exists: "+cache_file
+        print('Parsing pairs for dataset: '+dname)
+        pairs = []
+        for root, dirs, files in os.walk(dirname):
+            if 'val' in root: continue
+            dirs.sort()
+            pairs += [ (os.path.join(root,f), os.path.join(root,f[:-len('_1.jpeg')]+'_2.jpeg')) for f in sorted(files) if f.endswith('_1.jpeg')]
+        print('Found {:,} pairs'.format(len(pairs)))
+        print('Writing cache to: '+cache_file)
+        write_cache_file(cache_file, pairs, root=dirname)
+    else:
+        raise NotImplementedError('Unknown dataset: '+dname)
+def dnames_to_image_pairs(dnames, data_dir='./data/'):
+    """
+    dnames: list of datasets with image pairs, separated by +
+    """
+    all_pairs = []
+    for dname in dnames.split('+'):
+        if dname=='habitat_release':
+            dirname = os.path.join(data_dir, 'habitat_release')
+            assert os.path.isdir(dirname), "cannot find folder for habitat_release pairs: "+dirname
+            cache_file = os.path.join(dirname, 'pairs.txt')
+            assert os.path.isfile(cache_file), "cannot find cache file for habitat_release pairs, please first create the cache file, see instructions. "+cache_file
+            pairs = load_pairs_from_cache_file(cache_file, root=dirname)
+        elif dname in ['ARKitScenes', 'MegaDepth', '3DStreetView', 'IndoorVL']:
+            dirname = os.path.join(data_dir, dname+'_crops')
+            assert os.path.isdir(dirname), "cannot find folder for {:s} pairs: {:s}".format(dname, dirname)
+            list_file = os.path.join(dirname, 'listing.txt')
+            assert os.path.isfile(list_file), "cannot find list file for {:s} pairs, see instructions. {:s}".format(dname, list_file)
+            pairs = load_pairs_from_list_file(list_file, root=dirname)
+        print('  {:s}: {:,} pairs'.format(dname, len(pairs)))
+        all_pairs += pairs
+    if '+' in dnames: print(' Total: {:,} pairs'.format(len(all_pairs)))
+    return all_pairs
+class PairsDataset(Dataset):
+    def __init__(self, dnames, trfs='', totensor=True, normalize=True, data_dir='./data/'):
+        super().__init__()
+        self.image_pairs = dnames_to_image_pairs(dnames, data_dir=data_dir)
+        self.transforms = get_pair_transforms(transform_str=trfs, totensor=totensor, normalize=normalize)
+    def __len__(self):
+        return len(self.image_pairs)
+    def __getitem__(self, index):
+        im1path, im2path = self.image_pairs[index]
+        im1 = load_image(im1path)
+        im2 = load_image(im2path)
+        if self.transforms is not None: im1, im2 = self.transforms(im1, im2)
+        return im1, im2
+if __name__=="__main__":
+    import argparse
+    parser = argparse.ArgumentParser(prog="Computing and caching list of pairs for a given dataset")
+    parser.add_argument('--data_dir', default='./data/', type=str, help="path where data are stored")
+    parser.add_argument('--dataset', default='habitat_release', type=str, help="name of the dataset")
+    args = parser.parse_args()
+    parse_and_cache_all_pairs(dname=args.dataset, data_dir=args.data_dir)

dust3r/croco/datasets/transforms.py ADDED Viewed

	@@ -0,0 +1,95 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+import torch
+import torchvision.transforms
+import torchvision.transforms.functional as F
+# "Pair": apply a transform on a pair
+# "Both": apply the exact same transform to both images
+class ComposePair(torchvision.transforms.Compose):
+    def __call__(self, img1, img2):
+        for t in self.transforms:
+            img1, img2 = t(img1, img2)
+        return img1, img2
+class NormalizeBoth(torchvision.transforms.Normalize):
+    def forward(self, img1, img2):
+        img1 = super().forward(img1)
+        img2 = super().forward(img2)
+        return img1, img2
+class ToTensorBoth(torchvision.transforms.ToTensor):
+    def __call__(self, img1, img2):
+        img1 = super().__call__(img1)
+        img2 = super().__call__(img2)
+        return img1, img2
+class RandomCropPair(torchvision.transforms.RandomCrop):
+    # the crop will be intentionally different for the two images with this class
+    def forward(self, img1, img2):
+        img1 = super().forward(img1)
+        img2 = super().forward(img2)
+        return img1, img2
+class ColorJitterPair(torchvision.transforms.ColorJitter):
+    # can be symmetric (same for both images) or assymetric (different jitter params for each image) depending on assymetric_prob
+    def __init__(self, assymetric_prob, **kwargs):
+        super().__init__(**kwargs)
+        self.assymetric_prob = assymetric_prob
+    def jitter_one(self, img, fn_idx, brightness_factor, contrast_factor, saturation_factor, hue_factor):
+        for fn_id in fn_idx:
+            if fn_id == 0 and brightness_factor is not None:
+                img = F.adjust_brightness(img, brightness_factor)
+            elif fn_id == 1 and contrast_factor is not None:
+                img = F.adjust_contrast(img, contrast_factor)
+            elif fn_id == 2 and saturation_factor is not None:
+                img = F.adjust_saturation(img, saturation_factor)
+            elif fn_id == 3 and hue_factor is not None:
+                img = F.adjust_hue(img, hue_factor)
+        return img
+    def forward(self, img1, img2):
+        fn_idx, brightness_factor, contrast_factor, saturation_factor, hue_factor = self.get_params(
+            self.brightness, self.contrast, self.saturation, self.hue
+        )
+        img1 = self.jitter_one(img1, fn_idx, brightness_factor, contrast_factor, saturation_factor, hue_factor)
+        if torch.rand(1) < self.assymetric_prob: # assymetric:
+            fn_idx, brightness_factor, contrast_factor, saturation_factor, hue_factor = self.get_params(
+                self.brightness, self.contrast, self.saturation, self.hue
+            )
+        img2 = self.jitter_one(img2, fn_idx, brightness_factor, contrast_factor, saturation_factor, hue_factor)
+        return img1, img2
+def get_pair_transforms(transform_str, totensor=True, normalize=True):
+    # transform_str is eg    crop224+color
+    trfs = []
+    for s in transform_str.split('+'):
+        if s.startswith('crop'):
+            size = int(s[len('crop'):])
+            trfs.append(RandomCropPair(size))
+        elif s=='acolor':
+            trfs.append(ColorJitterPair(assymetric_prob=1.0, brightness=(0.6, 1.4), contrast=(0.6, 1.4), saturation=(0.6, 1.4), hue=0.0))
+        elif s=='': # if transform_str was ""
+            pass
+        else:
+            raise NotImplementedError('Unknown augmentation: '+s)
+    if totensor:
+        trfs.append( ToTensorBoth() )
+    if normalize:
+        trfs.append( NormalizeBoth(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) )
+    if len(trfs)==0:
+        return None
+    elif len(trfs)==1:
+        return trfs
+    else:
+        return ComposePair(trfs)

dust3r/croco/models/__pycache__/blocks.cpython-312.pyc ADDED Viewed

Binary file (19.6 kB). View file

dust3r/croco/models/__pycache__/croco.cpython-312.pyc ADDED Viewed

Binary file (15.2 kB). View file

dust3r/croco/models/__pycache__/dpt_block.cpython-312.pyc ADDED Viewed

Binary file (16.9 kB). View file

dust3r/croco/models/__pycache__/masking.cpython-312.pyc ADDED Viewed

Binary file (1.28 kB). View file

dust3r/croco/models/__pycache__/pos_embed.cpython-312.pyc ADDED Viewed

Binary file (8.31 kB). View file

dust3r/croco/models/blocks.py ADDED Viewed

	@@ -0,0 +1,307 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+# --------------------------------------------------------
+# Main encoder/decoder blocks
+# --------------------------------------------------------
+# References:
+# timm
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/helpers.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/drop.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/mlp.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/patch_embed.py
+import torch
+import torch.nn as nn
+from itertools import repeat
+import collections.abc
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
+            return x
+        return tuple(repeat(x, n))
+    return parse
+to_2tuple = _ntuple(2)
+def drop_path(x, drop_prob: float = 0., training: bool = False, scale_by_keep: bool = True):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+    """
+    if drop_prob == 0. or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = x.new_empty(shape).bernoulli_(keep_prob)
+    if keep_prob > 0.0 and scale_by_keep:
+        random_tensor.div_(keep_prob)
+    return x * random_tensor
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    """
+    def __init__(self, drop_prob: float = 0., scale_by_keep: bool = True):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+        self.scale_by_keep = scale_by_keep
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training, self.scale_by_keep)
+    def extra_repr(self):
+        return f'drop_prob={round(self.drop_prob,3):0.3f}'
+class Mlp(nn.Module):
+    """ MLP as used in Vision Transformer, MLP-Mixer and related networks"""
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, bias=True, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        bias = to_2tuple(bias)
+        drop_probs = to_2tuple(drop)
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias[0])
+        self.act = act_layer()
+        self.drop1 = nn.Dropout(drop_probs[0])
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias[1])
+        self.drop2 = nn.Dropout(drop_probs[1])
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop1(x)
+        x = self.fc2(x)
+        x = self.drop2(x)
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, rope=None, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.rope = rope
+    def forward(self, x, xpos):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).transpose(1,3)
+        q, k, v = [qkv[:,:,i] for i in range(3)]
+        # q,k,v = qkv.unbind(2)  # make torchscript happy (cannot use tensor as tuple)
+        if self.rope is not None:
+            q = self.rope(q, xpos)
+            k = self.rope(k, xpos)
+        # attn = (q @ k.transpose(-2, -1)) * self.scale
+        # attn = attn.softmax(dim=-1)
+        # attn = self.attn_drop(attn)
+        # x_old = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        # with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
+        x = torch.nn.functional.scaled_dot_product_attention(q, k, v, scale = self.scale, dropout_p=0.).transpose(1, 2).reshape(B, N, C)
+        # import ipdb;ipdb.set_trace()
+        # (x - x_old).abs().mean()
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class LayerNorm(nn.LayerNorm):
+    def forward(self, x):
+        t = x.dtype
+        x = super().forward(x.type(torch.float32))
+        return x.type(t)
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, rope=None):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(dim, rope=rope, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x, xpos):
+        dtype = x.dtype
+        x = x + self.drop_path(self.attn(self.norm1(x), xpos))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class CrossAttention(nn.Module):
+    def __init__(self, dim, rope=None, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim ** -0.5
+        self.projq = nn.Linear(dim, dim, bias=qkv_bias)
+        self.projk = nn.Linear(dim, dim, bias=qkv_bias)
+        self.projv = nn.Linear(dim, dim, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.rope = rope
+    def forward(self, query, key, value, qpos, kpos):
+        B, Nq, C = query.shape
+        Nk = key.shape[1]
+        Nv = value.shape[1]
+        q = self.projq(query).reshape(B,Nq,self.num_heads, C// self.num_heads).permute(0, 2, 1, 3)
+        k = self.projk(key).reshape(B,Nk,self.num_heads, C// self.num_heads).permute(0, 2, 1, 3)
+        v = self.projv(value).reshape(B,Nv,self.num_heads, C// self.num_heads).permute(0, 2, 1, 3)
+        if self.rope is not None:
+            q = self.rope(q, qpos)
+            k = self.rope(k, kpos)
+        # attn = (q @ k.transpose(-2, -1)) * self.scale
+        # attn = attn.softmax(dim=-1)
+        # attn = self.attn_drop(attn)
+        # x_old = (attn @ v).transpose(1, 2).reshape(B, Nq, C)
+        # with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
+        x = torch.nn.functional.scaled_dot_product_attention(q, k, v, scale = self.scale, dropout_p=0.).transpose(1, 2).reshape(B, Nq, C)
+        # import ipdb;ipdb.set_trace()
+        # (x - x_old).abs().mean()
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class DecoderBlock_onlyself(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, norm_mem=True, rope=None):
+        super().__init__()
+        self.attn = Attention(dim, rope=rope, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        # self.cross_attn = CrossAttention(dim, rope=rope, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm1 = norm_layer(dim)
+        self.norm3 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+        # self.norm_y = norm_layer(dim) if norm_mem else nn.Identity()
+    def forward(self, x, xpos, split=False):
+        # if split==False:
+        # else:
+        #     x = x.reshape(-1, y.shape[1], y.shape[2])
+        #     x = x + self.drop_path(self.attn(self.norm1(x), xpos.reshape(-1, y.shape[1], 2)))
+        #     x = x.reshape(y.shape[0], -1, y.shape[2])
+        x = x + self.drop_path(self.attn(self.norm1(x), xpos))
+        # y_ = self.norm_y(y)
+        # x = x + self.drop_path(self.cross_attn(self.norm2(x), y_, y_, xpos, ypos))
+        x = x + self.drop_path(self.mlp(self.norm3(x)))
+        return x
+class DecoderBlock_onlycross(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, norm_mem=True, rope=None):
+        super().__init__()
+        self.cross_attn = CrossAttention(dim, rope=rope, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        self.norm3 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+        self.norm_y = norm_layer(dim) if norm_mem else nn.Identity()
+    def forward(self, x, y, xpos, ypos, split=False):
+        y_ = self.norm_y(y)
+        x = x + self.drop_path(self.cross_attn(self.norm2(x), y_, y_, xpos, ypos))
+        x = x + self.drop_path(self.mlp(self.norm3(x)))
+        return x, y
+class DecoderBlock(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, norm_mem=True, rope=None):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(dim, rope=rope, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        self.cross_attn = CrossAttention(dim, rope=rope, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        self.norm3 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+        self.norm_y = norm_layer(dim) if norm_mem else nn.Identity()
+    def forward(self, x, y, xpos, ypos, split=False):
+        # if split==False:
+        x = x + self.drop_path(self.attn(self.norm1(x), xpos))
+        # else:
+        #     x = x.reshape(-1, y.shape[1], y.shape[2])
+        #     x = x + self.drop_path(self.attn(self.norm1(x), xpos.reshape(-1, y.shape[1], 2)))
+        #     x = x.reshape(y.shape[0], -1, y.shape[2])
+        y_ = self.norm_y(y)
+        x = x + self.drop_path(self.cross_attn(self.norm2(x), y_, y_, xpos, ypos))
+        x = x + self.drop_path(self.mlp(self.norm3(x)))
+        return x, y
+# patch embedding
+class PositionGetter(object):
+    """ return positions of patches """
+    def __init__(self):
+        self.cache_positions = {}
+    def __call__(self, b, h, w, device):
+        if not (h,w) in self.cache_positions:
+            x = torch.arange(w, device=device)
+            y = torch.arange(h, device=device)
+            self.cache_positions[h,w] = torch.cartesian_prod(y, x) # (h, w, 2)
+        pos = self.cache_positions[h,w].view(1, h*w, 2).expand(b, -1, 2).clone()
+        return pos
+class PatchEmbed(nn.Module):
+    """ just adding _init_weights + position getter compared to timm.models.layers.patch_embed.PatchEmbed"""
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+        self.flatten = flatten
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+        self.position_getter = PositionGetter()
+    def forward(self, x):
+        B, C, H, W = x.shape
+        torch._assert(H == self.img_size[0], f"Input image height ({H}) doesn't match model ({self.img_size[0]}).")
+        torch._assert(W == self.img_size[1], f"Input image width ({W}) doesn't match model ({self.img_size[1]}).")
+        x = self.proj(x)
+        pos = self.position_getter(B, x.size(2), x.size(3), x.device)
+        if self.flatten:
+            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
+        x = self.norm(x)
+        return x, pos
+    def _init_weights(self):
+        w = self.proj.weight.data
+        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))

dust3r/croco/models/criterion.py ADDED Viewed

	@@ -0,0 +1,37 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Criterion to train CroCo
+# --------------------------------------------------------
+# References:
+# MAE: https://github.com/facebookresearch/mae
+# --------------------------------------------------------
+import torch
+class MaskedMSE(torch.nn.Module):
+    def __init__(self, norm_pix_loss=False, masked=True):
+        """
+            norm_pix_loss: normalize each patch by their pixel mean and variance
+            masked: compute loss over the masked patches only
+        """
+        super().__init__()
+        self.norm_pix_loss = norm_pix_loss
+        self.masked = masked
+    def forward(self, pred, mask, target):
+        if self.norm_pix_loss:
+            mean = target.mean(dim=-1, keepdim=True)
+            var = target.var(dim=-1, keepdim=True)
+            target = (target - mean) / (var + 1.e-6)**.5
+        loss = (pred - target) ** 2
+        loss = loss.mean(dim=-1)  # [N, L], mean loss per patch
+        if self.masked:
+            loss = (loss * mask).sum() / mask.sum()  # mean loss on masked patches
+        else:
+            loss = loss.mean()  # mean loss
+        return loss

dust3r/croco/models/croco.py ADDED Viewed

	@@ -0,0 +1,288 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+# --------------------------------------------------------
+# CroCo model during pretraining
+# --------------------------------------------------------
+import torch
+import torch.nn as nn
+torch.backends.cuda.matmul.allow_tf32 = True # for gpu >= Ampere and pytorch >= 1.12
+from functools import partial
+from models.blocks import Block, DecoderBlock, PatchEmbed, DecoderBlock_onlyself, DecoderBlock_onlycross
+from models.pos_embed import get_2d_sincos_pos_embed, RoPE2D
+from models.masking import RandomMask
+from mast3r.modules import AttnBlock
+class CroCoNet(nn.Module):
+    def __init__(self,
+                 img_size=224,           # input image size
+                 patch_size=16,          # patch_size
+                 mask_ratio=0.9,         # ratios of masked tokens
+                 enc_embed_dim=768,      # encoder feature dimension
+                 enc_depth=12,           # encoder depth
+                 enc_num_heads=12,       # encoder number of heads in the transformer block
+                 dec_embed_dim=512,      # decoder feature dimension
+                 dec_depth=8,            # decoder depth
+                 dec_num_heads=16,       # decoder number of heads in the transformer block
+                 mlp_ratio=4,
+                 norm_layer=partial(nn.LayerNorm, eps=1e-6),
+                 norm_im2_in_dec=True,   # whether to apply normalization of the 'memory' = (second image) in the decoder
+                 pos_embed='cosine',     # positional embedding (either cosine or RoPE100)
+                ):
+        super(CroCoNet, self).__init__()
+        # patch embeddings  (with initialization done as in MAE)
+        self._set_patch_embed(img_size, patch_size, enc_embed_dim)
+        # mask generations
+        self._set_mask_generator(self.patch_embed.num_patches, mask_ratio)
+        self.pos_embed = pos_embed
+        if pos_embed=='cosine':
+            # positional embedding of the encoder
+            enc_pos_embed = get_2d_sincos_pos_embed(enc_embed_dim, int(self.patch_embed.num_patches**.5), n_cls_token=0)
+            self.register_buffer('enc_pos_embed', torch.from_numpy(enc_pos_embed).float())
+            # positional embedding of the decoder
+            dec_pos_embed = get_2d_sincos_pos_embed(dec_embed_dim, int(self.patch_embed.num_patches**.5), n_cls_token=0)
+            self.register_buffer('dec_pos_embed', torch.from_numpy(dec_pos_embed).float())
+            # pos embedding in each block
+            self.rope = None # nothing for cosine
+        elif pos_embed.startswith('RoPE'): # eg RoPE100
+            self.enc_pos_embed = None # nothing to add in the encoder with RoPE
+            self.dec_pos_embed = None # nothing to add in the decoder with RoPE
+            if RoPE2D is None: raise ImportError("Cannot find cuRoPE2D, please install it following the README instructions")
+            freq = float(pos_embed[len('RoPE'):])
+            self.rope = RoPE2D(freq=freq)
+        else:
+            raise NotImplementedError('Unknown pos_embed '+pos_embed)
+        # transformer for the encoder
+        self.enc_depth = enc_depth
+        self.enc_embed_dim = enc_embed_dim
+        self.enc_blocks = nn.ModuleList([
+            Block(enc_embed_dim, enc_num_heads, mlp_ratio, qkv_bias=True, norm_layer=norm_layer, rope=self.rope)
+            for i in range(enc_depth)])
+        self.enc_norm = norm_layer(enc_embed_dim)
+        self.enc_blocks_stage2 = nn.ModuleList([
+            Block(enc_embed_dim, enc_num_heads, mlp_ratio, qkv_bias=True, norm_layer=norm_layer, rope=self.rope)
+            for i in range(enc_depth//6)])
+        self.enc_norm_stage2 = norm_layer(enc_embed_dim)
+        # masked tokens
+        self._set_mask_token(dec_embed_dim)
+        # decoder
+        self._set_decoder(enc_embed_dim, dec_embed_dim, dec_num_heads, dec_depth, mlp_ratio, norm_layer, norm_im2_in_dec)
+        # prediction head
+        self._set_prediction_head(dec_embed_dim, patch_size)
+        # initializer weights
+        self.initialize_weights()
+    def _set_patch_embed(self, img_size=224, patch_size=16, enc_embed_dim=768):
+        self.patch_embed = PatchEmbed(img_size, patch_size, 3, enc_embed_dim)
+    def _set_mask_generator(self, num_patches, mask_ratio):
+        self.mask_generator = RandomMask(num_patches, mask_ratio)
+    def _set_mask_token(self, dec_embed_dim):
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, dec_embed_dim))
+    def _set_decoder(self, enc_embed_dim, dec_embed_dim, dec_num_heads, dec_depth, mlp_ratio, norm_layer, norm_im2_in_dec):
+        self.dec_depth = dec_depth
+        self.dec_embed_dim = dec_embed_dim
+        # transfer from encoder to decoder
+        self.decoder_embed = nn.Linear(enc_embed_dim, dec_embed_dim, bias=True)
+        # transformer for the decoder
+        self.dec_blocks = nn.ModuleList([
+            DecoderBlock(dec_embed_dim, dec_num_heads, mlp_ratio=mlp_ratio, qkv_bias=True, norm_layer=norm_layer, norm_mem=norm_im2_in_dec, rope=self.rope)
+            for i in range(dec_depth)])
+        self.dec_blocks_fine = nn.ModuleList([
+            DecoderBlock_onlyself(dec_embed_dim, dec_num_heads, mlp_ratio=mlp_ratio, qkv_bias=True, norm_layer=norm_layer, norm_mem=norm_im2_in_dec, rope=self.rope)
+            for i in range(dec_depth)])
+        self.dec_blocks_point_cross = nn.ModuleList([
+            DecoderBlock_onlycross(dec_embed_dim, dec_num_heads, mlp_ratio=mlp_ratio, qkv_bias=True, norm_layer=norm_layer, norm_mem=norm_im2_in_dec, rope=self.rope)
+            for i in range(dec_depth)])
+        # final norm layer
+        self.cam_cond_encoder = nn.ModuleList([AttnBlock(dec_embed_dim, dec_num_heads, mlp_ratio=mlp_ratio, attn_class=nn.MultiheadAttention)
+                for _ in range(dec_depth)])
+        self.pose_token_ref = nn.Parameter(torch.randn(1, 1, dec_embed_dim))
+        self.pose_token_source = nn.Parameter(torch.randn(1, 1, dec_embed_dim))
+        self.cam_cond_embed = nn.ModuleList([nn.Linear(dec_embed_dim, dec_embed_dim, bias=False) for i in range(dec_depth)])
+        self.dec_norm = norm_layer(dec_embed_dim)
+        self.dec_cam_norm = norm_layer(dec_embed_dim)
+    def _set_prediction_head(self, dec_embed_dim, patch_size):
+         self.prediction_head = nn.Linear(dec_embed_dim, patch_size**2 * 3, bias=True)
+    def initialize_weights(self):
+        # patch embed
+        self.patch_embed._init_weights()
+        # mask tokens
+        if self.mask_token is not None: torch.nn.init.normal_(self.mask_token, std=.02)
+        # linears and layer norms
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            # we use xavier_uniform following official JAX ViT:
+            torch.nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            if m.elementwise_affine == True:
+                nn.init.constant_(m.bias, 0)
+                nn.init.constant_(m.weight, 1.0)
+    def _encode_image_fine(self, image, shapes, dtype=torch.float32):
+        """
+        image has B x 3 x img_size x img_size
+        do_mask: whether to perform masking or not
+        return_all_blocks: if True, return the features at the end of every block
+                           instead of just the features from the last block (eg for some prediction heads)
+        """
+        # embed the image into patches  (x has size B x Npatches x C)
+        # and get position if each return patch (pos has size B x Npatches x 2)
+        x, pos = self.patch_embed_fine(image, shapes)
+        x = x.to(dtype)
+        # add positional embedding without cls token
+        B,N,C = x.size()
+        posvis = pos
+        # now apply the transformer encoder and normalization
+        for blk in self.enc_fine_blocks:
+            x = blk(x, posvis)
+        x = self.enc_fine_norm(x)
+        x, pos = self.patch_embed_fine2(x)
+        x = self.enc_fine_norm2(x)
+        return x, pos, None
+    def _encode_image(self, image, do_mask=False, return_all_blocks=False):
+        """
+        image has B x 3 x img_size x img_size
+        do_mask: whether to perform masking or not
+        return_all_blocks: if True, return the features at the end of every block
+                           instead of just the features from the last block (eg for some prediction heads)
+        """
+        # embed the image into patches  (x has size B x Npatches x C)
+        # and get position if each return patch (pos has size B x Npatches x 2)
+        x, pos = self.patch_embed(image)
+        # add positional embedding without cls token
+        if self.enc_pos_embed is not None:
+            x = x + self.enc_pos_embed[None,...]
+        # apply masking
+        B,N,C = x.size()
+        if do_mask:
+            masks = self.mask_generator(x)
+            x = x[~masks].view(B, -1, C)
+            posvis = pos[~masks].view(B, -1, 2)
+        else:
+            B,N,C = x.size()
+            masks = torch.zeros((B,N), dtype=bool)
+            posvis = pos
+        # now apply the transformer encoder and normalization
+        if return_all_blocks:
+            out = []
+            for blk in self.enc_blocks:
+                x = blk(x, posvis)
+                out.append(x)
+            out[-1] = self.enc_norm(out[-1])
+            return out, pos, masks
+        else:
+            for blk in self.enc_blocks:
+                x = blk(x, posvis)
+            x = self.enc_norm(x)
+            return x, pos, masks
+    def _decoder(self, feat1, pos1, masks1, feat2, pos2, return_all_blocks=False):
+        """
+        return_all_blocks: if True, return the features at the end of every block
+                           instead of just the features from the last block (eg for some prediction heads)
+        masks1 can be None => assume image1 fully visible
+        """
+        # encoder to decoder layer
+        visf1 = self.decoder_embed(feat1)
+        f2 = self.decoder_embed(feat2)
+        # append masked tokens to the sequence
+        B,Nenc,C = visf1.size()
+        if masks1 is None: # downstreams
+            f1_ = visf1
+        else: # pretraining
+            Ntotal = masks1.size(1)
+            f1_ = self.mask_token.repeat(B, Ntotal, 1).to(dtype=visf1.dtype)
+            f1_[~masks1] = visf1.view(B * Nenc, C)
+        # add positional embedding
+        if self.dec_pos_embed is not None:
+            f1_ = f1_ + self.dec_pos_embed
+            f2 = f2 + self.dec_pos_embed
+        # apply Transformer blocks
+        out = f1_
+        out2 = f2
+        if return_all_blocks:
+            _out, out = out, []
+            for blk in self.dec_blocks:
+                _out, out2 = blk(_out, out2, pos1, pos2)
+                out.append(_out)
+            out[-1] = self.dec_norm(out[-1])
+        else:
+            for blk in self.dec_blocks:
+                out, out2 = blk(out, out2, pos1, pos2)
+            out = self.dec_norm(out)
+        return out
+    def patchify(self, imgs):
+        """
+        imgs: (B, 3, H, W)
+        x: (B, L, patch_size**2 *3)
+        """
+        p = self.patch_embed.patch_size[0]
+        assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+        h = w = imgs.shape[2] // p
+        x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
+        x = torch.einsum('nchpwq->nhwpqc', x)
+        x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * 3))
+        return x
+    def unpatchify(self, x, channels=3):
+        """
+        x: (N, L, patch_size**2 *channels)
+        imgs: (N, 3, H, W)
+        """
+        patch_size = self.patch_embed.patch_size[0]
+        h = w = int(x.shape[1]**.5)
+        assert h * w == x.shape[1]
+        x = x.reshape(shape=(x.shape[0], h, w, patch_size, patch_size, channels))
+        x = torch.einsum('nhwpqc->nchpwq', x)
+        imgs = x.reshape(shape=(x.shape[0], channels, h * patch_size, h * patch_size))
+        return imgs
+    def forward(self, img1, img2):
+        """
+        img1: tensor of size B x 3 x img_size x img_size
+        img2: tensor of size B x 3 x img_size x img_size
+        out will be    B x N x (3*patch_size*patch_size)
+        masks are also returned as B x N just in case
+        """
+        # encoder of the masked first image
+        feat1, pos1, mask1 = self._encode_image(img1, do_mask=True)
+        # encoder of the second image
+        feat2, pos2, _ = self._encode_image(img2, do_mask=False)
+        # decoder
+        decfeat = self._decoder(feat1, pos1, mask1, feat2, pos2)
+        # prediction head
+        out = self.prediction_head(decfeat)
+        # get target
+        target = self.patchify(img1)
+        return out, mask1, target

dust3r/croco/models/dpt_block.py ADDED Viewed

	@@ -0,0 +1,450 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+# --------------------------------------------------------
+# DPT head for ViTs
+# --------------------------------------------------------
+# References:
+# https://github.com/isl-org/DPT
+# https://github.com/EPFL-VILAB/MultiMAE/blob/main/multimae/output_adapters.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from typing import Union, Tuple, Iterable, List, Optional, Dict
+def pair(t):
+    return t if isinstance(t, tuple) else (t, t)
+def make_scratch(in_shape, out_shape, groups=1, expand=False):
+    scratch = nn.Module()
+    out_shape1 = out_shape
+    out_shape2 = out_shape
+    out_shape3 = out_shape
+    out_shape4 = out_shape
+    if expand == True:
+        out_shape1 = out_shape
+        out_shape2 = out_shape * 2
+        out_shape3 = out_shape * 4
+        out_shape4 = out_shape * 8
+    scratch.layer1_rn = nn.Conv2d(
+        in_shape[0],
+        out_shape1,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+        bias=False,
+        groups=groups,
+    )
+    scratch.layer2_rn = nn.Conv2d(
+        in_shape[1],
+        out_shape2,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+        bias=False,
+        groups=groups,
+    )
+    scratch.layer3_rn = nn.Conv2d(
+        in_shape[2],
+        out_shape3,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+        bias=False,
+        groups=groups,
+    )
+    scratch.layer4_rn = nn.Conv2d(
+        in_shape[3],
+        out_shape4,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+        bias=False,
+        groups=groups,
+    )
+    scratch.layer_rn = nn.ModuleList([
+        scratch.layer1_rn,
+        scratch.layer2_rn,
+        scratch.layer3_rn,
+        scratch.layer4_rn,
+    ])
+    return scratch
+class ResidualConvUnit_custom(nn.Module):
+    """Residual convolution module."""
+    def __init__(self, features, activation, bn):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super().__init__()
+        self.bn = bn
+        self.groups = 1
+        self.conv1 = nn.Conv2d(
+            features,
+            features,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=not self.bn,
+            groups=self.groups,
+        )
+        self.conv2 = nn.Conv2d(
+            features,
+            features,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=not self.bn,
+            groups=self.groups,
+        )
+        if self.bn == True:
+            self.bn1 = nn.BatchNorm2d(features)
+            self.bn2 = nn.BatchNorm2d(features)
+        self.activation = activation
+        self.skip_add = nn.quantized.FloatFunctional()
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input
+        Returns:
+            tensor: output
+        """
+        out = self.activation(x)
+        out = self.conv1(out)
+        if self.bn == True:
+            out = self.bn1(out)
+        out = self.activation(out)
+        out = self.conv2(out)
+        if self.bn == True:
+            out = self.bn2(out)
+        if self.groups > 1:
+            out = self.conv_merge(out)
+        return self.skip_add.add(out, x)
+class FeatureFusionBlock_custom(nn.Module):
+    """Feature fusion block."""
+    def __init__(
+        self,
+        features,
+        activation,
+        deconv=False,
+        bn=False,
+        expand=False,
+        align_corners=True,
+        width_ratio=1,
+    ):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super(FeatureFusionBlock_custom, self).__init__()
+        self.width_ratio = width_ratio
+        self.deconv = deconv
+        self.align_corners = align_corners
+        self.groups = 1
+        self.expand = expand
+        out_features = features
+        if self.expand == True:
+            out_features = features // 2
+        self.out_conv = nn.Conv2d(
+            features,
+            out_features,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=True,
+            groups=1,
+        )
+        self.resConfUnit1 = ResidualConvUnit_custom(features, activation, bn)
+        self.resConfUnit2 = ResidualConvUnit_custom(features, activation, bn)
+        self.skip_add = nn.quantized.FloatFunctional()
+    def forward(self, *xs):
+        """Forward pass.
+        Returns:
+            tensor: output
+        """
+        output = xs[0]
+        if len(xs) == 2:
+            res = self.resConfUnit1(xs[1])
+            if self.width_ratio != 1:
+                res = F.interpolate(res, size=(output.shape[2], output.shape[3]), mode='bilinear')
+            output = self.skip_add.add(output, res)
+            # output += res
+        output = self.resConfUnit2(output)
+        if self.width_ratio != 1:
+            # and output.shape[3] < self.width_ratio * output.shape[2]
+            #size=(image.shape[])
+            if (output.shape[3] / output.shape[2]) < (2 / 3) * self.width_ratio:
+                shape = 3 * output.shape[3]
+            else:
+                shape = int(self.width_ratio * 2 * output.shape[2])
+            output  = F.interpolate(output, size=(2* output.shape[2], shape), mode='bilinear')
+        else:
+            output = nn.functional.interpolate(output, scale_factor=2,
+                    mode="bilinear", align_corners=self.align_corners)
+        output = self.out_conv(output)
+        return output
+def make_fusion_block(features, use_bn, width_ratio=1):
+    return FeatureFusionBlock_custom(
+        features,
+        nn.ReLU(False),
+        deconv=False,
+        bn=use_bn,
+        expand=False,
+        align_corners=True,
+        width_ratio=width_ratio,
+    )
+class Interpolate(nn.Module):
+    """Interpolation module."""
+    def __init__(self, scale_factor, mode, align_corners=False):
+        """Init.
+        Args:
+            scale_factor (float): scaling
+            mode (str): interpolation mode
+        """
+        super(Interpolate, self).__init__()
+        self.interp = nn.functional.interpolate
+        self.scale_factor = scale_factor
+        self.mode = mode
+        self.align_corners = align_corners
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input
+        Returns:
+            tensor: interpolated data
+        """
+        dtype = x.dtype
+        x = self.interp(
+            x.float(),
+            scale_factor=self.scale_factor,
+            mode=self.mode,
+            align_corners=self.align_corners,
+        )
+        x = x.to(dtype)
+        return x
+class DPTOutputAdapter(nn.Module):
+    """DPT output adapter.
+    :param num_cahnnels: Number of output channels
+    :param stride_level: tride level compared to the full-sized image.
+        E.g. 4 for 1/4th the size of the image.
+    :param patch_size_full: Int or tuple of the patch size over the full image size.
+        Patch size for smaller inputs will be computed accordingly.
+    :param hooks: Index of intermediate layers
+    :param layer_dims: Dimension of intermediate layers
+    :param feature_dim: Feature dimension
+    :param last_dim: out_channels/in_channels for the last two Conv2d when head_type == regression
+    :param use_bn: If set to True, activates batch norm
+    :param dim_tokens_enc:  Dimension of tokens coming from encoder
+    """
+    def __init__(self,
+                 num_channels: int = 1,
+                 stride_level: int = 1,
+                 patch_size: Union[int, Tuple[int, int]] = 16,
+                 main_tasks: Iterable[str] = ('rgb',),
+                 hooks: List[int] = [2, 5, 8, 11],
+                 layer_dims: List[int] = [96, 192, 384, 768],
+                 feature_dim: int = 256,
+                 last_dim: int = 32,
+                 use_bn: bool = False,
+                 dim_tokens_enc: Optional[int] = None,
+                 head_type: str = 'regression',
+                 output_width_ratio=1,
+                 **kwargs):
+        super().__init__()
+        self.num_channels = num_channels
+        self.stride_level = stride_level
+        self.patch_size = pair(patch_size)
+        self.main_tasks = main_tasks
+        self.hooks = hooks
+        self.layer_dims = layer_dims
+        self.feature_dim = feature_dim
+        self.dim_tokens_enc = dim_tokens_enc * len(self.main_tasks) if dim_tokens_enc is not None else None
+        self.head_type = head_type
+        # Actual patch height and width, taking into account stride of input
+        self.P_H = max(1, self.patch_size[0] // stride_level)
+        self.P_W = max(1, self.patch_size[1] // stride_level)
+        self.scratch = make_scratch(layer_dims, feature_dim, groups=1, expand=False)
+        self.scratch.refinenet1 = make_fusion_block(feature_dim, use_bn, output_width_ratio)
+        self.scratch.refinenet2 = make_fusion_block(feature_dim, use_bn, output_width_ratio)
+        self.scratch.refinenet3 = make_fusion_block(feature_dim, use_bn, output_width_ratio)
+        self.scratch.refinenet4 = make_fusion_block(feature_dim, use_bn, output_width_ratio)
+        if self.head_type == 'regression':
+            # The "DPTDepthModel" head
+            self.head = nn.Sequential(
+                nn.Conv2d(feature_dim, feature_dim // 2, kernel_size=3, stride=1, padding=1),
+                Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
+                nn.Conv2d(feature_dim // 2, last_dim, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(True),
+                nn.Conv2d(last_dim, self.num_channels, kernel_size=1, stride=1, padding=0)
+            )
+        elif self.head_type == 'semseg':
+            # The "DPTSegmentationModel" head
+            self.head = nn.Sequential(
+                nn.Conv2d(feature_dim, feature_dim, kernel_size=3, padding=1, bias=False),
+                nn.BatchNorm2d(feature_dim) if use_bn else nn.Identity(),
+                nn.ReLU(True),
+                nn.Dropout(0.1, False),
+                nn.Conv2d(feature_dim, self.num_channels, kernel_size=1),
+                Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
+            )
+        else:
+            raise ValueError('DPT head_type must be "regression" or "semseg".')
+        if self.dim_tokens_enc is not None:
+            self.init(dim_tokens_enc=dim_tokens_enc)
+    def init(self, dim_tokens_enc=768):
+        """
+        Initialize parts of decoder that are dependent on dimension of encoder tokens.
+        Should be called when setting up MultiMAE.
+        :param dim_tokens_enc: Dimension of tokens coming from encoder
+        """
+        #print(dim_tokens_enc)
+        # Set up activation postprocessing layers
+        if isinstance(dim_tokens_enc, int):
+            dim_tokens_enc = 4 * [dim_tokens_enc]
+        self.dim_tokens_enc = [dt * len(self.main_tasks) for dt in dim_tokens_enc]
+        self.act_1_postprocess = nn.Sequential(
+            nn.Conv2d(
+                in_channels=self.dim_tokens_enc[0],
+                out_channels=self.layer_dims[0],
+                kernel_size=1, stride=1, padding=0,
+            ),
+            nn.ConvTranspose2d(
+                in_channels=self.layer_dims[0],
+                out_channels=self.layer_dims[0],
+                kernel_size=4, stride=4, padding=0,
+                bias=True, dilation=1, groups=1,
+            )
+        )
+        self.act_2_postprocess = nn.Sequential(
+            nn.Conv2d(
+                in_channels=self.dim_tokens_enc[1],
+                out_channels=self.layer_dims[1],
+                kernel_size=1, stride=1, padding=0,
+            ),
+            nn.ConvTranspose2d(
+                in_channels=self.layer_dims[1],
+                out_channels=self.layer_dims[1],
+                kernel_size=2, stride=2, padding=0,
+                bias=True, dilation=1, groups=1,
+            )
+        )
+        self.act_3_postprocess = nn.Sequential(
+            nn.Conv2d(
+                in_channels=self.dim_tokens_enc[2],
+                out_channels=self.layer_dims[2],
+                kernel_size=1, stride=1, padding=0,
+            )
+        )
+        self.act_4_postprocess = nn.Sequential(
+            nn.Conv2d(
+                in_channels=self.dim_tokens_enc[3],
+                out_channels=self.layer_dims[3],
+                kernel_size=1, stride=1, padding=0,
+            ),
+            nn.Conv2d(
+                in_channels=self.layer_dims[3],
+                out_channels=self.layer_dims[3],
+                kernel_size=3, stride=2, padding=1,
+            )
+        )
+        self.act_postprocess = nn.ModuleList([
+            self.act_1_postprocess,
+            self.act_2_postprocess,
+            self.act_3_postprocess,
+            self.act_4_postprocess
+        ])
+    def adapt_tokens(self, encoder_tokens):
+        # Adapt tokens
+        x = []
+        x.append(encoder_tokens[:, :])
+        x = torch.cat(x, dim=-1)
+        return x
+    def forward(self, encoder_tokens: List[torch.Tensor], image_size):
+            #input_info: Dict):
+        assert self.dim_tokens_enc is not None, 'Need to call init(dim_tokens_enc) function first'
+        H, W = image_size
+        # Number of patches in height and width
+        N_H = H // (self.stride_level * self.P_H)
+        N_W = W // (self.stride_level * self.P_W)
+        # Hook decoder onto 4 layers from specified ViT layers
+        layers = [encoder_tokens[hook] for hook in self.hooks]
+        # Extract only task-relevant tokens and ignore global tokens.
+        layers = [self.adapt_tokens(l) for l in layers]
+        # Reshape tokens to spatial representation
+        layers = [rearrange(l, 'b (nh nw) c -> b c nh nw', nh=N_H, nw=N_W) for l in layers]
+        layers = [self.act_postprocess[idx](l) for idx, l in enumerate(layers)]
+        # Project layers to chosen feature dim
+        layers = [self.scratch.layer_rn[idx](l) for idx, l in enumerate(layers)]
+        # Fuse layers using refinement stages
+        path_4 = self.scratch.refinenet4(layers[3])
+        path_3 = self.scratch.refinenet3(path_4, layers[2])
+        path_2 = self.scratch.refinenet2(path_3, layers[1])
+        path_1 = self.scratch.refinenet1(path_2, layers[0])
+        # Output head
+        out = self.head(path_1)
+        return out

dust3r/croco/models/head_downstream.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+# --------------------------------------------------------
+# Heads for downstream tasks
+# --------------------------------------------------------
+"""
+A head is a module where the __init__ defines only the head hyperparameters.
+A method setup(croconet) takes a CroCoNet and set all layers according to the head and croconet attributes.
+The forward takes the features as well as a dictionary img_info containing the keys 'width' and 'height'
+"""
+import torch
+import torch.nn as nn
+from .dpt_block import DPTOutputAdapter
+class PixelwiseTaskWithDPT(nn.Module):
+    """ DPT module for CroCo.
+    by default, hooks_idx will be equal to:
+    * for encoder-only: 4 equally spread layers
+    * for encoder+decoder: last encoder + 3 equally spread layers of the decoder
+    """
+    def __init__(self, *, hooks_idx=None, layer_dims=[96,192,384,768],
+                 output_width_ratio=1, num_channels=1, postprocess=None, **kwargs):
+        super(PixelwiseTaskWithDPT, self).__init__()
+        self.return_all_blocks = True # backbone needs to return all layers
+        self.postprocess = postprocess
+        self.output_width_ratio = output_width_ratio
+        self.num_channels = num_channels
+        self.hooks_idx = hooks_idx
+        self.layer_dims = layer_dims
+    def setup(self, croconet):
+        dpt_args = {'output_width_ratio': self.output_width_ratio, 'num_channels': self.num_channels}
+        if self.hooks_idx is None:
+            if hasattr(croconet, 'dec_blocks'): # encoder + decoder
+                step = {8: 3, 12: 4, 24: 8}[croconet.dec_depth]
+                hooks_idx = [croconet.dec_depth+croconet.enc_depth-1-i*step for i in range(3,-1,-1)]
+            else: # encoder only
+                step = croconet.enc_depth//4
+                hooks_idx = [croconet.enc_depth-1-i*step for i in range(3,-1,-1)]
+            self.hooks_idx = hooks_idx
+            print(f'  PixelwiseTaskWithDPT: automatically setting hook_idxs={self.hooks_idx}')
+        dpt_args['hooks'] = self.hooks_idx
+        dpt_args['layer_dims'] = self.layer_dims
+        self.dpt = DPTOutputAdapter(**dpt_args)
+        dim_tokens = [croconet.enc_embed_dim if hook<croconet.enc_depth else croconet.dec_embed_dim for hook in self.hooks_idx]
+        dpt_init_args = {'dim_tokens_enc': dim_tokens}
+        self.dpt.init(**dpt_init_args)
+    def forward(self, x, img_info):
+        out = self.dpt(x, image_size=(img_info['height'],img_info['width']))
+        if self.postprocess: out = self.postprocess(out)
+        return out

dust3r/croco/models/masking.py ADDED Viewed

	@@ -0,0 +1,25 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+# --------------------------------------------------------
+# Masking utils
+# --------------------------------------------------------
+import torch
+import torch.nn as nn
+class RandomMask(nn.Module):
+    """
+    random masking
+    """
+    def __init__(self, num_patches, mask_ratio):
+        super().__init__()
+        self.num_patches = num_patches
+        self.num_mask = int(mask_ratio * self.num_patches)
+    def __call__(self, x):
+        noise = torch.rand(x.size(0), self.num_patches, device=x.device)
+        argsort = torch.argsort(noise, dim=1)
+        return argsort < self.num_mask

dust3r/croco/models/pos_embed.py ADDED Viewed

	@@ -0,0 +1,159 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+# --------------------------------------------------------
+# Position embedding utils
+# --------------------------------------------------------
+import numpy as np
+import torch
+# --------------------------------------------------------
+# 2D sine-cosine position embedding
+# References:
+# MAE: https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py
+# MoCo v3: https://github.com/facebookresearch/moco-v3
+# --------------------------------------------------------
+def get_2d_sincos_pos_embed(embed_dim, grid_size, n_cls_token=0):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [n_cls_token+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if n_cls_token>0:
+        pos_embed = np.concatenate([np.zeros([n_cls_token, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+    emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
+    return emb
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=float)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega  # (D/2,)
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
+    emb_sin = np.sin(out) # (M, D/2)
+    emb_cos = np.cos(out) # (M, D/2)
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+# --------------------------------------------------------
+# Interpolate position embeddings for high-resolution
+# References:
+# MAE: https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+# DeiT: https://github.com/facebookresearch/deit
+# --------------------------------------------------------
+def interpolate_pos_embed(model, checkpoint_model):
+    if 'pos_embed' in checkpoint_model:
+        pos_embed_checkpoint = checkpoint_model['pos_embed']
+        embedding_size = pos_embed_checkpoint.shape[-1]
+        num_patches = model.patch_embed.num_patches
+        num_extra_tokens = model.pos_embed.shape[-2] - num_patches
+        # height (== width) for the checkpoint position embedding
+        orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)
+        # height (== width) for the new position embedding
+        new_size = int(num_patches ** 0.5)
+        # class_token and dist_token are kept unchanged
+        if orig_size != new_size:
+            print("Position interpolate from %dx%d to %dx%d" % (orig_size, orig_size, new_size, new_size))
+            extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
+            # only the position tokens are interpolated
+            pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
+            pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
+            pos_tokens = torch.nn.functional.interpolate(
+                pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
+            pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
+            new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
+            checkpoint_model['pos_embed'] = new_pos_embed
+#----------------------------------------------------------
+# RoPE2D: RoPE implementation in 2D
+#----------------------------------------------------------
+try:
+    from models.curope import cuRoPE2D
+    RoPE2D = cuRoPE2D
+except ImportError:
+    print('Warning, cannot find cuda-compiled version of RoPE2D, using a slow pytorch version instead')
+    class RoPE2D(torch.nn.Module):
+        def __init__(self, freq=100.0, F0=1.0):
+            super().__init__()
+            self.base = freq
+            self.F0 = F0
+            self.cache = {}
+        def get_cos_sin(self, D, seq_len, device, dtype):
+            if (D,seq_len,device,dtype) not in self.cache:
+                inv_freq = 1.0 / (self.base ** (torch.arange(0, D, 2).float().to(device) / D))
+                t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)
+                freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype)
+                freqs = torch.cat((freqs, freqs), dim=-1)
+                cos = freqs.cos() # (Seq, Dim)
+                sin = freqs.sin()
+                self.cache[D,seq_len,device,dtype] = (cos,sin)
+            return self.cache[D,seq_len,device,dtype]
+        @staticmethod
+        def rotate_half(x):
+            x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
+            return torch.cat((-x2, x1), dim=-1)
+        def apply_rope1d(self, tokens, pos1d, cos, sin):
+            assert pos1d.ndim==2
+            cos = torch.nn.functional.embedding(pos1d, cos)[:, None, :, :]
+            sin = torch.nn.functional.embedding(pos1d, sin)[:, None, :, :]
+            return (tokens * cos) + (self.rotate_half(tokens) * sin)
+        def forward(self, tokens, positions):
+            """
+            input:
+                * tokens: batch_size x nheads x ntokens x dim
+                * positions: batch_size x ntokens x 2 (y and x position of each token)
+            output:
+                * tokens after appplying RoPE2D (batch_size x nheads x ntokens x dim)
+            """
+            assert tokens.size(3)%2==0, "number of dimensions should be a multiple of two"
+            D = tokens.size(3) // 2
+            assert positions.ndim==3 and positions.shape[-1] == 2 # Batch, Seq, 2
+            cos, sin = self.get_cos_sin(D, int(positions.max())+1, tokens.device, tokens.dtype)
+            # split features into two along the feature dimension, and apply rope1d on each half
+            y, x = tokens.chunk(2, dim=-1)
+            y = self.apply_rope1d(y, positions[:,:,0], cos, sin)
+            x = self.apply_rope1d(x, positions[:,:,1], cos, sin)
+            tokens = torch.cat((y, x), dim=-1)
+            return tokens

dust3r/croco/models/transformer_utils.py ADDED Viewed

	@@ -0,0 +1,1021 @@

+import os
+from collections import OrderedDict
+from typing import Tuple, Union
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import nn, einsum
+from torch.utils.checkpoint import checkpoint
+from einops import rearrange, repeat
+from inspect import isfunction
+try:
+    from flash_attn import flash_attn_qkvpacked_func, flash_attn_func, flash_attn_varlen_qkvpacked_func
+    from flash_attn.bert_padding import unpad_input, pad_input
+except:
+    flash_attn_qkvpacked_func, flash_attn_func, flash_attn_varlen_qkvpacked_func = None, None, None
+    unpad_input, pad_input = None, None
+from .x_transformer import AttentionLayers, BasicEncoder
+import math
+def _init_weights(module):
+    if isinstance(module, nn.Linear):
+        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        if module.bias is not None:
+            torch.nn.init.zeros_(module.bias)
+def exists(val):
+    return val is not None
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if isfunction(d) else d
+def zero_module(module):
+    """
+    Zero out the parameters of a module and return it.
+    """
+    for p in module.parameters():
+        p.detach().zero_()
+    return module
+# Copy from CLIP GitHub
+class LayerNorm(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16."""
+    def forward(self, x: torch.Tensor):
+        orig_type = x.dtype
+        ret = super().forward(x.type(torch.float32))
+        return ret.type(orig_type)
+class QuickGELU(nn.Module):
+    def forward(self, x: torch.Tensor):
+        return x * torch.sigmoid(1.702 * x)
+def modulate(x, shift, scale):
+    # from https://github.com/facebookresearch/DiT/blob/796c29e532f47bba17c5b9c5eb39b9354b8b7c64/models.py#L19
+    return x * (1 + scale.unsqueeze(0)) + shift.unsqueeze(0)
+class MultiheadAttentionFlashV2(nn.Module):
+    def __init__(self, embed_dim, n_head, bias=False, shift_group=None, qkv_packed=False, window_size=None):
+        super().__init__()
+        self.head_dim = embed_dim// n_head
+        self.embed_dim = embed_dim
+        self.n_head = n_head
+        self.to_q = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.to_k = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.to_v = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.shift_group = shift_group
+        self.qkv_packed = qkv_packed
+        self.window_size = window_size
+    def forward(self, q, k, v, dropout_p=0.0, softmax_scale=None, causal=False, need_weights=False, attn_mask=None):
+        q = q.permute(1, 0, 2)
+        k = k.permute(1, 0, 2)
+        v = v.permute(1, 0, 2)
+        h = self.n_head
+        q = self.to_q(q)
+        k = self.to_k(k)
+        v = self.to_v(v)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b n h d', h=h), (q, k, v))
+        # print(q.dtype, k.dtype, v.dtype)
+        if self.qkv_packed:
+            bsz, q_len, heads, head_dim = q.shape
+            group_size = self.shift_group
+            nheads = self.n_head
+            qkv = torch.stack([q,k,v], dim=2)
+            qkv = qkv.reshape(bsz, q_len, 3, 2, nheads // 2, self.head_dim).permute(0, 3, 1, 2, 4, 5).reshape(bsz * 2,
+                                                                                                              q_len, 3,
+                                                                                                              nheads // 2,
+                                                                                                              self.head_dim)
+            x = rearrange(qkv, "b s three h d -> b s (three h d)")
+            key_padding_mask = torch.ones(x.shape[0], x.shape[1], device=x.device, dtype=x.dtype)
+            x_unpad, indices, cu_q_lens, max_s = unpad_input(x, key_padding_mask)
+            cu_q_len_tmp = torch.arange(0, max_s, group_size, device=key_padding_mask.device, dtype=cu_q_lens.dtype)
+            cu_q_len_tmp2 = cu_q_len_tmp + group_size // 2
+            cu_q_len_tmp2[cu_q_len_tmp2 >= max_s] = torch.iinfo(cu_q_len_tmp2.dtype).min
+            cu_q_len_tmp = torch.stack([cu_q_len_tmp, cu_q_len_tmp2]).repeat(bsz, 1) + cu_q_lens[:-1].unsqueeze(-1)
+            cu_q_lens = torch.cat([cu_q_len_tmp, cu_q_lens[1:].unsqueeze(-1)], dim=-1).view(-1)
+            cu_q_lens = cu_q_lens[cu_q_lens >= 0]
+            x_unpad = rearrange(
+                x_unpad, "nnz (three h d) -> nnz three h d", three=3, h=nheads // 2
+            )
+            output_unpad = flash_attn_varlen_qkvpacked_func(
+                x_unpad, cu_q_lens, group_size, 0.0, softmax_scale=None, causal=False,
+            )
+            output = rearrange(
+                   pad_input(
+                       rearrange(output_unpad, "nnz h d -> nnz (h d)"), indices, bsz * 2, q_len
+                   ),
+                   "b s (h d) -> b s h d",
+                   h=nheads // 2,
+               )
+            r_out = output.reshape(bsz, 2, q_len, nheads // 2, self.head_dim).transpose(1, 2).reshape(bsz, q_len, nheads,
+                                                                                               self.head_dim)
+        else:
+            if self.shift_group is not None:
+                 bsz, q_len, heads, head_dim = q.shape
+                 assert q_len % self.shift_group == 0
+                 def shift(qkv, bsz, q_len, group_size, num_heads, head_dim):
+                     qkv[:, num_heads // 2:] = qkv[:, num_heads // 2:].roll(-group_size // 2, dims=2)
+                     qkv = qkv.transpose(1, 2).reshape(bsz * (q_len // group_size), group_size, num_heads, head_dim).transpose(1, 2)
+                     return qkv
+                 q = shift(q, bsz, q_len, self.shift_group, h, self.head_dim)
+                 k = shift(k, bsz, q_len, self.shift_group, h, self.head_dim)
+                 v = shift(v, bsz, q_len, self.shift_group, h, self.head_dim)
+            if self.window_size:
+                out = flash_attn_func(q, k, v, dropout_p=dropout_p, softmax_scale=softmax_scale, causal=causal, window_size=(self.window_size // 2, self.window_size // 2))
+            else:
+                out = flash_attn_func(q, k, v, dropout_p=dropout_p, softmax_scale=softmax_scale, causal=causal)
+            if self.shift_group is not None:
+                out = out.transpose(1, 2).contiguous()
+                out = rearrange(out, '(b l) g h d -> b (l g) h d', l=q_len // self.shift_group)
+                r_out = out.clone()
+                r_out[:, :, h//2:] = r_out[:, :, h//2:].roll(h//2, dims=1)
+            else:
+                r_out = out
+        r_out = rearrange(r_out, 'b n h d -> b n (h d)')
+        r_out = r_out.permute(1, 0, 2)
+        return (r_out,)
+class PSUpsamplerBlock(nn.Module):
+    def __init__(self, d_model: int, d_model_out: int, scale_factor: int):
+        super().__init__()
+        # self.mlp = nn.Sequential(OrderedDict([
+        #     ("c_fc", nn.Linear(d_model, d_model_out * scale_factor**2)),
+        #     ("gelu", QuickGELU()),
+        #     ("c_proj", nn.Linear(d_model_out * scale_factor**2, d_model_out * scale_factor**2))
+        # ]))
+        # self.ln_2 = LayerNorm(d_model)
+        self.scale_factor = scale_factor
+        self.d_model_out = d_model_out
+        self.residual_fc = nn.Linear(d_model, d_model_out * (scale_factor**2))
+        self.pixelshuffle = nn.PixelShuffle(scale_factor)
+    def forward(self, x: torch.Tensor):
+        # mlp block
+        # x.shape b, l, d
+        # y = self.ln_2(x)
+        # y = self.mlp(y)
+        # For here we have two cases:
+        # 1. If we have a modulation function for the MLP, we use it to modulate the output of the MLP
+        # 2. If we don't have a modulation function for the MLP, we use the modulation function for the attention
+        x = self.residual_fc(x)# .repeat(1, 1, self.scale_factor**2)
+        # x = x + y
+        bs, l, c = x.shape
+        resolution = int(np.sqrt(l))
+        x = x.permute(0, 2, 1).reshape(bs, c, resolution, resolution)
+        x = self.pixelshuffle(x)
+        x = x.reshape(bs, self.d_model_out, resolution*self.scale_factor*resolution*self.scale_factor)
+        x = x.permute(0, 2, 1)
+        # x = rearrange(x, 'b l (s c) -> b (l s) c', s=self.scale_factor**2)
+        return x
+class ResidualAttentionBlock(nn.Module):
+    def __init__(self, d_model: int,
+            n_head: int,
+            attn_mask: torch.Tensor = None,
+            modulate_feature_size: int = None,
+            modulate_act_type: str = 'gelu',
+            cross_att: bool = None,
+            flash_v2: bool = None,
+            qkv_packed: bool = None,
+            shift_group: int = None,
+            window_size: int = None,):
+        super().__init__()
+        print('vit flashv2', flash_v2)
+        self.flash_v2 = flash_v2
+        self.window_size = window_size
+        if self.flash_v2:
+            self.attn = MultiheadAttentionFlashV2(d_model, n_head, shift_group=shift_group, qkv_packed=qkv_packed, window_size=window_size)
+        else:
+            self.attn = nn.MultiheadAttention(d_model, n_head)
+        self.ln_1 = LayerNorm(d_model)
+        self.mlp = nn.Sequential(OrderedDict([
+            ("c_fc", nn.Linear(d_model, d_model * 4)),
+            ("gelu", QuickGELU()),
+            ("c_proj", nn.Linear(d_model * 4, d_model))
+        ]))
+        self.ln_2 = LayerNorm(d_model)
+        self.attn_mask = attn_mask
+        self.window_size = window_size
+        if modulate_feature_size is not None:
+            act_dict = {'gelu': QuickGELU,
+                    'silu': nn.SiLU}
+            self.modulation_fn = nn.Sequential(
+                LayerNorm(modulate_feature_size),
+                act_dict[modulate_act_type](),
+                nn.Linear(modulate_feature_size, 3 * d_model, bias=True)
+            )
+            self.mlp_modulation_fn = nn.Sequential(
+                LayerNorm(modulate_feature_size),
+                act_dict[modulate_act_type](),
+                nn.Linear(modulate_feature_size, 3 * d_model, bias=True)
+            )
+        else:
+            self.modulation_fn = None
+            self.mlp_modulation_fn = None
+        self.cross_att = cross_att
+        if self.cross_att:
+            self.cross_att = CrossAttention(query_dim=d_model, context_dim=d_model,
+                                    heads=n_head, dim_head=int(d_model//n_head), dropout=0)
+            self.ln_1_5 = LayerNorm(d_model)
+    def attention(self, x: torch.Tensor, index):
+        if self.attn_mask is not None:
+            self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device)
+            length = x.shape[0]
+            attn_mask = self.attn_mask[:length, :length]
+        else:
+            attn_mask = None
+        if self.window_size is not None:
+            x = x.permute(1, 0, 2)
+            b, l, c = x.shape
+            # print(x.shape)
+            assert l % self.window_size == 0
+            if index % 2 == 0:
+                x = rearrange(x, 'b (p w) c -> (b p) w c', w=self.window_size)
+                x = x.permute(1, 0, 2) # w, bp, c
+                x = self.attn(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
+                x = x.permute(1, 0, 2) # bp, w, c
+                x = rearrange(x, '(b l) w c -> b (l w) c', l=l//self.window_size, w=self.window_size)
+                x = x.permute(1, 0, 2) # w, bp, c
+            else:
+                x = torch.roll(x, shifts=self.window_size//2, dims=1)
+                x = rearrange(x, 'b (p w) c -> (b p) w c', w=self.window_size)
+                x = x.permute(1, 0, 2) # w, bp, c
+                x = self.attn(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
+                x = x.permute(1, 0, 2) # w, bp, c
+                x = rearrange(x, '(b l) w c -> b (l w) c', l=l//self.window_size, w=self.window_size)
+                x = torch.roll(x, shifts=-self.window_size//2, dims=1)
+                x = x.permute(1, 0, 2)
+        else:
+            x = self.attn(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
+        return x
+    def forward(self, x: torch.Tensor, modulation: torch.Tensor = None, context: torch.Tensor = None, index=None):
+        # self attention block
+        y = self.ln_1(x)
+        if self.modulation_fn is not None:
+            shift, scale, gate = self.modulation_fn(modulation).chunk(3, dim=1)
+            y = modulate(y, shift, scale)
+        y = self.attention(y, index)
+        # If we have modulation func for mlp as well, we will just use the gate for the attention
+        if self.modulation_fn is not None and self.mlp_modulation_fn is not None:
+            y = y * gate.unsqueeze(0)
+        x = x + y
+        # cross attention block
+        if self.cross_att:
+            y = self.cross_att(self.ln_1_5(x), context=context)
+            # print(y.mean().item())
+            x = x + y
+        # mlp block
+        y = self.ln_2(x)
+        if self.mlp_modulation_fn is not None:
+            shift, scale, gate = self.mlp_modulation_fn(modulation).chunk(3, dim=1)
+            y = modulate(y, shift, scale)
+        y = self.mlp(y)
+        # For here we have two cases:
+        # 1. If we have a modulation function for the MLP, we use it to modulate the output of the MLP
+        # 2. If we don't have a modulation function for the MLP, we use the modulation function for the attention
+        if self.modulation_fn is not None:
+            y = y * gate.unsqueeze(0)
+        x = x + y
+        return x
+class Transformer(nn.Module):
+    def __init__(self,
+            width: int,
+            layers: int,
+            heads: int,
+            attn_mask: torch.Tensor = None,
+            modulate_feature_size: int = None,
+            modulate_act_type: str = 'gelu',
+            cross_att_layers: int = 0,
+            return_all_layers=False,
+            flash_v2=True,
+            qkv_packed=False,
+            shift_group=None,
+            window_size=None):
+        super().__init__()
+        self.width = width
+        self.layers = layers
+        blocks = []
+        for _ in range(layers):
+            layer = ResidualAttentionBlock(width,
+                        heads,
+                        attn_mask,
+                        modulate_feature_size=modulate_feature_size,
+                        modulate_act_type=modulate_act_type,
+                        cross_att = (_ + cross_att_layers)>=layers,
+                        flash_v2=flash_v2,
+                        qkv_packed=qkv_packed,
+                        shift_group=shift_group,
+                        window_size=window_size)
+            blocks.append(layer)
+        self.resblocks = nn.Sequential(*blocks)
+        self.grad_checkpointing = False
+        self.return_all_layers = return_all_layers
+        self.flash_v2 = flash_v2
+    def set_grad_checkpointing(self, flag=True):
+        self.grad_checkpointing = flag
+    def forward(self,
+            x: torch.Tensor,
+            modulation: torch.Tensor = None,
+            context: torch.Tensor = None,
+            additional_residuals = None):
+        all_x = []
+        if additional_residuals is not None:
+            assert len(additional_residuals) == self.layers
+        for res_i, module in enumerate(self.resblocks):
+            if self.grad_checkpointing:
+                # print("Grad checkpointing")
+                x = checkpoint(module, x, modulation, context, res_i)
+            else:
+                x = module(x, modulation, context, res_i)
+            if additional_residuals is not None:
+                add_res = additional_residuals[res_i]
+                x[:, :add_res.shape[1]] = x[:, :add_res.shape[1]] + add_res
+            all_x.append(x)
+        if self.return_all_layers:
+            return all_x
+        else:
+            return x
+class GaussianUpsampler(nn.Module):
+    def __init__(self, width,
+                 up_ratio,
+                 ch_decay=1,
+                 low_channels=64,
+                 window_size=False,
+                 with_additional_inputs=False):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.low_channels = low_channels
+        self.window_size = window_size
+        self.base_width = width
+        self.with_additional_inputs = with_additional_inputs
+        for res_log2 in range(int(np.log2(up_ratio))):
+            _width = width
+            width = max(width // ch_decay, 64)
+            heads = int(width / 64)
+            width = heads * 64
+            if self.with_additional_inputs:
+                self.add_module(f'upsampler_{res_log2}', PSUpsamplerBlock(_width+self.base_width, width, 2))
+            else:
+                self.add_module(f'upsampler_{res_log2}', PSUpsamplerBlock(_width, width, 2))
+            encoder = Transformer(width, 2, heads,
+                                  modulate_feature_size=None,
+                                  modulate_act_type=None,
+                                  cross_att_layers=0,
+                                  return_all_layers=False,
+                                  flash_v2=False,
+                                  qkv_packed=False,
+                                  shift_group=False,
+                                  window_size=window_size)
+            self.add_module(f'attention_{res_log2}', encoder)
+        self.out_channels = width
+        self.ln_post = LayerNorm(width)
+    def forward(self, x, additional_inputs=None):
+        if self.with_additional_inputs:
+            assert len(additional_inputs) == int(np.log2(self.up_ratio))
+        for res_log2 in range(int(np.log2(self.up_ratio))):
+            if self.with_additional_inputs:
+                add_input = additional_inputs[res_log2]
+                scale = x.shape[1] // add_input.shape[1]
+                add_input = add_input.repeat_interleave(scale, 1)
+                x = torch.cat([x, add_input], dim=2)
+            x = getattr(self, f'upsampler_{res_log2}')(x)
+            x = x.permute(1, 0, 2)
+            x = getattr(self, f'attention_{res_log2}')(x)
+            x = x.permute(1, 0, 2)
+        x = self.ln_post(x)
+        return x
+class HyperGaussianUpsampler(nn.Module):
+    def __init__(self, width,
+                 resolution,
+                 up_ratio,
+                 ch_decay=1,
+                 window_size=False,
+                 with_additional_inputs=False,
+                 upsampler_kwargs={}):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.window_size = window_size
+        self.base_width = width
+        self.with_additional_inputs = with_additional_inputs
+        self.resolution = resolution
+        for res_log2 in range(int(np.log2(up_ratio))):
+            if res_log2 == 0:
+                _width = width
+                width = width
+                heads = int(width / 64)
+                width = heads * 64
+                if self.with_additional_inputs:
+                    self.add_module(f'upsampler_{res_log2}', PSUpsamplerBlock(_width+self.base_width, width, 2))
+                else:
+                    self.add_module(f'upsampler_{res_log2}', PSUpsamplerBlock(_width, width, 2))
+                encoder = Transformer(width, 2, heads,
+                                    modulate_feature_size=None,
+                                    modulate_act_type=None,
+                                    cross_att_layers=0,
+                                    return_all_layers=False,
+                                    flash_v2=False,
+                                    qkv_packed=False,
+                                    shift_group=False,
+                                    window_size=window_size)
+                self.add_module(f'attention_{res_log2}', encoder)
+                self.resolution = self.resolution * 2
+            else:
+                self.resolution = self.resolution * 2
+                self.add_module(f'upsample_{res_log2}',
+                                UpsamplerLayers_conv(in_channels=width,
+                                                     out_channels=width,
+                                                     resolution=self.resolution,
+                                                     conv_block_type = 'convnext',
+                                                     **upsampler_kwargs))
+        self.out_channels = width
+        # self.ln_post = LayerNorm(width)
+        self.ln_post = LayerNorm([self.resolution, self.resolution, width])
+    def forward(self, x, additional_inputs=None):
+        if self.with_additional_inputs:
+            assert len(additional_inputs) == int(np.log2(self.up_ratio))
+        for res_log2 in range(int(np.log2(self.up_ratio))):
+            if res_log2 == 0:
+                if self.with_additional_inputs:
+                    add_input = additional_inputs[res_log2]
+                    scale = x.shape[1] // add_input.shape[1]
+                    add_input = add_input.repeat_interleave(scale, 1)
+                    x = torch.cat([x, add_input], dim=2)
+                x = getattr(self, f'upsampler_{res_log2}')(x)
+                x = x.permute(1, 0, 2)
+                x = getattr(self, f'attention_{res_log2}')(x)
+                x = x.permute(1, 0, 2)
+                x = x.reshape(x.shape[0], int(math.sqrt(x.shape[1])), int(math.sqrt(x.shape[1])), -1).permute(0, 3, 1, 2)
+            else:
+                x = getattr(self, f'upsample_{res_log2}')(x)
+        x = self.ln_post(x.permute(0, 2, 3, 1))
+        return x
+class VisionTransformer(nn.Module):
+    def __init__(self,
+                 # transformer params
+                 in_channels: int,
+                 patch_size: int,
+                 width: int,
+                 layers: int,
+                 heads: int,
+                 weight: str = None,
+                 encode_layers: int = 0,
+                 shift_group = False,
+                 flash_v2 = False,
+                 qkv_packed = False,
+                 window_size = False,
+                 use_pe = False,
+                 # modualtion params
+                 modulate_feature_size: int = None,
+                 modulate_act_type: str = 'gelu',
+                 # camera condition
+                 camera_condition: str = 'plucker',
+                 # init params
+                 disable_dino=False,
+                 error_weight_init_mode='mean',
+                 # other params
+                 add_zero_conv=False,
+                 return_all_layers=False,
+                 disable_post_ln=False,
+                 rope=None):
+        super().__init__()
+        self.patch_size = patch_size
+        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
+        self.use_pe = use_pe
+        self.rope = rope
+        self.disable_dino = disable_dino
+        # if not self.disable_dino:
+        #     scale = width ** -0.5
+        #     self.class_embedding = nn.Parameter(scale * torch.randn(width))
+        #     self.positional_embedding = nn.Parameter(scale * torch.randn((input_res// patch_size) ** 2 + 1, width))
+        # else:
+        #     if self.use_pe:
+        # self.positional_embedding = nn.Parameter(torch.zeros(1, (input_res// patch_size) ** 2, width))
+        # nn.init.trunc_normal_(self.positional_embedding, std=0.02)
+        self.ln_pre = LayerNorm(width)
+        self.add_zero_conv = add_zero_conv
+        self.return_all_layers = return_all_layers
+        self.disable_post_ln = disable_post_ln
+        self.flash_v2 = flash_v2
+        self.qkv_packed = qkv_packed
+        self.camera_condition = camera_condition
+        if self.camera_condition == 'plucker': assert modulate_feature_size is None
+        if self.add_zero_conv:
+            assert self.return_all_layers
+            self.zero_convs = nn.ModuleList([zero_module(nn.Conv1d(in_channels=width, out_channels=width, kernel_size=1, stride=1, bias=True)) for _ in range(layers)])
+        self.encode_layers = encode_layers
+        if self.encode_layers > 0:
+            self.encoder = Transformer(width, encode_layers, heads,
+                                       modulate_feature_size=modulate_feature_size,
+                                       modulate_act_type=modulate_act_type,
+                                       cross_att_layers=0,
+                                       return_all_layers=return_all_layers,
+                                       flash_v2=flash_v2,
+                                       qkv_packed=qkv_packed,
+                                       shift_group=shift_group,
+                                       window_size=window_size)
+        self.transformer = Transformer(width, layers-encode_layers, heads,
+                                       modulate_feature_size=modulate_feature_size,
+                                       modulate_act_type=modulate_act_type,
+                                       cross_att_layers=0,
+                                       return_all_layers=return_all_layers,
+                                       flash_v2=flash_v2,
+                                       qkv_packed=qkv_packed,
+                                       shift_group=shift_group,
+                                       window_size=window_size)
+        if not self.disable_post_ln:
+            self.ln_post = LayerNorm(width)
+        if weight is not None:
+            if not self.disable_dino:
+                if "clip" in weight:
+                    raise NotImplementedError()
+                elif weight.startswith("vit_b_16"):
+                    load_timm_to_clip(self, config_name=weight, init_mode=error_weight_init_mode)
+                elif weight.startswith("vit_b_8"):
+                    load_timm_to_clip(self, config_name=weight, init_mode=error_weight_init_mode)
+                else:
+                    raise NotImplementedError()
+            else:
+                self.apply(_init_weights)
+            # Init the weight and bias of modulation_fn to zero
+            if modulate_feature_size != 0:
+                for block in self.transformer.resblocks:
+                    if block.modulation_fn is not None:
+                        block.modulation_fn[2].weight.data.zero_()
+                        block.modulation_fn[2].bias.data.zero_()
+                        if block.mlp_modulation_fn is not None:
+                            block.mlp_modulation_fn[2].weight.data.zero_()
+                            block.mlp_modulation_fn[2].bias.data.zero_()
+            for block in self.transformer.resblocks:
+                if block.cross_att:
+                    zero_module(block.cross_att.to_out)
+    def set_grad_checkpointing(self, flag=True):
+        self.transformer.set_grad_checkpointing(flag)
+    def forward(self,
+            x: torch.Tensor,
+            modulation: torch.Tensor = None,
+            context: torch.Tensor = None,
+            additional_residuals=None,
+            abla_crossview=False,
+            pos=None):
+        # image tokenization
+        bs, vs = x.shape[:2]
+        x = rearrange(x, 'b v c h w -> (b v) c h w')
+        pos = rearrange(pos, 'b v c h -> (b v) c h')
+        if self.camera_condition == 'plucker' and modulation is not None:
+            modulation = rearrange(modulation, 'b v c h w -> (b v) c h w')
+            x = torch.cat([x, modulation], dim=1)
+        modulation = None
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
+        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
+        # pre-normalization
+        x = self.ln_pre(x)
+        B, N, C = x.shape
+        x = x.reshape(B, N, -1, 64)
+        x = x.permute(0, 2, 1, 3)
+        # print('pre x mean: ', x.mean().item())
+        # print('pre x var: ', x.var().item())
+        x = x + self.rope(torch.ones_like(x).to(x), pos)
+        # print('x mean: ', x.mean().item())
+        # print('x var: ', x.var().item())
+        x = x.permute(0, 2, 1, 3)
+        x = x.reshape(B, N, -1)
+        # use encode to extract features
+        if self.encode_layers > 0:
+            x = x.permute(1, 0, 2)  # NLD -> LND
+            x = self.encoder(x, modulation, context, additional_residuals=additional_residuals)
+            x = x.permute(1, 0, 2)  # LND -> NLD
+        if not self.disable_dino:
+            x = x.permute(1, 0, 2)  # NLD -> LND
+        else:
+            if not abla_crossview:
+                # flatten x along the video dimension
+                x = rearrange(x, '(b v) n d -> b (v n) d', v=vs)
+                # print(x.shape)
+                x = x.permute(1, 0, 2)  # NLD -> LND
+            else:
+                x = x.permute(1, 0, 2)
+        x = self.transformer(x, modulation, context, additional_residuals=additional_residuals)
+        if self.add_zero_conv:
+            assert isinstance(x, (list, tuple))
+            assert len(x) == len(self.zero_convs)
+            new_x = []
+            for sub_x, sub_zero_conv in zip(x, self.zero_convs):
+                sub_x_out = sub_zero_conv(sub_x.permute(1, 2, 0))
+                new_x.append(sub_x_out.permute(2, 0, 1))
+            x = new_x
+        if self.return_all_layers:
+            assert isinstance(x, (list, tuple))
+            if not self.disable_post_ln:
+                x_final = x[-1].permute(1, 0, 2)  # LND -> NLD
+                x_final = self.ln_post(x_final)
+                x_final = rearrange(x_final, 'b (v n) d -> b v n d', v=vs)
+            x = [s.permute(1, 0, 2) for s in x]
+            x.append(x_final)
+            return x
+        if not self.disable_post_ln:
+            x = x.permute(1, 0, 2)  # LND -> NLD
+            x = self.ln_post(x)
+        if not self.disable_dino:
+            x = rearrange(x, '(b v) n d -> b v n d', b=bs, v=vs)
+        else:
+            if not abla_crossview:
+                # reshape x back to video dimension
+                x = rearrange(x, 'b (v n) d -> b v n d', v=vs)
+            else:
+                x = rearrange(x, '(b v) n d -> b v n d', v=vs)
+        return x
+    def extra_repr(self) -> str:
+        pass
+class VisionTransformer_fusion(nn.Module):
+    def __init__(self,
+                 # transformer params
+                 in_channels: int,
+                 patch_size: int,
+                 width: int,
+                 layers: int,
+                 heads: int,
+                 weight: str = None,
+                 encode_layers: int = 0,
+                 shift_group = False,
+                 flash_v2 = False,
+                 qkv_packed = False,
+                 window_size = False,
+                 use_pe = False,
+                 # modualtion params
+                 modulate_feature_size: int = None,
+                 modulate_act_type: str = 'gelu',
+                 # camera condition
+                 camera_condition: str = 'plucker',
+                 # init params
+                 disable_dino=False,
+                 error_weight_init_mode='mean',
+                 # other params
+                 add_zero_conv=False,
+                 return_all_layers=False,
+                 disable_post_ln=False,
+                 rope=None):
+        super().__init__()
+        self.patch_size = patch_size
+        self.use_pe = use_pe
+        self.rope = rope
+        self.disable_dino = disable_dino
+        # if not self.disable_dino:
+        #     scale = width ** -0.5
+        #     self.class_embedding = nn.Parameter(scale * torch.randn(width))
+        #     self.positional_embedding = nn.Parameter(scale * torch.randn((input_res// patch_size) ** 2 + 1, width))
+        # else:
+        #     if self.use_pe:
+        # self.positional_embedding = nn.Parameter(torch.zeros(1, (input_res// patch_size) ** 2, width))
+        # nn.init.trunc_normal_(self.positional_embedding, std=0.02)
+        self.ln_pre = LayerNorm(width)
+        self.add_zero_conv = add_zero_conv
+        self.return_all_layers = return_all_layers
+        self.disable_post_ln = disable_post_ln
+        self.flash_v2 = flash_v2
+        self.qkv_packed = qkv_packed
+        self.camera_condition = camera_condition
+        if self.camera_condition == 'plucker': assert modulate_feature_size is None
+        if self.add_zero_conv:
+            assert self.return_all_layers
+            self.zero_convs = nn.ModuleList([zero_module(nn.Conv1d(in_channels=width, out_channels=width, kernel_size=1, stride=1, bias=True)) for _ in range(layers)])
+        self.encode_layers = encode_layers
+        if self.encode_layers > 0:
+            self.encoder = Transformer(width, encode_layers, heads,
+                                       modulate_feature_size=modulate_feature_size,
+                                       modulate_act_type=modulate_act_type,
+                                       cross_att_layers=0,
+                                       return_all_layers=return_all_layers,
+                                       flash_v2=flash_v2,
+                                       qkv_packed=qkv_packed,
+                                       shift_group=shift_group,
+                                       window_size=window_size)
+        self.transformer = Transformer(width, layers-encode_layers, heads,
+                                       modulate_feature_size=modulate_feature_size,
+                                       modulate_act_type=modulate_act_type,
+                                       cross_att_layers=0,
+                                       return_all_layers=return_all_layers,
+                                       flash_v2=flash_v2,
+                                       qkv_packed=qkv_packed,
+                                       shift_group=shift_group,
+                                       window_size=window_size)
+        if not self.disable_post_ln:
+            self.ln_post = LayerNorm(width)
+        if weight is not None:
+            if not self.disable_dino:
+                if "clip" in weight:
+                    raise NotImplementedError()
+                elif weight.startswith("vit_b_16"):
+                    load_timm_to_clip(self, config_name=weight, init_mode=error_weight_init_mode)
+                elif weight.startswith("vit_b_8"):
+                    load_timm_to_clip(self, config_name=weight, init_mode=error_weight_init_mode)
+                else:
+                    raise NotImplementedError()
+            else:
+                self.apply(_init_weights)
+            # Init the weight and bias of modulation_fn to zero
+            if modulate_feature_size != 0:
+                for block in self.transformer.resblocks:
+                    if block.modulation_fn is not None:
+                        block.modulation_fn[2].weight.data.zero_()
+                        block.modulation_fn[2].bias.data.zero_()
+                        if block.mlp_modulation_fn is not None:
+                            block.mlp_modulation_fn[2].weight.data.zero_()
+                            block.mlp_modulation_fn[2].bias.data.zero_()
+            for block in self.transformer.resblocks:
+                if block.cross_att:
+                    zero_module(block.cross_att.to_out)
+    def set_grad_checkpointing(self, flag=True):
+        self.transformer.set_grad_checkpointing(flag)
+    def forward(self,
+            x: torch.Tensor,
+            modulation: torch.Tensor = None,
+            context: torch.Tensor = None,
+            additional_residuals=None,
+            abla_crossview=False,
+            pos=None):
+        # image tokenization
+        bs, vs = x.shape[:2]
+        x = rearrange(x, 'b v h g -> (b v) h g')  # shape = [*, grid ** 2, width]
+        pos = rearrange(pos, 'b v c h -> (b v) c h')
+        # pre-normalization
+        B, N, C = x.shape
+        x = x.reshape(B, N, -1, 64)
+        x = x.permute(0, 2, 1, 3)
+        # print('pre x mean: ', x.mean().item())
+        # print('pre x var: ', x.var().item())
+        x = x + self.rope(torch.ones_like(x).to(x), pos)
+        # print('x mean: ', x.mean().item())
+        # print('x var: ', x.var().item())
+        x = x.permute(0, 2, 1, 3)
+        x = x.reshape(B, N, -1)
+        # use encode to extract features
+        if self.encode_layers > 0:
+            x = x.permute(1, 0, 2)  # NLD -> LND
+            x = self.encoder(x, modulation, context, additional_residuals=additional_residuals)
+            x = x.permute(1, 0, 2)  # LND -> NLD
+        if not self.disable_dino:
+            x = x.permute(1, 0, 2)  # NLD -> LND
+        else:
+            if not abla_crossview:
+                # flatten x along the video dimension
+                x = rearrange(x, '(b v) n d -> b (v n) d', v=vs)
+                # print(x.shape)
+                x = x.permute(1, 0, 2)  # NLD -> LND
+            else:
+                x = x.permute(1, 0, 2)
+        x = self.transformer(x, modulation, context, additional_residuals=additional_residuals)
+        if self.add_zero_conv:
+            assert isinstance(x, (list, tuple))
+            assert len(x) == len(self.zero_convs)
+            new_x = []
+            for sub_x, sub_zero_conv in zip(x, self.zero_convs):
+                sub_x_out = sub_zero_conv(sub_x.permute(1, 2, 0))
+                new_x.append(sub_x_out.permute(2, 0, 1))
+            x = new_x
+        if self.return_all_layers:
+            assert isinstance(x, (list, tuple))
+            if not self.disable_post_ln:
+                x_final = x[-1].permute(1, 0, 2)  # LND -> NLD
+                x_final = self.ln_post(x_final)
+                x_final = rearrange(x_final, 'b (v n) d -> b v n d', v=vs)
+            x = [s.permute(1, 0, 2) for s in x]
+            x.append(x_final)
+            return x
+        if not self.disable_post_ln:
+            x = x.permute(1, 0, 2)  # LND -> NLD
+            x = self.ln_post(x)
+        if not self.disable_dino:
+            x = rearrange(x, '(b v) n d -> b v n d', b=bs, v=vs)
+        else:
+            if not abla_crossview:
+                # reshape x back to video dimension
+                x = rearrange(x, 'b (v n) d -> b v n d', v=vs)
+            else:
+                x = rearrange(x, '(b v) n d -> b v n d', v=vs)
+        return x
+    def extra_repr(self) -> str:
+        pass
+def resize_pos_embed(state_dict, model, interpolation: str = 'bicubic'):
+    """
+    Resize positional embeddings, implementation from google/simclr and open_clip.
+    """
+    # Rescale the grid of position embeddings when loading from state_dict
+    old_pos_embed = state_dict.get('positional_embedding', None)
+    if old_pos_embed is None:
+        return
+    # Compute the grid size and extra tokens
+    old_pos_len = state_dict["positional_embedding"].shape[0]
+    old_grid_size = round((state_dict["positional_embedding"].shape[0]) ** 0.5)
+    grid_size = round((model.positional_embedding.shape[0]) ** 0.5)
+    if old_grid_size == grid_size:
+        return
+    extra_tokens = old_pos_len - (old_grid_size ** 2)
+    if extra_tokens:
+        pos_emb_tok, pos_emb_img = old_pos_embed[:extra_tokens], old_pos_embed[extra_tokens:]
+    else:
+        pos_emb_tok, pos_emb_img = None, old_pos_embed
+    # Only interpolate the positional emb part, not the extra token part.
+    pos_emb_img = pos_emb_img.reshape(1, old_grid_size, old_grid_size, -1).permute(0, 3, 1, 2)
+    pos_emb_img = F.interpolate(
+        pos_emb_img,
+        size=grid_size,
+        mode=interpolation,
+        align_corners=True,
+    )
+    pos_emb_img = pos_emb_img.permute(0, 2, 3, 1).reshape(1, grid_size * grid_size, -1)[0]
+    # Concatenate back the
+    if pos_emb_tok is not None:
+        new_pos_embed = torch.cat([pos_emb_tok, pos_emb_img], dim=0)
+    else:
+        new_pos_embed = pos_emb_img
+    state_dict['positional_embedding'] = new_pos_embed
+myname2timmname = {
+    "vit_b_16_mae": None,
+    "vit_b_16_in": "vit_base_patch16_224",
+    "vit_b_16_in21k": 'vit_base_patch16_224_in21k',
+    "vit_b_16_sam": 'vit_base_patch16_224_sam',
+    "vit_b_16_dino": 'vit_base_patch16_224_dino',
+    "vit_b_16_mill_in21k": 'vit_base_patch16_224_miil_in21k',
+    "vit_b_16_mill": 'vit_base_patch16_224_miil',
+    "vit_b_8_dino": 'vit_base_patch16_224_dino',
+}
+def load_timm_to_clip(module, config_name="vit_b_16_mae", init_mode='zero'):
+    from torch import nn
+    from clip.model import LayerNorm as CLIPLayerNorm
+    from clip.model import QuickGELU
+    from torch.nn import GELU
+    from torch.nn import LayerNorm
+    import json
+    now_dir = os.path.abspath(os.path.dirname(__file__))
+    timm2clip = json.load(open(f"{now_dir}/timm2clip_vit_b_16.json"))
+    assert config_name in myname2timmname, f"The name {config_name} is not one of {list(myname2timmname.keys())}"
+    try:
+        timm_weight = torch.load(f"/sensei-fs/users/hatan/model/{config_name}.pth")["model"]
+    except Exception as e:
+        try:
+            print(f"/input/yhxu/models/dino_weights/{config_name}.pth")
+            timm_weight = torch.load(f"/input/yhxu/models/dino_weights/{config_name}.pth")["model"]
+        except Exception as e:
+            try:
+                print(f"/home/yhxu/models/dino_weights/{config_name}.pth")
+                timm_weight = torch.load(f"/home/yhxu/models/dino_weights/{config_name}.pth")["model"]
+            except:
+                try:
+                    timm_weight = torch.load(f"/nas2/zifan/checkpoint/dino_weights/{config_name}.pth")["model"]
+                except Exception as e:
+                    print("Please download weight with support/dump_timm_weights.py. \n"
+                        "If using mae weight, please check https://github.com/facebookresearch/mae,"
+                        "and download the weight as vit_b_16_mae.pth")
+                    assert False
+    # Build model's state dict
+    clipname2timmweight = {}
+    for timm_key, clip_key in timm2clip.items():
+        timm_value = timm_weight[timm_key]
+        clipname2timmweight[clip_key[len("visual."):]] = timm_value.squeeze()
+    # Resize positional embedding
+    resize_pos_embed(clipname2timmweight, module)
+    # Load weight to model.
+    model_visual_keys = set(module.state_dict().keys())
+    load_keys = set(clipname2timmweight.keys())
+    # print(f"Load not in model: {load_keys - model_visual_keys}")
+    # print(f"Model not in load: {model_visual_keys - load_keys}")
+    # status = module.load_state_dict(clipname2timmweight, strict=False)
+    try:
+        status = module.load_state_dict(clipname2timmweight, strict=False)
+    except:
+        print('conv.weight has error!')
+        if init_mode == 'zero':
+            new_weight = torch.zeros_like(clipname2timmweight['conv1.weight'])
+            new_weight = new_weight.repeat(1, 2, 1, 1)
+            new_weight[:,:3] = clipname2timmweight['conv1.weight']
+        elif init_mode == 'mean':
+            new_weight = torch.zeros_like(clipname2timmweight['conv1.weight'])
+            new_weight = new_weight.repeat(1, 3, 1, 1)
+            new_weight = ((clipname2timmweight['conv1.weight']).repeat(1, 3, 1, 1))/3
+        clipname2timmweight['conv1.weight'] = new_weight
+        status = module.load_state_dict(clipname2timmweight, strict=False)
+    # Since timm model has bias, we add it back here.
+    module.conv1.bias = nn.Parameter(clipname2timmweight['conv1.bias'])
+    # Reinit the visual weights that not covered by timm
+    module.ln_pre.reset_parameters()
+    def convert_clip_to_timm(module):
+        """Copy from detectron2, frozen BN"""
+        res = module
+        if isinstance(module, CLIPLayerNorm):
+            # Timm uses eps=1e-6 while CLIP uses eps=1e-5
+            res = LayerNorm(module.normalized_shape, eps=1e-6, elementwise_affine=module.elementwise_affine)
+            if module.elementwise_affine:
+                res.weight.data = module.weight.data.clone().detach()
+                res.bias.data = module.bias.data.clone().detach()
+        elif isinstance(module, QuickGELU):
+            # Timm uses GELU while CLIP uses QuickGELU
+            res = GELU()
+        else:
+            for name, child in module.named_children():
+                new_child = convert_clip_to_timm(child)
+                if new_child is not child:
+                    res.add_module(name, new_child)
+        return res

dust3r/croco/models/x_transformer.py ADDED Viewed

	@@ -0,0 +1,558 @@

+"""from https://github.com/lucidrains/x-transformers"""
+import math
+from random import random
+import torch
+from torch import nn, einsum
+import torch.nn.functional as F
+from torch.utils.checkpoint import checkpoint
+from functools import partial, wraps
+from inspect import isfunction
+from einops import rearrange, repeat, reduce
+# constants
+DEFAULT_DIM_HEAD = 64
+# helpers
+def exists(val):
+    return val is not None
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if isfunction(d) else d
+def cast_tuple(val, depth):
+    return val if isinstance(val, tuple) else (val,) * depth
+# init helpers
+def init_zero_(layer):
+    nn.init.constant_(layer.weight, 0.)
+    if exists(layer.bias):
+        nn.init.constant_(layer.bias, 0.)
+# keyword argument helpers
+def pick_and_pop(keys, d):
+    values = list(map(lambda key: d.pop(key), keys))
+    return dict(zip(keys, values))
+def group_dict_by_key(cond, d):
+    return_val = [dict(), dict()]
+    for key in d.keys():
+        match = bool(cond(key))
+        ind = int(not match)
+        return_val[ind][key] = d[key]
+    return (*return_val,)
+def string_begins_with(prefix, str):
+    return str.startswith(prefix)
+def group_by_key_prefix(prefix, d):
+    return group_dict_by_key(partial(string_begins_with, prefix), d)
+def groupby_prefix_and_trim(prefix, d):
+    kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
+    kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
+    return kwargs_without_prefix, kwargs
+# initializations
+def deepnorm_init(
+        transformer,
+        beta,
+        module_name_match_list=['.ff.', '.to_v', '.to_out']
+):
+    for name, module in transformer.named_modules():
+        if type(module) != nn.Linear:
+            continue
+        needs_beta_gain = any(map(lambda substr: substr in name, module_name_match_list))
+        gain = beta if needs_beta_gain else 1
+        nn.init.xavier_normal_(module.weight.data, gain=gain)
+        if exists(module.bias):
+            nn.init.constant_(module.bias.data, 0)
+# activations
+class ReluSquared(nn.Module):
+    def forward(self, x):
+        return F.relu(x) ** 2
+# norms
+class Scale(nn.Module):
+    def __init__(self, value, fn):
+        super().__init__()
+        self.value = value
+        self.fn = fn
+    def forward(self, x, **kwargs):
+        out = self.fn(x, **kwargs)
+        scale_fn = lambda t: t * self.value
+        if not isinstance(out, tuple):
+            return scale_fn(out)
+        return (scale_fn(out[0]), *out[1:])
+class ScaleNorm(nn.Module):
+    def __init__(self, dim, eps=1e-5):
+        super().__init__()
+        self.eps = eps
+        self.g = nn.Parameter(torch.ones(1) * (dim ** -0.5))
+    def forward(self, x):
+        norm = torch.norm(x, dim=-1, keepdim=True)
+        return x / norm.clamp(min=self.eps) * self.g
+class RMSNorm(nn.Module):
+    def __init__(self, dim, eps=1e-8):
+        super().__init__()
+        self.scale = dim ** -0.5
+        self.eps = eps
+        self.g = nn.Parameter(torch.ones(dim))
+    def forward(self, x):
+        norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
+        return x / norm.clamp(min=self.eps) * self.g
+# residual and residual gates
+class Residual(nn.Module):
+    def __init__(self, dim, scale_residual=False, scale_residual_constant=1.):
+        super().__init__()
+        self.residual_scale = nn.Parameter(torch.ones(dim)) if scale_residual else None
+        self.scale_residual_constant = scale_residual_constant
+    def forward(self, x, residual):
+        if exists(self.residual_scale):
+            residual = residual * self.residual_scale
+        if self.scale_residual_constant != 1:
+            residual = residual * self.scale_residual_constant
+        return x + residual
+class GRUGating(nn.Module):
+    def __init__(self, dim, scale_residual=False, **kwargs):
+        super().__init__()
+        self.gru = nn.GRUCell(dim, dim)
+        self.residual_scale = nn.Parameter(torch.ones(dim)) if scale_residual else None
+    def forward(self, x, residual):
+        if exists(self.residual_scale):
+            residual = residual * self.residual_scale
+        gated_output = self.gru(
+            rearrange(x, 'b n d -> (b n) d'),
+            rearrange(residual, 'b n d -> (b n) d')
+        )
+        return gated_output.reshape_as(x)
+# feedforward
+class GLU(nn.Module):
+    def __init__(self, dim_in, dim_out, activation):
+        super().__init__()
+        self.act = activation
+        self.proj = nn.Linear(dim_in, dim_out * 2)
+    def forward(self, x):
+        x, gate = self.proj(x).chunk(2, dim=-1)
+        return x * self.act(gate)
+class FeedForward(nn.Module):
+    def __init__(
+            self,
+            dim,
+            dim_out=None,
+            mult=4,
+            glu=False,
+            swish=False,
+            relu_squared=False,
+            post_act_ln=False,
+            dropout=0.,
+            no_bias=False,
+            zero_init_output=False
+    ):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = default(dim_out, dim)
+        if relu_squared:
+            activation = ReluSquared()
+        elif swish:
+            activation = nn.SiLU()
+        else:
+            activation = nn.GELU()
+        project_in = nn.Sequential(
+            nn.Linear(dim, inner_dim, bias=not no_bias),
+            activation
+        ) if not glu else GLU(dim, inner_dim, activation)
+        self.ff = nn.Sequential(
+            project_in,
+            nn.LayerNorm(inner_dim) if post_act_ln else nn.Identity(),
+            nn.Dropout(dropout),
+            nn.Linear(inner_dim, dim_out, bias=not no_bias)
+        )
+        # init last linear layer to 0
+        if zero_init_output:
+            init_zero_(self.ff[-1])
+    def forward(self, x):
+        return self.ff(x)
+# attention.
+class Attention(nn.Module):
+    def __init__(
+            self,
+            dim,
+            kv_dim=None,
+            dim_head=DEFAULT_DIM_HEAD,
+            heads=8,
+            causal=False,
+            dropout=0.,
+            zero_init_output=False,
+            shared_kv=False,
+            value_dim_head=None,
+            flash_attention=True,
+    ):
+        super().__init__()
+        self.scale = dim_head ** -0.5
+        if kv_dim is None:
+            kv_dim = dim
+        self.heads = heads
+        self.causal = causal
+        value_dim_head = default(value_dim_head, dim_head)
+        q_dim = k_dim = dim_head * heads
+        v_dim = out_dim = value_dim_head * heads
+        self.to_q = nn.Linear(dim, q_dim, bias=False)
+        self.to_k = nn.Linear(kv_dim, k_dim, bias=False)
+        # shared key / values, for further memory savings during inference
+        assert not (
+                    shared_kv and value_dim_head != dim_head), 'key and value head dimensions must be equal for shared key / values'
+        self.to_v = nn.Linear(kv_dim, v_dim, bias=False) if not shared_kv else None
+        # Convert to output
+        self.to_out = nn.Linear(out_dim, dim, bias=False)
+        # dropout
+        self.dropout_p = dropout
+        self.dropout = nn.Dropout(dropout)
+        # Flash Attention, needs PyTorch >= 1.13
+        self.flash = flash_attention
+        assert self.flash
+        # Use torch.nn.functional.scaled_dot_product_attention if available
+        # otherwise, we use the xformer library.
+        # self.use_xformer = True
+        self.use_xformer = not hasattr(torch.nn.functional, 'scaled_dot_product_attention')
+        # init output projection 0
+        if zero_init_output:
+            init_zero_(self.to_out)
+    def forward(
+            self,
+            x,
+            context=None,
+            mask=None,
+            context_mask=None,
+    ):
+        # print("x", x.dtype)
+        h = self.heads
+        kv_input = default(context, x)
+        q_input = x
+        k_input = kv_input
+        v_input = kv_input
+        q = self.to_q(q_input)
+        k = self.to_k(k_input)
+        v = self.to_v(v_input) if exists(self.to_v) else k
+        # print("q", q.dtype)
+        # print("k", k.dtype)
+        # print("v", v.dtype)
+        if self.use_xformer:
+            # Since xformers only accepts bf16/fp16, we need to convert qkv to bf16/fp16
+            dtype = q.dtype
+            q, k, v = map(lambda t: t.bfloat16() if t.dtype == torch.float32 else t, (q, k, v))
+            q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b n h d', h=h), (q, k, v))
+            try:
+                import xformers.ops as xops
+            except ImportError as e:
+                print("Please install xformers to use flash attention for PyTorch < 2.0.0.")
+                raise e
+            # Use the flash attention support from the xformers library
+            if self.causal:
+                attention_bias = xops.LowerTriangularMask()
+            else:
+                attention_bias = None
+            # The memory_efficient_attention takes the input as (batch, seq_len, heads, dim)
+            out = xops.memory_efficient_attention(
+                q, k, v, attn_bias=attention_bias,
+                # op=(xops.fmha.flash.FwOp, xops.fmha.flash.BwOp),
+            )
+            out = out.to(dtype)
+            out = rearrange(out, 'b n h d -> b n (h d)')
+        else:
+            q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h=h), (q, k, v))
+            # efficient attention using Flash Attention CUDA kernels
+            out = torch.nn.functional.scaled_dot_product_attention(
+                q, k, v, attn_mask=None, dropout_p=self.dropout_p, is_causal=self.causal,
+            )
+            out = rearrange(out, 'b h n d -> b n (h d)')
+        out = self.to_out(out)
+        if exists(mask):
+            mask = rearrange(mask, 'b n -> b n 1')
+            out = out.masked_fill(~mask, 0.)
+        return out
+    def extra_repr(self) -> str:
+        return f"causal: {self.causal}, flash attention: {self.flash}, " \
+               f"use_xformers (if False, use torch.nn.functional.scaled_dot_product_attention): {self.use_xformer}"
+def modulate(x, shift, scale):
+    # from https://github.com/facebookresearch/DiT/blob/796c29e532f47bba17c5b9c5eb39b9354b8b7c64/models.py#L19
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+class AttentionLayers(nn.Module):
+    def __init__(
+            self,
+            dim,
+            depth,
+            heads=8,
+            ctx_dim=None,
+            causal=False,
+            cross_attend=False,
+            only_cross=False,
+            use_scalenorm=False,
+            use_rmsnorm=False,
+            residual_attn=False,
+            cross_residual_attn=False,
+            macaron=False,
+            pre_norm=True,
+            gate_residual=False,
+            scale_residual=False,
+            scale_residual_constant=1.,
+            deepnorm=False,
+            sandwich_norm=False,
+            zero_init_branch_output=False,
+            layer_dropout=0.,
+            # Below are the arguments used for this img2nerf projects
+            modulate_feature_size=-1,
+            checkpointing=False,
+            checkpoint_every=1,
+            **kwargs
+    ):
+        super().__init__()
+        # Add checkpointing
+        self.checkpointing = checkpointing
+        self.checkpoint_every = checkpoint_every
+        ff_kwargs, kwargs = groupby_prefix_and_trim('ff_', kwargs)
+        attn_kwargs, kwargs = groupby_prefix_and_trim('attn_', kwargs)
+        self.dim = dim
+        self.depth = depth
+        self.layers = nn.ModuleList([])
+        # determine deepnorm and residual scale
+        if deepnorm:
+            assert scale_residual_constant == 1, 'scale residual constant is being overridden by deep norm settings'
+            pre_norm = sandwich_norm = False
+            scale_residual = True
+            scale_residual_constant = (2 * depth) ** 0.25
+        assert not (not pre_norm and sandwich_norm), 'sandwich norm cannot be used when not using prenorm'
+        self.pre_norm = pre_norm
+        self.sandwich_norm = sandwich_norm
+        self.residual_attn = residual_attn
+        self.cross_residual_attn = cross_residual_attn
+        self.cross_attend = cross_attend
+        norm_class = ScaleNorm if use_scalenorm else nn.LayerNorm
+        norm_class = RMSNorm if use_rmsnorm else norm_class
+        norm_fn = partial(norm_class, dim)
+        if cross_attend and not only_cross:
+            default_block = ('a', 'c', 'f')
+        elif cross_attend and only_cross:
+            default_block = ('c', 'f')
+        else:
+            default_block = ('a', 'f')
+        if macaron:
+            default_block = ('f',) + default_block
+        # zero init
+        if zero_init_branch_output:
+            attn_kwargs = {**attn_kwargs, 'zero_init_output': True}
+            ff_kwargs = {**ff_kwargs, 'zero_init_output': True}
+        # calculate layer block order
+        layer_types = default_block * depth
+        self.layer_types = layer_types
+        # stochastic depth
+        self.layer_dropouts = cast_tuple(layer_dropout, len(layer_types))
+        # iterate and construct layers
+        for ind, layer_type in enumerate(self.layer_types):
+            is_last_layer = ind == (len(self.layer_types) - 1)
+            if layer_type == 'a':
+                layer = Attention(dim, heads=heads, causal=causal, **attn_kwargs)
+            elif layer_type == 'c':
+                layer = Attention(dim, kv_dim=ctx_dim, heads=heads, **attn_kwargs)
+            elif layer_type == 'f':
+                layer = FeedForward(dim, **ff_kwargs)
+                layer = layer if not macaron else Scale(0.5, layer)
+            else:
+                raise Exception(f'invalid layer type {layer_type}')
+            residual_fn = GRUGating if gate_residual else Residual
+            residual = residual_fn(dim, scale_residual=scale_residual, scale_residual_constant=scale_residual_constant)
+            pre_branch_norm = norm_fn() if pre_norm else None
+            post_branch_norm = norm_fn() if sandwich_norm else None
+            post_main_norm = norm_fn() if not pre_norm and not is_last_layer else None
+            # The whole modulation part is copied from DiT
+            # https://github.com/facebookresearch/DiT
+            modulation = None
+            if modulate_feature_size is not None:
+                modulation = nn.Sequential(
+                    nn.LayerNorm(modulate_feature_size),
+                    nn.GELU(),
+                    nn.Linear(modulate_feature_size, 3 * dim, bias=True)
+                )
+            norms = nn.ModuleList([
+                pre_branch_norm,
+                post_branch_norm,
+                post_main_norm,
+            ])
+            self.layers.append(nn.ModuleList([
+                norms,
+                layer,
+                residual,
+                modulation,
+            ]))
+        if deepnorm:
+            init_gain = (8 * depth) ** -0.25
+            deepnorm_init(self, init_gain)
+    def forward(
+            self,
+            x,
+            context=None,
+            modulation=None,
+            mask=None,
+            context_mask=None,
+    ):
+        assert not (self.cross_attend ^ exists(context)), 'context must be passed in if cross_attend is set to True'
+        num_layers = len(self.layer_types)
+        assert num_layers % self.checkpoint_every == 0
+        for start_layer_idx in range(0, num_layers, self.checkpoint_every):
+            end_layer_idx = min(start_layer_idx + self.checkpoint_every, num_layers)
+            def run_layers(x, context, modulation, start, end):
+                for ind, (layer_type, (norm, block, residual_fn, modulation_fn), layer_dropout) in enumerate(
+                        zip(self.layer_types[start: end], self.layers[start: end], self.layer_dropouts[start: end])):
+                    residual = x
+                    pre_branch_norm, post_branch_norm, post_main_norm = norm
+                    if exists(pre_branch_norm):
+                        x = pre_branch_norm(x)
+                    if modulation_fn is not None:
+                        shift, scale, gate = modulation_fn(modulation).chunk(3, dim=1)
+                        x = modulate(x, shift, scale)
+                    if layer_type == 'a':
+                        out = block(x, mask=mask)
+                    elif layer_type == 'c':
+                        out = block(x, context=context, mask=mask, context_mask=context_mask)
+                    elif layer_type == 'f':
+                        out = block(x)
+                    if exists(post_branch_norm):
+                        out = post_branch_norm(out)
+                    if modulation_fn is not None:
+                        # TODO: add a option to use gate or not.
+                        out = out * gate.unsqueeze(1)
+                    x = residual_fn(out, residual)
+                    if exists(post_main_norm):
+                        x = post_main_norm(x)
+                return x
+            if self.checkpointing:
+                # print("X checkpointing")
+                x = checkpoint(run_layers, x, context, modulation, start_layer_idx, end_layer_idx)
+            else:
+                x = run_layers(x, context, modulation, start_layer_idx, end_layer_idx)
+        return x

dust3r/croco/utils/misc.py ADDED Viewed

	@@ -0,0 +1,583 @@

+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# utilitary functions for CroCo
+# --------------------------------------------------------
+# References:
+# MAE: https://github.com/facebookresearch/mae
+# DeiT: https://github.com/facebookresearch/deit
+# BEiT: https://github.com/microsoft/unilm/tree/master/beit
+# --------------------------------------------------------
+import builtins
+import datetime
+import os
+import time
+import math
+import json
+from collections import defaultdict, deque
+from pathlib import Path
+import numpy as np
+from datetime import timedelta
+import torch
+import torch.distributed as dist
+from torch import inf
+import functools
+from typing import cast, Dict, Iterable, List, Optional, Tuple, Union
+from typing_extensions import deprecated
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_avail_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        dist.barrier()
+        dist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+    @property
+    def global_avg(self):
+        return self.total / self.count
+    @property
+    def max(self):
+        return max(self.deque)
+    @property
+    def value(self):
+        return self.deque[-1]
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if v is None:
+                continue
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append(
+                "{}: {}".format(name, str(meter))
+            )
+        return self.delimiter.join(loss_str)
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+    def log_every(self, iterable, print_freq, header=None, max_iter=None):
+        i = 0
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt='{avg:.4f}')
+        data_time = SmoothedValue(fmt='{avg:.4f}')
+        len_iterable = min(len(iterable), max_iter) if max_iter else len(iterable)
+        space_fmt = ':' + str(len(str(len_iterable))) + 'd'
+        log_msg = [
+            header,
+            '[{0' + space_fmt + '}/{1}]',
+            'eta: {eta}',
+            '{meters}',
+            'time: {time}',
+            'data: {data}'
+        ]
+        if torch.cuda.is_available():
+            log_msg.append('max mem: {memory:.0f}')
+        log_msg = self.delimiter.join(log_msg)
+        MB = 1024.0 * 1024.0
+        for it,obj in enumerate(iterable):
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len_iterable - 1:
+                eta_seconds = iter_time.global_avg * (len_iterable - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(log_msg.format(
+                        i, len_iterable, eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time),
+                        memory=torch.cuda.max_memory_allocated() / MB))
+                else:
+                    print(log_msg.format(
+                        i, len_iterable, eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time)))
+            i += 1
+            end = time.time()
+            if max_iter and it >= max_iter:
+                break
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('{} Total time: {} ({:.4f} s / it)'.format(
+            header, total_time_str, total_time / len_iterable))
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    builtin_print = builtins.print
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        force = force #or (get_world_size() > 8)
+        if is_master or force:
+            now = datetime.datetime.now().time()
+            builtin_print('[{}] '.format(now), end='')  # print with time stamp
+            builtin_print(*args, **kwargs)
+    builtins.print = print
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+def is_main_process():
+    return get_rank() == 0
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+def init_distributed_mode(args):
+    nodist = args.nodist if hasattr(args,'nodist') else False
+    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ and not nodist:
+        args.rank = int(os.environ["RANK"])
+        args.world_size = int(os.environ['WORLD_SIZE'])
+        args.gpu = int(os.environ['LOCAL_RANK'])
+    else:
+        print('Not using distributed mode')
+        setup_for_distributed(is_master=True)  # hack
+        args.distributed = False
+        return
+    # args.distributed = True
+    # torch.cuda.set_device(args.gpu)
+    # args.dist_backend = 'nccl'
+    # print('| distributed init (rank {}): {}, gpu {}'.format(
+    #     args.rank, args.dist_url, args.gpu), flush=True)
+    # # os.environ['TORCH_NCCL_BLOCKING_WAIT'] = '0'  # not to enforce timeout
+    # torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
+    #                                     timeout=timedelta(seconds=72000000),
+    #                                     world_size=args.world_size, rank=args.rank)
+    # # print('| distributed is master {} {} {}'.format(is_main_process(), is_dist_avail_and_initialized(), dist.get_rank()), flush=True)
+    # torch.distributed.barrier()
+    # setup_for_distributed(args.gpu == 0)
+    args.distributed = True
+    torch.cuda.set_device(args.gpu)
+    args.dist_backend = 'nccl'
+    print('| distributed init (rank {}): {}, gpu {}'.format(
+        args.rank, args.dist_url, args.gpu), flush=True)
+    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url, timeout=timedelta(seconds=72000000),
+                                         world_size=args.world_size, rank=args.rank)
+    torch.distributed.barrier()
+    setup_for_distributed(args.gpu == 0)
+def _no_grad(func):
+    """
+    This wrapper is needed to avoid a circular import when using @torch.no_grad on the exposed functions
+    clip_grad_norm_ and clip_grad_value_ themselves.
+    """
+    def _no_grad_wrapper(*args, **kwargs):
+        with torch.no_grad():
+            return func(*args, **kwargs)
+    functools.update_wrapper(_no_grad_wrapper, func)
+    return _no_grad_wrapper
+@_no_grad
+def clip_grad_norm_(
+    parameters,
+    max_norm,
+    norm_type= 2.0,
+    error_if_nonfinite = False,
+    foreach = None,
+):
+    r"""Clip the gradient norm of an iterable of parameters.
+    The norm is computed over the norms of the individual gradients of all parameters,
+    as if the norms of the individual gradients were concatenated into a single vector.
+    Gradients are modified in-place.
+    Args:
+        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
+            single Tensor that will have gradients normalized
+        max_norm (float): max norm of the gradients
+        norm_type (float): type of the used p-norm. Can be ``'inf'`` for
+            infinity norm.
+        error_if_nonfinite (bool): if True, an error is thrown if the total
+            norm of the gradients from :attr:`parameters` is ``nan``,
+            ``inf``, or ``-inf``. Default: False (will switch to True in the future)
+        foreach (bool): use the faster foreach-based implementation.
+            If ``None``, use the foreach implementation for CUDA and CPU native tensors and silently
+            fall back to the slow implementation for other device types.
+            Default: ``None``
+    Returns:
+        Total norm of the parameter gradients (viewed as a single vector).
+    """
+    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    grads = [p.grad for p in parameters if p.grad is not None]
+    max_norm = float(max_norm)
+    norm_type = float(norm_type)
+    if len(grads) == 0:
+        return torch.tensor(0.0)
+    first_device = grads[0].device
+    grouped_grads: Dict[
+        Tuple[torch.device, torch.dtype], Tuple[List[List[Tensor]], List[int]]
+    ] = _group_tensors_by_device_and_dtype(
+        [grads]
+    )  # type: ignore[assignment]
+    norms: List[Tensor] = []
+    for (device, _), ([device_grads], _) in grouped_grads.items():  # type: ignore[assignment]
+        if (foreach is None and _has_foreach_support(device_grads, device)) or (
+            foreach and _device_has_foreach_support(device)
+        ):
+            norms.extend(torch._foreach_norm(device_grads, norm_type))
+        elif foreach:
+            raise RuntimeError(
+                f"foreach=True was passed, but can't use the foreach API on {device.type} tensors"
+            )
+        else:
+            norms.extend([torch.linalg.vector_norm(g, norm_type) for g in device_grads])
+    total_norm = torch.linalg.vector_norm(
+        torch.stack([norm.to(first_device) for norm in norms]), norm_type
+    )
+    return total_norm
+class NativeScalerWithGradNormCount:
+    state_dict_key = "amp_scaler"
+    def __init__(self, enabled=True):
+        self._scaler = torch.cuda.amp.GradScaler(enabled=False)
+    def __call__(self, loss, optimizer, clip_grad=10, parameters=None, create_graph=False, update_grad=True):
+        self._scaler.scale(loss).backward(create_graph=create_graph)
+        if update_grad:
+            # if clip_grad is not None:
+            # assert parameters is not None
+            self._scaler.unscale_(optimizer)  # unscale the gradients of optimizer's assigned params in-place
+            with torch.no_grad():
+                if isinstance(parameters, torch.Tensor):
+                    parameters = [parameters]
+                parameters = [p for p in parameters if p.grad is not None]
+                for p in parameters:
+                    if p.grad is None:
+                        print(f"WARNING: found a None grad, name is {n}, step is {step}", force=True)
+                    else:
+                        p.grad.nan_to_num_(nan=0., posinf=1e-3, neginf=-1e-3)
+            norm = torch.nn.utils.clip_grad_norm_(parameters, 10.)
+            # else:
+            #     self._scaler.unscale_(optimizer)
+            #     norm = get_grad_norm_(parameters)
+            self._scaler.step(optimizer)
+            self._scaler.update()
+        else:
+            norm = None
+        return norm
+    def state_dict(self):
+        return self._scaler.state_dict()
+    def load_state_dict(self, state_dict):
+        self._scaler.load_state_dict(state_dict)
+def get_grad_norm_(parameters, norm_type: float = 2.0) -> torch.Tensor:
+    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    parameters = [p for p in parameters if p.grad is not None]
+    norm_type = float(norm_type)
+    if len(parameters) == 0:
+        return torch.tensor(0.)
+    device = parameters[0].grad.device
+    if norm_type == inf:
+        total_norm = max(p.grad.detach().abs().max().to(device) for p in parameters)
+    else:
+        total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)
+    return total_norm
+def save_model(args, epoch, model_without_ddp, optimizer, loss_scaler, fname=None, best_so_far=None):
+    output_dir = Path(args.output_dir)
+    if fname is None: fname = str(epoch)
+    checkpoint_path = output_dir / ('checkpoint-%s.pth' % fname)
+    to_save = {
+        'model': model_without_ddp.state_dict(),
+        'optimizer': optimizer.state_dict(),
+        'scaler': loss_scaler.state_dict(),
+        'args': args,
+        'epoch': epoch,
+    }
+    if best_so_far is not None: to_save['best_so_far'] = best_so_far
+    print(f'>> Saving model to {checkpoint_path} ...')
+    save_on_master(to_save, checkpoint_path)
+    if is_main_process():
+        os.system('ossutil64 cp -f %s oss://antsys-vilab/zsz/checkpoints/%s -j 200' % (checkpoint_path, checkpoint_path))
+def load_model(args, model_without_ddp, optimizer, loss_scaler):
+    args.start_epoch = 0
+    best_so_far = None
+    if args.resume is not None:
+        if args.resume.startswith('https'):
+            checkpoint = torch.hub.load_state_dict_from_url(
+                args.resume, map_location='cpu', check_hash=True)
+        else:
+            checkpoint = torch.load(args.resume, map_location='cpu')
+        print("Resume checkpoint %s" % args.resume)
+        model_without_ddp.load_state_dict(checkpoint['model'], strict=False)
+        args.start_epoch = checkpoint['epoch'] + 1
+        optimizer.load_state_dict(checkpoint['optimizer'])
+        if 'scaler' in checkpoint:
+            loss_scaler.load_state_dict(checkpoint['scaler'])
+        if 'best_so_far' in checkpoint:
+            best_so_far = checkpoint['best_so_far']
+            print(" & best_so_far={:g}".format(best_so_far))
+        else:
+            print("")
+        print("With optim & sched! start_epoch={:d}".format(args.start_epoch), end='')
+    return best_so_far
+def all_reduce_mean(x):
+    world_size = get_world_size()
+    if world_size > 1:
+        x_reduce = torch.tensor(x).cuda()
+        dist.all_reduce(x_reduce)
+        x_reduce /= world_size
+        return x_reduce.item()
+    else:
+        return x
+def _replace(text, src, tgt, rm=''):
+    """ Advanced string replacement.
+    Given a text:
+    - replace all elements in src by the corresponding element in tgt
+    - remove all elements in rm
+    """
+    if len(tgt) == 1:
+        tgt = tgt * len(src)
+    assert len(src) == len(tgt), f"'{src}' and '{tgt}' should have the same len"
+    for s,t in zip(src, tgt):
+        text = text.replace(s,t)
+    for c in rm:
+        text = text.replace(c,'')
+    return text
+def filename( obj ):
+    """ transform a python obj or cmd into a proper filename.
+     - \1 gets replaced by slash '/'
+     - \2 gets replaced by comma ','
+    """
+    if not isinstance(obj, str):
+        obj = repr(obj)
+    obj = str(obj).replace('()','')
+    obj = _replace(obj, '_,(*/\1\2','-__x%/,', rm=' )\'"')
+    assert all(len(s) < 256 for s in obj.split(os.sep)), 'filename too long (>256 characters):\n'+obj
+    return obj
+def _get_num_layer_for_vit(var_name, enc_depth, dec_depth):
+    if var_name in ("cls_token", "mask_token", "pos_embed", "global_tokens"):
+        return 0
+    elif var_name.startswith("patch_embed"):
+        return 0
+    elif var_name.startswith("enc_blocks"):
+        layer_id = int(var_name.split('.')[1])
+        return layer_id + 1
+    elif var_name.startswith('decoder_embed') or var_name.startswith('enc_norm'): # part of the last black
+        return enc_depth
+    elif var_name.startswith('dec_blocks'):
+        layer_id = int(var_name.split('.')[1])
+        return enc_depth + layer_id + 1
+    elif var_name.startswith('dec_norm'): # part of the last block
+        return enc_depth + dec_depth
+    elif any(var_name.startswith(k) for k in ['head','prediction_head']):
+        return enc_depth + dec_depth + 1
+    else:
+        raise NotImplementedError(var_name)
+def get_parameter_groups(model, weight_decay, layer_decay=1.0, skip_list=(), no_lr_scale_list=[]):
+    parameter_group_names = {}
+    parameter_group_vars = {}
+    enc_depth, dec_depth = None, None
+    # prepare layer decay values
+    assert layer_decay==1.0 or 0.<layer_decay<1.
+    if layer_decay<1.:
+        enc_depth = model.enc_depth
+        dec_depth = model.dec_depth if hasattr(model, 'dec_blocks') else 0
+        num_layers = enc_depth+dec_depth
+        layer_decay_values = list(layer_decay ** (num_layers + 1 - i) for i in range(num_layers + 2))
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue  # frozen weights
+        # Assign weight decay values
+        if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
+            group_name = "no_decay"
+            this_weight_decay = 0.
+        elif 'mask_token' in name or 'pathch_embed' in name or 'enc_blocks' in name:
+            group_name = "encoder"
+            this_weight_decay = weight_decay
+        else:
+            group_name = "decay"
+            this_weight_decay = weight_decay
+        # Assign layer ID for LR scaling
+        if layer_decay<1.:
+            skip_scale = False
+            layer_id = _get_num_layer_for_vit(name, enc_depth, dec_depth)
+            group_name = "layer_%d_%s" % (layer_id, group_name)
+            if name in no_lr_scale_list:
+                skip_scale = True
+                group_name = f'{group_name}_no_lr_scale'
+        else:
+            layer_id = 0
+            skip_scale = True
+        if group_name not in parameter_group_names:
+            if not skip_scale:
+                scale = layer_decay_values[layer_id]
+            elif group_name == "encoder":
+                scale = 0.5
+            else:
+                scale = 1.
+            parameter_group_names[group_name] = {
+                "weight_decay": this_weight_decay,
+                "params": [],
+                "lr_scale": scale
+            }
+            parameter_group_vars[group_name] = {
+                "weight_decay": this_weight_decay,
+                "params": [],
+                "lr_scale": scale
+            }
+        parameter_group_vars[group_name]["params"].append(param)
+        parameter_group_names[group_name]["params"].append(name)
+    print("Param groups = %s" % json.dumps(parameter_group_names, indent=2))
+    return list(parameter_group_vars.values())
+def adjust_learning_rate(optimizer, epoch, args):
+    """Decay the learning rate with half-cycle cosine after warmup"""
+    lr_peak = args.min_lr + (args.lr - args.min_lr) * 0.5 * \
+        (1. + math.cos(math.pi * (epoch - args.warmup_epochs) / (args.epochs - args.warmup_epochs)))
+    T_mult = 1
+    warmup_iters = 2
+    T_0 = args.cycle_epoch
+    end = args.epochs
+    epoch_int = int(epoch) % T_0
+    decimal_part = epoch - math.floor(epoch)
+    epoch = decimal_part + epoch_int
+    T_cur = epoch
+    if T_cur < warmup_iters:
+        warmup_ratio = T_cur / warmup_iters
+        lr = args.min_lr + (lr_peak - args.min_lr) * warmup_ratio
+    else:
+        T_cur_adjusted = T_cur - warmup_iters
+        T_i = T_0 - warmup_iters
+        lr = args.min_lr + (lr_peak - args.min_lr) * (1 + math.cos(math.pi * T_cur_adjusted / T_i)) / 2
+        # 1e-5 + (1e-4-1e-5) * (1+math.cos(math.pi * (10-2) / 98)) / 2
+    for param_group in optimizer.param_groups:
+        if "lr_scale" in param_group:
+            param_group["lr"] = lr * param_group["lr_scale"]
+        else:
+            param_group["lr"] = lr
+    return lr

dust3r/dust3r/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Copyright (C) 2024-present Naver Corporation. All rights reserved.
2	+ # Licensed under CC BY-NC-SA 4.0 (non-commercial use only).

dust3r/dust3r/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (150 Bytes). View file

dust3r/dust3r/__pycache__/model.cpython-312.pyc ADDED Viewed

Binary file (12 kB). View file

dust3r/dust3r/__pycache__/patch_embed.cpython-312.pyc ADDED Viewed

Binary file (4.86 kB). View file

dust3r/dust3r/__pycache__/viz.cpython-312.pyc ADDED Viewed

Binary file (22.3 kB). View file

dust3r/dust3r/datasets/CustomDataset.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# Copyright (C) 2024-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Dataloader for preprocessed arkitscenes
+# dataset at https://github.com/apple/ARKitScenes - Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License https://github.com/apple/ARKitScenes/tree/main?tab=readme-ov-file#license
+# See datasets_preprocess/preprocess_arkitscenes.py
+# --------------------------------------------------------
+import os.path as osp
+import cv2
+import numpy as np
+import random
+import mast3r.utils.path_to_dust3r  # noqa
+from dust3r.datasets.base.base_stereo_view_dataset import BaseStereoViewDataset_test
+from collections import deque
+import os
+import json
+import time
+import glob
+from pathlib import Path
+class CustomDataset(BaseStereoViewDataset_test):
+    def __init__(self, *args, split, ROOT, wpose=False, sequential_input=False, index_list=None, **kwargs):
+        self.ROOT = ROOT
+        self.wpose = wpose
+        self.sequential_input = sequential_input
+        super().__init__(*args, **kwargs)
+    def __len__(self):
+        return 684000
+    @staticmethod
+    def image_read(image_file):
+        img = cv2.imread(image_file)
+        return cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+    def read_cam_file(self, filename):
+        with open(filename) as f:
+            lines = [line.rstrip() for line in f.readlines()]
+        # extrinsics: line [1,5), 4x4 matrix
+        extrinsics = np.fromstring(' '.join(lines[1:5]), dtype=np.float32, sep=' ')
+        extrinsics = extrinsics.reshape((4, 4))
+        # intrinsics: line [7-10), 3x3 matrix
+        intrinsics = np.fromstring(' '.join(lines[7:10]), dtype=np.float32, sep=' ')
+        intrinsics = intrinsics.reshape((3, 3))
+        # depth_min & depth_interval: line 11
+        depth_min = float(lines[11].split()[0])
+        depth_interval = float(lines[11].split()[1])
+        return intrinsics, extrinsics, depth_min, depth_interval
+    def _get_views(self, idx, resolution, rng):
+        images_list = glob.glob(osp.join(self.ROOT, '*.png')) + glob.glob(osp.join(self.ROOT, '*.jpg')) + glob.glob(osp.join(self.ROOT, '*.JPG'))
+        images_list = sorted(images_list)
+        if self.num_image != len(images_list):
+            images_list = random.sample(images_list, self.num_image)
+        self.gt_num_image = 0
+        views = []
+        for image in images_list:
+            rgb_image = self.image_read(image)
+            H, W = rgb_image.shape[:2]
+            if self.wpose == False:
+                intrinsics = np.array([[W, 0, W/2], [0, H, H/2], [0, 0, 1]])
+                camera_pose = np.eye(4)
+            else:
+                image_index = image.split('/')[-1].split('.')[0]
+                proj_mat_filename = os.path.join(self.ROOT, image_index+'.txt')
+                intrinsics, camera_pose, depth_min, depth_interval = self.read_cam_file(proj_mat_filename)
+                camera_pose = np.linalg.inv(camera_pose)
+            depthmap = np.zeros((H, W))
+            rgb_image, depthmap, intrinsics = self._crop_resize_if_necessary(
+                rgb_image, depthmap, intrinsics, resolution, rng=rng, info=None)
+            rgb_image_orig = rgb_image.copy()
+            H, W = depthmap.shape[:2]
+            fxfycxcy = np.array([intrinsics[0, 0]/W, intrinsics[1, 1]/H, intrinsics[0,2]/W, intrinsics[1,2]/H]).astype(np.float32)
+            views.append(dict(
+                img_org=rgb_image_orig,
+                img=rgb_image,
+                depthmap=depthmap.astype(np.float32),
+                camera_pose=camera_pose.astype(np.float32),
+                camera_intrinsics=intrinsics.astype(np.float32),
+                fxfycxcy=fxfycxcy,
+                dataset='custom',
+                label=image,
+                instance=image,
+            ))
+        return views
+if __name__ == "__main__":
+    from dust3r.datasets.base.base_stereo_view_dataset import view_name
+    from dust3r.viz import SceneViz, auto_cam_size
+    from dust3r.utils.image import rgb
+    import nerfvis.scene as scene_vis
+    for idx in np.random.permutation(len(dataset)):
+        views = dataset[idx]
+        # assert len(views) == 2
+        # print(view_name(views[0]), view_name(views[1]))
+        view_idxs = list(range(len(views)))
+        poses = [views[view_idx]['camera_pose'] for view_idx in view_idxs]
+        cam_size = max(auto_cam_size(poses), 0.001)
+        pts3ds = []
+        colors = []
+        valid_masks = []
+        c2ws = []
+        intrinsics = []
+        for view_idx in view_idxs:
+            pts3d = views[view_idx]['pts3d']
+            pts3ds.append(pts3d)
+            valid_mask = views[view_idx]['valid_mask']
+            valid_masks.append(valid_mask)
+            color = rgb(views[view_idx]['img'])
+            colors.append(color)
+            # viz.add_pointcloud(pts3d, colors, valid_mask)
+            c2ws.append(views[view_idx]['camera_pose'])
+        pts3ds = np.stack(pts3ds, axis=0)
+        colors = np.stack(colors, axis=0)
+        valid_masks = np.stack(valid_masks, axis=0)
+        c2ws = np.stack(c2ws)
+        scene_vis.set_title("My Scene")
+        scene_vis.set_opencv()
+        # colors = torch.zeros_like(structure).to(structure)
+        # scene_vis.add_points("points", pts3ds.reshape(-1,3)[valid_masks.reshape(-1)], vert_color=colors.reshape(-1,3)[valid_masks.reshape(-1)], point_size=1)
+        # for i in range(len(c2ws)):
+        f = 1111.0 / 2.5
+        z = 10.
+        scene_vis.add_camera_frustum("cameras", r=c2ws[:, :3, :3], t=c2ws[:, :3, 3], focal_length=f,
+                        image_width=colors.shape[2], image_height=colors.shape[1],
+                        z=z, connect=False, color=[1.0, 0.0, 0.0])
+        for i in range(len(c2ws)):
+            scene_vis.add_image(
+                            f"images/{i}",
+                            colors[i], # Can be a list of paths too (requires joblib for that)
+                            r=c2ws[i, :3, :3],
+                            t=c2ws[i, :3, 3],
+                            # Alternatively: from nerfvis.utils import split_mat4; **split_mat4(c2ws)
+                            focal_length=f,
+                            z=z,
+                        )
+        scene_vis.display(port=8081)

dust3r/dust3r/datasets/__init__.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# Copyright (C) 2024-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+from .utils.transforms import *
+from .base.batched_sampler import BatchedRandomSampler  # noqa
+from .CustomDataset import CustomDataset  # noqa
+def get_data_loader(dataset, batch_size, num_workers=8, shuffle=True, drop_last=True, pin_mem=True):
+    import torch
+    from croco.utils.misc import get_world_size, get_rank
+    # pytorch dataset
+    if isinstance(dataset, str):
+        dataset = eval(dataset)
+    world_size = get_world_size()
+    rank = get_rank()
+    try:
+        sampler = dataset.make_sampler(batch_size, shuffle=shuffle, world_size=world_size,
+                                       rank=rank, drop_last=drop_last)
+    except (AttributeError, NotImplementedError):
+        # not avail for this dataset
+        if torch.distributed.is_initialized():
+            sampler = torch.utils.data.DistributedSampler(
+                dataset, num_replicas=world_size, rank=rank, shuffle=shuffle, drop_last=drop_last
+            )
+        elif shuffle:
+            sampler = torch.utils.data.RandomSampler(dataset)
+        else:
+            sampler = torch.utils.data.SequentialSampler(dataset)
+    data_loader = torch.utils.data.DataLoader(
+        dataset,
+        sampler=sampler,
+        batch_size=batch_size,
+        num_workers=num_workers,
+        pin_memory=pin_mem,
+        drop_last=drop_last,
+    )
+    return data_loader

dust3r/dust3r/datasets/__pycache__/CustomDataset.cpython-312.pyc ADDED Viewed

Binary file (8.14 kB). View file

dust3r/dust3r/datasets/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (1.76 kB). View file

dust3r/dust3r/datasets/base/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Copyright (C) 2024-present Naver Corporation. All rights reserved.
2	+ # Licensed under CC BY-NC-SA 4.0 (non-commercial use only).

dust3r/dust3r/datasets/base/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (164 Bytes). View file

dust3r/dust3r/datasets/base/__pycache__/base_stereo_view_dataset.cpython-312.pyc ADDED Viewed

Binary file (32.6 kB). View file

dust3r/dust3r/datasets/base/__pycache__/batched_sampler.cpython-312.pyc ADDED Viewed

Binary file (4.09 kB). View file

dust3r/dust3r/datasets/base/__pycache__/easy_dataset.cpython-312.pyc ADDED Viewed

Binary file (8.87 kB). View file

dust3r/dust3r/datasets/base/base_stereo_view_dataset.py ADDED Viewed

	@@ -0,0 +1,774 @@

+# Copyright (C) 2024-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# base class for implementing datasets
+# --------------------------------------------------------
+import PIL
+import numpy as np
+import torch
+from dust3r.datasets.base.easy_dataset import EasyDataset
+from dust3r.datasets.utils.transforms import ImgNorm
+from dust3r.utils.geometry import depthmap_to_absolute_camera_coordinates, geotrf
+import dust3r.datasets.utils.cropping as cropping
+import random
+import copy
+from scipy.spatial.transform import Rotation
+import torchvision.transforms as transforms
+from dust3r.utils.geometry import inv, geotrf
+import cv2
+class BaseStereoViewDataset_test (EasyDataset):
+    """ Define all basic options.
+    Usage:
+        class MyDataset (BaseStereoViewDataset):
+            def _get_views(self, idx, rng):
+                # overload here
+                views = []
+                views.append(dict(img=, ...))
+                return views
+    """
+    def __init__(self, *,  # only keyword arguments
+                 split=None,
+                 resolution=None,  # square_size or (width, height) or list of [(width,height), ...]
+                 transform=ImgNorm,
+                 aug_crop=False,
+                 seed=None,
+                 num_views=8,
+                 gt_num_image=0,
+                 aug_monocular=False,
+                 aug_portrait_or_landscape=False,
+                 aug_rot90=False,
+                 aug_swap=False,
+                 only_pose=False,
+                sequential_input=False,
+                overfit=False,
+                caculate_mask=False):
+        self.sequential_input = sequential_input
+        self.split = split
+        self.num_image = num_views
+        self._set_resolutions(resolution)
+        self.gt_num_image=gt_num_image
+        self.aug_monocular=aug_monocular
+        self.aug_portrait_or_landscape = aug_portrait_or_landscape
+        self.transform = transform
+        self.transform_org = transforms.Compose([transform for transform in transform.transforms if type(transform).__name__ != 'ColorJitter'])
+        self.aug_rot90 = aug_rot90
+        self.aug_swap = aug_swap
+        self.only_pose = only_pose
+        self.overfit = overfit
+        self.rendering = False
+        self.caculate_mask = caculate_mask
+        if isinstance(transform, str):
+            transform = eval(transform)
+        self.aug_crop = aug_crop
+        self.seed = seed
+        self.kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(9, 9))
+    def __len__(self):
+        return len(self.scenes)
+    # def sequential_sample(self, im_start, last, interal):
+    #     im_list = [im_start + i * interal + random.choice(list(range(interal))) for i in range(self.num_image)]
+    #     im_list += [random.choice(im_list) + random.choice(list(range(interal))) for _ in range(self.gt_num_image)]
+    #     return im_list
+    def sequential_sample(self, im_start, last, interal):
+        im_list = [
+            im_start + i * interal + random.choice(list(range(-interal//2, interal//2)))
+            for i in range(self.num_image)
+        ]
+        im_list += [
+            random.choice(im_list) + random.choice(list(range(-interal//2, interal//2)))
+            for _ in range(self.gt_num_image)
+        ]
+        return im_list
+    def get_stats(self):
+        return f"{len(self)} pairs"
+    def __repr__(self):
+        resolutions_str = '['+';'.join(f'{w}x{h}' for w, h in self._resolutions)+']'
+        return f"""{type(self).__name__}({self.get_stats()},
+            {self.split=},
+            {self.seed=},
+            resolutions={resolutions_str},
+            {self.transform=})""".replace('self.', '').replace('\n', '').replace('   ', '')
+    def _get_views(self, idx, resolution, rng):
+        raise NotImplementedError()
+    def _swap_view_aug(self, views):
+        # if self._rng.random() < 0.5:
+            # views.reverse()
+        return random.shuffle(views)
+    def __getitem__(self, idx):
+        if isinstance(idx, tuple):
+            # the idx is specifying the aspect-ratio
+            idx, ar_idx = idx
+        else:
+            assert len(self._resolutions) == 1
+            ar_idx = 0
+        # set-up the rng
+        if self.seed:  # reseed for each __getitem__
+            self._rng = np.random.default_rng(seed=self.seed + idx)
+        elif not hasattr(self, '_rng'):
+            seed = torch.initial_seed()  # this is different for each dataloader process
+            self._rng = np.random.default_rng(seed=seed)
+        # over-loaded code
+        resolution = self._resolutions[ar_idx]  # DO NOT CHANGE THIS (compatible with BatchedRandomSampler)
+        flag = False
+        i = 0
+        # while flag == False and i < 100:
+        #     try:
+        #         views = self._get_views(idx, resolution, self._rng)
+        #         flag = True
+        #     except:
+        #         flag = False
+        #         i += 1
+        views = self._get_views(idx, resolution, self._rng)
+        # assert len(views) == self.num_image + self.gt_num_image
+        if self.only_pose == True:
+        # check data-types
+            for view in views:
+                # transpose to make sure all views are the same size
+                # this allows to check whether the RNG is is the same state each time
+                view['rng'] = int.from_bytes(self._rng.bytes(4), 'big')
+            return views
+        else:
+            for v, view in enumerate(views):
+                assert 'pts3d' not in view, f"pts3d should not be there, they will be computed afterwards based on intrinsics+depthmap for view {view_name(view)}"
+                view['idx'] = (idx, ar_idx, v)
+                # encode the image
+                width, height = view['img'].size
+                view['true_shape'] = np.int32((height, width))
+                view['img'] = self.transform_org(view['img'])
+                view['img_org' ] = self.transform_org(view['img_org'])
+                if 'depth_anything' not in view:
+                    view['depth_anything'] = np.zeros_like(view['depthmap'])
+                # if view['img_org'].shape[1] != 224:
+                # print(view['img_org' ].shape)
+                # print(view['img'].shape)
+                assert 'camera_intrinsics' in view
+                if 'camera_pose' not in view:
+                    view['camera_pose'] = np.full((4, 4), np.nan, dtype=np.float32)
+                else:
+                    assert np.isfinite(view['camera_pose']).all(), f'NaN in camera pose for view {view_name(view)}'
+                assert 'pts3d' not in view
+                assert 'valid_mask' not in view
+                assert np.isfinite(view['depthmap']).all(), f'NaN in depthmap for view {view_name(view)}'
+                pts3d, valid_mask = depthmap_to_absolute_camera_coordinates(**view)
+                view['pts3d'] = pts3d
+                view['valid_mask'] = valid_mask & np.isfinite(pts3d).all(axis=-1)
+                # print(view['pts3d'].shape)
+                # print(view['valid_mask'].shape)
+                # check all datatypes
+                for key, val in view.items():
+                    res, err_msg = is_good_type(key, val)
+                    assert res, f"{err_msg} with {key}={val} for view {view_name(view)}"
+                K = view['camera_intrinsics']
+            for view in views:
+                fxfycxcy = view['fxfycxcy'].copy()
+                H, W = view['img'].shape[1:]
+                fxfycxcy[0] = fxfycxcy[0] * W
+                fxfycxcy[1] = fxfycxcy[1] * H
+                fxfycxcy[2] = fxfycxcy[2] * W
+                fxfycxcy[3] = fxfycxcy[3] * H
+                view['fxfycxcy_unorm'] = fxfycxcy
+            # last thing done!
+            for view in views:
+                view['render_mask'] = np.ones((view['img'].shape[1], view['img'].shape[2]), dtype=np.uint8) > 0.1
+            for view in views:
+                # transpose to make sure all views are the same size
+                transpose_to_landscape(view)
+                if 'sky_mask' in view:
+                    view.pop('sky_mask')
+                # this allows to check whether the RNG is is the same state each time
+                view['rng'] = int.from_bytes(self._rng.bytes(4), 'big')
+            return views
+    def _set_resolutions(self, resolutions):
+        assert resolutions is not None, 'undefined resolution'
+        if not isinstance(resolutions, list):
+            resolutions = [resolutions]
+        self._resolutions = []
+        for resolution in resolutions:
+            if isinstance(resolution, int):
+                width = height = resolution
+            else:
+                width, height = resolution
+            assert isinstance(width, int), f'Bad type for {width=} {type(width)=}, should be int'
+            assert isinstance(height, int), f'Bad type for {height=} {type(height)=}, should be int'
+            assert width >= height
+            self._resolutions.append((width, height))
+    def _crop_resize_if_necessary(self, image, depthmap, intrinsics, resolution, rng=None, info=None, depth_anything=None):
+        """ This function:
+            - first downsizes the image with LANCZOS inteprolation,
+                which is better than bilinear interpolation in
+        """
+        if not isinstance(image, PIL.Image.Image):
+            image = PIL.Image.fromarray(image)
+        # transpose the resolution if necessary
+        W, H = image.size  # new size
+        assert resolution[0] >= resolution[1]
+        if H > 1.1 * W:
+            # image is portrait mode
+            resolution = resolution[::-1]
+        elif 0.7 < H / W < 1.3 and resolution[0] != resolution[1] and self.aug_portrait_or_landscape:
+            # image is square, so we chose (portrait, landscape) randomly
+            if rng.integers(2):
+                resolution = resolution[::-1]
+        # resolution = resolution[::-1]
+        # high-quality Lanczos down-scaling
+        target_resolution = np.array(resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics, depth_anything = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution, depth_anything=depth_anything)
+        else:
+            image, depthmap, intrinsics = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution)
+        # actual cropping (if necessary) with bilinear interpolation
+        offset_factor = 0.5
+        intrinsics2 = cropping.camera_matrix_of_crop(intrinsics, image.size, resolution, offset_factor=offset_factor)
+        crop_bbox = cropping.bbox_from_intrinsics_in_out(intrinsics, intrinsics2, resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics2, depth_anything = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox, depth_anything=depth_anything)
+            return image, depthmap, intrinsics2, depth_anything
+        else:
+            image, depthmap, intrinsics2 = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox)
+            return image, depthmap, intrinsics2
+    def _crop_resize_if_necessary_test(self, image, depthmap, intrinsics, resolution, rng=None, info=None, depth_anything=None):
+        """ This function:
+            - first downsizes the image with LANCZOS inteprolation,
+                which is better than bilinear interpolation in
+        """
+        if not isinstance(image, PIL.Image.Image):
+            image = PIL.Image.fromarray(image)
+        # transpose the resolution if necessary
+        W, H = image.size  # new size
+        assert resolution[0] >= resolution[1]
+        if H > 1.1 * W:
+            # image is portrait mode
+            resolution = resolution[::-1]
+        # resolution = resolution[::-1]
+        # high-quality Lanczos down-scaling
+        target_resolution = np.array(resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics, depth_anything = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution, depth_anything=depth_anything)
+        else:
+            image, depthmap, intrinsics = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution)
+        # actual cropping (if necessary) with bilinear interpolation
+        offset_factor = 0.5
+        intrinsics2 = cropping.camera_matrix_of_crop(intrinsics, image.size, resolution, offset_factor=offset_factor)
+        crop_bbox = cropping.bbox_from_intrinsics_in_out(intrinsics, intrinsics2, resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics2, depth_anything = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox, depth_anything=depth_anything)
+        else:
+            image, depthmap, intrinsics2 = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox)
+        return image, depthmap, intrinsics2
+def rotate_90(views, k=1):
+    # print('rotation =', k)
+    RT = np.eye(4, dtype=np.float32)
+    RT[:3, :3] = Rotation.from_euler('z', 90 * k, degrees=True).as_matrix()
+    for view in views:
+        view['img'] = torch.rot90(view['img'], k=k, dims=(-2, -1))  # WARNING!! dims=(-1,-2) != dims=(-2,-1)
+        view['depthmap'] = np.rot90(view['depthmap'], k=k).copy()
+        view['camera_pose'] = view['camera_pose'] @ RT
+        RT2 = np.eye(3, dtype=np.float32)
+        RT2[:2, :2] = RT[:2, :2] * ((1, -1), (-1, 1))
+        H, W = view['depthmap'].shape
+        if k % 4 == 0:
+            pass
+        elif k % 4 == 1:
+            # top-left (0,0) pixel becomes (0,H-1)
+            RT2[:2, 2] = (0, H - 1)
+        elif k % 4 == 2:
+            # top-left (0,0) pixel becomes (W-1,H-1)
+            RT2[:2, 2] = (W - 1, H - 1)
+        elif k % 4 == 3:
+            # top-left (0,0) pixel becomes (W-1,0)
+            RT2[:2, 2] = (W - 1, 0)
+        else:
+            raise ValueError(f'Bad value for {k=}')
+        view['camera_intrinsics'][:2, 2] = geotrf(RT2, view['camera_intrinsics'][:2, 2])
+        if k % 2 == 1:
+            K = view['camera_intrinsics']
+            np.fill_diagonal(K, K.diagonal()[[1, 0, 2]])
+        pts3d, valid_mask = depthmap_to_absolute_camera_coordinates(**view)
+        view['pts3d'] = pts3d
+        view['valid_mask'] = np.rot90(view['valid_mask'], k=k).copy()
+        view['true_shape'] = np.int32((H, W))
+        intrinsics = view['camera_intrinsics']
+        fxfycxcy = np.array([intrinsics[0, 0]/W, intrinsics[1, 1]/H, intrinsics[0,2]/W, intrinsics[1,2]/H]).astype(np.float32)
+        view['fxfycxcy'] = fxfycxcy
+def reciprocal_1d(corres_1_to_2, corres_2_to_1, shape1, shape2, ret_recip=False):
+    is_reciprocal1 = np.abs(unravel_xy(corres_2_to_1[corres_1_to_2], shape1) - unravel_xy(np.arange(len(corres_1_to_2)), shape1)).sum(-1) < 4
+    pos1 = is_reciprocal1.nonzero()[0]
+    pos2 = corres_1_to_2[pos1]
+    if ret_recip:
+        return is_reciprocal1, pos1, pos2
+    return pos1, pos2
+def reproject_view(pts3d, view2):
+    shape = view2['pts3d'].shape[:2]
+    return reproject(pts3d, view2['camera_intrinsics'], inv(view2['camera_pose']), shape)
+def reproject(pts3d, K, world2cam, shape):
+    H, W, THREE = pts3d.shape
+    assert THREE == 3
+    # reproject in camera2 space
+    with np.errstate(divide='ignore', invalid='ignore'):
+        pos = geotrf(K @ world2cam[:3], pts3d, norm=1, ncol=2)
+    # quantize to pixel positions
+    return (H, W), ravel_xy(pos, shape)
+def ravel_xy(pos, shape):
+    H, W = shape
+    with np.errstate(invalid='ignore'):
+        qx, qy = pos.reshape(-1, 2).round().astype(np.int32).T
+    quantized_pos = qx.clip(min=0, max=W - 1, out=qx) + W * qy.clip(min=0, max=H - 1, out=qy)
+    return quantized_pos
+def unravel_xy(pos, shape):
+    # convert (x+W*y) back to 2d (x,y) coordinates
+    return np.unravel_index(pos, shape)[0].base[:, ::-1].copy()
+class BaseStereoViewDataset (EasyDataset):
+    """ Define all basic options.
+    Usage:
+        class MyDataset (BaseStereoViewDataset):
+            def _get_views(self, idx, rng):
+                # overload here
+                views = []
+                views.append(dict(img=, ...))
+                return views
+    """
+    def __init__(self, *,  # only keyword arguments
+                 split=None,
+                 resolution=None,  # square_size or (width, height) or list of [(width,height), ...]
+                 transform=ImgNorm,
+                 aug_crop=False,
+                 seed=None,
+                 num_views=8,
+                 gt_num_image=0,
+                 aug_monocular=False,
+                 aug_portrait_or_landscape=True,
+                 aug_rot90=False,
+                 aug_swap=False,
+                 only_pose=False,
+                sequential_input=False,
+                overfit=False,
+                caculate_mask=False):
+        self.sequential_input = sequential_input
+        self.split = split
+        self.num_image = num_views
+        self._set_resolutions(resolution)
+        self.gt_num_image=gt_num_image
+        self.aug_monocular=aug_monocular
+        self.aug_portrait_or_landscape = aug_portrait_or_landscape
+        self.transform = transform
+        self.transform_org = transforms.Compose([transform for transform in transform.transforms if type(transform).__name__ != 'ColorJitter'])
+        self.aug_rot90 = aug_rot90
+        self.aug_swap = aug_swap
+        self.only_pose = only_pose
+        self.overfit = overfit
+        self.rendering = False
+        self.caculate_mask = caculate_mask
+        if isinstance(transform, str):
+            transform = eval(transform)
+        self.aug_crop = aug_crop
+        self.seed = seed
+        self.kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(9, 9))
+    def __len__(self):
+        return len(self.scenes)
+    # def sequential_sample(self, im_start, last, interal):
+    #     im_list = [im_start + i * interal + random.choice(list(range(interal))) for i in range(self.num_image)]
+    #     im_list += [random.choice(im_list) + random.choice(list(range(interal))) for _ in range(self.gt_num_image)]
+    #     return im_list
+    def sequential_sample(self, im_start, last, interal):
+        im_list = [
+            im_start + i * interal + random.choice(list(range(-interal//2, interal//2)))
+            for i in range(self.num_image)
+        ]
+        im_list += [
+            random.choice(im_list) + random.choice(list(range(-interal//2, interal//2)))
+            for _ in range(self.gt_num_image)
+        ]
+        return im_list
+    def get_stats(self):
+        return f"{len(self)} pairs"
+    def __repr__(self):
+        resolutions_str = '['+';'.join(f'{w}x{h}' for w, h in self._resolutions)+']'
+        return f"""{type(self).__name__}({self.get_stats()},
+            {self.split=},
+            {self.seed=},
+            resolutions={resolutions_str},
+            {self.transform=})""".replace('self.', '').replace('\n', '').replace('   ', '')
+    def _get_views(self, idx, resolution, rng):
+        raise NotImplementedError()
+    def _swap_view_aug(self, views):
+        # if self._rng.random() < 0.5:
+            # views.reverse()
+        return random.shuffle(views)
+    def __getitem__(self, idx):
+        if isinstance(idx, tuple):
+            # the idx is specifying the aspect-ratio
+            idx, ar_idx = idx
+        else:
+            assert len(self._resolutions) == 1
+            ar_idx = 0
+        # set-up the rng
+        if self.seed:  # reseed for each __getitem__
+            self._rng = np.random.default_rng(seed=self.seed + idx)
+        elif not hasattr(self, '_rng'):
+            seed = torch.initial_seed()  # this is different for each dataloader process
+            self._rng = np.random.default_rng(seed=seed)
+        # over-loaded code
+        resolution = self._resolutions[ar_idx]  # DO NOT CHANGE THIS (compatible with BatchedRandomSampler)
+        flag = False
+        i = 0
+        # views = self._get_views(idx, resolution, self._rng)
+        while flag == False and i < 1000:
+            try:
+                views = self._get_views(idx, resolution, self._rng)
+                flag = True
+            except:
+                flag = False
+                i += 1
+        # assert len(views) == self.num_image + self.gt_num_image
+        if self.only_pose == True:
+        # check data-types
+            for view in views:
+                # transpose to make sure all views are the same size
+                # this allows to check whether the RNG is is the same state each time
+                view['rng'] = int.from_bytes(self._rng.bytes(4), 'big')
+            return views
+        else:
+            for v, view in enumerate(views):
+                assert 'pts3d' not in view, f"pts3d should not be there, they will be computed afterwards based on intrinsics+depthmap for view {view_name(view)}"
+                view['idx'] = (idx, ar_idx, v)
+                # encode the image
+                width, height = view['img'].size
+                view['true_shape'] = np.int32((height, width))
+                view['img'] = self.transform(view['img'])
+                view['img_org' ] = self.transform_org(view['img_org'])
+                if 'depth_anything' not in view:
+                    view['depth_anything'] = np.zeros_like(view['depthmap'])
+                # if view['img_org'].shape[1] != 224:
+                # print(view['img_org' ].shape)
+                # print(view['img'].shape)
+                assert 'camera_intrinsics' in view
+                if 'camera_pose' not in view:
+                    view['camera_pose'] = np.full((4, 4), np.nan, dtype=np.float32)
+                else:
+                    assert np.isfinite(view['camera_pose']).all(), f'NaN in camera pose for view {view_name(view)}'
+                assert 'pts3d' not in view
+                assert 'valid_mask' not in view
+                assert np.isfinite(view['depthmap']).all(), f'NaN in depthmap for view {view_name(view)}'
+                pts3d, valid_mask = depthmap_to_absolute_camera_coordinates(**view)
+                view['pts3d'] = pts3d
+                view['valid_mask'] = valid_mask & np.isfinite(pts3d).all(axis=-1)
+                # print(view['pts3d'].shape)
+                # print(view['valid_mask'].shape)
+                # check all datatypes
+                for key, val in view.items():
+                    res, err_msg = is_good_type(key, val)
+                    assert res, f"{err_msg} with {key}={val} for view {view_name(view)}"
+                K = view['camera_intrinsics']
+            # if self.aug_swap:
+            #     self._swap_view_aug(views)
+            if self.aug_monocular:
+                if self._rng.random() < self.aug_monocular:
+                    random_idxs = random.choices(list(range(len(views)-1)), k = self.num_image + self.gt_num_image-1)
+                    views_copy = [views[-1]] + [copy.deepcopy(views[random_idxs[i]]) for i in range(len(views)-1)]
+                    views = views_copy
+            # if self.aug_rot90 is False:
+            #     pass
+            # elif self.aug_rot90 == 'same':
+            #     rotate_90(views, k=self._rng.choice(4))
+            # elif self.aug_rot90 == 'diff':
+            #     views_list = []
+            #     for view in views:
+            #         views_list += rotate_90([view], k=self._rng.choice(4))
+            #     views = views_list
+            # else:
+            #     raise ValueError(f'Bad value for {self.aug_rot90=}')
+            if self.rendering==False:
+                self._rng.shuffle(views)
+            if self.caculate_mask:
+                for view1 in views[self.num_image:]:
+                    render_mask = []
+                    start = True
+                    # images = []
+                    height, width = view1['true_shape']
+                    for view2 in views[:self.num_image]:
+                        shape1, corres1_to_2 = reproject_view(view1['pts3d'], view2)
+                        shape2, corres2_to_1 = reproject_view(view2['pts3d'], view1)
+                        # compute reciprocal correspondences:
+                        # pos1 == valid pixels (correspondences) in image1
+                        # corres1_to_2 = unravel_xy(corres1_to_2, shape2)
+                        # corres2_to_1 = unravel_xy(corres2_to_1, shape1)
+                        is_reciprocal1, pos1, pos2 = reciprocal_1d(corres1_to_2, corres2_to_1, shape1, shape2, ret_recip=True)
+                        render_mask.append(is_reciprocal1.reshape(shape1))
+                        # is_reciprocal1 = is_reciprocal1.reshape(shape1)
+                        # plt.subplot(1, 3, 1)
+                        # plt.imshow(is_reciprocal1)
+                        # plt.subplot(1, 3, 2)
+                        # plt.imshow(view1['img'].permute(1, 2, 0) / 2 + 0.5)
+                        # plt.subplot(1, 3, 3)
+                        # plt.imshow(view2['img'].permute(1, 2, 0) / 2 + 0.5)
+                        # plt.savefig('/data0/zsz/mast3recon/test/est.png')
+                        # import ipdb; ipdb.set_trace()
+                        # images.append(view2['img'])
+                        if start:
+                            view2['render_mask'] = np.ones((view2['img'].shape[1], view2['img'].shape[2]), dtype=np.uint8) > 0.1
+                    start = False
+                    render_mask = np.stack(render_mask, axis=0).sum(0)
+                    render_mask = cv2.dilate(render_mask/16, self.kernel)
+                    view1['render_mask'] = render_mask > 1e-5
+                    # images = torch.concatenate(images, dim=2)
+                    # import matplotlib.pyplot as plt
+                    # plt.subplot(3, 4, 1)
+                    # plt.imshow(render_mask)
+                    # plt.subplot(3, 4, 2)
+                    # plt.imshow(view1['img'].permute(1, 2, 0) / 2 + 0.5)
+                    # for i, image in enumerate(images):
+                    #     plt.subplot(3, 4, 3+i)
+                    #     plt.imshow(image.permute(1, 2, 0) / 2 + 0.5)
+                    # plt.savefig('/data0/zsz/mast3recon/test/est.png')
+                    # import ipdb; ipdb.set_trace()
+                    # if view1['render_mask'].shape != (height, width):
+                    #     import ipdb; ipdb.set_trace()
+                    assert view1['render_mask'].shape == (height, width)
+            else:
+                for view in views:
+                    view['render_mask'] = np.ones((view['img'].shape[1], view['img'].shape[2]), dtype=np.uint8) > 0.1
+            for view in views:
+                fxfycxcy = view['fxfycxcy'].copy()
+                H, W = view['img'].shape[1:]
+                fxfycxcy[0] = fxfycxcy[0] * W
+                fxfycxcy[1] = fxfycxcy[1] * H
+                fxfycxcy[2] = fxfycxcy[2] * W
+                fxfycxcy[3] = fxfycxcy[3] * H
+                view['fxfycxcy_unorm'] = fxfycxcy
+            # last thing done!
+            for view in views:
+                # transpose to make sure all views are the same size
+                transpose_to_landscape(view)
+                if 'sky_mask' in view:
+                    view.pop('sky_mask')
+                # this allows to check whether the RNG is is the same state each time
+                view['rng'] = int.from_bytes(self._rng.bytes(4), 'big')
+            return views
+    def _set_resolutions(self, resolutions):
+        assert resolutions is not None, 'undefined resolution'
+        if not isinstance(resolutions, list):
+            resolutions = [resolutions]
+        self._resolutions = []
+        for resolution in resolutions:
+            if isinstance(resolution, int):
+                width = height = resolution
+            else:
+                width, height = resolution
+            assert isinstance(width, int), f'Bad type for {width=} {type(width)=}, should be int'
+            assert isinstance(height, int), f'Bad type for {height=} {type(height)=}, should be int'
+            assert width >= height
+            self._resolutions.append((width, height))
+    def _crop_resize_if_necessary(self, image, depthmap, intrinsics, resolution, rng=None, info=None, depth_anything=None):
+        """ This function:
+            - first downsizes the image with LANCZOS inteprolation,
+                which is better than bilinear interpolation in
+        """
+        if not isinstance(image, PIL.Image.Image):
+            image = PIL.Image.fromarray(image)
+        # transpose the resolution if necessary
+        W, H = image.size  # new size
+        assert resolution[0] >= resolution[1]
+        if H > 1.1 * W:
+            # image is portrait mode
+            resolution = resolution[::-1]
+        elif 0.7 < H / W < 1.3 and resolution[0] != resolution[1] and self.aug_portrait_or_landscape:
+            # image is square, so we chose (portrait, landscape) randomly
+            if rng.integers(2):
+                resolution = resolution[::-1]
+        # resolution = resolution[::-1]
+        # high-quality Lanczos down-scaling
+        target_resolution = np.array(resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics, depth_anything = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution, depth_anything=depth_anything)
+        else:
+            image, depthmap, intrinsics = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution)
+        # actual cropping (if necessary) with bilinear interpolation
+        offset_factor = 0.5
+        intrinsics2 = cropping.camera_matrix_of_crop(intrinsics, image.size, resolution, offset_factor=offset_factor)
+        crop_bbox = cropping.bbox_from_intrinsics_in_out(intrinsics, intrinsics2, resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics2, depth_anything = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox, depth_anything=depth_anything)
+            return image, depthmap, intrinsics2, depth_anything
+        else:
+            image, depthmap, intrinsics2 = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox)
+            return image, depthmap, intrinsics2
+    def _crop_resize_if_necessary_test(self, image, depthmap, intrinsics, resolution, rng=None, info=None, depth_anything=None):
+        """ This function:
+            - first downsizes the image with LANCZOS inteprolation,
+                which is better than bilinear interpolation in
+        """
+        if not isinstance(image, PIL.Image.Image):
+            image = PIL.Image.fromarray(image)
+        # transpose the resolution if necessary
+        W, H = image.size  # new size
+        assert resolution[0] >= resolution[1]
+        if H > 1.1 * W:
+            # image is portrait mode
+            resolution = resolution[::-1]
+        # resolution = resolution[::-1]
+        # high-quality Lanczos down-scaling
+        target_resolution = np.array(resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics, depth_anything = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution, depth_anything=depth_anything)
+        else:
+            image, depthmap, intrinsics = cropping.rescale_image_depthmap(image, depthmap, intrinsics, target_resolution)
+        # actual cropping (if necessary) with bilinear interpolation
+        offset_factor = 0.5
+        intrinsics2 = cropping.camera_matrix_of_crop(intrinsics, image.size, resolution, offset_factor=offset_factor)
+        crop_bbox = cropping.bbox_from_intrinsics_in_out(intrinsics, intrinsics2, resolution)
+        if depth_anything is not None:
+            image, depthmap, intrinsics2, depth_anything = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox, depth_anything=depth_anything)
+        else:
+            image, depthmap, intrinsics2 = cropping.crop_image_depthmap(image, depthmap, intrinsics, crop_bbox)
+        return image, depthmap, intrinsics2
+def is_good_type(key, v):
+    """ returns (is_good, err_msg)
+    """
+    if isinstance(v, (str, int, tuple)):
+        return True, None
+    if v.dtype not in (np.float32, torch.float32, bool, np.int32, np.int64, np.uint8):
+        return False, f"bad {v.dtype=}"
+    return True, None
+def view_name(view, batch_index=None):
+    def sel(x): return x[batch_index] if batch_index not in (None, slice(None)) else x
+    db = sel(view['dataset'])
+    label = sel(view['label'])
+    instance = sel(view['instance'])
+    return f"{db}/{label}/{instance}"
+def transpose_to_landscape(view):
+    height, width = view['true_shape']
+    if width < height:
+        # rectify portrait to landscape
+        assert view['img'].shape == (3, height, width)
+        view['img'] = view['img'].swapaxes(1, 2)
+        # try:
+        if 'render_mask' in view:
+            assert view['render_mask'].shape == (height, width)
+            # except:
+            # import ipdb; ipdb.set_trace()
+            view['render_mask'] = view['render_mask'].swapaxes(0, 1)
+        assert view['img_org'].shape == (3, height, width)
+        view['img_org'] = view['img_org'].swapaxes(1, 2)
+        assert view['valid_mask'].shape == (height, width)
+        view['valid_mask'] = view['valid_mask'].swapaxes(0, 1)
+        assert view['depthmap'].shape == (height, width)
+        view['depthmap'] = view['depthmap'].swapaxes(0, 1)
+        assert view['pts3d'].shape == (height, width, 3)
+        view['pts3d'] = view['pts3d'].swapaxes(0, 1)
+        assert view['depth_anything'].shape == (height, width)
+        view['depth_anything'] = view['depth_anything'].swapaxes(0, 1)
+        # transpose x and y pixels
+        view['camera_intrinsics'] = view['camera_intrinsics']#[[1, 0, 2]]
+        # view['fxfycxcy'] = view['fxfycxcy']
+        # print(view['img'].shape)
+        # print(view['img_org'].shape)
+        # print(view['valid_mask'].shape)
+        # print(view['depthmap'].shape)
+        # print(view['pts3d'].shape)
+        # print(view['camera_intrinsics'].shape)
+        # print(view['fxfycxcy'].shape)
+    # assert view['img'].shape == (3, height, width)
+    # assert view['img_org'].shape == (3, height, width)
+    # assert view['valid_mask'].shape == (height, width)
+    # assert view['depthmap'].shape == (height, width)
+    # assert view['pts3d'].shape == (height, width, 3)
+    # assert view['camera_intrinsics'].shape == (3, 3)
+    # assert view['fxfycxcy'].shape == (4,)

dust3r/dust3r/datasets/base/batched_sampler.py ADDED Viewed

	@@ -0,0 +1,74 @@

+# Copyright (C) 2024-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Random sampling under a constraint
+# --------------------------------------------------------
+import numpy as np
+import torch
+class BatchedRandomSampler:
+    """ Random sampling under a constraint: each sample in the batch has the same feature,
+    which is chosen randomly from a known pool of 'features' for each batch.
+    For instance, the 'feature' could be the image aspect-ratio.
+    The index returned is a tuple (sample_idx, feat_idx).
+    This sampler ensures that each series of `batch_size` indices has the same `feat_idx`.
+    """
+    def __init__(self, dataset, batch_size, pool_size, world_size=1, rank=0, drop_last=True):
+        self.batch_size = batch_size
+        self.pool_size = pool_size
+        self.len_dataset = N = len(dataset)
+        self.total_size = round_by(N, batch_size*world_size) if drop_last else N
+        assert world_size == 1 or drop_last, 'must drop the last batch in distributed mode'
+        # distributed sampler
+        self.world_size = world_size
+        self.rank = rank
+        self.epoch = None
+    def __len__(self):
+        return self.total_size // self.world_size
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+    def __iter__(self):
+        # prepare RNG
+        if self.epoch is None:
+            assert self.world_size == 1 and self.rank == 0, 'use set_epoch() if distributed mode is used'
+            seed = int(torch.empty((), dtype=torch.int64).random_().item())
+        else:
+            seed = self.epoch + 777
+        rng = np.random.default_rng(seed=seed)
+        # random indices (will restart from 0 if not drop_last)
+        sample_idxs = np.arange(self.total_size)
+        rng.shuffle(sample_idxs)
+        # random feat_idxs (same across each batch)
+        n_batches = (self.total_size+self.batch_size-1) // self.batch_size
+        feat_idxs = rng.integers(self.pool_size, size=n_batches)
+        feat_idxs = np.broadcast_to(feat_idxs[:, None], (n_batches, self.batch_size))
+        feat_idxs = feat_idxs.ravel()[:self.total_size]
+        # put them together
+        idxs = np.c_[sample_idxs, feat_idxs]  # shape = (total_size, 2)
+        # Distributed sampler: we select a subset of batches
+        # make sure the slice for each node is aligned with batch_size
+        size_per_proc = self.batch_size * ((self.total_size + self.world_size *
+                                           self.batch_size-1) // (self.world_size * self.batch_size))
+        idxs = idxs[self.rank*size_per_proc: (self.rank+1)*size_per_proc]
+        yield from (tuple(idx) for idx in idxs)
+def round_by(total, multiple, up=False):
+    if up:
+        total = total + multiple-1
+    return (total//multiple) * multiple

dust3r/dust3r/datasets/base/easy_dataset.py ADDED Viewed

	@@ -0,0 +1,157 @@

+# Copyright (C) 2024-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# A dataset base class that you can easily resize and combine.
+# --------------------------------------------------------
+import numpy as np
+from dust3r.datasets.base.batched_sampler import BatchedRandomSampler
+class EasyDataset:
+    """ a dataset that you can easily resize and combine.
+    Examples:
+    ---------
+        2 * dataset ==> duplicate each element 2x
+        10 @ dataset ==> set the size to 10 (random sampling, duplicates if necessary)
+        dataset1 + dataset2 ==> concatenate datasets
+    """
+    def __add__(self, other):
+        return CatDataset([self, other])
+    def __rmul__(self, factor):
+        return MulDataset(factor, self)
+    def __rmatmul__(self, factor):
+        return ResizedDataset(factor, self)
+    def set_epoch(self, epoch):
+        pass  # nothing to do by default
+    def make_sampler(self, batch_size, shuffle=True, world_size=1, rank=0, drop_last=True):
+        if not (shuffle):
+            raise NotImplementedError()  # cannot deal yet
+        num_of_aspect_ratios = len(self._resolutions)
+        return BatchedRandomSampler(self, batch_size, num_of_aspect_ratios, world_size=world_size, rank=rank, drop_last=drop_last)
+class MulDataset (EasyDataset):
+    """ Artifically augmenting the size of a dataset.
+    """
+    multiplicator: int
+    def __init__(self, multiplicator, dataset):
+        assert isinstance(multiplicator, int) and multiplicator > 0
+        self.multiplicator = multiplicator
+        self.dataset = dataset
+    def __len__(self):
+        return self.multiplicator * len(self.dataset)
+    def __repr__(self):
+        return f'{self.multiplicator}*{repr(self.dataset)}'
+    def __getitem__(self, idx):
+        if isinstance(idx, tuple):
+            idx, other = idx
+            return self.dataset[idx // self.multiplicator, other]
+        else:
+            return self.dataset[idx // self.multiplicator]
+    @property
+    def _resolutions(self):
+        return self.dataset._resolutions
+class ResizedDataset (EasyDataset):
+    """ Artifically changing the size of a dataset.
+    """
+    new_size: int
+    def __init__(self, new_size, dataset):
+        assert isinstance(new_size, int) and new_size > 0
+        self.new_size = new_size
+        self.dataset = dataset
+    def __len__(self):
+        return self.new_size
+    def __repr__(self):
+        size_str = str(self.new_size)
+        for i in range((len(size_str)-1) // 3):
+            sep = -4*i-3
+            size_str = size_str[:sep] + '_' + size_str[sep:]
+        return f'{size_str} @ {repr(self.dataset)}'
+    def set_epoch(self, epoch):
+        # this random shuffle only depends on the epoch
+        rng = np.random.default_rng(seed=epoch+777)
+        # shuffle all indices
+        perm = rng.permutation(len(self.dataset))
+        # rotary extension until target size is met
+        shuffled_idxs = np.concatenate([perm] * (1 + (len(self)-1) // len(self.dataset)))
+        self._idxs_mapping = shuffled_idxs[:self.new_size]
+        assert len(self._idxs_mapping) == self.new_size
+    def __getitem__(self, idx):
+        assert hasattr(self, '_idxs_mapping'), 'You need to call dataset.set_epoch() to use ResizedDataset.__getitem__()'
+        if isinstance(idx, tuple):
+            idx, other = idx
+            return self.dataset[self._idxs_mapping[idx], other]
+        else:
+            return self.dataset[self._idxs_mapping[idx]]
+    @property
+    def _resolutions(self):
+        return self.dataset._resolutions
+class CatDataset (EasyDataset):
+    """ Concatenation of several datasets
+    """
+    def __init__(self, datasets):
+        for dataset in datasets:
+            assert isinstance(dataset, EasyDataset)
+        self.datasets = datasets
+        self._cum_sizes = np.cumsum([len(dataset) for dataset in datasets])
+    def __len__(self):
+        return self._cum_sizes[-1]
+    def __repr__(self):
+        # remove uselessly long transform
+        return ' + '.join(repr(dataset).replace(',transform=Compose( ToTensor() Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)))', '') for dataset in self.datasets)
+    def set_epoch(self, epoch):
+        for dataset in self.datasets:
+            dataset.set_epoch(epoch)
+    def __getitem__(self, idx):
+        other = None
+        if isinstance(idx, tuple):
+            idx, other = idx
+        if not (0 <= idx < len(self)):
+            raise IndexError()
+        db_idx = np.searchsorted(self._cum_sizes, idx, 'right')
+        dataset = self.datasets[db_idx]
+        new_idx = idx - (self._cum_sizes[db_idx - 1] if db_idx > 0 else 0)
+        if other is not None:
+            new_idx = (new_idx, other)
+        return dataset[new_idx]
+    @property
+    def _resolutions(self):
+        resolutions = self.datasets[0]._resolutions
+        for dataset in self.datasets[1:]:
+            assert tuple(dataset._resolutions) == tuple(resolutions)
+        return resolutions

dust3r/dust3r/datasets/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Copyright (C) 2024-present Naver Corporation. All rights reserved.
2	+ # Licensed under CC BY-NC-SA 4.0 (non-commercial use only).