Attempt an export to ONNX.

Browse files

Files changed (13) hide show

.gitattributes +1 -0
.gitignore +1 -0
.python-version +1 -0
README.md +66 -0
export_onnx.py +179 -0
justfile +20 -0
main.py +6 -0
models/csrc/wrapper.py +54 -0
models/onnx_wrapper.py +60 -0
models/sparsebev_sampling.py +49 -32
models/sparsebev_transformer.py +24 -25
pyproject.toml +44 -0
uv.lock +0 -0

.gitattributes CHANGED Viewed

@@ -1,3 +1,4 @@
 *.jpg filter=lfs diff=lfs merge=lfs -text
 *.jpeg filter=lfs diff=lfs merge=lfs -text
 *.png filter=lfs diff=lfs merge=lfs -text

+exports/** filter=lfs diff=lfs merge=lfs -text
 *.jpg filter=lfs diff=lfs merge=lfs -text
 *.jpeg filter=lfs diff=lfs merge=lfs -text
 *.png filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -51,3 +51,4 @@ checkpoints
 pretrain
 *.png
 *.jpg

 pretrain
 *.png
 *.jpg
+.claude/settings.local.json

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

README.md CHANGED Viewed

@@ -169,6 +169,72 @@ Visualize the sampling points (like Fig. 6 in the paper):
 python viz_sample_points.py --config configs/r50_nuimg_704x256.py --weights checkpoints/r50_nuimg_704x256.pth
 ```
 ## Acknowledgements
 Many thanks to these excellent open-source projects:

 python viz_sample_points.py --config configs/r50_nuimg_704x256.py --weights checkpoints/r50_nuimg_704x256.pth
 ```
+## Changes from upstream
+This fork adds ONNX export support targeting [ONNX Runtime's CoreML Execution Provider](https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html) for inference on Apple Silicon (Mac Studio).
+### Dependency management
+- `pyproject.toml` / `uv.lock` — project dependencies managed with [uv](https://docs.astral.sh/uv/)
+- `justfile` — task runner for common operations
+### ONNX export
+Three code changes were required to make the model traceable with `torch.onnx.export`:
+**`models/sparsebev_sampling.py`** — `sampling_4d()`
+- Replaced 6-dimensional advanced tensor indexing (not supported by the ONNX tracer) with `torch.gather` for best-view selection
+**`models/csrc/wrapper.py`** — new `msmv_sampling_onnx()`
+- Added an ONNX-compatible sampling path that uses 4D `F.grid_sample` (ONNX opset 16+) and `torch.gather` for view selection, replacing the original 5D volumetric `grid_sample` which is not in the ONNX spec
+- The existing CUDA kernel path (`msmv_sampling` / `msmv_sampling_pytorch`) is preserved and used when CUDA is available
+**`models/sparsebev_transformer.py`**
+- `SparseBEVTransformerDecoder.forward()`: added a fast path that accepts pre-computed `time_diff` and `lidar2img` tensors directly, bypassing the NumPy preprocessing that is not traceable
+- `SparseBEVTransformerDecoderLayer.forward()`: replaced a masked in-place assignment (`tensor[mask] = value`) with `torch.where`, which is ONNX-compatible
+- `SparseBEVSelfAttention.calc_bbox_dists()`: replaced a Python loop over the batch dimension with a vectorised `torch.norm` using broadcasting
+### New files
+| File | Purpose |
+|------|---------|
+| `export_onnx.py` | Exports the model to ONNX, runs ORT CPU + CoreML EP validation |
+| `models/onnx_wrapper.py` | Thin `nn.Module` wrapper that accepts pre-computed tensors instead of `img_metas` dicts |
+| `justfile` | `just onnx_export` / `just onnx_export_validate` |
+| `exports/` | ONNX model files tracked via Git LFS |
+### Running the export
+```bash
+just onnx_export
+# or with validation against PyTorch and CoreML EP:
+just onnx_export_validate
+```
+Exported models land in `exports/` as `sparsebev_{config}_opset{N}.onnx` (+ `.onnx.data` for weights).
+**Inference with ONNX Runtime:**
+```python
+import onnxruntime as ort
+sess = ort.InferenceSession(
+    'exports/sparsebev_r50_nuimg_704x256_400q_36ep_opset18.onnx',
+    providers=[('CoreMLExecutionProvider', {'MLComputeUnits': 'ALL'}),
+               'CPUExecutionProvider'],
+)
+cls_scores, bbox_preds = sess.run(None, {
+    'img':       img_np,        # [1, 48, 3, 256, 704] float32 BGR
+    'lidar2img': lidar2img_np,  # [1, 48, 4, 4] float32
+    'time_diff': time_diff_np,  # [1, 8]  float32, seconds since frame 0
+})
+# cls_scores: [6, 1, 400, 10]  raw logits per decoder layer
+# bbox_preds: [6, 1, 400, 10]  raw box params — decode with NMSFreeCoder
+```
+The `MLComputeUnits` option must be passed explicitly; without it ONNX Runtime discards the CoreML EP on the first unsupported partition instead of falling back per-node.
+---
 ## Acknowledgements
 Many thanks to these excellent open-source projects:

export_onnx.py ADDED Viewed

	@@ -0,0 +1,179 @@

+"""
+Export SparseBEV to ONNX for inference via ONNX Runtime CoreML EP.
+Usage:
+    python export_onnx.py \
+        --config configs/r50_nuimg_704x256_400q_36ep.py \
+        --weights checkpoints/r50_nuimg_704x256_400q_36ep.pth \
+        --out sparsebev.onnx
+Then run with CoreML EP:
+    import onnxruntime as ort, numpy as np
+    sess = ort.InferenceSession('sparsebev.onnx',
+                                providers=['CoreMLExecutionProvider',
+                                           'CPUExecutionProvider'])
+    outputs = sess.run(None, {'img': img_np, 'lidar2img': l2i_np, 'time_diff': td_np})
+    cls_scores, bbox_preds = outputs  # raw logits, apply NMSFreeCoder.decode() separately
+Input format (all float32 numpy arrays):
+    img        [1, 48, 3, 256, 704]  BGR, pixel values in [0, 255]
+    lidar2img  [1, 48, 4, 4]         LiDAR-to-image projection matrices
+    time_diff  [1, 8]                seconds since frame-0, one value per frame
+                                     (frame 0 = 0.0, frame k = timestamp[0] - timestamp[k])
+"""
+import argparse
+import sys
+from unittest.mock import MagicMock
+# mmcv is installed without compiled C++ ops (no mmcv-full on macOS).
+# SparseBEV doesn't use any mmcv ops at inference time, so stub out the
+# missing extension module before anything else imports mmcv.ops.
+sys.modules['mmcv._ext'] = MagicMock()
+import torch
+import numpy as np
+# Register all custom mmdet3d modules by importing the local package
+sys.path.insert(0, '.')
+import models  # noqa: F401  triggers __init__.py which registers DETECTORS etc.
+from mmcv import Config
+from mmdet3d.models import build_detector
+from mmcv.runner import load_checkpoint
+from models.onnx_wrapper import SparseBEVOnnxWrapper
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--config',   default='configs/r50_nuimg_704x256_400q_36ep.py')
+    parser.add_argument('--weights',  default='checkpoints/r50_nuimg_704x256_400q_36ep.pth')
+    parser.add_argument('--out-dir',  default='exports',
+                        help='Directory to write the ONNX model into')
+    parser.add_argument('--out',      default=None,
+                        help='Override output filename (default: derived from config + opset)')
+    parser.add_argument('--opset',   type=int, default=18,
+                        help='ONNX opset version (18 recommended for torch 2.x)')
+    parser.add_argument('--validate', action='store_true',
+                        help='Run ORT inference and compare to PyTorch output')
+    return parser.parse_args()
+def build_dummy_inputs(num_frames=8, num_cameras=6, H=256, W=704):
+    """Return (img, lidar2img, time_diff) dummy tensors for export / validation."""
+    img       = torch.zeros(1, num_frames * num_cameras, 3, H, W)
+    lidar2img = torch.eye(4).reshape(1, 1, 4, 4).expand(1, num_frames * num_cameras, 4, 4).contiguous()
+    time_diff = torch.zeros(1, num_frames)
+    return img, lidar2img, time_diff
+def main():
+    args = parse_args()
+    # ------------------------------------------------------------------ #
+    # Resolve output path
+    # ------------------------------------------------------------------ #
+    import os
+    os.makedirs(args.out_dir, exist_ok=True)
+    if args.out is None:
+        # Derive a descriptive name from the config stem.
+        # e.g. configs/r50_nuimg_704x256_400q_36ep.py
+        #   -> sparsebev_r50_nuimg_704x256_400q_36ep_opset18.onnx
+        config_stem = os.path.splitext(os.path.basename(args.config))[0]
+        args.out = os.path.join(args.out_dir,
+                                f'sparsebev_{config_stem}_opset{args.opset}.onnx')
+    else:
+        args.out = os.path.join(args.out_dir, os.path.basename(args.out))
+    # ------------------------------------------------------------------ #
+    # Load model
+    # ------------------------------------------------------------------ #
+    cfg = Config.fromfile(args.config)
+    model = build_detector(cfg.model, train_cfg=None, test_cfg=cfg.get('test_cfg'))
+    load_checkpoint(model, args.weights, map_location='cpu')
+    model.eval()
+    wrapper = SparseBEVOnnxWrapper(model).eval()
+    # ------------------------------------------------------------------ #
+    # Dummy inputs
+    # ------------------------------------------------------------------ #
+    img, lidar2img, time_diff = build_dummy_inputs()
+    # ------------------------------------------------------------------ #
+    # Reference PyTorch forward (for later numerical comparison)
+    # ------------------------------------------------------------------ #
+    with torch.no_grad():
+        ref_cls, ref_bbox = wrapper(img, lidar2img, time_diff)
+    print(f'PyTorch output shapes: cls={tuple(ref_cls.shape)}  bbox={tuple(ref_bbox.shape)}')
+    # ------------------------------------------------------------------ #
+    # ONNX export
+    # ------------------------------------------------------------------ #
+    print(f'Exporting to {args.out} (opset {args.opset}) …')
+    torch.onnx.export(
+        wrapper,
+        (img, lidar2img, time_diff),
+        args.out,
+        opset_version=args.opset,
+        input_names=['img', 'lidar2img', 'time_diff'],
+        output_names=['cls_scores', 'bbox_preds'],
+        do_constant_folding=True,
+        verbose=False,
+    )
+    print('Export done.')
+    # ------------------------------------------------------------------ #
+    # ONNX model check
+    # ------------------------------------------------------------------ #
+    import onnx
+    model_proto = onnx.load(args.out)
+    onnx.checker.check_model(model_proto)
+    print('ONNX checker passed.')
+    # ------------------------------------------------------------------ #
+    # Optional: validate ORT CPU output against PyTorch
+    # ------------------------------------------------------------------ #
+    if args.validate:
+        import onnxruntime as ort
+        print('Running ORT CPU validation …')
+        sess = ort.InferenceSession(args.out, providers=['CPUExecutionProvider'])
+        feeds = {
+            'img':       img.numpy(),
+            'lidar2img': lidar2img.numpy(),
+            'time_diff': time_diff.numpy(),
+        }
+        ort_cls, ort_bbox = sess.run(None, feeds)
+        cls_diff  = np.abs(ref_cls.numpy()  - ort_cls).max()
+        bbox_diff = np.abs(ref_bbox.numpy() - ort_bbox).max()
+        print(f'Max absolute diff — cls: {cls_diff:.6f}   bbox: {bbox_diff:.6f}')
+        if cls_diff < 5e-2 and bbox_diff < 5e-2:
+            print('Validation PASSED.')
+        else:
+            print('WARNING: diff is larger than expected — check for unsupported ops.')
+        # ------------------------------------------------------------------ #
+        # CoreML EP — must pass MLComputeUnits explicitly; without it ORT
+        # discards the EP entirely on first partition error instead of falling
+        # back per-node to the CPU provider.
+        # ------------------------------------------------------------------ #
+        print('\nRunning CoreML EP …')
+        sess_cml = ort.InferenceSession(
+            args.out,
+            providers=[
+                ('CoreMLExecutionProvider', {'MLComputeUnits': 'ALL'}),
+                'CPUExecutionProvider',
+            ],
+        )
+        cml_cls, cml_bbox = sess_cml.run(None, feeds)
+        cml_cls_diff  = np.abs(ref_cls.numpy()  - cml_cls).max()
+        cml_bbox_diff = np.abs(ref_bbox.numpy() - cml_bbox).max()
+        print(f'CoreML EP max diff — cls: {cml_cls_diff:.6f}   bbox: {cml_bbox_diff:.6f}')
+if __name__ == '__main__':
+    main()

justfile ADDED Viewed

	@@ -0,0 +1,20 @@

+python := "uv run python"
+config   := "configs/r50_nuimg_704x256_400q_36ep.py"
+weights  := "checkpoints/r50_nuimg_704x256_400q_36ep.pth"
+out_dir  := "exports"
+# Export the model to ONNX (output goes to exports/ with a descriptive name)
+onnx_export config=config weights=weights out_dir=out_dir:
+    {{ python }} export_onnx.py \
+        --config  {{ config }} \
+        --weights {{ weights }} \
+        --out-dir {{ out_dir }}
+# Export and validate against PyTorch + CoreML EP
+onnx_export_validate config=config weights=weights out_dir=out_dir:
+    {{ python }} export_onnx.py \
+        --config  {{ config }} \
+        --weights {{ weights }} \
+        --out-dir {{ out_dir }} \
+        --validate

main.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from sparsebev!")
+if __name__ == "__main__":
+    main()

models/csrc/wrapper.py CHANGED Viewed

@@ -91,3 +91,57 @@ def msmv_sampling(mlvl_feats, sampling_locations, scale_weights):
         return MSMVSamplingC23456.apply(*mlvl_feats, sampling_locations, scale_weights)
     else:
         return msmv_sampling_pytorch(mlvl_feats, sampling_locations, scale_weights)

         return MSMVSamplingC23456.apply(*mlvl_feats, sampling_locations, scale_weights)
     else:
         return msmv_sampling_pytorch(mlvl_feats, sampling_locations, scale_weights)
+def msmv_sampling_onnx(mlvl_feats, uv, view_idx, scale_weights):
+    """
+    ONNX-compatible multi-scale multi-view sampling using 4D F.grid_sample.
+    Replaces the 5D volumetric grid_sample used in msmv_sampling_pytorch with
+    separate per-view 4D grid_samples followed by a torch.gather for view
+    selection. All ops are in ONNX opset 16.
+    Args:
+        mlvl_feats:   list of [BTG, C, N, H, W] channel-first feature maps
+        uv:           [BTG, Q, P, 2]  normalised (u, v) in [0, 1]
+        view_idx:     [BTG, Q, P]     integer camera-view indices
+        scale_weights:[BTG, Q, P, L]  softmax weights over pyramid levels
+    Returns:
+        [BTG, Q, C, P]
+    """
+    BTG, C, N, _, _ = mlvl_feats[0].shape
+    _, Q, P, _ = uv.shape
+    # Convert UV from [0, 1] to [-1, 1] for F.grid_sample
+    uv_gs = uv * 2.0 - 1.0  # [BTG, Q, P, 2]
+    # Tile UV for all N views: [BTG*N, Q, P, 2]
+    # Use expand+contiguous+reshape (maps to ONNX Expand, better CoreML EP support
+    # than repeat_interleave which maps to ONNX Tile and can trip up CoreML)
+    uv_gs = uv_gs.unsqueeze(1).expand(BTG, N, Q, P, 2).contiguous().reshape(BTG * N, Q, P, 2)
+    # Pre-expand view_idx for gathering along the N dim: [BTG, C, 1, Q, P]
+    view_idx_g = view_idx[:, None, None, :, :].expand(BTG, C, 1, Q, P)
+    final = torch.zeros(BTG, C, Q, P, device=mlvl_feats[0].device, dtype=mlvl_feats[0].dtype)
+    for lvl, feat in enumerate(mlvl_feats):
+        _, _, _, H_lvl, W_lvl = feat.shape
+        # [BTG, C, N, H, W] -> [BTG, N, C, H, W] -> [BTG*N, C, H, W]
+        feat_4d = feat.permute(0, 2, 1, 3, 4).reshape(BTG * N, C, H_lvl, W_lvl)
+        # 4D grid_sample: [BTG*N, C, Q, P]
+        sampled = F.grid_sample(feat_4d, uv_gs, mode='bilinear', padding_mode='zeros', align_corners=True)
+        # [BTG*N, C, Q, P] -> [BTG, N, C, Q, P] -> [BTG, C, N, Q, P]
+        sampled = sampled.reshape(BTG, N, C, Q, P).permute(0, 2, 1, 3, 4)
+        # Gather the selected camera view: [BTG, C, 1, Q, P] -> [BTG, C, Q, P]
+        sampled = torch.gather(sampled, 2, view_idx_g).squeeze(2)
+        # Accumulate with per-level scale weight
+        w = scale_weights[..., lvl].reshape(BTG, 1, Q, P)
+        final = final + sampled * w
+    return final.permute(0, 2, 1, 3)  # [BTG, Q, C, P]

models/onnx_wrapper.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import torch
+import torch.nn as nn
+class SparseBEVOnnxWrapper(nn.Module):
+    """
+    Thin wrapper around SparseBEV for ONNX export.
+    Accepts pre-computed tensors instead of the img_metas dict so the graph
+    boundary is clean.  Returns raw decoder logits without NMS or decoding so
+    post-processing can stay in Python.
+    Inputs (all float32):
+        img        [B, T*N, 3, H, W]  — BGR images, will be normalised inside
+        lidar2img  [B, T*N, 4, 4]     — LiDAR-to-image projection matrices
+        time_diff  [B, T]             — seconds since the first frame (per frame,
+                                        averaged across the N cameras)
+    Outputs:
+        cls_scores  [num_layers, B, Q, num_classes]
+        bbox_preds  [num_layers, B, Q, 10]
+    """
+    def __init__(self, model, image_h=256, image_w=704, num_frames=8, num_cameras=6):
+        super().__init__()
+        self.model = model
+        self.image_h = image_h
+        self.image_w = image_w
+        self.num_frames = num_frames
+        self.num_cameras = num_cameras
+        # Disable stochastic augmentations that are meaningless at inference
+        self.model.use_grid_mask = False
+        # Disable FP16 casting decorators
+        self.model.fp16_enabled = False
+    def forward(self, img, lidar2img, time_diff):
+        B, TN, C, H, W = img.shape
+        # Build a minimal img_metas.  Only the Python-constant fields are here;
+        # the tensor fields (time_diff, lidar2img) are injected as real tensors
+        # so the ONNX tracer includes them in the graph.
+        img_shape = (self.image_h, self.image_w, C)
+        img_metas = [{
+            'img_shape': [img_shape] * TN,
+            'ori_shape': [img_shape] * TN,
+            'time_diff': time_diff,    # tensor — flows into the ONNX graph
+            'lidar2img': lidar2img,    # tensor — flows into the ONNX graph
+        }]
+        # Backbone + FPN
+        img_feats = self.model.extract_feat(img=img, img_metas=img_metas)
+        # Detection head — returns raw predictions, no NMS
+        outs = self.model.pts_bbox_head(img_feats, img_metas)
+        cls_scores = outs['all_cls_scores']   # [num_layers, B, Q, num_classes]
+        bbox_preds = outs['all_bbox_preds']   # [num_layers, B, Q, 10]
+        return cls_scores, bbox_preds

models/sparsebev_sampling.py CHANGED Viewed

@@ -2,7 +2,7 @@ import torch
 import torch.nn.functional as F
 from .bbox.utils import decode_bbox
 from .utils import rotation_3d_in_axis, DUMP
-from .csrc.wrapper import msmv_sampling, msmv_sampling_pytorch
 def make_sample_points(query_bbox, offset, pc_range):
@@ -88,38 +88,55 @@ def sampling_4d(sample_points, mlvl_feats, scale_weights, lidar2img, image_h, im
     valid_mask = valid_mask.permute(0, 1, 3, 4, 2)  # [B, T, Q, GP, N]
     sample_points_cam = sample_points_cam.permute(0, 1, 3, 4, 2, 5)  # [B, T, Q, GP, N, 2]
-    # prepare batched indexing
-    i_batch = torch.arange(B, dtype=torch.long, device=sample_points.device)
-    i_query = torch.arange(Q, dtype=torch.long, device=sample_points.device)
-    i_time = torch.arange(T, dtype=torch.long, device=sample_points.device)
-    i_point = torch.arange(G * P, dtype=torch.long, device=sample_points.device)
-    i_batch = i_batch.view(B, 1, 1, 1, 1).expand(B, T, Q, G * P, 1)
-    i_time = i_time.view(1, T, 1, 1, 1).expand(B, T, Q, G * P, 1)
-    i_query = i_query.view(1, 1, Q, 1, 1).expand(B, T, Q, G * P, 1)
-    i_point = i_point.view(1, 1, 1, G * P, 1).expand(B, T, Q, G * P, 1)
     # we only keep at most one valid sampling point, see https://zhuanlan.zhihu.com/p/654821380
-    i_view = torch.argmax(valid_mask, dim=-1)[..., None]  # [B, T, Q, GP, 1]
-    # index the only one sampling point and its valid flag
-    sample_points_cam = sample_points_cam[i_batch, i_time, i_query, i_point, i_view, :]  # [B, Q, GP, 1, 2]
-    valid_mask = valid_mask[i_batch, i_time, i_query, i_point, i_view]  # [B, Q, GP, 1]
-    # treat the view index as a new axis for grid_sample and normalize the view index to [0, 1]
-    sample_points_cam = torch.cat([sample_points_cam, i_view[..., None].float() / (N - 1)], dim=-1)
-    # reorganize the tensor to stack T and G to the batch dim for better parallelism
-    sample_points_cam = sample_points_cam.reshape(B, T, Q, G, P, 1, 3)
-    sample_points_cam = sample_points_cam.permute(0, 1, 3, 2, 4, 5, 6)  # [B, T, G, Q, P, 1, 3]
-    sample_points_cam = sample_points_cam.reshape(B*T*G, Q, P, 3)
-    # reorganize the tensor to stack T and G to the batch dim for better parallelism
-    scale_weights = scale_weights.reshape(B, Q, G, T, P, -1)
-    scale_weights = scale_weights.permute(0, 2, 3, 1, 4, 5)
-    scale_weights = scale_weights.reshape(B*G*T, Q, P, -1)
-    # multi-scale multi-view grid sample
-    final = msmv_sampling(mlvl_feats, sample_points_cam, scale_weights)
     # reorganize the sampled features
     C = final.shape[2]  # [BTG, Q, C, P]

 import torch.nn.functional as F
 from .bbox.utils import decode_bbox
 from .utils import rotation_3d_in_axis, DUMP
+from .csrc.wrapper import msmv_sampling, msmv_sampling_pytorch, msmv_sampling_onnx, MSMV_CUDA
 def make_sample_points(query_bbox, offset, pc_range):
     valid_mask = valid_mask.permute(0, 1, 3, 4, 2)  # [B, T, Q, GP, N]
     sample_points_cam = sample_points_cam.permute(0, 1, 3, 4, 2, 5)  # [B, T, Q, GP, N, 2]
     # we only keep at most one valid sampling point, see https://zhuanlan.zhihu.com/p/654821380
+    i_view = torch.argmax(valid_mask, dim=-1, keepdim=True)  # [B, T, Q, GP, 1]
+    if MSMV_CUDA:
+        # Original fancy-indexing path (used with CUDA kernel on Linux/Windows)
+        i_batch = torch.arange(B, dtype=torch.long, device=sample_points.device)
+        i_query = torch.arange(Q, dtype=torch.long, device=sample_points.device)
+        i_time = torch.arange(T, dtype=torch.long, device=sample_points.device)
+        i_point = torch.arange(G * P, dtype=torch.long, device=sample_points.device)
+        i_batch = i_batch.view(B, 1, 1, 1, 1).expand(B, T, Q, G * P, 1)
+        i_time = i_time.view(1, T, 1, 1, 1).expand(B, T, Q, G * P, 1)
+        i_query = i_query.view(1, 1, Q, 1, 1).expand(B, T, Q, G * P, 1)
+        i_point = i_point.view(1, 1, 1, G * P, 1).expand(B, T, Q, G * P, 1)
+        sample_points_cam = sample_points_cam[i_batch, i_time, i_query, i_point, i_view, :]
+        valid_mask = valid_mask[i_batch, i_time, i_query, i_point, i_view]
+        # treat the view index as a new axis for grid_sample, normalise to [0, 1]
+        sample_points_cam = torch.cat([sample_points_cam, i_view[..., None].float() / (N - 1)], dim=-1)
+        sample_points_cam = sample_points_cam.reshape(B, T, Q, G, P, 1, 3)
+        sample_points_cam = sample_points_cam.permute(0, 1, 3, 2, 4, 5, 6)
+        sample_points_cam = sample_points_cam.reshape(B*T*G, Q, P, 3)
+        scale_weights = scale_weights.reshape(B, Q, G, T, P, -1)
+        scale_weights = scale_weights.permute(0, 2, 3, 1, 4, 5)
+        scale_weights = scale_weights.reshape(B*G*T, Q, P, -1)
+        final = msmv_sampling(mlvl_feats, sample_points_cam, scale_weights)
+    else:
+        # ONNX-compatible path: torch.gather + 4D grid_sample (no custom CUDA ops)
+        # Select best-view UV coords via gather  [B, T, Q, GP, 1, 2]
+        i_view_uv = i_view.unsqueeze(-1).expand(B, T, Q, G * P, 1, 2)
+        sample_points_cam = torch.gather(sample_points_cam, 4, i_view_uv).squeeze(4)  # [B, T, Q, GP, 2]
+        # Reorganize UV to [B*T*G, Q, P, 2]
+        sample_points_cam = sample_points_cam.reshape(B, T, Q, G, P, 2)
+        sample_points_cam = sample_points_cam.permute(0, 1, 3, 2, 4, 5)  # [B, T, G, Q, P, 2]
+        sample_points_cam = sample_points_cam.reshape(B*T*G, Q, P, 2)
+        # Reorganize view_idx to [B*T*G, Q, P]
+        i_view = i_view.squeeze(4).reshape(B, T, Q, G, P)
+        i_view = i_view.permute(0, 1, 3, 2, 4).reshape(B*T*G, Q, P)
+        scale_weights = scale_weights.reshape(B, Q, G, T, P, -1)
+        scale_weights = scale_weights.permute(0, 2, 3, 1, 4, 5)
+        scale_weights = scale_weights.reshape(B*G*T, Q, P, -1)
+        final = msmv_sampling_onnx(mlvl_feats, sample_points_cam, i_view, scale_weights)
     # reorganize the sampled features
     C = final.shape[2]  # [BTG, Q, C, P]

models/sparsebev_transformer.py CHANGED Viewed

@@ -56,18 +56,23 @@ class SparseBEVTransformerDecoder(BaseModule):
     def forward(self, query_bbox, query_feat, mlvl_feats, attn_mask, img_metas):
         cls_scores, bbox_preds = [], []
-        # calculate time difference according to timestamps
-        timestamps = np.array([m['img_timestamp'] for m in img_metas], dtype=np.float64)
-        timestamps = np.reshape(timestamps, [query_bbox.shape[0], -1, 6])
-        time_diff = timestamps[:, :1, :] - timestamps
-        time_diff = np.mean(time_diff, axis=-1).astype(np.float32)  # [B, F]
-        time_diff = torch.from_numpy(time_diff).to(query_bbox.device)  # [B, F]
-        img_metas[0]['time_diff'] = time_diff
-        # organize projections matrix and copy to CUDA
-        lidar2img = np.asarray([m['lidar2img'] for m in img_metas]).astype(np.float32)
-        lidar2img = torch.from_numpy(lidar2img).to(query_bbox.device)  # [B, N, 4, 4]
-        img_metas[0]['lidar2img'] = lidar2img
         # group image features in advance for sampling, see `sampling_4d` for more details
         for lvl, feat in enumerate(mlvl_feats):
@@ -178,9 +183,11 @@ class SparseBEVTransformerDecoderLayer(BaseModule):
         # calculate absolute velocity according to time difference
         time_diff = img_metas[0]['time_diff']  # [B, F]
         if time_diff.shape[1] > 1:
-            time_diff = time_diff.clone()
-            time_diff[time_diff < 1e-5] = 1.0
-            bbox_pred[..., 8:] = bbox_pred[..., 8:] / time_diff[:, 1:2, None]
         if DUMP.enabled:
             query_bbox_dec = decode_bbox(query_bbox, self.pc_range)
@@ -236,16 +243,8 @@ class SparseBEVSelfAttention(BaseModule):
     @torch.no_grad()
     def calc_bbox_dists(self, bboxes):
         centers = decode_bbox(bboxes, self.pc_range)[..., :2]  # [B, Q, 2]
-        dist = []
-        for b in range(centers.shape[0]):
-            dist_b = torch.norm(centers[b].reshape(-1, 1, 2) - centers[b].reshape(1, -1, 2), dim=-1)
-            dist.append(dist_b[None, ...])
-        dist = torch.cat(dist, dim=0)  # [B, Q, Q]
-        dist = -dist
-        return dist
 class SparseBEVSampling(BaseModule):

     def forward(self, query_bbox, query_feat, mlvl_feats, attn_mask, img_metas):
         cls_scores, bbox_preds = [], []
+        if isinstance(img_metas[0].get('time_diff'), torch.Tensor):
+            # ONNX export path: tensors pre-computed and injected by the wrapper
+            pass  # time_diff and lidar2img already set in img_metas[0]
+        else:
+            # Standard path: extract from img_metas using numpy
+            # calculate time difference according to timestamps
+            timestamps = np.array([m['img_timestamp'] for m in img_metas], dtype=np.float64)
+            timestamps = np.reshape(timestamps, [query_bbox.shape[0], -1, 6])
+            time_diff = timestamps[:, :1, :] - timestamps
+            time_diff = np.mean(time_diff, axis=-1).astype(np.float32)  # [B, F]
+            time_diff = torch.from_numpy(time_diff).to(query_bbox.device)  # [B, F]
+            img_metas[0]['time_diff'] = time_diff
+            # organize projections matrix and copy to CUDA
+            lidar2img = np.asarray([m['lidar2img'] for m in img_metas]).astype(np.float32)
+            lidar2img = torch.from_numpy(lidar2img).to(query_bbox.device)  # [B, N, 4, 4]
+            img_metas[0]['lidar2img'] = lidar2img
         # group image features in advance for sampling, see `sampling_4d` for more details
         for lvl, feat in enumerate(mlvl_feats):
         # calculate absolute velocity according to time difference
         time_diff = img_metas[0]['time_diff']  # [B, F]
         if time_diff.shape[1] > 1:
+            time_diff = torch.where(time_diff < 1e-5, torch.ones_like(time_diff), time_diff)
+            bbox_pred = torch.cat([
+                bbox_pred[..., :8],
+                bbox_pred[..., 8:] / time_diff[:, 1:2, None],
+            ], dim=-1)
         if DUMP.enabled:
             query_bbox_dec = decode_bbox(query_bbox, self.pc_range)
     @torch.no_grad()
     def calc_bbox_dists(self, bboxes):
         centers = decode_bbox(bboxes, self.pc_range)[..., :2]  # [B, Q, 2]
+        dist = torch.norm(centers.unsqueeze(2) - centers.unsqueeze(1), dim=-1)  # [B, Q, Q]
+        return -dist
 class SparseBEVSampling(BaseModule):

pyproject.toml ADDED Viewed

	@@ -0,0 +1,44 @@

+[project]
+name = "sparsebev"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "numpy>=2.4.3",
+    "onnx>=1.20.1",
+    "onnxruntime>=1.16",
+    "setuptools>=40,<72",  # <72 required: mmcv setup.py uses pkg_resources removed in >=72
+    "torch>=2.10.0",
+    "torchvision>=0.25.0",
+    # mmdet ecosystem — old packages with stale pins, needs --no-build-isolation
+    "mmdet==2.28.2",
+    "mmsegmentation==0.30.0",
+    "mmdet3d==1.0.0rc6",
+    "mmcv==1.7.0",
+    "fvcore>=0.1.5.post20221221",
+    "einops>=0.8.2",
+    "onnxscript>=0.6.2",
+]
+[tool.uv]
+# Build mmcv/mmdet without isolation so they see the pinned setuptools<72
+# (they import pkg_resources in setup.py which was removed in setuptools>=72)
+no-build-isolation-package = ["mmcv", "mmdet", "mmdet3d", "mmsegmentation"]
+# mmdet3d==1.0.0rc6 has stale pins that conflict with Python 3.12 and modern torch.
+# Override to compatible modern versions.
+override-dependencies = [
+    "networkx>=2.5.1",
+    # mmdet3d pins numba==0.53.0 -> llvmlite==0.36.0 which only supports Python<3.10
+    "numba>=0.60.0",
+    "llvmlite>=0.43.0",
+    # setuptools>=72 removed pkg_resources as a top-level module; mmcv setup.py needs it
+    "setuptools<72",
+]
+[tool.uv.extra-build-dependencies]
+# mmdet3d/mmdet need torch at build time (they import it in setup.py)
+mmdet3d = ["torch"]
+mmdet = ["torch"]
+mmcv = ["setuptools"]

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff