rayv9_learnt_baseline_snap

S23DR 2026 competition submission. Predicts building wireframes (vertices + edges) from multi-view indoor panoramas.

Pipeline

Per scene:
  1. v1d heatmap model  →  per-view vertex heatmaps  →  spatial NMS  →  ~80 rays
  2. rays + SfM points  →  6-channel 64×32×64 voxel volume
                        →  RayVoxelTransformerV7a  →  hybrid ray/dense NMS  →  world-space vertices (v9)
  3. COLMAP + depth     →  point fusion  →  EdgeDepthSegmentsModel  →  wireframe (baseline)
  4. snap               →  baseline verts snapped 80% toward nearest v9 vert within 2 m
                           + unmatched v9 verts appended

Validation results (100 scenes, hoho22k_2026_trainval)

Method	Mean HSS	Median HSS
Learned baseline	0.352	0.369
+ Snap to v9	0.411	0.453

Vertex F1@0.5 = 0.494, F1@1.0 = 0.685. Snap wins on 85/100 val scenes.

Files

File	Purpose
`script.py`	Entry point. Loads all models, iterates dataset, writes `submission.json`
`v9_inference.py`	Full v9 vertex pipeline: COLMAP parsing, v1d heatmap model, spatial NMS, voxel volume construction, RayVoxelTransformer inference, hybrid NMS
`baseline_inference.py`	Learned baseline pipeline: point fusion, EdgeDepthSegmentsModel forward pass, postprocessing (merge vertices, snap to point cloud, snap horizontal)
`snap.py`	`snap_midpoint_plus_unmatched` — snaps baseline verts toward v9 verts and appends unmatched v9 verts
`model.py`	`RayVoxelTransformer` (V7a): dense image unprojection pathway + sparse ray/SfM transformer pathway, merged via learned gate
`dataset.py`	Voxel volume utilities for inference: `build_grid`, `splat_sfm`, `march_rays`, `assemble_volume`
`voxel_grid.py`	`VoxelGrid` class with world↔voxel transforms, ray marching, SfM splatting, GT target construction
`s23dr_2026_example/`	Learned baseline package: point fusion, tokenizer, `EdgeDepthSegmentsModel`, postprocessing, varifold loss
`v1d_checkpoint.pt`	v1d heatmap model (GroupNorm CNN, 7-channel input) — 12 M params
`v9_checkpoint.pt`	RayVoxelTransformerV7a — 22 M params
`baseline_checkpoint.pt`	EdgeDepthSegmentsModel — 102 M params
`params.json`	Competition metadata (competition ID, dataset, time limit, output paths)
`submission.json`	Output: list of `{order_id, wf_vertices, wf_edges}`

Setup

The baseline checkpoint (baseline_checkpoint.pt, 102 MB) is not included in this repo. Download it separately and place it in the root directory of this project:

# From Hugging Face
wget https://huggingface.co/kc92/rayv9_learnt_baseline_snap/resolve/main/baseline_checkpoint.pt

Usage

Competition submission (reads params.json, writes submission.json):

python script.py

Local smoke-test on the training split:

python script.py --mode local --n_scenes 10

Model details

v1d heatmap model

Input: 7-channel image (3-ch gestalt + 1-ch depth + 3-ch ADE segmentation), resized to 576×768
Architecture: dilated GroupNorm CNN encoder + transposed-conv decoder → sigmoid heatmap
NMS: 7×7 max-pool peaks with threshold 0.30, up to 60 peaks/view, spatial suppression at 20 px radius across views → ≤80 rays/scene

RayVoxelTransformer (V7a)

Input: 6-channel 64×32×64 voxel volume [log1p(ray_count), dir_x/y/z, mean_score, log1p(sfm)]
Dense pathway: image features (stride-4 CNN) unprojected into full grid → 3D conv → prob/offset
Sparse pathway: active ray voxels + SfM point tokens → joint self-attention transformer → prob/offset at active voxels
Merge: learned per-voxel gate between dense and sparse outputs
NMS: hybrid ray-guided NMS (threshold 0.35, radius 0.3 m) + dense NMS (threshold 0.45, radius 0.5 m), max 48 vertices

Learned baseline

Input: fused COLMAP + depth point cloud, priority-sampled to 4096 tokens (3072 COLMAP + 1024 depth)
Architecture: Perceiver-style EdgeDepthSegmentsModel with cross-attention latent bottleneck → segment predictions → varifold-based vertex/edge extraction
Postprocessing: iterative vertex merging, snap to point cloud, snap horizontal

Snap

For each baseline vertex, find nearest v9 vertex within 2 m; move baseline vertex 80% of the way toward it
Append any v9 vertices not claimed by a snap as extra vertices (edges unchanged)
Sweep-optimised params: weight=0.80, radius=2.0 m

Downloads last month: 42

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support