hoho

File size: 7,510 Bytes

---
license: apache-2.0
datasets:
- usm3d/hoho25k
language:
- en
tags:
- hoho25k
- s23dr2025
---
# S23DR 2025 Challenge - Winning Solution 🏆

This repository contains the **winning solution** for the [Structured Semantic 3D Reconstruction (S23DR) Challenge 2025](https://huggingface.co/spaces/usm3d/S23DR2025) at CVPR 2025 Workshop.

Our method achieves the winning performance in 3D wireframe reconstruction from multi-view images, combining COLMAP point clouds, semantic segmentation, and deep learning models to predict building wireframes with high accuracy.

## 🎯 Performance

Our solution achieved the best scores across key metrics:
- **HSS (Hybrid Structure Score)**: Superior spatial accuracy
- **F1 Score**: Excellent balance of precision and recall
- **IoU (Intersection over Union)**: High overlap with ground truth

## 🏗️ Method Overview

Our approach consists of two main components:

### 1. Multi-Modal Data Fusion
- **COLMAP Point Clouds**: Dense 3D reconstruction from multi-view images
- **Semantic Segmentation**: ADE20K and Gestalt segmentation for building elements
- **Depth Information**: Fitted dense depth maps aligned with 3D structure

### 2. Deep Learning Models
We employ two specialized neural networks:

#### FastPointNet (Vertex Prediction)
- **Input**: 11D point cloud patches (xyz + rgb + features)
- **Architecture**: Enhanced PointNet with residual connections, channel attention, and multi-scale pooling
- **Output**: 3D vertex coordinates + confidence scores + classification
- **Model File**: `pnet.pth`
- **Features**: 
  - Deeper architecture with 7 conv layers
  - Lightweight channel attention mechanism
  - Group normalization for stability
  - Multi-scale global pooling (max + average)

#### ClassificationPointNet (Edge Classification)  
- **Input**: 6D point cloud patches (xyz + rgb)
- **Architecture**: Binary classification PointNet with deep feature extraction
- **Output**: Edge/no-edge classification with confidence
- **Model File**: `pnet_class.pth`
- **Features**:
  - 6-layer convolutional feature extraction
  - Dropout regularization (0.3-0.5)
  - Xavier initialization

### 3. Patch-Based Processing Pipeline

Our pipeline processes local 3D patches around potential vertices:

1. **Initial Vertex Detection**: Extract candidates from semantic segmentation maps
2. **Point Cloud Clustering**: Group nearby 3D points using spatial clustering
3. **Patch Generation**: Create local point cloud patches (1-2m radius) centered on clusters
4. **Neural Refinement**: Use FastPointNet to refine vertex locations and classify validity
5. **Edge Prediction**: Generate candidate edges between vertices and classify using ClassificationPointNet
6. **Post-Processing**: Filter and merge results across multiple views

## 🚀 Quick Start

### Training
```bash
# Train vertex prediction model
python train_pnet_v2.py

# Train edge classification model  
python train_pnet_class.py
```

### Evaluation
```bash
# Run evaluation on HoHo25k dataset
python train.py --vertex_threshold 0.59 --edge_threshold 0.65 --only_predicted_connections True
```

### Inference
```bash
# Generate predictions (used in competition)
# Uses pnet.pth and pnet_class.pth models
python script.py
```

## 📁 Key Files

### Core Models
- `fast_pointnet_v2.py` - Enhanced PointNet for vertex prediction
- `fast_pointnet_class.py` - PointNet for edge classification  
- `end_to_end.py` - VoxelUNet implementation (available but not used in final solution)

### Pipeline
- `predict.py` - Main wireframe prediction pipeline (2900+ lines)
- `train.py` - Training and evaluation script
- `utils.py` - COLMAP utilities and helper functions
- `visu.py` - 3D visualization tools using Open3D

### Data Processing
- `generate_pcloud_dataset.py` - Dataset generation from HoHo25k
- `create_pcloud()` - Multi-view point cloud fusion with semantic features

### Analysis
- `find_best_results.py` - Hyperparameter optimization and result analysis
- `color_visu.py` - Color legend generation for semantic classes

## 🔧 Technical Details

### Input Data Processing
- **Multi-view RGB images** with camera poses from COLMAP
- **Depth maps** fitted to COLMAP sparse reconstruction  
- **ADE20k segmentation** for building detection
- **Gestalt segmentation** for architectural elements (roof, walls, windows, etc.)

### Feature Engineering
- **11D Point Features**: xyz coordinates + rgb colors + semantic labels + multi-view consistency
- **Patch Normalization**: Center patches at local centroids with 0.5-2.0m radius
- **Data Augmentation**: Random rotation, translation, scaling, and noise injection

### Training Strategy
- **Multi-task Learning**: Joint vertex position + confidence + classification prediction
- **Combined Loss**: SmoothL1 (position) + SoftPlus (confidence) + BCE (classification)
- **Optimization**: AdamW with cosine annealing, gradient clipping
- **Regularization**: Dropout, weight decay, label smoothing

### Hyperparameter Optimization
Our best configuration:
- `vertex_threshold`: 0.59
- `edge_threshold`: 0.65  
- `only_predicted_connections`: True

## 📊 Architecture Highlights

### FastPointNet Enhancements
- **Residual Connections**: Improved gradient flow in deep networks
- **Channel Attention**: Focus on important feature channels
- **Multi-Scale Features**: Combine max and average pooling (0.7 + 0.3 weighting)
- **Group Normalization**: Better stability for small batches
- **Leaky ReLU**: Prevent dying neurons (negative_slope=0.01)

### Patch Processing Strategy
- **Hierarchical Clustering**: Group points by spatial proximity
- **Multi-View Consistency**: Aggregate features across camera views
- **Semantic-Aware Sampling**: Prioritize building-relevant regions
- **Edge-Aware Patches**: Generate candidate patches for all vertex pairs

## 🎨 Visualization

The repository includes comprehensive 3D visualization tools:
- **Point Cloud Rendering**: COLMAP reconstructions with semantic colors
- **Wireframe Overlay**: Ground truth vs predicted wireframes  
- **Patch Visualization**: Local point cloud patches with predicted vertices
- **Camera Frustums**: Multi-view camera poses and coverage

## 📈 Evaluation Metrics

We evaluate using three key metrics:
- **HSS (Half-Space Score)**: Measures spatial accuracy of vertex positions
- **F1 Score**: Harmonic mean of precision and recall for edge detection
- **IoU**: Intersection over Union for overall wireframe quality

## 📄 Citation

Please cite our work if you use this code:

```bibtex
TODO
```

## 📋 Requirements

- Python 3.8+
- PyTorch 1.12+
- CUDA-capable GPU (recommended)
- OpenCV, NumPy, SciPy
- Open3D (visualization)
- PyVista (optional, for advanced visualization)
- HuggingFace Datasets

## 📜 License

Apache 2.0 - See LICENSE file for details.

## 🤝 Acknowledgments

The research was supported by Czech Science Foundation Grant No. 24-10738M. The access to the computational infrastructure of the OP VVV funded project CZ.02.1.01/0.0/0.0/16_019/0000765 "Research Center for Informatics" is also gratefully acknowledged. We also acknowledge the support from the Student Grant Competition of the Czech Technical University in Prague, grant No. SGS23/173/OHK3/3T/13.

Thanks to the S23DR Challenge organizers and the HoHo25k dataset creators for providing this excellent benchmark for 3D wireframe reconstruction research.

---

For detailed technical description, please refer to our paper and the comprehensive code documentation throughout the repository.