π§ Point Transformer v3 & Dino-In-The-Room (DITR) on GridNet-HD
This repository provides the implementation and training pipelines for two models applied to the GridNet-HD dataset:
- Point Transformer v3 (PTv3): baseline 3D model PTv3.
- Dino-In-The-Room (DITR): a fusion architecture combining Point Transformer v3 with DINOv2 image features, following the methodology proposed in DITR. This model represents the current state-of-the-art in multimodal 3Dβ2D fusion on multiple dataset.
π Dataset Structure
The GridNet-HD dataset must follow the original structure:
dataset-root/
βββ t1z5b/
β βββ images/ # RGB images (.JPG)
β βββ masks/ # Semantic segmentation masks (.png, single-channel label)
β βββ lidar/ # LiDAR point cloud (.las format with field "ground_truth")
β βββ pose/ # Camera poses and intrinsics (text files)
βββ t1z6a/
β βββ ...
βββ ...
βββ split.json # JSON file specifying the train/test split
βββ README.md
Environment
The following environment was used to train and evaluate the baseline model.
| Component | Details |
|---|---|
| GPU | 4 x NVIDIA A40 (48 GB VRAM) |
| CUDA Version | 12.x (installed in docker container) |
| OS | Ubuntu 22.04 LTS |
| RAM | 512 GB |
π§© 1. Point Transformer v3 (PTv3)
Start by clone the repo:
git clone https://huggingface.co/heig-vd-geo/PTv3_GridNet-HD_baseline
Data Preparation
python prepare_gridnethd.py \
--gridnethd_root $path_to_GridNet-HD-dataset_public$ \
--split_json $path_to_split.json$ \
--out_root $path_to_PTv3_GridNet-HD_baseline$/data/gridnethd/pc \
--pointcept_root $path_to_PTv3_GridNet-HD_baseline$ \
--temporary_root $path_to_temp_directory$ \
--dino_projection False
Training
This repository follows the same structure as Pointcept, enabling seamless integration. Launch the container and train as follows:
docker run --gpus all -it --rm --shm-size=240g \
-v $path_to_PTv3_GridNet-HD_baseline$:/workspace/Pointcept \
pointcept/pointcept:v1.6.0-pytorch2.5.0-cuda12.4-cudnn9-devel bash
cd Pointcept
export PYTHONPATH=./
python tools/train.py \
--config-file configs/gridnethd/PTv3_gridnethd_color.py \
--options save_path=exp/gridnethd/ptv3_color/ \
--num-gpus 4
Evaluation
python tools/test.py \
--config-file configs/gridnethd/PTv3_gridnethd_color.py \
--options save_path=exp/gridnethd/ptv3_color/ \
weight=model_best_PTv3.pth
π§ 2. Dino-In-The-Room (DITR)
Data Preparation
First, precompute DINOv2 image features:
python dinov2/compute_dinov2_features.py \
--gridnethd_root $path_to_gridnet_hd$ \
--split_json $path_to_split.json$
Then, prepare the dataset with feature projection enabled:
python prepare_gridnethd.py \
--gridnethd_root $path_to_GridNet-HD-dataset_public$ \
--split_json $path_to_split.json$ \
--out_root $path_to_PTv3_GridNet-HD_baseline$/data/gridnethd/pc \
--pointcept_root $path_to_PTv3_GridNet-HD_baseline$ \
--temporary_root $path_to_temp_directory$ \
--dino_projection True
Training
As with PTv3, integrate this repository within Pointcept:
docker run --gpus all -it --rm --shm-size=240g \
-v $path_to_PTv3_GridNet-HD_baseline$:/workspace/Pointcept \
pointcept/pointcept:v1.6.0-pytorch2.5.0-cuda12.4-cudnn9-devel bash
cd Pointcept
export PYTHONPATH=./
Evaluation
python tools/test.py \
--config-file configs/gridnethd/DITR_gridnethd_color_dinov2 \
--options save_path=exp/gridnethd/ditr/ \
weight=model_best_DITR.pth
π Quantitative Results
PTv3 (XYZ + Color), using overlap on test set and TTA (Test Time Augmentations), same for DITR (Dino In The Room).
| Class | PTv3 IoU (%) | DITR IoU (%) |
|---|---|---|
| Pylon | 97.12 | 96.81 |
| Conductor cable | 85.88 | 89.07 |
| Structural cable | 53.22 | 57.80 |
| Insulator | 90.63 | 93.20 |
| High vegetation | 88.30 | 88.81 |
| Low vegetation | 33.93 | 41.99 |
| Herbaceous vegetation | 91.72 | 90.05 |
| Rock, gravel, soil | 51.88 | 44.26 |
| Impervious soil (Road) | 79.63 | 79.49 |
| Water | 29.68 | 71.86 |
| Building | 60.49 | 70.26 |
| Mean IoU (mIoU) | 69.32 | 74.87 |
π§Ύ References
Point Transformer v3 β PTv3 paper
@misc{wu2024pointtransformerv3simpler,
title={Point Transformer V3: Simpler, Faster, Stronger},
author={Xiaoyang Wu and Li Jiang and Peng-Shuai Wang and Zhijian Liu and Xihui Liu and Yu Qiao and Wanli Ouyang and Tong He and Hengshuang Zhao},
year={2024},
eprint={2312.10035},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2312.10035},
}
Dino-In-The-Room (DITR) β DITR paper
@misc{zeid2025dinoroomleveraging2d,
title={DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation},
author={Karim Abou Zeid and Kadir Yilmaz and Daan de Geus and Alexander Hermans and David Adrian and Timm Linder and Bastian Leibe},
year={2025},
eprint={2503.18944},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.18944},
}
DINOv2 β DINOv2 paper
@misc{oquab2024dinov2learningrobustvisual,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Maxime Oquab and TimothΓ©e Darcet and ThΓ©o Moutakanni and Huy Vo and Marc Szafraniec and Vasil Khalidov and Pierre Fernandez and Daniel Haziza and Francisco Massa and Alaaeldin El-Nouby and Mahmoud Assran and Nicolas Ballas and Wojciech Galuba and Russell Howes and Po-Yao Huang and Shang-Wen Li and Ishan Misra and Michael Rabbat and Vasu Sharma and Gabriel Synnaeve and Hu Xu and HervΓ© Jegou and Julien Mairal and Patrick Labatut and Armand Joulin and Piotr Bojanowski},
year={2024},
eprint={2304.07193},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2304.07193},
}
GridNet-HD Dataset β GridNet-HD paper
@misc{gridnet-hd-dataset,
title={GridNet-HD: A High-Resolution Multi-Modal Dataset for LiDAR-Image Fusion on Power Line Infrastructure},
author={Antoine Carreaud and Shanci Li and Malo De Lacour and Digre Frinde and Jan Skaloud and Adrien Gressin},
year={2026},
eprint={2601.13052},
url={https://arxiv.org/abs/2601.13052},
}
π§βπ» Authors & Contact
For questions, please open an issue or contact us directly.