Datasets
All datasets inherit from the torch_geometric Dataset class, allowing for
automated preprocessing and inference-time transforms.
See the official documentation
for more details.
Supported datasets
| Dataset | Download from ? | Which files ? | Where to ? |
|---|---|---|---|
| S3DIS | link | Stanford3dDataset_v1.2.zip |
data/s3dis/ |
| ScanNetV2 | link | scannetv2-labels.combined.tsv {{scan_name}}.aggregation.json {{scan_name}}.txt {{scan_name}}_vh_clean_2.0.010000.segs.json {{scan_name}}_vh_clean_2.ply |
data/scannet/ |
| KITTI-360 | link | data_3d_semantics.zip data_3d_semantics_test.zip |
data/kitti360/ |
| DALES | link | DALESObjects.tar.gz |
data/dales/ |
Structure of the data/ directory
S3DIS data directory structure.
βββ data
βββ s3dis # Structure for S3DIS
βββ Stanford3dDataset_v1.2.zip # (optional) Downloaded zipped dataset with non-aligned rooms
βββ raw # Raw dataset files
β βββ Area_{{1, 2, 3, 4, 5, 6}} # S3DIS's area/room/room.txt structure
β βββ Area_{{1, 2, 3, 4, 5, 6}}_alignmentAngle.txt # Room alignment angles required for entire floor reconstruction
β βββ {{room_name}}
β βββ {{room_name}}.txt
βββ processed # Preprocessed data
βββ {{train, val, test}} # Dataset splits
βββ {{preprocessing_hash}} # Preprocessing folder
βββ Area_{{1, 2, 3, 4, 5, 6}}.h5 # Preprocessed Area file
Warning β οΈ: Make sure you download
Stanford3dDataset_v1.2.zipand NOT the aligned version βStanford3dDataset_v1.2_Aligned_Version.zip, which does not contain theArea_{{1, 2, 3, 4, 5, 6}}_alignmentAngle.txtfiles.
ScanNetV2 data directory structure.
βββ data
ββββ scannet # Structure for ScanNetV2
βββ raw # Raw dataset files
| βββ scannetv2-labels.combined.tsv
| βββ scans
| β βββ {{scan_name}}
| β βββ {{scan_name}}.aggregation.json
| β βββ {{scan_name}}.txt
| β βββ {{scan_name}}_vh_clean_2.0.010000.segs.json
| β βββ {{scan_name}}_vh_clean_2.ply
| βββ scans_test
| βββ {{scan_name}}
| βββ {{scan_name}}_vh_clean_2.ply
βββ processed # Preprocessed data
βββ {{train, val, test}} # Dataset splits
βββ {{preprocessing_hash}} # Preprocessing folder
βββ {{scans, scans_test}}
βββ {{scan_name}}.h5 # Preprocessed scan file
KITTI-360 data directory structure.
βββ data
ββββ kitti360 # Structure for KITTI-360
βββ data_3d_semantics_test.zip # (optional) Downloaded zipped test dataset
βββ data_3d_semantics.zip # (optional) Downloaded zipped train dataset
βββ raw # Raw dataset files
β βββ data_3d_semantics # Contains all raw train and test sequences
β βββ {{sequence_name}} # KITTI-360's sequence/static/window.ply structure
β βββ static
β βββ {{window_name}}.ply
βββ processed # Preprocessed data
βββ {{train, val, test}} # Dataset splits
βββ {{preprocessing_hash}} # Preprocessing folder
βββ {{sequence_name}}
βββ {{window_name}}.h5 # Preprocessed window file
DALES data directory structure.
βββ data
βββ dales # Structure for DALES
βββ DALESObjects.tar.gz # (optional) Downloaded zipped dataset
βββ raw # Raw dataset files
β βββ {{train, test}} # DALES' split/tile.ply structure
β βββ {{tile_name}}.ply
βββ processed # Preprocessed data
βββ {{train, val, test}} # Dataset splits
βββ {{preprocessing_hash}} # Preprocessing folder
βββ {{tile_name}}.h5 # Preprocessed tile file
Warning β οΈ: Make sure you download the
DALESObjects.tar.gzand NOT βdales_semantic_segmentation_las.tar.gznor βdales_semantic_segmentation_ply.tar.gzversions, which do not contain all required point attributes.
Tip π‘: Already have the dataset on your machine ? Save memory πΎ by simply symlinking or copying the files to
data/<dataset_name>/raw/, following the above-describeddata/structure.
Automatic download and preprocessing
Following torch_geometric's Dataset behaviour:
- Dataset instantiation
β‘ Load preprocessed data indata/<dataset_name>/processed - Missing files in
data/<dataset_name>/processedstructure
β‘ Automatic preprocessing using files indata/<dataset_name>/raw - Missing files in
data/<dataset_name>/rawstructure
β‘ Automatic unzipping of the downloaded dataset indata/<dataset_name> - Missing downloaded dataset in
data/<dataset_name>structure
β‘Automaticmanual download todata/<dataset_name>
Warning β οΈ: We do not support β automatic download, for compliance reasons. Please manually download the required dataset files to the required location as indicated in the above table.
Setting up your own data/ and logs/ paths
The data/ and logs/ directories will store all your datasets and training
logs. By default, these are placed in the repository directory.
Since this may take some space, or your heavy data may be stored elsewhere, you
may specify other paths for these directories by creating a
configs/local/default.yaml file containing the following:
# @package paths
# path to data directory
data_dir: /path/to/your/data/
# path to logging directory
log_dir: /path/to/your/logs/
Pre-transforms, transforms, on-device transforms
Pre-transforms are the functions making up the preprocessing.
These are called only once and their output is saved in
data/<dataset_name>/processed/. These typically encompass neighbor search and
partition construction.
The transforms are called by the Dataloader at batch-creation time. These
typically encompass sampling and data augmentations and are performed on CPU,
before moving the batch to the GPU.
On-device transforms, are transforms to be performed on GPU. These are
typically compute-intensive operations that could not be done once and for all
at preprocessing time, and are too slow to be performed on CPU by the
Dataloader.
Preprocessing hash
Different from torch_geometric, you can have multiple
preprocessed versions of each dataset, identified by their preprocessing hash.
This hash will change whenever the preprocessing configuration (i.e. pre-transforms) is modified in an impactful way (e.g. changing the partition regularization).
Modifications of the transforms and on-device transforms will not affect your preprocessing hash.
Mini datasets
Each dataset has a "mini" version which only processes a portion of the data, to speedup experimentation. To use it, set the dataset config of your choice:
mini: True
Or, if you are using the CLI, use the following syntax:
# Train SPT on mini-DALES
python src/train.py experiment=dales +datamodule.mini=True
Creating your own dataset
To create your own dataset, you will need to do the following:
- create
YourDatasetclass inheriting fromsrc.datasets.BaseDataset - create
YourDataModuleclass inheriting fromsrc.datamodules.DataModule - create
configs/datamodule/<TASK>/your_dataset.yamlconfig
Instructions are provided in the docstrings of those classes, and you can get inspiration from our code for S3DIS, ScanNet, KITTI-360 and DALES to get started.
We suggest that your config inherits from configs/datamodule/<TASK>/default.yaml, where <TASK> is be semantic or panoptic, depending on your segmentation task of interest. See
configs/datamodule/<TASK>/s3dis.yaml, configs/datamodule/<TASK>/scannet.yaml, configs/datamodule/<TASK>/kitti360.yaml, and
configs/datamodule/<TASK>/dales.yaml for inspiration.
Semantic label format
The semantic labels of your dataset must follow certain rules.
Indeed, your points are expected to have labels within $[0, C]$, where: $C$ is
the num_classes you define in your YourDataset.
- All labels $[0, C - 1]$ are assumed to be present in your dataset. As such, they will all be used in metrics and losses computation.
- A point with the $C$ label will be considered void/ignored/unlabeled (whichever you call it). As such, it will be excluded from metrics and losses computation
Hence, make sure the output of your YourDataset.read_single_raw_cloud()
reader method never returns labels outside your $[0, C]$ range. Besides,
if some labels in $[0, C - 1]$ are not useful to you (ie absent from your
dataset), we recommend you remap your labels to a new $[0, C' - 1]$ range
(torch_geometric.nn.pool.consecutive.consecutive_cluster can help you with
that, if need be), while making sure you only use the label $C'$ for
void/ignored/unlabeled points.
Setting your train, val, and test sets
The clouds you use for your respective sets are to be specified in the
all_base_cloud_ids() method of your YourDataset.
def all_base_cloud_ids(self):
return {
'train': [...], # list of UNIQUE clouds ids in your train set
'val': [...], # list of UNIQUE clouds ids in your validation set
'test': [...] # list of UNIQUE clouds ids in your test set
}
Importantly, the cloud ids specified in each split must be uniquely identified:
we do not want clouds to have the same name in your train and test set.
Using the test set for validation
Generally, if you intend to run multiple experiments and tune some
hyperparameters to suit your dataset, you do need a validation set to avoid
contaminating your test set, which must be kept aside until final performance
evaluation. Yet, in some cases you might want to only use a train and a test
set. In this case you must set:
def all_base_cloud_ids(self):
return {
'train': [...], # list of UNIQUE clouds ids in your train set
'val': [], # empty list, no validation clouds
'test': [...] # list of UNIQUE clouds ids in your test set
}
Still, you can specify that you want to also use the test set as a val set
(which is dangerous ML practice) by setting in your
configs/datamodule/your_task/your_dataset.yaml datamodule config:
val_on_test: True
val points mixed with train/test points
It sometimes happens that your validation points are stored in the same
preprocessed files as your training or testing points. In this peculiar situation, it is
possible to load the relevant files when needed and slice only the required points
as an on_device_transform to save time.
In this case, the all_base_cloud_ids() method of your YourDataset may contain duplicate
entries between val and the other splits:
def all_base_cloud_ids(self):
return {
'train': [...], # list of clouds ids in your train set, may contain duplicates with val
'val': [...], # list of clouds ids in your vallidation set
'test': [...] # list of clouds ids in your test set, may contain duplicates with val
}
You must specify one of the following in your
configs/datamodule/your_task/your_dataset.yaml datamodule config:
val_mixed_in_train: True # if some preprocessed clouds contain both validation and train points
test_mixed_in_val: True # if some preprocessed clouds contain both validation and test points
Finally, your read_single_raw_cloud() method must return Data objects holding a is_val
boolean attribute indicating whether a point belongs to the validation set. If val_mixed_in_train
or test_mixed_in_val are specified, this attribute will be used for selecting the relevant
points at batch creation time. See S3DIS's read_s3dis_area() for an example of how is_val
can be specified.