Data structures

This project relies on specific data structures for manipulating hierarchical partitions. As a basic user, you will most likely be exposed to Data and NAG. As you go deeper into the project, you may need to familiarize yourself with CSRData, Cluster, InstanceData, and all the associated batching structures.

`Data`

A Data object stores a single-level graph. It inherits from torch_geometric's Data and has a similar behavior (see the official documentation for more on this).

Similar to torch_geometric's Data:

Data.pos holds node positions
Data.x holds node features
Data.y holds node labels
Data.edge_index holds edges between nodes
Data.edge_attr holds edge features

Important additional specificities of our Data object are:

Data.super_index holds, for each node, the index of the parent node (ie superpoint) in the partition level above. Said otherwise, super_index allows mapping from $P_i$ to $P_{i+1}$.
Data.sub holds a Cluster object indicating, for each node, the children nodes in the partition level below. Said otherwise, sub allows mapping from $P_i$ to $P_{i-1}$.
Data.to_trimmed() works like torch_geometric's Data.coalesce() with the additional constraint that (i,j) and (j,i) edges are considered duplicates
Data.save() and Data.load() allow optimized, memory-friendly I/O operations
Data.select() indexes the nodes à la numpy
Data.show() for interactive visualization (see visualization documentation)

The Batch class allows for batching Data objects together, while preserving their advanced mechanisms without index collisions.

`NAG`

NAG (Nested Acyclic Graph) stores the hierarchical partition in the form of a list of Data objects.

Important specificities of our Data object are:

NAG[i] returns a Data object holding the partition level ì
NAG.get_super_index() returns the index mapping nodes from any level i to j with i<j
NAG.get_sampling() produces indices for sampling the superpoints with certain constraints
NAG.save() and NAG.load() allow optimized, memory-friendly I/O operations
NAG.select() indexes the nodes of a specified partition level à la numpy and updates the rest of the NAG structure accordingly
NAG.show() for interactive visualization (see visualization documentation)

The NAGBatch class allows for batching NAG objects together, while preserving their advanced mechanisms without index collisions.

`CSRData`

CSRData is the data structure we use for manipulating sparse information. This implements the CSR (Compressed Sparse Row) representation. Simply put, this struture stores a list of tensor objects as values, and a pointers 1D tensor for indexing values. The jth tensor values related to element i, are stored in: CSRData.values[j][CSRData.pointers[i]:CSRData.pointers[i+1].

The CSRData structure is designed for conveniently storing, indexing, updating, and batching data of varying sizes such as lists of lists while making use of efficient, vectorized, torch operations.

CSRBatch is the datastructure for batching multiple CSRData objects together.

`Cluster`

Cluster inherits from CSRData and is specifically designed for storing cluster indices to efficiently gather indices of elements in a cluster i, given i. This structure could typically be implemented as a list of lists, but our implementation allows much faster parallelized operations for indexing, updating, and batching this data.

When two Cluster are batched in an ClusterBatch, the indices will be updated to avoid collision between the batch items.

`InstanceData`

InstanceData inherits from CSRData and is specifically designed for manipulating instance and panoptic segmentation data. By convention, the element i by which we index the values is a point/superpoint/predicted instance. The associated values describe overlaps between the ith element and the target instance objects of the dataset. These values include the following:

obj: the index of the target instance object with which element i overlaps
count: the number of points in the overlap
y: the target instance object's semantic label

Importantly, each target object in the InstanceData is expected to be described by a unique index in obj, regardless of its actual semantic class. It is not required for the labels of object instances in obj to be contiguous in [0, obj_max], although enforcing it may have beneficial downstream effects on memory and I/O times.

Finally, when two InstanceData are batched in an InstanceBatch, the obj indices will be updated to avoid collision between the batch items.

🚀 Getting familiar with the data structures

See the notebooks/demo_nag.ipynb notebook to play with and visualize a provided NAG, without needing a whole dataset.

For more information, have a look at the docstrings and code of each data structure, these are fairly commented and should help you gain a deeper understanding of their mechanisms.