File size: 5,388 Bytes

26225c5

# Data structures

This project relies on specific data structures for manipulating hierarchical 
partitions. 
As a basic user, you will most likely be exposed to `Data` and `NAG`.
As you go deeper into the project, you may need to familiarize yourself with 
`CSRData`, `Cluster`, `InstanceData`, and all the associated batching 
structures. 

## `Data`
A `Data` object stores a single-level graph. 
It inherits from `torch_geometric`'s `Data` and has a similar behavior (see the
[official documentation](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Data.html#torch_geometric.data.Data) 
for more on this). 

Similar to `torch_geometric`'s `Data`:
- `Data.pos` holds node positions
- `Data.x` holds node features
- `Data.y` holds node labels
- `Data.edge_index` holds edges between nodes
- `Data.edge_attr` holds edge features

Important additional specificities of our `Data` object are:
- `Data.super_index` holds, for each node, the index of the parent node (ie 
superpoint) in the partition level above. Said otherwise, `super_index` 
allows mapping from $P_i$ to $P_{i+1}$.
- `Data.sub` holds a `Cluster` object indicating, for each node, the children 
nodes in the partition level below. Said otherwise, `sub` allows mapping from 
$P_i$ to $P_{i-1}$.
- `Data.to_trimmed()` works like `torch_geometric`'s `Data.coalesce()` with the
additional constraint that (i,j) and (j,i) edges are considered duplicates
- `Data.save()` and `Data.load()` allow optimized, memory-friendly I/O 
operations
- `Data.select()` indexes the nodes à la numpy
- `Data.show()` for interactive visualization (see 
[visualization documentation](docs/visualization.md))

The `Batch` class allows for batching `Data` objects together, while preserving
their advanced mechanisms without index collisions. 

## `NAG`
`NAG` (Nested Acyclic Graph) stores the hierarchical partition in the form of a 
list of `Data` objects.

Important specificities of our `Data` object are:
- `NAG[i]` returns a `Data` object holding the partition level `ì`
- `NAG.get_super_index()` returns the index mapping nodes from any level `i` to
`j` with `i<j`
- `NAG.get_sampling()` produces indices for sampling the superpoints with 
certain constraints
- `NAG.save()` and `NAG.load()` allow optimized, memory-friendly I/O operations
- `NAG.select()` indexes the nodes of a specified partition level à la numpy 
and updates the rest of the `NAG` structure accordingly
- `NAG.show()` for interactive visualization (see 
[visualization documentation](docs/visualization.md))

The `NAGBatch` class allows for batching `NAG` objects together, while 
preserving their advanced mechanisms without index collisions.

## `CSRData`
`CSRData` is the data structure we use for manipulating sparse information. 
This implements the [CSR (Compressed Sparse Row)](https://en.wikipedia.org/wiki/Sparse_matrix) representation. Simply 
put, this struture stores a list of tensor objects as `values`, and a `pointers`
1D tensor for indexing `values`. The `j`th tensor values related to element `i`,
are stored in: `CSRData.values[j][CSRData.pointers[i]:CSRData.pointers[i+1]`.

The `CSRData` structure is designed for conveniently storing, indexing, 
updating, and batching data of varying sizes such as lists of lists while making
use of efficient, vectorized, torch operations. 

`CSRBatch` is the datastructure for batching multiple `CSRData` objects 
together.

## `Cluster`
`Cluster` inherits from `CSRData` and is specifically designed for storing 
cluster indices to efficiently gather indices of elements in a cluster `i`, 
given `i`. This structure could typically be implemented as a list of lists, 
but our implementation allows much faster parallelized operations for indexing,
updating, and batching this data.

When two `Cluster` are batched in an `ClusterBatch`, the indices will be updated
to avoid collision between the batch items.

## `InstanceData`
`InstanceData` inherits from `CSRData` and is specifically designed for 
manipulating instance and panoptic segmentation data. By convention, the 
element `i` by which we index the `values` is a point/superpoint/predicted 
instance. The associated `values` describe overlaps between the `i`th element
and the target instance objects of the dataset. These `values` include the 
following:

- `obj`: the index of the target instance object with which element `i` overlaps
- `count`: the number of points in the overlap
- `y`: the target instance object's semantic label

Importantly, each target object in the `InstanceData` is expected to be 
described by a unique index in `obj`, regardless of its actual semantic class.
It is not required for the labels of object instances in `obj` to be contiguous
in `[0, obj_max]`, although enforcing it may have beneficial downstream effects
on memory and I/O times.

Finally, when two `InstanceData` are batched in an `InstanceBatch`, the `obj` 
indices will be updated to avoid collision between the batch items.

## 🚀  Getting familiar with the data structures
See the [`notebooks/demo_nag.ipynb`](../notebooks/demo_nag.ipynb) notebook to play with and visualize a 
provided `NAG`, without needing a whole dataset.

For more information, have a look at the docstrings and code of each data 
structure, these are fairly commented and should help you gain a deeper 
understanding of their mechanisms.