File size: 5,388 Bytes
26225c5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# Data structures
This project relies on specific data structures for manipulating hierarchical
partitions.
As a basic user, you will most likely be exposed to `Data` and `NAG`.
As you go deeper into the project, you may need to familiarize yourself with
`CSRData`, `Cluster`, `InstanceData`, and all the associated batching
structures.
## `Data`
A `Data` object stores a single-level graph.
It inherits from `torch_geometric`'s `Data` and has a similar behavior (see the
[official documentation](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Data.html#torch_geometric.data.Data)
for more on this).
Similar to `torch_geometric`'s `Data`:
- `Data.pos` holds node positions
- `Data.x` holds node features
- `Data.y` holds node labels
- `Data.edge_index` holds edges between nodes
- `Data.edge_attr` holds edge features
Important additional specificities of our `Data` object are:
- `Data.super_index` holds, for each node, the index of the parent node (ie
superpoint) in the partition level above. Said otherwise, `super_index`
allows mapping from $P_i$ to $P_{i+1}$.
- `Data.sub` holds a `Cluster` object indicating, for each node, the children
nodes in the partition level below. Said otherwise, `sub` allows mapping from
$P_i$ to $P_{i-1}$.
- `Data.to_trimmed()` works like `torch_geometric`'s `Data.coalesce()` with the
additional constraint that (i,j) and (j,i) edges are considered duplicates
- `Data.save()` and `Data.load()` allow optimized, memory-friendly I/O
operations
- `Data.select()` indexes the nodes à la numpy
- `Data.show()` for interactive visualization (see
[visualization documentation](docs/visualization.md))
The `Batch` class allows for batching `Data` objects together, while preserving
their advanced mechanisms without index collisions.
## `NAG`
`NAG` (Nested Acyclic Graph) stores the hierarchical partition in the form of a
list of `Data` objects.
Important specificities of our `Data` object are:
- `NAG[i]` returns a `Data` object holding the partition level `ì`
- `NAG.get_super_index()` returns the index mapping nodes from any level `i` to
`j` with `i<j`
- `NAG.get_sampling()` produces indices for sampling the superpoints with
certain constraints
- `NAG.save()` and `NAG.load()` allow optimized, memory-friendly I/O operations
- `NAG.select()` indexes the nodes of a specified partition level à la numpy
and updates the rest of the `NAG` structure accordingly
- `NAG.show()` for interactive visualization (see
[visualization documentation](docs/visualization.md))
The `NAGBatch` class allows for batching `NAG` objects together, while
preserving their advanced mechanisms without index collisions.
## `CSRData`
`CSRData` is the data structure we use for manipulating sparse information.
This implements the [CSR (Compressed Sparse Row)](https://en.wikipedia.org/wiki/Sparse_matrix) representation. Simply
put, this struture stores a list of tensor objects as `values`, and a `pointers`
1D tensor for indexing `values`. The `j`th tensor values related to element `i`,
are stored in: `CSRData.values[j][CSRData.pointers[i]:CSRData.pointers[i+1]`.
The `CSRData` structure is designed for conveniently storing, indexing,
updating, and batching data of varying sizes such as lists of lists while making
use of efficient, vectorized, torch operations.
`CSRBatch` is the datastructure for batching multiple `CSRData` objects
together.
## `Cluster`
`Cluster` inherits from `CSRData` and is specifically designed for storing
cluster indices to efficiently gather indices of elements in a cluster `i`,
given `i`. This structure could typically be implemented as a list of lists,
but our implementation allows much faster parallelized operations for indexing,
updating, and batching this data.
When two `Cluster` are batched in an `ClusterBatch`, the indices will be updated
to avoid collision between the batch items.
## `InstanceData`
`InstanceData` inherits from `CSRData` and is specifically designed for
manipulating instance and panoptic segmentation data. By convention, the
element `i` by which we index the `values` is a point/superpoint/predicted
instance. The associated `values` describe overlaps between the `i`th element
and the target instance objects of the dataset. These `values` include the
following:
- `obj`: the index of the target instance object with which element `i` overlaps
- `count`: the number of points in the overlap
- `y`: the target instance object's semantic label
Importantly, each target object in the `InstanceData` is expected to be
described by a unique index in `obj`, regardless of its actual semantic class.
It is not required for the labels of object instances in `obj` to be contiguous
in `[0, obj_max]`, although enforcing it may have beneficial downstream effects
on memory and I/O times.
Finally, when two `InstanceData` are batched in an `InstanceBatch`, the `obj`
indices will be updated to avoid collision between the batch items.
## 🚀 Getting familiar with the data structures
See the [`notebooks/demo_nag.ipynb`](../notebooks/demo_nag.ipynb) notebook to play with and visualize a
provided `NAG`, without needing a whole dataset.
For more information, have a look at the docstrings and code of each data
structure, these are fairly commented and should help you gain a deeper
understanding of their mechanisms.
|