s3_net / README.md
zzuxzt's picture
Upload folder using huggingface_hub
d9c5371 verified
# Semantic2D: Enabling Semantic Scene Understanding with 2D Lidar Alone
SΒ³-Net implementation code for our paper ["Semantic2D: Enabling Semantic Scene Understanding with 2D Lidar Alone"](https://arxiv.org/pdf/2409.09899).
Video demos can be found at [multimedia demonstrations](https://youtu.be/P1Hsvj6WUSY).
The Semantic2D dataset can be found and downloaded at: https://doi.org/10.5281/zenodo.18350696.
## Related Resources
- **Dataset Download:** https://doi.org/10.5281/zenodo.18350696
- **SALSA (Dataset and Labeling Framework):** https://github.com/TempleRAIL/semantic2d
- **SΒ³-Net (Stochastic Semantic Segmentation):** https://github.com/TempleRAIL/s3_net
- **Semantic CNN Navigation:** https://github.com/TempleRAIL/semantic_cnn_nav
## SΒ³-Net: Stochastic Semantic Segmentation Network
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
SΒ³-Net (Stochastic Semantic Segmentation Network) is a deep learning model for semantic segmentation of 2D LiDAR scans. It uses a Variational Autoencoder (VAE) architecture with residual blocks to predict semantic labels for each LiDAR point.
## Demo Results
**SΒ³-Net Segmentation**
![SΒ³-Net Segmentation](./demo/1.lobby_s3net_segmentation.gif)
**Semantic Mapping**
![Semantic Mapping](./demo/2.lobby_semantic_mapping.gif)
**Semantic Navigation**
![Semantic Navigation](./demo/3.lobby_semantic_navigation.gif)
## Model Architecture
SΒ³-Net uses an encoder-decoder architecture with stochastic latent representations:
```
Input (3 channels: scan, intensity, angle of incidence)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Encoder (Conv1D + Residual Blocks) β”‚
β”‚ - Conv1D (3 β†’ 32) stride=2 β”‚
β”‚ - Conv1D (32 β†’ 64) stride=2 β”‚
β”‚ - Residual Stack (2 layers) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ VAE Reparameterization β”‚
β”‚ - ΞΌ (mean) and Οƒ (std) estimation β”‚
β”‚ - Latent sampling z ~ N(ΞΌ, σ²) β”‚
β”‚ - Monte Carlo KL divergence β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Decoder (Residual + TransposeConv) β”‚
β”‚ - Residual Stack (2 layers) β”‚
β”‚ - TransposeConv1D (64 β†’ 32) β”‚
β”‚ - TransposeConv1D (32 β†’ 10) β”‚
β”‚ - Softmax (10 semantic classes) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Output (10 channels: semantic probabilities)
```
**Key Features:**
- **3 Input Channels:** Range scan, intensity, angle of incidence
- **10 Output Classes:** Background + 9 semantic classes
- **Stochastic Inference:** Multiple forward passes enable uncertainty estimation via majority voting
- **Loss Function:** Cross-Entropy + Lovasz-Softmax + Ξ²-VAE KL divergence
## Semantic Classes
| ID | Class | Description |
|----|------------|--------------------------------|
| 0 | Other | Background/unknown |
| 1 | Chair | Office and lounge chairs |
| 2 | Door | Doors (open/closed) |
| 3 | Elevator | Elevator doors |
| 4 | Person | Dynamic pedestrians |
| 5 | Pillar | Structural pillars/columns |
| 6 | Sofa | Sofas and couches |
| 7 | Table | Tables of all types |
| 8 | Trash bin | Waste receptacles |
| 9 | Wall | Walls and flat surfaces |
## Requirements
- Python 3.7+
- PyTorch 1.7.1+
- TensorBoard
- NumPy
- Matplotlib
- tqdm
Install dependencies:
```bash
pip install torch torchvision tensorboardX numpy matplotlib tqdm
```
## Dataset Structure
SΒ³-Net expects the Semantic2D dataset organized as follows:
```
~/semantic2d_data/
β”œβ”€β”€ dataset.txt # List of dataset folders
β”œβ”€β”€ 2024-04-11-15-24-29/ # Dataset folder 1
β”‚ β”œβ”€β”€ train.txt # Training sample list
β”‚ β”œβ”€β”€ dev.txt # Validation sample list
β”‚ β”œβ”€β”€ scans_lidar/ # Range scans (.npy)
β”‚ β”œβ”€β”€ intensities_lidar/ # Intensity data (.npy)
β”‚ └── semantic_label/ # Ground truth labels (.npy)
β”œβ”€β”€ 2024-04-04-12-16-41/ # Dataset folder 2
β”‚ └── ...
└── ...
```
**dataset.txt format:**
```
2024-04-11-15-24-29
2024-04-04-12-16-41
```
## Usage
### Training
Train SΒ³-Net on your dataset:
```bash
sh run_train.sh ~/semantic2d_data/ ~/semantic2d_data/
```
**Arguments:**
- `$1` - Training data directory (contains `dataset.txt` and subfolders)
- `$2` - Validation data directory
**Training Configuration** (in `scripts/train.py`):
| Parameter | Default | Description |
|-----------|---------|-------------|
| `NUM_EPOCHS` | 20000 | Total training epochs |
| `BATCH_SIZE` | 1024 | Samples per batch |
| `LEARNING_RATE` | 0.001 | Initial learning rate |
| `BETA` | 0.01 | Ξ²-VAE weight for KL divergence |
**Learning Rate Schedule:**
- Epochs 0-50000: `1e-4`
- Epochs 50000-480000: `2e-5`
- Epochs 480000+: Exponential decay
The model saves checkpoints every 2000 epochs to `./model/`.
### Inference Demo
Run semantic segmentation on test data:
```bash
sh run_eval_demo.sh ~/semantic2d_data/
```
**Arguments:**
- `$1` - Test data directory (reads `dev.txt` for sample list)
**Output:**
- `./output/semantic_ground_truth_*.png` - Ground truth visualizations
- `./output/semantic_s3net_*.png` - SΒ³-Net predictions
**Example Output:**
| Ground Truth | SΒ³-Net Prediction |
|:------------:|:-----------------:|
| ![Ground Truth](./output/semantic_ground_truth_7000.png) | ![SΒ³-Net Prediction](./output/semantic_s3net_7000.png) |
### Stochastic Inference
SΒ³-Net performs **32 stochastic forward passes** per sample and uses **majority voting** to determine the final prediction. This provides:
- More robust predictions
- Implicit uncertainty estimation
- Reduced noise in segmentation boundaries
## File Structure
```
s3_net/
β”œβ”€β”€ demo/ # Demo GIFs
β”‚ β”œβ”€β”€ 1.lobby_s3net_segmentation.gif
β”‚ β”œβ”€β”€ 2.lobby_semantic_mapping.gif
β”‚ └── 3.lobby_semantic_navigation.gif
β”œβ”€β”€ model/
β”‚ └── s3_net_model.pth # Pretrained model weights
β”œβ”€β”€ output/ # Inference output directory
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ model.py # SΒ³-Net model architecture
β”‚ β”œβ”€β”€ train.py # Training script
β”‚ β”œβ”€β”€ decode_demo.py # Inference/demo script
β”‚ └── lovasz_losses.py # Lovasz-Softmax loss function
β”œβ”€β”€ run_train.sh # Training driver script
β”œβ”€β”€ run_eval_demo.sh # Inference driver script
β”œβ”€β”€ LICENSE # MIT License
└── README.md # This file
```
## TensorBoard Monitoring
Training logs are saved to `./runs/`. View training progress:
```bash
tensorboard --logdir=runs
```
Monitored metrics:
- Training/Validation loss
- Cross-Entropy loss
- Lovasz-Softmax loss
## Pre-trained Model
A pre-trained model is included at `model/s3_net_model.pth`. This model was trained on the Semantic2D dataset with the Hokuyo UTM-30LX-EW LiDAR sensor.
To use the pre-trained model:
```bash
sh run_eval_demo.sh ~/semantic2d_data/
```
## Citation
```bibtex
@article{xie2026semantic2d,
title={Semantic2D: Enabling Semantic Scene Understanding with 2D Lidar Alone},
author={Xie, Zhanteng and Pan, Yipeng and Zhang, Yinqiang and Pan, Jia and Dames, Philip},
journal={arXiv preprint arXiv:2409.09899},
year={2026}
}
```