Dynamic Visual SLAM using a General 3D Prior
Paper β’ 2512.06868 β’ Published
Part of the ANIMA Perception Suite by Robot Flow Labs.
BALDUR enhances monocular visual SLAM for dynamic environments (crowds, traffic, moving robots). It combines a feed-forward depth + dynamic mask prediction network with patch-based bundle adjustment.
| Setting | Value |
|---|---|
| Dataset | TUM-RGBD Dynamic (3,784 frames, 4 sequences) |
| Split | 90/5/5 (train/val/test) |
| Best val_loss | 0.2680 (epoch 183) |
| Optimizer | AdamW (lr=0.0003, wd=0.05) |
| Scheduler | CosineAnnealingWarmRestarts (T_0=20) |
| Regularization | Dropout 0.3, heavy augmentation |
| Precision | bf16 mixed precision |
| Hardware | NVIDIA L4 (23GB) |
| Training time | ~5.3 hours |
| Date | 2026-04-02 |
| Format | File | Use Case |
|---|---|---|
| PyTorch (.pth) | pytorch/baldur_v1.pth |
Training, fine-tuning |
| SafeTensors | pytorch/baldur_v1.safetensors |
Fast loading, safe |
| ONNX | onnx/baldur_v1.onnx |
Cross-platform inference |
| TensorRT FP32 | tensorrt/baldur_v1_fp32.trt |
Full precision inference |
| TensorRT FP16 | tensorrt/baldur_v1_fp16.trt |
Edge deployment (Jetson/L4) |
import torch
from anima_baldur.models.feed_forward import FeedForwardNetwork
model = FeedForwardNetwork(pretrained=False, max_depth=10.0)
model.load_state_dict(torch.load("pytorch/baldur_v1.pth"))
model.eval()
# Inference
rgb = torch.randn(1, 3, 480, 640) # ImageNet-normalized
depth, dynamic_mask = model(rgb)
# depth: metres, dynamic_mask: 0=static, 1=dynamic
The dynamic mask identifies moving objects (people, vehicles) in the scene.
Trained self-supervised: the model learns to mask regions where depth prediction is unreliable.
Use dynamic_mask > 0.5 to filter dynamic regions from SLAM mapping.
Apache 2.0 β Robot Flow Labs / AIFLOW LABS LIMITED
@article{baldur2025,
title={Dynamic Visual SLAM using a General 3D Prior},
author={Zhong, Xingguang and Jin, Liren and Popovi\'c, Marija and Behley, Jens and Stachniss, Cyrill},
year={2025},
journal={arXiv:2512.06868}
}