THOR β€” ViSTA-SLAM STA Model

Project THOR is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the Symmetric Two-view Association (STA) frontend from the ViSTA-SLAM paper.

Paper

  • Title: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
  • Authors: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
  • arXiv: 2509.01584
  • Published: 1 September 2025

Model Summary

Property Value
Input Two RGB frames β€” (B, 3, 224, 224) each
Output Quaternion (B,4), Translation (B,3), Pointmap (B,H,W,3)
Parameters ~35% fewer than SOTA SLAM frontends
Intrinsics None required β€” intrinsic-free design
Checkpoint epoch 198
Best val loss 0.782216

Architecture

The STA model uses a symmetric encoder that processes two consecutive RGB frames through shared weights, producing:

  1. Pose head β€” relative SE(3) camera transformation (quaternion + translation)
  2. Pointmap head β€” dense local 3D pointmap in normalised image coordinates

A Sim(3) pose graph backend handles global consistency and scale-drift correction.

Usage

import torch
from anima_thor.models.sta_model import STAConfig, STAModel

# Load from this repository
config = STAConfig()
model = STAModel(config)

ckpt = torch.load("pytorch/thor_sta_v1.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

# Inference
img_a = torch.randn(1, 3, 224, 224)   # current frame
img_b = torch.randn(1, 3, 224, 224)   # previous frame

with torch.no_grad():
    output = model(img_a, img_b)

print(output.quaternion.shape)   # (1, 4)
print(output.translation.shape)  # (1, 3)
print(output.pointmap.shape)     # (1, H, W, 3)

ONNX inference

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "onnx/thor_sta_v1.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

img_a = np.random.randn(1, 3, 224, 224).astype(np.float32)
img_b = np.random.randn(1, 3, 224, 224).astype(np.float32)

quaternion, translation, pointmap = sess.run(
    None, {"img_a": img_a, "img_b": img_b}
)

Downstream Contracts (ANIMA Wave-6)

Module Dependency Topic
BALDUR Semantic mapping Pointmap β†’ voxel grid
HEIMDALL Hierarchical planning Pose stream @ 30 Hz
HERMOD Exploration Coverage map

Files

README.md                          # This file
pytorch/thor_sta_v1.pth            # PyTorch state dict
pytorch/thor_sta_v1.safetensors    # SafeTensors (if exported)
onnx/thor_sta_v1.onnx              # ONNX opset 17
tensorrt/thor_sta_v1_fp16.trt      # TensorRT FP16 (if exported)
tensorrt/thor_sta_v1_fp32.trt      # TensorRT FP32 (if exported)
configs/training.toml              # Training configuration
logs/training_history.json         # Epoch-by-epoch metrics

Citation

@article{zhang2025vistaslam,
  title   = {ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association},
  author  = {Zhang, Ganlin and Qian, Shenhan and Wang, Xi and Cremers, Daniel},
  journal = {arXiv preprint arXiv:2509.01584},
  year    = {2025},
}

License

MIT License β€” see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for ilessio-aiflowlab/project_thor