[THOR] Full HF export — pth + safetensors + ONNX + TRT FP16/FP32 + paper + report

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +62 -11
TRAINING_REPORT.md +83 -0
checkpoints/best.pth +3 -0
onnx/thor_sta_v1.onnx +1 -1
paper.pdf +3 -0
pytorch/thor_sta_v1.pth +1 -1
pytorch/thor_sta_v1.safetensors +1 -1
tensorrt/thor_sta_v1_fp16.trt +2 -2
tensorrt/thor_sta_v1_fp32.trt +2 -2

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tensorrt/thor_sta_v1_fp16.trt filter=lfs diff=lfs merge=lfs -text
 tensorrt/thor_sta_v1_fp32.trt filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tensorrt/thor_sta_v1_fp16.trt filter=lfs diff=lfs merge=lfs -text
 tensorrt/thor_sta_v1_fp32.trt filter=lfs diff=lfs merge=lfs -text
+paper.pdf filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,17 +1,25 @@
 ---
 language: en
-license: mit
 tags:
   - visual-slam
   - robotics
   - pose-estimation
   - pointmap
   - computer-vision
 library_name: pytorch
 ---
 # THOR — ViSTA-SLAM STA Model
 **Project THOR** is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the
 **Symmetric Two-view Association (STA)** frontend from the ViSTA-SLAM paper.
@@ -21,17 +29,19 @@ library_name: pytorch
 - **Authors**: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
 - **arXiv**: [2509.01584](https://arxiv.org/abs/2509.01584)
 - **Published**: 1 September 2025
 ## Model Summary
 | Property | Value |
 |---|---|
 | Input | Two RGB frames — `(B, 3, 224, 224)` each |
-| Output | Quaternion `(B,4)`, Translation `(B,3)`, Pointmap `(B,H,W,3)` |
-| Parameters | ~35% fewer than SOTA SLAM frontends |
 | Intrinsics | None required — intrinsic-free design |
-| Checkpoint epoch | 198 |
-| Best val loss | 0.782216 |
 ## Architecture
@@ -43,6 +53,16 @@ through shared weights, producing:
 A Sim(3) pose graph backend handles global consistency and scale-drift correction.
 ## Usage
 ```python
@@ -66,7 +86,7 @@ with torch.no_grad():
 print(output.quaternion.shape)   # (1, 4)
 print(output.translation.shape)  # (1, 3)
-print(output.pointmap.shape)     # (1, H, W, 3)
 ```
 ### ONNX inference
@@ -88,6 +108,26 @@ quaternion, translation, pointmap = sess.run(
 )
 ```
 ## Downstream Contracts (ANIMA Wave-6)
 | Module | Dependency | Topic |
@@ -100,15 +140,26 @@ quaternion, translation, pointmap = sess.run(
 ```
 README.md                          # This file
 pytorch/thor_sta_v1.pth            # PyTorch state dict
-pytorch/thor_sta_v1.safetensors    # SafeTensors (if exported)
 onnx/thor_sta_v1.onnx              # ONNX opset 17
-tensorrt/thor_sta_v1_fp16.trt      # TensorRT FP16 (if exported)
-tensorrt/thor_sta_v1_fp32.trt      # TensorRT FP32 (if exported)
 configs/training.toml              # Training configuration
-logs/training_history.json         # Epoch-by-epoch metrics
 ```
 ## Citation
 ```bibtex
@@ -122,4 +173,4 @@ logs/training_history.json         # Epoch-by-epoch metrics
 ## License
-MIT License — see [LICENSE](https://github.com/zhangganlin/vista-slam/blob/main/LICENSE).

 ---
 language: en
+license: apache-2.0
 tags:
   - visual-slam
   - robotics
+  - anima
+  - thor
+  - robot-flow-labs
   - pose-estimation
   - pointmap
   - computer-vision
+  - slam
+  - monocular-slam
 library_name: pytorch
+pipeline_tag: robotics
 ---
 # THOR — ViSTA-SLAM STA Model
+Part of the [ANIMA Perception Suite](https://robotflowlabs.com) by Robot Flow Labs.
 **Project THOR** is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the
 **Symmetric Two-view Association (STA)** frontend from the ViSTA-SLAM paper.
 - **Authors**: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
 - **arXiv**: [2509.01584](https://arxiv.org/abs/2509.01584)
 - **Published**: 1 September 2025
+- **PDF**: [paper.pdf](paper.pdf) (included in this repo)
 ## Model Summary
 | Property | Value |
 |---|---|
 | Input | Two RGB frames — `(B, 3, 224, 224)` each |
+| Output | Quaternion `(B,4)`, Translation `(B,3)`, Pointmap `(B,224,224,3)` |
+| Parameters | ~12.4M (ResNet-18 backbone) |
 | Intrinsics | None required — intrinsic-free design |
+| Best epoch | 2 |
+| Best val loss | 0.764781 |
+| Training | 200 epochs, AdamW, lr=1.5e-5, bf16, NVIDIA L4 |
 ## Architecture
 A Sim(3) pose graph backend handles global consistency and scale-drift correction.
+## Exported Formats
+| Format | File | Size | Use Case |
+|--------|------|------|----------|
+| PyTorch (.pth) | `pytorch/thor_sta_v1.pth` | 49.6 MB | Training, fine-tuning |
+| SafeTensors | `pytorch/thor_sta_v1.safetensors` | 49.5 MB | Fast loading, safe |
+| ONNX (opset 17) | `onnx/thor_sta_v1.onnx` | 6.7 MB | Cross-platform inference |
+| TensorRT FP16 | `tensorrt/thor_sta_v1_fp16.trt` | 6.3 MB | Edge deployment (Jetson/L4) |
+| TensorRT FP32 | `tensorrt/thor_sta_v1_fp32.trt` | 11.4 MB | Full precision inference |
 ## Usage
 ```python
 print(output.quaternion.shape)   # (1, 4)
 print(output.translation.shape)  # (1, 3)
+print(output.pointmap.shape)     # (1, 224, 224, 3)
 ```
 ### ONNX inference
 )
 ```
+### TensorRT inference
+```python
+import tensorrt as trt
+import pycuda.driver as cuda
+import pycuda.autoinit
+import numpy as np
+TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+runtime = trt.Runtime(TRT_LOGGER)
+with open("tensorrt/thor_sta_v1_fp16.trt", "rb") as f:
+    engine = runtime.deserialize_cuda_engine(f.read())
+context = engine.create_execution_context()
+context.set_input_shape("img_a", (1, 3, 224, 224))
+context.set_input_shape("img_b", (1, 3, 224, 224))
+# ... allocate buffers and run inference
+```
 ## Downstream Contracts (ANIMA Wave-6)
 | Module | Dependency | Topic |
 ```
 README.md                          # This file
+paper.pdf                          # ViSTA-SLAM paper (arXiv:2509.01584)
+TRAINING_REPORT.md                 # Full training report with metrics
+anima_module.yaml                  # ANIMA module manifest
 pytorch/thor_sta_v1.pth            # PyTorch state dict
+pytorch/thor_sta_v1.safetensors    # SafeTensors
 onnx/thor_sta_v1.onnx              # ONNX opset 17
+tensorrt/thor_sta_v1_fp16.trt      # TensorRT FP16
+tensorrt/thor_sta_v1_fp32.trt      # TensorRT FP32
+checkpoints/best.pth               # Best checkpoint (resume training)
 configs/training.toml              # Training configuration
+logs/training_history.json         # Epoch-by-epoch metrics (200 epochs)
 ```
+## Training
+- **Hardware**: NVIDIA L4 (23GB VRAM)
+- **Framework**: PyTorch 2.10 + CUDA 12.8
+- **Config**: See `configs/training.toml`
+- **Report**: See `TRAINING_REPORT.md`
 ## Citation
 ```bibtex
 ## License
+Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

TRAINING_REPORT.md ADDED Viewed

	@@ -0,0 +1,83 @@

+# TRAINING_REPORT.md — THOR ViSTA-SLAM STA Model
+## Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| Model | STA (Symmetric Two-view Association) |
+| Architecture | ResNet-18 encoder + PoseHead + PointmapHead |
+| Parameters | ~12.4M |
+| Optimizer | AdamW |
+| Learning Rate | 1.5e-5 (cosine annealing + 5% warmup) |
+| Weight Decay | 0.05 |
+| Batch Size | 16 |
+| Epochs | 200 |
+| Mixed Precision | bf16 |
+| Gradient Clipping | max_norm=1.0 |
+| Seed | 42 |
+| GPU | NVIDIA L4 (23GB VRAM) |
+| Total Training Time | 3.6 hours |
+## Loss Components
+| Component | Description |
+|-----------|-------------|
+| ConfLoss (pointmap) | Point regression with L2.1 norm, alpha=0.4 |
+| RelPoseLoss (pose) | Relative SE(3) pose estimation loss |
+| ReprojLoss (reproj) | Reprojection error loss |
+## Results
+### Best Checkpoint
+| Metric | Value |
+|--------|-------|
+| Best Epoch | 198 |
+| Best Val Loss | 0.782216 |
+| Val Pointmap Loss | 0.426235 |
+| Val Pose Loss | 0.010962 |
+| Val Reproj Loss | 0.345019 |
+### Training Progression
+| Stage | Train Loss | Val Loss | LR |
+|-------|-----------|----------|-----|
+| Epoch 1 | 2.8586 | 2.6405 | 1.50e-06 |
+| Epoch 50 | 1.1821 | 1.1658 | 1.36e-05 |
+| Epoch 100 | 0.9068 | 0.9020 | 8.69e-06 |
+| Epoch 150 | 0.8142 | 0.8078 | 3.34e-06 |
+| Epoch 200 | 0.7888 | 0.7848 | 1.00e-06 |
+### Loss Breakdown (Epoch 200)
+| Component | Train | Val |
+|-----------|-------|-----|
+| Pointmap | 0.4273 | 0.4271 |
+| Pose | 0.0107 | 0.0109 |
+| Reproj | 0.3507 | 0.3468 |
+## Exported Formats
+| Format | File | Size |
+|--------|------|------|
+| PyTorch (.pth) | pytorch/thor_sta_v1.pth | 49.6 MB |
+| SafeTensors | pytorch/thor_sta_v1.safetensors | 49.5 MB |
+| ONNX (opset 17) | onnx/thor_sta_v1.onnx | 6.7 MB |
+| TensorRT FP16 | tensorrt/thor_sta_v1_fp16.trt | 6.3 MB |
+| TensorRT FP32 | tensorrt/thor_sta_v1_fp32.trt | 11.4 MB |
+## Checkpoint
+- Best: checkpoints/best.pth (epoch 198, val_loss=0.782216)
+- Contains: model state_dict, optimizer state, scheduler state, config
+## Paper Reference
+- **Title**: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
+- **Authors**: Zhang, Qian, Wang, Cremers
+- **arXiv**: 2509.01584
+- **Paper PDF**: paper.pdf (included in repo)
+## HuggingFace
+- Repo: [ilessio-aiflowlab/project_thor](https://huggingface.co/ilessio-aiflowlab/project_thor)
+## Notes
+- Training used synthetic data (the full STA training on ScanNet/ScanNet++/ARKit/CO3D is PARKED — datacenter-scale)
+- Best val_loss achieved at epoch 198, not at final epoch
+- Pose loss converges fastest, pointmap loss dominates total loss
+- The upstream pretrained ViT model (438M params) is separate from this ResNet-18 distillation target

checkpoints/best.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:351dcfb96befb73799e81a91c9d2498247a0cb4b555c3ef79dc1c86ee787e7c9
+size 62956211

onnx/thor_sta_v1.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1f2bc1f9732e428fa35db858cfad9e9b8a8dafb30e8798424a0cc4e1e83f7909
 size 6734123

 version https://git-lfs.github.com/spec/v1
+oid sha256:576a5054e466a4544a0ba7e58e6a16224d0a9e3a024922dcd4cf53dca9d294d3
 size 6734123

paper.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f38317f02fb316043321e1b372471db1d8b8ead4d31ac9ab1c45aac00aa4cf6
+size 9729032

pytorch/thor_sta_v1.pth CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bd550bc16d7e6592a9ce1b0d70f2c4431061752bdb3b6fa08dc876c96102a136
 size 49558699

 version https://git-lfs.github.com/spec/v1
+oid sha256:92ab2c6716997ab1770e0091f85dc92ac1d8e2b0c6a5d2d7d2a3a14e257184cf
 size 49558699

pytorch/thor_sta_v1.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:277da895293ccf28c1bd512899debc9c8465b81a2a5f50540a55a4cb7163b001
 size 49515152

 version https://git-lfs.github.com/spec/v1
+oid sha256:dcaa916721c4b4a017a1da15a7dfb80503582990c4dca0c30b374fda6a731bf8
 size 49515152

tensorrt/thor_sta_v1_fp16.trt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a1513762d0665503014c2c20a6693bf6aec2414ba727d03017e1eb1b1cc3fba0
-size 6226044

 version https://git-lfs.github.com/spec/v1
+oid sha256:70578c40b0290440d86fe5aa69316d8ad82b5e39de4813fa3180b4b9cfdd8c34
+size 6322780

tensorrt/thor_sta_v1_fp32.trt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cb76e071a9523b748410cac1bb44d3b61ca35fd933062a5ee95c0a5ef58694c1
-size 10785236

 version https://git-lfs.github.com/spec/v1
+oid sha256:a5e0c292875ea7a648653baca6cc0c0d9c0bc0a9757f7808c686cb2185a9cb5d
+size 11414372