ilessio-aiflowlab commited on
Commit
e056a9f
·
verified ·
1 Parent(s): 79461a8

[THOR] Full HF export — pth + safetensors + ONNX + TRT FP16/FP32 + paper + report

Browse files
.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tensorrt/thor_sta_v1_fp16.trt filter=lfs diff=lfs merge=lfs -text
37
  tensorrt/thor_sta_v1_fp32.trt filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tensorrt/thor_sta_v1_fp16.trt filter=lfs diff=lfs merge=lfs -text
37
  tensorrt/thor_sta_v1_fp32.trt filter=lfs diff=lfs merge=lfs -text
38
+ paper.pdf filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,17 +1,25 @@
1
  ---
2
  language: en
3
- license: mit
4
  tags:
5
  - visual-slam
6
  - robotics
 
 
 
7
  - pose-estimation
8
  - pointmap
9
  - computer-vision
 
 
10
  library_name: pytorch
 
11
  ---
12
 
13
  # THOR — ViSTA-SLAM STA Model
14
 
 
 
15
  **Project THOR** is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the
16
  **Symmetric Two-view Association (STA)** frontend from the ViSTA-SLAM paper.
17
 
@@ -21,17 +29,19 @@ library_name: pytorch
21
  - **Authors**: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
22
  - **arXiv**: [2509.01584](https://arxiv.org/abs/2509.01584)
23
  - **Published**: 1 September 2025
 
24
 
25
  ## Model Summary
26
 
27
  | Property | Value |
28
  |---|---|
29
  | Input | Two RGB frames — `(B, 3, 224, 224)` each |
30
- | Output | Quaternion `(B,4)`, Translation `(B,3)`, Pointmap `(B,H,W,3)` |
31
- | Parameters | ~35% fewer than SOTA SLAM frontends |
32
  | Intrinsics | None required — intrinsic-free design |
33
- | Checkpoint epoch | 198 |
34
- | Best val loss | 0.782216 |
 
35
 
36
  ## Architecture
37
 
@@ -43,6 +53,16 @@ through shared weights, producing:
43
 
44
  A Sim(3) pose graph backend handles global consistency and scale-drift correction.
45
 
 
 
 
 
 
 
 
 
 
 
46
  ## Usage
47
 
48
  ```python
@@ -66,7 +86,7 @@ with torch.no_grad():
66
 
67
  print(output.quaternion.shape) # (1, 4)
68
  print(output.translation.shape) # (1, 3)
69
- print(output.pointmap.shape) # (1, H, W, 3)
70
  ```
71
 
72
  ### ONNX inference
@@ -88,6 +108,26 @@ quaternion, translation, pointmap = sess.run(
88
  )
89
  ```
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ## Downstream Contracts (ANIMA Wave-6)
92
 
93
  | Module | Dependency | Topic |
@@ -100,15 +140,26 @@ quaternion, translation, pointmap = sess.run(
100
 
101
  ```
102
  README.md # This file
 
 
 
103
  pytorch/thor_sta_v1.pth # PyTorch state dict
104
- pytorch/thor_sta_v1.safetensors # SafeTensors (if exported)
105
  onnx/thor_sta_v1.onnx # ONNX opset 17
106
- tensorrt/thor_sta_v1_fp16.trt # TensorRT FP16 (if exported)
107
- tensorrt/thor_sta_v1_fp32.trt # TensorRT FP32 (if exported)
 
108
  configs/training.toml # Training configuration
109
- logs/training_history.json # Epoch-by-epoch metrics
110
  ```
111
 
 
 
 
 
 
 
 
112
  ## Citation
113
 
114
  ```bibtex
@@ -122,4 +173,4 @@ logs/training_history.json # Epoch-by-epoch metrics
122
 
123
  ## License
124
 
125
- MIT Licensesee [LICENSE](https://github.com/zhangganlin/vista-slam/blob/main/LICENSE).
 
1
  ---
2
  language: en
3
+ license: apache-2.0
4
  tags:
5
  - visual-slam
6
  - robotics
7
+ - anima
8
+ - thor
9
+ - robot-flow-labs
10
  - pose-estimation
11
  - pointmap
12
  - computer-vision
13
+ - slam
14
+ - monocular-slam
15
  library_name: pytorch
16
+ pipeline_tag: robotics
17
  ---
18
 
19
  # THOR — ViSTA-SLAM STA Model
20
 
21
+ Part of the [ANIMA Perception Suite](https://robotflowlabs.com) by Robot Flow Labs.
22
+
23
  **Project THOR** is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the
24
  **Symmetric Two-view Association (STA)** frontend from the ViSTA-SLAM paper.
25
 
 
29
  - **Authors**: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
30
  - **arXiv**: [2509.01584](https://arxiv.org/abs/2509.01584)
31
  - **Published**: 1 September 2025
32
+ - **PDF**: [paper.pdf](paper.pdf) (included in this repo)
33
 
34
  ## Model Summary
35
 
36
  | Property | Value |
37
  |---|---|
38
  | Input | Two RGB frames — `(B, 3, 224, 224)` each |
39
+ | Output | Quaternion `(B,4)`, Translation `(B,3)`, Pointmap `(B,224,224,3)` |
40
+ | Parameters | ~12.4M (ResNet-18 backbone) |
41
  | Intrinsics | None required — intrinsic-free design |
42
+ | Best epoch | 2 |
43
+ | Best val loss | 0.764781 |
44
+ | Training | 200 epochs, AdamW, lr=1.5e-5, bf16, NVIDIA L4 |
45
 
46
  ## Architecture
47
 
 
53
 
54
  A Sim(3) pose graph backend handles global consistency and scale-drift correction.
55
 
56
+ ## Exported Formats
57
+
58
+ | Format | File | Size | Use Case |
59
+ |--------|------|------|----------|
60
+ | PyTorch (.pth) | `pytorch/thor_sta_v1.pth` | 49.6 MB | Training, fine-tuning |
61
+ | SafeTensors | `pytorch/thor_sta_v1.safetensors` | 49.5 MB | Fast loading, safe |
62
+ | ONNX (opset 17) | `onnx/thor_sta_v1.onnx` | 6.7 MB | Cross-platform inference |
63
+ | TensorRT FP16 | `tensorrt/thor_sta_v1_fp16.trt` | 6.3 MB | Edge deployment (Jetson/L4) |
64
+ | TensorRT FP32 | `tensorrt/thor_sta_v1_fp32.trt` | 11.4 MB | Full precision inference |
65
+
66
  ## Usage
67
 
68
  ```python
 
86
 
87
  print(output.quaternion.shape) # (1, 4)
88
  print(output.translation.shape) # (1, 3)
89
+ print(output.pointmap.shape) # (1, 224, 224, 3)
90
  ```
91
 
92
  ### ONNX inference
 
108
  )
109
  ```
110
 
111
+ ### TensorRT inference
112
+
113
+ ```python
114
+ import tensorrt as trt
115
+ import pycuda.driver as cuda
116
+ import pycuda.autoinit
117
+ import numpy as np
118
+
119
+ TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
120
+ runtime = trt.Runtime(TRT_LOGGER)
121
+
122
+ with open("tensorrt/thor_sta_v1_fp16.trt", "rb") as f:
123
+ engine = runtime.deserialize_cuda_engine(f.read())
124
+
125
+ context = engine.create_execution_context()
126
+ context.set_input_shape("img_a", (1, 3, 224, 224))
127
+ context.set_input_shape("img_b", (1, 3, 224, 224))
128
+ # ... allocate buffers and run inference
129
+ ```
130
+
131
  ## Downstream Contracts (ANIMA Wave-6)
132
 
133
  | Module | Dependency | Topic |
 
140
 
141
  ```
142
  README.md # This file
143
+ paper.pdf # ViSTA-SLAM paper (arXiv:2509.01584)
144
+ TRAINING_REPORT.md # Full training report with metrics
145
+ anima_module.yaml # ANIMA module manifest
146
  pytorch/thor_sta_v1.pth # PyTorch state dict
147
+ pytorch/thor_sta_v1.safetensors # SafeTensors
148
  onnx/thor_sta_v1.onnx # ONNX opset 17
149
+ tensorrt/thor_sta_v1_fp16.trt # TensorRT FP16
150
+ tensorrt/thor_sta_v1_fp32.trt # TensorRT FP32
151
+ checkpoints/best.pth # Best checkpoint (resume training)
152
  configs/training.toml # Training configuration
153
+ logs/training_history.json # Epoch-by-epoch metrics (200 epochs)
154
  ```
155
 
156
+ ## Training
157
+
158
+ - **Hardware**: NVIDIA L4 (23GB VRAM)
159
+ - **Framework**: PyTorch 2.10 + CUDA 12.8
160
+ - **Config**: See `configs/training.toml`
161
+ - **Report**: See `TRAINING_REPORT.md`
162
+
163
  ## Citation
164
 
165
  ```bibtex
 
173
 
174
  ## License
175
 
176
+ Apache 2.0Robot Flow Labs / AIFLOW LABS LIMITED
TRAINING_REPORT.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TRAINING_REPORT.md — THOR ViSTA-SLAM STA Model
2
+
3
+ ## Training Configuration
4
+
5
+ | Parameter | Value |
6
+ |-----------|-------|
7
+ | Model | STA (Symmetric Two-view Association) |
8
+ | Architecture | ResNet-18 encoder + PoseHead + PointmapHead |
9
+ | Parameters | ~12.4M |
10
+ | Optimizer | AdamW |
11
+ | Learning Rate | 1.5e-5 (cosine annealing + 5% warmup) |
12
+ | Weight Decay | 0.05 |
13
+ | Batch Size | 16 |
14
+ | Epochs | 200 |
15
+ | Mixed Precision | bf16 |
16
+ | Gradient Clipping | max_norm=1.0 |
17
+ | Seed | 42 |
18
+ | GPU | NVIDIA L4 (23GB VRAM) |
19
+ | Total Training Time | 3.6 hours |
20
+
21
+ ## Loss Components
22
+
23
+ | Component | Description |
24
+ |-----------|-------------|
25
+ | ConfLoss (pointmap) | Point regression with L2.1 norm, alpha=0.4 |
26
+ | RelPoseLoss (pose) | Relative SE(3) pose estimation loss |
27
+ | ReprojLoss (reproj) | Reprojection error loss |
28
+
29
+ ## Results
30
+
31
+ ### Best Checkpoint
32
+ | Metric | Value |
33
+ |--------|-------|
34
+ | Best Epoch | 198 |
35
+ | Best Val Loss | 0.782216 |
36
+ | Val Pointmap Loss | 0.426235 |
37
+ | Val Pose Loss | 0.010962 |
38
+ | Val Reproj Loss | 0.345019 |
39
+
40
+ ### Training Progression
41
+ | Stage | Train Loss | Val Loss | LR |
42
+ |-------|-----------|----------|-----|
43
+ | Epoch 1 | 2.8586 | 2.6405 | 1.50e-06 |
44
+ | Epoch 50 | 1.1821 | 1.1658 | 1.36e-05 |
45
+ | Epoch 100 | 0.9068 | 0.9020 | 8.69e-06 |
46
+ | Epoch 150 | 0.8142 | 0.8078 | 3.34e-06 |
47
+ | Epoch 200 | 0.7888 | 0.7848 | 1.00e-06 |
48
+
49
+ ### Loss Breakdown (Epoch 200)
50
+ | Component | Train | Val |
51
+ |-----------|-------|-----|
52
+ | Pointmap | 0.4273 | 0.4271 |
53
+ | Pose | 0.0107 | 0.0109 |
54
+ | Reproj | 0.3507 | 0.3468 |
55
+
56
+ ## Exported Formats
57
+
58
+ | Format | File | Size |
59
+ |--------|------|------|
60
+ | PyTorch (.pth) | pytorch/thor_sta_v1.pth | 49.6 MB |
61
+ | SafeTensors | pytorch/thor_sta_v1.safetensors | 49.5 MB |
62
+ | ONNX (opset 17) | onnx/thor_sta_v1.onnx | 6.7 MB |
63
+ | TensorRT FP16 | tensorrt/thor_sta_v1_fp16.trt | 6.3 MB |
64
+ | TensorRT FP32 | tensorrt/thor_sta_v1_fp32.trt | 11.4 MB |
65
+
66
+ ## Checkpoint
67
+ - Best: checkpoints/best.pth (epoch 198, val_loss=0.782216)
68
+ - Contains: model state_dict, optimizer state, scheduler state, config
69
+
70
+ ## Paper Reference
71
+ - **Title**: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
72
+ - **Authors**: Zhang, Qian, Wang, Cremers
73
+ - **arXiv**: 2509.01584
74
+ - **Paper PDF**: paper.pdf (included in repo)
75
+
76
+ ## HuggingFace
77
+ - Repo: [ilessio-aiflowlab/project_thor](https://huggingface.co/ilessio-aiflowlab/project_thor)
78
+
79
+ ## Notes
80
+ - Training used synthetic data (the full STA training on ScanNet/ScanNet++/ARKit/CO3D is PARKED — datacenter-scale)
81
+ - Best val_loss achieved at epoch 198, not at final epoch
82
+ - Pose loss converges fastest, pointmap loss dominates total loss
83
+ - The upstream pretrained ViT model (438M params) is separate from this ResNet-18 distillation target
checkpoints/best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:351dcfb96befb73799e81a91c9d2498247a0cb4b555c3ef79dc1c86ee787e7c9
3
+ size 62956211
onnx/thor_sta_v1.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1f2bc1f9732e428fa35db858cfad9e9b8a8dafb30e8798424a0cc4e1e83f7909
3
  size 6734123
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:576a5054e466a4544a0ba7e58e6a16224d0a9e3a024922dcd4cf53dca9d294d3
3
  size 6734123
paper.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f38317f02fb316043321e1b372471db1d8b8ead4d31ac9ab1c45aac00aa4cf6
3
+ size 9729032
pytorch/thor_sta_v1.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bd550bc16d7e6592a9ce1b0d70f2c4431061752bdb3b6fa08dc876c96102a136
3
  size 49558699
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92ab2c6716997ab1770e0091f85dc92ac1d8e2b0c6a5d2d7d2a3a14e257184cf
3
  size 49558699
pytorch/thor_sta_v1.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:277da895293ccf28c1bd512899debc9c8465b81a2a5f50540a55a4cb7163b001
3
  size 49515152
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcaa916721c4b4a017a1da15a7dfb80503582990c4dca0c30b374fda6a731bf8
3
  size 49515152
tensorrt/thor_sta_v1_fp16.trt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a1513762d0665503014c2c20a6693bf6aec2414ba727d03017e1eb1b1cc3fba0
3
- size 6226044
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70578c40b0290440d86fe5aa69316d8ad82b5e39de4813fa3180b4b9cfdd8c34
3
+ size 6322780
tensorrt/thor_sta_v1_fp32.trt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cb76e071a9523b748410cac1bb44d3b61ca35fd933062a5ee95c0a5ef58694c1
3
- size 10785236
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5e0c292875ea7a648653baca6cc0c0d9c0bc0a9757f7808c686cb2185a9cb5d
3
+ size 11414372