OpenTouch VP2T Retrieval Encoder
This repository contains the best saved checkpoint from our OpenTouch cross-modal retrieval encoder training used in the VLA-HAND tactile editing experiments.
Task
The model is trained for vp2t retrieval:
visual + pose -> tactile
It aligns RGB video windows and right-hand 3D landmark windows with tactile pressure windows using a CLIP-style symmetric contrastive loss.
Inputs And Outputs
Inputs follow the OpenTouch retrieval format:
rgb_images: [B, 20, 3, 224, 224]
hand_landmarks: [B, 20, 21, 3]
tactile_pressure: [B, 20, 1, 16, 16]
The model outputs normalized 64-dimensional embeddings:
visual_features: [B, 64]
pose_features: [B, 64]
tactile_features: [B, 64]
For vp2t, visual and pose embeddings are concatenated and projected to a fused query embedding, then matched against tactile embeddings.
Checkpoint
Recommended checkpoint:
epoch_280.pt
This is the best saved checkpoint by average bidirectional mAP. The absolute best tactile-to-visual mAP was at epoch 285, but only every 10 epochs was checkpointed.
Training Setup
dataset: OpenTouch official retrieval dataset
task_type: vp2t
sequence_length: 20
stride: 10
epochs: 300
batch_size: 4 in the recorded run
precision: amp_bf16
embed_dim: 64
visual backbone: google/vit-base-patch16-224-in21k
visual backbone freeze: true
tactile encoder: CNNetEmbedding
pose encoder: PoseEncoder
fusion: concat + linear projection
The training script is included in the VLA-HAND repo as:
scripts/run_opentouch_official_encoder_train.sh
Metrics
Validation set size: 2985 sliding windows.
Best saved checkpoint, epoch 280:
| Direction | R@1 | R@5 | R@10 | mAP |
|---|---|---|---|---|
| visual+pose -> tactile | 0.0647 | 0.2214 | 0.3374 | 0.1536 |
| tactile -> visual+pose | 0.0620 | 0.2251 | 0.3420 | 0.1525 |
Final epoch 300:
| Direction | R@1 | R@5 | R@10 | mAP |
|---|---|---|---|---|
| visual+pose -> tactile | 0.0616 | 0.2228 | 0.3307 | 0.1519 |
| tactile -> visual+pose | 0.0637 | 0.2268 | 0.3414 | 0.1533 |
Files
epoch_280.pt best saved checkpoint
results/results.jsonl full validation history
config/OpenTouch-DINOv3-B16-AllModalities.json model config used for training
Intended Use
This checkpoint is intended as a tactile/visual/pose representation model for OpenTouch-based tactile editing and analysis. It is not a VITRA action checkpoint and does not directly predict robot or hand actions.