OpenTouch VP2T Retrieval Encoder

This repository contains the best saved checkpoint from our OpenTouch cross-modal retrieval encoder training used in the VLA-HAND tactile editing experiments.

Task

The model is trained for vp2t retrieval:

visual + pose -> tactile

It aligns RGB video windows and right-hand 3D landmark windows with tactile pressure windows using a CLIP-style symmetric contrastive loss.

Inputs And Outputs

Inputs follow the OpenTouch retrieval format:

rgb_images:        [B, 20, 3, 224, 224]
hand_landmarks:    [B, 20, 21, 3]
tactile_pressure:  [B, 20, 1, 16, 16]

The model outputs normalized 64-dimensional embeddings:

visual_features:   [B, 64]
pose_features:     [B, 64]
tactile_features:  [B, 64]

For vp2t, visual and pose embeddings are concatenated and projected to a fused query embedding, then matched against tactile embeddings.

Checkpoint

Recommended checkpoint:

epoch_280.pt

This is the best saved checkpoint by average bidirectional mAP. The absolute best tactile-to-visual mAP was at epoch 285, but only every 10 epochs was checkpointed.

Training Setup

dataset: OpenTouch official retrieval dataset
task_type: vp2t
sequence_length: 20
stride: 10
epochs: 300
batch_size: 4 in the recorded run
precision: amp_bf16
embed_dim: 64
visual backbone: google/vit-base-patch16-224-in21k
visual backbone freeze: true
tactile encoder: CNNetEmbedding
pose encoder: PoseEncoder
fusion: concat + linear projection

The training script is included in the VLA-HAND repo as:

scripts/run_opentouch_official_encoder_train.sh

Metrics

Validation set size: 2985 sliding windows.

Best saved checkpoint, epoch 280:

Direction	R@1	R@5	R@10	mAP
visual+pose -> tactile	0.0647	0.2214	0.3374	0.1536
tactile -> visual+pose	0.0620	0.2251	0.3420	0.1525

Final epoch 300:

Direction	R@1	R@5	R@10	mAP
visual+pose -> tactile	0.0616	0.2228	0.3307	0.1519
tactile -> visual+pose	0.0637	0.2268	0.3414	0.1533

Files

epoch_280.pt                                  best saved checkpoint
results/results.jsonl                         full validation history
config/OpenTouch-DINOv3-B16-AllModalities.json model config used for training

Intended Use

This checkpoint is intended as a tactile/visual/pose representation model for OpenTouch-based tactile editing and analysis. It is not a VITRA action checkpoint and does not directly predict robot or hand actions.

Downloads last month: -; Downloads are not tracked for this model. How to track