OpenTouch VP2T Retrieval Encoder

This repository contains the best saved checkpoint from our OpenTouch cross-modal retrieval encoder training used in the VLA-HAND tactile editing experiments.

Task

The model is trained for vp2t retrieval:

visual + pose -> tactile

It aligns RGB video windows and right-hand 3D landmark windows with tactile pressure windows using a CLIP-style symmetric contrastive loss.

Inputs And Outputs

Inputs follow the OpenTouch retrieval format:

rgb_images:        [B, 20, 3, 224, 224]
hand_landmarks:    [B, 20, 21, 3]
tactile_pressure:  [B, 20, 1, 16, 16]

The model outputs normalized 64-dimensional embeddings:

visual_features:   [B, 64]
pose_features:     [B, 64]
tactile_features:  [B, 64]

For vp2t, visual and pose embeddings are concatenated and projected to a fused query embedding, then matched against tactile embeddings.

Checkpoint

Recommended checkpoint:

epoch_280.pt

This is the best saved checkpoint by average bidirectional mAP. The absolute best tactile-to-visual mAP was at epoch 285, but only every 10 epochs was checkpointed.

Training Setup

dataset: OpenTouch official retrieval dataset
task_type: vp2t
sequence_length: 20
stride: 10
epochs: 300
batch_size: 4 in the recorded run
precision: amp_bf16
embed_dim: 64
visual backbone: google/vit-base-patch16-224-in21k
visual backbone freeze: true
tactile encoder: CNNetEmbedding
pose encoder: PoseEncoder
fusion: concat + linear projection

The training script is included in the VLA-HAND repo as:

scripts/run_opentouch_official_encoder_train.sh

Metrics

Validation set size: 2985 sliding windows.

Best saved checkpoint, epoch 280:

Direction R@1 R@5 R@10 mAP
visual+pose -> tactile 0.0647 0.2214 0.3374 0.1536
tactile -> visual+pose 0.0620 0.2251 0.3420 0.1525

Final epoch 300:

Direction R@1 R@5 R@10 mAP
visual+pose -> tactile 0.0616 0.2228 0.3307 0.1519
tactile -> visual+pose 0.0637 0.2268 0.3414 0.1533

Files

epoch_280.pt                                  best saved checkpoint
results/results.jsonl                         full validation history
config/OpenTouch-DINOv3-B16-AllModalities.json model config used for training

Intended Use

This checkpoint is intended as a tactile/visual/pose representation model for OpenTouch-based tactile editing and analysis. It is not a VITRA action checkpoint and does not directly predict robot or hand actions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support