HAPTOS β€” General Tactile Representation Learning

Part of the ANIMA Perception Suite by Robot Flow Labs.

Paper

AnyTouch 2: General Optical Tactile Representation Learning for Dynamic Tactile Perception arXiv:2602.09617 GeWu-Lab

Architecture

ViT-Base Masked Autoencoder (MAE) for tactile image representation learning with:

  • Encoder: 12 layers, 12 heads, dim 768 (106.4M params total)
  • Decoder: 6 layers, 8 heads, dim 512 (paper-matched)
  • Force head: MLP from CLS token -> 3D force vector (fx, fy, fz)
  • Mask ratio: 75%
  • Input: 224x224 tactile images
  • Loss: MSE reconstruction + frame-diff recon + L1 force + delta-force + cross-sensor matching + action matching

Exported Formats

Format File Size Use Case
PyTorch (.pth) pytorch/haptos_v1.pth 376 MB Training, fine-tuning
SafeTensors pytorch/haptos_v1.safetensors 376 MB Fast loading, safe
ONNX onnx/haptos_v1.onnx 345 MB Cross-platform inference
TensorRT FP16 tensorrt/haptos_v1_fp16.trt 175 MB Edge deployment (Jetson/L4)
TensorRT FP32 tensorrt/haptos_v1_fp32.trt 345 MB Full precision inference
Checkpoint checkpoints/best.pth 1.1 GB Resume training (optimizer + scheduler state)

Training Details

Setting Value
Hardware 8x NVIDIA L4 (23.7 GB each)
VRAM Usage 19.0 GB / 23.7 GB (80%) per GPU
Effective Batch 192 (24/GPU x 8 GPUs)
Optimizer AdamW (betas=0.9, 0.95)
Learning Rate 3e-4
LR Schedule Warmup + Cosine Annealing with Warm Restarts (T0=28, T_mult=2)
Precision bf16 mixed precision
Epochs 40
Best Val Loss 0.0836 (epoch 52)
Test Loss 0.0825
Test Recon Loss 0.0090
Test Force Loss (L1) 0.7347

Usage

import torch
from safetensors.torch import load_file

# Load weights
state_dict = load_file("pytorch/haptos_v1.safetensors")

# Build model
from anima_haptos.models.mae_cuda import TactileMAECuda
model = TactileMAECuda(
    img_size=224, patch_size=16, embed_dim=768,
    encoder_depth=12, num_heads=12,
    decoder_dim=384, decoder_depth=4, decoder_heads=6,
    mask_ratio=0.75, force_head=True, force_dim=3,
)
model.load_state_dict(state_dict)
model.eval()

# Extract features
img = torch.randn(1, 3, 224, 224)
features = model.get_encoder_features(img)  # [1, 768]
force = model.force_head(features)          # [1, 3] (fx, fy, fz)

Capabilities

  • Pixel-level: Masked reconstruction of tactile images
  • Physical-level: 3D contact force estimation (fx, fy, fz) with L1 supervision
  • Multi-sensor: Works across GelSight, DIGIT, DuraGel, Tac3D
  • Temporal: Processes tactile frame sequences

Checkpoint Contents

best.pth includes full state for resume:

  • model_state_dict, optimizer_state_dict, scheduler_state_dict
  • early_stopping_state_dict, scaler_state_dict
  • epoch, global_step, val_loss

Files

β”œβ”€β”€ README.md
β”œβ”€β”€ paper.pdf
β”œβ”€β”€ pytorch/
β”‚   β”œβ”€β”€ haptos_v1.pth
β”‚   └── haptos_v1.safetensors
β”œβ”€β”€ onnx/
β”‚   └── haptos_v1.onnx
β”œβ”€β”€ tensorrt/
β”‚   β”œβ”€β”€ haptos_v1_fp16.trt
β”‚   └── haptos_v1_fp32.trt
β”œβ”€β”€ checkpoints/
β”‚   └── best.pth
β”œβ”€β”€ configs/
β”‚   └── training.yaml
└── logs/
    └── training_history.json

License

Apache 2.0 β€” Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for ilessio-aiflowlab/project_haptos