HAPTOS — General Tactile Representation Learning

Part of the ANIMA Perception Suite by Robot Flow Labs.

Paper

AnyTouch 2: General Optical Tactile Representation Learning for Dynamic Tactile Perception arXiv:2602.09617 GeWu-Lab

Architecture

ViT-Base Masked Autoencoder (MAE) for tactile image representation learning with:

Encoder: 12 layers, 12 heads, dim 768 (106.4M params total)
Decoder: 6 layers, 8 heads, dim 512 (paper-matched)
Force head: MLP from CLS token -> 3D force vector (fx, fy, fz)
Mask ratio: 75%
Input: 224x224 tactile images
Loss: MSE reconstruction + frame-diff recon + L1 force + delta-force + cross-sensor matching + action matching

Exported Formats

Format	File	Size	Use Case
PyTorch (.pth)	`pytorch/haptos_v1.pth`	376 MB	Training, fine-tuning
SafeTensors	`pytorch/haptos_v1.safetensors`	376 MB	Fast loading, safe
ONNX	`onnx/haptos_v1.onnx`	345 MB	Cross-platform inference
TensorRT FP16	`tensorrt/haptos_v1_fp16.trt`	175 MB	Edge deployment (Jetson/L4)
TensorRT FP32	`tensorrt/haptos_v1_fp32.trt`	345 MB	Full precision inference
Checkpoint	`checkpoints/best.pth`	1.1 GB	Resume training (optimizer + scheduler state)

Training Details

Setting	Value
Hardware	8x NVIDIA L4 (23.7 GB each)
VRAM Usage	19.0 GB / 23.7 GB (80%) per GPU
Effective Batch	192 (24/GPU x 8 GPUs)
Optimizer	AdamW (betas=0.9, 0.95)
Learning Rate	3e-4
LR Schedule	Warmup + Cosine Annealing with Warm Restarts (T0=28, T_mult=2)
Precision	bf16 mixed precision
Epochs	40
Best Val Loss	0.0836 (epoch 52)
Test Loss	0.0825
Test Recon Loss	0.0090
Test Force Loss (L1)	0.7347

Usage

import torch
from safetensors.torch import load_file

# Load weights
state_dict = load_file("pytorch/haptos_v1.safetensors")

# Build model
from anima_haptos.models.mae_cuda import TactileMAECuda
model = TactileMAECuda(
    img_size=224, patch_size=16, embed_dim=768,
    encoder_depth=12, num_heads=12,
    decoder_dim=384, decoder_depth=4, decoder_heads=6,
    mask_ratio=0.75, force_head=True, force_dim=3,
)
model.load_state_dict(state_dict)
model.eval()

# Extract features
img = torch.randn(1, 3, 224, 224)
features = model.get_encoder_features(img)  # [1, 768]
force = model.force_head(features)          # [1, 3] (fx, fy, fz)

Capabilities

Pixel-level: Masked reconstruction of tactile images
Physical-level: 3D contact force estimation (fx, fy, fz) with L1 supervision
Multi-sensor: Works across GelSight, DIGIT, DuraGel, Tac3D
Temporal: Processes tactile frame sequences

Checkpoint Contents

best.pth includes full state for resume:

model_state_dict, optimizer_state_dict, scheduler_state_dict
early_stopping_state_dict, scaler_state_dict
epoch, global_step, val_loss

Files

├── README.md
├── paper.pdf
├── pytorch/
│   ├── haptos_v1.pth
│   └── haptos_v1.safetensors
├── onnx/
│   └── haptos_v1.onnx
├── tensorrt/
│   ├── haptos_v1_fp16.trt
│   └── haptos_v1_fp32.trt
├── checkpoints/
│   └── best.pth
├── configs/
│   └── training.yaml
└── logs/
    └── training_history.json

License

Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for ilessio-aiflowlab/project_haptos

AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception

Paper • 2602.09617 • Published Feb 10 • 1