YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
Official code release for the paper "Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation"
Guo Ye*, Zexi Zhang*, Xu Zhao, Shang Wu, Haoran Lu, Shihan Lu, Han Liu *Equal contribution. Department of Computer Science, Northwestern University.
Think β Dream β Act. Each policy step runs in two passes. In the Think stage the policy proposes a draft action from the current state and a null tactile prediction. In the Dream stage a frozen V-JEPA2 world model forecasts the tactile outcome of that draft action. In the Act stage the policy integrates both the real observation and the predicted tactile feedback to refine the action, enabling fine-grained corrections for contact-rich manipulation.
Abstract
Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success.
Method
Two-stage training. Stage 1: a multimodal encoder $E_\psi$ processes macro / local / micro views and is trained with the Hierarchical Spatial Alignment loss $\mathcal{L}{HSA}$ together with the policy objective $\mathcal{L}W$, producing a draft action $a^{(t)}{\text{draft}}$. Stage 2: a tactile world model $W_\phi$ is trained to predict future tactile image sequences. At inference the policy "dreams" the future tactile feeling that would result from its draft action and refines the plan into a final action $a^{(t)}{\text{final}}$.
Three-scale visual hierarchy. Third-person macro vision, wrist-camera local vision, and high-resolution tactile micro vision are fused with the HSA loss, which explicitly grounds what the robot feels within what the robot sees.
Tasks & Dataset
We evaluate on four contact-rich tasks: Peg-in-Hole, USB Insert, Gear Assembly, and Tool Stabilization.
The hybrid tactile dataset contains 2M tactile frames across 4 tasks and 9 objects β 80% high-fidelity simulation + 20% real-world demonstrations.
Results
Qualitative comparison of predicted vs. ground-truth tactile sequences on Peg-in-Hole and Tool Stabilization.
Ablation studies on model and data scaling.
Installation
git clone https://github.com/michaelyeah7/learning-to-feel-the-future.git
cd learning-to-feel-the-future
conda create -n dreamtacvla python=3.10 -y
conda activate dreamtacvla
pip install -r requirements.txt
Data & Pretrained Weights
Set the dataset root once (otherwise ./datasets/ is used):
export DOBOT_DATA_DIR=/path/to/datasets
Place pretrained V-JEPA / V-JEPA2 backbones under jepa_ckpt/, e.g. jepa_ckpt/vitl_peg_e150.pt. Task entries (paths, episode lengths, camera names) are configured in ModelTrain/constants.py.
Training
A ready-to-run example for the peg-in-hole task with HSA + CLIP text conditioning:
bash train_peg.sh
Or invoke the trainer directly:
python ModelTrain/model_train.py \
--policy_class ACTJEPAAdapter \
--task_name dobot_peginhole_tac_1107 \
--ckpt_dir ckpt/my_experiment \
--vit_ckpt_path jepa_ckpt/vitl_peg_e150.pt \
--vit_model vitl \
--clip_model ViT-B-16 --freeze_clip \
--enable_text --text_prompt "Insert the peg into the hole" \
--enable_hsa --hsa_weight 1.0 \
--num_steps 20000 --batch_size 16 --lr 1e-5
Key arguments
| Group | Flag | Purpose |
|---|---|---|
| Required | --task_name |
Task entry from ModelTrain/constants.py |
| Required | --ckpt_dir |
Output directory for checkpoints |
| Required | --vit_ckpt_path / --vit_model |
V-JEPA tactile backbone (vitl or vitg) |
| CLIP | --clip_model, --freeze_clip |
RGB encoder variant; freeze recommended |
| Text | --enable_text, --text_prompt |
Language conditioning |
| HSA | --enable_hsa, --hsa_weight |
Hierarchical Spatial Alignment loss |
Inference
python experiments/run_inference.py \
--ckpt_dir ckpt/my_experiment \
--task_name dobot_peginhole_tac_1107
Data Collection (real robot)
See scripts/:
4_collect2train_data.pyβ convert collected episodes into training format6_dataset_count.pyβ dataset statistics
Visualization
bash compare_jepa_embeddings.sh # adapter embedding heatmaps across V-JEPA checkpoints
python visualize_adapter_embeddings.py # single-checkpoint visualization
Project Structure
learning-to-feel-the-future/
βββ ModelTrain/ # DreamTacVLA training code (policy, world model, HSA loss)
βββ dobot_control/ # Robot control + tactile feature extraction
βββ experiments/ # Inference / control / launch nodes
βββ scripts/ # Data collection utilities
βββ examples/ # Minimal usage examples
βββ third_party/ # Vendored dependencies (DynamixelSDK, Feetech)
βββ robomimic-r2d2/ # Vendored robomimic fork
βββ docs/figures/ # README figures
βββ train_peg.sh # Reference training script
βββ requirements.txt
Citation
@article{ye2025dreamtacvla,
title = {Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation},
author = {Ye, Guo and Zhang, Zexi and Zhao, Xu and Wu, Shang and
Lu, Haoran and Lu, Shihan and Liu, Han},
journal = {arXiv preprint arXiv:2512.23864},
year = {2025}
}
Acknowledgments
This codebase builds on excellent open-source work: ACT / Mobile ALOHA, V-JEPA / V-JEPA 2, CLIP and open_clip, robomimic, GELLO, and the Dobot X-Trainer teleoperation stack. See THIRD-PARTY-LICENSES for license details.
License
Released under the MIT License. Note that the V-JEPA / V-JEPA 2 pretrained weights are distributed by Meta under CC BY-NC 4.0 β please consult the upstream repository for the terms governing those weights.
Contact
For questions, please open a GitHub issue or contact Guo Ye at guoye2018@u.northwestern.edu.