| | --- |
| | license: mit |
| | language: |
| | - en |
| | tags: |
| | - tactile-sensing |
| | - controlnet |
| | - stable-diffusion |
| | - depth-to-tactile |
| | - image-generation |
| | - robotics |
| | - multi-modal |
| | - diffusion |
| | - ICRA |
| | pipeline_tag: image-to-image |
| | library_name: pytorch |
| | --- |
| | |
| | <h1 align="center">MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation</h1> |
| |
|
| | <p align="center"> |
| | <a href="https://github.com/sirine-b/MultiDiffSense"><img src="https://img.shields.io/badge/Code-GitHub-black?logo=github" alt="GitHub"></a> |
| | <a href="https://arxiv.org/abs/2602.19348"><img src="https://img.shields.io/badge/Paper-ICRA%202026-blue" alt="Paper"></a> |
| | <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-green" alt="License"></a> |
| | </p> |
| |
|
| | MultiDiffSense is a **ControlNet-based diffusion model** that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities. |
| |
|
| | ## Model Details |
| |
|
| | | | | |
| | |---|---| |
| | | **Architecture** | ControlNet built on Stable Diffusion 1.5 | |
| | | **Task** | Depth map + Text Prompt to Multi-Modal Tactile sensor image generation | |
| | | **Input** | 512x512 depth map (viridis colourmap) + text prompt | |
| | | **Output** | 512x512 tactile sensor image | |
| | | **Training** | ~150 epochs, frozen SD backbone, lr=1e-5, batch size 8 | |
| | | **Parameters** | ~860M (SD 1.5) + ~360M (ControlNet copy) | |
| |
|
| | ## Supported Tactile Sensor Modalities |
| |
|
| | <table> |
| | <thead> |
| | <tr> |
| | <th>Sensor</th> |
| | <th>Description</th> |
| | <th>Image Example</th> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td><strong>TacTip</strong></td> |
| | <td>Optical tactile sensor with pin-based deformation markers</td> |
| | <td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/HFLb9F7xYiNmlfQAkh3KO.png" width="120"/></td> |
| | </tr> |
| | <tr> |
| | <td><strong>ViTac</strong></td> |
| | <td>Vision-based tactile sensor (no markers)</td> |
| | <td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/2R9-qwRVSl6UUpXdl-6HC.png" width="120"/></td> |
| | </tr> |
| | <tr> |
| | <td><strong>ViTacTip</strong></td> |
| | <td>Combined vision-tactile sensor</td> |
| | <td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/24s4nbM-Vx9vrAIONOqUI.png" width="120"/></td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| | ## Files |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `multidiffsense.ckpt` | Trained checkpoint (trained on short prompts + depth maps) | |
| |
|
| | ## Usage |
| |
|
| | Clone the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run: |
| |
|
| | ```bash |
| | git clone https://github.com/sirine-b/MultiDiffSense.git |
| | cd MultiDiffSense |
| | pip install -r requirements.txt |
| | |
| | # Single depth map: |
| | python multidiffsense/controlnet/generate.py \ |
| | --source_image path/to/depth_map.png \ |
| | --prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}' |
| | |
| | # Batch generation from a prompt file: |
| | python multidiffsense/controlnet/generate.py \ |
| | --dataset_dir datasets \ |
| | --prompt_json datasets/test/prompt_ViTacTip.json |
| | ``` |
| |
|
| | See the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @inproceedings{multidiffsense2026, |
| | title = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose}, |
| | author = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang}, |
| | booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}, |
| | year = {2026} |
| | url = {https://arxiv.org/abs/2602.19348} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT |