Create README file

d8f95eb verified 2 days ago

4.04 kB

	---
	license: mit
	language:
	- en
	tags:
	- tactile-sensing
	- controlnet
	- stable-diffusion
	- depth-to-tactile
	- image-generation
	- robotics
	- multi-modal
	- diffusion
	- ICRA
	pipeline_tag: image-to-image
	library_name: pytorch
	---

	<h1 align="center">MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation</h1>

	<p align="center">
	<a href="https://github.com/sirine-b/MultiDiffSense"><img src="https://img.shields.io/badge/Code-GitHub-black?logo=github" alt="GitHub"></a>
	<a href="https://arxiv.org/abs/2602.19348"><img src="https://img.shields.io/badge/Paper-ICRA%202026-blue" alt="Paper"></a>
	<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-green" alt="License"></a>
	</p>

	MultiDiffSense is a ControlNet-based diffusion model that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| ControlNet built on Stable Diffusion 1.5 \|
	\| Task \| Depth map + Text Prompt to Multi-Modal Tactile sensor image generation \|
	\| Input \| 512x512 depth map (viridis colourmap) + text prompt \|
	\| Output \| 512x512 tactile sensor image \|
	\| Training \| ~150 epochs, frozen SD backbone, lr=1e-5, batch size 8 \|
	\| Parameters \| ~860M (SD 1.5) + ~360M (ControlNet copy) \|

	## Supported Tactile Sensor Modalities

	<table>
	<thead>
	<tr>
	<th>Sensor</th>
	<th>Description</th>
	<th>Image Example</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>TacTip</strong></td>
	<td>Optical tactile sensor with pin-based deformation markers</td>
	<td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/HFLb9F7xYiNmlfQAkh3KO.png" width="120"/></td>
	</tr>
	<tr>
	<td><strong>ViTac</strong></td>
	<td>Vision-based tactile sensor (no markers)</td>
	<td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/2R9-qwRVSl6UUpXdl-6HC.png" width="120"/></td>
	</tr>
	<tr>
	<td><strong>ViTacTip</strong></td>
	<td>Combined vision-tactile sensor</td>
	<td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/24s4nbM-Vx9vrAIONOqUI.png" width="120"/></td>
	</tr>
	</tbody>
	</table>

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `multidiffsense.ckpt` \| Trained checkpoint (trained on short prompts + depth maps) \|

	## Usage

	Clone the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run:

	```bash
	git clone https://github.com/sirine-b/MultiDiffSense.git
	cd MultiDiffSense
	pip install -r requirements.txt

	# Single depth map:
	python multidiffsense/controlnet/generate.py \
	--source_image path/to/depth_map.png \
	--prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'

	# Batch generation from a prompt file:
	python multidiffsense/controlnet/generate.py \
	--dataset_dir datasets \
	--prompt_json datasets/test/prompt_ViTacTip.json
	```

	See the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies.

	## Citation

	```bibtex
	@inproceedings{multidiffsense2026,
	title = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
	author = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
	booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
	year = {2026}
	url = {https://arxiv.org/abs/2602.19348}
	}
	```

	## License

	MIT