Update README.md

482831d verified 4 months ago

9.86 kB

	---
	license: cc
	datasets:
	- encord-team/E-MM1-100M
	- encord-team/E-MM1-1M
	language:
	- en
	---

	# Model Card for `ebind-full`

	![ebind](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/EohI585GsKe5cFlvWyWAX.png)

	<div style="display: flex; justify-content: space-between;">
	<div style="flex: 1; padding: 10px;">
	<!-- <a href="todohttps://arxiv.org/abs/YYMM.NNNNN" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img src="https://img.shields.io/badge/arXiv-YYMM.NNNNN-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
	</a> -->
	<a href="https://colab.research.google.com/github/encord-team/ebind/blob/main/misc/demo.ipynb" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="vertical-align:middle;">
	</a>
	<a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;">
	</a>
	<a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;">
	</a>
	<a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;">
	</a>
	<div style="flex:1"></div>
	<a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;">
	</a>
	<a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; ">
	<img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&style=social" style="vertical-align: middle">
	</a>
	<img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;">
	</div>
	</div>

	# EBind: Multi-Modal Embeddings

	## Model Details

	### Model Description

	EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computations.
	The model builds on top of three other models; [Perception Encoder](https://huggingface.co/facebook/PE-Core-L14-336), [ImageBind](https://huggingface.co/nielsr/imagebind-huge), and [Uni3D](https://github.com/baaivision/Uni3D).
	As indicated by the figure in the top, data is first embedded individually by the three said models.
	Audio and 3D point cloud embeddings are successively projected with an MLP into the embedding space of the Perception Encoder.
	The model produces unit-norm embeddings directly usable for similarity comparisons via dot-products ([cosine similarity]).

	This version loads all encoders.
	If you do not need all modalities, please refer to the [audio-vision](https://huggingface.co/encord-team/ebind-audio-vision) and [3D-points-vision](https://huggingface.co/encord-team/ebind-points-vision) only models.

	- Developed by: The Encord ML Team ([ml@encord.com](mailto:ml@encord.com))
	- Model type: Multimodal embedding model.
	- License: The model is published under the [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) license.

	### Model Sources

	- Repository: [Github](https://github.com/encord-team/ebind)
	- Project Page: [e-mm1.github.io](https://e-mm1.github.io)
	- Paper [optional]: Coming soon.
	- Demo [optional]: [Explore the embedding space](https://data.encord.com)

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	The model is intended to be used with direct file-inputs of the said modalities; image, video, audio, 3D, and text. It will produce a 1024 dimension embedding per input, suited for similarity computations.

	Downstream Use

	The model could be used to build multimodal LLMs, generative models, and systems that perceive their surroundings via both visual, audio, and point cloud embeddings.

	## Bias, Risks, and Limitations

	The model was built on data specified in the paper.
	As such, it will be biased towards data that "lives on the internet."
	For specific use-cases, a subsequent fine-tuning stage may be necessary.

	## How to Get Started with the Model

	Option 1
	If you want to work within the repository, use [`uv`](https://docs.astral.sh/uv/) to install the necessary dependencies.

	```bash
	git clone https://github.com/encord-team/ebind
	cd ebind
	uv sync
	```

	Option 2
	You can also install it as an external dependency for another project:

	```bash
	# Option 2.a
	python -m pip install git@https://github.com/encord-team/ebind
	# Option 2.b; or install a local, editable version
	git clone https://github.com/encord-team/ebind
	cd /path/to/your/project
	python -m pip install -e /path/to/ebind
	```

	> [!WARNING]
	> If you are running a project with pytorch=2.8.0, you should install torchcodec=0.7.0 (as opposed to the =0.8.0)
	> which is automatically installed with uv. torchcodec=0.8.* matches pytorch=2.9.0.

	> [!NOTE]
	> The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile](#compile-pointnet2-cuda-ops-optional).
	> To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels.

	### Loading the Model

	```python
	import torch
	from ebind import EBindModel, EBindProcessor

	model = EBindModel.from_pretrained("encord-team/ebind-full")
	processor = EBindProcessor.from_pretrained("encord-team/ebind-full")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device).eval()
	processor = processor.to(device)
	```

	### Processing Multi-Modal Inputs

	```python
	inputs = {
	"image": ["examples/dog.png", "examples/cat.png"],
	"video": ["examples/dog.mp4", "examples/cat.mp4"],
	"audio": ["examples/dog.mp4", "examples/cat.mp4"],
	"text": ["A dog is howling in the street", "A cat is sleeping on the couch"],
	"points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"],
	}

	with torch.inference_mode():
	batch = processor(inputs, return_tensors="pt") # set text_file_paths=True if passing text file paths instead of strings
	outputs = model.forward(**batch)
	```

	### Computing Cross-Modal Similarities

	```python
	keys = list(outputs.keys())
	for i, modality in enumerate(keys):
	for j, modality2 in enumerate(keys[i + 1:]):
	result = outputs[modality] @ outputs[modality2].T
	print(f"{modality} x {modality2}:")
	print(result.cpu().detach().numpy())
	print('='*26)
	```

	Expected Output:

	```
	image x video similarity:
	[[0.48 0.42]
	[0.41 0.6 ]]
	==========================
	image x audio similarity:
	[[0.07 0.05]
	[0.02 0.12]]
	==========================
	image x text similarity:
	[[0.16 0.07]
	[0.08 0.14]]
	==========================
	image x points similarity:
	[[0.2 0.19]
	[0.18 0.19]]
	==========================
	video x audio similarity:
	[[0.19 0.08]
	[0.03 0.16]]
	==========================
	video x text similarity:
	[[0.26 0.05]
	[0.11 0.14]]
	==========================
	video x points similarity:
	[[0.24 0.15]
	[0.17 0.26]]
	==========================
	audio x text similarity:
	[[ 0.12 -0. ]
	[ 0.07 0.09]]
	==========================
	audio x points similarity:
	[[0.13 0.06]
	[0.1 0.12]]
	==========================
	text x points similarity:
	[[0.19 0.14]
	[0.05 0.18]]
	==========================
	```

	Note: The image/video similarity is significantly higher because they share the same vision encoder.

	### Compile PointNet2 CUDA ops (optional)

	If you have CUDA available, consider building the [PointNet2](https://github.com/erikwijmans/Pointnet2_PyTorch/tree/master/pointnet2_ops_lib/pointnet2_ops/_ext-src) custom ops used for embedding point clouds to get faster inference:

	```bash
	cd src/ebind/models/uni3d/pointnet2_ops && \
	uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \
	MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace
	```

	> We have modified the code slightly in `src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py` to
	> have a fallback torch implementation in order for the model to be executable on no-GPU
	> hardware.

	## Evaluation

	We have evaluated the model on multiple benchmarks.
	We highlight that EBind is performing close to as well as models 4 and 17 times larger.

	![Summary plot](./summary.jpg)
	Figure 1: An average of the 13 benchmarks presented in the two tables below, plotted against model size.

	![Table 1: Retrieval benchmarks](./table-1.png)
	![Table 1: Zero-shot benchmarks](./table-2.png)

	BibTeX:

	```
	@misc{broadbent2025ebindpracticalapproachspace,
	title={{EBind}: a practical approach to space binding},
	author={Jim Broadbent and Felix Cohen and Frederik Hvilshøj and Eric Landau and Eren Sasoglu},
	year={2025},
	eprint={2511.14229},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2511.14229},
	}
	```

	## Try it now
	Explore the multimodal E-MM1 dataset behind this model [here](https://data.encord.com/e-mm1/explorer)!

	## Model Card Contact
	Please reach out to [ml@encord.com](mailto:ml@encord.com) with any questions or feedback