| | --- |
| | license: cc |
| | datasets: |
| | - encord-team/E-MM1-100M |
| | - encord-team/E-MM1-1M |
| | language: |
| | - en |
| | --- |
| | |
| | # Model Card for `ebind-full` |
| |
|
| |  |
| |
|
| | <div style="display: flex; justify-content: space-between;"> |
| | <div style="flex: 1; padding: 10px;"> |
| | <!-- <a href="todohttps://arxiv.org/abs/YYMM.NNNNN" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img src="https://img.shields.io/badge/arXiv-YYMM.NNNNN-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;"> |
| | </a> --> |
| | <a href="https://colab.research.google.com/github/encord-team/ebind/blob/main/misc/demo.ipynb" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="vertical-align:middle;"> |
| | </a> |
| | <a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;"> |
| | </a> |
| | <a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;"> |
| | </a> |
| | <a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;"> |
| | </a> |
| | <div style="flex:1"></div> |
| | <a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;"> |
| | </a> |
| | <a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; "> |
| | <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&style=social" style="vertical-align: middle"> |
| | </a> |
| | <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;"> |
| | </div> |
| | </div> |
| | |
| | # EBind: Multi-Modal Embeddings |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computations. |
| | The model builds on top of three other models; [Perception Encoder](https://huggingface.co/facebook/PE-Core-L14-336), [ImageBind](https://huggingface.co/nielsr/imagebind-huge), and [Uni3D](https://github.com/baaivision/Uni3D). |
| | As indicated by the figure in the top, data is first embedded individually by the three said models. |
| | Audio and 3D point cloud embeddings are successively projected with an MLP into the embedding space of the Perception Encoder. |
| | The model produces unit-norm embeddings directly usable for similarity comparisons via dot-products ([cosine similarity]). |
| |
|
| | This version loads all encoders. |
| | If you do not need all modalities, please refer to the [audio-vision](https://huggingface.co/encord-team/ebind-audio-vision) and [3D-points-vision](https://huggingface.co/encord-team/ebind-points-vision) only models. |
| |
|
| | - **Developed by:** The Encord ML Team ([ml@encord.com](mailto:ml@encord.com)) |
| | - **Model type:** Multimodal embedding model. |
| | - **License:** The model is published under the [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) license. |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** [Github](https://github.com/encord-team/ebind) |
| | - **Project Page:** [e-mm1.github.io](https://e-mm1.github.io) |
| | - **Paper [optional]:** Coming soon. |
| | - **Demo [optional]:** [Explore the embedding space](https://data.encord.com) |
| |
|
| | ## Uses |
| |
|
| | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| |
|
| | ### Direct Use |
| |
|
| | The model is intended to be used with direct file-inputs of the said modalities; image, video, audio, 3D, and text. It will produce a 1024 dimension embedding per input, suited for similarity computations. |
| |
|
| | **Downstream Use** |
| |
|
| | The model could be used to build multimodal LLMs, generative models, and systems that perceive their surroundings via both visual, audio, and point cloud embeddings. |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | The model was built on data specified in the paper. |
| | As such, it will be biased towards data that "lives on the internet." |
| | For specific use-cases, a subsequent fine-tuning stage may be necessary. |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | **Option 1** |
| | If you want to work within the repository, use [`uv`](https://docs.astral.sh/uv/) to install the necessary dependencies. |
| |
|
| | ```bash |
| | git clone https://github.com/encord-team/ebind |
| | cd ebind |
| | uv sync |
| | ``` |
| |
|
| | **Option 2** |
| | You can also install it as an external dependency for another project: |
| |
|
| | ```bash |
| | # Option 2.a |
| | python -m pip install git@https://github.com/encord-team/ebind |
| | # Option 2.b; or install a local, editable version |
| | git clone https://github.com/encord-team/ebind |
| | cd /path/to/your/project |
| | python -m pip install -e /path/to/ebind |
| | ``` |
| |
|
| | > [!WARNING] |
| | > If you are running a project with pytorch=2.8.0, you should install torchcodec=0.7.0 (as opposed to the =0.8.0) |
| | > which is automatically installed with uv. torchcodec=0.8.* matches pytorch=2.9.0. |
| |
|
| | > [!NOTE] |
| | > The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile](#compile-pointnet2-cuda-ops-optional). |
| | > To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels. |
| |
|
| | ### Loading the Model |
| |
|
| | ```python |
| | import torch |
| | from ebind import EBindModel, EBindProcessor |
| | |
| | model = EBindModel.from_pretrained("encord-team/ebind-full") |
| | processor = EBindProcessor.from_pretrained("encord-team/ebind-full") |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model = model.to(device).eval() |
| | processor = processor.to(device) |
| | ``` |
| |
|
| | ### Processing Multi-Modal Inputs |
| |
|
| | ```python |
| | inputs = { |
| | "image": ["examples/dog.png", "examples/cat.png"], |
| | "video": ["examples/dog.mp4", "examples/cat.mp4"], |
| | "audio": ["examples/dog.mp4", "examples/cat.mp4"], |
| | "text": ["A dog is howling in the street", "A cat is sleeping on the couch"], |
| | "points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"], |
| | } |
| | |
| | with torch.inference_mode(): |
| | batch = processor(inputs, return_tensors="pt") # set text_file_paths=True if passing text file paths instead of strings |
| | outputs = model.forward(**batch) |
| | ``` |
| |
|
| | ### Computing Cross-Modal Similarities |
| |
|
| | ```python |
| | keys = list(outputs.keys()) |
| | for i, modality in enumerate(keys): |
| | for j, modality2 in enumerate(keys[i + 1:]): |
| | result = outputs[modality] @ outputs[modality2].T |
| | print(f"{modality} x {modality2}:") |
| | print(result.cpu().detach().numpy()) |
| | print('='*26) |
| | ``` |
| |
|
| | Expected Output: |
| |
|
| | ``` |
| | image x video similarity: |
| | [[0.48 0.42] |
| | [0.41 0.6 ]] |
| | ========================== |
| | image x audio similarity: |
| | [[0.07 0.05] |
| | [0.02 0.12]] |
| | ========================== |
| | image x text similarity: |
| | [[0.16 0.07] |
| | [0.08 0.14]] |
| | ========================== |
| | image x points similarity: |
| | [[0.2 0.19] |
| | [0.18 0.19]] |
| | ========================== |
| | video x audio similarity: |
| | [[0.19 0.08] |
| | [0.03 0.16]] |
| | ========================== |
| | video x text similarity: |
| | [[0.26 0.05] |
| | [0.11 0.14]] |
| | ========================== |
| | video x points similarity: |
| | [[0.24 0.15] |
| | [0.17 0.26]] |
| | ========================== |
| | audio x text similarity: |
| | [[ 0.12 -0. ] |
| | [ 0.07 0.09]] |
| | ========================== |
| | audio x points similarity: |
| | [[0.13 0.06] |
| | [0.1 0.12]] |
| | ========================== |
| | text x points similarity: |
| | [[0.19 0.14] |
| | [0.05 0.18]] |
| | ========================== |
| | ``` |
| |
|
| | **Note:** The image/video similarity is significantly higher because they share the same vision encoder. |
| |
|
| | ### Compile PointNet2 CUDA ops (optional) |
| |
|
| | If you have CUDA available, consider building the [PointNet2](https://github.com/erikwijmans/Pointnet2_PyTorch/tree/master/pointnet2_ops_lib/pointnet2_ops/_ext-src) custom ops used for embedding point clouds to get faster inference: |
| |
|
| | ```bash |
| | cd src/ebind/models/uni3d/pointnet2_ops && \ |
| | uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \ |
| | MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace |
| | ``` |
| |
|
| | > We have modified the code slightly in `src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py` to |
| | > have a fallback torch implementation in order for the model to be executable on no-GPU |
| | > hardware. |
| |
|
| | ## Evaluation |
| |
|
| | We have evaluated the model on multiple benchmarks. |
| | We highlight that EBind is performing close to as well as models 4 and 17 times larger. |
| |
|
| |  |
| | **Figure 1:** An average of the 13 benchmarks presented in the two tables below, plotted against model size. |
| |
|
| |  |
| |  |
| |
|
| | **BibTeX:** |
| |
|
| | ``` |
| | @misc{broadbent2025ebindpracticalapproachspace, |
| | title={{EBind}: a practical approach to space binding}, |
| | author={Jim Broadbent and Felix Cohen and Frederik Hvilshøj and Eric Landau and Eren Sasoglu}, |
| | year={2025}, |
| | eprint={2511.14229}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2511.14229}, |
| | } |
| | ``` |
| |
|
| | ## Try it now |
| | Explore the multimodal E-MM1 dataset behind this model [here](https://data.encord.com/e-mm1/explorer)! |
| |
|
| | ## Model Card Contact |
| | Please reach out to [ml@encord.com](mailto:ml@encord.com) with any questions or feedback |
| |
|