Buckets:

lhallee's picture
|
download
raw
4.86 kB
---
license: cc-by-4.0
configs:
- config_name: affinity
default: true
data_files:
- split: train
path: data/affinity/data.csv
- config_name: p_ood_25
data_dir: data/p_ood_25
- config_name: p_ood_28
data_dir: data/p_ood_28
- config_name: p_ood_31
data_dir: data/p_ood_31
- config_name: p_ood_33
data_dir: data/p_ood_33
---
# InteractBind
> A physically grounded, large-scale protein–ligand interaction dataset
> for interpretable and interaction-aware binding prediction
---
## Links
- Paper: [A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?](https://arxiv.org/abs/2605.24045)
- Github: [ZhaohanM/InteractBind](https://github.com/ZhaohanM/InteractBind)
---
## Motivation
Most existing protein–ligand binding datasets provide only coarse-grained supervision, such as binary labels or scalar affinity values. While effective for prediction, these signals compress complex molecular interaction processes into a single outcome, limiting interpretability and mechanistic understanding.
**InteractBind** addresses this limitation by explicitly modelling *non-covalent interaction patterns* derived from experimentally resolved protein–ligand complexes.
It enables **token-level supervision**, bridging sequence-based representations with physically meaningful interaction structures.
---
## Dataset Overview
InteractBind is constructed from high-quality experimentally resolved complexes and includes:
- Protein sequences (FASTA and structure-aware sequence)
- Ligand molecular representations (SMILES and SELFIES)
- Binding labels and affinity annotations
- Token-level non-covalent interaction maps
The dataset is designed to support both **prediction accuracy** and **mechanistic interpretability**.
---
## Dataset
This repository provides benchmark CSVs with true residue-level interaction maps for PLI prediction evaluation.
| Dataset | Type | Example Use |
|----------|------|--------------|
| InteractBind (affinity) | Binding affinity splits | Evaluate in-domain |
| InteractBind-P-25%/28%/31%/33% OOD | Protein OOD splits | Evaluate novel protein generalisation |
## Files
The Hugging Face Dataset Viewer is configured to read the CSV subsets under `data/`:
- `affinity`: the full InteractBind affinity table.
- `p_ood_25`, `p_ood_28`, `p_ood_31`, `p_ood_33`: protein OOD benchmark subsets with `train`, `validation`, and `test` splits.
Each CSV includes seven residue-level binding-site fingerprint columns derived from the interaction maps:
- `Hydrogen bonding_binding_site`
- `Salt Bridges_binding_site`
- `π–π Stacking_binding_site`
- `Cation–π_binding_site`
- `Hydrophobic_binding_site`
- `Van der Waals_binding_site`
- `Overall_binding_site`
Each value is a binary list aligned to the protein FASTA sequence. For example, `[0,0,1,0]` marks the third residue as a binding-site residue. Negative protein-ligand pairs without contact-map entries are encoded as all-zero fingerprints.
## Supported Interaction Types
Structured annotations are provided for major non-covalent interaction categories:
- Hydrogen bonds
- Hydrophobic interactions
- Salt bridges
- π–π stacking
- π–cation interactions
- Van der Waals contacts
Each interaction channel can be used independently or combined for multi-channel supervision.
---
## Key Features
- **Physically grounded supervision**
Derived from experimentally resolved complexes rather than heuristic attention signals.
- **Token-level interaction maps**
Enables fine-grained modelling of residue–atom interactions.
- **Model-agnostic integration**
Compatible with sequence-based encoders (e.g., ESM, SELFormer, and other protein–ligand models).
- **Interpretability support**
Facilitates binding residue identification and interaction pattern analysis.
- **Scalable design**
Allows large-scale training without requiring full structural modelling during inference.
---
## Research Applications
InteractBind supports a broad range of research directions:
- Protein–ligand binding prediction
- Binding site/pocket localisation
- Interaction-aware representation learning
- Mechanistic hypothesis generation
- Drug discovery and virtual screening
- Explainable AI for molecular modelling
---
## Citation
If you use InteractBind in your research, please cite:
```bibtex
@misc{meng2026largescaleinteractbind,
title = {A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?},
author = {Meng, Zhaohan and Bai, Zhen and Yuan, Ke and Ounis, Iadh and Meng, Zaiqiao and Xu, Hao and Loscalzo, Joseph},
year = {2026},
eprint = {2605.24045},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2605.24045}
}
```

Xet Storage Details

Size:
4.86 kB
·
Xet hash:
6e628f9799848717d444fd2a5252d48a1493400d53ca4db7509dca15979084b1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.