Buckets:
| license: cc-by-4.0 | |
| configs: | |
| - config_name: affinity | |
| default: true | |
| data_files: | |
| - split: train | |
| path: data/affinity/data.csv | |
| - config_name: p_ood_25 | |
| data_dir: data/p_ood_25 | |
| - config_name: p_ood_28 | |
| data_dir: data/p_ood_28 | |
| - config_name: p_ood_31 | |
| data_dir: data/p_ood_31 | |
| - config_name: p_ood_33 | |
| data_dir: data/p_ood_33 | |
| # InteractBind | |
| > A physically grounded, large-scale protein–ligand interaction dataset | |
| > for interpretable and interaction-aware binding prediction | |
| --- | |
| ## Links | |
| - Paper: [A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?](https://arxiv.org/abs/2605.24045) | |
| - Github: [ZhaohanM/InteractBind](https://github.com/ZhaohanM/InteractBind) | |
| --- | |
| ## Motivation | |
| Most existing protein–ligand binding datasets provide only coarse-grained supervision, such as binary labels or scalar affinity values. While effective for prediction, these signals compress complex molecular interaction processes into a single outcome, limiting interpretability and mechanistic understanding. | |
| **InteractBind** addresses this limitation by explicitly modelling *non-covalent interaction patterns* derived from experimentally resolved protein–ligand complexes. | |
| It enables **token-level supervision**, bridging sequence-based representations with physically meaningful interaction structures. | |
| --- | |
| ## Dataset Overview | |
| InteractBind is constructed from high-quality experimentally resolved complexes and includes: | |
| - Protein sequences (FASTA and structure-aware sequence) | |
| - Ligand molecular representations (SMILES and SELFIES) | |
| - Binding labels and affinity annotations | |
| - Token-level non-covalent interaction maps | |
| The dataset is designed to support both **prediction accuracy** and **mechanistic interpretability**. | |
| --- | |
| ## Dataset | |
| This repository provides benchmark CSVs with true residue-level interaction maps for PLI prediction evaluation. | |
| | Dataset | Type | Example Use | | |
| |----------|------|--------------| | |
| | InteractBind (affinity) | Binding affinity splits | Evaluate in-domain | | |
| | InteractBind-P-25%/28%/31%/33% OOD | Protein OOD splits | Evaluate novel protein generalisation | | |
| ## Files | |
| The Hugging Face Dataset Viewer is configured to read the CSV subsets under `data/`: | |
| - `affinity`: the full InteractBind affinity table. | |
| - `p_ood_25`, `p_ood_28`, `p_ood_31`, `p_ood_33`: protein OOD benchmark subsets with `train`, `validation`, and `test` splits. | |
| Each CSV includes seven residue-level binding-site fingerprint columns derived from the interaction maps: | |
| - `Hydrogen bonding_binding_site` | |
| - `Salt Bridges_binding_site` | |
| - `π–π Stacking_binding_site` | |
| - `Cation–π_binding_site` | |
| - `Hydrophobic_binding_site` | |
| - `Van der Waals_binding_site` | |
| - `Overall_binding_site` | |
| Each value is a binary list aligned to the protein FASTA sequence. For example, `[0,0,1,0]` marks the third residue as a binding-site residue. Negative protein-ligand pairs without contact-map entries are encoded as all-zero fingerprints. | |
| ## Supported Interaction Types | |
| Structured annotations are provided for major non-covalent interaction categories: | |
| - Hydrogen bonds | |
| - Hydrophobic interactions | |
| - Salt bridges | |
| - π–π stacking | |
| - π–cation interactions | |
| - Van der Waals contacts | |
| Each interaction channel can be used independently or combined for multi-channel supervision. | |
| --- | |
| ## Key Features | |
| - **Physically grounded supervision** | |
| Derived from experimentally resolved complexes rather than heuristic attention signals. | |
| - **Token-level interaction maps** | |
| Enables fine-grained modelling of residue–atom interactions. | |
| - **Model-agnostic integration** | |
| Compatible with sequence-based encoders (e.g., ESM, SELFormer, and other protein–ligand models). | |
| - **Interpretability support** | |
| Facilitates binding residue identification and interaction pattern analysis. | |
| - **Scalable design** | |
| Allows large-scale training without requiring full structural modelling during inference. | |
| --- | |
| ## Research Applications | |
| InteractBind supports a broad range of research directions: | |
| - Protein–ligand binding prediction | |
| - Binding site/pocket localisation | |
| - Interaction-aware representation learning | |
| - Mechanistic hypothesis generation | |
| - Drug discovery and virtual screening | |
| - Explainable AI for molecular modelling | |
| --- | |
| ## Citation | |
| If you use InteractBind in your research, please cite: | |
| ```bibtex | |
| @misc{meng2026largescaleinteractbind, | |
| title = {A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?}, | |
| author = {Meng, Zhaohan and Bai, Zhen and Yuan, Ke and Ounis, Iadh and Meng, Zaiqiao and Xu, Hao and Loscalzo, Joseph}, | |
| year = {2026}, | |
| eprint = {2605.24045}, | |
| archivePrefix = {arXiv}, | |
| primaryClass = {cs.LG}, | |
| url = {https://arxiv.org/abs/2605.24045} | |
| } | |
| ``` | |
Xet Storage Details
- Size:
- 4.86 kB
- Xet hash:
- 6e628f9799848717d444fd2a5252d48a1493400d53ca4db7509dca15979084b1
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.