Buckets:

Synthyra
/

InteractBind-bucket

Files

xet

Synthyra/InteractBind-bucket / README.md

lhallee

29 days ago

preview code

download

raw

4.86 kB

	---
	license: cc-by-4.0
	configs:
	- config_name: affinity
	default: true
	data_files:
	- split: train
	path: data/affinity/data.csv
	- config_name: p_ood_25
	data_dir: data/p_ood_25
	- config_name: p_ood_28
	data_dir: data/p_ood_28
	- config_name: p_ood_31
	data_dir: data/p_ood_31
	- config_name: p_ood_33
	data_dir: data/p_ood_33
	---

	# InteractBind

	> A physically grounded, large-scale protein–ligand interaction dataset
	> for interpretable and interaction-aware binding prediction

	---

	## Links

	- Paper: [A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?](https://arxiv.org/abs/2605.24045)
	- Github: [ZhaohanM/InteractBind](https://github.com/ZhaohanM/InteractBind)

	---

	## Motivation

	Most existing protein–ligand binding datasets provide only coarse-grained supervision, such as binary labels or scalar affinity values. While effective for prediction, these signals compress complex molecular interaction processes into a single outcome, limiting interpretability and mechanistic understanding.

	InteractBind addresses this limitation by explicitly modelling non-covalent interaction patterns derived from experimentally resolved protein–ligand complexes.

	It enables token-level supervision, bridging sequence-based representations with physically meaningful interaction structures.

	---

	## Dataset Overview

	InteractBind is constructed from high-quality experimentally resolved complexes and includes:

	- Protein sequences (FASTA and structure-aware sequence)
	- Ligand molecular representations (SMILES and SELFIES)
	- Binding labels and affinity annotations
	- Token-level non-covalent interaction maps

	The dataset is designed to support both prediction accuracy and mechanistic interpretability.

	---
	## Dataset

	This repository provides benchmark CSVs with true residue-level interaction maps for PLI prediction evaluation.

	\| Dataset \| Type \| Example Use \|
	\|----------\|------\|--------------\|
	\| InteractBind (affinity) \| Binding affinity splits \| Evaluate in-domain \|
	\| InteractBind-P-25%/28%/31%/33% OOD \| Protein OOD splits \| Evaluate novel protein generalisation \|

	## Files

	The Hugging Face Dataset Viewer is configured to read the CSV subsets under `data/`:

	- `affinity`: the full InteractBind affinity table.
	- `p_ood_25`, `p_ood_28`, `p_ood_31`, `p_ood_33`: protein OOD benchmark subsets with `train`, `validation`, and `test` splits.

	Each CSV includes seven residue-level binding-site fingerprint columns derived from the interaction maps:

	- `Hydrogen bonding_binding_site`
	- `Salt Bridges_binding_site`
	- `π–π Stacking_binding_site`
	- `Cation–π_binding_site`
	- `Hydrophobic_binding_site`
	- `Van der Waals_binding_site`
	- `Overall_binding_site`

	Each value is a binary list aligned to the protein FASTA sequence. For example, `[0,0,1,0]` marks the third residue as a binding-site residue. Negative protein-ligand pairs without contact-map entries are encoded as all-zero fingerprints.

	## Supported Interaction Types

	Structured annotations are provided for major non-covalent interaction categories:

	- Hydrogen bonds
	- Hydrophobic interactions
	- Salt bridges
	- π–π stacking
	- π–cation interactions
	- Van der Waals contacts

	Each interaction channel can be used independently or combined for multi-channel supervision.

	---

	## Key Features

	- Physically grounded supervision
	Derived from experimentally resolved complexes rather than heuristic attention signals.

	- Token-level interaction maps
	Enables fine-grained modelling of residue–atom interactions.

	- Model-agnostic integration
	Compatible with sequence-based encoders (e.g., ESM, SELFormer, and other protein–ligand models).

	- Interpretability support
	Facilitates binding residue identification and interaction pattern analysis.

	- Scalable design
	Allows large-scale training without requiring full structural modelling during inference.

	---

	## Research Applications

	InteractBind supports a broad range of research directions:

	- Protein–ligand binding prediction
	- Binding site/pocket localisation
	- Interaction-aware representation learning
	- Mechanistic hypothesis generation
	- Drug discovery and virtual screening
	- Explainable AI for molecular modelling

	---

	## Citation

	If you use InteractBind in your research, please cite:

	```bibtex
	@misc{meng2026largescaleinteractbind,
	title = {A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?},
	author = {Meng, Zhaohan and Bai, Zhen and Yuan, Ke and Ounis, Iadh and Meng, Zaiqiao and Xu, Hao and Loscalzo, Joseph},
	year = {2026},
	eprint = {2605.24045},
	archivePrefix = {arXiv},
	primaryClass = {cs.LG},
	url = {https://arxiv.org/abs/2605.24045}
	}
	```

Xet Storage Details

Size:: 4.86 kB
Xet hash:: 6e628f9799848717d444fd2a5252d48a1493400d53ca4db7509dca15979084b1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.