gabrielbianchin commited on
Commit
e612d44
·
verified ·
1 Parent(s): 887e87e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: cc-by-nc-nd-4.0
3
  datasets:
4
- - SaeedLab/SeqScreen
5
  tags:
6
  - proteins
7
  - molecules
@@ -11,30 +11,30 @@ tags:
11
  - transformers
12
  ---
13
 
14
- # SeqScreen Frozen
15
 
16
- This model corresponds to the SeqScreen frozen configuration, in which both encoders are frozen and only the projection layers are trained on filtered ChEMBL.
17
 
18
- \[[Github Repo](https://github.com/pcdslab/SeqScreen)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/SeqScreen)\] | \[[Model Collection](https://huggingface.co/collections/SaeedLab/seqscreen)\] | \[[Cite](#citation)\]
19
 
20
  ## Abstract
21
 
22
- Virtual screening aims to identify candidate molecules that bind to a target protein, playing a central role in computational drug discovery. Sequence-based deep learning methods offer an applicable alternative to structure-based approaches, but typically process one protein-molecule pair at a time, limiting their scalability to large molecular libraries. Contrastive learning methods inspired by CLIP have shown promise for learning joint protein-molecule representations, but standard CLIP training was designed for symmetric tasks and does not account for the asymmetric and one-to-many nature of protein-molecule binding. In this paper, we introduce *SeqScreen*, a sequence-based virtual screening method built on a dual-encoder contrastive architecture. SeqScreen introduces a protein-centric batch construction strategy and an asymmetric multi-positive InfoNCE loss to cope with the protein-centric nature of virtual screening. We conduct a systematic evaluation across 8 protein language models and 3 molecular language model variants. The protein-centric batch construction consistently outperforms standard CLIP training across all evaluated encoders, while requiring approximately 32 times fewer training epochs and 7 times fewer forward passes during inference compared to pair-based methods. On the LIT-PCBA dataset, SeqScreen outperforms all sequence-based baselines, achieving a relative improvement of up to 39% in EF at 0.5 over the best competing method, while remaining competitive with traditional docking approaches without requiring 3D structural information.
23
 
24
  ## Model Details
25
 
26
- SeqScreen uses a dual-encoder architecture that independently encodes proteins and molecules using specialized language models. The protein branch processes amino acid sequences with ESM2 T36, and the molecule branch processes SMILES strings with MolDeBERTa MLC. Both representations are passed through projection heads to produce normalized embeddings.
27
 
28
  ![Model](pipeline.png)
29
 
30
  Two configurations are available in this collection:
31
 
32
- - [SeqScreen-Frozen](https://huggingface.co/SaeedLab/SeqScreen-Frozen): only the projection layers are trained, both encoders are frozen.
33
- - [SeqScreen-Finetuning](https://huggingface.co/SaeedLab/SeqScreen-Finetuning): the projection layers and ESM2 T36 are trained, MolDeBERTa MLC is frozen.
34
 
35
  ## Usage
36
 
37
- SeqScreen computes cosine similarities between protein and molecule embeddings, which can be used to rank candidate molecules for a given target protein.
38
 
39
  ### Similarity
40
 
@@ -89,14 +89,14 @@ with torch.no_grad():
89
  mol_rep = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-8)
90
 
91
 
92
- # seqscreen
93
- seqscreen = AutoModel.from_pretrained(
94
- 'SaeedLab/SeqScreen-Frozen',
95
  trust_remote_code=True
96
  ).eval()
97
 
98
  with torch.no_grad():
99
- outputs = seqscreen(prot=prot_rep, mol=mol_rep)
100
 
101
  print('Protein embeddings projected:', outputs.prot_rep)
102
  print('Molecule embeddings projected:', outputs.mol_rep)
 
1
  ---
2
  license: cc-by-nc-nd-4.0
3
  datasets:
4
+ - SaeedLab/BindScreen
5
  tags:
6
  - proteins
7
  - molecules
 
11
  - transformers
12
  ---
13
 
14
+ # BindScreen Frozen
15
 
16
+ This model corresponds to the BindScreen frozen configuration, in which both encoders are frozen and only the projection layers are trained on filtered ChEMBL.
17
 
18
+ \[[Github Repo](https://github.com/pcdslab/BindScreen)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/BindScreen)\] | \[[Model Collection](https://huggingface.co/collections/SaeedLab/bindscreen)\] | \[[Cite](#citation)\]
19
 
20
  ## Abstract
21
 
22
+ Virtual screening aims to identify candidate molecules that bind to a target protein, playing a central role in computational drug discovery. Sequence-based deep learning methods offer a more broadly applicable alternative to structure-based approaches, since they do not require 3D structural information. However, they typically require a separate forward pass per protein-molecule pair, limiting their scalability to large molecular libraries. Contrastive learning methods inspired by CLIP address this by encoding proteins and molecules independently, allowing similarity analysis via simple comparisons rather than a forward pass per pair. However, standard CLIP training was designed for symmetric tasks and does not account for the asymmetric and one-to-many nature of protein-molecule binding. In this paper, we introduce *BindScreen*, a sequence-based virtual screening method built on a dual-encoder contrastive architecture. BindScreen introduces a protein-centric batch construction strategy and an asymmetric multi-positive InfoNCE loss to cope with the protein-centric nature of virtual screening. We conducted a systematic evaluation of 8 protein language models and 3 molecular language model variants against BindScreen. The proposed protein-centric batch construction consistently outperforms standard CLIP training across all evaluated encoders while substantially improving computational efficiency, reducing training cost by up to 32 times. In addition, our experiments demonstrate that BindScreen requires 7 times fewer inference computations than pairwise virtual screening approaches. On the LIT-PCBA dataset, BindScreen outperforms all sequence-based baselines, achieving a relative improvement of up to 39% in EF at 0.5 over the best competing method, while remaining competitive with traditional docking approaches without requiring 3D structural information.
23
 
24
  ## Model Details
25
 
26
+ BindScreen uses a dual-encoder architecture that independently encodes proteins and molecules using specialized language models. The protein branch processes amino acid sequences with ESM2 T36, and the molecule branch processes SMILES strings with MolDeBERTa MLC. Both representations are passed through projection heads to produce normalized embeddings.
27
 
28
  ![Model](pipeline.png)
29
 
30
  Two configurations are available in this collection:
31
 
32
+ - [BindScreen-Frozen](https://huggingface.co/SaeedLab/BindScreen-Frozen): only the projection layers are trained, both encoders are frozen.
33
+ - [BindScreen-Finetuning](https://huggingface.co/SaeedLab/BindScreen-Finetuning): the projection layers and ESM2 T36 are trained, MolDeBERTa MLC is frozen.
34
 
35
  ## Usage
36
 
37
+ BindScreen computes cosine similarities between protein and molecule embeddings, which can be used to rank candidate molecules for a given target protein.
38
 
39
  ### Similarity
40
 
 
89
  mol_rep = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-8)
90
 
91
 
92
+ # bindscreen
93
+ bindscreen = AutoModel.from_pretrained(
94
+ 'SaeedLab/BindScreen-Frozen',
95
  trust_remote_code=True
96
  ).eval()
97
 
98
  with torch.no_grad():
99
+ outputs = bindscreen(prot=prot_rep, mol=mol_rep)
100
 
101
  print('Protein embeddings projected:', outputs.prot_rep)
102
  print('Molecule embeddings projected:', outputs.mol_rep)