gabrielbianchin commited on
Commit
c1ea99a
·
1 Parent(s): 835e667

upload files

Browse files
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ datasets:
4
+ - SaeedLab/SeqScreen
5
+ tags:
6
+ - proteins
7
+ - molecules
8
+ - bioinformatics
9
+ - drug-discovery
10
+ - feature-extraction
11
+ - transformers
12
+ ---
13
+
14
+ # SeqScreen Frozen
15
+
16
+ This model corresponds to the SeqScreen frozen configuration, in which only the projection layers are trained.
17
+
18
+ \[[Github Repo](https://github.com/pcdslab/SeqScreen)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/SeqScreen)\] | \[[Model Collection](https://huggingface.co/collections/SaeedLab/seqscreen)\] | \[[Cite](#citation)\]
19
+
20
+ ## Abstract
21
+
22
+ Virtual screening aims to identify candidate molecules that bind to a target protein, playing a central role in computational drug discovery. Sequence-based deep learning methods offer an applicable alternative to structure-based approaches, but typically process one protein-molecule pair at a time, limiting their scalability to large molecular libraries. Contrastive learning methods inspired by CLIP have shown promise for learning joint protein-molecule representations, but standard CLIP training was designed for symmetric tasks and does not account for the asymmetric and one-to-many nature of protein-molecule binding. In this paper, we introduce *SeqScreen*, a sequence-based virtual screening method built on a dual-encoder contrastive architecture. SeqScreen introduces a protein-centric batch construction strategy and an asymmetric multi-positive InfoNCE loss to cope with the protein-centric nature of virtual screening. We conduct a systematic evaluation across 8 protein language models and 3 molecular language model variants. The protein-centric batch construction consistently outperforms standard CLIP training across all evaluated encoders, while requiring approximately 32 times fewer training epochs and 7 times fewer forward passes during inference compared to pair-based methods. On the LIT-PCBA dataset, SeqScreen outperforms all sequence-based baselines, achieving a relative improvement of up to 39% in EF at 0.5 over the best competing method, while remaining competitive with traditional docking approaches without requiring 3D structural information.
23
+
24
+ ## Model Details
25
+
26
+ SeqScreen uses a dual-encoder architecture that independently encodes proteins and molecules using specialized language models. The protein branch processes amino acid sequences with ESM2 T36, and the molecule branch processes SMILES strings with MolDeBERTa MLC. Both representations are passed through projection heads to produce normalized embeddings.
27
+
28
+ ![Model](pipeline.png)
29
+
30
+ Two configurations are available in this collection:
31
+
32
+ - [SeqScreen-Frozen](https://huggingface.co/SaeedLab/SeqScreen-Frozen): only the projection layers are trained, both encoders are frozen.
33
+ - [SeqScreen-Finetuning](https://huggingface.co/SaeedLab/SeqScreen-Finetuning): the projection layers and ESM2 T36 are trained, MolDeBERTa MLC is frozen.
34
+
35
+ ## Usage
36
+
37
+ SeqScreen computes cosine similarities between protein and molecule embeddings, which can be used to rank candidate molecules for a given target protein.
38
+
39
+ ### Similarity
40
+
41
+ ```python
42
+ # code
43
+ ```
44
+
45
+
46
+ ## Citation
47
+
48
+ The paper is under review. As soon as it is accepted, we will update this section.
49
+
50
+ ## License
51
+
52
+ This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.
53
+
54
+ ## Contact
55
+
56
+ For any additional questions or comments, contact Fahad Saeed (fsaeed@fiu.edu).
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoConfig": "configuration_seqscreen.SeqScreenConfig",
4
+ "AutoModel": "modeling_seqscreen.SeqScreenModel"
5
+ },
6
+ "architectures": [
7
+ "SeqScreenModel"
8
+ ],
9
+ "dropout": 0.1,
10
+ "dtype": "float32",
11
+ "model_type": "seqscreen",
12
+ "mol_dim": 768,
13
+ "proj_dim": 512,
14
+ "prot_dim": 2560,
15
+ "transformers_version": "4.57.3"
16
+ }
configuration_seqscreen.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+ class SeqScreenConfig(PretrainedConfig):
4
+ model_type = "seqscreen"
5
+ def __init__(
6
+ self,
7
+ prot_dim: int = 2560,
8
+ mol_dim: int = 768,
9
+ proj_dim: int = 512,
10
+ dropout: float = 0.1,
11
+ **kwargs):
12
+ super().__init__(**kwargs)
13
+ self.prot_dim = prot_dim
14
+ self.mol_dim = mol_dim
15
+ self.proj_dim = proj_dim
16
+ self.dropout = dropout
convert_weights.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ from configuration_seqscreen import SeqScreenConfig
4
+ from modeling_seqscreen import SeqScreenModel
5
+
6
+
7
+ def convert_model(checkpoint_path, save_directory):
8
+ config = SeqScreenConfig()
9
+ hf_model = SeqScreenModel(config)
10
+ hf_model.eval()
11
+
12
+ old_state_dict = torch.load(checkpoint_path, map_location="cpu")
13
+
14
+ expected_prefixes = ("proj_prot.", "proj_mol.")
15
+
16
+ new_state_dict = {}
17
+ for key, value in old_state_dict.items():
18
+ if key.startswith(expected_prefixes):
19
+ new_state_dict[key] = value
20
+ else:
21
+ print(f"[Skip] {key}")
22
+
23
+ missing = set(hf_model.state_dict().keys()) - set(new_state_dict.keys())
24
+ unexpected = set(new_state_dict.keys()) - set(hf_model.state_dict().keys())
25
+
26
+ if missing:
27
+ raise RuntimeError(f"Missing keys in checkpoint: {missing}")
28
+ if unexpected:
29
+ raise RuntimeError(f"Unexpected keys after filtering: {unexpected}")
30
+
31
+ hf_model.load_state_dict(new_state_dict, strict=True)
32
+ print("State dict loaded successfully.")
33
+
34
+ os.makedirs(save_directory, exist_ok=True)
35
+ hf_model.save_pretrained(save_directory)
36
+ config.save_pretrained(save_directory)
37
+ print(f"Model saved to: {save_directory}")
38
+
39
+
40
+ if __name__ == "__main__":
41
+ convert_model(
42
+ checkpoint_path="model.pt",
43
+ save_directory="./seqscreen_hf",
44
+ )
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1c541db2f8ae2d5e06ed1ab8d032278d818a8f4800880c482bc8b76600494a1
3
+ size 8930448
modeling_seqscreen.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from dataclasses import dataclass
5
+ import torch
6
+ from transformers.utils import ModelOutput
7
+ from transformers import PreTrainedModel
8
+
9
+ from .configuration_seqscreen import SeqScreenConfig
10
+
11
+ @dataclass
12
+ class SeqScreenModelOutput(ModelOutput):
13
+ prot_rep: torch.FloatTensor = None
14
+ mol_rep: torch.FloatTensor = None
15
+ similarity: torch.FloatTensor = None
16
+
17
+ class ProjectionLayer(nn.Module):
18
+ def __init__(self, in_dim, out_dim, dropout):
19
+ super().__init__()
20
+ self.projection = nn.Sequential(
21
+ nn.Linear(in_dim, out_dim),
22
+ nn.LayerNorm(out_dim),
23
+ nn.GELU(),
24
+ nn.Dropout(dropout),
25
+ nn.Linear(out_dim, out_dim)
26
+ )
27
+
28
+ def forward(self, x):
29
+ x = self.projection(x)
30
+ return F.normalize(x, dim=-1)
31
+
32
+
33
+ class SeqScreenModel(PreTrainedModel):
34
+ config_class = SeqScreenConfig
35
+ base_model_prefix = "seqscreen"
36
+
37
+ def __init__(self, config: SeqScreenConfig):
38
+ super().__init__(config)
39
+
40
+ self.proj_prot = ProjectionLayer(config.prot_dim, config.proj_dim, dropout=config.dropout)
41
+ self.proj_mol = ProjectionLayer(config.mol_dim, config.proj_dim, dropout=config.dropout)
42
+
43
+ self.post_init()
44
+
45
+ def forward(self, prot: torch.Tensor, mol: torch.Tensor):
46
+ prot_rep = self.proj_prot(prot)
47
+ mol_rep = self.proj_mol(mol)
48
+ similarity = prot_rep @ mol_rep.T
49
+
50
+ return SeqScreenModelOutput(
51
+ prot_rep=prot_rep,
52
+ mol_rep=mol_rep,
53
+ similarity=similarity
54
+ )