gabrielbianchin commited on
Commit
41076ac
·
1 Parent(s): 2a27b70

update readme

Browse files
Files changed (1) hide show
  1. README.md +43 -11
README.md CHANGED
@@ -12,51 +12,83 @@ tags:
12
  # TITAN-BBB
13
  The paper is under review.
14
 
15
- \[[Github Repo](https://github.com/pcdslab/BBBP-Hybrid)\] | \[[Classification Model](https://huggingface.co/SaeedLab/BBBP-Classification)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/BBBP)\] | \[[Cite](#citation)\]
16
 
17
  ## Abstract
18
- The blood-brain barrier is a critical interface of the central nervous system, preventing most compounds from entering the brain. Predicting BBB permeability is essential for drug discovery targeting neurological diseases. Experimental in vitro and in vivo assays are costly and limited, motivating the use of computational approaches. While machine learning has shown promising results, combining handcrafted chemical descriptors with deep learning embeddings remains underexplored. In this work, we propose a model that integrates atom-level embeddings derived from SMILES representations with descriptors from cheminformatics libraries. We also introduce a curated dataset aggregated from multiple literature sources, which, to the best of our knowledge, is the largest available for this task. Results demonstrate that our approach outperforms state-of-the-art methods in classification and achieves competitive performance in regression, highlighting the benefits of combining deep representations with domain-specific features.
19
 
20
  ## Model Details
21
 
22
- This model is a hybrid deep learning method designed for molecular property classification. This architecture effectively combines three sources of information: embeddings from a pre-trained language model ([ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)), GNN, and classical molecular descriptors ([RDKit](https://www.rdkit.org/)).
23
 
24
- The model pipeline consists of four stages, embedding extraction, message passing, feature aggregation, and prediction. The model was trained using l1 loss and the AdamW optimizer.
25
 
26
  ![Model](pipeline.jpg)
27
 
28
  ## Model Usage
29
 
30
- Use the code below to predict a molecule's logBB value (blood-brain barrier permeability).
31
-
32
  **Note**: The model is only available using ```AutoModelForSequenceClassification```.
33
 
34
- **Note:** This model uses a custom hybrid architecture (Transformer + GNN + RDKit) defined in the source repository. Therefore, you must set `trust_remote_code=True` when loading both the model and the tokenizer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ```py
37
  import torch
38
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
39
 
40
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
41
- model = AutoModelForSequenceClassification.from_pretrained('SaeedLab/BBBP-Regression', trust_remote_code=True).to(device)
42
- tokenizer = AutoTokenizer.from_pretrained('SaeedLab/BBBP-Regression', trust_remote_code=True, device=device)
43
 
44
  model.eval()
45
 
46
  smiles = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
47
  inputs = tokenizer(smiles)
 
48
  with torch.no_grad():
49
  outputs = model(**inputs)
 
50
  print(outputs.logits)
51
  ```
52
 
 
 
 
 
 
 
 
53
  ### Requirements
54
 
55
  ```
56
  huggingface_hub
57
  rdkit
58
  torch
59
- torch_geometric
60
  ```
61
 
62
  ## Citation
 
12
  # TITAN-BBB
13
  The paper is under review.
14
 
15
+ \[[Github Repo](https://github.com/pcdslab/TITAN-BBB)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/BBB)\] | \[[Cite](#citation)\]
16
 
17
  ## Abstract
18
+ The blood-brain barrier (BBB) restricts most compounds from entering the brain, making BBB permeability prediction crucial for drug discovery. Experimental assays are costly and limited, motivating computational approaches. While machine learning has shown promise, combining chemical descriptors with deep learning embeddings remains underexplored. Here, we introduce TITAN-BBB, a multi-modal architecture that combines tabular, image, and text-based features via attention mechanism. To evaluate, we aggregated multiple literature sources to create the largest BBB permeability dataset to date, enabling robust training for both classification and regression tasks. Our results demonstrate that TITAN-BBB achieves 86.5% of balanced accuracy on classification tasks and 0.436 of mean absolute error for regression. Our approach also outperforms state-of-the-art models in both classification and regression performance, demonstrating the benefits of combining deep and domain-specific representations.
19
 
20
  ## Model Details
21
 
22
+ TITAN-BBB is a multi-modal method designed for molecular property (BBB) prediction. This architecture effectively combines three sources of information: embeddings from a pre-trained language model ([ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)), images representation extracted from convolutional neural networks ([ResNet50](https://docs.pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html)), and classical molecular descriptors ([RDKit](https://www.rdkit.org/)).
23
 
24
+ TITAN-BBB consists of three stages: multi-modal feature projection, attention-based fusion, and prediction.
25
 
26
  ![Model](pipeline.jpg)
27
 
28
  ## Model Usage
29
 
 
 
30
  **Note**: The model is only available using ```AutoModelForSequenceClassification```.
31
 
32
+ **Note:** This model uses a custom architecture (Transformer + CNN + RDKit) defined in the source repository. Therefore, you must set `trust_remote_code=True` when loading both the model and the tokenizer.
33
+
34
+ ### Classification
35
+
36
+ Use the code below to score (between 0 a 1) if a molecule can cross the BBB.
37
+
38
+ ```py
39
+ import torch
40
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
41
+
42
+ model = AutoModelForSequenceClassification.from_pretrained('SaeedLab/TITAN-BBB', subfolder='classification', trust_remote_code=True)
43
+ tokenizer = AutoTokenizer.from_pretrained('SaeedLab/TITAN-BBB', subfolder='classification', trust_remote_code=True)
44
+
45
+ model.eval()
46
+
47
+ smiles = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
48
+ inputs = tokenizer(smiles)
49
+
50
+ with torch.no_grad():
51
+ outputs = model(**inputs)
52
+
53
+ print(torch.sigmoid(outputs.logits))
54
+ ```
55
+
56
+ ### Regression
57
+
58
+ Use the code below to predict a molecule's permeability value (blood-brain barrier permeability).
59
 
60
  ```py
61
  import torch
62
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
63
 
64
+ model = AutoModelForSequenceClassification.from_pretrained('SaeedLab/TITAN-BBB', subfolder='regression', trust_remote_code=True)
65
+ tokenizer = AutoTokenizer.from_pretrained('SaeedLab/TITAN-BBB', subfolder='regression', trust_remote_code=True)
 
66
 
67
  model.eval()
68
 
69
  smiles = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
70
  inputs = tokenizer(smiles)
71
+
72
  with torch.no_grad():
73
  outputs = model(**inputs)
74
+
75
  print(outputs.logits)
76
  ```
77
 
78
+ ### Model Output
79
+
80
+ Both classification and regression models return for each input:
81
+ * logits: the raw output scores. For classification, please apply sigmoid to get the score between 0 and 1. For regression, use it as prediction.
82
+ * hidden_states: the attention-based aggregation of tabular, image, and text representations.
83
+ * attentions: the attention weights considering tabular, image, and text features for each input.
84
+
85
  ### Requirements
86
 
87
  ```
88
  huggingface_hub
89
  rdkit
90
  torch
91
+ torchvision
92
  ```
93
 
94
  ## Citation