SaeedLab
/

TITAN-BBB

@@ -12,51 +12,83 @@ tags:
 # TITAN-BBB
 The paper is under review.
-\[[Github Repo](https://github.com/pcdslab/BBBP-Hybrid)\] | \[[Classification Model](https://huggingface.co/SaeedLab/BBBP-Classification)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/BBBP)\] | \[[Cite](#citation)\]
 ## Abstract
-The blood-brain barrier is a critical interface of the central nervous system, preventing most compounds from entering the brain. Predicting BBB permeability is essential for drug discovery targeting neurological diseases. Experimental in vitro and in vivo assays are costly and limited, motivating the use of computational approaches. While machine learning has shown promising results, combining handcrafted chemical descriptors with deep learning embeddings remains underexplored. In this work, we propose a model that integrates atom-level embeddings derived from SMILES representations with descriptors from cheminformatics libraries. We also introduce a curated dataset aggregated from multiple literature sources, which, to the best of our knowledge, is the largest available for this task. Results demonstrate that our approach outperforms state-of-the-art methods in classification and achieves competitive performance in regression, highlighting the benefits of combining deep representations with domain-specific features.
 ## Model Details
-This model is a hybrid deep learning method designed for molecular property classification. This architecture effectively combines three sources of information: embeddings from a pre-trained language model ([ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)), GNN, and classical molecular descriptors ([RDKit](https://www.rdkit.org/)).
-The model pipeline consists of four stages, embedding extraction, message passing, feature aggregation, and prediction. The model was trained using l1 loss and the AdamW optimizer.
 ![Model](pipeline.jpg)
 ## Model Usage
-Use the code below to predict a molecule's logBB value (blood-brain barrier permeability).
 **Note**: The model is only available using ```AutoModelForSequenceClassification```.
-**Note:** This model uses a custom hybrid architecture (Transformer + GNN + RDKit) defined in the source repository. Therefore, you must set `trust_remote_code=True` when loading both the model and the tokenizer.
 ```py
 import torch
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model = AutoModelForSequenceClassification.from_pretrained('SaeedLab/BBBP-Regression', trust_remote_code=True).to(device)
-tokenizer = AutoTokenizer.from_pretrained('SaeedLab/BBBP-Regression', trust_remote_code=True, device=device)
 model.eval()
 smiles = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
 inputs = tokenizer(smiles)
 with torch.no_grad():
   outputs = model(**inputs)
 print(outputs.logits)
 ```
 ### Requirements
 ```
 huggingface_hub
 rdkit
 torch
-torch_geometric
 ```
 ## Citation

 # TITAN-BBB
 The paper is under review.
+\[[Github Repo](https://github.com/pcdslab/TITAN-BBB)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/BBB)\] | \[[Cite](#citation)\]
 ## Abstract
+The blood-brain barrier (BBB) restricts most compounds from entering the brain, making BBB permeability prediction crucial for drug discovery. Experimental assays are costly and limited, motivating computational approaches. While machine learning has shown promise, combining chemical descriptors with deep learning embeddings remains underexplored. Here, we introduce TITAN-BBB, a multi-modal architecture that combines tabular, image, and text-based features via attention mechanism. To evaluate, we aggregated multiple literature sources to create the largest BBB permeability dataset to date, enabling robust training for both classification and regression tasks. Our results demonstrate that TITAN-BBB achieves 86.5% of balanced accuracy on classification tasks and 0.436 of mean absolute error for regression. Our approach also outperforms state-of-the-art models in both classification and regression performance, demonstrating the benefits of combining deep and domain-specific representations.
 ## Model Details
+TITAN-BBB is a multi-modal method designed for molecular property (BBB) prediction. This architecture effectively combines three sources of information: embeddings from a pre-trained language model ([ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)), images representation extracted from convolutional neural networks ([ResNet50](https://docs.pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html)), and classical molecular descriptors ([RDKit](https://www.rdkit.org/)).
+TITAN-BBB consists of three stages: multi-modal feature projection, attention-based fusion, and prediction.
 ![Model](pipeline.jpg)
 ## Model Usage
 **Note**: The model is only available using ```AutoModelForSequenceClassification```.
+**Note:** This model uses a custom architecture (Transformer + CNN + RDKit) defined in the source repository. Therefore, you must set `trust_remote_code=True` when loading both the model and the tokenizer.
+### Classification
+Use the code below to score (between 0 a 1) if a molecule can cross the BBB.
+```py
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+model = AutoModelForSequenceClassification.from_pretrained('SaeedLab/TITAN-BBB', subfolder='classification', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('SaeedLab/TITAN-BBB', subfolder='classification', trust_remote_code=True)
+model.eval()
+smiles = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
+inputs = tokenizer(smiles)
+with torch.no_grad():
+  outputs = model(**inputs)
+print(torch.sigmoid(outputs.logits))
+```
+### Regression
+Use the code below to predict a molecule's permeability value (blood-brain barrier permeability).
 ```py
 import torch
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
+model = AutoModelForSequenceClassification.from_pretrained('SaeedLab/TITAN-BBB', subfolder='regression', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('SaeedLab/TITAN-BBB', subfolder='regression', trust_remote_code=True)
 model.eval()
 smiles = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
 inputs = tokenizer(smiles)
 with torch.no_grad():
   outputs = model(**inputs)
 print(outputs.logits)
 ```
+### Model Output
+Both classification and regression models return for each input:
+* logits: the raw output scores. For classification, please apply sigmoid to get the score between 0 and 1. For regression, use it as prediction.
+* hidden_states: the attention-based aggregation of tabular, image, and text representations.
+* attentions: the attention weights considering tabular, image, and text features for each input.
 ### Requirements
 ```
 huggingface_hub
 rdkit
 torch
+torchvision
 ```
 ## Citation