|
|
--- |
|
|
language: en |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- t5 |
|
|
- molecule-to-protein |
|
|
- smiles |
|
|
- protein-generation |
|
|
- binder |
|
|
- ligand |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- AI4PD/Mol2Pro-Binder-Dataset |
|
|
--- |
|
|
|
|
|
# Mol2Pro-base |
|
|
|
|
|
## Model description |
|
|
|
|
|
- **Architecture:** T5-efficient-base https://huggingface.co/google/t5-efficient-base |
|
|
- **Tokenization:** https://huggingface.co/AI4PD/Mol2Pro-tokenizer |
|
|
|
|
|
|
|
|
- **Code:** https://github.com/AI4PDLab/Mol2Pro |
|
|
- **Training data** https://huggingface.co/datasets/AI4PD/Mol2Pro-Binder-Dataset |
|
|
- **Paper:** https://doi.org/10.64898/2026.02.06.704305 |
|
|
|
|
|
|
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
import torch |
|
|
|
|
|
model_id = "AI4PD/Mol2Pro-base" |
|
|
tokenizer_id = "AI4PD/Mol2Pro-tokenizer" |
|
|
|
|
|
# Load tokenizers |
|
|
tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles") |
|
|
tokenizer_aa = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa") |
|
|
|
|
|
# Load model |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
|
``` |
|
|
|
|
|
## Intended use |
|
|
Research use only. The model generates candidate sequences conditioned on small-molecule inputs; it does not guarantee binding or function and must be validated experimentally. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work useful, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{VicenteSola2026Generalise, |
|
|
title = {Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data}, |
|
|
author = {Vicente-Sola, Alex and Dornfeld, Lars and Coines, Joan and Ferruz, Noelia}, |
|
|
journal = {bioRxiv}, |
|
|
year = {2026}, |
|
|
doi = {10.64898/2026.02.06.704305}, |
|
|
} |
|
|
|
|
|
|