--- language: en library_name: transformers pipeline_tag: text-generation tags: - t5 - molecule-to-protein - smiles - protein-generation - binder - ligand license: apache-2.0 datasets: - AI4PD/Mol2Pro-Binder-Dataset --- # Mol2Pro-base ## Model description - **Architecture:** T5-efficient-base https://huggingface.co/google/t5-efficient-base - **Tokenization:** https://huggingface.co/AI4PD/Mol2Pro-tokenizer - **Code:** https://github.com/AI4PDLab/Mol2Pro - **Training data** https://huggingface.co/datasets/AI4PD/Mol2Pro-Binder-Dataset - **Paper:** https://doi.org/10.64898/2026.02.06.704305 ## How to use ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch model_id = "AI4PD/Mol2Pro-base" tokenizer_id = "AI4PD/Mol2Pro-tokenizer" # Load tokenizers tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles") tokenizer_aa = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa") # Load model model = AutoModelForSeq2SeqLM.from_pretrained(model_id) ``` ## Intended use Research use only. The model generates candidate sequences conditioned on small-molecule inputs; it does not guarantee binding or function and must be validated experimentally. ## Citation If you find this work useful, please cite: ```bibtex @article{VicenteSola2026Generalise, title = {Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data}, author = {Vicente-Sola, Alex and Dornfeld, Lars and Coines, Joan and Ferruz, Noelia}, journal = {bioRxiv}, year = {2026}, doi = {10.64898/2026.02.06.704305}, }