File size: 3,148 Bytes
14063ff 75f8ea9 14063ff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# SMILES-based Transformer Encoder-Decoder (SMI-TED)
[](https://arxiv.org/abs/2407.20267)
This repository provides a HuggingFace-compatible version of the SMI-TED model, a SMILES-based Transformer Encoder-Decoder for chemical language modeling.
---
## 📦 Forked Resources
- **Forked GitHub:** [bisect-group/materials-smi-ted-fork](https://github.com/bisect-group/materials-smi-ted-fork)
- **Forked HuggingFace:** [bisectgroup/materials-smi-ted-fork](https://huggingface.co/bisectgroup/materials-smi-ted-fork)
## 🏷️ Original Resources
- **Original GitHub:** [IBM/materials (smi_ted)](https://github.com/IBM/materials/tree/main/models/smi_ted)
- **Original HuggingFace:** [ibm/materials.smi-ted](https://huggingface.co/ibm/materials.smi-ted)
- **Publication:** [A Large Encoder-Decoder Family of Foundation Models for Chemical Language](https://arxiv.org/abs/2407.20267)
---
## 🚀 Usage
```bash
pip install smi-ted
```
```python
import torch
import smi_ted
from transformers import AutoConfig, AutoModel, AutoTokenizer
# Load config, tokenizer, and model from HuggingFace Hub
config = AutoConfig.from_pretrained("bisectgroup/materials-smi-ted-fork")
tokenizer = AutoTokenizer.from_pretrained("bisectgroup/materials-smi-ted-fork")
model = AutoModel.from_pretrained("bisectgroup/materials-smi-ted-fork")
# Link tokenizer to model (required for SMILES reconstruction)
model.smi_ted.tokenizer = tokenizer
model.smi_ted.set_padding_idx_from_tokenizer()
# Example SMILES strings
smiles = [
'CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1',
'COc1ccc(-c2cc(=O)c3c(O)c(OC)c(OC)cc3o2)cc1O',
'CCOC(=O)c1ncn2c1CN(C)C(=O)c1cc(F)ccc1-2',
'Clc1ccccc1-c1nc(-c2ccncc2)no1',
'CC(C)(Oc1ccc(Cl)cc1)C(=O)OCc1cccc(CO)n1'
]
# Encode and decode SMILES
with torch.no_grad():
encoder_outputs = model.encode(smiles)
decoded_smiles = model.decode(encoder_outputs)
print(decoded_smiles)
```
---
## 📝 Citation
If you use this model, please cite:
```bibtex
@article{soares2025open,
title={An open-source family of large encoder-decoder foundation models for chemistry},
author={Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
journal={Communications Chemistry},
volume={8},
number={1},
pages={193},
year={2025},
publisher={Nature Publishing Group UK London}
}
@article{soares2024large,
title={A large encoder-decoder family of foundation models for chemical language},
author={Soares, Eduardo and Shirasuna, Victor and Brazil, Emilio Vital and Cerqueira, Renato and Zubarev, Dmitry and Schmidt, Kristin},
journal={arXiv preprint arXiv:2407.20267},
year={2024}
}
```
---
## 📧 Contact
For questions or collaborations, contact:
- eduardo.soares@ibm.com
- evital@br.ibm.com
---
**Note:**
This fork adapts the original SMI-TED codebase for seamless integration with HuggingFace's AutoModel and AutoTokenizer interfaces. For full source code and training scripts, see the [original IBM repo](https://github.com/IBM/materials/tree/main/models/smi_ted). |