bisectgroup
/

materials-smi-ted-fork

materials.smi-ted

Model card Files Files and versions

starkAhmed43 commited on Jul 28, 2025

Commit

14063ff

·

verified ·

1 Parent(s): 33802f3

Create README.md

Files changed (1) hide show

README.md +85 -0

README.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# SMILES-based Transformer Encoder-Decoder (SMI-TED)
+[![arXiv](https://img.shields.io/badge/arXiv-2407.20267-b31b1b.svg)](https://arxiv.org/abs/2407.20267)
+This repository provides a HuggingFace-compatible version of the SMI-TED model, a SMILES-based Transformer Encoder-Decoder for chemical language modeling.
+---
+## 📦 Forked Resources
+- **Forked GitHub:** [bisect-group/materials-smi-ted-fork](https://github.com/bisect-group/materials-smi-ted-fork)
+- **Forked HuggingFace:** [bisectgroup/materials-smi-ted-fork](https://huggingface.co/bisectgroup/materials-smi-ted-fork)
+## 🏷️ Original Resources
+- **Original GitHub:** [IBM/materials (smi_ted)](https://github.com/IBM/materials/tree/main/models/smi_ted)
+- **Original HuggingFace:** [ibm/materials.smi-ted](https://huggingface.co/ibm/materials.smi-ted)
+- **Publication:** [A Large Encoder-Decoder Family of Foundation Models for Chemical Language](https://arxiv.org/abs/2407.20267)
+---
+## 🚀 Usage
+```bash
+pip install smi-ted
+```
+```python
+import torch
+import smi_ted
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+# Load config, tokenizer, and model from HuggingFace Hub
+config = AutoConfig.from_pretrained("bisectgroup/materials-smi-ted-fork")
+tokenizer = AutoTokenizer.from_pretrained("bisectgroup/materials-smi-ted-fork")
+model = AutoModel.from_pretrained("bisectgroup/materials-smi-ted-fork")
+# Link tokenizer to model (required for SMILES reconstruction)
+model.smi_ted.tokenizer = tokenizer
+model.smi_ted.set_padding_idx_from_tokenizer()
+# Example SMILES strings
+smiles = [
+    'CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1',
+    'COc1ccc(-c2cc(=O)c3c(O)c(OC)c(OC)cc3o2)cc1O',
+    'CCOC(=O)c1ncn2c1CN(C)C(=O)c1cc(F)ccc1-2',
+    'Clc1ccccc1-c1nc(-c2ccncc2)no1',
+    'CC(C)(Oc1ccc(Cl)cc1)C(=O)OCc1cccc(CO)n1'
+]
+# Encode and decode SMILES
+with torch.no_grad():
+    encoder_outputs = model.encode(smiles)
+    decoded_smiles = model.decode(encoder_outputs)
+print(decoded_smiles)
+```
+---
+## 📝 Citation
+If you use this model, please cite:
+```bibtex
+@article{soares2024large,
+  title={A large encoder-decoder family of foundation models for chemical language},
+  author={Soares, Eduardo and Shirasuna, Victor and Brazil, Emilio Vital and Cerqueira, Renato and Zubarev, Dmitry and Schmidt, Kristin},
+  journal={arXiv preprint arXiv:2407.20267},
+  year={2024}
+}
+```
+---
+## 📧 Contact
+For questions or collaborations, contact:
+- eduardo.soares@ibm.com
+- evital@br.ibm.com
+---
+**Note:**
+This fork adapts the original SMI-TED codebase for seamless integration with HuggingFace's AutoModel and AutoTokenizer interfaces. For full source code and training scripts, see the [original IBM repo](https://github.com/IBM/materials/tree/main/models/smi_ted).