bisectgroup
/

materials-smi-ted-fork

materials.smi-ted

Model card Files Files and versions

materials-smi-ted-fork / README.md

starkAhmed43's picture

Update README.md

75f8ea9 verified 6 months ago

|

history blame contribute delete

3.15 kB

	# SMILES-based Transformer Encoder-Decoder (SMI-TED)

	[![arXiv](https://img.shields.io/badge/arXiv-2407.20267-b31b1b.svg)](https://arxiv.org/abs/2407.20267)

	This repository provides a HuggingFace-compatible version of the SMI-TED model, a SMILES-based Transformer Encoder-Decoder for chemical language modeling.

	---

	## 📦 Forked Resources

	- Forked GitHub: [bisect-group/materials-smi-ted-fork](https://github.com/bisect-group/materials-smi-ted-fork)
	- Forked HuggingFace: [bisectgroup/materials-smi-ted-fork](https://huggingface.co/bisectgroup/materials-smi-ted-fork)

	## 🏷️ Original Resources

	- Original GitHub: [IBM/materials (smi_ted)](https://github.com/IBM/materials/tree/main/models/smi_ted)
	- Original HuggingFace: [ibm/materials.smi-ted](https://huggingface.co/ibm/materials.smi-ted)
	- Publication: [A Large Encoder-Decoder Family of Foundation Models for Chemical Language](https://arxiv.org/abs/2407.20267)

	---

	## 🚀 Usage

	```bash
	pip install smi-ted
	```

	```python
	import torch
	import smi_ted
	from transformers import AutoConfig, AutoModel, AutoTokenizer

	# Load config, tokenizer, and model from HuggingFace Hub
	config = AutoConfig.from_pretrained("bisectgroup/materials-smi-ted-fork")
	tokenizer = AutoTokenizer.from_pretrained("bisectgroup/materials-smi-ted-fork")
	model = AutoModel.from_pretrained("bisectgroup/materials-smi-ted-fork")

	# Link tokenizer to model (required for SMILES reconstruction)
	model.smi_ted.tokenizer = tokenizer
	model.smi_ted.set_padding_idx_from_tokenizer()

	# Example SMILES strings
	smiles = [
	'CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1',
	'COc1ccc(-c2cc(=O)c3c(O)c(OC)c(OC)cc3o2)cc1O',
	'CCOC(=O)c1ncn2c1CN(C)C(=O)c1cc(F)ccc1-2',
	'Clc1ccccc1-c1nc(-c2ccncc2)no1',
	'CC(C)(Oc1ccc(Cl)cc1)C(=O)OCc1cccc(CO)n1'
	]

	# Encode and decode SMILES
	with torch.no_grad():
	encoder_outputs = model.encode(smiles)
	decoded_smiles = model.decode(encoder_outputs)

	print(decoded_smiles)
	```

	---

	## 📝 Citation

	If you use this model, please cite:

	```bibtex
	@article{soares2025open,
	title={An open-source family of large encoder-decoder foundation models for chemistry},
	author={Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
	journal={Communications Chemistry},
	volume={8},
	number={1},
	pages={193},
	year={2025},
	publisher={Nature Publishing Group UK London}
	}
	@article{soares2024large,
	title={A large encoder-decoder family of foundation models for chemical language},
	author={Soares, Eduardo and Shirasuna, Victor and Brazil, Emilio Vital and Cerqueira, Renato and Zubarev, Dmitry and Schmidt, Kristin},
	journal={arXiv preprint arXiv:2407.20267},
	year={2024}
	}
	```

	---

	## 📧 Contact

	For questions or collaborations, contact:
	- eduardo.soares@ibm.com
	- evital@br.ibm.com

	---

	Note:
	This fork adapts the original SMI-TED codebase for seamless integration with HuggingFace's AutoModel and AutoTokenizer interfaces. For full source code and training scripts, see the [original IBM repo](https://github.com/IBM/materials/tree/main/models/smi_ted).