starkAhmed43 commited on
Commit
14063ff
·
verified ·
1 Parent(s): 33802f3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SMILES-based Transformer Encoder-Decoder (SMI-TED)
2
+
3
+ [![arXiv](https://img.shields.io/badge/arXiv-2407.20267-b31b1b.svg)](https://arxiv.org/abs/2407.20267)
4
+
5
+ This repository provides a HuggingFace-compatible version of the SMI-TED model, a SMILES-based Transformer Encoder-Decoder for chemical language modeling.
6
+
7
+ ---
8
+
9
+ ## 📦 Forked Resources
10
+
11
+ - **Forked GitHub:** [bisect-group/materials-smi-ted-fork](https://github.com/bisect-group/materials-smi-ted-fork)
12
+ - **Forked HuggingFace:** [bisectgroup/materials-smi-ted-fork](https://huggingface.co/bisectgroup/materials-smi-ted-fork)
13
+
14
+ ## 🏷️ Original Resources
15
+
16
+ - **Original GitHub:** [IBM/materials (smi_ted)](https://github.com/IBM/materials/tree/main/models/smi_ted)
17
+ - **Original HuggingFace:** [ibm/materials.smi-ted](https://huggingface.co/ibm/materials.smi-ted)
18
+ - **Publication:** [A Large Encoder-Decoder Family of Foundation Models for Chemical Language](https://arxiv.org/abs/2407.20267)
19
+
20
+ ---
21
+
22
+ ## 🚀 Usage
23
+
24
+ ```bash
25
+ pip install smi-ted
26
+ ```
27
+
28
+ ```python
29
+ import torch
30
+ import smi_ted
31
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
32
+
33
+ # Load config, tokenizer, and model from HuggingFace Hub
34
+ config = AutoConfig.from_pretrained("bisectgroup/materials-smi-ted-fork")
35
+ tokenizer = AutoTokenizer.from_pretrained("bisectgroup/materials-smi-ted-fork")
36
+ model = AutoModel.from_pretrained("bisectgroup/materials-smi-ted-fork")
37
+
38
+ # Link tokenizer to model (required for SMILES reconstruction)
39
+ model.smi_ted.tokenizer = tokenizer
40
+ model.smi_ted.set_padding_idx_from_tokenizer()
41
+
42
+ # Example SMILES strings
43
+ smiles = [
44
+ 'CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1',
45
+ 'COc1ccc(-c2cc(=O)c3c(O)c(OC)c(OC)cc3o2)cc1O',
46
+ 'CCOC(=O)c1ncn2c1CN(C)C(=O)c1cc(F)ccc1-2',
47
+ 'Clc1ccccc1-c1nc(-c2ccncc2)no1',
48
+ 'CC(C)(Oc1ccc(Cl)cc1)C(=O)OCc1cccc(CO)n1'
49
+ ]
50
+
51
+ # Encode and decode SMILES
52
+ with torch.no_grad():
53
+ encoder_outputs = model.encode(smiles)
54
+ decoded_smiles = model.decode(encoder_outputs)
55
+
56
+ print(decoded_smiles)
57
+ ```
58
+
59
+ ---
60
+
61
+ ## 📝 Citation
62
+
63
+ If you use this model, please cite:
64
+
65
+ ```bibtex
66
+ @article{soares2024large,
67
+ title={A large encoder-decoder family of foundation models for chemical language},
68
+ author={Soares, Eduardo and Shirasuna, Victor and Brazil, Emilio Vital and Cerqueira, Renato and Zubarev, Dmitry and Schmidt, Kristin},
69
+ journal={arXiv preprint arXiv:2407.20267},
70
+ year={2024}
71
+ }
72
+ ```
73
+
74
+ ---
75
+
76
+ ## 📧 Contact
77
+
78
+ For questions or collaborations, contact:
79
+ - eduardo.soares@ibm.com
80
+ - evital@br.ibm.com
81
+
82
+ ---
83
+
84
+ **Note:**
85
+ This fork adapts the original SMI-TED codebase for seamless integration with HuggingFace's AutoModel and AutoTokenizer interfaces. For full source code and training scripts, see the [original IBM repo](https://github.com/IBM/materials/tree/main/models/smi_ted).