| # SMILES Tokenizer | |
| This is a custom tokenizer for SMILES (Simplified Molecular Input Line Entry System) strings. | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained('suku9/smiles-tokenizer') | |
| # Tokenize a SMILES string | |
| tokens = tokenizer.tokenize('CC(=O)OC1=CC=CC=C1C(=O)O') | |
| print(tokens) | |
| # Encode a SMILES string | |
| encoded = tokenizer('CC(=O)OC1=CC=CC=C1C(=O)O', return_tensors='pt') | |
| print(encoded) | |
| # Use with a GPT-2 model | |
| from transformers import GPT2LMHeadModel | |
| model = GPT2LMHeadModel.from_pretrained('gpt2') | |
| # Resize model embeddings to match tokenizer vocabulary size | |
| model.resize_token_embeddings(len(tokenizer)) | |
| ``` | |
| ## Features | |
| - Specialized for SMILES strings | |
| - Compatible with Hugging Face's transformers library | |
| - Designed to work with GPT-2 models | |
| - Preserves all functionality of the original SMILES tokenizer | |