# SMILES Tokenizer This is a custom tokenizer for SMILES (Simplified Molecular Input Line Entry System) strings. ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('suku9/smiles-tokenizer') # Tokenize a SMILES string tokens = tokenizer.tokenize('CC(=O)OC1=CC=CC=C1C(=O)O') print(tokens) # Encode a SMILES string encoded = tokenizer('CC(=O)OC1=CC=CC=C1C(=O)O', return_tensors='pt') print(encoded) # Use with a GPT-2 model from transformers import GPT2LMHeadModel model = GPT2LMHeadModel.from_pretrained('gpt2') # Resize model embeddings to match tokenizer vocabulary size model.resize_token_embeddings(len(tokenizer)) ``` ## Features - Specialized for SMILES strings - Compatible with Hugging Face's transformers library - Designed to work with GPT-2 models - Preserves all functionality of the original SMILES tokenizer