Using 1B+ smiles (PubChem+ZINC+ChEMBL) for pretraining.

refer to https://github.com/CSUBioGroup/MolMetaLM for more details.

Usage

Prepare tokenizer and model

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('wudejian789/MolMetaLM-base-1B')
model = AutoModel.from_pretrained('wudejian789/MolMetaLM-base-1B')

Obtain molecular representations from SMILES

smi = "COc1cc2c(cc1OC)CC([NH3+])C2"
tokenized_smi = tokenizer(" ".join(list(smi)), return_token_type_ids=False, 
                          return_tensors='pt', max_length=512, padding='longest', truncation=True)
emb_smi = model(**tokenized_smi).last_hidden_state
print(emb_smi.shape) # batch size, seq length, embedding size
Downloads last month
10
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support