| --- |
| license: apache-2.0 |
| tags: |
| - pytorch |
| - bert |
| - molecule-compound |
| pipeline_tag: fill-mask |
| --- |
| |
| # molcrawl-compounds-bert-medium |
|
|
| ## Model Description |
|
|
| GPT-2 medium (345M parameters) foundation model pre-trained on compound SMILES strings from the MolCrawl dataset. |
|
|
| The tokenizer is a character-level BPE tokenizer (vocab_size=612) that encodes each SMILES character as a separate token. Input SMILES strings should be passed **without** spaces (e.g. `CC(=O)O`). The `[SEP]` token (id=13) is used as the end-of-sequence marker. |
| |
| - **Model Type**: bert |
| - **Data Type**: Molecule/Compound |
| - **Training Date**: 2026-04-24 |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoModelForMaskedLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium") |
| tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium") |
| |
| # Predict masked SMILES token |
| # Use tokenizer.mask_token instead of hardcoded "[MASK]": |
| # BERT-style tokenizers vary ("[MASK]", "<mask>", etc.) |
| if tokenizer.mask_token is None: |
| raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.") |
| prompt = "CC(=O){MASK}".replace("{MASK}", tokenizer.mask_token) |
| inputs = tokenizer(prompt, return_tensors="pt") |
| mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| logits = outputs.logits |
| |
| predicted_token_id = logits[0, mask_index].argmax(dim=-1) |
| predicted_token = tokenizer.decode(predicted_token_id) |
| result = prompt.replace(tokenizer.mask_token, predicted_token) |
| print(f"Predicted: {result}") |
|
|
| ``` |
| |
| ## Source Code |
| |
| Training pipeline, configuration files, and data preparation scripts are |
| available in the MolCrawl GitHub repository: |
| [https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl) |
| |
| ## License |
| |
| This model is released under the APACHE-2.0 license. |
| |
| ## Citation |
| |
| If you use this model, please cite: |
| |
| ```bibtex |
| @misc{molcrawl_compounds_bert_medium, |
| title={molcrawl-compounds-bert-medium}, |
| author={{RIKEN}}, |
| year={2026}, |
| publisher={{Hugging Face}}, |
| url={{https://huggingface.co/kojima-lab/molcrawl-compounds-bert-medium}} |
| } |
| ``` |
| |