Add model card
Browse files
README.md
CHANGED
|
@@ -11,24 +11,45 @@ pipeline_tag: fill-mask
|
|
| 11 |
|
| 12 |
## Model Description
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
- **Model Type**: bert
|
| 17 |
- **Data Type**: Molecule/Compound
|
| 18 |
-
- **Training Date**: 2026-04-
|
| 19 |
|
| 20 |
## Usage
|
| 21 |
|
| 22 |
```python
|
| 23 |
-
from transformers import
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
model = AutoModel.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
|
| 27 |
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
|
| 28 |
|
| 29 |
-
#
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
```
|
| 33 |
|
| 34 |
## Training
|
|
|
|
| 11 |
|
| 12 |
## Model Description
|
| 13 |
|
| 14 |
+
GPT-2 medium (345M parameters) foundation model pre-trained on compound SMILES strings from the MolCrawl dataset.
|
| 15 |
+
|
| 16 |
+
The tokenizer is a character-level BPE tokenizer (vocab_size=612) that encodes each SMILES character as a separate token. Input SMILES strings should be passed **without** spaces (e.g. `CC(=O)O`). The `[SEP]` token (id=13) is used as the end-of-sequence marker.
|
| 17 |
+
|
| 18 |
+
## Datasets
|
| 19 |
+
|
| 20 |
+
- **MolCrawl compounds corpus (chembl + zinc + opv + reddb + pubchemqc)**: [https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh](https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh) (Pre-training corpus)
|
| 21 |
|
| 22 |
- **Model Type**: bert
|
| 23 |
- **Data Type**: Molecule/Compound
|
| 24 |
+
- **Training Date**: 2026-04-24
|
| 25 |
|
| 26 |
## Usage
|
| 27 |
|
| 28 |
```python
|
| 29 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 30 |
+
import torch
|
| 31 |
|
| 32 |
+
model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
|
|
|
|
| 33 |
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
|
| 34 |
|
| 35 |
+
# Predict masked SMILES token
|
| 36 |
+
# Use tokenizer.mask_token instead of hardcoded "[MASK]":
|
| 37 |
+
# BERT-style tokenizers vary ("[MASK]", "<mask>", etc.)
|
| 38 |
+
if tokenizer.mask_token is None:
|
| 39 |
+
raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.")
|
| 40 |
+
prompt = "CC(=O){MASK}".replace("{MASK}", tokenizer.mask_token)
|
| 41 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 42 |
+
mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
|
| 43 |
+
|
| 44 |
+
with torch.no_grad():
|
| 45 |
+
outputs = model(**inputs)
|
| 46 |
+
logits = outputs.logits
|
| 47 |
+
|
| 48 |
+
predicted_token_id = logits[0, mask_index].argmax(dim=-1)
|
| 49 |
+
predicted_token = tokenizer.decode(predicted_token_id)
|
| 50 |
+
result = prompt.replace(tokenizer.mask_token, predicted_token)
|
| 51 |
+
print(f"Predicted: {result}")
|
| 52 |
+
|
| 53 |
```
|
| 54 |
|
| 55 |
## Training
|