deskull commited on
Commit
dcaa2f5
·
verified ·
1 Parent(s): 43af36a

Add model card

Browse files
Files changed (1) hide show
  1. README.md +29 -8
README.md CHANGED
@@ -11,24 +11,45 @@ pipeline_tag: fill-mask
11
 
12
  ## Model Description
13
 
14
- This model was trained using the RIKEN Foundation Model training pipeline.
 
 
 
 
 
 
15
 
16
  - **Model Type**: bert
17
  - **Data Type**: Molecule/Compound
18
- - **Training Date**: 2026-04-22
19
 
20
  ## Usage
21
 
22
  ```python
23
- from transformers import AutoModel, AutoTokenizer
 
24
 
25
- # Load model and tokenizer
26
- model = AutoModel.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
27
  tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
28
 
29
- # Example usage
30
- inputs = tokenizer("your input sequence", return_tensors="pt")
31
- outputs = model(**inputs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
 
34
  ## Training
 
11
 
12
  ## Model Description
13
 
14
+ GPT-2 medium (345M parameters) foundation model pre-trained on compound SMILES strings from the MolCrawl dataset.
15
+
16
+ The tokenizer is a character-level BPE tokenizer (vocab_size=612) that encodes each SMILES character as a separate token. Input SMILES strings should be passed **without** spaces (e.g. `CC(=O)O`). The `[SEP]` token (id=13) is used as the end-of-sequence marker.
17
+
18
+ ## Datasets
19
+
20
+ - **MolCrawl compounds corpus (chembl + zinc + opv + reddb + pubchemqc)**: [https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh](https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh) (Pre-training corpus)
21
 
22
  - **Model Type**: bert
23
  - **Data Type**: Molecule/Compound
24
+ - **Training Date**: 2026-04-24
25
 
26
  ## Usage
27
 
28
  ```python
29
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
30
+ import torch
31
 
32
+ model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
 
33
  tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-bert-medium")
34
 
35
+ # Predict masked SMILES token
36
+ # Use tokenizer.mask_token instead of hardcoded "[MASK]":
37
+ # BERT-style tokenizers vary ("[MASK]", "<mask>", etc.)
38
+ if tokenizer.mask_token is None:
39
+ raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.")
40
+ prompt = "CC(=O){MASK}".replace("{MASK}", tokenizer.mask_token)
41
+ inputs = tokenizer(prompt, return_tensors="pt")
42
+ mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
43
+
44
+ with torch.no_grad():
45
+ outputs = model(**inputs)
46
+ logits = outputs.logits
47
+
48
+ predicted_token_id = logits[0, mask_index].argmax(dim=-1)
49
+ predicted_token = tokenizer.decode(predicted_token_id)
50
+ result = prompt.replace(tokenizer.mask_token, predicted_token)
51
+ print(f"Predicted: {result}")
52
+
53
  ```
54
 
55
  ## Training