| --- |
| license: apache-2.0 |
| tags: |
| - pytorch |
| - bert |
| - protein |
| pipeline_tag: fill-mask |
| --- |
| |
| # molcrawl-protein-sequence-bert-medium |
|
|
| ## Model Description |
|
|
| GPT-2 medium (345M parameters) foundation model pre-trained on protein amino acid sequences from the MolCrawl dataset. |
|
|
| - **Model Type**: bert |
| - **Data Type**: Protein |
| - **Training Date**: 2026-05-11 |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForMaskedLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-protein-sequence-bert-medium") |
| tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-protein-sequence-bert-medium") |
| |
| # Predict masked amino acid |
| # Use tokenizer.mask_token instead of hardcoded "[MASK]": |
| # BERT-style tokenizers vary ("[MASK]", "<mask>", etc.) |
| if tokenizer.mask_token is None: |
| raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.") |
| prompt = "MKTAYIAK{MASK}RQISFVKSHFSRQ".replace("{MASK}", tokenizer.mask_token) |
| inputs = tokenizer(prompt, return_tensors="pt") |
| mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| logits = outputs.logits |
| |
| predicted_token_id = logits[0, mask_index].argmax(dim=-1) |
| predicted_token = tokenizer.decode(predicted_token_id) |
| result = prompt.replace(tokenizer.mask_token, predicted_token) |
| print(f"Predicted: {result}") |
| |
| ``` |
|
|
| ## Source Code |
|
|
| Training pipeline, configuration files, and data preparation scripts are |
| available in the MolCrawl GitHub repository: |
| [https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl) |
|
|
| ## License |
|
|
| This model is released under the APACHE-2.0 license. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{molcrawl_protein_sequence_bert_medium, |
| title={molcrawl-protein-sequence-bert-medium}, |
| author={{RIKEN}}, |
| year={2026}, |
| publisher={{Hugging Face}}, |
| url={{https://huggingface.co/kojima-lab/molcrawl-protein-sequence-bert-medium}} |
| } |
| ``` |
|
|
|
|
| ## Example Output |
|
|
| End-to-end inference test (downloaded the model from this repo on CPU). |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForMaskedLM |
| |
| REPO_ID = "kojima-lab/molcrawl-protein-sequence-bert-medium" |
| tokenizer = AutoTokenizer.from_pretrained(REPO_ID) |
| model = AutoModelForMaskedLM.from_pretrained(REPO_ID) |
| model.eval() |
| |
| sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSR<mask>VIVQDIAYLRSLGYNIVATPRGYVLAGG" |
| inputs = tokenizer(sequence, return_tensors="pt") |
| mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0] |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| |
| predicted_id = outputs.logits[0, mask_index].argmax(dim=-1) |
| predicted_aa = tokenizer.convert_ids_to_tokens(predicted_id.tolist())[0] |
| print(f"Predicted amino acid at mask: {predicted_aa}") |
| # => Predicted amino acid at mask: W |
| ``` |
|
|