kojima-lab
/

molcrawl-genome-sequence-bert-medium

Model card Files Files and versions

deskull commited on 15 days ago

Commit

7e2b04a

·

verified ·

1 Parent(s): 346c427

Add model card

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+license: apache-2.0
+tags:
+- pytorch
+- bert
+- dna-genome
+pipeline_tag: fill-mask
+---
+# molcrawl-genome-sequence-bert-medium
+## Model Description
+GPT-2 medium (345M parameters) foundation model pre-trained on human genome DNA sequences from the [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) reference assembly.
+## Datasets
+- **GRCh38 human genome reference assembly**: [https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) (Pre-training corpus)
+- **Model Type**: bert
+- **Data Type**: DNA/Genome
+- **Training Date**: 2026-05-11
+## Usage
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+import torch
+model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-bert-medium")
+tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-bert-medium")
+# Predict masked DNA token
+# Use tokenizer.mask_token instead of hardcoded "[MASK]":
+# BERT-style tokenizers vary ("[MASK]", "<mask>", etc.)
+if tokenizer.mask_token is None:
+    raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.")
+prompt = "ATCGATCG{MASK}ATCGATCG".replace("{MASK}", tokenizer.mask_token)
+inputs = tokenizer(prompt, return_tensors="pt")
+mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+with torch.no_grad():
+    outputs = model(**inputs)
+logits = outputs.logits
+predicted_token_id = logits[0, mask_index].argmax(dim=-1)
+predicted_token = tokenizer.decode(predicted_token_id)
+result = prompt.replace(tokenizer.mask_token, predicted_token)
+print(f"Predicted: {result}")
+```
+## Source Code
+Training pipeline, configuration files, and data preparation scripts are
+available in the MolCrawl GitHub repository:
+[https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl)
+## License
+This model is released under the APACHE-2.0 license.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{molcrawl_genome_sequence_bert_medium,
+  title={molcrawl-genome-sequence-bert-medium},
+  author={{RIKEN}},
+  year={2026},
+  publisher={{Hugging Face}},
+  url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-bert-medium}}
+}
+```