| --- |
| license: apache-2.0 |
| tags: |
| - pytorch |
| - bert |
| - dna-genome |
| pipeline_tag: fill-mask |
| --- |
| |
| # molcrawl-genome-sequence-bert-medium |
|
|
| ## Model Description |
|
|
| GPT-2 medium (345M parameters) foundation model pre-trained on human genome DNA sequences from the [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) reference assembly. |
|
|
| ## Datasets |
|
|
| - **GRCh38 human genome reference assembly**: [https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) (Pre-training corpus) |
|
|
| - **Model Type**: bert |
| - **Data Type**: DNA/Genome |
| - **Training Date**: 2026-05-11 |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForMaskedLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-bert-medium") |
| tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-bert-medium") |
| |
| # Predict masked DNA token |
| # Use tokenizer.mask_token instead of hardcoded "[MASK]": |
| # BERT-style tokenizers vary ("[MASK]", "<mask>", etc.) |
| if tokenizer.mask_token is None: |
| raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.") |
| prompt = "ATCGATCG{MASK}ATCGATCG".replace("{MASK}", tokenizer.mask_token) |
| inputs = tokenizer(prompt, return_tensors="pt") |
| mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| logits = outputs.logits |
| |
| predicted_token_id = logits[0, mask_index].argmax(dim=-1) |
| predicted_token = tokenizer.decode(predicted_token_id) |
| result = prompt.replace(tokenizer.mask_token, predicted_token) |
| print(f"Predicted: {result}") |
| |
| ``` |
|
|
| ## Source Code |
|
|
| Training pipeline, configuration files, and data preparation scripts are |
| available in the MolCrawl GitHub repository: |
| [https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl) |
|
|
| ## License |
|
|
| This model is released under the APACHE-2.0 license. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{molcrawl_genome_sequence_bert_medium, |
| title={molcrawl-genome-sequence-bert-medium}, |
| author={{RIKEN}}, |
| year={2026}, |
| publisher={{Hugging Face}}, |
| url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-bert-medium}} |
| } |
| ``` |
|
|