deskull commited on
Commit
7e2b04a
·
verified ·
1 Parent(s): 346c427

Add model card

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pytorch
5
+ - bert
6
+ - dna-genome
7
+ pipeline_tag: fill-mask
8
+ ---
9
+
10
+ # molcrawl-genome-sequence-bert-medium
11
+
12
+ ## Model Description
13
+
14
+ GPT-2 medium (345M parameters) foundation model pre-trained on human genome DNA sequences from the [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) reference assembly.
15
+
16
+ ## Datasets
17
+
18
+ - **GRCh38 human genome reference assembly**: [https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) (Pre-training corpus)
19
+
20
+ - **Model Type**: bert
21
+ - **Data Type**: DNA/Genome
22
+ - **Training Date**: 2026-05-11
23
+
24
+ ## Usage
25
+
26
+ ```python
27
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
28
+ import torch
29
+
30
+ model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-bert-medium")
31
+ tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-bert-medium")
32
+
33
+ # Predict masked DNA token
34
+ # Use tokenizer.mask_token instead of hardcoded "[MASK]":
35
+ # BERT-style tokenizers vary ("[MASK]", "<mask>", etc.)
36
+ if tokenizer.mask_token is None:
37
+ raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.")
38
+ prompt = "ATCGATCG{MASK}ATCGATCG".replace("{MASK}", tokenizer.mask_token)
39
+ inputs = tokenizer(prompt, return_tensors="pt")
40
+ mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
41
+
42
+ with torch.no_grad():
43
+ outputs = model(**inputs)
44
+ logits = outputs.logits
45
+
46
+ predicted_token_id = logits[0, mask_index].argmax(dim=-1)
47
+ predicted_token = tokenizer.decode(predicted_token_id)
48
+ result = prompt.replace(tokenizer.mask_token, predicted_token)
49
+ print(f"Predicted: {result}")
50
+
51
+ ```
52
+
53
+ ## Source Code
54
+
55
+ Training pipeline, configuration files, and data preparation scripts are
56
+ available in the MolCrawl GitHub repository:
57
+ [https://github.com/mmai-framework-lab/MolCrawl](https://github.com/mmai-framework-lab/MolCrawl)
58
+
59
+ ## License
60
+
61
+ This model is released under the APACHE-2.0 license.
62
+
63
+ ## Citation
64
+
65
+ If you use this model, please cite:
66
+
67
+ ```bibtex
68
+ @misc{molcrawl_genome_sequence_bert_medium,
69
+ title={molcrawl-genome-sequence-bert-medium},
70
+ author={{RIKEN}},
71
+ year={2026},
72
+ publisher={{Hugging Face}},
73
+ url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-bert-medium}}
74
+ }
75
+ ```