kojima-lab
/

molcrawl-rna-celltype-bert-medium

 pipeline_tag: fill-mask
 ---
+> ⚠️ **Input format**: this model uses a Geneformer-style WordLevel
+> tokenizer whose vocabulary is **ENSEMBL gene IDs only** (e.g.
+> `ENSG00000000003`). Plain nucleotide strings (`AUGCAUGC...`) or
+> free text will NOT tokenize correctly — they collapse to one
+> `[UNK]` and the MLM head will return an ENSG-id at that mask
+> position by design. See the Example Output below for the correct
+> list-of-tokens calling convention.
 # molcrawl-rna-celltype-bert-medium
 ## Model Description