deskull commited on
Commit
b3f75c6
·
verified ·
1 Parent(s): fb01e49

docs: prominent input-format warning (this is a gene-ID model, not nucleotide)

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -7,6 +7,15 @@ tags:
7
  pipeline_tag: fill-mask
8
  ---
9
 
 
 
 
 
 
 
 
 
 
10
  # molcrawl-rna-celltype-bert-medium
11
 
12
  ## Model Description
 
7
  pipeline_tag: fill-mask
8
  ---
9
 
10
+ > ⚠️ **Input format**: this model uses a Geneformer-style WordLevel
11
+ > tokenizer whose vocabulary is **ENSEMBL gene IDs only** (e.g.
12
+ > `ENSG00000000003`). Plain nucleotide strings (`AUGCAUGC...`) or
13
+ > free text will NOT tokenize correctly — they collapse to one
14
+ > `[UNK]` and the MLM head will return an ENSG-id at that mask
15
+ > position by design. See the Example Output below for the correct
16
+ > list-of-tokens calling convention.
17
+
18
+
19
  # molcrawl-rna-celltype-bert-medium
20
 
21
  ## Model Description