docs: prominent input-format warning (this is a gene-ID model, not nucleotide)
Browse files
README.md
CHANGED
|
@@ -7,6 +7,15 @@ tags:
|
|
| 7 |
pipeline_tag: fill-mask
|
| 8 |
---
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
# molcrawl-rna-celltype-bert-medium
|
| 11 |
|
| 12 |
## Model Description
|
|
|
|
| 7 |
pipeline_tag: fill-mask
|
| 8 |
---
|
| 9 |
|
| 10 |
+
> ⚠️ **Input format**: this model uses a Geneformer-style WordLevel
|
| 11 |
+
> tokenizer whose vocabulary is **ENSEMBL gene IDs only** (e.g.
|
| 12 |
+
> `ENSG00000000003`). Plain nucleotide strings (`AUGCAUGC...`) or
|
| 13 |
+
> free text will NOT tokenize correctly — they collapse to one
|
| 14 |
+
> `[UNK]` and the MLM head will return an ENSG-id at that mask
|
| 15 |
+
> position by design. See the Example Output below for the correct
|
| 16 |
+
> list-of-tokens calling convention.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
# molcrawl-rna-celltype-bert-medium
|
| 20 |
|
| 21 |
## Model Description
|