About gene type for Geneformer input
Hi !
I have some samples containing different types of genes, such as protein-coing, non-coding etc. I want to know if Geneformer can accept other types of genes? Should I restrict my gene id to some specific type before tokenize? Thank you !
Besides, I want to confirm that which edition of human genome is appropriate ? I see that Ensembl GRCh37 Release 113 (October 2024) has been relaesed, which I will use for gene symbol conversion because not all my samples have ensembl_id (may only gene symbol). And different samples my have different edition, will this have any affect?
Thanks for your question! The genes that Geneformer is pretrained with are in the token dictionary corresponding to the given model. You can use genes other than those by adding them to the token dictionary, but you should train the model further using these tokens since the model won’t have seen them during pretraining. In terms of Ensembl versions, there can be differences, and we provide a mapping dictionary to consolidate some IDs, but you can edit this mapping file if needed.