Unable to Reproduce Results for Gene Classification

#425
by mchatz - opened

Hey,

I am unable to reproduce the results for gene classification using the default settings from the provided notebook https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/gene_classification.ipynb for the dosage sensitive task
Specifically, I am using the 6-layer Geneformer and the example input data from https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/dosage_sensitive_tfs/gc-30M_sample50k.dataset.

Issue:

I am getting a macro F1 score of 0.672, which is lower than expected.

The model is biased toward predicting the second class.
image.png

Please let me know if there are any suggestions or if additional configuration is required,

Thank you in advance.

Thank you for your question! If you are using the current version please note the default dictionary is for the 95M model so you need to provide the 30M dictionary for the 30M model. Otherwise the tokens will be scrambled from their true gene identity.

ctheodoris changed discussion status to closed

hello, when I run this notebook with

Example input_data_file for 30M model series: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/dosage_sensitive_tfs/gc-30M_sample50k.dataset

cc.prepare_data(input_data_file="/path/to/gc-30M_sample50k.dataset",
output_directory=output_dir,
output_prefix=output_prefix),

I met the following mistake,

TypeError: label_classes() missing 1 required positional argument: 'id_class_dict'.

May I aks have you ever encountered this problem,and if so, how did you solve it?
Thank you for your reply.

Sign up or log in to comment