Unable to Reproduce Results for Gene Classification

#425

by mchatz - opened Sep 23, 2024

Sep 23, 2024

Hey,

I am unable to reproduce the results for gene classification using the default settings from the provided notebook https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/gene_classification.ipynb for the dosage sensitive task
Specifically, I am using the 6-layer Geneformer and the example input data from https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/dosage_sensitive_tfs/gc-30M_sample50k.dataset.

Issue:

I am getting a macro F1 score of 0.672, which is lower than expected.

The model is biased toward predicting the second class.

Please let me know if there are any suggestions or if additional configuration is required,

Thank you in advance.

ctheodoris

Owner Sep 23, 2024

Thank you for your question! If you are using the current version please note the default dictionary is for the 95M model so you need to provide the 30M dictionary for the 30M model. Otherwise the tokens will be scrambled from their true gene identity.

ctheodoris changed discussion status to closed Sep 23, 2024

jenny143

Dec 5, 2025

hello, when I run this notebook with

gc-30M_sample50k.dataset

cc.prepare_data(input_data_file="/path/to/gc-30M_sample50k.dataset",
output_directory=output_dir,
output_prefix=output_prefix),

I met the following mistake,

TypeError: label_classes() missing 1 required positional argument: 'id_class_dict'.

May I aks have you ever encountered this problem,and if so, how did you solve it?
Thank you for your reply.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment