Questions on replicating the human_dcm_hcm in silico perturbation results
Thanks for your great work. I encountered some difficulties when replicating the in_silico_perturbation results. By the limitation of GPU memory, I set the max_ncells = 2000 or 5000. The model is the outmost model in the installed Geneformer folder and the dataset is the example human_dcm_hcm. The other codes keep the same as the original in_silico_perturbation.ipynb file as below.
from geneformer import InSilicoPerturber
from geneformer import InSilicoPerturberStats
# in silico perturbation in deletion mode to determine genes whose
# deletion in the dilated cardiomyopathy (dcm) state significantly shifts
# the embedding towards non-failing (nf) state
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb="all",
combos=0,
anchor_gene=None,
model_type="CellClassifier",
num_classes=3,
emb_mode="cell",
cell_emb_style="mean_pool",
filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
cell_states_to_model={"disease":(["dcm"],["nf"],["hcm"])},
max_ncells=5000,
emb_layer=0,
forward_batch_size=400,
nproc=16)
# outputs intermediate files from in silico perturbation
isp.perturb_data("/install_path/Geneformer/",
"/install_path/Geneformer/examples/datasets/human_dcm_hcm/",
"/install_path/Geneformer/examples/output/",
"human_dcm_hcm")
# I make some dirs as above to store the dataset and output files
ispstats = InSilicoPerturberStats(mode="goal_state_shift",
genes_perturbed="all",
combos=0,
anchor_gene=None,
cell_states_to_model={"disease":(["dcm"],["nf"],["hcm"])})
# extracts data from intermediate files and processes stats to output in final .csv
ispstats.get_stats("/install_path/Geneformer/examples/output/",
None,
"/install_path/Geneformer/examples/output/",
"human_dcm_hcm")
However, I found the results differ a lot from those in table12, the sheet "DCM_del_tx", typically by the low absolute value of shift_to_goal_end and shift_to_alt_end.
For example, I got "ADGRL3 ENSG00000150471 6.18E-05 -6.88E-06" with 883 detections from a max_5000_cells file, but the one in sheet DCM_del_tx is "ENSG00000150471 ADGRL3 0.010678123 -0.026562714" with 1880 detections, showing a 1000 times lower shift in my results. Could you tell me what would be the possible causes? Thank you!
Thank you for your interest in Geneformer! My understanding from your comment is that you are using the pretrained model for the analysis. As discussed in the manuscript, we first fine-tuned the model to distinguish the cardiomyopathy states before performing the in silico perturbation. By loading the pretrained model as a CellClassifier, you are introducing a head layer with random untrained weights from which you are modeling the embeddings with emb_layer = 0. We have released the fine-tuned model for cardiomyopathy in this repository so you should use that model instead. For max_ncells, you should set this to None so that all cells are used to calculate the start and goal embedding positions. Then, you can use cell_inds_to_perturb to subset to a smaller number of cells for the perturbations. You can increase it from 5000 to approach the number of detections reported in Table 12.
@v2vJyl Have you managed to replicate the results? I'm trying to do so myself and having some trouble.
@ctheodoris
Thank you for your help in troubleshooting. cell_inds_to_perturb is described as "useful for splitting large datasets across separate GPUs". So it allows us to run multiple analyses in parallel, which will then be collated by InSilicoPerturberStats()?