If the perturbation result of a single gene is sig=1, does it mean that the result of the single gene will also be included when selecting all gene perturbation results?

#414
by jialei233 - opened

Hi, ctheodoris!
When I was using "in_silico_perturbation" with "genes_to_perturb=all", I found it would take much time to finish perturbation, so I tried to use single ensembl ID which I was interested with to do the perturbation, and the result shows its sig=0. I wonder if I choose all the genes to perturbate, would the result be different or still same? Best wishes!!

Thank you for your question! Since in silico perturbation is inference only, the result of the cosine shift will not change regardless if you run it as a single gene or with genes_to_perturb="all". However, the statistical test will change depending on what is considered as a random distribution. It is not clear from your message how you ran the statistics after your perturbation, but if you ran it such that the random distribution was drawn from the perturbations you performed, and this was compared to the same exact distribution of the single gene you perturbed, then it is unsurprising that it was not significant. If you run the analysis with genes_to_perturb="all", you will have a random distribution that includes all other gene perturbations to be able to determine whether particular perturbations are more perturbative than a random perturbation. If you do not wish to run it with perturbing all genes, you could also run perturbations with random genes in random subsamples of cells to get a null distribution that you can provide to the stats module with the null distribution method option. Of note, we did make some changes to the "all" case recently to improve memory efficiency, so if you are using a prior version, you may consider pulling the recent one as the decreased memory requirement may allow you to run larger batches and therefore speed up the process. Additionally, we should note that, like all statistical tests, the number of observations also plays a role in the significance calculation, so if you run the in silico perturbations in all cells rather than a subsample, you will get a more confident measure of the distribution and gain statistical power.

ctheodoris changed discussion status to closed

Thank you very much for your answer and explanation!! I think I now understand why choosing a single gene perturbation resulted in no statistical difference. Moving forward, I will use genes_to_perturb="all" to obtain a better random distribution. Additionally, I will switch to the latest version of Geneformer. Once again, thank you for your patient response!!

Sign up or log in to comment