About in_silico_perturbation memory problem
Hello, I want to reproduce your paper results in the part of In silico treatment analysis.
I first fine-tuned the data you provided for cardiomyopathy so that the model could differentiate between the different states, hcm,nf,dcm.
Next, use the InSilicoPerturber function to perform gene deletion perturbation, and set the maximum cell number parameter to none, that is, use all cells for testing.
I have only 16gb of memory in one GPU here, all will appear OOM, I tried to set the batch size to 4, still not working. Finally I tested here with a batch size of 4 and a maximum cell count of 6500 can run. But the results obtained are very different from your article.
In order to get the results conveniently, I tested the ADCY5 SRPK3 two genes mentioned in this part of your article, and the results are as follows.

The final point I would like to make is that is the number of cells tested the main reason for such a large gap?
Or do you have a minimum threshold for how many cells to test for statistical power somewhere?
Thank you for your question. The number of cells will definitely have a large effect on the statistical significance. There is no specific threshold for the number of cells that would be universal across situations - it depends on the effect size expected for the particular perturbation you are testing. You could consider doing a power calculation if you would like to determine the minimum number of cells to reach statistical significance in each situation.
I am assuming based on your question that you are running out of memory with the in silico perturber and not the stats module. The in silico perturber saves to disk every 100 cells and clears the memory every 1000 cells. If you are working with limited resources, you could alter the code to clear the memory more frequently. You can also just keep running each subset of cells until you reach all the cells. Then, put all of the raw data in a single directory that you pass to the in silico perturber stats function so that it can accumulate the data from all of them.
You are right, the OOM appeared when I ran the perturbation, not during the next step of the stats.You can also just keep running each subset of cells until you reach all the cells. Then, put all of the raw data in a single directory that you pass to the in silico perturber stats function so that it can accumulate the data from all of them.
My understanding is that there is no difference between the result of running in this way and the result of running without splitting the data?
You are correct, there is no difference - it runs one cell independently from another and you are not training the model, only using it for inference.
Thank you very much
One more question. In the step where I calculate the perturbation [InSilicoPerturber], do the different split data have an effect on the results?
For example, in the data of cardiomyopathy disease, there are three different categories, dcm, hcm and nf. Will the combination of different proportions of these three categories lead to different results?
I'm sorry to bother you again.
If you are asking whether the different categories need to be balanced, the answer is no for in silico perturbation because it is not training the model, it is only performing inference. You should keep in mind though that the number of cells will affect the statistical significance, as discussed above.