Filtering cells for in silico perturbation from provided gene list
Hello,
Thank you for the great work and documentation! I have a question regarding the perturb_data() function of the InSilicoPerturber module. Perturbing all genes will take a very long time and I'm interested in the effects of knocking out certain genes on cell state transitions. When I provide a list that goes beyond a handful (~10) of genes, I get the following flag:
[ERROR] No cells in dataset contain genes_to_perturb
I see that this is because it's filtering on cells that have all of the genes tokenized such that the same cells will be used for all the different gene perturbations (specifically the filter_data_by_tokens function from perturber_utils). When I set genes_to_perturb='all', however, it does not enforce this same constraint (otherwise obviously no cells will have tokenized expression of every gene) and will run different cell sets for different gene perturbations.
My questions is, can I remove the constraint that is imposed when I specify a gene set? Or is my best bet to individually run InSilicoPerturber models for each individual gene and then aggregate them together?
Thank you!
Thanks for your question. Yes, all options are possible with the following methods:
- perturb all genes individually: "all"
- perturb a set of genes in combination: [Ensembl1, Ensembl2,...]
- perturb a single gene: [Ensembl1]
The third option, as you said, can be run with each gene of interest, which is ideal for parallelizing these independent runs and will be more efficient than perturbing all genes if you are only interested in a subset.