Slicing in Anndata in tokenizer.py
Regarding a piece of previous discussion.
During my trying: 1. tokenizing external .h5ad data; then 2. extract cell embeddings from pre-trained models, I have to replace the following line with
# X_view = adata[idx, coding_miRNA_loc].X
adata = adata.to_memory()
X_view = adata[:, coding_miRNA_loc][idx, :].X
Probably due to the fact that I installed the AnnData of 0.9.2, a more recent version.
Ran into the same issue and the same fix work, thank you so much!
Thank you for the update! Adding .to_memory() loads the whole adata to memory, but the purpose of the X_view method is to process the file chunks at a time similarly to how the .loom method we initially implemented is done. It would be ideal to determine if there is a different method to resolve the error without loading the whole dataset into memory, which can be a problem for very large datasets.
Thanks for the clarification, I will take a look at it over the weekends.
Thank you for following up! The variable adata is defined outside of that loop, but the loop as you said defines a chunk size which is what defines a narrower view of the data, X_view, for the following operations to be done with. If we load adata into memory rather than X_view, the whole adata will be loaded into memory, which would be an issue for large datasets. It would be wonderful to find another way to resolve the error without loading the whole dataset into memory though. Thank you again so much for looking into this!
True, I've re-run with the original implementation (X_view = adata[idx, coding_miRNA_loc].X), and previous issue has gone. Also I've done a few benchmarking, if your MEM is large than the adata size, adding adata=adata.to_memory() before the for-loop gives 5-10x speedup. 13mins -> 2~3 mins for an adata of 1e6 genome-wide cells.
Glad to hear the prior error is resolved. I added the option for modifying the chunk_size used by the anndata tokenizer to allow for speedups with larger chunk_size when memory is not a limitation. Thank you for your suggestion!
