Slicing in Anndata in tokenizer.py

#255

by hansen7 - opened Oct 4, 2023

Oct 4, 2023

Regarding a piece of previous discussion.

During my trying: 1. tokenizing external .h5ad data; then 2. extract cell embeddings from pre-trained models, I have to replace the following line with

# X_view = adata[idx, coding_miRNA_loc].X
adata = adata.to_memory()
X_view = adata[:, coding_miRNA_loc][idx, :].X

Probably due to the fact that I installed the AnnData of 0.9.2, a more recent version.

mm2123

Nov 10, 2023

•

edited Nov 10, 2023

Ran into the same issue and the same fix work, thank you so much!

ctheodoris

Owner Nov 11, 2023

Thank you for the update! Adding .to_memory() loads the whole adata to memory, but the purpose of the X_view method is to process the file chunks at a time similarly to how the .loom method we initially implemented is done. It would be ideal to determine if there is a different method to resolve the error without loading the whole dataset into memory, which can be a problem for very large datasets.

hansen7

Nov 11, 2023

Thanks for the clarification, I will take a look at it over the weekends.

hansen7

Nov 13, 2023

•

edited Nov 13, 2023

Oh I just realize it is in a loop of chunk processing: the adata here at most has chunk_size cells, which is preset as 512. So nothing should be worried :)

ctheodoris

Owner Nov 13, 2023

Thank you for following up! The variable adata is defined outside of that loop, but the loop as you said defines a chunk size which is what defines a narrower view of the data, X_view, for the following operations to be done with. If we load adata into memory rather than X_view, the whole adata will be loaded into memory, which would be an issue for large datasets. It would be wonderful to find another way to resolve the error without loading the whole dataset into memory though. Thank you again so much for looking into this!

hansen7

Nov 14, 2023

•

edited Nov 14, 2023

True, I've re-run with the original implementation (X_view = adata[idx, coding_miRNA_loc].X), and previous issue has gone. Also I've done a few benchmarking, if your MEM is large than the adata size, adding adata=adata.to_memory() before the for-loop gives 5-10x speedup. 13mins -> 2~3 mins for an adata of 1e6 genome-wide cells.

ctheodoris

Owner Nov 21, 2023

Glad to hear the prior error is resolved. I added the option for modifying the chunk_size used by the anndata tokenizer to allow for speedups with larger chunk_size when memory is not a limitation. Thank you for your suggestion!

ctheodoris changed discussion status to closed Nov 21, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment