Issues with the codebase
Hello,
There are a number of issues with the codebase that damage usability. It would be great if these could be fixed!
emb_extractor.py
- This file shuffles the dataset. For situations in which you are doing evaluation, this is very unwanted. It is extremely annoying to have to reverse this.
- This file saves csvs in the following format:
8,tensor(-0.0535),tensor(-0.0137),tensor(0.0381),tensor(0.0003),tensor(0.0163),tensor(-0.0026),tensor(-0.0553)...
which is not usable. It is also incredibly slow to save this csv when dealing with large datasets.
Tensors should be saved as torch files, or maybe as numpy array files. If the embeddings are saved as a csv, they should be saved as the full floats, not as the .__str__ version of the tensor.
- the embeddings variable, when created using the code in the notebook example, should be a numpy array of floats, or a torch tensor of floats. Currently it is created as an array of lists of tensors.
In scripts with progress bars, tqdm.notebook is imported, which does not work when using the in a terminal. tqdm.auto can be imported instead.
Thanks!
Thank you for your interest in Geneformer and for your suggestions!
Shuffling: the emb_extractor actually sorts the dataset by length so that the largest input is first. This is done to encounter memory limitations earlier so users can more easily optimize the maximum batch size they can use based on their resources. You can include labels for the cells so that the output embeddings are labeled accordingly. That way you can arrange the output in any order you prefer.
Output data type: Thank you for your suggestion to include an option to output the embeddings as torch files. Right now the output is a dataframe so that it can be used for plotting, but if users are not interested in plotting, we can add the option to instead output as torch files. If you have already implemented this, please submit a pull request so we can test it and merge it. I will note though, the output .csv should be a dataframe with floats, not the string format of tensors that you mentioned. Please make sure you are using the current version.
Embeddings variable format: Please specify which variable you are referring to (line of code would be helpful).
Progress bar: Thank you for the suggestion - we can update this for improved usability with batch jobs.
Current solution: Most of the issues you are having may be resolved for you by just importing and using the function "get_embs" within the embedding extractor. That way you can prepare the data how you'd like, without sorting, and output embedding tensors in the format you are interested in.
Thanks for the reply!
I found this sorting later (it was the reason for the issue in https://huggingface.co/ctheodoris/Geneformer/discussions/253). It would be beneficial to include this in the embedding extraction notebook. It would also be useful to add options to disable this behavior.
For saving torch or numpy output, the function should just return embs.cpu() or embs.cpu().numpy(). This is relevant to the embeddings variable format bullet point also:
this was for this line in the extract_and_plot notebook. Pandas seems to convert the rows to arrays, but not the actual values of the tensors to floats. https://huggingface.co/ctheodoris/Geneformer/blob/main/geneformer/emb_extractor.py#L401 instead of just claling .cpu(), .numpy() can also be added.
"path/to/input_data/",
"path/to/output_directory/",
"output_prefix")```
- Thanks! It would be great if there was a specific tutorial on how to get embeddings for a h5ad, since this is probably the most used file format, and since the tokenizer now supports it.
Thanks!
Probably related to this discussion. During my trying: 1. tokenizing external .h5ad data; then 2. extract cell embeddings from pre-trained geneformers, I have to replace the following line with
https://huggingface.co/ctheodoris/Geneformer/blob/4302f4835eda5320b13de85092c97f2c6679b36e/geneformer/tokenizer.py#L213
which is:
# X_view = adata[idx, coding_miRNA_loc].X
adata = adata.to_memory()
X_view = adata[:, coding_miRNA_loc][idx, :].X
Probably due to the fact that I installed the AnnData of 0.9.2, a more recent version.
Thank you for following up!
Sorting: by providing the argument "emb_label" with the necessary label from the .dataset column, this will associate the final embeddings with the necessary label. There is a reason for the sorting (to encounter memory constraints sooner, as I described earlier). If the label is used, the sorting should not provide an issue.
Tensor output: We added an option "output_torch_embs" to the extract_embs function. Please note this will output the embeddings as a tensor as well as a dataframe with the associated labels so expect 2 outputs.
Pandas output: When running the code, the pandas dataframe embedding values are floats. Please provide the pandas version you are using in case this is the reason we are not able to reproduce this issue. We added .numpy() in case this resolves the issue for you, but again, when we run this code the output is already in the float format.
Anndata --> embeddings: The input to the model is rank value encodings, so anndata files (or any other scRNAseq files) first need to be converted to this format using the provided tokenizer before being used for any modeling (e.g. extracting embeddings, fine-tuning the model, etc). We have an example dataset already indicated in the example notebook for extracting embeddings, but for additional clarity, we will also link to the example notebook for tokenization.
Probably related to this discussion. During my trying: 1. tokenizing external .h5ad data; then 2. extract cell embeddings from pre-trained geneformers, I have to replace the following line with
https://huggingface.co/ctheodoris/Geneformer/blob/4302f4835eda5320b13de85092c97f2c6679b36e/geneformer/tokenizer.py#L213which is:
# X_view = adata[idx, coding_miRNA_loc].X
adata = adata.to_memory()
X_view = adata[:, coding_miRNA_loc][idx, :].XProbably due to the fact that I installed the
AnnDataof0.9.2, a more recent version.
Thank you for noting this! Would you mind starting a new discussion with a title relevant to this as it is a separate issue? It will be helpful to future users who are looking for the answer to the same question. Thank you!