A question on input data and preprocessing steps (bin parameter)

#2
by ToGraduate - opened

Hello, I am very interested in your excellent single-cell foundation model, scGPT, particularly its potential to mitigate batch effects across multiple scRNA-seq datasets.

I have a question regarding expression binning, which I believe is a crucial prerequisite for batch correction. Specifically, I am curious whether scGPT supports relative binning of gene expression values on a per-cell basis.

However, I could not find any relevant hyperparameters or implementation for binning in the following source code files:

  1. DataLoader (https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/anndata_dataset.py)
  2. scGPTTokenizer (https://github.com/mims-harvard/TDC/blob/main/tdc/model_server/tokenizers/scgpt.py)
  3. scGPT model (https://github.com/mims-harvard/TDC/blob/main/tdc/model_server/models/scgpt.py).

In fact, based on the example data provided, it appears that binning is not applied, as each cell retains a distinct maximum gene expression value:

data = adata.X.toarray()
data[0].max() # 3.0
data[1].max() # 57.0
data[2].max() # 3.0
data[3].max() # 18.0

I would greatly appreciate your clarification on how expression binning is handled in the model β€” whether globally, per batch, or per cell β€” and where this is configured in the pipeline. In addition, I would like to clarify whether the input data are in the form of raw counts when binning is not used.

Thank you very much for your time and this impactful work.

Sign up or log in to comment