Prediction of celltype on 3k PBMCs dataset using fine-tuned model

#130
by thereallda - opened

Hi, thanks for your fantastic work.
I am quite new to the transformer and want to apply the Geneformer to celltypes classification. I was following your cell_classification notebook and used the model fine-tuned on immune organs to predict the celltypes of PBMCs 3k dataset (http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz).

I refer to the previous discussion (https://huggingface.co/ctheodoris/Geneformer/discussions/107) and perform the prediction as follows:

# 1. transform scRNA-seq expression data to rank value .dataset format
from geneformer import TranscriptomeTokenizer

tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
tk.tokenize_data("D:/jupyterNote/pySC/output/", output_directory="token_data/", output_prefix="tk_pbmc3k")

# 2. load new dataset
import pandas as pd
new_dataset = load_from_disk("D:/jupyterNote/Geneformer/examples/token_data/tk_pbmc3k.dataset/")
pd.DataFrame(new_dataset)

Input dataset

# 3. load the fine-tuned model
ft_model = BertForSequenceClassification.from_pretrained("cell_class_test/230719_geneformer_CellClassifier_immune_L2048_B4_LR5e-05_LSlinear_WU500_E10_Oadamw_F0/")
ft_trainer = Trainer(model=ft_model)

# 4. perform prediction 
ct_predictions = ft_trainer.predict(new_dataset)
ct_pred = ct_predictions.predictions

# celltype : index 
immune_label_idx_dict = target_dict_list[5]

# get the celltype with label_id
ct_pred_id = ct_pred.argmax(-1) 
ct_pred_label = [k for idx in ct_pred_id for k, v in immune_label_idx_dict.items() if v == idx]

Finally, when I compared the Geneformer prediction with the celltype annotation from the dataset, I found that most predictions were different from the annotation.

import numpy as np
import scanpy as sc
import anndata

adata = anndata.read_h5ad("D:/jupyterNote/pySC/output/pbmc3k.h5ad")
adata.obs['geneformer_pred'] = ct_pred_label
sc.pl.umap(adata, color='geneformer_pred')
sc.pl.umap(adata, color='cell_type')

image.png

image.png

Could you provide any suggestions on this? Any help would be appreciated.

I apologize for my mistakes. It appeared that I used scaled data as input for tokenizing, so the predictions were incorrect. After using the raw count for tokenizing, the predictions seem generally accurate despite the different terminology of celltype between the Geneformer and the 3k PBMCs dataset.

Also, I padded the whole dataset into the same length. It is the right way to do so?

from geneformer.pretrainer import token_dictionary

def preprocess_classifier_batch(cell_batch, max_len):
    if max_len == None:
        max_len = max([len(i) for i in cell_batch["input_ids"]])
    def pad_label_example(example):
        #example["labels"] = np.pad(example["labels"], 
        #                           (0, max_len-len(example["input_ids"])), 
        #                           mode='constant', constant_values=-100)
        example["input_ids"] = np.pad(example["input_ids"], 
                                      (0, max_len-len(example["input_ids"])), 
                                      mode='constant', constant_values=token_dictionary.get("<pad>"))
        example["attention_mask"] = (example["input_ids"] != token_dictionary.get("<pad>")).astype(int)
        return example
    padded_batch = cell_batch.map(pad_label_example)
    return padded_batch

# Function to find the largest number smaller
# than or equal to N that is divisible by k
def find_largest_div(N, K):
    rem = N % K
    if(rem == 0):
        return N
    else:
        return N - rem
   
# padded to be the same length.
set_len=len(new_dataset)
max_set_len = max(new_dataset.select([i for i in range(set_len)])["length"])
padded_dataset = preprocess_classifier_batch(new_dataset, max_set_len)
pd.DataFrame(padded_dataset)

# 3. load the fine-tuned model
# same as above

# 4. perform prediction 
ct_predictions = ft_trainer.predict(padded_dataset)
# same as above

# 5. plot umap 
# same as above

Thank you for your interest in Geneformer! I can’t see the end of the rank value encodings in the data frame since it’s cut off, but that looks generally correct for the padding - you just want to add the padding token at the end to make all the tensors the same length so they can be stacked into a batch for batched processing.

On a side note, the datasets used for fine-tuning in the manuscript and example here were provided by one of the alternative methods so were only chosen for the purpose of comparison and do not necessarily represent the ideal dataset for fine-tuning for each tissue’s cell types. If you trying to train an ideal model for classification of any cell state (including PBMC cell type annotation), it would be best to train with multiple datasets in the training set, if available, to ensure the best generalizability, and with fine-tuning hyperparameters on a validation test (which should be separate from the held-out test set used for evaluating the final best trained model). The training set can be prepared to have the same labels as you’d like to classify in your final dataset, and if needed can be balanced if there is an overwhelming majority of a particular state label (though in our experience the model is quite robust to imbalanced training datasets, likely because it has been pretrained on a large range of cells previously). Important hyperparameters to tune include max learning rate, learning schedule, warmup steps, and number of layers to freeze in the pretrained model, among others. Please see the example for hyperparameter tuning for more information.

ctheodoris changed discussion status to closed

Sign up or log in to comment