Question Regarding Inference Behavior of vandijklab/C2S-Scale-Pythia-1b-pt

#2
by Moxshi000 - opened

Hello van Dijk Lab team and the Hugging Face community,

I've been experimenting with your C2S-Scale-Pythia-1b-pt model and had a question about its expected behavior at the public inference endpoint.

Based on the model card, the primary task is cell type prediction from a "cell sentence". To test this core functionality, I submitted a request containing a simple cell sentence composed of highly specific and classic gene markers for a hepatocyte, including genes like ALB, APOB, and TTR.

My expectation was to receive a simple string as a response, such as 'Hepatocyte', which would represent the model's classification of the input.

However, the response I received was not a cell type prediction. Instead, the model generated another long, unstructured list of genes, which included many common housekeeping genes. It behaved like a base language model simply continuing the sequence, rather than performing a classification.

Prior to this simple test, I also attempted more complex generative prompts, asking the model to create new cell sentences based on specific constraints. Those attempts also resulted in similarly incoherent outputs, where all instructions were ignored.

This behavior leads me to suspect that the public inference endpoint might be running the base Pythia pre-trained model rather than the version fine-tuned with the final classification head for cell type prediction.

Could you please clarify if this is the expected behavior? Is there a different API structure or method required to access the intended cell type prediction functionality of this model?

Thank you for your work on this specialized and interesting model, and for any guidance you can provide

I also faced the same issue with both large C2S-Scale-1b and Pythia-410m models. The only text generated by the model is a list of genes (cell sentence) and you cannot control the number of genes with your prompt. Based on my experience, the natural language generation (or plain text) is below the expectations of the paper. To predict the cell type, I suggest you refer to the tutorial 4 (https://github.com/vandijklab/cell2sentence/blob/master/tutorials/c2s_tutorial_4_cell_type_prediction.ipynb) and use the predict_cell_types_of_data function instead. That function will give you a simple response such as Hepatocyte. Maybe not hepatocyte but likely from their training data labeling, such as CD8 positive alpha-beta memory T cell or something.

If anyone else could generate biological insight based on C2S framework, welcome to give any comments and correct me. Thank you very much.

Sign up or log in to comment