|
|
---
|
|
|
license: apache-2.0
|
|
|
tags:
|
|
|
- biology
|
|
|
- single-cell
|
|
|
- cell-type-annotation
|
|
|
- large-language-model
|
|
|
- reasoning
|
|
|
- zero-shot
|
|
|
- few-shot
|
|
|
language:
|
|
|
- en
|
|
|
library_name: transformers
|
|
|
datasets:
|
|
|
- custom
|
|
|
pipeline_tag: text2text-generation
|
|
|
model_name: CellReasoner
|
|
|
author: guangshuo
|
|
|
---
|
|
|
|
|
|
# CellReasoner: A reasoning-enhanced large language model for cell type annotation π§¬π§
|
|
|
|
|
|
<div align="center">
|
|
|
|
|
|
[π Paper](#citation) | [π» GitHub](https://github.com/compbioNJU/CellReasoner)
|
|
|
|
|
|
</div>
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
## π Table of Contents
|
|
|
|
|
|
- [π CellReasoner: A reasoning-enhanced large language model for cell type annotation π§¬π§ ](#cellreasoner-a-reasoning-enhanced-large-language-model-for-cell-type-annotation-π§¬π§ )
|
|
|
- [π Table of Contents](#-table-of-contents)
|
|
|
- [π¬ Key Highlights](#-key-highlights)
|
|
|
- [π Key Results](#-key-results)
|
|
|
- [π§ Model Zoo](#-model-zoo)
|
|
|
- [ποΈββοΈ Training](#-training)
|
|
|
- [π Usage](#-usage)
|
|
|
- [π Citation](#citation)
|
|
|
|
|
|
---
|
|
|
|
|
|
### π¬ Key Highlights
|
|
|
|
|
|
- Only **a few expert-level reasoning samples** are needed to activate reasoning in a 7B LLM.
|
|
|
- **CellReasoner** achieves **expert-level interpretability** and **zero-/few-shot generalization**.
|
|
|
- Demonstrated **superior performance** across various **scRNA-seq** and **scATAC-seq** datasets.
|
|
|
- Compatible with **marker-by-marker annotation**, **ontology mapping**, and **biological reasoning**.
|
|
|
|
|
|
> π§ Less data, more reasoning: CellReasoner achieves accurate, interpretable, and scalable cell annotation with minimal supervision.
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Key Results
|
|
|
|
|
|
### [PDAC dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE197177)
|
|
|
|
|
|
| Model | Score |
|
|
|
|--------------------|-------|
|
|
|
| Deepseek-V3 | 0.50 |
|
|
|
| Deepseek-R1 | 0.53 |
|
|
|
| ChatGPT-o3 | 0.58 |
|
|
|
| ChatGPT-4o | 0.63 |
|
|
|
| singleR | 0.68 |
|
|
|
| **CellReasoner-7B** | **0.73** |
|
|
|
| **CellReasoner-32B** | **0.74** |
|
|
|
|
|
|
---
|
|
|
|
|
|
### [PBMC3K dataset](https://www.10xgenomics.com/cn/datasets/3-k-pbm-cs-from-a-healthy-donor-1-standard-1-1-0)
|
|
|
|
|
|
| Model | Score |
|
|
|
|--------------------|-------|
|
|
|
| Deepseek-V3 | 0.52 |
|
|
|
| Deepseek-R1 | 0.52 |
|
|
|
| ChatGPT-4o | 0.76 |
|
|
|
| ChatGPT-o3 | 0.85 |
|
|
|
| singleR | 0.83 |
|
|
|
| **CellReasoner-7B** | **0.87** |
|
|
|
| **CellReasoner-32B** | **0.84** |
|
|
|
|
|
|
---
|
|
|
|
|
|
## π§ Model Zoo
|
|
|
|
|
|
Our CellReasoner models are available on Hugging Face π€:
|
|
|
|
|
|
| Model | Backbone | Link |
|
|
|
|---------------------|----------------------------|------|
|
|
|
| **CellReasoner-7B** | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | [π€](https://huggingface.co/guangshuo/CellReasoner-7B) |
|
|
|
| **CellReasoner-32B** | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | [π€](https://huggingface.co/guangshuo/CellReasoner-32B) |
|
|
|
|
|
|
---
|
|
|
|
|
|
## ποΈββοΈ Training
|
|
|
|
|
|
We use the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework for fine-tuning. It offers a flexible and efficient pipeline for supervised fine-tuning, LoRA, and multi-stage training strategies.
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
## π Usage
|
|
|
|
|
|
### π οΈ Step 1: Prepare Conda Environment
|
|
|
|
|
|
Make sure you have a working conda environment with the necessary dependencies installed. We recommend:
|
|
|
|
|
|
```bash
|
|
|
conda create -n cellreasoner python=3.11
|
|
|
conda activate cellreasoner
|
|
|
pip install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
### π§ͺ Step 2: Preprocess Input Data
|
|
|
|
|
|
If your input is in **Seurat `.rds`** format, use the R preprocessing script:
|
|
|
|
|
|
```bash
|
|
|
Rscript s01.process_rds.R ./demo_data/pbmc_demo.rds ./output/ data/ranked_hvg.list
|
|
|
```
|
|
|
|
|
|
If your input is in **AnnData `.h5ad`** format, use the Python script:
|
|
|
|
|
|
```bash
|
|
|
python s01.process_h5ad.py \
|
|
|
--input_file ./demo_data/pbmc_demo.h5ad \
|
|
|
--output_path ./output_h5ad \
|
|
|
--ranked_hvg_list ./data/ranked_hvg.list
|
|
|
```
|
|
|
|
|
|
Both pipelines will generate the following output files:
|
|
|
|
|
|
```
|
|
|
output/
|
|
|
βββ pbmc_demo.h5
|
|
|
βββ pbmc_demo.meta.csv
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
### π§± Step 3: Build Dataset for CellReasoner
|
|
|
|
|
|
Build the model input file using:
|
|
|
|
|
|
```bash
|
|
|
python s02.build_dataset.py \
|
|
|
--h5_path ./output/pbmc_demo.h5 \
|
|
|
--output_path ./output/ \
|
|
|
--meta_file_path ./output/pbmc_demo.meta.csv
|
|
|
```
|
|
|
|
|
|
If your metadata includes cell type labels (for scoring), specify the column name:
|
|
|
|
|
|
```bash
|
|
|
python s02.build_dataset.py \
|
|
|
--h5_path ./output/pbmc_demo.h5 \
|
|
|
--output_path ./output/ \
|
|
|
--meta_file_path ./output/pbmc_demo.meta.csv \
|
|
|
--cell_type_column "seurat_annotations"
|
|
|
```
|
|
|
|
|
|
This will generate:
|
|
|
|
|
|
```
|
|
|
output/
|
|
|
βββ pbmc_demo_for_CellReasoner.json
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
### π€ Step 4: Run Inference with CellReasoner
|
|
|
|
|
|
```bash
|
|
|
python s03.inference.py \
|
|
|
--model "CellReasoner-7B" \
|
|
|
--output_path "./output" \
|
|
|
--input_json "./output/pbmc_demo_for_CellReasoner.json" \
|
|
|
--batch_size 2
|
|
|
```
|
|
|
|
|
|
Result:
|
|
|
|
|
|
```
|
|
|
output/
|
|
|
βββ pbmc_demo_CellReasoner_result.csv
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
### π Evaluation and Reasoning Visualization
|
|
|
|
|
|
To compute scores, generate plots, or view reasoning outputs, refer to:
|
|
|
|
|
|
```bash
|
|
|
s03.inference.ipynb
|
|
|
```
|
|
|
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
```bibtex
|
|
|
@article {Cao2025.05.20.655112,
|
|
|
author = {Cao, Guangshuo and Shen, Yi and Wu, Jianghong and Chao, Haoyu and Chen, Ming and Chen, Dijun},
|
|
|
title = {CellReasoner: A reasoning-enhanced large language model for cell type annotation},
|
|
|
elocation-id = {2025.05.20.655112},
|
|
|
year = {2025},
|
|
|
doi = {10.1101/2025.05.20.655112},
|
|
|
URL = {https://www.biorxiv.org/content/early/2025/05/26/2025.05.20.655112},
|
|
|
eprint = {https://www.biorxiv.org/content/early/2025/05/26/2025.05.20.655112.full.pdf},
|
|
|
journal = {bioRxiv}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|