File size: 5,856 Bytes
dc1f736 2d13d0b 0c0eb3a 2d13d0b 0c0eb3a 2d13d0b a6cd13b 2d13d0b 6498209 2d13d0b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ---
license: apache-2.0
tags:
- biology
- single-cell
- cell-type-annotation
- large-language-model
- reasoning
- zero-shot
- few-shot
language:
- en
library_name: transformers
datasets:
- custom
pipeline_tag: text2text-generation
model_name: CellReasoner
author: guangshuo
---
# CellReasoner: A reasoning-enhanced large language model for cell type annotation π§¬π§
<div align="center">
[π Paper](#citation) | [π» GitHub](https://github.com/compbioNJU/CellReasoner)
</div>
---
---
## π Table of Contents
- [π CellReasoner: A reasoning-enhanced large language model for cell type annotation π§¬π§ ](#cellreasoner-a-reasoning-enhanced-large-language-model-for-cell-type-annotation-π§¬π§ )
- [π Table of Contents](#-table-of-contents)
- [π¬ Key Highlights](#-key-highlights)
- [π Key Results](#-key-results)
- [π§ Model Zoo](#-model-zoo)
- [ποΈββοΈ Training](#-training)
- [π Usage](#-usage)
- [π Citation](#citation)
---
### π¬ Key Highlights
- Only **a few expert-level reasoning samples** are needed to activate reasoning in a 7B LLM.
- **CellReasoner** achieves **expert-level interpretability** and **zero-/few-shot generalization**.
- Demonstrated **superior performance** across various **scRNA-seq** and **scATAC-seq** datasets.
- Compatible with **marker-by-marker annotation**, **ontology mapping**, and **biological reasoning**.
> π§ Less data, more reasoning: CellReasoner achieves accurate, interpretable, and scalable cell annotation with minimal supervision.
---
## π Key Results
### [PDAC dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE197177)
| Model | Score |
|--------------------|-------|
| Deepseek-V3 | 0.50 |
| Deepseek-R1 | 0.53 |
| ChatGPT-o3 | 0.58 |
| ChatGPT-4o | 0.63 |
| singleR | 0.68 |
| **CellReasoner-7B** | **0.73** |
| **CellReasoner-32B** | **0.74** |
---
### [PBMC3K dataset](https://www.10xgenomics.com/cn/datasets/3-k-pbm-cs-from-a-healthy-donor-1-standard-1-1-0)
| Model | Score |
|--------------------|-------|
| Deepseek-V3 | 0.52 |
| Deepseek-R1 | 0.52 |
| ChatGPT-4o | 0.76 |
| ChatGPT-o3 | 0.85 |
| singleR | 0.83 |
| **CellReasoner-7B** | **0.87** |
| **CellReasoner-32B** | **0.84** |
---
## π§ Model Zoo
Our CellReasoner models are available on Hugging Face π€:
| Model | Backbone | Link |
|---------------------|----------------------------|------|
| **CellReasoner-7B** | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | [π€](https://huggingface.co/guangshuo/CellReasoner-7B) |
| **CellReasoner-32B** | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | [π€](https://huggingface.co/guangshuo/CellReasoner-32B) |
---
## ποΈββοΈ Training
We use the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework for fine-tuning. It offers a flexible and efficient pipeline for supervised fine-tuning, LoRA, and multi-stage training strategies.
---
## π Usage
### π οΈ Step 1: Prepare Conda Environment
Make sure you have a working conda environment with the necessary dependencies installed. We recommend:
```bash
conda create -n cellreasoner python=3.11
conda activate cellreasoner
pip install -r requirements.txt
```
---
### π§ͺ Step 2: Preprocess Input Data
If your input is in **Seurat `.rds`** format, use the R preprocessing script:
```bash
Rscript s01.process_rds.R ./demo_data/pbmc_demo.rds ./output/ data/ranked_hvg.list
```
If your input is in **AnnData `.h5ad`** format, use the Python script:
```bash
python s01.process_h5ad.py \
--input_file ./demo_data/pbmc_demo.h5ad \
--output_path ./output_h5ad \
--ranked_hvg_list ./data/ranked_hvg.list
```
Both pipelines will generate the following output files:
```
output/
βββ pbmc_demo.h5
βββ pbmc_demo.meta.csv
```
---
### π§± Step 3: Build Dataset for CellReasoner
Build the model input file using:
```bash
python s02.build_dataset.py \
--h5_path ./output/pbmc_demo.h5 \
--output_path ./output/ \
--meta_file_path ./output/pbmc_demo.meta.csv
```
If your metadata includes cell type labels (for scoring), specify the column name:
```bash
python s02.build_dataset.py \
--h5_path ./output/pbmc_demo.h5 \
--output_path ./output/ \
--meta_file_path ./output/pbmc_demo.meta.csv \
--cell_type_column "seurat_annotations"
```
This will generate:
```
output/
βββ pbmc_demo_for_CellReasoner.json
```
---
### π€ Step 4: Run Inference with CellReasoner
```bash
python s03.inference.py \
--model "CellReasoner-7B" \
--output_path "./output" \
--input_json "./output/pbmc_demo_for_CellReasoner.json" \
--batch_size 2
```
Result:
```
output/
βββ pbmc_demo_CellReasoner_result.csv
```
---
### π Evaluation and Reasoning Visualization
To compute scores, generate plots, or view reasoning outputs, refer to:
```bash
s03.inference.ipynb
```
## Citation
```bibtex
@article {Cao2025.05.20.655112,
author = {Cao, Guangshuo and Shen, Yi and Wu, Jianghong and Chao, Haoyu and Chen, Ming and Chen, Dijun},
title = {CellReasoner: A reasoning-enhanced large language model for cell type annotation},
elocation-id = {2025.05.20.655112},
year = {2025},
doi = {10.1101/2025.05.20.655112},
URL = {https://www.biorxiv.org/content/early/2025/05/26/2025.05.20.655112},
eprint = {https://www.biorxiv.org/content/early/2025/05/26/2025.05.20.655112.full.pdf},
journal = {bioRxiv}
}
```
---
|