ScDiVa / README.md

Update README.md

7ee8665 verified 11 days ago

5.48 kB

	---
	license: apache-2.0
	library_name: pytorch
	tags:
	- biology
	- genomics
	- single-cell
	- transformer
	- diffusion
	- foundation-model
	pipeline_tag: feature-extraction
	---

	<div align="center">

	# ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

	<p align="center">
	<img src="https://huggingface.co/warming666/ScDiVa/resolve/main/assets/scDiVa.png" alt="ScDiVa Architecture" width="800"/>
	</p>

	[📄 arXiv Paper](https://arxiv.org/abs/2602.03477) \| [💻 GitHub Repository](https://github.com/wangmingxuan666/ScDiVa) \| [📊 Dataset](https://huggingface.co/datasets/warming666/ScDiVa)

	</div>

	## 🌟 Model Summary

	ScDiVa (Single-cell Deep Variational Analysis) is a 94.5M parameter foundation model pre-trained on 59 million single-cell transcriptomes. It utilizes a novel Masked Discrete Diffusion framework to model gene expression as an unordered set, effectively capturing the complex topology of gene regulatory networks.

	Unlike traditional autoregressive models, ScDiVa employs a bidirectional Transformer encoder with SwiGLU activations, Rotary Positional Embeddings (RoPE), and RMSNorm, optimized for:

	* Reconstruction
	* Cell Type Annotation
	* Multi-batch Integration
	* Gene Perturbation Prediction
	* Gene Regulatory Network (GRN) Inference

	## 🏗️ Model Specifications

	\| Attribute \| Value \|
	\| :--- \| :--- \|
	\| Parameters \| ~94.5M \|
	\| Layers \| 12 \|
	\| Hidden Size \| 512 \|
	\| Attention Heads \| 8 \|
	\| Max Sequence Length \| 1,200 genes \|
	\| Vocabulary \| 41,818 genes \|
	\| Training Objective \| Dual Denoising (Identity Classification + Value Regression) \|

	---

	## 🚀 Quick Start

	To use ScDiVa, you need the `modeling_scdiva.py` file (included in this repository).

	### 1. Installation

	```bash
	pip install torch numpy huggingface_hub

	```

	### 2. Loading the Pre-trained Model

	You can load the model directly using the `from_pretrained` method defined in our architecture.

	```python
	from modeling_scdiva import ScDiVaModel
	import torch

	# Load the model directly from Hugging Face
	# This will automatically download model.safetensors and config
	model = ScDiVaModel.from_pretrained("warming666/ScDiVa")
	model.eval()

	# Move to GPU if available
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	print(f"✅ ScDiVa loaded successfully on {device}")

	```

	### 3. Basic Inference Example

	```python
	# Create a dummy input (Batch Size: 2, Num Genes: 41818)
	# In practice, replace this with your normalized gene expression matrix
	input_data = torch.randn(2, 41818).to(device)

	with torch.no_grad():
	# Get latent embeddings (for clustering/integration)
	outputs = model.encode(input_data)
	embeddings = outputs['latent']
	print(f"Latent Embedding Shape: {embeddings.shape}") # [2, 128]

	# Get annotation logits
	predictions = model.predict(input_data, task="annotation")
	print(f"Annotation Logits Shape: {predictions.shape}") # [2, 100]

	```

	---

	## 📂 Repository Structure

	This repository contains the core pre-trained weights and fine-tuned checkpoints for downstream tasks.

	```text
	warming666/ScDiVa
	├── config.json # Model configuration
	├── model.safetensors # 🔥 Pre-trained Base Weights (94.5M)
	├── modeling_scdiva.py # Model architecture definition code
	└── downstream/ # 📂 Fine-tuned Checkpoints
	├── Multi-batch_Integration/
	│ ├── immune.pt
	│ ├── pbmc12k.pt
	│ └── ...
	├── Annotation_FT/ # Fine-tuned for specific tissues
	│ ├── hpancreas.pt
	│ └── ms.pt
	├── Annotation_Zeroshot/ # Weights for zero-shot projection
	└── Perturbation/ # Weights for gene perturbation tasks

	```

	To load a specific downstream model (e.g., for Batch Integration on Immune dataset), you can download the specific `.pt` file from the `downstream` folder and load it using `torch.load()`.

	---

	## 📊 Benchmarks

	ScDiVa achieves state-of-the-art performance across multiple benchmarks:

	* Batch Integration: Top-tier performance on PBMC12k (Avg-Bio: 0.9566) and BMMC datasets.
	* Annotation: 98.6% accuracy on hPancreas fine-tuning; 91.4% average accuracy on zero-shot tasks.
	* Perturbation: Pearson correlation of 0.837 on Adamson dataset.

	For detailed results, please refer to our [arXiv paper](https://www.google.com/url?sa=E&source=gmail&q=https://arxiv.org/abs/2602.03477).

	---

	## ⚠️ Limitations & Bias

	* Input Normalization: The model expects log-normalized gene expression data. Raw counts may lead to suboptimal performance.
	* Gene Vocabulary: Inputs must be aligned to the specific 41,818 gene vocabulary used during pre-training.
	* Not for Clinical Use: This model is for research purposes only and has not been validated for clinical diagnosis or treatment.

	---

	## 📄 Citation

	If you use ScDiVa in your research, please cite:

	```bibtex
	@article{wang2026scdiva,
	title={ScDiva: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression},
	author={Wang, Mingxuan and Chen, Cheng and Jiang, Gaoyang and Ren, Zijia and Zhao, Chuangxin and Shi, Lu and Ma, Yanbiao},
	journal={arXiv preprint arXiv:2602.03477},
	year={2026}
	}

	```

	<div align="center">
	<sub>Thank you to everyone who has helped me.</sub>
	</div>

	```

	```