---
license: apache-2.0
library_name: pytorch
tags:
- biology
- genomics
- single-cell
- transformer
- diffusion
- foundation-model
pipeline_tag: feature-extraction
---
# ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression
[**📄 arXiv Paper**](https://arxiv.org/abs/2602.03477) | [**💻 GitHub Repository**](https://github.com/wangmingxuan666/ScDiVa) | [**📊 Dataset**](https://huggingface.co/datasets/warming666/ScDiVa)
## 🌟 Model Summary
**ScDiVa** (Single-cell Deep Variational Analysis) is a **94.5M parameter** foundation model pre-trained on **59 million** single-cell transcriptomes. It utilizes a novel **Masked Discrete Diffusion** framework to model gene expression as an unordered set, effectively capturing the complex topology of gene regulatory networks.
Unlike traditional autoregressive models, ScDiVa employs a bidirectional Transformer encoder with **SwiGLU** activations, **Rotary Positional Embeddings (RoPE)**, and **RMSNorm**, optimized for:
* **Reconstruction**
* **Cell Type Annotation**
* **Multi-batch Integration**
* **Gene Perturbation Prediction**
* **Gene Regulatory Network (GRN) Inference**
## 🏗️ Model Specifications
| Attribute | Value |
| :--- | :--- |
| **Parameters** | ~94.5M |
| **Layers** | 12 |
| **Hidden Size** | 512 |
| **Attention Heads** | 8 |
| **Max Sequence Length** | 1,200 genes |
| **Vocabulary** | 41,818 genes |
| **Training Objective** | Dual Denoising (Identity Classification + Value Regression) |
---
## 🚀 Quick Start
To use ScDiVa, you need the `modeling_scdiva.py` file (included in this repository).
### 1. Installation
```bash
pip install torch numpy huggingface_hub
```
### 2. Loading the Pre-trained Model
You can load the model directly using the `from_pretrained` method defined in our architecture.
```python
from modeling_scdiva import ScDiVaModel
import torch
# Load the model directly from Hugging Face
# This will automatically download model.safetensors and config
model = ScDiVaModel.from_pretrained("warming666/ScDiVa")
model.eval()
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"✅ ScDiVa loaded successfully on {device}")
```
### 3. Basic Inference Example
```python
# Create a dummy input (Batch Size: 2, Num Genes: 41818)
# In practice, replace this with your normalized gene expression matrix
input_data = torch.randn(2, 41818).to(device)
with torch.no_grad():
# Get latent embeddings (for clustering/integration)
outputs = model.encode(input_data)
embeddings = outputs['latent']
print(f"Latent Embedding Shape: {embeddings.shape}") # [2, 128]
# Get annotation logits
predictions = model.predict(input_data, task="annotation")
print(f"Annotation Logits Shape: {predictions.shape}") # [2, 100]
```
---
## 📂 Repository Structure
This repository contains the core pre-trained weights and fine-tuned checkpoints for downstream tasks.
```text
warming666/ScDiVa
├── config.json # Model configuration
├── model.safetensors # 🔥 Pre-trained Base Weights (94.5M)
├── modeling_scdiva.py # Model architecture definition code
└── downstream/ # 📂 Fine-tuned Checkpoints
├── Multi-batch_Integration/
│ ├── immune.pt
│ ├── pbmc12k.pt
│ └── ...
├── Annotation_FT/ # Fine-tuned for specific tissues
│ ├── hpancreas.pt
│ └── ms.pt
├── Annotation_Zeroshot/ # Weights for zero-shot projection
└── Perturbation/ # Weights for gene perturbation tasks
```
To load a specific downstream model (e.g., for Batch Integration on Immune dataset), you can download the specific `.pt` file from the `downstream` folder and load it using `torch.load()`.
---
## 📊 Benchmarks
ScDiVa achieves state-of-the-art performance across multiple benchmarks:
* **Batch Integration**: Top-tier performance on PBMC12k (Avg-Bio: **0.9566**) and BMMC datasets.
* **Annotation**: **98.6%** accuracy on hPancreas fine-tuning; **91.4%** average accuracy on zero-shot tasks.
* **Perturbation**: Pearson correlation of **0.837** on Adamson dataset.
For detailed results, please refer to our [arXiv paper](https://www.google.com/url?sa=E&source=gmail&q=https://arxiv.org/abs/2602.03477).
---
## ⚠️ Limitations & Bias
* **Input Normalization**: The model expects log-normalized gene expression data. Raw counts may lead to suboptimal performance.
* **Gene Vocabulary**: Inputs must be aligned to the specific 41,818 gene vocabulary used during pre-training.
* **Not for Clinical Use**: This model is for research purposes only and has not been validated for clinical diagnosis or treatment.
---
## 📄 Citation
If you use ScDiVa in your research, please cite:
```bibtex
@article{wang2026scdiva,
title={ScDiva: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression},
author={Wang, Mingxuan and Chen, Cheng and Jiang, Gaoyang and Ren, Zijia and Zhao, Chuangxin and Shi, Lu and Ma, Yanbiao},
journal={arXiv preprint arXiv:2602.03477},
year={2026}
}
```
Thank you to everyone who has helped me.
```
```