|
|
--- |
|
|
license: other |
|
|
license_name: scienta-lab-eva-model-license |
|
|
license_link: LICENSE |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- biology |
|
|
- transcriptomics |
|
|
- rna-seq |
|
|
- gene-expression |
|
|
- foundation-model |
|
|
- single-cell |
|
|
- bulk-rna |
|
|
- immunology |
|
|
library_name: transformers |
|
|
pipeline_tag: feature-extraction |
|
|
extra_gated_prompt: >- |
|
|
Before accessing EVA-RNA, please provide the following information. |
|
|
Your responses will be used solely to better understand our user community. |
|
|
extra_gated_fields: |
|
|
Full name: text |
|
|
Affiliation (university, institute, or company): text |
|
|
I am a: |
|
|
type: select |
|
|
options: |
|
|
- Student (undergraduate or graduate) |
|
|
- PhD candidate |
|
|
- Academic researcher (postdoc, faculty, or staff scientist) |
|
|
- Industry professional |
|
|
- Other |
|
|
I accept the Scienta Lab EVA Model License: |
|
|
type: checkbox |
|
|
--- |
|
|
|
|
|
# EVA-RNA: Foundation Model for Transcriptomics |
|
|
|
|
|
Transformer-based foundation model that produces sample-level and gene-level |
|
|
embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in |
|
|
human and mouse. |
|
|
|
|
|
## Installation |
|
|
|
|
|
We recommend proceeding with the [uv package manager](https://docs.astral.sh/uv/getting-started/installation/). |
|
|
|
|
|
```bash |
|
|
uv venv --python 3.10 |
|
|
source .venv/bin/activate |
|
|
uv pip install transformers torch==2.6.0 scanpy anndata tqdm scipy scikit-misc |
|
|
``` |
|
|
|
|
|
### Optional: Flash Attention |
|
|
|
|
|
To handle larger gene contexts, EVA-RNA automatically runs on Flash Attention if |
|
|
available -- only available for post-Ampere GPUs (A100 and beyond). We recommend |
|
|
using the following wheel. |
|
|
|
|
|
```bash |
|
|
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import scanpy as sc |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) |
|
|
|
|
|
# Load example dataset (2,700 PBMCs, raw counts) |
|
|
# |
|
|
# NOTE: EVA is not meant to be directly used on single-cell data, |
|
|
# as it's designed primarily for bulk, microarray and pseudobulked |
|
|
# single-cell. We use `pbmc3k` here purely for convenience as a |
|
|
# quick-loading AnnData object. For single-cell data, pseudobulk by |
|
|
# sample or cell type before encoding. |
|
|
adata = sc.datasets.pbmc3k() |
|
|
|
|
|
# Subset to 2,000 highly variable genes for efficiency |
|
|
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3") |
|
|
adata = adata[:, adata.var.highly_variable].copy() |
|
|
|
|
|
# Encode (gene symbols auto-converted, preprocessing applied, GPU used if available) |
|
|
embeddings = model.encode_anndata(tokenizer, adata) |
|
|
adata.obsm["X_eva"] = embeddings |
|
|
``` |
|
|
|
|
|
### Options |
|
|
|
|
|
`model.encode_anndata()` accepts the following parameters: |
|
|
|
|
|
- `gene_column` — column in `adata.var` with gene identifiers (default: uses `adata.var_names`) |
|
|
- `species` — `"human"` or `"mouse"` for gene ID conversion (default: auto-detected) |
|
|
- `batch_size` — samples per inference batch (default: 32) |
|
|
- `device` — `"cpu"`, `"cuda"`, etc. (default: CUDA if available) |
|
|
- `show_progress` — show a progress bar (default: True) |
|
|
- `preprocess` — apply library-size normalization + log1p (default: True); set to False if data is already log-transformed |
|
|
|
|
|
## Advanced: Raw Tensor API |
|
|
|
|
|
For users who need direct control over inputs (mixed precision is applied automatically): |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = model.to(device).eval() |
|
|
|
|
|
# Gene IDs must be NCBI GeneIDs as strings |
|
|
gene_ids = ["7157", "675", "672"] # TP53, BRCA2, BRCA1 |
|
|
expression_values = [5.5, 3.2, 4.1] # log1p-normalized |
|
|
|
|
|
inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt") |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.inference_mode(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
sample_embedding = outputs.cls_embedding # (1, 256) |
|
|
gene_embeddings = outputs.gene_embeddings # (1, 3, 256) |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
batch_gene_ids = [ |
|
|
["7157", "675", "672"], |
|
|
["7157", "1956", "5290"], |
|
|
] |
|
|
batch_expression = [ |
|
|
[5.5, 3.2, 4.1], |
|
|
[2.1, 6.3, 1.8], |
|
|
] |
|
|
|
|
|
inputs = tokenizer(batch_gene_ids, batch_expression, padding=True, return_tensors="pt") |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.inference_mode(): |
|
|
outputs = model(**inputs) |
|
|
sample_embeddings = outputs.cls_embedding # (2, 256) |
|
|
``` |
|
|
|
|
|
## GPU and Precision |
|
|
|
|
|
EVA-RNA automatically applies mixed precision for optimal performance: |
|
|
|
|
|
- **Ampere+ GPUs** (A100, H100, RTX 30/40 series): bfloat16 |
|
|
- **Older CUDA GPUs** (V100, RTX 20 series): float16 |
|
|
- **CPU**: full precision (float32) |
|
|
|
|
|
No manual `torch.autocast()` is needed. |
|
|
|
|
|
> **Note — Flash Attention constraints:** When flash attention is installed and an |
|
|
> Ampere+ GPU is detected, the model uses flash attention layers. These layers |
|
|
> **require CUDA and half-precision inputs**. If you move the model to CPU you will |
|
|
> get a clear error asking you to move it back to GPU. If you pass `autocast=False`, |
|
|
> autocast is re-enabled automatically with a warning since flash attention cannot |
|
|
> run in full precision. |
|
|
|
|
|
### Disabling Automatic Mixed Precision |
|
|
|
|
|
For advanced use cases requiring manual precision control, pass `autocast=False`. |
|
|
This only takes effect when flash attention is **not** active (i.e., on older GPUs or |
|
|
when flash attention is not installed): |
|
|
|
|
|
```python |
|
|
model = model.to("cuda").eval() |
|
|
|
|
|
with torch.inference_mode(): |
|
|
# Disable automatic mixed precision (ignored when flash attention is active) |
|
|
outputs = model(**inputs, autocast=False) |
|
|
|
|
|
# Or via sample_embedding |
|
|
embedding = model.sample_embedding( |
|
|
gene_ids=gene_ids, |
|
|
expression_values=values, |
|
|
autocast=False, |
|
|
) |
|
|
``` |
|
|
|
|
|
## Converting Gene Symbols to NCBI Gene IDs |
|
|
|
|
|
The tokenizer vocabulary uses NCBI GeneIDs. A built-in gene mapper is included to |
|
|
convert gene symbols or Ensembl IDs: |
|
|
|
|
|
```python |
|
|
tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) |
|
|
|
|
|
# Available mappings: |
|
|
# "symbol_to_ncbi" – human gene symbols → NCBI GeneIDs |
|
|
# "ensembl_to_ncbi" – human Ensembl IDs → NCBI GeneIDs |
|
|
# "symbol_to_ncbi_mouse" – mouse gene symbols → NCBI GeneIDs |
|
|
|
|
|
mapper = tokenizer.gene_mapper["symbol_to_ncbi"] |
|
|
|
|
|
gene_symbols = ["TP53", "BRCA2", "BRCA1"] |
|
|
gene_ids = [mapper[s] for s in gene_symbols] |
|
|
# gene_ids = ["7157", "675", "672"] |
|
|
|
|
|
expression_values = [5.5, 3.2, 4.1] |
|
|
inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{scienta2026evauniversalmodelimmune, |
|
|
title={EVA: Towards a universal model of the immune system}, |
|
|
author={Ethan Bandasack and Vincent Bouget and Apolline Bruley and Yannis Cattan and Charlotte Claye and Matthew Corney and Julien Duquesne and Karim El Kanbi and Aziz Fouché and Pierre Marschall and Francesco Strozzi}, |
|
|
year={2026}, |
|
|
eprint={2602.10168}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={q-bio.QM}, |
|
|
url={https://arxiv.org/abs/2602.10168}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
[Scienta Lab EVA Model License](LICENSE) |
|
|
|