eva-rna / README.md

az-fouche

feat: Add guard form

21286e5 verified 1 day ago

preview code

raw

history blame contribute delete

7.4 kB

metadata

license: other
license_name: scienta-lab-eva-model-license
license_link: LICENSE
language:
  - en
tags:
  - biology
  - transcriptomics
  - rna-seq
  - gene-expression
  - foundation-model
  - single-cell
  - bulk-rna
  - immunology
library_name: transformers
pipeline_tag: feature-extraction
extra_gated_prompt: >-
  Before accessing EVA-RNA, please provide the following information. Your
  responses will be used solely to better understand our user community.
extra_gated_fields:
  Full name: text
  Affiliation (university, institute, or company): text
  I am a:
    type: select
    options:
      - Student (undergraduate or graduate)
      - PhD candidate
      - Academic researcher (postdoc, faculty, or staff scientist)
      - Industry professional
      - Other
  I accept the Scienta Lab EVA Model License:
    type: checkbox

EVA-RNA: Foundation Model for Transcriptomics

Transformer-based foundation model that produces sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse.

Installation

We recommend proceeding with the uv package manager.

uv venv --python 3.10
source .venv/bin/activate
uv pip install transformers torch==2.6.0 scanpy anndata tqdm scipy scikit-misc

Optional: Flash Attention

To handle larger gene contexts, EVA-RNA automatically runs on Flash Attention if available -- only available for post-Ampere GPUs (A100 and beyond). We recommend using the following wheel.

uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Quick Start

import scanpy as sc
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)

# Load example dataset (2,700 PBMCs, raw counts)
#
# NOTE: EVA is not meant to be directly used on single-cell data,
# as it's designed primarily for bulk, microarray and pseudobulked
# single-cell. We use `pbmc3k` here purely for convenience as a
# quick-loading AnnData object. For single-cell data, pseudobulk by
# sample or cell type before encoding.
adata = sc.datasets.pbmc3k()

# Subset to 2,000 highly variable genes for efficiency
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
adata = adata[:, adata.var.highly_variable].copy()

# Encode (gene symbols auto-converted, preprocessing applied, GPU used if available)
embeddings = model.encode_anndata(tokenizer, adata)
adata.obsm["X_eva"] = embeddings

Options

model.encode_anndata() accepts the following parameters:

gene_column — column in adata.var with gene identifiers (default: uses adata.var_names)
species — "human" or "mouse" for gene ID conversion (default: auto-detected)
batch_size — samples per inference batch (default: 32)
device — "cpu", "cuda", etc. (default: CUDA if available)
show_progress — show a progress bar (default: True)
preprocess — apply library-size normalization + log1p (default: True); set to False if data is already log-transformed

Advanced: Raw Tensor API

For users who need direct control over inputs (mixed precision is applied automatically):

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

# Gene IDs must be NCBI GeneIDs as strings
gene_ids = ["7157", "675", "672"]  # TP53, BRCA2, BRCA1
expression_values = [5.5, 3.2, 4.1]  # log1p-normalized

inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.inference_mode():
    outputs = model(**inputs)

sample_embedding = outputs.cls_embedding     # (1, 256)
gene_embeddings = outputs.gene_embeddings   # (1, 3, 256)

Batch Processing

batch_gene_ids = [
    ["7157", "675", "672"],
    ["7157", "1956", "5290"],
]
batch_expression = [
    [5.5, 3.2, 4.1],
    [2.1, 6.3, 1.8],
]

inputs = tokenizer(batch_gene_ids, batch_expression, padding=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.inference_mode():
    outputs = model(**inputs)
sample_embeddings = outputs.cls_embedding  # (2, 256)

GPU and Precision

EVA-RNA automatically applies mixed precision for optimal performance:

Ampere+ GPUs (A100, H100, RTX 30/40 series): bfloat16
Older CUDA GPUs (V100, RTX 20 series): float16
CPU: full precision (float32)

No manual torch.autocast() is needed.

Note — Flash Attention constraints: When flash attention is installed and an Ampere+ GPU is detected, the model uses flash attention layers. These layers require CUDA and half-precision inputs. If you move the model to CPU you will get a clear error asking you to move it back to GPU. If you pass autocast=False, autocast is re-enabled automatically with a warning since flash attention cannot run in full precision.

Disabling Automatic Mixed Precision

For advanced use cases requiring manual precision control, pass autocast=False. This only takes effect when flash attention is not active (i.e., on older GPUs or when flash attention is not installed):

model = model.to("cuda").eval()

with torch.inference_mode():
    # Disable automatic mixed precision (ignored when flash attention is active)
    outputs = model(**inputs, autocast=False)

    # Or via sample_embedding
    embedding = model.sample_embedding(
        gene_ids=gene_ids,
        expression_values=values,
        autocast=False,
    )

Converting Gene Symbols to NCBI Gene IDs

The tokenizer vocabulary uses NCBI GeneIDs. A built-in gene mapper is included to convert gene symbols or Ensembl IDs:

tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)

# Available mappings:
#   "symbol_to_ncbi"         – human gene symbols → NCBI GeneIDs
#   "ensembl_to_ncbi"        – human Ensembl IDs  → NCBI GeneIDs
#   "symbol_to_ncbi_mouse"   – mouse gene symbols → NCBI GeneIDs

mapper = tokenizer.gene_mapper["symbol_to_ncbi"]

gene_symbols = ["TP53", "BRCA2", "BRCA1"]
gene_ids = [mapper[s] for s in gene_symbols]
# gene_ids = ["7157", "675", "672"]

expression_values = [5.5, 3.2, 4.1]
inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt")

Citation

@article{scienta2026evauniversalmodelimmune,
      title={EVA: Towards a universal model of the immune system}, 
      author={Ethan Bandasack and Vincent Bouget and Apolline Bruley and Yannis Cattan and Charlotte Claye and Matthew Corney and Julien Duquesne and Karim El Kanbi and Aziz Fouché and Pierre Marschall and Francesco Strozzi},
      year={2026},
      eprint={2602.10168},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2602.10168}, 
}

License

Scienta Lab EVA Model License