eva-rna / README.md

feat: Add guard form

21286e5 verified 1 day ago

7.4 kB

	---
	license: other
	license_name: scienta-lab-eva-model-license
	license_link: LICENSE
	language:
	- en
	tags:
	- biology
	- transcriptomics
	- rna-seq
	- gene-expression
	- foundation-model
	- single-cell
	- bulk-rna
	- immunology
	library_name: transformers
	pipeline_tag: feature-extraction
	extra_gated_prompt: >-
	Before accessing EVA-RNA, please provide the following information.
	Your responses will be used solely to better understand our user community.
	extra_gated_fields:
	Full name: text
	Affiliation (university, institute, or company): text
	I am a:
	type: select
	options:
	- Student (undergraduate or graduate)
	- PhD candidate
	- Academic researcher (postdoc, faculty, or staff scientist)
	- Industry professional
	- Other
	I accept the Scienta Lab EVA Model License:
	type: checkbox
	---

	# EVA-RNA: Foundation Model for Transcriptomics

	Transformer-based foundation model that produces sample-level and gene-level
	embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in
	human and mouse.

	## Installation

	We recommend proceeding with the [uv package manager](https://docs.astral.sh/uv/getting-started/installation/).

	```bash
	uv venv --python 3.10
	source .venv/bin/activate
	uv pip install transformers torch==2.6.0 scanpy anndata tqdm scipy scikit-misc
	```

	### Optional: Flash Attention

	To handle larger gene contexts, EVA-RNA automatically runs on Flash Attention if
	available -- only available for post-Ampere GPUs (A100 and beyond). We recommend
	using the following wheel.

	```bash
	uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
	```

	## Quick Start

	```python
	import scanpy as sc
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)

	# Load example dataset (2,700 PBMCs, raw counts)
	#
	# NOTE: EVA is not meant to be directly used on single-cell data,
	# as it's designed primarily for bulk, microarray and pseudobulked
	# single-cell. We use `pbmc3k` here purely for convenience as a
	# quick-loading AnnData object. For single-cell data, pseudobulk by
	# sample or cell type before encoding.
	adata = sc.datasets.pbmc3k()

	# Subset to 2,000 highly variable genes for efficiency
	sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
	adata = adata[:, adata.var.highly_variable].copy()

	# Encode (gene symbols auto-converted, preprocessing applied, GPU used if available)
	embeddings = model.encode_anndata(tokenizer, adata)
	adata.obsm["X_eva"] = embeddings
	```

	### Options

	`model.encode_anndata()` accepts the following parameters:

	- `gene_column` — column in `adata.var` with gene identifiers (default: uses `adata.var_names`)
	- `species` — `"human"` or `"mouse"` for gene ID conversion (default: auto-detected)
	- `batch_size` — samples per inference batch (default: 32)
	- `device` — `"cpu"`, `"cuda"`, etc. (default: CUDA if available)
	- `show_progress` — show a progress bar (default: True)
	- `preprocess` — apply library-size normalization + log1p (default: True); set to False if data is already log-transformed

	## Advanced: Raw Tensor API

	For users who need direct control over inputs (mixed precision is applied automatically):

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device).eval()

	# Gene IDs must be NCBI GeneIDs as strings
	gene_ids = ["7157", "675", "672"] # TP53, BRCA2, BRCA1
	expression_values = [5.5, 3.2, 4.1] # log1p-normalized

	inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt")
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.inference_mode():
	outputs = model(**inputs)

	sample_embedding = outputs.cls_embedding # (1, 256)
	gene_embeddings = outputs.gene_embeddings # (1, 3, 256)
	```

	### Batch Processing

	```python
	batch_gene_ids = [
	["7157", "675", "672"],
	["7157", "1956", "5290"],
	]
	batch_expression = [
	[5.5, 3.2, 4.1],
	[2.1, 6.3, 1.8],
	]

	inputs = tokenizer(batch_gene_ids, batch_expression, padding=True, return_tensors="pt")
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.inference_mode():
	outputs = model(**inputs)
	sample_embeddings = outputs.cls_embedding # (2, 256)
	```

	## GPU and Precision

	EVA-RNA automatically applies mixed precision for optimal performance:

	- Ampere+ GPUs (A100, H100, RTX 30/40 series): bfloat16
	- Older CUDA GPUs (V100, RTX 20 series): float16
	- CPU: full precision (float32)

	No manual `torch.autocast()` is needed.

	> Note — Flash Attention constraints: When flash attention is installed and an
	> Ampere+ GPU is detected, the model uses flash attention layers. These layers
	> require CUDA and half-precision inputs. If you move the model to CPU you will
	> get a clear error asking you to move it back to GPU. If you pass `autocast=False`,
	> autocast is re-enabled automatically with a warning since flash attention cannot
	> run in full precision.

	### Disabling Automatic Mixed Precision

	For advanced use cases requiring manual precision control, pass `autocast=False`.
	This only takes effect when flash attention is not active (i.e., on older GPUs or
	when flash attention is not installed):

	```python
	model = model.to("cuda").eval()

	with torch.inference_mode():
	# Disable automatic mixed precision (ignored when flash attention is active)
	outputs = model(**inputs, autocast=False)

	# Or via sample_embedding
	embedding = model.sample_embedding(
	gene_ids=gene_ids,
	expression_values=values,
	autocast=False,
	)
	```

	## Converting Gene Symbols to NCBI Gene IDs

	The tokenizer vocabulary uses NCBI GeneIDs. A built-in gene mapper is included to
	convert gene symbols or Ensembl IDs:

	```python
	tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True)

	# Available mappings:
	# "symbol_to_ncbi" – human gene symbols → NCBI GeneIDs
	# "ensembl_to_ncbi" – human Ensembl IDs → NCBI GeneIDs
	# "symbol_to_ncbi_mouse" – mouse gene symbols → NCBI GeneIDs

	mapper = tokenizer.gene_mapper["symbol_to_ncbi"]

	gene_symbols = ["TP53", "BRCA2", "BRCA1"]
	gene_ids = [mapper[s] for s in gene_symbols]
	# gene_ids = ["7157", "675", "672"]

	expression_values = [5.5, 3.2, 4.1]
	inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt")
	```

	## Citation

	```bibtex
	@article{scienta2026evauniversalmodelimmune,
	title={EVA: Towards a universal model of the immune system},
	author={Ethan Bandasack and Vincent Bouget and Apolline Bruley and Yannis Cattan and Charlotte Claye and Matthew Corney and Julien Duquesne and Karim El Kanbi and Aziz Fouché and Pierre Marschall and Francesco Strozzi},
	year={2026},
	eprint={2602.10168},
	archivePrefix={arXiv},
	primaryClass={q-bio.QM},
	url={https://arxiv.org/abs/2602.10168},
	}
	```

	## License

	[Scienta Lab EVA Model License](LICENSE)