Vir2vec / README.md
pabloarozarenad's picture
Update README.md
78ddaef verified
---
license: mit
base_model:
- RaphaelMourad/Mistral-DNA-v1-422M-hg38
tags:
- genomics
- virus
library_name: transformers
paper:
title: 'Vir2vec: A Genome-Wide Viral Embedding'
url: https://www.biorxiv.org/content/10.64898/2025.12.12.693901v1
pipeline_tag: feature-extraction
---
# Vir2vec: A Genome-Wide Viral Embedding
## Model description
[vir2vec](https://www.biorxiv.org/content/10.64898/2025.12.12.693901v1) is a viral genomic language model (gLM) designed to produce fixed-length, genome-level embeddings that can be fine-tuned across downstream tasks such as viral discrimination, host-range prediction, and variant typing. For more details and training scripts check [GitHub](https://github.com/simoRancati/Vir2vec)
## Intended use
vir2vec embeddings are intended for tasks including (but not limited to):
- Virus vs non-virus genome/read discrimination
- DNA vs RNA virus classification
- Host-range prediction
- Intra-genus separation (e.g., HIV-1 vs HIV-2)
- Variant/subtype typing (e.g., SARS-CoV-2 lineages)
- Phenotypic signal detection (e.g., tissue tropism proxies)
## Model sizes
- 422M
- 138M
- 17M
## How to use
### Load from Hugging Face
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model = AutoModelForCausalLM.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model.eval()
```
### Compute embeddings
```python
dna = "ACGTAGCATCGCGATGACTGCATCACT"
inputs = tokenizer(dna, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
last_hidden = outputs.hidden_states[-1] # [1, seq_len, hidden_dim]
embedding = last_hidden.max(dim=1).values[0] # [hidden_dim] (max pooling)
print(embedding.shape)
```
## Access
vir2vec can be loaded upon request, subject to providing an institutional email address, a brief description of the intended use, and the associated IRB protocol number.