Vir2vec / README.md
pabloarozarenad's picture
Update README.md
78ddaef verified
metadata
license: mit
base_model:
  - RaphaelMourad/Mistral-DNA-v1-422M-hg38
tags:
  - genomics
  - virus
library_name: transformers
paper:
  title: 'Vir2vec: A Genome-Wide Viral Embedding'
  url: https://www.biorxiv.org/content/10.64898/2025.12.12.693901v1
pipeline_tag: feature-extraction

Vir2vec: A Genome-Wide Viral Embedding

Model description

vir2vec is a viral genomic language model (gLM) designed to produce fixed-length, genome-level embeddings that can be fine-tuned across downstream tasks such as viral discrimination, host-range prediction, and variant typing. For more details and training scripts check GitHub

Intended use

vir2vec embeddings are intended for tasks including (but not limited to):

  • Virus vs non-virus genome/read discrimination
  • DNA vs RNA virus classification
  • Host-range prediction
  • Intra-genus separation (e.g., HIV-1 vs HIV-2)
  • Variant/subtype typing (e.g., SARS-CoV-2 lineages)
  • Phenotypic signal detection (e.g., tissue tropism proxies)

Model sizes

  • 422M
  • 138M
  • 17M

How to use

Load from Hugging Face

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model = AutoModelForCausalLM.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model.eval()

Compute embeddings

dna = "ACGTAGCATCGCGATGACTGCATCACT"
inputs = tokenizer(dna, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
last_hidden = outputs.hidden_states[-1] # [1, seq_len, hidden_dim]
embedding = last_hidden.max(dim=1).values[0] # [hidden_dim] (max pooling)

print(embedding.shape)

Access

vir2vec can be loaded upon request, subject to providing an institutional email address, a brief description of the intended use, and the associated IRB protocol number.