| | --- |
| | license: mit |
| | base_model: |
| | - RaphaelMourad/Mistral-DNA-v1-422M-hg38 |
| | tags: |
| | - genomics |
| | - virus |
| | library_name: transformers |
| | paper: |
| | title: 'Vir2vec: A Genome-Wide Viral Embedding' |
| | url: https://www.biorxiv.org/content/10.64898/2025.12.12.693901v1 |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # Vir2vec: A Genome-Wide Viral Embedding |
| |
|
| | ## Model description |
| | [vir2vec](https://www.biorxiv.org/content/10.64898/2025.12.12.693901v1) is a viral genomic language model (gLM) designed to produce fixed-length, genome-level embeddings that can be fine-tuned across downstream tasks such as viral discrimination, host-range prediction, and variant typing. For more details and training scripts check [GitHub](https://github.com/simoRancati/Vir2vec) |
| |
|
| | ## Intended use |
| | vir2vec embeddings are intended for tasks including (but not limited to): |
| |
|
| | - Virus vs non-virus genome/read discrimination |
| | - DNA vs RNA virus classification |
| | - Host-range prediction |
| | - Intra-genus separation (e.g., HIV-1 vs HIV-2) |
| | - Variant/subtype typing (e.g., SARS-CoV-2 lineages) |
| | - Phenotypic signal detection (e.g., tissue tropism proxies) |
| |
|
| | ## Model sizes |
| | - 422M |
| | - 138M |
| | - 17M |
| |
|
| | ## How to use |
| | ### Load from Hugging Face |
| | ```python |
| | import torch |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default. |
| | model = AutoModelForCausalLM.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default. |
| | model.eval() |
| | ``` |
| |
|
| | ### Compute embeddings |
| | ```python |
| | dna = "ACGTAGCATCGCGATGACTGCATCACT" |
| | inputs = tokenizer(dna, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | outputs = model(**inputs, output_hidden_states=True) |
| | last_hidden = outputs.hidden_states[-1] # [1, seq_len, hidden_dim] |
| | embedding = last_hidden.max(dim=1).values[0] # [hidden_dim] (max pooling) |
| | |
| | print(embedding.shape) |
| | ``` |
| |
|
| | ## Access |
| | vir2vec can be loaded upon request, subject to providing an institutional email address, a brief description of the intended use, and the associated IRB protocol number. |