YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ViroCaduceus

ViroCaduceus is a Caduceus-based (Mamba-DNA) nucleotide language model pre-trained on the ViroBlend (ViroBland) corpus, a small (216 Mbp) mixed pretraining dataset with source-wise stratified sampling to balance human reference, multi-species genomes, and viral in-domain sequences.

It is released as part of the ViroBench benchmark for evaluating viral nucleotide foundation models.

Model details

Item Value
Architecture Caduceus-Ph (d_model=256, Mamba backbone)
Pretraining data ViroBlend (~216 Mbp)

Quick start

Install dependencies:

pip install torch transformers mamba-ssm causal-conv1d

Extract an embedding for a random DNA sequence:

python get_embedding.py

Or load in Python (base model + local pytorch_model.bin):

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

BASE = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16"
# REPO = "YDXX/ViroCaduceus"  # after uploading to Hugging Face

tokenizer = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(BASE, trust_remote_code=True)
# load ViroCaduceus weights from pytorch_model.bin if needed (see get_embedding.py)

Files

  • config.json โ€” training export config
  • pytorch_model.bin โ€” fine-tuned backbone weights
  • get_embedding.py โ€” minimal embedding demo
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support