YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ViroCaduceus
ViroCaduceus is a Caduceus-based (Mamba-DNA) nucleotide language model pre-trained on the ViroBlend (ViroBland) corpus, a small (216 Mbp) mixed pretraining dataset with source-wise stratified sampling to balance human reference, multi-species genomes, and viral in-domain sequences.
It is released as part of the ViroBench benchmark for evaluating viral nucleotide foundation models.
Model details
| Item | Value |
|---|---|
| Architecture | Caduceus-Ph (d_model=256, Mamba backbone) |
| Pretraining data | ViroBlend (~216 Mbp) |
Quick start
Install dependencies:
pip install torch transformers mamba-ssm causal-conv1d
Extract an embedding for a random DNA sequence:
python get_embedding.py
Or load in Python (base model + local pytorch_model.bin):
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
BASE = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16"
# REPO = "YDXX/ViroCaduceus" # after uploading to Hugging Face
tokenizer = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(BASE, trust_remote_code=True)
# load ViroCaduceus weights from pytorch_model.bin if needed (see get_embedding.py)
Files
config.jsonโ training export configpytorch_model.binโ fine-tuned backbone weightsget_embedding.pyโ minimal embedding demo
- Downloads last month
- 18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support