YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ViroDNABERT2
ViroDNABERT2 is a DNABERT-2-based nucleotide language model pre-trained on the ViroBlend (ViroBland) corpus, a small (216 Mbp) mixed pretraining dataset with source-wise stratified sampling to balance human reference, multi-species genomes, and viral in-domain sequences.
It is released as part of the ViroBench benchmark for evaluating viral nucleotide foundation models.
Model details
| Item | Value |
|---|---|
| Architecture | DNABERT-2-117M (BERT-style, BPE tokenizer) |
| Pretraining data | ViroBlend (~216 Mbp) |
Quick start
Install dependencies:
pip install torch transformers
Extract an embedding for a random DNA sequence:
python get_embedding.py
Or load in Python (base model + local pytorch_model.bin):
import torch
from transformers import AutoModel, AutoTokenizer
BASE = "zhihan1996/DNABERT-2-117M"
# REPO = "YDXX/ViroDNABERT2" # after uploading to Hugging Face
tokenizer = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
model = AutoModel.from_pretrained(BASE, trust_remote_code=True)
# load ViroDNABERT2 weights from pytorch_model.bin if needed (see get_embedding.py)
Files
config.jsonโ training export configpytorch_model.binโ fine-tuned backbone weightstokenizer.json/tokenizer_config.jsonโ tokenizer filesget_embedding.pyโ minimal embedding demo
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support