|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- LongSafari/hyenadna-large-1m-seqlen-hf |
|
|
- zhihan1996/DNABERT-2-117M |
|
|
- InstaDeepAI/nucleotide-transformer-v2-50m-multi-species |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- metagenomics |
|
|
- taxonomic-classification |
|
|
- antimicrobial-resistance |
|
|
- pathogen-detection |
|
|
--- |
|
|
|
|
|
# Genomic Language Models for Metagenomic Sequence Analysis |
|
|
|
|
|
We provide genomic language models fine-tuned for the following tasks: |
|
|
|
|
|
- **Taxonomic hierarchical classification** |
|
|
- **Anti-microbial resistance gene identification** |
|
|
- **Pathogenicity detection** |
|
|
|
|
|
See [code](https://github.com/jhuapl-bio/microbert) for details on fine-tuning, evaluation, and implementation. |
|
|
|
|
|
These are the official models implemented in [Evaluating the Effectiveness of Parameter-Efficient Fine-Tuning in Genomic Classification Tasks](https://www.biorxiv.org/content/10.1101/2025.08.21.671544v1). |
|
|
|
|
|
--- |
|
|
|
|
|
## Pretrained Foundation Models |
|
|
|
|
|
Our models are built upon several pretrained genomic foundation models: |
|
|
|
|
|
### Nucleotide Transformer (NT) |
|
|
- [InstaDeepAI/nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species) |
|
|
- [InstaDeepAI/nucleotide-transformer-v2-100m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-100m-multi-species) |
|
|
- [InstaDeepAI/nucleotide-transformer-v2-250m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-250m-multi-species) |
|
|
|
|
|
### DNABERT |
|
|
- [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M) |
|
|
- [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S) |
|
|
|
|
|
### HyenaDNA |
|
|
- [LongSafari/hyenadna-large-1m-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen-hf) |
|
|
- [LongSafari/hyenadna-medium-450k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen-hf) |
|
|
- [LongSafari/hyenadna-medium-160k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen-hf) |
|
|
- [LongSafari/hyenadna-small-32k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen-hf) |
|
|
|
|
|
We sincerely thank the teams behind NT, DNABERT, and HyenaDNA for making their tokenizers and pre-trained models available for use :) |
|
|
|
|
|
--- |
|
|
|
|
|
## Available Fine-Tuned Models |
|
|
|
|
|
We provide the following available models for use. |
|
|
|
|
|
- `taxonomy/DNABERT-2-117M-taxonomy` |
|
|
- `taxonomy/hyenadna-large-1m-seqlen-hf-taxonomy` |
|
|
- `taxonomy/nucleotide-transformer-v2-50m-multi-species-taxonomy` |
|
|
- `amr/binary/hyenadna-small-32k-seqlen-hf` |
|
|
- `amr/binary/nucleotide-transformer-v2-100m-multi-species` |
|
|
- `amr/multiclass/DNABERT-S` |
|
|
- `amr/multiclass/hyenadna-medium-450k-seqlen-hf` |
|
|
- `amr/multiclass/nucleotide-transformer-v2-250m-multi-species` |
|
|
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-fungal` |
|
|
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-viral` |
|
|
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-bacterial` |
|
|
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-viral` |
|
|
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-fungal` |
|
|
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-viral` |
|
|
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-bacterial` |
|
|
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-viral` |
|
|
|
|
|
To use these models, download the directories available here. |
|
|
You should also follow the installation instructions available at our [code](https://github.com/jhuapl-bio/microbert). |
|
|
There are two available modes of operation: setup from source code and setup from our pre-built [docker image](https://hub.docker.com/r/jhuaplbio/microbert-classify). |
|
|
Given that you have followed the setup instructions from source code and have downloaded the model directories here, here is sample code to run inference: |
|
|
|
|
|
``` |
|
|
import json |
|
|
from pathlib import Path |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
from transformers import ( |
|
|
AutoTokenizer, |
|
|
) |
|
|
from safetensors.torch import load_file |
|
|
|
|
|
from analysis.experiment.utils.data_processor import DataProcessor |
|
|
from analysis.experiment.models.hierarchical_model import ( |
|
|
HierarchicalClassificationModel, |
|
|
) |
|
|
|
|
|
# Replace with base directory containing all data processor, base model tokenizers, and trained model weights files |
|
|
model_dir = Path('data/LongSafari__hyenadna-large-1m-seqlen-hf') |
|
|
data_processor_dir = model_dir / "data_processor" # replace with directory containing your data processor |
|
|
metadata_path = data_processor_dir / "metadata.json" |
|
|
base_model_dir = model_dir / "base_model" # replace with directory containing your base model files |
|
|
trained_model_dir = model_dir / "model" # replace with directory containing your trained model files |
|
|
trained_model_path = trained_model_dir / "model.safetensors" |
|
|
|
|
|
# Load metadata |
|
|
with open(metadata_path, "r") as f: |
|
|
metadata = json.load(f) |
|
|
|
|
|
sequence_column = metadata["sequence_column"] |
|
|
labels = metadata["labels"] |
|
|
data_processor_filename = 'data_processor.pkl' |
|
|
|
|
|
# load data processor |
|
|
data_processor = DataProcessor( |
|
|
sequence_column=sequence_column, |
|
|
labels=labels, |
|
|
save_file=data_processor_filename, |
|
|
) |
|
|
data_processor.load_processor(data_processor_dir) |
|
|
|
|
|
# Get metadata-driven values |
|
|
num_labels = data_processor.num_labels |
|
|
class_weights = data_processor.class_weights |
|
|
|
|
|
# Load tokenizer from Hugging Face Hub or local path |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
pretrained_model_name_or_path=base_model_dir.as_posix(), |
|
|
trust_remote_code=True, |
|
|
local_files_only=True, |
|
|
) |
|
|
# Load fine-tuned model weights |
|
|
model = HierarchicalClassificationModel(base_model_dir.as_posix(), num_labels, class_weights) |
|
|
state_dict = load_file(trained_model_path) |
|
|
model.load_state_dict(state_dict, strict=False) |
|
|
input = "ATCG" |
|
|
|
|
|
# Run inference |
|
|
tokenized_input = tokenizer( |
|
|
input, |
|
|
return_tensors="pt", # Return results as PyTorch tensors |
|
|
) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**tokenized_input) |
|
|
|
|
|
for idx, col in enumerate(labels): |
|
|
logits = outputs['logits'][idx] # [num_classes] |
|
|
probs = F.softmax(logits, dim=-1).cpu() |
|
|
topk = torch.topk(probs, k=1, dim=-1) |
|
|
topk_index = topk.indices.numpy().ravel() |
|
|
topk_prob = topk.values |
|
|
topk_label = data_processor.encoders[col].inverse_transform(topk_index) |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Authors & Contact |
|
|
|
|
|
- Daniel Berman — daniel.berman@jhuapl.edu |
|
|
- Daniel Jimenez — daniel.jimenez@jhuapl.edu |
|
|
- Stanley Ta — stanley.ta@jhuapl.edu |
|
|
- Brian Merritt — brian.merritt@jhuapl.edu |
|
|
- Jeremy Ratcliff — jeremyratcliff@google.com |
|
|
- Vijay Narayan — vijay.narayan@jhuapl.edu |
|
|
- Molly Gallagher - molly.gallagher@jhuapl.edu |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgement |
|
|
|
|
|
This work was supported by funding from the **U.S. Centers for Disease Control and Prevention** through the **Office of Readiness and Response** under **Contract # 75D30124C20202**. |
|
|
|