--- license: mit language: - en base_model: - LongSafari/hyenadna-large-1m-seqlen-hf - zhihan1996/DNABERT-2-117M - InstaDeepAI/nucleotide-transformer-v2-50m-multi-species pipeline_tag: text-classification tags: - metagenomics - taxonomic-classification - antimicrobial-resistance - pathogen-detection --- # Genomic Language Models for Metagenomic Sequence Analysis We provide genomic language models fine-tuned for the following tasks: - **Taxonomic hierarchical classification** - **Anti-microbial resistance gene identification** - **Pathogenicity detection** See [code](https://github.com/jhuapl-bio/microbert) for details on fine-tuning, evaluation, and implementation. These are the official models implemented in [Evaluating the Effectiveness of Parameter-Efficient Fine-Tuning in Genomic Classification Tasks](https://www.biorxiv.org/content/10.1101/2025.08.21.671544v1). --- ## Pretrained Foundation Models Our models are built upon several pretrained genomic foundation models: ### Nucleotide Transformer (NT) - [InstaDeepAI/nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species) - [InstaDeepAI/nucleotide-transformer-v2-100m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-100m-multi-species) - [InstaDeepAI/nucleotide-transformer-v2-250m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-250m-multi-species) ### DNABERT - [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M) - [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S) ### HyenaDNA - [LongSafari/hyenadna-large-1m-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen-hf) - [LongSafari/hyenadna-medium-450k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen-hf) - [LongSafari/hyenadna-medium-160k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen-hf) - [LongSafari/hyenadna-small-32k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen-hf) We sincerely thank the teams behind NT, DNABERT, and HyenaDNA for making their tokenizers and pre-trained models available for use :) --- ## Available Fine-Tuned Models We provide the following available models for use. - `taxonomy/DNABERT-2-117M-taxonomy` - `taxonomy/hyenadna-large-1m-seqlen-hf-taxonomy` - `taxonomy/nucleotide-transformer-v2-50m-multi-species-taxonomy` - `amr/binary/hyenadna-small-32k-seqlen-hf` - `amr/binary/nucleotide-transformer-v2-100m-multi-species` - `amr/multiclass/DNABERT-S` - `amr/multiclass/hyenadna-medium-450k-seqlen-hf` - `amr/multiclass/nucleotide-transformer-v2-250m-multi-species` - `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-fungal` - `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-viral` - `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-bacterial` - `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-viral` - `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-fungal` - `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-viral` - `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-bacterial` - `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-viral` To use these models, download the directories available here. You should also follow the installation instructions available at our [code](https://github.com/jhuapl-bio/microbert). There are two available modes of operation: setup from source code and setup from our pre-built [docker image](https://hub.docker.com/r/jhuaplbio/microbert-classify). Given that you have followed the setup instructions from source code and have downloaded the model directories here, here is sample code to run inference: ``` import json from pathlib import Path import torch import torch.nn.functional as F from transformers import ( AutoTokenizer, ) from safetensors.torch import load_file from analysis.experiment.utils.data_processor import DataProcessor from analysis.experiment.models.hierarchical_model import ( HierarchicalClassificationModel, ) # Replace with base directory containing all data processor, base model tokenizers, and trained model weights files model_dir = Path('data/LongSafari__hyenadna-large-1m-seqlen-hf') data_processor_dir = model_dir / "data_processor" # replace with directory containing your data processor metadata_path = data_processor_dir / "metadata.json" base_model_dir = model_dir / "base_model" # replace with directory containing your base model files trained_model_dir = model_dir / "model" # replace with directory containing your trained model files trained_model_path = trained_model_dir / "model.safetensors" # Load metadata with open(metadata_path, "r") as f: metadata = json.load(f) sequence_column = metadata["sequence_column"] labels = metadata["labels"] data_processor_filename = 'data_processor.pkl' # load data processor data_processor = DataProcessor( sequence_column=sequence_column, labels=labels, save_file=data_processor_filename, ) data_processor.load_processor(data_processor_dir) # Get metadata-driven values num_labels = data_processor.num_labels class_weights = data_processor.class_weights # Load tokenizer from Hugging Face Hub or local path tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path=base_model_dir.as_posix(), trust_remote_code=True, local_files_only=True, ) # Load fine-tuned model weights model = HierarchicalClassificationModel(base_model_dir.as_posix(), num_labels, class_weights) state_dict = load_file(trained_model_path) model.load_state_dict(state_dict, strict=False) input = "ATCG" # Run inference tokenized_input = tokenizer( input, return_tensors="pt", # Return results as PyTorch tensors ) with torch.no_grad(): outputs = model(**tokenized_input) for idx, col in enumerate(labels): logits = outputs['logits'][idx] # [num_classes] probs = F.softmax(logits, dim=-1).cpu() topk = torch.topk(probs, k=1, dim=-1) topk_index = topk.indices.numpy().ravel() topk_prob = topk.values topk_label = data_processor.encoders[col].inverse_transform(topk_index) ``` --- ## Authors & Contact - Daniel Berman — daniel.berman@jhuapl.edu - Daniel Jimenez — daniel.jimenez@jhuapl.edu - Stanley Ta — stanley.ta@jhuapl.edu - Brian Merritt — brian.merritt@jhuapl.edu - Jeremy Ratcliff — jeremyratcliff@google.com - Vijay Narayan — vijay.narayan@jhuapl.edu - Molly Gallagher - molly.gallagher@jhuapl.edu --- ## Acknowledgement This work was supported by funding from the **U.S. Centers for Disease Control and Prevention** through the **Office of Readiness and Response** under **Contract # 75D30124C20202**.