microbert / README.md

Update README.md

2347f2e verified 5 months ago

6.78 kB

	---
	license: mit
	language:
	- en
	base_model:
	- LongSafari/hyenadna-large-1m-seqlen-hf
	- zhihan1996/DNABERT-2-117M
	- InstaDeepAI/nucleotide-transformer-v2-50m-multi-species
	pipeline_tag: text-classification
	tags:
	- metagenomics
	- taxonomic-classification
	- antimicrobial-resistance
	- pathogen-detection
	---

	# Genomic Language Models for Metagenomic Sequence Analysis

	We provide genomic language models fine-tuned for the following tasks:

	- Taxonomic hierarchical classification
	- Anti-microbial resistance gene identification
	- Pathogenicity detection

	See [code](https://github.com/jhuapl-bio/microbert) for details on fine-tuning, evaluation, and implementation.

	These are the official models implemented in [Evaluating the Effectiveness of Parameter-Efficient Fine-Tuning in Genomic Classification Tasks](https://www.biorxiv.org/content/10.1101/2025.08.21.671544v1).

	---

	## Pretrained Foundation Models

	Our models are built upon several pretrained genomic foundation models:

	### Nucleotide Transformer (NT)
	- [InstaDeepAI/nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species)
	- [InstaDeepAI/nucleotide-transformer-v2-100m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-100m-multi-species)
	- [InstaDeepAI/nucleotide-transformer-v2-250m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-250m-multi-species)

	### DNABERT
	- [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
	- [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)

	### HyenaDNA
	- [LongSafari/hyenadna-large-1m-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen-hf)
	- [LongSafari/hyenadna-medium-450k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen-hf)
	- [LongSafari/hyenadna-medium-160k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen-hf)
	- [LongSafari/hyenadna-small-32k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen-hf)

	We sincerely thank the teams behind NT, DNABERT, and HyenaDNA for making their tokenizers and pre-trained models available for use :)

	---

	## Available Fine-Tuned Models

	We provide the following available models for use.

	- `taxonomy/DNABERT-2-117M-taxonomy`
	- `taxonomy/hyenadna-large-1m-seqlen-hf-taxonomy`
	- `taxonomy/nucleotide-transformer-v2-50m-multi-species-taxonomy`
	- `amr/binary/hyenadna-small-32k-seqlen-hf`
	- `amr/binary/nucleotide-transformer-v2-100m-multi-species`
	- `amr/multiclass/DNABERT-S`
	- `amr/multiclass/hyenadna-medium-450k-seqlen-hf`
	- `amr/multiclass/nucleotide-transformer-v2-250m-multi-species`
	- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-fungal`
	- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-viral`
	- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-bacterial`
	- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-viral`
	- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-fungal`
	- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-viral`
	- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-bacterial`
	- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-viral`

	To use these models, download the directories available here.
	You should also follow the installation instructions available at our [code](https://github.com/jhuapl-bio/microbert).
	There are two available modes of operation: setup from source code and setup from our pre-built [docker image](https://hub.docker.com/r/jhuaplbio/microbert-classify).
	Given that you have followed the setup instructions from source code and have downloaded the model directories here, here is sample code to run inference:

	```
	import json
	from pathlib import Path
	import torch
	import torch.nn.functional as F
	from transformers import (
	AutoTokenizer,
	)
	from safetensors.torch import load_file

	from analysis.experiment.utils.data_processor import DataProcessor
	from analysis.experiment.models.hierarchical_model import (
	HierarchicalClassificationModel,
	)

	# Replace with base directory containing all data processor, base model tokenizers, and trained model weights files
	model_dir = Path('data/LongSafari__hyenadna-large-1m-seqlen-hf')
	data_processor_dir = model_dir / "data_processor" # replace with directory containing your data processor
	metadata_path = data_processor_dir / "metadata.json"
	base_model_dir = model_dir / "base_model" # replace with directory containing your base model files
	trained_model_dir = model_dir / "model" # replace with directory containing your trained model files
	trained_model_path = trained_model_dir / "model.safetensors"

	# Load metadata
	with open(metadata_path, "r") as f:
	metadata = json.load(f)

	sequence_column = metadata["sequence_column"]
	labels = metadata["labels"]
	data_processor_filename = 'data_processor.pkl'

	# load data processor
	data_processor = DataProcessor(
	sequence_column=sequence_column,
	labels=labels,
	save_file=data_processor_filename,
	)
	data_processor.load_processor(data_processor_dir)

	# Get metadata-driven values
	num_labels = data_processor.num_labels
	class_weights = data_processor.class_weights

	# Load tokenizer from Hugging Face Hub or local path
	tokenizer = AutoTokenizer.from_pretrained(
	pretrained_model_name_or_path=base_model_dir.as_posix(),
	trust_remote_code=True,
	local_files_only=True,
	)
	# Load fine-tuned model weights
	model = HierarchicalClassificationModel(base_model_dir.as_posix(), num_labels, class_weights)
	state_dict = load_file(trained_model_path)
	model.load_state_dict(state_dict, strict=False)
	input = "ATCG"

	# Run inference
	tokenized_input = tokenizer(
	input,
	return_tensors="pt", # Return results as PyTorch tensors
	)
	with torch.no_grad():
	outputs = model(**tokenized_input)

	for idx, col in enumerate(labels):
	logits = outputs['logits'][idx] # [num_classes]
	probs = F.softmax(logits, dim=-1).cpu()
	topk = torch.topk(probs, k=1, dim=-1)
	topk_index = topk.indices.numpy().ravel()
	topk_prob = topk.values
	topk_label = data_processor.encoders[col].inverse_transform(topk_index)
	```
	---

	## Authors & Contact

	- Daniel Berman — daniel.berman@jhuapl.edu
	- Daniel Jimenez — daniel.jimenez@jhuapl.edu
	- Stanley Ta — stanley.ta@jhuapl.edu
	- Brian Merritt — brian.merritt@jhuapl.edu
	- Jeremy Ratcliff — jeremyratcliff@google.com
	- Vijay Narayan — vijay.narayan@jhuapl.edu
	- Molly Gallagher - molly.gallagher@jhuapl.edu

	---

	## Acknowledgement

	This work was supported by funding from the U.S. Centers for Disease Control and Prevention through the Office of Readiness and Response under Contract # 75D30124C20202.