bert-metagenome

BERT model pretrained on metagenomic contigs and complete microbial genomes for DNA sequence embedding.

Model

architecture BERT, 24 layers, 768 hidden, 12 heads
parameters ~430M
input 1000 bp DNA sequence (ACGT)
output 768-dim embedding per position
pretraining metagenomic contigs + microbial genomes

Usage

import tensorflow as tf
from huggingface_hub import hf_hub_download

# download model
model_path = hf_hub_download(
    repo_id="genomenet/bert-metagenome",
    filename="bert_1k_3.h5"
)

# load with custom objects (if needed)
model = tf.keras.models.load_model(model_path, compile=False)

# get embeddings from layer 21
embedding_model = tf.keras.Model(
    inputs=model.input,
    outputs=model.get_layer("layer_transformer_block_21").output
)

# input: one-hot encoded DNA (batch, 1000, 4)
# output: embeddings (batch, 1000, 768)

Tokenization

DNA sequences are one-hot encoded:

  • A = [1, 0, 0, 0]
  • C = [0, 1, 0, 0]
  • G = [0, 0, 1, 0]
  • T = [0, 0, 0, 1]
  • N = [0.25, 0.25, 0.25, 0.25]

Input shape: (batch_size, 1000, 4)

Applications

  • CRISPR array detection (fine-tuned version: genomenet/crispr-array-detection)
  • Sequence classification
  • Metagenome binning
  • Functional annotation
  • Sequence similarity search via embeddings

Acknowledgements

  • BMBF de.NBI / GenomeNet
  • DFG SPP 2141
  • Helmholtz Centre for Infection Research (HZI)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using genomenet/bert-metagenome 1