bert-metagenome
BERT model pretrained on metagenomic contigs and complete microbial genomes for DNA sequence embedding.
Model
| architecture | BERT, 24 layers, 768 hidden, 12 heads |
| parameters | ~430M |
| input | 1000 bp DNA sequence (ACGT) |
| output | 768-dim embedding per position |
| pretraining | metagenomic contigs + microbial genomes |
Usage
import tensorflow as tf
from huggingface_hub import hf_hub_download
# download model
model_path = hf_hub_download(
repo_id="genomenet/bert-metagenome",
filename="bert_1k_3.h5"
)
# load with custom objects (if needed)
model = tf.keras.models.load_model(model_path, compile=False)
# get embeddings from layer 21
embedding_model = tf.keras.Model(
inputs=model.input,
outputs=model.get_layer("layer_transformer_block_21").output
)
# input: one-hot encoded DNA (batch, 1000, 4)
# output: embeddings (batch, 1000, 768)
Tokenization
DNA sequences are one-hot encoded:
- A = [1, 0, 0, 0]
- C = [0, 1, 0, 0]
- G = [0, 0, 1, 0]
- T = [0, 0, 0, 1]
- N = [0.25, 0.25, 0.25, 0.25]
Input shape: (batch_size, 1000, 4)
Applications
- CRISPR array detection (fine-tuned version: genomenet/crispr-array-detection)
- Sequence classification
- Metagenome binning
- Functional annotation
- Sequence similarity search via embeddings
Acknowledgements
- BMBF de.NBI / GenomeNet
- DFG SPP 2141
- Helmholtz Centre for Infection Research (HZI)
- Downloads last month
- -