BactoTiny-Seq-86M

A from-scratch BERT-style encoder trained on bacterial 16S rRNA sequences for sequence representation learning and genus-group classification.

⚠️ Research use only. This model is not a diagnostic device. Predictions must not be used for clinical, food-safety, or regulatory decisions without reference confirmation by an accredited laboratory.


Overview

BactoTiny-Seq-86M is an 86M parameter BERT encoder trained from random initialisation on bacterial 16S rRNA sequences from SILVA 138.1 NR99. No weights from any existing language model were used. All sequence knowledge was learned during masked language modelling (MLM) pretraining on a curated, fragment-augmented corpus of approxiametly 62,000 bacterial 16S sequences (558,000 sequences after augmentation).

The model produces 768-dimensional CLS embeddings usable for downstream bacterial group classification, sequence similarity, or transfer learning to other microbial genomics tasks.

Note on parameter count: Standard BERT-base reaches 110M parameters largely through its 30,522-token embedding matrix (23M params). BactoTiny-Seq uses a 10-token character-level DNA vocabulary, making the embedding table negligible (~7,680 params). The full parameter budget is therefore concentrated in the 12 transformer layers. This is an intentional design choice: the model is forced to learn sequence structure rather than vocabulary statistics.


Architecture

Parameter Value
Total parameters 86M
Layers 12
Hidden size 768
Attention heads 12
FFN size 3072
Context window 512 tokens (~510 bp)
Vocabulary 10 tokens: A T G C N + [PAD] [UNK] [CLS] [SEP] [MASK]
Positional encoding Learned absolute
Initialisation Random β€” no pretrained weights from any existing model

Training

Setting Value
Base data SILVA 138.1 SSURef NR99
Sequences post-filter 62,081
Target genera 18
Pretraining objective Masked Language Modelling (MLM, 15% masking)
Corpus augmentation 2 random fragments per sequence at 150/300/500/800 bp
Augmented corpus size ~558,000 sequences
Effective batch size 128
Learning rate 1e-4, cosine decay with 5% linear warmup
Weight decay 0.01
Precision bf16
Hardware NVIDIA A100
Optimiser AdamW

Fragment augmentation rationale: Without augmentation, the model trains exclusively on full-length sequences (1500 bp) truncated to 512 tokens, so it never encounters the pattern [CLS + N short tokens + (512-N) PAD]. At inference on partial fragments, the CLS representation collapses. Augmenting the pretraining corpus with short fragments at the target evaluation lengths forces the model to learn useful CLS representations across all input lengths, producing the short-fragment performance documented below.


Target Genus Groups (18)

Group Family Notes
Aeromonas Aeromonadaceae
Bacillus Bacillaceae Endospore-forming
Campylobacter Campylobacteraceae Microaerophile
Citrobacter Enterobacteriaceae
Cronobacter Enterobacteriaceae Formerly Enterobacter sakazakii
Enterobacter Enterobacteriaceae
Enterococcus Enterococcaceae
Escherichia/Shigella Enterobacteriaceae Reported as combined group β€” 16S cannot distinguish
Klebsiella Enterobacteriaceae
Legionella Legionellaceae
Listeria Listeriaceae Cold-tolerant
Proteus Enterobacteriaceae
Pseudomonas Pseudomonadaceae
Salmonella Enterobacteriaceae
Serratia Enterobacteriaceae
Staphylococcus Staphylococcaceae
Vibrio Vibrionaceae
Yersinia Enterobacteriaceae

Clostridium was excluded: only 2 sequences recovered from SILVA 138.1 NR99 under the applied quality filters β€” insufficient for training or evaluation.


Benchmark Results

All evaluations use a held-out 20% stratified test split. Fragment lengths simulate partial 16S reads from primer-limited or short-read sequencing workflows.

Full benchmark table

Fragment LogReg (4-mer) XGBoost (4-mer) BactoTiny-86M (probe) BactoTiny-86M (MLP)
150 bp 0.524 0.846 0.705 0.759
300 bp 0.599 0.934 0.814 0.846
500 bp 0.740 0.970 0.870 0.884
800 bp 0.796 0.983 0.910 0.911
Full-length 0.632 0.986 0.952 0.956
  • Probe: logistic regression on frozen CLS embeddings (StandardScaler + LogReg C=1.0)
  • MLP: 2-layer MLP head (768β†’512β†’18, GELU, dropout 0.1) trained on frozen CLS embeddings for 30 epochs

Per-class report β€” full-length sequences (linear probe)

Class Precision Recall F1 Support
Aeromonas 0.99 0.97 0.98 288
Bacillus 0.99 0.98 0.99 600
Campylobacter 1.00 1.00 1.00 204
Citrobacter 0.74 0.78 0.76 154
Cronobacter 0.84 0.60 0.70 45
Enterobacter 0.76 0.84 0.80 454
Enterococcus 0.99 0.99 0.99 383
Escherichia/Shigella 0.96 0.95 0.95 600
Klebsiella 0.89 0.88 0.88 600
Legionella 1.00 1.00 1.00 152
Listeria 1.00 0.99 1.00 253
Proteus 0.98 0.93 0.95 68
Pseudomonas 0.99 0.99 0.99 600
Salmonella 0.99 0.99 0.99 600
Serratia 0.90 0.88 0.89 262
Staphylococcus 0.99 1.00 0.99 600
Vibrio 0.99 0.99 0.99 600
Yersinia 0.97 0.94 0.96 141
Overall accuracy 0.952 6604
Macro avg 0.94 0.93 0.93 6604
Weighted avg 0.95 0.95 0.95 6604

Honest Assessment

Strengths:

Full-length classification at 0.952 (linear probe) is strong for a from-scratch model with a 10-token vocabulary. Campylobacter, Legionella, and Listeria achieve F1 1.00 β€” organisms with genuinely distinctive 16S signatures that the model has learned effectively. At 150 bp, the BactoTiny MLP (0.759) outperforms a logistic regression k-mer baseline (0.524) and closes to within 0.087 of XGBoost (0.846), which is a strong k-mer classifier. At full length, BactoTiny (0.956) is within 0.03 of XGBoost (0.986).

Limitations:

The five Enterobacteriaceae genera with lower F1 β€” Cronobacter (0.70), Citrobacter (0.76), Enterobacter (0.80), Serratia (0.89), Klebsiella (0.88) β€” reflect a fundamental constraint of 16S-based identification. The 16S gene does not vary sufficiently within Enterobacteriaceae to cleanly resolve genus-level identity. This is a biological limitation that affects all sequence-based methods including XGBoost, and biochemical confirmation is required regardless of prediction. Cronobacter support in the test set is low (45 sequences) so its F1 estimate carries more uncertainty than for better-represented classes.

XGBoost on 4-mer frequencies remains the stronger classifier at all fragment lengths in direct accuracy comparison. The value of BactoTiny-Seq is not competitive accuracy against k-mer methods on this specific task, but transferable learned representations that capture structural and evolutionary patterns beyond frequency statistics β€” usable for embedding-based search, transfer learning, and downstream tasks where the k-mer approach does not generalise.


Limitations

  • Context window 512 tokens (510 bp). Sequences longer than this are truncated at the 3β€² end.
  • 16S rRNA cannot distinguish Escherichia from Shigella β€” always reported as a combined group.
  • Trained on 18 target genera only. Inputs from outside this set will produce unreliable, low-confidence predictions with no out-of-distribution detection.
  • Cronobacter and Proteus have small test support (45 and 68 sequences). Per-class metrics for these groups carry higher uncertainty.
  • The Enterobacteriaceae confusion cluster reflects the fundamental limitation of 16S-based identification within this family β€” not a fixable model defect.
  • Not validated on clinical, environmental, or food-safety isolates. Research prototype only.

Intended Use

  • Bacterial group prediction from 16S rRNA sequences as a research tool
  • CLS embedding extraction for downstream microbial ML tasks
  • Benchmarking learned vs frequency-based 16S representations
  • Transfer learning foundation for other bacterial sequence tasks

Out-of-Scope Use

  • Clinical or diagnostic microbiology without reference confirmation
  • Species-level identification (family/genus-group level only)
  • Regulatory, food-safety, or public health decisions
  • Organisms outside the 18 target genera

Usage

from transformers import BertModel, PreTrainedTokenizerFast
import torch
import numpy as np

MODEL_ID  = "EphAsad/BactoTiny-Seq-86M"

tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_ID)
model     = BertModel.from_pretrained(MODEL_ID, add_pooling_layer=False)
model.eval()

def embed(sequence: str) -> np.ndarray:
    """Returns a 768-dimensional CLS embedding for a 16S rRNA sequence."""
    sequence = sequence.upper().replace("U", "T")
    # Dynamic padding: pad to sequence length only, not global max (512)
    seq_len = min(len(sequence) + 2, 512)
    pad_len = ((seq_len + 7) // 8) * 8
    enc = tokenizer(
        sequence,
        max_length  = pad_len,
        truncation  = True,
        padding     = "max_length",
        return_tensors = "pt",
    )
    with torch.no_grad():
        out = model(**enc)
    return out.last_hidden_state[:, 0, :].squeeze().numpy()

# Example
emb = embed("AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGG")
print(emb.shape)  # (768,)

Citation

@misc{bactotiny2026,
  author    = {Asad, Zain},
  title     = {BactoTiny-Seq-86M: From-scratch bacterial 16S rRNA transformer},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/EphAsad/BactoTiny-Seq-86M}
}

Project Context

BactoTiny-Seq is part of a portfolio of applied microbiology AI projects developed alongside four years of production laboratory informatics work. Related projects include BactAID (hybrid bacterial identification system, XGBoost + LoRA FLAN-T5, 95.1% accuracy across 140 genera), DomainEmbedder (domain-adaptive embeddings with A2C RL routing), and FireSOP. The unifying design philosophy across all projects β€” deterministic fallback at every model integration point, confidence-aware outputs, and human-in-the-loop design.

Downloads last month
12
Safetensors
Model size
86M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support