BactoTiny-Beta

A from-scratch BERT-style encoder trained on bacterial 16S rRNA sequences for sequence representation learning and genus-group classification.

⚠️ Research use only. This model is not a diagnostic device. Predictions must not be used for clinical, food-safety, or regulatory decisions without reference confirmation by an accredited laboratory.


Overview

BactoTiny-Beta is a 86M parameter BERT encoder trained from random initialisation on bacterial 16S rRNA sequences from SILVA 138.1 NR99. No weights from any existing language model were transferred. All sequence knowledge was learned during masked language modelling (MLM) pretraining on a curated, fragment-augmented corpus of ~62,000 bacterial 16S sequences.

The model produces fixed-length sequence embeddings (768-dimensional CLS vectors) that can be used for downstream bacterial group classification, sequence similarity search, or transfer learning.


Architecture

Parameter Value
Layers 12
Hidden size 768
Attention heads 12
FFN size 3072
Total parameters ~86M
Context window 512 tokens (~510 bp)
Vocabulary 10 tokens β€” character-level DNA (A T G C N + 5 special)
Initialisation Random β€” no pretrained weights from any existing model

Tokenizer: Character-level. Each nucleotide is one token. Vocabulary size of 10 means that nearly all model parameters reside in the transformer layers rather than embeddings, forcing the model to learn sequence structure directly.


Training

Setting Value
Base data SILVA 138.1 SSURef NR99
Sequences (post-filter) 62,081
Target genera 18 (see below)
Pretraining objective Masked Language Modelling (MLM, 15% masking)
Corpus augmentation Fragment augmentation at 150/300/500/800 bp (2 fragments/seq/length)
Augmented corpus size ~558,000 sequences
Epochs 10
Effective batch size 128 (64 Γ— grad_accum 2)
Learning rate 2e-4 with linear warmup (5%)
Weight decay 0.01
Precision bf16
Hardware NVIDIA A100
Optimiser AdamW

Fragment augmentation rationale: Full-length 16S sequences (~1500 bp) are truncated to 512 tokens at tokenisation. Without augmentation, the model never sees short-context inputs during pretraining, causing CLS embedding collapse on partial fragments at inference. Augmentation ensures the model trains on the same length distribution it will encounter downstream.


Target Genus Groups (18)

Group Notes
Aeromonas
Bacillus Endospore-forming
Campylobacter Microaerophile β€” requires 5% Oβ‚‚
Citrobacter Enterobacteriaceae
Cronobacter Formerly Enterobacter sakazakii
Enterobacter Enterobacteriaceae
Enterococcus
Escherichia/Shigella 16S cannot distinguish these genera β€” reported as a combined group
Klebsiella Enterobacteriaceae
Legionella Requires BCYE media β€” will not grow on blood agar
Listeria Cold-tolerant, facultative anaerobe
Proteus Swarming motility, Enterobacteriaceae
Pseudomonas Oxidase positive, obligate aerobe
Salmonella Enterobacteriaceae
Serratia Enterobacteriaceae
Staphylococcus Gram-positive coccus
Vibrio Halophilic curved rod
Yersinia Temperature-dependent motility, Enterobacteriaceae

Note: Clostridium was excluded from training due to insufficient SILVA representation (2 sequences) in this filtered corpus.


Benchmark Results

Classification accuracy on held-out test sequences (20% stratified split). Classifier: logistic regression linear probe on frozen CLS embeddings.

Full benchmark β€” all fragment lengths

Fragment Length LogReg (4-mer) XGBoost (4-mer) BactoTiny-86M (probe) BactoTiny-86M (MLP)
150 bp 0.524 0.846 0.185 0.171
300 bp 0.599 0.934 0.262 0.247
500 bp 0.740 0.970 0.329 0.313
800 bp 0.796 0.983 0.368 0.346
Full-length 0.632 0.986 0.799 0.781

Per-class report β€” full-length sequences (linear probe)

Class Precision Recall F1 Support
Aeromonas 0.74 0.74 0.74 288
Bacillus 0.68 0.80 0.73 600
Campylobacter 0.92 0.90 0.91 204
Citrobacter 0.66 0.42 0.51 154
Cronobacter 0.56 0.22 0.32 45
Enterobacter 0.51 0.60 0.55 454
Enterococcus 0.76 0.82 0.79 383
Escherichia/Shigella 0.87 0.80 0.83 600
Klebsiella 0.87 0.79 0.83 600
Legionella 0.89 0.84 0.86 152
Listeria 0.98 0.93 0.95 253
Proteus 0.76 0.51 0.61 68
Pseudomonas 0.80 0.88 0.84 600
Salmonella 0.93 0.95 0.94 600
Serratia 0.66 0.51 0.57 262
Staphylococcus 0.92 0.91 0.91 600
Vibrio 0.80 0.81 0.80 600
Yersinia 0.92 0.87 0.90 141
Overall accuracy 0.80 6604

Honest Assessment of Results

What the model does well: Listeria (F1 0.95), Salmonella (0.94), Campylobacter (0.91), Staphylococcus (0.91), and Yersinia (0.90) are all strongly represented. These genera have genuinely distinctive 16S signatures and the model has learned them effectively. Full-length classification accuracy of 0.80 is meaningful for a from-scratch model with a 10-token vocabulary.

Where the model struggles: The five Enterobacteriaceae genera with low F1 β€” Cronobacter (0.32), Citrobacter (0.51), Enterobacter (0.55), Serratia (0.57), Proteus (0.61) β€” are genuinely difficult to separate by 16S rRNA. This is a biological limitation, not only a model limitation. The 16S gene is not sufficiently variable within Enterobacteriaceae to reliably discriminate at genus level. Biochemical confirmation is required regardless of any sequence-based prediction.

Fragment-length performance: Short-fragment accuracy (0.185–0.368 at 150–800 bp) is substantially below the XGBoost k-mer baseline at equivalent lengths. XGBoost on 4-mer frequencies is a strong baseline for 16S classification and outperforms this model at all fragment lengths in direct comparison. The primary value of BactoTiny-Seq is not competitive accuracy against k-mer methods, but rather:

  • Learned sequence representations that capture structural and evolutionary patterns beyond frequency statistics
  • A domain-specific pretraining foundation for transfer learning to other bacterial genomics tasks
  • A proof of concept that a character-level transformer can be trained from scratch on bacterial sequence data and produce biologically meaningful embeddings

The short-fragment gap relative to XGBoost reflects that (a) the model is trained on 512-token windows of full-length sequences and may benefit from longer pretraining with more fragment diversity, and (b) 4-mer frequency profiles are a very efficient encoding for short sequences where there is little structural context for a transformer to exploit.


Limitations

  • Context window is 512 tokens (~510 bp). Sequences longer than this are truncated at the 3β€² end.
  • 16S rRNA cannot reliably distinguish Escherichia from Shigella β€” these are always reported as a combined group.
  • Short fragments (<300 bp) produce low-confidence predictions. Minimum recommended input: 300 bp.
  • Trained on 18 target genera only. Organisms outside this set will produce low-confidence, unreliable predictions.
  • Cronobacter and Proteus have low test support (45 and 68 sequences respectively). Per-class metrics for these groups should be interpreted with caution.
  • The Enterobacteriaceae confusion cluster (Enterobacter, Citrobacter, Serratia, Cronobacter, Proteus) reflects the fundamental limitation of 16S-based genus-level identification within this family.
  • This model has not been validated against clinical, environmental, or food-safety isolates. It is a research prototype.

Intended Use

  • Bacterial group prediction from 16S rRNA sequences as a research tool
  • Sequence embedding extraction for downstream microbiological ML tasks
  • Benchmarking and comparison of learned versus frequency-based 16S representations
  • Educational demonstration of from-scratch biological sequence modelling

Out-of-Scope Use

  • Clinical or diagnostic microbiology without reference confirmation
  • Species-level identification (model operates at family/genus-group level only)
  • Regulatory, food-safety, or public health decision-making
  • Organisms outside the 18 target genera

Usage

from transformers import BertModel, PreTrainedTokenizerFast
import torch
import numpy as np

tokenizer = PreTrainedTokenizerFast.from_pretrained("EphAsad/BactoTiny-Beta")
model     = BertModel.from_pretrained("EphAsad/BactoTiny-Beta",
                                       add_pooling_layer=False)
model.eval()

def embed(sequence: str) -> np.ndarray:
    sequence = sequence.upper().replace("U", "T")
    # Dynamic padding: pad to sequence length, not global max
    seq_len  = min(len(sequence) + 2, 512)
    pad_len  = ((seq_len + 7) // 8) * 8
    enc = tokenizer(sequence, max_length=pad_len,
                    truncation=True, padding="max_length",
                    return_tensors="pt")
    with torch.no_grad():
        out = model(**enc)
    return out.last_hidden_state[:, 0, :].squeeze().numpy()  # CLS vector

emb = embed("AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGG")
print(emb.shape)  # (768,)

Citation

@misc{bactotiny2025,
  author    = {Asad, Zain},
  title     = {BactoTiny-Beta: From-scratch bacterial 16S rRNA transformer},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/EphAsad/BactoTiny-Beta}
}

Project Context

BactoTiny-Seq is part of a portfolio of applied microbiology AI projects including BactAID (hybrid bacterial identification, XGBoost + LoRA FLAN-T5), DomainEmbedder (domain-adaptive embeddings with RL routing), and FireSOP (BM25 + FAISS hybrid SOP retrieval). The unifying design philosophy across all projects is deterministic fallback at every model integration point, confidence-aware outputs, and human-in-the-loop design.

Downloads last month
10
Safetensors
Model size
86M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support