BactoTiny-Beta

A from-scratch BERT-style encoder trained on bacterial 16S rRNA sequences for sequence representation learning and genus-group classification.

⚠️ Research use only. This model is not a diagnostic device. Predictions must not be used for clinical, food-safety, or regulatory decisions without reference confirmation by an accredited laboratory.

Overview

BactoTiny-Beta is a 86M parameter BERT encoder trained from random initialisation on bacterial 16S rRNA sequences from SILVA 138.1 NR99. No weights from any existing language model were transferred. All sequence knowledge was learned during masked language modelling (MLM) pretraining on a curated, fragment-augmented corpus of ~62,000 bacterial 16S sequences.

The model produces fixed-length sequence embeddings (768-dimensional CLS vectors) that can be used for downstream bacterial group classification, sequence similarity search, or transfer learning.

Architecture

Parameter	Value
Layers	12
Hidden size	768
Attention heads	12
FFN size	3072
Total parameters	~86M
Context window	512 tokens (~510 bp)
Vocabulary	10 tokens — character-level DNA (A T G C N + 5 special)
Initialisation	Random — no pretrained weights from any existing model

Tokenizer: Character-level. Each nucleotide is one token. Vocabulary size of 10 means that nearly all model parameters reside in the transformer layers rather than embeddings, forcing the model to learn sequence structure directly.

Training

Setting	Value
Base data	SILVA 138.1 SSURef NR99
Sequences (post-filter)	62,081
Target genera	18 (see below)
Pretraining objective	Masked Language Modelling (MLM, 15% masking)
Corpus augmentation	Fragment augmentation at 150/300/500/800 bp (2 fragments/seq/length)
Augmented corpus size	~558,000 sequences
Epochs	10
Effective batch size	128 (64 × grad_accum 2)
Learning rate	2e-4 with linear warmup (5%)
Weight decay	0.01
Precision	bf16
Hardware	NVIDIA A100
Optimiser	AdamW

Fragment augmentation rationale: Full-length 16S sequences (~1500 bp) are truncated to 512 tokens at tokenisation. Without augmentation, the model never sees short-context inputs during pretraining, causing CLS embedding collapse on partial fragments at inference. Augmentation ensures the model trains on the same length distribution it will encounter downstream.

Target Genus Groups (18)

Group	Notes
Aeromonas
Bacillus	Endospore-forming
Campylobacter	Microaerophile — requires 5% O₂
Citrobacter	Enterobacteriaceae
Cronobacter	Formerly Enterobacter sakazakii
Enterobacter	Enterobacteriaceae
Enterococcus
Escherichia/Shigella	16S cannot distinguish these genera — reported as a combined group
Klebsiella	Enterobacteriaceae
Legionella	Requires BCYE media — will not grow on blood agar
Listeria	Cold-tolerant, facultative anaerobe
Proteus	Swarming motility, Enterobacteriaceae
Pseudomonas	Oxidase positive, obligate aerobe
Salmonella	Enterobacteriaceae
Serratia	Enterobacteriaceae
Staphylococcus	Gram-positive coccus
Vibrio	Halophilic curved rod
Yersinia	Temperature-dependent motility, Enterobacteriaceae

Note: Clostridium was excluded from training due to insufficient SILVA representation (2 sequences) in this filtered corpus.

Benchmark Results

Classification accuracy on held-out test sequences (20% stratified split). Classifier: logistic regression linear probe on frozen CLS embeddings.

Full benchmark — all fragment lengths

Fragment Length	LogReg (4-mer)	XGBoost (4-mer)	BactoTiny-86M (probe)	BactoTiny-86M (MLP)
150 bp	0.524	0.846	0.185	0.171
300 bp	0.599	0.934	0.262	0.247
500 bp	0.740	0.970	0.329	0.313
800 bp	0.796	0.983	0.368	0.346
Full-length	0.632	0.986	0.799	0.781

Per-class report — full-length sequences (linear probe)

Class	Precision	Recall	F1	Support
Aeromonas	0.74	0.74	0.74	288
Bacillus	0.68	0.80	0.73	600
Campylobacter	0.92	0.90	0.91	204
Citrobacter	0.66	0.42	0.51	154
Cronobacter	0.56	0.22	0.32	45
Enterobacter	0.51	0.60	0.55	454
Enterococcus	0.76	0.82	0.79	383
Escherichia/Shigella	0.87	0.80	0.83	600
Klebsiella	0.87	0.79	0.83	600
Legionella	0.89	0.84	0.86	152
Listeria	0.98	0.93	0.95	253
Proteus	0.76	0.51	0.61	68
Pseudomonas	0.80	0.88	0.84	600
Salmonella	0.93	0.95	0.94	600
Serratia	0.66	0.51	0.57	262
Staphylococcus	0.92	0.91	0.91	600
Vibrio	0.80	0.81	0.80	600
Yersinia	0.92	0.87	0.90	141
Overall accuracy			0.80	6604

Honest Assessment of Results

What the model does well: Listeria (F1 0.95), Salmonella (0.94), Campylobacter (0.91), Staphylococcus (0.91), and Yersinia (0.90) are all strongly represented. These genera have genuinely distinctive 16S signatures and the model has learned them effectively. Full-length classification accuracy of 0.80 is meaningful for a from-scratch model with a 10-token vocabulary.

Where the model struggles: The five Enterobacteriaceae genera with low F1 — Cronobacter (0.32), Citrobacter (0.51), Enterobacter (0.55), Serratia (0.57), Proteus (0.61) — are genuinely difficult to separate by 16S rRNA. This is a biological limitation, not only a model limitation. The 16S gene is not sufficiently variable within Enterobacteriaceae to reliably discriminate at genus level. Biochemical confirmation is required regardless of any sequence-based prediction.

Fragment-length performance: Short-fragment accuracy (0.185–0.368 at 150–800 bp) is substantially below the XGBoost k-mer baseline at equivalent lengths. XGBoost on 4-mer frequencies is a strong baseline for 16S classification and outperforms this model at all fragment lengths in direct comparison. The primary value of BactoTiny-Seq is not competitive accuracy against k-mer methods, but rather:

Learned sequence representations that capture structural and evolutionary patterns beyond frequency statistics
A domain-specific pretraining foundation for transfer learning to other bacterial genomics tasks
A proof of concept that a character-level transformer can be trained from scratch on bacterial sequence data and produce biologically meaningful embeddings

The short-fragment gap relative to XGBoost reflects that (a) the model is trained on 512-token windows of full-length sequences and may benefit from longer pretraining with more fragment diversity, and (b) 4-mer frequency profiles are a very efficient encoding for short sequences where there is little structural context for a transformer to exploit.

Limitations

Context window is 512 tokens (~510 bp). Sequences longer than this are truncated at the 3′ end.
16S rRNA cannot reliably distinguish Escherichia from Shigella — these are always reported as a combined group.
Short fragments (<300 bp) produce low-confidence predictions. Minimum recommended input: 300 bp.
Trained on 18 target genera only. Organisms outside this set will produce low-confidence, unreliable predictions.
Cronobacter and Proteus have low test support (45 and 68 sequences respectively). Per-class metrics for these groups should be interpreted with caution.
The Enterobacteriaceae confusion cluster (Enterobacter, Citrobacter, Serratia, Cronobacter, Proteus) reflects the fundamental limitation of 16S-based genus-level identification within this family.
This model has not been validated against clinical, environmental, or food-safety isolates. It is a research prototype.

Intended Use

Bacterial group prediction from 16S rRNA sequences as a research tool
Sequence embedding extraction for downstream microbiological ML tasks
Benchmarking and comparison of learned versus frequency-based 16S representations
Educational demonstration of from-scratch biological sequence modelling

Out-of-Scope Use

Clinical or diagnostic microbiology without reference confirmation
Species-level identification (model operates at family/genus-group level only)
Regulatory, food-safety, or public health decision-making
Organisms outside the 18 target genera

Usage

from transformers import BertModel, PreTrainedTokenizerFast
import torch
import numpy as np

tokenizer = PreTrainedTokenizerFast.from_pretrained("EphAsad/BactoTiny-Beta")
model     = BertModel.from_pretrained("EphAsad/BactoTiny-Beta",
                                       add_pooling_layer=False)
model.eval()

def embed(sequence: str) -> np.ndarray:
    sequence = sequence.upper().replace("U", "T")
    # Dynamic padding: pad to sequence length, not global max
    seq_len  = min(len(sequence) + 2, 512)
    pad_len  = ((seq_len + 7) // 8) * 8
    enc = tokenizer(sequence, max_length=pad_len,
                    truncation=True, padding="max_length",
                    return_tensors="pt")
    with torch.no_grad():
        out = model(**enc)
    return out.last_hidden_state[:, 0, :].squeeze().numpy()  # CLS vector

emb = embed("AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGG")
print(emb.shape)  # (768,)

Citation

@misc{bactotiny2025,
  author    = {Asad, Zain},
  title     = {BactoTiny-Beta: From-scratch bacterial 16S rRNA transformer},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/EphAsad/BactoTiny-Beta}
}

Project Context

BactoTiny-Seq is part of a portfolio of applied microbiology AI projects including BactAID (hybrid bacterial identification, XGBoost + LoRA FLAN-T5), DomainEmbedder (domain-adaptive embeddings with RL routing), and FireSOP (BM25 + FAISS hybrid SOP retrieval). The unifying design philosophy across all projects is deterministic fallback at every model integration point, confidence-aware outputs, and human-in-the-loop design.

Downloads last month: 10

Safetensors

Model size

86M params

Tensor type

F32