BactoTiny-Seq-86M

A from-scratch BERT-style encoder trained on bacterial 16S rRNA sequences for sequence representation learning and genus-group classification.

⚠️ Research use only. This model is not a diagnostic device. Predictions must not be used for clinical, food-safety, or regulatory decisions without reference confirmation by an accredited laboratory.

Overview

BactoTiny-Seq-86M is an 86M parameter BERT encoder trained from random initialisation on bacterial 16S rRNA sequences from SILVA 138.1 NR99. No weights from any existing language model were used. All sequence knowledge was learned during masked language modelling (MLM) pretraining on a curated, fragment-augmented corpus of approxiametly 62,000 bacterial 16S sequences (558,000 sequences after augmentation).

The model produces 768-dimensional CLS embeddings usable for downstream bacterial group classification, sequence similarity, or transfer learning to other microbial genomics tasks.

Note on parameter count: Standard BERT-base reaches 110M parameters largely through its 30,522-token embedding matrix (23M params). BactoTiny-Seq uses a 10-token character-level DNA vocabulary, making the embedding table negligible (~7,680 params). The full parameter budget is therefore concentrated in the 12 transformer layers. This is an intentional design choice: the model is forced to learn sequence structure rather than vocabulary statistics.

Architecture

Parameter	Value
Total parameters	86M
Layers	12
Hidden size	768
Attention heads	12
FFN size	3072
Context window	512 tokens (~510 bp)
Vocabulary	10 tokens: A T G C N + [PAD] [UNK] [CLS] [SEP] [MASK]
Positional encoding	Learned absolute
Initialisation	Random — no pretrained weights from any existing model

Training

Setting	Value
Base data	SILVA 138.1 SSURef NR99
Sequences post-filter	62,081
Target genera	18
Pretraining objective	Masked Language Modelling (MLM, 15% masking)
Corpus augmentation	2 random fragments per sequence at 150/300/500/800 bp
Augmented corpus size	~558,000 sequences
Effective batch size	128
Learning rate	1e-4, cosine decay with 5% linear warmup
Weight decay	0.01
Precision	bf16
Hardware	NVIDIA A100
Optimiser	AdamW

Fragment augmentation rationale: Without augmentation, the model trains exclusively on full-length sequences (1500 bp) truncated to 512 tokens, so it never encounters the pattern [CLS + N short tokens + (512-N) PAD]. At inference on partial fragments, the CLS representation collapses. Augmenting the pretraining corpus with short fragments at the target evaluation lengths forces the model to learn useful CLS representations across all input lengths, producing the short-fragment performance documented below.

Target Genus Groups (18)

Group	Family	Notes
Aeromonas	Aeromonadaceae
Bacillus	Bacillaceae	Endospore-forming
Campylobacter	Campylobacteraceae	Microaerophile
Citrobacter	Enterobacteriaceae
Cronobacter	Enterobacteriaceae	Formerly Enterobacter sakazakii
Enterobacter	Enterobacteriaceae
Enterococcus	Enterococcaceae
Escherichia/Shigella	Enterobacteriaceae	Reported as combined group — 16S cannot distinguish
Klebsiella	Enterobacteriaceae
Legionella	Legionellaceae
Listeria	Listeriaceae	Cold-tolerant
Proteus	Enterobacteriaceae
Pseudomonas	Pseudomonadaceae
Salmonella	Enterobacteriaceae
Serratia	Enterobacteriaceae
Staphylococcus	Staphylococcaceae
Vibrio	Vibrionaceae
Yersinia	Enterobacteriaceae

Clostridium was excluded: only 2 sequences recovered from SILVA 138.1 NR99 under the applied quality filters — insufficient for training or evaluation.

Benchmark Results

All evaluations use a held-out 20% stratified test split. Fragment lengths simulate partial 16S reads from primer-limited or short-read sequencing workflows.

Full benchmark table

Fragment	LogReg (4-mer)	XGBoost (4-mer)	BactoTiny-86M (probe)	BactoTiny-86M (MLP)
150 bp	0.524	0.846	0.705	0.759
300 bp	0.599	0.934	0.814	0.846
500 bp	0.740	0.970	0.870	0.884
800 bp	0.796	0.983	0.910	0.911
Full-length	0.632	0.986	0.952	0.956

Probe: logistic regression on frozen CLS embeddings (StandardScaler + LogReg C=1.0)
MLP: 2-layer MLP head (768→512→18, GELU, dropout 0.1) trained on frozen CLS embeddings for 30 epochs

Per-class report — full-length sequences (linear probe)

Class	Precision	Recall	F1	Support
Aeromonas	0.99	0.97	0.98	288
Bacillus	0.99	0.98	0.99	600
Campylobacter	1.00	1.00	1.00	204
Citrobacter	0.74	0.78	0.76	154
Cronobacter	0.84	0.60	0.70	45
Enterobacter	0.76	0.84	0.80	454
Enterococcus	0.99	0.99	0.99	383
Escherichia/Shigella	0.96	0.95	0.95	600
Klebsiella	0.89	0.88	0.88	600
Legionella	1.00	1.00	1.00	152
Listeria	1.00	0.99	1.00	253
Proteus	0.98	0.93	0.95	68
Pseudomonas	0.99	0.99	0.99	600
Salmonella	0.99	0.99	0.99	600
Serratia	0.90	0.88	0.89	262
Staphylococcus	0.99	1.00	0.99	600
Vibrio	0.99	0.99	0.99	600
Yersinia	0.97	0.94	0.96	141
Overall accuracy			0.952	6604
Macro avg	0.94	0.93	0.93	6604
Weighted avg	0.95	0.95	0.95	6604

Honest Assessment

Strengths:

Full-length classification at 0.952 (linear probe) is strong for a from-scratch model with a 10-token vocabulary. Campylobacter, Legionella, and Listeria achieve F1 1.00 — organisms with genuinely distinctive 16S signatures that the model has learned effectively. At 150 bp, the BactoTiny MLP (0.759) outperforms a logistic regression k-mer baseline (0.524) and closes to within 0.087 of XGBoost (0.846), which is a strong k-mer classifier. At full length, BactoTiny (0.956) is within 0.03 of XGBoost (0.986).

Limitations:

The five Enterobacteriaceae genera with lower F1 — Cronobacter (0.70), Citrobacter (0.76), Enterobacter (0.80), Serratia (0.89), Klebsiella (0.88) — reflect a fundamental constraint of 16S-based identification. The 16S gene does not vary sufficiently within Enterobacteriaceae to cleanly resolve genus-level identity. This is a biological limitation that affects all sequence-based methods including XGBoost, and biochemical confirmation is required regardless of prediction. Cronobacter support in the test set is low (45 sequences) so its F1 estimate carries more uncertainty than for better-represented classes.

XGBoost on 4-mer frequencies remains the stronger classifier at all fragment lengths in direct accuracy comparison. The value of BactoTiny-Seq is not competitive accuracy against k-mer methods on this specific task, but transferable learned representations that capture structural and evolutionary patterns beyond frequency statistics — usable for embedding-based search, transfer learning, and downstream tasks where the k-mer approach does not generalise.

Limitations

Context window 512 tokens (510 bp). Sequences longer than this are truncated at the 3′ end.
16S rRNA cannot distinguish Escherichia from Shigella — always reported as a combined group.
Trained on 18 target genera only. Inputs from outside this set will produce unreliable, low-confidence predictions with no out-of-distribution detection.
Cronobacter and Proteus have small test support (45 and 68 sequences). Per-class metrics for these groups carry higher uncertainty.
The Enterobacteriaceae confusion cluster reflects the fundamental limitation of 16S-based identification within this family — not a fixable model defect.
Not validated on clinical, environmental, or food-safety isolates. Research prototype only.

Intended Use

Bacterial group prediction from 16S rRNA sequences as a research tool
CLS embedding extraction for downstream microbial ML tasks
Benchmarking learned vs frequency-based 16S representations
Transfer learning foundation for other bacterial sequence tasks

Out-of-Scope Use

Clinical or diagnostic microbiology without reference confirmation
Species-level identification (family/genus-group level only)
Regulatory, food-safety, or public health decisions
Organisms outside the 18 target genera

Usage

from transformers import BertModel, PreTrainedTokenizerFast
import torch
import numpy as np

MODEL_ID  = "EphAsad/BactoTiny-Seq-86M"

tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_ID)
model     = BertModel.from_pretrained(MODEL_ID, add_pooling_layer=False)
model.eval()

def embed(sequence: str) -> np.ndarray:
    """Returns a 768-dimensional CLS embedding for a 16S rRNA sequence."""
    sequence = sequence.upper().replace("U", "T")
    # Dynamic padding: pad to sequence length only, not global max (512)
    seq_len = min(len(sequence) + 2, 512)
    pad_len = ((seq_len + 7) // 8) * 8
    enc = tokenizer(
        sequence,
        max_length  = pad_len,
        truncation  = True,
        padding     = "max_length",
        return_tensors = "pt",
    )
    with torch.no_grad():
        out = model(**enc)
    return out.last_hidden_state[:, 0, :].squeeze().numpy()

# Example
emb = embed("AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGG")
print(emb.shape)  # (768,)

Citation

@misc{bactotiny2026,
  author    = {Asad, Zain},
  title     = {BactoTiny-Seq-86M: From-scratch bacterial 16S rRNA transformer},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/EphAsad/BactoTiny-Seq-86M}
}

Project Context

BactoTiny-Seq is part of a portfolio of applied microbiology AI projects developed alongside four years of production laboratory informatics work. Related projects include BactAID (hybrid bacterial identification system, XGBoost + LoRA FLAN-T5, 95.1% accuracy across 140 genera), DomainEmbedder (domain-adaptive embeddings with A2C RL routing), and FireSOP. The unifying design philosophy across all projects — deterministic fallback at every model integration point, confidence-aware outputs, and human-in-the-loop design.

Downloads last month: 12

Safetensors

Model size

86M params

Tensor type

F32