BactoTiny-Seq-86M
A from-scratch BERT-style encoder trained on bacterial 16S rRNA sequences for sequence representation learning and genus-group classification.
β οΈ Research use only. This model is not a diagnostic device. Predictions must not be used for clinical, food-safety, or regulatory decisions without reference confirmation by an accredited laboratory.
Overview
BactoTiny-Seq-86M is an 86M parameter BERT encoder trained from random initialisation on bacterial 16S rRNA sequences from SILVA 138.1 NR99. No weights from any existing language model were used. All sequence knowledge was learned during masked language modelling (MLM) pretraining on a curated, fragment-augmented corpus of approxiametly 62,000 bacterial 16S sequences (558,000 sequences after augmentation).
The model produces 768-dimensional CLS embeddings usable for downstream bacterial group classification, sequence similarity, or transfer learning to other microbial genomics tasks.
Note on parameter count: Standard BERT-base reaches 110M parameters largely through its 30,522-token embedding matrix (23M params). BactoTiny-Seq uses a 10-token character-level DNA vocabulary, making the embedding table negligible (~7,680 params). The full parameter budget is therefore concentrated in the 12 transformer layers. This is an intentional design choice: the model is forced to learn sequence structure rather than vocabulary statistics.
Architecture
| Parameter | Value |
|---|---|
| Total parameters | 86M |
| Layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| FFN size | 3072 |
| Context window | 512 tokens (~510 bp) |
| Vocabulary | 10 tokens: A T G C N + [PAD] [UNK] [CLS] [SEP] [MASK] |
| Positional encoding | Learned absolute |
| Initialisation | Random β no pretrained weights from any existing model |
Training
| Setting | Value |
|---|---|
| Base data | SILVA 138.1 SSURef NR99 |
| Sequences post-filter | 62,081 |
| Target genera | 18 |
| Pretraining objective | Masked Language Modelling (MLM, 15% masking) |
| Corpus augmentation | 2 random fragments per sequence at 150/300/500/800 bp |
| Augmented corpus size | ~558,000 sequences |
| Effective batch size | 128 |
| Learning rate | 1e-4, cosine decay with 5% linear warmup |
| Weight decay | 0.01 |
| Precision | bf16 |
| Hardware | NVIDIA A100 |
| Optimiser | AdamW |
Fragment augmentation rationale: Without augmentation, the model trains exclusively on full-length sequences (1500 bp) truncated to 512 tokens, so it never encounters the pattern [CLS + N short tokens + (512-N) PAD]. At inference on partial fragments, the CLS representation collapses. Augmenting the pretraining corpus with short fragments at the target evaluation lengths forces the model to learn useful CLS representations across all input lengths, producing the short-fragment performance documented below.
Target Genus Groups (18)
| Group | Family | Notes |
|---|---|---|
| Aeromonas | Aeromonadaceae | |
| Bacillus | Bacillaceae | Endospore-forming |
| Campylobacter | Campylobacteraceae | Microaerophile |
| Citrobacter | Enterobacteriaceae | |
| Cronobacter | Enterobacteriaceae | Formerly Enterobacter sakazakii |
| Enterobacter | Enterobacteriaceae | |
| Enterococcus | Enterococcaceae | |
| Escherichia/Shigella | Enterobacteriaceae | Reported as combined group β 16S cannot distinguish |
| Klebsiella | Enterobacteriaceae | |
| Legionella | Legionellaceae | |
| Listeria | Listeriaceae | Cold-tolerant |
| Proteus | Enterobacteriaceae | |
| Pseudomonas | Pseudomonadaceae | |
| Salmonella | Enterobacteriaceae | |
| Serratia | Enterobacteriaceae | |
| Staphylococcus | Staphylococcaceae | |
| Vibrio | Vibrionaceae | |
| Yersinia | Enterobacteriaceae |
Clostridium was excluded: only 2 sequences recovered from SILVA 138.1 NR99 under the applied quality filters β insufficient for training or evaluation.
Benchmark Results
All evaluations use a held-out 20% stratified test split. Fragment lengths simulate partial 16S reads from primer-limited or short-read sequencing workflows.
Full benchmark table
| Fragment | LogReg (4-mer) | XGBoost (4-mer) | BactoTiny-86M (probe) | BactoTiny-86M (MLP) |
|---|---|---|---|---|
| 150 bp | 0.524 | 0.846 | 0.705 | 0.759 |
| 300 bp | 0.599 | 0.934 | 0.814 | 0.846 |
| 500 bp | 0.740 | 0.970 | 0.870 | 0.884 |
| 800 bp | 0.796 | 0.983 | 0.910 | 0.911 |
| Full-length | 0.632 | 0.986 | 0.952 | 0.956 |
- Probe: logistic regression on frozen CLS embeddings (StandardScaler + LogReg C=1.0)
- MLP: 2-layer MLP head (768β512β18, GELU, dropout 0.1) trained on frozen CLS embeddings for 30 epochs
Per-class report β full-length sequences (linear probe)
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Aeromonas | 0.99 | 0.97 | 0.98 | 288 |
| Bacillus | 0.99 | 0.98 | 0.99 | 600 |
| Campylobacter | 1.00 | 1.00 | 1.00 | 204 |
| Citrobacter | 0.74 | 0.78 | 0.76 | 154 |
| Cronobacter | 0.84 | 0.60 | 0.70 | 45 |
| Enterobacter | 0.76 | 0.84 | 0.80 | 454 |
| Enterococcus | 0.99 | 0.99 | 0.99 | 383 |
| Escherichia/Shigella | 0.96 | 0.95 | 0.95 | 600 |
| Klebsiella | 0.89 | 0.88 | 0.88 | 600 |
| Legionella | 1.00 | 1.00 | 1.00 | 152 |
| Listeria | 1.00 | 0.99 | 1.00 | 253 |
| Proteus | 0.98 | 0.93 | 0.95 | 68 |
| Pseudomonas | 0.99 | 0.99 | 0.99 | 600 |
| Salmonella | 0.99 | 0.99 | 0.99 | 600 |
| Serratia | 0.90 | 0.88 | 0.89 | 262 |
| Staphylococcus | 0.99 | 1.00 | 0.99 | 600 |
| Vibrio | 0.99 | 0.99 | 0.99 | 600 |
| Yersinia | 0.97 | 0.94 | 0.96 | 141 |
| Overall accuracy | 0.952 | 6604 | ||
| Macro avg | 0.94 | 0.93 | 0.93 | 6604 |
| Weighted avg | 0.95 | 0.95 | 0.95 | 6604 |
Honest Assessment
Strengths:
Full-length classification at 0.952 (linear probe) is strong for a from-scratch model with a 10-token vocabulary. Campylobacter, Legionella, and Listeria achieve F1 1.00 β organisms with genuinely distinctive 16S signatures that the model has learned effectively. At 150 bp, the BactoTiny MLP (0.759) outperforms a logistic regression k-mer baseline (0.524) and closes to within 0.087 of XGBoost (0.846), which is a strong k-mer classifier. At full length, BactoTiny (0.956) is within 0.03 of XGBoost (0.986).
Limitations:
The five Enterobacteriaceae genera with lower F1 β Cronobacter (0.70), Citrobacter (0.76), Enterobacter (0.80), Serratia (0.89), Klebsiella (0.88) β reflect a fundamental constraint of 16S-based identification. The 16S gene does not vary sufficiently within Enterobacteriaceae to cleanly resolve genus-level identity. This is a biological limitation that affects all sequence-based methods including XGBoost, and biochemical confirmation is required regardless of prediction. Cronobacter support in the test set is low (45 sequences) so its F1 estimate carries more uncertainty than for better-represented classes.
XGBoost on 4-mer frequencies remains the stronger classifier at all fragment lengths in direct accuracy comparison. The value of BactoTiny-Seq is not competitive accuracy against k-mer methods on this specific task, but transferable learned representations that capture structural and evolutionary patterns beyond frequency statistics β usable for embedding-based search, transfer learning, and downstream tasks where the k-mer approach does not generalise.
Limitations
- Context window 512 tokens (510 bp). Sequences longer than this are truncated at the 3β² end.
- 16S rRNA cannot distinguish Escherichia from Shigella β always reported as a combined group.
- Trained on 18 target genera only. Inputs from outside this set will produce unreliable, low-confidence predictions with no out-of-distribution detection.
- Cronobacter and Proteus have small test support (45 and 68 sequences). Per-class metrics for these groups carry higher uncertainty.
- The Enterobacteriaceae confusion cluster reflects the fundamental limitation of 16S-based identification within this family β not a fixable model defect.
- Not validated on clinical, environmental, or food-safety isolates. Research prototype only.
Intended Use
- Bacterial group prediction from 16S rRNA sequences as a research tool
- CLS embedding extraction for downstream microbial ML tasks
- Benchmarking learned vs frequency-based 16S representations
- Transfer learning foundation for other bacterial sequence tasks
Out-of-Scope Use
- Clinical or diagnostic microbiology without reference confirmation
- Species-level identification (family/genus-group level only)
- Regulatory, food-safety, or public health decisions
- Organisms outside the 18 target genera
Usage
from transformers import BertModel, PreTrainedTokenizerFast
import torch
import numpy as np
MODEL_ID = "EphAsad/BactoTiny-Seq-86M"
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_ID)
model = BertModel.from_pretrained(MODEL_ID, add_pooling_layer=False)
model.eval()
def embed(sequence: str) -> np.ndarray:
"""Returns a 768-dimensional CLS embedding for a 16S rRNA sequence."""
sequence = sequence.upper().replace("U", "T")
# Dynamic padding: pad to sequence length only, not global max (512)
seq_len = min(len(sequence) + 2, 512)
pad_len = ((seq_len + 7) // 8) * 8
enc = tokenizer(
sequence,
max_length = pad_len,
truncation = True,
padding = "max_length",
return_tensors = "pt",
)
with torch.no_grad():
out = model(**enc)
return out.last_hidden_state[:, 0, :].squeeze().numpy()
# Example
emb = embed("AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGG")
print(emb.shape) # (768,)
Citation
@misc{bactotiny2026,
author = {Asad, Zain},
title = {BactoTiny-Seq-86M: From-scratch bacterial 16S rRNA transformer},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/EphAsad/BactoTiny-Seq-86M}
}
Project Context
BactoTiny-Seq is part of a portfolio of applied microbiology AI projects developed alongside four years of production laboratory informatics work. Related projects include BactAID (hybrid bacterial identification system, XGBoost + LoRA FLAN-T5, 95.1% accuracy across 140 genera), DomainEmbedder (domain-adaptive embeddings with A2C RL routing), and FireSOP. The unifying design philosophy across all projects β deterministic fallback at every model integration point, confidence-aware outputs, and human-in-the-loop design.
- Downloads last month
- 12