PhenoFormer
PhenoFormer is a 22M-parameter Transformer classifier for bacterial genus identification from phenotypic test results. Given a partial set of microbiological test observations β Gram stain, biochemical reactions, growth characteristics, fermentation profiles β it returns ranked genus predictions with calibrated confidence scores and, optionally, a prioritised recommendation of which tests to run next.
It is designed to reflect the real conditions of a clinical or food/water testing laboratory: you rarely have a complete biochemical profile at the point of identification. PhenoFormer was trained on simulated partial panels derived from 15 named test batteries and structured random sampling strategies, making it robust to the incomplete, in-progress data that exists during a genuine identification workflow.
Model Details
| Property | Value |
|---|---|
| Parameters | 22,288,506 |
| Architecture | Transformer encoder (custom) |
| d_model | 512 |
| Encoder layers | 8 |
| Attention heads | 8 |
| Feed-forward dim | 1,536 |
| Dropout | 0.12 |
| Vocabulary size | 1,120 tokens |
| Max sequence length | 192 tokens |
| Output classes | 122 bacterial genera |
| Best checkpoint | Epoch 12 |
| Temperature scaling | T = 1.063 (post-hoc calibration) |
| Training precision | bf16 (A100) |
| Model size | ~85 MB |
Architecture notes
Input phenotypic fields are serialised as structured token sequences (field name + value tokens), not free text. Absent fields are simply omitted rather than masked to zero β the correct inductive bias for partial-observation classification. A [CLS]-style aggregation token produces the genus logits. Temperature scaling is applied at inference time using a learned scalar (T = 1.063) fitted on the validation split to correct mild overconfidence.
Intended Use
Primary use cases
- Genus-level identification support during routine bacteriology workup, where a subset of phenotypic tests has been completed
- Differential diagnosis narrowing: the top-3 and top-5 predictions cover the true genus in 77.6% and 84.1% of cases respectively on mixed partial-panel test examples, providing a ranked shortlist for confirmatory testing
- Next-test recommendation: the packaged inference module includes a dual recommender that ranks untested fields by either confirmation weight (reinforce the top prediction) or discriminatory power (maximally separate competing candidates)
- Laboratory information system integration: the inference module is self-contained and callable with a plain dict of
{field: value}observations
Not intended for
- Clinical diagnostic reporting or patient care decisions β predictions are decision support only and must be confirmed by a qualified microbiologist
- Antimicrobial susceptibility inference β the model classifies genus, not species, and carries no AMR information
- Atypical, obligate intracellular, or fastidious organisms (e.g. Rickettsia, Chlamydia, Mycoplasma) β excluded from training due to insufficient records
- Subspecies or biotype discrimination β classification is genus-level only
- Fungal or parasitic identification β scope is bacteria only (note: Saccharomyces, Candida, and Cryptococcus appear in training as a minority and may produce unreliable predictions)
Training Data
The training data is the BactAID Gold Test Dataset, a curated collection of 8,360 records spanning 138 bacterial genera and 5,563 unique species/name entries. Each record contains a structured phenotypic profile compiled from reference sources including Bergey's Manual and published species descriptions.
After filtering genera with fewer than 10 records (16 genera excluded), the usable dataset covers 122 genera across 8,335 records.
Dataset statistics
| Split | Records | Generated examples |
|---|---|---|
| Train (70%) | 5,825 | 205,227 |
| Validation (15%) | 1,248 | 35,434 |
| Test (15%) | 1,262 | 36,259 |
The record-level split was performed before any example generation to prevent data leakage.
Phenotypic schema
The model uses 45 canonical phenotypic fields:
Morphology & Growth: Gram Stain, Shape, Colony Morphology, Growth Temperature, Media Grown On, Capsule, Spore Formation, Oxygen Requirement, Motility, Motility Type, Haemolysis, Haemolysis Type
Core Biochemistry: Catalase, Oxidase, Indole, Urease, Citrate, H2S, Nitrate Reduction, Methyl Red, VP, Coagulase, DNase, ONPG, NaCl Tolerant (β₯6%), Lipase Test, Gelatin Hydrolysis, Esculin Hydrolysis, Arginine dihydrolase, Lysine Decarboxylase, Ornithine Decarboxylase
Fermentation: Glucose, Lactose, Sucrose, Mannitol, Maltose, Xylose, Arabinose, Rhamnose, Sorbitol, Raffinose, Inositol, Trehalose
Extended: Gas Production, TSI Pattern
Field coverage ranges from 100% (Shape) to ~35% (TSI Pattern, Gas Production). Growth Temperature was present for 98.5% of records and was parsed into four numeric features per record (low bound, high bound, range, midpoint) plus ten binary threshold flags (growth at 4, 10, 20, 25, 30, 37, 42, 45, 50, 55 Β°C). The original categorical token was retained alongside these expansions.
| Growth Temperature parse status | Records |
|---|---|
| Parsed (double slash format) | 8,047 |
| Parsed (range format) | 142 |
| Single number | 25 |
| Parsed (multiple numbers, min/max) | 1 |
| Missing | 120 |
Genus representation
Most-represented genera (β₯100 records): Escherichia (160), Citrobacter (156), Staphylococcus (150), Bacillus (144), Listeria (131), Pseudomonas (131), Streptococcus (131), Enterobacter (130), Clostridium (126), Campylobacter (123), Enterococcus (121), Klebsiella (118), Salmonella (116), Vibrio (115), Lactobacillus (113), Cronobacter (108), Serratia (108), Leuconostoc (105).
Genus distribution is moderately imbalanced. Class-balanced sample weights (inverse square-root of class frequency) were applied during training.
Training Procedure
Panel-aware partial phenotype generation
Rather than training on complete phenotypic profiles, each record generates multiple examples by sampling different subsets of its available fields, simulating real laboratory conditions. Four strategies were combined:
- Named panels β 15 fixed test batteries reflecting real-world workflows, including ENTEROBACTERIACEAE_LITE, GRAM_POSITIVE_COCCI_PANEL, NON_FERMENTER_PANEL, ANAEROBE_PANEL, VIBRIO_AEROMONAS_PANEL, MORPHOLOGY_CULTURE_PANEL, CORE_BIOCHEM, and workflow prefix panels at multiple completion stages
- Random partial sampling β random subsets of 2βN available fields, with Gram Stain and Shape given a 65% inclusion bias to reflect their typical priority in practice
- Dropout from full β keep-probability of 0.30β0.75 applied per field across 5 rounds
- Full available β the complete set of observed fields for each record
This produced 205,227 training examples from 5,825 source records after de-duplication of identical field-set combinations within the same record. The largest source categories were FULL_AVAILABLE (5,825), MORPHOLOGY_CULTURE_PANEL (5,822), EARLY_GRAM_SHAPE (5,818), EARLY_BASIC (5,804), and the five DROPOUT rounds (~5,763β5,786 each).
Training configuration
| Parameter | Value |
|---|---|
| Optimiser | AdamW |
| Training precision | bf16 (A100) |
| AMP | Enabled (GradScaler disabled for bf16) |
| Best checkpoint | Epoch 12 of 14 |
| Early stopping score | Composite (Top-1 + Top-3 + Top-5 weighted) |
| Class weighting | Inverse square-root frequency |
| Min fields per example | 2 |
Training history
| Epoch | Train loss | Val loss | Top-1 | Top-3 | Top-5 | Macro F1 | ECE |
|---|---|---|---|---|---|---|---|
| 1 | 3.3417 | 2.0114 | 0.4629 | 0.6816 | 0.7796 | 0.4356 | 0.0669 |
| 2 | 1.8601 | 1.6958 | 0.5529 | 0.7444 | 0.8197 | 0.5508 | 0.0223 |
| 3 | 1.6152 | 1.6085 | 0.5777 | 0.7591 | 0.8303 | 0.5790 | 0.0181 |
| 4 | 1.5058 | 1.5780 | 0.5899 | 0.7621 | 0.8325 | 0.5852 | 0.0209 |
| 5 | 1.4273 | 1.5476 | 0.5912 | 0.7663 | 0.8364 | 0.5862 | 0.0256 |
| 6 | 1.3620 | 1.5350 | 0.5983 | 0.7686 | 0.8371 | 0.5935 | 0.0262 |
| 7 | 1.3011 | 1.5255 | 0.6030 | 0.7711 | 0.8395 | 0.5985 | 0.0258 |
| 8 | 1.2432 | 1.5327 | 0.6034 | 0.7733 | 0.8390 | 0.6004 | 0.0321 |
| 9 | 1.1896 | 1.5439 | 0.6058 | 0.7736 | 0.8392 | 0.6015 | 0.0358 |
| 10 | 1.1410 | 1.5509 | 0.6089 | 0.7735 | 0.8400 | 0.6029 | 0.0396 |
| 11 | 1.1016 | 1.5642 | 0.6098 | 0.7729 | 0.8398 | 0.6034 | 0.0433 |
| 12 β | 1.0724 | 1.5680 | 0.6110 | 0.7725 | 0.8391 | 0.6035 | 0.0421 |
| 13 | 1.0549 | 1.5794 | 0.6103 | 0.7726 | 0.8394 | 0.6028 | 0.0454 |
| 14 | 1.0460 | 1.5833 | 0.6102 | 0.7722 | 0.8392 | 0.6023 | 0.0473 |
Validation loss stopped improving after epoch 12. ECE degraded slightly in later epochs as the model became overconfident without explicit calibration during training β this is corrected post-hoc by temperature scaling.
Evaluation
Test set performance (panel-aware partial phenotype examples, n=36,259)
| Metric | Value |
|---|---|
| Top-1 accuracy | 0.6119 |
| Top-3 accuracy | 0.7757 |
| Top-5 accuracy | 0.8411 |
| Macro F1 | 0.6069 |
| Weighted F1 | 0.6141 |
| ECE (pre-calibration) | 0.0400 |
| ECE (post temperature scaling, T=1.063) | 0.0309 |
| NLL (pre-calibration) | 1.2997 |
| NLL (post temperature scaling) | 1.2977 |
Comparison with XGBoost baseline (XGB v2)
| Model | Top-1 | Top-3 | Notes |
|---|---|---|---|
| XGB v2 | 0.5958 | 0.7670 | 900 estimators, hist method |
| PhenoFormer (uncalibrated) | 0.6119 | 0.7757 | Epoch 12 checkpoint |
| PhenoFormer (calibrated) | 0.6119 | 0.7757 | T = 1.063, ECE 0.031 |
The Transformer's Top-1 advantage over XGB is ~1.6pp. On structured, sparse tabular input, tree-based models are generally competitive; this margin is expected. PhenoFormer's additional justification is in probability calibration and its suitability for the next-test recommendation module.
Controlled panel performance (n=1,262 test records per panel)
| Panel | Top-1 | Top-3 | Top-5 | Macro F1 | ECE |
|---|---|---|---|---|---|
| NO_GROWTH_TEMPERATURE | 0.9667 | 0.9945 | 0.9976 | 0.9649 | 0.028 |
| FULL_AVAILABLE | 0.9580 | 0.9921 | 0.9952 | 0.9570 | 0.026 |
| NO_MEDIA | 0.9572 | 0.9913 | 0.9952 | 0.9549 | 0.026 |
| NO_MEDIA_NO_GROWTH_TEMPERATURE | 0.9532 | 0.9889 | 0.9945 | 0.9496 | 0.020 |
| NON_FERMENTER_PANEL | 0.8875 | 0.9762 | 0.9873 | 0.8763 | 0.028 |
| VIBRIO_AEROMONAS_PANEL | 0.8827 | 0.9739 | 0.9857 | 0.8677 | 0.015 |
| ANAEROBE_PANEL | 0.8685 | 0.9659 | 0.9842 | 0.8538 | 0.023 |
| MORPHOLOGY_CULTURE_PANEL | 0.8494 | 0.9509 | 0.9691 | 0.8530 | 0.032 |
| ENTEROBACTERIACEAE_LITE | 0.7052 | 0.8835 | 0.9429 | 0.6788 | 0.072 |
| CORE_BIOCHEM_PLUS_MOTILITY | 0.6450 | 0.8447 | 0.9192 | 0.6140 | 0.033 |
| CORE_BIOCHEM | 0.5610 | 0.7876 | 0.8677 | 0.5056 | 0.033 |
| SUGAR_PANEL | 0.5586 | 0.7655 | 0.8573 | 0.5195 | 0.046 |
| GRAM_POSITIVE_COCCI_PANEL | 0.4960 | 0.7068 | 0.7995 | 0.4606 | 0.126 |
| EARLY_BASIC | 0.2464 | 0.4849 | 0.6078 | 0.1583 | 0.027 |
| EARLY_GRAM_SHAPE | 0.1220 | 0.2940 | 0.3851 | 0.0418 | 0.026 |
Gated panel performance (restricted to records with sufficient fields)
| Panel | Examples | Top-1 | Top-3 | Top-5 |
|---|---|---|---|---|
| NON_FERMENTER_PANEL | 556 | 0.9029 | 0.9838 | 0.9910 |
| ANAEROBE_PANEL | 631 | 0.8891 | 0.9715 | 0.9857 |
| VIBRIO_AEROMONAS_PANEL | 285 | 0.8491 | 0.9719 | 0.9860 |
| ENTEROBACTERIACEAE_LITE | 658 | 0.7660 | 0.9149 | 0.9620 |
| GRAM_POSITIVE_COCCI_PANEL | 186 | 0.7581 | 0.9032 | 0.9462 |
Specialised panels improve substantially when restricted to the organisms they were designed for β the Gram-positive cocci panel improves from 49.6% to 75.8% Top-1, and the Enterobacteriaceae panel from 70.5% to 76.6%.
Confidence threshold analysis
| Threshold | Coverage | Examples | Top-1 | Top-3 | Top-5 |
|---|---|---|---|---|---|
| 0.05 | 98.4% | 35,682 | 0.621 | 0.786 | 0.852 |
| 0.10 | 91.0% | 33,010 | 0.665 | 0.832 | 0.893 |
| 0.20 | 81.8% | 29,676 | 0.724 | 0.886 | 0.933 |
| 0.30 | 75.3% | 27,285 | 0.769 | 0.916 | 0.950 |
| 0.50 | 64.7% | 23,452 | 0.840 | 0.951 | 0.970 |
| 0.70 | 54.4% | 19,712 | 0.907 | 0.970 | 0.980 |
| 0.90 | 44.3% | 16,061 | 0.961 | 0.985 | 0.990 |
| 0.95 | 37.8% | 13,716 | 0.977 | 0.991 | 0.993 |
For laboratory use, a threshold between 0.50β0.70 is recommended as a starting point, with lower-confidence predictions flagged for review rather than discarded. The top-5 list and next-test recommendations remain valid at any confidence level.
Worst-performing genera
| Genus | Top-1 | Top-3 | Top-5 | Mean fields | Primary confusion |
|---|---|---|---|---|---|
| Ralstonia | 0.217 | 0.625 | 0.827 | 13.5 | Cupriavidus |
| Comamonas | 0.234 | 0.495 | 0.698 | 13.2 | Acidovorax |
| Acidovorax | 0.272 | 0.599 | 0.672 | 13.7 | Comamonas |
| Leclercia | 0.317 | 0.537 | 0.619 | 14.6 | Kluyvera |
| Kluyvera | 0.325 | 0.563 | 0.675 | 14.5 | Leclercia |
| Cupriavidus | 0.342 | 0.603 | 0.742 | 13.6 | Ralstonia |
| Alcaligenes | 0.373 | 0.538 | 0.678 | 12.2 | β |
| Anaerococcus | 0.391 | 0.609 | 0.815 | 13.8 | Peptoniphilus |
| Cellulomonas | 0.392 | 0.608 | 0.705 | 14.5 | β |
| Massilia | 0.399 | 0.592 | 0.686 | 12.7 | Pseudomonas |
Most hard-to-classify genera are phenotypically similar to a taxonomically related genus. The Ralstonia/Cupriavidus, Comamonas/Acidovorax, and Kluyvera/Leclercia pairs share overlapping profiles in reference literature.
Most common confusions
| True genus | Predicted | Count | Notes |
|---|---|---|---|
| Salmonella | Citrobacter | 78 | Both Enterobacteriaceae; overlapping sugar profiles |
| Kingella | Eikenella | 74 | Shared fastidious GN morphology |
| Providencia | Morganella | 62 | Both Proteeae; urease/indole overlap |
| Clostridium | Clostridioides | 62 | Taxonomically recent genus boundary |
| Vibrio | Aeromonas | 59 | Oxidase-positive curved GN rods |
| Enterobacter | Citrobacter | 58 | VP/citrate distinction required |
| Clostridioides | Clostridium | 55 | Bidirectional with above |
| Escherichia | Citrobacter | 55 | Indole/citrate distinction required |
| Klebsiella | Raoultella | 55 | Historically synonymous genera |
| Ralstonia | Cupriavidus | 54 | Formerly classified as same genus |
| Streptococcus | Enterococcus | 42 | Haemolysis/bile tolerance distinction required |
The next-test recommender is specifically designed to surface the discriminating tests for exactly these pairs.
High-confidence errors
Some errors occur at very high confidence (>0.99 top-1 probability). These represent cases where the available field subset is genuinely ambiguous between genera at the population level β the model is making the statistically optimal prediction given the observed evidence, but the true isolate is atypical for that pattern. Notable examples: Mycobacterium kansasii predicted as Corynebacterium on a 6-field morphology-only panel; Eikenella predicted as Kingella on an 8-field morphology/culture panel (reciprocally confusable genera); Leuconostoc predicted as Clostridium on an 8-field panel including gas production but without Gram stain. These cases reinforce that confidence should be read alongside the top-5 list and margin, not as a binary accept/reject signal.
How to Use
Basic inference
from phenotype_inference import PhenotypeClassifier
model = PhenotypeClassifier("phenotype_tinytransformer_v1_temperature_scaled.pt")
observations = {
"Gram Stain": "Positive",
"Shape": "Cocci",
"Catalase": "Positive",
"Coagulase": "Positive",
"Haemolysis Type": "Beta",
"DNase": "Positive",
"Mannitol Fermentation": "Positive",
}
result = model.predict(observations)
print(result["top_genus"]) # e.g. "Staphylococcus"
print(result["confidence"]) # calibrated probability, e.g. 0.91
print(result["top5"]) # ranked list of (genus, probability) tuples
print(result["confidence_tier"]) # "HIGH" / "MODERATE" / "LOW"
Next-test recommendation
# Confirmation mode: reinforces the top prediction
confirmation = model.recommend_next_tests(
observations, mode="confirmation", top_n=5
)
# Discriminatory mode: maximally separates competing candidates
discriminatory = model.recommend_next_tests(
observations, mode="discriminatory", top_n=5
)
Limitations and Risks
Genus-level scope only. Confirmatory testing (coagulase, MALDI-TOF, genomic) remains necessary for species-level clinical or regulatory decisions.
Excluded genera. The following 16 genera were excluded from training due to insufficient records: Arthrobacter, Carnobacterium, Chlamydia, Coxiella, Halomonas, Janthinobacterium, Lactococcus, Microbacterium, Mycoplasma, Ochrobactrum, Pediococcus, Plesiomonas, Propionibacterium, Rickettsia, Roseomonas, Thermococcus. The model will predict one of the 122 trained genera for isolates from these groups.
Phenotypically similar genus pairs. Ralstonia/Cupriavidus, Comamonas/Acidovorax, Kluyvera/Leclercia, Clostridium/Clostridioides, and Klebsiella/Raoultella are systematically confusable. Treat both candidates as live hypotheses when the model returns a low-margin prediction between such pairs.
Data sourcing. Training data was compiled from reference literature, not directly from laboratory isolate records. Real isolate behaviour can deviate from type-strain descriptions, particularly for environmental or food-associated strains.
Not a clinical diagnostic device. This model has not been validated as an in vitro diagnostic device (IVD) under EU MDR, UK MDR 2002, or any equivalent regulatory framework. It must not be used as the sole basis for clinical, public health, or food safety reporting decisions.
Citation
@misc{phenoformer2025,
author = {Asad, Zain},
title = {PhenoFormer: A Panel-Aware Transformer for Bacterial Genus Identification from Phenotypic Test Results},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ZainAsad/PhenoFormer}},
note = {Trained on the BactAID Gold Test Dataset. Includes temperature-scaled inference and dual next-test recommendation.}
}
Related Work
- BactAID β the parent project: a hybrid bacterial identification system (XGBoost + LoRA FLAN-T5 + BART RAG) achieving 95.1% accuracy across 140 genera. Zenodo
- PhenotypeClassifier-XGB v2 β the XGBoost baseline developed alongside PhenoFormer on identical data splits
- DomainEmbedder β domain-adaptive embeddings with LoRA and A2C RL routing
Model Card Contact
Developed by Zain Asad (Applied AI Engineer | Microbiology Specialist). Raise an issue on the repository or contact via HuggingFace.