PhenoFormer

PhenoFormer is a 22M-parameter Transformer classifier for bacterial genus identification from phenotypic test results. Given a partial set of microbiological test observations β€” Gram stain, biochemical reactions, growth characteristics, fermentation profiles β€” it returns ranked genus predictions with calibrated confidence scores and, optionally, a prioritised recommendation of which tests to run next.

It is designed to reflect the real conditions of a clinical or food/water testing laboratory: you rarely have a complete biochemical profile at the point of identification. PhenoFormer was trained on simulated partial panels derived from 15 named test batteries and structured random sampling strategies, making it robust to the incomplete, in-progress data that exists during a genuine identification workflow.


Model Details

Property Value
Parameters 22,288,506
Architecture Transformer encoder (custom)
d_model 512
Encoder layers 8
Attention heads 8
Feed-forward dim 1,536
Dropout 0.12
Vocabulary size 1,120 tokens
Max sequence length 192 tokens
Output classes 122 bacterial genera
Best checkpoint Epoch 12
Temperature scaling T = 1.063 (post-hoc calibration)
Training precision bf16 (A100)
Model size ~85 MB

Architecture notes

Input phenotypic fields are serialised as structured token sequences (field name + value tokens), not free text. Absent fields are simply omitted rather than masked to zero β€” the correct inductive bias for partial-observation classification. A [CLS]-style aggregation token produces the genus logits. Temperature scaling is applied at inference time using a learned scalar (T = 1.063) fitted on the validation split to correct mild overconfidence.


Intended Use

Primary use cases

  • Genus-level identification support during routine bacteriology workup, where a subset of phenotypic tests has been completed
  • Differential diagnosis narrowing: the top-3 and top-5 predictions cover the true genus in 77.6% and 84.1% of cases respectively on mixed partial-panel test examples, providing a ranked shortlist for confirmatory testing
  • Next-test recommendation: the packaged inference module includes a dual recommender that ranks untested fields by either confirmation weight (reinforce the top prediction) or discriminatory power (maximally separate competing candidates)
  • Laboratory information system integration: the inference module is self-contained and callable with a plain dict of {field: value} observations

Not intended for

  • Clinical diagnostic reporting or patient care decisions β€” predictions are decision support only and must be confirmed by a qualified microbiologist
  • Antimicrobial susceptibility inference β€” the model classifies genus, not species, and carries no AMR information
  • Atypical, obligate intracellular, or fastidious organisms (e.g. Rickettsia, Chlamydia, Mycoplasma) β€” excluded from training due to insufficient records
  • Subspecies or biotype discrimination β€” classification is genus-level only
  • Fungal or parasitic identification β€” scope is bacteria only (note: Saccharomyces, Candida, and Cryptococcus appear in training as a minority and may produce unreliable predictions)

Training Data

The training data is the BactAID Gold Test Dataset, a curated collection of 8,360 records spanning 138 bacterial genera and 5,563 unique species/name entries. Each record contains a structured phenotypic profile compiled from reference sources including Bergey's Manual and published species descriptions.

After filtering genera with fewer than 10 records (16 genera excluded), the usable dataset covers 122 genera across 8,335 records.

Dataset statistics

Split Records Generated examples
Train (70%) 5,825 205,227
Validation (15%) 1,248 35,434
Test (15%) 1,262 36,259

The record-level split was performed before any example generation to prevent data leakage.

Phenotypic schema

The model uses 45 canonical phenotypic fields:

Morphology & Growth: Gram Stain, Shape, Colony Morphology, Growth Temperature, Media Grown On, Capsule, Spore Formation, Oxygen Requirement, Motility, Motility Type, Haemolysis, Haemolysis Type

Core Biochemistry: Catalase, Oxidase, Indole, Urease, Citrate, H2S, Nitrate Reduction, Methyl Red, VP, Coagulase, DNase, ONPG, NaCl Tolerant (β‰₯6%), Lipase Test, Gelatin Hydrolysis, Esculin Hydrolysis, Arginine dihydrolase, Lysine Decarboxylase, Ornithine Decarboxylase

Fermentation: Glucose, Lactose, Sucrose, Mannitol, Maltose, Xylose, Arabinose, Rhamnose, Sorbitol, Raffinose, Inositol, Trehalose

Extended: Gas Production, TSI Pattern

Field coverage ranges from 100% (Shape) to ~35% (TSI Pattern, Gas Production). Growth Temperature was present for 98.5% of records and was parsed into four numeric features per record (low bound, high bound, range, midpoint) plus ten binary threshold flags (growth at 4, 10, 20, 25, 30, 37, 42, 45, 50, 55 Β°C). The original categorical token was retained alongside these expansions.

Growth Temperature parse status Records
Parsed (double slash format) 8,047
Parsed (range format) 142
Single number 25
Parsed (multiple numbers, min/max) 1
Missing 120

Genus representation

Most-represented genera (β‰₯100 records): Escherichia (160), Citrobacter (156), Staphylococcus (150), Bacillus (144), Listeria (131), Pseudomonas (131), Streptococcus (131), Enterobacter (130), Clostridium (126), Campylobacter (123), Enterococcus (121), Klebsiella (118), Salmonella (116), Vibrio (115), Lactobacillus (113), Cronobacter (108), Serratia (108), Leuconostoc (105).

Genus distribution is moderately imbalanced. Class-balanced sample weights (inverse square-root of class frequency) were applied during training.


Training Procedure

Panel-aware partial phenotype generation

Rather than training on complete phenotypic profiles, each record generates multiple examples by sampling different subsets of its available fields, simulating real laboratory conditions. Four strategies were combined:

  1. Named panels β€” 15 fixed test batteries reflecting real-world workflows, including ENTEROBACTERIACEAE_LITE, GRAM_POSITIVE_COCCI_PANEL, NON_FERMENTER_PANEL, ANAEROBE_PANEL, VIBRIO_AEROMONAS_PANEL, MORPHOLOGY_CULTURE_PANEL, CORE_BIOCHEM, and workflow prefix panels at multiple completion stages
  2. Random partial sampling β€” random subsets of 2–N available fields, with Gram Stain and Shape given a 65% inclusion bias to reflect their typical priority in practice
  3. Dropout from full β€” keep-probability of 0.30–0.75 applied per field across 5 rounds
  4. Full available β€” the complete set of observed fields for each record

This produced 205,227 training examples from 5,825 source records after de-duplication of identical field-set combinations within the same record. The largest source categories were FULL_AVAILABLE (5,825), MORPHOLOGY_CULTURE_PANEL (5,822), EARLY_GRAM_SHAPE (5,818), EARLY_BASIC (5,804), and the five DROPOUT rounds (~5,763–5,786 each).

Training configuration

Parameter Value
Optimiser AdamW
Training precision bf16 (A100)
AMP Enabled (GradScaler disabled for bf16)
Best checkpoint Epoch 12 of 14
Early stopping score Composite (Top-1 + Top-3 + Top-5 weighted)
Class weighting Inverse square-root frequency
Min fields per example 2

Training history

Epoch Train loss Val loss Top-1 Top-3 Top-5 Macro F1 ECE
1 3.3417 2.0114 0.4629 0.6816 0.7796 0.4356 0.0669
2 1.8601 1.6958 0.5529 0.7444 0.8197 0.5508 0.0223
3 1.6152 1.6085 0.5777 0.7591 0.8303 0.5790 0.0181
4 1.5058 1.5780 0.5899 0.7621 0.8325 0.5852 0.0209
5 1.4273 1.5476 0.5912 0.7663 0.8364 0.5862 0.0256
6 1.3620 1.5350 0.5983 0.7686 0.8371 0.5935 0.0262
7 1.3011 1.5255 0.6030 0.7711 0.8395 0.5985 0.0258
8 1.2432 1.5327 0.6034 0.7733 0.8390 0.6004 0.0321
9 1.1896 1.5439 0.6058 0.7736 0.8392 0.6015 0.0358
10 1.1410 1.5509 0.6089 0.7735 0.8400 0.6029 0.0396
11 1.1016 1.5642 0.6098 0.7729 0.8398 0.6034 0.0433
12 βœ“ 1.0724 1.5680 0.6110 0.7725 0.8391 0.6035 0.0421
13 1.0549 1.5794 0.6103 0.7726 0.8394 0.6028 0.0454
14 1.0460 1.5833 0.6102 0.7722 0.8392 0.6023 0.0473

Validation loss stopped improving after epoch 12. ECE degraded slightly in later epochs as the model became overconfident without explicit calibration during training β€” this is corrected post-hoc by temperature scaling.


Evaluation

Test set performance (panel-aware partial phenotype examples, n=36,259)

Metric Value
Top-1 accuracy 0.6119
Top-3 accuracy 0.7757
Top-5 accuracy 0.8411
Macro F1 0.6069
Weighted F1 0.6141
ECE (pre-calibration) 0.0400
ECE (post temperature scaling, T=1.063) 0.0309
NLL (pre-calibration) 1.2997
NLL (post temperature scaling) 1.2977

Comparison with XGBoost baseline (XGB v2)

Model Top-1 Top-3 Notes
XGB v2 0.5958 0.7670 900 estimators, hist method
PhenoFormer (uncalibrated) 0.6119 0.7757 Epoch 12 checkpoint
PhenoFormer (calibrated) 0.6119 0.7757 T = 1.063, ECE 0.031

The Transformer's Top-1 advantage over XGB is ~1.6pp. On structured, sparse tabular input, tree-based models are generally competitive; this margin is expected. PhenoFormer's additional justification is in probability calibration and its suitability for the next-test recommendation module.

Controlled panel performance (n=1,262 test records per panel)

Panel Top-1 Top-3 Top-5 Macro F1 ECE
NO_GROWTH_TEMPERATURE 0.9667 0.9945 0.9976 0.9649 0.028
FULL_AVAILABLE 0.9580 0.9921 0.9952 0.9570 0.026
NO_MEDIA 0.9572 0.9913 0.9952 0.9549 0.026
NO_MEDIA_NO_GROWTH_TEMPERATURE 0.9532 0.9889 0.9945 0.9496 0.020
NON_FERMENTER_PANEL 0.8875 0.9762 0.9873 0.8763 0.028
VIBRIO_AEROMONAS_PANEL 0.8827 0.9739 0.9857 0.8677 0.015
ANAEROBE_PANEL 0.8685 0.9659 0.9842 0.8538 0.023
MORPHOLOGY_CULTURE_PANEL 0.8494 0.9509 0.9691 0.8530 0.032
ENTEROBACTERIACEAE_LITE 0.7052 0.8835 0.9429 0.6788 0.072
CORE_BIOCHEM_PLUS_MOTILITY 0.6450 0.8447 0.9192 0.6140 0.033
CORE_BIOCHEM 0.5610 0.7876 0.8677 0.5056 0.033
SUGAR_PANEL 0.5586 0.7655 0.8573 0.5195 0.046
GRAM_POSITIVE_COCCI_PANEL 0.4960 0.7068 0.7995 0.4606 0.126
EARLY_BASIC 0.2464 0.4849 0.6078 0.1583 0.027
EARLY_GRAM_SHAPE 0.1220 0.2940 0.3851 0.0418 0.026

Gated panel performance (restricted to records with sufficient fields)

Panel Examples Top-1 Top-3 Top-5
NON_FERMENTER_PANEL 556 0.9029 0.9838 0.9910
ANAEROBE_PANEL 631 0.8891 0.9715 0.9857
VIBRIO_AEROMONAS_PANEL 285 0.8491 0.9719 0.9860
ENTEROBACTERIACEAE_LITE 658 0.7660 0.9149 0.9620
GRAM_POSITIVE_COCCI_PANEL 186 0.7581 0.9032 0.9462

Specialised panels improve substantially when restricted to the organisms they were designed for β€” the Gram-positive cocci panel improves from 49.6% to 75.8% Top-1, and the Enterobacteriaceae panel from 70.5% to 76.6%.

Confidence threshold analysis

Threshold Coverage Examples Top-1 Top-3 Top-5
0.05 98.4% 35,682 0.621 0.786 0.852
0.10 91.0% 33,010 0.665 0.832 0.893
0.20 81.8% 29,676 0.724 0.886 0.933
0.30 75.3% 27,285 0.769 0.916 0.950
0.50 64.7% 23,452 0.840 0.951 0.970
0.70 54.4% 19,712 0.907 0.970 0.980
0.90 44.3% 16,061 0.961 0.985 0.990
0.95 37.8% 13,716 0.977 0.991 0.993

For laboratory use, a threshold between 0.50–0.70 is recommended as a starting point, with lower-confidence predictions flagged for review rather than discarded. The top-5 list and next-test recommendations remain valid at any confidence level.

Worst-performing genera

Genus Top-1 Top-3 Top-5 Mean fields Primary confusion
Ralstonia 0.217 0.625 0.827 13.5 Cupriavidus
Comamonas 0.234 0.495 0.698 13.2 Acidovorax
Acidovorax 0.272 0.599 0.672 13.7 Comamonas
Leclercia 0.317 0.537 0.619 14.6 Kluyvera
Kluyvera 0.325 0.563 0.675 14.5 Leclercia
Cupriavidus 0.342 0.603 0.742 13.6 Ralstonia
Alcaligenes 0.373 0.538 0.678 12.2 β€”
Anaerococcus 0.391 0.609 0.815 13.8 Peptoniphilus
Cellulomonas 0.392 0.608 0.705 14.5 β€”
Massilia 0.399 0.592 0.686 12.7 Pseudomonas

Most hard-to-classify genera are phenotypically similar to a taxonomically related genus. The Ralstonia/Cupriavidus, Comamonas/Acidovorax, and Kluyvera/Leclercia pairs share overlapping profiles in reference literature.

Most common confusions

True genus Predicted Count Notes
Salmonella Citrobacter 78 Both Enterobacteriaceae; overlapping sugar profiles
Kingella Eikenella 74 Shared fastidious GN morphology
Providencia Morganella 62 Both Proteeae; urease/indole overlap
Clostridium Clostridioides 62 Taxonomically recent genus boundary
Vibrio Aeromonas 59 Oxidase-positive curved GN rods
Enterobacter Citrobacter 58 VP/citrate distinction required
Clostridioides Clostridium 55 Bidirectional with above
Escherichia Citrobacter 55 Indole/citrate distinction required
Klebsiella Raoultella 55 Historically synonymous genera
Ralstonia Cupriavidus 54 Formerly classified as same genus
Streptococcus Enterococcus 42 Haemolysis/bile tolerance distinction required

The next-test recommender is specifically designed to surface the discriminating tests for exactly these pairs.

High-confidence errors

Some errors occur at very high confidence (>0.99 top-1 probability). These represent cases where the available field subset is genuinely ambiguous between genera at the population level β€” the model is making the statistically optimal prediction given the observed evidence, but the true isolate is atypical for that pattern. Notable examples: Mycobacterium kansasii predicted as Corynebacterium on a 6-field morphology-only panel; Eikenella predicted as Kingella on an 8-field morphology/culture panel (reciprocally confusable genera); Leuconostoc predicted as Clostridium on an 8-field panel including gas production but without Gram stain. These cases reinforce that confidence should be read alongside the top-5 list and margin, not as a binary accept/reject signal.


How to Use

Basic inference

from phenotype_inference import PhenotypeClassifier

model = PhenotypeClassifier("phenotype_tinytransformer_v1_temperature_scaled.pt")

observations = {
    "Gram Stain": "Positive",
    "Shape": "Cocci",
    "Catalase": "Positive",
    "Coagulase": "Positive",
    "Haemolysis Type": "Beta",
    "DNase": "Positive",
    "Mannitol Fermentation": "Positive",
}

result = model.predict(observations)

print(result["top_genus"])        # e.g. "Staphylococcus"
print(result["confidence"])       # calibrated probability, e.g. 0.91
print(result["top5"])             # ranked list of (genus, probability) tuples
print(result["confidence_tier"])  # "HIGH" / "MODERATE" / "LOW"

Next-test recommendation

# Confirmation mode: reinforces the top prediction
confirmation = model.recommend_next_tests(
    observations, mode="confirmation", top_n=5
)

# Discriminatory mode: maximally separates competing candidates
discriminatory = model.recommend_next_tests(
    observations, mode="discriminatory", top_n=5
)

Limitations and Risks

Genus-level scope only. Confirmatory testing (coagulase, MALDI-TOF, genomic) remains necessary for species-level clinical or regulatory decisions.

Excluded genera. The following 16 genera were excluded from training due to insufficient records: Arthrobacter, Carnobacterium, Chlamydia, Coxiella, Halomonas, Janthinobacterium, Lactococcus, Microbacterium, Mycoplasma, Ochrobactrum, Pediococcus, Plesiomonas, Propionibacterium, Rickettsia, Roseomonas, Thermococcus. The model will predict one of the 122 trained genera for isolates from these groups.

Phenotypically similar genus pairs. Ralstonia/Cupriavidus, Comamonas/Acidovorax, Kluyvera/Leclercia, Clostridium/Clostridioides, and Klebsiella/Raoultella are systematically confusable. Treat both candidates as live hypotheses when the model returns a low-margin prediction between such pairs.

Data sourcing. Training data was compiled from reference literature, not directly from laboratory isolate records. Real isolate behaviour can deviate from type-strain descriptions, particularly for environmental or food-associated strains.

Not a clinical diagnostic device. This model has not been validated as an in vitro diagnostic device (IVD) under EU MDR, UK MDR 2002, or any equivalent regulatory framework. It must not be used as the sole basis for clinical, public health, or food safety reporting decisions.


Citation

@misc{phenoformer2025,
  author       = {Asad, Zain},
  title        = {PhenoFormer: A Panel-Aware Transformer for Bacterial Genus Identification from Phenotypic Test Results},
  year         = {2025},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ZainAsad/PhenoFormer}},
  note         = {Trained on the BactAID Gold Test Dataset. Includes temperature-scaled inference and dual next-test recommendation.}
}

Related Work

  • BactAID β€” the parent project: a hybrid bacterial identification system (XGBoost + LoRA FLAN-T5 + BART RAG) achieving 95.1% accuracy across 140 genera. Zenodo
  • PhenotypeClassifier-XGB v2 β€” the XGBoost baseline developed alongside PhenoFormer on identical data splits
  • DomainEmbedder β€” domain-adaptive embeddings with LoRA and A2C RL routing

Model Card Contact

Developed by Zain Asad (Applied AI Engineer | Microbiology Specialist). Raise an issue on the repository or contact via HuggingFace.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support