PhenoFormer

PhenoFormer is a 22M-parameter Transformer classifier for bacterial genus identification from phenotypic test results. Given a partial set of microbiological test observations — Gram stain, biochemical reactions, growth characteristics, fermentation profiles — it returns ranked genus predictions with calibrated confidence scores and, optionally, a prioritised recommendation of which tests to run next.

It is designed to reflect the real conditions of a clinical or food/water testing laboratory: you rarely have a complete biochemical profile at the point of identification. PhenoFormer was trained on simulated partial panels derived from 15 named test batteries and structured random sampling strategies, making it robust to the incomplete, in-progress data that exists during a genuine identification workflow.

Model Details

Property	Value
Parameters	22,288,506
Architecture	Transformer encoder (custom)
d_model	512
Encoder layers	8
Attention heads	8
Feed-forward dim	1,536
Dropout	0.12
Vocabulary size	1,120 tokens
Max sequence length	192 tokens
Output classes	122 bacterial genera
Best checkpoint	Epoch 12
Temperature scaling	T = 1.063 (post-hoc calibration)
Training precision	bf16 (A100)
Model size	~85 MB

Architecture notes

Input phenotypic fields are serialised as structured token sequences (field name + value tokens), not free text. Absent fields are simply omitted rather than masked to zero — the correct inductive bias for partial-observation classification. A [CLS]-style aggregation token produces the genus logits. Temperature scaling is applied at inference time using a learned scalar (T = 1.063) fitted on the validation split to correct mild overconfidence.

Intended Use

Primary use cases

Genus-level identification support during routine bacteriology workup, where a subset of phenotypic tests has been completed
Differential diagnosis narrowing: the top-3 and top-5 predictions cover the true genus in 77.6% and 84.1% of cases respectively on mixed partial-panel test examples, providing a ranked shortlist for confirmatory testing
Next-test recommendation: the packaged inference module includes a dual recommender that ranks untested fields by either confirmation weight (reinforce the top prediction) or discriminatory power (maximally separate competing candidates)
Laboratory information system integration: the inference module is self-contained and callable with a plain dict of {field: value} observations

Not intended for

Clinical diagnostic reporting or patient care decisions — predictions are decision support only and must be confirmed by a qualified microbiologist
Antimicrobial susceptibility inference — the model classifies genus, not species, and carries no AMR information
Atypical, obligate intracellular, or fastidious organisms (e.g. Rickettsia, Chlamydia, Mycoplasma) — excluded from training due to insufficient records
Subspecies or biotype discrimination — classification is genus-level only
Fungal or parasitic identification — scope is bacteria only (note: Saccharomyces, Candida, and Cryptococcus appear in training as a minority and may produce unreliable predictions)

Training Data

The training data is the BactAID Gold Test Dataset, a curated collection of 8,360 records spanning 138 bacterial genera and 5,563 unique species/name entries. Each record contains a structured phenotypic profile compiled from reference sources including Bergey's Manual and published species descriptions.

After filtering genera with fewer than 10 records (16 genera excluded), the usable dataset covers 122 genera across 8,335 records.

Dataset statistics

Split	Records	Generated examples
Train (70%)	5,825	205,227
Validation (15%)	1,248	35,434
Test (15%)	1,262	36,259

The record-level split was performed before any example generation to prevent data leakage.

Phenotypic schema

The model uses 45 canonical phenotypic fields:

Morphology & Growth: Gram Stain, Shape, Colony Morphology, Growth Temperature, Media Grown On, Capsule, Spore Formation, Oxygen Requirement, Motility, Motility Type, Haemolysis, Haemolysis Type

Core Biochemistry: Catalase, Oxidase, Indole, Urease, Citrate, H2S, Nitrate Reduction, Methyl Red, VP, Coagulase, DNase, ONPG, NaCl Tolerant (≥6%), Lipase Test, Gelatin Hydrolysis, Esculin Hydrolysis, Arginine dihydrolase, Lysine Decarboxylase, Ornithine Decarboxylase

Fermentation: Glucose, Lactose, Sucrose, Mannitol, Maltose, Xylose, Arabinose, Rhamnose, Sorbitol, Raffinose, Inositol, Trehalose

Extended: Gas Production, TSI Pattern

Field coverage ranges from 100% (Shape) to ~35% (TSI Pattern, Gas Production). Growth Temperature was present for 98.5% of records and was parsed into four numeric features per record (low bound, high bound, range, midpoint) plus ten binary threshold flags (growth at 4, 10, 20, 25, 30, 37, 42, 45, 50, 55 °C). The original categorical token was retained alongside these expansions.

Growth Temperature parse status	Records
Parsed (double slash format)	8,047
Parsed (range format)	142
Single number	25
Parsed (multiple numbers, min/max)	1
Missing	120

Genus representation

Most-represented genera (≥100 records): Escherichia (160), Citrobacter (156), Staphylococcus (150), Bacillus (144), Listeria (131), Pseudomonas (131), Streptococcus (131), Enterobacter (130), Clostridium (126), Campylobacter (123), Enterococcus (121), Klebsiella (118), Salmonella (116), Vibrio (115), Lactobacillus (113), Cronobacter (108), Serratia (108), Leuconostoc (105).

Genus distribution is moderately imbalanced. Class-balanced sample weights (inverse square-root of class frequency) were applied during training.

Training Procedure

Panel-aware partial phenotype generation

Rather than training on complete phenotypic profiles, each record generates multiple examples by sampling different subsets of its available fields, simulating real laboratory conditions. Four strategies were combined:

Named panels — 15 fixed test batteries reflecting real-world workflows, including ENTEROBACTERIACEAE_LITE, GRAM_POSITIVE_COCCI_PANEL, NON_FERMENTER_PANEL, ANAEROBE_PANEL, VIBRIO_AEROMONAS_PANEL, MORPHOLOGY_CULTURE_PANEL, CORE_BIOCHEM, and workflow prefix panels at multiple completion stages
Random partial sampling — random subsets of 2–N available fields, with Gram Stain and Shape given a 65% inclusion bias to reflect their typical priority in practice
Dropout from full — keep-probability of 0.30–0.75 applied per field across 5 rounds
Full available — the complete set of observed fields for each record

This produced 205,227 training examples from 5,825 source records after de-duplication of identical field-set combinations within the same record. The largest source categories were FULL_AVAILABLE (5,825), MORPHOLOGY_CULTURE_PANEL (5,822), EARLY_GRAM_SHAPE (5,818), EARLY_BASIC (5,804), and the five DROPOUT rounds (~5,763–5,786 each).

Training configuration

Parameter	Value
Optimiser	AdamW
Training precision	bf16 (A100)
AMP	Enabled (GradScaler disabled for bf16)
Best checkpoint	Epoch 12 of 14
Early stopping score	Composite (Top-1 + Top-3 + Top-5 weighted)
Class weighting	Inverse square-root frequency
Min fields per example	2

Training history

Epoch	Train loss	Val loss	Top-1	Top-3	Top-5	Macro F1	ECE
1	3.3417	2.0114	0.4629	0.6816	0.7796	0.4356	0.0669
2	1.8601	1.6958	0.5529	0.7444	0.8197	0.5508	0.0223
3	1.6152	1.6085	0.5777	0.7591	0.8303	0.5790	0.0181
4	1.5058	1.5780	0.5899	0.7621	0.8325	0.5852	0.0209
5	1.4273	1.5476	0.5912	0.7663	0.8364	0.5862	0.0256
6	1.3620	1.5350	0.5983	0.7686	0.8371	0.5935	0.0262
7	1.3011	1.5255	0.6030	0.7711	0.8395	0.5985	0.0258
8	1.2432	1.5327	0.6034	0.7733	0.8390	0.6004	0.0321
9	1.1896	1.5439	0.6058	0.7736	0.8392	0.6015	0.0358
10	1.1410	1.5509	0.6089	0.7735	0.8400	0.6029	0.0396
11	1.1016	1.5642	0.6098	0.7729	0.8398	0.6034	0.0433
12 ✓	1.0724	1.5680	0.6110	0.7725	0.8391	0.6035	0.0421
13	1.0549	1.5794	0.6103	0.7726	0.8394	0.6028	0.0454
14	1.0460	1.5833	0.6102	0.7722	0.8392	0.6023	0.0473

Validation loss stopped improving after epoch 12. ECE degraded slightly in later epochs as the model became overconfident without explicit calibration during training — this is corrected post-hoc by temperature scaling.

Evaluation

Test set performance (panel-aware partial phenotype examples, n=36,259)

Metric	Value
Top-1 accuracy	0.6119
Top-3 accuracy	0.7757
Top-5 accuracy	0.8411
Macro F1	0.6069
Weighted F1	0.6141
ECE (pre-calibration)	0.0400
ECE (post temperature scaling, T=1.063)	0.0309
NLL (pre-calibration)	1.2997
NLL (post temperature scaling)	1.2977

Comparison with XGBoost baseline (XGB v2)

Model	Top-1	Top-3	Notes
XGB v2	0.5958	0.7670	900 estimators, hist method
PhenoFormer (uncalibrated)	0.6119	0.7757	Epoch 12 checkpoint
PhenoFormer (calibrated)	0.6119	0.7757	T = 1.063, ECE 0.031

The Transformer's Top-1 advantage over XGB is ~1.6pp. On structured, sparse tabular input, tree-based models are generally competitive; this margin is expected. PhenoFormer's additional justification is in probability calibration and its suitability for the next-test recommendation module.

Controlled panel performance (n=1,262 test records per panel)

Panel	Top-1	Top-3	Top-5	Macro F1	ECE
NO_GROWTH_TEMPERATURE	0.9667	0.9945	0.9976	0.9649	0.028
FULL_AVAILABLE	0.9580	0.9921	0.9952	0.9570	0.026
NO_MEDIA	0.9572	0.9913	0.9952	0.9549	0.026
NO_MEDIA_NO_GROWTH_TEMPERATURE	0.9532	0.9889	0.9945	0.9496	0.020
NON_FERMENTER_PANEL	0.8875	0.9762	0.9873	0.8763	0.028
VIBRIO_AEROMONAS_PANEL	0.8827	0.9739	0.9857	0.8677	0.015
ANAEROBE_PANEL	0.8685	0.9659	0.9842	0.8538	0.023
MORPHOLOGY_CULTURE_PANEL	0.8494	0.9509	0.9691	0.8530	0.032
ENTEROBACTERIACEAE_LITE	0.7052	0.8835	0.9429	0.6788	0.072
CORE_BIOCHEM_PLUS_MOTILITY	0.6450	0.8447	0.9192	0.6140	0.033
CORE_BIOCHEM	0.5610	0.7876	0.8677	0.5056	0.033
SUGAR_PANEL	0.5586	0.7655	0.8573	0.5195	0.046
GRAM_POSITIVE_COCCI_PANEL	0.4960	0.7068	0.7995	0.4606	0.126
EARLY_BASIC	0.2464	0.4849	0.6078	0.1583	0.027
EARLY_GRAM_SHAPE	0.1220	0.2940	0.3851	0.0418	0.026

Gated panel performance (restricted to records with sufficient fields)

Panel	Examples	Top-1	Top-3	Top-5
NON_FERMENTER_PANEL	556	0.9029	0.9838	0.9910
ANAEROBE_PANEL	631	0.8891	0.9715	0.9857
VIBRIO_AEROMONAS_PANEL	285	0.8491	0.9719	0.9860
ENTEROBACTERIACEAE_LITE	658	0.7660	0.9149	0.9620
GRAM_POSITIVE_COCCI_PANEL	186	0.7581	0.9032	0.9462

Specialised panels improve substantially when restricted to the organisms they were designed for — the Gram-positive cocci panel improves from 49.6% to 75.8% Top-1, and the Enterobacteriaceae panel from 70.5% to 76.6%.

Confidence threshold analysis

Threshold	Coverage	Examples	Top-1	Top-3	Top-5
0.05	98.4%	35,682	0.621	0.786	0.852
0.10	91.0%	33,010	0.665	0.832	0.893
0.20	81.8%	29,676	0.724	0.886	0.933
0.30	75.3%	27,285	0.769	0.916	0.950
0.50	64.7%	23,452	0.840	0.951	0.970
0.70	54.4%	19,712	0.907	0.970	0.980
0.90	44.3%	16,061	0.961	0.985	0.990
0.95	37.8%	13,716	0.977	0.991	0.993

For laboratory use, a threshold between 0.50–0.70 is recommended as a starting point, with lower-confidence predictions flagged for review rather than discarded. The top-5 list and next-test recommendations remain valid at any confidence level.

Worst-performing genera

Genus	Top-1	Top-3	Top-5	Mean fields	Primary confusion
Ralstonia	0.217	0.625	0.827	13.5	Cupriavidus
Comamonas	0.234	0.495	0.698	13.2	Acidovorax
Acidovorax	0.272	0.599	0.672	13.7	Comamonas
Leclercia	0.317	0.537	0.619	14.6	Kluyvera
Kluyvera	0.325	0.563	0.675	14.5	Leclercia
Cupriavidus	0.342	0.603	0.742	13.6	Ralstonia
Alcaligenes	0.373	0.538	0.678	12.2	—
Anaerococcus	0.391	0.609	0.815	13.8	Peptoniphilus
Cellulomonas	0.392	0.608	0.705	14.5	—
Massilia	0.399	0.592	0.686	12.7	Pseudomonas

Most hard-to-classify genera are phenotypically similar to a taxonomically related genus. The Ralstonia/Cupriavidus, Comamonas/Acidovorax, and Kluyvera/Leclercia pairs share overlapping profiles in reference literature.

Most common confusions

True genus	Predicted	Count	Notes
Salmonella	Citrobacter	78	Both Enterobacteriaceae; overlapping sugar profiles
Kingella	Eikenella	74	Shared fastidious GN morphology
Providencia	Morganella	62	Both Proteeae; urease/indole overlap
Clostridium	Clostridioides	62	Taxonomically recent genus boundary
Vibrio	Aeromonas	59	Oxidase-positive curved GN rods
Enterobacter	Citrobacter	58	VP/citrate distinction required
Clostridioides	Clostridium	55	Bidirectional with above
Escherichia	Citrobacter	55	Indole/citrate distinction required
Klebsiella	Raoultella	55	Historically synonymous genera
Ralstonia	Cupriavidus	54	Formerly classified as same genus
Streptococcus	Enterococcus	42	Haemolysis/bile tolerance distinction required

The next-test recommender is specifically designed to surface the discriminating tests for exactly these pairs.

High-confidence errors

Some errors occur at very high confidence (>0.99 top-1 probability). These represent cases where the available field subset is genuinely ambiguous between genera at the population level — the model is making the statistically optimal prediction given the observed evidence, but the true isolate is atypical for that pattern. Notable examples: Mycobacterium kansasii predicted as Corynebacterium on a 6-field morphology-only panel; Eikenella predicted as Kingella on an 8-field morphology/culture panel (reciprocally confusable genera); Leuconostoc predicted as Clostridium on an 8-field panel including gas production but without Gram stain. These cases reinforce that confidence should be read alongside the top-5 list and margin, not as a binary accept/reject signal.

How to Use

Basic inference

from phenotype_inference import PhenotypeClassifier

model = PhenotypeClassifier("phenotype_tinytransformer_v1_temperature_scaled.pt")

observations = {
    "Gram Stain": "Positive",
    "Shape": "Cocci",
    "Catalase": "Positive",
    "Coagulase": "Positive",
    "Haemolysis Type": "Beta",
    "DNase": "Positive",
    "Mannitol Fermentation": "Positive",
}

result = model.predict(observations)

print(result["top_genus"])        # e.g. "Staphylococcus"
print(result["confidence"])       # calibrated probability, e.g. 0.91
print(result["top5"])             # ranked list of (genus, probability) tuples
print(result["confidence_tier"])  # "HIGH" / "MODERATE" / "LOW"

Next-test recommendation

# Confirmation mode: reinforces the top prediction
confirmation = model.recommend_next_tests(
    observations, mode="confirmation", top_n=5
)

# Discriminatory mode: maximally separates competing candidates
discriminatory = model.recommend_next_tests(
    observations, mode="discriminatory", top_n=5
)

Limitations and Risks

Genus-level scope only. Confirmatory testing (coagulase, MALDI-TOF, genomic) remains necessary for species-level clinical or regulatory decisions.

Excluded genera. The following 16 genera were excluded from training due to insufficient records: Arthrobacter, Carnobacterium, Chlamydia, Coxiella, Halomonas, Janthinobacterium, Lactococcus, Microbacterium, Mycoplasma, Ochrobactrum, Pediococcus, Plesiomonas, Propionibacterium, Rickettsia, Roseomonas, Thermococcus. The model will predict one of the 122 trained genera for isolates from these groups.

Phenotypically similar genus pairs. Ralstonia/Cupriavidus, Comamonas/Acidovorax, Kluyvera/Leclercia, Clostridium/Clostridioides, and Klebsiella/Raoultella are systematically confusable. Treat both candidates as live hypotheses when the model returns a low-margin prediction between such pairs.

Data sourcing. Training data was compiled from reference literature, not directly from laboratory isolate records. Real isolate behaviour can deviate from type-strain descriptions, particularly for environmental or food-associated strains.

Not a clinical diagnostic device. This model has not been validated as an in vitro diagnostic device (IVD) under EU MDR, UK MDR 2002, or any equivalent regulatory framework. It must not be used as the sole basis for clinical, public health, or food safety reporting decisions.

Citation

@misc{phenoformer2025,
  author       = {Asad, Zain},
  title        = {PhenoFormer: A Panel-Aware Transformer for Bacterial Genus Identification from Phenotypic Test Results},
  year         = {2025},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ZainAsad/PhenoFormer}},
  note         = {Trained on the BactAID Gold Test Dataset. Includes temperature-scaled inference and dual next-test recommendation.}
}

Related Work

BactAID — the parent project: a hybrid bacterial identification system (XGBoost + LoRA FLAN-T5 + BART RAG) achieving 95.1% accuracy across 140 genera. Zenodo
PhenotypeClassifier-XGB v2 — the XGBoost baseline developed alongside PhenoFormer on identical data splits
DomainEmbedder — domain-adaptive embeddings with LoRA and A2C RL routing

Model Card Contact

Developed by Zain Asad (Applied AI Engineer | Microbiology Specialist). Raise an issue on the repository or contact via HuggingFace.

Downloads last month: -; Downloads are not tracked for this model. How to track