|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- google/flan-t5-base |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: transformers |
|
|
tags: |
|
|
- biology |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
### Model Card: Core Schema Parsing LLM (Microbiology) |
|
|
## Model Overview |
|
|
|
|
|
This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits. |
|
|
|
|
|
The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system. |
|
|
|
|
|
## Base Model |
|
|
|
|
|
Base architecture: google/flan-t5-base |
|
|
|
|
|
Model type: Encoder–decoder (Seq2Seq), instruction-tuned |
|
|
|
|
|
The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of: |
|
|
|
|
|
A free-text phenotype description |
|
|
|
|
|
A deterministic target serialization of core schema fields and values |
|
|
|
|
|
Data preprocessing: |
|
|
|
|
|
The name field and all non-core schema fields were explicitly removed to prevent label leakage. |
|
|
|
|
|
Target outputs were serialized deterministically using sorted schema keys (Field: Value format). |
|
|
|
|
|
Inputs and targets were constrained to schema-relevant content only. |
|
|
|
|
|
The dataset was split 80/20 into training and validation subsets. |
|
|
|
|
|
# Training Procedure |
|
|
|
|
|
- Epochs: 3 |
|
|
|
|
|
- Optimizer: AdamW (default Hugging Face Trainer) |
|
|
|
|
|
- Learning rate: 1e-5 |
|
|
|
|
|
# Batching: |
|
|
|
|
|
- Per-device batch size: 1 |
|
|
|
|
|
- Gradient accumulation: 8 (effective batch size = 8) |
|
|
|
|
|
- Sequence lengths: |
|
|
|
|
|
- Max input length: 2048 tokens |
|
|
|
|
|
- Max output length: 2048 tokens |
|
|
|
|
|
# Precision: |
|
|
|
|
|
- bf16 on supported hardware (A100), otherwise fp16 |
|
|
|
|
|
- Stability measures: |
|
|
|
|
|
- Gradient checkpointing enabled |
|
|
|
|
|
- Gradient clipping (max_grad_norm = 1.0) |
|
|
|
|
|
- Warmup ratio of 0.03 |
|
|
|
|
|
- The model was trained using the Hugging Face Trainer API and saved after completion of all epochs. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for: |
|
|
|
|
|
Structured parsing of microbiology phenotype text into predefined schema fields |
|
|
|
|
|
Use as a third-stage parser alongside rule-based and extended parsers |
|
|
|
|
|
Supporting downstream deterministic scoring, ranking, and retrieval systems |
|
|
|
|
|
Not intended for: |
|
|
|
|
|
Standalone clinical diagnosis |
|
|
|
|
|
Autonomous decision-making |
|
|
|
|
|
Use without additional validation layers |
|
|
|
|
|
## Integration Context |
|
|
|
|
|
In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline: |
|
|
|
|
|
- Rule-based parser (high precision) |
|
|
|
|
|
- Extended parser (schema-aware) |
|
|
|
|
|
- LLM parser (coverage and robustness) |
|
|
|
|
|
Outputs are reconciled and validated downstream before being used for identification or explanation. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
Performance depends on coverage of the training schema and cannot generalize beyond it. |
|
|
|
|
|
The model may hallucinate field values if used outside its intended constrained pipeline. |
|
|
|
|
|
It is sensitive to extreme deviations in input style or unsupported terminology. |
|
|
|
|
|
## Ethical and Safety Considerations |
|
|
|
|
|
The model does not provide medical advice or diagnoses. |
|
|
|
|
|
Outputs should always be reviewed in conjunction with deterministic logic and domain expertise. |
|
|
|
|
|
Training data was curated to minimize leakage and unintended inference. |
|
|
|
|
|
## Author |
|
|
|
|
|
Developed and fine-tuned by Zain Asad as part of the BactAI-D project. |