EphAsad
/

EphBactAID

@@ -9,14 +9,14 @@ tags:
 language:
 - en
 ---
-Model Card: Core Schema Parsing LLM (Microbiology)
-Model Overview
 This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits.
 The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system.
-Base Model
 Base architecture: google/flan-t5-base
@@ -24,7 +24,7 @@ Model type: Encoder–decoder (Seq2Seq), instruction-tuned
 The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware.
-Training Data
 The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of:
@@ -42,41 +42,41 @@ Inputs and targets were constrained to schema-relevant content only.
 The dataset was split 80/20 into training and validation subsets.
-Training Procedure
-Epochs: 3
-Optimizer: AdamW (default Hugging Face Trainer)
-Learning rate: 1e-5
-Batching:
-Per-device batch size: 1
-Gradient accumulation: 8 (effective batch size = 8)
-Sequence lengths:
-Max input length: 2048 tokens
-Max output length: 2048 tokens
-Precision:
-bf16 on supported hardware (A100), otherwise fp16
-Stability measures:
-Gradient checkpointing enabled
-Gradient clipping (max_grad_norm = 1.0)
-Warmup ratio of 0.03
-The model was trained using the Hugging Face Trainer API and saved after completion of all epochs.
-Intended Use
 This model is intended for:
@@ -94,19 +94,19 @@ Autonomous decision-making
 Use without additional validation layers
-Integration Context
 In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline:
-Rule-based parser (high precision)
-Extended parser (schema-aware)
-LLM parser (coverage and robustness)
 Outputs are reconciled and validated downstream before being used for identification or explanation.
-Limitations
 Performance depends on coverage of the training schema and cannot generalize beyond it.
@@ -114,7 +114,7 @@ The model may hallucinate field values if used outside its intended constrained
 It is sensitive to extreme deviations in input style or unsupported terminology.
-Ethical and Safety Considerations
 The model does not provide medical advice or diagnoses.
@@ -122,6 +122,6 @@ Outputs should always be reviewed in conjunction with deterministic logic and do
 Training data was curated to minimize leakage and unintended inference.
-Author
 Developed and fine-tuned by Zain Asad as part of the BactAI-D project.

 language:
 - en
 ---
+### Model Card: Core Schema Parsing LLM (Microbiology)
+## Model Overview
 This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits.
 The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system.
+## Base Model
 Base architecture: google/flan-t5-base
 The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware.
+## Training Data
 The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of:
 The dataset was split 80/20 into training and validation subsets.
+# Training Procedure
+- Epochs: 3
+- Optimizer: AdamW (default Hugging Face Trainer)
+- Learning rate: 1e-5
+# Batching:
+- Per-device batch size: 1
+- Gradient accumulation: 8 (effective batch size = 8)
+- Sequence lengths:
+- Max input length: 2048 tokens
+- Max output length: 2048 tokens
+# Precision:
+- bf16 on supported hardware (A100), otherwise fp16
+- Stability measures:
+- Gradient checkpointing enabled
+- Gradient clipping (max_grad_norm = 1.0)
+- Warmup ratio of 0.03
+- The model was trained using the Hugging Face Trainer API and saved after completion of all epochs.
+## Intended Use
 This model is intended for:
 Use without additional validation layers
+## Integration Context
 In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline:
+- Rule-based parser (high precision)
+- Extended parser (schema-aware)
+- LLM parser (coverage and robustness)
 Outputs are reconciled and validated downstream before being used for identification or explanation.
+## Limitations
 Performance depends on coverage of the training schema and cannot generalize beyond it.
 It is sensitive to extreme deviations in input style or unsupported terminology.
+## Ethical and Safety Considerations
 The model does not provide medical advice or diagnoses.
 Training data was curated to minimize leakage and unintended inference.
+## Author
 Developed and fine-tuned by Zain Asad as part of the BactAI-D project.