EphAsad commited on
Commit
813b2fe
·
verified ·
1 Parent(s): 4e4281b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -3
README.md CHANGED
@@ -1,3 +1,127 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - google/flan-t5-base
5
+ pipeline_tag: feature-extraction
6
+ library_name: transformers
7
+ tags:
8
+ - biology
9
+ language:
10
+ - en
11
+ ---
12
+ Model Card: Core Schema Parsing LLM (Microbiology)
13
+ Model Overview
14
+
15
+ This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits.
16
+
17
+ The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system.
18
+
19
+ Base Model
20
+
21
+ Base architecture: google/flan-t5-base
22
+
23
+ Model type: Encoder–decoder (Seq2Seq), instruction-tuned
24
+
25
+ The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware.
26
+
27
+ Training Data
28
+
29
+ The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of:
30
+
31
+ A free-text phenotype description
32
+
33
+ A deterministic target serialization of core schema fields and values
34
+
35
+ Data preprocessing:
36
+
37
+ The name field and all non-core schema fields were explicitly removed to prevent label leakage.
38
+
39
+ Target outputs were serialized deterministically using sorted schema keys (Field: Value format).
40
+
41
+ Inputs and targets were constrained to schema-relevant content only.
42
+
43
+ The dataset was split 80/20 into training and validation subsets.
44
+
45
+ Training Procedure
46
+
47
+ Epochs: 3
48
+
49
+ Optimizer: AdamW (default Hugging Face Trainer)
50
+
51
+ Learning rate: 1e-5
52
+
53
+ Batching:
54
+
55
+ Per-device batch size: 1
56
+
57
+ Gradient accumulation: 8 (effective batch size = 8)
58
+
59
+ Sequence lengths:
60
+
61
+ Max input length: 2048 tokens
62
+
63
+ Max output length: 2048 tokens
64
+
65
+ Precision:
66
+
67
+ bf16 on supported hardware (A100), otherwise fp16
68
+
69
+ Stability measures:
70
+
71
+ Gradient checkpointing enabled
72
+
73
+ Gradient clipping (max_grad_norm = 1.0)
74
+
75
+ Warmup ratio of 0.03
76
+
77
+ The model was trained using the Hugging Face Trainer API and saved after completion of all epochs.
78
+
79
+ Intended Use
80
+
81
+ This model is intended for:
82
+
83
+ Structured parsing of microbiology phenotype text into predefined schema fields
84
+
85
+ Use as a third-stage parser alongside rule-based and extended parsers
86
+
87
+ Supporting downstream deterministic scoring, ranking, and retrieval systems
88
+
89
+ Not intended for:
90
+
91
+ Standalone clinical diagnosis
92
+
93
+ Autonomous decision-making
94
+
95
+ Use without additional validation layers
96
+
97
+ Integration Context
98
+
99
+ In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline:
100
+
101
+ Rule-based parser (high precision)
102
+
103
+ Extended parser (schema-aware)
104
+
105
+ LLM parser (coverage and robustness)
106
+
107
+ Outputs are reconciled and validated downstream before being used for identification or explanation.
108
+
109
+ Limitations
110
+
111
+ Performance depends on coverage of the training schema and cannot generalize beyond it.
112
+
113
+ The model may hallucinate field values if used outside its intended constrained pipeline.
114
+
115
+ It is sensitive to extreme deviations in input style or unsupported terminology.
116
+
117
+ Ethical and Safety Considerations
118
+
119
+ The model does not provide medical advice or diagnoses.
120
+
121
+ Outputs should always be reviewed in conjunction with deterministic logic and domain expertise.
122
+
123
+ Training data was curated to minimize leakage and unintended inference.
124
+
125
+ Author
126
+
127
+ Developed and fine-tuned by Zain Asad as part of the BactAI-D project.