EphAsad commited on
Commit
9da0513
·
verified ·
1 Parent(s): 813b2fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -29
README.md CHANGED
@@ -9,14 +9,14 @@ tags:
9
  language:
10
  - en
11
  ---
12
- Model Card: Core Schema Parsing LLM (Microbiology)
13
- Model Overview
14
 
15
  This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits.
16
 
17
  The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system.
18
 
19
- Base Model
20
 
21
  Base architecture: google/flan-t5-base
22
 
@@ -24,7 +24,7 @@ Model type: Encoder–decoder (Seq2Seq), instruction-tuned
24
 
25
  The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware.
26
 
27
- Training Data
28
 
29
  The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of:
30
 
@@ -42,41 +42,41 @@ Inputs and targets were constrained to schema-relevant content only.
42
 
43
  The dataset was split 80/20 into training and validation subsets.
44
 
45
- Training Procedure
46
 
47
- Epochs: 3
48
 
49
- Optimizer: AdamW (default Hugging Face Trainer)
50
 
51
- Learning rate: 1e-5
52
 
53
- Batching:
54
 
55
- Per-device batch size: 1
56
 
57
- Gradient accumulation: 8 (effective batch size = 8)
58
 
59
- Sequence lengths:
60
 
61
- Max input length: 2048 tokens
62
 
63
- Max output length: 2048 tokens
64
 
65
- Precision:
66
 
67
- bf16 on supported hardware (A100), otherwise fp16
68
 
69
- Stability measures:
70
 
71
- Gradient checkpointing enabled
72
 
73
- Gradient clipping (max_grad_norm = 1.0)
74
 
75
- Warmup ratio of 0.03
76
 
77
- The model was trained using the Hugging Face Trainer API and saved after completion of all epochs.
78
 
79
- Intended Use
80
 
81
  This model is intended for:
82
 
@@ -94,19 +94,19 @@ Autonomous decision-making
94
 
95
  Use without additional validation layers
96
 
97
- Integration Context
98
 
99
  In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline:
100
 
101
- Rule-based parser (high precision)
102
 
103
- Extended parser (schema-aware)
104
 
105
- LLM parser (coverage and robustness)
106
 
107
  Outputs are reconciled and validated downstream before being used for identification or explanation.
108
 
109
- Limitations
110
 
111
  Performance depends on coverage of the training schema and cannot generalize beyond it.
112
 
@@ -114,7 +114,7 @@ The model may hallucinate field values if used outside its intended constrained
114
 
115
  It is sensitive to extreme deviations in input style or unsupported terminology.
116
 
117
- Ethical and Safety Considerations
118
 
119
  The model does not provide medical advice or diagnoses.
120
 
@@ -122,6 +122,6 @@ Outputs should always be reviewed in conjunction with deterministic logic and do
122
 
123
  Training data was curated to minimize leakage and unintended inference.
124
 
125
- Author
126
 
127
  Developed and fine-tuned by Zain Asad as part of the BactAI-D project.
 
9
  language:
10
  - en
11
  ---
12
+ ### Model Card: Core Schema Parsing LLM (Microbiology)
13
+ ## Model Overview
14
 
15
  This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits.
16
 
17
  The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system.
18
 
19
+ ## Base Model
20
 
21
  Base architecture: google/flan-t5-base
22
 
 
24
 
25
  The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware.
26
 
27
+ ## Training Data
28
 
29
  The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of:
30
 
 
42
 
43
  The dataset was split 80/20 into training and validation subsets.
44
 
45
+ # Training Procedure
46
 
47
+ - Epochs: 3
48
 
49
+ - Optimizer: AdamW (default Hugging Face Trainer)
50
 
51
+ - Learning rate: 1e-5
52
 
53
+ # Batching:
54
 
55
+ - Per-device batch size: 1
56
 
57
+ - Gradient accumulation: 8 (effective batch size = 8)
58
 
59
+ - Sequence lengths:
60
 
61
+ - Max input length: 2048 tokens
62
 
63
+ - Max output length: 2048 tokens
64
 
65
+ # Precision:
66
 
67
+ - bf16 on supported hardware (A100), otherwise fp16
68
 
69
+ - Stability measures:
70
 
71
+ - Gradient checkpointing enabled
72
 
73
+ - Gradient clipping (max_grad_norm = 1.0)
74
 
75
+ - Warmup ratio of 0.03
76
 
77
+ - The model was trained using the Hugging Face Trainer API and saved after completion of all epochs.
78
 
79
+ ## Intended Use
80
 
81
  This model is intended for:
82
 
 
94
 
95
  Use without additional validation layers
96
 
97
+ ## Integration Context
98
 
99
  In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline:
100
 
101
+ - Rule-based parser (high precision)
102
 
103
+ - Extended parser (schema-aware)
104
 
105
+ - LLM parser (coverage and robustness)
106
 
107
  Outputs are reconciled and validated downstream before being used for identification or explanation.
108
 
109
+ ## Limitations
110
 
111
  Performance depends on coverage of the training schema and cannot generalize beyond it.
112
 
 
114
 
115
  It is sensitive to extreme deviations in input style or unsupported terminology.
116
 
117
+ ## Ethical and Safety Considerations
118
 
119
  The model does not provide medical advice or diagnoses.
120
 
 
122
 
123
  Training data was curated to minimize leakage and unintended inference.
124
 
125
+ ## Author
126
 
127
  Developed and fine-tuned by Zain Asad as part of the BactAI-D project.