praveenramesh commited on
Commit
517c04f
·
verified ·
1 Parent(s): 9723d7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -25
README.md CHANGED
@@ -14,23 +14,120 @@ licence: license
14
  This model is a fine-tuned version of [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it).
15
  It has been trained using [TRL](https://github.com/huggingface/trl).
16
 
 
 
 
 
 
17
  ## Quick start
18
 
19
  ```python
 
 
20
  from transformers import pipeline
21
 
22
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
- generator = pipeline("text-generation", model="praveenramesh/clinical_sdtm", device="cuda")
24
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
- print(output["generated_text"])
26
- ```
27
 
28
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
-
31
 
 
32
 
33
- This model was trained with SFT.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ### Framework versions
36
 
@@ -38,20 +135,4 @@ This model was trained with SFT.
38
  - Transformers: 5.2.0
39
  - Pytorch: 2.10.0
40
  - Datasets: 4.5.0
41
- - Tokenizers: 0.22.2
42
-
43
- ## Citations
44
-
45
-
46
-
47
- Cite TRL as:
48
-
49
- ```bibtex
50
- @software{vonwerra2020trl,
51
- title = {{TRL: Transformers Reinforcement Learning}},
52
- author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
53
- license = {Apache-2.0},
54
- url = {https://github.com/huggingface/trl},
55
- year = {2020}
56
- }
57
- ```
 
14
  This model is a fine-tuned version of [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it).
15
  It has been trained using [TRL](https://github.com/huggingface/trl).
16
 
17
+ The ClinicalIntelligence/saama_gemma is a fine-tuned MedGemma model designed to transform unstructured clinical narratives—such as discharge notes—into structured, SDTM-aligned datasets (e.g., Adverse Events, Medical History, Procedures). Trained on an SME-curated dataset derived from MIMIC-III, the model treats clinical data extraction as a complex reasoning task, explicitly evaluating assertion, temporality, and causality to generate accurate, traceable JSON outputs. By learning regulatory semantics directly, it significantly outperforms base models in domain grounding and schema consistency. Users should note current limitations regarding context window constraints for lengthy notes, rare abbreviation handling, and the resolution of multi-domain entities.
18
+
19
+
20
+
21
+
22
  ## Quick start
23
 
24
  ```python
25
+ import re
26
+
27
  from transformers import pipeline
28
 
 
 
 
 
 
29
 
30
+ def extract_entities(text):
31
+ """
32
+ Extracts entities, domains, and justifications from the given text
33
+ and returns them as a list of dictionaries.
34
+ """
35
+ # The regex pattern looks for:
36
+ # 1. Content inside <think> and </think> tags
37
+ # 2. Domain prefixed with ~ (ignoring whitespace/special characters like non-breaking spaces)
38
+ # 3. Extracted entity prefixed with ~~ (up to the next <think> tag or end of string)
39
+ pattern = r"(?s)<think>\s*(.*?)\s*</think>\s*~([A-Z0-9]+)\s*~~(.*?)(?=<think>|$)"
40
+
41
+ # Find all matches in the text
42
+ matches = re.findall(pattern, text)
43
+
44
+ extracted_data = []
45
+
46
+ for justification, domain, entity in matches:
47
+ extracted_data.append(
48
+ {
49
+ "domain": domain.strip(),
50
+ "extracted_entity": entity.strip(),
51
+ "justification": justification.strip(),
52
+ }
53
+ )
54
+
55
+ return extracted_data
56
 
 
57
 
58
+ prefix = """Extract SDTM domain entities from: """
59
 
60
+ unstructured_text = """
61
+ # Dyspnea
62
+ # Community-acquired pneumonia:
63
+ Patient presenting with chest tightness and shortness of breath.
64
+ Labs notable for WBC 25 with neutrophilic predominance. CTA
65
+ negative for pulmonary embolism, but did demonstrate multifocal
66
+ GGOs concerning for possible pneumonia, suspected viral based on
67
+ imaging findings. Flu negative. Treated empirically for
68
+ community-acquired bacterial pneumonia, with CTX/doxy ->
69
+ Levaquin. Symptoms were mostly resolved at time of discharge.
70
+
71
+ #Initial concern for possibility of TB
72
+ The ED physician who saw her initially had started a TB rule
73
+ out. However, on closer history by medicine, she had essentially
74
+ nothing to support that diagnosis, other than being from an
75
+ endemic area. The patient has no known exposures to friends or
76
+ family members with active pulmonary TB. She has not lost any
77
+ weight. She denies any cough or other respiratory symptoms prior
78
+ to development of her current - likely viral - pneumonia. CT
79
+ chest was not felt by myself or the radiologist to be at all
80
+ suspicious for TB. Thus, after discussing the case with
81
+ infection control, the TB rule out was called off.
82
+
83
+ # Palpitations
84
+ # Paroxysmal narrow-complex tachycardia:
85
+ Patient has a history of occasional mild palpitations. On
86
+ arrival to our ED, she was given adenosine, which did not break
87
+ the rhythm (although no strip was obtained to allow me to review
88
+ what happened with adenosine administration).
89
+ On the floor, she was monitored on telemetry and was noted to
90
+ occasionally speed up from ___ to ~140. The change in rate was
91
+ rapid, but generally represented the HR speeding up
92
+ progressively over a few seconds, not an immediate change in
93
+ both rate and rhythm from one beat to the next. The p was not
94
+ overtly distinguishable from the sinus p by morphology, although
95
+ the PR interval shortened greatly from 140 ms to 89 ms. ___
96
+ tachycardia did not correlate with position or exertion; I got
97
+ her out of bed and had her walk around without significant rise
98
+ in HR on tele a that time. Most likely atrial tachycardia, but
99
+ ___ is considered also. In either case, these are
100
+ benign as long as she remains predominantly asymptomatic as she
101
+ is now.
102
+ Initial EKG noted to have diffuse repolarization changes that
103
+ resolved with improved rates; trop<0.01x2.
104
+ """
105
+
106
+ generator = pipeline(
107
+ "text-generation", model="ClinicalIntelligence/saama_gemma", device="cuda"
108
+ )
109
+ output = generator(
110
+ [{"role": "user", "content": prefix + unstructured_text}],
111
+ max_new_tokens=16000,
112
+ return_full_text=False,
113
+ )[0]
114
+ llm_output = output["generated_text"]
115
+ extracted_entities_list = extract_entities(llm_output)
116
+
117
+ for extracted_entity in extracted_entities_list:
118
+ print(extracted_entity)
119
+ ```
120
+
121
+ ## Training procedure
122
+ * Training Data Size: 6,500 samples (grouped by uid and formatted into complete user/assistant conversational threads)
123
+ * Number of Epochs: 5 (The model processed the entire dataset 5 times, totaling approximately 4,065 optimization steps)
124
+ * Effective Batch Size: 8 (Per-device batch size of 1 combined with 8 gradient accumulation steps)
125
+ * LoRA Rank ($r$ / Adapter Size): 16 (Provides a balance between capturing complex, domain-specific logic and maintaining a lightweight adapter)
126
+ * LoRA Alpha: 32LoRA Scaling Factor: 2.0 (Calculated as Alpha / Rank, providing a strong fine-tuning signal to enforce strict extraction formatting)
127
+ * Targeted Layers: All linear layers (target_modules="all-linear")
128
+ * Adapters were applied to attention modules as well as MLP blocks to maximize instructional compliance and mimic full fine-tuning
129
+ * Maximum Sequence Length: 12,000 tokens (Sufficient to handle extensive hospital course notes)
130
+ * Learning Rate: 2e-4Precision: bfloat16 with Flash Attention 2 and gradient checkpointing enabled for memory efficiency.
131
 
132
  ### Framework versions
133
 
 
135
  - Transformers: 5.2.0
136
  - Pytorch: 2.10.0
137
  - Datasets: 4.5.0
138
+ - Tokenizers: 0.22.2