ClinicalIntelligence
/

saama_gemma

@@ -14,23 +14,120 @@ licence: license
 This model is a fine-tuned version of [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it).
 It has been trained using [TRL](https://github.com/huggingface/trl).
 ## Quick start
 ```python
 from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="praveenramesh/clinical_sdtm", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
-```
-## Training procedure
-This model was trained with SFT.
 ### Framework versions
@@ -38,20 +135,4 @@ This model was trained with SFT.
 - Transformers: 5.2.0
 - Pytorch: 2.10.0
 - Datasets: 4.5.0
-- Tokenizers: 0.22.2
-## Citations
-Cite TRL as:
-```bibtex
-@software{vonwerra2020trl,
-  title   = {{TRL: Transformers Reinforcement Learning}},
-  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
-  license = {Apache-2.0},
-  url     = {https://github.com/huggingface/trl},
-  year    = {2020}
-}
-```

 This model is a fine-tuned version of [google/medgemma-4b-it](https://huggingface.co/google/medgemma-4b-it).
 It has been trained using [TRL](https://github.com/huggingface/trl).
+The ClinicalIntelligence/saama_gemma is a fine-tuned MedGemma model designed to transform unstructured clinical narratives—such as discharge notes—into structured, SDTM-aligned datasets (e.g., Adverse Events, Medical History, Procedures). Trained on an SME-curated dataset derived from MIMIC-III, the model treats clinical data extraction as a complex reasoning task, explicitly evaluating assertion, temporality, and causality to generate accurate, traceable JSON outputs. By learning regulatory semantics directly, it significantly outperforms base models in domain grounding and schema consistency. Users should note current limitations regarding context window constraints for lengthy notes, rare abbreviation handling, and the resolution of multi-domain entities.
 ## Quick start
 ```python
+import re
 from transformers import pipeline
+def extract_entities(text):
+    """
+    Extracts entities, domains, and justifications from the given text
+    and returns them as a list of dictionaries.
+    """
+    # The regex pattern looks for:
+    # 1. Content inside <think> and </think> tags
+    # 2. Domain prefixed with ~ (ignoring whitespace/special characters like non-breaking spaces)
+    # 3. Extracted entity prefixed with ~~ (up to the next <think> tag or end of string)
+    pattern = r"(?s)<think>\s*(.*?)\s*</think>\s*~([A-Z0-9]+)\s*~~(.*?)(?=<think>|$)"
+    # Find all matches in the text
+    matches = re.findall(pattern, text)
+    extracted_data = []
+    for justification, domain, entity in matches:
+        extracted_data.append(
+            {
+                "domain": domain.strip(),
+                "extracted_entity": entity.strip(),
+                "justification": justification.strip(),
+            }
+        )
+    return extracted_data
+prefix = """Extract SDTM domain entities from: """
+unstructured_text = """
+# Dyspnea
+# Community-acquired pneumonia:
+Patient presenting with chest tightness and shortness of breath.
+Labs notable for WBC 25 with neutrophilic predominance. CTA
+negative for pulmonary embolism, but did demonstrate multifocal
+GGOs concerning for possible pneumonia, suspected viral based on
+imaging findings. Flu negative. Treated empirically for
+community-acquired bacterial pneumonia, with CTX/doxy ->
+Levaquin. Symptoms were mostly resolved at time of discharge.
+#Initial concern for possibility of TB
+The ED physician who saw her initially had started a TB rule
+out. However, on closer history by medicine, she had essentially
+nothing to support that diagnosis, other than being from an
+endemic area. The patient has no known exposures to friends or
+family members with active pulmonary TB. She has not lost any
+weight. She denies any cough or other respiratory symptoms prior
+to development of her current - likely viral - pneumonia. CT
+chest was not felt by myself or the radiologist to be at all
+suspicious for TB. Thus, after discussing the case with
+infection control, the TB rule out was called off.
+# Palpitations
+# Paroxysmal narrow-complex tachycardia:
+Patient has a history of occasional mild palpitations. On
+arrival to our ED, she was given adenosine, which did not break
+the rhythm (although no strip was obtained to allow me to review
+what happened with adenosine administration).
+On the floor, she was monitored on telemetry and was noted to
+occasionally speed up from ___ to ~140. The change in rate was
+rapid, but generally represented the HR speeding up
+progressively over a few seconds, not an immediate change in
+both rate and rhythm from one beat to the next. The p was not
+overtly distinguishable from the sinus p by morphology, although
+the PR interval shortened greatly from 140 ms to 89 ms. ___
+tachycardia did not correlate with position or exertion; I got
+her out of bed and had her walk around without significant rise
+in HR on tele a that time. Most likely atrial tachycardia, but
+___ is considered also. In either case, these are
+benign as long as she remains predominantly asymptomatic as she
+is now.
+Initial EKG noted to have diffuse repolarization changes that
+resolved with improved rates; trop<0.01x2.
+"""
+generator = pipeline(
+    "text-generation", model="ClinicalIntelligence/saama_gemma", device="cuda"
+)
+output = generator(
+    [{"role": "user", "content": prefix + unstructured_text}],
+    max_new_tokens=16000,
+    return_full_text=False,
+)[0]
+llm_output = output["generated_text"]
+extracted_entities_list = extract_entities(llm_output)
+for extracted_entity in extracted_entities_list:
+    print(extracted_entity)
+```
+## Training procedure
+* Training Data Size: 6,500 samples (grouped by uid and formatted into complete user/assistant conversational threads)
+* Number of Epochs: 5 (The model processed the entire dataset 5 times, totaling approximately 4,065 optimization steps)
+* Effective Batch Size: 8 (Per-device batch size of 1 combined with 8 gradient accumulation steps)
+* LoRA Rank ($r$ / Adapter Size): 16 (Provides a balance between capturing complex, domain-specific logic and maintaining a lightweight adapter)
+* LoRA Alpha: 32LoRA Scaling Factor: 2.0 (Calculated as Alpha / Rank, providing a strong fine-tuning signal to enforce strict extraction formatting)
+* Targeted Layers: All linear layers (target_modules="all-linear")
+* Adapters were applied to attention modules as well as MLP blocks to maximize instructional compliance and mimic full fine-tuning
+* Maximum Sequence Length: 12,000 tokens (Sufficient to handle extensive hospital course notes)
+* Learning Rate: 2e-4Precision: bfloat16 with Flash Attention 2 and gradient checkpointing enabled for memory efficiency.
 ### Framework versions
 - Transformers: 5.2.0
 - Pytorch: 2.10.0
 - Datasets: 4.5.0
+- Tokenizers: 0.22.2