row56
/

ProtoPatient

@@ -119,47 +119,124 @@ ProtoPatient/
 ### 1. Install Dependencies
 ```bash
-pip install transformers torch
 ```
 ### 2. Load the Model via Hugging Face
 ```python
-from transformers import AutoTokenizer, AutoModel
-repo_id = "row56/ProtoPatient"
-tokenizer = AutoTokenizer.from_pretrained(repo_id)
-model = AutoModel.from_pretrained(repo_id)
 model.eval()
-sample_text = "This patient presents with severe headaches and nausea..."
-inputs = tokenizer(sample_text, return_tensors="pt")
-outputs = model(**inputs)
-print("Output shape:", outputs.last_hidden_state.shape)
 ```
-## 3. Interpreting Outputs
-For a full prototypical classification workflow, use the custom modules in `proto_model/` (e.g., `ProtoForMultiLabelClassification`) to inspect:
-- Which tokens receive high attention for each diagnosis.
-- Which prototypical patients are retrieved as similar examples.
-Using the standard `AutoModel` returns raw embeddings; the custom code is required for full label-wise attention and prototype retrieval.
----
-## 4. (Optional) Hugging Face Pipelines
-Integrate the model into a pipeline for feature extraction:
 ```python
-from transformers import pipeline
-extractor = pipeline("feature-extraction", model=repo_id, tokenizer=repo_id)
-embeddings = extractor("Severe headaches and vomiting...")
-print(len(embeddings), len(embeddings[0]))  # Token-level feature vectors
 ```
 # Intended Use, Limitations & Ethical Considerations
 ## Intended Use

 ### 1. Install Dependencies
 ```bash
+git clone https://huggingface.co/row56/ProtoPatient
+cd ProtoPatient
+pip install -e . transformers torch safetensors
+export TOKENIZERS_PARALLELISM=false
 ```
 ### 2. Load the Model via Hugging Face
 ```python
+import torch
+from transformers import AutoTokenizer
+from proto_model.configuration_proto import ProtoConfig
+from proto_model.modeling_proto    import ProtoForMultiLabelClassification
+# Load & configure
+cfg       = ProtoConfig.from_pretrained("row56/ProtoPatient")
+cfg.pretrained_model_name_or_path = "bert-base-uncased"
+cfg.use_cuda                     = False
+tokenizer = AutoTokenizer.from_pretrained(cfg.pretrained_model_name_or_path)
+model     = ProtoForMultiLabelClassification.from_pretrained(
+                "row56/ProtoPatient",
+                config=cfg,
+                ignore_mismatched_sizes=True
+            )
 model.eval()
+model.cpu()
+# Helper
+def get_proto_logits(texts):
+    enc = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
+    batch = {
+        "input_ids":       enc["input_ids"],
+        "attention_masks": enc["attention_mask"],
+        "token_type_ids":  enc.get("token_type_ids", torch.zeros_like(enc["input_ids"])),
+        "tokens":          [tokenizer.convert_ids_to_tokens(ids.tolist()) for ids in enc["input_ids"]]
+    }
+    with torch.no_grad():
+        logits, _ = model.proto_module(batch)
+    return logits
+# Run
+texts = [
+    "Patient shows elevated heart rate and low oxygen saturation.",
+    "No significant findings; patient is healthy."
+]
+logits = get_proto_logits(texts)
+print("Logits shape:", logits.shape)
+print("Logits:\n", logits)
+probs = torch.sigmoid(logits)
+print("Probabilities:\n", probs)
 ```
+## 3. Training Data & Licenses
+This model was trained on the MIMIC-III Clinical Database (v1.4), a large de-identified ICU dataset released under a data use agreement.
+To obtain MIMIC-III:
+Visit https://physionet.org/content/mimiciii/1.4/
+Register for a free PhysioNet account and complete the CITI “Data or Specimens Only Research” training.
+Sign the MIMIC-III Data Use Agreement (DUA).
+Download the raw notes and run the preprocessing scripts from the paper’s repository.
+Note: We do not redistribute MIMIC-III itself; users must obtain it directly under its license.
+## 4. Load Precomputed Training Data for Prototype Retrieval
+After you have MIMIC-III and have applied the published preprocessing, you should produce:
+data/train_embeds.npy — NumPy array of shape (N, d) with per-example, per-class embeddings.
+data/train_texts.json — JSON array of length N of the raw admission-note strings.
+Place those in data/ and then:
 ```python
+import numpy as np
+import json
+train_embeds = np.load("data/train_embeds.npy")      # shape (N, d)
+with open("data/train_texts.json", "r") as f:
+    train_texts = json.load(f)                       # list[str]
+print(f"Loaded {train_embeds.shape[0]} embeddings of dim {train_embeds.shape[1]}")
 ```
+## 5. Interpreting Outputs & Retrieving Prototypes
+```python
+from sklearn.neighbors import NearestNeighbors
+text = "Patient has chest pain and shortness of breath."
+enc  = tokenizer([text], padding=True, truncation=True, return_tensors="pt")
+batch = {
+    "input_ids":       enc["input_ids"],
+    "attention_masks": enc["attention_mask"],
+    "token_type_ids":  enc.get("token_type_ids", torch.zeros_like(enc["input_ids"])),
+    "tokens":          [tokenizer.convert_ids_to_tokens(ids.tolist()) for ids in enc["input_ids"]]
+}
+with torch.no_grad():
+    logits, metadata = model.proto_module(batch)
+attn_scores = metadata["attentions"][0]  # [num_labels, seq_len]
+for label_id, scores in enumerate(attn_scores):
+    topk = sorted(zip(batch["tokens"][0], scores.tolist()),
+                  key=lambda x: -x[1])[:5]
+    print(f"Label {label_id} top tokens:", topk)
+proto_vecs = model.proto_module.prototype_vectors.cpu().numpy()  # [num_labels, d]
+nn = NearestNeighbors(n_neighbors=1, metric="euclidean").fit(train_embeds)
+for label_id, u_c in enumerate(proto_vecs):
+    dist, idx = nn.kneighbors(u_c.reshape(1, -1))
+    print(f"\nLabel {label_id} prototype (distance={dist[0][0]:.3f}):")
+    print(train_texts[idx[0][0]])
+```
 # Intended Use, Limitations & Ethical Considerations
 ## Intended Use