Upload sproto model

Browse files

Files changed (7) hide show

LICENSE +105 -0
README.md +256 -0
config.json +33 -0
configuration_sproto.py +61 -0
model.safetensors +3 -0
modeling_sproto.py +77 -0
overview.png +0 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,105 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity.
+      "You" shall mean an individual or Legal Entity exercising permissions
+      granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on or derived from the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner.
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License.
+      Subject to the terms and conditions of this License, each Contributor
+      hereby grants to You a perpetual, worldwide, non-exclusive, no-charge,
+      royalty-free, irrevocable copyright license to reproduce, prepare
+      Derivative Works of, publicly display, publicly perform, sublicense,
+      and distribute the Work and such Derivative Works in Source or Object
+      form.
+   3. Grant of Patent License.
+      Subject to the terms and conditions of this License, each Contributor
+      hereby grants to You a perpetual, worldwide, non-exclusive, no-charge,
+      royalty-free, irrevocable patent license to make, have made, use,
+      offer to sell, sell, import, and otherwise transfer the Work.
+   4. Redistribution.
+      You may reproduce and distribute copies of the Work or Derivative
+      Works thereof in any medium, with or without modifications, and in
+      Source or Object form, provided that You meet the following conditions:
+      You must give any other recipients of the Work a copy of this License
+      and you must cause any modified files to carry prominent notices
+      stating that You changed the files.
+   5. Submission of Contributions.
+      Unless You explicitly state otherwise, any Contribution intentionally
+      submitted for inclusion in the Work shall be under the terms of this
+      License.
+   6. Trademarks.
+      This License does not grant permission to use the trade names,
+      trademarks, service marks, or product names of the Licensor.
+   7. Disclaimer of Warranty.
+      The Work is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR
+      CONDITIONS OF ANY KIND.
+   8. Limitation of Liability.
+      In no event and under no legal theory shall any Contributor be liable
+      for damages arising from the use of the Work.
+   9. Accepting Warranty or Additional Liability.
+      While redistributing the Work, You may choose to offer support or
+      warranty obligations, but You may not impose such obligations on
+      Contributors.
+   END OF TERMS AND CONDITIONS

README.md ADDED Viewed

	@@ -0,0 +1,256 @@

+---
+language: en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-classification
+task_categories:
+  - text-classification
+model_type: sproto
+base_model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
+datasets:
+  - mimic-iv
+metrics:
+  - auroc
+  - pr-auc
+tags:
+  - text-classification
+  - multi-label-classification
+  - long-tail-learning
+  - medical
+  - clinical-nlp
+  - interpretability
+  - prototypical-networks
+  - ehr
+---
+# S-Proto: Sparse Prototypical Networks for Long-Tail Clinical Diagnosis Prediction
+![S-Proto](overview.png)
+This repository provides **S-Proto**, a sparse and interpretable prototypical network for extreme multi-label diagnosis prediction from clinical text. The model is designed to address the long-tail distribution of clinical diagnoses while preserving faithful, prototype-based explanations.
+## Interactive Demo
+You can explore the model's predictions and interpretability features through our interactive web demo:
+**[https://s-proto.demo.datexis.com/](https://s-proto.demo.datexis.com/)**
+S-Proto was introduced in the paper:
+**[Boosting Long-Tail Data Classification with Sparse Prototypical Networks](https://ecmlpkdd-storage.s3.eu-central-1.amazonaws.com/preprints/2024/lncs14947/lncs14947435.pdf)**
+Alexei Figueroa*, Jens-Michalis Papaioannou*, et al.
+DATEXIS, Berliner Hochschule für Technik, Feinstein Institutes, TU Munich, Leibniz University Hannover
+(* equal contribution)
+## Overview
+Clinical outcome prediction from Electronic Health Records is characterized by extreme label imbalance. A small number of diagnoses account for most patients, while the majority of diagnoses appear rarely. Standard transformer classifiers tend to perform well on frequent diagnoses but degrade sharply in the long tail.
+S-Proto addresses this problem by extending prototypical networks with:
+- Multiple prototypes per diagnosis
+- Sparse winner-takes-all activation
+- Prototype-level interpretability
+- Efficient training despite increased representational capacity
+The model achieves state-of-the-art performance on MIMIC-IV diagnosis prediction, with particularly strong gains in PR-AUC for rare diagnoses, and transfers successfully to unseen clinical datasets.
+## Model Architecture
+S-Proto builds on **PubMedBERT** as the text encoder and introduces a sparse prototypical layer on top.
+For each diagnosis label, the model learns multiple sub-networks, each consisting of:
+- A label-specific attention vector
+- A prototype vector representing a prototypical patient
+Given an input clinical note:
+1. The note is encoded using PubMedBERT
+2. Token embeddings are projected into a latent space
+3. Each diagnosis activates multiple candidate sub-networks
+4. A winner-takes-all mechanism selects the single most relevant sub-network per diagnosis
+5. Only the winning prototype contributes to the prediction and receives gradient updates
+This allows S-Proto to model heterogeneous disease phenotypes while remaining sparse and efficient.
+## Intended Use
+This model is intended for:
+- Clinical diagnosis prediction from admission notes
+- Research on long-tail learning in healthcare NLP
+- Interpretable clinical decision support systems
+- Analysis of disease phenotypes via learned prototypes
+This model is **not intended for direct clinical deployment** without external validation, auditing, and regulatory approval.
+## Inference Example
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+tokenizer = AutoTokenizer.from_pretrained(
+    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
+)
+model = AutoModel.from_pretrained(
+    "datexis/sproto",
+    trust_remote_code=True
+)
+model.eval()
+text_input = [
+    "CHIEF COMPLAINT: Right Carotid Artery Stenosis. "
+    "PRESENT ILLNESS: Ms. ___ is a ___ year old woman with hyperlipidemia, "
+    "cirrhosis with esophageal varices, alcoholism, COPD, left eye blindness, "
+    "and right carotid stenosis status post right carotid endarterectomy."
+]
+inputs = tokenizer(
+    text_input,
+    padding=True,
+    truncation=True,
+    max_length=512,
+    return_tensors="pt"
+)
+tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in inputs["input_ids"]]
+with torch.no_grad():
+    output = model(
+        input_ids=inputs["input_ids"],
+        attention_mask=inputs["attention_mask"],
+        token_type_ids=inputs.get("token_type_ids"),
+        tokens=tokens
+    )
+logits = output["logits"]
+max_indices = output["max_indices"]
+metadata = output["metadata"]
+print("Inference successful")
+print("Logits shape:", logits.shape)
+print("Max indices:", max_indices)
+print("Metadata:", metadata)
+```
+## Outputs
+The model returns a dictionary with the following entries:
+- **logits**
+  Prediction scores per diagnosis label.
+- **max_indices**
+  Index of the winning prototype sub-network per diagnosis, corresponding to the selected prototype.
+- **metadata**
+  Additional information useful for analysis and interpretability.
+## Explainability
+S-Proto provides built-in faithful explanations through its prototypical structure:
+- Attention vectors highlight clinically relevant tokens
+- Prototype distances reflect similarity to prototypical patients
+- Multiple prototypes per diagnosis capture disease subtypes and cohorts
+- Faithfulness metrics remain comparable to ProtoPatient despite higher capacity
+Qualitative evaluation with medical professionals confirms that learned prototypes often correspond to clinically meaningful phenotypes.
+## Training
+First, clone the repository:
+```bash
+git clone https://github.com/DATEXIS/sproto.git
+cd sproto
+```
+Set up the environment using Poetry:
+```bash
+poetry install
+```
+Activate the virtual environment:
+```bash
+poetry env activate
+```
+Once the environment is active, you can start training by running the train command with the desired arguments.
+Example:
+```bash
+train \
+  --batch_size 3 \
+  --pretrained_model microsoft/biomednlp-pubmedbert-base-uncased-abstract-fulltext \
+  --pretrained_model_path path_to_pretrained_model.ckpt \
+  --model_type MULTI_PROTO \
+  --train_file training_data.csv \
+  --val_file validation_data.csv \
+  --test_file test_data.csv \
+  --save_dir ../experiments/ \
+  --gpus 1 \
+  --check_val_every_n_epoch 2 \
+  --num_warmup_steps 0 \
+  --num_training_steps 50 \
+  --max_length 512 \
+  --lr_features 0.000005 \
+  --lr_prototypes 0.001 \
+  --lr_others 0.001 \
+  --num_val_samples None \
+  --use_attention True \
+  --reduce_hidden_size 256 \
+  --all_labels_path all_labels.pcl \
+  --seed 42 \
+  --label_column labels \
+  --metric_opt auroc_macro \
+  --train_files [] \
+  --val_files [] \
+  --only_test True \
+  --model_name 5p \
+  --store_metadata False \
+  --num_prototypes_per_class 5
+```
+## Citation
+```bibtex
+@inproceedings{figueroa2024sproto,
+  title={Boosting Long-Tail Data Classification with Sparse Prototypical Networks},
+  author={Figueroa, Alexei and Papaioannou, Jens-Michalis and Fallon, Conor and Bekiaridou, Alexandra and Bressem, Keno and Zanos, Stavros and Gers, Felix and Nejdl, Wolfgang and Löser, Alexander},
+  booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing},
+  year={2024}
+}
+```
+## License
+This model and its associated code are released under the Apache License 2.0.
+The model was trained on the MIMIC-IV dataset, which is subject to restricted access. No training data is included or redistributed with this repository.
+The data were accessed under a data use agreement. No patient-identifiable information is shared.
+Use of this model must comply with all applicable data governance and ethical guidelines.
+### Limitations
+- Extremely rare diagnoses remain challenging
+- Clinical dataset biases may be reflected in predictions
+- Winner-takes-all selection is fixed and not learned dynamically
+- Not validated for real-world clinical deployment
+### Ethical Considerations
+- The model processes sensitive clinical text
+- Predictions should always be reviewed by qualified professionals
+- Outputs should not be used as sole evidence for clinical decisions
+- Care must be taken to avoid reinforcing existing healthcare biases

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "attention_vector_path": null,
+  "auto_map": {
+    "AutoConfig": "configuration_sproto.SprotoConfig",
+    "AutoModel": "modeling_sproto.SprotoModel"
+  },
+  "batch_size": 21,
+  "dot_product": false,
+  "eval_buckets": null,
+  "final_layer": false,
+  "label_order_path": "/pvc/shared/continual/data/icd_10_all_labels_admission_mimiciv_dia.pcl",
+  "loss": "BCE",
+  "lr_features": 5e-06,
+  "lr_others": 0.001,
+  "lr_prototypes": 0.001,
+  "model_type": "sproto",
+  "normalize": null,
+  "num_classes": 1643,
+  "num_prototypes_per_class": 5,
+  "num_training_steps": 5000,
+  "num_warmup_steps": 5000,
+  "pretrained_model": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
+  "prototype_vector_path": null,
+  "reduce_hidden_size": 256,
+  "save_dir": "/pvc/shared/continual/experiments/mimiciv/icd10_clinical-continual-5p-test",
+  "seed": 28,
+  "transformers_version": "4.25.1",
+  "use_attention": true,
+  "use_cuda": true,
+  "use_global_attention": false,
+  "use_prototype_loss": false,
+  "use_sigmoid": false
+}

configuration_sproto.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from transformers.configuration_utils import PretrainedConfig
+class SprotoConfig(PretrainedConfig):
+    model_type = "sproto"
+    def __init__(
+        self,
+        pretrained_model=None,
+        num_classes=None,
+        label_order_path=None,
+        use_sigmoid=False,
+        use_cuda=True,
+        lr_prototypes=5e-2,
+        lr_features=2e-6,
+        lr_others=2e-2,
+        num_training_steps=5000,
+        num_warmup_steps=1000,
+        loss="BCE",
+        save_dir="output",
+        use_attention=True,
+        use_global_attention=False,
+        dot_product=False,
+        normalize=None,
+        final_layer=False,
+        reduce_hidden_size=None,
+        use_prototype_loss=False,
+        prototype_vector_path=None,
+        attention_vector_path=None,
+        eval_buckets=None,
+        seed=7,
+        num_prototypes_per_class=1,
+        batch_size=10,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.pretrained_model = pretrained_model
+        self.num_classes = num_classes
+        self.label_order_path = label_order_path
+        self.use_sigmoid = use_sigmoid
+        self.use_cuda = use_cuda
+        self.lr_prototypes = lr_prototypes
+        self.lr_features = lr_features
+        self.lr_others = lr_others
+        self.num_training_steps = num_training_steps
+        self.num_warmup_steps = num_warmup_steps
+        self.loss = loss
+        self.save_dir = save_dir
+        self.use_attention = use_attention
+        self.use_global_attention = use_global_attention
+        self.dot_product = dot_product
+        self.normalize = normalize
+        self.final_layer = final_layer
+        self.reduce_hidden_size = reduce_hidden_size
+        self.use_prototype_loss = use_prototype_loss
+        self.prototype_vector_path = prototype_vector_path
+        self.attention_vector_path = attention_vector_path
+        self.eval_buckets = eval_buckets
+        self.seed = seed
+        self.num_prototypes_per_class = num_prototypes_per_class
+        self.batch_size = batch_size

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef1e86215368bfcbb723cb3a28d2c343927f154decc09527cce2093326a07fd2
+size 455575332

modeling_sproto.py ADDED Viewed

	@@ -0,0 +1,77 @@

+from transformers import PreTrainedModel
+from sproto.model.multi_proto import MultiProtoModule
+from .configuration_sproto import SprotoConfig
+class SprotoModel(PreTrainedModel):
+    config_class = SprotoConfig
+    base_model_prefix = "sproto"
+    def __init__(self, config: SprotoConfig):
+        super().__init__(config)
+        self.module = MultiProtoModule(
+            pretrained_model=config.pretrained_model,
+            num_classes=config.num_classes,
+            label_order_path=config.label_order_path,
+            use_sigmoid=config.use_sigmoid,
+            use_cuda=config.use_cuda,
+            lr_prototypes=config.lr_prototypes,
+            lr_features=config.lr_features,
+            lr_others=config.lr_others,
+            num_training_steps=config.num_training_steps,
+            num_warmup_steps=config.num_warmup_steps,
+            loss=config.loss,
+            save_dir=config.save_dir,
+            use_attention=config.use_attention,
+            use_global_attention=config.use_global_attention,
+            dot_product=config.dot_product,
+            normalize=config.normalize,
+            final_layer=config.final_layer,
+            reduce_hidden_size=config.reduce_hidden_size,
+            use_prototype_loss=config.use_prototype_loss,
+            prototype_vector_path=config.prototype_vector_path,
+            attention_vector_path=config.attention_vector_path,
+            eval_buckets=config.eval_buckets,
+            seed=config.seed,
+            num_prototypes_per_class=config.num_prototypes_per_class,
+            batch_size=config.batch_size,
+        )
+        # Initialize weights and apply final processing
+        self.post_init()
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (MultiProtoModule)):
+            # MultiProtoModule handles its own initialization or is loaded from checkpoint
+            return
+        # Add other initializations if standard layers are used directly in SprotoModel
+        pass
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        targets=None,
+        tokens=None,
+        sample_ids=None,
+        **kwargs,
+    ):
+        batch = {
+            "input_ids": input_ids,
+            "attention_masks": attention_mask,
+            "token_type_ids": token_type_ids,
+            "targets": targets,
+            "tokens": tokens,
+            "sample_ids": sample_ids,
+        }
+        logits, max_indices, metadata = self.module(batch)
+        return {
+            "logits": logits,
+            "max_indices": max_indices,
+            "metadata": metadata,
+        }

overview.png ADDED Viewed