Upload model

Browse files

Files changed (5) hide show

README.md +199 -0
config.json +33 -0
configuration_graph_clip.py +67 -0
model.safetensors +3 -0
modeling_graph_clip.py +278 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "alpha": 0.5,
+  "architectures": [
+    "GraphCLIPModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_graph_clip.GraphCLIPConfig",
+    "AutoModel": "modeling_graph_clip.GraphCLIPModel"
+  },
+  "graph_config": {
+    "dropout": 0.2,
+    "embedding_dim": 512,
+    "ffn_embedding_dim": 512,
+    "hidden_size": 512,
+    "model_type": "graphormer",
+    "num_hidden_layers": 6
+  },
+  "graph_pair_type": "image",
+  "initializer_factor": 1.0,
+  "logit_scale_init_value": 2.6592,
+  "model_type": "graph_clip",
+  "pretrained_graphormer_hub_id": "helena-balabin/pretrained_graphormer_vg_image_graphs",
+  "pretrained_model_name_or_path": "openai/clip-vit-base-patch32",
+  "projection_dim": 512,
+  "text_config": {
+    "model_type": "clip_text_model"
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.45.2",
+  "vision_config": {
+    "model_type": "clip_vision_model"
+  }
+}

configuration_graph_clip.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""Configuration class for the custom Graph-based CLIP model incorporating Image, Text, and Graph inputs."""
+from typing import Optional, Union
+from transformers import CLIPConfig
+from transformers.models.deprecated.graphormer.configuration_graphormer import GraphormerConfig
+class GraphCLIPConfig(CLIPConfig):
+    r"""
+    Configuration for GraphCLIPModel, which extends CLIP with a Graphormer encoder.
+    Args:
+        graph_config (`Union[dict, GraphormerConfig]`):
+            Configuration (or dict) for the Graphormer graph encoder.
+        graph_pair_type (`str`, *optional*, defaults to `"text"`):
+            Which modality to pair against the graph in contrastive loss.
+            One of `"text"` or `"image"`.
+        pretrained_model_name_or_path (`str`, *optional*):
+            If set, vision & text heads will be loaded from this CLIP checkpoint.
+        pretrained_graphormer_hub_id (`str`, *optional*):
+            If set, the Graphormer will be loaded from this HuggingFace Hub model ID.
+        alpha (`float`, *optional*, defaults to 0.5):
+            Weight for combining image-text and graph-pair contrastive losses.
+        **kwargs:
+            All remaining kwargs will be passed to the base `CLIPConfig` (e.g., `projection_dim`,
+            `vision_layers`, `text_layers`, etc.).
+    """
+    model_type = "graph_clip"
+    def __init__(
+        self,
+        graph_config: Union[dict, GraphormerConfig] = GraphormerConfig(
+            hidden_size=512,
+            embedding_dim=512,
+            ffn_embedding_dim=512,
+            num_hidden_layers=6,
+            dropout=0.1,
+        ),
+        graph_pair_type: str = "text",
+        pretrained_model_name_or_path: Optional[str] = None,
+        pretrained_graphormer_hub_id: Optional[str] = None,
+        alpha: float = 0.5,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        # build or assign the graph encoder config
+        if isinstance(graph_config, dict):
+            self.graph_config = GraphormerConfig(**graph_config)
+        else:
+            self.graph_config = graph_config
+        # which modality to pair the graph with
+        if graph_pair_type not in ("text", "image"):
+            raise ValueError("`graph_pair_type` must be either 'text' or 'image'")
+        self.graph_pair_type = graph_pair_type
+        # if provided, load CLIP vision/text from this checkpoint
+        self.pretrained_model_name_or_path = pretrained_model_name_or_path
+        # if provided, load pretrained Graphormer from this HuggingFace Hub model ID
+        self.pretrained_graphormer_hub_id = pretrained_graphormer_hub_id
+        # alpha for the contrastive loss
+        self.alpha = alpha

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:949597cdaf0150ce353b0306c1be1287537b06534a57c47d1bc655f2f8f2e74e
+size 663769988

modeling_graph_clip.py ADDED Viewed

	@@ -0,0 +1,278 @@

+"""Contrastive Learning-Based Graph, Image, and Text Model."""
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from transformers import TrainerCallback
+from transformers import CLIPModel, CLIPTextModel, CLIPVisionModel, GraphormerModel
+from transformers.modeling_outputs import BaseModelOutputWithNoAttention, BaseModelOutputWithPooling, ModelOutput
+from transformers.models.clip.modeling_clip import clip_loss
+from nsd_compositionality.models.graph_clip_model.configuration_graph_clip import GraphCLIPConfig
+class LossLoggingCallback(TrainerCallback):
+    def on_log(self, args, state, control, logs=None, model=None, **kwargs):  # type: ignore
+        """Log losses from the model during training.
+        Args:
+            args: Training arguments.
+            state: Training state.
+            control: Control object for training.
+            logs: Dictionary of logs to update.
+            model: The model being trained.
+        """
+        if logs is None or model is None:
+            return
+        add = {}
+        if getattr(model, "last_loss_image_text", None) is not None:
+            add["loss_image_text"] = model.last_loss_image_text
+        if getattr(model, "last_loss_graph_pair", None) is not None:
+            add["loss_graph_pair"] = model.last_loss_graph_pair
+        if add:
+            logs.update(add)
+@dataclass
+class GraphCLIPOutput(ModelOutput):
+    """
+    Custom output class for GraphCLIPModel.
+    Attributes:
+        loss (torch.FloatTensor, optional): Loss value if return_loss is True.
+        logits_image_text (torch.FloatTensor): Logits for image-text pairs.
+        logits_graph_pair (torch.FloatTensor): Logits for graph-text or graph-image pairs.
+        image_embeds (torch.FloatTensor): Image embeddings.
+        graph_embeds (torch.FloatTensor): Graph embeddings.
+        text_embeds (torch.FloatTensor): Text embeddings.
+        vision_model_output (BaseModelOutputWithPooling): Output from the vision model.
+        text_model_output (BaseModelOutputWithPooling): Output from the text model.
+        graph_model_output (BaseModelOutputWithNoAttention): Output from the graph model.
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits_image_text: torch.FloatTensor = None
+    logits_graph_pair: torch.FloatTensor = None
+    image_embeds: torch.FloatTensor = None
+    graph_embeds: torch.FloatTensor = None
+    text_embeds: torch.FloatTensor = None
+    vision_model_output: BaseModelOutputWithPooling = None
+    text_model_output: BaseModelOutputWithPooling = None
+    graph_model_output: BaseModelOutputWithNoAttention = None
+class GraphCLIPModel(CLIPModel):
+    config_class = GraphCLIPConfig
+    def __init__(self, config: GraphCLIPConfig):
+        # Specify configs
+        super().__init__(config)
+        graph_config = config.graph_config
+        self.alpha = getattr(config, "alpha", 0.5)
+        # If "pretrained_model_name_or_path" is in config, load the pretrained vision and text models
+        if config.pretrained_model_name_or_path:
+            self.vision_model = CLIPVisionModel.from_pretrained(
+                config.pretrained_model_name_or_path,
+            ).vision_model
+            self.text_model = CLIPTextModel.from_pretrained(
+                config.pretrained_model_name_or_path,
+            )
+        # Initialize Graphormer model - load pretrained if specified
+        if config.pretrained_graphormer_hub_id:
+            self.graph_model = GraphormerModel.from_pretrained(config.pretrained_graphormer_hub_id)
+        else:
+            self.graph_model = GraphormerModel._from_config(graph_config)
+        # Projection layer for graph embeddings
+        self.graph_projection = nn.Linear(graph_config.hidden_size, config.projection_dim, bias=False)
+        # Determine the graph pair type (either "text" or "image")
+        self.graph_pair_type = config.graph_pair_type  # Should be "text" or "image"
+        # For logging component losses
+        self.last_loss_image_text: Optional[torch.tensor] = None
+        self.last_loss_graph_pair: Optional[torch.tensor] = None
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        graph_input: Optional[dict] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        return_loss: Optional[bool] = True,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs,  # noqa
+    ) -> Union[Tuple, GraphCLIPOutput]:
+        """
+        Forward pass of GraphCLIP Model with three modalities: image, graph, and text.
+        Args:
+            input_ids (torch.LongTensor): Tokenized text input IDs.
+            pixel_values (torch.FloatTensor): Batch of images.
+            graph_input (dict, optional): Dictionary of inputs for the Graphormer encoder.
+            attention_mask (torch.LongTensor, optional): Attention mask for the text encoder.
+            position_ids (torch.LongTensor, optional): Position IDs for text encoder.
+            return_loss (bool, optional): Whether to compute the contrastive loss, default is True.
+            output_attentions (bool, optional): Whether to output attentions.
+            output_hidden_states (bool, optional): Whether to output hidden states.
+            return_dict (bool, optional): Whether to return a ModelOutput object.
+            **kwargs: Additional keyword arguments.
+        Returns:
+            GraphCLIPOutput: Custom output object containing logits and embeddings.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Process images through the CLIP vision encoder
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[1]  # Pooled output
+        image_embeds = self.visual_projection(image_embeds)
+        # Process text input through CLIP text encoder
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        text_embeds = text_outputs[1]  # Pooled output
+        text_embeds = self.text_projection(text_embeds)
+        # Process graph input through Graphormer (if provided)
+        graph_outputs = None
+        graph_embeds = None
+        if graph_input is not None:
+            graph_outputs = self.graph_model(
+                **graph_input,
+            )
+            # Use the special graph token for graph representation
+            graph_embeds = graph_outputs.last_hidden_state[:, 0, :]
+            graph_embeds = self.graph_projection(graph_embeds)
+        # Normalize the projected features
+        image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
+        text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
+        if graph_embeds is not None:
+            graph_embeds = graph_embeds / graph_embeds.norm(p=2, dim=-1, keepdim=True)
+        # Compute scaled cosine similarity logits
+        logit_scale = self.logit_scale.exp()
+        logits_image_text = logit_scale * torch.matmul(image_embeds, text_embeds.t())
+        # Compute graph pair logits based on the specified pair type (if graph input is provided)
+        logits_graph_pair = None
+        if graph_embeds is not None:
+            if self.graph_pair_type == "text":
+                logits_graph_pair = logit_scale * torch.matmul(graph_embeds, text_embeds.t())
+            elif self.graph_pair_type == "image":
+                logits_graph_pair = logit_scale * torch.matmul(graph_embeds, image_embeds.t())
+            else:
+                raise ValueError("Invalid graph_pair_type. Must be 'text' or 'image'.")
+        loss = None
+        if return_loss:
+            # Compute contrastive loss for the specified pairs
+            loss_image_text = clip_loss(logits_image_text)
+            # Store for logging
+            try:
+                self.last_loss_image_text = loss_image_text.detach().mean()
+            except Exception:
+                self.last_loss_image_text = None
+            if logits_graph_pair is not None:
+                loss_graph_pair = clip_loss(logits_graph_pair)
+                try:
+                    self.last_loss_graph_pair = loss_graph_pair.detach().mean()
+                except Exception:
+                    self.last_loss_graph_pair = None
+                loss = (1.0 - self.alpha) * loss_image_text + self.alpha * loss_graph_pair
+            else:
+                self.last_loss_graph_pair = None
+                loss = loss_image_text
+        if not return_dict:
+            output = (
+                logits_image_text,
+                logits_graph_pair,
+                image_embeds,
+                graph_embeds,
+                text_embeds,
+                vision_outputs,
+                text_outputs,
+                graph_outputs,
+            )
+            return ((loss,) + output) if loss is not None else output
+        return GraphCLIPOutput(
+            loss=loss,
+            logits_image_text=logits_image_text,
+            logits_graph_pair=logits_graph_pair,
+            image_embeds=image_embeds,
+            graph_embeds=graph_embeds,
+            text_embeds=text_embeds,
+            vision_model_output=vision_outputs,
+            text_model_output=text_outputs,
+            graph_model_output=graph_outputs,
+        )
+    def freeze_layers(self, freeze_vision: bool = False, freeze_text: bool = False, freeze_graph: bool = False):
+        """
+        Freeze or unfreeze layers of the vision, text, and graph backbones.
+        Args:
+            freeze_vision (bool): Whether to freeze the vision backbone.
+            freeze_text (bool): Whether to freeze the text backbone.
+            freeze_graph (bool): Whether to freeze the graph backbone.
+        """
+        if freeze_vision:
+            for param in self.vision_model.parameters():
+                param.requires_grad = False
+        if freeze_text:
+            for param in self.text_model.parameters():
+                param.requires_grad = False
+        if freeze_graph:
+            for param in self.graph_model.parameters():
+                param.requires_grad = False
+    def unfreeze_partial_layers(self, model_part: str, num_layers: int):
+        """
+        Unfreeze the last `num_layers` of a specific model part.
+        Args:
+            model_part (str): The part of the model to unfreeze ('vision', 'text', or 'graph').
+            num_layers (int): Number of layers to unfreeze from the end.
+        """
+        if model_part == "vision":
+            layers = list(self.vision_model.encoder.layers)
+        elif model_part == "text":
+            layers = list(self.text_model.text_model.encoder.layers)
+        elif model_part == "graph":
+            layers = list(self.graph_model.graph_encoder.layers)
+        else:
+            raise ValueError("Invalid model_part. Must be 'vision', 'text', or 'graph'.")
+        # Freeze all layers first
+        for layer in layers:
+            for param in layer.parameters():
+                param.requires_grad = False
+        # Unfreeze the last `num_layers`
+        for layer in layers[-num_layers:]:
+            for param in layer.parameters():
+                param.requires_grad = True