Upload 9 files

Browse files

Files changed (9) hide show

README.md +97 -31
config.json +171 -0
gitattributes (1) +35 -0
model.safetensors +3 -0
preprocessor_config.json +27 -0
requirements.txt +35 -0
tokenizer.json +0 -0
tokenizer_config.json +34 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,4 @@
 ---
 license: apache-2.0
 language: en
@@ -5,10 +6,12 @@ library_name: transformers
 tags:
 - clip
 - image-classification
 - fairface
 - vision
 model-index:
-- name: gender-classification-clip
   results:
   - task:
       type: image-classification
@@ -21,20 +24,28 @@ model-index:
     - type: accuracy
       value: 0.9638
       name: Gender Accuracy
 ---
-# Fine-tuned CLIP Model for Gender Classification
-This repository contains the model **`gender-classification-clip`**, a fine-tuned version of the **[openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)** model. It has been adapted for classifying perceived gender from facial images.
-The model was trained on the gender labels from the **[FairFace dataset](https://github.com/joojs/fairface)**, which is designed to be balanced across demographic categories. This model card provides a detailed look at its performance, limitations, and intended use to encourage responsible application.
 ## Model Description
 The base model, CLIP (Contrastive Language-Image Pre-Training), learns rich visual representations by matching images to their corresponding text descriptions. This fine-tuned version repurposes the powerful vision encoder from CLIP for a specific classification task.
-It takes an image as input and outputs a prediction for:
 *   **Gender:** 2 categories (Male, Female)
 ## Intended Uses & Limitations
@@ -42,11 +53,11 @@ This model is intended primarily for research and analysis purposes.
 ### Intended Uses
 *   **Research on model fairness and bias:** Analyzing the model's performance differences across demographic groups.
-*   **Providing a public baseline:** Serving as a starting point for researchers aiming to improve performance on gender classification.
-*   **Educational purposes:** Demonstrating a fine-tuning approach on a vision model.
 ### Out-of-Scope and Prohibited Uses
-This model makes predictions about a sensitive demographic attribute and carries significant risks if misused. The following uses are explicitly out-of-scope and strongly discouraged:
 *   **Surveillance, monitoring, or tracking of individuals.**
 *   **Automated decision-making that impacts an individual's rights or opportunities** (e.g., loan applications, hiring decisions, insurance eligibility).
 *   **Inferring or assigning an individual's self-identity.** The model's predictions are based on learned visual patterns and do not reflect how a person identifies.
@@ -54,7 +65,7 @@ This model makes predictions about a sensitive demographic attribute and carries
 ## How to Get Started
-To use this model, you need to import its custom `GenderClipVisionModel` class, as it is not a standard `AutoModel`.
 ```python
 import torch
@@ -65,36 +76,47 @@ import torch.nn as nn
 # --- 0. Define the Custom Model Class ---
 # You must define the model architecture to load the weights into it.
-class GenderClipVisionModel(nn.Module):
     def __init__(self, num_labels):
-        super(GenderClipVisionModel, self).__init__()
         # Load the vision part of a CLIP model
         self.vision_model = AutoModel.from_pretrained("openai/clip-vit-large-patch14").vision_model
         hidden_size = self.vision_model.config.hidden_size
-        self.gender_head = nn.Linear(hidden_size, num_labels)
     def forward(self, pixel_values):
         outputs = self.vision_model(pixel_values=pixel_values)
         pooled_output = outputs.pooler_output
-        return self.gender_head(pooled_output)
 # --- 1. Configuration ---
-MODEL_PATH = "syntheticbot/gender-classification-clip"
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 # --- 2. Define Label Mappings (must match training) ---
 gender_labels = ['Female', 'Male']
-id2label = {i: label for i, label in enumerate(sorted(gender_labels))}
-NUM_LABELS = len(gender_labels)
 # --- 3. Load Model and Processor ---
 processor = CLIPImageProcessor.from_pretrained(MODEL_PATH)
-model = GenderClipVisionModel(num_labels=NUM_LABELS)
-# Note: You would typically load fine-tuned weights here.
-# For this example, we proceed with the class structure.
-# model.load_state_dict(torch.load("path_to_your_model_weights.bin"))
 model.to(DEVICE)
 model.eval()
@@ -111,22 +133,27 @@ def predict(image_path):
     with torch.no_grad():
         logits = model(pixel_values=inputs['pixel_values'])
-    pred_id = torch.argmax(logits, dim=-1).item()
-    pred_label = id2label[pred_id]
-    print(f"Prediction for {image_path}:")
-    print(f"  - Gender: {pred_label}")
-    return {"gender": pred_label}
 # --- 5. Run Prediction ---
-# predict('sample.jpg') # Replace with the path to your image
 ```
 ## Training Details
 *   **Base Model:** [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
-*   **Dataset:** [FairFace](https://github.com/joojs/fairface) (using only gender labels)
-*   **Training Procedure:** The model was fine-tuned for 5 epochs. The vision encoder was mostly frozen, with only the final 3 transformer layers being unfrozen for training. A separate linear classification head was added for the gender task. The loss function was the Cross-Entropy Loss.
 ## Evaluation
@@ -134,6 +161,8 @@ The model was evaluated on the FairFace validation split, which contains 10,954
 ### Performance Metrics
 #### **Gender Classification (Overall Accuracy: 96.38%)**
 ```
               precision    recall  f1-score   support
@@ -146,12 +175,48 @@ The model was evaluated on the FairFace validation split, which contains 10,954
 weighted avg       0.96      0.96      0.96     10954
 ```
 ## Bias, Risks, and Limitations
-*   **Perceptual vs. Identity:** The model predicts perceived gender based on visual data. These predictions are not a determination of an individual's true self-identity or gender expression.
-*   **Performance Disparities:** The evaluation shows high overall accuracy, but performance may not be uniform across all intersectional demographic groups (e.g., different races, ages). Using this model in any application can perpetuate existing biases.
 *   **Data Representation:** While trained on FairFace, a balanced dataset, the model may still reflect societal biases present in the original pre-training data of CLIP.
-*   **Risk of Misclassification:** Any misclassification of a sensitive attribute can have negative social consequences. The model is not perfect and will make mistakes.
 ### Citation
@@ -174,4 +239,5 @@ weighted avg       0.96      0.96      0.96     10954
   pages={1548--1558},
   year={2021}
 }
 ```

 ---
 license: apache-2.0
 language: en
 tags:
 - clip
 - image-classification
+- multi-task-classification
 - fairface
 - vision
+- autoeval-has-no-ethical-license
 model-index:
+- name: clip-face-attribute-classifier
   results:
   - task:
       type: image-classification
     - type: accuracy
       value: 0.9638
       name: Gender Accuracy
+    - type: accuracy
+      value: 0.7322
+      name: Race Accuracy
+    - type: accuracy
+      value: 0.5917
+      name: Age Accuracy
 ---
+# Fine-tuned CLIP Model for Face Attribute Classification
+This repository contains the model **`clip-face-attribute-classifier`**, a fine-tuned version of the **[openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)** model. It has been adapted for multi-task classification of perceived age, gender, and race from facial images.
+The model was trained on the **[FairFace dataset](https://github.com/joojs/fairface)**, which is designed to be balanced across these demographic categories. This model card provides a detailed look at its performance, limitations, and intended use to encourage responsible application.
 ## Model Description
 The base model, CLIP (Contrastive Language-Image Pre-Training), learns rich visual representations by matching images to their corresponding text descriptions. This fine-tuned version repurposes the powerful vision encoder from CLIP for a specific classification task.
+It takes an image as input and outputs three separate predictions for:
+*   **Age:** 9 categories (0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, more than 70)
 *   **Gender:** 2 categories (Male, Female)
+*   **Race:** 7 categories (White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, Latino_Hispanic)
 ## Intended Uses & Limitations
 ### Intended Uses
 *   **Research on model fairness and bias:** Analyzing the model's performance differences across demographic groups.
+*   **Providing a public baseline:** Serving as a starting point for researchers aiming to improve performance on these specific classification tasks.
+*   **Educational purposes:** Demonstrating a multi-task fine-tuning approach on a vision model.
 ### Out-of-Scope and Prohibited Uses
+This model makes predictions about sensitive demographic attributes and carries significant risks if misused. The following uses are explicitly out-of-scope and strongly discouraged:
 *   **Surveillance, monitoring, or tracking of individuals.**
 *   **Automated decision-making that impacts an individual's rights or opportunities** (e.g., loan applications, hiring decisions, insurance eligibility).
 *   **Inferring or assigning an individual's self-identity.** The model's predictions are based on learned visual patterns and do not reflect how a person identifies.
 ## How to Get Started
+To use this model, you need to import its custom `MultiTaskClipVisionModel` class, as it is not a standard `AutoModel`.
 ```python
 import torch
 # --- 0. Define the Custom Model Class ---
 # You must define the model architecture to load the weights into it.
+class MultiTaskClipVisionModel(nn.Module):
     def __init__(self, num_labels):
+        super(MultiTaskClipVisionModel, self).__init__()
         # Load the vision part of a CLIP model
         self.vision_model = AutoModel.from_pretrained("openai/clip-vit-large-patch14").vision_model
         hidden_size = self.vision_model.config.hidden_size
+        self.age_head = nn.Linear(hidden_size, num_labels['age'])
+        self.gender_head = nn.Linear(hidden_size, num_labels['gender'])
+        self.race_head = nn.Linear(hidden_size, num_labels['race'])
     def forward(self, pixel_values):
         outputs = self.vision_model(pixel_values=pixel_values)
         pooled_output = outputs.pooler_output
+        return {
+            'age': self.age_head(pooled_output),
+            'gender': self.gender_head(pooled_output),
+            'race': self.race_head(pooled_output),
+        }
 # --- 1. Configuration ---
+MODEL_PATH = "syntheticbot/clip-face-attribute-classifier"
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 # --- 2. Define Label Mappings (must match training) ---
+age_labels = ['0-2', '10-19', '20-29', '3-9', '30-39', '40-49', '50-59', '60-69', 'more than 70']
 gender_labels = ['Female', 'Male']
+race_labels = ['Black', 'East Asian', 'Indian', 'Latino_Hispanic', 'Middle Eastern', 'Southeast Asian', 'White']
+# Use sorted lists to create a consistent mapping
+id_mappings = {
+    'age': {i: label for i, label in enumerate(sorted(age_labels))},
+    'gender': {i: label for i, label in enumerate(sorted(gender_labels))},
+    'race': {i: label for i, label in enumerate(sorted(race_labels))},
+}
+NUM_LABELS = { 'age': len(age_labels), 'gender': len(gender_labels), 'race': len(race_labels) }
 # --- 3. Load Model and Processor ---
 processor = CLIPImageProcessor.from_pretrained(MODEL_PATH)
+model = MultiTaskClipVisionModel(num_labels=NUM_LABELS)
 model.to(DEVICE)
 model.eval()
     with torch.no_grad():
         logits = model(pixel_values=inputs['pixel_values'])
+    predictions = {}
+    for task in ['age', 'gender', 'race']:
+        pred_id = torch.argmax(logits[task], dim=-1).item()
+        pred_label = id_mappings[task][pred_id]
+        predictions[task] = pred_label
+    print(f"Predictions for {image_path}:")
+    for task, label in predictions.items():
+        print(f"  - {task.capitalize()}: {label}")
+    return predictions
 # --- 5. Run Prediction ---
+predict('sample.jpg') # Replace with the path to your image
 ```
 ## Training Details
 *   **Base Model:** [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
+*   **Dataset:** [FairFace](https://github.com/joojs/fairface)
+*   **Training Procedure:** The model was fine-tuned for 5 epochs. The vision encoder was mostly frozen, with only the final 3 transformer layers being unfrozen for training. A separate linear classification head was added for each task (age, gender, race). The total loss was the sum of the Cross-Entropy Loss from each of the three tasks.
 ## Evaluation
 ### Performance Metrics
+The following reports detail the model's performance on each task.
 #### **Gender Classification (Overall Accuracy: 96.38%)**
 ```
               precision    recall  f1-score   support
 weighted avg       0.96      0.96      0.96     10954
 ```
+#### **Race Classification (Overall Accuracy: 73.22%)**
+```
+                 precision    recall  f1-score   support
+          Black       0.90      0.89      0.89      1556
+     East Asian       0.74      0.78      0.76      1550
+         Indian       0.81      0.75      0.78      1516
+Latino_Hispanic       0.58      0.62      0.60      1623
+ Middle Eastern       0.69      0.57      0.62      1209
+Southeast Asian       0.66      0.65      0.65      1415
+          White       0.75      0.80      0.77      2085
+       accuracy                           0.73     10954
+      macro avg       0.73      0.72      0.73     10954
+   weighted avg       0.73      0.73      0.73     10954
+```
+#### **Age Classification (Overall Accuracy: 59.17%)**
+```
+              precision    recall  f1-score   support
+         0-2       0.93      0.45      0.60       199
+       10-19       0.62      0.41      0.50      1181
+       20-29       0.64      0.76      0.70      3300
+         3-9       0.77      0.88      0.82      1356
+       30-39       0.49      0.50      0.49      2330
+       40-49       0.46      0.44      0.45      1353
+       50-59       0.47      0.40      0.43       796
+       60-69       0.45      0.32      0.38       321
+more than 70       0.75      0.10      0.18       118
+    accuracy                           0.59     10954
+   macro avg       0.62      0.47      0.51     10954
+weighted avg       0.59      0.59      0.58     10954
+```
 ## Bias, Risks, and Limitations
+*   **Perceptual vs. Identity:** The model predicts perceived attributes based on visual data. These predictions are not a determination of an individual's true self-identity.
+*   **Performance Disparities:** The evaluation clearly shows that performance is not uniform across all categories. The model is significantly less accurate for certain racial groups (e.g., Latino_Hispanic, Middle Eastern) and older age groups. Using this model in any application will perpetuate these biases.
 *   **Data Representation:** While trained on FairFace, a balanced dataset, the model may still reflect societal biases present in the original pre-training data of CLIP.
+*   **Risk of Misclassification:** Any misclassification, particularly of sensitive attributes, can have negative social consequences. The model's moderate accuracy in age and race prediction makes this a significant risk.
 ### Citation
   pages={1548--1558},
   year={2021}
 }
+```
 ```

config.json ADDED Viewed

	@@ -0,0 +1,171 @@

+{
+  "_name_or_path": "clip-vit-large-patch14/",
+  "architectures": [
+    "CLIPModel"
+  ],
+  "initializer_factor": 1.0,
+  "logit_scale_init_value": 2.6592,
+  "model_type": "clip",
+  "projection_dim": 768,
+  "text_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "bos_token_id": 0,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 2,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "quick_gelu",
+    "hidden_size": 768,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 77,
+    "min_length": 0,
+    "model_type": "clip_text_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 12,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 12,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 1,
+    "prefix": null,
+    "problem_type": null,
+    "projection_dim" : 768,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.16.0.dev0",
+    "use_bfloat16": false,
+    "vocab_size": 49408
+  },
+  "text_config_dict": {
+    "hidden_size": 768,
+    "intermediate_size": 3072,
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "projection_dim": 768
+  },
+  "torch_dtype": "float32",
+  "transformers_version": null,
+  "vision_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "quick_gelu",
+    "hidden_size": 1024,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 224,
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "clip_vision_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 24,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 14,
+    "prefix": null,
+    "problem_type": null,
+    "projection_dim" : 768,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.16.0.dev0",
+    "use_bfloat16": false
+  },
+  "vision_config_dict": {
+    "hidden_size": 1024,
+    "intermediate_size": 4096,
+    "num_attention_heads": 16,
+    "num_hidden_layers": 24,
+    "patch_size": 14,
+    "projection_dim": 768
+  }
+}

gitattributes (1) ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e811883e6f247acc61a869a938b9523d1eb1d34fa3c1e882b3f033a49b8cb72d
+size 1212846240

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "crop_size": {
+    "height": 224,
+    "width": 224
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "CLIPImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 224
+  }
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,35 @@

+# This file lists the required packages for the clip-face-attribute-classifier project.
+# Install them using: pip install -r requirements.txt
+# --- Hugging Face Libraries ---
+# Core library for models, Trainer, TrainingArguments, and processors
+transformers==4.38.2
+# Used for data handling and creating Dataset objects
+datasets==2.18.0
+# For efficient training and hardware acceleration with the Trainer
+accelerate==0.27.2
+# For interacting with the Hugging Face Hub (login, upload, etc.)
+huggingface_hub==0.21.4
+# --- Core Deep Learning Framework ---
+# The fundamental deep learning library
+torch==2.2.1
+# Companion library for computer vision tasks in PyTorch
+torchvision==0.17.1
+# --- Data Handling and Metrics ---
+# For reading and manipulating the .csv label files
+pandas==2.2.1
+# For calculating evaluation metrics like accuracy, precision, recall, and F1-score
+scikit-learn==1.4.1.post1
+# --- Utilities ---
+# For opening and handling image files
+Pillow==10.2.0
+# For creating progress bars during evaluation
+tqdm==4.66.2
+# For loading the safer .safetensors model format
+safetensors==0.4.2

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+    "unk_token": {
+        "content": "<|endoftext|>",
+        "single_word": false,
+        "lstrip": false,
+        "rstrip": false,
+        "normalized": true,
+        "__type": "AddedToken"
+    },
+    "bos_token": {
+        "content": "<|startoftext|>",
+        "single_word": false,
+        "lstrip": false,
+        "rstrip": false,
+        "normalized": true,
+        "__type": "AddedToken"
+    },
+    "eos_token": {
+        "content": "<|endoftext|>",
+        "single_word": false,
+        "lstrip": false,
+        "rstrip": false,
+        "normalized": true,
+        "__type": "AddedToken"
+    },
+    "pad_token": "<|endoftext|>",
+    "add_prefix_space": false,
+    "errors": "replace",
+    "do_lower_case": true,
+    "name_or_path": "openai/clip-vit-base-patch32",
+    "model_max_length": 77,
+    "special_tokens_map_file": "./special_tokens_map.json",
+    "tokenizer_class": "CLIPTokenizer"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff