COCO_no_sports_general
Model Description
COCO_no_sports_general is a causal language model based on GPT-2, fine-tuned on the florence-generated image captions of a subset of COCO. This subset is labeled for physical activity content in the text:
- Label 0: Not related to physical activity (e.g., indoor scenes, objects, people at rest)
- Label 1: Related to physical activity (e.g., sports, exercise, physical activity)
The model has been trained on a general distribution of this data:
- Label distribution:
[0.805, 0.195]
This version is designed to serve as the general model of our pipeline.
Training and Evaluation Data
- Dataset:
BeyondDeepFakeDetection/COCO_no_sports - Label schema: Binary classification of text as related to physical activity or not.
- Source: COCO, Florence
Training Procedure
Hyperparameters
- Learning rate:
2e-5 - Optimizer:
AdamW, betas =(0.9, 0.999), epsilon =1e-8 - Batch size (train):
8 - Batch size (eval):
16 - Scheduler: Linear with warm-up
- Warm-up steps:
1000 - Epochs:
5 - Mixed precision: Native AMP
- Seed:
42
Training Results
| Epoch | Step | Train Loss | Validation Loss |
|---|---|---|---|
| 1 | 4431 | 1.1166 | 1.0106 |
| 2 | 8862 | 0.9943 | 0.9314 |
| 3 | 13293 | 0.9394 | 0.8984 |
| 4 | 17724 | 0.9068 | 0.8823 |
| 5 | 22155 | 0.8828 | 0.8762 |
Final validation loss: 0.8762
Framework Versions
transformers: 4.46.3datasets: 2.19.1tokenizers: 0.20.3pytorch: 2.1.2+cu121
Get started
In order to infer the joint probability of phrases under this model you can use the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
import pandas as pd
from huggingface_hub import login
from tqdm import tqdm
from datasets import load_dataset
# Define variables
hf_token = ""
model_name = f"BeyondDeepFakeDetection/COCO_no_sports"
text_column = "text"
dataset = "BeyondDeepFakeDetection/COCO_no_sports_general"
# Load Model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer.pad_token = tokenizer.eos_token
model.to(device)
# Login
login(token=hf_token)
def compute_log_probabilities_for_sequence(model, tokenizer, input_text):
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits[:, :-1, :]
target_ids = input_ids[:, 1:]
log_probs = F.log_softmax(logits, dim=-1)
seq_token_logprobs = log_probs.gather(2, target_ids.unsqueeze(-1)).squeeze(-1)
word_probabilities = []
for i, token_id in enumerate(target_ids[0]):
word = tokenizer.decode([token_id])
log_prob = seq_token_logprobs[0, i].item()
word_probabilities.append((word, log_prob))
return word_probabilities
test_df = pd.DataFrame(load_dataset(dataset, split="train"))
results = []
for count, text in enumerate(tqdm(test_df[text_column], desc="Processing Texts")):
word_probs = compute_log_probabilities_for_sequence(model, tokenizer, text)
total_log_prob = sum(prob for _, prob in word_probs)
avg_log_prob = total_log_prob / len(word_probs) if word_probs else float("-inf")
results.append({
"text_id": count,
"total_log_prob": total_log_prob,
"avg_log_prob": avg_log_prob,
"word_probabilities": str(word_probs),
})
- Downloads last month
- -
Model tree for BeyondDeepFakeDetection/COCO_no_sports_general
Base model
openai-community/gpt2