COCO_no_sports_general

Model Description

COCO_no_sports_general is a causal language model based on GPT-2, fine-tuned on the florence-generated image captions of a subset of COCO. This subset is labeled for physical activity content in the text:

  • Label 0: Not related to physical activity (e.g., indoor scenes, objects, people at rest)
  • Label 1: Related to physical activity (e.g., sports, exercise, physical activity)

The model has been trained on a general distribution of this data:

  • Label distribution: [0.805, 0.195]

This version is designed to serve as the general model of our pipeline.

Training and Evaluation Data

Training Procedure

Hyperparameters

  • Learning rate: 2e-5
  • Optimizer: AdamW, betas = (0.9, 0.999), epsilon = 1e-8
  • Batch size (train): 8
  • Batch size (eval): 16
  • Scheduler: Linear with warm-up
  • Warm-up steps: 1000
  • Epochs: 5
  • Mixed precision: Native AMP
  • Seed: 42

Training Results

Epoch Step Train Loss Validation Loss
1 4431 1.1166 1.0106
2 8862 0.9943 0.9314
3 13293 0.9394 0.8984
4 17724 0.9068 0.8823
5 22155 0.8828 0.8762

Final validation loss: 0.8762

Framework Versions

  • transformers: 4.46.3
  • datasets: 2.19.1
  • tokenizers: 0.20.3
  • pytorch: 2.1.2+cu121

Get started

In order to infer the joint probability of phrases under this model you can use the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
import pandas as pd
from huggingface_hub import login
from tqdm import tqdm
from datasets import load_dataset


# Define variables
hf_token = ""
model_name = f"BeyondDeepFakeDetection/COCO_no_sports"
text_column = "text"
dataset = "BeyondDeepFakeDetection/COCO_no_sports_general"

# Load Model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer.pad_token = tokenizer.eos_token
model.to(device)

# Login
login(token=hf_token)


def compute_log_probabilities_for_sequence(model, tokenizer, input_text):
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits[:, :-1, :]
        target_ids = input_ids[:, 1:]

        log_probs = F.log_softmax(logits, dim=-1)
        seq_token_logprobs = log_probs.gather(2, target_ids.unsqueeze(-1)).squeeze(-1)

    word_probabilities = []
    for i, token_id in enumerate(target_ids[0]):
        word = tokenizer.decode([token_id])
        log_prob = seq_token_logprobs[0, i].item()
        word_probabilities.append((word, log_prob))

    return word_probabilities


test_df = pd.DataFrame(load_dataset(dataset, split="train"))
results = []

for count, text in enumerate(tqdm(test_df[text_column], desc="Processing Texts")):
    word_probs = compute_log_probabilities_for_sequence(model, tokenizer, text)
    total_log_prob = sum(prob for _, prob in word_probs)
    avg_log_prob = total_log_prob / len(word_probs) if word_probs else float("-inf")
    results.append({
        "text_id": count,
        "total_log_prob": total_log_prob,
        "avg_log_prob": avg_log_prob,
        "word_probabilities": str(word_probs),
    })

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BeyondDeepFakeDetection/COCO_no_sports_general

Finetuned
(2072)
this model