DJ-PO Reward Model: Prompt Stereotypicality Scorer

This is the reward model from the DJ-PO (Diverse Job Prompt Optimizer) framework, designed to score text-to-image prompts based on their likelihood of generating stereotypical occupation representations.

Model Description

DJ-PO Reward Model is a lightweight PyTorch model that predicts stereotypicality scores for text prompts. It combines a frozen SentenceTransformer encoder with a trainable linear head to efficiently learn ranking relationships from human feedback.

  • Model Type: Reward Model for Prompt Scoring
  • Base Model: sentence-transformers/all-mpnet-base-v2 (frozen)
  • Architecture: SentenceTransformer + Linear Head (768 → 1)
  • Task: Predicting prompt stereotypicality (higher score = more diverse)
  • Training Data: 10,800 human-ranked prompt-image pairs
  • Performance: Spearman's ρ = 0.4915 (moderate-to-strong correlation)

Intended Use

Primary Use Case

This model is designed to score occupation-related text prompts before feeding them to text-to-image models. Prompts scoring below a threshold (default: 0.12) are flagged for optimization using an LLM to reduce stereotypical outputs.

Example Usage

import torch
from sentence_transformers import SentenceTransformer
import torch.nn as nn

# Define the model architecture
class RewardModel(nn.Module):
    def __init__(self, sentence_transformer_model_name='sentence-transformers/all-mpnet-base-v2', embedding_dim=768):
        super(RewardModel, self).__init__()
        self.embedding_model = SentenceTransformer(sentence_transformer_model_name)
        self.fc = nn.Linear(embedding_dim, 1)
        self.dropout = nn.Dropout(p=0.1)

    def forward(self, prompts):
        embeddings = self.embedding_model.encode(prompts, convert_to_tensor=True)
        scores = self.fc(self.dropout(embeddings)).squeeze(-1)
        return scores

# Load the model
model = RewardModel()
model.load_state_dict(torch.load('reward_model_listwise_3.0_main.pth', map_location=torch.device('cpu')))
model.eval()

# Score a prompt
prompt = "Generate an image of a CEO"
with torch.no_grad():
    score = model([prompt]).item()

print(f"Stereotypicality Score: {score:.4f}")
print(f"Needs Optimization: {score < 0.12}")

How It Works

  1. Input: Text prompt (e.g., "Generate an image of a nurse")
  2. Encoding: Frozen SentenceTransformer generates 768-dim embedding
  3. Scoring: Linear head + dropout produces stereotypicality score
  4. Output: Float value (higher = more diverse/less stereotypical)

Threshold: Prompts scoring below 0.12 are considered stereotypical and should be optimized.

Training Details

Dataset

  • Original Data: 1,200 human-ranked prompt-image pairs
    • 40 occupations × 3 prompt types (stereotypical, neutral, diverse)
    • 10 annotators performing 20 ranking tasks each
  • Augmented Data: 10,800 total datapoints
    • 8× paraphrasing per original prompt using GPT-4o
    • Labels inherited from original human rankings

Training Configuration

  • Epochs: 12
  • Learning Rate: 1e-5
  • Optimizer: AdamW
  • Scheduler: Cosine Annealing
  • Loss Function: Listwise Ranking Loss + MSE (α=0.01)
  • Temperature: 0.5 (for temperature-scaled softmax)
  • Batch Size: Variable (grouped by occupation)
  • Device: CUDA / CPU compatible

Performance Metrics

  • Spearman's ρ: 0.4915 (final)
  • Training Progression: 0.0637 (epoch 1) → 0.4915 (epoch 12)
  • Metric Rationale: Spearman correlation chosen for focus on ranking consistency over exact score prediction

Validation Results

The DJ-PO framework (using this reward model) was validated through human evaluation:

  • Study Size: 20 annotators, 200 total rankings
  • Statistical Test: Welch's t-test, t(31.34) = 7.02, p < .001
  • Effect: Images generated with DJ-PO optimization were rated as significantly less stereotypical
  • Mean Rankings:
    • Unoptimized images: M = 2.55, SD = 0.63
    • Optimized images: M = 4.45, SD = 1.03

Limitations and Biases

Known Limitations

  1. Western-Centric Perspective: Trained on annotations from Netherlands-based annotators (8 Dutch, 1 Polish, 1 Turkish)
  2. Age Bias: Annotators averaged 23.7 years old
  3. Synthetic Data Dependency: 90% of training data is paraphrased
  4. Occupation Scope: Focused on professional occupational roles
  5. Cultural Context: May not generalize to non-Western contexts
  6. Score Calibration: Threshold (0.12) determined via KDE on training distribution

Ethical Considerations

  • This model reflects the values and perspectives of a small, demographically limited annotator pool
  • "Diversity" is defined through Western European legal frameworks (Dutch equal treatment law)
  • The model should be used as part of a broader ethical AI strategy, not as a standalone solution
  • Users should validate outputs in their specific cultural and application contexts

Seven Diversification Aspects

The model was trained to recognize prompts that consider these diversity dimensions:

  1. Religion/Belief - Representation beyond dominant religions
  2. Race/Ethnic Origin - Broader ethnic and racial diversity
  3. Gender Identity/Expression - Beyond binary gender stereotypes
  4. Nationality - Global geographic representation
  5. Ability/Disability Status - Including people with diverse abilities
  6. Age/Generational Identity - Representation across age groups
  7. Body Type - Diverse body shapes and sizes

Citation

If you use this model in your research, please cite:

@inproceedings{zwakenberg2025djpo,
  title={Reducing Occupation Stereotypes in T2I models using Reward Models and Prompt Optimization},
  author={Zwakenberg, Alex and Tacoma, Sietske and Mioch, Tina and Peeters, Marieke M. M.},
  booktitle={Proceedings of [Conference Name]},
  year={2025},
  organization={University of Applied Sciences Utrecht}
}

Model Card Authors

Alex Zwakenberg (University of Applied Sciences Utrecht)

Model Card Contact

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

This research was supported by:

  • SPRONG project Responsible Applied AI (grant SPR.ALG.01.024)
  • Dutch Government through Regieorgaan SIA
  • Monks for practical feedback and collaboration
  • All volunteer annotators who contributed to the human ranking dataset

Disclaimer: This model is a research prototype. While it demonstrates significant bias reduction in controlled evaluations, it does not eliminate all stereotypical outputs and should be used as part of a comprehensive ethical AI strategy.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support