DJ-PO Reward Model: Prompt Stereotypicality Scorer

This is the reward model from the DJ-PO (Diverse Job Prompt Optimizer) framework, designed to score text-to-image prompts based on their likelihood of generating stereotypical occupation representations.

Model Description

DJ-PO Reward Model is a lightweight PyTorch model that predicts stereotypicality scores for text prompts. It combines a frozen SentenceTransformer encoder with a trainable linear head to efficiently learn ranking relationships from human feedback.

Model Type: Reward Model for Prompt Scoring
Base Model: sentence-transformers/all-mpnet-base-v2 (frozen)
Architecture: SentenceTransformer + Linear Head (768 → 1)
Task: Predicting prompt stereotypicality (higher score = more diverse)
Training Data: 10,800 human-ranked prompt-image pairs
Performance: Spearman's ρ = 0.4915 (moderate-to-strong correlation)

Intended Use

Primary Use Case

This model is designed to score occupation-related text prompts before feeding them to text-to-image models. Prompts scoring below a threshold (default: 0.12) are flagged for optimization using an LLM to reduce stereotypical outputs.

Example Usage

import torch
from sentence_transformers import SentenceTransformer
import torch.nn as nn

# Define the model architecture
class RewardModel(nn.Module):
    def __init__(self, sentence_transformer_model_name='sentence-transformers/all-mpnet-base-v2', embedding_dim=768):
        super(RewardModel, self).__init__()
        self.embedding_model = SentenceTransformer(sentence_transformer_model_name)
        self.fc = nn.Linear(embedding_dim, 1)
        self.dropout = nn.Dropout(p=0.1)

    def forward(self, prompts):
        embeddings = self.embedding_model.encode(prompts, convert_to_tensor=True)
        scores = self.fc(self.dropout(embeddings)).squeeze(-1)
        return scores

# Load the model
model = RewardModel()
model.load_state_dict(torch.load('reward_model_listwise_3.0_main.pth', map_location=torch.device('cpu')))
model.eval()

# Score a prompt
prompt = "Generate an image of a CEO"
with torch.no_grad():
    score = model([prompt]).item()

print(f"Stereotypicality Score: {score:.4f}")
print(f"Needs Optimization: {score < 0.12}")

How It Works

Input: Text prompt (e.g., "Generate an image of a nurse")
Encoding: Frozen SentenceTransformer generates 768-dim embedding
Scoring: Linear head + dropout produces stereotypicality score
Output: Float value (higher = more diverse/less stereotypical)

Threshold: Prompts scoring below 0.12 are considered stereotypical and should be optimized.

Training Details

Dataset

Original Data: 1,200 human-ranked prompt-image pairs
- 40 occupations × 3 prompt types (stereotypical, neutral, diverse)
- 10 annotators performing 20 ranking tasks each
Augmented Data: 10,800 total datapoints
- 8× paraphrasing per original prompt using GPT-4o
- Labels inherited from original human rankings

Training Configuration

Epochs: 12
Learning Rate: 1e-5
Optimizer: AdamW
Scheduler: Cosine Annealing
Loss Function: Listwise Ranking Loss + MSE (α=0.01)
Temperature: 0.5 (for temperature-scaled softmax)
Batch Size: Variable (grouped by occupation)
Device: CUDA / CPU compatible

Performance Metrics

Spearman's ρ: 0.4915 (final)
Training Progression: 0.0637 (epoch 1) → 0.4915 (epoch 12)
Metric Rationale: Spearman correlation chosen for focus on ranking consistency over exact score prediction

Validation Results

The DJ-PO framework (using this reward model) was validated through human evaluation:

Study Size: 20 annotators, 200 total rankings
Statistical Test: Welch's t-test, t(31.34) = 7.02, p < .001
Effect: Images generated with DJ-PO optimization were rated as significantly less stereotypical
Mean Rankings:
- Unoptimized images: M = 2.55, SD = 0.63
- Optimized images: M = 4.45, SD = 1.03

Limitations and Biases

Known Limitations

Western-Centric Perspective: Trained on annotations from Netherlands-based annotators (8 Dutch, 1 Polish, 1 Turkish)
Age Bias: Annotators averaged 23.7 years old
Synthetic Data Dependency: 90% of training data is paraphrased
Occupation Scope: Focused on professional occupational roles
Cultural Context: May not generalize to non-Western contexts
Score Calibration: Threshold (0.12) determined via KDE on training distribution

Ethical Considerations

This model reflects the values and perspectives of a small, demographically limited annotator pool
"Diversity" is defined through Western European legal frameworks (Dutch equal treatment law)
The model should be used as part of a broader ethical AI strategy, not as a standalone solution
Users should validate outputs in their specific cultural and application contexts

Seven Diversification Aspects

The model was trained to recognize prompts that consider these diversity dimensions:

Religion/Belief - Representation beyond dominant religions
Race/Ethnic Origin - Broader ethnic and racial diversity
Gender Identity/Expression - Beyond binary gender stereotypes
Nationality - Global geographic representation
Ability/Disability Status - Including people with diverse abilities
Age/Generational Identity - Representation across age groups
Body Type - Diverse body shapes and sizes

Citation

If you use this model in your research, please cite:

@inproceedings{zwakenberg2025djpo,
  title={Reducing Occupation Stereotypes in T2I models using Reward Models and Prompt Optimization},
  author={Zwakenberg, Alex and Tacoma, Sietske and Mioch, Tina and Peeters, Marieke M. M.},
  booktitle={Proceedings of [Conference Name]},
  year={2025},
  organization={University of Applied Sciences Utrecht}
}

Model Card Authors

Alex Zwakenberg (University of Applied Sciences Utrecht)

Model Card Contact

Email: Lex.zwakenberg@gmail.com
GitHub: DJ-PO Framework Repository

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

This research was supported by:

SPRONG project Responsible Applied AI (grant SPR.ALG.01.024)
Dutch Government through Regieorgaan SIA
Monks for practical feedback and collaboration
All volunteer annotators who contributed to the human ranking dataset

Disclaimer: This model is a research prototype. While it demonstrates significant bias reduction in controlled evaluations, it does not eliminate all stereotypical outputs and should be used as part of a comprehensive ethical AI strategy.

Downloads last month: 2