DJ-PO Reward Model: Prompt Stereotypicality Scorer
This is the reward model from the DJ-PO (Diverse Job Prompt Optimizer) framework, designed to score text-to-image prompts based on their likelihood of generating stereotypical occupation representations.
Model Description
DJ-PO Reward Model is a lightweight PyTorch model that predicts stereotypicality scores for text prompts. It combines a frozen SentenceTransformer encoder with a trainable linear head to efficiently learn ranking relationships from human feedback.
- Model Type: Reward Model for Prompt Scoring
- Base Model:
sentence-transformers/all-mpnet-base-v2(frozen) - Architecture: SentenceTransformer + Linear Head (768 → 1)
- Task: Predicting prompt stereotypicality (higher score = more diverse)
- Training Data: 10,800 human-ranked prompt-image pairs
- Performance: Spearman's ρ = 0.4915 (moderate-to-strong correlation)
Intended Use
Primary Use Case
This model is designed to score occupation-related text prompts before feeding them to text-to-image models. Prompts scoring below a threshold (default: 0.12) are flagged for optimization using an LLM to reduce stereotypical outputs.
Example Usage
import torch
from sentence_transformers import SentenceTransformer
import torch.nn as nn
# Define the model architecture
class RewardModel(nn.Module):
def __init__(self, sentence_transformer_model_name='sentence-transformers/all-mpnet-base-v2', embedding_dim=768):
super(RewardModel, self).__init__()
self.embedding_model = SentenceTransformer(sentence_transformer_model_name)
self.fc = nn.Linear(embedding_dim, 1)
self.dropout = nn.Dropout(p=0.1)
def forward(self, prompts):
embeddings = self.embedding_model.encode(prompts, convert_to_tensor=True)
scores = self.fc(self.dropout(embeddings)).squeeze(-1)
return scores
# Load the model
model = RewardModel()
model.load_state_dict(torch.load('reward_model_listwise_3.0_main.pth', map_location=torch.device('cpu')))
model.eval()
# Score a prompt
prompt = "Generate an image of a CEO"
with torch.no_grad():
score = model([prompt]).item()
print(f"Stereotypicality Score: {score:.4f}")
print(f"Needs Optimization: {score < 0.12}")
How It Works
- Input: Text prompt (e.g., "Generate an image of a nurse")
- Encoding: Frozen SentenceTransformer generates 768-dim embedding
- Scoring: Linear head + dropout produces stereotypicality score
- Output: Float value (higher = more diverse/less stereotypical)
Threshold: Prompts scoring below 0.12 are considered stereotypical and should be optimized.
Training Details
Dataset
- Original Data: 1,200 human-ranked prompt-image pairs
- 40 occupations × 3 prompt types (stereotypical, neutral, diverse)
- 10 annotators performing 20 ranking tasks each
- Augmented Data: 10,800 total datapoints
- 8× paraphrasing per original prompt using GPT-4o
- Labels inherited from original human rankings
Training Configuration
- Epochs: 12
- Learning Rate: 1e-5
- Optimizer: AdamW
- Scheduler: Cosine Annealing
- Loss Function: Listwise Ranking Loss + MSE (α=0.01)
- Temperature: 0.5 (for temperature-scaled softmax)
- Batch Size: Variable (grouped by occupation)
- Device: CUDA / CPU compatible
Performance Metrics
- Spearman's ρ: 0.4915 (final)
- Training Progression: 0.0637 (epoch 1) → 0.4915 (epoch 12)
- Metric Rationale: Spearman correlation chosen for focus on ranking consistency over exact score prediction
Validation Results
The DJ-PO framework (using this reward model) was validated through human evaluation:
- Study Size: 20 annotators, 200 total rankings
- Statistical Test: Welch's t-test, t(31.34) = 7.02, p < .001
- Effect: Images generated with DJ-PO optimization were rated as significantly less stereotypical
- Mean Rankings:
- Unoptimized images: M = 2.55, SD = 0.63
- Optimized images: M = 4.45, SD = 1.03
Limitations and Biases
Known Limitations
- Western-Centric Perspective: Trained on annotations from Netherlands-based annotators (8 Dutch, 1 Polish, 1 Turkish)
- Age Bias: Annotators averaged 23.7 years old
- Synthetic Data Dependency: 90% of training data is paraphrased
- Occupation Scope: Focused on professional occupational roles
- Cultural Context: May not generalize to non-Western contexts
- Score Calibration: Threshold (0.12) determined via KDE on training distribution
Ethical Considerations
- This model reflects the values and perspectives of a small, demographically limited annotator pool
- "Diversity" is defined through Western European legal frameworks (Dutch equal treatment law)
- The model should be used as part of a broader ethical AI strategy, not as a standalone solution
- Users should validate outputs in their specific cultural and application contexts
Seven Diversification Aspects
The model was trained to recognize prompts that consider these diversity dimensions:
- Religion/Belief - Representation beyond dominant religions
- Race/Ethnic Origin - Broader ethnic and racial diversity
- Gender Identity/Expression - Beyond binary gender stereotypes
- Nationality - Global geographic representation
- Ability/Disability Status - Including people with diverse abilities
- Age/Generational Identity - Representation across age groups
- Body Type - Diverse body shapes and sizes
Citation
If you use this model in your research, please cite:
@inproceedings{zwakenberg2025djpo,
title={Reducing Occupation Stereotypes in T2I models using Reward Models and Prompt Optimization},
author={Zwakenberg, Alex and Tacoma, Sietske and Mioch, Tina and Peeters, Marieke M. M.},
booktitle={Proceedings of [Conference Name]},
year={2025},
organization={University of Applied Sciences Utrecht}
}
Model Card Authors
Alex Zwakenberg (University of Applied Sciences Utrecht)
Model Card Contact
- Email: Lex.zwakenberg@gmail.com
- GitHub: DJ-PO Framework Repository
License
This model is released under the MIT License. See the LICENSE file for details.
Acknowledgments
This research was supported by:
- SPRONG project Responsible Applied AI (grant SPR.ALG.01.024)
- Dutch Government through Regieorgaan SIA
- Monks for practical feedback and collaboration
- All volunteer annotators who contributed to the human ranking dataset
Disclaimer: This model is a research prototype. While it demonstrates significant bias reduction in controlled evaluations, it does not eliminate all stereotypical outputs and should be used as part of a comprehensive ethical AI strategy.
- Downloads last month
- -