Spaces:

Ina-Shapiro
/

Paperbot6

Sleeping

App Files Files Community

Paperbot6 / Papers /Creative Preference Optimization.txt

Ina-Shapiro

adding all the paperbot4 files

a213258 6 months ago

raw

history blame contribute delete

93.1 kB

	--- Page 1 ---

	arXiv:2505.14442v1 [cs.CL] 20 May 2025
	Creative Preference Optimization
	Mete Ismayilzada1,2, Antonio Laverghetta Jr.3, Simone A. Luchini3,
	Reet Patel3, Antoine Bosselut1, Lonneke van der Plas 2 Roger Beaty 3
	1EPFL, 2Università della Svizzera Italiana, 3Pennsylvania State University
	mahammad.ismayilzada epfl.ch
	Abstract
	While
	Large
	Language
	Models
	(LLMs)
	have demonstrated impressive performance
	across natural language generation tasks,
	their ability to generate truly creative con-
	tent characterized by novelty, diversity, sur-
	prise, and quality remains limited. Existing
	methods for enhancing LLM creativity often
	focus narrowly on diversity or specific tasks,
	failing to address creativity s multifaceted na-
	ture in a generalizable way.
	In this work,
	we propose Creative Preference Optimization
	(CRPO), a novel alignment method that in-
	jects signals from multiple creativity dimen-
	sions into the preference optimization objec-
	tive in a modular fashion. We train and eval-
	uate creativity-augmented versions of several
	models using CRPO and MUCE, a new large-
	scale human preference dataset spanning over
	200,000 human-generated responses and rat-
	ings from more than 30 psychological creativ-
	ity assessments. Our models outperform strong
	baselines, including GPT-4o, on both auto-
	mated and human evaluations, producing more
	novel, diverse, and surprising generations while
	maintaining high output quality. Additional
	evaluations on NOVELTYBENCH further con-
	firm the generalizability of our approach. To-
	gether, our results demonstrate that directly op-
	timizing for creativity within preference frame-
	works is a promising direction for advancing
	the creative capabilities of LLMs without com-
	promising output quality.

	Introduction
	Large Language Models (LLMs) have made sig-
	nificant progress across a broad range of natural
	language generation tasks (Team et al., 2023; Zhao
	et al., 2025; Bubeck et al., 2023; Wei et al., 2022;
	Brown et al., 2020). However, whether LLMs ex-
	hibit true human-like creativity i.e the ability to pro-
	duce novel (i.e., original), high-quality (i.e. useful)
	and surprising (i.e. unexpected) ideas (Simonton,
	2012; Boden, 2004) remains unclear. Research on
	the creativity of LLMs has found mixed results,
	with some reporting that LLMs are more creative
	than humans (Bellemare-Pepin et al., 2024; Zhao
	et al., 2024), others reporting that they are less cre-
	ative (Koivisto and Grassini, 2023; Chakrabarty
	et al., 2024; Ismayilzada et al., 2024b), and some
	finding their creativity to be on par with each other
	(Stevenson et al., 2022; Góes et al., 2023; Gilhooly,
	2024).
	However, past research has also found
	that the high LLM performance can be attributed
	to the artificial nature of the creativity tasks (Is-
	mayilzada et al., 2024a) commonly employed to
	evaluate LLMs such as the Alternative Uses Task
	(Guilford, 1967) or to the remarkable creativity of
	human-written texts on the web (Lu et al., 2024).
	Consequently, LLMs have been shown to often
	lack novelty and surprise in their generations (Is-
	mayilzada et al., 2024a,b; Zhang et al., 2025; Tian
	et al., 2024; Chakrabarty et al., 2024) and produce
	significantly less diverse content compared to hu-
	mans (Padmakumar and He, 2023; Anderson et al.,
	2024; Kirk et al., 2023; Xu et al., 2024; O Mahony
	et al., 2024; Zhang et al., 2024; Wenger and Kenett,
	2025). These tendencies limit the utility of LLMs
	for creative tasks, such as story generation and cre-
	ative problem solving that often require longer re-
	sponses and out-of-the-box thinking (Tian et al.,
	2023; Huang et al., 2024; Chen et al., 2024).
	Recent research has proposed some methods for
	improving creativity of LLMs, often targeting diver-
	sity aspect alone (Wong et al., 2024; Hayati et al.,
	2023; Chung et al., 2023; Franceschelli and Mu-
	solesi, 2024; Zhang et al., 2024; Wang et al., 2024b;
	Zhou et al., 2025; Lanchantin et al., 2025; Chung
	et al., 2025) or focusing on a single creativity task
	(Tian et al., 2023; Nair et al., 2024; Summers-Stay
	et al., 2023). However, creativity is a multifaceted
	ability that also encompasses novelty, surprise, and
	quality and manifests itself in a wide range of tasks.
	Consequently, it has been argued that methods pro-
	moting creativity improvements should consider

	--- Page 2 ---

	prompt
	preferred
	response
	set of
	preferred
	responses
	Novelty
	Diversity
	Surprise
	Quality
	LM
	RM

	λn
	λd
	λs
	λq
	Creativity
	DPO Loss
	Figure 1: Our preference alignment method CRPO to improve output creativity by injecting a weighted combination
	of signals from multiple creativity dimensions.
	multiple dimensions of creativity together across
	several creative tasks (Ismayilzada et al., 2024a).
	Hence, the broader challenge of enhancing overall
	creativity in LLM outputs largely remain underex-
	plored.
	To this end, we propose a novel approach to di-
	rectly optimize for creativity in language model
	generation through preference learning (Ouyang
	et al., 2022; Rafailov et al., 2023). Recent works
	targeting improvement in LLM creativity have
	mainly focused on black-box techniques to elicit
	creative outputs through input-level (e.g., prompt-
	ing) (Tian et al., 2023; Mehrotra et al., 2024;
	Nair et al., 2024; Summers-Stay et al., 2023) and
	output-level strategies (e.g., creative decoding)
	(Franceschelli and Musolesi, 2024; Meister et al.,
	2023).
	However, these methods are inherently
	limited to the fixed creative capacity of language
	models and are not designed to optimize for fine-
	grained dimensions of creativity. Recently, moti-
	vated by the negative impact of the preference align-
	ment techniques on the diversity of LLM outputs
	(Padmakumar and He, 2023; Anderson et al., 2024;
	Kirk et al., 2023; O Mahony et al., 2024; West and
	Potts, 2025), few works have suggested directly
	modifying the preference optimization methods to
	promote output diversity (Lanchantin et al., 2025;
	Chung et al., 2025). Inspired by these approaches,
	we design a new optimization strategy that injects
	signals from multiple dimensions of creativity into
	the preference modeling objective in a modular
	fashion. Specifically, we integrate the novelty, di-
	versity, surprise and quality dimensions of creativ-
	ity into the training objective of direct preference
	optimization (DPO) (Rafailov et al., 2023), with
	weighted composition that allow balancing each
	dimension s contribution. We call this method cre-
	ative preference optimization (CRPO) and provide
	its conceptual illustration in Figure 1 with full de-
	tails in Section 3.
	We test the efficacy of CRPO using MUCE
	(Multitask Creativity Evaluation), our newly cu-
	rated large-scale dataset of prompt-response pairs
	annotated with human preferences across a di-
	verse range of creative tasks in multiple languages.
	While previous work has largely evaluated creativ-
	ity improvements on a narrow range of tasks like
	story generation (Chung et al., 2025; Lanchantin
	et al., 2025) or creative problem solving (Tian
	et al., 2023), MUCE enables us to test whether
	our methods truly generalize across a diverse range
	of creativity assessments. Our results show that
	Llama-3.1-8B-Instruct (AI Meta, 2024) and
	Mistral-7B-Instruct-v0.3 (Jiang et al., 2023)
	trained using CRPO outperform the same models
	trained using only supervised fine-tuning (SFT) or
	DPO without any creativity injections, as well as
	existing LLMs such as GPT-4o, generating more
	novel, diverse, and surprising outputs than all the
	baselines while maintaining high quality.
	Our main contributions are as follows:
	1. We introduce MUCE, a large-scale prefer-
	ence dataset consisting of more than 200,000
	human responses and ratings for more than
	30 creativity assessments. All tasks within
	MUCE are carefully chosen to provide valid
	measures of creativity in humans, making
	MUCE one of the largest psychologically
	valid datasets of human creativity for train-

	--- Page 3 ---

	ing preference models.
	2. We propose a novel flexible preference
	alignment method CRPO that injects sig-
	nals from several dimensions of creativ-
	ity into the existing preference optimization
	method DPO and train creativity-enhanced
	versions of Llama-3.1-8B-Instruct and
	Mistral-7B-Instruct-v0.3.
	3. We evaluate the effectiveness of our approach
	on a range of creativity tasks from MUCE, as
	well as external tasks from NOVELTYBENCH
	(Zhang et al., 2025), using both automated
	metrics and human evaluations. Our analy-
	sis shows that CRPO is a promising method
	for enhancing the creative capabilities of lan-
	guage models while maintaining quality.

	Related Work
	2.1
	Large Language Model Creativity
	The potential of building LLM applications for
	creative industries has spurred significant research
	interest on AI creativity (Bellemare-Pepin et al.,
	2024), and many LLM tools marketed for assis-
	tance with creative tasks have been developed in
	the last few years (Wang et al., 2024b). Yet de-
	bates on whether AI is capable of true creativity are
	nearly as old as AI itself (Stein, 2014; Franceschelli
	and Musolesi, 2024; Sæbø and Brovold, 2024),
	with theoretical and philosophical arguments being
	made both for and against AI creativity (Ismay-
	ilzada et al., 2024a). Classic psychological theories
	of creativity generally agree that, for a product to
	be creative, it must be new, surprising, and valu-
	able (Boden, 2004). Creative tasks are also often
	characterized by high diversity (Padmakumar and
	He, 2023; Shypula et al., 2025), though diversity is
	only one facet of creativity (Johnson et al., 2021).
	Studies on LLM creativity have yielded conflicting
	findings: some suggest LLMs surpass human cre-
	ativity (Bellemare-Pepin et al., 2024; Zhao et al.,
	2024), others argue they fall short (Koivisto and
	Grassini, 2023; Chakrabarty et al., 2024; Ismay-
	ilzada et al., 2024b), while some conclude that
	LLM and human creativity are roughly equivalent
	(Gilhooly, 2024; Stevenson et al., 2022; Góes et al.,
	2023). Some works have suggested that LLMs
	lack novelty and surprise in their generations (Is-
	mayilzada et al., 2024a,b; Zhang et al., 2025; Tian
	et al., 2024; Chakrabarty et al., 2024) and their
	seemingly remarkable creative outputs may be in
	large part attributable to the remarkable creativity
	of human-written texts on the web (Lu et al., 2024).
	Some recent works have suggested improving the
	creativity of LLMs through prompting techniques
	(Tian et al., 2023; Mehrotra et al., 2024; Nair et al.,
	2024; Summers-Stay et al., 2023) and decoding
	strategies (Franceschelli and Musolesi, 2024; Meis-
	ter et al., 2023). In this work, we instead explore
	directly optimizing language models for creativity
	using human preferences extracted from responses
	to creativity assessments.
	2.2
	Preference Learning
	Aligning LLMs to human preferences has proven
	effective in developing models that are helpful and
	useful to users, leading to the emergence of numer-
	ous preference learning methods (Gao et al., 2024;
	Ouyang et al., 2022; Rafailov et al., 2023). How-
	ever, prior work has highlighted a lack of diversity
	in LLM outputs (Anderson et al., 2024; Lanchantin
	et al., 2025; Wenger and Kenett, 2025; Padmaku-
	mar and He, 2023), with alignment often cited as
	a contributing factor (West and Potts, 2025). In
	response, recent research has explored modifica-
	tions to existing preference modeling techniques
	aimed at mitigating this reduction in diversity. One
	notable approach, Diverse Preference Optimiza-
	tion, proposes enhancing preference data creation
	by selecting preference pairs based on a diversity
	metric (Lanchantin et al., 2025). Another recent
	method introduces a modification to the optimiza-
	tion objective itself to incorporate a diversity signal
	(Chung et al., 2025). Both strategies have demon-
	strated effectiveness in promoting output diversity
	with minimal impact on output quality. However,
	as previously noted, diversity represents only one
	facet of creativity; true creativity also requires the
	capacity for novelty and surprise. In this work,
	we present a modular preference alignment frame-
	work for creativity that enables direct optimization
	across multiple dimensions of creative expression.

	Creative Preference Optimization
	According to its three-criterion definition, creativity
	involves the generation of novel, high-quality, and
	surprising ideas (Simonton, 2012; Boden, 2004;
	Runco and Jaeger, 2012). Moreover, creative out-
	puts tend to be highly diverse across individuals
	(Anderson et al., 2024). Therefore, to promote over-
	all creativity in LLM outputs, we propose to inject
	unsupervised metrics related to each dimension of

	--- Page 4 ---

	creativity into the loss functions of standard pref-
	erence optimization methods. We use direct pref-
	erence optimization (DPO) (Rafailov et al., 2023)
	to illustrate our modifications to the loss function.
	Recall that in the standard formulation of DPO, a
	policy model (pθ) is directly optimized on a dataset
	of (x, yw, yl) where x, yw and yl refer to the model
	input (i.e. prompt), preferred (i.e. chosen) model
	response and dispreferred (i.e. rejected) model re-
	sponse, respectively. Using the ratio between the
	policy model s likelihood and that of the reference
	SFT model (pSFT ) as an implicit reward, the train-
	ing objective of DPO is defined as follows:
	lDP O
	h
	log σ
	β log
	pθ(yw x)
	pSFT(yw x) β log
	pθ(yl x)
	pSFT(yl x)
	i
	LDPO E(x,yw,yl) D
	lDP O
	(1)
	A challenge with standard preference optimiza-
	tion methods is that they may significantly reduce
	the diversity of the responses LLMs generate, as the
	loss function encourages models to generate pre-
	ferred responses even if they are not very creative
	(West and Potts, 2025; Padmakumar and He, 2023;
	Anderson et al., 2024; Kirk et al., 2023; Xu et al.,
	2024; O Mahony et al., 2024; Zhang et al., 2024;
	Wenger and Kenett, 2025). Existing approaches
	to address this in the preference optimization ob-
	jective have centered around curating a preference
	data based on various diversity metrics (Lanchantin
	et al., 2025) or incorporating extra regularization
	terms that encourage diverse generations while bal-
	ancing quality (Chung et al., 2025). For example,
	the recently proposed Diversified DPO (DDPO)
	method adds a scalar diversity term δw (i.e. diver-
	sity score of the preferred response) into the DPO
	loss (Chung et al., 2025):
	LDDPO E(x,yw,yl) D
	δwlDP O
	(2)
	While diversity is important for creativity, re-
	search in psychology has long established that truly
	creative responses also require novelty, surprise,
	and quality (Boden, 2004; Barron, 1955; Simon-
	ton, 2018). Therefore, we propose incorporating
	metrics for each of these, alongside diversity, into
	the preference loss in a modular structure, enabling
	the construction of different creativity models by
	combining these dimensions as needed.
	LCDPO E(x,yw,yl) D
	h
	(λdδw λnνw λsξw λqγw)lDP O
	i
	(3)
	In our proposed creative DPO loss, δw, νw, ξw
	and γw correspond to diversity, novelty, surprise
	and quality scores of the preferred response respec-
	tively and λd, λn, λs and λq are hyperparameters
	that control the effect of each score (we call them
	injection weights). In particular, when λd 1,
	λn 0, λs 0 and λq 0, we recover the DDPO
	loss. While there are multiple approaches for oper-
	ationalizing δw, νw, ξw and γw, we propose to use
	the following metrics for each:
	3.1
	Diversity
	We use an inverse homogenization metric from
	Padmakumar and He (2023) similar to Chung et al.
	(2025). Specifically, given a prompt x and a set of
	(preferred) responses for x denoted as Yx, we com-
	pute the diversity score of any particular preferred
	response as the average pairwise semantic distance
	to all the other preferred responses in Yx:
	δw

	Yx 1
	X
	yi Yx yw
	semdis(yw, yi)
	(4)
	We use 1 cos_sim( , ) as a semantic distance
	function (i.e., semdis( , )).
	3.2
	Novelty
	We use a novelty metric similar to Karampiperis
	et al. (2014) where the novelty of a text is defined
	as the absolute difference between the average pair-
	wise semantic distances of words in the text and
	those of a reference corpus of texts. In particular,
	we define the set of preferred responses to a prompt
	x as a reference corpus (Yx) and define the novelty
	of a preferred response as follows:
	νw DSI(yw) DSI(Yx)
	(5)
	DSI(T)
	P T
	i,j 1 semdis(Ti, Tj), i j
	T
	(6)
	Here T refers to a piece of text, Ti to the word
	i in the set of unique words in T denoted as T
	and DSI( ) is divergent semantic integration, the
	average pairwise semantic distances of words in a
	text (Johnson et al., 2022).
	3.3
	Surprise
	We use Shannon surprise the negative log-
	likelihood of the text which has been widely
	used as a measure of surprise in prior work
	(Bunescu and Uduehi, 2022; Modirshanechi et al.,
	2022; Kuznetsova et al., 2013). More specifically,

	--- Page 5 ---

	given a prompt x, we define the surprise of a par-
	ticular response as the exponentiated negative log-
	likelihood of the response (i.e. perplexity) condi-
	tioned on the prompt x and under some reference
	model S as follows:
	ξw 2 logPS(yw x)
	(7)
	3.4
	Quality
	Although a general quality scoring method is hard
	to define, reward models that are trained to output
	a high score to preferred answers can be used as
	a proxy (Zhang et al., 2025; Lambert et al., 2024).
	In particular, we define the quality of a preferred
	response given a prompt x as the score assigned by
	some reward model R: γw R(yw x).

	The MUCE Dataset
	To compile MUCE, we solicited data from the
	global creativity research community, specifically
	targeting researchers studying human creativity to
	obtain data from tasks known to be valid creativity
	measures. We specifically targeted datasets which
	contained complete metadata, including informa-
	tion about the task, language, and items that partici-
	pants responded to. We gathered additional data by
	performing a manual search of the Open Science
	Framework database1, and only retained data from
	peer-reviewed articles. In total, 43 of the data in
	MUCE has never been publicly released, making
	it unlikely that LLMs have seen the item-response
	combinations for the majority of our tasks.
	Every response in MUCE was rated for creativ-
	ity by at least two raters, and in some cases up to
	75 employing a missing-raters design (Forthmann
	et al., 2025). While it is common practice to mea-
	sure creativity using multiple independent raters,
	individual raters may deliver unhelpful or noisy rat-
	ings if they did not understand the task instructions,
	had a different understanding of the rating criteria,
	or for other reasons (Forthmann et al., 2017). To
	account for this, we followed best practices for sub-
	jective scoring tasks by employing Judge Response
	Theory (Myszkowski and Storme, 2019) to check
	for raters whose ratings were uninformative in an
	information-theoretic sense. We fit JRT models to
	each task within MUCE, which gave us an infor-
	mation function for each rater across tasks. We
	then input the results from the JRT into a genetic
	algorithm (Schroeders et al., 2016) which identi-
	fied a subset of raters per dataset that maximized
	1https: osf.io
	the per dataset rater information function.2 This
	process dropped uninformative raters from each
	dataset, enhancing the quality of the final creativity
	ratings. The individual rater s scores were aggre-
	gated via factor scores, as is best practice in creativ-
	ity assessment (Silvia, 2011), and we rescaled the
	factor-transformed creativity scores into the integer
	range 10-50 as is done for prior work in automated
	creativity assessment (Organisciak et al., 2023).
	From this dataset, we create multiple data splits for
	training and testing. Full details about the dataset
	construction are in Appendix A.

	Experiments
	5.1
	SFT and Preference Datasets
	While our MUCE dataset contains samples for mul-
	tiple languages, we focus on showing the effec-
	tiveness of CRPO on the English subset in this
	work and leave experiments using the full dataset
	as future work. From the base English MUCE
	dataset, we generate a preference dataset by creat-
	ing tuples of preferred and rejected responses to the
	same prompt, treating the response that received
	the higher creativity score as the preferred one. Past
	work has shown that data quality is one of the main
	factors behind preference model performance (Liu
	et al., 2024; Deng et al., 2025; Wang et al., 2024a).
	Therefore, we curate a high-quality SFT dataset
	of 5, 275 samples (MUCE-SFT) and preference
	dataset of 42, 058 samples (MUCE-PREF) from
	the base MUCE which we detail in Appendix B.
	5.2
	Training
	Models
	As
	our
	base
	models,
	we
	use
	Llama-3.1-8B-Instruct
	(AI Meta,
	2024)
	and Mistral-7B-Instruct-v0.3 (Jiang et al.,
	2023) and implement CRPO as described in
	Section 3.
	We first train our models using
	supervised fine-tuning (SFT model) for a single
	epoch on MUCE-SFT, and then apply preference
	optimization on the SFT model using CRPO and
	MUCE-PREF dataset. We train all models using
	parameter-efficient tuning with LoRA using a
	rank of 128 and an alpha of 256 (Hu et al., 2022).
	Additional details on the training setup can be
	found in Appendix C.
	Creativity Injection
	We compute creativity met-
	ric scores for each preferred response and inject
	2While ensuring that the algorithm kept at least two raters
	per dataset.

	--- Page 6 ---

	quality
	0.05
	0.06
	0.07
	0.08
	0.09
	novelty

	quality
	0.20
	0.25
	0.30
	0.35
	0.40
	0.45
	diversity

	quality

	surprise
	SFT
	DPO
	Llama-3.1-8B
	Gemini-2.0
	GPT-4o
	Claude-3.7
	CrPO-nov
	CrPO-div
	CrPO-sur
	CrPO-nov-qua
	CrPO-div-qua
	CrPO-sur-qua
	CrPO-qua
	CrPO-nov-div-sur
	CrPO-cre
	Figure 2:
	Results on held-out evaluation suite from MUCE across all baselines and our models using
	Llama-3.1-8B-Instruct as a base model. nov, div, sur, qua, cre denote novelty, diversity, surprise, qual-
	ity, and creativity, respectively. Results are averaged across tasks. Mistral-7B-Instruct-v0.3 results can be
	found in Appendix Figure 6.
	them into the DPO objective function as described
	in Section 3. Since each metric is on a differ-
	ent scale and we would like to combine the ef-
	fects of different injections, we normalize each
	score to a range of [0, 1] before injection.
	We
	vary the injection weights λd, λn, λs, λq accord-
	ingly3 to train different suites of creative mod-
	els. As novelty and diversity measures require
	a reference set to compute against, we adopt a
	prompt-level granularity and consider the set of
	responses for a given prompt as the reference cor-
	pus similar to prior work (Chung et al., 2025).
	We use the jina-embeddings-v3 model (Sturua
	et al., 2024) to compute text embeddings for
	all metrics that rely on semantic distance. For
	surprise, we use instruction-tuned Gemma-2-27B
	(Google, 2024a) as our reference surprise model
	S. While our creativity preference dataset is al-
	ready high-quality, we also experiment with in-
	jecting external quality signals to study its inter-
	action with other creativity dimensions. Hence,
	for the quality measure, we employ an existing re-
	ward model Skywork-Reward-Gemma-27B-v0.2
	(Liu et al., 2024) that is one of the top-performing
	models on RewardBench (Lambert et al., 2024) as
	our reference reward model R.
	5.3
	Evaluation
	Tasks and Metrics
	We evaluate all models
	across several dimensions of creativity on held-out
	prompts of various tasks and two held-out tasks.
	3For example, to train a novelty model, we set λn 1
	and others to 0 whereas for novelty and quality model we set
	λn 1 and λq 1.
	More specifically, we use 6 held-out prompts from
	Real-Life Creative Problem Solving, Alternate Uses
	of Objects, Design Solutions, Hypothesis Genera-
	tion, and Metaphors tasks, and 9 prompts from two
	held-out tasks of Poems and Sentence Completion.
	For each prompt, we generate 16 responses from
	each model by varying the temperature, topp, and
	topk decoding parameters. Our final held-out eval-
	uation suite contains 224 samples. We evaluate the
	responses on the dimensions of novelty, diversity,
	and surprise using the metrics described in Sec-
	tion 3. Additionally, to study the tradeoff between
	creativity and quality, we train a reward model
	on our preference dataset using instruction tuned
	Gemma-2-9b (Google, 2024a) and use it to score
	the overall quality of model generations. More
	details about the evaluation setup can be found in
	Appendix D.
	Baselines
	As baselines, we use the base mod-
	els Llama-3.1-8B-Instruct (AI Meta, 2024)
	and Mistral-7B-Instruct-v0.3 (Jiang et al.,
	2023), SFT models which are the base mod-
	els supervised fine-tuned on MUCE-SFT, vanilla
	DPO model trained on top of the SFT model
	using the MUCE-PREF dataset without any
	creativity injections and three closed-source
	instruction-tuned LLMs, namely GPT-4o (OpenAI,
	2024), Claude-3.7-Sonnet (Anthropic, 2025),
	and Gemini-2.0-Flash (Google, 2024b).
	CRPO Models
	We train several CRPO mod-
	els corresponding to the different dimensions of
	creativity. More specifically, for each dimension,
	we train a model that is injected with a signal for

	--- Page 7 ---

	0.0
	0.5
	1.0
	1.5
	2.0
	n
	0.088
	0.090
	0.092
	0.094
	0.096
	0.098
	0.100
	novelty
	0.0
	0.5
	1.0
	1.5
	2.0
	d
	0.36
	0.38
	0.40
	0.42
	0.44
	0.46
	0.48
	diversity
	0.0
	0.5
	1.0
	1.5
	2.0
	s

	surprise
	0.0
	0.5
	1.0
	1.5
	2.0
	q
	5.6
	5.4
	5.2
	5.0
	4.8
	4.6
	4.4
	quality
	Figure 3: Effect of injection weights for each dimension. Results are averaged across three seed runs.
	the given dimension and another model that is in-
	jected with a signal for both the given dimension
	(e.g. CRPO-nov) and the quality dimension (e.g.
	CRPO-nov-qua). We train the latter models to un-
	derstand the tradeoff between other dimensions of
	creativity and the quality that has been reported in
	previous research (Zhang et al., 2025; Lanchantin
	et al., 2025; Chung et al., 2025). Additionally, we
	train two creative models that inject all dimensions
	of creativity (denoted as CRPO-cre) and all ex-
	cept quality (denoted as CRPO-nov-div-sur). In
	all these experiments, λ injection weights are set
	to 1 for simplicity. We perform a more detailed
	analysis of these hyperparameters in Section 6.1.

	Results
	Figure 2 summarizes performance on our held-
	out evaluation suite across creativity dimensions
	for all baselines and CRPO models using the
	Llama-3.1-8B-Instruct as a base. Results for
	Mistral-7B-Instruct-v0.3 can be found in Ap-
	pendix Figure 6 and follows the same trends. First,
	we observe a clear separation between existing
	instruction-tuned LLMs and our models: while
	the former cluster around high quality but low nov-
	elty, diversity, and surprise, our models achieve
	high scores across all four dimensions. Second,
	for each creativity dimension, the model trained
	with that specific injection outperforms others on
	the same metric, confirming the effectiveness of
	targeted optimization, without a considerable drop
	in quality.
	Models
	that
	combine
	a
	creativity
	signal
	with
	an
	external
	quality
	signal
	(CRPO-{nov,div,sur}-qua) improve in quality
	but show reduced performance on the targeted di-
	mension, illustrating a trade-off. The same pattern
	holds when comparing the CRPO-nov-div-sur
	model to the full CRPO-cre model, further
	highlighting the balance between quality and other
	facets of creativity. Interestingly, the vanilla DPO
	model, without any creativity injections, already
	outperforms existing LLM baselines, demon-
	strating the strength of our preference dataset.
	Still, most of our creativity-optimized models
	significantly surpass DPO across all dimensions.
	Finally, the SFT model performs worst in quality
	and shows only comparable performance on other
	dimensions, reinforcing prior findings (Chung
	et al., 2025) about the limited generalizability of
	supervised fine-tuning in creative tasks, where no
	single correct answer exists.
	Overall, our results show that CrPO enhances
	multiple aspects of creativity with minimal im-
	pact on quality, offering a flexible and effective
	framework for creativity alignment in LLMs.
	6.1
	Effect of Injection Weights
	While we set all injection weights to 1 for sim-
	plicity in our main evaluations, we also study the
	effect of the different injection values on the perfor-
	mance of models across dimensions. In particular,
	we vary the injection weights from 0 to 2.0 with
	an increment of 0.5 for all dimensions and report
	the averaged results across three seed runs in Fig-
	ure 3. We observe that across most dimensions,
	an injection weight of 0.5 yields the greatest per-
	formance gains, with further increases resulting in
	diminishing returns or slight performance degrada-
	tion. In terms of quality, the injection weight of
	1.0 results in the highest performance. Neverthe-
	less, any weight above 0 consistently outperforms
	the model without any injection with minimal drop
	in quality (Appendix Figure 8). We suggest tun-
	ing these values depending on the training dataset,
	underlying task, and the base model for the best
	performance.
	6.2
	Human Evaluation
	In addition to automated metrics, we conduct a
	human evaluation to assess the real-world effec-
	tiveness of our approach. Due to the high cost

	--- Page 8 ---

	DPO
	GPT-4o
	Llama-3.1-8B
	SFT
	Baseline Models
	CrPO-cre
	CrPO-nov-div-sur
	CrPO-nov
	CrPO-div
	CrPO-sur
	Our Models
	50.0
	43.8
	56.2
	93.8
	56.2
	56.2
	68.8
	100.0
	37.5
	37.5
	37.5
	75.0
	68.8
	37.5
	18.8
	100.0
	43.8
	56.2
	43.8
	93.8
	Win Rates ( ) for Human Evaluation - Creativity

	Figure 4: Human evaluation results measured by win
	rates. Participants were asked to make a pairwise com-
	parison between our models and baselines with respect
	to the overall creativity.
	of human studies, we focus on the overall cre-
	ativity dimension using a single task (Sentence
	Completion), 4 prompts, 4 baselines (SFT, DPO,
	Llama-3.1-8B-Instruct, and GPT-4o), and 5
	CRPO variants (nov, div, sur, nov-div-sur,
	cre). In a blind pairwise setup, participants com-
	pared responses from a baseline and a CRPO
	model for creativity, unaware that the texts were
	AI-generated. A total of 320 comparisons were
	collected with balanced sampling across models.
	Additional details are in Appendix D.1.
	Figure 4 presents the win rates. The CRPO-
	nov-div-sur model consistently outperforms all
	baselines, particularly Llama-3.1-8B-Instruct,
	by a wide margin. In contrast, the full CRPO-cre
	model lags slightly, reflecting the creativity quality
	tradeoff seen in automated evaluations. Notably,
	CRPO models achieve especially strong gains over
	SFT, reinforcing previous findings.
	3.00
	3.25
	3.50
	3.75
	4.00
	4.25
	4.50
	4.75
	5.00
	quality

	novelty
	SFT
	DPO
	Llama-3.1-8B
	Gemini-2.0
	GPT-4o
	Claude-3.7
	CrPO-nov
	CrPO-div
	CrPO-sur
	CrPO-nov-qua
	CrPO-div-qua
	CrPO-sur-qua
	CrPO-qua
	CrPO-nov-div-sur
	CrPO-cre
	Figure 5: Evaluation results on NOVELTYBENCH, using
	the novelty and quality metrics defined in Zhang et al.
	(2025).
	6.3
	NOVELTYBENCH Evaluation
	While we demonstrate the effectiveness of our ap-
	proach on the MUCE held-out set using automated
	metrics, we also evaluate generalization on external
	benchmarks using the recently introduced NOVEL-
	TYBENCH (Zhang et al., 2025). This benchmark
	includes tasks spanning randomness, factual knowl-
	edge, creative writing, and subjectivity. Following
	the recommended evaluation setup, we benchmark
	all baselines and CRPO variants on a curated 100-
	prompt subset, using the benchmark s novelty and
	quality metrics. Full details are in Appendix D.2.
	Figure 5 shows novelty vs. quality scores across
	all models and tasks.
	As in our internal eval-
	uation, we observe a clear separation: existing
	LLM baselines cluster around lower novelty and
	variable quality, while our models consistently
	achieve high scores on both dimensions.
	No-
	tably, although our models outperform SFT on nov-
	elty, the SFT model surprisingly achieves higher
	quality beating both baselines by a large mar-
	gin and our models by a smaller one. This aligns
	with findings from NOVELTYBENCH (Zhang et al.,
	2025), where smaller models like Gemma-2-2B-it
	and Llama-3.1-8B-Instruct often surpass larger
	ones in quality.
	Overall, our models set a new state-of-the-art
	on the NOVELTYBENCH leaderboard in terms of
	novelty. 4

	Conclusion
	We introduce CRPO, a flexible methodology for
	enhancing the creativity of LLMs. Leveraging a
	novel large-scale human preference dataset focused
	on creativity, we show that models aligned with
	CRPO produce generations that are not only novel,
	diverse, and surprising, but also high in quality
	on both our held-out evaluation suite and the
	external NOVELTYBENCH dataset. Human evalua-
	tions further confirm that raters consistently judge
	our model s outputs to be more creative than those
	of several strong baselines, highlighting the po-
	tential of our approach to boost LLM creativity.
	While our experiments focus on smaller models
	such as Llama-3.1-8B and an English-only dataset,
	future work could explore the scalability of CRPO
	to larger models, multilingual settings and other
	preference optimization methods.
	4https: novelty-bench.github.io

	--- Page 9 ---

	Limitations
	Due to constraints on both computational resources
	and budget for human studies, we were unable to
	evaluate CRPO on any languages other than En-
	glish. Multilingual creativity assessment using gen-
	erative AI remains a challenging problem and an
	active area of research (Haase et al., 2025). While
	we believe our data represents a valuable resource
	for the community, future work will need to test
	our methods in multilingual settings to ensure mul-
	tilingual generalization. These compute constraints
	also prevented us from evaluating CRPO on larger
	open-weight models, making scaling trends diffi-
	cult to predict. We retained only samples with full
	agreement for the creativity score when training our
	models. While this aligns with best practices for
	creativity measurement in psychology (Cseh and
	Jeffries, 2019), it may also mask genuine sources of
	rater disagreement that should be modeled. Finally,
	we acknowledge that, much like other datasets used
	to align LLMs, the preferences represented by our
	annotator population likely do not reflect the full
	range of human preferences, which could bias our
	models generations (Yeh et al., 2024). We believe
	that the large-scale and multilingual nature of our
	preference data likely makes it one of the most rep-
	resentative creativity datasets currently available,
	but stress that future work should consider issues of
	bias and fairness more carefully for LLM creativity
	assessment.
	Ethical Considerations
	We emphasize that our models should not be used
	for safety-critical applications, as the relationship
	between creativity and alignment with other val-
	ues remains underexplored. Notably, our dataset
	contains responses to tests of malevolent creativity
	that are by definition unsafe for models to generate.
	We also observed qualitatively that CRPO mod-
	els were more likely to generate unsafe or toxic
	responses even to prompts that did not explicitly
	request such behaviors. We believe that our data
	is valuable for red-teaming evaluations on tasks re-
	quiring creativity, and that aligning models on these
	malevolent responses could be beneficial for under-
	standing how malicious actors might use creativity-
	enhanced models to execute unsafe goals. How-
	ever, we also acknowledge the ethical concerns that
	the release of our models and datasets would raise,
	and believe that restricting access to only those
	which have signed a license agreement is the best
	approach for balancing safety with continued sci-
	entific advancement. While we believe our results
	demonstrate how aligning LLMs with carefully de-
	signed human creativity datasets can significantly
	improve the novelty and diversity of their genera-
	tions, it remains unclear how to both optimize for
	creativity while preserving guardrails that prevent
	unsafe behavior.
	We also acknowledge the broader debates around
	the valid use of AI in social-behavioral research
	(Sun et al., 2025) and concerns surrounding AI au-
	tomation of industries requiring creativity (Wilkin-
	son, 2023) in which our work is situated. While
	the over-reliance on AI for creative tasks to the
	detriment of human welfare is a legitimate con-
	cern, AI has also been acknowledged for its poten-
	tial to enhance human creativity above and beyond
	what might be possible otherwise (de Chantal et al.,
	2025). Creativity is a vital skill for future knowl-
	edge workers to master (Forum, 2025), and we
	believe that enhancing the creativity of AI is an
	important prerequisite for developing AI systems
	capable of training humans to be more creative.
	Acknowledgements
	Mete and Lonneke gratefully acknowledge the sup-
	port of the Swiss National Science Foundation
	(grant 205121_207437: C - LING). R.E.B. is sup-
	ported by grants from the US National Science
	Foundation [DRL-1920653; DRL-240078; DUE-
	2155070].
	References
	Sergio Agnoli, Giovanni E Corazza, and Mark A
	Runco. 2016. Estimating creativity with a multiple-
	measurement approach within scientific and artistic
	domains. Creativity Research Journal, 28(2):171
	176.
	AI Meta. 2024. Llama 3 model card.
	Barrett R Anderson, Jash Hemant Shah, and Max
	Kreminski. 2024. Homogenization effects of large
	language models on human creative ideation.
	In
	Proceedings of the 16th conference on creativity
	cognition, pages 413 425.
	Anthropic. 2025. Claude 3.7 sonnet and claude code.
	Frank Barron. 1955. The disposition toward original-
	ity. The Journal of Abnormal and Social Psychology,
	51(3):478.
	Roger Beaty, Robert A Cortes, Simone Luchini, John D
	Patterson, Boris Forthmann, Brendan S Baker, Bap-
	tiste Barbot, Mariale Hardiman, and Adam Green.

	--- Page 10 ---

	2024. The scientific creative thinking test (sctt): Re-
	liability, validity, and automated scoring. PsyArxiv
	Preprints.
	Roger E Beaty and Dan R Johnson. 2021. Automating
	creativity assessment with semdis: An open platform
	for computing semantic distance. Behavior research
	methods, 53(2):757 780.
	Antoine Bellemare-Pepin, François Lespinasse, Philipp
	Thölke, Yann Harel, Kory Mathewson, Jay A Ol-
	son, Yoshua Bengio, and Karim Jerbi. 2024. Diver-
	gent creativity in humans and large language models.
	arXiv preprint arXiv:2405.13012.
	Margaret A Boden. 2004. The creative mind: Myths
	and mechanisms. Routledge.
	Tom Brown, Benjamin Mann, Nick Ryder, Melanie
	Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
	Neelakantan, Pranav Shyam, Girish Sastry, Amanda
	Askell, and 1 others. 2020. Language models are
	few-shot learners. Advances in neural information
	processing systems, 33:1877 1901.
	Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
	dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe-
	ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg,
	Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro,
	and Yi Zhang. 2023. Sparks of artificial general in-
	telligence: Early experiments with gpt-4. Preprint,
	arXiv:2303.12712.
	Razvan C. Bunescu and Oseremen O. Uduehi. 2022.
	Distribution-based measures of surprise for creative
	language: Experiments with humor and metaphor.
	Proceedings of the 3rd Workshop on Figurative Lan-
	guage Processing (FLP).
	Tuhin Chakrabarty, Philippe Laban, Divyansh Agar-
	wal, Smaranda Muresan, and Chien-Sheng Wu. 2024.
	Art or artifice? large language models and the false
	promise of creativity. In Proceedings of the 2024
	CHI Conference on Human Factors in Computing
	Systems, pages 1 34.
	Soma Chaudhuri, Alan Pickering, and Joydeep Bhat-
	tacharya. 2025. Evaluating poetry: Navigating the
	divide between aesthetical and creativity judgments.
	The Journal of Creative Behavior, 59(1):e683.
	Qi Chen, Bowen Zhang, Gang Wang, and Qi Wu.
	2024. Weak-eval-strong: Evaluating and eliciting
	lateral thinking of llms with situation puzzles. arXiv
	preprint arXiv:2410.06733.
	John Joon Young Chung, Ece Kamar, and Saleema
	Amershi. 2023.
	Increasing diversity while main-
	taining accuracy: Text data generation with large
	language models and human interventions. arXiv
	preprint arXiv:2306.04140.
	John Joon Young Chung, Vishakh Padmakumar, Melissa
	Roemmele, Yuqian Sun, and Max Kreminski.
	2025.
	Modifying large language model post-
	training for diverse creative writing. arXiv preprint
	arXiv:2503.17126.
	Katherine N Cotter, Jean E Pretz, and James C Kaufman.
	2016. Applicant extracurricular involvement predicts
	creativity better than traditional admissions factors.
	Psychology of Aesthetics, Creativity, and the Arts,
	10(1):2.
	Genevieve M Cseh and Karl K Jeffries. 2019. A scat-
	tered cat: A critical evaluation of the consensual as-
	sessment technique for creativity research. Psychol-
	ogy of Aesthetics, Creativity, and the Arts, 13(2):159.
	Pier Luc de Chantal, Roger Beaty, Antonio Laverghetta,
	Jimmy Pronchick, John Patterson, Peter Organisciak,
	Katarzyna Potega vel Zabik, Baptiste Barbot, and
	Maciej Karwowski. 2025. Artificial intelligence en-
	hances human creativity through real-time evaluative
	feedback.
	Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang,
	and Xiangnan He. 2025. Less is more: Improving
	llm alignment via preference data selection. arXiv
	preprint arXiv:2502.14560.
	Paul V DiStefano, John D Patterson, and Roger E Beaty.
	2024. Automatic scoring of metaphor creativity with
	large language models. Creativity Research Journal,
	pages 1 15.
	Paul V DiStefano, Daniel Zeitlen, Janet Rafner, Pier-
	Luc de Chantal, Aoran Peng, Scarlett Miller, and
	Roger Beaty. 2025. Evaluating ai s ideas: The role
	of individual creativity and expertise in human-ai
	co-creativity.
	Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
	Hierarchical neural story generation. In Proceedings
	of the 56th Annual Meeting of the Association for
	Computational Linguistics (Volume 1: Long Papers),
	pages 889 898, Melbourne, Australia. Association
	for Computational Linguistics.
	Li Fan, Kaixiang Zhuang, Xueyang Wang, Jingyi Zhang,
	Cheng Liu, Jing Gu, and Jiang Qiu. 2023. Explor-
	ing the behavioral and neural correlates of seman-
	tic distance in creative writing. Psychophysiology,
	60(5):e14239.
	Boris Forthmann, Benjamin Goecke, and Roger E Beaty.
	2025. Planning missing data designs for human rat-
	ings in creativity research: A practical guide. Cre-
	ativity Research Journal, 37(1):167 178.
	Boris Forthmann, Heinz Holling, Nima Zandi, Anne
	Gerwig, Pınar Çelik, Martin Storme, and Todd
	Lubart. 2017. Missing creativity: The effect of cogni-
	tive workload on rater (dis-) agreement in subjective
	divergent-thinking scores. Thinking Skills and Cre-
	ativity, 23:129 139.
	World Economic Forum. 2025. Future of jobs report.
	Giorgio Franceschelli and Mirco Musolesi. 2024. Cre-
	ative beam search: Llm-as-a-judge for improving re-
	sponse generation. arXiv preprint arXiv:2405.00099.

	--- Page 11 ---

	Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai,
	Zhe Yang, Liang Chen, Helan Hu, Runxin Xu,
	Qingxiu Dong, Ce Zheng, Wen Xiao, Ge Zhang,
	Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng
	Liu, Zeyu Cui, Jian Yang, Lei Sha, and 5 others.
	2024. Towards a unified view of preference learn-
	ing for large language models: A survey. Preprint,
	arXiv:2409.02795.
	Ken Gilhooly. 2024. Ai vs humans in the aut: Simula-
	tions to llms. Journal of Creativity, 34(1):100071.
	Benjamin Goecke, Paul V DiStefano, Wolfgang As-
	chauer, Kurt Haim, Roger Beaty, and Boris Forth-
	mann. 2024a. Automated scoring of scientific cre-
	ativity in german. The Journal of Creative Behavior,
	58(3):321 327.
	Benjamin Goecke, Selina Weiss, and Oliver Wilhelm.
	2024b. Driving factors of individual differences in
	broad retrieval ability: Gr is more than the sum of its
	parts. Journal of Experimental Psychology: Learn-
	ing, Memory, and Cognition.
	Fabrício Góes, Piotr Sawicki, Marek Grzes, Marco
	Volpe, and Jacob Watson. 2023. Pushing gpt s cre-
	ativity to its limits: Alternative uses and torrance
	tests. In ICCC.
	Google. 2024a. Gemma 2: Improving open language
	models at a practical size.
	Google. 2024b. Introducing gemini 2.0: our new ai
	model for the agentic era.
	Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp
	Schmid, Zachary Mueller, Sourab Mangrulkar, Marc
	Sun, and Benjamin Bossan. 2022. Accelerate: Train-
	ing and inference at scale made simple, efficient and
	adaptable.
	https: github.com huggingface
	accelerate.
	J.P. Guilford. 1967. The Nature of Human Intelligence.
	McGraw-Hill series in psychology. McGraw-Hill.
	Jennifer Haase, Paul H. P. Hanel, and Sebastian Pokutta.
	2025. S-dat: A multilingual, genai-driven frame-
	work for automated divergent thinking assessment.
	Preprint, arXiv:2505.09068.
	Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Ra-
	jagopal, and Dongyeop Kang. 2023. How far can
	we extract diverse perspectives from large language
	models? arXiv preprint arXiv:2311.09799.
	Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023.
	Debertav3: Improving deberta using electra-style pre-
	training with gradient-disentangled embedding shar-
	ing. Preprint, arXiv:2111.09543.
	Ruizhi He, Kaixiang Zhuang, Lijun Liu, Ke Ding,
	Xi Wang, Lei Fu, Jiang Qiu, and Qunlin Chen. 2022.
	The impact of knowledge on poetry composition: An
	fmri investigation. Brain and language, 235:105202.
	Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
	Yejin Choi. 2019. The curious case of neural text
	degeneration. arXiv preprint arXiv:1904.09751.
	Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
	Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
	Weizhu Chen, and 1 others. 2022. Lora: Low-rank
	adaptation of large language models. ICLR, 1(2):3.
	Shulin Huang, Shirong Ma, Yinghui Li, Mengzuo
	Huang, Wuhe Zou, Weidong Zhang, and Haitao
	Zheng. 2024. LatEval: An interactive LLMs evalu-
	ation benchmark with incomplete information from
	lateral thinking puzzles. In Proceedings of the 2024
	Joint International Conference on Computational
	Linguistics, Language Resources and Evaluation
	(LREC-COLING 2024), pages 10186 10197, Torino,
	Italia. ELRA and ICCL.
	Mete Ismayilzada, Debjit Paul, Antoine Bosselut,
	and Lonneke van der Plas. 2024a.
	Creativity in
	ai:
	Progresses and challenges.
	arXiv preprint
	arXiv:2410.17218.
	Mete Ismayilzada, Claire Stevenson, and Lonneke
	van der Plas. 2024b. Evaluating creative short story
	generation in humans and large language models.
	arXiv preprint arXiv:2411.02316.
	Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
	sch, Chris Bamford, Devendra Singh Chaplot, Diego
	de las Casas, Florian Bressand, Gianna Lengyel, Guil-
	laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
	Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
	Thibaut Lavril, Thomas Wang, Timothée Lacroix,
	and William El Sayed. 2023. Mistral 7b. Preprint,
	arXiv:2310.06825.
	Dan R Johnson, Andrew S Cuthbert, and Mara E Tynan.
	2021. The neglect of idea diversity in creative idea
	generation and evaluation. Psychology of Aesthetics,
	Creativity, and the Arts, 15(1):125.
	Dan Richard Johnson, J. Kaufman, Brendan S. Baker,
	John D. Patterson, Baptiste Barbot, Adam E. Green,
	Janet G. van Hell, Evan S. Kennedy, Grace F Sulli-
	van, Christa L. Taylor, Thomas Ward, and Roger E.
	Beaty. 2022. Divergent semantic integration (dsi):
	Extracting creativity from narratives with distribu-
	tional semantic modeling. Behavior Research Meth-
	ods, 55:3726 3759.
	Hansika Kapoor, Hreem Mahadeshwar, Sarah Rezaei,
	Roni Reiter-Palmon, and James C Kaufman. 2024.
	The ties that bind: Low morals, high deception, and
	dark creativity. Creativity Research Journal, pages
	1 20.
	Pythagoras Karampiperis, Antonis Koukourikos, and
	Evangelia Koliopoulou. 2014. Towards machines for
	measuring creativity: The use of computational tools
	in storytelling activities. In 2014 IEEE 14th Interna-
	tional Conference on Advanced Learning Technolo-
	gies, pages 508 512.

	--- Page 12 ---

	Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis,
	Jelena Luketina, Eric Hambro, Edward Grefenstette,
	and Roberta Raileanu. 2023. Understanding the ef-
	fects of rlhf on llm generalisation and diversity. arXiv
	preprint arXiv:2310.06452.
	Mika Koivisto and Simone Grassini. 2023. Best hu-
	mans still outperform artificial intelligence in a cre-
	ative divergent thinking task.
	Scientific reports,
	13(1):13601.
	Polina Kuznetsova, Jianfu Chen, and Yejin Choi. 2013.
	Understanding and quantifying creativity in lexical
	composition. In Conference on Empirical Methods
	in Natural Language Processing.
	Nathan Lambert, Valentina Pyatkin, Jacob Morrison,
	LJ Miranda, Bill Yuchen Lin, Khyathi Chandu,
	Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi,
	and 1 others. 2024. Rewardbench: Evaluating re-
	ward models for language modeling. arXiv preprint
	arXiv:2403.13787.
	Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala,
	Ping Yu, Jason Weston, Sainbayar Sukhbaatar, and
	Ilia Kulikov. 2025. Diverse preference optimization.
	arXiv preprint arXiv:2501.18101.
	Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Ju-
	jie He, Chaojie Wang, Shuicheng Yan, Yang Liu,
	and Yahui Zhou. 2024. Skywork-reward: Bag of
	tricks for reward modeling in llms. arXiv preprint
	arXiv:2410.18451.
	Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar
	Mireshghallah, Jiacheng Liu, Seungju Han, Allyson
	Ettinger, Liwei Jiang, Khyathi Chandu, Nouha Dziri,
	and 1 others. 2024. Ai as humanity s salieri: Quan-
	tifying linguistic creativity of language models via
	systematic attribution of machine text against web
	text. arXiv preprint arXiv:2410.04265.
	Simone A Luchini, Nadine T Maliakkal, Paul V DiS-
	tefano, Antonio Laverghetta Jr, John D Patterson,
	Roger E Beaty, and Roni Reiter-Palmon. 2025. Auto-
	mated scoring of creative problem solving with large
	language models: A comparison of originality and
	quality ratings. Psychology of Aesthetics, Creativity,
	and the Arts.
	Pronita Mehrotra, Aishni Parab, and Sumit Gulwani.
	2024. Enhancing creativity in large language mod-
	els through associative thinking strategies. arXiv
	preprint arXiv:2405.06715.
	Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan
	Cotterell. 2023. Locally typical sampling. Transac-
	tions of the Association for Computational Linguis-
	tics, 11:102 121.
	Alireza Modirshanechi, Johanni Brea, and Wulfram
	Gerstner. 2022. A taxonomy of surprise definitions.
	Journal of Mathematical Psychology, 110:102712.
	Nils Myszkowski and Martin Storme. 2019. Judge re-
	sponse theory? a call to upgrade our psychometrical
	account of creativity judgments. Psychology of Aes-
	thetics, Creativity, and the Arts, 13(2):167.
	Lakshmi Nair, Evana Gizzi, and Jivko Sinapov. 2024.
	Creative problem solving in large language and vi-
	sion models - what would it take? In Findings of the
	Association for Computational Linguistics: EMNLP
	2024, pages 11978 11994, Miami, Florida, USA.
	Association for Computational Linguistics.
	OpenAI. 2024.
	Gpt-4o system card.
	Preprint,
	arXiv:2410.21276.
	Peter Organisciak, Selcuk Acar, Denis Dumas, and
	Kelly Berthiaume. 2023. Beyond semantic distance:
	Automated scoring of divergent thinking greatly im-
	proves with large language models. Thinking Skills
	and Creativity, 49:101356.
	Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
	Carroll Wainwright, Pamela Mishkin, Chong Zhang,
	Sandhini Agarwal, Katarina Slama, Alex Ray, and 1
	others. 2022. Training language models to follow in-
	structions with human feedback. Advances in neural
	information processing systems, 35:27730 27744.
	Laura O Mahony, Leo Grinsztajn, Hailey Schoelkopf,
	and Stella Biderman. 2024. Attributing mode col-
	lapse in the fine-tuning of large language models. In
	ICLR 2024 Workshop on Mathematical and Empiri-
	cal Understanding of Foundation Models.
	Vishakh Padmakumar and He He. 2023. Does writ-
	ing with language models reduce content diversity?
	arXiv preprint arXiv:2309.05196.
	John D Patterson, Hannah M Merseal, Dan R Johnson,
	Sergio Agnoli, Matthijs Baas, Brendan S Baker, Bap-
	tiste Barbot, Mathias Benedek, Khatereh Borhani,
	Qunlin Chen, and 1 others. 2023. Multilingual se-
	mantic distance: Automatic verbal creativity assess-
	ment in many languages. Psychology of Aesthetics,
	Creativity, and the Arts, 17(4):495.
	Corinna Perchtold-Stefan, Hansika Kapoor, James C
	Kaufman, Hreem Mahadeshwar, and Alison Fernan-
	des. 2024. Development and neuronal validation of
	the dark creativity deception battery (dcdb).
	Corinna M Perchtold-Stefan, Christian Rominger, Ilona
	Papousek, and Andreas Fink. 2023. Functional eeg
	alpha activation patterns during malevolent creativity.
	Neuroscience, 522:98 108.
	Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo-
	pher D Manning, Stefano Ermon, and Chelsea Finn.
	2023.
	Direct preference optimization: Your lan-
	guage model is secretly a reward model. Advances in
	Neural Information Processing Systems, 36:53728
	53741.
	Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
	and Yuxiong He. 2020. Zero: Memory optimizations
	toward training trillion parameter models. In SC20:

	--- Page 13 ---

	International Conference for High Performance Com-
	puting, Networking, Storage and Analysis, pages 1
	16. IEEE.
	Tuval Raz, Simone Luchini, Roger Beaty, and Yoed
	Kenett. 2024. Bridging the measurement gap: A
	large language model method of assessing open-
	ended question complexity. In Proceedings of the
	Annual Meeting of the Cognitive Science Society, vol-
	ume 46.
	Mark A Runco and Garrett J Jaeger. 2012. The standard
	definition of creativity. Creativity research journal,
	24(1):92 96.
	Solve Sæbø and Helge Brovold. 2024. On the stochas-
	tics of human and artificial creativity. arXiv preprint
	arXiv:2403.06996.
	Keita Saito, Akifumi Wachi, Koki Wataoka, and
	Youhei Akimoto. 2023.
	Verbosity bias in prefer-
	ence labeling by large language models. Preprint,
	arXiv:2310.10076.
	Janika Saretzki, Rosalie Andrae, Boris Forthmann, and
	Mathias Benedek. 2024. Investigation of response ag-
	gregation methods in divergent thinking assessments.
	The Journal of Creative Behavior.
	Ulrich Schroeders, Oliver Wilhelm, and Gabriel Olaru.
	2016. Meta-heuristics in short scale construction:
	Ant colony optimization and genetic algorithm. PloS
	one, 11(11):e0167110.
	Alexander Shypula, Shuo Li, Botong Zhang, Vishakh
	Padmakumar, Kayo Yin, and Osbert Bastani. 2025.
	Evaluating the diversity and quality of llm generated
	content. arXiv preprint arXiv:2504.12522.
	Paul J Silvia. 2011. Subjective scoring of divergent
	thinking: Examining the reliability of unusual uses,
	instances, and consequences tasks. Thinking Skills
	and Creativity, 6(1):24 30.
	Dean Keith Simonton. 2012. Taking the us patent of-
	fice criteria seriously: A quantitative three-criterion
	creativity definition and its implications. Creativity
	research journal, 24(2-3):97 106.
	Dean Keith Simonton. 2018. Defining creativity: Don t
	we also need to define what is not creative?
	The
	Journal of Creative Behavior, 52(1):80 90.
	Morris I Stein. 2014. Stimulating creativity: Individual
	procedures. Academic Press.
	Claire E. Stevenson, Iris Smal, Matthijs Baas, Raoul
	Grasman, and Han L. J. van der Maas. 2022. Putting
	gpt-3 s creativity to the (alternative uses) test. In
	ICCC.
	Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram,
	Michael Günther, Bo Wang, Markus Krimmel, Feng
	Wang, Georgios Mastrapas, Andreas Koukounas, An-
	dreas Koukounas, Nan Wang, and Han Xiao. 2024.
	jina-embeddings-v3: Multilingual embeddings with
	task lora. Preprint, arXiv:2409.10173.
	Douglas Summers-Stay, Stephanie M. Lukin, and
	Clare R. Voss. 2023. Brainstorm, then select: a gen-
	erative language model improves its creativity score.
	Huaman Sun, Jiaxin Pei, Minje Choi, and David Jur-
	gens. 2025. Sociodemographic prompting is not yet
	an effective approach for simulating subjective judg-
	ments with llms. In Proceedings of the 2025 Confer-
	ence of the Nations of the Americas Chapter of the
	Association for Computational Linguistics: Human
	Language Technologies (Volume 2: Short Papers),
	pages 845 854.
	Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-
	Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
	Schalkwyk, Andrew M Dai, Anja Hauth, Katie Mil-
	lican, and 1 others. 2023.
	Gemini: a family of
	highly capable multimodal models. arXiv preprint
	arXiv:2312.11805.
	Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang,
	Alexander Spangher, Muhao Chen, Jonathan May,
	and Nanyun Peng. 2024. Are large language models
	capable of generating human-level narratives? arXiv
	preprint arXiv:2407.13248.
	Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ro-
	nan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi,
	Thomas L Griffiths, and Faeze Brahman. 2023. Mac-
	gyver: Are large language models creative problem
	solvers? arXiv preprint arXiv:2311.09682.
	Binghai Wang, Rui Zheng, Lu Chen, Zhiheng Xi, Wei
	Shen, Yuhao Zhou, Dong Yan, Tao Gui, Qi Zhang,
	and Xuan-Jing Huang. 2024a.
	Reward modeling
	requires automatic adjustment based on data quality.
	In Findings of the Association for Computational
	Linguistics: EMNLP 2024, pages 4041 4064.
	Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang,
	Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao
	Xie, Chuou Xu, Jihong Dai, and 1 others. 2024b.
	Weaver: Foundation models for creative writing.
	arXiv preprint arXiv:2401.17268.
	Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
	Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
	Maarten Bosma, Denny Zhou, Donald Metzler, and
	1 others. 2022. Emergent abilities of large language
	models. arXiv preprint arXiv:2206.07682.
	Selina Weiss, Benjamin Goecke, and Oliver Wilhelm.
	2024. How much retrieval ability is in originality?
	The Journal of Creative Behavior, 58(3):370 387.
	Selina Weiss, Sally Olderbak, and Oliver Wilhelm. 2023.
	Conceptualizing and measuring ability emotional cre-
	ativity. Psychology of Aesthetics, Creativity, and the
	Arts.
	Emily Wenger and Yoed Kenett. 2025. We re different,
	we re the same: Creative homogeneity across llms.
	arXiv preprint arXiv:2501.19361.
	Peter West and Christopher Potts. 2025. Base models
	beat aligned models at randomness and creativity.
	Preprint, arXiv:2505.00047.

	--- Page 14 ---

	Alissa Wilkinson. 2023. Hollywood s writers are on
	strike. here s why that matters.
	Justin Wong, Yury Orlovskiy, Michael Luo, Sanjit A
	Seshia, and Joseph E Gonzalez. 2024. Simplestrat:
	Diversifying language model generation with stratifi-
	cation. arXiv preprint arXiv:2410.09038.
	Weijia Xu, Nebojsa Jojic, Sudha Rao, Chris Brockett,
	and Bill Dolan. 2024. Echoes in ai: Quantifying
	lack of plot diversity in llm outputs. arXiv preprint
	arXiv:2501.00273.
	Min-Hsuan Yeh, Leitian Tao, Jeffrey Wang, Xuefeng
	Du, and Yixuan Li. 2024. How reliable is human
	feedback for aligning large language models? arXiv
	preprint arXiv:2410.01957.
	Yuhua Yu, Lindsay Krebs, Mark Beeman, and Vicky T
	Lai. 2024. Exploring how generating metaphor via
	insight versus analysis affects metaphor quality and
	learning outcomes. Cognitive science, 48(8):e13488.
	Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen
	Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and
	Daphne Ippolito. 2025. Noveltybench: Evaluating
	creativity and diversity in language models. arXiv
	preprint arXiv:2504.05228.
	Yiming Zhang, Avi Schwarzschild, Nicholas Carlini,
	Zico Kolter, and Daphne Ippolito. 2024. Forcing
	diffuse distributions out of language models. arXiv
	preprint arXiv:2404.10859.
	Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
	Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
	Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen
	Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
	Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, and
	3 others. 2025. A survey of large language models.
	Preprint, arXiv:2303.18223.
	Yunpu Zhao, Rui Zhang, Wenyi Li, Di Huang, Jiaming
	Guo, Shaohui Peng, Yifan Hao, Yuanbo Wen, Xing
	Hu, Zidong Du, and 1 others. 2024. Assessing and
	understanding creativity in large language models.
	arXiv preprint arXiv:2401.12491.
	Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben
	Narad, Timothy T Rogers, Lalit K Jain, Robert D
	Nowak, Bob Mankoff, and Jifan Zhang. 2025. Bridg-
	ing the creativity understanding gap: Small-scale
	human alignment enables expert-level humor ranking
	in llms. arXiv preprint arXiv:2502.20356.
	Aleksandra Zieli nska, Peter Organisciak, Denis Du-
	mas, and Maciej Karwowski. 2023. Lost in trans-
	lation? not for large language models: Automated
	divergent thinking scoring performance translates to
	non-english contexts. Thinking Skills and Creativity,
	50:101414.
	A
	MUCE Dataset
	We compiled data by means of crowdsourcing and
	data mining of the open-source data sharing plat-
	form OSF. We crowdsourced from the global cre-
	ativity research community by means of direct re-
	quests and posts on academic listservs. In our call
	for data-sharing, we requested data relating to any
	creativity responses that were provided by human
	participants and scored for creativity by human
	raters. We specifically requested that the datasets
	include scores from each rater, rather than com-
	posite creativity scores, to determine rating data
	quality for each submission. As part of our inclu-
	sion criteria, we further requested that researchers
	provide information relating to: (a) the creativity
	task, (b) the item associated with each response, (c)
	the construct that was rated, and (d) the language
	of the task. We further asked researchers to provide
	a statement on whether they agreed to making their
	data open-source. In terms of data mining through
	the OSF platform, we first searched through a se-
	ries of relevant keywords (e.g., creativity task ,
	originality score ). We only retained sub-datasets
	from credible sources, which were associated with
	a citable peer-reviewed article, and which included
	all the required data relating to our inclusion crite-
	ria.
	After removing responses that didn t meet our
	inclusion criteria, our dataset amounted to 321,572
	human-rated and language-based creativity re-
	sponses. The dataset was thus cleaned by standard-
	izing the naming for each variable except for the
	responses. We then removed responses for having
	been rated by fewer than 2 human judges. Dupli-
	cate responses were also removed, by retaining a
	single exemplar for responses that appeared twice
	within a specific item and task.
	To enhance the reliability of human creativity
	ratings across the numerous datasets, we optimized
	the selection of raters by applying a meta-heuristic
	algorithm. Specifically, we applied a Genetic Algo-
	rithm (Schroeders et al., 2016). The GA operates
	through iterative selection, crossover, and muta-
	tion processes, mirroring the principles of natural
	selection, and in our case to identify the optimal
	subsets of raters for each dataset. In each itera-
	tion, candidate solutions that is, combinations of
	raters were evaluated based on a predefined fit-
	ness function that prioritized the maximization of
	empirical reliability (rxx) within a graded response
	model (GRM) and hence in line to judge response

	--- Page 15 ---

	theory. For sub-datasets involving decimal-based
	scales, individual ratings were rounded to the near-
	est integer value (rounding up if containing a deci-
	mal .5) to meet the requirements of the GRM.
	Rater subsets demonstrating superior reliability
	were selected, recombined, and modified through
	random perturbations to prevent premature con-
	vergence to suboptimal solutions. This approach
	ensured that the selected raters provided consistent
	and informative judgments while reducing noise
	introduced by inconsistent or uninformative ratings.
	By automating the selection process through GA,
	we opted for maximal comparability in the selec-
	tion process across datasets. Previous research has
	demonstrated the utility of GA in psychometric op-
	timization tasks, particularly in balancing brevity
	and measurement precision while maintaining con-
	struct validity. In the present study, GA facilitated
	a systematic and data-driven refinement of rater
	selection, arguably enhancing the overall quality of
	creativity ratings.
	After dropping uninformative raters in each sub-
	dataset, we again removed any rows containing less
	than 2 ratings due to rater removal. Afterwards, we
	used the new rater subsets per dataset and computed
	factor scores for each given response that were used
	as creativity scores. We calculated factor scores
	via a GRM model, ran separately over each sub-
	dataset, to derive a single creativity score for each
	response. Finally, we applied min-max scaling on
	each sub-dataset to transform ratings into a range of
	10 to 50, with intervals of 1. This step was applied
	to ensure that ratings would only constitute a single
	token in length, to lessen the burden of predicting
	multi-token labels by the LLMs.
	We then withheld all responses in the Spanish
	language from our final dataset and assigned them
	to an out-of-distribution-language (OOD-l) set. Re-
	sponses from the OOD-l set were not included in
	the training data of MUCE, allowing us to test
	whether the model could generalize to creative re-
	sponses in an unseen language. We selected Span-
	ish as it would allow for a fair test of generalizabil-
	ity given: (1) Spanish tends to be a high-resource
	language within the pre-training of modern LLMs,
	(2) it is similar to other Latin-root languages in our
	training data (e.g., Italian), (3) responses in Span-
	ish spanned multiple creativity tasks, and (4) the
	language spanned a limited number of responses in
	our total dataset. We further withheld all responses
	from two highly-naturalistic tasks, the Poem and
	Alternative Title Generation, and assigned these
	to an out-of-distribution task (OOD-t) set. We se-
	lected these tasks as they made up a limited portion
	of the total dataset and would provide a test of
	MUCE s performance on unseen naturalistic cre-
	ativity tasks.
	We then randomly selected items within each
	task and assigned them to an out-of-distribution
	item (OOD-i) set. We identified candidate items
	that corresponded to 5 or less of the responses
	within a task. Then, for tasks that contained 20 or
	more total items, we randomly assigned 2 of these
	items to our OOD-i set. For tasks that contained
	fewer than 20 total items, we instead randomly as-
	signed 1 of these items to the OOD-i set. Finally,
	we split the remaining responses in our dataset
	into training, validation, and out-of-distribution re-
	sponses (OOD-r) sets according to an 80 10 10
	split. We grouped responses into unique combina-
	tions of sub-dataset, task, language, item, and rat-
	ing label, then randomly assigned responses within
	each combination to each of the sets, ensuring an
	equal representation of responses associated with
	each of these variables within the training, valida-
	tion, and OOD-r sets. Table 1 contains the final
	dataset statistics for MUCE. Tables 6 and 7 contain
	the descriptions and data statistics for each task in
	MUCE. Tables 8, 9, 10, 11, and 12 list some exam-
	ple prompts and low-rated and high-rated responses
	for each task from MUCE.
	B
	SFT and Preference Datasets
	Past work has shown that data quality is one of
	the main factors behind preference model perfor-
	mance (Liu et al., 2024; Deng et al., 2025; Wang
	et al., 2024a). In particular, the margin in the score
	(i.e. reward margin) between the preferred and re-
	jected response may influence the performance of
	the model, since training pairs with smaller mar-
	gins are likely to contain annotation noise and be
	more difficult to learn. We experiment with dif-
	ferent reward margins and choose a margin of 5
	for the final experiments as it showed a balance
	between mitigating annotator noise and creating
	a dataset with nuanced preferences. Additionally,
	to ensure a high-quality preference dataset, first
	we filter the base MUCE dataset and select only
	the samples that have a full agreement from all
	annotators. Then we filter out all samples that
	have a rating below 20 and limit the number of
	pairings between samples to 10. This results in
	a final preference training dataset of 42, 058 sam-

	--- Page 16 ---

	quality
	0.05
	0.06
	0.07
	0.08
	novelty

	quality
	0.20
	0.25
	0.30
	0.35
	0.40
	0.45
	diversity

	quality

	surprise
	SFT
	DPO
	Mistral-7B
	Gemini-2.0
	GPT-4o
	Claude-3.7
	CrPO-nov
	CrPO-div
	CrPO-sur
	CrPO-nov-qua
	CrPO-div-qua
	CrPO-sur-qua
	CrPO-qua
	CrPO-nov-div-sur
	CrPO-cre
	Figure 6:
	Results on held-out evaluation suite from MUCE across all baselines and our models using
	Mistral-7B-Instruct-v0.3 as a base model. nov, div, sur, qua, cre denote novelty, diversity, surprise, quality,
	and creativity, respectively. Results are averaged across tasks.
	Total
	Train
	Dev
	Test
	OOD-i
	OOD-l
	OOD-t
	samples
	245,030
	183,973
	23,254
	22,419
	6,253
	4,719
	4,412
	tasks

	languages

	prompts

	Table 1: Detailed statistics for each split of MUCE.
	Human Evaluation Instructions
	In this study, you will be presented with two
	responses to a creative task. Your job is to
	select the response that you believe is the
	most creative. Please base your judgment
	only on the creativity of the ideas not on
	how long or detailed the response is. A
	shorter response can be more creative than
	a longer one, and vice versa. Focus on how
	original, unique, and innovative the idea
	feels to you. There are no right or wrong
	answers we re interested in your opinion.
	Figure 7: Rater instructions for the human evaluation.
	ples (MUCE-PREF). We also create a high-quality
	instruction-tuning dataset from MUCE-PREF by
	pairing the prompts with all preferred responses
	that have a rating above 30 resulting in a dataset
	of 5, 275 samples (MUCE-SFT). Tables 2 and 3
	contain the statistics for these datasets.
	C
	Training
	We follow a training setup similar to Chung
	et al. (2025) and use Llama-3.1-8B-Instruct
	and Mistral-7B-Instruct-v0.3 (Jiang et al.,
	2023) as our base models. Using these models,
	we train an SFT, DPO and several CRPO models.
	We train all models using parameter-efficient tun-
	ing with LoRA using a rank of 128 and an alpha
	of 256 (Hu et al., 2022). All training was done
	using HuggingFace TRL library5 with Accelerate
	(Gugger et al., 2022) and DeepSpeed ZeRO-2 (Ra-
	jbhandari et al., 2020) on NVIDIA A100 GPUs with
	gradient checkpointing.
	SFT model is trained on the MUCE-SFT dataset
	for a single epoch with a batch size of 2 per GPU
	using a gradient accumulation size of 4 and context
	size of 1024. We use a cosine scheduler with a
	half-cycle warmup and maximum learning rate of
	3e 5. Final model achieves 85 mean token
	accuracy on the validation set.
	DPO and CRPO models are trained using the
	SFT model as a base on our MUCE-PREF dataset
	for a single epoch with a batch size of 8 per GPU
	using a gradient accumulation size of 8 and context
	size of 1024. We use a linear scheduler with a
	learning rate of 5e 6. All final models achieve
	over 82 reward accuracy on the validation set.
	5https: huggingface.co docs trl en index

	--- Page 17 ---

	30.5
	31.0
	31.5
	32.0
	32.5
	33.0
	33.5
	34.0
	34.5
	quality
	0.090
	0.092
	0.094
	0.096
	0.098
	novelty
	CrPO-nov with different injection weights
	lambda
	0.5
	1.0
	1.5
	2.0
	0.0

	quality
	0.36
	0.38
	0.40
	0.42
	0.44
	0.46
	diversity
	CrPO-div with different injection weights
	lambda
	0.5
	1.0
	1.5
	2.0
	0.0

	quality

	surprise
	CrPO-sur with different injection weights
	lambda
	0.5
	1.0
	1.5
	2.0
	0.0
	Figure 8: Effect of injection weights for each dimension on the quality score. Results are averaged across three seed
	runs.
	D
	Evaluation
	For each prompt in our held-out evaluation suite,
	we generate a total of 16 responses for every model
	by sampling 4 responses for each of the following
	four decoding setups that induce high randomness
	using various sampling techniques (Fan et al., 2018;
	Holtzman et al., 2019):
	1. temperature 0.7, topp 0.95
	2. temperature 0.9, topp 0.99
	3. temperature 0.7, topk 50
	4. temperature 0.8, topp 0.97
	Moreover, as the existing instruction-tuned LLMs
	tend to produce verbose outputs (Saito et al., 2023),
	in order to minimize the length bias, we add further
	instructions in the prompt, constraining the output
	length in terms of the number of sentences and
	words. We compute the constraint values based
	on the median number of words and sentences of
	responses per task from our training dataset. Table
	4 lists an example evaluation prompt for each task.
	Table 5 lists an example response from all models
	to a single prompt.
	D.1
	Human Evaluation
	Since we have multiple model responses per
	prompt, instead of randomly choosing a response,
	for each prompt, we choose top 4 model responses
	measured by the overall automated creativity score
	which we define as the sum of normalized novelty,
	diversity, surprise and quality scores. This setup en-
	sures that models are compared to each other with
	their best outputs. We recruited 15 participants on
	Prolific6 to complete the study, requiring that they
	reside in the U.S. and have an approval rating of
	at least 90 . Ethics board approval was received
	from the Pennsylvania State University IRB for this
	study. We provided participants with a definition of
	creativity, and instructed them not to focus on the
	length or detail of the response when rating. Figure
	7 lists the instructions given to raters for evaluating
	creativity. We additionally included a comprehen-
	sion check where participants were quizzed about
	the task instructions, to help catch careless partici-
	pants. Raters who failed this check were excluded
	from further analysis. All raters were compensated
	adequately with at least a minimum payment of 9
	per hour. Final win rates are calculated for each
	response pair based on the majority vote across
	participants. The inter-rater agreement computed
	using Krippendorff s alpha was 0.463, indicating a
	moderate agreement.
	D.2
	NOVELTYBENCH Evaluation
	NOVELTYBENCH is a recently introduced bench-
	mark to measure how well language models can
	generate novel and high-quality answers to user re-
	quests involving subjectivity, randomness, and cre-
	ativity (Zhang et al., 2025). We use a 100-sample
	subset of their benchmark that is manually curated
	by the authors and contains four distinct categories
	where diversity and novelty are expected:
	Randomness: prompts that involve random-
	izing over a set of options. Example: Roll a
	make-believe 20-sided die.
	Factual Knowledge: prompts that request un-
	derspecified factual information, which allow
	6https: www.prolific.com

	--- Page 18 ---

	Task
	prompts
	samples
	Real-Life
	Creative
	Problem
	Solving

	5,601
	Question
	Asking

	Malevolent
	Problems

	Metaphors

	Alternate
	Uses
	of
	Objects Task

	4,388
	Design Solu-
	tions

	1,366
	Essays

	Stories

	1,498
	Consequences 5
	10,865
	Experiment
	Design

	5,640
	Hypothesis
	Generation

	5,260
	Research
	Questions

	5,832
	Associations

	Total

	42,058
	Table 2: MUCE-PREF training dataset details.
	many valid answers. Example: List a capital
	city in Africa.
	Creative Writing: prompts that involve gen-
	erating a creative form of text, including po-
	etry, and story-writing. Example: Tell me a
	riddle.
	Subjectivity: prompts that request subjective
	answers or opinions. Example: What s the
	best car to get in 2023?
	Additionally, the paper proposes new metrics to
	measure novelty and quality (i.e. utility) that are
	different than ours. To compute novelty, they pro-
	pose a method that learns to partition the output
	space into equivalence classes from human annota-
	tions. Each class represents one unique generation
	that is roughly equivalent to the others in the same
	class and different from the generations in other
	classes. They consider a functional equivalence
	that defines two generations to be different if and
	only if a user who has seen one generation would
	Task
	prompts
	samples
	Real-Life
	Creative
	Problem
	Solving

	Question
	Asking

	Malevolent
	Problems

	Metaphors

	Alternate
	Uses
	of
	Objects Task

	Design Solu-
	tions

	Essays

	Stories

	Consequences 5
	1,315
	Experiment
	Design

	Hypothesis
	Generation

	Research
	Questions

	Associations

	Total

	5,275
	Table 3: MUCE-SFT training dataset details.
	likely benefit from seeing the other. To this end, the
	authors annotated 1,100 pairs of generations condi-
	tioned on prompts from NOVELTYBENCH sampled
	from a diverse set of models. From these annotated
	pairs, they used 1,000 for training and fine-tuned a
	deberta-v3-large model (He et al., 2023) to predict
	binary functional equivalence between two genera-
	tions. With the equivalence classifier, they partition
	the output space into equivalence classes. Then
	they define the novelty as the distinctk metric that
	is the number of equivalence classes in a partition
	of k sample generations from a language model:
	distinctk : {ci i [k]}
	(8)
	To compute quality, they consider a model of
	user behavior that describes how users interact with
	and consume language model generations. They
	assume that the user has a patience level p [0, 1]:
	after observing each additional generation, they
	have a probability p of requesting an additional
	generation from the language model and observing
	the next generation, and a probability 1 p of

	--- Page 19 ---

	Task
	Prompt
	Real-Life Creative Problem Solving
	Come up with an original and creative solution for
	the following real-world problem: Clara, a junior
	pre-med student, is working part-time and taking a
	15 hour credit load at school. ... skipped ... Please
	limit your response to 4 sentences and at most 75
	words.
	Alternate Uses of Objects
	Come up with an original and creative use for the
	following object: rope. Please limit your response to
	1 sentence and at most 17 words.
	Design Solutions
	Come up with an original and creative solution to
	reduce the amount of litter in public spaces and pro-
	mote waste reduction and recycling. Please limit
	your response to 2 sentences and at most 36 words.
	Hypothesis Generation
	Come up with an original and creative scientific hy-
	pothesis for the following scenario: You notice that
	dogs seem to like one of your friends, but cats seem
	to like another friend. What hypotheses do you have
	about why that is? Please limit your response to 1
	sentence and at most 22 words.
	Metaphors
	Come up with an original and creative metaphoric
	equivalent for the concept described below: Stomata
	are tiny openings or pores found on the underside of a
	plant leaf. They are used for gas exchange, enabling
	the intake of carbon dioxide and release of oxygen..
	Please limit your response to 1 sentence and at most
	10 words.
	Poems
	Come up with an original and creative poem about
	the following concept: choice. Please limit your
	response to 5 sentences and at most 150 words.
	Sentence Completion
	Finish the sentence with an original and creative end-
	ing: When I got on the school bus....Please respond
	in one sentence.
	Table 4: MUCE dataset held-out item and task evaluation prompts.
	stopping interacting with the model. Then they
	compute the quality of a sequence of generations
	as the cumulative utility:
	utilityk : 1 p
	1 pk
	k
	X
	i 1
	pi 1 1[ci cj, j i] ui
	(9)
	To
	compute
	the
	utility
	of
	individ-
	ual
	generations,
	they
	also
	use
	the
	Skywork-Reward-Gemma-2-27B-v0.2
	(Liu
	et al., 2024) model.
	To benchmark our models, we follow their rec-
	ommended setup for evaluation. In particular, we
	set the number of generations to 10 per model and
	the patience level to 0.8 and use their trained clas-
	sifier for output space partition.

	--- Page 20 ---

	Model
	Response
	Human
	played hide-and-seek by forming into different ob-
	jects and trying not to be found.
	Llama-3.1-8B-Instruct
	At a sleepover, we transformed our host s base-
	ment into a mystical Dreamscape where we wove a
	tapestry of secrets, stories, and whispered promises
	that only the darkness could keep.
	Gemini-2.0-Flash
	At a sleepover we...built a pillow fort so magnifi-
	cent, it accidentally opened a portal to a realm where
	dreams manifested as sentient, fluffy creatures de-
	manding to be tickled.
	Claude-3.7-Sonnet
	At a sleepover we constructed an elaborate blanket
	fortress complete with secret passages, flashlight con-
	stellations on the ceiling, and an unspoken pact to
	guard each other s midnight confessions until the end
	of time.
	GPT-4o
	At a sleepover we transformed the living room into
	a magical fort of pillows and fairy lights, where we
	whispered secrets and told stories until the first light
	of dawn.
	SFT
	We played a game of Twister where we put out our
	phones to see who was the best twister.
	DPO
	conducted an experiment to see if a person s person-
	ality is changed with an altered state of conscious-
	ness.
	CRPO-nov
	tried to break the record for most consecutive hours
	without speaking and we discovered we could all
	hear each other s thoughts.
	CRPO-nov-qua
	Participated in an experiment where we tested the
	effects of sleep deprivation on the human mind.
	CRPO-div
	construct a space shuttle that takes us to the moon
	and from there we can launch our dream rockets.
	CRPO-div-qua
	recreated the conditions of a 19th century under-
	ground railroad and had to map out our escape to
	Canada.
	CRPO-sur
	Operate on each other to implant a permanent
	adrenaline gland.
	CRPO-sur-qua
	created an underwater laboratory within our inflat-
	able pool to collect the evidence we found of alien
	life.
	CRPO-qua
	began to master the art of telekinesis by competi-
	tively tossing each other s pillows across the room.
	CRPO-nov-div-sur
	Built a rollercoaster out of air mattresses and then did
	a hot-wheel car-launch into the trenches and caught
	the crash on GoPro cameras.
	CRPO-cre
	Created an experiment to see if our dreams could be
	manipulated and transfer to one another.
	Table 5: Example model (and gold human) responses to the prompt At a sleepover we ... from the Sentence
	Completion task.

	--- Page 21 ---

	Task
	Description
	Dataset Sources
	prompts
	samples
	Real-Life
	Creative
	Problem
	Solving
	Produce solutions for ev-
	eryday problems
	(Luchini
	et
	al.,
	2025;
	Kapoor et al., 2024; Saret-
	zki et al., 2024; Perchtold-
	Stefan et al., 2024)

	33,340
	Alternate Ti-
	tles Genera-
	tion
	Produce alternative titles
	for widely known books or
	movies
	(Agnoli et al., 2016)

	2,986
	Question
	Asking
	Produce questions about
	everyday objects
	(Raz et al., 2024)

	3,566
	Poems
	Produce poems about ev-
	eryday concepts
	(Fan et al., 2023; Chaud-
	huri et al., 2025; He et al.,
	2022)

	2,580
	Design Solu-
	tions
	Produce solutions to real-
	world design problems
	(DiStefano et al., 2025)

	10,818
	Combining
	Objects
	Produce
	combinations
	of everyday objects to
	achieve a goal
	(Weiss et al., 2023)

	4,494
	Plot
	Titles
	Generation
	Produce titles for story
	plots
	(Weiss
	et
	al.,
	2023;
	Goecke
	et
	al.,
	2024b;
	Weiss et al., 2024)

	1,832
	Instances
	of Common
	Concepts
	Produce instances related
	to everyday adjectives
	(Organisciak et al., 2023)

	2,474
	Experiment
	Design
	Produce experiment de-
	signs to test scientific hy-
	potheses
	(Beaty
	et
	al.,
	2024;
	Goecke et al., 2024a)

	4,893
	Associations
	Produce word associations
	(Beaty and Johnson, 2021)

	1,004
	Emotional
	Trials
	Produce
	feelings
	one
	might have in a given
	situation
	(Weiss et al., 2023)

	Invent Nick-
	names
	Produce nicknames for ev-
	eryday concepts and ob-
	jects
	(Weiss et al., 2023)

	Situation Re-
	description
	Produce redescriptions of
	negative situations into
	positive situations
	(Weiss et al., 2023)

	Alternate
	Uses
	of
	Objects Task
	Produce alternate uses for
	everyday objects
	(Patterson et al., 2023;
	Zieli nska et al., 2023; Or-
	ganisciak et al., 2023)

	88,155
	Stories
	Produce short stories from
	three word prompts
	(Luchini et al., 2025; Ag-
	noli et al., 2016; Fan et al.,
	2023; He et al., 2022)

	2,757
	Table 6: MUCE dataset details broken down by task (Part 1).

	--- Page 22 ---

	Task
	Description
	Dataset Sources
	prompts
	samples
	Malevolent
	Problems
	Produce ideas on how to
	take revenge on or sabo-
	tage a wrongdoer
	(Perchtold-Stefan et al.,
	2023; Kapoor et al., 2024;
	Perchtold-Stefan
	et
	al.,
	2024)

	16,536
	Metaphors
	Produce metaphors to de-
	scribe scenarios
	(DiStefano et al., 2024; Yu
	et al., 2024)

	13,210
	Essays
	Produce essays on a topic
	(Cotter et al., 2016)

	Consequences Produce possible conse-
	quences to scenarios
	(Weiss et al., 2024, 2023;
	Goecke et al., 2024b)

	24,874
	Sentence
	Completion
	Produce endings to incom-
	plete sentences
	(Organisciak et al., 2023)

	2,629
	Hypothesis
	Generation
	Produce
	scientific
	hy-
	potheses
	for
	specific
	observations
	(Beaty
	et
	al.,
	2024;
	Goecke et al., 2024a)

	18,455
	Research
	Questions
	Produce research ques-
	tions relating to scenarios
	(Beaty
	et
	al.,
	2024;
	Goecke et al., 2024a)

	5,161
	Composites
	Produce composite words
	from a prompt word
	(Weiss et al., 2023)

	Evoking
	Emotional
	Responses
	from People
	Produce ways to evoke
	emotional responses in
	people as a TV producer
	(Weiss et al., 2023)

	Emotions in
	Everyday Sit-
	uations
	Produce
	emotions
	you
	might feel in response to
	everyday situations
	(Weiss et al., 2023)

	Table 7: MUCE dataset details broken down by task (Part 2).

	--- Page 23 ---

	Task
	Example prompt
	Example low rating re-
	sponse
	Example high rating re-
	sponse
	Real-Life
	Creative
	Problem
	Solving
	Becky is a college stu-
	dent who works part-time
	at Mark s Pizzeria. Mark,
	the owner of the restau-
	rant, has treated Becky
	very well. He gave her a
	job that she needs to help
	pay her rent when no other
	business would employ
	her because she was ar-
	rested for shoplifting three
	years ago.
	Mark also
	lets Becky work around
	her school schedule, and
	has asked if she wants to
	be a shift manager in the
	summers. Becky s room-
	mate Jim also works at the
	pizzeria, but Jim has been
	causing a lot of problems
	at work. He always avoids
	doing his job, treats cus-
	tomers rudely, and makes
	a lot of mistakes with or-
	ders.
	Jim recently be-
	gan stealing food from the
	pizzeria. Two days ago the
	pizzeria was short- staffed,
	so Jim and Becky were
	the only employees left at
	closing time.
	Jim made
	10 extra pizzas and took
	them home to a party he
	was hosting without pay-
	ing for them. Becky feels
	like she needs to do some-
	thing about Jim s behav-
	ior.
	However, Becky is
	hesitant to tell Mark about
	Jim because Jim is a good
	friend to Becky.
	Becky
	also needs Jim to have a
	job so he can pay his por-
	tion of their rent. Becky
	does not know what to
	do..
	Morally the right thing for
	Becky to do would be to
	tell her boss.
	However,
	to be a good friend would
	to be not to tell on Jim.
	The only creative solution
	to this problem would to
	be to try and talk to Jim
	one on one. Give Jim the
	decision of whether or nt
	he wants Becky to inform
	their boss of what he has
	been doing. As a friend he
	should understand where
	Becky is coming from and
	want to take the strain off
	her.
	Becky should first dis-
	cuss this with Jim, and
	tell him that he needs to
	either pay for the pizzas
	or he needs to go to the
	boss, and admit what he
	has done.
	He will get
	caught in the end because
	eventually the ingredients
	will be missed. The boss
	may unerstand, and per-
	haps will allow him to
	work off the pizzas some-
	how. Maybe he could help
	out cleaning up around the
	restaurant. If Jim will not
	tell his boss Becky should
	tell him.
	She wouldn t
	necessarily have to come
	right out and tell on her
	coworker she could come
	up with a way for the boss
	to catch him at it. If he
	does it once Jim will more
	than likely do it again. She
	could tell the boss to check
	on the inventory.
	She
	could have other people
	who might have been at
	the party come tell her
	boss bout it. If all of that
	fails, she should just tell
	Mark about Jim stealing
	the pizzas.
	Table 8: MUCE dataset examples (Part 1).

	--- Page 24 ---

	Task
	Example prompt
	Example low rating re-
	sponse
	Example high rating re-
	sponse
	Question
	Asking
	pencil
	How big is it?
	How many great ideas
	have started with a pen-
	cil?
	Poems
	childhood
	Twinkle, Twinkle little
	star....ect
	Red Rover, Red Rover
	Is my childhood over? I
	don t feel quite grown up
	I still laugh at "I CUP"
	I play slide with my sis-
	ter and still call my fourth
	grade teacher "mister" I
	suppose, even still, my
	childhood is over even if
	I can still play red rover
	red rover
	Design Solu-
	tions
	Develop as many design
	ideas as you can to reduce
	air pollution in cities.
	Walk
	use 3d printing as an in-
	nivating way of building
	houses as it reduces labour
	and
	Combining
	Objects
	Paint sign
	paper, ballpoint pen
	beetroot
	juice,
	quark
	cheese
	Plot
	Titles
	Generation
	Now spoke
	A completely normal ev-
	eryday life
	VR glasses charger defec-
	tive
	Instances
	of Common
	Concepts
	soft
	something that is not
	hard
	a futuristic ball that turns
	really fuzzy and comfy at
	places it gets contact to
	Experiment
	Design
	You think some animals
	have a sense of humor that
	humans don t usually un-
	derstand. How could you
	test that hypothesis?
	observe
	tickle your dog to see
	how he acts when he s
	laughing. then, observe
	your dog throughout the
	day and note when he is
	laughing.
	you may be-
	gin to pick up on moments
	where he does things that
	are funny to him.
	Associations
	expert
	winner
	ace
	Emotional
	Trials
	You have a date tonight,
	and once again your dress
	didn t get ready in time at
	the laundry.
	worried, afraid, sad
	Anger, panic, anticipa-
	tion
	Invent Nick-
	names
	plate
	porcelain
	Shrunken UFO
	Table 9: MUCE dataset examples (Part 2).

	--- Page 25 ---

	Task
	Example prompt
	Example low rating re-
	sponse
	Example high rating re-
	sponse
	Alternate
	Uses
	of
	Objects Task
	knife
	weapon
	make up "knife charac-
	ters" and create a movie
	Stories
	petrol-diesel-pump
	I needed to fuel my car
	before we could start the
	long drive. I drove to the
	petrol station. i went to the
	pump and fuel my car with
	diesel. new i was ready for
	the task ahead
	Manly Merde was a truck
	driver looking for trouble.
	He pulled into the Casino
	in the back where the
	drivers go. He took a swig
	of whisky and walked to
	the petrol station, grabbed
	the pump and spurt diesel
	into the air like hydro-
	carbon fountain.
	He let
	out a big belly laugh and
	screamed, "Let the revo-
	lution begin!" And that
	is how the trucker wars
	started.
	Malevolent
	Problems
	Your professor in class
	announces an award for
	the person who comes up
	with the best solution for
	a project. By chance, an-
	other student leaves their
	notebook behind in class.
	You read their ideas and
	believe that they are the
	best. You decide to turn
	them in as your own; how-
	ever you know that if the
	other student submits the
	same solution, there will
	be a problem.
	I will not do the above
	render their notebook un-
	readable by dropping wa-
	ter at the last moment
	Metaphors
	The hot tea is...
	boiling
	liquid fire
	Consequences What would be the result
	if society no longer used
	money, and instead traded
	goods and services?
	Banks would be unneces-
	sary.
	People (especially cou-
	ples) would stop fighting
	so much about financial is-
	sues
	Sentence
	Completion
	It started raining and...
	I got wet
	because I was covered in
	oil, I began to levitate, and
	all the witnesses called me
	the next coming of some
	sort of goddess.
	Table 10: MUCE dataset examples (Part 3).

	--- Page 26 ---

	Task
	Example prompt
	Example low rating re-
	sponse
	Example high rating re-
	sponse
	Hypothesis
	Generation
	On a field trip, you drive
	past a massive field with
	hundreds of large holes
	visible as far as the eye
	can see.
	What hypothe-
	ses do you have about
	what purpose the holes
	may serve?
	the holes resulted over
	time and nature
	the holes are for animals
	giving birth.
	Essays
	dream project
	I don t really know what
	carreer path I want to fol-
	low.
	I just want a job
	where I can help people
	and get a good pay check
	so I can support myt fu-
	ture endevors. I want to
	do something that no one
	has ever done before in a
	way no one has ever seen.
	I want to inspire a genera-
	tion to work on a better fu-
	ture for everybody. I guess
	what I really want is to be
	remembered as an icon. i
	want to be someone that
	people look up to.
	I want to go into foren-
	sic science when I gradu-
	tate. Therefore, my dream
	project is to discover the
	perfect device that can
	help solve every crime
	scene. This device would
	be able to analyize the
	crime scene and tell us
	exactly how many people
	died and how they died.
	It would then collect ev-
	idence samples such as
	blood.
	Next, it would
	use what the information
	it found at the crime scene
	to help make up questions
	the dectectives would ask
	the suspects. It would use
	it s technology to come up
	with questions that only
	the murderer could an-
	swer. Later on, back at the
	lab, it would help discover
	whose blood the samples
	belonged to. In the end,
	the only human power
	that would be needed was
	someone to arrest the con-
	victed person and the peo-
	ple to help clean up the
	crime scene.
	Table 11: MUCE dataset examples (Part 4).

	--- Page 27 ---

	Task
	Example prompt
	Example low rating re-
	sponse
	Example high rating re-
	sponse
	Situation Re-
	description
	You notice how your col-
	league first treats another
	employee very kindly and
	then shortly afterwards
	starts talking negatively
	behind his back
	It would be nice if you
	were older
	I ll talk to them. Then I ll
	have to work less
	Alternate Ti-
	tles Genera-
	tion
	The Betrothed
	renzo and lucia
	Plague, Honor and Love
	in Baroque Brianza
	Research
	Questions
	You travel to a jungle
	that contains no human
	life and is completely un-
	known to the scientific
	community. What scien-
	tific questions could you
	ask about this jungle?
	How many people will
	come with me?
	Do these species share
	a common characterisitic
	that humans don t have?
	Composites
	jitters
	Exam jitters
	Easter bunny missing jit-
	ters
	Evoking
	Emotional
	Responses
	from People
	Describe how you would
	make people look down on
	others
	I
	will
	always
	scream
	loudly
	I would divide the au-
	dience into two groups
	and give one group a rub-
	ber glove as headgear and
	the other group a tiara or
	crown made of real gold.
	Emotions in
	Everyday Sit-
	uations
	You re at work. A glance
	at the clock tells you
	that you re about to finish
	work and start your long-
	awaited weekend.
	I feel happy
	I feel sorry for my desk
	chair, which is unused
	over the weekend and
	stands alone in the office.
	Table 12: MUCE dataset examples (Part 5).