Spaces:
Sleeping
Sleeping
| --- Page 1 --- | |
| arXiv:2505.14442v1 [cs.CL] 20 May 2025 | |
| Creative Preference Optimization | |
| Mete Ismayilzada1,2, Antonio Laverghetta Jr.3, Simone A. Luchini3, | |
| Reet Patel3, Antoine Bosselut1, Lonneke van der Plas 2 Roger Beaty 3 | |
| 1EPFL, 2Universit脿 della Svizzera Italiana, 3Pennsylvania State University | |
| mahammad.ismayilzada epfl.ch | |
| Abstract | |
| While | |
| Large | |
| Language | |
| Models | |
| (LLMs) | |
| have demonstrated impressive performance | |
| across natural language generation tasks, | |
| their ability to generate truly creative con- | |
| tent characterized by novelty, diversity, sur- | |
| prise, and quality remains limited. Existing | |
| methods for enhancing LLM creativity often | |
| focus narrowly on diversity or specific tasks, | |
| failing to address creativity s multifaceted na- | |
| ture in a generalizable way. | |
| In this work, | |
| we propose Creative Preference Optimization | |
| (CRPO), a novel alignment method that in- | |
| jects signals from multiple creativity dimen- | |
| sions into the preference optimization objec- | |
| tive in a modular fashion. We train and eval- | |
| uate creativity-augmented versions of several | |
| models using CRPO and MUCE, a new large- | |
| scale human preference dataset spanning over | |
| 200,000 human-generated responses and rat- | |
| ings from more than 30 psychological creativ- | |
| ity assessments. Our models outperform strong | |
| baselines, including GPT-4o, on both auto- | |
| mated and human evaluations, producing more | |
| novel, diverse, and surprising generations while | |
| maintaining high output quality. Additional | |
| evaluations on NOVELTYBENCH further con- | |
| firm the generalizability of our approach. To- | |
| gether, our results demonstrate that directly op- | |
| timizing for creativity within preference frame- | |
| works is a promising direction for advancing | |
| the creative capabilities of LLMs without com- | |
| promising output quality. | |
| Introduction | |
| Large Language Models (LLMs) have made sig- | |
| nificant progress across a broad range of natural | |
| language generation tasks (Team et al., 2023; Zhao | |
| et al., 2025; Bubeck et al., 2023; Wei et al., 2022; | |
| Brown et al., 2020). However, whether LLMs ex- | |
| hibit true human-like creativity i.e the ability to pro- | |
| duce novel (i.e., original), high-quality (i.e. useful) | |
| and surprising (i.e. unexpected) ideas (Simonton, | |
| 2012; Boden, 2004) remains unclear. Research on | |
| the creativity of LLMs has found mixed results, | |
| with some reporting that LLMs are more creative | |
| than humans (Bellemare-Pepin et al., 2024; Zhao | |
| et al., 2024), others reporting that they are less cre- | |
| ative (Koivisto and Grassini, 2023; Chakrabarty | |
| et al., 2024; Ismayilzada et al., 2024b), and some | |
| finding their creativity to be on par with each other | |
| (Stevenson et al., 2022; G贸es et al., 2023; Gilhooly, | |
| 2024). | |
| However, past research has also found | |
| that the high LLM performance can be attributed | |
| to the artificial nature of the creativity tasks (Is- | |
| mayilzada et al., 2024a) commonly employed to | |
| evaluate LLMs such as the Alternative Uses Task | |
| (Guilford, 1967) or to the remarkable creativity of | |
| human-written texts on the web (Lu et al., 2024). | |
| Consequently, LLMs have been shown to often | |
| lack novelty and surprise in their generations (Is- | |
| mayilzada et al., 2024a,b; Zhang et al., 2025; Tian | |
| et al., 2024; Chakrabarty et al., 2024) and produce | |
| significantly less diverse content compared to hu- | |
| mans (Padmakumar and He, 2023; Anderson et al., | |
| 2024; Kirk et al., 2023; Xu et al., 2024; O Mahony | |
| et al., 2024; Zhang et al., 2024; Wenger and Kenett, | |
| 2025). These tendencies limit the utility of LLMs | |
| for creative tasks, such as story generation and cre- | |
| ative problem solving that often require longer re- | |
| sponses and out-of-the-box thinking (Tian et al., | |
| 2023; Huang et al., 2024; Chen et al., 2024). | |
| Recent research has proposed some methods for | |
| improving creativity of LLMs, often targeting diver- | |
| sity aspect alone (Wong et al., 2024; Hayati et al., | |
| 2023; Chung et al., 2023; Franceschelli and Mu- | |
| solesi, 2024; Zhang et al., 2024; Wang et al., 2024b; | |
| Zhou et al., 2025; Lanchantin et al., 2025; Chung | |
| et al., 2025) or focusing on a single creativity task | |
| (Tian et al., 2023; Nair et al., 2024; Summers-Stay | |
| et al., 2023). However, creativity is a multifaceted | |
| ability that also encompasses novelty, surprise, and | |
| quality and manifests itself in a wide range of tasks. | |
| Consequently, it has been argued that methods pro- | |
| moting creativity improvements should consider | |
| --- Page 2 --- | |
| prompt | |
| preferred | |
| response | |
| set of | |
| preferred | |
| responses | |
| Novelty | |
| Diversity | |
| Surprise | |
| Quality | |
| LM | |
| RM | |
| 位n | |
| 位d | |
| 位s | |
| 位q | |
| Creativity | |
| DPO Loss | |
| Figure 1: Our preference alignment method CRPO to improve output creativity by injecting a weighted combination | |
| of signals from multiple creativity dimensions. | |
| multiple dimensions of creativity together across | |
| several creative tasks (Ismayilzada et al., 2024a). | |
| Hence, the broader challenge of enhancing overall | |
| creativity in LLM outputs largely remain underex- | |
| plored. | |
| To this end, we propose a novel approach to di- | |
| rectly optimize for creativity in language model | |
| generation through preference learning (Ouyang | |
| et al., 2022; Rafailov et al., 2023). Recent works | |
| targeting improvement in LLM creativity have | |
| mainly focused on black-box techniques to elicit | |
| creative outputs through input-level (e.g., prompt- | |
| ing) (Tian et al., 2023; Mehrotra et al., 2024; | |
| Nair et al., 2024; Summers-Stay et al., 2023) and | |
| output-level strategies (e.g., creative decoding) | |
| (Franceschelli and Musolesi, 2024; Meister et al., | |
| 2023). | |
| However, these methods are inherently | |
| limited to the fixed creative capacity of language | |
| models and are not designed to optimize for fine- | |
| grained dimensions of creativity. Recently, moti- | |
| vated by the negative impact of the preference align- | |
| ment techniques on the diversity of LLM outputs | |
| (Padmakumar and He, 2023; Anderson et al., 2024; | |
| Kirk et al., 2023; O Mahony et al., 2024; West and | |
| Potts, 2025), few works have suggested directly | |
| modifying the preference optimization methods to | |
| promote output diversity (Lanchantin et al., 2025; | |
| Chung et al., 2025). Inspired by these approaches, | |
| we design a new optimization strategy that injects | |
| signals from multiple dimensions of creativity into | |
| the preference modeling objective in a modular | |
| fashion. Specifically, we integrate the novelty, di- | |
| versity, surprise and quality dimensions of creativ- | |
| ity into the training objective of direct preference | |
| optimization (DPO) (Rafailov et al., 2023), with | |
| weighted composition that allow balancing each | |
| dimension s contribution. We call this method cre- | |
| ative preference optimization (CRPO) and provide | |
| its conceptual illustration in Figure 1 with full de- | |
| tails in Section 3. | |
| We test the efficacy of CRPO using MUCE | |
| (Multitask Creativity Evaluation), our newly cu- | |
| rated large-scale dataset of prompt-response pairs | |
| annotated with human preferences across a di- | |
| verse range of creative tasks in multiple languages. | |
| While previous work has largely evaluated creativ- | |
| ity improvements on a narrow range of tasks like | |
| story generation (Chung et al., 2025; Lanchantin | |
| et al., 2025) or creative problem solving (Tian | |
| et al., 2023), MUCE enables us to test whether | |
| our methods truly generalize across a diverse range | |
| of creativity assessments. Our results show that | |
| Llama-3.1-8B-Instruct (AI Meta, 2024) and | |
| Mistral-7B-Instruct-v0.3 (Jiang et al., 2023) | |
| trained using CRPO outperform the same models | |
| trained using only supervised fine-tuning (SFT) or | |
| DPO without any creativity injections, as well as | |
| existing LLMs such as GPT-4o, generating more | |
| novel, diverse, and surprising outputs than all the | |
| baselines while maintaining high quality. | |
| Our main contributions are as follows: | |
| 1. We introduce MUCE, a large-scale prefer- | |
| ence dataset consisting of more than 200,000 | |
| human responses and ratings for more than | |
| 30 creativity assessments. All tasks within | |
| MUCE are carefully chosen to provide valid | |
| measures of creativity in humans, making | |
| MUCE one of the largest psychologically | |
| valid datasets of human creativity for train- | |
| --- Page 3 --- | |
| ing preference models. | |
| 2. We propose a novel flexible preference | |
| alignment method CRPO that injects sig- | |
| nals from several dimensions of creativ- | |
| ity into the existing preference optimization | |
| method DPO and train creativity-enhanced | |
| versions of Llama-3.1-8B-Instruct and | |
| Mistral-7B-Instruct-v0.3. | |
| 3. We evaluate the effectiveness of our approach | |
| on a range of creativity tasks from MUCE, as | |
| well as external tasks from NOVELTYBENCH | |
| (Zhang et al., 2025), using both automated | |
| metrics and human evaluations. Our analy- | |
| sis shows that CRPO is a promising method | |
| for enhancing the creative capabilities of lan- | |
| guage models while maintaining quality. | |
| Related Work | |
| 2.1 | |
| Large Language Model Creativity | |
| The potential of building LLM applications for | |
| creative industries has spurred significant research | |
| interest on AI creativity (Bellemare-Pepin et al., | |
| 2024), and many LLM tools marketed for assis- | |
| tance with creative tasks have been developed in | |
| the last few years (Wang et al., 2024b). Yet de- | |
| bates on whether AI is capable of true creativity are | |
| nearly as old as AI itself (Stein, 2014; Franceschelli | |
| and Musolesi, 2024; S忙b酶 and Brovold, 2024), | |
| with theoretical and philosophical arguments being | |
| made both for and against AI creativity (Ismay- | |
| ilzada et al., 2024a). Classic psychological theories | |
| of creativity generally agree that, for a product to | |
| be creative, it must be new, surprising, and valu- | |
| able (Boden, 2004). Creative tasks are also often | |
| characterized by high diversity (Padmakumar and | |
| He, 2023; Shypula et al., 2025), though diversity is | |
| only one facet of creativity (Johnson et al., 2021). | |
| Studies on LLM creativity have yielded conflicting | |
| findings: some suggest LLMs surpass human cre- | |
| ativity (Bellemare-Pepin et al., 2024; Zhao et al., | |
| 2024), others argue they fall short (Koivisto and | |
| Grassini, 2023; Chakrabarty et al., 2024; Ismay- | |
| ilzada et al., 2024b), while some conclude that | |
| LLM and human creativity are roughly equivalent | |
| (Gilhooly, 2024; Stevenson et al., 2022; G贸es et al., | |
| 2023). Some works have suggested that LLMs | |
| lack novelty and surprise in their generations (Is- | |
| mayilzada et al., 2024a,b; Zhang et al., 2025; Tian | |
| et al., 2024; Chakrabarty et al., 2024) and their | |
| seemingly remarkable creative outputs may be in | |
| large part attributable to the remarkable creativity | |
| of human-written texts on the web (Lu et al., 2024). | |
| Some recent works have suggested improving the | |
| creativity of LLMs through prompting techniques | |
| (Tian et al., 2023; Mehrotra et al., 2024; Nair et al., | |
| 2024; Summers-Stay et al., 2023) and decoding | |
| strategies (Franceschelli and Musolesi, 2024; Meis- | |
| ter et al., 2023). In this work, we instead explore | |
| directly optimizing language models for creativity | |
| using human preferences extracted from responses | |
| to creativity assessments. | |
| 2.2 | |
| Preference Learning | |
| Aligning LLMs to human preferences has proven | |
| effective in developing models that are helpful and | |
| useful to users, leading to the emergence of numer- | |
| ous preference learning methods (Gao et al., 2024; | |
| Ouyang et al., 2022; Rafailov et al., 2023). How- | |
| ever, prior work has highlighted a lack of diversity | |
| in LLM outputs (Anderson et al., 2024; Lanchantin | |
| et al., 2025; Wenger and Kenett, 2025; Padmaku- | |
| mar and He, 2023), with alignment often cited as | |
| a contributing factor (West and Potts, 2025). In | |
| response, recent research has explored modifica- | |
| tions to existing preference modeling techniques | |
| aimed at mitigating this reduction in diversity. One | |
| notable approach, Diverse Preference Optimiza- | |
| tion, proposes enhancing preference data creation | |
| by selecting preference pairs based on a diversity | |
| metric (Lanchantin et al., 2025). Another recent | |
| method introduces a modification to the optimiza- | |
| tion objective itself to incorporate a diversity signal | |
| (Chung et al., 2025). Both strategies have demon- | |
| strated effectiveness in promoting output diversity | |
| with minimal impact on output quality. However, | |
| as previously noted, diversity represents only one | |
| facet of creativity; true creativity also requires the | |
| capacity for novelty and surprise. In this work, | |
| we present a modular preference alignment frame- | |
| work for creativity that enables direct optimization | |
| across multiple dimensions of creative expression. | |
| Creative Preference Optimization | |
| According to its three-criterion definition, creativity | |
| involves the generation of novel, high-quality, and | |
| surprising ideas (Simonton, 2012; Boden, 2004; | |
| Runco and Jaeger, 2012). Moreover, creative out- | |
| puts tend to be highly diverse across individuals | |
| (Anderson et al., 2024). Therefore, to promote over- | |
| all creativity in LLM outputs, we propose to inject | |
| unsupervised metrics related to each dimension of | |
| --- Page 4 --- | |
| creativity into the loss functions of standard pref- | |
| erence optimization methods. We use direct pref- | |
| erence optimization (DPO) (Rafailov et al., 2023) | |
| to illustrate our modifications to the loss function. | |
| Recall that in the standard formulation of DPO, a | |
| policy model (p胃) is directly optimized on a dataset | |
| of (x, yw, yl) where x, yw and yl refer to the model | |
| input (i.e. prompt), preferred (i.e. chosen) model | |
| response and dispreferred (i.e. rejected) model re- | |
| sponse, respectively. Using the ratio between the | |
| policy model s likelihood and that of the reference | |
| SFT model (pSFT ) as an implicit reward, the train- | |
| ing objective of DPO is defined as follows: | |
| lDP O | |
| h | |
| log 蟽 | |
| 尾 log | |
| p胃(yw x) | |
| pSFT(yw x) 尾 log | |
| p胃(yl x) | |
| pSFT(yl x) | |
| i | |
| LDPO E(x,yw,yl) D | |
| lDP O | |
| (1) | |
| A challenge with standard preference optimiza- | |
| tion methods is that they may significantly reduce | |
| the diversity of the responses LLMs generate, as the | |
| loss function encourages models to generate pre- | |
| ferred responses even if they are not very creative | |
| (West and Potts, 2025; Padmakumar and He, 2023; | |
| Anderson et al., 2024; Kirk et al., 2023; Xu et al., | |
| 2024; O Mahony et al., 2024; Zhang et al., 2024; | |
| Wenger and Kenett, 2025). Existing approaches | |
| to address this in the preference optimization ob- | |
| jective have centered around curating a preference | |
| data based on various diversity metrics (Lanchantin | |
| et al., 2025) or incorporating extra regularization | |
| terms that encourage diverse generations while bal- | |
| ancing quality (Chung et al., 2025). For example, | |
| the recently proposed Diversified DPO (DDPO) | |
| method adds a scalar diversity term 未w (i.e. diver- | |
| sity score of the preferred response) into the DPO | |
| loss (Chung et al., 2025): | |
| LDDPO E(x,yw,yl) D | |
| 未wlDP O | |
| (2) | |
| While diversity is important for creativity, re- | |
| search in psychology has long established that truly | |
| creative responses also require novelty, surprise, | |
| and quality (Boden, 2004; Barron, 1955; Simon- | |
| ton, 2018). Therefore, we propose incorporating | |
| metrics for each of these, alongside diversity, into | |
| the preference loss in a modular structure, enabling | |
| the construction of different creativity models by | |
| combining these dimensions as needed. | |
| LCDPO E(x,yw,yl) D | |
| h | |
| (位d未w 位n谓w 位s尉w 位q纬w)lDP O | |
| i | |
| (3) | |
| In our proposed creative DPO loss, 未w, 谓w, 尉w | |
| and 纬w correspond to diversity, novelty, surprise | |
| and quality scores of the preferred response respec- | |
| tively and 位d, 位n, 位s and 位q are hyperparameters | |
| that control the effect of each score (we call them | |
| injection weights). In particular, when 位d 1, | |
| 位n 0, 位s 0 and 位q 0, we recover the DDPO | |
| loss. While there are multiple approaches for oper- | |
| ationalizing 未w, 谓w, 尉w and 纬w, we propose to use | |
| the following metrics for each: | |
| 3.1 | |
| Diversity | |
| We use an inverse homogenization metric from | |
| Padmakumar and He (2023) similar to Chung et al. | |
| (2025). Specifically, given a prompt x and a set of | |
| (preferred) responses for x denoted as Yx, we com- | |
| pute the diversity score of any particular preferred | |
| response as the average pairwise semantic distance | |
| to all the other preferred responses in Yx: | |
| 未w | |
| Yx 1 | |
| X | |
| yi Yx yw | |
| semdis(yw, yi) | |
| (4) | |
| We use 1 cos_sim( , ) as a semantic distance | |
| function (i.e., semdis( , )). | |
| 3.2 | |
| Novelty | |
| We use a novelty metric similar to Karampiperis | |
| et al. (2014) where the novelty of a text is defined | |
| as the absolute difference between the average pair- | |
| wise semantic distances of words in the text and | |
| those of a reference corpus of texts. In particular, | |
| we define the set of preferred responses to a prompt | |
| x as a reference corpus (Yx) and define the novelty | |
| of a preferred response as follows: | |
| 谓w DSI(yw) DSI(Yx) | |
| (5) | |
| DSI(T) | |
| P T | |
| i,j 1 semdis(Ti, Tj), i j | |
| T | |
| (6) | |
| Here T refers to a piece of text, Ti to the word | |
| i in the set of unique words in T denoted as T | |
| and DSI( ) is divergent semantic integration, the | |
| average pairwise semantic distances of words in a | |
| text (Johnson et al., 2022). | |
| 3.3 | |
| Surprise | |
| We use Shannon surprise the negative log- | |
| likelihood of the text which has been widely | |
| used as a measure of surprise in prior work | |
| (Bunescu and Uduehi, 2022; Modirshanechi et al., | |
| 2022; Kuznetsova et al., 2013). More specifically, | |
| --- Page 5 --- | |
| given a prompt x, we define the surprise of a par- | |
| ticular response as the exponentiated negative log- | |
| likelihood of the response (i.e. perplexity) condi- | |
| tioned on the prompt x and under some reference | |
| model S as follows: | |
| 尉w 2 logPS(yw x) | |
| (7) | |
| 3.4 | |
| Quality | |
| Although a general quality scoring method is hard | |
| to define, reward models that are trained to output | |
| a high score to preferred answers can be used as | |
| a proxy (Zhang et al., 2025; Lambert et al., 2024). | |
| In particular, we define the quality of a preferred | |
| response given a prompt x as the score assigned by | |
| some reward model R: 纬w R(yw x). | |
| The MUCE Dataset | |
| To compile MUCE, we solicited data from the | |
| global creativity research community, specifically | |
| targeting researchers studying human creativity to | |
| obtain data from tasks known to be valid creativity | |
| measures. We specifically targeted datasets which | |
| contained complete metadata, including informa- | |
| tion about the task, language, and items that partici- | |
| pants responded to. We gathered additional data by | |
| performing a manual search of the Open Science | |
| Framework database1, and only retained data from | |
| peer-reviewed articles. In total, 43 of the data in | |
| MUCE has never been publicly released, making | |
| it unlikely that LLMs have seen the item-response | |
| combinations for the majority of our tasks. | |
| Every response in MUCE was rated for creativ- | |
| ity by at least two raters, and in some cases up to | |
| 75 employing a missing-raters design (Forthmann | |
| et al., 2025). While it is common practice to mea- | |
| sure creativity using multiple independent raters, | |
| individual raters may deliver unhelpful or noisy rat- | |
| ings if they did not understand the task instructions, | |
| had a different understanding of the rating criteria, | |
| or for other reasons (Forthmann et al., 2017). To | |
| account for this, we followed best practices for sub- | |
| jective scoring tasks by employing Judge Response | |
| Theory (Myszkowski and Storme, 2019) to check | |
| for raters whose ratings were uninformative in an | |
| information-theoretic sense. We fit JRT models to | |
| each task within MUCE, which gave us an infor- | |
| mation function for each rater across tasks. We | |
| then input the results from the JRT into a genetic | |
| algorithm (Schroeders et al., 2016) which identi- | |
| fied a subset of raters per dataset that maximized | |
| 1https: osf.io | |
| the per dataset rater information function.2 This | |
| process dropped uninformative raters from each | |
| dataset, enhancing the quality of the final creativity | |
| ratings. The individual rater s scores were aggre- | |
| gated via factor scores, as is best practice in creativ- | |
| ity assessment (Silvia, 2011), and we rescaled the | |
| factor-transformed creativity scores into the integer | |
| range 10-50 as is done for prior work in automated | |
| creativity assessment (Organisciak et al., 2023). | |
| From this dataset, we create multiple data splits for | |
| training and testing. Full details about the dataset | |
| construction are in Appendix A. | |
| Experiments | |
| 5.1 | |
| SFT and Preference Datasets | |
| While our MUCE dataset contains samples for mul- | |
| tiple languages, we focus on showing the effec- | |
| tiveness of CRPO on the English subset in this | |
| work and leave experiments using the full dataset | |
| as future work. From the base English MUCE | |
| dataset, we generate a preference dataset by creat- | |
| ing tuples of preferred and rejected responses to the | |
| same prompt, treating the response that received | |
| the higher creativity score as the preferred one. Past | |
| work has shown that data quality is one of the main | |
| factors behind preference model performance (Liu | |
| et al., 2024; Deng et al., 2025; Wang et al., 2024a). | |
| Therefore, we curate a high-quality SFT dataset | |
| of 5, 275 samples (MUCE-SFT) and preference | |
| dataset of 42, 058 samples (MUCE-PREF) from | |
| the base MUCE which we detail in Appendix B. | |
| 5.2 | |
| Training | |
| Models | |
| As | |
| our | |
| base | |
| models, | |
| we | |
| use | |
| Llama-3.1-8B-Instruct | |
| (AI Meta, | |
| 2024) | |
| and Mistral-7B-Instruct-v0.3 (Jiang et al., | |
| 2023) and implement CRPO as described in | |
| Section 3. | |
| We first train our models using | |
| supervised fine-tuning (SFT model) for a single | |
| epoch on MUCE-SFT, and then apply preference | |
| optimization on the SFT model using CRPO and | |
| MUCE-PREF dataset. We train all models using | |
| parameter-efficient tuning with LoRA using a | |
| rank of 128 and an alpha of 256 (Hu et al., 2022). | |
| Additional details on the training setup can be | |
| found in Appendix C. | |
| Creativity Injection | |
| We compute creativity met- | |
| ric scores for each preferred response and inject | |
| 2While ensuring that the algorithm kept at least two raters | |
| per dataset. | |
| --- Page 6 --- | |
| quality | |
| 0.05 | |
| 0.06 | |
| 0.07 | |
| 0.08 | |
| 0.09 | |
| novelty | |
| quality | |
| 0.20 | |
| 0.25 | |
| 0.30 | |
| 0.35 | |
| 0.40 | |
| 0.45 | |
| diversity | |
| quality | |
| surprise | |
| SFT | |
| DPO | |
| Llama-3.1-8B | |
| Gemini-2.0 | |
| GPT-4o | |
| Claude-3.7 | |
| CrPO-nov | |
| CrPO-div | |
| CrPO-sur | |
| CrPO-nov-qua | |
| CrPO-div-qua | |
| CrPO-sur-qua | |
| CrPO-qua | |
| CrPO-nov-div-sur | |
| CrPO-cre | |
| Figure 2: | |
| Results on held-out evaluation suite from MUCE across all baselines and our models using | |
| Llama-3.1-8B-Instruct as a base model. nov, div, sur, qua, cre denote novelty, diversity, surprise, qual- | |
| ity, and creativity, respectively. Results are averaged across tasks. Mistral-7B-Instruct-v0.3 results can be | |
| found in Appendix Figure 6. | |
| them into the DPO objective function as described | |
| in Section 3. Since each metric is on a differ- | |
| ent scale and we would like to combine the ef- | |
| fects of different injections, we normalize each | |
| score to a range of [0, 1] before injection. | |
| We | |
| vary the injection weights 位d, 位n, 位s, 位q accord- | |
| ingly3 to train different suites of creative mod- | |
| els. As novelty and diversity measures require | |
| a reference set to compute against, we adopt a | |
| prompt-level granularity and consider the set of | |
| responses for a given prompt as the reference cor- | |
| pus similar to prior work (Chung et al., 2025). | |
| We use the jina-embeddings-v3 model (Sturua | |
| et al., 2024) to compute text embeddings for | |
| all metrics that rely on semantic distance. For | |
| surprise, we use instruction-tuned Gemma-2-27B | |
| (Google, 2024a) as our reference surprise model | |
| S. While our creativity preference dataset is al- | |
| ready high-quality, we also experiment with in- | |
| jecting external quality signals to study its inter- | |
| action with other creativity dimensions. Hence, | |
| for the quality measure, we employ an existing re- | |
| ward model Skywork-Reward-Gemma-27B-v0.2 | |
| (Liu et al., 2024) that is one of the top-performing | |
| models on RewardBench (Lambert et al., 2024) as | |
| our reference reward model R. | |
| 5.3 | |
| Evaluation | |
| Tasks and Metrics | |
| We evaluate all models | |
| across several dimensions of creativity on held-out | |
| prompts of various tasks and two held-out tasks. | |
| 3For example, to train a novelty model, we set 位n 1 | |
| and others to 0 whereas for novelty and quality model we set | |
| 位n 1 and 位q 1. | |
| More specifically, we use 6 held-out prompts from | |
| Real-Life Creative Problem Solving, Alternate Uses | |
| of Objects, Design Solutions, Hypothesis Genera- | |
| tion, and Metaphors tasks, and 9 prompts from two | |
| held-out tasks of Poems and Sentence Completion. | |
| For each prompt, we generate 16 responses from | |
| each model by varying the temperature, topp, and | |
| topk decoding parameters. Our final held-out eval- | |
| uation suite contains 224 samples. We evaluate the | |
| responses on the dimensions of novelty, diversity, | |
| and surprise using the metrics described in Sec- | |
| tion 3. Additionally, to study the tradeoff between | |
| creativity and quality, we train a reward model | |
| on our preference dataset using instruction tuned | |
| Gemma-2-9b (Google, 2024a) and use it to score | |
| the overall quality of model generations. More | |
| details about the evaluation setup can be found in | |
| Appendix D. | |
| Baselines | |
| As baselines, we use the base mod- | |
| els Llama-3.1-8B-Instruct (AI Meta, 2024) | |
| and Mistral-7B-Instruct-v0.3 (Jiang et al., | |
| 2023), SFT models which are the base mod- | |
| els supervised fine-tuned on MUCE-SFT, vanilla | |
| DPO model trained on top of the SFT model | |
| using the MUCE-PREF dataset without any | |
| creativity injections and three closed-source | |
| instruction-tuned LLMs, namely GPT-4o (OpenAI, | |
| 2024), Claude-3.7-Sonnet (Anthropic, 2025), | |
| and Gemini-2.0-Flash (Google, 2024b). | |
| CRPO Models | |
| We train several CRPO mod- | |
| els corresponding to the different dimensions of | |
| creativity. More specifically, for each dimension, | |
| we train a model that is injected with a signal for | |
| --- Page 7 --- | |
| 0.0 | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| n | |
| 0.088 | |
| 0.090 | |
| 0.092 | |
| 0.094 | |
| 0.096 | |
| 0.098 | |
| 0.100 | |
| novelty | |
| 0.0 | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| d | |
| 0.36 | |
| 0.38 | |
| 0.40 | |
| 0.42 | |
| 0.44 | |
| 0.46 | |
| 0.48 | |
| diversity | |
| 0.0 | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| s | |
| surprise | |
| 0.0 | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| q | |
| 5.6 | |
| 5.4 | |
| 5.2 | |
| 5.0 | |
| 4.8 | |
| 4.6 | |
| 4.4 | |
| quality | |
| Figure 3: Effect of injection weights for each dimension. Results are averaged across three seed runs. | |
| the given dimension and another model that is in- | |
| jected with a signal for both the given dimension | |
| (e.g. CRPO-nov) and the quality dimension (e.g. | |
| CRPO-nov-qua). We train the latter models to un- | |
| derstand the tradeoff between other dimensions of | |
| creativity and the quality that has been reported in | |
| previous research (Zhang et al., 2025; Lanchantin | |
| et al., 2025; Chung et al., 2025). Additionally, we | |
| train two creative models that inject all dimensions | |
| of creativity (denoted as CRPO-cre) and all ex- | |
| cept quality (denoted as CRPO-nov-div-sur). In | |
| all these experiments, 位 injection weights are set | |
| to 1 for simplicity. We perform a more detailed | |
| analysis of these hyperparameters in Section 6.1. | |
| Results | |
| Figure 2 summarizes performance on our held- | |
| out evaluation suite across creativity dimensions | |
| for all baselines and CRPO models using the | |
| Llama-3.1-8B-Instruct as a base. Results for | |
| Mistral-7B-Instruct-v0.3 can be found in Ap- | |
| pendix Figure 6 and follows the same trends. First, | |
| we observe a clear separation between existing | |
| instruction-tuned LLMs and our models: while | |
| the former cluster around high quality but low nov- | |
| elty, diversity, and surprise, our models achieve | |
| high scores across all four dimensions. Second, | |
| for each creativity dimension, the model trained | |
| with that specific injection outperforms others on | |
| the same metric, confirming the effectiveness of | |
| targeted optimization, without a considerable drop | |
| in quality. | |
| Models | |
| that | |
| combine | |
| a | |
| creativity | |
| signal | |
| with | |
| an | |
| external | |
| quality | |
| signal | |
| (CRPO-{nov,div,sur}-qua) improve in quality | |
| but show reduced performance on the targeted di- | |
| mension, illustrating a trade-off. The same pattern | |
| holds when comparing the CRPO-nov-div-sur | |
| model to the full CRPO-cre model, further | |
| highlighting the balance between quality and other | |
| facets of creativity. Interestingly, the vanilla DPO | |
| model, without any creativity injections, already | |
| outperforms existing LLM baselines, demon- | |
| strating the strength of our preference dataset. | |
| Still, most of our creativity-optimized models | |
| significantly surpass DPO across all dimensions. | |
| Finally, the SFT model performs worst in quality | |
| and shows only comparable performance on other | |
| dimensions, reinforcing prior findings (Chung | |
| et al., 2025) about the limited generalizability of | |
| supervised fine-tuning in creative tasks, where no | |
| single correct answer exists. | |
| Overall, our results show that CrPO enhances | |
| multiple aspects of creativity with minimal im- | |
| pact on quality, offering a flexible and effective | |
| framework for creativity alignment in LLMs. | |
| 6.1 | |
| Effect of Injection Weights | |
| While we set all injection weights to 1 for sim- | |
| plicity in our main evaluations, we also study the | |
| effect of the different injection values on the perfor- | |
| mance of models across dimensions. In particular, | |
| we vary the injection weights from 0 to 2.0 with | |
| an increment of 0.5 for all dimensions and report | |
| the averaged results across three seed runs in Fig- | |
| ure 3. We observe that across most dimensions, | |
| an injection weight of 0.5 yields the greatest per- | |
| formance gains, with further increases resulting in | |
| diminishing returns or slight performance degrada- | |
| tion. In terms of quality, the injection weight of | |
| 1.0 results in the highest performance. Neverthe- | |
| less, any weight above 0 consistently outperforms | |
| the model without any injection with minimal drop | |
| in quality (Appendix Figure 8). We suggest tun- | |
| ing these values depending on the training dataset, | |
| underlying task, and the base model for the best | |
| performance. | |
| 6.2 | |
| Human Evaluation | |
| In addition to automated metrics, we conduct a | |
| human evaluation to assess the real-world effec- | |
| tiveness of our approach. Due to the high cost | |
| --- Page 8 --- | |
| DPO | |
| GPT-4o | |
| Llama-3.1-8B | |
| SFT | |
| Baseline Models | |
| CrPO-cre | |
| CrPO-nov-div-sur | |
| CrPO-nov | |
| CrPO-div | |
| CrPO-sur | |
| Our Models | |
| 50.0 | |
| 43.8 | |
| 56.2 | |
| 93.8 | |
| 56.2 | |
| 56.2 | |
| 68.8 | |
| 100.0 | |
| 37.5 | |
| 37.5 | |
| 37.5 | |
| 75.0 | |
| 68.8 | |
| 37.5 | |
| 18.8 | |
| 100.0 | |
| 43.8 | |
| 56.2 | |
| 43.8 | |
| 93.8 | |
| Win Rates ( ) for Human Evaluation - Creativity | |
| Figure 4: Human evaluation results measured by win | |
| rates. Participants were asked to make a pairwise com- | |
| parison between our models and baselines with respect | |
| to the overall creativity. | |
| of human studies, we focus on the overall cre- | |
| ativity dimension using a single task (Sentence | |
| Completion), 4 prompts, 4 baselines (SFT, DPO, | |
| Llama-3.1-8B-Instruct, and GPT-4o), and 5 | |
| CRPO variants (nov, div, sur, nov-div-sur, | |
| cre). In a blind pairwise setup, participants com- | |
| pared responses from a baseline and a CRPO | |
| model for creativity, unaware that the texts were | |
| AI-generated. A total of 320 comparisons were | |
| collected with balanced sampling across models. | |
| Additional details are in Appendix D.1. | |
| Figure 4 presents the win rates. The CRPO- | |
| nov-div-sur model consistently outperforms all | |
| baselines, particularly Llama-3.1-8B-Instruct, | |
| by a wide margin. In contrast, the full CRPO-cre | |
| model lags slightly, reflecting the creativity quality | |
| tradeoff seen in automated evaluations. Notably, | |
| CRPO models achieve especially strong gains over | |
| SFT, reinforcing previous findings. | |
| 3.00 | |
| 3.25 | |
| 3.50 | |
| 3.75 | |
| 4.00 | |
| 4.25 | |
| 4.50 | |
| 4.75 | |
| 5.00 | |
| quality | |
| novelty | |
| SFT | |
| DPO | |
| Llama-3.1-8B | |
| Gemini-2.0 | |
| GPT-4o | |
| Claude-3.7 | |
| CrPO-nov | |
| CrPO-div | |
| CrPO-sur | |
| CrPO-nov-qua | |
| CrPO-div-qua | |
| CrPO-sur-qua | |
| CrPO-qua | |
| CrPO-nov-div-sur | |
| CrPO-cre | |
| Figure 5: Evaluation results on NOVELTYBENCH, using | |
| the novelty and quality metrics defined in Zhang et al. | |
| (2025). | |
| 6.3 | |
| NOVELTYBENCH Evaluation | |
| While we demonstrate the effectiveness of our ap- | |
| proach on the MUCE held-out set using automated | |
| metrics, we also evaluate generalization on external | |
| benchmarks using the recently introduced NOVEL- | |
| TYBENCH (Zhang et al., 2025). This benchmark | |
| includes tasks spanning randomness, factual knowl- | |
| edge, creative writing, and subjectivity. Following | |
| the recommended evaluation setup, we benchmark | |
| all baselines and CRPO variants on a curated 100- | |
| prompt subset, using the benchmark s novelty and | |
| quality metrics. Full details are in Appendix D.2. | |
| Figure 5 shows novelty vs. quality scores across | |
| all models and tasks. | |
| As in our internal eval- | |
| uation, we observe a clear separation: existing | |
| LLM baselines cluster around lower novelty and | |
| variable quality, while our models consistently | |
| achieve high scores on both dimensions. | |
| No- | |
| tably, although our models outperform SFT on nov- | |
| elty, the SFT model surprisingly achieves higher | |
| quality beating both baselines by a large mar- | |
| gin and our models by a smaller one. This aligns | |
| with findings from NOVELTYBENCH (Zhang et al., | |
| 2025), where smaller models like Gemma-2-2B-it | |
| and Llama-3.1-8B-Instruct often surpass larger | |
| ones in quality. | |
| Overall, our models set a new state-of-the-art | |
| on the NOVELTYBENCH leaderboard in terms of | |
| novelty. 4 | |
| Conclusion | |
| We introduce CRPO, a flexible methodology for | |
| enhancing the creativity of LLMs. Leveraging a | |
| novel large-scale human preference dataset focused | |
| on creativity, we show that models aligned with | |
| CRPO produce generations that are not only novel, | |
| diverse, and surprising, but also high in quality | |
| on both our held-out evaluation suite and the | |
| external NOVELTYBENCH dataset. Human evalua- | |
| tions further confirm that raters consistently judge | |
| our model s outputs to be more creative than those | |
| of several strong baselines, highlighting the po- | |
| tential of our approach to boost LLM creativity. | |
| While our experiments focus on smaller models | |
| such as Llama-3.1-8B and an English-only dataset, | |
| future work could explore the scalability of CRPO | |
| to larger models, multilingual settings and other | |
| preference optimization methods. | |
| 4https: novelty-bench.github.io | |
| --- Page 9 --- | |
| Limitations | |
| Due to constraints on both computational resources | |
| and budget for human studies, we were unable to | |
| evaluate CRPO on any languages other than En- | |
| glish. Multilingual creativity assessment using gen- | |
| erative AI remains a challenging problem and an | |
| active area of research (Haase et al., 2025). While | |
| we believe our data represents a valuable resource | |
| for the community, future work will need to test | |
| our methods in multilingual settings to ensure mul- | |
| tilingual generalization. These compute constraints | |
| also prevented us from evaluating CRPO on larger | |
| open-weight models, making scaling trends diffi- | |
| cult to predict. We retained only samples with full | |
| agreement for the creativity score when training our | |
| models. While this aligns with best practices for | |
| creativity measurement in psychology (Cseh and | |
| Jeffries, 2019), it may also mask genuine sources of | |
| rater disagreement that should be modeled. Finally, | |
| we acknowledge that, much like other datasets used | |
| to align LLMs, the preferences represented by our | |
| annotator population likely do not reflect the full | |
| range of human preferences, which could bias our | |
| models generations (Yeh et al., 2024). We believe | |
| that the large-scale and multilingual nature of our | |
| preference data likely makes it one of the most rep- | |
| resentative creativity datasets currently available, | |
| but stress that future work should consider issues of | |
| bias and fairness more carefully for LLM creativity | |
| assessment. | |
| Ethical Considerations | |
| We emphasize that our models should not be used | |
| for safety-critical applications, as the relationship | |
| between creativity and alignment with other val- | |
| ues remains underexplored. Notably, our dataset | |
| contains responses to tests of malevolent creativity | |
| that are by definition unsafe for models to generate. | |
| We also observed qualitatively that CRPO mod- | |
| els were more likely to generate unsafe or toxic | |
| responses even to prompts that did not explicitly | |
| request such behaviors. We believe that our data | |
| is valuable for red-teaming evaluations on tasks re- | |
| quiring creativity, and that aligning models on these | |
| malevolent responses could be beneficial for under- | |
| standing how malicious actors might use creativity- | |
| enhanced models to execute unsafe goals. How- | |
| ever, we also acknowledge the ethical concerns that | |
| the release of our models and datasets would raise, | |
| and believe that restricting access to only those | |
| which have signed a license agreement is the best | |
| approach for balancing safety with continued sci- | |
| entific advancement. While we believe our results | |
| demonstrate how aligning LLMs with carefully de- | |
| signed human creativity datasets can significantly | |
| improve the novelty and diversity of their genera- | |
| tions, it remains unclear how to both optimize for | |
| creativity while preserving guardrails that prevent | |
| unsafe behavior. | |
| We also acknowledge the broader debates around | |
| the valid use of AI in social-behavioral research | |
| (Sun et al., 2025) and concerns surrounding AI au- | |
| tomation of industries requiring creativity (Wilkin- | |
| son, 2023) in which our work is situated. While | |
| the over-reliance on AI for creative tasks to the | |
| detriment of human welfare is a legitimate con- | |
| cern, AI has also been acknowledged for its poten- | |
| tial to enhance human creativity above and beyond | |
| what might be possible otherwise (de Chantal et al., | |
| 2025). Creativity is a vital skill for future knowl- | |
| edge workers to master (Forum, 2025), and we | |
| believe that enhancing the creativity of AI is an | |
| important prerequisite for developing AI systems | |
| capable of training humans to be more creative. | |
| Acknowledgements | |
| Mete and Lonneke gratefully acknowledge the sup- | |
| port of the Swiss National Science Foundation | |
| (grant 205121_207437: C - LING). R.E.B. is sup- | |
| ported by grants from the US National Science | |
| Foundation [DRL-1920653; DRL-240078; DUE- | |
| 2155070]. | |
| References | |
| Sergio Agnoli, Giovanni E Corazza, and Mark A | |
| Runco. 2016. Estimating creativity with a multiple- | |
| measurement approach within scientific and artistic | |
| domains. Creativity Research Journal, 28(2):171 | |
| 176. | |
| AI Meta. 2024. Llama 3 model card. | |
| Barrett R Anderson, Jash Hemant Shah, and Max | |
| Kreminski. 2024. Homogenization effects of large | |
| language models on human creative ideation. | |
| In | |
| Proceedings of the 16th conference on creativity | |
| cognition, pages 413 425. | |
| Anthropic. 2025. Claude 3.7 sonnet and claude code. | |
| Frank Barron. 1955. The disposition toward original- | |
| ity. The Journal of Abnormal and Social Psychology, | |
| 51(3):478. | |
| Roger Beaty, Robert A Cortes, Simone Luchini, John D | |
| Patterson, Boris Forthmann, Brendan S Baker, Bap- | |
| tiste Barbot, Mariale Hardiman, and Adam Green. | |
| --- Page 10 --- | |
| 2024. The scientific creative thinking test (sctt): Re- | |
| liability, validity, and automated scoring. PsyArxiv | |
| Preprints. | |
| Roger E Beaty and Dan R Johnson. 2021. Automating | |
| creativity assessment with semdis: An open platform | |
| for computing semantic distance. Behavior research | |
| methods, 53(2):757 780. | |
| Antoine Bellemare-Pepin, Fran莽ois Lespinasse, Philipp | |
| Th枚lke, Yann Harel, Kory Mathewson, Jay A Ol- | |
| son, Yoshua Bengio, and Karim Jerbi. 2024. Diver- | |
| gent creativity in humans and large language models. | |
| arXiv preprint arXiv:2405.13012. | |
| Margaret A Boden. 2004. The creative mind: Myths | |
| and mechanisms. Routledge. | |
| Tom Brown, Benjamin Mann, Nick Ryder, Melanie | |
| Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind | |
| Neelakantan, Pranav Shyam, Girish Sastry, Amanda | |
| Askell, and 1 others. 2020. Language models are | |
| few-shot learners. Advances in neural information | |
| processing systems, 33:1877 1901. | |
| S茅bastien Bubeck, Varun Chandrasekaran, Ronen El- | |
| dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- | |
| ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, | |
| Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, | |
| and Yi Zhang. 2023. Sparks of artificial general in- | |
| telligence: Early experiments with gpt-4. Preprint, | |
| arXiv:2303.12712. | |
| Razvan C. Bunescu and Oseremen O. Uduehi. 2022. | |
| Distribution-based measures of surprise for creative | |
| language: Experiments with humor and metaphor. | |
| Proceedings of the 3rd Workshop on Figurative Lan- | |
| guage Processing (FLP). | |
| Tuhin Chakrabarty, Philippe Laban, Divyansh Agar- | |
| wal, Smaranda Muresan, and Chien-Sheng Wu. 2024. | |
| Art or artifice? large language models and the false | |
| promise of creativity. In Proceedings of the 2024 | |
| CHI Conference on Human Factors in Computing | |
| Systems, pages 1 34. | |
| Soma Chaudhuri, Alan Pickering, and Joydeep Bhat- | |
| tacharya. 2025. Evaluating poetry: Navigating the | |
| divide between aesthetical and creativity judgments. | |
| The Journal of Creative Behavior, 59(1):e683. | |
| Qi Chen, Bowen Zhang, Gang Wang, and Qi Wu. | |
| 2024. Weak-eval-strong: Evaluating and eliciting | |
| lateral thinking of llms with situation puzzles. arXiv | |
| preprint arXiv:2410.06733. | |
| John Joon Young Chung, Ece Kamar, and Saleema | |
| Amershi. 2023. | |
| Increasing diversity while main- | |
| taining accuracy: Text data generation with large | |
| language models and human interventions. arXiv | |
| preprint arXiv:2306.04140. | |
| John Joon Young Chung, Vishakh Padmakumar, Melissa | |
| Roemmele, Yuqian Sun, and Max Kreminski. | |
| 2025. | |
| Modifying large language model post- | |
| training for diverse creative writing. arXiv preprint | |
| arXiv:2503.17126. | |
| Katherine N Cotter, Jean E Pretz, and James C Kaufman. | |
| 2016. Applicant extracurricular involvement predicts | |
| creativity better than traditional admissions factors. | |
| Psychology of Aesthetics, Creativity, and the Arts, | |
| 10(1):2. | |
| Genevieve M Cseh and Karl K Jeffries. 2019. A scat- | |
| tered cat: A critical evaluation of the consensual as- | |
| sessment technique for creativity research. Psychol- | |
| ogy of Aesthetics, Creativity, and the Arts, 13(2):159. | |
| Pier Luc de Chantal, Roger Beaty, Antonio Laverghetta, | |
| Jimmy Pronchick, John Patterson, Peter Organisciak, | |
| Katarzyna Potega vel Zabik, Baptiste Barbot, and | |
| Maciej Karwowski. 2025. Artificial intelligence en- | |
| hances human creativity through real-time evaluative | |
| feedback. | |
| Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, | |
| and Xiangnan He. 2025. Less is more: Improving | |
| llm alignment via preference data selection. arXiv | |
| preprint arXiv:2502.14560. | |
| Paul V DiStefano, John D Patterson, and Roger E Beaty. | |
| 2024. Automatic scoring of metaphor creativity with | |
| large language models. Creativity Research Journal, | |
| pages 1 15. | |
| Paul V DiStefano, Daniel Zeitlen, Janet Rafner, Pier- | |
| Luc de Chantal, Aoran Peng, Scarlett Miller, and | |
| Roger Beaty. 2025. Evaluating ai s ideas: The role | |
| of individual creativity and expertise in human-ai | |
| co-creativity. | |
| Angela Fan, Mike Lewis, and Yann Dauphin. 2018. | |
| Hierarchical neural story generation. In Proceedings | |
| of the 56th Annual Meeting of the Association for | |
| Computational Linguistics (Volume 1: Long Papers), | |
| pages 889 898, Melbourne, Australia. Association | |
| for Computational Linguistics. | |
| Li Fan, Kaixiang Zhuang, Xueyang Wang, Jingyi Zhang, | |
| Cheng Liu, Jing Gu, and Jiang Qiu. 2023. Explor- | |
| ing the behavioral and neural correlates of seman- | |
| tic distance in creative writing. Psychophysiology, | |
| 60(5):e14239. | |
| Boris Forthmann, Benjamin Goecke, and Roger E Beaty. | |
| 2025. Planning missing data designs for human rat- | |
| ings in creativity research: A practical guide. Cre- | |
| ativity Research Journal, 37(1):167 178. | |
| Boris Forthmann, Heinz Holling, Nima Zandi, Anne | |
| Gerwig, P谋nar 脟elik, Martin Storme, and Todd | |
| Lubart. 2017. Missing creativity: The effect of cogni- | |
| tive workload on rater (dis-) agreement in subjective | |
| divergent-thinking scores. Thinking Skills and Cre- | |
| ativity, 23:129 139. | |
| World Economic Forum. 2025. Future of jobs report. | |
| Giorgio Franceschelli and Mirco Musolesi. 2024. Cre- | |
| ative beam search: Llm-as-a-judge for improving re- | |
| sponse generation. arXiv preprint arXiv:2405.00099. | |
| --- Page 11 --- | |
| Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, | |
| Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, | |
| Qingxiu Dong, Ce Zheng, Wen Xiao, Ge Zhang, | |
| Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng | |
| Liu, Zeyu Cui, Jian Yang, Lei Sha, and 5 others. | |
| 2024. Towards a unified view of preference learn- | |
| ing for large language models: A survey. Preprint, | |
| arXiv:2409.02795. | |
| Ken Gilhooly. 2024. Ai vs humans in the aut: Simula- | |
| tions to llms. Journal of Creativity, 34(1):100071. | |
| Benjamin Goecke, Paul V DiStefano, Wolfgang As- | |
| chauer, Kurt Haim, Roger Beaty, and Boris Forth- | |
| mann. 2024a. Automated scoring of scientific cre- | |
| ativity in german. The Journal of Creative Behavior, | |
| 58(3):321 327. | |
| Benjamin Goecke, Selina Weiss, and Oliver Wilhelm. | |
| 2024b. Driving factors of individual differences in | |
| broad retrieval ability: Gr is more than the sum of its | |
| parts. Journal of Experimental Psychology: Learn- | |
| ing, Memory, and Cognition. | |
| Fabr铆cio G贸es, Piotr Sawicki, Marek Grzes, Marco | |
| Volpe, and Jacob Watson. 2023. Pushing gpt s cre- | |
| ativity to its limits: Alternative uses and torrance | |
| tests. In ICCC. | |
| Google. 2024a. Gemma 2: Improving open language | |
| models at a practical size. | |
| Google. 2024b. Introducing gemini 2.0: our new ai | |
| model for the agentic era. | |
| Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp | |
| Schmid, Zachary Mueller, Sourab Mangrulkar, Marc | |
| Sun, and Benjamin Bossan. 2022. Accelerate: Train- | |
| ing and inference at scale made simple, efficient and | |
| adaptable. | |
| https: github.com huggingface | |
| accelerate. | |
| J.P. Guilford. 1967. The Nature of Human Intelligence. | |
| McGraw-Hill series in psychology. McGraw-Hill. | |
| Jennifer Haase, Paul H. P. Hanel, and Sebastian Pokutta. | |
| 2025. S-dat: A multilingual, genai-driven frame- | |
| work for automated divergent thinking assessment. | |
| Preprint, arXiv:2505.09068. | |
| Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Ra- | |
| jagopal, and Dongyeop Kang. 2023. How far can | |
| we extract diverse perspectives from large language | |
| models? arXiv preprint arXiv:2311.09799. | |
| Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. | |
| Debertav3: Improving deberta using electra-style pre- | |
| training with gradient-disentangled embedding shar- | |
| ing. Preprint, arXiv:2111.09543. | |
| Ruizhi He, Kaixiang Zhuang, Lijun Liu, Ke Ding, | |
| Xi Wang, Lei Fu, Jiang Qiu, and Qunlin Chen. 2022. | |
| The impact of knowledge on poetry composition: An | |
| fmri investigation. Brain and language, 235:105202. | |
| Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and | |
| Yejin Choi. 2019. The curious case of neural text | |
| degeneration. arXiv preprint arXiv:1904.09751. | |
| Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan | |
| Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, | |
| Weizhu Chen, and 1 others. 2022. Lora: Low-rank | |
| adaptation of large language models. ICLR, 1(2):3. | |
| Shulin Huang, Shirong Ma, Yinghui Li, Mengzuo | |
| Huang, Wuhe Zou, Weidong Zhang, and Haitao | |
| Zheng. 2024. LatEval: An interactive LLMs evalu- | |
| ation benchmark with incomplete information from | |
| lateral thinking puzzles. In Proceedings of the 2024 | |
| Joint International Conference on Computational | |
| Linguistics, Language Resources and Evaluation | |
| (LREC-COLING 2024), pages 10186 10197, Torino, | |
| Italia. ELRA and ICCL. | |
| Mete Ismayilzada, Debjit Paul, Antoine Bosselut, | |
| and Lonneke van der Plas. 2024a. | |
| Creativity in | |
| ai: | |
| Progresses and challenges. | |
| arXiv preprint | |
| arXiv:2410.17218. | |
| Mete Ismayilzada, Claire Stevenson, and Lonneke | |
| van der Plas. 2024b. Evaluating creative short story | |
| generation in humans and large language models. | |
| arXiv preprint arXiv:2411.02316. | |
| Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- | |
| sch, Chris Bamford, Devendra Singh Chaplot, Diego | |
| de las Casas, Florian Bressand, Gianna Lengyel, Guil- | |
| laume Lample, Lucile Saulnier, L茅lio Renard Lavaud, | |
| Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, | |
| Thibaut Lavril, Thomas Wang, Timoth茅e Lacroix, | |
| and William El Sayed. 2023. Mistral 7b. Preprint, | |
| arXiv:2310.06825. | |
| Dan R Johnson, Andrew S Cuthbert, and Mara E Tynan. | |
| 2021. The neglect of idea diversity in creative idea | |
| generation and evaluation. Psychology of Aesthetics, | |
| Creativity, and the Arts, 15(1):125. | |
| Dan Richard Johnson, J. Kaufman, Brendan S. Baker, | |
| John D. Patterson, Baptiste Barbot, Adam E. Green, | |
| Janet G. van Hell, Evan S. Kennedy, Grace F Sulli- | |
| van, Christa L. Taylor, Thomas Ward, and Roger E. | |
| Beaty. 2022. Divergent semantic integration (dsi): | |
| Extracting creativity from narratives with distribu- | |
| tional semantic modeling. Behavior Research Meth- | |
| ods, 55:3726 3759. | |
| Hansika Kapoor, Hreem Mahadeshwar, Sarah Rezaei, | |
| Roni Reiter-Palmon, and James C Kaufman. 2024. | |
| The ties that bind: Low morals, high deception, and | |
| dark creativity. Creativity Research Journal, pages | |
| 1 20. | |
| Pythagoras Karampiperis, Antonis Koukourikos, and | |
| Evangelia Koliopoulou. 2014. Towards machines for | |
| measuring creativity: The use of computational tools | |
| in storytelling activities. In 2014 IEEE 14th Interna- | |
| tional Conference on Advanced Learning Technolo- | |
| gies, pages 508 512. | |
| --- Page 12 --- | |
| Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, | |
| Jelena Luketina, Eric Hambro, Edward Grefenstette, | |
| and Roberta Raileanu. 2023. Understanding the ef- | |
| fects of rlhf on llm generalisation and diversity. arXiv | |
| preprint arXiv:2310.06452. | |
| Mika Koivisto and Simone Grassini. 2023. Best hu- | |
| mans still outperform artificial intelligence in a cre- | |
| ative divergent thinking task. | |
| Scientific reports, | |
| 13(1):13601. | |
| Polina Kuznetsova, Jianfu Chen, and Yejin Choi. 2013. | |
| Understanding and quantifying creativity in lexical | |
| composition. In Conference on Empirical Methods | |
| in Natural Language Processing. | |
| Nathan Lambert, Valentina Pyatkin, Jacob Morrison, | |
| LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, | |
| Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, | |
| and 1 others. 2024. Rewardbench: Evaluating re- | |
| ward models for language modeling. arXiv preprint | |
| arXiv:2403.13787. | |
| Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala, | |
| Ping Yu, Jason Weston, Sainbayar Sukhbaatar, and | |
| Ilia Kulikov. 2025. Diverse preference optimization. | |
| arXiv preprint arXiv:2501.18101. | |
| Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Ju- | |
| jie He, Chaojie Wang, Shuicheng Yan, Yang Liu, | |
| and Yahui Zhou. 2024. Skywork-reward: Bag of | |
| tricks for reward modeling in llms. arXiv preprint | |
| arXiv:2410.18451. | |
| Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar | |
| Mireshghallah, Jiacheng Liu, Seungju Han, Allyson | |
| Ettinger, Liwei Jiang, Khyathi Chandu, Nouha Dziri, | |
| and 1 others. 2024. Ai as humanity s salieri: Quan- | |
| tifying linguistic creativity of language models via | |
| systematic attribution of machine text against web | |
| text. arXiv preprint arXiv:2410.04265. | |
| Simone A Luchini, Nadine T Maliakkal, Paul V DiS- | |
| tefano, Antonio Laverghetta Jr, John D Patterson, | |
| Roger E Beaty, and Roni Reiter-Palmon. 2025. Auto- | |
| mated scoring of creative problem solving with large | |
| language models: A comparison of originality and | |
| quality ratings. Psychology of Aesthetics, Creativity, | |
| and the Arts. | |
| Pronita Mehrotra, Aishni Parab, and Sumit Gulwani. | |
| 2024. Enhancing creativity in large language mod- | |
| els through associative thinking strategies. arXiv | |
| preprint arXiv:2405.06715. | |
| Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan | |
| Cotterell. 2023. Locally typical sampling. Transac- | |
| tions of the Association for Computational Linguis- | |
| tics, 11:102 121. | |
| Alireza Modirshanechi, Johanni Brea, and Wulfram | |
| Gerstner. 2022. A taxonomy of surprise definitions. | |
| Journal of Mathematical Psychology, 110:102712. | |
| Nils Myszkowski and Martin Storme. 2019. Judge re- | |
| sponse theory? a call to upgrade our psychometrical | |
| account of creativity judgments. Psychology of Aes- | |
| thetics, Creativity, and the Arts, 13(2):167. | |
| Lakshmi Nair, Evana Gizzi, and Jivko Sinapov. 2024. | |
| Creative problem solving in large language and vi- | |
| sion models - what would it take? In Findings of the | |
| Association for Computational Linguistics: EMNLP | |
| 2024, pages 11978 11994, Miami, Florida, USA. | |
| Association for Computational Linguistics. | |
| OpenAI. 2024. | |
| Gpt-4o system card. | |
| Preprint, | |
| arXiv:2410.21276. | |
| Peter Organisciak, Selcuk Acar, Denis Dumas, and | |
| Kelly Berthiaume. 2023. Beyond semantic distance: | |
| Automated scoring of divergent thinking greatly im- | |
| proves with large language models. Thinking Skills | |
| and Creativity, 49:101356. | |
| Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, | |
| Carroll Wainwright, Pamela Mishkin, Chong Zhang, | |
| Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 | |
| others. 2022. Training language models to follow in- | |
| structions with human feedback. Advances in neural | |
| information processing systems, 35:27730 27744. | |
| Laura O Mahony, Leo Grinsztajn, Hailey Schoelkopf, | |
| and Stella Biderman. 2024. Attributing mode col- | |
| lapse in the fine-tuning of large language models. In | |
| ICLR 2024 Workshop on Mathematical and Empiri- | |
| cal Understanding of Foundation Models. | |
| Vishakh Padmakumar and He He. 2023. Does writ- | |
| ing with language models reduce content diversity? | |
| arXiv preprint arXiv:2309.05196. | |
| John D Patterson, Hannah M Merseal, Dan R Johnson, | |
| Sergio Agnoli, Matthijs Baas, Brendan S Baker, Bap- | |
| tiste Barbot, Mathias Benedek, Khatereh Borhani, | |
| Qunlin Chen, and 1 others. 2023. Multilingual se- | |
| mantic distance: Automatic verbal creativity assess- | |
| ment in many languages. Psychology of Aesthetics, | |
| Creativity, and the Arts, 17(4):495. | |
| Corinna Perchtold-Stefan, Hansika Kapoor, James C | |
| Kaufman, Hreem Mahadeshwar, and Alison Fernan- | |
| des. 2024. Development and neuronal validation of | |
| the dark creativity deception battery (dcdb). | |
| Corinna M Perchtold-Stefan, Christian Rominger, Ilona | |
| Papousek, and Andreas Fink. 2023. Functional eeg | |
| alpha activation patterns during malevolent creativity. | |
| Neuroscience, 522:98 108. | |
| Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- | |
| pher D Manning, Stefano Ermon, and Chelsea Finn. | |
| 2023. | |
| Direct preference optimization: Your lan- | |
| guage model is secretly a reward model. Advances in | |
| Neural Information Processing Systems, 36:53728 | |
| 53741. | |
| Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, | |
| and Yuxiong He. 2020. Zero: Memory optimizations | |
| toward training trillion parameter models. In SC20: | |
| --- Page 13 --- | |
| International Conference for High Performance Com- | |
| puting, Networking, Storage and Analysis, pages 1 | |
| 16. IEEE. | |
| Tuval Raz, Simone Luchini, Roger Beaty, and Yoed | |
| Kenett. 2024. Bridging the measurement gap: A | |
| large language model method of assessing open- | |
| ended question complexity. In Proceedings of the | |
| Annual Meeting of the Cognitive Science Society, vol- | |
| ume 46. | |
| Mark A Runco and Garrett J Jaeger. 2012. The standard | |
| definition of creativity. Creativity research journal, | |
| 24(1):92 96. | |
| Solve S忙b酶 and Helge Brovold. 2024. On the stochas- | |
| tics of human and artificial creativity. arXiv preprint | |
| arXiv:2403.06996. | |
| Keita Saito, Akifumi Wachi, Koki Wataoka, and | |
| Youhei Akimoto. 2023. | |
| Verbosity bias in prefer- | |
| ence labeling by large language models. Preprint, | |
| arXiv:2310.10076. | |
| Janika Saretzki, Rosalie Andrae, Boris Forthmann, and | |
| Mathias Benedek. 2024. Investigation of response ag- | |
| gregation methods in divergent thinking assessments. | |
| The Journal of Creative Behavior. | |
| Ulrich Schroeders, Oliver Wilhelm, and Gabriel Olaru. | |
| 2016. Meta-heuristics in short scale construction: | |
| Ant colony optimization and genetic algorithm. PloS | |
| one, 11(11):e0167110. | |
| Alexander Shypula, Shuo Li, Botong Zhang, Vishakh | |
| Padmakumar, Kayo Yin, and Osbert Bastani. 2025. | |
| Evaluating the diversity and quality of llm generated | |
| content. arXiv preprint arXiv:2504.12522. | |
| Paul J Silvia. 2011. Subjective scoring of divergent | |
| thinking: Examining the reliability of unusual uses, | |
| instances, and consequences tasks. Thinking Skills | |
| and Creativity, 6(1):24 30. | |
| Dean Keith Simonton. 2012. Taking the us patent of- | |
| fice criteria seriously: A quantitative three-criterion | |
| creativity definition and its implications. Creativity | |
| research journal, 24(2-3):97 106. | |
| Dean Keith Simonton. 2018. Defining creativity: Don t | |
| we also need to define what is not creative? | |
| The | |
| Journal of Creative Behavior, 52(1):80 90. | |
| Morris I Stein. 2014. Stimulating creativity: Individual | |
| procedures. Academic Press. | |
| Claire E. Stevenson, Iris Smal, Matthijs Baas, Raoul | |
| Grasman, and Han L. J. van der Maas. 2022. Putting | |
| gpt-3 s creativity to the (alternative uses) test. In | |
| ICCC. | |
| Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, | |
| Michael G眉nther, Bo Wang, Markus Krimmel, Feng | |
| Wang, Georgios Mastrapas, Andreas Koukounas, An- | |
| dreas Koukounas, Nan Wang, and Han Xiao. 2024. | |
| jina-embeddings-v3: Multilingual embeddings with | |
| task lora. Preprint, arXiv:2409.10173. | |
| Douglas Summers-Stay, Stephanie M. Lukin, and | |
| Clare R. Voss. 2023. Brainstorm, then select: a gen- | |
| erative language model improves its creativity score. | |
| Huaman Sun, Jiaxin Pei, Minje Choi, and David Jur- | |
| gens. 2025. Sociodemographic prompting is not yet | |
| an effective approach for simulating subjective judg- | |
| ments with llms. In Proceedings of the 2025 Confer- | |
| ence of the Nations of the Americas Chapter of the | |
| Association for Computational Linguistics: Human | |
| Language Technologies (Volume 2: Short Papers), | |
| pages 845 854. | |
| Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- | |
| Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan | |
| Schalkwyk, Andrew M Dai, Anja Hauth, Katie Mil- | |
| lican, and 1 others. 2023. | |
| Gemini: a family of | |
| highly capable multimodal models. arXiv preprint | |
| arXiv:2312.11805. | |
| Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, | |
| Alexander Spangher, Muhao Chen, Jonathan May, | |
| and Nanyun Peng. 2024. Are large language models | |
| capable of generating human-level narratives? arXiv | |
| preprint arXiv:2407.13248. | |
| Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ro- | |
| nan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, | |
| Thomas L Griffiths, and Faeze Brahman. 2023. Mac- | |
| gyver: Are large language models creative problem | |
| solvers? arXiv preprint arXiv:2311.09682. | |
| Binghai Wang, Rui Zheng, Lu Chen, Zhiheng Xi, Wei | |
| Shen, Yuhao Zhou, Dong Yan, Tao Gui, Qi Zhang, | |
| and Xuan-Jing Huang. 2024a. | |
| Reward modeling | |
| requires automatic adjustment based on data quality. | |
| In Findings of the Association for Computational | |
| Linguistics: EMNLP 2024, pages 4041 4064. | |
| Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, | |
| Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao | |
| Xie, Chuou Xu, Jihong Dai, and 1 others. 2024b. | |
| Weaver: Foundation models for creative writing. | |
| arXiv preprint arXiv:2401.17268. | |
| Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, | |
| Barret Zoph, Sebastian Borgeaud, Dani Yogatama, | |
| Maarten Bosma, Denny Zhou, Donald Metzler, and | |
| 1 others. 2022. Emergent abilities of large language | |
| models. arXiv preprint arXiv:2206.07682. | |
| Selina Weiss, Benjamin Goecke, and Oliver Wilhelm. | |
| 2024. How much retrieval ability is in originality? | |
| The Journal of Creative Behavior, 58(3):370 387. | |
| Selina Weiss, Sally Olderbak, and Oliver Wilhelm. 2023. | |
| Conceptualizing and measuring ability emotional cre- | |
| ativity. Psychology of Aesthetics, Creativity, and the | |
| Arts. | |
| Emily Wenger and Yoed Kenett. 2025. We re different, | |
| we re the same: Creative homogeneity across llms. | |
| arXiv preprint arXiv:2501.19361. | |
| Peter West and Christopher Potts. 2025. Base models | |
| beat aligned models at randomness and creativity. | |
| Preprint, arXiv:2505.00047. | |
| --- Page 14 --- | |
| Alissa Wilkinson. 2023. Hollywood s writers are on | |
| strike. here s why that matters. | |
| Justin Wong, Yury Orlovskiy, Michael Luo, Sanjit A | |
| Seshia, and Joseph E Gonzalez. 2024. Simplestrat: | |
| Diversifying language model generation with stratifi- | |
| cation. arXiv preprint arXiv:2410.09038. | |
| Weijia Xu, Nebojsa Jojic, Sudha Rao, Chris Brockett, | |
| and Bill Dolan. 2024. Echoes in ai: Quantifying | |
| lack of plot diversity in llm outputs. arXiv preprint | |
| arXiv:2501.00273. | |
| Min-Hsuan Yeh, Leitian Tao, Jeffrey Wang, Xuefeng | |
| Du, and Yixuan Li. 2024. How reliable is human | |
| feedback for aligning large language models? arXiv | |
| preprint arXiv:2410.01957. | |
| Yuhua Yu, Lindsay Krebs, Mark Beeman, and Vicky T | |
| Lai. 2024. Exploring how generating metaphor via | |
| insight versus analysis affects metaphor quality and | |
| learning outcomes. Cognitive science, 48(8):e13488. | |
| Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen | |
| Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and | |
| Daphne Ippolito. 2025. Noveltybench: Evaluating | |
| creativity and diversity in language models. arXiv | |
| preprint arXiv:2504.05228. | |
| Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, | |
| Zico Kolter, and Daphne Ippolito. 2024. Forcing | |
| diffuse distributions out of language models. arXiv | |
| preprint arXiv:2404.10859. | |
| Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, | |
| Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen | |
| Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen | |
| Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, | |
| Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, and | |
| 3 others. 2025. A survey of large language models. | |
| Preprint, arXiv:2303.18223. | |
| Yunpu Zhao, Rui Zhang, Wenyi Li, Di Huang, Jiaming | |
| Guo, Shaohui Peng, Yifan Hao, Yuanbo Wen, Xing | |
| Hu, Zidong Du, and 1 others. 2024. Assessing and | |
| understanding creativity in large language models. | |
| arXiv preprint arXiv:2401.12491. | |
| Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben | |
| Narad, Timothy T Rogers, Lalit K Jain, Robert D | |
| Nowak, Bob Mankoff, and Jifan Zhang. 2025. Bridg- | |
| ing the creativity understanding gap: Small-scale | |
| human alignment enables expert-level humor ranking | |
| in llms. arXiv preprint arXiv:2502.20356. | |
| Aleksandra Zieli nska, Peter Organisciak, Denis Du- | |
| mas, and Maciej Karwowski. 2023. Lost in trans- | |
| lation? not for large language models: Automated | |
| divergent thinking scoring performance translates to | |
| non-english contexts. Thinking Skills and Creativity, | |
| 50:101414. | |
| A | |
| MUCE Dataset | |
| We compiled data by means of crowdsourcing and | |
| data mining of the open-source data sharing plat- | |
| form OSF. We crowdsourced from the global cre- | |
| ativity research community by means of direct re- | |
| quests and posts on academic listservs. In our call | |
| for data-sharing, we requested data relating to any | |
| creativity responses that were provided by human | |
| participants and scored for creativity by human | |
| raters. We specifically requested that the datasets | |
| include scores from each rater, rather than com- | |
| posite creativity scores, to determine rating data | |
| quality for each submission. As part of our inclu- | |
| sion criteria, we further requested that researchers | |
| provide information relating to: (a) the creativity | |
| task, (b) the item associated with each response, (c) | |
| the construct that was rated, and (d) the language | |
| of the task. We further asked researchers to provide | |
| a statement on whether they agreed to making their | |
| data open-source. In terms of data mining through | |
| the OSF platform, we first searched through a se- | |
| ries of relevant keywords (e.g., creativity task , | |
| originality score ). We only retained sub-datasets | |
| from credible sources, which were associated with | |
| a citable peer-reviewed article, and which included | |
| all the required data relating to our inclusion crite- | |
| ria. | |
| After removing responses that didn t meet our | |
| inclusion criteria, our dataset amounted to 321,572 | |
| human-rated and language-based creativity re- | |
| sponses. The dataset was thus cleaned by standard- | |
| izing the naming for each variable except for the | |
| responses. We then removed responses for having | |
| been rated by fewer than 2 human judges. Dupli- | |
| cate responses were also removed, by retaining a | |
| single exemplar for responses that appeared twice | |
| within a specific item and task. | |
| To enhance the reliability of human creativity | |
| ratings across the numerous datasets, we optimized | |
| the selection of raters by applying a meta-heuristic | |
| algorithm. Specifically, we applied a Genetic Algo- | |
| rithm (Schroeders et al., 2016). The GA operates | |
| through iterative selection, crossover, and muta- | |
| tion processes, mirroring the principles of natural | |
| selection, and in our case to identify the optimal | |
| subsets of raters for each dataset. In each itera- | |
| tion, candidate solutions that is, combinations of | |
| raters were evaluated based on a predefined fit- | |
| ness function that prioritized the maximization of | |
| empirical reliability (rxx) within a graded response | |
| model (GRM) and hence in line to judge response | |
| --- Page 15 --- | |
| theory. For sub-datasets involving decimal-based | |
| scales, individual ratings were rounded to the near- | |
| est integer value (rounding up if containing a deci- | |
| mal .5) to meet the requirements of the GRM. | |
| Rater subsets demonstrating superior reliability | |
| were selected, recombined, and modified through | |
| random perturbations to prevent premature con- | |
| vergence to suboptimal solutions. This approach | |
| ensured that the selected raters provided consistent | |
| and informative judgments while reducing noise | |
| introduced by inconsistent or uninformative ratings. | |
| By automating the selection process through GA, | |
| we opted for maximal comparability in the selec- | |
| tion process across datasets. Previous research has | |
| demonstrated the utility of GA in psychometric op- | |
| timization tasks, particularly in balancing brevity | |
| and measurement precision while maintaining con- | |
| struct validity. In the present study, GA facilitated | |
| a systematic and data-driven refinement of rater | |
| selection, arguably enhancing the overall quality of | |
| creativity ratings. | |
| After dropping uninformative raters in each sub- | |
| dataset, we again removed any rows containing less | |
| than 2 ratings due to rater removal. Afterwards, we | |
| used the new rater subsets per dataset and computed | |
| factor scores for each given response that were used | |
| as creativity scores. We calculated factor scores | |
| via a GRM model, ran separately over each sub- | |
| dataset, to derive a single creativity score for each | |
| response. Finally, we applied min-max scaling on | |
| each sub-dataset to transform ratings into a range of | |
| 10 to 50, with intervals of 1. This step was applied | |
| to ensure that ratings would only constitute a single | |
| token in length, to lessen the burden of predicting | |
| multi-token labels by the LLMs. | |
| We then withheld all responses in the Spanish | |
| language from our final dataset and assigned them | |
| to an out-of-distribution-language (OOD-l) set. Re- | |
| sponses from the OOD-l set were not included in | |
| the training data of MUCE, allowing us to test | |
| whether the model could generalize to creative re- | |
| sponses in an unseen language. We selected Span- | |
| ish as it would allow for a fair test of generalizabil- | |
| ity given: (1) Spanish tends to be a high-resource | |
| language within the pre-training of modern LLMs, | |
| (2) it is similar to other Latin-root languages in our | |
| training data (e.g., Italian), (3) responses in Span- | |
| ish spanned multiple creativity tasks, and (4) the | |
| language spanned a limited number of responses in | |
| our total dataset. We further withheld all responses | |
| from two highly-naturalistic tasks, the Poem and | |
| Alternative Title Generation, and assigned these | |
| to an out-of-distribution task (OOD-t) set. We se- | |
| lected these tasks as they made up a limited portion | |
| of the total dataset and would provide a test of | |
| MUCE s performance on unseen naturalistic cre- | |
| ativity tasks. | |
| We then randomly selected items within each | |
| task and assigned them to an out-of-distribution | |
| item (OOD-i) set. We identified candidate items | |
| that corresponded to 5 or less of the responses | |
| within a task. Then, for tasks that contained 20 or | |
| more total items, we randomly assigned 2 of these | |
| items to our OOD-i set. For tasks that contained | |
| fewer than 20 total items, we instead randomly as- | |
| signed 1 of these items to the OOD-i set. Finally, | |
| we split the remaining responses in our dataset | |
| into training, validation, and out-of-distribution re- | |
| sponses (OOD-r) sets according to an 80 10 10 | |
| split. We grouped responses into unique combina- | |
| tions of sub-dataset, task, language, item, and rat- | |
| ing label, then randomly assigned responses within | |
| each combination to each of the sets, ensuring an | |
| equal representation of responses associated with | |
| each of these variables within the training, valida- | |
| tion, and OOD-r sets. Table 1 contains the final | |
| dataset statistics for MUCE. Tables 6 and 7 contain | |
| the descriptions and data statistics for each task in | |
| MUCE. Tables 8, 9, 10, 11, and 12 list some exam- | |
| ple prompts and low-rated and high-rated responses | |
| for each task from MUCE. | |
| B | |
| SFT and Preference Datasets | |
| Past work has shown that data quality is one of | |
| the main factors behind preference model perfor- | |
| mance (Liu et al., 2024; Deng et al., 2025; Wang | |
| et al., 2024a). In particular, the margin in the score | |
| (i.e. reward margin) between the preferred and re- | |
| jected response may influence the performance of | |
| the model, since training pairs with smaller mar- | |
| gins are likely to contain annotation noise and be | |
| more difficult to learn. We experiment with dif- | |
| ferent reward margins and choose a margin of 5 | |
| for the final experiments as it showed a balance | |
| between mitigating annotator noise and creating | |
| a dataset with nuanced preferences. Additionally, | |
| to ensure a high-quality preference dataset, first | |
| we filter the base MUCE dataset and select only | |
| the samples that have a full agreement from all | |
| annotators. Then we filter out all samples that | |
| have a rating below 20 and limit the number of | |
| pairings between samples to 10. This results in | |
| a final preference training dataset of 42, 058 sam- | |
| --- Page 16 --- | |
| quality | |
| 0.05 | |
| 0.06 | |
| 0.07 | |
| 0.08 | |
| novelty | |
| quality | |
| 0.20 | |
| 0.25 | |
| 0.30 | |
| 0.35 | |
| 0.40 | |
| 0.45 | |
| diversity | |
| quality | |
| surprise | |
| SFT | |
| DPO | |
| Mistral-7B | |
| Gemini-2.0 | |
| GPT-4o | |
| Claude-3.7 | |
| CrPO-nov | |
| CrPO-div | |
| CrPO-sur | |
| CrPO-nov-qua | |
| CrPO-div-qua | |
| CrPO-sur-qua | |
| CrPO-qua | |
| CrPO-nov-div-sur | |
| CrPO-cre | |
| Figure 6: | |
| Results on held-out evaluation suite from MUCE across all baselines and our models using | |
| Mistral-7B-Instruct-v0.3 as a base model. nov, div, sur, qua, cre denote novelty, diversity, surprise, quality, | |
| and creativity, respectively. Results are averaged across tasks. | |
| Total | |
| Train | |
| Dev | |
| Test | |
| OOD-i | |
| OOD-l | |
| OOD-t | |
| samples | |
| 245,030 | |
| 183,973 | |
| 23,254 | |
| 22,419 | |
| 6,253 | |
| 4,719 | |
| 4,412 | |
| tasks | |
| languages | |
| prompts | |
| Table 1: Detailed statistics for each split of MUCE. | |
| Human Evaluation Instructions | |
| In this study, you will be presented with two | |
| responses to a creative task. Your job is to | |
| select the response that you believe is the | |
| most creative. Please base your judgment | |
| only on the creativity of the ideas not on | |
| how long or detailed the response is. A | |
| shorter response can be more creative than | |
| a longer one, and vice versa. Focus on how | |
| original, unique, and innovative the idea | |
| feels to you. There are no right or wrong | |
| answers we re interested in your opinion. | |
| Figure 7: Rater instructions for the human evaluation. | |
| ples (MUCE-PREF). We also create a high-quality | |
| instruction-tuning dataset from MUCE-PREF by | |
| pairing the prompts with all preferred responses | |
| that have a rating above 30 resulting in a dataset | |
| of 5, 275 samples (MUCE-SFT). Tables 2 and 3 | |
| contain the statistics for these datasets. | |
| C | |
| Training | |
| We follow a training setup similar to Chung | |
| et al. (2025) and use Llama-3.1-8B-Instruct | |
| and Mistral-7B-Instruct-v0.3 (Jiang et al., | |
| 2023) as our base models. Using these models, | |
| we train an SFT, DPO and several CRPO models. | |
| We train all models using parameter-efficient tun- | |
| ing with LoRA using a rank of 128 and an alpha | |
| of 256 (Hu et al., 2022). All training was done | |
| using HuggingFace TRL library5 with Accelerate | |
| (Gugger et al., 2022) and DeepSpeed ZeRO-2 (Ra- | |
| jbhandari et al., 2020) on NVIDIA A100 GPUs with | |
| gradient checkpointing. | |
| SFT model is trained on the MUCE-SFT dataset | |
| for a single epoch with a batch size of 2 per GPU | |
| using a gradient accumulation size of 4 and context | |
| size of 1024. We use a cosine scheduler with a | |
| half-cycle warmup and maximum learning rate of | |
| 3e 5. Final model achieves 85 mean token | |
| accuracy on the validation set. | |
| DPO and CRPO models are trained using the | |
| SFT model as a base on our MUCE-PREF dataset | |
| for a single epoch with a batch size of 8 per GPU | |
| using a gradient accumulation size of 8 and context | |
| size of 1024. We use a linear scheduler with a | |
| learning rate of 5e 6. All final models achieve | |
| over 82 reward accuracy on the validation set. | |
| 5https: huggingface.co docs trl en index | |
| --- Page 17 --- | |
| 30.5 | |
| 31.0 | |
| 31.5 | |
| 32.0 | |
| 32.5 | |
| 33.0 | |
| 33.5 | |
| 34.0 | |
| 34.5 | |
| quality | |
| 0.090 | |
| 0.092 | |
| 0.094 | |
| 0.096 | |
| 0.098 | |
| novelty | |
| CrPO-nov with different injection weights | |
| lambda | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| 0.0 | |
| quality | |
| 0.36 | |
| 0.38 | |
| 0.40 | |
| 0.42 | |
| 0.44 | |
| 0.46 | |
| diversity | |
| CrPO-div with different injection weights | |
| lambda | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| 0.0 | |
| quality | |
| surprise | |
| CrPO-sur with different injection weights | |
| lambda | |
| 0.5 | |
| 1.0 | |
| 1.5 | |
| 2.0 | |
| 0.0 | |
| Figure 8: Effect of injection weights for each dimension on the quality score. Results are averaged across three seed | |
| runs. | |
| D | |
| Evaluation | |
| For each prompt in our held-out evaluation suite, | |
| we generate a total of 16 responses for every model | |
| by sampling 4 responses for each of the following | |
| four decoding setups that induce high randomness | |
| using various sampling techniques (Fan et al., 2018; | |
| Holtzman et al., 2019): | |
| 1. temperature 0.7, topp 0.95 | |
| 2. temperature 0.9, topp 0.99 | |
| 3. temperature 0.7, topk 50 | |
| 4. temperature 0.8, topp 0.97 | |
| Moreover, as the existing instruction-tuned LLMs | |
| tend to produce verbose outputs (Saito et al., 2023), | |
| in order to minimize the length bias, we add further | |
| instructions in the prompt, constraining the output | |
| length in terms of the number of sentences and | |
| words. We compute the constraint values based | |
| on the median number of words and sentences of | |
| responses per task from our training dataset. Table | |
| 4 lists an example evaluation prompt for each task. | |
| Table 5 lists an example response from all models | |
| to a single prompt. | |
| D.1 | |
| Human Evaluation | |
| Since we have multiple model responses per | |
| prompt, instead of randomly choosing a response, | |
| for each prompt, we choose top 4 model responses | |
| measured by the overall automated creativity score | |
| which we define as the sum of normalized novelty, | |
| diversity, surprise and quality scores. This setup en- | |
| sures that models are compared to each other with | |
| their best outputs. We recruited 15 participants on | |
| Prolific6 to complete the study, requiring that they | |
| reside in the U.S. and have an approval rating of | |
| at least 90 . Ethics board approval was received | |
| from the Pennsylvania State University IRB for this | |
| study. We provided participants with a definition of | |
| creativity, and instructed them not to focus on the | |
| length or detail of the response when rating. Figure | |
| 7 lists the instructions given to raters for evaluating | |
| creativity. We additionally included a comprehen- | |
| sion check where participants were quizzed about | |
| the task instructions, to help catch careless partici- | |
| pants. Raters who failed this check were excluded | |
| from further analysis. All raters were compensated | |
| adequately with at least a minimum payment of 9 | |
| per hour. Final win rates are calculated for each | |
| response pair based on the majority vote across | |
| participants. The inter-rater agreement computed | |
| using Krippendorff s alpha was 0.463, indicating a | |
| moderate agreement. | |
| D.2 | |
| NOVELTYBENCH Evaluation | |
| NOVELTYBENCH is a recently introduced bench- | |
| mark to measure how well language models can | |
| generate novel and high-quality answers to user re- | |
| quests involving subjectivity, randomness, and cre- | |
| ativity (Zhang et al., 2025). We use a 100-sample | |
| subset of their benchmark that is manually curated | |
| by the authors and contains four distinct categories | |
| where diversity and novelty are expected: | |
| Randomness: prompts that involve random- | |
| izing over a set of options. Example: Roll a | |
| make-believe 20-sided die. | |
| Factual Knowledge: prompts that request un- | |
| derspecified factual information, which allow | |
| 6https: www.prolific.com | |
| --- Page 18 --- | |
| Task | |
| prompts | |
| samples | |
| Real-Life | |
| Creative | |
| Problem | |
| Solving | |
| 5,601 | |
| Question | |
| Asking | |
| Malevolent | |
| Problems | |
| Metaphors | |
| Alternate | |
| Uses | |
| of | |
| Objects Task | |
| 4,388 | |
| Design Solu- | |
| tions | |
| 1,366 | |
| Essays | |
| Stories | |
| 1,498 | |
| Consequences 5 | |
| 10,865 | |
| Experiment | |
| Design | |
| 5,640 | |
| Hypothesis | |
| Generation | |
| 5,260 | |
| Research | |
| Questions | |
| 5,832 | |
| Associations | |
| Total | |
| 42,058 | |
| Table 2: MUCE-PREF training dataset details. | |
| many valid answers. Example: List a capital | |
| city in Africa. | |
| Creative Writing: prompts that involve gen- | |
| erating a creative form of text, including po- | |
| etry, and story-writing. Example: Tell me a | |
| riddle. | |
| Subjectivity: prompts that request subjective | |
| answers or opinions. Example: What s the | |
| best car to get in 2023? | |
| Additionally, the paper proposes new metrics to | |
| measure novelty and quality (i.e. utility) that are | |
| different than ours. To compute novelty, they pro- | |
| pose a method that learns to partition the output | |
| space into equivalence classes from human annota- | |
| tions. Each class represents one unique generation | |
| that is roughly equivalent to the others in the same | |
| class and different from the generations in other | |
| classes. They consider a functional equivalence | |
| that defines two generations to be different if and | |
| only if a user who has seen one generation would | |
| Task | |
| prompts | |
| samples | |
| Real-Life | |
| Creative | |
| Problem | |
| Solving | |
| Question | |
| Asking | |
| Malevolent | |
| Problems | |
| Metaphors | |
| Alternate | |
| Uses | |
| of | |
| Objects Task | |
| Design Solu- | |
| tions | |
| Essays | |
| Stories | |
| Consequences 5 | |
| 1,315 | |
| Experiment | |
| Design | |
| Hypothesis | |
| Generation | |
| Research | |
| Questions | |
| Associations | |
| Total | |
| 5,275 | |
| Table 3: MUCE-SFT training dataset details. | |
| likely benefit from seeing the other. To this end, the | |
| authors annotated 1,100 pairs of generations condi- | |
| tioned on prompts from NOVELTYBENCH sampled | |
| from a diverse set of models. From these annotated | |
| pairs, they used 1,000 for training and fine-tuned a | |
| deberta-v3-large model (He et al., 2023) to predict | |
| binary functional equivalence between two genera- | |
| tions. With the equivalence classifier, they partition | |
| the output space into equivalence classes. Then | |
| they define the novelty as the distinctk metric that | |
| is the number of equivalence classes in a partition | |
| of k sample generations from a language model: | |
| distinctk : {ci i [k]} | |
| (8) | |
| To compute quality, they consider a model of | |
| user behavior that describes how users interact with | |
| and consume language model generations. They | |
| assume that the user has a patience level p [0, 1]: | |
| after observing each additional generation, they | |
| have a probability p of requesting an additional | |
| generation from the language model and observing | |
| the next generation, and a probability 1 p of | |
| --- Page 19 --- | |
| Task | |
| Prompt | |
| Real-Life Creative Problem Solving | |
| Come up with an original and creative solution for | |
| the following real-world problem: Clara, a junior | |
| pre-med student, is working part-time and taking a | |
| 15 hour credit load at school. ... skipped ... Please | |
| limit your response to 4 sentences and at most 75 | |
| words. | |
| Alternate Uses of Objects | |
| Come up with an original and creative use for the | |
| following object: rope. Please limit your response to | |
| 1 sentence and at most 17 words. | |
| Design Solutions | |
| Come up with an original and creative solution to | |
| reduce the amount of litter in public spaces and pro- | |
| mote waste reduction and recycling. Please limit | |
| your response to 2 sentences and at most 36 words. | |
| Hypothesis Generation | |
| Come up with an original and creative scientific hy- | |
| pothesis for the following scenario: You notice that | |
| dogs seem to like one of your friends, but cats seem | |
| to like another friend. What hypotheses do you have | |
| about why that is? Please limit your response to 1 | |
| sentence and at most 22 words. | |
| Metaphors | |
| Come up with an original and creative metaphoric | |
| equivalent for the concept described below: Stomata | |
| are tiny openings or pores found on the underside of a | |
| plant leaf. They are used for gas exchange, enabling | |
| the intake of carbon dioxide and release of oxygen.. | |
| Please limit your response to 1 sentence and at most | |
| 10 words. | |
| Poems | |
| Come up with an original and creative poem about | |
| the following concept: choice. Please limit your | |
| response to 5 sentences and at most 150 words. | |
| Sentence Completion | |
| Finish the sentence with an original and creative end- | |
| ing: When I got on the school bus....Please respond | |
| in one sentence. | |
| Table 4: MUCE dataset held-out item and task evaluation prompts. | |
| stopping interacting with the model. Then they | |
| compute the quality of a sequence of generations | |
| as the cumulative utility: | |
| utilityk : 1 p | |
| 1 pk | |
| k | |
| X | |
| i 1 | |
| pi 1 1[ci cj, j i] ui | |
| (9) | |
| To | |
| compute | |
| the | |
| utility | |
| of | |
| individ- | |
| ual | |
| generations, | |
| they | |
| also | |
| use | |
| the | |
| Skywork-Reward-Gemma-2-27B-v0.2 | |
| (Liu | |
| et al., 2024) model. | |
| To benchmark our models, we follow their rec- | |
| ommended setup for evaluation. In particular, we | |
| set the number of generations to 10 per model and | |
| the patience level to 0.8 and use their trained clas- | |
| sifier for output space partition. | |
| --- Page 20 --- | |
| Model | |
| Response | |
| Human | |
| played hide-and-seek by forming into different ob- | |
| jects and trying not to be found. | |
| Llama-3.1-8B-Instruct | |
| At a sleepover, we transformed our host s base- | |
| ment into a mystical Dreamscape where we wove a | |
| tapestry of secrets, stories, and whispered promises | |
| that only the darkness could keep. | |
| Gemini-2.0-Flash | |
| At a sleepover we...built a pillow fort so magnifi- | |
| cent, it accidentally opened a portal to a realm where | |
| dreams manifested as sentient, fluffy creatures de- | |
| manding to be tickled. | |
| Claude-3.7-Sonnet | |
| At a sleepover we constructed an elaborate blanket | |
| fortress complete with secret passages, flashlight con- | |
| stellations on the ceiling, and an unspoken pact to | |
| guard each other s midnight confessions until the end | |
| of time. | |
| GPT-4o | |
| At a sleepover we transformed the living room into | |
| a magical fort of pillows and fairy lights, where we | |
| whispered secrets and told stories until the first light | |
| of dawn. | |
| SFT | |
| We played a game of Twister where we put out our | |
| phones to see who was the best twister. | |
| DPO | |
| conducted an experiment to see if a person s person- | |
| ality is changed with an altered state of conscious- | |
| ness. | |
| CRPO-nov | |
| tried to break the record for most consecutive hours | |
| without speaking and we discovered we could all | |
| hear each other s thoughts. | |
| CRPO-nov-qua | |
| Participated in an experiment where we tested the | |
| effects of sleep deprivation on the human mind. | |
| CRPO-div | |
| construct a space shuttle that takes us to the moon | |
| and from there we can launch our dream rockets. | |
| CRPO-div-qua | |
| recreated the conditions of a 19th century under- | |
| ground railroad and had to map out our escape to | |
| Canada. | |
| CRPO-sur | |
| Operate on each other to implant a permanent | |
| adrenaline gland. | |
| CRPO-sur-qua | |
| created an underwater laboratory within our inflat- | |
| able pool to collect the evidence we found of alien | |
| life. | |
| CRPO-qua | |
| began to master the art of telekinesis by competi- | |
| tively tossing each other s pillows across the room. | |
| CRPO-nov-div-sur | |
| Built a rollercoaster out of air mattresses and then did | |
| a hot-wheel car-launch into the trenches and caught | |
| the crash on GoPro cameras. | |
| CRPO-cre | |
| Created an experiment to see if our dreams could be | |
| manipulated and transfer to one another. | |
| Table 5: Example model (and gold human) responses to the prompt At a sleepover we ... from the Sentence | |
| Completion task. | |
| --- Page 21 --- | |
| Task | |
| Description | |
| Dataset Sources | |
| prompts | |
| samples | |
| Real-Life | |
| Creative | |
| Problem | |
| Solving | |
| Produce solutions for ev- | |
| eryday problems | |
| (Luchini | |
| et | |
| al., | |
| 2025; | |
| Kapoor et al., 2024; Saret- | |
| zki et al., 2024; Perchtold- | |
| Stefan et al., 2024) | |
| 33,340 | |
| Alternate Ti- | |
| tles Genera- | |
| tion | |
| Produce alternative titles | |
| for widely known books or | |
| movies | |
| (Agnoli et al., 2016) | |
| 2,986 | |
| Question | |
| Asking | |
| Produce questions about | |
| everyday objects | |
| (Raz et al., 2024) | |
| 3,566 | |
| Poems | |
| Produce poems about ev- | |
| eryday concepts | |
| (Fan et al., 2023; Chaud- | |
| huri et al., 2025; He et al., | |
| 2022) | |
| 2,580 | |
| Design Solu- | |
| tions | |
| Produce solutions to real- | |
| world design problems | |
| (DiStefano et al., 2025) | |
| 10,818 | |
| Combining | |
| Objects | |
| Produce | |
| combinations | |
| of everyday objects to | |
| achieve a goal | |
| (Weiss et al., 2023) | |
| 4,494 | |
| Plot | |
| Titles | |
| Generation | |
| Produce titles for story | |
| plots | |
| (Weiss | |
| et | |
| al., | |
| 2023; | |
| Goecke | |
| et | |
| al., | |
| 2024b; | |
| Weiss et al., 2024) | |
| 1,832 | |
| Instances | |
| of Common | |
| Concepts | |
| Produce instances related | |
| to everyday adjectives | |
| (Organisciak et al., 2023) | |
| 2,474 | |
| Experiment | |
| Design | |
| Produce experiment de- | |
| signs to test scientific hy- | |
| potheses | |
| (Beaty | |
| et | |
| al., | |
| 2024; | |
| Goecke et al., 2024a) | |
| 4,893 | |
| Associations | |
| Produce word associations | |
| (Beaty and Johnson, 2021) | |
| 1,004 | |
| Emotional | |
| Trials | |
| Produce | |
| feelings | |
| one | |
| might have in a given | |
| situation | |
| (Weiss et al., 2023) | |
| Invent Nick- | |
| names | |
| Produce nicknames for ev- | |
| eryday concepts and ob- | |
| jects | |
| (Weiss et al., 2023) | |
| Situation Re- | |
| description | |
| Produce redescriptions of | |
| negative situations into | |
| positive situations | |
| (Weiss et al., 2023) | |
| Alternate | |
| Uses | |
| of | |
| Objects Task | |
| Produce alternate uses for | |
| everyday objects | |
| (Patterson et al., 2023; | |
| Zieli nska et al., 2023; Or- | |
| ganisciak et al., 2023) | |
| 88,155 | |
| Stories | |
| Produce short stories from | |
| three word prompts | |
| (Luchini et al., 2025; Ag- | |
| noli et al., 2016; Fan et al., | |
| 2023; He et al., 2022) | |
| 2,757 | |
| Table 6: MUCE dataset details broken down by task (Part 1). | |
| --- Page 22 --- | |
| Task | |
| Description | |
| Dataset Sources | |
| prompts | |
| samples | |
| Malevolent | |
| Problems | |
| Produce ideas on how to | |
| take revenge on or sabo- | |
| tage a wrongdoer | |
| (Perchtold-Stefan et al., | |
| 2023; Kapoor et al., 2024; | |
| Perchtold-Stefan | |
| et | |
| al., | |
| 2024) | |
| 16,536 | |
| Metaphors | |
| Produce metaphors to de- | |
| scribe scenarios | |
| (DiStefano et al., 2024; Yu | |
| et al., 2024) | |
| 13,210 | |
| Essays | |
| Produce essays on a topic | |
| (Cotter et al., 2016) | |
| Consequences Produce possible conse- | |
| quences to scenarios | |
| (Weiss et al., 2024, 2023; | |
| Goecke et al., 2024b) | |
| 24,874 | |
| Sentence | |
| Completion | |
| Produce endings to incom- | |
| plete sentences | |
| (Organisciak et al., 2023) | |
| 2,629 | |
| Hypothesis | |
| Generation | |
| Produce | |
| scientific | |
| hy- | |
| potheses | |
| for | |
| specific | |
| observations | |
| (Beaty | |
| et | |
| al., | |
| 2024; | |
| Goecke et al., 2024a) | |
| 18,455 | |
| Research | |
| Questions | |
| Produce research ques- | |
| tions relating to scenarios | |
| (Beaty | |
| et | |
| al., | |
| 2024; | |
| Goecke et al., 2024a) | |
| 5,161 | |
| Composites | |
| Produce composite words | |
| from a prompt word | |
| (Weiss et al., 2023) | |
| Evoking | |
| Emotional | |
| Responses | |
| from People | |
| Produce ways to evoke | |
| emotional responses in | |
| people as a TV producer | |
| (Weiss et al., 2023) | |
| Emotions in | |
| Everyday Sit- | |
| uations | |
| Produce | |
| emotions | |
| you | |
| might feel in response to | |
| everyday situations | |
| (Weiss et al., 2023) | |
| Table 7: MUCE dataset details broken down by task (Part 2). | |
| --- Page 23 --- | |
| Task | |
| Example prompt | |
| Example low rating re- | |
| sponse | |
| Example high rating re- | |
| sponse | |
| Real-Life | |
| Creative | |
| Problem | |
| Solving | |
| Becky is a college stu- | |
| dent who works part-time | |
| at Mark s Pizzeria. Mark, | |
| the owner of the restau- | |
| rant, has treated Becky | |
| very well. He gave her a | |
| job that she needs to help | |
| pay her rent when no other | |
| business would employ | |
| her because she was ar- | |
| rested for shoplifting three | |
| years ago. | |
| Mark also | |
| lets Becky work around | |
| her school schedule, and | |
| has asked if she wants to | |
| be a shift manager in the | |
| summers. Becky s room- | |
| mate Jim also works at the | |
| pizzeria, but Jim has been | |
| causing a lot of problems | |
| at work. He always avoids | |
| doing his job, treats cus- | |
| tomers rudely, and makes | |
| a lot of mistakes with or- | |
| ders. | |
| Jim recently be- | |
| gan stealing food from the | |
| pizzeria. Two days ago the | |
| pizzeria was short- staffed, | |
| so Jim and Becky were | |
| the only employees left at | |
| closing time. | |
| Jim made | |
| 10 extra pizzas and took | |
| them home to a party he | |
| was hosting without pay- | |
| ing for them. Becky feels | |
| like she needs to do some- | |
| thing about Jim s behav- | |
| ior. | |
| However, Becky is | |
| hesitant to tell Mark about | |
| Jim because Jim is a good | |
| friend to Becky. | |
| Becky | |
| also needs Jim to have a | |
| job so he can pay his por- | |
| tion of their rent. Becky | |
| does not know what to | |
| do.. | |
| Morally the right thing for | |
| Becky to do would be to | |
| tell her boss. | |
| However, | |
| to be a good friend would | |
| to be not to tell on Jim. | |
| The only creative solution | |
| to this problem would to | |
| be to try and talk to Jim | |
| one on one. Give Jim the | |
| decision of whether or nt | |
| he wants Becky to inform | |
| their boss of what he has | |
| been doing. As a friend he | |
| should understand where | |
| Becky is coming from and | |
| want to take the strain off | |
| her. | |
| Becky should first dis- | |
| cuss this with Jim, and | |
| tell him that he needs to | |
| either pay for the pizzas | |
| or he needs to go to the | |
| boss, and admit what he | |
| has done. | |
| He will get | |
| caught in the end because | |
| eventually the ingredients | |
| will be missed. The boss | |
| may unerstand, and per- | |
| haps will allow him to | |
| work off the pizzas some- | |
| how. Maybe he could help | |
| out cleaning up around the | |
| restaurant. If Jim will not | |
| tell his boss Becky should | |
| tell him. | |
| She wouldn t | |
| necessarily have to come | |
| right out and tell on her | |
| coworker she could come | |
| up with a way for the boss | |
| to catch him at it. If he | |
| does it once Jim will more | |
| than likely do it again. She | |
| could tell the boss to check | |
| on the inventory. | |
| She | |
| could have other people | |
| who might have been at | |
| the party come tell her | |
| boss bout it. If all of that | |
| fails, she should just tell | |
| Mark about Jim stealing | |
| the pizzas. | |
| Table 8: MUCE dataset examples (Part 1). | |
| --- Page 24 --- | |
| Task | |
| Example prompt | |
| Example low rating re- | |
| sponse | |
| Example high rating re- | |
| sponse | |
| Question | |
| Asking | |
| pencil | |
| How big is it? | |
| How many great ideas | |
| have started with a pen- | |
| cil? | |
| Poems | |
| childhood | |
| Twinkle, Twinkle little | |
| star....ect | |
| Red Rover, Red Rover | |
| Is my childhood over? I | |
| don t feel quite grown up | |
| I still laugh at "I CUP" | |
| I play slide with my sis- | |
| ter and still call my fourth | |
| grade teacher "mister" I | |
| suppose, even still, my | |
| childhood is over even if | |
| I can still play red rover | |
| red rover | |
| Design Solu- | |
| tions | |
| Develop as many design | |
| ideas as you can to reduce | |
| air pollution in cities. | |
| Walk | |
| use 3d printing as an in- | |
| nivating way of building | |
| houses as it reduces labour | |
| and | |
| Combining | |
| Objects | |
| Paint sign | |
| paper, ballpoint pen | |
| beetroot | |
| juice, | |
| quark | |
| cheese | |
| Plot | |
| Titles | |
| Generation | |
| Now spoke | |
| A completely normal ev- | |
| eryday life | |
| VR glasses charger defec- | |
| tive | |
| Instances | |
| of Common | |
| Concepts | |
| soft | |
| something that is not | |
| hard | |
| a futuristic ball that turns | |
| really fuzzy and comfy at | |
| places it gets contact to | |
| Experiment | |
| Design | |
| You think some animals | |
| have a sense of humor that | |
| humans don t usually un- | |
| derstand. How could you | |
| test that hypothesis? | |
| observe | |
| tickle your dog to see | |
| how he acts when he s | |
| laughing. then, observe | |
| your dog throughout the | |
| day and note when he is | |
| laughing. | |
| you may be- | |
| gin to pick up on moments | |
| where he does things that | |
| are funny to him. | |
| Associations | |
| expert | |
| winner | |
| ace | |
| Emotional | |
| Trials | |
| You have a date tonight, | |
| and once again your dress | |
| didn t get ready in time at | |
| the laundry. | |
| worried, afraid, sad | |
| Anger, panic, anticipa- | |
| tion | |
| Invent Nick- | |
| names | |
| plate | |
| porcelain | |
| Shrunken UFO | |
| Table 9: MUCE dataset examples (Part 2). | |
| --- Page 25 --- | |
| Task | |
| Example prompt | |
| Example low rating re- | |
| sponse | |
| Example high rating re- | |
| sponse | |
| Alternate | |
| Uses | |
| of | |
| Objects Task | |
| knife | |
| weapon | |
| make up "knife charac- | |
| ters" and create a movie | |
| Stories | |
| petrol-diesel-pump | |
| I needed to fuel my car | |
| before we could start the | |
| long drive. I drove to the | |
| petrol station. i went to the | |
| pump and fuel my car with | |
| diesel. new i was ready for | |
| the task ahead | |
| Manly Merde was a truck | |
| driver looking for trouble. | |
| He pulled into the Casino | |
| in the back where the | |
| drivers go. He took a swig | |
| of whisky and walked to | |
| the petrol station, grabbed | |
| the pump and spurt diesel | |
| into the air like hydro- | |
| carbon fountain. | |
| He let | |
| out a big belly laugh and | |
| screamed, "Let the revo- | |
| lution begin!" And that | |
| is how the trucker wars | |
| started. | |
| Malevolent | |
| Problems | |
| Your professor in class | |
| announces an award for | |
| the person who comes up | |
| with the best solution for | |
| a project. By chance, an- | |
| other student leaves their | |
| notebook behind in class. | |
| You read their ideas and | |
| believe that they are the | |
| best. You decide to turn | |
| them in as your own; how- | |
| ever you know that if the | |
| other student submits the | |
| same solution, there will | |
| be a problem. | |
| I will not do the above | |
| render their notebook un- | |
| readable by dropping wa- | |
| ter at the last moment | |
| Metaphors | |
| The hot tea is... | |
| boiling | |
| liquid fire | |
| Consequences What would be the result | |
| if society no longer used | |
| money, and instead traded | |
| goods and services? | |
| Banks would be unneces- | |
| sary. | |
| People (especially cou- | |
| ples) would stop fighting | |
| so much about financial is- | |
| sues | |
| Sentence | |
| Completion | |
| It started raining and... | |
| I got wet | |
| because I was covered in | |
| oil, I began to levitate, and | |
| all the witnesses called me | |
| the next coming of some | |
| sort of goddess. | |
| Table 10: MUCE dataset examples (Part 3). | |
| --- Page 26 --- | |
| Task | |
| Example prompt | |
| Example low rating re- | |
| sponse | |
| Example high rating re- | |
| sponse | |
| Hypothesis | |
| Generation | |
| On a field trip, you drive | |
| past a massive field with | |
| hundreds of large holes | |
| visible as far as the eye | |
| can see. | |
| What hypothe- | |
| ses do you have about | |
| what purpose the holes | |
| may serve? | |
| the holes resulted over | |
| time and nature | |
| the holes are for animals | |
| giving birth. | |
| Essays | |
| dream project | |
| I don t really know what | |
| carreer path I want to fol- | |
| low. | |
| I just want a job | |
| where I can help people | |
| and get a good pay check | |
| so I can support myt fu- | |
| ture endevors. I want to | |
| do something that no one | |
| has ever done before in a | |
| way no one has ever seen. | |
| I want to inspire a genera- | |
| tion to work on a better fu- | |
| ture for everybody. I guess | |
| what I really want is to be | |
| remembered as an icon. i | |
| want to be someone that | |
| people look up to. | |
| I want to go into foren- | |
| sic science when I gradu- | |
| tate. Therefore, my dream | |
| project is to discover the | |
| perfect device that can | |
| help solve every crime | |
| scene. This device would | |
| be able to analyize the | |
| crime scene and tell us | |
| exactly how many people | |
| died and how they died. | |
| It would then collect ev- | |
| idence samples such as | |
| blood. | |
| Next, it would | |
| use what the information | |
| it found at the crime scene | |
| to help make up questions | |
| the dectectives would ask | |
| the suspects. It would use | |
| it s technology to come up | |
| with questions that only | |
| the murderer could an- | |
| swer. Later on, back at the | |
| lab, it would help discover | |
| whose blood the samples | |
| belonged to. In the end, | |
| the only human power | |
| that would be needed was | |
| someone to arrest the con- | |
| victed person and the peo- | |
| ple to help clean up the | |
| crime scene. | |
| Table 11: MUCE dataset examples (Part 4). | |
| --- Page 27 --- | |
| Task | |
| Example prompt | |
| Example low rating re- | |
| sponse | |
| Example high rating re- | |
| sponse | |
| Situation Re- | |
| description | |
| You notice how your col- | |
| league first treats another | |
| employee very kindly and | |
| then shortly afterwards | |
| starts talking negatively | |
| behind his back | |
| It would be nice if you | |
| were older | |
| I ll talk to them. Then I ll | |
| have to work less | |
| Alternate Ti- | |
| tles Genera- | |
| tion | |
| The Betrothed | |
| renzo and lucia | |
| Plague, Honor and Love | |
| in Baroque Brianza | |
| Research | |
| Questions | |
| You travel to a jungle | |
| that contains no human | |
| life and is completely un- | |
| known to the scientific | |
| community. What scien- | |
| tific questions could you | |
| ask about this jungle? | |
| How many people will | |
| come with me? | |
| Do these species share | |
| a common characterisitic | |
| that humans don t have? | |
| Composites | |
| jitters | |
| Exam jitters | |
| Easter bunny missing jit- | |
| ters | |
| Evoking | |
| Emotional | |
| Responses | |
| from People | |
| Describe how you would | |
| make people look down on | |
| others | |
| I | |
| will | |
| always | |
| scream | |
| loudly | |
| I would divide the au- | |
| dience into two groups | |
| and give one group a rub- | |
| ber glove as headgear and | |
| the other group a tiara or | |
| crown made of real gold. | |
| Emotions in | |
| Everyday Sit- | |
| uations | |
| You re at work. A glance | |
| at the clock tells you | |
| that you re about to finish | |
| work and start your long- | |
| awaited weekend. | |
| I feel happy | |
| I feel sorry for my desk | |
| chair, which is unused | |
| over the weekend and | |
| stands alone in the office. | |
| Table 12: MUCE dataset examples (Part 5). |