Spaces:

Ina-Shapiro
/

Paperbot6

Sleeping

App Files Files Community

Paperbot6 / Papers /Using Large Language Models for Idea.txt

Ina-Shapiro

adding all the paperbot4 files

a213258 6 months ago

raw

history blame contribute delete

83.8 kB

	--- Page 1 ---

	Using Large Language Models for Idea
	Generation in Innovation

	Lennart Meincke
	Operations, Information and Decisions, The Wharton School, University of Pennsylvania, lennart wharton.upenn.edu
	Karan Girotra
	Cornell Tech and Johnson College of Business, Cornell University, girotra cornell.edu
	Gideon Nave
	Marketing, The Wharton School, University of Pennsylvania, gnave wharton.upenn.edu
	Christian Terwiesch
	Operations, Information and Decisions, The Wharton School, University of Pennsylvania, terwiesch wharton.upenn.edu
	Karl T. Ulrich
	Operations, Information and Decisions, The Wharton School, University of Pennsylvania, ulrich wharton.upenn.edu

	This research evaluates the efficacy of large language models (LLMs) in generating new product ideas. To do so, we compare three
	pools of ideas for new products targeted toward college students priced at 50 or less. The first pool of ideas was created by
	university students in a product design course before the availability of LLMs. The second and third pools of ideas were generated
	by OpenAI s GPT-4 using zero-shot and few-shot prompting, respectively. We evaluated idea quality using standard market
	research techniques to predict average purchase intent probability. We used text mining to assess idea similarity and human raters
	to evaluate idea novelty. We find that AI-generated ideas outperform human-generated ideas in terms of average purchase intent,
	with few-shot prompting yielding slightly higher intent than zero-shot prompting. However, AI-generated ideas are perceived as
	less novel and exhibit higher pairwise similarity, particularly with few-shot prompting, indicating a less diverse solution landscape.
	When focusing on the quality of the best ideas (rather than on the average ideas), we find that AI-generated ideas are seven times
	more likely to rank among the top 10 of ideas, demonstrating a significant advantage over human-generated ideas. We propose
	that this 7:1 advantage is a conservative estimate, as it does not account for AI's greater productivity. Our findings suggest that
	despite some drawbacks, AI creativity presents a substantial benefit in generating high-quality ideas for new product development.
	Funding: Funding was provided by the Mack Institute for Innovation Management at the Wharton School of the University of
	Pennsylvania.

	Key words: innovation; idea generation; creativity; creative problem solving; LLM; large-scale language models; AI; artificial
	intelligence; ChatGPT; GPT

	--- Page 2 ---

	1. Introduction
	Generative artificial intelligence (GenAI) has remarkably advanced in creating life-like images and
	coherent, fluent text. Open AI s ChatGPT chatbot, based on the Generative Pre-trained Transformer (GPT)
	series of large language models (LLM), can equal or surpass human performance in academic examinations
	and tests for professional certifications (OpenAI et al. 2023). Moreover, LLMs can provide valuable
	professional advice in fields like software development, medicine, and law.
	Despite their remarkable performance, LLMs sometimes produce text that is semantically or
	syntactically plausible but is, in fact, factually incorrect or nonsensical, a phenomenon often referred to as
	hallucinations. This outcome is a byproduct of how LLMs are designed, as they are optimized to generate
	the most statistically likely sequences of words with an intentional injection of randomness. In most
	applications, this randomness and the associated hallucinations and inconsistencies create problems that
	limit the use of LLM-based solutions to low-stakes settings, or they require extensive human supervision.
	But are there applications in which we can leverage the weaknesses of hallucinations and inconsistent
	quality and turn them into a strength? We propose that the domain of creativity and innovation provides
	such an application. This domain operates quite differently than most management settings, where we
	commonly expect to use each unit of work produced. As such, consistency is prized and is, therefore, the
	focus of contemporary performance management. Erratic and inconsistent behavior is to be eliminated. An
	airline would rather hire a pilot that executes a within-safety-margins landing 10 out of 10 times rather than
	one that makes a brilliant approach five times and an unsafe approach another five. But, when it comes to
	creativity and innovation, say finding a new opportunity to improve the air travel experience or launching
	a new aviation venture, the same airline would prefer an ideator that generates one brilliant idea and nine
	nonsense ideas over one that generates ten decent ideas.
	The reason for this difference is that when it comes to creativity and innovation, the performance of the
	process is not determined by the sum or the average of all ideas created. Instead, each idea is seen as a real
	option that the decision maker can decide to execute (Huchzermeier and Loch 2001). Thus, the performance
	of the process is determined by the quality of the best idea(s) (Dahan and Mendelson 2001, Terwiesch and
	Xu 2008, Terwiesch and Ulrich 2009, Girotra et al. 2010). The process of innovation thereby can be thought
	of as a search process that generates ideas with random quality values by drawing from an underlying
	stochastic distribution until the cost of creating one additional draw from the distribution (e.g., creating one
	more product concept or building one more prototype) exceeds the marginal benefit (Weitzman 1979).
	Prior research in product development and innovation has modeled various aspects of this search
	process, including the pros and cons of parallel search (Loch et al. 2001), the tension between sampling
	from very different regions of the pay-off distributions ( selectionism ) versus locally improving a given
	project (Sommer and Loch 2004), and the need for building balanced portfolios that consist of different

	--- Page 3 ---

	types of projects (Chao and Kavadias 2008).
	We follow this line of research and consider a setting in which ideas of unknown quality are created,
	and the quality of the best few ideas determines the overall performance. This could be a setting of corporate
	portfolio planning in a large established organization as described by Si et al. (2022). However, to facilitate
	our experimental design, we focus on the idea generation in the product developement process for a newly
	formed venture. Specifically, we look for a product idea that targets the college student market and can be
	sold for 50 or less. This innovation challenge is similar to the study settings used in prior work (e.g.,
	Osborn 1953, Connolly et al. 1990, Sutton and Hargadon 1996, Girotra et al. 2010) to evaluate and compare
	various brainstorming methods (e.g., group vs. individual; nominal groups vs. hybrid groups).
	In contrast to this prior work, we consider ideas generated by humans and ideas generated by artificial
	intelligence (AI) in the form of Open AI s GPT-4. As discussed above, LLMs are designed to generate new
	content, and in the domain of brainstorming, their stochastic (if not outright erratic) behavior might turn a
	bug into a feature. Thus, we hypothesize that LLMs have the potential to be excellent ideators. The purpose
	of this paper, therefore, is to formally test this hypothesis by comparing the performance of LLMs in
	generating new ideas to that of human idea generators.
	Specifically, we compare three pools of ideas for new products targeted toward college students at a
	price of USD 50 or less. The first pool of ideas was created by students at an elite university enrolled in a
	course on product design before the availability of LLMs. The second pool was generated by OpenAI s
	GPT-4 with the same prompt as that given to the students and no other guidance (zero-shot prompting).
	The third pool was generated by prompting GPT-4 with the same prompt as that given to the students and
	a sample of highly rated ideas to enable some in-context learning (few-shot prompting). We evaluate the
	quality of the ideas using standard market research techniques and survey human respondents to predict an
	average purchase intent probability for each product, which we use as our measure of idea quality. We use
	text mining techniques to evaluate the similarities of ideas and rely on human raters to assess idea novelty.
	This comparison between human idea generation and AI-based idea generation allows us to contribute
	to the innovation literature by establishing the following novel results.
	First, AI-generated ideas are, on average, significantly better (average purchase intent of 0.48 relative
	to 0.40 for human-generated ideas), especially in the case of few-shot prompting (average purchase intent
	of 0.49 relative to 0.46 for zero-shot prompting), as shown in Study 1.
	Second, despite this success, consumers perceive AI-generated ideas as less novel (perceived novelty
	of 0.36 relative to 0.41). Moreover, AI-generated ideas are more likely to overlap: text mining reveals that
	the average pairwise similarity of ideas is higher among AI-generated ideas and further increases when
	using few-shot prompting. As a result, the underlying solution landscape is less likely to be fully explored
	(Study 2).

	--- Page 4 ---

	Finally, we show that for a given number of ideas, the quality of the best ideas generated by AI is
	significantly greater than that of the best ideas generated by humans (Study 3). Specifically, we show that
	AI-generated ideas are seven times more likely to be among the top 10 of ideas generated in our
	experiment. This is significant given the context. What matters for innovation is the quality of the best
	idea. The objective of idea generation is to generate at least a few truly exceptionally great ideas. In most
	innovation settings, we would rather have 10 great ideas and 90 terrible ideas than 100 ideas of solid
	quality. Holding the number of ideas constant, we need to trade off the advantageous effect of higher
	average idea quality (Study 1) with the disadvantages of less novelty, more overlapping ideas, and fewer
	ideas that can be discovered (Study 2). Study 3 clearly establishes AI s supremacy over humans in this
	respect.
	A quarter of a century ago, Goldenberg et al. (1999) asked the question Can AI-generated ideas finally
	compete with human ones, long after researchers first considered the possibility? . We believe that the three
	studies presented in this article provide empirical support for an affirmative answer to this question. From
	a practical perspective, we see the 7:1 advantage of AI creativity over human creativity as a conservative
	estimate, as we did not credit AI for its substantially greater productivity.
	The remainder of the article is organized as follows. After reviewing some recent work on GenAI and
	creativity (Section 2), we introduce our theoretical framework and our hypotheses (Section 3), followed by
	the technical set-up of our experiments (Section 4). We conducted three studies to assess the creativity of
	human- and AI-generated ideas. First, in Study 1 we ask human participants to rate ideas from both sources
	(human- and AI-generated) and compare the results (Section 5). Second, in Study 2 we use text-based
	analysis to calculate how many unique ideas can be created by humans and LLMs in our specific domain
	as well as ask human participants to rate the novelty of ideas from both sources and compare the results
	(Section 6). Third, in Study 3 we look at the extreme distributions of idea quality to identify possible
	advantages for the best ideas by either humans or AI (Section 7). We conclude the paper by discussing
	potential limitations of our studies, their robustness to alternative specifications (Section 8), and the
	implications of our findings (Section 9).

	2. GenAI applications to creative tasks
	Research to date has demonstrated three key findings regarding AI's role in creativity and innovation. First,
	AI frequently matches or exceeds human performance in creative tasks. Haase and Hanel (2023) found that
	LLMs have reached human-level performance in divergent thinking tasks such as the Alternative Uses Task
	(AUT). This is supported by Hubert et al. (2024), who studied GPT-4 responses for the Consequences Task
	and Divergent Association Tasks, finding that AI is more creative than humans across all its dimensions.
	While Koivisto and Grassini (2023) find that AI chatbots outperform average human performance in the

	--- Page 5 ---

	AUT, they also note that the most exceptional human ideas still match or exceed those generated by AI.
	Second, studies show that AI aids in improving creative outcomes for humans when using it as a tool.
	Doshi and Hauser (2024) find that AI use helps humans to create more creative and enjoyable short stories.
	However, the collective diversity decreases and stories become more similar to one another. Similarly, Jia
	et al. (2024) found that AI assistance boosted employee creativity in a telemarketing company when
	responding to customer questions, ultimately increasing sales. Zhou and Lee (2024) show that integrating
	text-to-image AI into creative workflows increased the number of artworks created by 25 and raised the
	likelihood of receiving the works receiving favorite per view by 50 , highlighting the benefits of LLMs
	augmenting human workflows ( human in the loop ).
	Third, studies have explored human preferences for AI-generated versus human-generated creations,
	often finding that people prefer human involvement. For instance, Hitsuwari et al. (2023) found that survey
	participants cannot distinguish between AI-generated and human-generated haikus, but rated poems co-
	created by humans and AI as the most beautiful with no significant preference for haikus created solely by
	humans or AI. Bellaiche et al. (2023) provide evidence that humans prefer human involvement in art
	creation by showing that participants prefer AI-generated art falsely labeled as created by humans to the
	same art correctly labeled as AI-generated, suggesting a bias for human involvement in the creative process.
	Similarly, Shank et al. (2023) find comparable results for AI-generated classical music, although no such
	preference was found for electronic music. However, Zlatkov et al. (2023) found no significant preference
	for either AI or human-generated music overall.
	Taken together, this body of research illustrates the potency of AI in creative tasks. AI not only matches
	human creativity but also improves human performance when used as a collaborative tool. However, at
	least when considering artistic outcomes, there remains a human preference for creativity that involves
	human touch. This growing evidence suggests a natural next step: evaluating AI's efficacy in innovation
	management in general and in idea generation in particular, where artistic preferences are less important,
	while carefully examing potential issues such as less diverse ideas.

	3. Theoretical Framework and Hypotheses
	To understand GenAI s ability to tackle various creative tasks, we must first conceptualize creativity. The
	literature distinguishes between three dimensions of creativity. Fluency is the ability to generate many ideas
	or solutions to a problem. It reflects the quantity of generated ideas. Flexibility is the capacity to produce a
	variety of ideas or solutions, showing an ability to shift approaches or perspectives. And, originality is the
	ability to produce novel and unique ideas (Guilford 1967, Torrance 1968). In addition, the brainstorming
	literature often considers idea quality as a fourth dimension of creativity. We omit fluency as a performance
	metric, as comparing the number of ideas or the speed of idea generation between a computer and a human

	--- Page 6 ---

	will lead to the obvious result that the computer displays greater fluency, creating more ideas per unit of
	time. This leaves us with idea quality, flexibility, and originality as the dimensions of comparison between
	humans and AI.
	The atomic unit of analysis in this comparison is an idea. In the context of innovation, we define an idea
	as a novel match between a solution and a need. As mentioned above, across three studies we will ask
	students as well as GenAI to come up with new product ideas targeted toward college students that can be
	sold for 50 or less. To illustrate our unit analysis of an idea, consider one of the student-generated ideas:

	Convertible High-Heel Shoe: Many prefer high-heel shoes for dress-up occasions, yet walking in high heels
	for more than short distances is very challenging. Might we create a stylish high-heel shoe that easily
	adapts to a comfortable walking configuration, say by folding down or removing a heel portion of the shoe?

	In this example, the need is the desire of some people to dress up and wear high-heeled shoes for some
	occasions while still walking comfortably. The proposed solution is to make the heel portion of the shoe so
	that it can be folded down or removed.
	Idea generation, by either individuals or groups, is a process that creates a stream of ideas with varying
	quality levels. This stream can be the result of either human effort or the use of AI. Each of these ideas can
	be validated on a quality scale. Our quality scale is based on a purchase intent study. Kornish and Ulrich
	(2014) show that the best indicator of future value creation is the average purchase intent expressed by a
	sample of consumers in the target market. Furthermore, they show that no single individual, expert or
	novice, is particularly good at estimating value. Instead, a sample of expressed purchase intent from about
	15 individuals in the target market is a reliable measure of idea quality.
	Some ideas are likely to be brilliant (high-quality), some are horrible (low-quality), and most will be
	somewhere in between (medium-quality). We can think of this uncertain quality value as a random variable
	drawn from an underlying pay-off distribution (Weitzman 1979, Dahan and Mendelson 2001).
	Recall that we chose to measure three dimentions of creativity associated with idea generation: quality,
	flexibility, and originality. Our first hypothesis relates to the first dimention: AI s ability to generate ideas
	comparable in their average quality to human-generated ideas. In other words, we focus on the mean of the
	underlying idea-quality distribution. We make two arguments for why GPT-4 would create ideas of higher
	average quality than humans. First, the training data for GPT-4 includes millions of product reviews
	revealing unmet user needs, social media posts of excited and frustrated customers alike, and marketing
	materials for countless products that have been launched more or less successfully in the past. Second, the
	literature reviewed in Section 2 has established that GPT-4 has tremendous creative capabilities in other
	domains such as music generation or story writing.

	--- Page 7 ---

	Hypothesis 1 (Idea quality): The average quality of AI-generated ideas is higher than the average quality
	of human-generated ideas.

	Our second hypothesis relates to the second two dimensions: flexibility and originality. We first define
	these concepts in the context of generating ideas for new products and come up with appropriate
	measurement scales.
	There exists a vast number of possible new product ideas that differ along many dimensions. We can
	think of ideas as positions in a highly dimensional space. OpenAI s GPT-4 models text as multi-
	dimensional embedding vectors in this space, where each dimension may represent a distinct attribute or
	feature of the text. Such vectors have hundreds of dimensions. Similar texts will often lie close to each other
	while different ones will be far apart. However, interpreting the distances and dimensions is often not
	straightforward given the high dimensionality.
	To illustrate, consider a two-dimensional search space like the map of a territory. For example, consider
	the exploration of such a territory in the search for fishing spots in the ocean. The (x, y) coordinates capture
	the geographic locations of schools of fish. Each location has a pay-off corresponding to the amount of fish
	in the water. The goal of the fisherman is to find the location with the greatest fish density. In such a search
	process, local adjustments along a gradient of increasing fish density in the water via local search may
	increase the value of a fishing location. Yet, in rugged solution landscapes, i.e., ones that have multiple
	local optima, such local search is unlikely to yield the globally optimal solution.
	Thereby, the ruggedness of the underlying solution landscape makes it impossible to arrive at the most
	valuable fishing location (idea) in the ocean (idea space) via local adjustments. Rather, a broad exploration
	is needed (see Sommer and Loch 2004). Without prior knowledge about the landscape, some new locations
	that are very different from past locations should be explored. This creates the classic trade-off between
	exploration and exploitation (March 1991).
	With this as our backdrop, we provide two ways of operationalizing flexibility, overlap and the total
	number of discoverable ideas, and one way to operationalize originality, idea novelty. All three are
	important properties of a search process in general and of an ideation process in particular.
	To explain overlap, let s return to our fishing example. To explore fishing locations in an ocean, the
	locations should be distinctively different from each other. Even in a rugged solution landscape, some
	spatial correlations in pay-offs between two adjacent coordinates are likely. In much the same way, in the
	world of innovation, we want our ideas to be distinct from each other. To determine how distinctly different
	an idea is relative to other ideas, we measure the cosine similarity of its embedding vector relative to the
	embedding vectors of the other ideas (following Cox et al. 2021 and Dell'Acqua et al. 2023). Section 8

	--- Page 8 ---

	provides alternative measures to this analytical choice. For a given pool of ideas produced by an idea-
	generation process, human or AI, we can thus randomly pull out two ideas and compute the angle between
	two associated embedding vectors. The Cosine of such angles will range from -1 to 1, with 1 indicating
	identical vectors and 0 indicating no similarity (orthogonal). While negative values are possible in principle,
	they rarely occur in practice as further discussed in study 2. By performing a pairwise comparison of all
	ideas and averaging their similarities, we can compute the average pool similarity. Next, we define two
	ideas as overlapping if their cosine similarity is above 𝜃 0.8. That is, we count any new idea added to
	the pool as overlapping if its cosine similarity exceeds 0.8 compared to any of the existing ideas in the pool.
	Our first measure of flexibility is based on computing the distribution of pairwise cosine similarities and
	counting the frequency of overlaps. We discuss this and other assumptions in Section 8 and provide
	extensive robustness analyses including evaluating alternative model specifications.
	Next, imagine a fisherman with no memory looking for fish at random locations. Every period, this
	fisherman sets out and fishes, yielding an estimate for the payoff of a specific location. How many unique
	fishing locations will be discovered this way? Early in the exploratory efforts, every fishing spot is an
	unexplored territory. Yet, as this process goes on, the likelihood of overlap increases, i.e., the fisherman is
	more likely to revisit a location previously tested. Given our definition of overlapping ideas (cosine
	similarity exceeding the θ 0.8 threshold), we can observe a stream of incoming ideas, one by one, and
	determine whether a new idea is unique relative to the pool of ideas created up to this point. Early on, just
	like in the fisherman s case, each idea is likely unique (non-overlapping with the ideas created so far).
	However, as the process progresses, the percentage of overlapping ideas will increase as the underlying
	search space gets exhausted. For a finite sequence of T ideas, we can evaluate the number of overlapping
	ideas, Noverlap, and thus compute the number of unique ideas, Nunique T-Noverlap. Definitions for how we
	operationalize this approach are shown in study 2.
	In addition to utilizing idea overlap for computing the number of unique ideas in a finite stream of ideas,
	we can further estimate the total number of discoverable ideas in the search space, even if many were not
	part of the sequence of T ideas, i.e., the ideas have not (yet) been discovered. To do so, we use what in
	population ecology is known as a capture-recapture model, used to estimate the number of unique fishing
	locations based on how frequently a previously visited location is revisited by a fisherman with no memory.
	With such a model, we simply count the incidents of an idea overlapping with a past idea. The frequency
	of overlap and its increased occurrence rate over time allows for estimating the number of ideas that can
	be discovered (Kornish and Ulrich 2011). This provides us with our second measure of flexibility.
	Next, consider originality. The search for ideas can yield ideas that are more or less novel. We measure
	idea novelty in the same way we measure idea quality by directly asking potential customers for its novelty
	assessment and averaging this value. In summary, we evaluate flexibility by looking at idea overlap (which

	--- Page 9 ---

	can be converted into an estimate for the numbers of ideas that can be discovered) and evaluate originality
	by directly asking consumers to rate novelty.
	How will a pool of AI-generated ideas compare to these human-generated ideas in terms of quality,
	flexibility and originality? By their very design, GPTs are autoregressive processes. They don t plan ahead
	but predict one word (or token) at a time based on a context window, including the prompt and the prior
	words created. Such a one word at a time process is unlikely to systematically and exhaustively explore
	an entire solution landscape. This lack of broad exploration will be further amplified in the presence of a
	system prompt that illustrates the concept of ideas by providing one or multiple ideas from the past (few-
	shot prompting) relative to the case in which no past ideas are provided (zero-shot prompting). This should
	limit both the flexibility and the originality of the creative process.These arguments, taken together with
	existing research in other domains showing less novelty for AI-generated content versus human-generated
	content (Doshi and Hauser, 2024), lead to the following two hypotheses:

	Hypothesis 2a (Flexibility): The likelihood of two ideas overlapping is higher for a pool of AI-generated
	ideas than for a pool of human-generated ideas, resulting in fewer discoverable ideas.

	Hypothesis 2b (Originality): The average novelty of AI-generated ideas is lower than that of human-
	generated ideas.

	Our third hypothesis returns to the concept of idea quality. This time, however, we are not concerned
	about the average idea quality but instead focus on the quality of the best ideas. Rather than focusing on the
	quality of the single best idea (the extreme value, Dahan and Mendelson 2001), we focus on the 90th
	percentile of idea quality distribution, i.e., the top 10 percent of the ideas. We do so for two reasons.
	The first reason is statistical estimation: for a single experiment like ours, there simply does not exist a
	test that allows us to make statistically significant statements for a single data point. Moving to the 90th
	percentile, we can compare the mean across larger groups of ideas (Section 8 presents our results for other
	percentiles).
	There also exists a second, managerial reason. In many, if not most, practical settings, the assessment of
	idea quality is noisy, especially in the early stages of an innovation process when an idea is nothing but a
	title and a few words. For this reason innovation tournaments don t just advance a single idea to the next
	round, but a set of the x percent of the most promising ideas where x can vary widely, but typically ranges
	between 10 and 50 percent (Terwiesch and Ulrich 2009). We therefore state:

	Hypothesis 3 (Top Decile): The quality of the 90th percentile AI-generated ideas is higher than that of the

	--- Page 10 ---

	90th percentile human-generated ideas.

	4. Experimental setup
	For our experiment, we utilize three different pools of ideas, namely student-generated ideas, GPT-4-
	generated ideas with zero-shot prompting and GPT-4-generated ideas with few-shot prompting. For the
	student pool, we rely on data collected in 2021 in a product design and innovation course at an elite
	university. In this course, 50 students participated in an innovation challenge to come up with ideas for a
	physical product marketed to college students for 50 or less (this price cap is imposed to limit the
	complexity of the projects in a one-semester course.). The challenge was organized in a traditional
	innovation tournament format (Terwiesch and Ulrich 2009, 2023), in which individuals first independently
	generate many ideas, which are then combined into a pool of several hundred ideas and subsequently
	evaluated by others in the group (i.e., crowdsourced evaluations). Thus, we have access to a large set of
	ideas generated by humans before AI tools became widely available to enhance ideation.
	Speifically, we use a pool of independently aggregated human ideas by randomly selecting 200 entries,
	each comprising a descriptive title and a paragraph of text, from the student ideas generated in these
	challenges in 2021 (i.e., at a time prior to the widespread availability of ChatGPT and other LLMs). The
	set of 200 ideas constitutes our first pool and forms the baseline for comparison with the ideas generated
	using LLMs. We prompt Open AI s GPT-4 (more specifically, gpt-4-0314) with the same prompt we gave
	the students. No LLM yet acts entirely autonomously. Rather, they are tools used by humans to complete
	tasks. For this study, we aim for minimal prompt engineering, thus representing a novice user scenario.
	However, we acknowledge that many strategies could potentially improve LLM performance. For instance,
	Mihm and Schlapp (2019) show that providing feedback during ideation contests can further improve
	performance of human innovators and we expect this to hold for LLMs as well
	For our first LLM-generated idea pool we use the system prompt to provide contextual information and
	subsequent user prompts to ask for ideas, ten at a time. The user prompt includes the additional request that
	the descriptions be 40-80 words, like the student sample.

	System Prompt
	You are a creative entrepreneur looking to generate new product ideas. The product will target college
	students in the United States. It should be a physical good, not a service or software. I'd like a product that
	could be sold at a retail price of less than about USD 50. The ideas are just ideas. The product need not
	yet exist, nor may it necessarily be clearly feasible. Number all ideas and give them a name. The name and
	idea are separated by a colon.

	--- Page 11 ---

	User Prompt
	Please generate ten ideas as ten separate paragraphs. The idea should be expressed as a paragraph of
	40-80 words.

	The model used for all work covered in this paper is gpt-4-0314 with the temperature parameter at 0.7
	to retain randomness and thus greater creativity. The temperature parameter controls the randomness of the
	output, with lower values leading to more deterministic output and higher values increasing variability. At
	the time of the experiment, the suggested default value for temperature was 0.7 to strike a balance between
	coherence and creativity, without possibly sampling highly unlikely tokens (i.e., semantic chunks used for
	representational efficiency) that lead to undesirable responses.
	An obstacle to using GPT-4 for generating hundreds of ideas is its finite memory, typically limited to
	the number of tokens the underlying LLM can consider in generating its responses. Once the number of
	tokens in a session exceeds the model s limit, the LLM has no memory of the first ideas generated, and
	subsequent ideas can become increasingly redundant. The number of tokens in the version of GPT-4 we
	had access to was about 8,000, roughly 7,000 words or approximately 80 ideas (some tokens are used for
	the system and user prompt and idea titles).
	To generate more than 80 ideas resulting from the limited context window, we asked GPT-4 to
	compress the previously generated ideas into shorter summaries. These summaries were then provided
	to the model before generating the next batch of ideas, ensuring that the model knows the previously
	generated ideas while remaining within the context limits. We used the below summarization prompt,
	followed by the original system prompt and generated summaries, and finally, a user prompt that explicitly
	asks for different ideas. This constitutes our second pool of comparison.

	Summarization Prompt
	Aggressively compress the following ideas so that their original meaning remains but they are much
	shorter. You can use tags or keywords. : Ideas generated so far
	System Prompt

	Original System Prompt Previously you generated the following ideas and should not repeat them:
	Summaries

	User Prompt
	Original User Prompt Make sure they are different from the previous ideas.

	--- Page 12 ---

	For our second pool of LLM-generated ideas, we provide the LLM with examples (few-shot learning)
	of high-quality ideas generated by students. In particular, we appended our prompts to provide the LLM
	with six highly rated ideas from a separate student set that completed the same exercise and informed GPT-
	4 that these ideas had been well-received by students in our class. We used six examples due to context
	window limitations at the time of the experiment as well as drawing on previous experiments from in-
	context few-shot learning where too many examples can degrade performance (see Meincke and Carton
	2024). This constitutes our third pool of comparison.

	Good Ideas Prompt
	Original System Prompt Here are some well received ideas for inspiration: Good Ideas

	Overall, we generated 100 ideas using zero-shot prompting and another 100 using few-shot prompting.
	The resulting average word count for GPT-4 generated ideas is 69 and 71 for GPT-4 with provided with
	examples. The average description is 63 words long for student ideas. We compared the resulting few-shot
	prompted ideas to the examples provided to ensure that GPT-4 did not simply slightly modify the examples.
	The average pairwise cosine similarity between the six examples and the 100 generated ideas is 0.33 and
	the highest similarity between two ideas is 0.51. Thus, we have no reason to believe that GPT-4 repeated
	the provided ideas.

	5. Study 1: comparing the quality of ideas generated by humans and AI
	The Institutional Review Board (IRB) at the University of Pennsylvania approved the research described
	in this paper in May 2023, Protocol 853634. We used the online platform Prolific to recruit college-age
	indiviuals from the United States to evaluate all 400 ideas from the three pools (pool 1 with 200 ideas
	created by humans, pool 2 with 100 created by GPT-4 with zero-shot prompting, and pool 3 with 100
	created by GPT-4 with few-shot prompting) via a purchase intent survey. We presented ideas in random
	order and randomized at the idea level, meaning that every survey participant could potentially see ideas
	from multiple sources. Each respondent evaluated an average of 40 ideas. On average, each idea was
	evaluated 20 times. In the summer of 2023, concerns surfaced that ChatGPT was being used to provide
	mTurk responses. This practice appears to have been limited to text generation tasks, not to multiple
	choice tasks like our five-box purchase-intent survey. Indeed, just answering the survey question directly
	requires less effort than trying to deploy ChatGPT to answer the question. We thus believe that our study
	participants were humans.
	We asked respondents to express purchase intent using the standard five-box options: definitely would
	not purchase, probably would not purchase, might or might not purchase, probably would purchase, and

	--- Page 13 ---

	definitely would purchase. Jameson and Bass (1989) recommend weighting responses for the five possible
	responses as 0, 0.25, 0.50, 0.75, and 1.00 to develop a single measure of purchase probability, which we
	use as a measure of idea quality (other weightings are possible, as we discuss in Section 8). Figure 1 shows
	the full quality distribution of ideas generated by the three pools.

	Figure 1
	Distribution of idea quality for three sets of ideas

	Notes. Purchase intent is the weighted average of the five-box response scale per Jameson and Bass (1989).

	Figure 1 shows the quality (purchase probability) of ideas across the three pools. On average, GPT-4
	generated ideas with greater purchase intent (46.4 with zero-shot prompting and 49.3 with few-shot
	prompting) than humans (40.4 ). The standard deviation of the quality of ideas is comparable between the
	three pools. We formally test the impact of idea source on the perceived quality of product ideas via a linear
	mixed-effects model with purchase intent as the dependent variable. The model included two fixed-effects
	denoting source (humans are the baseline) and random intercepts and slopes for respondents and ideas. We

	--- Page 14 ---

	find significant differences in the perceived quality of ideas as a function of their source. Ideas generated
	by GPT-4 with no examples (zero-shot) were rated significantly higher than human-generated ideas (𝛽
	0.059; 95 CI [0.031, 0.088]; t(246) 4.06, p 0.001) and ideas generated by GPT-4 provided with
	positive examples (few-shot) received even higher ratings (𝛽 0.089; 95 CI [0.060, 0.12]; t(223) 5.93,
	p 0.001). Purchase intent is weakly significantly different between the two pools of LLM-generated ideas
	(𝛽 0.03; 95 CI [-0.01, 0.06]; t(199) 1.892, p 0.06). These findings indicate that LLM-generated
	ideas are, on average, more likely to be purchased than human-generated ideas (for additional robustness
	tests, see Section 8).

	6. Study 2: Diversity and Novelty of Ideas
	Our second study focuses on how the fraction of overlapping ideas and the resulting estimated total number
	of ideas the process can generate (idea flexibility, hypothesis 2a) and the perceived novelty of the ideas as
	assessed by human raters (idea originality, hypothesis 2b) depend on the idea source.
	6.1. Overlapping Ideas An idea-generation process creates a sequence of ideas in which each
	additional idea generated can be compared to the previously created ideas according to its similarity. For a
	pool of ideas, we can hence compute the average pairwise similarity of one idea compared to all other ideas
	and then compute the average overall similarity for the entire pool. We can also apply a threshold to
	pairwise idea similarity to identify at what point the ideas start to become more repetitive, i.e., when we are
	starting to exhaust the space of new ideas given a particular idea-generation process. A pool of ideas then
	might have a few overlapping ideas, which informs our second quantitative metric, the total number of
	ideas the process can generate.
	To measure the diversity of the ideas, we calculate the cosine similarity of each idea relative to the rest
	of the set. We first calculate a vector of text embeddings for each idea. We follow the technical setup in
	Dell'Acqua et al. (2023) and use Google's Universal Sentence Encoder (USE) model for our idea
	embeddings, which is specifically optimized for semantic similarity between sentences. Table 1 shows the
	results.
	In geometry, the cosine of the angle between vectors ranges from -1 to 1. However, when using Google
	USE, negative similarity is rarely encountered, since the overall text structure does not substantially differ
	between ideas. Ideas follow a similar pattern in terms of text length and style, often leading with the title
	before the idea description. In our test, a cosine similarity of 1 between two ideas thus indicates that they
	are very similar (their embedding vectors are aligned), whereas a cosine similarity of 0 implies orthogonal
	or unrelated ideas. We consider a new idea added to an idea pool to be unique if its pairwise cosine similarity
	compared to all previously added ideas is never greater than 0.8. Additional robustness checks using
	different thresholds and measures can be found in Section 8.

	--- Page 15 ---

	Table 1
	Summary Statistics for Idea Overlap

	Student Ideas
	GPT-4 zero-shot
	GPT-4 few-shot
	N Ideas

	Average cosine
	similarity of all
	ideas
	0.221
	0.415
	0.428
	Fraction of ideas
	in pool with
	cosine similarity
	0.8

	0.05
	0.07

	Notes. We compute the fraction as the number of ideas whose average pairwise similarity compared to all

	other ideas in the pool exceeds 0.8 divided by the total number of ideas in the pool.

	For each pool, we compute the average pairwise similarity between all ideas. One-way ANOVA
	analyses show that the source has a significant effect on the cosine similarity between the three pools. The
	difference between all three groups is also significant (η² 0.455, 95 CI [-0.210, -0.204], F(2, 29598)
	12340.95, p 0.001). Considering only two groups, human ideas have a significantly smaller cosine
	similarity than GPT-4-generated ideas (η² 0.358, 95 CI [-0.197, -0.190], F(1, 24649) 13715.82, p
	0.001). Zero-shot GPT-4 ideas exhibit a significantly smaller cosine similarity than few-shot GPT-4 ideas
	(η² 0.004, 95 CI [-0.018, -0.010], F(1, 9898) 44.24, p 0.001).
	Because there is no overlap among human-generated ideas using cosine similarity, the fraction of ideas
	would be zero and the number of unique ideas infinitely large, in line with hypothesis 2a. A larger pool of
	student ideas will eventually contain overlapping ideas (see Kornish and Ulrich 2014 for estimates) but
	based on our assumptions for similarity, the student sample only contains unique ideas. We perform a
	binomial test to formally estimate the significance of the differences. We find that the fraction of similar
	human-generated ideas (95 CI for fraction [0.0, 0.0184]) is significantly smaller than that of the zero-shot
	GPT-4 ideas (RD -0.05, 95 CI [-0.093, -0.007], p 0.001) and few-shot GPT-4 ideas (RD -0.07, 95
	CI [-0.120, -0.020], p 0.001), supporting hypothesis 2a. The difference between the two GPT-4 pools is
	not significantly different (RD -0.02, 95 CI [-0.086, 0.046], p 0.56). Our findings suggest that a
	greater number of distinct ideas generated comes from the human-ideation process, as opposed to GPT-4.
	We calculate the exact numbers in the next section.

	--- Page 16 ---

	Figure 2
	Distribution of cosine similarities across the three pools

	Notes. Density plot of cosine similarities comparing all three pools. The dotted line shows the mean and confidence
	interval of the estimate for a pool used for the ANOVA. The difference between all three groups is also significant (η²
	0.455, 95 CI [-0.210, -0.204], F(2, 29598) 12340.95, p 0.001).

	6.2.
	Number of Discoverable Ideas Given the fraction of unique ideas, we can estimate the number of
	unique ideas that could be generated by each of our three processes (pools) students, LLM (zero-shot),
	and LLM prompted with examples (few-shot) using the method of Kornish and Ulrich (2011). This
	method, which uses the capture-recapture method to analyze the probability that the next idea in a sequence
	is unique, reportedly originates with Laplace (Cochran 1978), but has been adapted to wildlife ecology and
	other domains. For illustration, consider again fishing in a lake as a metaphor for the idea-generation
	process. Each idea is a catch, and the fish is released back into the lake. Sometimes, the same fish will be
	caught again. The more frequently an individual fish is re-caught, the smaller the estimate of the overall
	fish population. Thus, the probability that a fish has never been caught previously is a decreasing function
	of the number of ideas generated.
	This probability decay is typically represented by an exponential function.

	p(n) e an
	(1)

	We define p(n) as the probability that the next idea is unique given n ideas have been generated already.
	The expected number of unique ideas out of n generated, u(n), is the integral under this curve.

	--- Page 17 ---

	u(n) (1 a)(1 e an)
	(2)

	This form of probability decay comes from a specific underlying process, with T unique ideas total (T
	fish in the pond), and each equally likely to be drawn. This assumption is commonly used in the Lincoln-
	Peterson method (Lincoln 1930), the standard model for estimating population size in the literature on
	wildlife ecology. The decay parameter and the total T are linked: T 1 a. This model has only a single
	parameter, a, which is the inverse of the size of the opportunity space, i.e., an estimate of the total number
	of unique ideas that an unlimited number of comparable idea generators, each generating an enormous
	number of ideas, would generate.
	Given a set of ideas generated and a count of the number of unique ideas in that set, the model can be
	used to calculate T, an estimate of the size of the opportunity space. Using the similarity threshold of 0.8
	from the cosine similarity metric, we found that 5 of the 100 ideas generated by the LLM with zero-shot
	prompting were essentially similar to an idea already generated (fish recaptured), and that 7 of the 100 ideas
	generated via few-shot prompting were redundant. Thus, u(100) is 0.95 in the first case, and u(100) is 0.93
	in the second case. This corresponds to an estimate of T of 966 ideas (zero-shot) and of 680 ideas (few-
	shot) respectively.
	In our sample, human-generated ideas were all unique. Thus, as expected from our overlap calculations,
	and based on the estimates provided by the capture-recapture model, we find support for the second
	quantitative metric of hypothesis 2a. The number of unique ideas that can be discovered is lower for both
	pools of AI-generated ideas than for the human idea-generation process. In addition, prompting the LLM
	with examples seems to further reduce the estimated number of unique ideas available to the process. We
	perform additional robustness checks in Section 8.

	6.3. Perceived Novelty Given that LLMs are designed to generate the statistically most plausible
	sequence of text based on their training data, perhaps they generate less novel ideas than humans. Novelty
	is not a goal expressed in the prompt used in this study for either humans or GPT-4 and is typically not a
	primary objective in commercial product development efforts. Still, to ensure that GPT-generated ideas are
	not merely lists of existing ideas, we investigate how the novelty of ideas varies between LLM-generated
	ideas and those generated by humans.
	Based on Shibayama et al. (2021), we assessed novelty by asking responders on Prolific the question
	Relative to other products you have seen, how novel do you consider the idea for this new product? [0:
	Not at all novel, 0.25: Slightly novel, 0.5: Moderately novel, 0.75: Very novel, 1: Extremely novel]. The
	average novelty of human-generated ideas is 40.6 (SD: 0.117), which is greater than that of zero-shot
	GPT-4 (36.7 , SD: 0.101), and few-shot GPT-4 (36.1 , SD: 0.111; see Figure 3).

	--- Page 18 ---

	Similar to purchase intent, we estimate a linear mixed-effects model to investigate how the idea source
	(human ideas, zero-shot GPT-4 and few-shot GPT-4) affects the perceived novelty of product ideas. The
	model includes two fixed effects for denoting the source (humans are baseline), random intercepts and
	slopes for both respondents and ideas.
	We find significant differences in perceived novelty between human and zero-shot GPT-4-generated
	ideas (𝛽 -0.038; 95 CI [-0.066, -0.01]; t(269) -2.67, p 0.008) at the alpha 0.05 threshold. Ideas
	generated by few-shot GPT-4 also receive significantly lower novelty ratings (𝛽 -0.049; 95 CI [-0.078,
	-0.02]; t(268) -3.4, p 0.001) compared to human-generated ideas. These findings suggest that some
	LLM-generated ideas are perceived as less novel than human-generated ideas.
	Perceived novelty is not significantly different between the two pools of LLM-generated ideas (𝛽 -
	0.01; 95 CI [-0.039, 0.017]; t(195) -0.757, p 0.45). Of note, novelty does not appear to be significantly
	correlated with purchase intent. The correlation coefficient is slightly negative at -0.08 (95 CI [-0.176,
	0.016], p 0.12). Additional robustness checks can be found in Section 8.

	Figure 3
	Distribution of novelty ratings for three samples of ideas

	Notes. Novelty based on mTurk assessment per Kwon, Kim, and Lee (2009).

	--- Page 19 ---

	These findings support Hypothesis 2b: AI-generated ideas are, on average, less novel than human-
	generated ideas. Of note, the average novelty of all ideas, irrespective of source, lies between slightly and
	moderately novel. While human ideas are around 0.047 points more novel, there is little reason to believe
	that novelty alone, i.e., being the first to think of an idea, leads to a significant financial advantage. As
	Terwiesch and Ulrich (2010) and others have argued, the first-mover advantage is a myth. As such, from a
	commercial point of view, we don t believe that the slightly lower novelty outweighs the productivity and
	quality benefits of LLMs.

	7. Study 3: What is the quality of the best idea(s)?
	Table 2 summarizes the titles of the top 40 ideas (10 ) in our pool, that is the top 40 out of the 400 ideas
	used.
	Table 2 Top 10 Ideas by Purchase Intent
	Title
	Source
	Purchase Intent
	Novelty
	Compact Printer
	GPT-4 (Few-Shot)
	0.76
	0.55
	Solar-Powered Gadget Charger
	GPT-4 (Few-Shot)
	0.75
	0.44
	QuickClean Mini Vacuum
	GPT-4 (Zero-Shot)
	0.75
	0.30
	Noise-Canceling Headphones
	GPT-4 (Few-Shot)
	0.72
	0.18
	StudyErgo Seat Cushion
	GPT-4 (Zero-Shot)
	0.72
	0.39
	Multifunctional Desk Organizer
	GPT-4 (Few-Shot)
	0.71
	0.21
	Reusable Silicone Food Storage Bags
	GPT-4 (Few-Shot)
	0.68
	0.34
	Portable Closet Organizer
	GPT-4 (Few-Shot)
	0.67
	0.23
	Dorm Room Chef [oven, microwave and toaster]
	GPT-4 (Few-Shot)
	0.67
	0.71
	Collegiate Cookware
	GPT-4 (Few-Shot)
	0.67
	0.45
	Collapsible Laundry Basket
	GPT-4 (Few-Shot)
	0.65
	0.21
	On-the-Go Charging Pouch
	GPT-4 (Few-Shot)
	0.65
	0.33
	GreenEats Reusable Containers
	GPT-4 (Zero-Shot)
	0.65
	0.21
	HydrationStation [bottle with filter]
	GPT-4 (Zero-Shot)
	0.64
	0.19
	Reusable Shopping Bag Set
	GPT-4 (Few-Shot)
	0.64
	0.19
	CollegeLife Collapsible Laundry Hamper
	GPT-4 (Zero-Shot)
	0.64
	0.26
	Adaptiflex [cord extension to fit big adapters]
	Student
	0.64
	0.44
	SpaceSaver Hangers
	GPT-4 (Zero-Shot)
	0.64
	0.33
	Dorm Room Air Purifier
	GPT-4 (Few-Shot)
	0.63
	0.29
	Smart Power Strip
	GPT-4 (Few-Shot)
	0.63
	0.22
	CampusCharger Pro
	GPT-4 (Zero-Shot)
	0.63
	0.31
	Kitchen Safe Gloves
	Student
	0.62
	0.31
	Nightstand Nook [charging, cup holder]
	GPT-4 (Few-Shot)
	0.62
	0.43
	Mini Steamer
	GPT-4 (Few-Shot)
	0.62
	0.41
	CollegeCare First Aid Kit
	GPT-4 (Zero-Shot)
	0.62
	0.26
	StudySoundProof [soundproofing panels]
	GPT-4 (Zero-Shot)
	0.62
	0.57
	FreshAir Fan
	GPT-4 (Zero-Shot)
	0.62
	0.29
	StudyBuddy Lamp [portable, usb charging]
	GPT-4 (Zero-Shot)
	0.62
	0.43
	Bluetooth Signal Merger [share music]
	Student
	0.62
	0.41

	--- Page 20 ---

	Adjustable Laptop Riser
	GPT-4 (Few-Shot)
	0.62
	0.21
	EcoCharge [solar powered charger]
	GPT-4 (Zero-Shot)
	0.62
	0.43
	Smartphone Projector
	Student
	0.62
	0.57
	Grocery Helper [hook to carry multiple bags]
	Student
	0.62
	0.53
	FitnessOnTheGo [portable gym equipment]
	GPT-4 (Zero-Shot)
	0.62
	0.42
	Multipurpose Fitness Equipment
	GPT-4 (Few-Shot)
	0.62
	0.37
	CollegeCooker
	GPT-4 (Zero-Shot)
	0.61
	0.50
	Multifunctional Wall Organizer
	GPT-4 (Few-Shot)
	0.61
	0.31
	DormDoc Portable Scanner
	GPT-4 (Zero-Shot)
	0.61
	0.49
	Mobile Charging Station Organizer
	GPT-4 (Few-Shot)
	0.61
	0.26
	StudyMate Planner
	GPT-4 (Few-Shot)
	0.61
	0.22
	DormChef Kitchen Set
	GPT-4 (Zero-Shot)
	0.61
	0.33
	LaundryBuddy [laundry basket]
	GPT-4 (Zero-Shot)
	0.61
	0.30
	Notes. The asterisk ( ) denotes ideas where the text in square brackets [] is not part of the original title and
	was added to clarify the idea.

	Among the top 40 ideas (top decile) 35 (87.5 ) were generated by GPT-4 (see Table 3). In other words,
	for every human idea in the top 10 we count 7 ideas generated by GPT-4. A Chi-Square Test of
	independence, with the null hypothesis of equal representation of all sources among the top ideas (20, 10
	and 10) rejected the null hypothesis (x2 26.39, p 0.001, df 2), thus confirming hypothesis 3.

	Table 3
	Best Ideas Across Pools

	Student Ideas
	GPT-4 zero-shot
	GPT-4 few-shot
	N Ideas

	Average Quality
	of Top Decile
	0.62
	0.64
	0.66
	Average
	Novelty of Top
	Decile
	0.45
	0.35
	0.33
	Fraction of the
	top decile of
	pooled ideas
	from this source
	5 40
	15 40
	20 40

	To better understand how the full distribution of idea qualities is affected by the idea source, we use
	quantile regression analysis. Quantile regression (Koenker and Hallock 2001) extends traditional regression
	by computing the relationship between explanatory variables (idea source) and the response variable (idea
	quality) for different percentiles of the data. As mentioned above, in innovation, the quality of the best ideas
	is generally more important than the average quality. That is, we prefer a few exceptional ideas to a lot of

	--- Page 21 ---

	mediocre ones. Using quantile regression, we can examine the tails of the distribution instead of the mean,
	allowing us to test whether GPT-4 excels at generating high-quality ideas only for specific percentiles or
	whether the effect holds across the entire distribution.
	Our analysis follows Girotra et al. (2010). We use the average idea quality ratings as the dependent
	variable, and our explanatory variable is a binary variable indicating whether the idea is human-generated
	(baseline level) or AI-generated (GPT-4 zero-shot and GPT-4 few-shot prompting). Figure 4 shows the
	results. For all percentiles, GPT-4 ideas consistently outperform student ideas. The effect is especially
	pronounced for the upper tail of the distribution (80 and above), where GPT-4 has the strongest advantage.
	This implies that not only does GPT-4 generate better ideas on average, but it is also especially adept at
	producing top-tier ideas compared to students.

	Figure 4
	Estimated Difference in Idea Quality Ratings between AI-generated Ideas and Human-generated

	ideas (baseline), for Different Percentiles

	8. Discussion and Limitations
	In this section, we discuss conceptual limitations of our work, limitations related to our research design,
	as well as data analysis and the robustness of our analysis to a set of alternative specifications and
	assumptions.

	--- Page 22 ---

	Our findings indicate that GPT-4 produces higher-quality ideas that are more likely to be purchased than
	humans, though they are perceived as less novel. AI significantly outperforms human creativity in
	generating top-tier ideas, with GPT-4 ideas being seven times more likely to rank in the top 10 . Given
	AI s advantage in both quality and productivity, our findings have profound implications for the field of
	innovation management. For instance, AI can serve as a first step in brainstorming sessions, allowing
	organizations to rapidly explore a wide variety of ideas with minimal cost and time investment. Human
	ideators can also provide AI with their own interesting ideas and refine them with the help of AI. Another
	important implication lies in the potential shift of focus from idea generation to idea evaluation. If LLMs
	can reliably produce numerous high-quality ideas at very low cost companies might allocate more resources
	toward assessing and refining those ideas instead of ideating from scratch. This shift could lead to the
	development of new tools and frameworks specifically designed to help organizations sort, rank, and
	prioritize AI-generated ideas, further streamlining the innovation process.
	However, while the results show that GPT-4 outperforms human creativity in terms of producing top-
	tier ideas, the reduced novelty and increased similarity among AI-generated ideas point to a limitation. This
	suggests that a human in the loop is still important to drive the ideation direction and ensure that ideas are
	as novel as possible. Future research could explore ways to mitigate this issue by enhancing LLMs' ability
	to generate more diverse and creative solutions through techniques such as fine-tuning.
	Investigating whether LLMs can evaluate ideas with the same rigor as human evaluators would help to
	further improve the ideation process. It would allow an LLM to get immediate feedback on its creations,
	leaving humans to focus on implementation and strategy.

	8.1 Conceptual and Research Design Limitations Conceptually, our prompting approach (i.e., a
	simple prompt) is not optimized for creativity or novelty. It also follows a single ideator setup instead of
	approaches such as hybrid brainstorming that lead to more and better ideas (Girotra et al. 2010). A model
	given more specific instructions on how to ideate effectively might thus perform even better. Different
	prompting techniques such as Chain-of-Thought (CoT), which asks the model to reason through a problem
	in multiple steps instead of directly providing an answer (Wei et al. 2023), might also improve performance.
	Furthermore, providing the model with hundreds of good ideas, either via many-shot learning or fine-tuning
	could also provide enhanced performance. This suggests that we likely underestimate the true power of AI-
	based idea generation.
	Second, it is possible that professional product innovators would generate better ideas than our students.
	However, this has not been the experience of the paper s authors, who have taught many academic courses
	and worked in many product development settings. Many students who participated in the innovation
	contests have gone on to be product innovators, sometimes based on ideas from the course tournament.

	--- Page 23 ---

	Nevertheless, we have not produced evidence that GPT-4 is better than the best product innovators working
	today. However, we believe that we can claim conservatively that GPT-4 is better than many human product
	innovators working today and probably better than average. Thus, at a very minimum, an LLM could
	elevate the least capable humans to a better-than-average level of performance.
	Third, GPT might be a great salesperson. As such, it is possible that the writing style ( pitch ) convinces
	the customers rather than the idea itself. Prior work in other domains suggests that the text generated by
	LLMs is not distinguishable from that generated by humans (Brown et al. 2020), though recent work has
	developed sophisticated measures to detect LLM-generated text (Mitchell et al. 2023, Kobak et al. 2024,
	Venkatraman et al. 2024). For example, Kobak et al. (2024) provide intuitions that could be used to identify
	LLM-generated text, such as words that are not commonly used by the majority of English speakers like
	delve. However, it is unlikely that these characteristics were known to our survey participants at the time
	of our experiment in May and June 2023, and that any particular idea generated by GPT-4 could easily be
	distinguished from those generated by our students. Future research could use LLMs to present human-
	generated ideas in a way that more closely mimics the presentation style of LLM-generated ideas, ensuring
	that the quality of the idea is not confounded by its presentation style.
	Fourth, our study is set in the widely understood domain of consumer products for the college students
	market that cost less than 50. Presumably, there exists a lot of commentary and data about such products
	in the training data used by the GPT class of language models. As such, it is unclear whether our results
	would generalize to more specialized domains, such as surgical instruments. Organizations looking for
	opportunities in these specialized domains should fine-tune language models with their own proprietary
	data to achieve comparable or better performance.
	Fifth, innovation often benefits from collaboration and is not solely focused on one ideator generating
	many ideas. Liu et al. (2018) show that collaborating with other innovators improves the creative process
	by enabling the transfer of critical skills and knowledge, particularly when those collaborations involve
	highly skilled innovators. Future work should investigate whether this can be applied to human and LLM
	interaction, and whether an LLM could help a novice human innovator become better.

	8.2 Robustness There are different ways to analyze the data. Here, we provide additional robustness
	checks that investigate the validity of our results under various specifications.
	8.2.1 Study 1 To measure purchase intent, it is possible to use other convex weighting schemes. Ulrich
	and Eppinger (2007) weigh definitely would purchase as 0.4 and probably would purchase as 0.2 with
	all other responses rated as 0. When using this alternative set of weights, we find the same significant
	differences between pools.
	As a robustness test for our primary purchase-intent analysis using a linear mixed-effects model, we also

	--- Page 24 ---

	conduct a simpler linear regression focusing on the average perceived quality of product ideas across
	different sources. This model aggregates individual ratings at the idea level, removing the random effects
	to capture the overall influence of the source on rating averages. The results confirm our previous findings
	and show that ideas from GPT-4 (zero-shot) are rated higher than human ones by an average of 0.256 points
	(95 CI [0.15, 0.37]; t 4.602, p 0.001), and ideas from GPT-4 (few-shot) are rated higher by an average
	of 0.358 points (95 CI [0.25, 0.47]; t 6.435, p 0.001).
	In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor.
	We find significant differences in the perceived quality, measured as purchase intent of product ideas,
	between sources. Ideas generated by GPT-4 (zero shot) receive a significantly greater average rating (𝛽
	0.395; 95 CI [0.215, 0.575]; z 4.31, p 0.001). Similarly, ideas generated by GPT-4 (few-shot) receive
	even higher ratings (𝛽 0.581; 95 CI [0.400, 0.762]; z 6.30, p 0.001) compared to human-generated
	ideas. These findings suggest that LLM-generated ideas are perceived as more likely to be purchased than
	human-generated ideas, with the highest perceived quality attributed to few-shot GPT-4-generated ideas.
	8.2.2 Study 2 Our chosen threshold of 𝜃 0.8 has been established through experimentation by
	comparing ideas as pairs of two and their respective similarity scores. However, our findings are robust to
	other values such as 0.7 (25 and 37 overlapping ideas for zero-shot and few-shot GPT-4 respectively) and
	0.75 (16 and 23 overlapping ideas). At 𝜃 0.85, the zero-shot GPT-4 pool only features two overlapping
	ideas, whereas the few-shot pool features one. Because these are extreme values that approach zero, we
	used 0.8 as our main threshold. We compute the pairwise similarity for an idea compared to all other ideas
	in the pool and calculate the average. Mean pairwise similarity is a common measure in ideation
	(Siangliulue et al. 2016, Cox et al. 2021) and similar text-mining tasks (Doshi and Hauser 2024) but it is
	not without issues, as it lacks sensitivity to highly clustered ideas. As an additional specification, we
	consider the per-pool collective diversity of all ideas by following the work in Cox et al. (2021) and
	construct a minimum spanning tree (MST) which spans all points (ideas) in space with the smallest total
	distance along the edges. In 2D space, an MST would be the tree that contains all points with the shortest
	overall length of edges. We compute the mean of all edge distances as a measure of how distributed ideas
	are in the high-dimensional space. The spanning tree is constructed in high-dimensional space (512
	dimensions), its edge weights summed up and divided by the number of edges, resulting in a range from 0
	(not diverse at all) to 1 (very diverse). Based on this measure, the student idea pool is the most diverse
	(0.53), GPT-4 zero-shot is the second most diverse (0.33) and GPT-4 few-shot is the least diverse (0.3)
	pool.
	Similar to purchase intent, we also conduct a simpler linear regression focusing on the average perceived
	novelty of product ideas across different sources. This model aggregates individual ratings at the idea level,
	removing the random effects to capture the overall influence of the source on rating averages. We find that

	--- Page 25 ---

	ideas from GPT-4 (zero-shot) are significantly less novel than human ones (𝛽 -0.177; 95 CI [-0.286, -
	0.069]; t -3.22, p 0.0014). Ideas from GPT-4 (few-shot) are rated as significantly less novel than human
	ones (𝛽 -0.197; 95 CI [-0.305, -0.089]; t -3.58, p 0.001). This simpler analysis reinforces that human
	ideas are more novel than AI-generated ones, even when using zero-shot prompting.
	In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor.
	We find significant differences in the perceived novelty. Ideas generated by GPT-4 (zero shot) receive a
	significantly lower average rating (𝛽 -0.306; 95 CI [-0.514, -0.1]; z -2.89, p 0.01). Similarly, ideas
	generated by GPT-4 (few-shot) receive even lower ratings (𝛽 -0.39; 95 CI [-0.6, -0.18]; z -3.66, p
	0.001) compared to human-generated ideas. These findings suggest that LLM-generated ideas are perceived
	as less novel than human-generated ideas, with the lowest perceived novelty attributed to few-shot GPT-4-
	generated ideas.
	8.2.3 Study 3 In this study, we present our results for the 90th percentile of all aggregated ideas. Table
	4 shows that using other percentiles yields similar results.

	Table 4 Top 5 and 15 Percent of Ideas Pool Distributions

	Student Ideas
	GPT-4 zero-shot
	GPT-4 few-shot
	Average Quality
	of Top 5
	0.64
	0.67
	0.68
	Fraction of the
	top 5 of pooled
	ideas from this
	source
	1 20
	6 20
	14 20
	Average Quality
	of Top 15
	0.60
	0.62
	0.64
	Fraction of the
	top 15 of
	pooled ideas
	from this source
	11 60
	22 60
	27 60

	9. Summary
	GenAI has demonstrated remarkable advancements in creating coherent and fluent text, equaling or
	surpassing human performance in various academic and professional domains. In this study, we explored
	the ideation capabilities of OpenAI's GPT-4, a state-of-the-art large language model, in comparison to the
	ideation abilities of university students when generating ideas for new products targeted toward college

	--- Page 26 ---

	students at a price point of 50 or less. Specifically, we make three main contributions to the literature of
	innovation and the role of AI.
	First, GPT-4 produces high-quality ideas that are perceived as more likely to be purchased than human-
	generated ideas. Second, consumers perceive AI-generated ideas as less novel. Third, when considering the
	quality of the best ideas, AI outperforms human creativity significantly. To put these findings in context,
	innovation favors a few great ideas over a large number of solid ideas and our results show that AI-generated
	ideas are seven times more likely to be among the top 10 of ideas considered for our experiment compared
	to human ideas. Despite the reduction in novelty, the overall AI advantage thus remains substantial.
	The fact that GPT-4 is very efficient at generating ideas does not require a formal research study. Two
	hundred ideas can be generated by one human interacting with GPT-4 in about 15 minutes. A human
	working alone can generate about five ideas in 15 minutes and humans working in groups do even worse
	(Girotra et al., 2010). In short, the productivity race between humans and GPT-4 is not even close. However,
	as we show in this article, the enormous potential of LLMs in ideation does not result only from their ability
	to quickly and inexpensively generate ideas, but in the remarkable quality of their output.
	Importantly, these ideas can be produced at a fraction of the cost it would take humans, generating
	hundreds of high-quality ideas. This previously unimaginable productivity in generating ideas may
	substantially reduce the importance of the idea-generation phase of innovation and shift managerial focus
	to the idea-evaluation phase. Can an LLM also take on the task of idea evaluation? From our viewpoint,
	this is a fascinating question for future research.

	References
	Bellaiche L, Shahi R, Turpin MH, Ragnhildstveit A, Sprockett S, Barr N, Christensen A, Seli P (2023)
	Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cogn.
	Research 8(1):42.
	Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. (2020) Language
	Models are Few-Shot Learners. (July 22) http: arxiv.org abs 2005.14165.
	Chao RO, Kavadias S (2008) A Theoretical Framework for Managing the New Product Development
	Portfolio: When and How to Use Strategic Buckets. Management Science 54(5):907 921.
	Cochran WG (1978) Laplace s Ratio Estimator. David HA, ed. Contributions to Survey Sampling and
	Applied Statistics. (Academic Press), 3 10.
	Connolly T, Jessup LM, Valacich JS (1990) Effects of Anonymity and Evaluative Tone on Idea
	Generation in Computer-Mediated Groups. Management Science 36(6):689 703.
	Cox SR, Wang Y, Abdul A, Von Der Weth C, Y. Lim B (2021) Directed Diversity: Leveraging Language
	Embedding Distances for Collective Creativity in Crowd Ideation. Proceedings of the 2021 CHI

	--- Page 27 ---

	Conference on Human Factors in Computing Systems. (ACM, Yokohama Japan), 1 35.
	Dahan E, Mendelson H (2001) An Extreme-Value Model of Concept Testing. Management Science
	47(1):102 116.
	Dell Acqua F, McFowland III E, Mollick ER, Lifshitz-Assaf H, Kellogg K, Rajendran S, Krayer L,
	Candelon F, Lakhani KR (2023) Navigating the Jagged Technological Frontier: Field Experimental
	Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. (September 15)
	https: papers.ssrn.com abstract 4573321.
	Doshi AR, Hauser OP (2024) Generative AI enhances individual creativity but reduces the collective
	diversity of novel content. Sci. Adv. 10(28):eadn5290.
	Girotra K, Terwiesch C, Ulrich KT (2010) Idea Generation and the Quality of the Best Idea. Management
	Science 56(4):591 605.
	Goldenberg J, Mazursky D, Solomon S (1999) Creative Sparks. Science 285(5433):1495 1496.
	Guilford JP (1967) Creativity: Yesterday, Today and Tomorrow. Journal of Creative Behavior 1(1):3 14.
	Haase J, Hanel PHP (2023) Artificial muses: Generative artificial intelligence chatbots have risen to
	human-level creativity. Journal of Creativity 33(3):100066.
	Hitsuwari J, Ueda Y, Yun W, Nomura M (2023) Does human AI collaboration lead to more creative art?
	Aesthetic evaluation of human-made and AI-generated haiku poetry. Computers in Human Behavior
	139:107502.
	Hubert KF, Awa KN, Zabelina DL (2024) The current state of artificial intelligence generative language
	models is more creative than humans on divergent thinking tasks. Sci Rep 14(1):3440.
	Huchzermeier A, Loch CH (2001) Project Management Under Risk: Using the Real Options Approach to
	Evaluate Flexibility in R D. Management Science 47(1):85 101.
	Jamieson LF, Bass FM (1989) Adjusting Stated Intention Measures to Predict Trial Purchase of New
	Products: A Comparison of Models and Methods. Journal of Marketing Research 26(3):336 345.
	Jia N, Luo X, Fang Z, Liao C (2024) When and How Artificial Intelligence Augments Employee
	Creativity. AMJ 67(1):5 32.
	Kobak D, González-Márquez R, Horvát EÁ, Lause J (2024) Delving into ChatGPT usage in academic
	writing through excess vocabulary. (July 3) http: arxiv.org abs 2406.07016.
	Koenker R, Hallock KF (2001) Quantile Regression. Journal of Economic Perspectives 15(4):143 156.
	Koivisto M, Grassini S (2023) Best humans still outperform artificial intelligence in a creative divergent
	thinking task. Sci Rep 13(1):13601.
	Kornish LJ, Ulrich KT (2011) Opportunity Spaces in Innovation: Empirical Analysis of Large Samples of
	Ideas. Management Science 57(1):107 128.
	Kornish LJ, Ulrich KT (2014) The Importance of the Raw Idea in Innovation: Testing the Sow s Ear

	--- Page 28 ---

	Hypothesis. Journal of Marketing Research 51(1):14 26.
	Lincoln FC (1930) Calculating waterfowl abundance on the basis of banding returns (U.S. Dept. of
	Agriculture, Washington, D.C.).
	Liu H, Mihm J, Sosa ME (2018) Where Do Stars Come From? The Role of Star vs. Nonstar Collaborators
	in Creative Settings. Organization Science 29(6):1149 1169.
	Loch CH, Terwiesch C, Thomke S (2001) Parallel and Sequential Testing of Design Alternatives.
	Management Science 47(5):663 678.
	March JG (1991) Exploration and Exploitation in Organizational Learning. Organization Science
	2(1):71 87.
	Mihm J, Schlapp J (2019) Sourcing Innovation: On Feedback in Contests. Management Science
	65(2):559 576.
	Mitchell E, Lee Y, Khazatsky A, Manning CD, Finn C (2023) DetectGPT: Zero-Shot Machine-Generated
	Text Detection using Probability Curvature. (July 23) http: arxiv.org abs 2301.11305.
	Meincke L, Carton A (2024) Beyond Multiple Choice: The Role of Large Language Models in
	Educational Simulations. (May 26) https: papers.ssrn.com abstract 4873537.
	OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. (2024) GPT-4 Technical
	Report. (March 4) http: arxiv.org abs 2303.08774.
	Osborn AF (1953) Applied imagination (Scribner S, Oxford, England).
	Rashidi HH, Fennell BD, Albahra S, Hu B, Gorbett T (2023) The ChatGPT conundrum: Human-
	generated scientific manuscripts misidentified as AI creations by AI text detection tool. Journal of
	Pathology Informatics 14:100342.
	Shank DB, Stefanik C, Stuhlsatz C, Kacirek K, Belfi AM (2023) AI composer bias: Listeners like music
	less when they think it was composed by an AI. J Exp Psychol Appl 29(3):676 692.
	Shibayama S, Yin D, Matsumoto K (2021) Measuring novelty in science with word embedding Muscio
	A, ed. PLoS ONE 16(7):e0254034.
	Si H, Kavadias S, Loch CH (2022) Managing Innovation Portfolios: From Project Selection to Portfolio
	Design. (March 6) https: papers.ssrn.com abstract 4050940.
	Siangliulue P, Chan J, Dow SP, Gajos KZ (2016) IdeaHound: Improving Large-scale Collaborative
	Ideation with Crowd-Powered Real-time Semantic Modeling. Proceedings of the 29th Annual Symposium
	on User Interface Software and Technology. (ACM, Tokyo Japan), 609 624.
	Sommer SC, Loch CH (2004) Selectionism and Learning in Projects with Complexity and Unforeseeable
	Uncertainty. Management Science 50(10):1334 1347.
	Sutton RI, Hargadon A (1996) Brainstorming Groups in Context: Effectiveness in a Product Design Firm.
	Administrative Science Quarterly 41(4):685.

	--- Page 29 ---

	Terwiesch C (2023) Let s cast a critical eye over business ideas from ChatGPT. Financial Times (March
	12) https: www.ft.com content 591ad272-6419-4f2c-9935-caff1d670f08.
	Terwiesch C, Ulrich K (2023) The innovation tournament handbook: a step-by-step guide to finding
	exceptional solutions to any challenge (Wharton School Press, Philadelphia, PA).
	Terwiesch C, Ulrich KT (2009) Innovation tournaments: creating and selecting exceptional opportunities
	(Harvard Business Press, Boston, Mass).
	Terwiesch C, Xu Y (2008) Innovation Contests, Open Innovation, and Multiagent Problem Solving.
	Management Science 54(9):1529 1543.
	Torrance EP (1968) A Longitudinal Examination of the Fourth Grade Slump in Creativity. Gifted Child
	Quarterly 12(4):195 199.
	Ulrich K, Eppinger S (2007) Product Design and Development (McGraw-Hill Education)
	Venkatraman S, Uchendu A, Lee D (2024) GPT-who: An Information Density-based Machine-Generated
	Text Detector. (April 3) http: arxiv.org abs 2310.06202.
	Wang H, Zou J, Mozer M, Goyal A, Lamb A, Zhang L, Su WJ, et al. (2024) Can AI Be as Creative as
	Humans? (January 25) http: arxiv.org abs 2401.01623.
	Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D (2023) Chain-of-
	Thought Prompting Elicits Reasoning in Large Language Models. (January 10)
	http: arxiv.org abs 2201.11903.
	Weitzman ML (1979) Optimal Search for the Best Alternative. Econometrica 47(3):641.
	Zhou E, Lee D (2024) Generative artificial intelligence, human creativity, and art Harding M, ed. PNAS
	Nexus 3(3):pgae052.
	Zlatkov D, Ens J, Pasquier P (2023) Searching for Human Bias Against AI-Composed Music. Artificial
	Intelligence in Music, Sound, Art and Design: 12th International Conference, EvoMUSART 2023, Held
	as Part of EvoStar 2023, Brno, Czech Republic, April 12 14, 2023, Proceedings. (Springer-Verlag,
	Berlin, Heidelberg), 308 323.

	--- Page 30 ---

	Appendix A. Quantile Regression Results
	The regression model considered quantiles 0.1 to 0.9 with a 0.1 step. For each quantile, it estimated
	MeanRating SourceAI. SourceAI is a dummy variable that indicates whether the idea source was a student
	(SourceAI 0) or GPT-4 (SourceAI 1). A positive value for SourceAI indicates that ideas by GPT-4
	performed better than human ideas. Negative values indicate the opposite.

	Table A.1.
	Quantile Regression Results for quantiles 0.1 to 0.9
	Quantile
	Intercept
	Source AI
	Conf. Int. Low
	Conf. Int. High
	0.1

	0.3
	0.128399
	0.471601
	0.2
	1.230768
	0.290958
	0.152504
	0.429411
	0.3
	1.388859
	0.277678
	0.149304
	0.406051
	0.4
	1.549946
	0.262418
	0.139183
	0.385653
	0.5
	1.666666
	0.227943
	0.10915
	0.346736
	0.6
	1.789528
	0.210472
	0.095888
	0.325057
	0.7
	1.882355
	0.260502
	0.137852
	0.383152
	0.8
	1.954555
	0.445445
	0.328038
	0.562851
	0.9
	2.181822
	0.318235
	0.190531
	0.445939
	Notes. (p 0.1 ).

	Appendix B. Supplementary Regression Tables
	Purchase Intent
	Predictors
	Estimates
	CI
	p
	(Intercept)
	0.40
	0.38 0.43
	0.001
	Source [Zero-Shot]
	0.06
	0.03 0.09
	0.001
	Source [Few-Shot]
	0.09
	0.06 0.12
	0.001
	Random Effects
	σ2
	0.07
	τ00 IdeaID
	0.01
	τ00 RespondentID
	0.02
	τ11 IdeaID.SourceZero-Shot
	0.01

	--- Page 31 ---

	τ11 IdeaID.SourceFew-Shot
	0.03
	τ11 RespondentID.SourceZero-Shot
	0.01
	τ11 RespondentID.SourceFew-Shot
	0.01
	ρ01
	-0.64
	-0.97
	-0.06
	-0.28
	ICC
	0.28
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.014 0.290

	Purchase Intent Alternative Weights
	Predictors
	Estimates
	CI
	p
	(Intercept)
	0.08
	0.07 0.08
	0.001
	Source [Zero-Shot]
	0.02
	0.01 0.03
	0.001
	Source [Few-Shot]
	0.03
	0.02 0.04
	0.001
	Random Effects
	σ2
	0.01
	τ00 IdeaID
	0.00
	τ00 RespondentID
	0.00
	τ11 IdeaID.SourceZero-Shot
	0.00
	τ11 IdeaID.SourceFew-Shot
	0.00
	τ11 RespondentID.SourceZero-Shot
	0.00
	τ11 RespondentID.SourceFew-Shot
	0.00
	ρ01
	0.04
	0.48
	-0.00
	-0.16
	ICC
	0.21

	--- Page 32 ---

	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.009 0.215

	Purchase Intent (Simple)
	Predictors
	Estimates
	CI
	p
	(Intercept)
	1.62
	1.55 1.68
	0.001
	Source [Zero-Shot]
	0.26
	0.15 0.37
	0.001
	Source [Few-Shot]
	0.36
	0.25 0.47
	0.001
	Observations

	R2 R2 adjusted
	0.108 0.104

	Purchase Intent (no weights, zero-shot baseline)
	Predictors
	Estimates
	CI
	p
	(Intercept)
	1.85
	1.73 1.98
	0.001
	Source [Student]
	-0.24
	-0.35 -0.12
	0.001
	Source [Few-Shot]
	0.12
	-0.00 0.24
	0.058
	Random Effects
	σ2
	1.18
	τ00 IdeaID
	0.12
	τ00 RespondentID
	0.42
	τ11 IdeaID.SourceStudent
	0.33
	τ11 IdeaID.SourceFew-Shot
	0.12
	τ11 RespondentID.SourceStudent
	0.13
	τ11 RespondentID.SourceFew-Shot
	0.01
	ρ01
	-0.77
	-0.36
	-0.50
	-0.99

	--- Page 33 ---

	ICC
	0.31
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.014 0.322

	Purchase Intent (ordered logistic regression)
	Predictors
	Odds Ratios
	CI
	p
	0 1
	0.25
	0.21 0.29
	0.001
	1 2
	1.06
	0.90 1.26
	0.484
	2 3
	3.01
	2.53 3.57
	0.001
	3 4
	19.07
	15.84 22.97
	0.001
	Source [Zero-Shot]
	1.48
	1.24 1.78
	0.001
	Source [Few-Shot]
	1.79
	1.49 2.14
	0.001
	Random Effects
	σ2
	3.29
	τ00 IdeaID
	0.39
	τ00 RespondentID
	0.92
	ICC
	0.28
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.014 0.294
	Novelty
	Predictors
	Estimates
	CI
	p
	(Intercept)
	0.41
	0.39 0.43
	0.001
	Source [Zero-Shot]
	-0.04
	-0.07 -0.01
	0.008
	Source [Few-Shot]
	-0.05
	-0.08 -0.02
	0.001
	Random Effects

	--- Page 34 ---

	σ2
	0.05
	τ00 IdeaID
	0.01
	τ00 RespondentID
	0.01
	τ11 IdeaID.SourceZero-Shot
	0.02
	τ11 IdeaID.SourceFew-Shot
	0.03
	τ11 RespondentID.SourceZero-Shot
	0.01
	τ11 RespondentID.SourceFew-Shot
	0.01
	ρ01
	-0.87
	-0.99
	0.14
	0.06
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.009 NA

	Novelty (Simple)
	Predictors
	Estimates
	CI
	p
	(Intercept)
	1.64
	1.58 1.70
	0.001
	Source [Zero-Shot]
	-0.18
	-0.29 -0.07
	0.001
	Source [Few-Shot]
	-0.20
	-0.31 -0.09
	0.001
	Observations

	R2 R2 adjusted
	0.042 0.037

	Novelty (no weights, zero-shot baseline)
	Predictors
	Estimates
	CI
	p
	(Intercept)
	1.48
	1.37 1.59
	0.001
	Source [Student]
	0.15
	0.05 0.26
	0.004
	Source [Few-Shot]
	-0.04
	-0.16 0.07
	0.493
	Random Effects

	--- Page 35 ---

	σ2
	0.90
	τ00 IdeaID
	0.11
	τ00 RespondentID
	0.35
	τ11 IdeaID.SourceStudent
	0.33
	τ11 IdeaID.SourceFew-Shot
	0.44
	τ11 RespondentID.SourceStudent
	0.03
	τ11 RespondentID.SourceFew-Shot
	0.04
	ρ01
	-0.71
	-0.98
	-1.00
	-0.23
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.009 NA

	Novelty (zero-shot baseline)
	Predictors
	Estimates
	CI
	p
	(Intercept)
	0.37
	0.34 0.40
	0.001
	Source [Student]
	0.04
	0.01 0.07
	0.008
	Source [Few-Shot]
	-0.01
	-0.04 0.02
	0.449
	Random Effects
	σ2
	0.05
	τ00 IdeaID
	0.01
	τ00 RespondentID
	0.02
	τ11 IdeaID.SourceStudent
	0.02
	τ11 IdeaID.SourceFew-Shot
	0.03
	τ11 RespondentID.SourceStudent
	0.01
	τ11 RespondentID.SourceFew-Shot
	0.00
	ρ01
	-0.74

	--- Page 36 ---

	-1.00
	-0.69
	-0.95
	ICC
	0.35
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.006 0.356

	Novelty (ordered logistic regression)
	Predictors
	Odds Ratios
	CI
	p
	0 1
	0.16
	0.13 0.19
	0.001
	1 2
	0.87
	0.72 1.04
	0.118
	2 3
	4.51
	3.76 5.43
	0.001
	3 4
	29.60
	24.09 36.37
	0.001
	Source [Zero-Shot]
	0.74
	0.60 0.91
	0.004
	Source [Few-Shot]
	0.68
	0.55 0.84
	0.001
	Random Effects
	σ2
	3.29
	τ00 IdeaID
	0.57
	τ00 RespondentID
	0.91
	ICC
	0.31
	N RespondentID

	N IdeaID

	Observations

	Marginal R2 Conditional R2
	0.006 0.315