Paperbot6 / Papers /Using Large Language Models for Idea.txt
Ina-Shapiro's picture
adding all the paperbot4 files
a213258
--- Page 1 ---
Using Large Language Models for Idea
Generation in Innovation
Lennart Meincke
Operations, Information and Decisions, The Wharton School, University of Pennsylvania, lennart wharton.upenn.edu
Karan Girotra
Cornell Tech and Johnson College of Business, Cornell University, girotra cornell.edu
Gideon Nave
Marketing, The Wharton School, University of Pennsylvania, gnave wharton.upenn.edu
Christian Terwiesch
Operations, Information and Decisions, The Wharton School, University of Pennsylvania, terwiesch wharton.upenn.edu
Karl T. Ulrich
Operations, Information and Decisions, The Wharton School, University of Pennsylvania, ulrich wharton.upenn.edu
This research evaluates the efficacy of large language models (LLMs) in generating new product ideas. To do so, we compare three
pools of ideas for new products targeted toward college students priced at 50 or less. The first pool of ideas was created by
university students in a product design course before the availability of LLMs. The second and third pools of ideas were generated
by OpenAI s GPT-4 using zero-shot and few-shot prompting, respectively. We evaluated idea quality using standard market
research techniques to predict average purchase intent probability. We used text mining to assess idea similarity and human raters
to evaluate idea novelty. We find that AI-generated ideas outperform human-generated ideas in terms of average purchase intent,
with few-shot prompting yielding slightly higher intent than zero-shot prompting. However, AI-generated ideas are perceived as
less novel and exhibit higher pairwise similarity, particularly with few-shot prompting, indicating a less diverse solution landscape.
When focusing on the quality of the best ideas (rather than on the average ideas), we find that AI-generated ideas are seven times
more likely to rank among the top 10 of ideas, demonstrating a significant advantage over human-generated ideas. We propose
that this 7:1 advantage is a conservative estimate, as it does not account for AI's greater productivity. Our findings suggest that
despite some drawbacks, AI creativity presents a substantial benefit in generating high-quality ideas for new product development.
Funding: Funding was provided by the Mack Institute for Innovation Management at the Wharton School of the University of
Pennsylvania.
Key words: innovation; idea generation; creativity; creative problem solving; LLM; large-scale language models; AI; artificial
intelligence; ChatGPT; GPT
--- Page 2 ---
1. Introduction
Generative artificial intelligence (GenAI) has remarkably advanced in creating life-like images and
coherent, fluent text. Open AI s ChatGPT chatbot, based on the Generative Pre-trained Transformer (GPT)
series of large language models (LLM), can equal or surpass human performance in academic examinations
and tests for professional certifications (OpenAI et al. 2023). Moreover, LLMs can provide valuable
professional advice in fields like software development, medicine, and law.
Despite their remarkable performance, LLMs sometimes produce text that is semantically or
syntactically plausible but is, in fact, factually incorrect or nonsensical, a phenomenon often referred to as
hallucinations. This outcome is a byproduct of how LLMs are designed, as they are optimized to generate
the most statistically likely sequences of words with an intentional injection of randomness. In most
applications, this randomness and the associated hallucinations and inconsistencies create problems that
limit the use of LLM-based solutions to low-stakes settings, or they require extensive human supervision.
But are there applications in which we can leverage the weaknesses of hallucinations and inconsistent
quality and turn them into a strength? We propose that the domain of creativity and innovation provides
such an application. This domain operates quite differently than most management settings, where we
commonly expect to use each unit of work produced. As such, consistency is prized and is, therefore, the
focus of contemporary performance management. Erratic and inconsistent behavior is to be eliminated. An
airline would rather hire a pilot that executes a within-safety-margins landing 10 out of 10 times rather than
one that makes a brilliant approach five times and an unsafe approach another five. But, when it comes to
creativity and innovation, say finding a new opportunity to improve the air travel experience or launching
a new aviation venture, the same airline would prefer an ideator that generates one brilliant idea and nine
nonsense ideas over one that generates ten decent ideas.
The reason for this difference is that when it comes to creativity and innovation, the performance of the
process is not determined by the sum or the average of all ideas created. Instead, each idea is seen as a real
option that the decision maker can decide to execute (Huchzermeier and Loch 2001). Thus, the performance
of the process is determined by the quality of the best idea(s) (Dahan and Mendelson 2001, Terwiesch and
Xu 2008, Terwiesch and Ulrich 2009, Girotra et al. 2010). The process of innovation thereby can be thought
of as a search process that generates ideas with random quality values by drawing from an underlying
stochastic distribution until the cost of creating one additional draw from the distribution (e.g., creating one
more product concept or building one more prototype) exceeds the marginal benefit (Weitzman 1979).
Prior research in product development and innovation has modeled various aspects of this search
process, including the pros and cons of parallel search (Loch et al. 2001), the tension between sampling
from very different regions of the pay-off distributions ( selectionism ) versus locally improving a given
project (Sommer and Loch 2004), and the need for building balanced portfolios that consist of different
--- Page 3 ---
types of projects (Chao and Kavadias 2008).
We follow this line of research and consider a setting in which ideas of unknown quality are created,
and the quality of the best few ideas determines the overall performance. This could be a setting of corporate
portfolio planning in a large established organization as described by Si et al. (2022). However, to facilitate
our experimental design, we focus on the idea generation in the product developement process for a newly
formed venture. Specifically, we look for a product idea that targets the college student market and can be
sold for 50 or less. This innovation challenge is similar to the study settings used in prior work (e.g.,
Osborn 1953, Connolly et al. 1990, Sutton and Hargadon 1996, Girotra et al. 2010) to evaluate and compare
various brainstorming methods (e.g., group vs. individual; nominal groups vs. hybrid groups).
In contrast to this prior work, we consider ideas generated by humans and ideas generated by artificial
intelligence (AI) in the form of Open AI s GPT-4. As discussed above, LLMs are designed to generate new
content, and in the domain of brainstorming, their stochastic (if not outright erratic) behavior might turn a
bug into a feature. Thus, we hypothesize that LLMs have the potential to be excellent ideators. The purpose
of this paper, therefore, is to formally test this hypothesis by comparing the performance of LLMs in
generating new ideas to that of human idea generators.
Specifically, we compare three pools of ideas for new products targeted toward college students at a
price of USD 50 or less. The first pool of ideas was created by students at an elite university enrolled in a
course on product design before the availability of LLMs. The second pool was generated by OpenAI s
GPT-4 with the same prompt as that given to the students and no other guidance (zero-shot prompting).
The third pool was generated by prompting GPT-4 with the same prompt as that given to the students and
a sample of highly rated ideas to enable some in-context learning (few-shot prompting). We evaluate the
quality of the ideas using standard market research techniques and survey human respondents to predict an
average purchase intent probability for each product, which we use as our measure of idea quality. We use
text mining techniques to evaluate the similarities of ideas and rely on human raters to assess idea novelty.
This comparison between human idea generation and AI-based idea generation allows us to contribute
to the innovation literature by establishing the following novel results.
First, AI-generated ideas are, on average, significantly better (average purchase intent of 0.48 relative
to 0.40 for human-generated ideas), especially in the case of few-shot prompting (average purchase intent
of 0.49 relative to 0.46 for zero-shot prompting), as shown in Study 1.
Second, despite this success, consumers perceive AI-generated ideas as less novel (perceived novelty
of 0.36 relative to 0.41). Moreover, AI-generated ideas are more likely to overlap: text mining reveals that
the average pairwise similarity of ideas is higher among AI-generated ideas and further increases when
using few-shot prompting. As a result, the underlying solution landscape is less likely to be fully explored
(Study 2).
--- Page 4 ---
Finally, we show that for a given number of ideas, the quality of the best ideas generated by AI is
significantly greater than that of the best ideas generated by humans (Study 3). Specifically, we show that
AI-generated ideas are seven times more likely to be among the top 10 of ideas generated in our
experiment. This is significant given the context. What matters for innovation is the quality of the best
idea. The objective of idea generation is to generate at least a few truly exceptionally great ideas. In most
innovation settings, we would rather have 10 great ideas and 90 terrible ideas than 100 ideas of solid
quality. Holding the number of ideas constant, we need to trade off the advantageous effect of higher
average idea quality (Study 1) with the disadvantages of less novelty, more overlapping ideas, and fewer
ideas that can be discovered (Study 2). Study 3 clearly establishes AI s supremacy over humans in this
respect.
A quarter of a century ago, Goldenberg et al. (1999) asked the question Can AI-generated ideas finally
compete with human ones, long after researchers first considered the possibility? . We believe that the three
studies presented in this article provide empirical support for an affirmative answer to this question. From
a practical perspective, we see the 7:1 advantage of AI creativity over human creativity as a conservative
estimate, as we did not credit AI for its substantially greater productivity.
The remainder of the article is organized as follows. After reviewing some recent work on GenAI and
creativity (Section 2), we introduce our theoretical framework and our hypotheses (Section 3), followed by
the technical set-up of our experiments (Section 4). We conducted three studies to assess the creativity of
human- and AI-generated ideas. First, in Study 1 we ask human participants to rate ideas from both sources
(human- and AI-generated) and compare the results (Section 5). Second, in Study 2 we use text-based
analysis to calculate how many unique ideas can be created by humans and LLMs in our specific domain
as well as ask human participants to rate the novelty of ideas from both sources and compare the results
(Section 6). Third, in Study 3 we look at the extreme distributions of idea quality to identify possible
advantages for the best ideas by either humans or AI (Section 7). We conclude the paper by discussing
potential limitations of our studies, their robustness to alternative specifications (Section 8), and the
implications of our findings (Section 9).
2. GenAI applications to creative tasks
Research to date has demonstrated three key findings regarding AI's role in creativity and innovation. First,
AI frequently matches or exceeds human performance in creative tasks. Haase and Hanel (2023) found that
LLMs have reached human-level performance in divergent thinking tasks such as the Alternative Uses Task
(AUT). This is supported by Hubert et al. (2024), who studied GPT-4 responses for the Consequences Task
and Divergent Association Tasks, finding that AI is more creative than humans across all its dimensions.
While Koivisto and Grassini (2023) find that AI chatbots outperform average human performance in the
--- Page 5 ---
AUT, they also note that the most exceptional human ideas still match or exceed those generated by AI.
Second, studies show that AI aids in improving creative outcomes for humans when using it as a tool.
Doshi and Hauser (2024) find that AI use helps humans to create more creative and enjoyable short stories.
However, the collective diversity decreases and stories become more similar to one another. Similarly, Jia
et al. (2024) found that AI assistance boosted employee creativity in a telemarketing company when
responding to customer questions, ultimately increasing sales. Zhou and Lee (2024) show that integrating
text-to-image AI into creative workflows increased the number of artworks created by 25 and raised the
likelihood of receiving the works receiving favorite per view by 50 , highlighting the benefits of LLMs
augmenting human workflows ( human in the loop ).
Third, studies have explored human preferences for AI-generated versus human-generated creations,
often finding that people prefer human involvement. For instance, Hitsuwari et al. (2023) found that survey
participants cannot distinguish between AI-generated and human-generated haikus, but rated poems co-
created by humans and AI as the most beautiful with no significant preference for haikus created solely by
humans or AI. Bellaiche et al. (2023) provide evidence that humans prefer human involvement in art
creation by showing that participants prefer AI-generated art falsely labeled as created by humans to the
same art correctly labeled as AI-generated, suggesting a bias for human involvement in the creative process.
Similarly, Shank et al. (2023) find comparable results for AI-generated classical music, although no such
preference was found for electronic music. However, Zlatkov et al. (2023) found no significant preference
for either AI or human-generated music overall.
Taken together, this body of research illustrates the potency of AI in creative tasks. AI not only matches
human creativity but also improves human performance when used as a collaborative tool. However, at
least when considering artistic outcomes, there remains a human preference for creativity that involves
human touch. This growing evidence suggests a natural next step: evaluating AI's efficacy in innovation
management in general and in idea generation in particular, where artistic preferences are less important,
while carefully examing potential issues such as less diverse ideas.
3. Theoretical Framework and Hypotheses
To understand GenAI s ability to tackle various creative tasks, we must first conceptualize creativity. The
literature distinguishes between three dimensions of creativity. Fluency is the ability to generate many ideas
or solutions to a problem. It reflects the quantity of generated ideas. Flexibility is the capacity to produce a
variety of ideas or solutions, showing an ability to shift approaches or perspectives. And, originality is the
ability to produce novel and unique ideas (Guilford 1967, Torrance 1968). In addition, the brainstorming
literature often considers idea quality as a fourth dimension of creativity. We omit fluency as a performance
metric, as comparing the number of ideas or the speed of idea generation between a computer and a human
--- Page 6 ---
will lead to the obvious result that the computer displays greater fluency, creating more ideas per unit of
time. This leaves us with idea quality, flexibility, and originality as the dimensions of comparison between
humans and AI.
The atomic unit of analysis in this comparison is an idea. In the context of innovation, we define an idea
as a novel match between a solution and a need. As mentioned above, across three studies we will ask
students as well as GenAI to come up with new product ideas targeted toward college students that can be
sold for 50 or less. To illustrate our unit analysis of an idea, consider one of the student-generated ideas:
Convertible High-Heel Shoe: Many prefer high-heel shoes for dress-up occasions, yet walking in high heels
for more than short distances is very challenging. Might we create a stylish high-heel shoe that easily
adapts to a comfortable walking configuration, say by folding down or removing a heel portion of the shoe?
In this example, the need is the desire of some people to dress up and wear high-heeled shoes for some
occasions while still walking comfortably. The proposed solution is to make the heel portion of the shoe so
that it can be folded down or removed.
Idea generation, by either individuals or groups, is a process that creates a stream of ideas with varying
quality levels. This stream can be the result of either human effort or the use of AI. Each of these ideas can
be validated on a quality scale. Our quality scale is based on a purchase intent study. Kornish and Ulrich
(2014) show that the best indicator of future value creation is the average purchase intent expressed by a
sample of consumers in the target market. Furthermore, they show that no single individual, expert or
novice, is particularly good at estimating value. Instead, a sample of expressed purchase intent from about
15 individuals in the target market is a reliable measure of idea quality.
Some ideas are likely to be brilliant (high-quality), some are horrible (low-quality), and most will be
somewhere in between (medium-quality). We can think of this uncertain quality value as a random variable
drawn from an underlying pay-off distribution (Weitzman 1979, Dahan and Mendelson 2001).
Recall that we chose to measure three dimentions of creativity associated with idea generation: quality,
flexibility, and originality. Our first hypothesis relates to the first dimention: AI s ability to generate ideas
comparable in their average quality to human-generated ideas. In other words, we focus on the mean of the
underlying idea-quality distribution. We make two arguments for why GPT-4 would create ideas of higher
average quality than humans. First, the training data for GPT-4 includes millions of product reviews
revealing unmet user needs, social media posts of excited and frustrated customers alike, and marketing
materials for countless products that have been launched more or less successfully in the past. Second, the
literature reviewed in Section 2 has established that GPT-4 has tremendous creative capabilities in other
domains such as music generation or story writing.
--- Page 7 ---
Hypothesis 1 (Idea quality): The average quality of AI-generated ideas is higher than the average quality
of human-generated ideas.
Our second hypothesis relates to the second two dimensions: flexibility and originality. We first define
these concepts in the context of generating ideas for new products and come up with appropriate
measurement scales.
There exists a vast number of possible new product ideas that differ along many dimensions. We can
think of ideas as positions in a highly dimensional space. OpenAI s GPT-4 models text as multi-
dimensional embedding vectors in this space, where each dimension may represent a distinct attribute or
feature of the text. Such vectors have hundreds of dimensions. Similar texts will often lie close to each other
while different ones will be far apart. However, interpreting the distances and dimensions is often not
straightforward given the high dimensionality.
To illustrate, consider a two-dimensional search space like the map of a territory. For example, consider
the exploration of such a territory in the search for fishing spots in the ocean. The (x, y) coordinates capture
the geographic locations of schools of fish. Each location has a pay-off corresponding to the amount of fish
in the water. The goal of the fisherman is to find the location with the greatest fish density. In such a search
process, local adjustments along a gradient of increasing fish density in the water via local search may
increase the value of a fishing location. Yet, in rugged solution landscapes, i.e., ones that have multiple
local optima, such local search is unlikely to yield the globally optimal solution.
Thereby, the ruggedness of the underlying solution landscape makes it impossible to arrive at the most
valuable fishing location (idea) in the ocean (idea space) via local adjustments. Rather, a broad exploration
is needed (see Sommer and Loch 2004). Without prior knowledge about the landscape, some new locations
that are very different from past locations should be explored. This creates the classic trade-off between
exploration and exploitation (March 1991).
With this as our backdrop, we provide two ways of operationalizing flexibility, overlap and the total
number of discoverable ideas, and one way to operationalize originality, idea novelty. All three are
important properties of a search process in general and of an ideation process in particular.
To explain overlap, let s return to our fishing example. To explore fishing locations in an ocean, the
locations should be distinctively different from each other. Even in a rugged solution landscape, some
spatial correlations in pay-offs between two adjacent coordinates are likely. In much the same way, in the
world of innovation, we want our ideas to be distinct from each other. To determine how distinctly different
an idea is relative to other ideas, we measure the cosine similarity of its embedding vector relative to the
embedding vectors of the other ideas (following Cox et al. 2021 and Dell'Acqua et al. 2023). Section 8
--- Page 8 ---
provides alternative measures to this analytical choice. For a given pool of ideas produced by an idea-
generation process, human or AI, we can thus randomly pull out two ideas and compute the angle between
two associated embedding vectors. The Cosine of such angles will range from -1 to 1, with 1 indicating
identical vectors and 0 indicating no similarity (orthogonal). While negative values are possible in principle,
they rarely occur in practice as further discussed in study 2. By performing a pairwise comparison of all
ideas and averaging their similarities, we can compute the average pool similarity. Next, we define two
ideas as overlapping if their cosine similarity is above 𝜃 0.8. That is, we count any new idea added to
the pool as overlapping if its cosine similarity exceeds 0.8 compared to any of the existing ideas in the pool.
Our first measure of flexibility is based on computing the distribution of pairwise cosine similarities and
counting the frequency of overlaps. We discuss this and other assumptions in Section 8 and provide
extensive robustness analyses including evaluating alternative model specifications.
Next, imagine a fisherman with no memory looking for fish at random locations. Every period, this
fisherman sets out and fishes, yielding an estimate for the payoff of a specific location. How many unique
fishing locations will be discovered this way? Early in the exploratory efforts, every fishing spot is an
unexplored territory. Yet, as this process goes on, the likelihood of overlap increases, i.e., the fisherman is
more likely to revisit a location previously tested. Given our definition of overlapping ideas (cosine
similarity exceeding the θ 0.8 threshold), we can observe a stream of incoming ideas, one by one, and
determine whether a new idea is unique relative to the pool of ideas created up to this point. Early on, just
like in the fisherman s case, each idea is likely unique (non-overlapping with the ideas created so far).
However, as the process progresses, the percentage of overlapping ideas will increase as the underlying
search space gets exhausted. For a finite sequence of T ideas, we can evaluate the number of overlapping
ideas, Noverlap, and thus compute the number of unique ideas, Nunique T-Noverlap. Definitions for how we
operationalize this approach are shown in study 2.
In addition to utilizing idea overlap for computing the number of unique ideas in a finite stream of ideas,
we can further estimate the total number of discoverable ideas in the search space, even if many were not
part of the sequence of T ideas, i.e., the ideas have not (yet) been discovered. To do so, we use what in
population ecology is known as a capture-recapture model, used to estimate the number of unique fishing
locations based on how frequently a previously visited location is revisited by a fisherman with no memory.
With such a model, we simply count the incidents of an idea overlapping with a past idea. The frequency
of overlap and its increased occurrence rate over time allows for estimating the number of ideas that can
be discovered (Kornish and Ulrich 2011). This provides us with our second measure of flexibility.
Next, consider originality. The search for ideas can yield ideas that are more or less novel. We measure
idea novelty in the same way we measure idea quality by directly asking potential customers for its novelty
assessment and averaging this value. In summary, we evaluate flexibility by looking at idea overlap (which
--- Page 9 ---
can be converted into an estimate for the numbers of ideas that can be discovered) and evaluate originality
by directly asking consumers to rate novelty.
How will a pool of AI-generated ideas compare to these human-generated ideas in terms of quality,
flexibility and originality? By their very design, GPTs are autoregressive processes. They don t plan ahead
but predict one word (or token) at a time based on a context window, including the prompt and the prior
words created. Such a one word at a time process is unlikely to systematically and exhaustively explore
an entire solution landscape. This lack of broad exploration will be further amplified in the presence of a
system prompt that illustrates the concept of ideas by providing one or multiple ideas from the past (few-
shot prompting) relative to the case in which no past ideas are provided (zero-shot prompting). This should
limit both the flexibility and the originality of the creative process.These arguments, taken together with
existing research in other domains showing less novelty for AI-generated content versus human-generated
content (Doshi and Hauser, 2024), lead to the following two hypotheses:
Hypothesis 2a (Flexibility): The likelihood of two ideas overlapping is higher for a pool of AI-generated
ideas than for a pool of human-generated ideas, resulting in fewer discoverable ideas.
Hypothesis 2b (Originality): The average novelty of AI-generated ideas is lower than that of human-
generated ideas.
Our third hypothesis returns to the concept of idea quality. This time, however, we are not concerned
about the average idea quality but instead focus on the quality of the best ideas. Rather than focusing on the
quality of the single best idea (the extreme value, Dahan and Mendelson 2001), we focus on the 90th
percentile of idea quality distribution, i.e., the top 10 percent of the ideas. We do so for two reasons.
The first reason is statistical estimation: for a single experiment like ours, there simply does not exist a
test that allows us to make statistically significant statements for a single data point. Moving to the 90th
percentile, we can compare the mean across larger groups of ideas (Section 8 presents our results for other
percentiles).
There also exists a second, managerial reason. In many, if not most, practical settings, the assessment of
idea quality is noisy, especially in the early stages of an innovation process when an idea is nothing but a
title and a few words. For this reason innovation tournaments don t just advance a single idea to the next
round, but a set of the x percent of the most promising ideas where x can vary widely, but typically ranges
between 10 and 50 percent (Terwiesch and Ulrich 2009). We therefore state:
Hypothesis 3 (Top Decile): The quality of the 90th percentile AI-generated ideas is higher than that of the
--- Page 10 ---
90th percentile human-generated ideas.
4. Experimental setup
For our experiment, we utilize three different pools of ideas, namely student-generated ideas, GPT-4-
generated ideas with zero-shot prompting and GPT-4-generated ideas with few-shot prompting. For the
student pool, we rely on data collected in 2021 in a product design and innovation course at an elite
university. In this course, 50 students participated in an innovation challenge to come up with ideas for a
physical product marketed to college students for 50 or less (this price cap is imposed to limit the
complexity of the projects in a one-semester course.). The challenge was organized in a traditional
innovation tournament format (Terwiesch and Ulrich 2009, 2023), in which individuals first independently
generate many ideas, which are then combined into a pool of several hundred ideas and subsequently
evaluated by others in the group (i.e., crowdsourced evaluations). Thus, we have access to a large set of
ideas generated by humans before AI tools became widely available to enhance ideation.
Speifically, we use a pool of independently aggregated human ideas by randomly selecting 200 entries,
each comprising a descriptive title and a paragraph of text, from the student ideas generated in these
challenges in 2021 (i.e., at a time prior to the widespread availability of ChatGPT and other LLMs). The
set of 200 ideas constitutes our first pool and forms the baseline for comparison with the ideas generated
using LLMs. We prompt Open AI s GPT-4 (more specifically, gpt-4-0314) with the same prompt we gave
the students. No LLM yet acts entirely autonomously. Rather, they are tools used by humans to complete
tasks. For this study, we aim for minimal prompt engineering, thus representing a novice user scenario.
However, we acknowledge that many strategies could potentially improve LLM performance. For instance,
Mihm and Schlapp (2019) show that providing feedback during ideation contests can further improve
performance of human innovators and we expect this to hold for LLMs as well
For our first LLM-generated idea pool we use the system prompt to provide contextual information and
subsequent user prompts to ask for ideas, ten at a time. The user prompt includes the additional request that
the descriptions be 40-80 words, like the student sample.
System Prompt
You are a creative entrepreneur looking to generate new product ideas. The product will target college
students in the United States. It should be a physical good, not a service or software. I'd like a product that
could be sold at a retail price of less than about USD 50. The ideas are just ideas. The product need not
yet exist, nor may it necessarily be clearly feasible. Number all ideas and give them a name. The name and
idea are separated by a colon.
--- Page 11 ---
User Prompt
Please generate ten ideas as ten separate paragraphs. The idea should be expressed as a paragraph of
40-80 words.
The model used for all work covered in this paper is gpt-4-0314 with the temperature parameter at 0.7
to retain randomness and thus greater creativity. The temperature parameter controls the randomness of the
output, with lower values leading to more deterministic output and higher values increasing variability. At
the time of the experiment, the suggested default value for temperature was 0.7 to strike a balance between
coherence and creativity, without possibly sampling highly unlikely tokens (i.e., semantic chunks used for
representational efficiency) that lead to undesirable responses.
An obstacle to using GPT-4 for generating hundreds of ideas is its finite memory, typically limited to
the number of tokens the underlying LLM can consider in generating its responses. Once the number of
tokens in a session exceeds the model s limit, the LLM has no memory of the first ideas generated, and
subsequent ideas can become increasingly redundant. The number of tokens in the version of GPT-4 we
had access to was about 8,000, roughly 7,000 words or approximately 80 ideas (some tokens are used for
the system and user prompt and idea titles).
To generate more than 80 ideas resulting from the limited context window, we asked GPT-4 to
compress the previously generated ideas into shorter summaries. These summaries were then provided
to the model before generating the next batch of ideas, ensuring that the model knows the previously
generated ideas while remaining within the context limits. We used the below summarization prompt,
followed by the original system prompt and generated summaries, and finally, a user prompt that explicitly
asks for different ideas. This constitutes our second pool of comparison.
Summarization Prompt
Aggressively compress the following ideas so that their original meaning remains but they are much
shorter. You can use tags or keywords. : Ideas generated so far
System Prompt
Original System Prompt Previously you generated the following ideas and should not repeat them:
Summaries
User Prompt
Original User Prompt Make sure they are different from the previous ideas.
--- Page 12 ---
For our second pool of LLM-generated ideas, we provide the LLM with examples (few-shot learning)
of high-quality ideas generated by students. In particular, we appended our prompts to provide the LLM
with six highly rated ideas from a separate student set that completed the same exercise and informed GPT-
4 that these ideas had been well-received by students in our class. We used six examples due to context
window limitations at the time of the experiment as well as drawing on previous experiments from in-
context few-shot learning where too many examples can degrade performance (see Meincke and Carton
2024). This constitutes our third pool of comparison.
Good Ideas Prompt
Original System Prompt Here are some well received ideas for inspiration: Good Ideas
Overall, we generated 100 ideas using zero-shot prompting and another 100 using few-shot prompting.
The resulting average word count for GPT-4 generated ideas is 69 and 71 for GPT-4 with provided with
examples. The average description is 63 words long for student ideas. We compared the resulting few-shot
prompted ideas to the examples provided to ensure that GPT-4 did not simply slightly modify the examples.
The average pairwise cosine similarity between the six examples and the 100 generated ideas is 0.33 and
the highest similarity between two ideas is 0.51. Thus, we have no reason to believe that GPT-4 repeated
the provided ideas.
5. Study 1: comparing the quality of ideas generated by humans and AI
The Institutional Review Board (IRB) at the University of Pennsylvania approved the research described
in this paper in May 2023, Protocol 853634. We used the online platform Prolific to recruit college-age
indiviuals from the United States to evaluate all 400 ideas from the three pools (pool 1 with 200 ideas
created by humans, pool 2 with 100 created by GPT-4 with zero-shot prompting, and pool 3 with 100
created by GPT-4 with few-shot prompting) via a purchase intent survey. We presented ideas in random
order and randomized at the idea level, meaning that every survey participant could potentially see ideas
from multiple sources. Each respondent evaluated an average of 40 ideas. On average, each idea was
evaluated 20 times. In the summer of 2023, concerns surfaced that ChatGPT was being used to provide
mTurk responses. This practice appears to have been limited to text generation tasks, not to multiple
choice tasks like our five-box purchase-intent survey. Indeed, just answering the survey question directly
requires less effort than trying to deploy ChatGPT to answer the question. We thus believe that our study
participants were humans.
We asked respondents to express purchase intent using the standard five-box options: definitely would
not purchase, probably would not purchase, might or might not purchase, probably would purchase, and
--- Page 13 ---
definitely would purchase. Jameson and Bass (1989) recommend weighting responses for the five possible
responses as 0, 0.25, 0.50, 0.75, and 1.00 to develop a single measure of purchase probability, which we
use as a measure of idea quality (other weightings are possible, as we discuss in Section 8). Figure 1 shows
the full quality distribution of ideas generated by the three pools.
Figure 1
Distribution of idea quality for three sets of ideas
Notes. Purchase intent is the weighted average of the five-box response scale per Jameson and Bass (1989).
Figure 1 shows the quality (purchase probability) of ideas across the three pools. On average, GPT-4
generated ideas with greater purchase intent (46.4 with zero-shot prompting and 49.3 with few-shot
prompting) than humans (40.4 ). The standard deviation of the quality of ideas is comparable between the
three pools. We formally test the impact of idea source on the perceived quality of product ideas via a linear
mixed-effects model with purchase intent as the dependent variable. The model included two fixed-effects
denoting source (humans are the baseline) and random intercepts and slopes for respondents and ideas. We
--- Page 14 ---
find significant differences in the perceived quality of ideas as a function of their source. Ideas generated
by GPT-4 with no examples (zero-shot) were rated significantly higher than human-generated ideas (𝛽
0.059; 95 CI [0.031, 0.088]; t(246) 4.06, p 0.001) and ideas generated by GPT-4 provided with
positive examples (few-shot) received even higher ratings (𝛽 0.089; 95 CI [0.060, 0.12]; t(223) 5.93,
p 0.001). Purchase intent is weakly significantly different between the two pools of LLM-generated ideas
(𝛽 0.03; 95 CI [-0.01, 0.06]; t(199) 1.892, p 0.06). These findings indicate that LLM-generated
ideas are, on average, more likely to be purchased than human-generated ideas (for additional robustness
tests, see Section 8).
6. Study 2: Diversity and Novelty of Ideas
Our second study focuses on how the fraction of overlapping ideas and the resulting estimated total number
of ideas the process can generate (idea flexibility, hypothesis 2a) and the perceived novelty of the ideas as
assessed by human raters (idea originality, hypothesis 2b) depend on the idea source.
6.1. Overlapping Ideas An idea-generation process creates a sequence of ideas in which each
additional idea generated can be compared to the previously created ideas according to its similarity. For a
pool of ideas, we can hence compute the average pairwise similarity of one idea compared to all other ideas
and then compute the average overall similarity for the entire pool. We can also apply a threshold to
pairwise idea similarity to identify at what point the ideas start to become more repetitive, i.e., when we are
starting to exhaust the space of new ideas given a particular idea-generation process. A pool of ideas then
might have a few overlapping ideas, which informs our second quantitative metric, the total number of
ideas the process can generate.
To measure the diversity of the ideas, we calculate the cosine similarity of each idea relative to the rest
of the set. We first calculate a vector of text embeddings for each idea. We follow the technical setup in
Dell'Acqua et al. (2023) and use Google's Universal Sentence Encoder (USE) model for our idea
embeddings, which is specifically optimized for semantic similarity between sentences. Table 1 shows the
results.
In geometry, the cosine of the angle between vectors ranges from -1 to 1. However, when using Google
USE, negative similarity is rarely encountered, since the overall text structure does not substantially differ
between ideas. Ideas follow a similar pattern in terms of text length and style, often leading with the title
before the idea description. In our test, a cosine similarity of 1 between two ideas thus indicates that they
are very similar (their embedding vectors are aligned), whereas a cosine similarity of 0 implies orthogonal
or unrelated ideas. We consider a new idea added to an idea pool to be unique if its pairwise cosine similarity
compared to all previously added ideas is never greater than 0.8. Additional robustness checks using
different thresholds and measures can be found in Section 8.
--- Page 15 ---
Table 1
Summary Statistics for Idea Overlap
Student Ideas
GPT-4 zero-shot
GPT-4 few-shot
N Ideas
Average cosine
similarity of all
ideas
0.221
0.415
0.428
Fraction of ideas
in pool with
cosine similarity
0.8
0.05
0.07
Notes. We compute the fraction as the number of ideas whose average pairwise similarity compared to all
other ideas in the pool exceeds 0.8 divided by the total number of ideas in the pool.
For each pool, we compute the average pairwise similarity between all ideas. One-way ANOVA
analyses show that the source has a significant effect on the cosine similarity between the three pools. The
difference between all three groups is also significant (η² 0.455, 95 CI [-0.210, -0.204], F(2, 29598)
12340.95, p 0.001). Considering only two groups, human ideas have a significantly smaller cosine
similarity than GPT-4-generated ideas (η² 0.358, 95 CI [-0.197, -0.190], F(1, 24649) 13715.82, p
0.001). Zero-shot GPT-4 ideas exhibit a significantly smaller cosine similarity than few-shot GPT-4 ideas
(η² 0.004, 95 CI [-0.018, -0.010], F(1, 9898) 44.24, p 0.001).
Because there is no overlap among human-generated ideas using cosine similarity, the fraction of ideas
would be zero and the number of unique ideas infinitely large, in line with hypothesis 2a. A larger pool of
student ideas will eventually contain overlapping ideas (see Kornish and Ulrich 2014 for estimates) but
based on our assumptions for similarity, the student sample only contains unique ideas. We perform a
binomial test to formally estimate the significance of the differences. We find that the fraction of similar
human-generated ideas (95 CI for fraction [0.0, 0.0184]) is significantly smaller than that of the zero-shot
GPT-4 ideas (RD -0.05, 95 CI [-0.093, -0.007], p 0.001) and few-shot GPT-4 ideas (RD -0.07, 95
CI [-0.120, -0.020], p 0.001), supporting hypothesis 2a. The difference between the two GPT-4 pools is
not significantly different (RD -0.02, 95 CI [-0.086, 0.046], p 0.56). Our findings suggest that a
greater number of distinct ideas generated comes from the human-ideation process, as opposed to GPT-4.
We calculate the exact numbers in the next section.
--- Page 16 ---
Figure 2
Distribution of cosine similarities across the three pools
Notes. Density plot of cosine similarities comparing all three pools. The dotted line shows the mean and confidence
interval of the estimate for a pool used for the ANOVA. The difference between all three groups is also significant (η²
0.455, 95 CI [-0.210, -0.204], F(2, 29598) 12340.95, p 0.001).
6.2.
Number of Discoverable Ideas Given the fraction of unique ideas, we can estimate the number of
unique ideas that could be generated by each of our three processes (pools) students, LLM (zero-shot),
and LLM prompted with examples (few-shot) using the method of Kornish and Ulrich (2011). This
method, which uses the capture-recapture method to analyze the probability that the next idea in a sequence
is unique, reportedly originates with Laplace (Cochran 1978), but has been adapted to wildlife ecology and
other domains. For illustration, consider again fishing in a lake as a metaphor for the idea-generation
process. Each idea is a catch, and the fish is released back into the lake. Sometimes, the same fish will be
caught again. The more frequently an individual fish is re-caught, the smaller the estimate of the overall
fish population. Thus, the probability that a fish has never been caught previously is a decreasing function
of the number of ideas generated.
This probability decay is typically represented by an exponential function.
p(n) e an
(1)
We define p(n) as the probability that the next idea is unique given n ideas have been generated already.
The expected number of unique ideas out of n generated, u(n), is the integral under this curve.
--- Page 17 ---
u(n) (1 a)(1 e an)
(2)
This form of probability decay comes from a specific underlying process, with T unique ideas total (T
fish in the pond), and each equally likely to be drawn. This assumption is commonly used in the Lincoln-
Peterson method (Lincoln 1930), the standard model for estimating population size in the literature on
wildlife ecology. The decay parameter and the total T are linked: T 1 a. This model has only a single
parameter, a, which is the inverse of the size of the opportunity space, i.e., an estimate of the total number
of unique ideas that an unlimited number of comparable idea generators, each generating an enormous
number of ideas, would generate.
Given a set of ideas generated and a count of the number of unique ideas in that set, the model can be
used to calculate T, an estimate of the size of the opportunity space. Using the similarity threshold of 0.8
from the cosine similarity metric, we found that 5 of the 100 ideas generated by the LLM with zero-shot
prompting were essentially similar to an idea already generated (fish recaptured), and that 7 of the 100 ideas
generated via few-shot prompting were redundant. Thus, u(100) is 0.95 in the first case, and u(100) is 0.93
in the second case. This corresponds to an estimate of T of 966 ideas (zero-shot) and of 680 ideas (few-
shot) respectively.
In our sample, human-generated ideas were all unique. Thus, as expected from our overlap calculations,
and based on the estimates provided by the capture-recapture model, we find support for the second
quantitative metric of hypothesis 2a. The number of unique ideas that can be discovered is lower for both
pools of AI-generated ideas than for the human idea-generation process. In addition, prompting the LLM
with examples seems to further reduce the estimated number of unique ideas available to the process. We
perform additional robustness checks in Section 8.
6.3. Perceived Novelty Given that LLMs are designed to generate the statistically most plausible
sequence of text based on their training data, perhaps they generate less novel ideas than humans. Novelty
is not a goal expressed in the prompt used in this study for either humans or GPT-4 and is typically not a
primary objective in commercial product development efforts. Still, to ensure that GPT-generated ideas are
not merely lists of existing ideas, we investigate how the novelty of ideas varies between LLM-generated
ideas and those generated by humans.
Based on Shibayama et al. (2021), we assessed novelty by asking responders on Prolific the question
Relative to other products you have seen, how novel do you consider the idea for this new product? [0:
Not at all novel, 0.25: Slightly novel, 0.5: Moderately novel, 0.75: Very novel, 1: Extremely novel]. The
average novelty of human-generated ideas is 40.6 (SD: 0.117), which is greater than that of zero-shot
GPT-4 (36.7 , SD: 0.101), and few-shot GPT-4 (36.1 , SD: 0.111; see Figure 3).
--- Page 18 ---
Similar to purchase intent, we estimate a linear mixed-effects model to investigate how the idea source
(human ideas, zero-shot GPT-4 and few-shot GPT-4) affects the perceived novelty of product ideas. The
model includes two fixed effects for denoting the source (humans are baseline), random intercepts and
slopes for both respondents and ideas.
We find significant differences in perceived novelty between human and zero-shot GPT-4-generated
ideas (𝛽 -0.038; 95 CI [-0.066, -0.01]; t(269) -2.67, p 0.008) at the alpha 0.05 threshold. Ideas
generated by few-shot GPT-4 also receive significantly lower novelty ratings (𝛽 -0.049; 95 CI [-0.078,
-0.02]; t(268) -3.4, p 0.001) compared to human-generated ideas. These findings suggest that some
LLM-generated ideas are perceived as less novel than human-generated ideas.
Perceived novelty is not significantly different between the two pools of LLM-generated ideas (𝛽 -
0.01; 95 CI [-0.039, 0.017]; t(195) -0.757, p 0.45). Of note, novelty does not appear to be significantly
correlated with purchase intent. The correlation coefficient is slightly negative at -0.08 (95 CI [-0.176,
0.016], p 0.12). Additional robustness checks can be found in Section 8.
Figure 3
Distribution of novelty ratings for three samples of ideas
Notes. Novelty based on mTurk assessment per Kwon, Kim, and Lee (2009).
--- Page 19 ---
These findings support Hypothesis 2b: AI-generated ideas are, on average, less novel than human-
generated ideas. Of note, the average novelty of all ideas, irrespective of source, lies between slightly and
moderately novel. While human ideas are around 0.047 points more novel, there is little reason to believe
that novelty alone, i.e., being the first to think of an idea, leads to a significant financial advantage. As
Terwiesch and Ulrich (2010) and others have argued, the first-mover advantage is a myth. As such, from a
commercial point of view, we don t believe that the slightly lower novelty outweighs the productivity and
quality benefits of LLMs.
7. Study 3: What is the quality of the best idea(s)?
Table 2 summarizes the titles of the top 40 ideas (10 ) in our pool, that is the top 40 out of the 400 ideas
used.
Table 2 Top 10 Ideas by Purchase Intent
Title
Source
Purchase Intent
Novelty
Compact Printer
GPT-4 (Few-Shot)
0.76
0.55
Solar-Powered Gadget Charger
GPT-4 (Few-Shot)
0.75
0.44
QuickClean Mini Vacuum
GPT-4 (Zero-Shot)
0.75
0.30
Noise-Canceling Headphones
GPT-4 (Few-Shot)
0.72
0.18
StudyErgo Seat Cushion
GPT-4 (Zero-Shot)
0.72
0.39
Multifunctional Desk Organizer
GPT-4 (Few-Shot)
0.71
0.21
Reusable Silicone Food Storage Bags
GPT-4 (Few-Shot)
0.68
0.34
Portable Closet Organizer
GPT-4 (Few-Shot)
0.67
0.23
Dorm Room Chef [oven, microwave and toaster]
GPT-4 (Few-Shot)
0.67
0.71
Collegiate Cookware
GPT-4 (Few-Shot)
0.67
0.45
Collapsible Laundry Basket
GPT-4 (Few-Shot)
0.65
0.21
On-the-Go Charging Pouch
GPT-4 (Few-Shot)
0.65
0.33
GreenEats Reusable Containers
GPT-4 (Zero-Shot)
0.65
0.21
HydrationStation [bottle with filter]
GPT-4 (Zero-Shot)
0.64
0.19
Reusable Shopping Bag Set
GPT-4 (Few-Shot)
0.64
0.19
CollegeLife Collapsible Laundry Hamper
GPT-4 (Zero-Shot)
0.64
0.26
Adaptiflex [cord extension to fit big adapters]
Student
0.64
0.44
SpaceSaver Hangers
GPT-4 (Zero-Shot)
0.64
0.33
Dorm Room Air Purifier
GPT-4 (Few-Shot)
0.63
0.29
Smart Power Strip
GPT-4 (Few-Shot)
0.63
0.22
CampusCharger Pro
GPT-4 (Zero-Shot)
0.63
0.31
Kitchen Safe Gloves
Student
0.62
0.31
Nightstand Nook [charging, cup holder]
GPT-4 (Few-Shot)
0.62
0.43
Mini Steamer
GPT-4 (Few-Shot)
0.62
0.41
CollegeCare First Aid Kit
GPT-4 (Zero-Shot)
0.62
0.26
StudySoundProof [soundproofing panels]
GPT-4 (Zero-Shot)
0.62
0.57
FreshAir Fan
GPT-4 (Zero-Shot)
0.62
0.29
StudyBuddy Lamp [portable, usb charging]
GPT-4 (Zero-Shot)
0.62
0.43
Bluetooth Signal Merger [share music]
Student
0.62
0.41
--- Page 20 ---
Adjustable Laptop Riser
GPT-4 (Few-Shot)
0.62
0.21
EcoCharge [solar powered charger]
GPT-4 (Zero-Shot)
0.62
0.43
Smartphone Projector
Student
0.62
0.57
Grocery Helper [hook to carry multiple bags]
Student
0.62
0.53
FitnessOnTheGo [portable gym equipment]
GPT-4 (Zero-Shot)
0.62
0.42
Multipurpose Fitness Equipment
GPT-4 (Few-Shot)
0.62
0.37
CollegeCooker
GPT-4 (Zero-Shot)
0.61
0.50
Multifunctional Wall Organizer
GPT-4 (Few-Shot)
0.61
0.31
DormDoc Portable Scanner
GPT-4 (Zero-Shot)
0.61
0.49
Mobile Charging Station Organizer
GPT-4 (Few-Shot)
0.61
0.26
StudyMate Planner
GPT-4 (Few-Shot)
0.61
0.22
DormChef Kitchen Set
GPT-4 (Zero-Shot)
0.61
0.33
LaundryBuddy [laundry basket]
GPT-4 (Zero-Shot)
0.61
0.30
Notes. The asterisk ( ) denotes ideas where the text in square brackets [] is not part of the original title and
was added to clarify the idea.
Among the top 40 ideas (top decile) 35 (87.5 ) were generated by GPT-4 (see Table 3). In other words,
for every human idea in the top 10 we count 7 ideas generated by GPT-4. A Chi-Square Test of
independence, with the null hypothesis of equal representation of all sources among the top ideas (20, 10
and 10) rejected the null hypothesis (x2 26.39, p 0.001, df 2), thus confirming hypothesis 3.
Table 3
Best Ideas Across Pools
Student Ideas
GPT-4 zero-shot
GPT-4 few-shot
N Ideas
Average Quality
of Top Decile
0.62
0.64
0.66
Average
Novelty of Top
Decile
0.45
0.35
0.33
Fraction of the
top decile of
pooled ideas
from this source
5 40
15 40
20 40
To better understand how the full distribution of idea qualities is affected by the idea source, we use
quantile regression analysis. Quantile regression (Koenker and Hallock 2001) extends traditional regression
by computing the relationship between explanatory variables (idea source) and the response variable (idea
quality) for different percentiles of the data. As mentioned above, in innovation, the quality of the best ideas
is generally more important than the average quality. That is, we prefer a few exceptional ideas to a lot of
--- Page 21 ---
mediocre ones. Using quantile regression, we can examine the tails of the distribution instead of the mean,
allowing us to test whether GPT-4 excels at generating high-quality ideas only for specific percentiles or
whether the effect holds across the entire distribution.
Our analysis follows Girotra et al. (2010). We use the average idea quality ratings as the dependent
variable, and our explanatory variable is a binary variable indicating whether the idea is human-generated
(baseline level) or AI-generated (GPT-4 zero-shot and GPT-4 few-shot prompting). Figure 4 shows the
results. For all percentiles, GPT-4 ideas consistently outperform student ideas. The effect is especially
pronounced for the upper tail of the distribution (80 and above), where GPT-4 has the strongest advantage.
This implies that not only does GPT-4 generate better ideas on average, but it is also especially adept at
producing top-tier ideas compared to students.
Figure 4
Estimated Difference in Idea Quality Ratings between AI-generated Ideas and Human-generated
ideas (baseline), for Different Percentiles
8. Discussion and Limitations
In this section, we discuss conceptual limitations of our work, limitations related to our research design,
as well as data analysis and the robustness of our analysis to a set of alternative specifications and
assumptions.
--- Page 22 ---
Our findings indicate that GPT-4 produces higher-quality ideas that are more likely to be purchased than
humans, though they are perceived as less novel. AI significantly outperforms human creativity in
generating top-tier ideas, with GPT-4 ideas being seven times more likely to rank in the top 10 . Given
AI s advantage in both quality and productivity, our findings have profound implications for the field of
innovation management. For instance, AI can serve as a first step in brainstorming sessions, allowing
organizations to rapidly explore a wide variety of ideas with minimal cost and time investment. Human
ideators can also provide AI with their own interesting ideas and refine them with the help of AI. Another
important implication lies in the potential shift of focus from idea generation to idea evaluation. If LLMs
can reliably produce numerous high-quality ideas at very low cost companies might allocate more resources
toward assessing and refining those ideas instead of ideating from scratch. This shift could lead to the
development of new tools and frameworks specifically designed to help organizations sort, rank, and
prioritize AI-generated ideas, further streamlining the innovation process.
However, while the results show that GPT-4 outperforms human creativity in terms of producing top-
tier ideas, the reduced novelty and increased similarity among AI-generated ideas point to a limitation. This
suggests that a human in the loop is still important to drive the ideation direction and ensure that ideas are
as novel as possible. Future research could explore ways to mitigate this issue by enhancing LLMs' ability
to generate more diverse and creative solutions through techniques such as fine-tuning.
Investigating whether LLMs can evaluate ideas with the same rigor as human evaluators would help to
further improve the ideation process. It would allow an LLM to get immediate feedback on its creations,
leaving humans to focus on implementation and strategy.
8.1 Conceptual and Research Design Limitations Conceptually, our prompting approach (i.e., a
simple prompt) is not optimized for creativity or novelty. It also follows a single ideator setup instead of
approaches such as hybrid brainstorming that lead to more and better ideas (Girotra et al. 2010). A model
given more specific instructions on how to ideate effectively might thus perform even better. Different
prompting techniques such as Chain-of-Thought (CoT), which asks the model to reason through a problem
in multiple steps instead of directly providing an answer (Wei et al. 2023), might also improve performance.
Furthermore, providing the model with hundreds of good ideas, either via many-shot learning or fine-tuning
could also provide enhanced performance. This suggests that we likely underestimate the true power of AI-
based idea generation.
Second, it is possible that professional product innovators would generate better ideas than our students.
However, this has not been the experience of the paper s authors, who have taught many academic courses
and worked in many product development settings. Many students who participated in the innovation
contests have gone on to be product innovators, sometimes based on ideas from the course tournament.
--- Page 23 ---
Nevertheless, we have not produced evidence that GPT-4 is better than the best product innovators working
today. However, we believe that we can claim conservatively that GPT-4 is better than many human product
innovators working today and probably better than average. Thus, at a very minimum, an LLM could
elevate the least capable humans to a better-than-average level of performance.
Third, GPT might be a great salesperson. As such, it is possible that the writing style ( pitch ) convinces
the customers rather than the idea itself. Prior work in other domains suggests that the text generated by
LLMs is not distinguishable from that generated by humans (Brown et al. 2020), though recent work has
developed sophisticated measures to detect LLM-generated text (Mitchell et al. 2023, Kobak et al. 2024,
Venkatraman et al. 2024). For example, Kobak et al. (2024) provide intuitions that could be used to identify
LLM-generated text, such as words that are not commonly used by the majority of English speakers like
delve. However, it is unlikely that these characteristics were known to our survey participants at the time
of our experiment in May and June 2023, and that any particular idea generated by GPT-4 could easily be
distinguished from those generated by our students. Future research could use LLMs to present human-
generated ideas in a way that more closely mimics the presentation style of LLM-generated ideas, ensuring
that the quality of the idea is not confounded by its presentation style.
Fourth, our study is set in the widely understood domain of consumer products for the college students
market that cost less than 50. Presumably, there exists a lot of commentary and data about such products
in the training data used by the GPT class of language models. As such, it is unclear whether our results
would generalize to more specialized domains, such as surgical instruments. Organizations looking for
opportunities in these specialized domains should fine-tune language models with their own proprietary
data to achieve comparable or better performance.
Fifth, innovation often benefits from collaboration and is not solely focused on one ideator generating
many ideas. Liu et al. (2018) show that collaborating with other innovators improves the creative process
by enabling the transfer of critical skills and knowledge, particularly when those collaborations involve
highly skilled innovators. Future work should investigate whether this can be applied to human and LLM
interaction, and whether an LLM could help a novice human innovator become better.
8.2 Robustness There are different ways to analyze the data. Here, we provide additional robustness
checks that investigate the validity of our results under various specifications.
8.2.1 Study 1 To measure purchase intent, it is possible to use other convex weighting schemes. Ulrich
and Eppinger (2007) weigh definitely would purchase as 0.4 and probably would purchase as 0.2 with
all other responses rated as 0. When using this alternative set of weights, we find the same significant
differences between pools.
As a robustness test for our primary purchase-intent analysis using a linear mixed-effects model, we also
--- Page 24 ---
conduct a simpler linear regression focusing on the average perceived quality of product ideas across
different sources. This model aggregates individual ratings at the idea level, removing the random effects
to capture the overall influence of the source on rating averages. The results confirm our previous findings
and show that ideas from GPT-4 (zero-shot) are rated higher than human ones by an average of 0.256 points
(95 CI [0.15, 0.37]; t 4.602, p 0.001), and ideas from GPT-4 (few-shot) are rated higher by an average
of 0.358 points (95 CI [0.25, 0.47]; t 6.435, p 0.001).
In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor.
We find significant differences in the perceived quality, measured as purchase intent of product ideas,
between sources. Ideas generated by GPT-4 (zero shot) receive a significantly greater average rating (𝛽
0.395; 95 CI [0.215, 0.575]; z 4.31, p 0.001). Similarly, ideas generated by GPT-4 (few-shot) receive
even higher ratings (𝛽 0.581; 95 CI [0.400, 0.762]; z 6.30, p 0.001) compared to human-generated
ideas. These findings suggest that LLM-generated ideas are perceived as more likely to be purchased than
human-generated ideas, with the highest perceived quality attributed to few-shot GPT-4-generated ideas.
8.2.2 Study 2 Our chosen threshold of 𝜃 0.8 has been established through experimentation by
comparing ideas as pairs of two and their respective similarity scores. However, our findings are robust to
other values such as 0.7 (25 and 37 overlapping ideas for zero-shot and few-shot GPT-4 respectively) and
0.75 (16 and 23 overlapping ideas). At 𝜃 0.85, the zero-shot GPT-4 pool only features two overlapping
ideas, whereas the few-shot pool features one. Because these are extreme values that approach zero, we
used 0.8 as our main threshold. We compute the pairwise similarity for an idea compared to all other ideas
in the pool and calculate the average. Mean pairwise similarity is a common measure in ideation
(Siangliulue et al. 2016, Cox et al. 2021) and similar text-mining tasks (Doshi and Hauser 2024) but it is
not without issues, as it lacks sensitivity to highly clustered ideas. As an additional specification, we
consider the per-pool collective diversity of all ideas by following the work in Cox et al. (2021) and
construct a minimum spanning tree (MST) which spans all points (ideas) in space with the smallest total
distance along the edges. In 2D space, an MST would be the tree that contains all points with the shortest
overall length of edges. We compute the mean of all edge distances as a measure of how distributed ideas
are in the high-dimensional space. The spanning tree is constructed in high-dimensional space (512
dimensions), its edge weights summed up and divided by the number of edges, resulting in a range from 0
(not diverse at all) to 1 (very diverse). Based on this measure, the student idea pool is the most diverse
(0.53), GPT-4 zero-shot is the second most diverse (0.33) and GPT-4 few-shot is the least diverse (0.3)
pool.
Similar to purchase intent, we also conduct a simpler linear regression focusing on the average perceived
novelty of product ideas across different sources. This model aggregates individual ratings at the idea level,
removing the random effects to capture the overall influence of the source on rating averages. We find that
--- Page 25 ---
ideas from GPT-4 (zero-shot) are significantly less novel than human ones (𝛽 -0.177; 95 CI [-0.286, -
0.069]; t -3.22, p 0.0014). Ideas from GPT-4 (few-shot) are rated as significantly less novel than human
ones (𝛽 -0.197; 95 CI [-0.305, -0.089]; t -3.58, p 0.001). This simpler analysis reinforces that human
ideas are more novel than AI-generated ones, even when using zero-shot prompting.
In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor.
We find significant differences in the perceived novelty. Ideas generated by GPT-4 (zero shot) receive a
significantly lower average rating (𝛽 -0.306; 95 CI [-0.514, -0.1]; z -2.89, p 0.01). Similarly, ideas
generated by GPT-4 (few-shot) receive even lower ratings (𝛽 -0.39; 95 CI [-0.6, -0.18]; z -3.66, p
0.001) compared to human-generated ideas. These findings suggest that LLM-generated ideas are perceived
as less novel than human-generated ideas, with the lowest perceived novelty attributed to few-shot GPT-4-
generated ideas.
8.2.3 Study 3 In this study, we present our results for the 90th percentile of all aggregated ideas. Table
4 shows that using other percentiles yields similar results.
Table 4 Top 5 and 15 Percent of Ideas Pool Distributions
Student Ideas
GPT-4 zero-shot
GPT-4 few-shot
Average Quality
of Top 5
0.64
0.67
0.68
Fraction of the
top 5 of pooled
ideas from this
source
1 20
6 20
14 20
Average Quality
of Top 15
0.60
0.62
0.64
Fraction of the
top 15 of
pooled ideas
from this source
11 60
22 60
27 60
9. Summary
GenAI has demonstrated remarkable advancements in creating coherent and fluent text, equaling or
surpassing human performance in various academic and professional domains. In this study, we explored
the ideation capabilities of OpenAI's GPT-4, a state-of-the-art large language model, in comparison to the
ideation abilities of university students when generating ideas for new products targeted toward college
--- Page 26 ---
students at a price point of 50 or less. Specifically, we make three main contributions to the literature of
innovation and the role of AI.
First, GPT-4 produces high-quality ideas that are perceived as more likely to be purchased than human-
generated ideas. Second, consumers perceive AI-generated ideas as less novel. Third, when considering the
quality of the best ideas, AI outperforms human creativity significantly. To put these findings in context,
innovation favors a few great ideas over a large number of solid ideas and our results show that AI-generated
ideas are seven times more likely to be among the top 10 of ideas considered for our experiment compared
to human ideas. Despite the reduction in novelty, the overall AI advantage thus remains substantial.
The fact that GPT-4 is very efficient at generating ideas does not require a formal research study. Two
hundred ideas can be generated by one human interacting with GPT-4 in about 15 minutes. A human
working alone can generate about five ideas in 15 minutes and humans working in groups do even worse
(Girotra et al., 2010). In short, the productivity race between humans and GPT-4 is not even close. However,
as we show in this article, the enormous potential of LLMs in ideation does not result only from their ability
to quickly and inexpensively generate ideas, but in the remarkable quality of their output.
Importantly, these ideas can be produced at a fraction of the cost it would take humans, generating
hundreds of high-quality ideas. This previously unimaginable productivity in generating ideas may
substantially reduce the importance of the idea-generation phase of innovation and shift managerial focus
to the idea-evaluation phase. Can an LLM also take on the task of idea evaluation? From our viewpoint,
this is a fascinating question for future research.
References
Bellaiche L, Shahi R, Turpin MH, Ragnhildstveit A, Sprockett S, Barr N, Christensen A, Seli P (2023)
Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cogn.
Research 8(1):42.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. (2020) Language
Models are Few-Shot Learners. (July 22) http: arxiv.org abs 2005.14165.
Chao RO, Kavadias S (2008) A Theoretical Framework for Managing the New Product Development
Portfolio: When and How to Use Strategic Buckets. Management Science 54(5):907 921.
Cochran WG (1978) Laplace s Ratio Estimator. David HA, ed. Contributions to Survey Sampling and
Applied Statistics. (Academic Press), 3 10.
Connolly T, Jessup LM, Valacich JS (1990) Effects of Anonymity and Evaluative Tone on Idea
Generation in Computer-Mediated Groups. Management Science 36(6):689 703.
Cox SR, Wang Y, Abdul A, Von Der Weth C, Y. Lim B (2021) Directed Diversity: Leveraging Language
Embedding Distances for Collective Creativity in Crowd Ideation. Proceedings of the 2021 CHI
--- Page 27 ---
Conference on Human Factors in Computing Systems. (ACM, Yokohama Japan), 1 35.
Dahan E, Mendelson H (2001) An Extreme-Value Model of Concept Testing. Management Science
47(1):102 116.
Dell Acqua F, McFowland III E, Mollick ER, Lifshitz-Assaf H, Kellogg K, Rajendran S, Krayer L,
Candelon F, Lakhani KR (2023) Navigating the Jagged Technological Frontier: Field Experimental
Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. (September 15)
https: papers.ssrn.com abstract 4573321.
Doshi AR, Hauser OP (2024) Generative AI enhances individual creativity but reduces the collective
diversity of novel content. Sci. Adv. 10(28):eadn5290.
Girotra K, Terwiesch C, Ulrich KT (2010) Idea Generation and the Quality of the Best Idea. Management
Science 56(4):591 605.
Goldenberg J, Mazursky D, Solomon S (1999) Creative Sparks. Science 285(5433):1495 1496.
Guilford JP (1967) Creativity: Yesterday, Today and Tomorrow. Journal of Creative Behavior 1(1):3 14.
Haase J, Hanel PHP (2023) Artificial muses: Generative artificial intelligence chatbots have risen to
human-level creativity. Journal of Creativity 33(3):100066.
Hitsuwari J, Ueda Y, Yun W, Nomura M (2023) Does human AI collaboration lead to more creative art?
Aesthetic evaluation of human-made and AI-generated haiku poetry. Computers in Human Behavior
139:107502.
Hubert KF, Awa KN, Zabelina DL (2024) The current state of artificial intelligence generative language
models is more creative than humans on divergent thinking tasks. Sci Rep 14(1):3440.
Huchzermeier A, Loch CH (2001) Project Management Under Risk: Using the Real Options Approach to
Evaluate Flexibility in R D. Management Science 47(1):85 101.
Jamieson LF, Bass FM (1989) Adjusting Stated Intention Measures to Predict Trial Purchase of New
Products: A Comparison of Models and Methods. Journal of Marketing Research 26(3):336 345.
Jia N, Luo X, Fang Z, Liao C (2024) When and How Artificial Intelligence Augments Employee
Creativity. AMJ 67(1):5 32.
Kobak D, González-Márquez R, Horvát EÁ, Lause J (2024) Delving into ChatGPT usage in academic
writing through excess vocabulary. (July 3) http: arxiv.org abs 2406.07016.
Koenker R, Hallock KF (2001) Quantile Regression. Journal of Economic Perspectives 15(4):143 156.
Koivisto M, Grassini S (2023) Best humans still outperform artificial intelligence in a creative divergent
thinking task. Sci Rep 13(1):13601.
Kornish LJ, Ulrich KT (2011) Opportunity Spaces in Innovation: Empirical Analysis of Large Samples of
Ideas. Management Science 57(1):107 128.
Kornish LJ, Ulrich KT (2014) The Importance of the Raw Idea in Innovation: Testing the Sow s Ear
--- Page 28 ---
Hypothesis. Journal of Marketing Research 51(1):14 26.
Lincoln FC (1930) Calculating waterfowl abundance on the basis of banding returns (U.S. Dept. of
Agriculture, Washington, D.C.).
Liu H, Mihm J, Sosa ME (2018) Where Do Stars Come From? The Role of Star vs. Nonstar Collaborators
in Creative Settings. Organization Science 29(6):1149 1169.
Loch CH, Terwiesch C, Thomke S (2001) Parallel and Sequential Testing of Design Alternatives.
Management Science 47(5):663 678.
March JG (1991) Exploration and Exploitation in Organizational Learning. Organization Science
2(1):71 87.
Mihm J, Schlapp J (2019) Sourcing Innovation: On Feedback in Contests. Management Science
65(2):559 576.
Mitchell E, Lee Y, Khazatsky A, Manning CD, Finn C (2023) DetectGPT: Zero-Shot Machine-Generated
Text Detection using Probability Curvature. (July 23) http: arxiv.org abs 2301.11305.
Meincke L, Carton A (2024) Beyond Multiple Choice: The Role of Large Language Models in
Educational Simulations. (May 26) https: papers.ssrn.com abstract 4873537.
OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. (2024) GPT-4 Technical
Report. (March 4) http: arxiv.org abs 2303.08774.
Osborn AF (1953) Applied imagination (Scribner S, Oxford, England).
Rashidi HH, Fennell BD, Albahra S, Hu B, Gorbett T (2023) The ChatGPT conundrum: Human-
generated scientific manuscripts misidentified as AI creations by AI text detection tool. Journal of
Pathology Informatics 14:100342.
Shank DB, Stefanik C, Stuhlsatz C, Kacirek K, Belfi AM (2023) AI composer bias: Listeners like music
less when they think it was composed by an AI. J Exp Psychol Appl 29(3):676 692.
Shibayama S, Yin D, Matsumoto K (2021) Measuring novelty in science with word embedding Muscio
A, ed. PLoS ONE 16(7):e0254034.
Si H, Kavadias S, Loch CH (2022) Managing Innovation Portfolios: From Project Selection to Portfolio
Design. (March 6) https: papers.ssrn.com abstract 4050940.
Siangliulue P, Chan J, Dow SP, Gajos KZ (2016) IdeaHound: Improving Large-scale Collaborative
Ideation with Crowd-Powered Real-time Semantic Modeling. Proceedings of the 29th Annual Symposium
on User Interface Software and Technology. (ACM, Tokyo Japan), 609 624.
Sommer SC, Loch CH (2004) Selectionism and Learning in Projects with Complexity and Unforeseeable
Uncertainty. Management Science 50(10):1334 1347.
Sutton RI, Hargadon A (1996) Brainstorming Groups in Context: Effectiveness in a Product Design Firm.
Administrative Science Quarterly 41(4):685.
--- Page 29 ---
Terwiesch C (2023) Let s cast a critical eye over business ideas from ChatGPT. Financial Times (March
12) https: www.ft.com content 591ad272-6419-4f2c-9935-caff1d670f08.
Terwiesch C, Ulrich K (2023) The innovation tournament handbook: a step-by-step guide to finding
exceptional solutions to any challenge (Wharton School Press, Philadelphia, PA).
Terwiesch C, Ulrich KT (2009) Innovation tournaments: creating and selecting exceptional opportunities
(Harvard Business Press, Boston, Mass).
Terwiesch C, Xu Y (2008) Innovation Contests, Open Innovation, and Multiagent Problem Solving.
Management Science 54(9):1529 1543.
Torrance EP (1968) A Longitudinal Examination of the Fourth Grade Slump in Creativity. Gifted Child
Quarterly 12(4):195 199.
Ulrich K, Eppinger S (2007) Product Design and Development (McGraw-Hill Education)
Venkatraman S, Uchendu A, Lee D (2024) GPT-who: An Information Density-based Machine-Generated
Text Detector. (April 3) http: arxiv.org abs 2310.06202.
Wang H, Zou J, Mozer M, Goyal A, Lamb A, Zhang L, Su WJ, et al. (2024) Can AI Be as Creative as
Humans? (January 25) http: arxiv.org abs 2401.01623.
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D (2023) Chain-of-
Thought Prompting Elicits Reasoning in Large Language Models. (January 10)
http: arxiv.org abs 2201.11903.
Weitzman ML (1979) Optimal Search for the Best Alternative. Econometrica 47(3):641.
Zhou E, Lee D (2024) Generative artificial intelligence, human creativity, and art Harding M, ed. PNAS
Nexus 3(3):pgae052.
Zlatkov D, Ens J, Pasquier P (2023) Searching for Human Bias Against AI-Composed Music. Artificial
Intelligence in Music, Sound, Art and Design: 12th International Conference, EvoMUSART 2023, Held
as Part of EvoStar 2023, Brno, Czech Republic, April 12 14, 2023, Proceedings. (Springer-Verlag,
Berlin, Heidelberg), 308 323.
--- Page 30 ---
Appendix A. Quantile Regression Results
The regression model considered quantiles 0.1 to 0.9 with a 0.1 step. For each quantile, it estimated
MeanRating SourceAI. SourceAI is a dummy variable that indicates whether the idea source was a student
(SourceAI 0) or GPT-4 (SourceAI 1). A positive value for SourceAI indicates that ideas by GPT-4
performed better than human ideas. Negative values indicate the opposite.
Table A.1.
Quantile Regression Results for quantiles 0.1 to 0.9
Quantile
Intercept
Source AI
Conf. Int. Low
Conf. Int. High
0.1
0.3
0.128399
0.471601
0.2
1.230768
0.290958
0.152504
0.429411
0.3
1.388859
0.277678
0.149304
0.406051
0.4
1.549946
0.262418
0.139183
0.385653
0.5
1.666666
0.227943
0.10915
0.346736
0.6
1.789528
0.210472
0.095888
0.325057
0.7
1.882355
0.260502
0.137852
0.383152
0.8
1.954555
0.445445
0.328038
0.562851
0.9
2.181822
0.318235
0.190531
0.445939
Notes. (p 0.1 ).
Appendix B. Supplementary Regression Tables
Purchase Intent
Predictors
Estimates
CI
p
(Intercept)
0.40
0.38 0.43
0.001
Source [Zero-Shot]
0.06
0.03 0.09
0.001
Source [Few-Shot]
0.09
0.06 0.12
0.001
Random Effects
σ2
0.07
τ00 IdeaID
0.01
τ00 RespondentID
0.02
τ11 IdeaID.SourceZero-Shot
0.01
--- Page 31 ---
τ11 IdeaID.SourceFew-Shot
0.03
τ11 RespondentID.SourceZero-Shot
0.01
τ11 RespondentID.SourceFew-Shot
0.01
ρ01
-0.64
-0.97
-0.06
-0.28
ICC
0.28
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.014 0.290
Purchase Intent Alternative Weights
Predictors
Estimates
CI
p
(Intercept)
0.08
0.07 0.08
0.001
Source [Zero-Shot]
0.02
0.01 0.03
0.001
Source [Few-Shot]
0.03
0.02 0.04
0.001
Random Effects
σ2
0.01
τ00 IdeaID
0.00
τ00 RespondentID
0.00
τ11 IdeaID.SourceZero-Shot
0.00
τ11 IdeaID.SourceFew-Shot
0.00
τ11 RespondentID.SourceZero-Shot
0.00
τ11 RespondentID.SourceFew-Shot
0.00
ρ01
0.04
0.48
-0.00
-0.16
ICC
0.21
--- Page 32 ---
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.009 0.215
Purchase Intent (Simple)
Predictors
Estimates
CI
p
(Intercept)
1.62
1.55 1.68
0.001
Source [Zero-Shot]
0.26
0.15 0.37
0.001
Source [Few-Shot]
0.36
0.25 0.47
0.001
Observations
R2 R2 adjusted
0.108 0.104
Purchase Intent (no weights, zero-shot baseline)
Predictors
Estimates
CI
p
(Intercept)
1.85
1.73 1.98
0.001
Source [Student]
-0.24
-0.35 -0.12
0.001
Source [Few-Shot]
0.12
-0.00 0.24
0.058
Random Effects
σ2
1.18
τ00 IdeaID
0.12
τ00 RespondentID
0.42
τ11 IdeaID.SourceStudent
0.33
τ11 IdeaID.SourceFew-Shot
0.12
τ11 RespondentID.SourceStudent
0.13
τ11 RespondentID.SourceFew-Shot
0.01
ρ01
-0.77
-0.36
-0.50
-0.99
--- Page 33 ---
ICC
0.31
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.014 0.322
Purchase Intent (ordered logistic regression)
Predictors
Odds Ratios
CI
p
0 1
0.25
0.21 0.29
0.001
1 2
1.06
0.90 1.26
0.484
2 3
3.01
2.53 3.57
0.001
3 4
19.07
15.84 22.97
0.001
Source [Zero-Shot]
1.48
1.24 1.78
0.001
Source [Few-Shot]
1.79
1.49 2.14
0.001
Random Effects
σ2
3.29
τ00 IdeaID
0.39
τ00 RespondentID
0.92
ICC
0.28
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.014 0.294
Novelty
Predictors
Estimates
CI
p
(Intercept)
0.41
0.39 0.43
0.001
Source [Zero-Shot]
-0.04
-0.07 -0.01
0.008
Source [Few-Shot]
-0.05
-0.08 -0.02
0.001
Random Effects
--- Page 34 ---
σ2
0.05
τ00 IdeaID
0.01
τ00 RespondentID
0.01
τ11 IdeaID.SourceZero-Shot
0.02
τ11 IdeaID.SourceFew-Shot
0.03
τ11 RespondentID.SourceZero-Shot
0.01
τ11 RespondentID.SourceFew-Shot
0.01
ρ01
-0.87
-0.99
0.14
0.06
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.009 NA
Novelty (Simple)
Predictors
Estimates
CI
p
(Intercept)
1.64
1.58 1.70
0.001
Source [Zero-Shot]
-0.18
-0.29 -0.07
0.001
Source [Few-Shot]
-0.20
-0.31 -0.09
0.001
Observations
R2 R2 adjusted
0.042 0.037
Novelty (no weights, zero-shot baseline)
Predictors
Estimates
CI
p
(Intercept)
1.48
1.37 1.59
0.001
Source [Student]
0.15
0.05 0.26
0.004
Source [Few-Shot]
-0.04
-0.16 0.07
0.493
Random Effects
--- Page 35 ---
σ2
0.90
τ00 IdeaID
0.11
τ00 RespondentID
0.35
τ11 IdeaID.SourceStudent
0.33
τ11 IdeaID.SourceFew-Shot
0.44
τ11 RespondentID.SourceStudent
0.03
τ11 RespondentID.SourceFew-Shot
0.04
ρ01
-0.71
-0.98
-1.00
-0.23
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.009 NA
Novelty (zero-shot baseline)
Predictors
Estimates
CI
p
(Intercept)
0.37
0.34 0.40
0.001
Source [Student]
0.04
0.01 0.07
0.008
Source [Few-Shot]
-0.01
-0.04 0.02
0.449
Random Effects
σ2
0.05
τ00 IdeaID
0.01
τ00 RespondentID
0.02
τ11 IdeaID.SourceStudent
0.02
τ11 IdeaID.SourceFew-Shot
0.03
τ11 RespondentID.SourceStudent
0.01
τ11 RespondentID.SourceFew-Shot
0.00
ρ01
-0.74
--- Page 36 ---
-1.00
-0.69
-0.95
ICC
0.35
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.006 0.356
Novelty (ordered logistic regression)
Predictors
Odds Ratios
CI
p
0 1
0.16
0.13 0.19
0.001
1 2
0.87
0.72 1.04
0.118
2 3
4.51
3.76 5.43
0.001
3 4
29.60
24.09 36.37
0.001
Source [Zero-Shot]
0.74
0.60 0.91
0.004
Source [Few-Shot]
0.68
0.55 0.84
0.001
Random Effects
σ2
3.29
τ00 IdeaID
0.57
τ00 RespondentID
0.91
ICC
0.31
N RespondentID
N IdeaID
Observations
Marginal R2 Conditional R2
0.006 0.315