Spaces:
Sleeping
Sleeping
| --- Page 1 --- | |
| Using Large Language Models for Idea | |
| Generation in Innovation | |
| Lennart Meincke | |
| Operations, Information and Decisions, The Wharton School, University of Pennsylvania, lennart wharton.upenn.edu | |
| Karan Girotra | |
| Cornell Tech and Johnson College of Business, Cornell University, girotra cornell.edu | |
| Gideon Nave | |
| Marketing, The Wharton School, University of Pennsylvania, gnave wharton.upenn.edu | |
| Christian Terwiesch | |
| Operations, Information and Decisions, The Wharton School, University of Pennsylvania, terwiesch wharton.upenn.edu | |
| Karl T. Ulrich | |
| Operations, Information and Decisions, The Wharton School, University of Pennsylvania, ulrich wharton.upenn.edu | |
| This research evaluates the efficacy of large language models (LLMs) in generating new product ideas. To do so, we compare three | |
| pools of ideas for new products targeted toward college students priced at 50 or less. The first pool of ideas was created by | |
| university students in a product design course before the availability of LLMs. The second and third pools of ideas were generated | |
| by OpenAI s GPT-4 using zero-shot and few-shot prompting, respectively. We evaluated idea quality using standard market | |
| research techniques to predict average purchase intent probability. We used text mining to assess idea similarity and human raters | |
| to evaluate idea novelty. We find that AI-generated ideas outperform human-generated ideas in terms of average purchase intent, | |
| with few-shot prompting yielding slightly higher intent than zero-shot prompting. However, AI-generated ideas are perceived as | |
| less novel and exhibit higher pairwise similarity, particularly with few-shot prompting, indicating a less diverse solution landscape. | |
| When focusing on the quality of the best ideas (rather than on the average ideas), we find that AI-generated ideas are seven times | |
| more likely to rank among the top 10 of ideas, demonstrating a significant advantage over human-generated ideas. We propose | |
| that this 7:1 advantage is a conservative estimate, as it does not account for AI's greater productivity. Our findings suggest that | |
| despite some drawbacks, AI creativity presents a substantial benefit in generating high-quality ideas for new product development. | |
| Funding: Funding was provided by the Mack Institute for Innovation Management at the Wharton School of the University of | |
| Pennsylvania. | |
| Key words: innovation; idea generation; creativity; creative problem solving; LLM; large-scale language models; AI; artificial | |
| intelligence; ChatGPT; GPT | |
| --- Page 2 --- | |
| 1. Introduction | |
| Generative artificial intelligence (GenAI) has remarkably advanced in creating life-like images and | |
| coherent, fluent text. Open AI s ChatGPT chatbot, based on the Generative Pre-trained Transformer (GPT) | |
| series of large language models (LLM), can equal or surpass human performance in academic examinations | |
| and tests for professional certifications (OpenAI et al. 2023). Moreover, LLMs can provide valuable | |
| professional advice in fields like software development, medicine, and law. | |
| Despite their remarkable performance, LLMs sometimes produce text that is semantically or | |
| syntactically plausible but is, in fact, factually incorrect or nonsensical, a phenomenon often referred to as | |
| hallucinations. This outcome is a byproduct of how LLMs are designed, as they are optimized to generate | |
| the most statistically likely sequences of words with an intentional injection of randomness. In most | |
| applications, this randomness and the associated hallucinations and inconsistencies create problems that | |
| limit the use of LLM-based solutions to low-stakes settings, or they require extensive human supervision. | |
| But are there applications in which we can leverage the weaknesses of hallucinations and inconsistent | |
| quality and turn them into a strength? We propose that the domain of creativity and innovation provides | |
| such an application. This domain operates quite differently than most management settings, where we | |
| commonly expect to use each unit of work produced. As such, consistency is prized and is, therefore, the | |
| focus of contemporary performance management. Erratic and inconsistent behavior is to be eliminated. An | |
| airline would rather hire a pilot that executes a within-safety-margins landing 10 out of 10 times rather than | |
| one that makes a brilliant approach five times and an unsafe approach another five. But, when it comes to | |
| creativity and innovation, say finding a new opportunity to improve the air travel experience or launching | |
| a new aviation venture, the same airline would prefer an ideator that generates one brilliant idea and nine | |
| nonsense ideas over one that generates ten decent ideas. | |
| The reason for this difference is that when it comes to creativity and innovation, the performance of the | |
| process is not determined by the sum or the average of all ideas created. Instead, each idea is seen as a real | |
| option that the decision maker can decide to execute (Huchzermeier and Loch 2001). Thus, the performance | |
| of the process is determined by the quality of the best idea(s) (Dahan and Mendelson 2001, Terwiesch and | |
| Xu 2008, Terwiesch and Ulrich 2009, Girotra et al. 2010). The process of innovation thereby can be thought | |
| of as a search process that generates ideas with random quality values by drawing from an underlying | |
| stochastic distribution until the cost of creating one additional draw from the distribution (e.g., creating one | |
| more product concept or building one more prototype) exceeds the marginal benefit (Weitzman 1979). | |
| Prior research in product development and innovation has modeled various aspects of this search | |
| process, including the pros and cons of parallel search (Loch et al. 2001), the tension between sampling | |
| from very different regions of the pay-off distributions ( selectionism ) versus locally improving a given | |
| project (Sommer and Loch 2004), and the need for building balanced portfolios that consist of different | |
| --- Page 3 --- | |
| types of projects (Chao and Kavadias 2008). | |
| We follow this line of research and consider a setting in which ideas of unknown quality are created, | |
| and the quality of the best few ideas determines the overall performance. This could be a setting of corporate | |
| portfolio planning in a large established organization as described by Si et al. (2022). However, to facilitate | |
| our experimental design, we focus on the idea generation in the product developement process for a newly | |
| formed venture. Specifically, we look for a product idea that targets the college student market and can be | |
| sold for 50 or less. This innovation challenge is similar to the study settings used in prior work (e.g., | |
| Osborn 1953, Connolly et al. 1990, Sutton and Hargadon 1996, Girotra et al. 2010) to evaluate and compare | |
| various brainstorming methods (e.g., group vs. individual; nominal groups vs. hybrid groups). | |
| In contrast to this prior work, we consider ideas generated by humans and ideas generated by artificial | |
| intelligence (AI) in the form of Open AI s GPT-4. As discussed above, LLMs are designed to generate new | |
| content, and in the domain of brainstorming, their stochastic (if not outright erratic) behavior might turn a | |
| bug into a feature. Thus, we hypothesize that LLMs have the potential to be excellent ideators. The purpose | |
| of this paper, therefore, is to formally test this hypothesis by comparing the performance of LLMs in | |
| generating new ideas to that of human idea generators. | |
| Specifically, we compare three pools of ideas for new products targeted toward college students at a | |
| price of USD 50 or less. The first pool of ideas was created by students at an elite university enrolled in a | |
| course on product design before the availability of LLMs. The second pool was generated by OpenAI s | |
| GPT-4 with the same prompt as that given to the students and no other guidance (zero-shot prompting). | |
| The third pool was generated by prompting GPT-4 with the same prompt as that given to the students and | |
| a sample of highly rated ideas to enable some in-context learning (few-shot prompting). We evaluate the | |
| quality of the ideas using standard market research techniques and survey human respondents to predict an | |
| average purchase intent probability for each product, which we use as our measure of idea quality. We use | |
| text mining techniques to evaluate the similarities of ideas and rely on human raters to assess idea novelty. | |
| This comparison between human idea generation and AI-based idea generation allows us to contribute | |
| to the innovation literature by establishing the following novel results. | |
| First, AI-generated ideas are, on average, significantly better (average purchase intent of 0.48 relative | |
| to 0.40 for human-generated ideas), especially in the case of few-shot prompting (average purchase intent | |
| of 0.49 relative to 0.46 for zero-shot prompting), as shown in Study 1. | |
| Second, despite this success, consumers perceive AI-generated ideas as less novel (perceived novelty | |
| of 0.36 relative to 0.41). Moreover, AI-generated ideas are more likely to overlap: text mining reveals that | |
| the average pairwise similarity of ideas is higher among AI-generated ideas and further increases when | |
| using few-shot prompting. As a result, the underlying solution landscape is less likely to be fully explored | |
| (Study 2). | |
| --- Page 4 --- | |
| Finally, we show that for a given number of ideas, the quality of the best ideas generated by AI is | |
| significantly greater than that of the best ideas generated by humans (Study 3). Specifically, we show that | |
| AI-generated ideas are seven times more likely to be among the top 10 of ideas generated in our | |
| experiment. This is significant given the context. What matters for innovation is the quality of the best | |
| idea. The objective of idea generation is to generate at least a few truly exceptionally great ideas. In most | |
| innovation settings, we would rather have 10 great ideas and 90 terrible ideas than 100 ideas of solid | |
| quality. Holding the number of ideas constant, we need to trade off the advantageous effect of higher | |
| average idea quality (Study 1) with the disadvantages of less novelty, more overlapping ideas, and fewer | |
| ideas that can be discovered (Study 2). Study 3 clearly establishes AI s supremacy over humans in this | |
| respect. | |
| A quarter of a century ago, Goldenberg et al. (1999) asked the question Can AI-generated ideas finally | |
| compete with human ones, long after researchers first considered the possibility? . We believe that the three | |
| studies presented in this article provide empirical support for an affirmative answer to this question. From | |
| a practical perspective, we see the 7:1 advantage of AI creativity over human creativity as a conservative | |
| estimate, as we did not credit AI for its substantially greater productivity. | |
| The remainder of the article is organized as follows. After reviewing some recent work on GenAI and | |
| creativity (Section 2), we introduce our theoretical framework and our hypotheses (Section 3), followed by | |
| the technical set-up of our experiments (Section 4). We conducted three studies to assess the creativity of | |
| human- and AI-generated ideas. First, in Study 1 we ask human participants to rate ideas from both sources | |
| (human- and AI-generated) and compare the results (Section 5). Second, in Study 2 we use text-based | |
| analysis to calculate how many unique ideas can be created by humans and LLMs in our specific domain | |
| as well as ask human participants to rate the novelty of ideas from both sources and compare the results | |
| (Section 6). Third, in Study 3 we look at the extreme distributions of idea quality to identify possible | |
| advantages for the best ideas by either humans or AI (Section 7). We conclude the paper by discussing | |
| potential limitations of our studies, their robustness to alternative specifications (Section 8), and the | |
| implications of our findings (Section 9). | |
| 2. GenAI applications to creative tasks | |
| Research to date has demonstrated three key findings regarding AI's role in creativity and innovation. First, | |
| AI frequently matches or exceeds human performance in creative tasks. Haase and Hanel (2023) found that | |
| LLMs have reached human-level performance in divergent thinking tasks such as the Alternative Uses Task | |
| (AUT). This is supported by Hubert et al. (2024), who studied GPT-4 responses for the Consequences Task | |
| and Divergent Association Tasks, finding that AI is more creative than humans across all its dimensions. | |
| While Koivisto and Grassini (2023) find that AI chatbots outperform average human performance in the | |
| --- Page 5 --- | |
| AUT, they also note that the most exceptional human ideas still match or exceed those generated by AI. | |
| Second, studies show that AI aids in improving creative outcomes for humans when using it as a tool. | |
| Doshi and Hauser (2024) find that AI use helps humans to create more creative and enjoyable short stories. | |
| However, the collective diversity decreases and stories become more similar to one another. Similarly, Jia | |
| et al. (2024) found that AI assistance boosted employee creativity in a telemarketing company when | |
| responding to customer questions, ultimately increasing sales. Zhou and Lee (2024) show that integrating | |
| text-to-image AI into creative workflows increased the number of artworks created by 25 and raised the | |
| likelihood of receiving the works receiving favorite per view by 50 , highlighting the benefits of LLMs | |
| augmenting human workflows ( human in the loop ). | |
| Third, studies have explored human preferences for AI-generated versus human-generated creations, | |
| often finding that people prefer human involvement. For instance, Hitsuwari et al. (2023) found that survey | |
| participants cannot distinguish between AI-generated and human-generated haikus, but rated poems co- | |
| created by humans and AI as the most beautiful with no significant preference for haikus created solely by | |
| humans or AI. Bellaiche et al. (2023) provide evidence that humans prefer human involvement in art | |
| creation by showing that participants prefer AI-generated art falsely labeled as created by humans to the | |
| same art correctly labeled as AI-generated, suggesting a bias for human involvement in the creative process. | |
| Similarly, Shank et al. (2023) find comparable results for AI-generated classical music, although no such | |
| preference was found for electronic music. However, Zlatkov et al. (2023) found no significant preference | |
| for either AI or human-generated music overall. | |
| Taken together, this body of research illustrates the potency of AI in creative tasks. AI not only matches | |
| human creativity but also improves human performance when used as a collaborative tool. However, at | |
| least when considering artistic outcomes, there remains a human preference for creativity that involves | |
| human touch. This growing evidence suggests a natural next step: evaluating AI's efficacy in innovation | |
| management in general and in idea generation in particular, where artistic preferences are less important, | |
| while carefully examing potential issues such as less diverse ideas. | |
| 3. Theoretical Framework and Hypotheses | |
| To understand GenAI s ability to tackle various creative tasks, we must first conceptualize creativity. The | |
| literature distinguishes between three dimensions of creativity. Fluency is the ability to generate many ideas | |
| or solutions to a problem. It reflects the quantity of generated ideas. Flexibility is the capacity to produce a | |
| variety of ideas or solutions, showing an ability to shift approaches or perspectives. And, originality is the | |
| ability to produce novel and unique ideas (Guilford 1967, Torrance 1968). In addition, the brainstorming | |
| literature often considers idea quality as a fourth dimension of creativity. We omit fluency as a performance | |
| metric, as comparing the number of ideas or the speed of idea generation between a computer and a human | |
| --- Page 6 --- | |
| will lead to the obvious result that the computer displays greater fluency, creating more ideas per unit of | |
| time. This leaves us with idea quality, flexibility, and originality as the dimensions of comparison between | |
| humans and AI. | |
| The atomic unit of analysis in this comparison is an idea. In the context of innovation, we define an idea | |
| as a novel match between a solution and a need. As mentioned above, across three studies we will ask | |
| students as well as GenAI to come up with new product ideas targeted toward college students that can be | |
| sold for 50 or less. To illustrate our unit analysis of an idea, consider one of the student-generated ideas: | |
| Convertible High-Heel Shoe: Many prefer high-heel shoes for dress-up occasions, yet walking in high heels | |
| for more than short distances is very challenging. Might we create a stylish high-heel shoe that easily | |
| adapts to a comfortable walking configuration, say by folding down or removing a heel portion of the shoe? | |
| In this example, the need is the desire of some people to dress up and wear high-heeled shoes for some | |
| occasions while still walking comfortably. The proposed solution is to make the heel portion of the shoe so | |
| that it can be folded down or removed. | |
| Idea generation, by either individuals or groups, is a process that creates a stream of ideas with varying | |
| quality levels. This stream can be the result of either human effort or the use of AI. Each of these ideas can | |
| be validated on a quality scale. Our quality scale is based on a purchase intent study. Kornish and Ulrich | |
| (2014) show that the best indicator of future value creation is the average purchase intent expressed by a | |
| sample of consumers in the target market. Furthermore, they show that no single individual, expert or | |
| novice, is particularly good at estimating value. Instead, a sample of expressed purchase intent from about | |
| 15 individuals in the target market is a reliable measure of idea quality. | |
| Some ideas are likely to be brilliant (high-quality), some are horrible (low-quality), and most will be | |
| somewhere in between (medium-quality). We can think of this uncertain quality value as a random variable | |
| drawn from an underlying pay-off distribution (Weitzman 1979, Dahan and Mendelson 2001). | |
| Recall that we chose to measure three dimentions of creativity associated with idea generation: quality, | |
| flexibility, and originality. Our first hypothesis relates to the first dimention: AI s ability to generate ideas | |
| comparable in their average quality to human-generated ideas. In other words, we focus on the mean of the | |
| underlying idea-quality distribution. We make two arguments for why GPT-4 would create ideas of higher | |
| average quality than humans. First, the training data for GPT-4 includes millions of product reviews | |
| revealing unmet user needs, social media posts of excited and frustrated customers alike, and marketing | |
| materials for countless products that have been launched more or less successfully in the past. Second, the | |
| literature reviewed in Section 2 has established that GPT-4 has tremendous creative capabilities in other | |
| domains such as music generation or story writing. | |
| --- Page 7 --- | |
| Hypothesis 1 (Idea quality): The average quality of AI-generated ideas is higher than the average quality | |
| of human-generated ideas. | |
| Our second hypothesis relates to the second two dimensions: flexibility and originality. We first define | |
| these concepts in the context of generating ideas for new products and come up with appropriate | |
| measurement scales. | |
| There exists a vast number of possible new product ideas that differ along many dimensions. We can | |
| think of ideas as positions in a highly dimensional space. OpenAI s GPT-4 models text as multi- | |
| dimensional embedding vectors in this space, where each dimension may represent a distinct attribute or | |
| feature of the text. Such vectors have hundreds of dimensions. Similar texts will often lie close to each other | |
| while different ones will be far apart. However, interpreting the distances and dimensions is often not | |
| straightforward given the high dimensionality. | |
| To illustrate, consider a two-dimensional search space like the map of a territory. For example, consider | |
| the exploration of such a territory in the search for fishing spots in the ocean. The (x, y) coordinates capture | |
| the geographic locations of schools of fish. Each location has a pay-off corresponding to the amount of fish | |
| in the water. The goal of the fisherman is to find the location with the greatest fish density. In such a search | |
| process, local adjustments along a gradient of increasing fish density in the water via local search may | |
| increase the value of a fishing location. Yet, in rugged solution landscapes, i.e., ones that have multiple | |
| local optima, such local search is unlikely to yield the globally optimal solution. | |
| Thereby, the ruggedness of the underlying solution landscape makes it impossible to arrive at the most | |
| valuable fishing location (idea) in the ocean (idea space) via local adjustments. Rather, a broad exploration | |
| is needed (see Sommer and Loch 2004). Without prior knowledge about the landscape, some new locations | |
| that are very different from past locations should be explored. This creates the classic trade-off between | |
| exploration and exploitation (March 1991). | |
| With this as our backdrop, we provide two ways of operationalizing flexibility, overlap and the total | |
| number of discoverable ideas, and one way to operationalize originality, idea novelty. All three are | |
| important properties of a search process in general and of an ideation process in particular. | |
| To explain overlap, let s return to our fishing example. To explore fishing locations in an ocean, the | |
| locations should be distinctively different from each other. Even in a rugged solution landscape, some | |
| spatial correlations in pay-offs between two adjacent coordinates are likely. In much the same way, in the | |
| world of innovation, we want our ideas to be distinct from each other. To determine how distinctly different | |
| an idea is relative to other ideas, we measure the cosine similarity of its embedding vector relative to the | |
| embedding vectors of the other ideas (following Cox et al. 2021 and Dell'Acqua et al. 2023). Section 8 | |
| --- Page 8 --- | |
| provides alternative measures to this analytical choice. For a given pool of ideas produced by an idea- | |
| generation process, human or AI, we can thus randomly pull out two ideas and compute the angle between | |
| two associated embedding vectors. The Cosine of such angles will range from -1 to 1, with 1 indicating | |
| identical vectors and 0 indicating no similarity (orthogonal). While negative values are possible in principle, | |
| they rarely occur in practice as further discussed in study 2. By performing a pairwise comparison of all | |
| ideas and averaging their similarities, we can compute the average pool similarity. Next, we define two | |
| ideas as overlapping if their cosine similarity is above 𝜃 0.8. That is, we count any new idea added to | |
| the pool as overlapping if its cosine similarity exceeds 0.8 compared to any of the existing ideas in the pool. | |
| Our first measure of flexibility is based on computing the distribution of pairwise cosine similarities and | |
| counting the frequency of overlaps. We discuss this and other assumptions in Section 8 and provide | |
| extensive robustness analyses including evaluating alternative model specifications. | |
| Next, imagine a fisherman with no memory looking for fish at random locations. Every period, this | |
| fisherman sets out and fishes, yielding an estimate for the payoff of a specific location. How many unique | |
| fishing locations will be discovered this way? Early in the exploratory efforts, every fishing spot is an | |
| unexplored territory. Yet, as this process goes on, the likelihood of overlap increases, i.e., the fisherman is | |
| more likely to revisit a location previously tested. Given our definition of overlapping ideas (cosine | |
| similarity exceeding the θ 0.8 threshold), we can observe a stream of incoming ideas, one by one, and | |
| determine whether a new idea is unique relative to the pool of ideas created up to this point. Early on, just | |
| like in the fisherman s case, each idea is likely unique (non-overlapping with the ideas created so far). | |
| However, as the process progresses, the percentage of overlapping ideas will increase as the underlying | |
| search space gets exhausted. For a finite sequence of T ideas, we can evaluate the number of overlapping | |
| ideas, Noverlap, and thus compute the number of unique ideas, Nunique T-Noverlap. Definitions for how we | |
| operationalize this approach are shown in study 2. | |
| In addition to utilizing idea overlap for computing the number of unique ideas in a finite stream of ideas, | |
| we can further estimate the total number of discoverable ideas in the search space, even if many were not | |
| part of the sequence of T ideas, i.e., the ideas have not (yet) been discovered. To do so, we use what in | |
| population ecology is known as a capture-recapture model, used to estimate the number of unique fishing | |
| locations based on how frequently a previously visited location is revisited by a fisherman with no memory. | |
| With such a model, we simply count the incidents of an idea overlapping with a past idea. The frequency | |
| of overlap and its increased occurrence rate over time allows for estimating the number of ideas that can | |
| be discovered (Kornish and Ulrich 2011). This provides us with our second measure of flexibility. | |
| Next, consider originality. The search for ideas can yield ideas that are more or less novel. We measure | |
| idea novelty in the same way we measure idea quality by directly asking potential customers for its novelty | |
| assessment and averaging this value. In summary, we evaluate flexibility by looking at idea overlap (which | |
| --- Page 9 --- | |
| can be converted into an estimate for the numbers of ideas that can be discovered) and evaluate originality | |
| by directly asking consumers to rate novelty. | |
| How will a pool of AI-generated ideas compare to these human-generated ideas in terms of quality, | |
| flexibility and originality? By their very design, GPTs are autoregressive processes. They don t plan ahead | |
| but predict one word (or token) at a time based on a context window, including the prompt and the prior | |
| words created. Such a one word at a time process is unlikely to systematically and exhaustively explore | |
| an entire solution landscape. This lack of broad exploration will be further amplified in the presence of a | |
| system prompt that illustrates the concept of ideas by providing one or multiple ideas from the past (few- | |
| shot prompting) relative to the case in which no past ideas are provided (zero-shot prompting). This should | |
| limit both the flexibility and the originality of the creative process.These arguments, taken together with | |
| existing research in other domains showing less novelty for AI-generated content versus human-generated | |
| content (Doshi and Hauser, 2024), lead to the following two hypotheses: | |
| Hypothesis 2a (Flexibility): The likelihood of two ideas overlapping is higher for a pool of AI-generated | |
| ideas than for a pool of human-generated ideas, resulting in fewer discoverable ideas. | |
| Hypothesis 2b (Originality): The average novelty of AI-generated ideas is lower than that of human- | |
| generated ideas. | |
| Our third hypothesis returns to the concept of idea quality. This time, however, we are not concerned | |
| about the average idea quality but instead focus on the quality of the best ideas. Rather than focusing on the | |
| quality of the single best idea (the extreme value, Dahan and Mendelson 2001), we focus on the 90th | |
| percentile of idea quality distribution, i.e., the top 10 percent of the ideas. We do so for two reasons. | |
| The first reason is statistical estimation: for a single experiment like ours, there simply does not exist a | |
| test that allows us to make statistically significant statements for a single data point. Moving to the 90th | |
| percentile, we can compare the mean across larger groups of ideas (Section 8 presents our results for other | |
| percentiles). | |
| There also exists a second, managerial reason. In many, if not most, practical settings, the assessment of | |
| idea quality is noisy, especially in the early stages of an innovation process when an idea is nothing but a | |
| title and a few words. For this reason innovation tournaments don t just advance a single idea to the next | |
| round, but a set of the x percent of the most promising ideas where x can vary widely, but typically ranges | |
| between 10 and 50 percent (Terwiesch and Ulrich 2009). We therefore state: | |
| Hypothesis 3 (Top Decile): The quality of the 90th percentile AI-generated ideas is higher than that of the | |
| --- Page 10 --- | |
| 90th percentile human-generated ideas. | |
| 4. Experimental setup | |
| For our experiment, we utilize three different pools of ideas, namely student-generated ideas, GPT-4- | |
| generated ideas with zero-shot prompting and GPT-4-generated ideas with few-shot prompting. For the | |
| student pool, we rely on data collected in 2021 in a product design and innovation course at an elite | |
| university. In this course, 50 students participated in an innovation challenge to come up with ideas for a | |
| physical product marketed to college students for 50 or less (this price cap is imposed to limit the | |
| complexity of the projects in a one-semester course.). The challenge was organized in a traditional | |
| innovation tournament format (Terwiesch and Ulrich 2009, 2023), in which individuals first independently | |
| generate many ideas, which are then combined into a pool of several hundred ideas and subsequently | |
| evaluated by others in the group (i.e., crowdsourced evaluations). Thus, we have access to a large set of | |
| ideas generated by humans before AI tools became widely available to enhance ideation. | |
| Speifically, we use a pool of independently aggregated human ideas by randomly selecting 200 entries, | |
| each comprising a descriptive title and a paragraph of text, from the student ideas generated in these | |
| challenges in 2021 (i.e., at a time prior to the widespread availability of ChatGPT and other LLMs). The | |
| set of 200 ideas constitutes our first pool and forms the baseline for comparison with the ideas generated | |
| using LLMs. We prompt Open AI s GPT-4 (more specifically, gpt-4-0314) with the same prompt we gave | |
| the students. No LLM yet acts entirely autonomously. Rather, they are tools used by humans to complete | |
| tasks. For this study, we aim for minimal prompt engineering, thus representing a novice user scenario. | |
| However, we acknowledge that many strategies could potentially improve LLM performance. For instance, | |
| Mihm and Schlapp (2019) show that providing feedback during ideation contests can further improve | |
| performance of human innovators and we expect this to hold for LLMs as well | |
| For our first LLM-generated idea pool we use the system prompt to provide contextual information and | |
| subsequent user prompts to ask for ideas, ten at a time. The user prompt includes the additional request that | |
| the descriptions be 40-80 words, like the student sample. | |
| System Prompt | |
| You are a creative entrepreneur looking to generate new product ideas. The product will target college | |
| students in the United States. It should be a physical good, not a service or software. I'd like a product that | |
| could be sold at a retail price of less than about USD 50. The ideas are just ideas. The product need not | |
| yet exist, nor may it necessarily be clearly feasible. Number all ideas and give them a name. The name and | |
| idea are separated by a colon. | |
| --- Page 11 --- | |
| User Prompt | |
| Please generate ten ideas as ten separate paragraphs. The idea should be expressed as a paragraph of | |
| 40-80 words. | |
| The model used for all work covered in this paper is gpt-4-0314 with the temperature parameter at 0.7 | |
| to retain randomness and thus greater creativity. The temperature parameter controls the randomness of the | |
| output, with lower values leading to more deterministic output and higher values increasing variability. At | |
| the time of the experiment, the suggested default value for temperature was 0.7 to strike a balance between | |
| coherence and creativity, without possibly sampling highly unlikely tokens (i.e., semantic chunks used for | |
| representational efficiency) that lead to undesirable responses. | |
| An obstacle to using GPT-4 for generating hundreds of ideas is its finite memory, typically limited to | |
| the number of tokens the underlying LLM can consider in generating its responses. Once the number of | |
| tokens in a session exceeds the model s limit, the LLM has no memory of the first ideas generated, and | |
| subsequent ideas can become increasingly redundant. The number of tokens in the version of GPT-4 we | |
| had access to was about 8,000, roughly 7,000 words or approximately 80 ideas (some tokens are used for | |
| the system and user prompt and idea titles). | |
| To generate more than 80 ideas resulting from the limited context window, we asked GPT-4 to | |
| compress the previously generated ideas into shorter summaries. These summaries were then provided | |
| to the model before generating the next batch of ideas, ensuring that the model knows the previously | |
| generated ideas while remaining within the context limits. We used the below summarization prompt, | |
| followed by the original system prompt and generated summaries, and finally, a user prompt that explicitly | |
| asks for different ideas. This constitutes our second pool of comparison. | |
| Summarization Prompt | |
| Aggressively compress the following ideas so that their original meaning remains but they are much | |
| shorter. You can use tags or keywords. : Ideas generated so far | |
| System Prompt | |
| Original System Prompt Previously you generated the following ideas and should not repeat them: | |
| Summaries | |
| User Prompt | |
| Original User Prompt Make sure they are different from the previous ideas. | |
| --- Page 12 --- | |
| For our second pool of LLM-generated ideas, we provide the LLM with examples (few-shot learning) | |
| of high-quality ideas generated by students. In particular, we appended our prompts to provide the LLM | |
| with six highly rated ideas from a separate student set that completed the same exercise and informed GPT- | |
| 4 that these ideas had been well-received by students in our class. We used six examples due to context | |
| window limitations at the time of the experiment as well as drawing on previous experiments from in- | |
| context few-shot learning where too many examples can degrade performance (see Meincke and Carton | |
| 2024). This constitutes our third pool of comparison. | |
| Good Ideas Prompt | |
| Original System Prompt Here are some well received ideas for inspiration: Good Ideas | |
| Overall, we generated 100 ideas using zero-shot prompting and another 100 using few-shot prompting. | |
| The resulting average word count for GPT-4 generated ideas is 69 and 71 for GPT-4 with provided with | |
| examples. The average description is 63 words long for student ideas. We compared the resulting few-shot | |
| prompted ideas to the examples provided to ensure that GPT-4 did not simply slightly modify the examples. | |
| The average pairwise cosine similarity between the six examples and the 100 generated ideas is 0.33 and | |
| the highest similarity between two ideas is 0.51. Thus, we have no reason to believe that GPT-4 repeated | |
| the provided ideas. | |
| 5. Study 1: comparing the quality of ideas generated by humans and AI | |
| The Institutional Review Board (IRB) at the University of Pennsylvania approved the research described | |
| in this paper in May 2023, Protocol 853634. We used the online platform Prolific to recruit college-age | |
| indiviuals from the United States to evaluate all 400 ideas from the three pools (pool 1 with 200 ideas | |
| created by humans, pool 2 with 100 created by GPT-4 with zero-shot prompting, and pool 3 with 100 | |
| created by GPT-4 with few-shot prompting) via a purchase intent survey. We presented ideas in random | |
| order and randomized at the idea level, meaning that every survey participant could potentially see ideas | |
| from multiple sources. Each respondent evaluated an average of 40 ideas. On average, each idea was | |
| evaluated 20 times. In the summer of 2023, concerns surfaced that ChatGPT was being used to provide | |
| mTurk responses. This practice appears to have been limited to text generation tasks, not to multiple | |
| choice tasks like our five-box purchase-intent survey. Indeed, just answering the survey question directly | |
| requires less effort than trying to deploy ChatGPT to answer the question. We thus believe that our study | |
| participants were humans. | |
| We asked respondents to express purchase intent using the standard five-box options: definitely would | |
| not purchase, probably would not purchase, might or might not purchase, probably would purchase, and | |
| --- Page 13 --- | |
| definitely would purchase. Jameson and Bass (1989) recommend weighting responses for the five possible | |
| responses as 0, 0.25, 0.50, 0.75, and 1.00 to develop a single measure of purchase probability, which we | |
| use as a measure of idea quality (other weightings are possible, as we discuss in Section 8). Figure 1 shows | |
| the full quality distribution of ideas generated by the three pools. | |
| Figure 1 | |
| Distribution of idea quality for three sets of ideas | |
| Notes. Purchase intent is the weighted average of the five-box response scale per Jameson and Bass (1989). | |
| Figure 1 shows the quality (purchase probability) of ideas across the three pools. On average, GPT-4 | |
| generated ideas with greater purchase intent (46.4 with zero-shot prompting and 49.3 with few-shot | |
| prompting) than humans (40.4 ). The standard deviation of the quality of ideas is comparable between the | |
| three pools. We formally test the impact of idea source on the perceived quality of product ideas via a linear | |
| mixed-effects model with purchase intent as the dependent variable. The model included two fixed-effects | |
| denoting source (humans are the baseline) and random intercepts and slopes for respondents and ideas. We | |
| --- Page 14 --- | |
| find significant differences in the perceived quality of ideas as a function of their source. Ideas generated | |
| by GPT-4 with no examples (zero-shot) were rated significantly higher than human-generated ideas (𝛽 | |
| 0.059; 95 CI [0.031, 0.088]; t(246) 4.06, p 0.001) and ideas generated by GPT-4 provided with | |
| positive examples (few-shot) received even higher ratings (𝛽 0.089; 95 CI [0.060, 0.12]; t(223) 5.93, | |
| p 0.001). Purchase intent is weakly significantly different between the two pools of LLM-generated ideas | |
| (𝛽 0.03; 95 CI [-0.01, 0.06]; t(199) 1.892, p 0.06). These findings indicate that LLM-generated | |
| ideas are, on average, more likely to be purchased than human-generated ideas (for additional robustness | |
| tests, see Section 8). | |
| 6. Study 2: Diversity and Novelty of Ideas | |
| Our second study focuses on how the fraction of overlapping ideas and the resulting estimated total number | |
| of ideas the process can generate (idea flexibility, hypothesis 2a) and the perceived novelty of the ideas as | |
| assessed by human raters (idea originality, hypothesis 2b) depend on the idea source. | |
| 6.1. Overlapping Ideas An idea-generation process creates a sequence of ideas in which each | |
| additional idea generated can be compared to the previously created ideas according to its similarity. For a | |
| pool of ideas, we can hence compute the average pairwise similarity of one idea compared to all other ideas | |
| and then compute the average overall similarity for the entire pool. We can also apply a threshold to | |
| pairwise idea similarity to identify at what point the ideas start to become more repetitive, i.e., when we are | |
| starting to exhaust the space of new ideas given a particular idea-generation process. A pool of ideas then | |
| might have a few overlapping ideas, which informs our second quantitative metric, the total number of | |
| ideas the process can generate. | |
| To measure the diversity of the ideas, we calculate the cosine similarity of each idea relative to the rest | |
| of the set. We first calculate a vector of text embeddings for each idea. We follow the technical setup in | |
| Dell'Acqua et al. (2023) and use Google's Universal Sentence Encoder (USE) model for our idea | |
| embeddings, which is specifically optimized for semantic similarity between sentences. Table 1 shows the | |
| results. | |
| In geometry, the cosine of the angle between vectors ranges from -1 to 1. However, when using Google | |
| USE, negative similarity is rarely encountered, since the overall text structure does not substantially differ | |
| between ideas. Ideas follow a similar pattern in terms of text length and style, often leading with the title | |
| before the idea description. In our test, a cosine similarity of 1 between two ideas thus indicates that they | |
| are very similar (their embedding vectors are aligned), whereas a cosine similarity of 0 implies orthogonal | |
| or unrelated ideas. We consider a new idea added to an idea pool to be unique if its pairwise cosine similarity | |
| compared to all previously added ideas is never greater than 0.8. Additional robustness checks using | |
| different thresholds and measures can be found in Section 8. | |
| --- Page 15 --- | |
| Table 1 | |
| Summary Statistics for Idea Overlap | |
| Student Ideas | |
| GPT-4 zero-shot | |
| GPT-4 few-shot | |
| N Ideas | |
| Average cosine | |
| similarity of all | |
| ideas | |
| 0.221 | |
| 0.415 | |
| 0.428 | |
| Fraction of ideas | |
| in pool with | |
| cosine similarity | |
| 0.8 | |
| 0.05 | |
| 0.07 | |
| Notes. We compute the fraction as the number of ideas whose average pairwise similarity compared to all | |
| other ideas in the pool exceeds 0.8 divided by the total number of ideas in the pool. | |
| For each pool, we compute the average pairwise similarity between all ideas. One-way ANOVA | |
| analyses show that the source has a significant effect on the cosine similarity between the three pools. The | |
| difference between all three groups is also significant (η² 0.455, 95 CI [-0.210, -0.204], F(2, 29598) | |
| 12340.95, p 0.001). Considering only two groups, human ideas have a significantly smaller cosine | |
| similarity than GPT-4-generated ideas (η² 0.358, 95 CI [-0.197, -0.190], F(1, 24649) 13715.82, p | |
| 0.001). Zero-shot GPT-4 ideas exhibit a significantly smaller cosine similarity than few-shot GPT-4 ideas | |
| (η² 0.004, 95 CI [-0.018, -0.010], F(1, 9898) 44.24, p 0.001). | |
| Because there is no overlap among human-generated ideas using cosine similarity, the fraction of ideas | |
| would be zero and the number of unique ideas infinitely large, in line with hypothesis 2a. A larger pool of | |
| student ideas will eventually contain overlapping ideas (see Kornish and Ulrich 2014 for estimates) but | |
| based on our assumptions for similarity, the student sample only contains unique ideas. We perform a | |
| binomial test to formally estimate the significance of the differences. We find that the fraction of similar | |
| human-generated ideas (95 CI for fraction [0.0, 0.0184]) is significantly smaller than that of the zero-shot | |
| GPT-4 ideas (RD -0.05, 95 CI [-0.093, -0.007], p 0.001) and few-shot GPT-4 ideas (RD -0.07, 95 | |
| CI [-0.120, -0.020], p 0.001), supporting hypothesis 2a. The difference between the two GPT-4 pools is | |
| not significantly different (RD -0.02, 95 CI [-0.086, 0.046], p 0.56). Our findings suggest that a | |
| greater number of distinct ideas generated comes from the human-ideation process, as opposed to GPT-4. | |
| We calculate the exact numbers in the next section. | |
| --- Page 16 --- | |
| Figure 2 | |
| Distribution of cosine similarities across the three pools | |
| Notes. Density plot of cosine similarities comparing all three pools. The dotted line shows the mean and confidence | |
| interval of the estimate for a pool used for the ANOVA. The difference between all three groups is also significant (η² | |
| 0.455, 95 CI [-0.210, -0.204], F(2, 29598) 12340.95, p 0.001). | |
| 6.2. | |
| Number of Discoverable Ideas Given the fraction of unique ideas, we can estimate the number of | |
| unique ideas that could be generated by each of our three processes (pools) students, LLM (zero-shot), | |
| and LLM prompted with examples (few-shot) using the method of Kornish and Ulrich (2011). This | |
| method, which uses the capture-recapture method to analyze the probability that the next idea in a sequence | |
| is unique, reportedly originates with Laplace (Cochran 1978), but has been adapted to wildlife ecology and | |
| other domains. For illustration, consider again fishing in a lake as a metaphor for the idea-generation | |
| process. Each idea is a catch, and the fish is released back into the lake. Sometimes, the same fish will be | |
| caught again. The more frequently an individual fish is re-caught, the smaller the estimate of the overall | |
| fish population. Thus, the probability that a fish has never been caught previously is a decreasing function | |
| of the number of ideas generated. | |
| This probability decay is typically represented by an exponential function. | |
| p(n) e an | |
| (1) | |
| We define p(n) as the probability that the next idea is unique given n ideas have been generated already. | |
| The expected number of unique ideas out of n generated, u(n), is the integral under this curve. | |
| --- Page 17 --- | |
| u(n) (1 a)(1 e an) | |
| (2) | |
| This form of probability decay comes from a specific underlying process, with T unique ideas total (T | |
| fish in the pond), and each equally likely to be drawn. This assumption is commonly used in the Lincoln- | |
| Peterson method (Lincoln 1930), the standard model for estimating population size in the literature on | |
| wildlife ecology. The decay parameter and the total T are linked: T 1 a. This model has only a single | |
| parameter, a, which is the inverse of the size of the opportunity space, i.e., an estimate of the total number | |
| of unique ideas that an unlimited number of comparable idea generators, each generating an enormous | |
| number of ideas, would generate. | |
| Given a set of ideas generated and a count of the number of unique ideas in that set, the model can be | |
| used to calculate T, an estimate of the size of the opportunity space. Using the similarity threshold of 0.8 | |
| from the cosine similarity metric, we found that 5 of the 100 ideas generated by the LLM with zero-shot | |
| prompting were essentially similar to an idea already generated (fish recaptured), and that 7 of the 100 ideas | |
| generated via few-shot prompting were redundant. Thus, u(100) is 0.95 in the first case, and u(100) is 0.93 | |
| in the second case. This corresponds to an estimate of T of 966 ideas (zero-shot) and of 680 ideas (few- | |
| shot) respectively. | |
| In our sample, human-generated ideas were all unique. Thus, as expected from our overlap calculations, | |
| and based on the estimates provided by the capture-recapture model, we find support for the second | |
| quantitative metric of hypothesis 2a. The number of unique ideas that can be discovered is lower for both | |
| pools of AI-generated ideas than for the human idea-generation process. In addition, prompting the LLM | |
| with examples seems to further reduce the estimated number of unique ideas available to the process. We | |
| perform additional robustness checks in Section 8. | |
| 6.3. Perceived Novelty Given that LLMs are designed to generate the statistically most plausible | |
| sequence of text based on their training data, perhaps they generate less novel ideas than humans. Novelty | |
| is not a goal expressed in the prompt used in this study for either humans or GPT-4 and is typically not a | |
| primary objective in commercial product development efforts. Still, to ensure that GPT-generated ideas are | |
| not merely lists of existing ideas, we investigate how the novelty of ideas varies between LLM-generated | |
| ideas and those generated by humans. | |
| Based on Shibayama et al. (2021), we assessed novelty by asking responders on Prolific the question | |
| Relative to other products you have seen, how novel do you consider the idea for this new product? [0: | |
| Not at all novel, 0.25: Slightly novel, 0.5: Moderately novel, 0.75: Very novel, 1: Extremely novel]. The | |
| average novelty of human-generated ideas is 40.6 (SD: 0.117), which is greater than that of zero-shot | |
| GPT-4 (36.7 , SD: 0.101), and few-shot GPT-4 (36.1 , SD: 0.111; see Figure 3). | |
| --- Page 18 --- | |
| Similar to purchase intent, we estimate a linear mixed-effects model to investigate how the idea source | |
| (human ideas, zero-shot GPT-4 and few-shot GPT-4) affects the perceived novelty of product ideas. The | |
| model includes two fixed effects for denoting the source (humans are baseline), random intercepts and | |
| slopes for both respondents and ideas. | |
| We find significant differences in perceived novelty between human and zero-shot GPT-4-generated | |
| ideas (𝛽 -0.038; 95 CI [-0.066, -0.01]; t(269) -2.67, p 0.008) at the alpha 0.05 threshold. Ideas | |
| generated by few-shot GPT-4 also receive significantly lower novelty ratings (𝛽 -0.049; 95 CI [-0.078, | |
| -0.02]; t(268) -3.4, p 0.001) compared to human-generated ideas. These findings suggest that some | |
| LLM-generated ideas are perceived as less novel than human-generated ideas. | |
| Perceived novelty is not significantly different between the two pools of LLM-generated ideas (𝛽 - | |
| 0.01; 95 CI [-0.039, 0.017]; t(195) -0.757, p 0.45). Of note, novelty does not appear to be significantly | |
| correlated with purchase intent. The correlation coefficient is slightly negative at -0.08 (95 CI [-0.176, | |
| 0.016], p 0.12). Additional robustness checks can be found in Section 8. | |
| Figure 3 | |
| Distribution of novelty ratings for three samples of ideas | |
| Notes. Novelty based on mTurk assessment per Kwon, Kim, and Lee (2009). | |
| --- Page 19 --- | |
| These findings support Hypothesis 2b: AI-generated ideas are, on average, less novel than human- | |
| generated ideas. Of note, the average novelty of all ideas, irrespective of source, lies between slightly and | |
| moderately novel. While human ideas are around 0.047 points more novel, there is little reason to believe | |
| that novelty alone, i.e., being the first to think of an idea, leads to a significant financial advantage. As | |
| Terwiesch and Ulrich (2010) and others have argued, the first-mover advantage is a myth. As such, from a | |
| commercial point of view, we don t believe that the slightly lower novelty outweighs the productivity and | |
| quality benefits of LLMs. | |
| 7. Study 3: What is the quality of the best idea(s)? | |
| Table 2 summarizes the titles of the top 40 ideas (10 ) in our pool, that is the top 40 out of the 400 ideas | |
| used. | |
| Table 2 Top 10 Ideas by Purchase Intent | |
| Title | |
| Source | |
| Purchase Intent | |
| Novelty | |
| Compact Printer | |
| GPT-4 (Few-Shot) | |
| 0.76 | |
| 0.55 | |
| Solar-Powered Gadget Charger | |
| GPT-4 (Few-Shot) | |
| 0.75 | |
| 0.44 | |
| QuickClean Mini Vacuum | |
| GPT-4 (Zero-Shot) | |
| 0.75 | |
| 0.30 | |
| Noise-Canceling Headphones | |
| GPT-4 (Few-Shot) | |
| 0.72 | |
| 0.18 | |
| StudyErgo Seat Cushion | |
| GPT-4 (Zero-Shot) | |
| 0.72 | |
| 0.39 | |
| Multifunctional Desk Organizer | |
| GPT-4 (Few-Shot) | |
| 0.71 | |
| 0.21 | |
| Reusable Silicone Food Storage Bags | |
| GPT-4 (Few-Shot) | |
| 0.68 | |
| 0.34 | |
| Portable Closet Organizer | |
| GPT-4 (Few-Shot) | |
| 0.67 | |
| 0.23 | |
| Dorm Room Chef [oven, microwave and toaster] | |
| GPT-4 (Few-Shot) | |
| 0.67 | |
| 0.71 | |
| Collegiate Cookware | |
| GPT-4 (Few-Shot) | |
| 0.67 | |
| 0.45 | |
| Collapsible Laundry Basket | |
| GPT-4 (Few-Shot) | |
| 0.65 | |
| 0.21 | |
| On-the-Go Charging Pouch | |
| GPT-4 (Few-Shot) | |
| 0.65 | |
| 0.33 | |
| GreenEats Reusable Containers | |
| GPT-4 (Zero-Shot) | |
| 0.65 | |
| 0.21 | |
| HydrationStation [bottle with filter] | |
| GPT-4 (Zero-Shot) | |
| 0.64 | |
| 0.19 | |
| Reusable Shopping Bag Set | |
| GPT-4 (Few-Shot) | |
| 0.64 | |
| 0.19 | |
| CollegeLife Collapsible Laundry Hamper | |
| GPT-4 (Zero-Shot) | |
| 0.64 | |
| 0.26 | |
| Adaptiflex [cord extension to fit big adapters] | |
| Student | |
| 0.64 | |
| 0.44 | |
| SpaceSaver Hangers | |
| GPT-4 (Zero-Shot) | |
| 0.64 | |
| 0.33 | |
| Dorm Room Air Purifier | |
| GPT-4 (Few-Shot) | |
| 0.63 | |
| 0.29 | |
| Smart Power Strip | |
| GPT-4 (Few-Shot) | |
| 0.63 | |
| 0.22 | |
| CampusCharger Pro | |
| GPT-4 (Zero-Shot) | |
| 0.63 | |
| 0.31 | |
| Kitchen Safe Gloves | |
| Student | |
| 0.62 | |
| 0.31 | |
| Nightstand Nook [charging, cup holder] | |
| GPT-4 (Few-Shot) | |
| 0.62 | |
| 0.43 | |
| Mini Steamer | |
| GPT-4 (Few-Shot) | |
| 0.62 | |
| 0.41 | |
| CollegeCare First Aid Kit | |
| GPT-4 (Zero-Shot) | |
| 0.62 | |
| 0.26 | |
| StudySoundProof [soundproofing panels] | |
| GPT-4 (Zero-Shot) | |
| 0.62 | |
| 0.57 | |
| FreshAir Fan | |
| GPT-4 (Zero-Shot) | |
| 0.62 | |
| 0.29 | |
| StudyBuddy Lamp [portable, usb charging] | |
| GPT-4 (Zero-Shot) | |
| 0.62 | |
| 0.43 | |
| Bluetooth Signal Merger [share music] | |
| Student | |
| 0.62 | |
| 0.41 | |
| --- Page 20 --- | |
| Adjustable Laptop Riser | |
| GPT-4 (Few-Shot) | |
| 0.62 | |
| 0.21 | |
| EcoCharge [solar powered charger] | |
| GPT-4 (Zero-Shot) | |
| 0.62 | |
| 0.43 | |
| Smartphone Projector | |
| Student | |
| 0.62 | |
| 0.57 | |
| Grocery Helper [hook to carry multiple bags] | |
| Student | |
| 0.62 | |
| 0.53 | |
| FitnessOnTheGo [portable gym equipment] | |
| GPT-4 (Zero-Shot) | |
| 0.62 | |
| 0.42 | |
| Multipurpose Fitness Equipment | |
| GPT-4 (Few-Shot) | |
| 0.62 | |
| 0.37 | |
| CollegeCooker | |
| GPT-4 (Zero-Shot) | |
| 0.61 | |
| 0.50 | |
| Multifunctional Wall Organizer | |
| GPT-4 (Few-Shot) | |
| 0.61 | |
| 0.31 | |
| DormDoc Portable Scanner | |
| GPT-4 (Zero-Shot) | |
| 0.61 | |
| 0.49 | |
| Mobile Charging Station Organizer | |
| GPT-4 (Few-Shot) | |
| 0.61 | |
| 0.26 | |
| StudyMate Planner | |
| GPT-4 (Few-Shot) | |
| 0.61 | |
| 0.22 | |
| DormChef Kitchen Set | |
| GPT-4 (Zero-Shot) | |
| 0.61 | |
| 0.33 | |
| LaundryBuddy [laundry basket] | |
| GPT-4 (Zero-Shot) | |
| 0.61 | |
| 0.30 | |
| Notes. The asterisk ( ) denotes ideas where the text in square brackets [] is not part of the original title and | |
| was added to clarify the idea. | |
| Among the top 40 ideas (top decile) 35 (87.5 ) were generated by GPT-4 (see Table 3). In other words, | |
| for every human idea in the top 10 we count 7 ideas generated by GPT-4. A Chi-Square Test of | |
| independence, with the null hypothesis of equal representation of all sources among the top ideas (20, 10 | |
| and 10) rejected the null hypothesis (x2 26.39, p 0.001, df 2), thus confirming hypothesis 3. | |
| Table 3 | |
| Best Ideas Across Pools | |
| Student Ideas | |
| GPT-4 zero-shot | |
| GPT-4 few-shot | |
| N Ideas | |
| Average Quality | |
| of Top Decile | |
| 0.62 | |
| 0.64 | |
| 0.66 | |
| Average | |
| Novelty of Top | |
| Decile | |
| 0.45 | |
| 0.35 | |
| 0.33 | |
| Fraction of the | |
| top decile of | |
| pooled ideas | |
| from this source | |
| 5 40 | |
| 15 40 | |
| 20 40 | |
| To better understand how the full distribution of idea qualities is affected by the idea source, we use | |
| quantile regression analysis. Quantile regression (Koenker and Hallock 2001) extends traditional regression | |
| by computing the relationship between explanatory variables (idea source) and the response variable (idea | |
| quality) for different percentiles of the data. As mentioned above, in innovation, the quality of the best ideas | |
| is generally more important than the average quality. That is, we prefer a few exceptional ideas to a lot of | |
| --- Page 21 --- | |
| mediocre ones. Using quantile regression, we can examine the tails of the distribution instead of the mean, | |
| allowing us to test whether GPT-4 excels at generating high-quality ideas only for specific percentiles or | |
| whether the effect holds across the entire distribution. | |
| Our analysis follows Girotra et al. (2010). We use the average idea quality ratings as the dependent | |
| variable, and our explanatory variable is a binary variable indicating whether the idea is human-generated | |
| (baseline level) or AI-generated (GPT-4 zero-shot and GPT-4 few-shot prompting). Figure 4 shows the | |
| results. For all percentiles, GPT-4 ideas consistently outperform student ideas. The effect is especially | |
| pronounced for the upper tail of the distribution (80 and above), where GPT-4 has the strongest advantage. | |
| This implies that not only does GPT-4 generate better ideas on average, but it is also especially adept at | |
| producing top-tier ideas compared to students. | |
| Figure 4 | |
| Estimated Difference in Idea Quality Ratings between AI-generated Ideas and Human-generated | |
| ideas (baseline), for Different Percentiles | |
| 8. Discussion and Limitations | |
| In this section, we discuss conceptual limitations of our work, limitations related to our research design, | |
| as well as data analysis and the robustness of our analysis to a set of alternative specifications and | |
| assumptions. | |
| --- Page 22 --- | |
| Our findings indicate that GPT-4 produces higher-quality ideas that are more likely to be purchased than | |
| humans, though they are perceived as less novel. AI significantly outperforms human creativity in | |
| generating top-tier ideas, with GPT-4 ideas being seven times more likely to rank in the top 10 . Given | |
| AI s advantage in both quality and productivity, our findings have profound implications for the field of | |
| innovation management. For instance, AI can serve as a first step in brainstorming sessions, allowing | |
| organizations to rapidly explore a wide variety of ideas with minimal cost and time investment. Human | |
| ideators can also provide AI with their own interesting ideas and refine them with the help of AI. Another | |
| important implication lies in the potential shift of focus from idea generation to idea evaluation. If LLMs | |
| can reliably produce numerous high-quality ideas at very low cost companies might allocate more resources | |
| toward assessing and refining those ideas instead of ideating from scratch. This shift could lead to the | |
| development of new tools and frameworks specifically designed to help organizations sort, rank, and | |
| prioritize AI-generated ideas, further streamlining the innovation process. | |
| However, while the results show that GPT-4 outperforms human creativity in terms of producing top- | |
| tier ideas, the reduced novelty and increased similarity among AI-generated ideas point to a limitation. This | |
| suggests that a human in the loop is still important to drive the ideation direction and ensure that ideas are | |
| as novel as possible. Future research could explore ways to mitigate this issue by enhancing LLMs' ability | |
| to generate more diverse and creative solutions through techniques such as fine-tuning. | |
| Investigating whether LLMs can evaluate ideas with the same rigor as human evaluators would help to | |
| further improve the ideation process. It would allow an LLM to get immediate feedback on its creations, | |
| leaving humans to focus on implementation and strategy. | |
| 8.1 Conceptual and Research Design Limitations Conceptually, our prompting approach (i.e., a | |
| simple prompt) is not optimized for creativity or novelty. It also follows a single ideator setup instead of | |
| approaches such as hybrid brainstorming that lead to more and better ideas (Girotra et al. 2010). A model | |
| given more specific instructions on how to ideate effectively might thus perform even better. Different | |
| prompting techniques such as Chain-of-Thought (CoT), which asks the model to reason through a problem | |
| in multiple steps instead of directly providing an answer (Wei et al. 2023), might also improve performance. | |
| Furthermore, providing the model with hundreds of good ideas, either via many-shot learning or fine-tuning | |
| could also provide enhanced performance. This suggests that we likely underestimate the true power of AI- | |
| based idea generation. | |
| Second, it is possible that professional product innovators would generate better ideas than our students. | |
| However, this has not been the experience of the paper s authors, who have taught many academic courses | |
| and worked in many product development settings. Many students who participated in the innovation | |
| contests have gone on to be product innovators, sometimes based on ideas from the course tournament. | |
| --- Page 23 --- | |
| Nevertheless, we have not produced evidence that GPT-4 is better than the best product innovators working | |
| today. However, we believe that we can claim conservatively that GPT-4 is better than many human product | |
| innovators working today and probably better than average. Thus, at a very minimum, an LLM could | |
| elevate the least capable humans to a better-than-average level of performance. | |
| Third, GPT might be a great salesperson. As such, it is possible that the writing style ( pitch ) convinces | |
| the customers rather than the idea itself. Prior work in other domains suggests that the text generated by | |
| LLMs is not distinguishable from that generated by humans (Brown et al. 2020), though recent work has | |
| developed sophisticated measures to detect LLM-generated text (Mitchell et al. 2023, Kobak et al. 2024, | |
| Venkatraman et al. 2024). For example, Kobak et al. (2024) provide intuitions that could be used to identify | |
| LLM-generated text, such as words that are not commonly used by the majority of English speakers like | |
| delve. However, it is unlikely that these characteristics were known to our survey participants at the time | |
| of our experiment in May and June 2023, and that any particular idea generated by GPT-4 could easily be | |
| distinguished from those generated by our students. Future research could use LLMs to present human- | |
| generated ideas in a way that more closely mimics the presentation style of LLM-generated ideas, ensuring | |
| that the quality of the idea is not confounded by its presentation style. | |
| Fourth, our study is set in the widely understood domain of consumer products for the college students | |
| market that cost less than 50. Presumably, there exists a lot of commentary and data about such products | |
| in the training data used by the GPT class of language models. As such, it is unclear whether our results | |
| would generalize to more specialized domains, such as surgical instruments. Organizations looking for | |
| opportunities in these specialized domains should fine-tune language models with their own proprietary | |
| data to achieve comparable or better performance. | |
| Fifth, innovation often benefits from collaboration and is not solely focused on one ideator generating | |
| many ideas. Liu et al. (2018) show that collaborating with other innovators improves the creative process | |
| by enabling the transfer of critical skills and knowledge, particularly when those collaborations involve | |
| highly skilled innovators. Future work should investigate whether this can be applied to human and LLM | |
| interaction, and whether an LLM could help a novice human innovator become better. | |
| 8.2 Robustness There are different ways to analyze the data. Here, we provide additional robustness | |
| checks that investigate the validity of our results under various specifications. | |
| 8.2.1 Study 1 To measure purchase intent, it is possible to use other convex weighting schemes. Ulrich | |
| and Eppinger (2007) weigh definitely would purchase as 0.4 and probably would purchase as 0.2 with | |
| all other responses rated as 0. When using this alternative set of weights, we find the same significant | |
| differences between pools. | |
| As a robustness test for our primary purchase-intent analysis using a linear mixed-effects model, we also | |
| --- Page 24 --- | |
| conduct a simpler linear regression focusing on the average perceived quality of product ideas across | |
| different sources. This model aggregates individual ratings at the idea level, removing the random effects | |
| to capture the overall influence of the source on rating averages. The results confirm our previous findings | |
| and show that ideas from GPT-4 (zero-shot) are rated higher than human ones by an average of 0.256 points | |
| (95 CI [0.15, 0.37]; t 4.602, p 0.001), and ideas from GPT-4 (few-shot) are rated higher by an average | |
| of 0.358 points (95 CI [0.25, 0.47]; t 6.435, p 0.001). | |
| In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor. | |
| We find significant differences in the perceived quality, measured as purchase intent of product ideas, | |
| between sources. Ideas generated by GPT-4 (zero shot) receive a significantly greater average rating (𝛽 | |
| 0.395; 95 CI [0.215, 0.575]; z 4.31, p 0.001). Similarly, ideas generated by GPT-4 (few-shot) receive | |
| even higher ratings (𝛽 0.581; 95 CI [0.400, 0.762]; z 6.30, p 0.001) compared to human-generated | |
| ideas. These findings suggest that LLM-generated ideas are perceived as more likely to be purchased than | |
| human-generated ideas, with the highest perceived quality attributed to few-shot GPT-4-generated ideas. | |
| 8.2.2 Study 2 Our chosen threshold of 𝜃 0.8 has been established through experimentation by | |
| comparing ideas as pairs of two and their respective similarity scores. However, our findings are robust to | |
| other values such as 0.7 (25 and 37 overlapping ideas for zero-shot and few-shot GPT-4 respectively) and | |
| 0.75 (16 and 23 overlapping ideas). At 𝜃 0.85, the zero-shot GPT-4 pool only features two overlapping | |
| ideas, whereas the few-shot pool features one. Because these are extreme values that approach zero, we | |
| used 0.8 as our main threshold. We compute the pairwise similarity for an idea compared to all other ideas | |
| in the pool and calculate the average. Mean pairwise similarity is a common measure in ideation | |
| (Siangliulue et al. 2016, Cox et al. 2021) and similar text-mining tasks (Doshi and Hauser 2024) but it is | |
| not without issues, as it lacks sensitivity to highly clustered ideas. As an additional specification, we | |
| consider the per-pool collective diversity of all ideas by following the work in Cox et al. (2021) and | |
| construct a minimum spanning tree (MST) which spans all points (ideas) in space with the smallest total | |
| distance along the edges. In 2D space, an MST would be the tree that contains all points with the shortest | |
| overall length of edges. We compute the mean of all edge distances as a measure of how distributed ideas | |
| are in the high-dimensional space. The spanning tree is constructed in high-dimensional space (512 | |
| dimensions), its edge weights summed up and divided by the number of edges, resulting in a range from 0 | |
| (not diverse at all) to 1 (very diverse). Based on this measure, the student idea pool is the most diverse | |
| (0.53), GPT-4 zero-shot is the second most diverse (0.33) and GPT-4 few-shot is the least diverse (0.3) | |
| pool. | |
| Similar to purchase intent, we also conduct a simpler linear regression focusing on the average perceived | |
| novelty of product ideas across different sources. This model aggregates individual ratings at the idea level, | |
| removing the random effects to capture the overall influence of the source on rating averages. We find that | |
| --- Page 25 --- | |
| ideas from GPT-4 (zero-shot) are significantly less novel than human ones (𝛽 -0.177; 95 CI [-0.286, - | |
| 0.069]; t -3.22, p 0.0014). Ideas from GPT-4 (few-shot) are rated as significantly less novel than human | |
| ones (𝛽 -0.197; 95 CI [-0.305, -0.089]; t -3.58, p 0.001). This simpler analysis reinforces that human | |
| ideas are more novel than AI-generated ones, even when using zero-shot prompting. | |
| In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor. | |
| We find significant differences in the perceived novelty. Ideas generated by GPT-4 (zero shot) receive a | |
| significantly lower average rating (𝛽 -0.306; 95 CI [-0.514, -0.1]; z -2.89, p 0.01). Similarly, ideas | |
| generated by GPT-4 (few-shot) receive even lower ratings (𝛽 -0.39; 95 CI [-0.6, -0.18]; z -3.66, p | |
| 0.001) compared to human-generated ideas. These findings suggest that LLM-generated ideas are perceived | |
| as less novel than human-generated ideas, with the lowest perceived novelty attributed to few-shot GPT-4- | |
| generated ideas. | |
| 8.2.3 Study 3 In this study, we present our results for the 90th percentile of all aggregated ideas. Table | |
| 4 shows that using other percentiles yields similar results. | |
| Table 4 Top 5 and 15 Percent of Ideas Pool Distributions | |
| Student Ideas | |
| GPT-4 zero-shot | |
| GPT-4 few-shot | |
| Average Quality | |
| of Top 5 | |
| 0.64 | |
| 0.67 | |
| 0.68 | |
| Fraction of the | |
| top 5 of pooled | |
| ideas from this | |
| source | |
| 1 20 | |
| 6 20 | |
| 14 20 | |
| Average Quality | |
| of Top 15 | |
| 0.60 | |
| 0.62 | |
| 0.64 | |
| Fraction of the | |
| top 15 of | |
| pooled ideas | |
| from this source | |
| 11 60 | |
| 22 60 | |
| 27 60 | |
| 9. Summary | |
| GenAI has demonstrated remarkable advancements in creating coherent and fluent text, equaling or | |
| surpassing human performance in various academic and professional domains. In this study, we explored | |
| the ideation capabilities of OpenAI's GPT-4, a state-of-the-art large language model, in comparison to the | |
| ideation abilities of university students when generating ideas for new products targeted toward college | |
| --- Page 26 --- | |
| students at a price point of 50 or less. Specifically, we make three main contributions to the literature of | |
| innovation and the role of AI. | |
| First, GPT-4 produces high-quality ideas that are perceived as more likely to be purchased than human- | |
| generated ideas. Second, consumers perceive AI-generated ideas as less novel. Third, when considering the | |
| quality of the best ideas, AI outperforms human creativity significantly. To put these findings in context, | |
| innovation favors a few great ideas over a large number of solid ideas and our results show that AI-generated | |
| ideas are seven times more likely to be among the top 10 of ideas considered for our experiment compared | |
| to human ideas. Despite the reduction in novelty, the overall AI advantage thus remains substantial. | |
| The fact that GPT-4 is very efficient at generating ideas does not require a formal research study. Two | |
| hundred ideas can be generated by one human interacting with GPT-4 in about 15 minutes. A human | |
| working alone can generate about five ideas in 15 minutes and humans working in groups do even worse | |
| (Girotra et al., 2010). In short, the productivity race between humans and GPT-4 is not even close. However, | |
| as we show in this article, the enormous potential of LLMs in ideation does not result only from their ability | |
| to quickly and inexpensively generate ideas, but in the remarkable quality of their output. | |
| Importantly, these ideas can be produced at a fraction of the cost it would take humans, generating | |
| hundreds of high-quality ideas. This previously unimaginable productivity in generating ideas may | |
| substantially reduce the importance of the idea-generation phase of innovation and shift managerial focus | |
| to the idea-evaluation phase. Can an LLM also take on the task of idea evaluation? From our viewpoint, | |
| this is a fascinating question for future research. | |
| References | |
| Bellaiche L, Shahi R, Turpin MH, Ragnhildstveit A, Sprockett S, Barr N, Christensen A, Seli P (2023) | |
| Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cogn. | |
| Research 8(1):42. | |
| Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. (2020) Language | |
| Models are Few-Shot Learners. (July 22) http: arxiv.org abs 2005.14165. | |
| Chao RO, Kavadias S (2008) A Theoretical Framework for Managing the New Product Development | |
| Portfolio: When and How to Use Strategic Buckets. Management Science 54(5):907 921. | |
| Cochran WG (1978) Laplace s Ratio Estimator. David HA, ed. Contributions to Survey Sampling and | |
| Applied Statistics. (Academic Press), 3 10. | |
| Connolly T, Jessup LM, Valacich JS (1990) Effects of Anonymity and Evaluative Tone on Idea | |
| Generation in Computer-Mediated Groups. Management Science 36(6):689 703. | |
| Cox SR, Wang Y, Abdul A, Von Der Weth C, Y. Lim B (2021) Directed Diversity: Leveraging Language | |
| Embedding Distances for Collective Creativity in Crowd Ideation. Proceedings of the 2021 CHI | |
| --- Page 27 --- | |
| Conference on Human Factors in Computing Systems. (ACM, Yokohama Japan), 1 35. | |
| Dahan E, Mendelson H (2001) An Extreme-Value Model of Concept Testing. Management Science | |
| 47(1):102 116. | |
| Dell Acqua F, McFowland III E, Mollick ER, Lifshitz-Assaf H, Kellogg K, Rajendran S, Krayer L, | |
| Candelon F, Lakhani KR (2023) Navigating the Jagged Technological Frontier: Field Experimental | |
| Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. (September 15) | |
| https: papers.ssrn.com abstract 4573321. | |
| Doshi AR, Hauser OP (2024) Generative AI enhances individual creativity but reduces the collective | |
| diversity of novel content. Sci. Adv. 10(28):eadn5290. | |
| Girotra K, Terwiesch C, Ulrich KT (2010) Idea Generation and the Quality of the Best Idea. Management | |
| Science 56(4):591 605. | |
| Goldenberg J, Mazursky D, Solomon S (1999) Creative Sparks. Science 285(5433):1495 1496. | |
| Guilford JP (1967) Creativity: Yesterday, Today and Tomorrow. Journal of Creative Behavior 1(1):3 14. | |
| Haase J, Hanel PHP (2023) Artificial muses: Generative artificial intelligence chatbots have risen to | |
| human-level creativity. Journal of Creativity 33(3):100066. | |
| Hitsuwari J, Ueda Y, Yun W, Nomura M (2023) Does human AI collaboration lead to more creative art? | |
| Aesthetic evaluation of human-made and AI-generated haiku poetry. Computers in Human Behavior | |
| 139:107502. | |
| Hubert KF, Awa KN, Zabelina DL (2024) The current state of artificial intelligence generative language | |
| models is more creative than humans on divergent thinking tasks. Sci Rep 14(1):3440. | |
| Huchzermeier A, Loch CH (2001) Project Management Under Risk: Using the Real Options Approach to | |
| Evaluate Flexibility in R D. Management Science 47(1):85 101. | |
| Jamieson LF, Bass FM (1989) Adjusting Stated Intention Measures to Predict Trial Purchase of New | |
| Products: A Comparison of Models and Methods. Journal of Marketing Research 26(3):336 345. | |
| Jia N, Luo X, Fang Z, Liao C (2024) When and How Artificial Intelligence Augments Employee | |
| Creativity. AMJ 67(1):5 32. | |
| Kobak D, González-Márquez R, Horvát EÁ, Lause J (2024) Delving into ChatGPT usage in academic | |
| writing through excess vocabulary. (July 3) http: arxiv.org abs 2406.07016. | |
| Koenker R, Hallock KF (2001) Quantile Regression. Journal of Economic Perspectives 15(4):143 156. | |
| Koivisto M, Grassini S (2023) Best humans still outperform artificial intelligence in a creative divergent | |
| thinking task. Sci Rep 13(1):13601. | |
| Kornish LJ, Ulrich KT (2011) Opportunity Spaces in Innovation: Empirical Analysis of Large Samples of | |
| Ideas. Management Science 57(1):107 128. | |
| Kornish LJ, Ulrich KT (2014) The Importance of the Raw Idea in Innovation: Testing the Sow s Ear | |
| --- Page 28 --- | |
| Hypothesis. Journal of Marketing Research 51(1):14 26. | |
| Lincoln FC (1930) Calculating waterfowl abundance on the basis of banding returns (U.S. Dept. of | |
| Agriculture, Washington, D.C.). | |
| Liu H, Mihm J, Sosa ME (2018) Where Do Stars Come From? The Role of Star vs. Nonstar Collaborators | |
| in Creative Settings. Organization Science 29(6):1149 1169. | |
| Loch CH, Terwiesch C, Thomke S (2001) Parallel and Sequential Testing of Design Alternatives. | |
| Management Science 47(5):663 678. | |
| March JG (1991) Exploration and Exploitation in Organizational Learning. Organization Science | |
| 2(1):71 87. | |
| Mihm J, Schlapp J (2019) Sourcing Innovation: On Feedback in Contests. Management Science | |
| 65(2):559 576. | |
| Mitchell E, Lee Y, Khazatsky A, Manning CD, Finn C (2023) DetectGPT: Zero-Shot Machine-Generated | |
| Text Detection using Probability Curvature. (July 23) http: arxiv.org abs 2301.11305. | |
| Meincke L, Carton A (2024) Beyond Multiple Choice: The Role of Large Language Models in | |
| Educational Simulations. (May 26) https: papers.ssrn.com abstract 4873537. | |
| OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. (2024) GPT-4 Technical | |
| Report. (March 4) http: arxiv.org abs 2303.08774. | |
| Osborn AF (1953) Applied imagination (Scribner S, Oxford, England). | |
| Rashidi HH, Fennell BD, Albahra S, Hu B, Gorbett T (2023) The ChatGPT conundrum: Human- | |
| generated scientific manuscripts misidentified as AI creations by AI text detection tool. Journal of | |
| Pathology Informatics 14:100342. | |
| Shank DB, Stefanik C, Stuhlsatz C, Kacirek K, Belfi AM (2023) AI composer bias: Listeners like music | |
| less when they think it was composed by an AI. J Exp Psychol Appl 29(3):676 692. | |
| Shibayama S, Yin D, Matsumoto K (2021) Measuring novelty in science with word embedding Muscio | |
| A, ed. PLoS ONE 16(7):e0254034. | |
| Si H, Kavadias S, Loch CH (2022) Managing Innovation Portfolios: From Project Selection to Portfolio | |
| Design. (March 6) https: papers.ssrn.com abstract 4050940. | |
| Siangliulue P, Chan J, Dow SP, Gajos KZ (2016) IdeaHound: Improving Large-scale Collaborative | |
| Ideation with Crowd-Powered Real-time Semantic Modeling. Proceedings of the 29th Annual Symposium | |
| on User Interface Software and Technology. (ACM, Tokyo Japan), 609 624. | |
| Sommer SC, Loch CH (2004) Selectionism and Learning in Projects with Complexity and Unforeseeable | |
| Uncertainty. Management Science 50(10):1334 1347. | |
| Sutton RI, Hargadon A (1996) Brainstorming Groups in Context: Effectiveness in a Product Design Firm. | |
| Administrative Science Quarterly 41(4):685. | |
| --- Page 29 --- | |
| Terwiesch C (2023) Let s cast a critical eye over business ideas from ChatGPT. Financial Times (March | |
| 12) https: www.ft.com content 591ad272-6419-4f2c-9935-caff1d670f08. | |
| Terwiesch C, Ulrich K (2023) The innovation tournament handbook: a step-by-step guide to finding | |
| exceptional solutions to any challenge (Wharton School Press, Philadelphia, PA). | |
| Terwiesch C, Ulrich KT (2009) Innovation tournaments: creating and selecting exceptional opportunities | |
| (Harvard Business Press, Boston, Mass). | |
| Terwiesch C, Xu Y (2008) Innovation Contests, Open Innovation, and Multiagent Problem Solving. | |
| Management Science 54(9):1529 1543. | |
| Torrance EP (1968) A Longitudinal Examination of the Fourth Grade Slump in Creativity. Gifted Child | |
| Quarterly 12(4):195 199. | |
| Ulrich K, Eppinger S (2007) Product Design and Development (McGraw-Hill Education) | |
| Venkatraman S, Uchendu A, Lee D (2024) GPT-who: An Information Density-based Machine-Generated | |
| Text Detector. (April 3) http: arxiv.org abs 2310.06202. | |
| Wang H, Zou J, Mozer M, Goyal A, Lamb A, Zhang L, Su WJ, et al. (2024) Can AI Be as Creative as | |
| Humans? (January 25) http: arxiv.org abs 2401.01623. | |
| Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D (2023) Chain-of- | |
| Thought Prompting Elicits Reasoning in Large Language Models. (January 10) | |
| http: arxiv.org abs 2201.11903. | |
| Weitzman ML (1979) Optimal Search for the Best Alternative. Econometrica 47(3):641. | |
| Zhou E, Lee D (2024) Generative artificial intelligence, human creativity, and art Harding M, ed. PNAS | |
| Nexus 3(3):pgae052. | |
| Zlatkov D, Ens J, Pasquier P (2023) Searching for Human Bias Against AI-Composed Music. Artificial | |
| Intelligence in Music, Sound, Art and Design: 12th International Conference, EvoMUSART 2023, Held | |
| as Part of EvoStar 2023, Brno, Czech Republic, April 12 14, 2023, Proceedings. (Springer-Verlag, | |
| Berlin, Heidelberg), 308 323. | |
| --- Page 30 --- | |
| Appendix A. Quantile Regression Results | |
| The regression model considered quantiles 0.1 to 0.9 with a 0.1 step. For each quantile, it estimated | |
| MeanRating SourceAI. SourceAI is a dummy variable that indicates whether the idea source was a student | |
| (SourceAI 0) or GPT-4 (SourceAI 1). A positive value for SourceAI indicates that ideas by GPT-4 | |
| performed better than human ideas. Negative values indicate the opposite. | |
| Table A.1. | |
| Quantile Regression Results for quantiles 0.1 to 0.9 | |
| Quantile | |
| Intercept | |
| Source AI | |
| Conf. Int. Low | |
| Conf. Int. High | |
| 0.1 | |
| 0.3 | |
| 0.128399 | |
| 0.471601 | |
| 0.2 | |
| 1.230768 | |
| 0.290958 | |
| 0.152504 | |
| 0.429411 | |
| 0.3 | |
| 1.388859 | |
| 0.277678 | |
| 0.149304 | |
| 0.406051 | |
| 0.4 | |
| 1.549946 | |
| 0.262418 | |
| 0.139183 | |
| 0.385653 | |
| 0.5 | |
| 1.666666 | |
| 0.227943 | |
| 0.10915 | |
| 0.346736 | |
| 0.6 | |
| 1.789528 | |
| 0.210472 | |
| 0.095888 | |
| 0.325057 | |
| 0.7 | |
| 1.882355 | |
| 0.260502 | |
| 0.137852 | |
| 0.383152 | |
| 0.8 | |
| 1.954555 | |
| 0.445445 | |
| 0.328038 | |
| 0.562851 | |
| 0.9 | |
| 2.181822 | |
| 0.318235 | |
| 0.190531 | |
| 0.445939 | |
| Notes. (p 0.1 ). | |
| Appendix B. Supplementary Regression Tables | |
| Purchase Intent | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 0.40 | |
| 0.38 0.43 | |
| 0.001 | |
| Source [Zero-Shot] | |
| 0.06 | |
| 0.03 0.09 | |
| 0.001 | |
| Source [Few-Shot] | |
| 0.09 | |
| 0.06 0.12 | |
| 0.001 | |
| Random Effects | |
| σ2 | |
| 0.07 | |
| τ00 IdeaID | |
| 0.01 | |
| τ00 RespondentID | |
| 0.02 | |
| τ11 IdeaID.SourceZero-Shot | |
| 0.01 | |
| --- Page 31 --- | |
| τ11 IdeaID.SourceFew-Shot | |
| 0.03 | |
| τ11 RespondentID.SourceZero-Shot | |
| 0.01 | |
| τ11 RespondentID.SourceFew-Shot | |
| 0.01 | |
| ρ01 | |
| -0.64 | |
| -0.97 | |
| -0.06 | |
| -0.28 | |
| ICC | |
| 0.28 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.014 0.290 | |
| Purchase Intent Alternative Weights | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 0.08 | |
| 0.07 0.08 | |
| 0.001 | |
| Source [Zero-Shot] | |
| 0.02 | |
| 0.01 0.03 | |
| 0.001 | |
| Source [Few-Shot] | |
| 0.03 | |
| 0.02 0.04 | |
| 0.001 | |
| Random Effects | |
| σ2 | |
| 0.01 | |
| τ00 IdeaID | |
| 0.00 | |
| τ00 RespondentID | |
| 0.00 | |
| τ11 IdeaID.SourceZero-Shot | |
| 0.00 | |
| τ11 IdeaID.SourceFew-Shot | |
| 0.00 | |
| τ11 RespondentID.SourceZero-Shot | |
| 0.00 | |
| τ11 RespondentID.SourceFew-Shot | |
| 0.00 | |
| ρ01 | |
| 0.04 | |
| 0.48 | |
| -0.00 | |
| -0.16 | |
| ICC | |
| 0.21 | |
| --- Page 32 --- | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.009 0.215 | |
| Purchase Intent (Simple) | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 1.62 | |
| 1.55 1.68 | |
| 0.001 | |
| Source [Zero-Shot] | |
| 0.26 | |
| 0.15 0.37 | |
| 0.001 | |
| Source [Few-Shot] | |
| 0.36 | |
| 0.25 0.47 | |
| 0.001 | |
| Observations | |
| R2 R2 adjusted | |
| 0.108 0.104 | |
| Purchase Intent (no weights, zero-shot baseline) | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 1.85 | |
| 1.73 1.98 | |
| 0.001 | |
| Source [Student] | |
| -0.24 | |
| -0.35 -0.12 | |
| 0.001 | |
| Source [Few-Shot] | |
| 0.12 | |
| -0.00 0.24 | |
| 0.058 | |
| Random Effects | |
| σ2 | |
| 1.18 | |
| τ00 IdeaID | |
| 0.12 | |
| τ00 RespondentID | |
| 0.42 | |
| τ11 IdeaID.SourceStudent | |
| 0.33 | |
| τ11 IdeaID.SourceFew-Shot | |
| 0.12 | |
| τ11 RespondentID.SourceStudent | |
| 0.13 | |
| τ11 RespondentID.SourceFew-Shot | |
| 0.01 | |
| ρ01 | |
| -0.77 | |
| -0.36 | |
| -0.50 | |
| -0.99 | |
| --- Page 33 --- | |
| ICC | |
| 0.31 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.014 0.322 | |
| Purchase Intent (ordered logistic regression) | |
| Predictors | |
| Odds Ratios | |
| CI | |
| p | |
| 0 1 | |
| 0.25 | |
| 0.21 0.29 | |
| 0.001 | |
| 1 2 | |
| 1.06 | |
| 0.90 1.26 | |
| 0.484 | |
| 2 3 | |
| 3.01 | |
| 2.53 3.57 | |
| 0.001 | |
| 3 4 | |
| 19.07 | |
| 15.84 22.97 | |
| 0.001 | |
| Source [Zero-Shot] | |
| 1.48 | |
| 1.24 1.78 | |
| 0.001 | |
| Source [Few-Shot] | |
| 1.79 | |
| 1.49 2.14 | |
| 0.001 | |
| Random Effects | |
| σ2 | |
| 3.29 | |
| τ00 IdeaID | |
| 0.39 | |
| τ00 RespondentID | |
| 0.92 | |
| ICC | |
| 0.28 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.014 0.294 | |
| Novelty | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 0.41 | |
| 0.39 0.43 | |
| 0.001 | |
| Source [Zero-Shot] | |
| -0.04 | |
| -0.07 -0.01 | |
| 0.008 | |
| Source [Few-Shot] | |
| -0.05 | |
| -0.08 -0.02 | |
| 0.001 | |
| Random Effects | |
| --- Page 34 --- | |
| σ2 | |
| 0.05 | |
| τ00 IdeaID | |
| 0.01 | |
| τ00 RespondentID | |
| 0.01 | |
| τ11 IdeaID.SourceZero-Shot | |
| 0.02 | |
| τ11 IdeaID.SourceFew-Shot | |
| 0.03 | |
| τ11 RespondentID.SourceZero-Shot | |
| 0.01 | |
| τ11 RespondentID.SourceFew-Shot | |
| 0.01 | |
| ρ01 | |
| -0.87 | |
| -0.99 | |
| 0.14 | |
| 0.06 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.009 NA | |
| Novelty (Simple) | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 1.64 | |
| 1.58 1.70 | |
| 0.001 | |
| Source [Zero-Shot] | |
| -0.18 | |
| -0.29 -0.07 | |
| 0.001 | |
| Source [Few-Shot] | |
| -0.20 | |
| -0.31 -0.09 | |
| 0.001 | |
| Observations | |
| R2 R2 adjusted | |
| 0.042 0.037 | |
| Novelty (no weights, zero-shot baseline) | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 1.48 | |
| 1.37 1.59 | |
| 0.001 | |
| Source [Student] | |
| 0.15 | |
| 0.05 0.26 | |
| 0.004 | |
| Source [Few-Shot] | |
| -0.04 | |
| -0.16 0.07 | |
| 0.493 | |
| Random Effects | |
| --- Page 35 --- | |
| σ2 | |
| 0.90 | |
| τ00 IdeaID | |
| 0.11 | |
| τ00 RespondentID | |
| 0.35 | |
| τ11 IdeaID.SourceStudent | |
| 0.33 | |
| τ11 IdeaID.SourceFew-Shot | |
| 0.44 | |
| τ11 RespondentID.SourceStudent | |
| 0.03 | |
| τ11 RespondentID.SourceFew-Shot | |
| 0.04 | |
| ρ01 | |
| -0.71 | |
| -0.98 | |
| -1.00 | |
| -0.23 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.009 NA | |
| Novelty (zero-shot baseline) | |
| Predictors | |
| Estimates | |
| CI | |
| p | |
| (Intercept) | |
| 0.37 | |
| 0.34 0.40 | |
| 0.001 | |
| Source [Student] | |
| 0.04 | |
| 0.01 0.07 | |
| 0.008 | |
| Source [Few-Shot] | |
| -0.01 | |
| -0.04 0.02 | |
| 0.449 | |
| Random Effects | |
| σ2 | |
| 0.05 | |
| τ00 IdeaID | |
| 0.01 | |
| τ00 RespondentID | |
| 0.02 | |
| τ11 IdeaID.SourceStudent | |
| 0.02 | |
| τ11 IdeaID.SourceFew-Shot | |
| 0.03 | |
| τ11 RespondentID.SourceStudent | |
| 0.01 | |
| τ11 RespondentID.SourceFew-Shot | |
| 0.00 | |
| ρ01 | |
| -0.74 | |
| --- Page 36 --- | |
| -1.00 | |
| -0.69 | |
| -0.95 | |
| ICC | |
| 0.35 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.006 0.356 | |
| Novelty (ordered logistic regression) | |
| Predictors | |
| Odds Ratios | |
| CI | |
| p | |
| 0 1 | |
| 0.16 | |
| 0.13 0.19 | |
| 0.001 | |
| 1 2 | |
| 0.87 | |
| 0.72 1.04 | |
| 0.118 | |
| 2 3 | |
| 4.51 | |
| 3.76 5.43 | |
| 0.001 | |
| 3 4 | |
| 29.60 | |
| 24.09 36.37 | |
| 0.001 | |
| Source [Zero-Shot] | |
| 0.74 | |
| 0.60 0.91 | |
| 0.004 | |
| Source [Few-Shot] | |
| 0.68 | |
| 0.55 0.84 | |
| 0.001 | |
| Random Effects | |
| σ2 | |
| 3.29 | |
| τ00 IdeaID | |
| 0.57 | |
| τ00 RespondentID | |
| 0.91 | |
| ICC | |
| 0.31 | |
| N RespondentID | |
| N IdeaID | |
| Observations | |
| Marginal R2 Conditional R2 | |
| 0.006 0.315 |