--- Page 1 --- Using Large Language Models for Idea Generation in Innovation Lennart Meincke Operations, Information and Decisions, The Wharton School, University of Pennsylvania, lennart wharton.upenn.edu Karan Girotra Cornell Tech and Johnson College of Business, Cornell University, girotra cornell.edu Gideon Nave Marketing, The Wharton School, University of Pennsylvania, gnave wharton.upenn.edu Christian Terwiesch Operations, Information and Decisions, The Wharton School, University of Pennsylvania, terwiesch wharton.upenn.edu Karl T. Ulrich Operations, Information and Decisions, The Wharton School, University of Pennsylvania, ulrich wharton.upenn.edu This research evaluates the efficacy of large language models (LLMs) in generating new product ideas. To do so, we compare three pools of ideas for new products targeted toward college students priced at 50 or less. The first pool of ideas was created by university students in a product design course before the availability of LLMs. The second and third pools of ideas were generated by OpenAI s GPT-4 using zero-shot and few-shot prompting, respectively. We evaluated idea quality using standard market research techniques to predict average purchase intent probability. We used text mining to assess idea similarity and human raters to evaluate idea novelty. We find that AI-generated ideas outperform human-generated ideas in terms of average purchase intent, with few-shot prompting yielding slightly higher intent than zero-shot prompting. However, AI-generated ideas are perceived as less novel and exhibit higher pairwise similarity, particularly with few-shot prompting, indicating a less diverse solution landscape. When focusing on the quality of the best ideas (rather than on the average ideas), we find that AI-generated ideas are seven times more likely to rank among the top 10 of ideas, demonstrating a significant advantage over human-generated ideas. We propose that this 7:1 advantage is a conservative estimate, as it does not account for AI's greater productivity. Our findings suggest that despite some drawbacks, AI creativity presents a substantial benefit in generating high-quality ideas for new product development. Funding: Funding was provided by the Mack Institute for Innovation Management at the Wharton School of the University of Pennsylvania. Key words: innovation; idea generation; creativity; creative problem solving; LLM; large-scale language models; AI; artificial intelligence; ChatGPT; GPT --- Page 2 --- 1. Introduction Generative artificial intelligence (GenAI) has remarkably advanced in creating life-like images and coherent, fluent text. Open AI s ChatGPT chatbot, based on the Generative Pre-trained Transformer (GPT) series of large language models (LLM), can equal or surpass human performance in academic examinations and tests for professional certifications (OpenAI et al. 2023). Moreover, LLMs can provide valuable professional advice in fields like software development, medicine, and law. Despite their remarkable performance, LLMs sometimes produce text that is semantically or syntactically plausible but is, in fact, factually incorrect or nonsensical, a phenomenon often referred to as hallucinations. This outcome is a byproduct of how LLMs are designed, as they are optimized to generate the most statistically likely sequences of words with an intentional injection of randomness. In most applications, this randomness and the associated hallucinations and inconsistencies create problems that limit the use of LLM-based solutions to low-stakes settings, or they require extensive human supervision. But are there applications in which we can leverage the weaknesses of hallucinations and inconsistent quality and turn them into a strength? We propose that the domain of creativity and innovation provides such an application. This domain operates quite differently than most management settings, where we commonly expect to use each unit of work produced. As such, consistency is prized and is, therefore, the focus of contemporary performance management. Erratic and inconsistent behavior is to be eliminated. An airline would rather hire a pilot that executes a within-safety-margins landing 10 out of 10 times rather than one that makes a brilliant approach five times and an unsafe approach another five. But, when it comes to creativity and innovation, say finding a new opportunity to improve the air travel experience or launching a new aviation venture, the same airline would prefer an ideator that generates one brilliant idea and nine nonsense ideas over one that generates ten decent ideas. The reason for this difference is that when it comes to creativity and innovation, the performance of the process is not determined by the sum or the average of all ideas created. Instead, each idea is seen as a real option that the decision maker can decide to execute (Huchzermeier and Loch 2001). Thus, the performance of the process is determined by the quality of the best idea(s) (Dahan and Mendelson 2001, Terwiesch and Xu 2008, Terwiesch and Ulrich 2009, Girotra et al. 2010). The process of innovation thereby can be thought of as a search process that generates ideas with random quality values by drawing from an underlying stochastic distribution until the cost of creating one additional draw from the distribution (e.g., creating one more product concept or building one more prototype) exceeds the marginal benefit (Weitzman 1979). Prior research in product development and innovation has modeled various aspects of this search process, including the pros and cons of parallel search (Loch et al. 2001), the tension between sampling from very different regions of the pay-off distributions ( selectionism ) versus locally improving a given project (Sommer and Loch 2004), and the need for building balanced portfolios that consist of different --- Page 3 --- types of projects (Chao and Kavadias 2008). We follow this line of research and consider a setting in which ideas of unknown quality are created, and the quality of the best few ideas determines the overall performance. This could be a setting of corporate portfolio planning in a large established organization as described by Si et al. (2022). However, to facilitate our experimental design, we focus on the idea generation in the product developement process for a newly formed venture. Specifically, we look for a product idea that targets the college student market and can be sold for 50 or less. This innovation challenge is similar to the study settings used in prior work (e.g., Osborn 1953, Connolly et al. 1990, Sutton and Hargadon 1996, Girotra et al. 2010) to evaluate and compare various brainstorming methods (e.g., group vs. individual; nominal groups vs. hybrid groups). In contrast to this prior work, we consider ideas generated by humans and ideas generated by artificial intelligence (AI) in the form of Open AI s GPT-4. As discussed above, LLMs are designed to generate new content, and in the domain of brainstorming, their stochastic (if not outright erratic) behavior might turn a bug into a feature. Thus, we hypothesize that LLMs have the potential to be excellent ideators. The purpose of this paper, therefore, is to formally test this hypothesis by comparing the performance of LLMs in generating new ideas to that of human idea generators. Specifically, we compare three pools of ideas for new products targeted toward college students at a price of USD 50 or less. The first pool of ideas was created by students at an elite university enrolled in a course on product design before the availability of LLMs. The second pool was generated by OpenAI s GPT-4 with the same prompt as that given to the students and no other guidance (zero-shot prompting). The third pool was generated by prompting GPT-4 with the same prompt as that given to the students and a sample of highly rated ideas to enable some in-context learning (few-shot prompting). We evaluate the quality of the ideas using standard market research techniques and survey human respondents to predict an average purchase intent probability for each product, which we use as our measure of idea quality. We use text mining techniques to evaluate the similarities of ideas and rely on human raters to assess idea novelty. This comparison between human idea generation and AI-based idea generation allows us to contribute to the innovation literature by establishing the following novel results. First, AI-generated ideas are, on average, significantly better (average purchase intent of 0.48 relative to 0.40 for human-generated ideas), especially in the case of few-shot prompting (average purchase intent of 0.49 relative to 0.46 for zero-shot prompting), as shown in Study 1. Second, despite this success, consumers perceive AI-generated ideas as less novel (perceived novelty of 0.36 relative to 0.41). Moreover, AI-generated ideas are more likely to overlap: text mining reveals that the average pairwise similarity of ideas is higher among AI-generated ideas and further increases when using few-shot prompting. As a result, the underlying solution landscape is less likely to be fully explored (Study 2). --- Page 4 --- Finally, we show that for a given number of ideas, the quality of the best ideas generated by AI is significantly greater than that of the best ideas generated by humans (Study 3). Specifically, we show that AI-generated ideas are seven times more likely to be among the top 10 of ideas generated in our experiment. This is significant given the context. What matters for innovation is the quality of the best idea. The objective of idea generation is to generate at least a few truly exceptionally great ideas. In most innovation settings, we would rather have 10 great ideas and 90 terrible ideas than 100 ideas of solid quality. Holding the number of ideas constant, we need to trade off the advantageous effect of higher average idea quality (Study 1) with the disadvantages of less novelty, more overlapping ideas, and fewer ideas that can be discovered (Study 2). Study 3 clearly establishes AI s supremacy over humans in this respect. A quarter of a century ago, Goldenberg et al. (1999) asked the question Can AI-generated ideas finally compete with human ones, long after researchers first considered the possibility? . We believe that the three studies presented in this article provide empirical support for an affirmative answer to this question. From a practical perspective, we see the 7:1 advantage of AI creativity over human creativity as a conservative estimate, as we did not credit AI for its substantially greater productivity. The remainder of the article is organized as follows. After reviewing some recent work on GenAI and creativity (Section 2), we introduce our theoretical framework and our hypotheses (Section 3), followed by the technical set-up of our experiments (Section 4). We conducted three studies to assess the creativity of human- and AI-generated ideas. First, in Study 1 we ask human participants to rate ideas from both sources (human- and AI-generated) and compare the results (Section 5). Second, in Study 2 we use text-based analysis to calculate how many unique ideas can be created by humans and LLMs in our specific domain as well as ask human participants to rate the novelty of ideas from both sources and compare the results (Section 6). Third, in Study 3 we look at the extreme distributions of idea quality to identify possible advantages for the best ideas by either humans or AI (Section 7). We conclude the paper by discussing potential limitations of our studies, their robustness to alternative specifications (Section 8), and the implications of our findings (Section 9). 2. GenAI applications to creative tasks Research to date has demonstrated three key findings regarding AI's role in creativity and innovation. First, AI frequently matches or exceeds human performance in creative tasks. Haase and Hanel (2023) found that LLMs have reached human-level performance in divergent thinking tasks such as the Alternative Uses Task (AUT). This is supported by Hubert et al. (2024), who studied GPT-4 responses for the Consequences Task and Divergent Association Tasks, finding that AI is more creative than humans across all its dimensions. While Koivisto and Grassini (2023) find that AI chatbots outperform average human performance in the --- Page 5 --- AUT, they also note that the most exceptional human ideas still match or exceed those generated by AI. Second, studies show that AI aids in improving creative outcomes for humans when using it as a tool. Doshi and Hauser (2024) find that AI use helps humans to create more creative and enjoyable short stories. However, the collective diversity decreases and stories become more similar to one another. Similarly, Jia et al. (2024) found that AI assistance boosted employee creativity in a telemarketing company when responding to customer questions, ultimately increasing sales. Zhou and Lee (2024) show that integrating text-to-image AI into creative workflows increased the number of artworks created by 25 and raised the likelihood of receiving the works receiving favorite per view by 50 , highlighting the benefits of LLMs augmenting human workflows ( human in the loop ). Third, studies have explored human preferences for AI-generated versus human-generated creations, often finding that people prefer human involvement. For instance, Hitsuwari et al. (2023) found that survey participants cannot distinguish between AI-generated and human-generated haikus, but rated poems co- created by humans and AI as the most beautiful with no significant preference for haikus created solely by humans or AI. Bellaiche et al. (2023) provide evidence that humans prefer human involvement in art creation by showing that participants prefer AI-generated art falsely labeled as created by humans to the same art correctly labeled as AI-generated, suggesting a bias for human involvement in the creative process. Similarly, Shank et al. (2023) find comparable results for AI-generated classical music, although no such preference was found for electronic music. However, Zlatkov et al. (2023) found no significant preference for either AI or human-generated music overall. Taken together, this body of research illustrates the potency of AI in creative tasks. AI not only matches human creativity but also improves human performance when used as a collaborative tool. However, at least when considering artistic outcomes, there remains a human preference for creativity that involves human touch. This growing evidence suggests a natural next step: evaluating AI's efficacy in innovation management in general and in idea generation in particular, where artistic preferences are less important, while carefully examing potential issues such as less diverse ideas. 3. Theoretical Framework and Hypotheses To understand GenAI s ability to tackle various creative tasks, we must first conceptualize creativity. The literature distinguishes between three dimensions of creativity. Fluency is the ability to generate many ideas or solutions to a problem. It reflects the quantity of generated ideas. Flexibility is the capacity to produce a variety of ideas or solutions, showing an ability to shift approaches or perspectives. And, originality is the ability to produce novel and unique ideas (Guilford 1967, Torrance 1968). In addition, the brainstorming literature often considers idea quality as a fourth dimension of creativity. We omit fluency as a performance metric, as comparing the number of ideas or the speed of idea generation between a computer and a human --- Page 6 --- will lead to the obvious result that the computer displays greater fluency, creating more ideas per unit of time. This leaves us with idea quality, flexibility, and originality as the dimensions of comparison between humans and AI. The atomic unit of analysis in this comparison is an idea. In the context of innovation, we define an idea as a novel match between a solution and a need. As mentioned above, across three studies we will ask students as well as GenAI to come up with new product ideas targeted toward college students that can be sold for 50 or less. To illustrate our unit analysis of an idea, consider one of the student-generated ideas: Convertible High-Heel Shoe: Many prefer high-heel shoes for dress-up occasions, yet walking in high heels for more than short distances is very challenging. Might we create a stylish high-heel shoe that easily adapts to a comfortable walking configuration, say by folding down or removing a heel portion of the shoe? In this example, the need is the desire of some people to dress up and wear high-heeled shoes for some occasions while still walking comfortably. The proposed solution is to make the heel portion of the shoe so that it can be folded down or removed. Idea generation, by either individuals or groups, is a process that creates a stream of ideas with varying quality levels. This stream can be the result of either human effort or the use of AI. Each of these ideas can be validated on a quality scale. Our quality scale is based on a purchase intent study. Kornish and Ulrich (2014) show that the best indicator of future value creation is the average purchase intent expressed by a sample of consumers in the target market. Furthermore, they show that no single individual, expert or novice, is particularly good at estimating value. Instead, a sample of expressed purchase intent from about 15 individuals in the target market is a reliable measure of idea quality. Some ideas are likely to be brilliant (high-quality), some are horrible (low-quality), and most will be somewhere in between (medium-quality). We can think of this uncertain quality value as a random variable drawn from an underlying pay-off distribution (Weitzman 1979, Dahan and Mendelson 2001). Recall that we chose to measure three dimentions of creativity associated with idea generation: quality, flexibility, and originality. Our first hypothesis relates to the first dimention: AI s ability to generate ideas comparable in their average quality to human-generated ideas. In other words, we focus on the mean of the underlying idea-quality distribution. We make two arguments for why GPT-4 would create ideas of higher average quality than humans. First, the training data for GPT-4 includes millions of product reviews revealing unmet user needs, social media posts of excited and frustrated customers alike, and marketing materials for countless products that have been launched more or less successfully in the past. Second, the literature reviewed in Section 2 has established that GPT-4 has tremendous creative capabilities in other domains such as music generation or story writing. --- Page 7 --- Hypothesis 1 (Idea quality): The average quality of AI-generated ideas is higher than the average quality of human-generated ideas. Our second hypothesis relates to the second two dimensions: flexibility and originality. We first define these concepts in the context of generating ideas for new products and come up with appropriate measurement scales. There exists a vast number of possible new product ideas that differ along many dimensions. We can think of ideas as positions in a highly dimensional space. OpenAI s GPT-4 models text as multi- dimensional embedding vectors in this space, where each dimension may represent a distinct attribute or feature of the text. Such vectors have hundreds of dimensions. Similar texts will often lie close to each other while different ones will be far apart. However, interpreting the distances and dimensions is often not straightforward given the high dimensionality. To illustrate, consider a two-dimensional search space like the map of a territory. For example, consider the exploration of such a territory in the search for fishing spots in the ocean. The (x, y) coordinates capture the geographic locations of schools of fish. Each location has a pay-off corresponding to the amount of fish in the water. The goal of the fisherman is to find the location with the greatest fish density. In such a search process, local adjustments along a gradient of increasing fish density in the water via local search may increase the value of a fishing location. Yet, in rugged solution landscapes, i.e., ones that have multiple local optima, such local search is unlikely to yield the globally optimal solution. Thereby, the ruggedness of the underlying solution landscape makes it impossible to arrive at the most valuable fishing location (idea) in the ocean (idea space) via local adjustments. Rather, a broad exploration is needed (see Sommer and Loch 2004). Without prior knowledge about the landscape, some new locations that are very different from past locations should be explored. This creates the classic trade-off between exploration and exploitation (March 1991). With this as our backdrop, we provide two ways of operationalizing flexibility, overlap and the total number of discoverable ideas, and one way to operationalize originality, idea novelty. All three are important properties of a search process in general and of an ideation process in particular. To explain overlap, let s return to our fishing example. To explore fishing locations in an ocean, the locations should be distinctively different from each other. Even in a rugged solution landscape, some spatial correlations in pay-offs between two adjacent coordinates are likely. In much the same way, in the world of innovation, we want our ideas to be distinct from each other. To determine how distinctly different an idea is relative to other ideas, we measure the cosine similarity of its embedding vector relative to the embedding vectors of the other ideas (following Cox et al. 2021 and Dell'Acqua et al. 2023). Section 8 --- Page 8 --- provides alternative measures to this analytical choice. For a given pool of ideas produced by an idea- generation process, human or AI, we can thus randomly pull out two ideas and compute the angle between two associated embedding vectors. The Cosine of such angles will range from -1 to 1, with 1 indicating identical vectors and 0 indicating no similarity (orthogonal). While negative values are possible in principle, they rarely occur in practice as further discussed in study 2. By performing a pairwise comparison of all ideas and averaging their similarities, we can compute the average pool similarity. Next, we define two ideas as overlapping if their cosine similarity is above 𝜃 0.8. That is, we count any new idea added to the pool as overlapping if its cosine similarity exceeds 0.8 compared to any of the existing ideas in the pool. Our first measure of flexibility is based on computing the distribution of pairwise cosine similarities and counting the frequency of overlaps. We discuss this and other assumptions in Section 8 and provide extensive robustness analyses including evaluating alternative model specifications. Next, imagine a fisherman with no memory looking for fish at random locations. Every period, this fisherman sets out and fishes, yielding an estimate for the payoff of a specific location. How many unique fishing locations will be discovered this way? Early in the exploratory efforts, every fishing spot is an unexplored territory. Yet, as this process goes on, the likelihood of overlap increases, i.e., the fisherman is more likely to revisit a location previously tested. Given our definition of overlapping ideas (cosine similarity exceeding the θ 0.8 threshold), we can observe a stream of incoming ideas, one by one, and determine whether a new idea is unique relative to the pool of ideas created up to this point. Early on, just like in the fisherman s case, each idea is likely unique (non-overlapping with the ideas created so far). However, as the process progresses, the percentage of overlapping ideas will increase as the underlying search space gets exhausted. For a finite sequence of T ideas, we can evaluate the number of overlapping ideas, Noverlap, and thus compute the number of unique ideas, Nunique T-Noverlap. Definitions for how we operationalize this approach are shown in study 2. In addition to utilizing idea overlap for computing the number of unique ideas in a finite stream of ideas, we can further estimate the total number of discoverable ideas in the search space, even if many were not part of the sequence of T ideas, i.e., the ideas have not (yet) been discovered. To do so, we use what in population ecology is known as a capture-recapture model, used to estimate the number of unique fishing locations based on how frequently a previously visited location is revisited by a fisherman with no memory. With such a model, we simply count the incidents of an idea overlapping with a past idea. The frequency of overlap and its increased occurrence rate over time allows for estimating the number of ideas that can be discovered (Kornish and Ulrich 2011). This provides us with our second measure of flexibility. Next, consider originality. The search for ideas can yield ideas that are more or less novel. We measure idea novelty in the same way we measure idea quality by directly asking potential customers for its novelty assessment and averaging this value. In summary, we evaluate flexibility by looking at idea overlap (which --- Page 9 --- can be converted into an estimate for the numbers of ideas that can be discovered) and evaluate originality by directly asking consumers to rate novelty. How will a pool of AI-generated ideas compare to these human-generated ideas in terms of quality, flexibility and originality? By their very design, GPTs are autoregressive processes. They don t plan ahead but predict one word (or token) at a time based on a context window, including the prompt and the prior words created. Such a one word at a time process is unlikely to systematically and exhaustively explore an entire solution landscape. This lack of broad exploration will be further amplified in the presence of a system prompt that illustrates the concept of ideas by providing one or multiple ideas from the past (few- shot prompting) relative to the case in which no past ideas are provided (zero-shot prompting). This should limit both the flexibility and the originality of the creative process.These arguments, taken together with existing research in other domains showing less novelty for AI-generated content versus human-generated content (Doshi and Hauser, 2024), lead to the following two hypotheses: Hypothesis 2a (Flexibility): The likelihood of two ideas overlapping is higher for a pool of AI-generated ideas than for a pool of human-generated ideas, resulting in fewer discoverable ideas. Hypothesis 2b (Originality): The average novelty of AI-generated ideas is lower than that of human- generated ideas. Our third hypothesis returns to the concept of idea quality. This time, however, we are not concerned about the average idea quality but instead focus on the quality of the best ideas. Rather than focusing on the quality of the single best idea (the extreme value, Dahan and Mendelson 2001), we focus on the 90th percentile of idea quality distribution, i.e., the top 10 percent of the ideas. We do so for two reasons. The first reason is statistical estimation: for a single experiment like ours, there simply does not exist a test that allows us to make statistically significant statements for a single data point. Moving to the 90th percentile, we can compare the mean across larger groups of ideas (Section 8 presents our results for other percentiles). There also exists a second, managerial reason. In many, if not most, practical settings, the assessment of idea quality is noisy, especially in the early stages of an innovation process when an idea is nothing but a title and a few words. For this reason innovation tournaments don t just advance a single idea to the next round, but a set of the x percent of the most promising ideas where x can vary widely, but typically ranges between 10 and 50 percent (Terwiesch and Ulrich 2009). We therefore state: Hypothesis 3 (Top Decile): The quality of the 90th percentile AI-generated ideas is higher than that of the --- Page 10 --- 90th percentile human-generated ideas. 4. Experimental setup For our experiment, we utilize three different pools of ideas, namely student-generated ideas, GPT-4- generated ideas with zero-shot prompting and GPT-4-generated ideas with few-shot prompting. For the student pool, we rely on data collected in 2021 in a product design and innovation course at an elite university. In this course, 50 students participated in an innovation challenge to come up with ideas for a physical product marketed to college students for 50 or less (this price cap is imposed to limit the complexity of the projects in a one-semester course.). The challenge was organized in a traditional innovation tournament format (Terwiesch and Ulrich 2009, 2023), in which individuals first independently generate many ideas, which are then combined into a pool of several hundred ideas and subsequently evaluated by others in the group (i.e., crowdsourced evaluations). Thus, we have access to a large set of ideas generated by humans before AI tools became widely available to enhance ideation. Speifically, we use a pool of independently aggregated human ideas by randomly selecting 200 entries, each comprising a descriptive title and a paragraph of text, from the student ideas generated in these challenges in 2021 (i.e., at a time prior to the widespread availability of ChatGPT and other LLMs). The set of 200 ideas constitutes our first pool and forms the baseline for comparison with the ideas generated using LLMs. We prompt Open AI s GPT-4 (more specifically, gpt-4-0314) with the same prompt we gave the students. No LLM yet acts entirely autonomously. Rather, they are tools used by humans to complete tasks. For this study, we aim for minimal prompt engineering, thus representing a novice user scenario. However, we acknowledge that many strategies could potentially improve LLM performance. For instance, Mihm and Schlapp (2019) show that providing feedback during ideation contests can further improve performance of human innovators and we expect this to hold for LLMs as well For our first LLM-generated idea pool we use the system prompt to provide contextual information and subsequent user prompts to ask for ideas, ten at a time. The user prompt includes the additional request that the descriptions be 40-80 words, like the student sample. System Prompt You are a creative entrepreneur looking to generate new product ideas. The product will target college students in the United States. It should be a physical good, not a service or software. I'd like a product that could be sold at a retail price of less than about USD 50. The ideas are just ideas. The product need not yet exist, nor may it necessarily be clearly feasible. Number all ideas and give them a name. The name and idea are separated by a colon. --- Page 11 --- User Prompt Please generate ten ideas as ten separate paragraphs. The idea should be expressed as a paragraph of 40-80 words. The model used for all work covered in this paper is gpt-4-0314 with the temperature parameter at 0.7 to retain randomness and thus greater creativity. The temperature parameter controls the randomness of the output, with lower values leading to more deterministic output and higher values increasing variability. At the time of the experiment, the suggested default value for temperature was 0.7 to strike a balance between coherence and creativity, without possibly sampling highly unlikely tokens (i.e., semantic chunks used for representational efficiency) that lead to undesirable responses. An obstacle to using GPT-4 for generating hundreds of ideas is its finite memory, typically limited to the number of tokens the underlying LLM can consider in generating its responses. Once the number of tokens in a session exceeds the model s limit, the LLM has no memory of the first ideas generated, and subsequent ideas can become increasingly redundant. The number of tokens in the version of GPT-4 we had access to was about 8,000, roughly 7,000 words or approximately 80 ideas (some tokens are used for the system and user prompt and idea titles). To generate more than 80 ideas resulting from the limited context window, we asked GPT-4 to compress the previously generated ideas into shorter summaries. These summaries were then provided to the model before generating the next batch of ideas, ensuring that the model knows the previously generated ideas while remaining within the context limits. We used the below summarization prompt, followed by the original system prompt and generated summaries, and finally, a user prompt that explicitly asks for different ideas. This constitutes our second pool of comparison. Summarization Prompt Aggressively compress the following ideas so that their original meaning remains but they are much shorter. You can use tags or keywords. : Ideas generated so far System Prompt Original System Prompt Previously you generated the following ideas and should not repeat them: Summaries User Prompt Original User Prompt Make sure they are different from the previous ideas. --- Page 12 --- For our second pool of LLM-generated ideas, we provide the LLM with examples (few-shot learning) of high-quality ideas generated by students. In particular, we appended our prompts to provide the LLM with six highly rated ideas from a separate student set that completed the same exercise and informed GPT- 4 that these ideas had been well-received by students in our class. We used six examples due to context window limitations at the time of the experiment as well as drawing on previous experiments from in- context few-shot learning where too many examples can degrade performance (see Meincke and Carton 2024). This constitutes our third pool of comparison. Good Ideas Prompt Original System Prompt Here are some well received ideas for inspiration: Good Ideas Overall, we generated 100 ideas using zero-shot prompting and another 100 using few-shot prompting. The resulting average word count for GPT-4 generated ideas is 69 and 71 for GPT-4 with provided with examples. The average description is 63 words long for student ideas. We compared the resulting few-shot prompted ideas to the examples provided to ensure that GPT-4 did not simply slightly modify the examples. The average pairwise cosine similarity between the six examples and the 100 generated ideas is 0.33 and the highest similarity between two ideas is 0.51. Thus, we have no reason to believe that GPT-4 repeated the provided ideas. 5. Study 1: comparing the quality of ideas generated by humans and AI The Institutional Review Board (IRB) at the University of Pennsylvania approved the research described in this paper in May 2023, Protocol 853634. We used the online platform Prolific to recruit college-age indiviuals from the United States to evaluate all 400 ideas from the three pools (pool 1 with 200 ideas created by humans, pool 2 with 100 created by GPT-4 with zero-shot prompting, and pool 3 with 100 created by GPT-4 with few-shot prompting) via a purchase intent survey. We presented ideas in random order and randomized at the idea level, meaning that every survey participant could potentially see ideas from multiple sources. Each respondent evaluated an average of 40 ideas. On average, each idea was evaluated 20 times. In the summer of 2023, concerns surfaced that ChatGPT was being used to provide mTurk responses. This practice appears to have been limited to text generation tasks, not to multiple choice tasks like our five-box purchase-intent survey. Indeed, just answering the survey question directly requires less effort than trying to deploy ChatGPT to answer the question. We thus believe that our study participants were humans. We asked respondents to express purchase intent using the standard five-box options: definitely would not purchase, probably would not purchase, might or might not purchase, probably would purchase, and --- Page 13 --- definitely would purchase. Jameson and Bass (1989) recommend weighting responses for the five possible responses as 0, 0.25, 0.50, 0.75, and 1.00 to develop a single measure of purchase probability, which we use as a measure of idea quality (other weightings are possible, as we discuss in Section 8). Figure 1 shows the full quality distribution of ideas generated by the three pools. Figure 1 Distribution of idea quality for three sets of ideas Notes. Purchase intent is the weighted average of the five-box response scale per Jameson and Bass (1989). Figure 1 shows the quality (purchase probability) of ideas across the three pools. On average, GPT-4 generated ideas with greater purchase intent (46.4 with zero-shot prompting and 49.3 with few-shot prompting) than humans (40.4 ). The standard deviation of the quality of ideas is comparable between the three pools. We formally test the impact of idea source on the perceived quality of product ideas via a linear mixed-effects model with purchase intent as the dependent variable. The model included two fixed-effects denoting source (humans are the baseline) and random intercepts and slopes for respondents and ideas. We --- Page 14 --- find significant differences in the perceived quality of ideas as a function of their source. Ideas generated by GPT-4 with no examples (zero-shot) were rated significantly higher than human-generated ideas (𝛽 0.059; 95 CI [0.031, 0.088]; t(246) 4.06, p 0.001) and ideas generated by GPT-4 provided with positive examples (few-shot) received even higher ratings (𝛽 0.089; 95 CI [0.060, 0.12]; t(223) 5.93, p 0.001). Purchase intent is weakly significantly different between the two pools of LLM-generated ideas (𝛽 0.03; 95 CI [-0.01, 0.06]; t(199) 1.892, p 0.06). These findings indicate that LLM-generated ideas are, on average, more likely to be purchased than human-generated ideas (for additional robustness tests, see Section 8). 6. Study 2: Diversity and Novelty of Ideas Our second study focuses on how the fraction of overlapping ideas and the resulting estimated total number of ideas the process can generate (idea flexibility, hypothesis 2a) and the perceived novelty of the ideas as assessed by human raters (idea originality, hypothesis 2b) depend on the idea source. 6.1. Overlapping Ideas An idea-generation process creates a sequence of ideas in which each additional idea generated can be compared to the previously created ideas according to its similarity. For a pool of ideas, we can hence compute the average pairwise similarity of one idea compared to all other ideas and then compute the average overall similarity for the entire pool. We can also apply a threshold to pairwise idea similarity to identify at what point the ideas start to become more repetitive, i.e., when we are starting to exhaust the space of new ideas given a particular idea-generation process. A pool of ideas then might have a few overlapping ideas, which informs our second quantitative metric, the total number of ideas the process can generate. To measure the diversity of the ideas, we calculate the cosine similarity of each idea relative to the rest of the set. We first calculate a vector of text embeddings for each idea. We follow the technical setup in Dell'Acqua et al. (2023) and use Google's Universal Sentence Encoder (USE) model for our idea embeddings, which is specifically optimized for semantic similarity between sentences. Table 1 shows the results. In geometry, the cosine of the angle between vectors ranges from -1 to 1. However, when using Google USE, negative similarity is rarely encountered, since the overall text structure does not substantially differ between ideas. Ideas follow a similar pattern in terms of text length and style, often leading with the title before the idea description. In our test, a cosine similarity of 1 between two ideas thus indicates that they are very similar (their embedding vectors are aligned), whereas a cosine similarity of 0 implies orthogonal or unrelated ideas. We consider a new idea added to an idea pool to be unique if its pairwise cosine similarity compared to all previously added ideas is never greater than 0.8. Additional robustness checks using different thresholds and measures can be found in Section 8. --- Page 15 --- Table 1 Summary Statistics for Idea Overlap Student Ideas GPT-4 zero-shot GPT-4 few-shot N Ideas Average cosine similarity of all ideas 0.221 0.415 0.428 Fraction of ideas in pool with cosine similarity 0.8 0.05 0.07 Notes. We compute the fraction as the number of ideas whose average pairwise similarity compared to all other ideas in the pool exceeds 0.8 divided by the total number of ideas in the pool. For each pool, we compute the average pairwise similarity between all ideas. One-way ANOVA analyses show that the source has a significant effect on the cosine similarity between the three pools. The difference between all three groups is also significant (η² 0.455, 95 CI [-0.210, -0.204], F(2, 29598) 12340.95, p 0.001). Considering only two groups, human ideas have a significantly smaller cosine similarity than GPT-4-generated ideas (η² 0.358, 95 CI [-0.197, -0.190], F(1, 24649) 13715.82, p 0.001). Zero-shot GPT-4 ideas exhibit a significantly smaller cosine similarity than few-shot GPT-4 ideas (η² 0.004, 95 CI [-0.018, -0.010], F(1, 9898) 44.24, p 0.001). Because there is no overlap among human-generated ideas using cosine similarity, the fraction of ideas would be zero and the number of unique ideas infinitely large, in line with hypothesis 2a. A larger pool of student ideas will eventually contain overlapping ideas (see Kornish and Ulrich 2014 for estimates) but based on our assumptions for similarity, the student sample only contains unique ideas. We perform a binomial test to formally estimate the significance of the differences. We find that the fraction of similar human-generated ideas (95 CI for fraction [0.0, 0.0184]) is significantly smaller than that of the zero-shot GPT-4 ideas (RD -0.05, 95 CI [-0.093, -0.007], p 0.001) and few-shot GPT-4 ideas (RD -0.07, 95 CI [-0.120, -0.020], p 0.001), supporting hypothesis 2a. The difference between the two GPT-4 pools is not significantly different (RD -0.02, 95 CI [-0.086, 0.046], p 0.56). Our findings suggest that a greater number of distinct ideas generated comes from the human-ideation process, as opposed to GPT-4. We calculate the exact numbers in the next section. --- Page 16 --- Figure 2 Distribution of cosine similarities across the three pools Notes. Density plot of cosine similarities comparing all three pools. The dotted line shows the mean and confidence interval of the estimate for a pool used for the ANOVA. The difference between all three groups is also significant (η² 0.455, 95 CI [-0.210, -0.204], F(2, 29598) 12340.95, p 0.001). 6.2. Number of Discoverable Ideas Given the fraction of unique ideas, we can estimate the number of unique ideas that could be generated by each of our three processes (pools) students, LLM (zero-shot), and LLM prompted with examples (few-shot) using the method of Kornish and Ulrich (2011). This method, which uses the capture-recapture method to analyze the probability that the next idea in a sequence is unique, reportedly originates with Laplace (Cochran 1978), but has been adapted to wildlife ecology and other domains. For illustration, consider again fishing in a lake as a metaphor for the idea-generation process. Each idea is a catch, and the fish is released back into the lake. Sometimes, the same fish will be caught again. The more frequently an individual fish is re-caught, the smaller the estimate of the overall fish population. Thus, the probability that a fish has never been caught previously is a decreasing function of the number of ideas generated. This probability decay is typically represented by an exponential function. p(n) e an (1) We define p(n) as the probability that the next idea is unique given n ideas have been generated already. The expected number of unique ideas out of n generated, u(n), is the integral under this curve. --- Page 17 --- u(n) (1 a)(1 e an) (2) This form of probability decay comes from a specific underlying process, with T unique ideas total (T fish in the pond), and each equally likely to be drawn. This assumption is commonly used in the Lincoln- Peterson method (Lincoln 1930), the standard model for estimating population size in the literature on wildlife ecology. The decay parameter and the total T are linked: T 1 a. This model has only a single parameter, a, which is the inverse of the size of the opportunity space, i.e., an estimate of the total number of unique ideas that an unlimited number of comparable idea generators, each generating an enormous number of ideas, would generate. Given a set of ideas generated and a count of the number of unique ideas in that set, the model can be used to calculate T, an estimate of the size of the opportunity space. Using the similarity threshold of 0.8 from the cosine similarity metric, we found that 5 of the 100 ideas generated by the LLM with zero-shot prompting were essentially similar to an idea already generated (fish recaptured), and that 7 of the 100 ideas generated via few-shot prompting were redundant. Thus, u(100) is 0.95 in the first case, and u(100) is 0.93 in the second case. This corresponds to an estimate of T of 966 ideas (zero-shot) and of 680 ideas (few- shot) respectively. In our sample, human-generated ideas were all unique. Thus, as expected from our overlap calculations, and based on the estimates provided by the capture-recapture model, we find support for the second quantitative metric of hypothesis 2a. The number of unique ideas that can be discovered is lower for both pools of AI-generated ideas than for the human idea-generation process. In addition, prompting the LLM with examples seems to further reduce the estimated number of unique ideas available to the process. We perform additional robustness checks in Section 8. 6.3. Perceived Novelty Given that LLMs are designed to generate the statistically most plausible sequence of text based on their training data, perhaps they generate less novel ideas than humans. Novelty is not a goal expressed in the prompt used in this study for either humans or GPT-4 and is typically not a primary objective in commercial product development efforts. Still, to ensure that GPT-generated ideas are not merely lists of existing ideas, we investigate how the novelty of ideas varies between LLM-generated ideas and those generated by humans. Based on Shibayama et al. (2021), we assessed novelty by asking responders on Prolific the question Relative to other products you have seen, how novel do you consider the idea for this new product? [0: Not at all novel, 0.25: Slightly novel, 0.5: Moderately novel, 0.75: Very novel, 1: Extremely novel]. The average novelty of human-generated ideas is 40.6 (SD: 0.117), which is greater than that of zero-shot GPT-4 (36.7 , SD: 0.101), and few-shot GPT-4 (36.1 , SD: 0.111; see Figure 3). --- Page 18 --- Similar to purchase intent, we estimate a linear mixed-effects model to investigate how the idea source (human ideas, zero-shot GPT-4 and few-shot GPT-4) affects the perceived novelty of product ideas. The model includes two fixed effects for denoting the source (humans are baseline), random intercepts and slopes for both respondents and ideas. We find significant differences in perceived novelty between human and zero-shot GPT-4-generated ideas (𝛽 -0.038; 95 CI [-0.066, -0.01]; t(269) -2.67, p 0.008) at the alpha 0.05 threshold. Ideas generated by few-shot GPT-4 also receive significantly lower novelty ratings (𝛽 -0.049; 95 CI [-0.078, -0.02]; t(268) -3.4, p 0.001) compared to human-generated ideas. These findings suggest that some LLM-generated ideas are perceived as less novel than human-generated ideas. Perceived novelty is not significantly different between the two pools of LLM-generated ideas (𝛽 - 0.01; 95 CI [-0.039, 0.017]; t(195) -0.757, p 0.45). Of note, novelty does not appear to be significantly correlated with purchase intent. The correlation coefficient is slightly negative at -0.08 (95 CI [-0.176, 0.016], p 0.12). Additional robustness checks can be found in Section 8. Figure 3 Distribution of novelty ratings for three samples of ideas Notes. Novelty based on mTurk assessment per Kwon, Kim, and Lee (2009). --- Page 19 --- These findings support Hypothesis 2b: AI-generated ideas are, on average, less novel than human- generated ideas. Of note, the average novelty of all ideas, irrespective of source, lies between slightly and moderately novel. While human ideas are around 0.047 points more novel, there is little reason to believe that novelty alone, i.e., being the first to think of an idea, leads to a significant financial advantage. As Terwiesch and Ulrich (2010) and others have argued, the first-mover advantage is a myth. As such, from a commercial point of view, we don t believe that the slightly lower novelty outweighs the productivity and quality benefits of LLMs. 7. Study 3: What is the quality of the best idea(s)? Table 2 summarizes the titles of the top 40 ideas (10 ) in our pool, that is the top 40 out of the 400 ideas used. Table 2 Top 10 Ideas by Purchase Intent Title Source Purchase Intent Novelty Compact Printer GPT-4 (Few-Shot) 0.76 0.55 Solar-Powered Gadget Charger GPT-4 (Few-Shot) 0.75 0.44 QuickClean Mini Vacuum GPT-4 (Zero-Shot) 0.75 0.30 Noise-Canceling Headphones GPT-4 (Few-Shot) 0.72 0.18 StudyErgo Seat Cushion GPT-4 (Zero-Shot) 0.72 0.39 Multifunctional Desk Organizer GPT-4 (Few-Shot) 0.71 0.21 Reusable Silicone Food Storage Bags GPT-4 (Few-Shot) 0.68 0.34 Portable Closet Organizer GPT-4 (Few-Shot) 0.67 0.23 Dorm Room Chef [oven, microwave and toaster] GPT-4 (Few-Shot) 0.67 0.71 Collegiate Cookware GPT-4 (Few-Shot) 0.67 0.45 Collapsible Laundry Basket GPT-4 (Few-Shot) 0.65 0.21 On-the-Go Charging Pouch GPT-4 (Few-Shot) 0.65 0.33 GreenEats Reusable Containers GPT-4 (Zero-Shot) 0.65 0.21 HydrationStation [bottle with filter] GPT-4 (Zero-Shot) 0.64 0.19 Reusable Shopping Bag Set GPT-4 (Few-Shot) 0.64 0.19 CollegeLife Collapsible Laundry Hamper GPT-4 (Zero-Shot) 0.64 0.26 Adaptiflex [cord extension to fit big adapters] Student 0.64 0.44 SpaceSaver Hangers GPT-4 (Zero-Shot) 0.64 0.33 Dorm Room Air Purifier GPT-4 (Few-Shot) 0.63 0.29 Smart Power Strip GPT-4 (Few-Shot) 0.63 0.22 CampusCharger Pro GPT-4 (Zero-Shot) 0.63 0.31 Kitchen Safe Gloves Student 0.62 0.31 Nightstand Nook [charging, cup holder] GPT-4 (Few-Shot) 0.62 0.43 Mini Steamer GPT-4 (Few-Shot) 0.62 0.41 CollegeCare First Aid Kit GPT-4 (Zero-Shot) 0.62 0.26 StudySoundProof [soundproofing panels] GPT-4 (Zero-Shot) 0.62 0.57 FreshAir Fan GPT-4 (Zero-Shot) 0.62 0.29 StudyBuddy Lamp [portable, usb charging] GPT-4 (Zero-Shot) 0.62 0.43 Bluetooth Signal Merger [share music] Student 0.62 0.41 --- Page 20 --- Adjustable Laptop Riser GPT-4 (Few-Shot) 0.62 0.21 EcoCharge [solar powered charger] GPT-4 (Zero-Shot) 0.62 0.43 Smartphone Projector Student 0.62 0.57 Grocery Helper [hook to carry multiple bags] Student 0.62 0.53 FitnessOnTheGo [portable gym equipment] GPT-4 (Zero-Shot) 0.62 0.42 Multipurpose Fitness Equipment GPT-4 (Few-Shot) 0.62 0.37 CollegeCooker GPT-4 (Zero-Shot) 0.61 0.50 Multifunctional Wall Organizer GPT-4 (Few-Shot) 0.61 0.31 DormDoc Portable Scanner GPT-4 (Zero-Shot) 0.61 0.49 Mobile Charging Station Organizer GPT-4 (Few-Shot) 0.61 0.26 StudyMate Planner GPT-4 (Few-Shot) 0.61 0.22 DormChef Kitchen Set GPT-4 (Zero-Shot) 0.61 0.33 LaundryBuddy [laundry basket] GPT-4 (Zero-Shot) 0.61 0.30 Notes. The asterisk ( ) denotes ideas where the text in square brackets [] is not part of the original title and was added to clarify the idea. Among the top 40 ideas (top decile) 35 (87.5 ) were generated by GPT-4 (see Table 3). In other words, for every human idea in the top 10 we count 7 ideas generated by GPT-4. A Chi-Square Test of independence, with the null hypothesis of equal representation of all sources among the top ideas (20, 10 and 10) rejected the null hypothesis (x2 26.39, p 0.001, df 2), thus confirming hypothesis 3. Table 3 Best Ideas Across Pools Student Ideas GPT-4 zero-shot GPT-4 few-shot N Ideas Average Quality of Top Decile 0.62 0.64 0.66 Average Novelty of Top Decile 0.45 0.35 0.33 Fraction of the top decile of pooled ideas from this source 5 40 15 40 20 40 To better understand how the full distribution of idea qualities is affected by the idea source, we use quantile regression analysis. Quantile regression (Koenker and Hallock 2001) extends traditional regression by computing the relationship between explanatory variables (idea source) and the response variable (idea quality) for different percentiles of the data. As mentioned above, in innovation, the quality of the best ideas is generally more important than the average quality. That is, we prefer a few exceptional ideas to a lot of --- Page 21 --- mediocre ones. Using quantile regression, we can examine the tails of the distribution instead of the mean, allowing us to test whether GPT-4 excels at generating high-quality ideas only for specific percentiles or whether the effect holds across the entire distribution. Our analysis follows Girotra et al. (2010). We use the average idea quality ratings as the dependent variable, and our explanatory variable is a binary variable indicating whether the idea is human-generated (baseline level) or AI-generated (GPT-4 zero-shot and GPT-4 few-shot prompting). Figure 4 shows the results. For all percentiles, GPT-4 ideas consistently outperform student ideas. The effect is especially pronounced for the upper tail of the distribution (80 and above), where GPT-4 has the strongest advantage. This implies that not only does GPT-4 generate better ideas on average, but it is also especially adept at producing top-tier ideas compared to students. Figure 4 Estimated Difference in Idea Quality Ratings between AI-generated Ideas and Human-generated ideas (baseline), for Different Percentiles 8. Discussion and Limitations In this section, we discuss conceptual limitations of our work, limitations related to our research design, as well as data analysis and the robustness of our analysis to a set of alternative specifications and assumptions. --- Page 22 --- Our findings indicate that GPT-4 produces higher-quality ideas that are more likely to be purchased than humans, though they are perceived as less novel. AI significantly outperforms human creativity in generating top-tier ideas, with GPT-4 ideas being seven times more likely to rank in the top 10 . Given AI s advantage in both quality and productivity, our findings have profound implications for the field of innovation management. For instance, AI can serve as a first step in brainstorming sessions, allowing organizations to rapidly explore a wide variety of ideas with minimal cost and time investment. Human ideators can also provide AI with their own interesting ideas and refine them with the help of AI. Another important implication lies in the potential shift of focus from idea generation to idea evaluation. If LLMs can reliably produce numerous high-quality ideas at very low cost companies might allocate more resources toward assessing and refining those ideas instead of ideating from scratch. This shift could lead to the development of new tools and frameworks specifically designed to help organizations sort, rank, and prioritize AI-generated ideas, further streamlining the innovation process. However, while the results show that GPT-4 outperforms human creativity in terms of producing top- tier ideas, the reduced novelty and increased similarity among AI-generated ideas point to a limitation. This suggests that a human in the loop is still important to drive the ideation direction and ensure that ideas are as novel as possible. Future research could explore ways to mitigate this issue by enhancing LLMs' ability to generate more diverse and creative solutions through techniques such as fine-tuning. Investigating whether LLMs can evaluate ideas with the same rigor as human evaluators would help to further improve the ideation process. It would allow an LLM to get immediate feedback on its creations, leaving humans to focus on implementation and strategy. 8.1 Conceptual and Research Design Limitations Conceptually, our prompting approach (i.e., a simple prompt) is not optimized for creativity or novelty. It also follows a single ideator setup instead of approaches such as hybrid brainstorming that lead to more and better ideas (Girotra et al. 2010). A model given more specific instructions on how to ideate effectively might thus perform even better. Different prompting techniques such as Chain-of-Thought (CoT), which asks the model to reason through a problem in multiple steps instead of directly providing an answer (Wei et al. 2023), might also improve performance. Furthermore, providing the model with hundreds of good ideas, either via many-shot learning or fine-tuning could also provide enhanced performance. This suggests that we likely underestimate the true power of AI- based idea generation. Second, it is possible that professional product innovators would generate better ideas than our students. However, this has not been the experience of the paper s authors, who have taught many academic courses and worked in many product development settings. Many students who participated in the innovation contests have gone on to be product innovators, sometimes based on ideas from the course tournament. --- Page 23 --- Nevertheless, we have not produced evidence that GPT-4 is better than the best product innovators working today. However, we believe that we can claim conservatively that GPT-4 is better than many human product innovators working today and probably better than average. Thus, at a very minimum, an LLM could elevate the least capable humans to a better-than-average level of performance. Third, GPT might be a great salesperson. As such, it is possible that the writing style ( pitch ) convinces the customers rather than the idea itself. Prior work in other domains suggests that the text generated by LLMs is not distinguishable from that generated by humans (Brown et al. 2020), though recent work has developed sophisticated measures to detect LLM-generated text (Mitchell et al. 2023, Kobak et al. 2024, Venkatraman et al. 2024). For example, Kobak et al. (2024) provide intuitions that could be used to identify LLM-generated text, such as words that are not commonly used by the majority of English speakers like delve. However, it is unlikely that these characteristics were known to our survey participants at the time of our experiment in May and June 2023, and that any particular idea generated by GPT-4 could easily be distinguished from those generated by our students. Future research could use LLMs to present human- generated ideas in a way that more closely mimics the presentation style of LLM-generated ideas, ensuring that the quality of the idea is not confounded by its presentation style. Fourth, our study is set in the widely understood domain of consumer products for the college students market that cost less than 50. Presumably, there exists a lot of commentary and data about such products in the training data used by the GPT class of language models. As such, it is unclear whether our results would generalize to more specialized domains, such as surgical instruments. Organizations looking for opportunities in these specialized domains should fine-tune language models with their own proprietary data to achieve comparable or better performance. Fifth, innovation often benefits from collaboration and is not solely focused on one ideator generating many ideas. Liu et al. (2018) show that collaborating with other innovators improves the creative process by enabling the transfer of critical skills and knowledge, particularly when those collaborations involve highly skilled innovators. Future work should investigate whether this can be applied to human and LLM interaction, and whether an LLM could help a novice human innovator become better. 8.2 Robustness There are different ways to analyze the data. Here, we provide additional robustness checks that investigate the validity of our results under various specifications. 8.2.1 Study 1 To measure purchase intent, it is possible to use other convex weighting schemes. Ulrich and Eppinger (2007) weigh definitely would purchase as 0.4 and probably would purchase as 0.2 with all other responses rated as 0. When using this alternative set of weights, we find the same significant differences between pools. As a robustness test for our primary purchase-intent analysis using a linear mixed-effects model, we also --- Page 24 --- conduct a simpler linear regression focusing on the average perceived quality of product ideas across different sources. This model aggregates individual ratings at the idea level, removing the random effects to capture the overall influence of the source on rating averages. The results confirm our previous findings and show that ideas from GPT-4 (zero-shot) are rated higher than human ones by an average of 0.256 points (95 CI [0.15, 0.37]; t 4.602, p 0.001), and ideas from GPT-4 (few-shot) are rated higher by an average of 0.358 points (95 CI [0.25, 0.47]; t 6.435, p 0.001). In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor. We find significant differences in the perceived quality, measured as purchase intent of product ideas, between sources. Ideas generated by GPT-4 (zero shot) receive a significantly greater average rating (𝛽 0.395; 95 CI [0.215, 0.575]; z 4.31, p 0.001). Similarly, ideas generated by GPT-4 (few-shot) receive even higher ratings (𝛽 0.581; 95 CI [0.400, 0.762]; z 6.30, p 0.001) compared to human-generated ideas. These findings suggest that LLM-generated ideas are perceived as more likely to be purchased than human-generated ideas, with the highest perceived quality attributed to few-shot GPT-4-generated ideas. 8.2.2 Study 2 Our chosen threshold of 𝜃 0.8 has been established through experimentation by comparing ideas as pairs of two and their respective similarity scores. However, our findings are robust to other values such as 0.7 (25 and 37 overlapping ideas for zero-shot and few-shot GPT-4 respectively) and 0.75 (16 and 23 overlapping ideas). At 𝜃 0.85, the zero-shot GPT-4 pool only features two overlapping ideas, whereas the few-shot pool features one. Because these are extreme values that approach zero, we used 0.8 as our main threshold. We compute the pairwise similarity for an idea compared to all other ideas in the pool and calculate the average. Mean pairwise similarity is a common measure in ideation (Siangliulue et al. 2016, Cox et al. 2021) and similar text-mining tasks (Doshi and Hauser 2024) but it is not without issues, as it lacks sensitivity to highly clustered ideas. As an additional specification, we consider the per-pool collective diversity of all ideas by following the work in Cox et al. (2021) and construct a minimum spanning tree (MST) which spans all points (ideas) in space with the smallest total distance along the edges. In 2D space, an MST would be the tree that contains all points with the shortest overall length of edges. We compute the mean of all edge distances as a measure of how distributed ideas are in the high-dimensional space. The spanning tree is constructed in high-dimensional space (512 dimensions), its edge weights summed up and divided by the number of edges, resulting in a range from 0 (not diverse at all) to 1 (very diverse). Based on this measure, the student idea pool is the most diverse (0.53), GPT-4 zero-shot is the second most diverse (0.33) and GPT-4 few-shot is the least diverse (0.3) pool. Similar to purchase intent, we also conduct a simpler linear regression focusing on the average perceived novelty of product ideas across different sources. This model aggregates individual ratings at the idea level, removing the random effects to capture the overall influence of the source on rating averages. We find that --- Page 25 --- ideas from GPT-4 (zero-shot) are significantly less novel than human ones (𝛽 -0.177; 95 CI [-0.286, - 0.069]; t -3.22, p 0.0014). Ideas from GPT-4 (few-shot) are rated as significantly less novel than human ones (𝛽 -0.197; 95 CI [-0.305, -0.089]; t -3.58, p 0.001). This simpler analysis reinforces that human ideas are more novel than AI-generated ones, even when using zero-shot prompting. In addition, we estimate a cumulative link mixed model (CLMM) to treat the ratings outcome as a factor. We find significant differences in the perceived novelty. Ideas generated by GPT-4 (zero shot) receive a significantly lower average rating (𝛽 -0.306; 95 CI [-0.514, -0.1]; z -2.89, p 0.01). Similarly, ideas generated by GPT-4 (few-shot) receive even lower ratings (𝛽 -0.39; 95 CI [-0.6, -0.18]; z -3.66, p 0.001) compared to human-generated ideas. These findings suggest that LLM-generated ideas are perceived as less novel than human-generated ideas, with the lowest perceived novelty attributed to few-shot GPT-4- generated ideas. 8.2.3 Study 3 In this study, we present our results for the 90th percentile of all aggregated ideas. Table 4 shows that using other percentiles yields similar results. Table 4 Top 5 and 15 Percent of Ideas Pool Distributions Student Ideas GPT-4 zero-shot GPT-4 few-shot Average Quality of Top 5 0.64 0.67 0.68 Fraction of the top 5 of pooled ideas from this source 1 20 6 20 14 20 Average Quality of Top 15 0.60 0.62 0.64 Fraction of the top 15 of pooled ideas from this source 11 60 22 60 27 60 9. Summary GenAI has demonstrated remarkable advancements in creating coherent and fluent text, equaling or surpassing human performance in various academic and professional domains. In this study, we explored the ideation capabilities of OpenAI's GPT-4, a state-of-the-art large language model, in comparison to the ideation abilities of university students when generating ideas for new products targeted toward college --- Page 26 --- students at a price point of 50 or less. Specifically, we make three main contributions to the literature of innovation and the role of AI. First, GPT-4 produces high-quality ideas that are perceived as more likely to be purchased than human- generated ideas. Second, consumers perceive AI-generated ideas as less novel. Third, when considering the quality of the best ideas, AI outperforms human creativity significantly. To put these findings in context, innovation favors a few great ideas over a large number of solid ideas and our results show that AI-generated ideas are seven times more likely to be among the top 10 of ideas considered for our experiment compared to human ideas. Despite the reduction in novelty, the overall AI advantage thus remains substantial. The fact that GPT-4 is very efficient at generating ideas does not require a formal research study. Two hundred ideas can be generated by one human interacting with GPT-4 in about 15 minutes. A human working alone can generate about five ideas in 15 minutes and humans working in groups do even worse (Girotra et al., 2010). In short, the productivity race between humans and GPT-4 is not even close. However, as we show in this article, the enormous potential of LLMs in ideation does not result only from their ability to quickly and inexpensively generate ideas, but in the remarkable quality of their output. Importantly, these ideas can be produced at a fraction of the cost it would take humans, generating hundreds of high-quality ideas. This previously unimaginable productivity in generating ideas may substantially reduce the importance of the idea-generation phase of innovation and shift managerial focus to the idea-evaluation phase. Can an LLM also take on the task of idea evaluation? From our viewpoint, this is a fascinating question for future research. References Bellaiche L, Shahi R, Turpin MH, Ragnhildstveit A, Sprockett S, Barr N, Christensen A, Seli P (2023) Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cogn. Research 8(1):42. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. (2020) Language Models are Few-Shot Learners. (July 22) http: arxiv.org abs 2005.14165. Chao RO, Kavadias S (2008) A Theoretical Framework for Managing the New Product Development Portfolio: When and How to Use Strategic Buckets. Management Science 54(5):907 921. Cochran WG (1978) Laplace s Ratio Estimator. David HA, ed. Contributions to Survey Sampling and Applied Statistics. (Academic Press), 3 10. Connolly T, Jessup LM, Valacich JS (1990) Effects of Anonymity and Evaluative Tone on Idea Generation in Computer-Mediated Groups. Management Science 36(6):689 703. Cox SR, Wang Y, Abdul A, Von Der Weth C, Y. Lim B (2021) Directed Diversity: Leveraging Language Embedding Distances for Collective Creativity in Crowd Ideation. Proceedings of the 2021 CHI --- Page 27 --- Conference on Human Factors in Computing Systems. (ACM, Yokohama Japan), 1 35. Dahan E, Mendelson H (2001) An Extreme-Value Model of Concept Testing. Management Science 47(1):102 116. Dell Acqua F, McFowland III E, Mollick ER, Lifshitz-Assaf H, Kellogg K, Rajendran S, Krayer L, Candelon F, Lakhani KR (2023) Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. (September 15) https: papers.ssrn.com abstract 4573321. Doshi AR, Hauser OP (2024) Generative AI enhances individual creativity but reduces the collective diversity of novel content. Sci. Adv. 10(28):eadn5290. Girotra K, Terwiesch C, Ulrich KT (2010) Idea Generation and the Quality of the Best Idea. Management Science 56(4):591 605. Goldenberg J, Mazursky D, Solomon S (1999) Creative Sparks. Science 285(5433):1495 1496. Guilford JP (1967) Creativity: Yesterday, Today and Tomorrow. Journal of Creative Behavior 1(1):3 14. Haase J, Hanel PHP (2023) Artificial muses: Generative artificial intelligence chatbots have risen to human-level creativity. Journal of Creativity 33(3):100066. Hitsuwari J, Ueda Y, Yun W, Nomura M (2023) Does human AI collaboration lead to more creative art? Aesthetic evaluation of human-made and AI-generated haiku poetry. Computers in Human Behavior 139:107502. Hubert KF, Awa KN, Zabelina DL (2024) The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks. Sci Rep 14(1):3440. Huchzermeier A, Loch CH (2001) Project Management Under Risk: Using the Real Options Approach to Evaluate Flexibility in R D. Management Science 47(1):85 101. Jamieson LF, Bass FM (1989) Adjusting Stated Intention Measures to Predict Trial Purchase of New Products: A Comparison of Models and Methods. Journal of Marketing Research 26(3):336 345. Jia N, Luo X, Fang Z, Liao C (2024) When and How Artificial Intelligence Augments Employee Creativity. AMJ 67(1):5 32. Kobak D, González-Márquez R, Horvát EÁ, Lause J (2024) Delving into ChatGPT usage in academic writing through excess vocabulary. (July 3) http: arxiv.org abs 2406.07016. Koenker R, Hallock KF (2001) Quantile Regression. Journal of Economic Perspectives 15(4):143 156. Koivisto M, Grassini S (2023) Best humans still outperform artificial intelligence in a creative divergent thinking task. Sci Rep 13(1):13601. Kornish LJ, Ulrich KT (2011) Opportunity Spaces in Innovation: Empirical Analysis of Large Samples of Ideas. Management Science 57(1):107 128. Kornish LJ, Ulrich KT (2014) The Importance of the Raw Idea in Innovation: Testing the Sow s Ear --- Page 28 --- Hypothesis. Journal of Marketing Research 51(1):14 26. Lincoln FC (1930) Calculating waterfowl abundance on the basis of banding returns (U.S. Dept. of Agriculture, Washington, D.C.). Liu H, Mihm J, Sosa ME (2018) Where Do Stars Come From? The Role of Star vs. Nonstar Collaborators in Creative Settings. Organization Science 29(6):1149 1169. Loch CH, Terwiesch C, Thomke S (2001) Parallel and Sequential Testing of Design Alternatives. Management Science 47(5):663 678. March JG (1991) Exploration and Exploitation in Organizational Learning. Organization Science 2(1):71 87. Mihm J, Schlapp J (2019) Sourcing Innovation: On Feedback in Contests. Management Science 65(2):559 576. Mitchell E, Lee Y, Khazatsky A, Manning CD, Finn C (2023) DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. (July 23) http: arxiv.org abs 2301.11305. Meincke L, Carton A (2024) Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations. (May 26) https: papers.ssrn.com abstract 4873537. OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. (2024) GPT-4 Technical Report. (March 4) http: arxiv.org abs 2303.08774. Osborn AF (1953) Applied imagination (Scribner S, Oxford, England). Rashidi HH, Fennell BD, Albahra S, Hu B, Gorbett T (2023) The ChatGPT conundrum: Human- generated scientific manuscripts misidentified as AI creations by AI text detection tool. Journal of Pathology Informatics 14:100342. Shank DB, Stefanik C, Stuhlsatz C, Kacirek K, Belfi AM (2023) AI composer bias: Listeners like music less when they think it was composed by an AI. J Exp Psychol Appl 29(3):676 692. Shibayama S, Yin D, Matsumoto K (2021) Measuring novelty in science with word embedding Muscio A, ed. PLoS ONE 16(7):e0254034. Si H, Kavadias S, Loch CH (2022) Managing Innovation Portfolios: From Project Selection to Portfolio Design. (March 6) https: papers.ssrn.com abstract 4050940. Siangliulue P, Chan J, Dow SP, Gajos KZ (2016) IdeaHound: Improving Large-scale Collaborative Ideation with Crowd-Powered Real-time Semantic Modeling. Proceedings of the 29th Annual Symposium on User Interface Software and Technology. (ACM, Tokyo Japan), 609 624. Sommer SC, Loch CH (2004) Selectionism and Learning in Projects with Complexity and Unforeseeable Uncertainty. Management Science 50(10):1334 1347. Sutton RI, Hargadon A (1996) Brainstorming Groups in Context: Effectiveness in a Product Design Firm. Administrative Science Quarterly 41(4):685. --- Page 29 --- Terwiesch C (2023) Let s cast a critical eye over business ideas from ChatGPT. Financial Times (March 12) https: www.ft.com content 591ad272-6419-4f2c-9935-caff1d670f08. Terwiesch C, Ulrich K (2023) The innovation tournament handbook: a step-by-step guide to finding exceptional solutions to any challenge (Wharton School Press, Philadelphia, PA). Terwiesch C, Ulrich KT (2009) Innovation tournaments: creating and selecting exceptional opportunities (Harvard Business Press, Boston, Mass). Terwiesch C, Xu Y (2008) Innovation Contests, Open Innovation, and Multiagent Problem Solving. Management Science 54(9):1529 1543. Torrance EP (1968) A Longitudinal Examination of the Fourth Grade Slump in Creativity. Gifted Child Quarterly 12(4):195 199. Ulrich K, Eppinger S (2007) Product Design and Development (McGraw-Hill Education) Venkatraman S, Uchendu A, Lee D (2024) GPT-who: An Information Density-based Machine-Generated Text Detector. (April 3) http: arxiv.org abs 2310.06202. Wang H, Zou J, Mozer M, Goyal A, Lamb A, Zhang L, Su WJ, et al. (2024) Can AI Be as Creative as Humans? (January 25) http: arxiv.org abs 2401.01623. Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D (2023) Chain-of- Thought Prompting Elicits Reasoning in Large Language Models. (January 10) http: arxiv.org abs 2201.11903. Weitzman ML (1979) Optimal Search for the Best Alternative. Econometrica 47(3):641. Zhou E, Lee D (2024) Generative artificial intelligence, human creativity, and art Harding M, ed. PNAS Nexus 3(3):pgae052. Zlatkov D, Ens J, Pasquier P (2023) Searching for Human Bias Against AI-Composed Music. Artificial Intelligence in Music, Sound, Art and Design: 12th International Conference, EvoMUSART 2023, Held as Part of EvoStar 2023, Brno, Czech Republic, April 12 14, 2023, Proceedings. (Springer-Verlag, Berlin, Heidelberg), 308 323. --- Page 30 --- Appendix A. Quantile Regression Results The regression model considered quantiles 0.1 to 0.9 with a 0.1 step. For each quantile, it estimated MeanRating SourceAI. SourceAI is a dummy variable that indicates whether the idea source was a student (SourceAI 0) or GPT-4 (SourceAI 1). A positive value for SourceAI indicates that ideas by GPT-4 performed better than human ideas. Negative values indicate the opposite. Table A.1. Quantile Regression Results for quantiles 0.1 to 0.9 Quantile Intercept Source AI Conf. Int. Low Conf. Int. High 0.1 0.3 0.128399 0.471601 0.2 1.230768 0.290958 0.152504 0.429411 0.3 1.388859 0.277678 0.149304 0.406051 0.4 1.549946 0.262418 0.139183 0.385653 0.5 1.666666 0.227943 0.10915 0.346736 0.6 1.789528 0.210472 0.095888 0.325057 0.7 1.882355 0.260502 0.137852 0.383152 0.8 1.954555 0.445445 0.328038 0.562851 0.9 2.181822 0.318235 0.190531 0.445939 Notes. (p 0.1 ). Appendix B. Supplementary Regression Tables Purchase Intent Predictors Estimates CI p (Intercept) 0.40 0.38 0.43 0.001 Source [Zero-Shot] 0.06 0.03 0.09 0.001 Source [Few-Shot] 0.09 0.06 0.12 0.001 Random Effects σ2 0.07 τ00 IdeaID 0.01 τ00 RespondentID 0.02 τ11 IdeaID.SourceZero-Shot 0.01 --- Page 31 --- τ11 IdeaID.SourceFew-Shot 0.03 τ11 RespondentID.SourceZero-Shot 0.01 τ11 RespondentID.SourceFew-Shot 0.01 ρ01 -0.64 -0.97 -0.06 -0.28 ICC 0.28 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.014 0.290 Purchase Intent Alternative Weights Predictors Estimates CI p (Intercept) 0.08 0.07 0.08 0.001 Source [Zero-Shot] 0.02 0.01 0.03 0.001 Source [Few-Shot] 0.03 0.02 0.04 0.001 Random Effects σ2 0.01 τ00 IdeaID 0.00 τ00 RespondentID 0.00 τ11 IdeaID.SourceZero-Shot 0.00 τ11 IdeaID.SourceFew-Shot 0.00 τ11 RespondentID.SourceZero-Shot 0.00 τ11 RespondentID.SourceFew-Shot 0.00 ρ01 0.04 0.48 -0.00 -0.16 ICC 0.21 --- Page 32 --- N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.009 0.215 Purchase Intent (Simple) Predictors Estimates CI p (Intercept) 1.62 1.55 1.68 0.001 Source [Zero-Shot] 0.26 0.15 0.37 0.001 Source [Few-Shot] 0.36 0.25 0.47 0.001 Observations R2 R2 adjusted 0.108 0.104 Purchase Intent (no weights, zero-shot baseline) Predictors Estimates CI p (Intercept) 1.85 1.73 1.98 0.001 Source [Student] -0.24 -0.35 -0.12 0.001 Source [Few-Shot] 0.12 -0.00 0.24 0.058 Random Effects σ2 1.18 τ00 IdeaID 0.12 τ00 RespondentID 0.42 τ11 IdeaID.SourceStudent 0.33 τ11 IdeaID.SourceFew-Shot 0.12 τ11 RespondentID.SourceStudent 0.13 τ11 RespondentID.SourceFew-Shot 0.01 ρ01 -0.77 -0.36 -0.50 -0.99 --- Page 33 --- ICC 0.31 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.014 0.322 Purchase Intent (ordered logistic regression) Predictors Odds Ratios CI p 0 1 0.25 0.21 0.29 0.001 1 2 1.06 0.90 1.26 0.484 2 3 3.01 2.53 3.57 0.001 3 4 19.07 15.84 22.97 0.001 Source [Zero-Shot] 1.48 1.24 1.78 0.001 Source [Few-Shot] 1.79 1.49 2.14 0.001 Random Effects σ2 3.29 τ00 IdeaID 0.39 τ00 RespondentID 0.92 ICC 0.28 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.014 0.294 Novelty Predictors Estimates CI p (Intercept) 0.41 0.39 0.43 0.001 Source [Zero-Shot] -0.04 -0.07 -0.01 0.008 Source [Few-Shot] -0.05 -0.08 -0.02 0.001 Random Effects --- Page 34 --- σ2 0.05 τ00 IdeaID 0.01 τ00 RespondentID 0.01 τ11 IdeaID.SourceZero-Shot 0.02 τ11 IdeaID.SourceFew-Shot 0.03 τ11 RespondentID.SourceZero-Shot 0.01 τ11 RespondentID.SourceFew-Shot 0.01 ρ01 -0.87 -0.99 0.14 0.06 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.009 NA Novelty (Simple) Predictors Estimates CI p (Intercept) 1.64 1.58 1.70 0.001 Source [Zero-Shot] -0.18 -0.29 -0.07 0.001 Source [Few-Shot] -0.20 -0.31 -0.09 0.001 Observations R2 R2 adjusted 0.042 0.037 Novelty (no weights, zero-shot baseline) Predictors Estimates CI p (Intercept) 1.48 1.37 1.59 0.001 Source [Student] 0.15 0.05 0.26 0.004 Source [Few-Shot] -0.04 -0.16 0.07 0.493 Random Effects --- Page 35 --- σ2 0.90 τ00 IdeaID 0.11 τ00 RespondentID 0.35 τ11 IdeaID.SourceStudent 0.33 τ11 IdeaID.SourceFew-Shot 0.44 τ11 RespondentID.SourceStudent 0.03 τ11 RespondentID.SourceFew-Shot 0.04 ρ01 -0.71 -0.98 -1.00 -0.23 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.009 NA Novelty (zero-shot baseline) Predictors Estimates CI p (Intercept) 0.37 0.34 0.40 0.001 Source [Student] 0.04 0.01 0.07 0.008 Source [Few-Shot] -0.01 -0.04 0.02 0.449 Random Effects σ2 0.05 τ00 IdeaID 0.01 τ00 RespondentID 0.02 τ11 IdeaID.SourceStudent 0.02 τ11 IdeaID.SourceFew-Shot 0.03 τ11 RespondentID.SourceStudent 0.01 τ11 RespondentID.SourceFew-Shot 0.00 ρ01 -0.74 --- Page 36 --- -1.00 -0.69 -0.95 ICC 0.35 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.006 0.356 Novelty (ordered logistic regression) Predictors Odds Ratios CI p 0 1 0.16 0.13 0.19 0.001 1 2 0.87 0.72 1.04 0.118 2 3 4.51 3.76 5.43 0.001 3 4 29.60 24.09 36.37 0.001 Source [Zero-Shot] 0.74 0.60 0.91 0.004 Source [Few-Shot] 0.68 0.55 0.84 0.001 Random Effects σ2 3.29 τ00 IdeaID 0.57 τ00 RespondentID 0.91 ICC 0.31 N RespondentID N IdeaID Observations Marginal R2 Conditional R2 0.006 0.315