Paperbot6 / abstracts.md
Ina-Shapiro's picture
adding all the paperbot4 files
a213258

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
Title Link Citation Abstract Filename
The Labor Market Effects of Generative Artificial Intelligence https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5136877 Hartley, J., Jolevski, F., Melo, V., & Moore, B. (2024). The Labor Market Effects of Generative Artificial Intelligence. Available at SSRN 5136877. In this paper we develop a new survey analyzing Generative AI use in the labor market to assist in measuring the economic effects of Generative AI. We find, consistent with other surveys that Generative AI tools like large language models (LLMs) are most commonly used in the labor force by younger individuals, more highly educated individuals, higher income individuals, and those in particular industries such as customer service, marketing and information technology. Overall, we find that LLM adoption at work among U.S. survey respondents above 18 has increased rapidly from 30.1% as of December 2024, to 43.2% as of March/April 2025. We also estimate Generative AI use at the intensive margins, its efficiency gains and its use in job search and seek to examine the effects of LLMs on productivity and the labor market using a number of additional datasets. These results have several implications for policymakers, businesses, and researchers navigating the evolving landscape shaped by the integration of Generative AI into the global economy. The Labor Market Effects of Generativ.txt
The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas https://arxiv.org/abs/2506.20803 Si, C., Hashimoto, T., & Yang, D. (2025). The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas. arXiv preprint arXiv:2506.20803. Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes. The Ideation Execution Gap Executio.txt
Generative AI Can Harm Learning https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486 Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2024). Generative AI Can Harm Learning. The Wharton School Research Paper. Available at SSRN: https://ssrn.com/abstract=4895486. Generative artificial intelligence (AI) is poised to revolutionize how humans work, and has already demonstrated promise in significantly improving human productivity. However, a key remaining question is how generative AI affects learning, namely, how humans acquire new skills as they perform tasks. This kind of skill learning is critical to long-term productivity gains, especially in domains where generative AI is fallible and human experts must check its outputs. We study the impact of generative AI, specifically OpenAI's GPT-4, on human learning in the context of math classes at a high school. In a field experiment involving nearly a thousand students, we have deployed and evaluated two GPT based tutors, one that mimics a standard ChatGPT interface (called GPT Base) and one with prompts designed to safeguard learning (called GPT Tutor). These tutors comprise about 15% of the curriculum in each of three grades. Consistent with prior work, our results show that access to GPT-4 significantly improves performance (48% improvement for GPT Base and 127% for GPT Tutor). However, we additionally find that when access is subsequently taken away, students actually perform worse than those who never had access (17% reduction for GPT Base). That is, access to GPT-4 can harm educational outcomes. These negative learning effects are largely mitigated by the safeguards included in GPT Tutor. Our results suggest that students attempt to use GPT-4 as a "crutch" during practice problem sessions, and when successful, perform worse on their own. Thus, to maintain long-term productivity, we must be cautious when deploying generative AI to ensure humans continue to learn critical skills. Generative AI Can Harm Learning.txt
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond https://arxiv.org/abs/2405.03520 Zhu, Z., Wang, X., Zhao, W., Min, C., Deng, N., Dou, M., ... & Huang, G. (2024). Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond. arXiv preprint arXiv:2405.03520. General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: this https URL Is Sora a World Simulator A Compreh.txt
Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task https://arxiv.org/abs/2506.08872 Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., ... & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. arXiv preprint arXiv:2506.08872. This study explores the neural and behavioral consequences of LLM-assisted essay writing. Participants were divided into three groups: LLM, Search Engine, and Brain-only (no tools). Each completed three sessions under the same condition. In a fourth session, LLM users were reassigned to Brain-only group (LLM-to-Brain), and Brain-only users were reassigned to LLM condition (Brain-to-LLM). A total of 54 participants took part in Sessions 1-3, with 18 completing session 4. We used electroencephalography (EEG) to assess cognitive load during essay writing, and analyzed essays using NLP, as well as scoring essays with the help from human teachers and an AI judge. Across groups, NERs, n-gram patterns, and topic ontology showed within-group homogeneity. EEG revealed significant differences in brain connectivity: Brain-only participants exhibited the strongest, most distributed networks; Search Engine users showed moderate engagement; and LLM users displayed the weakest connectivity. Cognitive activity scaled down in relation to external tool use. In session 4, LLM-to-Brain participants showed reduced alpha and beta connectivity, indicating under-engagement. Brain-to-LLM users exhibited higher memory recall and activation of occipito-parietal and prefrontal areas, similar to Search Engine users. Self-reported ownership of essays was the lowest in the LLM group and the highest in the Brain-only group. LLM users also struggled to accurately quote their own work. While LLMs offer immediate convenience, our findings highlight potential cognitive costs. Over four months, LLM users consistently underperformed at neural, linguistic, and behavioral levels. These results raise concerns about the long-term educational implications of LLM reliance and underscore the need for deeper inquiry into Your Brain on ChatGPT Accumulation o.txt
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions https://arxiv.org/abs/2506.09038 Kirichenko, P., Ibrahim, M., Chaudhuri, K., & Bell, S. J. (2025). AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions. arXiv preprint arXiv:2506.09038. For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain -- i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown impressive results in complex problem solving, surprisingly, we find that reasoning fine-tuning degrades abstention (by 24\% on average), even for math and science domains on which reasoning models are explicitly trained. We find that while a carefully crafted system prompt can boost abstention in practice, it does not resolve models' fundamental inability to reason about uncertainty. We release AbstentionBench to foster research into advancing LLM reliability. AbstentionBench Reasoning LLMs Fail o.txt
HOW AI AND HUMAN BEHAVIORS SHAPE PSYCHOSOCIAL EFFECTS OF CHATBOT USE: A LONGITUDINAL RANDOMIZED CONTROLLED STUDY dam-prod2.media.mit.edu/x/2025/03/21/Randomized_Control_Study_on_Chatbot_Psychosocial_Effect.pdf Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., ... & Agarwal, S. (2025). How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study. arXiv preprint arXiv:2503.17473. AI chatbots, especially those with voice capabilities, have become increasingly human-like, with more users seeking emotional support and companionship from them. Concerns are rising about how such interactions might impact users' loneliness and socialization with real people. We conducted a four-week randomized, controlled, IRB-approved experiment (n=981, >300K messages) to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage. Results showed that while voice-based chatbots initially appeared beneficial in mitigating loneliness and dependence compared with text-based chatbots, these advantages diminished at high usage levels, especially with a neutral-voice chatbot. Conversation type also shaped outcomes: personal topics slightly increased loneliness but tended to lower emotional dependence compared with open-ended conversations, whereas non-personal topics were associated with greater dependence among heavy users. Overall, higher daily usage—across all modalities and conversation types—correlated with higher loneliness, dependence, and problematic use, and lower socialization. Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). We highlight the need for further research on whether chatbots' ability to manage emotional content without fostering dependence or replacing human relationships benefits overall well-being How AI and Human Behaviors Shape Psyc.txt
AI Companions Reduce Loneliness https://arxiv.org/pdf/2407.19096 De Freitas, J., Uğuralp, A. K., Uğuralp, Z., & Puntoni, S. (2024). AI Companions Reduce Loneliness. Harvard Business Working Paper No. 24-078. Available at SSRN 4893097. Chatbots are now able to engage in sophisticated conversations with consumers in the domain of relationships, providing a potential coping solution to widescale societal loneliness. Behavioral research provides little insight into whether these applications are effective at alleviating loneliness. We address this question by focusing on "AI companions": applications designed to provide consumers with synthetic interaction partners. Studies 1 and 2 find suggestive evidence that consumers use AI companions to alleviate loneliness, by employing a novel methodology for fine-tuning large language models (LLMs) to detect loneliness in conversations and reviews. Study 3 finds that AI companions successfully alleviate loneliness on par only with interacting with another person, and more than other activities such watching YouTube videos. Moreover, consumers underestimate the degree to which AI companions improve their loneliness. Study 4 uses a longitudinal design and finds that an AI companion consistently reduces loneliness over the course of a week. Study 5 provides evidence that both the chatbots' performance and, especially, whether it makes users feel heard, explain reductions in loneliness. Study 6 provides an additional robustness check for the loneliness-alleviating benefits of AI companions. AI Companions Reduce Loneliness.txt
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs https://arxiv.org/abs/2502.08640 Mazeika, M., Yin, X., Tamirisa, R., Lim, J., Lee, B. W., Ren, R., ... & Hendrycks, D. (2025). Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs. arXiv preprint arXiv:2502.08640. As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations. Utility Engineering Analyzing and Contro.txt
Automation of Systematic Reviews with Large Language Models https://www.medrxiv.org/content/10.1101/2025.06.13.25329541v1 Cao, C., Arora, R., Cento, P., Manta, K., Farahani, E., Cecere, M., ... & Bobrovitz, N. (2025). Automation of Systematic Reviews with Large Language Models. medRxiv. https://doi.org/10.1101/2025.06.13.25329541. Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR, an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found a median of 2.0 (IQR 1 to 6.5) eligible studies likely missed by the original authors. Meta-analyses revealed that otto-SR generated newly statistically significant conclusions in 2 reviews and negated significance in 1 review. These findings demonstrate that LLMs can autonomously conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis. Automation of Systematic Reviews with.txt
Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce https://arxiv.org/pdf/2506.06576 Shao, Y., Zope, H., Jiang, Y., Pei, J., Nguyen, D., Brynjolfsson, E., & Yang, D. (2025). Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce. arXiv preprint arXiv:2506.06576. The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape. In this paper, we address this gap by introducing a novel auditing framework to assess which occupational tasks workers want AI agents to automate or augment, and how those desires align with the current technological capabilities. Our framework features an audio-enhanced mini-interview to capture nuanced worker desires and introduces the Human Agency Scale (HAS) as a shared language to quantify the preferred level of human involvement. Using this framework, we construct the WORKBank database, building on the U.S. Department of Labor's O*NET database, to capture preferences from 1,500 domain workers and capability assessments from AI experts across over 844 tasks spanning 104 occupations. Jointly considering the desire and technological capability divides tasks in WORKBank into four zones: Automation "Green Light" Zone, Automation "Red Light" Zone, R&D Opportunity Zone, Low Priority Zone. This highlights critical mismatches and opportunities for AI agent development. Moving beyond a simple automate-or-not dichotomy, our results reveal diverse HAS profiles across occupations, reflecting heterogeneous expectations for human involvement. Moreover, our study offers early signals of how AI agent integration may reshape the core human competencies, shifting from information-focused skills to interpersonal ones. These findings underscore the importance of aligning AI agent development with human desires and preparing workers for evolving workplace dynamics. Future of Work with AI Agents Audit.txt
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions https://arxiv.org/abs/2505.18878 Huang, K.-H., Prabhakar, A., Thorat, O., Agarwal, D., Choubey, P. K., Mao, Y., ... & Wu, C.-S. (2025). CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. arXiv preprint arXiv:2505.18878. While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition. CRMArena-Pro Holistic Assessment of LLM.txt
From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis https://www.medrxiv.org/content/10.1101/2025.06.07.25329176v1 Everett, S. S., Bunning, B. J., Jain, P., Lopez, I., Agarwal, A., Desai, M., ... & Horvitz, E. (2025). From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis. medRxiv. https://doi.org/10.1101/2025.06.07.25329176. Early studies of large language models (LLMs) in clinical settings have largely treated artificial intelligence (AI) as a tool rather than an active collaborator. As LLMs now demonstrate expert-level diagnostic performance, the focus shifts from whether AI can offer valuable suggestions to how it can be effectively integrated into physicians' diagnostic workflows. We conducted a randomized controlled trial (n=70 clinicians) to evaluate the value of employing a custom GPT system designed to engage collaboratively with clinicians on diagnostic reasoning challenges. The collaborative design began with independent diagnostic assessments from both the clinician and the AI. These were then combined in an AI-generated synthesis that integrated the two perspectives, highlighting points of agreement and disagreement and offering commentary on each. We evaluated two workflow variants: one where the AI provided an initial opinion (AI-first), and another where it followed the clinician's assessment (AI-second). Clinicians using either collaborative workflow outperformed those using traditional tools, achieving average accuracies of 85% (AI-first) and 82% (AI-second), compared to 75% with traditional resources (p < 0.0004 and p < 0.00001; mean differences = 9.8% and 6.8%; 95% CI = 4.6%–15% and 4.0%–9.6%). Performance did not differ significantly between workflows or from the AI-alone score of 90%. These results underscore the value of collaborative AI systems that complement clinician expertise and foster effective coordination between human and machine reasoning in diagnostic decision-making. From Tool to Teammate A Randomized.txt
Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532 Meincke, L., Mollick, E. R., Mollick, L., & Shapiro, D. (2025). Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. Available at SSRN 5285532. This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to "think step by step" (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things: The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers. For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response. Taken together, this suggests that a simple CoT prompt is generally still a useful tool for boosting average performance in non-reasoning models, especially older or smaller models that may not engage in a CoT reasoning by default. However, the gains must be weighed against increased response times and potential decreases in perfect accuracy due to more variability in answers. For dedicated reasoning models, the added benefits of explicit CoT prompting appear negligible and may not justify the substantial increase in processing time. Prompting Science Report 2 The Decrea.txt
Prompting Science Report 1: Prompt Engineering is Complicated and Contingent https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5165270 Meincke, L., Mollick, E. R., Mollick, L., & Shapiro, D. (2025). Prompting Science Report 1: Prompt Engineering is Complicated and Contingent. Available at SSRN 5165270. This is the first of a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we demonstrate two things: - There is no single standard for measuring whether a Large Language Model (LLM) passes a benchmark, and that choosing a standard has a big impact on how well the LLM does on that benchmark. The standard you choose will depend on your goals for using an LLM in a particular case. - It is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Specifically, we find that sometimes being polite to the LLM helps performance, and sometimes it lowers performance. We also find that constraining the AI's answers helps performance in some cases, though it may lower performance in other cases. Taken together, this suggests that benchmarking AI performance is not one-size-fits-all, and also that particular prompting formulas or approaches, like being polite to the AI, are not universally valuable. Prompting Science Report 1 Prompt En.txt
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity https://machinelearning.apple.com/research/illusion-of-thinking?utm_source=perplexity Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple Machine Learning Research. Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their computational capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established and conventional coding/mathematical benchmarks, limiting answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs "think". Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-intuitive scaling limit: the reasoning effort to reach problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference conditions, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs provides benefits, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities. The Illusion of Thinking Understandin.txt
Creative Preference Optimization https://arxiv.org/abs/2505.14442 Ismayilzada, M., Laverghetta Jr., A., Luchini, S. A., Patel, R., Bosselut, A., van der Plas, L., & Beaty, R. (2025). Creative Preference Optimization. arXiv preprint arXiv:2505.14442. While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality. Creative Preference Optimization.txt
Has the Creativity of Large-Language Models peaked? — an analysis of inter- and intra-LLM variability — https://arxiv.org/pdf/2504.12320 Haase, J., Hanel, P. H. P., & Pokutta, S. (2025). Has the Creativity of Large-Language Models peaked? — an analysis of inter- and intra-LLM variability. arXiv preprint arXiv:2504.12320. Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs—including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek—across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18–24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts. Has the Creativity of Large-Language M.txt
Learning to Reason without External Rewards https://arxiv.org/abs/2505.19590 Zhao, X., Kang, Z., Feng, A., Levine, S., & Song, D. (2025). Learning to Reason without External Rewards. arXiv preprint arXiv:2505.19590. Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at this https URL Learning to Reason without External Rewa.txt
Harnessing the Universal Geometry of Embeddings https://arxiv.org/pdf/2505.12540 Jha, R., Zhang, C., Shmatikov, V., & Morris, J. X. (2025). Harnessing the Universal Geometry of Embeddings. arXiv preprint arXiv:2505.12540. We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for security. An adversary with access to a database of only embedding vectors can extract sensitive information about underlying documents, sufficient for classification and attribute inference. Harnessing the Universal Geometry of.txt
From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria https://documents.worldbank.org/en/publication/documents-reports/documentdetail/099548105192529324 De Simone, M. E., Tiberti, F. H., Barron Rodriguez, M. R., Manolio, F. A., Mosuro, W., & Dikoru, E. J. (2025). From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria. Policy Research Working Paper No. WPS11125. World Bank Group. This study evaluates the impact of a program leveraging large language models for virtual tutoring in secondary education in Nigeria. Using a randomized controlled trial, the program deployed Microsoft Copilot (powered by GPT-4) to support first-year senior secondary students in English language learning over six weeks. The intervention demonstrated a significant improvement of 0.31 standard deviation on an assessment that included English topics aligned with the Nigerian curriculum, knowledge of artificial intelligence and digital skills. The effect on English, the main outcome of interest, was of 0.23 standard deviations. Cost-effectiveness analysis revealed substantial learning gains, equating to 1.5 to 2 years of 'business-as-usual' schooling, situating the intervention among some of the most cost-effective programs to improve learning outcomes. An analysis of heterogeneous effects shows that while the program benefits students across the baseline ability distribution, the largest effects are for female students, and those with higher initial academic performance. The findings highlight that artificial intelligence-powered tutoring, when designed and used properly, can have transformative impacts in the education sector in low-resource settings. From Chalkboards to Chatbots Evaluatin.txt
Generalization bias in large language model summarization of scientific research https://royalsocietypublishing.org/doi/epdf/10.1098/rsos.241776 Peters, U., & Chin-Yee, B. (2025). Generalization bias in large language model summarization of scientific research. Royal Society Open Science, 12(4), 241776. Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy. Generalization bias in LLM summarizat.txt
When GenAI increases inequality: evidence from university debating competition https://poid.lse.ac.uk/PUBLICATIONS/abstract.asp?index=10951 Roldan, A. (2024). When GenAI increases inequality: Evidence from a university debating competition. POID Working Paper No. POIDWP096. London School of Economics and Political Science. This paper evaluates the impact of Generative Artificial Intelligence (GenAI) on productivity and work inequality. I run a Randomized Controlled Trial in a university debating competition, in which I randomly assign GenAI support to students to prepare a series of one-on-one debates. This novel setting allows me to measure productivity improvements in a task involving critical thinking and evaluate its impact on high cognitive and social skills. Contrary to most early findings in the GenAI literature, I find that high ability students benefit significantly more from GenAI than their lower ability counterparts. Analysis of mechanisms suggests that high ability students are more effective at extracting and using the information provided by GenAI. They also experience larger improvements in their perception of time needed to prepare debates when allowed to use GenAI. I suggest a possible explanation to reconcile these results with previous findings: when tasks require higher-order skills and unpredictable interactions, and answers cannot be copy-pasted from the AI, high ability workers are likely to benefit more from GenAI. When GenAI increases inequality Eviden.txt
The Uneven Impact of Generative AI on Entrepreneurial Performance https://www.hbs.edu/ris/Publication%20Files/24-042_9ebd2f26-e292-404c-b858-3e883f0e11c0.pdf Otis, N., Clarke, R., Delecourt, S., Holtz, D., & Koning, R. (2024). The Uneven Impact of Generative AI on Entrepreneurial Performance. Available at SSRN 4671369. There is a growing belief that scalable and low-cost AI assistance can improve firm decision-making and economic performance. However, running a business involves a myriad of open-ended problems, making it hard to generalize from recent studies showing that generative AI improves performance on well-defined writing tasks. In our five-month field experiment with 640 Kenyan entrepreneurs, we assessed the impact of AI-generated advice on small business revenues and profits. Participants were randomly assigned to a control group that received a standard business guide or to a treatment group that received a GPT-4 powered AI business mentor via WhatsApp. While we find no average treatment effect, this is because the causal effect of generative AI access varied with the baseline business performance of the entrepreneur: high performers benefited by just over 20% from AI advice, whereas low performers did roughly 10% worse with AI assistance. Exploratory analysis of the WhatsApp interaction logs shows that both groups sought the AI mentor's advice, but that low performers did worse because they sought help on much more challenging business tasks. These findings highlight how the tasks selected by firms and entrepreneurs for AI assistance fundamentally shape who will benefit from generative AI. The Uneven Impact of Gen.txt
Large Language Models Are More Persuasive Than Incentivized Human Persuaders https://arxiv.org/abs/2505.09662 Schoenegger, P., Salvi, F., Liu, J., Nan, X., Debnath, R., Fasolo, B., ... & Salatiello, A. (2025). Large Language Models Are More Persuasive Than Incentivized Human Persuaders. arXiv preprint arXiv:2505.09662. We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging Large Language Models Are More Persuasiv.txt
Use of GPT-4 to Diagnose Complex Clinical Cases https://ai.nejm.org/doi/full/10.1056/AIp2300031 Eriksen, A. V., Möller, S., & Ryg, J. (2024). Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI, 1(1). We assessed the performance of the newly released AI GPT-4 in diagnosing complex medical case challenges and compared the success rate to that of medical-journal readers. GPT-4 correctly diagnosed 57% of cases, outperforming 99.98% of simulated human readers generated from online answers. We highlight the potential for AI to be a powerful supportive tool for diagnosis; however, further improvements, validation, and addressing of ethical considerations are needed before clinical implementation. (No funding was obtained for this study.) Use of GPT-4 to Diagnose Complex Cli.txt
Using Large Language Models for Idea Generation in Innovation https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4526071 Meincke, L., Girotra, K., Nave, G., Terwiesch, C., & Ulrich, K. T. (2024). Using Large Language Models for Idea Generation in Innovation. The Wharton School Research Paper. Available at SSRN 4526071. This research evaluates the efficacy of large language models (LLMs) in generating new product ideas. To do so, we compare three pools of ideas for new products targeted toward college students priced at $50 or less. The first pool of ideas was created by university students in a product design course before the availability of LLMs. The second and third pools of ideas were generated by OpenAI's GPT-4 using zero-shot and few-shot prompting, respectively. We evaluated idea quality using standard market research techniques to predict average purchase intent probability. We used text mining to assess idea similarity and human raters to evaluate idea novelty. We find that AI-generated ideas outperform human-generated ideas in terms of average purchase intent, with few-shot prompting yielding slightly higher intent than zero-shot prompting. However, AI-generated ideas are perceived as less novel and exhibit higher pairwise similarity, particularly with few-shot prompting, indicating a less diverse solution landscape. When focusing on the quality of the best ideas (rather than on the average ideas), we find that AI-generated ideas are seven times more likely to rank among the top 10% of ideas, demonstrating a significant advantage over human-generated ideas. We propose that this 7:1 advantage is a conservative estimate, as it does not account for AI's greater productivity. Our findings suggest that despite some drawbacks, AI creativity presents a substantial benefit in generating high-quality ideas for new product development. Using Large Language Models for Idea.txt
Testing theory of mind in large language models and humans https://www.nature.com/articles/s41562-024-01882-z Strachan, J. W. A., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., ... & Becchio, C. (2024). Testing theory of mind in large language models and humans. Nature Human Behaviour, 8, 1285-1295. At the core of what defines us as humans is the concept of theory of mind: the ability to track other people's mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences. Testing theory of mind in large langua.txt
Use of GPT-4 to Diagnose Complex Clinical CasesNavigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf Dell'Acqua, F., McFowland III, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., ... & Lakhani, K. R. (2023). Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Harvard Business School Technology & Operations Mgt. Unit Working Paper No. 24-013. Available at SSRN 4573321. The public release of Large Language Models (LLMs) has sparked tremendous interest in how humans will use Artificial Intelligence (AI) to accomplish a variety of tasks. In our study conducted with Boston Consulting Group, a global management consulting firm, we examine the performance implications of AI on realistic, complex, and knowledge-intensive tasks. The pre-registered experiment involved 758 consultants comprising about 7% of the individual contributor-level consultants at the company. After establishing a performance baseline on a similar task, subjects were randomly assigned to one of three conditions: no AI access, GPT-4 AI access, or GPT-4 AI access with a prompt engineering overview. We suggest that the capabilities of AI create a "jagged technological frontier" where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI. For each one of a set of 18 realistic consulting tasks within the frontier of AI capabilities, consultants using AI were significantly more productive (they completed 12.2% more tasks on average, and completed tasks 25.1% more quickly), and produced significantly higher quality results (more than 40% higher quality compared to a control group). Consultants across the skills distribution benefited significantly from having AI augmentation, with those below the average performance threshold increasing by 43% and those above increasing by 17% compared to their own scores. For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI. Further, our analysis shows the emergence of two distinctive patterns of successful AI use by humans along a spectrum of human-AI integration. One set of consultants acted as "Centaurs," like the mythical half-horse/half-human creature, dividing and delegating their solution-creation activities to the AI or to themselves. Another set of consultants acted more like "Cyborgs," completely integrating their task flow with the AI and continually interacting with the technology. Use of GPT-4 to Diagnose Complex Clinical CasesNavigating the Jagged Technological Frontier Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.txt
Introducing HealthBench: An evaluation for AI systems and human health https://openai.com/index/healthbench/ OpenAI. (2025). Introducing HealthBench: An evaluation for AI systems and human health. OpenAI. Improving human health will be one of the defining impacts of AGI. If developed and deployed effectively, large language models have the potential to expand access to health information, support clinicians in delivering high-quality care, and help people advocate for their health and that of their communities. To get there, we need to ensure models are useful and safe. Evaluations are essential to understanding how models perform in health settings. Significant efforts have already been made across academia and industry, yet many existing evaluations do not reflect realistic scenarios, lack rigorous validation against expert medical opinion, or leave no room for state-of-the-art models to improve. Today, we're introducing HealthBench: a new benchmark designed to better measure capabilities of AI systems for health. Built in partnership with 262 physicians who have practiced in 60 countries, HealthBench includes 5,000 realistic health conversations, each with a custom physician-created rubric to grade model responses. Introducing HealthBench An evaluation for AI systems and human health.txt
Large Language Models, Small Labor Market Effects https://www.nber.org/papers/w33777 Humlum, A., & Vestergaard, E. (2025). Large Language Models, Small Labor Market Effects. NBER Working Paper No. 33777. We examine the labor market effects of AI chatbots using two large-scale adoption surveys (late 2023 and 2024) covering 11 exposed occupations (25,000 workers, 7,000 workplaces), linked to matched employer-employee data in Denmark. AI chatbots are now widespread—most employers encourage their use, many deploy in-house models, and training initiatives are common. These firm-led investments boost adoption, narrow demographic gaps in take-up, enhance workplace utility, and create new job tasks. Yet, despite substantial investments, economic impacts remain minimal. Using difference-in-differences and employer policies as quasi-experimental variation, we estimate precise zeros: AI chatbots have had no significant impact on earnings or recorded hours in any occupation, with confidence intervals ruling out effects larger than 1%. Modest productivity gains (average time savings of 3%), combined with weak wage pass-through, help explain these limited labor market effects. Our findings challenge narratives of imminent labor market transformation due to Generative AI. Large Language Models Small Labor M.txt
The effect of ChatGPT on students' learning performance, learning perception, and higher-order thinking: insights from a meta-analysis https://www.nature.com/articles/s41599-025-04787-y Wang, J., & Fan, W. (2025). The effect of ChatGPT on students' learning performance, learning perception, and higher-order thinking: insights from a meta-analysis. Humanities and Social Sciences Communications, 12, 621. As a new type of artificial intelligence, ChatGPT is becoming widely used in learning. However, academic consensus regarding its efficacy remains elusive. This study aimed to assess the effectiveness of ChatGPT in improving students' learning performance, learning perception, and higher-order thinking through a meta-analysis of 51 research studies published between November 2022 and February 2025. The results indicate that ChatGPT has a large positive impact on improving learning performance (g = 0.867) and a moderately positive impact on enhancing learning perception (g = 0.456) and fostering higher-order thinking (g = 0.457). The impact of ChatGPT on learning performance was moderated by type of course (QB = 64.249, P < 0.001), learning model (QB = 76.220, P < 0.001), and duration (QB = 55.998, P < 0.001); its effect on learning perception was moderated by duration (QB = 19.839, P < 0.001); and its influence on the development of higher-order thinking was moderated by type of course (QB = 7.811, P < 0.05) and the role played by ChatGPT (QB = 4.872, P < 0.05). This study suggests that: (1) appropriate learning scaffolds or educational frameworks (e.g., Bloom's taxonomy) should be provided when using ChatGPT to develop students' higher-order thinking; (2) the broad use of ChatGPT at various grade levels and in different types of courses should be encouraged to support diverse learning needs; (3) ChatGPT should be actively integrated into different learning modes to enhance student learning, especially in problem-based learning; (4) continuous use of ChatGPT should be ensured to support student learning, with a recommended duration of 4–8 weeks for more stable effects; (5) ChatGPT should be flexibly integrated into teaching as an intelligent tutor, learning partner, and educational tool. Finally, due to the limited sample size for learning perception and higher-order thinking, and the moderately positive effect, future studies with expanded scope should further explore how to use ChatGPT more effectively to cultivate students' learning perception and higher-order thinking. The effect of ChatGPT on students lea.txt
How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? https://arxiv.org/abs/2504.02767 Algaba, A., Holst, V., Tori, F., Mobini, M., Verbeken, B., Wenmackers, S., & Ginis, V. (2025). How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? arXiv preprint arXiv:2504.02767. The spread of scientific knowledge depends on how researchers discover and cite previous work. The adoption of large language models (LLMs) in the scientific research process introduces a new layer to these citation practices. However, it remains unclear to what extent LLMs align with human citation practices, how they perform across domains, and may influence citation dynamics. Here, we show that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references. This pattern persists across scientific domains despite significant field-specific variations in existence rates, which refer to the proportion of generated references that match existing records in external bibliometric databases. Analyzing 274,951 references generated by GPT-4o for 10,000 papers, we find that LLM recommendations diverge from traditional citation patterns by preferring more recent references with shorter titles and fewer authors. Emphasizing their content-level relevance, the generated references are semantically aligned with the content of each paper at levels comparable to the ground truth references and display similar network effects while reducing author self-citations. These findings illustrate how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends. As LLMs become more integrated into the scientific research process, it is important to understand their role in shaping how scientific communities discover and build upon prior work. HOW DEEP DO LARGE LANGUAGE MODELS.txt
When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection https://arxiv.org/abs/2503.15204 Mairittha, T., Sawanglok, T., Raden, P., & Treesuk, S. (2025). When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection. arXiv preprint arXiv:2503.15204. Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI-powered, multi-agent diagnostic system that leverages Retrieval-Augmented Generation (RAG) to deliver timely, evidence-based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom-Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence-weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI-driven diagnostic framework, this approach enhances veterinary decision-making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security. WHEN PIGS GET SICK MULTI-AGENT AI.txt
Underreporting of AI use: The role of social desirability bias https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5232910 Ling, Y., & Imas, A. (2025). Underreporting of AI use: The role of social desirability bias. Available at SSRN 5232910. The integration of artificial intelligence (AI) into work and educational settings is rapidly increasing, yet accurately gauging its adoption remains a challenge. The majority of research uses self-reported surveys. The resulting estimates vary widely, sometimes differing by as much as 40 percentage points in the same setting. This paper studies whether social desirability bias–--the tendency to answer surveys in a way that would be viewed favorably by an outside party–--can potentially explain this discrepancy. We collect data on AI use in a large representative sample of university students. We assess the potential for social desirability bias using a common tool from psychology, indirect questioning: all students report both their own AI use and the use of their peers. The data reveals a significant gap, with approximately 60% of students reporting using AI themselves compared to 90% of their peers. In a follow-up study, natural language processing reveals social desirability bias as key driver of the gap between own and others' AI use: students are hesitant to admit AI use due to negative perceptions. This suggests that using self-reports may underestimate the actual prevalence of AI in settings where social desirability bias plays a role, such as education.
Towards Conversational Diagnostic Artificial Intelligence https://www.nature.com/articles/s41586-025-08866-7 Tu, T., Schaekermann, M., Palepu, A., Saab, K., Freyberg, J., Tanno, R., ... & Natarajan, V. (2025). Towards conversational diagnostic artificial intelligence. Nature, 642, 442-450. At the heart of medicine lies physician–patient dialogue, where skillful history-taking enables effective diagnosis, management and enduring trust. Artificial intelligence (AI) systems capable of diagnostic dialogue could increase accessibility and quality of care. However, approximating clinicians' expertise is an outstanding challenge. Here we introduce AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based AI system optimized for diagnostic dialogue. AMIE uses a self-play-based simulated environment with automated feedback for scaling learning across disease conditions, specialties and contexts. We designed a framework for evaluating clinically meaningful axes of performance, including history-taking, diagnostic accuracy, management, communication skills and empathy. We compared AMIE's performance to that of primary care physicians in a randomized, double-blind crossover study of text-based consultations with validated patient-actors similar to objective structured clinical examination. The study included 159 case scenarios from providers in Canada, the United Kingdom and India, 20 primary care physicians compared to AMIE, and evaluations by specialist physicians and patient-actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 30 out of 32 axes according to the specialist physicians and 25 out of 26 axes according to the patient-actors. Our research has several limitations and should be interpreted with caution. Clinicians used synchronous text chat, which permits large-scale LLM–patient interactions, but this is unfamiliar in clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
Demand for LLMs: Descriptive Evidence on Substitution, Market Expansion, and Multihoming https://arxiv.org/abs/2504.15440 Fradkin, A. (2025). Demand for LLMs: Descriptive Evidence on Substitution, Market Expansion, and Multihoming. arXiv preprint arXiv:2504.15440. This paper documents three stylized facts about the demand for Large Language Models (LLMs) using data from OpenRouter, a prominent LLM marketplace. First, new models experience rapid initial adoption that stabilizes within weeks. Second, model releases differ substantially in whether they primarily attract new users or substitute demand from competing models. Third, multi-homing—using multiple models simultaneously—is common among apps. These findings suggest significant horizontal and vertical differentiation in the LLM market, implying opportunities for providers to maintain demand and pricing power despite rapid technological advances.
The Leaderboard Illusion https://arxiv.org/abs/2504.20879 Singh, S., Nan, Y., Wang, A., D'Souza, D., Kapoor, S., Üstün, A., ... & Hooker, S. (2025). The Leaderboard Illusion. arXiv preprint arXiv:2504.20879. Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field
Effective and Scalable Math Support: Evidence on the Impact of an AI-Tutor on Math Achievement in Ghana https://arxiv.org/abs/2402.09809 Henkel, O., Horne-Robinson, H., Kozhakhmetova, N., & Lee, A. (2024). Effective and Scalable Math Support: Evidence on the Impact of an AI-Tutor on Math Achievement in Ghana. arXiv preprint arXiv:2402.09809. This study evaluates the impact of Rori, an AI powered conversational math tutor accessible via WhatsApp, on the math performance of approximately 1,000 students in grades 3-9 across 11 schools in Ghana. Each school was assigned to a treatment group or control group; the students in the control group continued their regular math instruction, while students in the treatment group engaged with Rori, for two 30-minute sessions per week over 8 months in addition to regular math instruction. We find that the math growth scores were substantially higher for the treatment group with an effect size of 0.37, and that the results were statistically significant (p < 0.001). The fact that Rori works with basic mobile devices on low-bandwidth data networks gives the intervention strong potential to support personalized learning on other low-and-middle-income countries (LMICs), where laptop ownership and high-speed internet - prerequisite for many video-centered learning platforms - remain extremely limited. While the results should be interpreted judiciously, as they only report on year 1 of the intervention, and future research is necessary to better understand which conditions are necessary for successful implementation, they do suggest that chat-based tutoring solutions leveraging artificial intelligence could offer a costeffective approach to enhancing learning outcomes for millions of students globally.
Instructors as Innovators: a Future-focused Approach to New AI Learning Opportunities, With Prompts https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4802463 Mollick, E. R., & Mollick, L. (2024). Instructors as Innovators: a Future-focused Approach to New AI Learning Opportunities, With Prompts. The Wharton School Research Paper. Available at SSRN 4802463. This paper explores how instructors can leverage generative AI to create personalized learning experiences for students that transform teaching and learning. We present a range of AI-based exercises that enable novel forms of practice and application including simulations, mentoring, coaching, and co-creation. For each type of exercise, we provide prompts that instructors can customize, along with guidance on classroom implementation, assessment, and risks to consider. We also provide blueprints, prompts that help instructors create their own original prompts. Instructors can leverage their content and pedagogical expertise to design these experiences, putting them in the role of builders and innovators. We argue that this instructor-driven approach has the potential to democratize the development of educational technology by enabling individual instructors to create AI exercises and tools tailored to their students' needs. While the exercises in this paper are a starting point, not a definitive solutions, they demonstrate AI's potential to expand what is poss
AI Tutoring Outperforms Active Learning https://www.researchgate.net/publication/380587627_AI_Tutoring_Outperforms_Active_Learning Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2024). AI Tutoring Outperforms Active Learning. Research Square. https://doi.org/10.21203/rs.3.rs-4243877/v1. Advances in generative artificial intelligence (GAI) show great potential for improving education. Yet little is known about how this new technology should be used and how effective it can be. Here we report a randomized, controlled study measuring college students' learning and their perceptions when content is presented through an AI-powered tutor compared with an active learning class. The AI tutor was developed with the same pedagogical best practices as the lectures. We find that students learn more than twice as much in less time when using an AI tutor, compared with the active learning class. They also feel more engaged and more motivated. These findings offer empirical evidence for the efficacy of a widely accessible AI-powered pedagogy in significantly enhancing learning outcomes, presenting a compelling case for its broad adoption in learning environments.
Can AI Change Your View? Evidence from a Large-Scale Online Feild Experiment https://regmedia.co.uk/2025/04/29/supplied_can_ai_change_your_view.pdf ? Motivation. Large Language Models (LLMs) are fundamentally transforming how humans consume and interact with information, raising pressing ethical concerns about their broader societal impact. Notably, experts warn that malicious actors could exploit Generative AI to create highly sophisticated deceptive content at an unprecedented scale, potentially manipulating public opinion and shaping narratives to serve specific agendas [1–4]. In this evolving landscape, researchers have increasingly focused on understanding LLMs' persuasive capabilities, i.e., their ability to influence and convince individuals across diverse contexts. Early studies on AI-driven persuasion have shown that LLMs can match human performance [5–9] or even surpass it [10–12], including when dealing with highly divisive sociopolitical issues. Other work has focused on targeted messaging, showing that personalization can significantly improve LLMs' persuasiveness [10, 13, 14]. Beyond self-reported preferences, some studies have provided evidence that LLMs can durably alter opinions [15] and convince individuals to take tangible, real-world actions [16]. Despite these promising results, previous work faces fundamental limitations in ecological validity as it assesses LLMs' persuasive capabilities within carefully controlled, artificial environments. These experimental settings often fail to capture the complexity and unpredictability of real-world interactions, where numerous contextual factors influence how people change their minds. Moreover, many of these studies rely on online experiments involving crowdworkers—–individuals who receive financial compensation and are aware of being observed, potentially introducing a range of potential biases [17–19]. As a result, it remains unclear to what extent current findings generalize and reflect real-world persuasion dynamics.
Competitive Programming with Large Reasoning Models https://arxiv.org/abs/2502.06807 El-Kishky, A., Wei, A., Saraiva, A., Minaiev, B., Selsam, D., Dohan, D., ... & Zhou, W. (2025). Competitive Programming with Large Reasoning Models. arXiv preprint arXiv:2502.06807. We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.
Generative Artificial Intelligence and Evaluating Strategic Decisions https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4714776 Doshi, A. R., Bell, J. J., Mirzayev, E., & Vanneste, B. (2025). Generative Artificial Intelligence and Evaluating Strategic Decisions. Strategic Management Journal, 46(3). Strategic decisions are uncertain and often irreversible. Hence, predicting the value of alternatives is important for strategic decision making. We investigate the use of generative artificial intelligence (AI) in evaluating strategic alternatives using business models generated by AI (study 1) or submitted to a competition (study 2). Each study uses a sample of 60 business models and examines agreement in business model rankings made by large language models (LLMs) and those by human experts. We consider multiple LLMs, assumed LLM roles, and prompts. We find that generative AI often produces evaluations that are inconsistent and biased. However, when aggregating evaluations, AI rankings tend to resemble those of human experts. This study highlights the value of gene[rating AI]
AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162111 Schwarcz, D., Manning, S., Barry, P., Cleveland, D. R., Prescott, J. J., & Rich, B. (2025). AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice. Available at SSRN 5162111. Generative AI is set to transform the legal profession, but its full impact remains uncertain. While AI models like GPT-4 improve the efficiency with which legal work can be completed, they can at times make up cases and "hallucinate" facts, thereby undermining legal judgment, particularly in complex tasks handled by skilled lawyers. This article examines two emerging AI innovations that may mitigate these lingering issues: Retrieval Augmented Generation (RAG), which grounds AI-powered analysis in legal sources, and AI reasoning models, which structure complex reasoning before generating output. We conducted the first randomized controlled trial assessing these technologies, assigning upper-level law students to complete six legal tasks using a RAG-powered legal AI tool (Vincent AI), an AI reasoning model (OpenAI's o1-preview), or no AI. We find that both AI tools significantly enhanced legal work quality, a marked contrast with previous research examining older large language models like GPT-4. Moreover, we find that these models maintain the efficiency benefits associated with use of older AI technologies. Our findings show that AI assistance significantly boosts productivity in five out of six tested legal tasks, with Vincent yielding statistically significant gains of approximately 38% to 115% and o1-preview increasing productivity by 34% to 140%, with particularly strong effects in complex tasks like drafting persuasive letters and analyzing complaints. Notably, o1-preview improved the analytical depth of participants' work product but resulted in some hallucinations, whereas Vincent AI-aided participants produced roughly the same amount of hallucinations as participants who did not use AI at all. These findings suggest that integrating domain-specific RAG capabilities with reasoning models could yield synergistic improvements, shaping the next generation of AI-powered legal tools and the future of lawyering more generally.
ChatGPT's role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study https://journals.lww.com/international-journal-of-surgery/fulltext/2025/03000/chatgpt_s_role_in_alleviating_anxiety_in_total.20.aspx Gan, W., Ouyang, J., She, G., Xue, Z., Zhu, L., Lin, A., ... & Zheng, X. (2025). ChatGPT's role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study. International Journal of Surgery, 111(3), 2546-2557. Background: Recent advancements in artificial intelligence (AI) like ChatGPT have expanded possibilities for patient education, yet its impact on perioperative anxiety in total knee arthroplasty (TKA) patients remains unexplored. Methods: In this single-blind, randomized controlled pilot study from April to July 2023, 60 patients were randomly allocated using sealed envelopes to either ChatGPT-assisted or traditional surgeon-led informed consent groups. In the ChatGPT group, physicians used ChatGPT 4.0 to provide standardized, comprehensive responses to patient queries during the consent process, while maintaining their role in interpreting and contextualizing the information. Outcomes were measured using Hospital Anxiety and Depression Scales (HADS), Perioperative Apprehension Scale-7 (PAS-7), Visual Analogue Scales for Anxiety and Pain (VAS-A, VAS-P), Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and satisfaction questionnaires. Results: Of 55 patients completing the study, the ChatGPT group showed significantly lower anxiety scores after informed consent (HADS-A: 10.48 ± 3.84 vs 12.75 ± 4.12, P = .04, Power = .67; PAS-7: 12.44 ± 3.70 vs 14.64 ± 2.11, P = .01, Power = .85; VAS-A: 5.40 ± 1.89 vs 6.71 ± 2.27, P = .02, Power = .75) and on the fifth postoperative day (HADS-A: 8.33 ± 3.20 vs 10.71 ± 3.83, P = .01, Power = .79; VAS-A: 3.41 ± 1.58 vs 4.64 ± 1.70, P = .008, Power = .85). The ChatGPT group also reported higher satisfaction with preoperative education (4.22 ± 0.51 vs 3.43 ± 0.84, P<.001, Power = .99) and overall hospitalization experience (4.11 ± 0.65 vs 3.46 ± 0.69, P = .001, Power = .97). No significant differences were found in depression scores, knee function, or pain levels. Conclusions: ChatGPT-assisted informed consent effectively reduced perioperative anxiety and improved patient satisfaction in TKA patients. While these preliminary findings are promising, larger studies are needed to validate these results and explore broader applications of AI in preoperative patient education.
Rewarding Chatbots for Real-World Engagement with Millions of Users https://arxiv.org/abs/2303.06135 Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z., Mudupalli, V., ... & Beauchamp, W. (2023). Rewarding Chatbots for Real-World Engagement with Millions of Users. arXiv preprint arXiv:2303.06135. The emergence of pretrained large language models has led to the deployment of a range of social chatbots for chitchat. Although these chatbots demonstrate language ability and fluency, they are not guaranteed to be engaging and can struggle to retain users. This work investigates the development of social chatbots that prioritize user engagement to enhance retention, specifically examining the use of human feedback to efficiently develop highly engaging chatbots. The proposed approach uses automatic pseudo-labels collected from user interactions to train a reward model that can be used to reject low-scoring sample responses generated by the chatbot model at inference time. Intuitive evaluation metrics, such as mean conversation length (MCL), are introduced as proxies to measure the level of engagement of deployed chatbots. A/B testing on groups of 10,000 new daily chatbot users on the Chai Research platform shows that this approach increases the MCL by up to 70%, which translates to a more than 30% increase in user retention for a GPT-J 6B model. Future work aims to use the reward model to realise a data fly-wheel, where the latest user conversations can be used to alternately fine-tune the language model and the reward model.
The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks https://arxiv.org/pdf/2504.10768 Schmälzle, R., Lim, S., Du, Y., & Bente, G. (2025). The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks. arXiv preprint arXiv:2504.10768. This paper examines the thin-slicing approach – the ability to make accurate judgments based on minimal information – in the context of scientific presentations. Drawing on research from nonverbal communication and personality psychology, we show that brief excerpts (thin slices) reliably predict overall presentation quality. Using a novel corpus of over one hundred real-life science talks, we employ Large Language Models (LLMs) to evaluate transcripts of full presentations and their thin slices. By correlating LLM-based evaluations of short excerpts with full-talk assessments, we determine how much information is needed for accurate predictions. Our results demonstrate that LLM-based evaluations align closely with human ratings, proving their validity, reliability, and efficiency. Critically, even very short excerpts (< 10% of a talk) strongly predict overall evaluations. This suggests that the first moments of a presentation convey relevant information that is used in quality evaluations and can shape lasting impressions. The findings are robust across different LLMs and prompting strategies. This work extends thin-slicing research to public speaking and connects theories of impression formation to LLMs and current research on AI communication. We discuss implications for communication and social cognition research on message reception. Lastly, we suggest an LLM-based thin-slicing framework as a scalable feedback tool to enhance human communication.
Artificial intelligence and dichotomania https://www.cambridge.org/core/journals/judgment-and-decision-making/article/artificial-intelligence-and-dichotomania/0421D2310727D73FAB47069FD1620AA1 McShane, B. B., Gal, D., & Duhachek, A. (2025). Artificial intelligence and dichotomania. Judgment and Decision Making, 20, e23. Large language models (LLMs) such as ChatGPT, Gemini, and Claude are increasingly being used in aid or place of human judgment and decision making. Indeed, academic researchers are increasingly using LLMs as a research tool. In this paper, we examine whether LLMs, like academic researchers, fall prey to a particularly common human error in interpreting statistical results, namely 'dichotomania' that results from the dichotomization of statistical results into the categories 'statistically significant' and 'statistically nonsignificant'. We find that ChatGPT, Gemini, and Claude fall prey to dichotomania at the 0.05 and 0.10 thresholds commonly used to declare 'statistical significance'. In addition, prompt engineering with principles taken from an American Statistical Association Statement on Statistical Significance and P-values intended as a corrective to human errors does not mitigate this and arguably exacerbates it. Further, more recent and larger versions of these models do not necessarily perform better. Finally, these models sometimes provide interpretations that are not only incorrect but also highly erratic.
Measuring Human Leadership Skills with AI Agents https://www.nber.org/papers/w33662 Weidmann, B., Xu, Y., & Deming, D. J. (2025). Measuring Human Leadership Skills with AI Agents We show that leadership skill with artificially intelligent (AI) agents predicts leadership skill with human groups. In a large pre-registered lab experiment, human leaders worked with AI agents to solve problems. Their performance on this "AI leadership test" was strongly correlated (ρ=0.81) with their causal impact as leaders of human teams, which we estimate by repeatedly randomly assigning leaders to groups of human followers and measuring team performance. Successful leaders of both humans and AI agents ask more questions and engage in more conversational turn-taking; they score higher on measures of social intelligence, fluid intelligence, and decision-making skill, but do not differ in gender, age, ethnicity or education. Our findings indicate that AI agents can be effective proxies for human participants in social experiments, which greatly simplifies the measurement of leadership and teamwork skills.
The power of generative marketing: Can generative AI create superhuman visual marketing content? https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4597899 Hartmann, J., Exner, Y., & Domdey, S. (2024). The power of generative marketing: Can generative AI create superhuman visual marketing content? International Journal of Research in Marketing, Forthcoming. Generative AI's capacity to create photorealistic images has the potential to augment human creativity and disrupt the economics of visual marketing content production. This research systematically compares the performance of AI-generated to human-made marketing images across important marketing dimensions. First, we prompt seven state-of-the-art generative text-to-image models (DALL-E 3, Midjourney v6, Firefly 2, Imagen 2, Imagine, Realistic Vision, and Stable Diffusion XL Turbo) to create 10,320 synthetic marketing images, using 2,400 real-world, human-made images as input. 254,400 human evaluations of these images show that AI-generated marketing imagery can surpass human-made images in quality, realism, and aesthetics. Second, we give identical creative briefings to commissioned human freelancers and the AI models, showing that the best synthetic images also excel in ad creativity, ad attitudes, and prompt following. Third, a field study with more than 173,000 impressions demonstrates that AI-generated banner ads can compete with professional human-made stock photography, achieving an up to 50% higher click-through rate than a human-made image. Collectively, our findings suggest that the paradigm shift brought about by generative AI can help advertisers produce marketing content not only faster and orders of magnitude cheaper but also at superhuman effectiveness levels with important implications for firms, consumers, and policymakers. To facilitate future research on AI-generated marketing imagery, we release "GenImageNet" that contains all of our synthetic images and their human ratings.
Medical Hallucinations in Foundation Models and Their Impact on Healthcare https://arxiv.org/abs/2503.05777 Kim, Y., Jeong, H., Chen, S., Li, S. S., Lu, M., Alhamoud, K., ... & Breazeal, C. (2025). Medical Hallucinations in Foundation Models and Their Impact on Healthcare. arXiv preprint arXiv:2503.05777. Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety.
Large Language Models Pass the Turing Test https://arxiv.org/pdf/2503.23674 Jones, C. R., & Bergen, B. K. (2025). Large Language Models Pass the Turing Test. arXiv preprint arXiv:2503.23674. We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time—not significantly more or less often than the humans they were being compared to—while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
Prompting Science Report 1: Prompt Engineering is Complicated and Contingent https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5165270 Meincke, L., Mollick, E. R., Mollick, L., & Shapiro, D. (2025). Prompting Science Report 1: Prompt Engineering is Complicated and Contingent. Available at SSRN 5165270. This is the first of a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we demonstrate two things: - There is no single standard for measuring whether a Large Language Model (LLM) passes a benchmark, and that choosing a standard has a big impact on how well the LLM does on that benchmark. The standard you choose will depend on your goals for using an LLM in a particular case. - It is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Specifically, we find that sometimes being polite to the LLM helps performance, and sometimes it lowers performance. We also find that constraining the AI's answers helps performance in some cases, though it may lower performance in other cases. Taken together, this suggests that benchmarking AI performance is not one-size-fits-all, and also that particular prompting formulas or approaches, like being polite to the AI, are not universally valuable. Prompting Science Report 1 Prompt En.txt
Creative Preference Optimization https://arxiv.org/abs/2505.14442 Ismayilzada, M., Laverghetta Jr., A., Luchini, S. A., Patel, R., Bosselut, A., van der Plas, L., & Beaty, R. (2025). Creative Preference Optimization. arXiv preprint arXiv:2505.14442. While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality. Creative Preference Optimization.txt